15
GoEmotions: A Dataset of Fine-Grained Emotions Dorottya Demszky 1* Dana Movshovitz-Attias 2 Jeongwoo Ko 2 Alan Cowen 2 Gaurav Nemade 2 Sujith Ravi 3* 1 Stanford Linguistics 2 Google Research 3 Amazon Alexa [email protected] {danama, jko, acowen, gnemade}@google.com [email protected] Abstract Understanding emotion expressed in language has a wide range of applications, from build- ing empathetic chatbots to detecting harmful online behavior. Advancement in this area can be improved using large-scale datasets with a fine-grained typology, adaptable to multi- ple downstream tasks. We introduce GoEmo- tions, the largest manually annotated dataset of 58k English Reddit comments, labeled for 27 emotion categories or Neutral. We demon- strate the high quality of the annotations via Principal Preserved Component Analysis. We conduct transfer learning experiments with ex- isting emotion benchmarks to show that our dataset generalizes well to other domains and different emotion taxonomies. Our BERT- based model achieves an average F1-score of .46 across our proposed taxonomy, leaving much room for improvement. 1 1 Introduction Emotion expression and detection are central to the human experience and social interaction. With as many as a handful of words we are able to express a wide variety of subtle and complex emotions, and it has thus been a long-term goal to enable machines to understand affect and emotion (Picard, 1997). In the past decade, NLP researchers made avail- able several datasets for language-based emotion classification for a variety of domains and appli- cations, including for news headlines (Strapparava and Mihalcea, 2007), tweets (CrowdFlower, 2016; Mohammad et al., 2018), and narrative sequences (Liu et al., 2019), to name just a few. However, ex- isting available datasets are (1) mostly small, con- taining up to several thousand instances, and (2) cover a limited emotion taxonomy, with coarse clas- * Work done while at Google Research. 1 Data and code available at https://github.com/ google-research/google-research/tree/ master/goemotions. Sample Text Label(s) OMG, yep!!! That is the final ans- wer. Thank you so much! gratitude, approval I’m not even sure what it is, why do people hate it confusion Guilty of doing this tbph remorse This caught me off guard for real. I’m actually off my bed laughing surprise, amusement I tried to send this to a friend but [NAME] knocked it away. disappointment Table 1: Example annotations from our dataset. sification into Ekman (Ekman, 1992b) or Plutchik (Plutchik, 1980) emotions. Recently, Bostan and Klinger (2018) have ag- gregated 14 popular emotion classification corpora under a unified framework that allows direct com- parison of the existing resources. Importantly, their analysis suggests annotation quality gaps in the largest manually annotated emotion classifi- cation dataset, CrowdFlower (2016), containing 40K tweets labeled for one of 13 emotions. While their work enables such comparative evaluations, it highlights the need for a large-scale, consistently labeled emotion dataset over a fine-grained taxon- omy, with demonstrated high-quality annotations. To this end, we compiled GoEmotions, the largest human annotated dataset of 58k carefully selected Reddit comments, labeled for 27 emotion categories or Neutral, with comments extracted from popular English subreddits. Table 1 shows an illustrative sample of our collected data. We design our emotion taxonomy considering related work in psychology and coverage in our data. In contrast to Ekman’s taxonomy, which includes only one posi- tive emotion (joy), our taxonomy includes a large number of positive, negative, and ambiguous emo- tion categories, making it suitable for downstream arXiv:2005.00547v1 [cs.CL] 1 May 2020

GoEmotions: A Dataset of Fine-Grained Emotions · 2020. 5. 5. · Sentiment balancing. We reduce sentiment bias by removing subreddits with little representation of positive, negative,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: GoEmotions: A Dataset of Fine-Grained Emotions · 2020. 5. 5. · Sentiment balancing. We reduce sentiment bias by removing subreddits with little representation of positive, negative,

GoEmotions A Dataset of Fine-Grained Emotions

Dorottya Demszky1lowast Dana Movshovitz-Attias2 Jeongwoo Ko2

Alan Cowen2 Gaurav Nemade2 Sujith Ravi3

1Stanford Linguistics 2Google Research 3Amazon Alexaddemszkystanfordedu

danama jko acowen gnemadegooglecomsravisraviorg

Abstract

Understanding emotion expressed in languagehas a wide range of applications from build-ing empathetic chatbots to detecting harmfulonline behavior Advancement in this area canbe improved using large-scale datasets witha fine-grained typology adaptable to multi-ple downstream tasks We introduce GoEmo-tions the largest manually annotated datasetof 58k English Reddit comments labeled for27 emotion categories or Neutral We demon-strate the high quality of the annotations viaPrincipal Preserved Component Analysis Weconduct transfer learning experiments with ex-isting emotion benchmarks to show that ourdataset generalizes well to other domains anddifferent emotion taxonomies Our BERT-based model achieves an average F1-score of46 across our proposed taxonomy leavingmuch room for improvement1

1 Introduction

Emotion expression and detection are central to thehuman experience and social interaction With asmany as a handful of words we are able to express awide variety of subtle and complex emotions and ithas thus been a long-term goal to enable machinesto understand affect and emotion (Picard 1997)

In the past decade NLP researchers made avail-able several datasets for language-based emotionclassification for a variety of domains and appli-cations including for news headlines (Strapparavaand Mihalcea 2007) tweets (CrowdFlower 2016Mohammad et al 2018) and narrative sequences(Liu et al 2019) to name just a few However ex-isting available datasets are (1) mostly small con-taining up to several thousand instances and (2)cover a limited emotion taxonomy with coarse clas-

lowast Work done while at Google Research1Data and code available at httpsgithubcom

google-researchgoogle-researchtreemastergoemotions

Sample Text Label(s)

OMG yep That is the final ans-wer Thank you so much

gratitudeapproval

Irsquom not even sure what it is whydo people hate it

confusion

Guilty of doing this tbph remorse

This caught me off guard for realIrsquom actually off my bed laughing

surpriseamusement

I tried to send this to a friend but[NAME] knocked it away

disappointment

Table 1 Example annotations from our dataset

sification into Ekman (Ekman 1992b) or Plutchik(Plutchik 1980) emotions

Recently Bostan and Klinger (2018) have ag-gregated 14 popular emotion classification corporaunder a unified framework that allows direct com-parison of the existing resources Importantlytheir analysis suggests annotation quality gaps inthe largest manually annotated emotion classifi-cation dataset CrowdFlower (2016) containing40K tweets labeled for one of 13 emotions Whiletheir work enables such comparative evaluations ithighlights the need for a large-scale consistentlylabeled emotion dataset over a fine-grained taxon-omy with demonstrated high-quality annotations

To this end we compiled GoEmotions thelargest human annotated dataset of 58k carefullyselected Reddit comments labeled for 27 emotioncategories or Neutral with comments extractedfrom popular English subreddits Table 1 shows anillustrative sample of our collected data We designour emotion taxonomy considering related work inpsychology and coverage in our data In contrast toEkmanrsquos taxonomy which includes only one posi-tive emotion (joy) our taxonomy includes a largenumber of positive negative and ambiguous emo-tion categories making it suitable for downstream

arX

iv2

005

0054

7v1

[cs

CL

] 1

May

202

0

conversation understanding tasks that require a sub-tle understanding of emotion expression such asthe analysis of customer feedback or the enhance-ment of chatbots

We include a thorough analysis of the annotateddata and the quality of the annotations Via Princi-pal Preserved Component Analysis (Cowen et al2019b) we show a strong support for reliable disso-ciation among all 27 emotion categories indicatingthe suitability of our annotations for building anemotion classification model

We perform hierarchical clustering on the emo-tion judgments finding that emotions related inintensity cluster together closely and that the top-level clusters correspond to sentiment categoriesThese relations among emotions allow for theirpotential grouping into higher-level categories ifdesired for a downstream task

We provide a strong baseline for modeling fine-grained emotion classification over GoEmotionsBy fine-tuning a BERT-base model (Devlin et al2019) we achieve an average F1-score of 46 overour taxonomy 64 over an Ekman-style groupinginto six coarse categories and 69 over a sentimentgrouping These results leave much room for im-provement showcasing this task is not yet fullyaddressed by current state-of-the-art NLU models

We conduct transfer learning experiments withexisting emotion benchmarks to show that ourdata can generalize to different taxonomies anddomains such as tweets and personal narrativesOur experiments demonstrate that given limited re-sources to label additional emotion classificationdata for specialized domains our data can providebaseline emotion understanding and contribute toincreasing model accuracy for the target domain

2 Related Work

21 Emotion Datasets

Ever since Affective Text (Strapparava and Mihal-cea 2007) the first benchmark for emotion recog-nition was introduced the field has seen severalemotion datasets that vary in size domain and tax-onomy (cf Bostan and Klinger 2018) The major-ity of emotion datasets are constructed manuallybut tend to be relatively small The largest manu-ally labeled dataset is CrowdFlower (2016) with39k labeled examples which were found by Bostanand Klinger (2018) to be noisy in comparison withother emotion datasets Other datasets are automat-ically weakly-labeled based on emotion-related

hashtags on Twitter (Wang et al 2012 Abdul-Mageed and Ungar 2017) We build our datasetmanually making it the largest human annotateddataset with multiple annotations per example forquality assurance

Several existing datasets come from the domainof Twitter given its informal language and expres-sive content such as emojis and hashtags Otherdatasets annotate news headlines (Strapparava andMihalcea 2007) dialogs (Li et al 2017) fairy-tales (Alm et al 2005) movie subtitles (Ohmanet al 2018) sentences based on FrameNet (Ghaziet al 2015) or self-reported experiences (Schererand Wallbott 1994) among other domains We arethe first to build on Reddit comments for emotionprediction

22 Emotion TaxonomyOne of the main aspects distinguishing our datasetis its emotion taxonomy The vast majority of ex-isting datasets contain annotations for minor varia-tions of the 6 basic emotion categories (joy angerfear sadness disgust and surprise) proposed byEkman (1992a) andor along affective dimensions(valence and arousal) that underpin the circumplexmodel of affect (Russell 2003 Buechel and Hahn2017)

Recent advances in psychology have offered newconceptual and methodological approaches to cap-turing the more complex ldquosemantic spacerdquo of emo-tion (Cowen et al 2019a) by studying the distribu-tion of emotion responses to a diverse array of stim-uli via computational techniques Studies guidedby these principles have identified 27 distinct va-rieties of emotional experience conveyed by shortvideos (Cowen and Keltner 2017) 13 by music(Cowen et al in press) 28 by facial expression(Cowen and Keltner 2019) 12 by speech prosody(Cowen et al 2019b) and 24 by nonverbal vocal-ization (Cowen et al 2018) In this work we buildon these methods and findings to devise our gran-ular taxonomy for text-based emotion recognitionand study the dimensionality of language-basedemotion space

23 Emotion Classification ModelsBoth feature-based and neural models have beenused to build automatic emotion classification mod-els Feature-based models often make use of hand-built lexicons such as the Valence Arousal Dom-inance Lexicon (Mohammad 2018) Using rep-resentations from BERT (Devlin et al 2019) a

transformer-based model with language model pre-training has recently shown to reach state-of-the-art performance on several NLP tasks also includ-ing emotion prediction the top-performing modelsin the EmotionX Challenge (Hsu and Ku 2018) allemployed a pre-trained BERT model We also usethe BERT model in our experiments and we findthat it outperforms our biLSTM model

3 GoEmotions

Our dataset is composed of 58K Reddit commentslabeled for one or more of 27 emotion(s) or Neutral

31 Selecting amp Curating Reddit commentsWe use a Reddit data dump originating in the reddit-data-tools project2 which contains comments from2005 (the start of Reddit) to January 2019 Weselect subreddits with at least 10k comments andremove deleted and non-English comments

Reddit is known for a demographic bias lean-ing towards young male users (Duggan and Smith2013) which is not reflective of a globally diversepopulation The platform also introduces a skewtowards toxic offensive language (Mohan et al2017) Thus Reddit content has been used to studydepression (Pirina and Coltekin 2018) microag-gressions (Breitfeller et al 2019) and Yanardagand Rahwan (2018) have shown the effect of us-ing biased Reddit data by training a ldquopsychopathrdquobot To address these concerns and enable buildingbroadly representative emotion models using GoE-motions we take a series of data curation measuresto ensure our data does not reinforce general noremotion-specific language biases

We identify harmful comments using pre-definedlists containing offensiveadult vulgar (mildly of-fensive profanity) identity and religion terms (in-cluded as supplementary material) These are usedfor data filtering and masking as described belowLists were internally compiled and we believe theyare comprehensive and widely useful for datasetcuration however they may not be complete

Reducing profanity We remove subreddits thatare not safe for work3 and where 10+ of com-ments include offensiveadult and vulgar tokensWe remove remaining comments that include offen-siveadult tokens Vulgar comments are preservedas we believe they are central to learning about

2httpsgithubcomdewarimreddit-data-tools

3httpredditlistcomnsfw

negative emotions The dataset includes the list offiltered tokens

Manual review We manually review identitycomments and remove those offensive towards aparticular ethnicity gender sexual orientation ordisability to the best of our judgment

Length filtering We apply NLTKrsquos word tok-enizer and select comments 3-30 tokens long in-cluding punctuation To create a relatively balanceddistribution of comment length we perform down-sampling capping by the number of commentswith the median token count (12)

Sentiment balancing We reduce sentiment biasby removing subreddits with little representationof positive negative ambiguous or neutral senti-ment To estimate a commentrsquos sentiment we runour emotion prediction model trained on a pilotbatch of 22k annotated examples The mappingof emotions into sentiment categories is found inFigure 2 We exclude subreddits consisting of morethan 30 neutral comments or less than 20 ofnegative positive or ambiguous comments

Emotion balancing We assign a predicted emo-tion to each comment using the pilot model de-scribed above Then we reduce emotion bias bydownsampling the weakly-labelled data cappingby the number of comments belonging to the me-dian emotion count

Subreddit balancing To avoid over representa-tion of popular subreddits we perform downsam-pling capping by the median subreddit count

From the remaining 315k comments (from 482subreddits) we randomly sample for annotation

Masking We mask proper names referring topeople with a [NAME] token using a BERT-basedNamed Entity Tagger (Tsai et al 2019) We maskreligion terms with a [RELIGION] token The listof these terms is included with our dataset Notethat raters viewed unmasked comments during rat-ing

32 Taxonomy of EmotionsWhen creating the taxonomy we seek to jointlymaximize the following objectives

1 Provide greatest coverage in terms of emo-tions expressed in our data To address this wemanually labeled a small subset of the data andran a pilot task where raters can suggest emotionlabels on top of the pre-defined set

2 Provide greatest coverage in terms of kindsof emotional expression We consult psychologyliterature on emotion expression and recognition(Plutchik 1980 Cowen and Keltner 2017 Cowenet al 2019b) Since to our knowledge therehas not been research that identifies principal cat-egories for emotion recognition in the domain oftext (see Section 22) we consider those emotionsthat are identified as basic in other domains (videoand speech) and that we can assume to apply totext as well

3 Limit overlap among emotions and limit thenumber of emotions We do not want to includeemotions that are too similar since that makes theannotation task more difficult Moreover combin-ing similar labels with high coverage would resultin an explosion in annotated labels

The final set of selected emotions is listed inTable 4 and Figure 1 See Appendix B for moredetails on our multi-step taxonomy selection proce-dure

33 AnnotationWe assigned three raters to each example For thoseexamples where no raters agree on at least oneemotion label we assigned two additional ratersAll raters are native English speakers from India4

Instructions Raters were asked to identify theemotions expressed by the writer of the text givenpre-defined emotion definitions (see Appendix A)and a few example texts for each emotion Raterswere free to select multiple emotions but wereasked to only select those ones for which they werereasonably confident that it is expressed in the textIf raters were not certain about any emotion beingexpressed they were asked to select Neutral Weincluded a checkbox for raters to indicate if anexample was particularly difficult to label in whichcase they could select no emotions We removedall examples for which no emotion was selected

The rater interface Reddit comments were pre-sented with no additional metadata (such as theauthor or subreddit) To help raters navigate thelarge space of emotion in our taxonomy they werepresented a table containing all emotion categoriesaggregated by sentiment (by the mapping in Fig-ure 2) and whether that emotion is generally ex-pressed towards something (eg disapproval) or is

4Cowen et al (2019b) find that emotion judgments in In-dian and US English speakers largely occupy the same dimen-sions

Number of examples 58009Number of emotions 27 + neutralNumber of unique raters 82Number of raters example 3 or 5Marked unclear ordifficult to label 16

1 832 153 2

Number of labels per example

4+ 2Number of examples w 2+ ratersagreeing on at least 1 label 54263 (94)

Number of examples w 3+ ratersagreeing on at least 1 label 17763 (31)

Table 2 Summary statistics of our labeled data

more of an intrinsic feeling (eg joy) The instruc-tions highlighted that this separation of categorieswas by no means clear-cut but captured generaltendencies and we encouraged raters to ignore thecategorization whenever they saw fit Emotionswith a straightforward mapping onto emojis wereshown with an emoji in the UI to further ease theirinterpretation

4 Data Analysis

Table 2 shows summary statistics for the data Mostof the examples (83) have a single emotion labeland have at least two raters agreeing on a single la-bel (94) The Neutral category makes up 26 ofall emotion labels ndash we exclude that category fromthe following analyses since we do not consider itto be part of the semantic space of emotions

Figure 1 shows the distribution of emotion labelsWe can see a large disparity in terms of emotionfrequencies (eg admiration is 30 times more fre-quent than grief ) despite our emotion and senti-ment balancing steps taken during data selectionThis is expected given the disparate frequencies ofemotions in natural human expression

41 Interrater Correlation

We estimate rater agreement for each emotion viainterrater correlation (Delgado and Tibau 2019)5

For each rater r isin R we calculate the Spearmancorrelation between rrsquos judgments and the mean

5We use correlations as opposed to Cohenrsquos kappa (Cohen1960) because the former is a more interpretable metric and itis also more suitable for measuring agreement among a vari-able number of raters rating different examples In Appendix Cwe report Cohenrsquos kappa values as well which correlate highlywith the values obtained from interrater correlation (Pearsonr = 085 p lt 0001)

InterraterCorrelation

Figure 1 Our emotion categories ordered by the num-ber of examples where at least one rater uses a particu-lar label The color indicates the interrater correlation

of other ratersrsquo judgments for all examples that rrated We then take the average of these rater-levelcorrelation scores In Section 43 we show thateach emotion has significant interrater correlationafter controlling for several potential confounds

Figure 1 shows that gratitude admiration andamusement have the highest and grief and nervous-ness have the lowest interrater correlation Emotionfrequency correlates with interrater agreement butthe two are not equivalent Infrequent emotionscan have relatively high interrater correlation (egfear) and frequent emotions can have have rela-tively low interrater correlation (eg annoyance)

42 Correlation Among Emotions

To better understand the relationship between emo-tions in our data we look at their correlations LetN be the number of examples in our dataset Weobtain N dimensional vectors for each emotionby averaging ratersrsquo judgments for all exampleslabeled with that emotion We calculate Pearsoncorrelation values between each pair of emotionsThe heatmap in Figure 2 shows that emotions thatare related in intensity (eg annoyance and angerjoy and excitement nervousness and fear) havea strong positive correlation On the other handemotions that have the opposite sentiment are neg-atively correlated

Sentimentpositivenegativeambiguous

sentiment

amus

emen

tex

citem

ent

joy

love

desir

eop

timism

carin

gpr

ide

adm

iratio

ngr

atitu

dere

lief

appr

oval

real

izatio

nsu

rpris

ecu

riosit

yco

nfus

ion

fear

nerv

ousn

ess

rem

orse

emba

rrass

men

tdi

sapp

oint

men

tsa

dnes

sgr

ief

disg

ust

ange

ran

noya

nce

disa

ppro

val

amusementexcitementjoylovedesireoptimismcaringprideadmirationgratitudereliefapprovalrealizationsurprisecuriosityconfusionfearnervousnessremorseembarrassmentdisappointmentsadnessgriefdisgustangerannoyancedisapproval

030 015 000 015 030

Figure 2 The heatmap shows the correlation betweenratings for each emotion The dendrogram representsthe a hierarchical clustering of the ratings The senti-ment labeling was done a priori and it shows that theclusters closely map onto sentiment groups

We also perform hierarchical clustering to un-cover the nested structure of our taxonomy Weuse correlation as a distance metric and ward asa linkage method applied to the averaged ratingsThe dendrogram on the top of Figure 2 shows thatemotions that are related by intensity are neighborsand that larger clusters map closely onto sentimentcategories Interestingly emotions that we labeledas ldquoambiguousrdquo in terms of sentiment (eg sur-prise) are closer to the positive than to the negativecategory This suggests that in our data ambiguousemotions are more likely to occur in the context ofpositive sentiment than that of negative sentiment

43 Principal Preserved Component Analysis

To better understand agreement among raters andthe latent structure of the emotion space we applyPrincipal Preserved Component Analysis (PPCA)(Cowen et al 2019b) to our data PPCA extractslinear combinations of attributes (here emotionjudgments) that maximally covary across two setsof data that measure the same attributes (here ran-domly split judgments for each example) ThusPPCA allows us to uncover latent dimensions of

Algorithm 1 Leave-One-Rater-Out PPCA1 Rlarr set of raters2 E larr set of emotions3 C isin R|R|times|E|4 for all raters r isin 1 |R| do5 nlarr number of examples annotated by r6 J isin Rntimes|R|times|E| larr all ratings for the exam-

ples annotated by r7 Jminusr isin Rntimes|R|minus1times|E| larr all ratings in J

excluding r8 Jr isin Rntimes|E| larr all ratings by r9 XY isin Rntimes|E| larr randomly split Jminusr and

average ratings across raters for both sets10 W isin R|E|times|E| larr result of PPCA(XY )11 for all componentsdagger wiisin1|E| in W do12 vr

i larr projectionDagger of Jr onto wi

13 vminusri larr projectionDagger of Jminusr onto wi

14 Cri larr correlation between vri and vminusri

partialing out vminusrk forallk isin 1 iminus 115 end for16 end for17 C prime larrWilcoxon signed rank test on C18 C primeprime larr Bonferroni correction on C prime(α = 005)daggerin descending order of eigenvalueDaggerwe demean vectors before projection

emotion that have high agreement across ratersUnlike Principal Component Analysis (PCA)

PPCA examines the cross-covariance betweendatasets rather than the variancecovariance matrixwithin a single dataset We obtain the principalpreserved components (PPCs) of two datasets (ma-trices) XY isin RNtimes|E| where N is the numberof examples and |E| is the number of emotionsby calculating the eigenvectors of the symmetrizedcross covariance matrix XTY + Y TX

Extracting significant dimensions We removeexamples labeled as Neutral and keep those ex-amples that still have at least 3 ratings after thisfiltering step We then determine the number ofsignificant dimensions using a leave-one-rater outanalysis as described by Algorithm 1

We find that all 27 PPCs are highly signifi-cant Specifically Bonferroni-corrected p-valuesare less than 15e-6 for all dimensions (correctedα = 00017) suggesting that the emotions werehighly dissociable Such a high degree of signifi-cance for all dimensions is nontrivial For exampleCowen et al (2019b) find that only 12 out of their30 emotion categories are significantly dissociable

t-SNE projection To better understand how theexamples are organized in the emotion space weapply t-SNE a dimension reduction method thatseeks to preserve distances between data pointsusing the scikit-learn package (Pedregosa et al2011) The dataset can be explored in our inter-active plot6 where one can also look at the textsand the annotations The color of each data point isthe weighted average of the RGB values represent-ing those emotions that at least half of the ratersselected

44 Linguistic Correlates of Emotions

We extract the lexical correlates of each emotion bycalculating the log odds ratio informative Dirichletprior (Monroe et al 2008) of all tokens for eachemotion category contrasting to all other emotionsSince the log odds are z-scored all values greaterthan 3 indicate highly significant (gt3 std) associ-ation with the corresponding emotion We list thetop 5 tokens for each category in Table 3 We findthat those emotions that are highly significantly as-sociated with certain tokens (eg gratitude withldquothanksrdquo amusement with ldquololrdquo) tend to have thehighest interrater correlation (see Figure 1) Con-versely emotions that have fewer significantly as-sociated tokens (eg grief and nervousness) tendto have low interrater correlation These resultssuggest certain emotions are more verbally implicitand may require more context to be interpreted

5 Modeling

We present a strong baseline emotion predictionmodel for GoEmotions

51 Data Preparation

To minimize the noise in our data we filter outemotion labels selected by only a single annotatorWe keep examples with at least one label after thisfiltering is performed mdash this amounts to 93 of theoriginal data We randomly split this data into train(80) dev (10) and test (10) sets We onlyevaluate on the test set once the model is finalized

Even though we filter our data for the baselineexperiments we see particular value in the 4K ex-amples that lack agreement This subset of the datalikely contains edgedifficult examples for the emo-tion domain (eg emotion-ambiguous text) andpresent challenges for further exploration That is

6httpsnlpstanfordedu˜ddemszkygoemotionstsnehtml

admiration amusement approval caring anger annoyance disappointment disapproval confusiongreat (42) lol (66) agree (24) you (12) fuck (24) annoying (14) disappointing (11) not (16) confused (18)

awesome (32) haha (32) not (13) worry (11) hate (18) stupid (13) disappointed (10) donrsquot (14) why (11)amazing (30) funny (27) donrsquot (12) careful (9) fucking (18) fucking (12) bad (9) disagree (9) sure (10)

good (28) lmao (21) yes (12) stay (9) angry (11) shit (10) disappointment (7) nope (8) what (10)beautiful (23) hilarious (18) agreed (11) your (8) dare (10) dumb (9) unfortunately (7) doesnrsquot (7) understand (8)

desire excitement gratitude joy disgust embarrassment fear grief curiositywish (29) excited (21) thanks (75) happy (32) disgusting (22) embarrassing (12) scared (16) died (6) curious (22)want (8) happy (8) thank (69) glad (27) awful (14) shame (11) afraid (16) rip (4) what (18)

wanted (6) cake (8) for (24) enjoy (20) worst (13) awkward (10) scary (15) why (13)could (6) wow (8) you (18) enjoyed (12) worse (12) embarrassment (8) terrible (12) how (11)

ambitious (4) interesting (7) sharing (17) fun (12) weird (9) embarrassed (7) terrifying (11) did (10)love optimism pride relief nervousness remorse sadness realization surprise

love (76) hope (45) proud (14) glad (5) nervous (8) sorry (39) sad (31) realize (14) wow (23)loved (21) hopefully (19) pride (4) relieved (4) worried (8) regret (9) sadly (16) realized (12) surprised (21)

favorite (13) luck (18) accomplishment relieving (4) anxiety (6) apologies (7) sorry (15) realised (7) wonder (15)loves (12) hoping (16) (4) relief (4) anxious (4) apologize (6) painful (10) realization (6) shocked (12)

like (9) will (8) worrying (4) guilt (5) crying (9) thought (6) omg (11)

Table 3 Top 5 words associated with each emotion ( positive negative ambiguous ) The rounded z-scored logodds ratios in the parentheses with the threshold set at 3 indicate significance of association

why we release all 58K examples with all annota-torsrsquo ratings

Grouping emotions We create a hierarchicalgrouping of our taxonomy and evaluate the modelperformance on each level of the hierarchy A sen-timent level divides the labels into 4 categories ndashpositive negative ambiguous and Neutral ndash withthe Neutral category intact and the rest of the map-ping as shown in Figure 2 The Ekman level furtherdivides the taxonomy using the Neutral label andthe following 6 groups anger (maps to anger an-noyance disapproval) disgust (maps to disgust)fear (maps to fear nervousness) joy (all positiveemotions) sadness (maps to sadness disappoint-ment embarrassment grief remorse) and surprise(all ambiguous emotions)

52 Model ArchitectureWe use the BERT-base model (Devlin et al 2019)for our experiments We add a dense output layeron top of the pretrained model for the purposesof finetuning with a sigmoid cross entropy lossfunction to support multi-label classification As anadditional baseline we train a bidirectional LSTM

53 Parameter SettingsWhen finetuning the pre-trained BERT model wekeep most of the hyperparameters set by Devlinet al (2019) intact and only change the batch sizeand learning rate We find that training for at least4 epochs is necessary for learning the data buttraining for more epochs results in overfitting Wealso find that a small batch size of 16 and learningrate of 5e-5 yields the best performance

For the biLSTM we set the hidden layer dimen-sionality to 256 the learning rate to 01 with a

decay rate of 095 We apply a dropout of 07

54 Results

Table 4 summarizes the performance of our bestmodel BERT on the test set which achieves anaverage F1-score of 46 (std=19) The model ob-tains the best performance on emotions with overtlexical markers such as gratitude (86) amusement(8) and love (78) The model obtains the lowestF1-score on grief (0) relief (15) and realization(21) which are the lowest frequency emotions Wefind that less frequent emotions tend to be confusedby the model with more frequent emotions relatedin sentiment and intensity (eg grief with sadnesspride with admiration nervousness with fear) mdashsee Appendix G for a more detailed analysis

Table 5 and Table 6 show results for a sentiment-grouped model (F1-score = 69) and an Ekman-grouped model (F1-score = 64) respectively Thesignificant performance increase in the transitionfrom full to Ekman-level taxonomy indicates thatthis grouping mitigates confusion among inner-group lower-level categories

The biLSTM model performs significantlyworse than BERT obtaining an average F1-scoreof 41 for the full taxonomy 54 for an Ekman-grouped model and 6 for a sentiment-groupedmodel

6 Transfer Learning Experiments

We conduct transfer learning experiments on exist-ing emotion benchmarks in order to show our datageneralizes across domains and taxonomies Thegoal is to demonstrate that given little labeled datain a target domain one can utilize GoEmotions asbaseline emotion understanding data

100 200 500 1000 maxTraining Set Size

00

02

04

06F-

scor

eISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 3 Transfer learning results in terms of average F1-scores across emotion categories The bars indicate the95 confidence intervals which we obtain from 10 different runs on 10 different random splits of the data

Emotion Precision Recall F1

admiration 053 083 065amusement 070 094 080anger 036 066 047annoyance 024 063 034approval 026 057 036caring 030 056 039confusion 024 076 037curiosity 040 084 054desire 043 059 049disappointment 019 052 028disapproval 029 061 039disgust 034 066 045embarrassment 039 049 043excitement 026 052 034fear 046 085 060gratitude 079 095 086grief 000 000 000joy 039 073 051love 068 092 078nervousness 028 048 035neutral 056 084 068optimism 041 069 051pride 067 025 036realization 016 029 021relief 050 009 015remorse 053 088 066sadness 038 071 049surprise 040 066 050macro-average 040 063 046std 018 024 019

Table 4 Results based on GoEmotions taxonomy

61 Emotion Benchmark Datasets

We consider the nine benchmark datasets fromBostan and Klinger (2018)rsquos Unified Datasetwhich vary in terms of their size domain qual-

Sentiment Precision Recall F1

ambiguous 054 066 060negative 065 076 070neutral 064 069 067positive 078 087 082macro-average 065 074 069std 009 010 009

Table 5 Results based on sentiment-grouped data

Ekman Emotion Precision Recall F1

anger 050 065 057disgust 052 053 053fear 061 076 068joy 077 088 082neutral 066 067 066sadness 056 062 059macro-average 060 068 064std 010 012 011

Table 6 Results using Ekmanrsquos taxonomy

ity and taxonomy In the interest of space we onlydiscuss three of these datasets here chosen basedon their diversity of domains In our experimentswe observe similar trends for the additional bench-marks and all are included in the Appendix H

The International Survey on Emotion An-tecedents and Reactions (ISEAR) (Scherer andWallbott 1994) is a collection of personal reportson emotional events written by 3000 people fromdifferent cultural backgrounds The dataset con-tains 8k sentences each labeled with a single emo-tion The categories are anger disgust fear guiltjoy sadness and shame

EmoInt (Mohammad et al 2018) is part of theSemEval 2018 benchmark and it contains crowd-sourced annotations for 7k tweets The labels areintensity annotations for anger joy sadness and

fear We obtain binary annotations for these emo-tions by using 5 as the cutoff

Emotion-Stimulus (Ghazi et al 2015) containsannotations for 24k sentences generated based onFrameNetrsquos emotion-directed frames Their taxon-omy is anger disgust fear joy sadness shameand surprise

62 Experimental SetupTraining set size We experiment with varyingamount of training data from the target domaindataset including 100 200 500 1000 and 80(named ldquomaxrdquo) of dataset examples We generate10 random splits for each train set size with theremaining examples held as a test set

We report the results of the finetuning experi-ments detailed below for each data size with con-fidence intervals based on repeated experimentsusing the splits

Finetuning We compare three different finetun-ing setups In the BASELINE setup we finetuneBERT only on the target dataset In the FREEZE

setup we first finetune BERT on GoEmotions thenperform transfer learning by replacing the finaldense layer freezing all layers besides the lastlayer and finetuning on the target dataset TheNOFREEZE setup is the same as FREEZE exceptthat we do not freeze the bottom layers We holdthe batch size at 16 learning rate at 2e-5 and num-ber of epochs at 3 for all experiments

63 ResultsThe results in Figure 3 suggest that our dataset gen-eralizes well to different domains and taxonomiesand that using a model using GoEmotions can helpin cases when there is limited data from the targetdomain or limited resources for labeling

Given limited target domain data (100 or 200 ex-amples) both FREEZE and NOFREEZE yield signif-icantly higher performance than the BASELINE forall three datasets Importantly NOFREEZE resultsshow significantly higher performance for all train-ing set sizes except for ldquomaxrdquo where NOFREEZE

and BASELINE perform similarly

7 Conclusion

We present GoEmotions a large manually anno-tated carefully curated dataset for fine-grainedemotion prediction We provide a detailed dataanalysis demonstrating the reliability of the anno-tations for the full taxonomy We show the general-

izability of the data across domains and taxonomiesvia transfer learning experiments We build a strongbaseline by fine-tuning a BERT model howeverthe results suggest much room for future improve-ment Future work can explore the cross-culturalrobustness of emotion ratings and extend the tax-onomy to other languages and domains

Data Disclaimer We are aware that the datasetcontains biases and is not representative of globaldiversity We are aware that the dataset containspotentially problematic content Potential biases inthe data include Inherent biases in Reddit and userbase biases the offensivevulgar word lists usedfor data filtering inherent or unconscious bias inassessment of offensive identity labels annotatorswere all native English speakers from India Allthese likely affect labeling precision and recallfor a trained model The emotion pilot model usedfor sentiment labeling was trained on examplesreviewed by the research team Anyone using thisdataset should be aware of these limitations of thedataset

Acknowledgments

We thank the three anonymous reviewers for theirconstructive feedback We would also like to thankthe annotators for their hard work

ReferencesMuhammad Abdul-Mageed and Lyle Ungar 2017

EmoNet Fine-Grained Emotion Detection withGated Recurrent Neural Networks In Proceedingsof the 55th Annual Meeting of the Association forComputational Linguistics (Volume 1 Long Papers)pages 718ndash728

Cecilia Ovesdotter Alm Dan Roth and Richard Sproat2005 Emotions from text Machine learningfor text-based emotion prediction In Proceed-ings of Human Language Technology Conferenceand Conference on Empirical Methods in NaturalLanguage Processing pages 579ndash586 VancouverBritish Columbia Canada Association for Compu-tational Linguistics

Laura-Ana-Maria Bostan and Roman Klinger 2018An analysis of annotated corpora for emotion clas-sification in text In Proceedings of the 27th Inter-national Conference on Computational Linguisticspages 2104ndash2119

Luke Breitfeller Emily Ahn David Jurgens and Yu-lia Tsvetkov 2019 Finding microaggressions in thewild A case for locating elusive phenomena in so-cial media posts In Proceedings of the 2019 Con-ference on Empirical Methods in Natural Language

Processing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 1664ndash1674

Sven Buechel and Udo Hahn 2017 Emobank Study-ing the impact of annotation perspective and repre-sentation format on dimensional emotion analysisIn Proceedings of the 15th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics Volume 2 Short Papers pages 578ndash585

Jacob Cohen 1960 A coefficient of agreement fornominal scales Educational and psychological mea-surement 20(1)37ndash46

Alan Cowen Disa Sauter Jessica L Tracy and DacherKeltner 2019a Mapping the passions Towarda high-dimensional taxonomy of emotional experi-ence and expression Psychological Science in thePublic Interest 20(1)69ndash90

Alan S Cowen Hillary Anger Elfenbein Petri Laukkaand Dacher Keltner 2018 Mapping 24 emotionsconveyed by brief human vocalization AmericanPsychologist 74(6)698ndash712

Alan S Cowen Xia Fang Disa Sauter and Dacher Kelt-ner in press What music makes us feel At leastthirteen dimensions organize subjective experiencesassociated with music across cultures Proceedingsof the National Academy of Sciences

Alan S Cowen and Dacher Keltner 2017 Self-reportcaptures 27 distinct categories of emotion bridged bycontinuous gradients Proceedings of the NationalAcademy of Sciences 114(38)E7900ndashE7909

Alan S Cowen and Dacher Keltner 2019 What theface displays Mapping 28 emotions conveyed bynaturalistic expression American Psychologist

Alan S Cowen Petri Laukka Hillary Anger ElfenbeinRunjing Liu and Dacher Keltner 2019b The pri-macy of categories in the recognition of 12 emotionsin speech prosody across two cultures Nature Hu-man Behaviour 3(4)369

CrowdFlower 2016 httpswwwfigure-eightcomdatasentiment-analysis-emotion-text

Rosario Delgado and Xavier-Andoni Tibau 2019Why cohens kappa should be avoided as perfor-mance measure in classification PloS one 14(9)

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In 17th Annual Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics (NAACL)

Maeve Duggan and Aaron Smith 2013 6 of onlineadults are reddit users Pew Internet amp AmericanLife Project 31ndash10

Paul Ekman 1992a Are there basic emotions Psy-chological Review 99(3)550ndash553

Paul Ekman 1992b An argument for basic emotionsCognition amp Emotion 6(3-4)169ndash200

Diman Ghazi Diana Inkpen and Stan Szpakowicz2015 Detecting Emotion Stimuli in Emotion-Bearing Sentences In International Conference onIntelligent Text Processing and Computational Lin-guistics pages 152ndash165 Springer

Chao-Chun Hsu and Lun-Wei Ku 2018 SocialNLP2018 EmotionX challenge overview Recognizingemotions in dialogues In Proceedings of the SixthInternational Workshop on Natural Language Pro-cessing for Social Media pages 27ndash31 MelbourneAustralia Association for Computational Linguis-tics

Yanran Li Hui Su Xiaoyu Shen Wenjie Li ZiqiangCao and Shuzi Niu 2017 Dailydialog A manuallylabelled multi-turn dialogue dataset arXiv preprintarXiv171003957

Chen Liu Muhammad Osama and Anderson De An-drade 2019 Dens A dataset for multi-class emo-tion analysis arXiv preprint arXiv191011769

Saif Mohammad 2018 Obtaining reliable human rat-ings of valence arousal and dominance for 20000english words In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1 Long Papers) pages 174ndash184

Saif Mohammad Felipe Bravo-Marquez MohammadSalameh and Svetlana Kiritchenko 2018 SemEval-2018 task 1 Affect in tweets In Proceedings ofThe 12th International Workshop on Semantic Eval-uation pages 1ndash17 New Orleans Louisiana Asso-ciation for Computational Linguistics

Saif M Mohammad 2012 emotional tweets In Pro-ceedings of the First Joint Conference on Lexicaland Computational Semantics-Volume 1 Proceed-ings of the main conference and the shared task andVolume 2 Proceedings of the Sixth InternationalWorkshop on Semantic Evaluation pages 246ndash255Association for Computational Linguistics

Saif M Mohammad Xiaodan Zhu Svetlana Kir-itchenko and Joel Martin 2015 Sentiment emo-tion purpose and style in electoral tweets Informa-tion Processing amp Management 51(4)480ndash499

Shruthi Mohan Apala Guha Michael Harris FredPopowich Ashley Schuster and Chris Priebe 2017The impact of toxic language on the health of redditcommunities In Canadian Conference on ArtificialIntelligence pages 51ndash56 Springer

Burt L Monroe Michael P Colaresi and Kevin MQuinn 2008 Fightinrsquowords Lexical feature selec-tion and evaluation for identifying the content of po-litical conflict Political Analysis 16(4)372ndash403

Emily Ohman Kaisla Kajava Jorg Tiedemann andTimo Honkela 2018 Creating a dataset formultilingual fine-grained emotion-detection usinggamification-based annotation In Proceedings ofthe 9th Workshop on Computational Approaches toSubjectivity Sentiment and Social Media Analysispages 24ndash30

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P PrettenhoferR Weiss V Dubourg J Vanderplas A PassosD Cournapeau M Brucher M Perrot and E Duch-esnay 2011 Scikit-learn Machine learning inPython Journal of Machine Learning Research122825ndash2830

Matthew E Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep Contextualized Word Rep-resentations In 16th Annual Conference of theNorth American Chapter of the Association for Com-putational Linguistics (NAACL)

Rosalind W Picard 1997 Affective Computing MITPress

Inna Pirina and Cagrı Coltekin 2018 Identifyingdepression on reddit The effect of training dataIn Proceedings of the 2018 EMNLP WorkshopSMM4H The 3rd Social Media Mining for HealthApplications Workshop amp Shared Task pages 9ndash12

Robert Plutchik 1980 A general psychoevolutionarytheory of emotion In Theories of emotion pages3ndash33 Elsevier

James A Russell 2003 Core affect and the psycholog-ical construction of emotion Psychological review110(1)145

Klaus R Scherer and Harald G Wallbott 1994 Evi-dence for Universality and Cultural Variation of Dif-ferential Emotion Response Patterning Journal ofpersonality and social psychology 66(2)310

Hendrik Schuff Jeremy Barnes Julian Mohme Sebas-tian Pado and Roman Klinger 2017 Annotationmodelling and analysis of fine-grained emotions ona stance and sentiment detection corpus In Pro-ceedings of the 8th Workshop on Computational Ap-proaches to Subjectivity Sentiment and Social Me-dia Analysis pages 13ndash23

Carlo Strapparava and Rada Mihalcea 2007 SemEval-2007 task 14 Affective text In Proceedings of theFourth International Workshop on Semantic Evalua-tions (SemEval-2007) pages 70ndash74 Prague CzechRepublic Association for Computational Linguis-tics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT Rediscovers the Classical NLP Pipeline InAssociation for Computational Linguistics

Henry Tsai Jason Riesa Melvin Johnson Naveen Ari-vazhagan Xin Li and Amelia Archer 2019 Smalland Practical BERT Models for Sequence LabelingIn EMNLP 2019

Wenbo Wang Lu Chen Krishnaprasad Thirunarayanand Amit P Sheth 2012 Harnessing twitter lsquobigdatarsquo for automatic emotion identification In 2012International Conference on Privacy Security Riskand Trust and 2012 International Confernece on So-cial Computing pages 587ndash592 IEEE

Cebrian-M Yanardag P and I Rahwan 2018 Nor-man Worlds first psychopath ai

A Emotion Definitions

admiration Finding something impressive orworthy of respectamusement Finding something funny or beingentertainedanger A strong feeling of displeasure or antag-onismannoyance Mild anger irritationapproval Having or expressing a favorable opin-ioncaring Displaying kindness and concern for othersconfusion Lack of understanding uncertaintycuriosity A strong desire to know or learn some-thingdesire A strong feeling of wanting something orwishing for something to happendisappointment Sadness or displeasure caused bythe nonfulfillment of onersquos hopes or expectationsdisapproval Having or expressing an unfavor-able opiniondisgust Revulsion or strong disapproval arousedby something unpleasant or offensiveembarrassment Self-consciousness shame orawkwardnessexcitement Feeling of great enthusiasm and ea-gernessfear Being afraid or worriedgratitude A feeling of thankfulness and appre-ciationgrief Intense sorrow especially caused by some-onersquos deathjoy A feeling of pleasure and happinesslove A strong positive emotion of regard andaffectionnervousness Apprehension worry anxietyoptimism Hopefulness and confidence aboutthe future or the success of somethingpride Pleasure or satisfaction due to ones ownachievements or the achievements of those withwhom one is closely associatedrealization Becoming aware of somethingrelief Reassurance and relaxation following releasefrom anxiety or distressremorse Regret or guilty feelingsadness Emotional pain sorrowsurprise Feeling astonished startled by some-thing unexpected

B Taxonomy Selection amp Data Collection

We selected our taxonomy through a careful multi-round process In the first pilot round of data col-

lection we used emotions that were identified to besalient by Cowen and Keltner (2017) making surethat our set includes Ekmans emotion categories asused in previous NLP work In this round we alsoincluded an open input box where annotators couldsuggest emotion(s) that were not among the op-tions We annotated 3K examples in the first roundWe updated the taxonomy based on the results ofthis round (see details below) In the second pilotround of data collection we repeated this processwith 2k new examples once again updating thetaxonomy

While reviewing the results from the pilotrounds we identified and removed emotions thatwere scarcely selected by annotators andor hadlow interrater agreement due to being very similarto other emotions or being too difficult to detectfrom text These emotions were boredom doubtheartbroken indifference and calmness We alsoidentified and added those emotions to our tax-onomy that were frequently suggested by ratersandor seemed to be represented in the data uponmanual inspection These emotions were desiredisappointment pride realization relief and re-morse In this process we also refined the categorynames (eg replacing ecstasy with excitement) toones that seemed interpretable to annotators Thisis how we arrived at the final set of 27 emotions+ Neutral Our high interrater agreement in the fi-nal data can be partially explained by the fact thatwe took interpretability into consideration whileconstructing the taxonomy The dataset is we arereleasing was labeled in the third round over thefinal taxonomy

C Cohenrsquos Kappa Values

In Section 41 we measure agreement betweenraters via Spearman correlation following consid-erations by Delgado and Tibau (2019) In Table 7we report the Cohenrsquos kappa values for compari-son which we obtain by randomly sampling tworatings for each example and calculating the Co-henrsquos kappa between these two sets of ratings Wefind that all Cohenrsquos kappa values are greater than 0showing rater agreement Moreover the Cohenrsquoskappa values correlate highly with the interratercorrelation values (Pearson r = 085 p lt 0001)providing corroborative evidence for the significantdegree of interrater agreement for each emotion

Emotion InterraterCorrelation

Cohenrsquoskappa

admiration 0535 0468amusement 0482 0474anger 0207 0307annoyance 0193 0192approval 0385 0187caring 0237 0252confusion 0217 0270curiosity 0418 0366desire 0177 0251disappointment 0186 0184disapproval 0274 0234disgust 0192 0241embarrassment 0177 0218excitement 0193 0222fear 0266 0394gratitude 0645 0749grief 0162 0095joy 0296 0301love 0446 0555nervousness 0164 0144optimism 0322 0300pride 0163 0148realization 0194 0155relief 0172 0185remorse 0178 0358sadness 0346 0336surprise 0275 0331

Table 7 Interrater agreement as measured by interratercorrelation and Cohenrsquos kappa

D Sentiment of Reddit Subreddits

In Section 3 we describe how we obtain subredditsthat are balanced in terms of sentiment Here wenote the distribution of sentiments across subred-dits before we apply the filtering neutral (M=28STD=11) positive (M=41 STD=11) neg-ative (M=19 STD=7) ambiguous (M=35STD=8) After filtering the distribution of sen-timents across our remaining subreddits becameneutral (M=24 STD=5) positive (M=35STD=6) negative (M=27 STD=4) ambigu-ous (M=33 STD=4)

E BERTrsquos Most Activated Layers

To better understand whether there are any layersin BERT that are particularly important for ourtask we freeze BERT and calculate the center of

gravity (Tenney et al 2019) based on scalar mixingweights (Peters et al 2018) We find that all layersare similarly important for our task with center ofgravity = 619 (see Figure 4) This is consistentwith Tenney et al (2019) who have also found thattasks involving high-level semantics tend to makeuse of all BERT layers

0 1 2 3 4 5 6 7 8 9 10 11 12

Layer

000

002

004

006

008

010

Soft

max W

eig

hts

Figure 4 Softmax weights of each BERT layer whentrained on our dataset

F Number of Emotion Labels PerExample

Figure 5 shows the number of emotion labels perexample before and after we filter for those labelsthat have agreement We use the filtered set oflabels for training and testing our models

0 1 2 3 4 5 6 7Number of Labels

0k

10k

20k

30k

40k

Num

ber o

f Exa

mpl

es

pre-filterpost-filter

Figure 5 Number of emotion labels per example be-fore and after filtering the labels chosen by only a sin-gle annotator

G Confusion Matrix

Figure 6 shows the normalized confusion matrix forour model predictions Since GoEmotions is a mul-tilabel dataset we calculate the confusion matrix

similarly as we would calculate a co-occurrencematrix for each true label we increase the countfor each predicted label Specifically we define amatrix M where Mij denotes the raw confusioncount between the true label i and the predictedlabel j For example if the true labels are joyand admiration and the predicted labels are joyand pride then we increase the count for MjoyjoyMjoypride Madmirationjoy and MadmirationprideIn practice since most of our examples only hasa single label (see Figure 5) our confusion matrixis very similar to one calculated for a single-labelclassification task

Given the disparate frequencies among the la-bels we normalize M by dividing the counts ineach row (representing counts for each true emo-tion label) by the sum of that row The heatmap inFigure 6 shows these normalized counts We findthat the model tends to confuse emotions that arerelated in sentiment and intensity (eg grief andsadness pride and admiration nervousness andfear)

We also perform hierarchical clustering over thenormalized confusion matrix using correlation as adistance metric and ward as a linkage method Wefind that the model learns relatively similar clustersas the ones in Figure 2 even though the trainingdata only includes a subset of the labels that haveagreement (see Figure 5)

H Transfer Learning Results

Figure 7 shows the results for all 9 datasets thatare downloadable and have categorical emotionsin the Unified Dataset (Bostan and Klinger 2018)These datasets are DailyDialog (Li et al 2017)Emotion-Stimulus (Ghazi et al 2015) AffectiveText (Strapparava and Mihalcea 2007) Crowd-Flower (CrowdFlower 2016) Electoral Tweets(Mohammad et al 2015) ISEAR (Scherer andWallbott 1994) the Twitter Emotion Corpus (TEC)(Mohammad 2012) EmoInt (Mohammad et al2018) and the Stance Sentiment Emotion Corpus(SSEC) (Schuff et al 2017)

We describe the experimental setup in Sec-tion 62 which we use across all datasets Wefind that transfer learning helps in the case of alldatasets especially when there is limited train-ing data Interestingly in the case of Crowd-Flower which is known to be noisy (Bostan andKlinger 2018) and Electoral Tweets which is asmall dataset of sim4k labeled examples and a large

taxonomy of 36 emotions FREEZE gives a signifi-cant boost of performance over the BASELINE andNOFREEZE for all training set sizes besides ldquomaxrdquo

For the other datasets we find that FREEZE

tends to give a performance boost compared to theother setups only up to a couple of hundred train-ing examples For 500-1000 training examplesNOFREEZE tends to outperform the BASELINE butwe can see that these two setups come closer whenthere is more training data available These resultssuggests that our dataset helps if there is limiteddata from the target domain

Sentimentpositivenegativeambiguous

sentiment

curio

sity

conf

usio

nam

usem

ent

grat

itude

adm

iratio

npr

ide

appr

oval

real

izatio

nsu

rpris

eex

citem

ent

joy

relie

fca

ring

optim

ismde

sire

love fear

nerv

ousn

ess

grie

fsa

dnes

sre

mor

sedi

sapp

rova

ldi

sapp

oint

men

tan

ger

anno

yanc

eem

barra

ssm

ent

disg

ust

Predicted Label

curiosityconfusionamusementgratitudeadmirationprideapprovalrealizationsurpriseexcitementjoyreliefcaringoptimismdesirelovefearnervousnessgriefsadnessremorsedisapprovaldisappointmentangerannoyanceembarrassmentdisgust

True

Lab

el

Proportionof Predictions

000

015

030

045

060

075

Figure 6 A normalized confusion matrix for our model predictions The plot shows that the model confusesemotions with other emotions that are related in intensity and sentiment

02

03

04

F-sc

ore

DailyDialog (everyday conversations)

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

01

02

03

04

05

Affective Text (news headlines)

000

005

010

015

020

F-sc

ore

CrowdFlower (tweets)

006

008

010

012

ElectoralTweets (tweets)

00

02

04

06

ISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

01

02

03

04

05

F-sc

ore

TEC (tweets)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

100 200 500 1000 maxTraining Set Size

060

065

070

SSEC (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 7 Transfer learning results on 9 emotion benchmarks from the Unified Dataset (Bostan and Klinger 2018)

Page 2: GoEmotions: A Dataset of Fine-Grained Emotions · 2020. 5. 5. · Sentiment balancing. We reduce sentiment bias by removing subreddits with little representation of positive, negative,

conversation understanding tasks that require a sub-tle understanding of emotion expression such asthe analysis of customer feedback or the enhance-ment of chatbots

We include a thorough analysis of the annotateddata and the quality of the annotations Via Princi-pal Preserved Component Analysis (Cowen et al2019b) we show a strong support for reliable disso-ciation among all 27 emotion categories indicatingthe suitability of our annotations for building anemotion classification model

We perform hierarchical clustering on the emo-tion judgments finding that emotions related inintensity cluster together closely and that the top-level clusters correspond to sentiment categoriesThese relations among emotions allow for theirpotential grouping into higher-level categories ifdesired for a downstream task

We provide a strong baseline for modeling fine-grained emotion classification over GoEmotionsBy fine-tuning a BERT-base model (Devlin et al2019) we achieve an average F1-score of 46 overour taxonomy 64 over an Ekman-style groupinginto six coarse categories and 69 over a sentimentgrouping These results leave much room for im-provement showcasing this task is not yet fullyaddressed by current state-of-the-art NLU models

We conduct transfer learning experiments withexisting emotion benchmarks to show that ourdata can generalize to different taxonomies anddomains such as tweets and personal narrativesOur experiments demonstrate that given limited re-sources to label additional emotion classificationdata for specialized domains our data can providebaseline emotion understanding and contribute toincreasing model accuracy for the target domain

2 Related Work

21 Emotion Datasets

Ever since Affective Text (Strapparava and Mihal-cea 2007) the first benchmark for emotion recog-nition was introduced the field has seen severalemotion datasets that vary in size domain and tax-onomy (cf Bostan and Klinger 2018) The major-ity of emotion datasets are constructed manuallybut tend to be relatively small The largest manu-ally labeled dataset is CrowdFlower (2016) with39k labeled examples which were found by Bostanand Klinger (2018) to be noisy in comparison withother emotion datasets Other datasets are automat-ically weakly-labeled based on emotion-related

hashtags on Twitter (Wang et al 2012 Abdul-Mageed and Ungar 2017) We build our datasetmanually making it the largest human annotateddataset with multiple annotations per example forquality assurance

Several existing datasets come from the domainof Twitter given its informal language and expres-sive content such as emojis and hashtags Otherdatasets annotate news headlines (Strapparava andMihalcea 2007) dialogs (Li et al 2017) fairy-tales (Alm et al 2005) movie subtitles (Ohmanet al 2018) sentences based on FrameNet (Ghaziet al 2015) or self-reported experiences (Schererand Wallbott 1994) among other domains We arethe first to build on Reddit comments for emotionprediction

22 Emotion TaxonomyOne of the main aspects distinguishing our datasetis its emotion taxonomy The vast majority of ex-isting datasets contain annotations for minor varia-tions of the 6 basic emotion categories (joy angerfear sadness disgust and surprise) proposed byEkman (1992a) andor along affective dimensions(valence and arousal) that underpin the circumplexmodel of affect (Russell 2003 Buechel and Hahn2017)

Recent advances in psychology have offered newconceptual and methodological approaches to cap-turing the more complex ldquosemantic spacerdquo of emo-tion (Cowen et al 2019a) by studying the distribu-tion of emotion responses to a diverse array of stim-uli via computational techniques Studies guidedby these principles have identified 27 distinct va-rieties of emotional experience conveyed by shortvideos (Cowen and Keltner 2017) 13 by music(Cowen et al in press) 28 by facial expression(Cowen and Keltner 2019) 12 by speech prosody(Cowen et al 2019b) and 24 by nonverbal vocal-ization (Cowen et al 2018) In this work we buildon these methods and findings to devise our gran-ular taxonomy for text-based emotion recognitionand study the dimensionality of language-basedemotion space

23 Emotion Classification ModelsBoth feature-based and neural models have beenused to build automatic emotion classification mod-els Feature-based models often make use of hand-built lexicons such as the Valence Arousal Dom-inance Lexicon (Mohammad 2018) Using rep-resentations from BERT (Devlin et al 2019) a

transformer-based model with language model pre-training has recently shown to reach state-of-the-art performance on several NLP tasks also includ-ing emotion prediction the top-performing modelsin the EmotionX Challenge (Hsu and Ku 2018) allemployed a pre-trained BERT model We also usethe BERT model in our experiments and we findthat it outperforms our biLSTM model

3 GoEmotions

Our dataset is composed of 58K Reddit commentslabeled for one or more of 27 emotion(s) or Neutral

31 Selecting amp Curating Reddit commentsWe use a Reddit data dump originating in the reddit-data-tools project2 which contains comments from2005 (the start of Reddit) to January 2019 Weselect subreddits with at least 10k comments andremove deleted and non-English comments

Reddit is known for a demographic bias lean-ing towards young male users (Duggan and Smith2013) which is not reflective of a globally diversepopulation The platform also introduces a skewtowards toxic offensive language (Mohan et al2017) Thus Reddit content has been used to studydepression (Pirina and Coltekin 2018) microag-gressions (Breitfeller et al 2019) and Yanardagand Rahwan (2018) have shown the effect of us-ing biased Reddit data by training a ldquopsychopathrdquobot To address these concerns and enable buildingbroadly representative emotion models using GoE-motions we take a series of data curation measuresto ensure our data does not reinforce general noremotion-specific language biases

We identify harmful comments using pre-definedlists containing offensiveadult vulgar (mildly of-fensive profanity) identity and religion terms (in-cluded as supplementary material) These are usedfor data filtering and masking as described belowLists were internally compiled and we believe theyare comprehensive and widely useful for datasetcuration however they may not be complete

Reducing profanity We remove subreddits thatare not safe for work3 and where 10+ of com-ments include offensiveadult and vulgar tokensWe remove remaining comments that include offen-siveadult tokens Vulgar comments are preservedas we believe they are central to learning about

2httpsgithubcomdewarimreddit-data-tools

3httpredditlistcomnsfw

negative emotions The dataset includes the list offiltered tokens

Manual review We manually review identitycomments and remove those offensive towards aparticular ethnicity gender sexual orientation ordisability to the best of our judgment

Length filtering We apply NLTKrsquos word tok-enizer and select comments 3-30 tokens long in-cluding punctuation To create a relatively balanceddistribution of comment length we perform down-sampling capping by the number of commentswith the median token count (12)

Sentiment balancing We reduce sentiment biasby removing subreddits with little representationof positive negative ambiguous or neutral senti-ment To estimate a commentrsquos sentiment we runour emotion prediction model trained on a pilotbatch of 22k annotated examples The mappingof emotions into sentiment categories is found inFigure 2 We exclude subreddits consisting of morethan 30 neutral comments or less than 20 ofnegative positive or ambiguous comments

Emotion balancing We assign a predicted emo-tion to each comment using the pilot model de-scribed above Then we reduce emotion bias bydownsampling the weakly-labelled data cappingby the number of comments belonging to the me-dian emotion count

Subreddit balancing To avoid over representa-tion of popular subreddits we perform downsam-pling capping by the median subreddit count

From the remaining 315k comments (from 482subreddits) we randomly sample for annotation

Masking We mask proper names referring topeople with a [NAME] token using a BERT-basedNamed Entity Tagger (Tsai et al 2019) We maskreligion terms with a [RELIGION] token The listof these terms is included with our dataset Notethat raters viewed unmasked comments during rat-ing

32 Taxonomy of EmotionsWhen creating the taxonomy we seek to jointlymaximize the following objectives

1 Provide greatest coverage in terms of emo-tions expressed in our data To address this wemanually labeled a small subset of the data andran a pilot task where raters can suggest emotionlabels on top of the pre-defined set

2 Provide greatest coverage in terms of kindsof emotional expression We consult psychologyliterature on emotion expression and recognition(Plutchik 1980 Cowen and Keltner 2017 Cowenet al 2019b) Since to our knowledge therehas not been research that identifies principal cat-egories for emotion recognition in the domain oftext (see Section 22) we consider those emotionsthat are identified as basic in other domains (videoand speech) and that we can assume to apply totext as well

3 Limit overlap among emotions and limit thenumber of emotions We do not want to includeemotions that are too similar since that makes theannotation task more difficult Moreover combin-ing similar labels with high coverage would resultin an explosion in annotated labels

The final set of selected emotions is listed inTable 4 and Figure 1 See Appendix B for moredetails on our multi-step taxonomy selection proce-dure

33 AnnotationWe assigned three raters to each example For thoseexamples where no raters agree on at least oneemotion label we assigned two additional ratersAll raters are native English speakers from India4

Instructions Raters were asked to identify theemotions expressed by the writer of the text givenpre-defined emotion definitions (see Appendix A)and a few example texts for each emotion Raterswere free to select multiple emotions but wereasked to only select those ones for which they werereasonably confident that it is expressed in the textIf raters were not certain about any emotion beingexpressed they were asked to select Neutral Weincluded a checkbox for raters to indicate if anexample was particularly difficult to label in whichcase they could select no emotions We removedall examples for which no emotion was selected

The rater interface Reddit comments were pre-sented with no additional metadata (such as theauthor or subreddit) To help raters navigate thelarge space of emotion in our taxonomy they werepresented a table containing all emotion categoriesaggregated by sentiment (by the mapping in Fig-ure 2) and whether that emotion is generally ex-pressed towards something (eg disapproval) or is

4Cowen et al (2019b) find that emotion judgments in In-dian and US English speakers largely occupy the same dimen-sions

Number of examples 58009Number of emotions 27 + neutralNumber of unique raters 82Number of raters example 3 or 5Marked unclear ordifficult to label 16

1 832 153 2

Number of labels per example

4+ 2Number of examples w 2+ ratersagreeing on at least 1 label 54263 (94)

Number of examples w 3+ ratersagreeing on at least 1 label 17763 (31)

Table 2 Summary statistics of our labeled data

more of an intrinsic feeling (eg joy) The instruc-tions highlighted that this separation of categorieswas by no means clear-cut but captured generaltendencies and we encouraged raters to ignore thecategorization whenever they saw fit Emotionswith a straightforward mapping onto emojis wereshown with an emoji in the UI to further ease theirinterpretation

4 Data Analysis

Table 2 shows summary statistics for the data Mostof the examples (83) have a single emotion labeland have at least two raters agreeing on a single la-bel (94) The Neutral category makes up 26 ofall emotion labels ndash we exclude that category fromthe following analyses since we do not consider itto be part of the semantic space of emotions

Figure 1 shows the distribution of emotion labelsWe can see a large disparity in terms of emotionfrequencies (eg admiration is 30 times more fre-quent than grief ) despite our emotion and senti-ment balancing steps taken during data selectionThis is expected given the disparate frequencies ofemotions in natural human expression

41 Interrater Correlation

We estimate rater agreement for each emotion viainterrater correlation (Delgado and Tibau 2019)5

For each rater r isin R we calculate the Spearmancorrelation between rrsquos judgments and the mean

5We use correlations as opposed to Cohenrsquos kappa (Cohen1960) because the former is a more interpretable metric and itis also more suitable for measuring agreement among a vari-able number of raters rating different examples In Appendix Cwe report Cohenrsquos kappa values as well which correlate highlywith the values obtained from interrater correlation (Pearsonr = 085 p lt 0001)

InterraterCorrelation

Figure 1 Our emotion categories ordered by the num-ber of examples where at least one rater uses a particu-lar label The color indicates the interrater correlation

of other ratersrsquo judgments for all examples that rrated We then take the average of these rater-levelcorrelation scores In Section 43 we show thateach emotion has significant interrater correlationafter controlling for several potential confounds

Figure 1 shows that gratitude admiration andamusement have the highest and grief and nervous-ness have the lowest interrater correlation Emotionfrequency correlates with interrater agreement butthe two are not equivalent Infrequent emotionscan have relatively high interrater correlation (egfear) and frequent emotions can have have rela-tively low interrater correlation (eg annoyance)

42 Correlation Among Emotions

To better understand the relationship between emo-tions in our data we look at their correlations LetN be the number of examples in our dataset Weobtain N dimensional vectors for each emotionby averaging ratersrsquo judgments for all exampleslabeled with that emotion We calculate Pearsoncorrelation values between each pair of emotionsThe heatmap in Figure 2 shows that emotions thatare related in intensity (eg annoyance and angerjoy and excitement nervousness and fear) havea strong positive correlation On the other handemotions that have the opposite sentiment are neg-atively correlated

Sentimentpositivenegativeambiguous

sentiment

amus

emen

tex

citem

ent

joy

love

desir

eop

timism

carin

gpr

ide

adm

iratio

ngr

atitu

dere

lief

appr

oval

real

izatio

nsu

rpris

ecu

riosit

yco

nfus

ion

fear

nerv

ousn

ess

rem

orse

emba

rrass

men

tdi

sapp

oint

men

tsa

dnes

sgr

ief

disg

ust

ange

ran

noya

nce

disa

ppro

val

amusementexcitementjoylovedesireoptimismcaringprideadmirationgratitudereliefapprovalrealizationsurprisecuriosityconfusionfearnervousnessremorseembarrassmentdisappointmentsadnessgriefdisgustangerannoyancedisapproval

030 015 000 015 030

Figure 2 The heatmap shows the correlation betweenratings for each emotion The dendrogram representsthe a hierarchical clustering of the ratings The senti-ment labeling was done a priori and it shows that theclusters closely map onto sentiment groups

We also perform hierarchical clustering to un-cover the nested structure of our taxonomy Weuse correlation as a distance metric and ward asa linkage method applied to the averaged ratingsThe dendrogram on the top of Figure 2 shows thatemotions that are related by intensity are neighborsand that larger clusters map closely onto sentimentcategories Interestingly emotions that we labeledas ldquoambiguousrdquo in terms of sentiment (eg sur-prise) are closer to the positive than to the negativecategory This suggests that in our data ambiguousemotions are more likely to occur in the context ofpositive sentiment than that of negative sentiment

43 Principal Preserved Component Analysis

To better understand agreement among raters andthe latent structure of the emotion space we applyPrincipal Preserved Component Analysis (PPCA)(Cowen et al 2019b) to our data PPCA extractslinear combinations of attributes (here emotionjudgments) that maximally covary across two setsof data that measure the same attributes (here ran-domly split judgments for each example) ThusPPCA allows us to uncover latent dimensions of

Algorithm 1 Leave-One-Rater-Out PPCA1 Rlarr set of raters2 E larr set of emotions3 C isin R|R|times|E|4 for all raters r isin 1 |R| do5 nlarr number of examples annotated by r6 J isin Rntimes|R|times|E| larr all ratings for the exam-

ples annotated by r7 Jminusr isin Rntimes|R|minus1times|E| larr all ratings in J

excluding r8 Jr isin Rntimes|E| larr all ratings by r9 XY isin Rntimes|E| larr randomly split Jminusr and

average ratings across raters for both sets10 W isin R|E|times|E| larr result of PPCA(XY )11 for all componentsdagger wiisin1|E| in W do12 vr

i larr projectionDagger of Jr onto wi

13 vminusri larr projectionDagger of Jminusr onto wi

14 Cri larr correlation between vri and vminusri

partialing out vminusrk forallk isin 1 iminus 115 end for16 end for17 C prime larrWilcoxon signed rank test on C18 C primeprime larr Bonferroni correction on C prime(α = 005)daggerin descending order of eigenvalueDaggerwe demean vectors before projection

emotion that have high agreement across ratersUnlike Principal Component Analysis (PCA)

PPCA examines the cross-covariance betweendatasets rather than the variancecovariance matrixwithin a single dataset We obtain the principalpreserved components (PPCs) of two datasets (ma-trices) XY isin RNtimes|E| where N is the numberof examples and |E| is the number of emotionsby calculating the eigenvectors of the symmetrizedcross covariance matrix XTY + Y TX

Extracting significant dimensions We removeexamples labeled as Neutral and keep those ex-amples that still have at least 3 ratings after thisfiltering step We then determine the number ofsignificant dimensions using a leave-one-rater outanalysis as described by Algorithm 1

We find that all 27 PPCs are highly signifi-cant Specifically Bonferroni-corrected p-valuesare less than 15e-6 for all dimensions (correctedα = 00017) suggesting that the emotions werehighly dissociable Such a high degree of signifi-cance for all dimensions is nontrivial For exampleCowen et al (2019b) find that only 12 out of their30 emotion categories are significantly dissociable

t-SNE projection To better understand how theexamples are organized in the emotion space weapply t-SNE a dimension reduction method thatseeks to preserve distances between data pointsusing the scikit-learn package (Pedregosa et al2011) The dataset can be explored in our inter-active plot6 where one can also look at the textsand the annotations The color of each data point isthe weighted average of the RGB values represent-ing those emotions that at least half of the ratersselected

44 Linguistic Correlates of Emotions

We extract the lexical correlates of each emotion bycalculating the log odds ratio informative Dirichletprior (Monroe et al 2008) of all tokens for eachemotion category contrasting to all other emotionsSince the log odds are z-scored all values greaterthan 3 indicate highly significant (gt3 std) associ-ation with the corresponding emotion We list thetop 5 tokens for each category in Table 3 We findthat those emotions that are highly significantly as-sociated with certain tokens (eg gratitude withldquothanksrdquo amusement with ldquololrdquo) tend to have thehighest interrater correlation (see Figure 1) Con-versely emotions that have fewer significantly as-sociated tokens (eg grief and nervousness) tendto have low interrater correlation These resultssuggest certain emotions are more verbally implicitand may require more context to be interpreted

5 Modeling

We present a strong baseline emotion predictionmodel for GoEmotions

51 Data Preparation

To minimize the noise in our data we filter outemotion labels selected by only a single annotatorWe keep examples with at least one label after thisfiltering is performed mdash this amounts to 93 of theoriginal data We randomly split this data into train(80) dev (10) and test (10) sets We onlyevaluate on the test set once the model is finalized

Even though we filter our data for the baselineexperiments we see particular value in the 4K ex-amples that lack agreement This subset of the datalikely contains edgedifficult examples for the emo-tion domain (eg emotion-ambiguous text) andpresent challenges for further exploration That is

6httpsnlpstanfordedu˜ddemszkygoemotionstsnehtml

admiration amusement approval caring anger annoyance disappointment disapproval confusiongreat (42) lol (66) agree (24) you (12) fuck (24) annoying (14) disappointing (11) not (16) confused (18)

awesome (32) haha (32) not (13) worry (11) hate (18) stupid (13) disappointed (10) donrsquot (14) why (11)amazing (30) funny (27) donrsquot (12) careful (9) fucking (18) fucking (12) bad (9) disagree (9) sure (10)

good (28) lmao (21) yes (12) stay (9) angry (11) shit (10) disappointment (7) nope (8) what (10)beautiful (23) hilarious (18) agreed (11) your (8) dare (10) dumb (9) unfortunately (7) doesnrsquot (7) understand (8)

desire excitement gratitude joy disgust embarrassment fear grief curiositywish (29) excited (21) thanks (75) happy (32) disgusting (22) embarrassing (12) scared (16) died (6) curious (22)want (8) happy (8) thank (69) glad (27) awful (14) shame (11) afraid (16) rip (4) what (18)

wanted (6) cake (8) for (24) enjoy (20) worst (13) awkward (10) scary (15) why (13)could (6) wow (8) you (18) enjoyed (12) worse (12) embarrassment (8) terrible (12) how (11)

ambitious (4) interesting (7) sharing (17) fun (12) weird (9) embarrassed (7) terrifying (11) did (10)love optimism pride relief nervousness remorse sadness realization surprise

love (76) hope (45) proud (14) glad (5) nervous (8) sorry (39) sad (31) realize (14) wow (23)loved (21) hopefully (19) pride (4) relieved (4) worried (8) regret (9) sadly (16) realized (12) surprised (21)

favorite (13) luck (18) accomplishment relieving (4) anxiety (6) apologies (7) sorry (15) realised (7) wonder (15)loves (12) hoping (16) (4) relief (4) anxious (4) apologize (6) painful (10) realization (6) shocked (12)

like (9) will (8) worrying (4) guilt (5) crying (9) thought (6) omg (11)

Table 3 Top 5 words associated with each emotion ( positive negative ambiguous ) The rounded z-scored logodds ratios in the parentheses with the threshold set at 3 indicate significance of association

why we release all 58K examples with all annota-torsrsquo ratings

Grouping emotions We create a hierarchicalgrouping of our taxonomy and evaluate the modelperformance on each level of the hierarchy A sen-timent level divides the labels into 4 categories ndashpositive negative ambiguous and Neutral ndash withthe Neutral category intact and the rest of the map-ping as shown in Figure 2 The Ekman level furtherdivides the taxonomy using the Neutral label andthe following 6 groups anger (maps to anger an-noyance disapproval) disgust (maps to disgust)fear (maps to fear nervousness) joy (all positiveemotions) sadness (maps to sadness disappoint-ment embarrassment grief remorse) and surprise(all ambiguous emotions)

52 Model ArchitectureWe use the BERT-base model (Devlin et al 2019)for our experiments We add a dense output layeron top of the pretrained model for the purposesof finetuning with a sigmoid cross entropy lossfunction to support multi-label classification As anadditional baseline we train a bidirectional LSTM

53 Parameter SettingsWhen finetuning the pre-trained BERT model wekeep most of the hyperparameters set by Devlinet al (2019) intact and only change the batch sizeand learning rate We find that training for at least4 epochs is necessary for learning the data buttraining for more epochs results in overfitting Wealso find that a small batch size of 16 and learningrate of 5e-5 yields the best performance

For the biLSTM we set the hidden layer dimen-sionality to 256 the learning rate to 01 with a

decay rate of 095 We apply a dropout of 07

54 Results

Table 4 summarizes the performance of our bestmodel BERT on the test set which achieves anaverage F1-score of 46 (std=19) The model ob-tains the best performance on emotions with overtlexical markers such as gratitude (86) amusement(8) and love (78) The model obtains the lowestF1-score on grief (0) relief (15) and realization(21) which are the lowest frequency emotions Wefind that less frequent emotions tend to be confusedby the model with more frequent emotions relatedin sentiment and intensity (eg grief with sadnesspride with admiration nervousness with fear) mdashsee Appendix G for a more detailed analysis

Table 5 and Table 6 show results for a sentiment-grouped model (F1-score = 69) and an Ekman-grouped model (F1-score = 64) respectively Thesignificant performance increase in the transitionfrom full to Ekman-level taxonomy indicates thatthis grouping mitigates confusion among inner-group lower-level categories

The biLSTM model performs significantlyworse than BERT obtaining an average F1-scoreof 41 for the full taxonomy 54 for an Ekman-grouped model and 6 for a sentiment-groupedmodel

6 Transfer Learning Experiments

We conduct transfer learning experiments on exist-ing emotion benchmarks in order to show our datageneralizes across domains and taxonomies Thegoal is to demonstrate that given little labeled datain a target domain one can utilize GoEmotions asbaseline emotion understanding data

100 200 500 1000 maxTraining Set Size

00

02

04

06F-

scor

eISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 3 Transfer learning results in terms of average F1-scores across emotion categories The bars indicate the95 confidence intervals which we obtain from 10 different runs on 10 different random splits of the data

Emotion Precision Recall F1

admiration 053 083 065amusement 070 094 080anger 036 066 047annoyance 024 063 034approval 026 057 036caring 030 056 039confusion 024 076 037curiosity 040 084 054desire 043 059 049disappointment 019 052 028disapproval 029 061 039disgust 034 066 045embarrassment 039 049 043excitement 026 052 034fear 046 085 060gratitude 079 095 086grief 000 000 000joy 039 073 051love 068 092 078nervousness 028 048 035neutral 056 084 068optimism 041 069 051pride 067 025 036realization 016 029 021relief 050 009 015remorse 053 088 066sadness 038 071 049surprise 040 066 050macro-average 040 063 046std 018 024 019

Table 4 Results based on GoEmotions taxonomy

61 Emotion Benchmark Datasets

We consider the nine benchmark datasets fromBostan and Klinger (2018)rsquos Unified Datasetwhich vary in terms of their size domain qual-

Sentiment Precision Recall F1

ambiguous 054 066 060negative 065 076 070neutral 064 069 067positive 078 087 082macro-average 065 074 069std 009 010 009

Table 5 Results based on sentiment-grouped data

Ekman Emotion Precision Recall F1

anger 050 065 057disgust 052 053 053fear 061 076 068joy 077 088 082neutral 066 067 066sadness 056 062 059macro-average 060 068 064std 010 012 011

Table 6 Results using Ekmanrsquos taxonomy

ity and taxonomy In the interest of space we onlydiscuss three of these datasets here chosen basedon their diversity of domains In our experimentswe observe similar trends for the additional bench-marks and all are included in the Appendix H

The International Survey on Emotion An-tecedents and Reactions (ISEAR) (Scherer andWallbott 1994) is a collection of personal reportson emotional events written by 3000 people fromdifferent cultural backgrounds The dataset con-tains 8k sentences each labeled with a single emo-tion The categories are anger disgust fear guiltjoy sadness and shame

EmoInt (Mohammad et al 2018) is part of theSemEval 2018 benchmark and it contains crowd-sourced annotations for 7k tweets The labels areintensity annotations for anger joy sadness and

fear We obtain binary annotations for these emo-tions by using 5 as the cutoff

Emotion-Stimulus (Ghazi et al 2015) containsannotations for 24k sentences generated based onFrameNetrsquos emotion-directed frames Their taxon-omy is anger disgust fear joy sadness shameand surprise

62 Experimental SetupTraining set size We experiment with varyingamount of training data from the target domaindataset including 100 200 500 1000 and 80(named ldquomaxrdquo) of dataset examples We generate10 random splits for each train set size with theremaining examples held as a test set

We report the results of the finetuning experi-ments detailed below for each data size with con-fidence intervals based on repeated experimentsusing the splits

Finetuning We compare three different finetun-ing setups In the BASELINE setup we finetuneBERT only on the target dataset In the FREEZE

setup we first finetune BERT on GoEmotions thenperform transfer learning by replacing the finaldense layer freezing all layers besides the lastlayer and finetuning on the target dataset TheNOFREEZE setup is the same as FREEZE exceptthat we do not freeze the bottom layers We holdthe batch size at 16 learning rate at 2e-5 and num-ber of epochs at 3 for all experiments

63 ResultsThe results in Figure 3 suggest that our dataset gen-eralizes well to different domains and taxonomiesand that using a model using GoEmotions can helpin cases when there is limited data from the targetdomain or limited resources for labeling

Given limited target domain data (100 or 200 ex-amples) both FREEZE and NOFREEZE yield signif-icantly higher performance than the BASELINE forall three datasets Importantly NOFREEZE resultsshow significantly higher performance for all train-ing set sizes except for ldquomaxrdquo where NOFREEZE

and BASELINE perform similarly

7 Conclusion

We present GoEmotions a large manually anno-tated carefully curated dataset for fine-grainedemotion prediction We provide a detailed dataanalysis demonstrating the reliability of the anno-tations for the full taxonomy We show the general-

izability of the data across domains and taxonomiesvia transfer learning experiments We build a strongbaseline by fine-tuning a BERT model howeverthe results suggest much room for future improve-ment Future work can explore the cross-culturalrobustness of emotion ratings and extend the tax-onomy to other languages and domains

Data Disclaimer We are aware that the datasetcontains biases and is not representative of globaldiversity We are aware that the dataset containspotentially problematic content Potential biases inthe data include Inherent biases in Reddit and userbase biases the offensivevulgar word lists usedfor data filtering inherent or unconscious bias inassessment of offensive identity labels annotatorswere all native English speakers from India Allthese likely affect labeling precision and recallfor a trained model The emotion pilot model usedfor sentiment labeling was trained on examplesreviewed by the research team Anyone using thisdataset should be aware of these limitations of thedataset

Acknowledgments

We thank the three anonymous reviewers for theirconstructive feedback We would also like to thankthe annotators for their hard work

ReferencesMuhammad Abdul-Mageed and Lyle Ungar 2017

EmoNet Fine-Grained Emotion Detection withGated Recurrent Neural Networks In Proceedingsof the 55th Annual Meeting of the Association forComputational Linguistics (Volume 1 Long Papers)pages 718ndash728

Cecilia Ovesdotter Alm Dan Roth and Richard Sproat2005 Emotions from text Machine learningfor text-based emotion prediction In Proceed-ings of Human Language Technology Conferenceand Conference on Empirical Methods in NaturalLanguage Processing pages 579ndash586 VancouverBritish Columbia Canada Association for Compu-tational Linguistics

Laura-Ana-Maria Bostan and Roman Klinger 2018An analysis of annotated corpora for emotion clas-sification in text In Proceedings of the 27th Inter-national Conference on Computational Linguisticspages 2104ndash2119

Luke Breitfeller Emily Ahn David Jurgens and Yu-lia Tsvetkov 2019 Finding microaggressions in thewild A case for locating elusive phenomena in so-cial media posts In Proceedings of the 2019 Con-ference on Empirical Methods in Natural Language

Processing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 1664ndash1674

Sven Buechel and Udo Hahn 2017 Emobank Study-ing the impact of annotation perspective and repre-sentation format on dimensional emotion analysisIn Proceedings of the 15th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics Volume 2 Short Papers pages 578ndash585

Jacob Cohen 1960 A coefficient of agreement fornominal scales Educational and psychological mea-surement 20(1)37ndash46

Alan Cowen Disa Sauter Jessica L Tracy and DacherKeltner 2019a Mapping the passions Towarda high-dimensional taxonomy of emotional experi-ence and expression Psychological Science in thePublic Interest 20(1)69ndash90

Alan S Cowen Hillary Anger Elfenbein Petri Laukkaand Dacher Keltner 2018 Mapping 24 emotionsconveyed by brief human vocalization AmericanPsychologist 74(6)698ndash712

Alan S Cowen Xia Fang Disa Sauter and Dacher Kelt-ner in press What music makes us feel At leastthirteen dimensions organize subjective experiencesassociated with music across cultures Proceedingsof the National Academy of Sciences

Alan S Cowen and Dacher Keltner 2017 Self-reportcaptures 27 distinct categories of emotion bridged bycontinuous gradients Proceedings of the NationalAcademy of Sciences 114(38)E7900ndashE7909

Alan S Cowen and Dacher Keltner 2019 What theface displays Mapping 28 emotions conveyed bynaturalistic expression American Psychologist

Alan S Cowen Petri Laukka Hillary Anger ElfenbeinRunjing Liu and Dacher Keltner 2019b The pri-macy of categories in the recognition of 12 emotionsin speech prosody across two cultures Nature Hu-man Behaviour 3(4)369

CrowdFlower 2016 httpswwwfigure-eightcomdatasentiment-analysis-emotion-text

Rosario Delgado and Xavier-Andoni Tibau 2019Why cohens kappa should be avoided as perfor-mance measure in classification PloS one 14(9)

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In 17th Annual Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics (NAACL)

Maeve Duggan and Aaron Smith 2013 6 of onlineadults are reddit users Pew Internet amp AmericanLife Project 31ndash10

Paul Ekman 1992a Are there basic emotions Psy-chological Review 99(3)550ndash553

Paul Ekman 1992b An argument for basic emotionsCognition amp Emotion 6(3-4)169ndash200

Diman Ghazi Diana Inkpen and Stan Szpakowicz2015 Detecting Emotion Stimuli in Emotion-Bearing Sentences In International Conference onIntelligent Text Processing and Computational Lin-guistics pages 152ndash165 Springer

Chao-Chun Hsu and Lun-Wei Ku 2018 SocialNLP2018 EmotionX challenge overview Recognizingemotions in dialogues In Proceedings of the SixthInternational Workshop on Natural Language Pro-cessing for Social Media pages 27ndash31 MelbourneAustralia Association for Computational Linguis-tics

Yanran Li Hui Su Xiaoyu Shen Wenjie Li ZiqiangCao and Shuzi Niu 2017 Dailydialog A manuallylabelled multi-turn dialogue dataset arXiv preprintarXiv171003957

Chen Liu Muhammad Osama and Anderson De An-drade 2019 Dens A dataset for multi-class emo-tion analysis arXiv preprint arXiv191011769

Saif Mohammad 2018 Obtaining reliable human rat-ings of valence arousal and dominance for 20000english words In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1 Long Papers) pages 174ndash184

Saif Mohammad Felipe Bravo-Marquez MohammadSalameh and Svetlana Kiritchenko 2018 SemEval-2018 task 1 Affect in tweets In Proceedings ofThe 12th International Workshop on Semantic Eval-uation pages 1ndash17 New Orleans Louisiana Asso-ciation for Computational Linguistics

Saif M Mohammad 2012 emotional tweets In Pro-ceedings of the First Joint Conference on Lexicaland Computational Semantics-Volume 1 Proceed-ings of the main conference and the shared task andVolume 2 Proceedings of the Sixth InternationalWorkshop on Semantic Evaluation pages 246ndash255Association for Computational Linguistics

Saif M Mohammad Xiaodan Zhu Svetlana Kir-itchenko and Joel Martin 2015 Sentiment emo-tion purpose and style in electoral tweets Informa-tion Processing amp Management 51(4)480ndash499

Shruthi Mohan Apala Guha Michael Harris FredPopowich Ashley Schuster and Chris Priebe 2017The impact of toxic language on the health of redditcommunities In Canadian Conference on ArtificialIntelligence pages 51ndash56 Springer

Burt L Monroe Michael P Colaresi and Kevin MQuinn 2008 Fightinrsquowords Lexical feature selec-tion and evaluation for identifying the content of po-litical conflict Political Analysis 16(4)372ndash403

Emily Ohman Kaisla Kajava Jorg Tiedemann andTimo Honkela 2018 Creating a dataset formultilingual fine-grained emotion-detection usinggamification-based annotation In Proceedings ofthe 9th Workshop on Computational Approaches toSubjectivity Sentiment and Social Media Analysispages 24ndash30

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P PrettenhoferR Weiss V Dubourg J Vanderplas A PassosD Cournapeau M Brucher M Perrot and E Duch-esnay 2011 Scikit-learn Machine learning inPython Journal of Machine Learning Research122825ndash2830

Matthew E Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep Contextualized Word Rep-resentations In 16th Annual Conference of theNorth American Chapter of the Association for Com-putational Linguistics (NAACL)

Rosalind W Picard 1997 Affective Computing MITPress

Inna Pirina and Cagrı Coltekin 2018 Identifyingdepression on reddit The effect of training dataIn Proceedings of the 2018 EMNLP WorkshopSMM4H The 3rd Social Media Mining for HealthApplications Workshop amp Shared Task pages 9ndash12

Robert Plutchik 1980 A general psychoevolutionarytheory of emotion In Theories of emotion pages3ndash33 Elsevier

James A Russell 2003 Core affect and the psycholog-ical construction of emotion Psychological review110(1)145

Klaus R Scherer and Harald G Wallbott 1994 Evi-dence for Universality and Cultural Variation of Dif-ferential Emotion Response Patterning Journal ofpersonality and social psychology 66(2)310

Hendrik Schuff Jeremy Barnes Julian Mohme Sebas-tian Pado and Roman Klinger 2017 Annotationmodelling and analysis of fine-grained emotions ona stance and sentiment detection corpus In Pro-ceedings of the 8th Workshop on Computational Ap-proaches to Subjectivity Sentiment and Social Me-dia Analysis pages 13ndash23

Carlo Strapparava and Rada Mihalcea 2007 SemEval-2007 task 14 Affective text In Proceedings of theFourth International Workshop on Semantic Evalua-tions (SemEval-2007) pages 70ndash74 Prague CzechRepublic Association for Computational Linguis-tics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT Rediscovers the Classical NLP Pipeline InAssociation for Computational Linguistics

Henry Tsai Jason Riesa Melvin Johnson Naveen Ari-vazhagan Xin Li and Amelia Archer 2019 Smalland Practical BERT Models for Sequence LabelingIn EMNLP 2019

Wenbo Wang Lu Chen Krishnaprasad Thirunarayanand Amit P Sheth 2012 Harnessing twitter lsquobigdatarsquo for automatic emotion identification In 2012International Conference on Privacy Security Riskand Trust and 2012 International Confernece on So-cial Computing pages 587ndash592 IEEE

Cebrian-M Yanardag P and I Rahwan 2018 Nor-man Worlds first psychopath ai

A Emotion Definitions

admiration Finding something impressive orworthy of respectamusement Finding something funny or beingentertainedanger A strong feeling of displeasure or antag-onismannoyance Mild anger irritationapproval Having or expressing a favorable opin-ioncaring Displaying kindness and concern for othersconfusion Lack of understanding uncertaintycuriosity A strong desire to know or learn some-thingdesire A strong feeling of wanting something orwishing for something to happendisappointment Sadness or displeasure caused bythe nonfulfillment of onersquos hopes or expectationsdisapproval Having or expressing an unfavor-able opiniondisgust Revulsion or strong disapproval arousedby something unpleasant or offensiveembarrassment Self-consciousness shame orawkwardnessexcitement Feeling of great enthusiasm and ea-gernessfear Being afraid or worriedgratitude A feeling of thankfulness and appre-ciationgrief Intense sorrow especially caused by some-onersquos deathjoy A feeling of pleasure and happinesslove A strong positive emotion of regard andaffectionnervousness Apprehension worry anxietyoptimism Hopefulness and confidence aboutthe future or the success of somethingpride Pleasure or satisfaction due to ones ownachievements or the achievements of those withwhom one is closely associatedrealization Becoming aware of somethingrelief Reassurance and relaxation following releasefrom anxiety or distressremorse Regret or guilty feelingsadness Emotional pain sorrowsurprise Feeling astonished startled by some-thing unexpected

B Taxonomy Selection amp Data Collection

We selected our taxonomy through a careful multi-round process In the first pilot round of data col-

lection we used emotions that were identified to besalient by Cowen and Keltner (2017) making surethat our set includes Ekmans emotion categories asused in previous NLP work In this round we alsoincluded an open input box where annotators couldsuggest emotion(s) that were not among the op-tions We annotated 3K examples in the first roundWe updated the taxonomy based on the results ofthis round (see details below) In the second pilotround of data collection we repeated this processwith 2k new examples once again updating thetaxonomy

While reviewing the results from the pilotrounds we identified and removed emotions thatwere scarcely selected by annotators andor hadlow interrater agreement due to being very similarto other emotions or being too difficult to detectfrom text These emotions were boredom doubtheartbroken indifference and calmness We alsoidentified and added those emotions to our tax-onomy that were frequently suggested by ratersandor seemed to be represented in the data uponmanual inspection These emotions were desiredisappointment pride realization relief and re-morse In this process we also refined the categorynames (eg replacing ecstasy with excitement) toones that seemed interpretable to annotators Thisis how we arrived at the final set of 27 emotions+ Neutral Our high interrater agreement in the fi-nal data can be partially explained by the fact thatwe took interpretability into consideration whileconstructing the taxonomy The dataset is we arereleasing was labeled in the third round over thefinal taxonomy

C Cohenrsquos Kappa Values

In Section 41 we measure agreement betweenraters via Spearman correlation following consid-erations by Delgado and Tibau (2019) In Table 7we report the Cohenrsquos kappa values for compari-son which we obtain by randomly sampling tworatings for each example and calculating the Co-henrsquos kappa between these two sets of ratings Wefind that all Cohenrsquos kappa values are greater than 0showing rater agreement Moreover the Cohenrsquoskappa values correlate highly with the interratercorrelation values (Pearson r = 085 p lt 0001)providing corroborative evidence for the significantdegree of interrater agreement for each emotion

Emotion InterraterCorrelation

Cohenrsquoskappa

admiration 0535 0468amusement 0482 0474anger 0207 0307annoyance 0193 0192approval 0385 0187caring 0237 0252confusion 0217 0270curiosity 0418 0366desire 0177 0251disappointment 0186 0184disapproval 0274 0234disgust 0192 0241embarrassment 0177 0218excitement 0193 0222fear 0266 0394gratitude 0645 0749grief 0162 0095joy 0296 0301love 0446 0555nervousness 0164 0144optimism 0322 0300pride 0163 0148realization 0194 0155relief 0172 0185remorse 0178 0358sadness 0346 0336surprise 0275 0331

Table 7 Interrater agreement as measured by interratercorrelation and Cohenrsquos kappa

D Sentiment of Reddit Subreddits

In Section 3 we describe how we obtain subredditsthat are balanced in terms of sentiment Here wenote the distribution of sentiments across subred-dits before we apply the filtering neutral (M=28STD=11) positive (M=41 STD=11) neg-ative (M=19 STD=7) ambiguous (M=35STD=8) After filtering the distribution of sen-timents across our remaining subreddits becameneutral (M=24 STD=5) positive (M=35STD=6) negative (M=27 STD=4) ambigu-ous (M=33 STD=4)

E BERTrsquos Most Activated Layers

To better understand whether there are any layersin BERT that are particularly important for ourtask we freeze BERT and calculate the center of

gravity (Tenney et al 2019) based on scalar mixingweights (Peters et al 2018) We find that all layersare similarly important for our task with center ofgravity = 619 (see Figure 4) This is consistentwith Tenney et al (2019) who have also found thattasks involving high-level semantics tend to makeuse of all BERT layers

0 1 2 3 4 5 6 7 8 9 10 11 12

Layer

000

002

004

006

008

010

Soft

max W

eig

hts

Figure 4 Softmax weights of each BERT layer whentrained on our dataset

F Number of Emotion Labels PerExample

Figure 5 shows the number of emotion labels perexample before and after we filter for those labelsthat have agreement We use the filtered set oflabels for training and testing our models

0 1 2 3 4 5 6 7Number of Labels

0k

10k

20k

30k

40k

Num

ber o

f Exa

mpl

es

pre-filterpost-filter

Figure 5 Number of emotion labels per example be-fore and after filtering the labels chosen by only a sin-gle annotator

G Confusion Matrix

Figure 6 shows the normalized confusion matrix forour model predictions Since GoEmotions is a mul-tilabel dataset we calculate the confusion matrix

similarly as we would calculate a co-occurrencematrix for each true label we increase the countfor each predicted label Specifically we define amatrix M where Mij denotes the raw confusioncount between the true label i and the predictedlabel j For example if the true labels are joyand admiration and the predicted labels are joyand pride then we increase the count for MjoyjoyMjoypride Madmirationjoy and MadmirationprideIn practice since most of our examples only hasa single label (see Figure 5) our confusion matrixis very similar to one calculated for a single-labelclassification task

Given the disparate frequencies among the la-bels we normalize M by dividing the counts ineach row (representing counts for each true emo-tion label) by the sum of that row The heatmap inFigure 6 shows these normalized counts We findthat the model tends to confuse emotions that arerelated in sentiment and intensity (eg grief andsadness pride and admiration nervousness andfear)

We also perform hierarchical clustering over thenormalized confusion matrix using correlation as adistance metric and ward as a linkage method Wefind that the model learns relatively similar clustersas the ones in Figure 2 even though the trainingdata only includes a subset of the labels that haveagreement (see Figure 5)

H Transfer Learning Results

Figure 7 shows the results for all 9 datasets thatare downloadable and have categorical emotionsin the Unified Dataset (Bostan and Klinger 2018)These datasets are DailyDialog (Li et al 2017)Emotion-Stimulus (Ghazi et al 2015) AffectiveText (Strapparava and Mihalcea 2007) Crowd-Flower (CrowdFlower 2016) Electoral Tweets(Mohammad et al 2015) ISEAR (Scherer andWallbott 1994) the Twitter Emotion Corpus (TEC)(Mohammad 2012) EmoInt (Mohammad et al2018) and the Stance Sentiment Emotion Corpus(SSEC) (Schuff et al 2017)

We describe the experimental setup in Sec-tion 62 which we use across all datasets Wefind that transfer learning helps in the case of alldatasets especially when there is limited train-ing data Interestingly in the case of Crowd-Flower which is known to be noisy (Bostan andKlinger 2018) and Electoral Tweets which is asmall dataset of sim4k labeled examples and a large

taxonomy of 36 emotions FREEZE gives a signifi-cant boost of performance over the BASELINE andNOFREEZE for all training set sizes besides ldquomaxrdquo

For the other datasets we find that FREEZE

tends to give a performance boost compared to theother setups only up to a couple of hundred train-ing examples For 500-1000 training examplesNOFREEZE tends to outperform the BASELINE butwe can see that these two setups come closer whenthere is more training data available These resultssuggests that our dataset helps if there is limiteddata from the target domain

Sentimentpositivenegativeambiguous

sentiment

curio

sity

conf

usio

nam

usem

ent

grat

itude

adm

iratio

npr

ide

appr

oval

real

izatio

nsu

rpris

eex

citem

ent

joy

relie

fca

ring

optim

ismde

sire

love fear

nerv

ousn

ess

grie

fsa

dnes

sre

mor

sedi

sapp

rova

ldi

sapp

oint

men

tan

ger

anno

yanc

eem

barra

ssm

ent

disg

ust

Predicted Label

curiosityconfusionamusementgratitudeadmirationprideapprovalrealizationsurpriseexcitementjoyreliefcaringoptimismdesirelovefearnervousnessgriefsadnessremorsedisapprovaldisappointmentangerannoyanceembarrassmentdisgust

True

Lab

el

Proportionof Predictions

000

015

030

045

060

075

Figure 6 A normalized confusion matrix for our model predictions The plot shows that the model confusesemotions with other emotions that are related in intensity and sentiment

02

03

04

F-sc

ore

DailyDialog (everyday conversations)

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

01

02

03

04

05

Affective Text (news headlines)

000

005

010

015

020

F-sc

ore

CrowdFlower (tweets)

006

008

010

012

ElectoralTweets (tweets)

00

02

04

06

ISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

01

02

03

04

05

F-sc

ore

TEC (tweets)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

100 200 500 1000 maxTraining Set Size

060

065

070

SSEC (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 7 Transfer learning results on 9 emotion benchmarks from the Unified Dataset (Bostan and Klinger 2018)

Page 3: GoEmotions: A Dataset of Fine-Grained Emotions · 2020. 5. 5. · Sentiment balancing. We reduce sentiment bias by removing subreddits with little representation of positive, negative,

transformer-based model with language model pre-training has recently shown to reach state-of-the-art performance on several NLP tasks also includ-ing emotion prediction the top-performing modelsin the EmotionX Challenge (Hsu and Ku 2018) allemployed a pre-trained BERT model We also usethe BERT model in our experiments and we findthat it outperforms our biLSTM model

3 GoEmotions

Our dataset is composed of 58K Reddit commentslabeled for one or more of 27 emotion(s) or Neutral

31 Selecting amp Curating Reddit commentsWe use a Reddit data dump originating in the reddit-data-tools project2 which contains comments from2005 (the start of Reddit) to January 2019 Weselect subreddits with at least 10k comments andremove deleted and non-English comments

Reddit is known for a demographic bias lean-ing towards young male users (Duggan and Smith2013) which is not reflective of a globally diversepopulation The platform also introduces a skewtowards toxic offensive language (Mohan et al2017) Thus Reddit content has been used to studydepression (Pirina and Coltekin 2018) microag-gressions (Breitfeller et al 2019) and Yanardagand Rahwan (2018) have shown the effect of us-ing biased Reddit data by training a ldquopsychopathrdquobot To address these concerns and enable buildingbroadly representative emotion models using GoE-motions we take a series of data curation measuresto ensure our data does not reinforce general noremotion-specific language biases

We identify harmful comments using pre-definedlists containing offensiveadult vulgar (mildly of-fensive profanity) identity and religion terms (in-cluded as supplementary material) These are usedfor data filtering and masking as described belowLists were internally compiled and we believe theyare comprehensive and widely useful for datasetcuration however they may not be complete

Reducing profanity We remove subreddits thatare not safe for work3 and where 10+ of com-ments include offensiveadult and vulgar tokensWe remove remaining comments that include offen-siveadult tokens Vulgar comments are preservedas we believe they are central to learning about

2httpsgithubcomdewarimreddit-data-tools

3httpredditlistcomnsfw

negative emotions The dataset includes the list offiltered tokens

Manual review We manually review identitycomments and remove those offensive towards aparticular ethnicity gender sexual orientation ordisability to the best of our judgment

Length filtering We apply NLTKrsquos word tok-enizer and select comments 3-30 tokens long in-cluding punctuation To create a relatively balanceddistribution of comment length we perform down-sampling capping by the number of commentswith the median token count (12)

Sentiment balancing We reduce sentiment biasby removing subreddits with little representationof positive negative ambiguous or neutral senti-ment To estimate a commentrsquos sentiment we runour emotion prediction model trained on a pilotbatch of 22k annotated examples The mappingof emotions into sentiment categories is found inFigure 2 We exclude subreddits consisting of morethan 30 neutral comments or less than 20 ofnegative positive or ambiguous comments

Emotion balancing We assign a predicted emo-tion to each comment using the pilot model de-scribed above Then we reduce emotion bias bydownsampling the weakly-labelled data cappingby the number of comments belonging to the me-dian emotion count

Subreddit balancing To avoid over representa-tion of popular subreddits we perform downsam-pling capping by the median subreddit count

From the remaining 315k comments (from 482subreddits) we randomly sample for annotation

Masking We mask proper names referring topeople with a [NAME] token using a BERT-basedNamed Entity Tagger (Tsai et al 2019) We maskreligion terms with a [RELIGION] token The listof these terms is included with our dataset Notethat raters viewed unmasked comments during rat-ing

32 Taxonomy of EmotionsWhen creating the taxonomy we seek to jointlymaximize the following objectives

1 Provide greatest coverage in terms of emo-tions expressed in our data To address this wemanually labeled a small subset of the data andran a pilot task where raters can suggest emotionlabels on top of the pre-defined set

2 Provide greatest coverage in terms of kindsof emotional expression We consult psychologyliterature on emotion expression and recognition(Plutchik 1980 Cowen and Keltner 2017 Cowenet al 2019b) Since to our knowledge therehas not been research that identifies principal cat-egories for emotion recognition in the domain oftext (see Section 22) we consider those emotionsthat are identified as basic in other domains (videoand speech) and that we can assume to apply totext as well

3 Limit overlap among emotions and limit thenumber of emotions We do not want to includeemotions that are too similar since that makes theannotation task more difficult Moreover combin-ing similar labels with high coverage would resultin an explosion in annotated labels

The final set of selected emotions is listed inTable 4 and Figure 1 See Appendix B for moredetails on our multi-step taxonomy selection proce-dure

33 AnnotationWe assigned three raters to each example For thoseexamples where no raters agree on at least oneemotion label we assigned two additional ratersAll raters are native English speakers from India4

Instructions Raters were asked to identify theemotions expressed by the writer of the text givenpre-defined emotion definitions (see Appendix A)and a few example texts for each emotion Raterswere free to select multiple emotions but wereasked to only select those ones for which they werereasonably confident that it is expressed in the textIf raters were not certain about any emotion beingexpressed they were asked to select Neutral Weincluded a checkbox for raters to indicate if anexample was particularly difficult to label in whichcase they could select no emotions We removedall examples for which no emotion was selected

The rater interface Reddit comments were pre-sented with no additional metadata (such as theauthor or subreddit) To help raters navigate thelarge space of emotion in our taxonomy they werepresented a table containing all emotion categoriesaggregated by sentiment (by the mapping in Fig-ure 2) and whether that emotion is generally ex-pressed towards something (eg disapproval) or is

4Cowen et al (2019b) find that emotion judgments in In-dian and US English speakers largely occupy the same dimen-sions

Number of examples 58009Number of emotions 27 + neutralNumber of unique raters 82Number of raters example 3 or 5Marked unclear ordifficult to label 16

1 832 153 2

Number of labels per example

4+ 2Number of examples w 2+ ratersagreeing on at least 1 label 54263 (94)

Number of examples w 3+ ratersagreeing on at least 1 label 17763 (31)

Table 2 Summary statistics of our labeled data

more of an intrinsic feeling (eg joy) The instruc-tions highlighted that this separation of categorieswas by no means clear-cut but captured generaltendencies and we encouraged raters to ignore thecategorization whenever they saw fit Emotionswith a straightforward mapping onto emojis wereshown with an emoji in the UI to further ease theirinterpretation

4 Data Analysis

Table 2 shows summary statistics for the data Mostof the examples (83) have a single emotion labeland have at least two raters agreeing on a single la-bel (94) The Neutral category makes up 26 ofall emotion labels ndash we exclude that category fromthe following analyses since we do not consider itto be part of the semantic space of emotions

Figure 1 shows the distribution of emotion labelsWe can see a large disparity in terms of emotionfrequencies (eg admiration is 30 times more fre-quent than grief ) despite our emotion and senti-ment balancing steps taken during data selectionThis is expected given the disparate frequencies ofemotions in natural human expression

41 Interrater Correlation

We estimate rater agreement for each emotion viainterrater correlation (Delgado and Tibau 2019)5

For each rater r isin R we calculate the Spearmancorrelation between rrsquos judgments and the mean

5We use correlations as opposed to Cohenrsquos kappa (Cohen1960) because the former is a more interpretable metric and itis also more suitable for measuring agreement among a vari-able number of raters rating different examples In Appendix Cwe report Cohenrsquos kappa values as well which correlate highlywith the values obtained from interrater correlation (Pearsonr = 085 p lt 0001)

InterraterCorrelation

Figure 1 Our emotion categories ordered by the num-ber of examples where at least one rater uses a particu-lar label The color indicates the interrater correlation

of other ratersrsquo judgments for all examples that rrated We then take the average of these rater-levelcorrelation scores In Section 43 we show thateach emotion has significant interrater correlationafter controlling for several potential confounds

Figure 1 shows that gratitude admiration andamusement have the highest and grief and nervous-ness have the lowest interrater correlation Emotionfrequency correlates with interrater agreement butthe two are not equivalent Infrequent emotionscan have relatively high interrater correlation (egfear) and frequent emotions can have have rela-tively low interrater correlation (eg annoyance)

42 Correlation Among Emotions

To better understand the relationship between emo-tions in our data we look at their correlations LetN be the number of examples in our dataset Weobtain N dimensional vectors for each emotionby averaging ratersrsquo judgments for all exampleslabeled with that emotion We calculate Pearsoncorrelation values between each pair of emotionsThe heatmap in Figure 2 shows that emotions thatare related in intensity (eg annoyance and angerjoy and excitement nervousness and fear) havea strong positive correlation On the other handemotions that have the opposite sentiment are neg-atively correlated

Sentimentpositivenegativeambiguous

sentiment

amus

emen

tex

citem

ent

joy

love

desir

eop

timism

carin

gpr

ide

adm

iratio

ngr

atitu

dere

lief

appr

oval

real

izatio

nsu

rpris

ecu

riosit

yco

nfus

ion

fear

nerv

ousn

ess

rem

orse

emba

rrass

men

tdi

sapp

oint

men

tsa

dnes

sgr

ief

disg

ust

ange

ran

noya

nce

disa

ppro

val

amusementexcitementjoylovedesireoptimismcaringprideadmirationgratitudereliefapprovalrealizationsurprisecuriosityconfusionfearnervousnessremorseembarrassmentdisappointmentsadnessgriefdisgustangerannoyancedisapproval

030 015 000 015 030

Figure 2 The heatmap shows the correlation betweenratings for each emotion The dendrogram representsthe a hierarchical clustering of the ratings The senti-ment labeling was done a priori and it shows that theclusters closely map onto sentiment groups

We also perform hierarchical clustering to un-cover the nested structure of our taxonomy Weuse correlation as a distance metric and ward asa linkage method applied to the averaged ratingsThe dendrogram on the top of Figure 2 shows thatemotions that are related by intensity are neighborsand that larger clusters map closely onto sentimentcategories Interestingly emotions that we labeledas ldquoambiguousrdquo in terms of sentiment (eg sur-prise) are closer to the positive than to the negativecategory This suggests that in our data ambiguousemotions are more likely to occur in the context ofpositive sentiment than that of negative sentiment

43 Principal Preserved Component Analysis

To better understand agreement among raters andthe latent structure of the emotion space we applyPrincipal Preserved Component Analysis (PPCA)(Cowen et al 2019b) to our data PPCA extractslinear combinations of attributes (here emotionjudgments) that maximally covary across two setsof data that measure the same attributes (here ran-domly split judgments for each example) ThusPPCA allows us to uncover latent dimensions of

Algorithm 1 Leave-One-Rater-Out PPCA1 Rlarr set of raters2 E larr set of emotions3 C isin R|R|times|E|4 for all raters r isin 1 |R| do5 nlarr number of examples annotated by r6 J isin Rntimes|R|times|E| larr all ratings for the exam-

ples annotated by r7 Jminusr isin Rntimes|R|minus1times|E| larr all ratings in J

excluding r8 Jr isin Rntimes|E| larr all ratings by r9 XY isin Rntimes|E| larr randomly split Jminusr and

average ratings across raters for both sets10 W isin R|E|times|E| larr result of PPCA(XY )11 for all componentsdagger wiisin1|E| in W do12 vr

i larr projectionDagger of Jr onto wi

13 vminusri larr projectionDagger of Jminusr onto wi

14 Cri larr correlation between vri and vminusri

partialing out vminusrk forallk isin 1 iminus 115 end for16 end for17 C prime larrWilcoxon signed rank test on C18 C primeprime larr Bonferroni correction on C prime(α = 005)daggerin descending order of eigenvalueDaggerwe demean vectors before projection

emotion that have high agreement across ratersUnlike Principal Component Analysis (PCA)

PPCA examines the cross-covariance betweendatasets rather than the variancecovariance matrixwithin a single dataset We obtain the principalpreserved components (PPCs) of two datasets (ma-trices) XY isin RNtimes|E| where N is the numberof examples and |E| is the number of emotionsby calculating the eigenvectors of the symmetrizedcross covariance matrix XTY + Y TX

Extracting significant dimensions We removeexamples labeled as Neutral and keep those ex-amples that still have at least 3 ratings after thisfiltering step We then determine the number ofsignificant dimensions using a leave-one-rater outanalysis as described by Algorithm 1

We find that all 27 PPCs are highly signifi-cant Specifically Bonferroni-corrected p-valuesare less than 15e-6 for all dimensions (correctedα = 00017) suggesting that the emotions werehighly dissociable Such a high degree of signifi-cance for all dimensions is nontrivial For exampleCowen et al (2019b) find that only 12 out of their30 emotion categories are significantly dissociable

t-SNE projection To better understand how theexamples are organized in the emotion space weapply t-SNE a dimension reduction method thatseeks to preserve distances between data pointsusing the scikit-learn package (Pedregosa et al2011) The dataset can be explored in our inter-active plot6 where one can also look at the textsand the annotations The color of each data point isthe weighted average of the RGB values represent-ing those emotions that at least half of the ratersselected

44 Linguistic Correlates of Emotions

We extract the lexical correlates of each emotion bycalculating the log odds ratio informative Dirichletprior (Monroe et al 2008) of all tokens for eachemotion category contrasting to all other emotionsSince the log odds are z-scored all values greaterthan 3 indicate highly significant (gt3 std) associ-ation with the corresponding emotion We list thetop 5 tokens for each category in Table 3 We findthat those emotions that are highly significantly as-sociated with certain tokens (eg gratitude withldquothanksrdquo amusement with ldquololrdquo) tend to have thehighest interrater correlation (see Figure 1) Con-versely emotions that have fewer significantly as-sociated tokens (eg grief and nervousness) tendto have low interrater correlation These resultssuggest certain emotions are more verbally implicitand may require more context to be interpreted

5 Modeling

We present a strong baseline emotion predictionmodel for GoEmotions

51 Data Preparation

To minimize the noise in our data we filter outemotion labels selected by only a single annotatorWe keep examples with at least one label after thisfiltering is performed mdash this amounts to 93 of theoriginal data We randomly split this data into train(80) dev (10) and test (10) sets We onlyevaluate on the test set once the model is finalized

Even though we filter our data for the baselineexperiments we see particular value in the 4K ex-amples that lack agreement This subset of the datalikely contains edgedifficult examples for the emo-tion domain (eg emotion-ambiguous text) andpresent challenges for further exploration That is

6httpsnlpstanfordedu˜ddemszkygoemotionstsnehtml

admiration amusement approval caring anger annoyance disappointment disapproval confusiongreat (42) lol (66) agree (24) you (12) fuck (24) annoying (14) disappointing (11) not (16) confused (18)

awesome (32) haha (32) not (13) worry (11) hate (18) stupid (13) disappointed (10) donrsquot (14) why (11)amazing (30) funny (27) donrsquot (12) careful (9) fucking (18) fucking (12) bad (9) disagree (9) sure (10)

good (28) lmao (21) yes (12) stay (9) angry (11) shit (10) disappointment (7) nope (8) what (10)beautiful (23) hilarious (18) agreed (11) your (8) dare (10) dumb (9) unfortunately (7) doesnrsquot (7) understand (8)

desire excitement gratitude joy disgust embarrassment fear grief curiositywish (29) excited (21) thanks (75) happy (32) disgusting (22) embarrassing (12) scared (16) died (6) curious (22)want (8) happy (8) thank (69) glad (27) awful (14) shame (11) afraid (16) rip (4) what (18)

wanted (6) cake (8) for (24) enjoy (20) worst (13) awkward (10) scary (15) why (13)could (6) wow (8) you (18) enjoyed (12) worse (12) embarrassment (8) terrible (12) how (11)

ambitious (4) interesting (7) sharing (17) fun (12) weird (9) embarrassed (7) terrifying (11) did (10)love optimism pride relief nervousness remorse sadness realization surprise

love (76) hope (45) proud (14) glad (5) nervous (8) sorry (39) sad (31) realize (14) wow (23)loved (21) hopefully (19) pride (4) relieved (4) worried (8) regret (9) sadly (16) realized (12) surprised (21)

favorite (13) luck (18) accomplishment relieving (4) anxiety (6) apologies (7) sorry (15) realised (7) wonder (15)loves (12) hoping (16) (4) relief (4) anxious (4) apologize (6) painful (10) realization (6) shocked (12)

like (9) will (8) worrying (4) guilt (5) crying (9) thought (6) omg (11)

Table 3 Top 5 words associated with each emotion ( positive negative ambiguous ) The rounded z-scored logodds ratios in the parentheses with the threshold set at 3 indicate significance of association

why we release all 58K examples with all annota-torsrsquo ratings

Grouping emotions We create a hierarchicalgrouping of our taxonomy and evaluate the modelperformance on each level of the hierarchy A sen-timent level divides the labels into 4 categories ndashpositive negative ambiguous and Neutral ndash withthe Neutral category intact and the rest of the map-ping as shown in Figure 2 The Ekman level furtherdivides the taxonomy using the Neutral label andthe following 6 groups anger (maps to anger an-noyance disapproval) disgust (maps to disgust)fear (maps to fear nervousness) joy (all positiveemotions) sadness (maps to sadness disappoint-ment embarrassment grief remorse) and surprise(all ambiguous emotions)

52 Model ArchitectureWe use the BERT-base model (Devlin et al 2019)for our experiments We add a dense output layeron top of the pretrained model for the purposesof finetuning with a sigmoid cross entropy lossfunction to support multi-label classification As anadditional baseline we train a bidirectional LSTM

53 Parameter SettingsWhen finetuning the pre-trained BERT model wekeep most of the hyperparameters set by Devlinet al (2019) intact and only change the batch sizeand learning rate We find that training for at least4 epochs is necessary for learning the data buttraining for more epochs results in overfitting Wealso find that a small batch size of 16 and learningrate of 5e-5 yields the best performance

For the biLSTM we set the hidden layer dimen-sionality to 256 the learning rate to 01 with a

decay rate of 095 We apply a dropout of 07

54 Results

Table 4 summarizes the performance of our bestmodel BERT on the test set which achieves anaverage F1-score of 46 (std=19) The model ob-tains the best performance on emotions with overtlexical markers such as gratitude (86) amusement(8) and love (78) The model obtains the lowestF1-score on grief (0) relief (15) and realization(21) which are the lowest frequency emotions Wefind that less frequent emotions tend to be confusedby the model with more frequent emotions relatedin sentiment and intensity (eg grief with sadnesspride with admiration nervousness with fear) mdashsee Appendix G for a more detailed analysis

Table 5 and Table 6 show results for a sentiment-grouped model (F1-score = 69) and an Ekman-grouped model (F1-score = 64) respectively Thesignificant performance increase in the transitionfrom full to Ekman-level taxonomy indicates thatthis grouping mitigates confusion among inner-group lower-level categories

The biLSTM model performs significantlyworse than BERT obtaining an average F1-scoreof 41 for the full taxonomy 54 for an Ekman-grouped model and 6 for a sentiment-groupedmodel

6 Transfer Learning Experiments

We conduct transfer learning experiments on exist-ing emotion benchmarks in order to show our datageneralizes across domains and taxonomies Thegoal is to demonstrate that given little labeled datain a target domain one can utilize GoEmotions asbaseline emotion understanding data

100 200 500 1000 maxTraining Set Size

00

02

04

06F-

scor

eISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 3 Transfer learning results in terms of average F1-scores across emotion categories The bars indicate the95 confidence intervals which we obtain from 10 different runs on 10 different random splits of the data

Emotion Precision Recall F1

admiration 053 083 065amusement 070 094 080anger 036 066 047annoyance 024 063 034approval 026 057 036caring 030 056 039confusion 024 076 037curiosity 040 084 054desire 043 059 049disappointment 019 052 028disapproval 029 061 039disgust 034 066 045embarrassment 039 049 043excitement 026 052 034fear 046 085 060gratitude 079 095 086grief 000 000 000joy 039 073 051love 068 092 078nervousness 028 048 035neutral 056 084 068optimism 041 069 051pride 067 025 036realization 016 029 021relief 050 009 015remorse 053 088 066sadness 038 071 049surprise 040 066 050macro-average 040 063 046std 018 024 019

Table 4 Results based on GoEmotions taxonomy

61 Emotion Benchmark Datasets

We consider the nine benchmark datasets fromBostan and Klinger (2018)rsquos Unified Datasetwhich vary in terms of their size domain qual-

Sentiment Precision Recall F1

ambiguous 054 066 060negative 065 076 070neutral 064 069 067positive 078 087 082macro-average 065 074 069std 009 010 009

Table 5 Results based on sentiment-grouped data

Ekman Emotion Precision Recall F1

anger 050 065 057disgust 052 053 053fear 061 076 068joy 077 088 082neutral 066 067 066sadness 056 062 059macro-average 060 068 064std 010 012 011

Table 6 Results using Ekmanrsquos taxonomy

ity and taxonomy In the interest of space we onlydiscuss three of these datasets here chosen basedon their diversity of domains In our experimentswe observe similar trends for the additional bench-marks and all are included in the Appendix H

The International Survey on Emotion An-tecedents and Reactions (ISEAR) (Scherer andWallbott 1994) is a collection of personal reportson emotional events written by 3000 people fromdifferent cultural backgrounds The dataset con-tains 8k sentences each labeled with a single emo-tion The categories are anger disgust fear guiltjoy sadness and shame

EmoInt (Mohammad et al 2018) is part of theSemEval 2018 benchmark and it contains crowd-sourced annotations for 7k tweets The labels areintensity annotations for anger joy sadness and

fear We obtain binary annotations for these emo-tions by using 5 as the cutoff

Emotion-Stimulus (Ghazi et al 2015) containsannotations for 24k sentences generated based onFrameNetrsquos emotion-directed frames Their taxon-omy is anger disgust fear joy sadness shameand surprise

62 Experimental SetupTraining set size We experiment with varyingamount of training data from the target domaindataset including 100 200 500 1000 and 80(named ldquomaxrdquo) of dataset examples We generate10 random splits for each train set size with theremaining examples held as a test set

We report the results of the finetuning experi-ments detailed below for each data size with con-fidence intervals based on repeated experimentsusing the splits

Finetuning We compare three different finetun-ing setups In the BASELINE setup we finetuneBERT only on the target dataset In the FREEZE

setup we first finetune BERT on GoEmotions thenperform transfer learning by replacing the finaldense layer freezing all layers besides the lastlayer and finetuning on the target dataset TheNOFREEZE setup is the same as FREEZE exceptthat we do not freeze the bottom layers We holdthe batch size at 16 learning rate at 2e-5 and num-ber of epochs at 3 for all experiments

63 ResultsThe results in Figure 3 suggest that our dataset gen-eralizes well to different domains and taxonomiesand that using a model using GoEmotions can helpin cases when there is limited data from the targetdomain or limited resources for labeling

Given limited target domain data (100 or 200 ex-amples) both FREEZE and NOFREEZE yield signif-icantly higher performance than the BASELINE forall three datasets Importantly NOFREEZE resultsshow significantly higher performance for all train-ing set sizes except for ldquomaxrdquo where NOFREEZE

and BASELINE perform similarly

7 Conclusion

We present GoEmotions a large manually anno-tated carefully curated dataset for fine-grainedemotion prediction We provide a detailed dataanalysis demonstrating the reliability of the anno-tations for the full taxonomy We show the general-

izability of the data across domains and taxonomiesvia transfer learning experiments We build a strongbaseline by fine-tuning a BERT model howeverthe results suggest much room for future improve-ment Future work can explore the cross-culturalrobustness of emotion ratings and extend the tax-onomy to other languages and domains

Data Disclaimer We are aware that the datasetcontains biases and is not representative of globaldiversity We are aware that the dataset containspotentially problematic content Potential biases inthe data include Inherent biases in Reddit and userbase biases the offensivevulgar word lists usedfor data filtering inherent or unconscious bias inassessment of offensive identity labels annotatorswere all native English speakers from India Allthese likely affect labeling precision and recallfor a trained model The emotion pilot model usedfor sentiment labeling was trained on examplesreviewed by the research team Anyone using thisdataset should be aware of these limitations of thedataset

Acknowledgments

We thank the three anonymous reviewers for theirconstructive feedback We would also like to thankthe annotators for their hard work

ReferencesMuhammad Abdul-Mageed and Lyle Ungar 2017

EmoNet Fine-Grained Emotion Detection withGated Recurrent Neural Networks In Proceedingsof the 55th Annual Meeting of the Association forComputational Linguistics (Volume 1 Long Papers)pages 718ndash728

Cecilia Ovesdotter Alm Dan Roth and Richard Sproat2005 Emotions from text Machine learningfor text-based emotion prediction In Proceed-ings of Human Language Technology Conferenceand Conference on Empirical Methods in NaturalLanguage Processing pages 579ndash586 VancouverBritish Columbia Canada Association for Compu-tational Linguistics

Laura-Ana-Maria Bostan and Roman Klinger 2018An analysis of annotated corpora for emotion clas-sification in text In Proceedings of the 27th Inter-national Conference on Computational Linguisticspages 2104ndash2119

Luke Breitfeller Emily Ahn David Jurgens and Yu-lia Tsvetkov 2019 Finding microaggressions in thewild A case for locating elusive phenomena in so-cial media posts In Proceedings of the 2019 Con-ference on Empirical Methods in Natural Language

Processing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 1664ndash1674

Sven Buechel and Udo Hahn 2017 Emobank Study-ing the impact of annotation perspective and repre-sentation format on dimensional emotion analysisIn Proceedings of the 15th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics Volume 2 Short Papers pages 578ndash585

Jacob Cohen 1960 A coefficient of agreement fornominal scales Educational and psychological mea-surement 20(1)37ndash46

Alan Cowen Disa Sauter Jessica L Tracy and DacherKeltner 2019a Mapping the passions Towarda high-dimensional taxonomy of emotional experi-ence and expression Psychological Science in thePublic Interest 20(1)69ndash90

Alan S Cowen Hillary Anger Elfenbein Petri Laukkaand Dacher Keltner 2018 Mapping 24 emotionsconveyed by brief human vocalization AmericanPsychologist 74(6)698ndash712

Alan S Cowen Xia Fang Disa Sauter and Dacher Kelt-ner in press What music makes us feel At leastthirteen dimensions organize subjective experiencesassociated with music across cultures Proceedingsof the National Academy of Sciences

Alan S Cowen and Dacher Keltner 2017 Self-reportcaptures 27 distinct categories of emotion bridged bycontinuous gradients Proceedings of the NationalAcademy of Sciences 114(38)E7900ndashE7909

Alan S Cowen and Dacher Keltner 2019 What theface displays Mapping 28 emotions conveyed bynaturalistic expression American Psychologist

Alan S Cowen Petri Laukka Hillary Anger ElfenbeinRunjing Liu and Dacher Keltner 2019b The pri-macy of categories in the recognition of 12 emotionsin speech prosody across two cultures Nature Hu-man Behaviour 3(4)369

CrowdFlower 2016 httpswwwfigure-eightcomdatasentiment-analysis-emotion-text

Rosario Delgado and Xavier-Andoni Tibau 2019Why cohens kappa should be avoided as perfor-mance measure in classification PloS one 14(9)

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In 17th Annual Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics (NAACL)

Maeve Duggan and Aaron Smith 2013 6 of onlineadults are reddit users Pew Internet amp AmericanLife Project 31ndash10

Paul Ekman 1992a Are there basic emotions Psy-chological Review 99(3)550ndash553

Paul Ekman 1992b An argument for basic emotionsCognition amp Emotion 6(3-4)169ndash200

Diman Ghazi Diana Inkpen and Stan Szpakowicz2015 Detecting Emotion Stimuli in Emotion-Bearing Sentences In International Conference onIntelligent Text Processing and Computational Lin-guistics pages 152ndash165 Springer

Chao-Chun Hsu and Lun-Wei Ku 2018 SocialNLP2018 EmotionX challenge overview Recognizingemotions in dialogues In Proceedings of the SixthInternational Workshop on Natural Language Pro-cessing for Social Media pages 27ndash31 MelbourneAustralia Association for Computational Linguis-tics

Yanran Li Hui Su Xiaoyu Shen Wenjie Li ZiqiangCao and Shuzi Niu 2017 Dailydialog A manuallylabelled multi-turn dialogue dataset arXiv preprintarXiv171003957

Chen Liu Muhammad Osama and Anderson De An-drade 2019 Dens A dataset for multi-class emo-tion analysis arXiv preprint arXiv191011769

Saif Mohammad 2018 Obtaining reliable human rat-ings of valence arousal and dominance for 20000english words In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1 Long Papers) pages 174ndash184

Saif Mohammad Felipe Bravo-Marquez MohammadSalameh and Svetlana Kiritchenko 2018 SemEval-2018 task 1 Affect in tweets In Proceedings ofThe 12th International Workshop on Semantic Eval-uation pages 1ndash17 New Orleans Louisiana Asso-ciation for Computational Linguistics

Saif M Mohammad 2012 emotional tweets In Pro-ceedings of the First Joint Conference on Lexicaland Computational Semantics-Volume 1 Proceed-ings of the main conference and the shared task andVolume 2 Proceedings of the Sixth InternationalWorkshop on Semantic Evaluation pages 246ndash255Association for Computational Linguistics

Saif M Mohammad Xiaodan Zhu Svetlana Kir-itchenko and Joel Martin 2015 Sentiment emo-tion purpose and style in electoral tweets Informa-tion Processing amp Management 51(4)480ndash499

Shruthi Mohan Apala Guha Michael Harris FredPopowich Ashley Schuster and Chris Priebe 2017The impact of toxic language on the health of redditcommunities In Canadian Conference on ArtificialIntelligence pages 51ndash56 Springer

Burt L Monroe Michael P Colaresi and Kevin MQuinn 2008 Fightinrsquowords Lexical feature selec-tion and evaluation for identifying the content of po-litical conflict Political Analysis 16(4)372ndash403

Emily Ohman Kaisla Kajava Jorg Tiedemann andTimo Honkela 2018 Creating a dataset formultilingual fine-grained emotion-detection usinggamification-based annotation In Proceedings ofthe 9th Workshop on Computational Approaches toSubjectivity Sentiment and Social Media Analysispages 24ndash30

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P PrettenhoferR Weiss V Dubourg J Vanderplas A PassosD Cournapeau M Brucher M Perrot and E Duch-esnay 2011 Scikit-learn Machine learning inPython Journal of Machine Learning Research122825ndash2830

Matthew E Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep Contextualized Word Rep-resentations In 16th Annual Conference of theNorth American Chapter of the Association for Com-putational Linguistics (NAACL)

Rosalind W Picard 1997 Affective Computing MITPress

Inna Pirina and Cagrı Coltekin 2018 Identifyingdepression on reddit The effect of training dataIn Proceedings of the 2018 EMNLP WorkshopSMM4H The 3rd Social Media Mining for HealthApplications Workshop amp Shared Task pages 9ndash12

Robert Plutchik 1980 A general psychoevolutionarytheory of emotion In Theories of emotion pages3ndash33 Elsevier

James A Russell 2003 Core affect and the psycholog-ical construction of emotion Psychological review110(1)145

Klaus R Scherer and Harald G Wallbott 1994 Evi-dence for Universality and Cultural Variation of Dif-ferential Emotion Response Patterning Journal ofpersonality and social psychology 66(2)310

Hendrik Schuff Jeremy Barnes Julian Mohme Sebas-tian Pado and Roman Klinger 2017 Annotationmodelling and analysis of fine-grained emotions ona stance and sentiment detection corpus In Pro-ceedings of the 8th Workshop on Computational Ap-proaches to Subjectivity Sentiment and Social Me-dia Analysis pages 13ndash23

Carlo Strapparava and Rada Mihalcea 2007 SemEval-2007 task 14 Affective text In Proceedings of theFourth International Workshop on Semantic Evalua-tions (SemEval-2007) pages 70ndash74 Prague CzechRepublic Association for Computational Linguis-tics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT Rediscovers the Classical NLP Pipeline InAssociation for Computational Linguistics

Henry Tsai Jason Riesa Melvin Johnson Naveen Ari-vazhagan Xin Li and Amelia Archer 2019 Smalland Practical BERT Models for Sequence LabelingIn EMNLP 2019

Wenbo Wang Lu Chen Krishnaprasad Thirunarayanand Amit P Sheth 2012 Harnessing twitter lsquobigdatarsquo for automatic emotion identification In 2012International Conference on Privacy Security Riskand Trust and 2012 International Confernece on So-cial Computing pages 587ndash592 IEEE

Cebrian-M Yanardag P and I Rahwan 2018 Nor-man Worlds first psychopath ai

A Emotion Definitions

admiration Finding something impressive orworthy of respectamusement Finding something funny or beingentertainedanger A strong feeling of displeasure or antag-onismannoyance Mild anger irritationapproval Having or expressing a favorable opin-ioncaring Displaying kindness and concern for othersconfusion Lack of understanding uncertaintycuriosity A strong desire to know or learn some-thingdesire A strong feeling of wanting something orwishing for something to happendisappointment Sadness or displeasure caused bythe nonfulfillment of onersquos hopes or expectationsdisapproval Having or expressing an unfavor-able opiniondisgust Revulsion or strong disapproval arousedby something unpleasant or offensiveembarrassment Self-consciousness shame orawkwardnessexcitement Feeling of great enthusiasm and ea-gernessfear Being afraid or worriedgratitude A feeling of thankfulness and appre-ciationgrief Intense sorrow especially caused by some-onersquos deathjoy A feeling of pleasure and happinesslove A strong positive emotion of regard andaffectionnervousness Apprehension worry anxietyoptimism Hopefulness and confidence aboutthe future or the success of somethingpride Pleasure or satisfaction due to ones ownachievements or the achievements of those withwhom one is closely associatedrealization Becoming aware of somethingrelief Reassurance and relaxation following releasefrom anxiety or distressremorse Regret or guilty feelingsadness Emotional pain sorrowsurprise Feeling astonished startled by some-thing unexpected

B Taxonomy Selection amp Data Collection

We selected our taxonomy through a careful multi-round process In the first pilot round of data col-

lection we used emotions that were identified to besalient by Cowen and Keltner (2017) making surethat our set includes Ekmans emotion categories asused in previous NLP work In this round we alsoincluded an open input box where annotators couldsuggest emotion(s) that were not among the op-tions We annotated 3K examples in the first roundWe updated the taxonomy based on the results ofthis round (see details below) In the second pilotround of data collection we repeated this processwith 2k new examples once again updating thetaxonomy

While reviewing the results from the pilotrounds we identified and removed emotions thatwere scarcely selected by annotators andor hadlow interrater agreement due to being very similarto other emotions or being too difficult to detectfrom text These emotions were boredom doubtheartbroken indifference and calmness We alsoidentified and added those emotions to our tax-onomy that were frequently suggested by ratersandor seemed to be represented in the data uponmanual inspection These emotions were desiredisappointment pride realization relief and re-morse In this process we also refined the categorynames (eg replacing ecstasy with excitement) toones that seemed interpretable to annotators Thisis how we arrived at the final set of 27 emotions+ Neutral Our high interrater agreement in the fi-nal data can be partially explained by the fact thatwe took interpretability into consideration whileconstructing the taxonomy The dataset is we arereleasing was labeled in the third round over thefinal taxonomy

C Cohenrsquos Kappa Values

In Section 41 we measure agreement betweenraters via Spearman correlation following consid-erations by Delgado and Tibau (2019) In Table 7we report the Cohenrsquos kappa values for compari-son which we obtain by randomly sampling tworatings for each example and calculating the Co-henrsquos kappa between these two sets of ratings Wefind that all Cohenrsquos kappa values are greater than 0showing rater agreement Moreover the Cohenrsquoskappa values correlate highly with the interratercorrelation values (Pearson r = 085 p lt 0001)providing corroborative evidence for the significantdegree of interrater agreement for each emotion

Emotion InterraterCorrelation

Cohenrsquoskappa

admiration 0535 0468amusement 0482 0474anger 0207 0307annoyance 0193 0192approval 0385 0187caring 0237 0252confusion 0217 0270curiosity 0418 0366desire 0177 0251disappointment 0186 0184disapproval 0274 0234disgust 0192 0241embarrassment 0177 0218excitement 0193 0222fear 0266 0394gratitude 0645 0749grief 0162 0095joy 0296 0301love 0446 0555nervousness 0164 0144optimism 0322 0300pride 0163 0148realization 0194 0155relief 0172 0185remorse 0178 0358sadness 0346 0336surprise 0275 0331

Table 7 Interrater agreement as measured by interratercorrelation and Cohenrsquos kappa

D Sentiment of Reddit Subreddits

In Section 3 we describe how we obtain subredditsthat are balanced in terms of sentiment Here wenote the distribution of sentiments across subred-dits before we apply the filtering neutral (M=28STD=11) positive (M=41 STD=11) neg-ative (M=19 STD=7) ambiguous (M=35STD=8) After filtering the distribution of sen-timents across our remaining subreddits becameneutral (M=24 STD=5) positive (M=35STD=6) negative (M=27 STD=4) ambigu-ous (M=33 STD=4)

E BERTrsquos Most Activated Layers

To better understand whether there are any layersin BERT that are particularly important for ourtask we freeze BERT and calculate the center of

gravity (Tenney et al 2019) based on scalar mixingweights (Peters et al 2018) We find that all layersare similarly important for our task with center ofgravity = 619 (see Figure 4) This is consistentwith Tenney et al (2019) who have also found thattasks involving high-level semantics tend to makeuse of all BERT layers

0 1 2 3 4 5 6 7 8 9 10 11 12

Layer

000

002

004

006

008

010

Soft

max W

eig

hts

Figure 4 Softmax weights of each BERT layer whentrained on our dataset

F Number of Emotion Labels PerExample

Figure 5 shows the number of emotion labels perexample before and after we filter for those labelsthat have agreement We use the filtered set oflabels for training and testing our models

0 1 2 3 4 5 6 7Number of Labels

0k

10k

20k

30k

40k

Num

ber o

f Exa

mpl

es

pre-filterpost-filter

Figure 5 Number of emotion labels per example be-fore and after filtering the labels chosen by only a sin-gle annotator

G Confusion Matrix

Figure 6 shows the normalized confusion matrix forour model predictions Since GoEmotions is a mul-tilabel dataset we calculate the confusion matrix

similarly as we would calculate a co-occurrencematrix for each true label we increase the countfor each predicted label Specifically we define amatrix M where Mij denotes the raw confusioncount between the true label i and the predictedlabel j For example if the true labels are joyand admiration and the predicted labels are joyand pride then we increase the count for MjoyjoyMjoypride Madmirationjoy and MadmirationprideIn practice since most of our examples only hasa single label (see Figure 5) our confusion matrixis very similar to one calculated for a single-labelclassification task

Given the disparate frequencies among the la-bels we normalize M by dividing the counts ineach row (representing counts for each true emo-tion label) by the sum of that row The heatmap inFigure 6 shows these normalized counts We findthat the model tends to confuse emotions that arerelated in sentiment and intensity (eg grief andsadness pride and admiration nervousness andfear)

We also perform hierarchical clustering over thenormalized confusion matrix using correlation as adistance metric and ward as a linkage method Wefind that the model learns relatively similar clustersas the ones in Figure 2 even though the trainingdata only includes a subset of the labels that haveagreement (see Figure 5)

H Transfer Learning Results

Figure 7 shows the results for all 9 datasets thatare downloadable and have categorical emotionsin the Unified Dataset (Bostan and Klinger 2018)These datasets are DailyDialog (Li et al 2017)Emotion-Stimulus (Ghazi et al 2015) AffectiveText (Strapparava and Mihalcea 2007) Crowd-Flower (CrowdFlower 2016) Electoral Tweets(Mohammad et al 2015) ISEAR (Scherer andWallbott 1994) the Twitter Emotion Corpus (TEC)(Mohammad 2012) EmoInt (Mohammad et al2018) and the Stance Sentiment Emotion Corpus(SSEC) (Schuff et al 2017)

We describe the experimental setup in Sec-tion 62 which we use across all datasets Wefind that transfer learning helps in the case of alldatasets especially when there is limited train-ing data Interestingly in the case of Crowd-Flower which is known to be noisy (Bostan andKlinger 2018) and Electoral Tweets which is asmall dataset of sim4k labeled examples and a large

taxonomy of 36 emotions FREEZE gives a signifi-cant boost of performance over the BASELINE andNOFREEZE for all training set sizes besides ldquomaxrdquo

For the other datasets we find that FREEZE

tends to give a performance boost compared to theother setups only up to a couple of hundred train-ing examples For 500-1000 training examplesNOFREEZE tends to outperform the BASELINE butwe can see that these two setups come closer whenthere is more training data available These resultssuggests that our dataset helps if there is limiteddata from the target domain

Sentimentpositivenegativeambiguous

sentiment

curio

sity

conf

usio

nam

usem

ent

grat

itude

adm

iratio

npr

ide

appr

oval

real

izatio

nsu

rpris

eex

citem

ent

joy

relie

fca

ring

optim

ismde

sire

love fear

nerv

ousn

ess

grie

fsa

dnes

sre

mor

sedi

sapp

rova

ldi

sapp

oint

men

tan

ger

anno

yanc

eem

barra

ssm

ent

disg

ust

Predicted Label

curiosityconfusionamusementgratitudeadmirationprideapprovalrealizationsurpriseexcitementjoyreliefcaringoptimismdesirelovefearnervousnessgriefsadnessremorsedisapprovaldisappointmentangerannoyanceembarrassmentdisgust

True

Lab

el

Proportionof Predictions

000

015

030

045

060

075

Figure 6 A normalized confusion matrix for our model predictions The plot shows that the model confusesemotions with other emotions that are related in intensity and sentiment

02

03

04

F-sc

ore

DailyDialog (everyday conversations)

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

01

02

03

04

05

Affective Text (news headlines)

000

005

010

015

020

F-sc

ore

CrowdFlower (tweets)

006

008

010

012

ElectoralTweets (tweets)

00

02

04

06

ISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

01

02

03

04

05

F-sc

ore

TEC (tweets)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

100 200 500 1000 maxTraining Set Size

060

065

070

SSEC (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 7 Transfer learning results on 9 emotion benchmarks from the Unified Dataset (Bostan and Klinger 2018)

Page 4: GoEmotions: A Dataset of Fine-Grained Emotions · 2020. 5. 5. · Sentiment balancing. We reduce sentiment bias by removing subreddits with little representation of positive, negative,

2 Provide greatest coverage in terms of kindsof emotional expression We consult psychologyliterature on emotion expression and recognition(Plutchik 1980 Cowen and Keltner 2017 Cowenet al 2019b) Since to our knowledge therehas not been research that identifies principal cat-egories for emotion recognition in the domain oftext (see Section 22) we consider those emotionsthat are identified as basic in other domains (videoand speech) and that we can assume to apply totext as well

3 Limit overlap among emotions and limit thenumber of emotions We do not want to includeemotions that are too similar since that makes theannotation task more difficult Moreover combin-ing similar labels with high coverage would resultin an explosion in annotated labels

The final set of selected emotions is listed inTable 4 and Figure 1 See Appendix B for moredetails on our multi-step taxonomy selection proce-dure

33 AnnotationWe assigned three raters to each example For thoseexamples where no raters agree on at least oneemotion label we assigned two additional ratersAll raters are native English speakers from India4

Instructions Raters were asked to identify theemotions expressed by the writer of the text givenpre-defined emotion definitions (see Appendix A)and a few example texts for each emotion Raterswere free to select multiple emotions but wereasked to only select those ones for which they werereasonably confident that it is expressed in the textIf raters were not certain about any emotion beingexpressed they were asked to select Neutral Weincluded a checkbox for raters to indicate if anexample was particularly difficult to label in whichcase they could select no emotions We removedall examples for which no emotion was selected

The rater interface Reddit comments were pre-sented with no additional metadata (such as theauthor or subreddit) To help raters navigate thelarge space of emotion in our taxonomy they werepresented a table containing all emotion categoriesaggregated by sentiment (by the mapping in Fig-ure 2) and whether that emotion is generally ex-pressed towards something (eg disapproval) or is

4Cowen et al (2019b) find that emotion judgments in In-dian and US English speakers largely occupy the same dimen-sions

Number of examples 58009Number of emotions 27 + neutralNumber of unique raters 82Number of raters example 3 or 5Marked unclear ordifficult to label 16

1 832 153 2

Number of labels per example

4+ 2Number of examples w 2+ ratersagreeing on at least 1 label 54263 (94)

Number of examples w 3+ ratersagreeing on at least 1 label 17763 (31)

Table 2 Summary statistics of our labeled data

more of an intrinsic feeling (eg joy) The instruc-tions highlighted that this separation of categorieswas by no means clear-cut but captured generaltendencies and we encouraged raters to ignore thecategorization whenever they saw fit Emotionswith a straightforward mapping onto emojis wereshown with an emoji in the UI to further ease theirinterpretation

4 Data Analysis

Table 2 shows summary statistics for the data Mostof the examples (83) have a single emotion labeland have at least two raters agreeing on a single la-bel (94) The Neutral category makes up 26 ofall emotion labels ndash we exclude that category fromthe following analyses since we do not consider itto be part of the semantic space of emotions

Figure 1 shows the distribution of emotion labelsWe can see a large disparity in terms of emotionfrequencies (eg admiration is 30 times more fre-quent than grief ) despite our emotion and senti-ment balancing steps taken during data selectionThis is expected given the disparate frequencies ofemotions in natural human expression

41 Interrater Correlation

We estimate rater agreement for each emotion viainterrater correlation (Delgado and Tibau 2019)5

For each rater r isin R we calculate the Spearmancorrelation between rrsquos judgments and the mean

5We use correlations as opposed to Cohenrsquos kappa (Cohen1960) because the former is a more interpretable metric and itis also more suitable for measuring agreement among a vari-able number of raters rating different examples In Appendix Cwe report Cohenrsquos kappa values as well which correlate highlywith the values obtained from interrater correlation (Pearsonr = 085 p lt 0001)

InterraterCorrelation

Figure 1 Our emotion categories ordered by the num-ber of examples where at least one rater uses a particu-lar label The color indicates the interrater correlation

of other ratersrsquo judgments for all examples that rrated We then take the average of these rater-levelcorrelation scores In Section 43 we show thateach emotion has significant interrater correlationafter controlling for several potential confounds

Figure 1 shows that gratitude admiration andamusement have the highest and grief and nervous-ness have the lowest interrater correlation Emotionfrequency correlates with interrater agreement butthe two are not equivalent Infrequent emotionscan have relatively high interrater correlation (egfear) and frequent emotions can have have rela-tively low interrater correlation (eg annoyance)

42 Correlation Among Emotions

To better understand the relationship between emo-tions in our data we look at their correlations LetN be the number of examples in our dataset Weobtain N dimensional vectors for each emotionby averaging ratersrsquo judgments for all exampleslabeled with that emotion We calculate Pearsoncorrelation values between each pair of emotionsThe heatmap in Figure 2 shows that emotions thatare related in intensity (eg annoyance and angerjoy and excitement nervousness and fear) havea strong positive correlation On the other handemotions that have the opposite sentiment are neg-atively correlated

Sentimentpositivenegativeambiguous

sentiment

amus

emen

tex

citem

ent

joy

love

desir

eop

timism

carin

gpr

ide

adm

iratio

ngr

atitu

dere

lief

appr

oval

real

izatio

nsu

rpris

ecu

riosit

yco

nfus

ion

fear

nerv

ousn

ess

rem

orse

emba

rrass

men

tdi

sapp

oint

men

tsa

dnes

sgr

ief

disg

ust

ange

ran

noya

nce

disa

ppro

val

amusementexcitementjoylovedesireoptimismcaringprideadmirationgratitudereliefapprovalrealizationsurprisecuriosityconfusionfearnervousnessremorseembarrassmentdisappointmentsadnessgriefdisgustangerannoyancedisapproval

030 015 000 015 030

Figure 2 The heatmap shows the correlation betweenratings for each emotion The dendrogram representsthe a hierarchical clustering of the ratings The senti-ment labeling was done a priori and it shows that theclusters closely map onto sentiment groups

We also perform hierarchical clustering to un-cover the nested structure of our taxonomy Weuse correlation as a distance metric and ward asa linkage method applied to the averaged ratingsThe dendrogram on the top of Figure 2 shows thatemotions that are related by intensity are neighborsand that larger clusters map closely onto sentimentcategories Interestingly emotions that we labeledas ldquoambiguousrdquo in terms of sentiment (eg sur-prise) are closer to the positive than to the negativecategory This suggests that in our data ambiguousemotions are more likely to occur in the context ofpositive sentiment than that of negative sentiment

43 Principal Preserved Component Analysis

To better understand agreement among raters andthe latent structure of the emotion space we applyPrincipal Preserved Component Analysis (PPCA)(Cowen et al 2019b) to our data PPCA extractslinear combinations of attributes (here emotionjudgments) that maximally covary across two setsof data that measure the same attributes (here ran-domly split judgments for each example) ThusPPCA allows us to uncover latent dimensions of

Algorithm 1 Leave-One-Rater-Out PPCA1 Rlarr set of raters2 E larr set of emotions3 C isin R|R|times|E|4 for all raters r isin 1 |R| do5 nlarr number of examples annotated by r6 J isin Rntimes|R|times|E| larr all ratings for the exam-

ples annotated by r7 Jminusr isin Rntimes|R|minus1times|E| larr all ratings in J

excluding r8 Jr isin Rntimes|E| larr all ratings by r9 XY isin Rntimes|E| larr randomly split Jminusr and

average ratings across raters for both sets10 W isin R|E|times|E| larr result of PPCA(XY )11 for all componentsdagger wiisin1|E| in W do12 vr

i larr projectionDagger of Jr onto wi

13 vminusri larr projectionDagger of Jminusr onto wi

14 Cri larr correlation between vri and vminusri

partialing out vminusrk forallk isin 1 iminus 115 end for16 end for17 C prime larrWilcoxon signed rank test on C18 C primeprime larr Bonferroni correction on C prime(α = 005)daggerin descending order of eigenvalueDaggerwe demean vectors before projection

emotion that have high agreement across ratersUnlike Principal Component Analysis (PCA)

PPCA examines the cross-covariance betweendatasets rather than the variancecovariance matrixwithin a single dataset We obtain the principalpreserved components (PPCs) of two datasets (ma-trices) XY isin RNtimes|E| where N is the numberof examples and |E| is the number of emotionsby calculating the eigenvectors of the symmetrizedcross covariance matrix XTY + Y TX

Extracting significant dimensions We removeexamples labeled as Neutral and keep those ex-amples that still have at least 3 ratings after thisfiltering step We then determine the number ofsignificant dimensions using a leave-one-rater outanalysis as described by Algorithm 1

We find that all 27 PPCs are highly signifi-cant Specifically Bonferroni-corrected p-valuesare less than 15e-6 for all dimensions (correctedα = 00017) suggesting that the emotions werehighly dissociable Such a high degree of signifi-cance for all dimensions is nontrivial For exampleCowen et al (2019b) find that only 12 out of their30 emotion categories are significantly dissociable

t-SNE projection To better understand how theexamples are organized in the emotion space weapply t-SNE a dimension reduction method thatseeks to preserve distances between data pointsusing the scikit-learn package (Pedregosa et al2011) The dataset can be explored in our inter-active plot6 where one can also look at the textsand the annotations The color of each data point isthe weighted average of the RGB values represent-ing those emotions that at least half of the ratersselected

44 Linguistic Correlates of Emotions

We extract the lexical correlates of each emotion bycalculating the log odds ratio informative Dirichletprior (Monroe et al 2008) of all tokens for eachemotion category contrasting to all other emotionsSince the log odds are z-scored all values greaterthan 3 indicate highly significant (gt3 std) associ-ation with the corresponding emotion We list thetop 5 tokens for each category in Table 3 We findthat those emotions that are highly significantly as-sociated with certain tokens (eg gratitude withldquothanksrdquo amusement with ldquololrdquo) tend to have thehighest interrater correlation (see Figure 1) Con-versely emotions that have fewer significantly as-sociated tokens (eg grief and nervousness) tendto have low interrater correlation These resultssuggest certain emotions are more verbally implicitand may require more context to be interpreted

5 Modeling

We present a strong baseline emotion predictionmodel for GoEmotions

51 Data Preparation

To minimize the noise in our data we filter outemotion labels selected by only a single annotatorWe keep examples with at least one label after thisfiltering is performed mdash this amounts to 93 of theoriginal data We randomly split this data into train(80) dev (10) and test (10) sets We onlyevaluate on the test set once the model is finalized

Even though we filter our data for the baselineexperiments we see particular value in the 4K ex-amples that lack agreement This subset of the datalikely contains edgedifficult examples for the emo-tion domain (eg emotion-ambiguous text) andpresent challenges for further exploration That is

6httpsnlpstanfordedu˜ddemszkygoemotionstsnehtml

admiration amusement approval caring anger annoyance disappointment disapproval confusiongreat (42) lol (66) agree (24) you (12) fuck (24) annoying (14) disappointing (11) not (16) confused (18)

awesome (32) haha (32) not (13) worry (11) hate (18) stupid (13) disappointed (10) donrsquot (14) why (11)amazing (30) funny (27) donrsquot (12) careful (9) fucking (18) fucking (12) bad (9) disagree (9) sure (10)

good (28) lmao (21) yes (12) stay (9) angry (11) shit (10) disappointment (7) nope (8) what (10)beautiful (23) hilarious (18) agreed (11) your (8) dare (10) dumb (9) unfortunately (7) doesnrsquot (7) understand (8)

desire excitement gratitude joy disgust embarrassment fear grief curiositywish (29) excited (21) thanks (75) happy (32) disgusting (22) embarrassing (12) scared (16) died (6) curious (22)want (8) happy (8) thank (69) glad (27) awful (14) shame (11) afraid (16) rip (4) what (18)

wanted (6) cake (8) for (24) enjoy (20) worst (13) awkward (10) scary (15) why (13)could (6) wow (8) you (18) enjoyed (12) worse (12) embarrassment (8) terrible (12) how (11)

ambitious (4) interesting (7) sharing (17) fun (12) weird (9) embarrassed (7) terrifying (11) did (10)love optimism pride relief nervousness remorse sadness realization surprise

love (76) hope (45) proud (14) glad (5) nervous (8) sorry (39) sad (31) realize (14) wow (23)loved (21) hopefully (19) pride (4) relieved (4) worried (8) regret (9) sadly (16) realized (12) surprised (21)

favorite (13) luck (18) accomplishment relieving (4) anxiety (6) apologies (7) sorry (15) realised (7) wonder (15)loves (12) hoping (16) (4) relief (4) anxious (4) apologize (6) painful (10) realization (6) shocked (12)

like (9) will (8) worrying (4) guilt (5) crying (9) thought (6) omg (11)

Table 3 Top 5 words associated with each emotion ( positive negative ambiguous ) The rounded z-scored logodds ratios in the parentheses with the threshold set at 3 indicate significance of association

why we release all 58K examples with all annota-torsrsquo ratings

Grouping emotions We create a hierarchicalgrouping of our taxonomy and evaluate the modelperformance on each level of the hierarchy A sen-timent level divides the labels into 4 categories ndashpositive negative ambiguous and Neutral ndash withthe Neutral category intact and the rest of the map-ping as shown in Figure 2 The Ekman level furtherdivides the taxonomy using the Neutral label andthe following 6 groups anger (maps to anger an-noyance disapproval) disgust (maps to disgust)fear (maps to fear nervousness) joy (all positiveemotions) sadness (maps to sadness disappoint-ment embarrassment grief remorse) and surprise(all ambiguous emotions)

52 Model ArchitectureWe use the BERT-base model (Devlin et al 2019)for our experiments We add a dense output layeron top of the pretrained model for the purposesof finetuning with a sigmoid cross entropy lossfunction to support multi-label classification As anadditional baseline we train a bidirectional LSTM

53 Parameter SettingsWhen finetuning the pre-trained BERT model wekeep most of the hyperparameters set by Devlinet al (2019) intact and only change the batch sizeand learning rate We find that training for at least4 epochs is necessary for learning the data buttraining for more epochs results in overfitting Wealso find that a small batch size of 16 and learningrate of 5e-5 yields the best performance

For the biLSTM we set the hidden layer dimen-sionality to 256 the learning rate to 01 with a

decay rate of 095 We apply a dropout of 07

54 Results

Table 4 summarizes the performance of our bestmodel BERT on the test set which achieves anaverage F1-score of 46 (std=19) The model ob-tains the best performance on emotions with overtlexical markers such as gratitude (86) amusement(8) and love (78) The model obtains the lowestF1-score on grief (0) relief (15) and realization(21) which are the lowest frequency emotions Wefind that less frequent emotions tend to be confusedby the model with more frequent emotions relatedin sentiment and intensity (eg grief with sadnesspride with admiration nervousness with fear) mdashsee Appendix G for a more detailed analysis

Table 5 and Table 6 show results for a sentiment-grouped model (F1-score = 69) and an Ekman-grouped model (F1-score = 64) respectively Thesignificant performance increase in the transitionfrom full to Ekman-level taxonomy indicates thatthis grouping mitigates confusion among inner-group lower-level categories

The biLSTM model performs significantlyworse than BERT obtaining an average F1-scoreof 41 for the full taxonomy 54 for an Ekman-grouped model and 6 for a sentiment-groupedmodel

6 Transfer Learning Experiments

We conduct transfer learning experiments on exist-ing emotion benchmarks in order to show our datageneralizes across domains and taxonomies Thegoal is to demonstrate that given little labeled datain a target domain one can utilize GoEmotions asbaseline emotion understanding data

100 200 500 1000 maxTraining Set Size

00

02

04

06F-

scor

eISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 3 Transfer learning results in terms of average F1-scores across emotion categories The bars indicate the95 confidence intervals which we obtain from 10 different runs on 10 different random splits of the data

Emotion Precision Recall F1

admiration 053 083 065amusement 070 094 080anger 036 066 047annoyance 024 063 034approval 026 057 036caring 030 056 039confusion 024 076 037curiosity 040 084 054desire 043 059 049disappointment 019 052 028disapproval 029 061 039disgust 034 066 045embarrassment 039 049 043excitement 026 052 034fear 046 085 060gratitude 079 095 086grief 000 000 000joy 039 073 051love 068 092 078nervousness 028 048 035neutral 056 084 068optimism 041 069 051pride 067 025 036realization 016 029 021relief 050 009 015remorse 053 088 066sadness 038 071 049surprise 040 066 050macro-average 040 063 046std 018 024 019

Table 4 Results based on GoEmotions taxonomy

61 Emotion Benchmark Datasets

We consider the nine benchmark datasets fromBostan and Klinger (2018)rsquos Unified Datasetwhich vary in terms of their size domain qual-

Sentiment Precision Recall F1

ambiguous 054 066 060negative 065 076 070neutral 064 069 067positive 078 087 082macro-average 065 074 069std 009 010 009

Table 5 Results based on sentiment-grouped data

Ekman Emotion Precision Recall F1

anger 050 065 057disgust 052 053 053fear 061 076 068joy 077 088 082neutral 066 067 066sadness 056 062 059macro-average 060 068 064std 010 012 011

Table 6 Results using Ekmanrsquos taxonomy

ity and taxonomy In the interest of space we onlydiscuss three of these datasets here chosen basedon their diversity of domains In our experimentswe observe similar trends for the additional bench-marks and all are included in the Appendix H

The International Survey on Emotion An-tecedents and Reactions (ISEAR) (Scherer andWallbott 1994) is a collection of personal reportson emotional events written by 3000 people fromdifferent cultural backgrounds The dataset con-tains 8k sentences each labeled with a single emo-tion The categories are anger disgust fear guiltjoy sadness and shame

EmoInt (Mohammad et al 2018) is part of theSemEval 2018 benchmark and it contains crowd-sourced annotations for 7k tweets The labels areintensity annotations for anger joy sadness and

fear We obtain binary annotations for these emo-tions by using 5 as the cutoff

Emotion-Stimulus (Ghazi et al 2015) containsannotations for 24k sentences generated based onFrameNetrsquos emotion-directed frames Their taxon-omy is anger disgust fear joy sadness shameand surprise

62 Experimental SetupTraining set size We experiment with varyingamount of training data from the target domaindataset including 100 200 500 1000 and 80(named ldquomaxrdquo) of dataset examples We generate10 random splits for each train set size with theremaining examples held as a test set

We report the results of the finetuning experi-ments detailed below for each data size with con-fidence intervals based on repeated experimentsusing the splits

Finetuning We compare three different finetun-ing setups In the BASELINE setup we finetuneBERT only on the target dataset In the FREEZE

setup we first finetune BERT on GoEmotions thenperform transfer learning by replacing the finaldense layer freezing all layers besides the lastlayer and finetuning on the target dataset TheNOFREEZE setup is the same as FREEZE exceptthat we do not freeze the bottom layers We holdthe batch size at 16 learning rate at 2e-5 and num-ber of epochs at 3 for all experiments

63 ResultsThe results in Figure 3 suggest that our dataset gen-eralizes well to different domains and taxonomiesand that using a model using GoEmotions can helpin cases when there is limited data from the targetdomain or limited resources for labeling

Given limited target domain data (100 or 200 ex-amples) both FREEZE and NOFREEZE yield signif-icantly higher performance than the BASELINE forall three datasets Importantly NOFREEZE resultsshow significantly higher performance for all train-ing set sizes except for ldquomaxrdquo where NOFREEZE

and BASELINE perform similarly

7 Conclusion

We present GoEmotions a large manually anno-tated carefully curated dataset for fine-grainedemotion prediction We provide a detailed dataanalysis demonstrating the reliability of the anno-tations for the full taxonomy We show the general-

izability of the data across domains and taxonomiesvia transfer learning experiments We build a strongbaseline by fine-tuning a BERT model howeverthe results suggest much room for future improve-ment Future work can explore the cross-culturalrobustness of emotion ratings and extend the tax-onomy to other languages and domains

Data Disclaimer We are aware that the datasetcontains biases and is not representative of globaldiversity We are aware that the dataset containspotentially problematic content Potential biases inthe data include Inherent biases in Reddit and userbase biases the offensivevulgar word lists usedfor data filtering inherent or unconscious bias inassessment of offensive identity labels annotatorswere all native English speakers from India Allthese likely affect labeling precision and recallfor a trained model The emotion pilot model usedfor sentiment labeling was trained on examplesreviewed by the research team Anyone using thisdataset should be aware of these limitations of thedataset

Acknowledgments

We thank the three anonymous reviewers for theirconstructive feedback We would also like to thankthe annotators for their hard work

ReferencesMuhammad Abdul-Mageed and Lyle Ungar 2017

EmoNet Fine-Grained Emotion Detection withGated Recurrent Neural Networks In Proceedingsof the 55th Annual Meeting of the Association forComputational Linguistics (Volume 1 Long Papers)pages 718ndash728

Cecilia Ovesdotter Alm Dan Roth and Richard Sproat2005 Emotions from text Machine learningfor text-based emotion prediction In Proceed-ings of Human Language Technology Conferenceand Conference on Empirical Methods in NaturalLanguage Processing pages 579ndash586 VancouverBritish Columbia Canada Association for Compu-tational Linguistics

Laura-Ana-Maria Bostan and Roman Klinger 2018An analysis of annotated corpora for emotion clas-sification in text In Proceedings of the 27th Inter-national Conference on Computational Linguisticspages 2104ndash2119

Luke Breitfeller Emily Ahn David Jurgens and Yu-lia Tsvetkov 2019 Finding microaggressions in thewild A case for locating elusive phenomena in so-cial media posts In Proceedings of the 2019 Con-ference on Empirical Methods in Natural Language

Processing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 1664ndash1674

Sven Buechel and Udo Hahn 2017 Emobank Study-ing the impact of annotation perspective and repre-sentation format on dimensional emotion analysisIn Proceedings of the 15th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics Volume 2 Short Papers pages 578ndash585

Jacob Cohen 1960 A coefficient of agreement fornominal scales Educational and psychological mea-surement 20(1)37ndash46

Alan Cowen Disa Sauter Jessica L Tracy and DacherKeltner 2019a Mapping the passions Towarda high-dimensional taxonomy of emotional experi-ence and expression Psychological Science in thePublic Interest 20(1)69ndash90

Alan S Cowen Hillary Anger Elfenbein Petri Laukkaand Dacher Keltner 2018 Mapping 24 emotionsconveyed by brief human vocalization AmericanPsychologist 74(6)698ndash712

Alan S Cowen Xia Fang Disa Sauter and Dacher Kelt-ner in press What music makes us feel At leastthirteen dimensions organize subjective experiencesassociated with music across cultures Proceedingsof the National Academy of Sciences

Alan S Cowen and Dacher Keltner 2017 Self-reportcaptures 27 distinct categories of emotion bridged bycontinuous gradients Proceedings of the NationalAcademy of Sciences 114(38)E7900ndashE7909

Alan S Cowen and Dacher Keltner 2019 What theface displays Mapping 28 emotions conveyed bynaturalistic expression American Psychologist

Alan S Cowen Petri Laukka Hillary Anger ElfenbeinRunjing Liu and Dacher Keltner 2019b The pri-macy of categories in the recognition of 12 emotionsin speech prosody across two cultures Nature Hu-man Behaviour 3(4)369

CrowdFlower 2016 httpswwwfigure-eightcomdatasentiment-analysis-emotion-text

Rosario Delgado and Xavier-Andoni Tibau 2019Why cohens kappa should be avoided as perfor-mance measure in classification PloS one 14(9)

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In 17th Annual Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics (NAACL)

Maeve Duggan and Aaron Smith 2013 6 of onlineadults are reddit users Pew Internet amp AmericanLife Project 31ndash10

Paul Ekman 1992a Are there basic emotions Psy-chological Review 99(3)550ndash553

Paul Ekman 1992b An argument for basic emotionsCognition amp Emotion 6(3-4)169ndash200

Diman Ghazi Diana Inkpen and Stan Szpakowicz2015 Detecting Emotion Stimuli in Emotion-Bearing Sentences In International Conference onIntelligent Text Processing and Computational Lin-guistics pages 152ndash165 Springer

Chao-Chun Hsu and Lun-Wei Ku 2018 SocialNLP2018 EmotionX challenge overview Recognizingemotions in dialogues In Proceedings of the SixthInternational Workshop on Natural Language Pro-cessing for Social Media pages 27ndash31 MelbourneAustralia Association for Computational Linguis-tics

Yanran Li Hui Su Xiaoyu Shen Wenjie Li ZiqiangCao and Shuzi Niu 2017 Dailydialog A manuallylabelled multi-turn dialogue dataset arXiv preprintarXiv171003957

Chen Liu Muhammad Osama and Anderson De An-drade 2019 Dens A dataset for multi-class emo-tion analysis arXiv preprint arXiv191011769

Saif Mohammad 2018 Obtaining reliable human rat-ings of valence arousal and dominance for 20000english words In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1 Long Papers) pages 174ndash184

Saif Mohammad Felipe Bravo-Marquez MohammadSalameh and Svetlana Kiritchenko 2018 SemEval-2018 task 1 Affect in tweets In Proceedings ofThe 12th International Workshop on Semantic Eval-uation pages 1ndash17 New Orleans Louisiana Asso-ciation for Computational Linguistics

Saif M Mohammad 2012 emotional tweets In Pro-ceedings of the First Joint Conference on Lexicaland Computational Semantics-Volume 1 Proceed-ings of the main conference and the shared task andVolume 2 Proceedings of the Sixth InternationalWorkshop on Semantic Evaluation pages 246ndash255Association for Computational Linguistics

Saif M Mohammad Xiaodan Zhu Svetlana Kir-itchenko and Joel Martin 2015 Sentiment emo-tion purpose and style in electoral tweets Informa-tion Processing amp Management 51(4)480ndash499

Shruthi Mohan Apala Guha Michael Harris FredPopowich Ashley Schuster and Chris Priebe 2017The impact of toxic language on the health of redditcommunities In Canadian Conference on ArtificialIntelligence pages 51ndash56 Springer

Burt L Monroe Michael P Colaresi and Kevin MQuinn 2008 Fightinrsquowords Lexical feature selec-tion and evaluation for identifying the content of po-litical conflict Political Analysis 16(4)372ndash403

Emily Ohman Kaisla Kajava Jorg Tiedemann andTimo Honkela 2018 Creating a dataset formultilingual fine-grained emotion-detection usinggamification-based annotation In Proceedings ofthe 9th Workshop on Computational Approaches toSubjectivity Sentiment and Social Media Analysispages 24ndash30

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P PrettenhoferR Weiss V Dubourg J Vanderplas A PassosD Cournapeau M Brucher M Perrot and E Duch-esnay 2011 Scikit-learn Machine learning inPython Journal of Machine Learning Research122825ndash2830

Matthew E Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep Contextualized Word Rep-resentations In 16th Annual Conference of theNorth American Chapter of the Association for Com-putational Linguistics (NAACL)

Rosalind W Picard 1997 Affective Computing MITPress

Inna Pirina and Cagrı Coltekin 2018 Identifyingdepression on reddit The effect of training dataIn Proceedings of the 2018 EMNLP WorkshopSMM4H The 3rd Social Media Mining for HealthApplications Workshop amp Shared Task pages 9ndash12

Robert Plutchik 1980 A general psychoevolutionarytheory of emotion In Theories of emotion pages3ndash33 Elsevier

James A Russell 2003 Core affect and the psycholog-ical construction of emotion Psychological review110(1)145

Klaus R Scherer and Harald G Wallbott 1994 Evi-dence for Universality and Cultural Variation of Dif-ferential Emotion Response Patterning Journal ofpersonality and social psychology 66(2)310

Hendrik Schuff Jeremy Barnes Julian Mohme Sebas-tian Pado and Roman Klinger 2017 Annotationmodelling and analysis of fine-grained emotions ona stance and sentiment detection corpus In Pro-ceedings of the 8th Workshop on Computational Ap-proaches to Subjectivity Sentiment and Social Me-dia Analysis pages 13ndash23

Carlo Strapparava and Rada Mihalcea 2007 SemEval-2007 task 14 Affective text In Proceedings of theFourth International Workshop on Semantic Evalua-tions (SemEval-2007) pages 70ndash74 Prague CzechRepublic Association for Computational Linguis-tics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT Rediscovers the Classical NLP Pipeline InAssociation for Computational Linguistics

Henry Tsai Jason Riesa Melvin Johnson Naveen Ari-vazhagan Xin Li and Amelia Archer 2019 Smalland Practical BERT Models for Sequence LabelingIn EMNLP 2019

Wenbo Wang Lu Chen Krishnaprasad Thirunarayanand Amit P Sheth 2012 Harnessing twitter lsquobigdatarsquo for automatic emotion identification In 2012International Conference on Privacy Security Riskand Trust and 2012 International Confernece on So-cial Computing pages 587ndash592 IEEE

Cebrian-M Yanardag P and I Rahwan 2018 Nor-man Worlds first psychopath ai

A Emotion Definitions

admiration Finding something impressive orworthy of respectamusement Finding something funny or beingentertainedanger A strong feeling of displeasure or antag-onismannoyance Mild anger irritationapproval Having or expressing a favorable opin-ioncaring Displaying kindness and concern for othersconfusion Lack of understanding uncertaintycuriosity A strong desire to know or learn some-thingdesire A strong feeling of wanting something orwishing for something to happendisappointment Sadness or displeasure caused bythe nonfulfillment of onersquos hopes or expectationsdisapproval Having or expressing an unfavor-able opiniondisgust Revulsion or strong disapproval arousedby something unpleasant or offensiveembarrassment Self-consciousness shame orawkwardnessexcitement Feeling of great enthusiasm and ea-gernessfear Being afraid or worriedgratitude A feeling of thankfulness and appre-ciationgrief Intense sorrow especially caused by some-onersquos deathjoy A feeling of pleasure and happinesslove A strong positive emotion of regard andaffectionnervousness Apprehension worry anxietyoptimism Hopefulness and confidence aboutthe future or the success of somethingpride Pleasure or satisfaction due to ones ownachievements or the achievements of those withwhom one is closely associatedrealization Becoming aware of somethingrelief Reassurance and relaxation following releasefrom anxiety or distressremorse Regret or guilty feelingsadness Emotional pain sorrowsurprise Feeling astonished startled by some-thing unexpected

B Taxonomy Selection amp Data Collection

We selected our taxonomy through a careful multi-round process In the first pilot round of data col-

lection we used emotions that were identified to besalient by Cowen and Keltner (2017) making surethat our set includes Ekmans emotion categories asused in previous NLP work In this round we alsoincluded an open input box where annotators couldsuggest emotion(s) that were not among the op-tions We annotated 3K examples in the first roundWe updated the taxonomy based on the results ofthis round (see details below) In the second pilotround of data collection we repeated this processwith 2k new examples once again updating thetaxonomy

While reviewing the results from the pilotrounds we identified and removed emotions thatwere scarcely selected by annotators andor hadlow interrater agreement due to being very similarto other emotions or being too difficult to detectfrom text These emotions were boredom doubtheartbroken indifference and calmness We alsoidentified and added those emotions to our tax-onomy that were frequently suggested by ratersandor seemed to be represented in the data uponmanual inspection These emotions were desiredisappointment pride realization relief and re-morse In this process we also refined the categorynames (eg replacing ecstasy with excitement) toones that seemed interpretable to annotators Thisis how we arrived at the final set of 27 emotions+ Neutral Our high interrater agreement in the fi-nal data can be partially explained by the fact thatwe took interpretability into consideration whileconstructing the taxonomy The dataset is we arereleasing was labeled in the third round over thefinal taxonomy

C Cohenrsquos Kappa Values

In Section 41 we measure agreement betweenraters via Spearman correlation following consid-erations by Delgado and Tibau (2019) In Table 7we report the Cohenrsquos kappa values for compari-son which we obtain by randomly sampling tworatings for each example and calculating the Co-henrsquos kappa between these two sets of ratings Wefind that all Cohenrsquos kappa values are greater than 0showing rater agreement Moreover the Cohenrsquoskappa values correlate highly with the interratercorrelation values (Pearson r = 085 p lt 0001)providing corroborative evidence for the significantdegree of interrater agreement for each emotion

Emotion InterraterCorrelation

Cohenrsquoskappa

admiration 0535 0468amusement 0482 0474anger 0207 0307annoyance 0193 0192approval 0385 0187caring 0237 0252confusion 0217 0270curiosity 0418 0366desire 0177 0251disappointment 0186 0184disapproval 0274 0234disgust 0192 0241embarrassment 0177 0218excitement 0193 0222fear 0266 0394gratitude 0645 0749grief 0162 0095joy 0296 0301love 0446 0555nervousness 0164 0144optimism 0322 0300pride 0163 0148realization 0194 0155relief 0172 0185remorse 0178 0358sadness 0346 0336surprise 0275 0331

Table 7 Interrater agreement as measured by interratercorrelation and Cohenrsquos kappa

D Sentiment of Reddit Subreddits

In Section 3 we describe how we obtain subredditsthat are balanced in terms of sentiment Here wenote the distribution of sentiments across subred-dits before we apply the filtering neutral (M=28STD=11) positive (M=41 STD=11) neg-ative (M=19 STD=7) ambiguous (M=35STD=8) After filtering the distribution of sen-timents across our remaining subreddits becameneutral (M=24 STD=5) positive (M=35STD=6) negative (M=27 STD=4) ambigu-ous (M=33 STD=4)

E BERTrsquos Most Activated Layers

To better understand whether there are any layersin BERT that are particularly important for ourtask we freeze BERT and calculate the center of

gravity (Tenney et al 2019) based on scalar mixingweights (Peters et al 2018) We find that all layersare similarly important for our task with center ofgravity = 619 (see Figure 4) This is consistentwith Tenney et al (2019) who have also found thattasks involving high-level semantics tend to makeuse of all BERT layers

0 1 2 3 4 5 6 7 8 9 10 11 12

Layer

000

002

004

006

008

010

Soft

max W

eig

hts

Figure 4 Softmax weights of each BERT layer whentrained on our dataset

F Number of Emotion Labels PerExample

Figure 5 shows the number of emotion labels perexample before and after we filter for those labelsthat have agreement We use the filtered set oflabels for training and testing our models

0 1 2 3 4 5 6 7Number of Labels

0k

10k

20k

30k

40k

Num

ber o

f Exa

mpl

es

pre-filterpost-filter

Figure 5 Number of emotion labels per example be-fore and after filtering the labels chosen by only a sin-gle annotator

G Confusion Matrix

Figure 6 shows the normalized confusion matrix forour model predictions Since GoEmotions is a mul-tilabel dataset we calculate the confusion matrix

similarly as we would calculate a co-occurrencematrix for each true label we increase the countfor each predicted label Specifically we define amatrix M where Mij denotes the raw confusioncount between the true label i and the predictedlabel j For example if the true labels are joyand admiration and the predicted labels are joyand pride then we increase the count for MjoyjoyMjoypride Madmirationjoy and MadmirationprideIn practice since most of our examples only hasa single label (see Figure 5) our confusion matrixis very similar to one calculated for a single-labelclassification task

Given the disparate frequencies among the la-bels we normalize M by dividing the counts ineach row (representing counts for each true emo-tion label) by the sum of that row The heatmap inFigure 6 shows these normalized counts We findthat the model tends to confuse emotions that arerelated in sentiment and intensity (eg grief andsadness pride and admiration nervousness andfear)

We also perform hierarchical clustering over thenormalized confusion matrix using correlation as adistance metric and ward as a linkage method Wefind that the model learns relatively similar clustersas the ones in Figure 2 even though the trainingdata only includes a subset of the labels that haveagreement (see Figure 5)

H Transfer Learning Results

Figure 7 shows the results for all 9 datasets thatare downloadable and have categorical emotionsin the Unified Dataset (Bostan and Klinger 2018)These datasets are DailyDialog (Li et al 2017)Emotion-Stimulus (Ghazi et al 2015) AffectiveText (Strapparava and Mihalcea 2007) Crowd-Flower (CrowdFlower 2016) Electoral Tweets(Mohammad et al 2015) ISEAR (Scherer andWallbott 1994) the Twitter Emotion Corpus (TEC)(Mohammad 2012) EmoInt (Mohammad et al2018) and the Stance Sentiment Emotion Corpus(SSEC) (Schuff et al 2017)

We describe the experimental setup in Sec-tion 62 which we use across all datasets Wefind that transfer learning helps in the case of alldatasets especially when there is limited train-ing data Interestingly in the case of Crowd-Flower which is known to be noisy (Bostan andKlinger 2018) and Electoral Tweets which is asmall dataset of sim4k labeled examples and a large

taxonomy of 36 emotions FREEZE gives a signifi-cant boost of performance over the BASELINE andNOFREEZE for all training set sizes besides ldquomaxrdquo

For the other datasets we find that FREEZE

tends to give a performance boost compared to theother setups only up to a couple of hundred train-ing examples For 500-1000 training examplesNOFREEZE tends to outperform the BASELINE butwe can see that these two setups come closer whenthere is more training data available These resultssuggests that our dataset helps if there is limiteddata from the target domain

Sentimentpositivenegativeambiguous

sentiment

curio

sity

conf

usio

nam

usem

ent

grat

itude

adm

iratio

npr

ide

appr

oval

real

izatio

nsu

rpris

eex

citem

ent

joy

relie

fca

ring

optim

ismde

sire

love fear

nerv

ousn

ess

grie

fsa

dnes

sre

mor

sedi

sapp

rova

ldi

sapp

oint

men

tan

ger

anno

yanc

eem

barra

ssm

ent

disg

ust

Predicted Label

curiosityconfusionamusementgratitudeadmirationprideapprovalrealizationsurpriseexcitementjoyreliefcaringoptimismdesirelovefearnervousnessgriefsadnessremorsedisapprovaldisappointmentangerannoyanceembarrassmentdisgust

True

Lab

el

Proportionof Predictions

000

015

030

045

060

075

Figure 6 A normalized confusion matrix for our model predictions The plot shows that the model confusesemotions with other emotions that are related in intensity and sentiment

02

03

04

F-sc

ore

DailyDialog (everyday conversations)

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

01

02

03

04

05

Affective Text (news headlines)

000

005

010

015

020

F-sc

ore

CrowdFlower (tweets)

006

008

010

012

ElectoralTweets (tweets)

00

02

04

06

ISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

01

02

03

04

05

F-sc

ore

TEC (tweets)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

100 200 500 1000 maxTraining Set Size

060

065

070

SSEC (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 7 Transfer learning results on 9 emotion benchmarks from the Unified Dataset (Bostan and Klinger 2018)

Page 5: GoEmotions: A Dataset of Fine-Grained Emotions · 2020. 5. 5. · Sentiment balancing. We reduce sentiment bias by removing subreddits with little representation of positive, negative,

InterraterCorrelation

Figure 1 Our emotion categories ordered by the num-ber of examples where at least one rater uses a particu-lar label The color indicates the interrater correlation

of other ratersrsquo judgments for all examples that rrated We then take the average of these rater-levelcorrelation scores In Section 43 we show thateach emotion has significant interrater correlationafter controlling for several potential confounds

Figure 1 shows that gratitude admiration andamusement have the highest and grief and nervous-ness have the lowest interrater correlation Emotionfrequency correlates with interrater agreement butthe two are not equivalent Infrequent emotionscan have relatively high interrater correlation (egfear) and frequent emotions can have have rela-tively low interrater correlation (eg annoyance)

42 Correlation Among Emotions

To better understand the relationship between emo-tions in our data we look at their correlations LetN be the number of examples in our dataset Weobtain N dimensional vectors for each emotionby averaging ratersrsquo judgments for all exampleslabeled with that emotion We calculate Pearsoncorrelation values between each pair of emotionsThe heatmap in Figure 2 shows that emotions thatare related in intensity (eg annoyance and angerjoy and excitement nervousness and fear) havea strong positive correlation On the other handemotions that have the opposite sentiment are neg-atively correlated

Sentimentpositivenegativeambiguous

sentiment

amus

emen

tex

citem

ent

joy

love

desir

eop

timism

carin

gpr

ide

adm

iratio

ngr

atitu

dere

lief

appr

oval

real

izatio

nsu

rpris

ecu

riosit

yco

nfus

ion

fear

nerv

ousn

ess

rem

orse

emba

rrass

men

tdi

sapp

oint

men

tsa

dnes

sgr

ief

disg

ust

ange

ran

noya

nce

disa

ppro

val

amusementexcitementjoylovedesireoptimismcaringprideadmirationgratitudereliefapprovalrealizationsurprisecuriosityconfusionfearnervousnessremorseembarrassmentdisappointmentsadnessgriefdisgustangerannoyancedisapproval

030 015 000 015 030

Figure 2 The heatmap shows the correlation betweenratings for each emotion The dendrogram representsthe a hierarchical clustering of the ratings The senti-ment labeling was done a priori and it shows that theclusters closely map onto sentiment groups

We also perform hierarchical clustering to un-cover the nested structure of our taxonomy Weuse correlation as a distance metric and ward asa linkage method applied to the averaged ratingsThe dendrogram on the top of Figure 2 shows thatemotions that are related by intensity are neighborsand that larger clusters map closely onto sentimentcategories Interestingly emotions that we labeledas ldquoambiguousrdquo in terms of sentiment (eg sur-prise) are closer to the positive than to the negativecategory This suggests that in our data ambiguousemotions are more likely to occur in the context ofpositive sentiment than that of negative sentiment

43 Principal Preserved Component Analysis

To better understand agreement among raters andthe latent structure of the emotion space we applyPrincipal Preserved Component Analysis (PPCA)(Cowen et al 2019b) to our data PPCA extractslinear combinations of attributes (here emotionjudgments) that maximally covary across two setsof data that measure the same attributes (here ran-domly split judgments for each example) ThusPPCA allows us to uncover latent dimensions of

Algorithm 1 Leave-One-Rater-Out PPCA1 Rlarr set of raters2 E larr set of emotions3 C isin R|R|times|E|4 for all raters r isin 1 |R| do5 nlarr number of examples annotated by r6 J isin Rntimes|R|times|E| larr all ratings for the exam-

ples annotated by r7 Jminusr isin Rntimes|R|minus1times|E| larr all ratings in J

excluding r8 Jr isin Rntimes|E| larr all ratings by r9 XY isin Rntimes|E| larr randomly split Jminusr and

average ratings across raters for both sets10 W isin R|E|times|E| larr result of PPCA(XY )11 for all componentsdagger wiisin1|E| in W do12 vr

i larr projectionDagger of Jr onto wi

13 vminusri larr projectionDagger of Jminusr onto wi

14 Cri larr correlation between vri and vminusri

partialing out vminusrk forallk isin 1 iminus 115 end for16 end for17 C prime larrWilcoxon signed rank test on C18 C primeprime larr Bonferroni correction on C prime(α = 005)daggerin descending order of eigenvalueDaggerwe demean vectors before projection

emotion that have high agreement across ratersUnlike Principal Component Analysis (PCA)

PPCA examines the cross-covariance betweendatasets rather than the variancecovariance matrixwithin a single dataset We obtain the principalpreserved components (PPCs) of two datasets (ma-trices) XY isin RNtimes|E| where N is the numberof examples and |E| is the number of emotionsby calculating the eigenvectors of the symmetrizedcross covariance matrix XTY + Y TX

Extracting significant dimensions We removeexamples labeled as Neutral and keep those ex-amples that still have at least 3 ratings after thisfiltering step We then determine the number ofsignificant dimensions using a leave-one-rater outanalysis as described by Algorithm 1

We find that all 27 PPCs are highly signifi-cant Specifically Bonferroni-corrected p-valuesare less than 15e-6 for all dimensions (correctedα = 00017) suggesting that the emotions werehighly dissociable Such a high degree of signifi-cance for all dimensions is nontrivial For exampleCowen et al (2019b) find that only 12 out of their30 emotion categories are significantly dissociable

t-SNE projection To better understand how theexamples are organized in the emotion space weapply t-SNE a dimension reduction method thatseeks to preserve distances between data pointsusing the scikit-learn package (Pedregosa et al2011) The dataset can be explored in our inter-active plot6 where one can also look at the textsand the annotations The color of each data point isthe weighted average of the RGB values represent-ing those emotions that at least half of the ratersselected

44 Linguistic Correlates of Emotions

We extract the lexical correlates of each emotion bycalculating the log odds ratio informative Dirichletprior (Monroe et al 2008) of all tokens for eachemotion category contrasting to all other emotionsSince the log odds are z-scored all values greaterthan 3 indicate highly significant (gt3 std) associ-ation with the corresponding emotion We list thetop 5 tokens for each category in Table 3 We findthat those emotions that are highly significantly as-sociated with certain tokens (eg gratitude withldquothanksrdquo amusement with ldquololrdquo) tend to have thehighest interrater correlation (see Figure 1) Con-versely emotions that have fewer significantly as-sociated tokens (eg grief and nervousness) tendto have low interrater correlation These resultssuggest certain emotions are more verbally implicitand may require more context to be interpreted

5 Modeling

We present a strong baseline emotion predictionmodel for GoEmotions

51 Data Preparation

To minimize the noise in our data we filter outemotion labels selected by only a single annotatorWe keep examples with at least one label after thisfiltering is performed mdash this amounts to 93 of theoriginal data We randomly split this data into train(80) dev (10) and test (10) sets We onlyevaluate on the test set once the model is finalized

Even though we filter our data for the baselineexperiments we see particular value in the 4K ex-amples that lack agreement This subset of the datalikely contains edgedifficult examples for the emo-tion domain (eg emotion-ambiguous text) andpresent challenges for further exploration That is

6httpsnlpstanfordedu˜ddemszkygoemotionstsnehtml

admiration amusement approval caring anger annoyance disappointment disapproval confusiongreat (42) lol (66) agree (24) you (12) fuck (24) annoying (14) disappointing (11) not (16) confused (18)

awesome (32) haha (32) not (13) worry (11) hate (18) stupid (13) disappointed (10) donrsquot (14) why (11)amazing (30) funny (27) donrsquot (12) careful (9) fucking (18) fucking (12) bad (9) disagree (9) sure (10)

good (28) lmao (21) yes (12) stay (9) angry (11) shit (10) disappointment (7) nope (8) what (10)beautiful (23) hilarious (18) agreed (11) your (8) dare (10) dumb (9) unfortunately (7) doesnrsquot (7) understand (8)

desire excitement gratitude joy disgust embarrassment fear grief curiositywish (29) excited (21) thanks (75) happy (32) disgusting (22) embarrassing (12) scared (16) died (6) curious (22)want (8) happy (8) thank (69) glad (27) awful (14) shame (11) afraid (16) rip (4) what (18)

wanted (6) cake (8) for (24) enjoy (20) worst (13) awkward (10) scary (15) why (13)could (6) wow (8) you (18) enjoyed (12) worse (12) embarrassment (8) terrible (12) how (11)

ambitious (4) interesting (7) sharing (17) fun (12) weird (9) embarrassed (7) terrifying (11) did (10)love optimism pride relief nervousness remorse sadness realization surprise

love (76) hope (45) proud (14) glad (5) nervous (8) sorry (39) sad (31) realize (14) wow (23)loved (21) hopefully (19) pride (4) relieved (4) worried (8) regret (9) sadly (16) realized (12) surprised (21)

favorite (13) luck (18) accomplishment relieving (4) anxiety (6) apologies (7) sorry (15) realised (7) wonder (15)loves (12) hoping (16) (4) relief (4) anxious (4) apologize (6) painful (10) realization (6) shocked (12)

like (9) will (8) worrying (4) guilt (5) crying (9) thought (6) omg (11)

Table 3 Top 5 words associated with each emotion ( positive negative ambiguous ) The rounded z-scored logodds ratios in the parentheses with the threshold set at 3 indicate significance of association

why we release all 58K examples with all annota-torsrsquo ratings

Grouping emotions We create a hierarchicalgrouping of our taxonomy and evaluate the modelperformance on each level of the hierarchy A sen-timent level divides the labels into 4 categories ndashpositive negative ambiguous and Neutral ndash withthe Neutral category intact and the rest of the map-ping as shown in Figure 2 The Ekman level furtherdivides the taxonomy using the Neutral label andthe following 6 groups anger (maps to anger an-noyance disapproval) disgust (maps to disgust)fear (maps to fear nervousness) joy (all positiveemotions) sadness (maps to sadness disappoint-ment embarrassment grief remorse) and surprise(all ambiguous emotions)

52 Model ArchitectureWe use the BERT-base model (Devlin et al 2019)for our experiments We add a dense output layeron top of the pretrained model for the purposesof finetuning with a sigmoid cross entropy lossfunction to support multi-label classification As anadditional baseline we train a bidirectional LSTM

53 Parameter SettingsWhen finetuning the pre-trained BERT model wekeep most of the hyperparameters set by Devlinet al (2019) intact and only change the batch sizeand learning rate We find that training for at least4 epochs is necessary for learning the data buttraining for more epochs results in overfitting Wealso find that a small batch size of 16 and learningrate of 5e-5 yields the best performance

For the biLSTM we set the hidden layer dimen-sionality to 256 the learning rate to 01 with a

decay rate of 095 We apply a dropout of 07

54 Results

Table 4 summarizes the performance of our bestmodel BERT on the test set which achieves anaverage F1-score of 46 (std=19) The model ob-tains the best performance on emotions with overtlexical markers such as gratitude (86) amusement(8) and love (78) The model obtains the lowestF1-score on grief (0) relief (15) and realization(21) which are the lowest frequency emotions Wefind that less frequent emotions tend to be confusedby the model with more frequent emotions relatedin sentiment and intensity (eg grief with sadnesspride with admiration nervousness with fear) mdashsee Appendix G for a more detailed analysis

Table 5 and Table 6 show results for a sentiment-grouped model (F1-score = 69) and an Ekman-grouped model (F1-score = 64) respectively Thesignificant performance increase in the transitionfrom full to Ekman-level taxonomy indicates thatthis grouping mitigates confusion among inner-group lower-level categories

The biLSTM model performs significantlyworse than BERT obtaining an average F1-scoreof 41 for the full taxonomy 54 for an Ekman-grouped model and 6 for a sentiment-groupedmodel

6 Transfer Learning Experiments

We conduct transfer learning experiments on exist-ing emotion benchmarks in order to show our datageneralizes across domains and taxonomies Thegoal is to demonstrate that given little labeled datain a target domain one can utilize GoEmotions asbaseline emotion understanding data

100 200 500 1000 maxTraining Set Size

00

02

04

06F-

scor

eISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 3 Transfer learning results in terms of average F1-scores across emotion categories The bars indicate the95 confidence intervals which we obtain from 10 different runs on 10 different random splits of the data

Emotion Precision Recall F1

admiration 053 083 065amusement 070 094 080anger 036 066 047annoyance 024 063 034approval 026 057 036caring 030 056 039confusion 024 076 037curiosity 040 084 054desire 043 059 049disappointment 019 052 028disapproval 029 061 039disgust 034 066 045embarrassment 039 049 043excitement 026 052 034fear 046 085 060gratitude 079 095 086grief 000 000 000joy 039 073 051love 068 092 078nervousness 028 048 035neutral 056 084 068optimism 041 069 051pride 067 025 036realization 016 029 021relief 050 009 015remorse 053 088 066sadness 038 071 049surprise 040 066 050macro-average 040 063 046std 018 024 019

Table 4 Results based on GoEmotions taxonomy

61 Emotion Benchmark Datasets

We consider the nine benchmark datasets fromBostan and Klinger (2018)rsquos Unified Datasetwhich vary in terms of their size domain qual-

Sentiment Precision Recall F1

ambiguous 054 066 060negative 065 076 070neutral 064 069 067positive 078 087 082macro-average 065 074 069std 009 010 009

Table 5 Results based on sentiment-grouped data

Ekman Emotion Precision Recall F1

anger 050 065 057disgust 052 053 053fear 061 076 068joy 077 088 082neutral 066 067 066sadness 056 062 059macro-average 060 068 064std 010 012 011

Table 6 Results using Ekmanrsquos taxonomy

ity and taxonomy In the interest of space we onlydiscuss three of these datasets here chosen basedon their diversity of domains In our experimentswe observe similar trends for the additional bench-marks and all are included in the Appendix H

The International Survey on Emotion An-tecedents and Reactions (ISEAR) (Scherer andWallbott 1994) is a collection of personal reportson emotional events written by 3000 people fromdifferent cultural backgrounds The dataset con-tains 8k sentences each labeled with a single emo-tion The categories are anger disgust fear guiltjoy sadness and shame

EmoInt (Mohammad et al 2018) is part of theSemEval 2018 benchmark and it contains crowd-sourced annotations for 7k tweets The labels areintensity annotations for anger joy sadness and

fear We obtain binary annotations for these emo-tions by using 5 as the cutoff

Emotion-Stimulus (Ghazi et al 2015) containsannotations for 24k sentences generated based onFrameNetrsquos emotion-directed frames Their taxon-omy is anger disgust fear joy sadness shameand surprise

62 Experimental SetupTraining set size We experiment with varyingamount of training data from the target domaindataset including 100 200 500 1000 and 80(named ldquomaxrdquo) of dataset examples We generate10 random splits for each train set size with theremaining examples held as a test set

We report the results of the finetuning experi-ments detailed below for each data size with con-fidence intervals based on repeated experimentsusing the splits

Finetuning We compare three different finetun-ing setups In the BASELINE setup we finetuneBERT only on the target dataset In the FREEZE

setup we first finetune BERT on GoEmotions thenperform transfer learning by replacing the finaldense layer freezing all layers besides the lastlayer and finetuning on the target dataset TheNOFREEZE setup is the same as FREEZE exceptthat we do not freeze the bottom layers We holdthe batch size at 16 learning rate at 2e-5 and num-ber of epochs at 3 for all experiments

63 ResultsThe results in Figure 3 suggest that our dataset gen-eralizes well to different domains and taxonomiesand that using a model using GoEmotions can helpin cases when there is limited data from the targetdomain or limited resources for labeling

Given limited target domain data (100 or 200 ex-amples) both FREEZE and NOFREEZE yield signif-icantly higher performance than the BASELINE forall three datasets Importantly NOFREEZE resultsshow significantly higher performance for all train-ing set sizes except for ldquomaxrdquo where NOFREEZE

and BASELINE perform similarly

7 Conclusion

We present GoEmotions a large manually anno-tated carefully curated dataset for fine-grainedemotion prediction We provide a detailed dataanalysis demonstrating the reliability of the anno-tations for the full taxonomy We show the general-

izability of the data across domains and taxonomiesvia transfer learning experiments We build a strongbaseline by fine-tuning a BERT model howeverthe results suggest much room for future improve-ment Future work can explore the cross-culturalrobustness of emotion ratings and extend the tax-onomy to other languages and domains

Data Disclaimer We are aware that the datasetcontains biases and is not representative of globaldiversity We are aware that the dataset containspotentially problematic content Potential biases inthe data include Inherent biases in Reddit and userbase biases the offensivevulgar word lists usedfor data filtering inherent or unconscious bias inassessment of offensive identity labels annotatorswere all native English speakers from India Allthese likely affect labeling precision and recallfor a trained model The emotion pilot model usedfor sentiment labeling was trained on examplesreviewed by the research team Anyone using thisdataset should be aware of these limitations of thedataset

Acknowledgments

We thank the three anonymous reviewers for theirconstructive feedback We would also like to thankthe annotators for their hard work

ReferencesMuhammad Abdul-Mageed and Lyle Ungar 2017

EmoNet Fine-Grained Emotion Detection withGated Recurrent Neural Networks In Proceedingsof the 55th Annual Meeting of the Association forComputational Linguistics (Volume 1 Long Papers)pages 718ndash728

Cecilia Ovesdotter Alm Dan Roth and Richard Sproat2005 Emotions from text Machine learningfor text-based emotion prediction In Proceed-ings of Human Language Technology Conferenceand Conference on Empirical Methods in NaturalLanguage Processing pages 579ndash586 VancouverBritish Columbia Canada Association for Compu-tational Linguistics

Laura-Ana-Maria Bostan and Roman Klinger 2018An analysis of annotated corpora for emotion clas-sification in text In Proceedings of the 27th Inter-national Conference on Computational Linguisticspages 2104ndash2119

Luke Breitfeller Emily Ahn David Jurgens and Yu-lia Tsvetkov 2019 Finding microaggressions in thewild A case for locating elusive phenomena in so-cial media posts In Proceedings of the 2019 Con-ference on Empirical Methods in Natural Language

Processing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 1664ndash1674

Sven Buechel and Udo Hahn 2017 Emobank Study-ing the impact of annotation perspective and repre-sentation format on dimensional emotion analysisIn Proceedings of the 15th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics Volume 2 Short Papers pages 578ndash585

Jacob Cohen 1960 A coefficient of agreement fornominal scales Educational and psychological mea-surement 20(1)37ndash46

Alan Cowen Disa Sauter Jessica L Tracy and DacherKeltner 2019a Mapping the passions Towarda high-dimensional taxonomy of emotional experi-ence and expression Psychological Science in thePublic Interest 20(1)69ndash90

Alan S Cowen Hillary Anger Elfenbein Petri Laukkaand Dacher Keltner 2018 Mapping 24 emotionsconveyed by brief human vocalization AmericanPsychologist 74(6)698ndash712

Alan S Cowen Xia Fang Disa Sauter and Dacher Kelt-ner in press What music makes us feel At leastthirteen dimensions organize subjective experiencesassociated with music across cultures Proceedingsof the National Academy of Sciences

Alan S Cowen and Dacher Keltner 2017 Self-reportcaptures 27 distinct categories of emotion bridged bycontinuous gradients Proceedings of the NationalAcademy of Sciences 114(38)E7900ndashE7909

Alan S Cowen and Dacher Keltner 2019 What theface displays Mapping 28 emotions conveyed bynaturalistic expression American Psychologist

Alan S Cowen Petri Laukka Hillary Anger ElfenbeinRunjing Liu and Dacher Keltner 2019b The pri-macy of categories in the recognition of 12 emotionsin speech prosody across two cultures Nature Hu-man Behaviour 3(4)369

CrowdFlower 2016 httpswwwfigure-eightcomdatasentiment-analysis-emotion-text

Rosario Delgado and Xavier-Andoni Tibau 2019Why cohens kappa should be avoided as perfor-mance measure in classification PloS one 14(9)

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In 17th Annual Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics (NAACL)

Maeve Duggan and Aaron Smith 2013 6 of onlineadults are reddit users Pew Internet amp AmericanLife Project 31ndash10

Paul Ekman 1992a Are there basic emotions Psy-chological Review 99(3)550ndash553

Paul Ekman 1992b An argument for basic emotionsCognition amp Emotion 6(3-4)169ndash200

Diman Ghazi Diana Inkpen and Stan Szpakowicz2015 Detecting Emotion Stimuli in Emotion-Bearing Sentences In International Conference onIntelligent Text Processing and Computational Lin-guistics pages 152ndash165 Springer

Chao-Chun Hsu and Lun-Wei Ku 2018 SocialNLP2018 EmotionX challenge overview Recognizingemotions in dialogues In Proceedings of the SixthInternational Workshop on Natural Language Pro-cessing for Social Media pages 27ndash31 MelbourneAustralia Association for Computational Linguis-tics

Yanran Li Hui Su Xiaoyu Shen Wenjie Li ZiqiangCao and Shuzi Niu 2017 Dailydialog A manuallylabelled multi-turn dialogue dataset arXiv preprintarXiv171003957

Chen Liu Muhammad Osama and Anderson De An-drade 2019 Dens A dataset for multi-class emo-tion analysis arXiv preprint arXiv191011769

Saif Mohammad 2018 Obtaining reliable human rat-ings of valence arousal and dominance for 20000english words In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1 Long Papers) pages 174ndash184

Saif Mohammad Felipe Bravo-Marquez MohammadSalameh and Svetlana Kiritchenko 2018 SemEval-2018 task 1 Affect in tweets In Proceedings ofThe 12th International Workshop on Semantic Eval-uation pages 1ndash17 New Orleans Louisiana Asso-ciation for Computational Linguistics

Saif M Mohammad 2012 emotional tweets In Pro-ceedings of the First Joint Conference on Lexicaland Computational Semantics-Volume 1 Proceed-ings of the main conference and the shared task andVolume 2 Proceedings of the Sixth InternationalWorkshop on Semantic Evaluation pages 246ndash255Association for Computational Linguistics

Saif M Mohammad Xiaodan Zhu Svetlana Kir-itchenko and Joel Martin 2015 Sentiment emo-tion purpose and style in electoral tweets Informa-tion Processing amp Management 51(4)480ndash499

Shruthi Mohan Apala Guha Michael Harris FredPopowich Ashley Schuster and Chris Priebe 2017The impact of toxic language on the health of redditcommunities In Canadian Conference on ArtificialIntelligence pages 51ndash56 Springer

Burt L Monroe Michael P Colaresi and Kevin MQuinn 2008 Fightinrsquowords Lexical feature selec-tion and evaluation for identifying the content of po-litical conflict Political Analysis 16(4)372ndash403

Emily Ohman Kaisla Kajava Jorg Tiedemann andTimo Honkela 2018 Creating a dataset formultilingual fine-grained emotion-detection usinggamification-based annotation In Proceedings ofthe 9th Workshop on Computational Approaches toSubjectivity Sentiment and Social Media Analysispages 24ndash30

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P PrettenhoferR Weiss V Dubourg J Vanderplas A PassosD Cournapeau M Brucher M Perrot and E Duch-esnay 2011 Scikit-learn Machine learning inPython Journal of Machine Learning Research122825ndash2830

Matthew E Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep Contextualized Word Rep-resentations In 16th Annual Conference of theNorth American Chapter of the Association for Com-putational Linguistics (NAACL)

Rosalind W Picard 1997 Affective Computing MITPress

Inna Pirina and Cagrı Coltekin 2018 Identifyingdepression on reddit The effect of training dataIn Proceedings of the 2018 EMNLP WorkshopSMM4H The 3rd Social Media Mining for HealthApplications Workshop amp Shared Task pages 9ndash12

Robert Plutchik 1980 A general psychoevolutionarytheory of emotion In Theories of emotion pages3ndash33 Elsevier

James A Russell 2003 Core affect and the psycholog-ical construction of emotion Psychological review110(1)145

Klaus R Scherer and Harald G Wallbott 1994 Evi-dence for Universality and Cultural Variation of Dif-ferential Emotion Response Patterning Journal ofpersonality and social psychology 66(2)310

Hendrik Schuff Jeremy Barnes Julian Mohme Sebas-tian Pado and Roman Klinger 2017 Annotationmodelling and analysis of fine-grained emotions ona stance and sentiment detection corpus In Pro-ceedings of the 8th Workshop on Computational Ap-proaches to Subjectivity Sentiment and Social Me-dia Analysis pages 13ndash23

Carlo Strapparava and Rada Mihalcea 2007 SemEval-2007 task 14 Affective text In Proceedings of theFourth International Workshop on Semantic Evalua-tions (SemEval-2007) pages 70ndash74 Prague CzechRepublic Association for Computational Linguis-tics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT Rediscovers the Classical NLP Pipeline InAssociation for Computational Linguistics

Henry Tsai Jason Riesa Melvin Johnson Naveen Ari-vazhagan Xin Li and Amelia Archer 2019 Smalland Practical BERT Models for Sequence LabelingIn EMNLP 2019

Wenbo Wang Lu Chen Krishnaprasad Thirunarayanand Amit P Sheth 2012 Harnessing twitter lsquobigdatarsquo for automatic emotion identification In 2012International Conference on Privacy Security Riskand Trust and 2012 International Confernece on So-cial Computing pages 587ndash592 IEEE

Cebrian-M Yanardag P and I Rahwan 2018 Nor-man Worlds first psychopath ai

A Emotion Definitions

admiration Finding something impressive orworthy of respectamusement Finding something funny or beingentertainedanger A strong feeling of displeasure or antag-onismannoyance Mild anger irritationapproval Having or expressing a favorable opin-ioncaring Displaying kindness and concern for othersconfusion Lack of understanding uncertaintycuriosity A strong desire to know or learn some-thingdesire A strong feeling of wanting something orwishing for something to happendisappointment Sadness or displeasure caused bythe nonfulfillment of onersquos hopes or expectationsdisapproval Having or expressing an unfavor-able opiniondisgust Revulsion or strong disapproval arousedby something unpleasant or offensiveembarrassment Self-consciousness shame orawkwardnessexcitement Feeling of great enthusiasm and ea-gernessfear Being afraid or worriedgratitude A feeling of thankfulness and appre-ciationgrief Intense sorrow especially caused by some-onersquos deathjoy A feeling of pleasure and happinesslove A strong positive emotion of regard andaffectionnervousness Apprehension worry anxietyoptimism Hopefulness and confidence aboutthe future or the success of somethingpride Pleasure or satisfaction due to ones ownachievements or the achievements of those withwhom one is closely associatedrealization Becoming aware of somethingrelief Reassurance and relaxation following releasefrom anxiety or distressremorse Regret or guilty feelingsadness Emotional pain sorrowsurprise Feeling astonished startled by some-thing unexpected

B Taxonomy Selection amp Data Collection

We selected our taxonomy through a careful multi-round process In the first pilot round of data col-

lection we used emotions that were identified to besalient by Cowen and Keltner (2017) making surethat our set includes Ekmans emotion categories asused in previous NLP work In this round we alsoincluded an open input box where annotators couldsuggest emotion(s) that were not among the op-tions We annotated 3K examples in the first roundWe updated the taxonomy based on the results ofthis round (see details below) In the second pilotround of data collection we repeated this processwith 2k new examples once again updating thetaxonomy

While reviewing the results from the pilotrounds we identified and removed emotions thatwere scarcely selected by annotators andor hadlow interrater agreement due to being very similarto other emotions or being too difficult to detectfrom text These emotions were boredom doubtheartbroken indifference and calmness We alsoidentified and added those emotions to our tax-onomy that were frequently suggested by ratersandor seemed to be represented in the data uponmanual inspection These emotions were desiredisappointment pride realization relief and re-morse In this process we also refined the categorynames (eg replacing ecstasy with excitement) toones that seemed interpretable to annotators Thisis how we arrived at the final set of 27 emotions+ Neutral Our high interrater agreement in the fi-nal data can be partially explained by the fact thatwe took interpretability into consideration whileconstructing the taxonomy The dataset is we arereleasing was labeled in the third round over thefinal taxonomy

C Cohenrsquos Kappa Values

In Section 41 we measure agreement betweenraters via Spearman correlation following consid-erations by Delgado and Tibau (2019) In Table 7we report the Cohenrsquos kappa values for compari-son which we obtain by randomly sampling tworatings for each example and calculating the Co-henrsquos kappa between these two sets of ratings Wefind that all Cohenrsquos kappa values are greater than 0showing rater agreement Moreover the Cohenrsquoskappa values correlate highly with the interratercorrelation values (Pearson r = 085 p lt 0001)providing corroborative evidence for the significantdegree of interrater agreement for each emotion

Emotion InterraterCorrelation

Cohenrsquoskappa

admiration 0535 0468amusement 0482 0474anger 0207 0307annoyance 0193 0192approval 0385 0187caring 0237 0252confusion 0217 0270curiosity 0418 0366desire 0177 0251disappointment 0186 0184disapproval 0274 0234disgust 0192 0241embarrassment 0177 0218excitement 0193 0222fear 0266 0394gratitude 0645 0749grief 0162 0095joy 0296 0301love 0446 0555nervousness 0164 0144optimism 0322 0300pride 0163 0148realization 0194 0155relief 0172 0185remorse 0178 0358sadness 0346 0336surprise 0275 0331

Table 7 Interrater agreement as measured by interratercorrelation and Cohenrsquos kappa

D Sentiment of Reddit Subreddits

In Section 3 we describe how we obtain subredditsthat are balanced in terms of sentiment Here wenote the distribution of sentiments across subred-dits before we apply the filtering neutral (M=28STD=11) positive (M=41 STD=11) neg-ative (M=19 STD=7) ambiguous (M=35STD=8) After filtering the distribution of sen-timents across our remaining subreddits becameneutral (M=24 STD=5) positive (M=35STD=6) negative (M=27 STD=4) ambigu-ous (M=33 STD=4)

E BERTrsquos Most Activated Layers

To better understand whether there are any layersin BERT that are particularly important for ourtask we freeze BERT and calculate the center of

gravity (Tenney et al 2019) based on scalar mixingweights (Peters et al 2018) We find that all layersare similarly important for our task with center ofgravity = 619 (see Figure 4) This is consistentwith Tenney et al (2019) who have also found thattasks involving high-level semantics tend to makeuse of all BERT layers

0 1 2 3 4 5 6 7 8 9 10 11 12

Layer

000

002

004

006

008

010

Soft

max W

eig

hts

Figure 4 Softmax weights of each BERT layer whentrained on our dataset

F Number of Emotion Labels PerExample

Figure 5 shows the number of emotion labels perexample before and after we filter for those labelsthat have agreement We use the filtered set oflabels for training and testing our models

0 1 2 3 4 5 6 7Number of Labels

0k

10k

20k

30k

40k

Num

ber o

f Exa

mpl

es

pre-filterpost-filter

Figure 5 Number of emotion labels per example be-fore and after filtering the labels chosen by only a sin-gle annotator

G Confusion Matrix

Figure 6 shows the normalized confusion matrix forour model predictions Since GoEmotions is a mul-tilabel dataset we calculate the confusion matrix

similarly as we would calculate a co-occurrencematrix for each true label we increase the countfor each predicted label Specifically we define amatrix M where Mij denotes the raw confusioncount between the true label i and the predictedlabel j For example if the true labels are joyand admiration and the predicted labels are joyand pride then we increase the count for MjoyjoyMjoypride Madmirationjoy and MadmirationprideIn practice since most of our examples only hasa single label (see Figure 5) our confusion matrixis very similar to one calculated for a single-labelclassification task

Given the disparate frequencies among the la-bels we normalize M by dividing the counts ineach row (representing counts for each true emo-tion label) by the sum of that row The heatmap inFigure 6 shows these normalized counts We findthat the model tends to confuse emotions that arerelated in sentiment and intensity (eg grief andsadness pride and admiration nervousness andfear)

We also perform hierarchical clustering over thenormalized confusion matrix using correlation as adistance metric and ward as a linkage method Wefind that the model learns relatively similar clustersas the ones in Figure 2 even though the trainingdata only includes a subset of the labels that haveagreement (see Figure 5)

H Transfer Learning Results

Figure 7 shows the results for all 9 datasets thatare downloadable and have categorical emotionsin the Unified Dataset (Bostan and Klinger 2018)These datasets are DailyDialog (Li et al 2017)Emotion-Stimulus (Ghazi et al 2015) AffectiveText (Strapparava and Mihalcea 2007) Crowd-Flower (CrowdFlower 2016) Electoral Tweets(Mohammad et al 2015) ISEAR (Scherer andWallbott 1994) the Twitter Emotion Corpus (TEC)(Mohammad 2012) EmoInt (Mohammad et al2018) and the Stance Sentiment Emotion Corpus(SSEC) (Schuff et al 2017)

We describe the experimental setup in Sec-tion 62 which we use across all datasets Wefind that transfer learning helps in the case of alldatasets especially when there is limited train-ing data Interestingly in the case of Crowd-Flower which is known to be noisy (Bostan andKlinger 2018) and Electoral Tweets which is asmall dataset of sim4k labeled examples and a large

taxonomy of 36 emotions FREEZE gives a signifi-cant boost of performance over the BASELINE andNOFREEZE for all training set sizes besides ldquomaxrdquo

For the other datasets we find that FREEZE

tends to give a performance boost compared to theother setups only up to a couple of hundred train-ing examples For 500-1000 training examplesNOFREEZE tends to outperform the BASELINE butwe can see that these two setups come closer whenthere is more training data available These resultssuggests that our dataset helps if there is limiteddata from the target domain

Sentimentpositivenegativeambiguous

sentiment

curio

sity

conf

usio

nam

usem

ent

grat

itude

adm

iratio

npr

ide

appr

oval

real

izatio

nsu

rpris

eex

citem

ent

joy

relie

fca

ring

optim

ismde

sire

love fear

nerv

ousn

ess

grie

fsa

dnes

sre

mor

sedi

sapp

rova

ldi

sapp

oint

men

tan

ger

anno

yanc

eem

barra

ssm

ent

disg

ust

Predicted Label

curiosityconfusionamusementgratitudeadmirationprideapprovalrealizationsurpriseexcitementjoyreliefcaringoptimismdesirelovefearnervousnessgriefsadnessremorsedisapprovaldisappointmentangerannoyanceembarrassmentdisgust

True

Lab

el

Proportionof Predictions

000

015

030

045

060

075

Figure 6 A normalized confusion matrix for our model predictions The plot shows that the model confusesemotions with other emotions that are related in intensity and sentiment

02

03

04

F-sc

ore

DailyDialog (everyday conversations)

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

01

02

03

04

05

Affective Text (news headlines)

000

005

010

015

020

F-sc

ore

CrowdFlower (tweets)

006

008

010

012

ElectoralTweets (tweets)

00

02

04

06

ISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

01

02

03

04

05

F-sc

ore

TEC (tweets)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

100 200 500 1000 maxTraining Set Size

060

065

070

SSEC (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 7 Transfer learning results on 9 emotion benchmarks from the Unified Dataset (Bostan and Klinger 2018)

Page 6: GoEmotions: A Dataset of Fine-Grained Emotions · 2020. 5. 5. · Sentiment balancing. We reduce sentiment bias by removing subreddits with little representation of positive, negative,

Algorithm 1 Leave-One-Rater-Out PPCA1 Rlarr set of raters2 E larr set of emotions3 C isin R|R|times|E|4 for all raters r isin 1 |R| do5 nlarr number of examples annotated by r6 J isin Rntimes|R|times|E| larr all ratings for the exam-

ples annotated by r7 Jminusr isin Rntimes|R|minus1times|E| larr all ratings in J

excluding r8 Jr isin Rntimes|E| larr all ratings by r9 XY isin Rntimes|E| larr randomly split Jminusr and

average ratings across raters for both sets10 W isin R|E|times|E| larr result of PPCA(XY )11 for all componentsdagger wiisin1|E| in W do12 vr

i larr projectionDagger of Jr onto wi

13 vminusri larr projectionDagger of Jminusr onto wi

14 Cri larr correlation between vri and vminusri

partialing out vminusrk forallk isin 1 iminus 115 end for16 end for17 C prime larrWilcoxon signed rank test on C18 C primeprime larr Bonferroni correction on C prime(α = 005)daggerin descending order of eigenvalueDaggerwe demean vectors before projection

emotion that have high agreement across ratersUnlike Principal Component Analysis (PCA)

PPCA examines the cross-covariance betweendatasets rather than the variancecovariance matrixwithin a single dataset We obtain the principalpreserved components (PPCs) of two datasets (ma-trices) XY isin RNtimes|E| where N is the numberof examples and |E| is the number of emotionsby calculating the eigenvectors of the symmetrizedcross covariance matrix XTY + Y TX

Extracting significant dimensions We removeexamples labeled as Neutral and keep those ex-amples that still have at least 3 ratings after thisfiltering step We then determine the number ofsignificant dimensions using a leave-one-rater outanalysis as described by Algorithm 1

We find that all 27 PPCs are highly signifi-cant Specifically Bonferroni-corrected p-valuesare less than 15e-6 for all dimensions (correctedα = 00017) suggesting that the emotions werehighly dissociable Such a high degree of signifi-cance for all dimensions is nontrivial For exampleCowen et al (2019b) find that only 12 out of their30 emotion categories are significantly dissociable

t-SNE projection To better understand how theexamples are organized in the emotion space weapply t-SNE a dimension reduction method thatseeks to preserve distances between data pointsusing the scikit-learn package (Pedregosa et al2011) The dataset can be explored in our inter-active plot6 where one can also look at the textsand the annotations The color of each data point isthe weighted average of the RGB values represent-ing those emotions that at least half of the ratersselected

44 Linguistic Correlates of Emotions

We extract the lexical correlates of each emotion bycalculating the log odds ratio informative Dirichletprior (Monroe et al 2008) of all tokens for eachemotion category contrasting to all other emotionsSince the log odds are z-scored all values greaterthan 3 indicate highly significant (gt3 std) associ-ation with the corresponding emotion We list thetop 5 tokens for each category in Table 3 We findthat those emotions that are highly significantly as-sociated with certain tokens (eg gratitude withldquothanksrdquo amusement with ldquololrdquo) tend to have thehighest interrater correlation (see Figure 1) Con-versely emotions that have fewer significantly as-sociated tokens (eg grief and nervousness) tendto have low interrater correlation These resultssuggest certain emotions are more verbally implicitand may require more context to be interpreted

5 Modeling

We present a strong baseline emotion predictionmodel for GoEmotions

51 Data Preparation

To minimize the noise in our data we filter outemotion labels selected by only a single annotatorWe keep examples with at least one label after thisfiltering is performed mdash this amounts to 93 of theoriginal data We randomly split this data into train(80) dev (10) and test (10) sets We onlyevaluate on the test set once the model is finalized

Even though we filter our data for the baselineexperiments we see particular value in the 4K ex-amples that lack agreement This subset of the datalikely contains edgedifficult examples for the emo-tion domain (eg emotion-ambiguous text) andpresent challenges for further exploration That is

6httpsnlpstanfordedu˜ddemszkygoemotionstsnehtml

admiration amusement approval caring anger annoyance disappointment disapproval confusiongreat (42) lol (66) agree (24) you (12) fuck (24) annoying (14) disappointing (11) not (16) confused (18)

awesome (32) haha (32) not (13) worry (11) hate (18) stupid (13) disappointed (10) donrsquot (14) why (11)amazing (30) funny (27) donrsquot (12) careful (9) fucking (18) fucking (12) bad (9) disagree (9) sure (10)

good (28) lmao (21) yes (12) stay (9) angry (11) shit (10) disappointment (7) nope (8) what (10)beautiful (23) hilarious (18) agreed (11) your (8) dare (10) dumb (9) unfortunately (7) doesnrsquot (7) understand (8)

desire excitement gratitude joy disgust embarrassment fear grief curiositywish (29) excited (21) thanks (75) happy (32) disgusting (22) embarrassing (12) scared (16) died (6) curious (22)want (8) happy (8) thank (69) glad (27) awful (14) shame (11) afraid (16) rip (4) what (18)

wanted (6) cake (8) for (24) enjoy (20) worst (13) awkward (10) scary (15) why (13)could (6) wow (8) you (18) enjoyed (12) worse (12) embarrassment (8) terrible (12) how (11)

ambitious (4) interesting (7) sharing (17) fun (12) weird (9) embarrassed (7) terrifying (11) did (10)love optimism pride relief nervousness remorse sadness realization surprise

love (76) hope (45) proud (14) glad (5) nervous (8) sorry (39) sad (31) realize (14) wow (23)loved (21) hopefully (19) pride (4) relieved (4) worried (8) regret (9) sadly (16) realized (12) surprised (21)

favorite (13) luck (18) accomplishment relieving (4) anxiety (6) apologies (7) sorry (15) realised (7) wonder (15)loves (12) hoping (16) (4) relief (4) anxious (4) apologize (6) painful (10) realization (6) shocked (12)

like (9) will (8) worrying (4) guilt (5) crying (9) thought (6) omg (11)

Table 3 Top 5 words associated with each emotion ( positive negative ambiguous ) The rounded z-scored logodds ratios in the parentheses with the threshold set at 3 indicate significance of association

why we release all 58K examples with all annota-torsrsquo ratings

Grouping emotions We create a hierarchicalgrouping of our taxonomy and evaluate the modelperformance on each level of the hierarchy A sen-timent level divides the labels into 4 categories ndashpositive negative ambiguous and Neutral ndash withthe Neutral category intact and the rest of the map-ping as shown in Figure 2 The Ekman level furtherdivides the taxonomy using the Neutral label andthe following 6 groups anger (maps to anger an-noyance disapproval) disgust (maps to disgust)fear (maps to fear nervousness) joy (all positiveemotions) sadness (maps to sadness disappoint-ment embarrassment grief remorse) and surprise(all ambiguous emotions)

52 Model ArchitectureWe use the BERT-base model (Devlin et al 2019)for our experiments We add a dense output layeron top of the pretrained model for the purposesof finetuning with a sigmoid cross entropy lossfunction to support multi-label classification As anadditional baseline we train a bidirectional LSTM

53 Parameter SettingsWhen finetuning the pre-trained BERT model wekeep most of the hyperparameters set by Devlinet al (2019) intact and only change the batch sizeand learning rate We find that training for at least4 epochs is necessary for learning the data buttraining for more epochs results in overfitting Wealso find that a small batch size of 16 and learningrate of 5e-5 yields the best performance

For the biLSTM we set the hidden layer dimen-sionality to 256 the learning rate to 01 with a

decay rate of 095 We apply a dropout of 07

54 Results

Table 4 summarizes the performance of our bestmodel BERT on the test set which achieves anaverage F1-score of 46 (std=19) The model ob-tains the best performance on emotions with overtlexical markers such as gratitude (86) amusement(8) and love (78) The model obtains the lowestF1-score on grief (0) relief (15) and realization(21) which are the lowest frequency emotions Wefind that less frequent emotions tend to be confusedby the model with more frequent emotions relatedin sentiment and intensity (eg grief with sadnesspride with admiration nervousness with fear) mdashsee Appendix G for a more detailed analysis

Table 5 and Table 6 show results for a sentiment-grouped model (F1-score = 69) and an Ekman-grouped model (F1-score = 64) respectively Thesignificant performance increase in the transitionfrom full to Ekman-level taxonomy indicates thatthis grouping mitigates confusion among inner-group lower-level categories

The biLSTM model performs significantlyworse than BERT obtaining an average F1-scoreof 41 for the full taxonomy 54 for an Ekman-grouped model and 6 for a sentiment-groupedmodel

6 Transfer Learning Experiments

We conduct transfer learning experiments on exist-ing emotion benchmarks in order to show our datageneralizes across domains and taxonomies Thegoal is to demonstrate that given little labeled datain a target domain one can utilize GoEmotions asbaseline emotion understanding data

100 200 500 1000 maxTraining Set Size

00

02

04

06F-

scor

eISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 3 Transfer learning results in terms of average F1-scores across emotion categories The bars indicate the95 confidence intervals which we obtain from 10 different runs on 10 different random splits of the data

Emotion Precision Recall F1

admiration 053 083 065amusement 070 094 080anger 036 066 047annoyance 024 063 034approval 026 057 036caring 030 056 039confusion 024 076 037curiosity 040 084 054desire 043 059 049disappointment 019 052 028disapproval 029 061 039disgust 034 066 045embarrassment 039 049 043excitement 026 052 034fear 046 085 060gratitude 079 095 086grief 000 000 000joy 039 073 051love 068 092 078nervousness 028 048 035neutral 056 084 068optimism 041 069 051pride 067 025 036realization 016 029 021relief 050 009 015remorse 053 088 066sadness 038 071 049surprise 040 066 050macro-average 040 063 046std 018 024 019

Table 4 Results based on GoEmotions taxonomy

61 Emotion Benchmark Datasets

We consider the nine benchmark datasets fromBostan and Klinger (2018)rsquos Unified Datasetwhich vary in terms of their size domain qual-

Sentiment Precision Recall F1

ambiguous 054 066 060negative 065 076 070neutral 064 069 067positive 078 087 082macro-average 065 074 069std 009 010 009

Table 5 Results based on sentiment-grouped data

Ekman Emotion Precision Recall F1

anger 050 065 057disgust 052 053 053fear 061 076 068joy 077 088 082neutral 066 067 066sadness 056 062 059macro-average 060 068 064std 010 012 011

Table 6 Results using Ekmanrsquos taxonomy

ity and taxonomy In the interest of space we onlydiscuss three of these datasets here chosen basedon their diversity of domains In our experimentswe observe similar trends for the additional bench-marks and all are included in the Appendix H

The International Survey on Emotion An-tecedents and Reactions (ISEAR) (Scherer andWallbott 1994) is a collection of personal reportson emotional events written by 3000 people fromdifferent cultural backgrounds The dataset con-tains 8k sentences each labeled with a single emo-tion The categories are anger disgust fear guiltjoy sadness and shame

EmoInt (Mohammad et al 2018) is part of theSemEval 2018 benchmark and it contains crowd-sourced annotations for 7k tweets The labels areintensity annotations for anger joy sadness and

fear We obtain binary annotations for these emo-tions by using 5 as the cutoff

Emotion-Stimulus (Ghazi et al 2015) containsannotations for 24k sentences generated based onFrameNetrsquos emotion-directed frames Their taxon-omy is anger disgust fear joy sadness shameand surprise

62 Experimental SetupTraining set size We experiment with varyingamount of training data from the target domaindataset including 100 200 500 1000 and 80(named ldquomaxrdquo) of dataset examples We generate10 random splits for each train set size with theremaining examples held as a test set

We report the results of the finetuning experi-ments detailed below for each data size with con-fidence intervals based on repeated experimentsusing the splits

Finetuning We compare three different finetun-ing setups In the BASELINE setup we finetuneBERT only on the target dataset In the FREEZE

setup we first finetune BERT on GoEmotions thenperform transfer learning by replacing the finaldense layer freezing all layers besides the lastlayer and finetuning on the target dataset TheNOFREEZE setup is the same as FREEZE exceptthat we do not freeze the bottom layers We holdthe batch size at 16 learning rate at 2e-5 and num-ber of epochs at 3 for all experiments

63 ResultsThe results in Figure 3 suggest that our dataset gen-eralizes well to different domains and taxonomiesand that using a model using GoEmotions can helpin cases when there is limited data from the targetdomain or limited resources for labeling

Given limited target domain data (100 or 200 ex-amples) both FREEZE and NOFREEZE yield signif-icantly higher performance than the BASELINE forall three datasets Importantly NOFREEZE resultsshow significantly higher performance for all train-ing set sizes except for ldquomaxrdquo where NOFREEZE

and BASELINE perform similarly

7 Conclusion

We present GoEmotions a large manually anno-tated carefully curated dataset for fine-grainedemotion prediction We provide a detailed dataanalysis demonstrating the reliability of the anno-tations for the full taxonomy We show the general-

izability of the data across domains and taxonomiesvia transfer learning experiments We build a strongbaseline by fine-tuning a BERT model howeverthe results suggest much room for future improve-ment Future work can explore the cross-culturalrobustness of emotion ratings and extend the tax-onomy to other languages and domains

Data Disclaimer We are aware that the datasetcontains biases and is not representative of globaldiversity We are aware that the dataset containspotentially problematic content Potential biases inthe data include Inherent biases in Reddit and userbase biases the offensivevulgar word lists usedfor data filtering inherent or unconscious bias inassessment of offensive identity labels annotatorswere all native English speakers from India Allthese likely affect labeling precision and recallfor a trained model The emotion pilot model usedfor sentiment labeling was trained on examplesreviewed by the research team Anyone using thisdataset should be aware of these limitations of thedataset

Acknowledgments

We thank the three anonymous reviewers for theirconstructive feedback We would also like to thankthe annotators for their hard work

ReferencesMuhammad Abdul-Mageed and Lyle Ungar 2017

EmoNet Fine-Grained Emotion Detection withGated Recurrent Neural Networks In Proceedingsof the 55th Annual Meeting of the Association forComputational Linguistics (Volume 1 Long Papers)pages 718ndash728

Cecilia Ovesdotter Alm Dan Roth and Richard Sproat2005 Emotions from text Machine learningfor text-based emotion prediction In Proceed-ings of Human Language Technology Conferenceand Conference on Empirical Methods in NaturalLanguage Processing pages 579ndash586 VancouverBritish Columbia Canada Association for Compu-tational Linguistics

Laura-Ana-Maria Bostan and Roman Klinger 2018An analysis of annotated corpora for emotion clas-sification in text In Proceedings of the 27th Inter-national Conference on Computational Linguisticspages 2104ndash2119

Luke Breitfeller Emily Ahn David Jurgens and Yu-lia Tsvetkov 2019 Finding microaggressions in thewild A case for locating elusive phenomena in so-cial media posts In Proceedings of the 2019 Con-ference on Empirical Methods in Natural Language

Processing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 1664ndash1674

Sven Buechel and Udo Hahn 2017 Emobank Study-ing the impact of annotation perspective and repre-sentation format on dimensional emotion analysisIn Proceedings of the 15th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics Volume 2 Short Papers pages 578ndash585

Jacob Cohen 1960 A coefficient of agreement fornominal scales Educational and psychological mea-surement 20(1)37ndash46

Alan Cowen Disa Sauter Jessica L Tracy and DacherKeltner 2019a Mapping the passions Towarda high-dimensional taxonomy of emotional experi-ence and expression Psychological Science in thePublic Interest 20(1)69ndash90

Alan S Cowen Hillary Anger Elfenbein Petri Laukkaand Dacher Keltner 2018 Mapping 24 emotionsconveyed by brief human vocalization AmericanPsychologist 74(6)698ndash712

Alan S Cowen Xia Fang Disa Sauter and Dacher Kelt-ner in press What music makes us feel At leastthirteen dimensions organize subjective experiencesassociated with music across cultures Proceedingsof the National Academy of Sciences

Alan S Cowen and Dacher Keltner 2017 Self-reportcaptures 27 distinct categories of emotion bridged bycontinuous gradients Proceedings of the NationalAcademy of Sciences 114(38)E7900ndashE7909

Alan S Cowen and Dacher Keltner 2019 What theface displays Mapping 28 emotions conveyed bynaturalistic expression American Psychologist

Alan S Cowen Petri Laukka Hillary Anger ElfenbeinRunjing Liu and Dacher Keltner 2019b The pri-macy of categories in the recognition of 12 emotionsin speech prosody across two cultures Nature Hu-man Behaviour 3(4)369

CrowdFlower 2016 httpswwwfigure-eightcomdatasentiment-analysis-emotion-text

Rosario Delgado and Xavier-Andoni Tibau 2019Why cohens kappa should be avoided as perfor-mance measure in classification PloS one 14(9)

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In 17th Annual Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics (NAACL)

Maeve Duggan and Aaron Smith 2013 6 of onlineadults are reddit users Pew Internet amp AmericanLife Project 31ndash10

Paul Ekman 1992a Are there basic emotions Psy-chological Review 99(3)550ndash553

Paul Ekman 1992b An argument for basic emotionsCognition amp Emotion 6(3-4)169ndash200

Diman Ghazi Diana Inkpen and Stan Szpakowicz2015 Detecting Emotion Stimuli in Emotion-Bearing Sentences In International Conference onIntelligent Text Processing and Computational Lin-guistics pages 152ndash165 Springer

Chao-Chun Hsu and Lun-Wei Ku 2018 SocialNLP2018 EmotionX challenge overview Recognizingemotions in dialogues In Proceedings of the SixthInternational Workshop on Natural Language Pro-cessing for Social Media pages 27ndash31 MelbourneAustralia Association for Computational Linguis-tics

Yanran Li Hui Su Xiaoyu Shen Wenjie Li ZiqiangCao and Shuzi Niu 2017 Dailydialog A manuallylabelled multi-turn dialogue dataset arXiv preprintarXiv171003957

Chen Liu Muhammad Osama and Anderson De An-drade 2019 Dens A dataset for multi-class emo-tion analysis arXiv preprint arXiv191011769

Saif Mohammad 2018 Obtaining reliable human rat-ings of valence arousal and dominance for 20000english words In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1 Long Papers) pages 174ndash184

Saif Mohammad Felipe Bravo-Marquez MohammadSalameh and Svetlana Kiritchenko 2018 SemEval-2018 task 1 Affect in tweets In Proceedings ofThe 12th International Workshop on Semantic Eval-uation pages 1ndash17 New Orleans Louisiana Asso-ciation for Computational Linguistics

Saif M Mohammad 2012 emotional tweets In Pro-ceedings of the First Joint Conference on Lexicaland Computational Semantics-Volume 1 Proceed-ings of the main conference and the shared task andVolume 2 Proceedings of the Sixth InternationalWorkshop on Semantic Evaluation pages 246ndash255Association for Computational Linguistics

Saif M Mohammad Xiaodan Zhu Svetlana Kir-itchenko and Joel Martin 2015 Sentiment emo-tion purpose and style in electoral tweets Informa-tion Processing amp Management 51(4)480ndash499

Shruthi Mohan Apala Guha Michael Harris FredPopowich Ashley Schuster and Chris Priebe 2017The impact of toxic language on the health of redditcommunities In Canadian Conference on ArtificialIntelligence pages 51ndash56 Springer

Burt L Monroe Michael P Colaresi and Kevin MQuinn 2008 Fightinrsquowords Lexical feature selec-tion and evaluation for identifying the content of po-litical conflict Political Analysis 16(4)372ndash403

Emily Ohman Kaisla Kajava Jorg Tiedemann andTimo Honkela 2018 Creating a dataset formultilingual fine-grained emotion-detection usinggamification-based annotation In Proceedings ofthe 9th Workshop on Computational Approaches toSubjectivity Sentiment and Social Media Analysispages 24ndash30

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P PrettenhoferR Weiss V Dubourg J Vanderplas A PassosD Cournapeau M Brucher M Perrot and E Duch-esnay 2011 Scikit-learn Machine learning inPython Journal of Machine Learning Research122825ndash2830

Matthew E Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep Contextualized Word Rep-resentations In 16th Annual Conference of theNorth American Chapter of the Association for Com-putational Linguistics (NAACL)

Rosalind W Picard 1997 Affective Computing MITPress

Inna Pirina and Cagrı Coltekin 2018 Identifyingdepression on reddit The effect of training dataIn Proceedings of the 2018 EMNLP WorkshopSMM4H The 3rd Social Media Mining for HealthApplications Workshop amp Shared Task pages 9ndash12

Robert Plutchik 1980 A general psychoevolutionarytheory of emotion In Theories of emotion pages3ndash33 Elsevier

James A Russell 2003 Core affect and the psycholog-ical construction of emotion Psychological review110(1)145

Klaus R Scherer and Harald G Wallbott 1994 Evi-dence for Universality and Cultural Variation of Dif-ferential Emotion Response Patterning Journal ofpersonality and social psychology 66(2)310

Hendrik Schuff Jeremy Barnes Julian Mohme Sebas-tian Pado and Roman Klinger 2017 Annotationmodelling and analysis of fine-grained emotions ona stance and sentiment detection corpus In Pro-ceedings of the 8th Workshop on Computational Ap-proaches to Subjectivity Sentiment and Social Me-dia Analysis pages 13ndash23

Carlo Strapparava and Rada Mihalcea 2007 SemEval-2007 task 14 Affective text In Proceedings of theFourth International Workshop on Semantic Evalua-tions (SemEval-2007) pages 70ndash74 Prague CzechRepublic Association for Computational Linguis-tics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT Rediscovers the Classical NLP Pipeline InAssociation for Computational Linguistics

Henry Tsai Jason Riesa Melvin Johnson Naveen Ari-vazhagan Xin Li and Amelia Archer 2019 Smalland Practical BERT Models for Sequence LabelingIn EMNLP 2019

Wenbo Wang Lu Chen Krishnaprasad Thirunarayanand Amit P Sheth 2012 Harnessing twitter lsquobigdatarsquo for automatic emotion identification In 2012International Conference on Privacy Security Riskand Trust and 2012 International Confernece on So-cial Computing pages 587ndash592 IEEE

Cebrian-M Yanardag P and I Rahwan 2018 Nor-man Worlds first psychopath ai

A Emotion Definitions

admiration Finding something impressive orworthy of respectamusement Finding something funny or beingentertainedanger A strong feeling of displeasure or antag-onismannoyance Mild anger irritationapproval Having or expressing a favorable opin-ioncaring Displaying kindness and concern for othersconfusion Lack of understanding uncertaintycuriosity A strong desire to know or learn some-thingdesire A strong feeling of wanting something orwishing for something to happendisappointment Sadness or displeasure caused bythe nonfulfillment of onersquos hopes or expectationsdisapproval Having or expressing an unfavor-able opiniondisgust Revulsion or strong disapproval arousedby something unpleasant or offensiveembarrassment Self-consciousness shame orawkwardnessexcitement Feeling of great enthusiasm and ea-gernessfear Being afraid or worriedgratitude A feeling of thankfulness and appre-ciationgrief Intense sorrow especially caused by some-onersquos deathjoy A feeling of pleasure and happinesslove A strong positive emotion of regard andaffectionnervousness Apprehension worry anxietyoptimism Hopefulness and confidence aboutthe future or the success of somethingpride Pleasure or satisfaction due to ones ownachievements or the achievements of those withwhom one is closely associatedrealization Becoming aware of somethingrelief Reassurance and relaxation following releasefrom anxiety or distressremorse Regret or guilty feelingsadness Emotional pain sorrowsurprise Feeling astonished startled by some-thing unexpected

B Taxonomy Selection amp Data Collection

We selected our taxonomy through a careful multi-round process In the first pilot round of data col-

lection we used emotions that were identified to besalient by Cowen and Keltner (2017) making surethat our set includes Ekmans emotion categories asused in previous NLP work In this round we alsoincluded an open input box where annotators couldsuggest emotion(s) that were not among the op-tions We annotated 3K examples in the first roundWe updated the taxonomy based on the results ofthis round (see details below) In the second pilotround of data collection we repeated this processwith 2k new examples once again updating thetaxonomy

While reviewing the results from the pilotrounds we identified and removed emotions thatwere scarcely selected by annotators andor hadlow interrater agreement due to being very similarto other emotions or being too difficult to detectfrom text These emotions were boredom doubtheartbroken indifference and calmness We alsoidentified and added those emotions to our tax-onomy that were frequently suggested by ratersandor seemed to be represented in the data uponmanual inspection These emotions were desiredisappointment pride realization relief and re-morse In this process we also refined the categorynames (eg replacing ecstasy with excitement) toones that seemed interpretable to annotators Thisis how we arrived at the final set of 27 emotions+ Neutral Our high interrater agreement in the fi-nal data can be partially explained by the fact thatwe took interpretability into consideration whileconstructing the taxonomy The dataset is we arereleasing was labeled in the third round over thefinal taxonomy

C Cohenrsquos Kappa Values

In Section 41 we measure agreement betweenraters via Spearman correlation following consid-erations by Delgado and Tibau (2019) In Table 7we report the Cohenrsquos kappa values for compari-son which we obtain by randomly sampling tworatings for each example and calculating the Co-henrsquos kappa between these two sets of ratings Wefind that all Cohenrsquos kappa values are greater than 0showing rater agreement Moreover the Cohenrsquoskappa values correlate highly with the interratercorrelation values (Pearson r = 085 p lt 0001)providing corroborative evidence for the significantdegree of interrater agreement for each emotion

Emotion InterraterCorrelation

Cohenrsquoskappa

admiration 0535 0468amusement 0482 0474anger 0207 0307annoyance 0193 0192approval 0385 0187caring 0237 0252confusion 0217 0270curiosity 0418 0366desire 0177 0251disappointment 0186 0184disapproval 0274 0234disgust 0192 0241embarrassment 0177 0218excitement 0193 0222fear 0266 0394gratitude 0645 0749grief 0162 0095joy 0296 0301love 0446 0555nervousness 0164 0144optimism 0322 0300pride 0163 0148realization 0194 0155relief 0172 0185remorse 0178 0358sadness 0346 0336surprise 0275 0331

Table 7 Interrater agreement as measured by interratercorrelation and Cohenrsquos kappa

D Sentiment of Reddit Subreddits

In Section 3 we describe how we obtain subredditsthat are balanced in terms of sentiment Here wenote the distribution of sentiments across subred-dits before we apply the filtering neutral (M=28STD=11) positive (M=41 STD=11) neg-ative (M=19 STD=7) ambiguous (M=35STD=8) After filtering the distribution of sen-timents across our remaining subreddits becameneutral (M=24 STD=5) positive (M=35STD=6) negative (M=27 STD=4) ambigu-ous (M=33 STD=4)

E BERTrsquos Most Activated Layers

To better understand whether there are any layersin BERT that are particularly important for ourtask we freeze BERT and calculate the center of

gravity (Tenney et al 2019) based on scalar mixingweights (Peters et al 2018) We find that all layersare similarly important for our task with center ofgravity = 619 (see Figure 4) This is consistentwith Tenney et al (2019) who have also found thattasks involving high-level semantics tend to makeuse of all BERT layers

0 1 2 3 4 5 6 7 8 9 10 11 12

Layer

000

002

004

006

008

010

Soft

max W

eig

hts

Figure 4 Softmax weights of each BERT layer whentrained on our dataset

F Number of Emotion Labels PerExample

Figure 5 shows the number of emotion labels perexample before and after we filter for those labelsthat have agreement We use the filtered set oflabels for training and testing our models

0 1 2 3 4 5 6 7Number of Labels

0k

10k

20k

30k

40k

Num

ber o

f Exa

mpl

es

pre-filterpost-filter

Figure 5 Number of emotion labels per example be-fore and after filtering the labels chosen by only a sin-gle annotator

G Confusion Matrix

Figure 6 shows the normalized confusion matrix forour model predictions Since GoEmotions is a mul-tilabel dataset we calculate the confusion matrix

similarly as we would calculate a co-occurrencematrix for each true label we increase the countfor each predicted label Specifically we define amatrix M where Mij denotes the raw confusioncount between the true label i and the predictedlabel j For example if the true labels are joyand admiration and the predicted labels are joyand pride then we increase the count for MjoyjoyMjoypride Madmirationjoy and MadmirationprideIn practice since most of our examples only hasa single label (see Figure 5) our confusion matrixis very similar to one calculated for a single-labelclassification task

Given the disparate frequencies among the la-bels we normalize M by dividing the counts ineach row (representing counts for each true emo-tion label) by the sum of that row The heatmap inFigure 6 shows these normalized counts We findthat the model tends to confuse emotions that arerelated in sentiment and intensity (eg grief andsadness pride and admiration nervousness andfear)

We also perform hierarchical clustering over thenormalized confusion matrix using correlation as adistance metric and ward as a linkage method Wefind that the model learns relatively similar clustersas the ones in Figure 2 even though the trainingdata only includes a subset of the labels that haveagreement (see Figure 5)

H Transfer Learning Results

Figure 7 shows the results for all 9 datasets thatare downloadable and have categorical emotionsin the Unified Dataset (Bostan and Klinger 2018)These datasets are DailyDialog (Li et al 2017)Emotion-Stimulus (Ghazi et al 2015) AffectiveText (Strapparava and Mihalcea 2007) Crowd-Flower (CrowdFlower 2016) Electoral Tweets(Mohammad et al 2015) ISEAR (Scherer andWallbott 1994) the Twitter Emotion Corpus (TEC)(Mohammad 2012) EmoInt (Mohammad et al2018) and the Stance Sentiment Emotion Corpus(SSEC) (Schuff et al 2017)

We describe the experimental setup in Sec-tion 62 which we use across all datasets Wefind that transfer learning helps in the case of alldatasets especially when there is limited train-ing data Interestingly in the case of Crowd-Flower which is known to be noisy (Bostan andKlinger 2018) and Electoral Tweets which is asmall dataset of sim4k labeled examples and a large

taxonomy of 36 emotions FREEZE gives a signifi-cant boost of performance over the BASELINE andNOFREEZE for all training set sizes besides ldquomaxrdquo

For the other datasets we find that FREEZE

tends to give a performance boost compared to theother setups only up to a couple of hundred train-ing examples For 500-1000 training examplesNOFREEZE tends to outperform the BASELINE butwe can see that these two setups come closer whenthere is more training data available These resultssuggests that our dataset helps if there is limiteddata from the target domain

Sentimentpositivenegativeambiguous

sentiment

curio

sity

conf

usio

nam

usem

ent

grat

itude

adm

iratio

npr

ide

appr

oval

real

izatio

nsu

rpris

eex

citem

ent

joy

relie

fca

ring

optim

ismde

sire

love fear

nerv

ousn

ess

grie

fsa

dnes

sre

mor

sedi

sapp

rova

ldi

sapp

oint

men

tan

ger

anno

yanc

eem

barra

ssm

ent

disg

ust

Predicted Label

curiosityconfusionamusementgratitudeadmirationprideapprovalrealizationsurpriseexcitementjoyreliefcaringoptimismdesirelovefearnervousnessgriefsadnessremorsedisapprovaldisappointmentangerannoyanceembarrassmentdisgust

True

Lab

el

Proportionof Predictions

000

015

030

045

060

075

Figure 6 A normalized confusion matrix for our model predictions The plot shows that the model confusesemotions with other emotions that are related in intensity and sentiment

02

03

04

F-sc

ore

DailyDialog (everyday conversations)

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

01

02

03

04

05

Affective Text (news headlines)

000

005

010

015

020

F-sc

ore

CrowdFlower (tweets)

006

008

010

012

ElectoralTweets (tweets)

00

02

04

06

ISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

01

02

03

04

05

F-sc

ore

TEC (tweets)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

100 200 500 1000 maxTraining Set Size

060

065

070

SSEC (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 7 Transfer learning results on 9 emotion benchmarks from the Unified Dataset (Bostan and Klinger 2018)

Page 7: GoEmotions: A Dataset of Fine-Grained Emotions · 2020. 5. 5. · Sentiment balancing. We reduce sentiment bias by removing subreddits with little representation of positive, negative,

admiration amusement approval caring anger annoyance disappointment disapproval confusiongreat (42) lol (66) agree (24) you (12) fuck (24) annoying (14) disappointing (11) not (16) confused (18)

awesome (32) haha (32) not (13) worry (11) hate (18) stupid (13) disappointed (10) donrsquot (14) why (11)amazing (30) funny (27) donrsquot (12) careful (9) fucking (18) fucking (12) bad (9) disagree (9) sure (10)

good (28) lmao (21) yes (12) stay (9) angry (11) shit (10) disappointment (7) nope (8) what (10)beautiful (23) hilarious (18) agreed (11) your (8) dare (10) dumb (9) unfortunately (7) doesnrsquot (7) understand (8)

desire excitement gratitude joy disgust embarrassment fear grief curiositywish (29) excited (21) thanks (75) happy (32) disgusting (22) embarrassing (12) scared (16) died (6) curious (22)want (8) happy (8) thank (69) glad (27) awful (14) shame (11) afraid (16) rip (4) what (18)

wanted (6) cake (8) for (24) enjoy (20) worst (13) awkward (10) scary (15) why (13)could (6) wow (8) you (18) enjoyed (12) worse (12) embarrassment (8) terrible (12) how (11)

ambitious (4) interesting (7) sharing (17) fun (12) weird (9) embarrassed (7) terrifying (11) did (10)love optimism pride relief nervousness remorse sadness realization surprise

love (76) hope (45) proud (14) glad (5) nervous (8) sorry (39) sad (31) realize (14) wow (23)loved (21) hopefully (19) pride (4) relieved (4) worried (8) regret (9) sadly (16) realized (12) surprised (21)

favorite (13) luck (18) accomplishment relieving (4) anxiety (6) apologies (7) sorry (15) realised (7) wonder (15)loves (12) hoping (16) (4) relief (4) anxious (4) apologize (6) painful (10) realization (6) shocked (12)

like (9) will (8) worrying (4) guilt (5) crying (9) thought (6) omg (11)

Table 3 Top 5 words associated with each emotion ( positive negative ambiguous ) The rounded z-scored logodds ratios in the parentheses with the threshold set at 3 indicate significance of association

why we release all 58K examples with all annota-torsrsquo ratings

Grouping emotions We create a hierarchicalgrouping of our taxonomy and evaluate the modelperformance on each level of the hierarchy A sen-timent level divides the labels into 4 categories ndashpositive negative ambiguous and Neutral ndash withthe Neutral category intact and the rest of the map-ping as shown in Figure 2 The Ekman level furtherdivides the taxonomy using the Neutral label andthe following 6 groups anger (maps to anger an-noyance disapproval) disgust (maps to disgust)fear (maps to fear nervousness) joy (all positiveemotions) sadness (maps to sadness disappoint-ment embarrassment grief remorse) and surprise(all ambiguous emotions)

52 Model ArchitectureWe use the BERT-base model (Devlin et al 2019)for our experiments We add a dense output layeron top of the pretrained model for the purposesof finetuning with a sigmoid cross entropy lossfunction to support multi-label classification As anadditional baseline we train a bidirectional LSTM

53 Parameter SettingsWhen finetuning the pre-trained BERT model wekeep most of the hyperparameters set by Devlinet al (2019) intact and only change the batch sizeand learning rate We find that training for at least4 epochs is necessary for learning the data buttraining for more epochs results in overfitting Wealso find that a small batch size of 16 and learningrate of 5e-5 yields the best performance

For the biLSTM we set the hidden layer dimen-sionality to 256 the learning rate to 01 with a

decay rate of 095 We apply a dropout of 07

54 Results

Table 4 summarizes the performance of our bestmodel BERT on the test set which achieves anaverage F1-score of 46 (std=19) The model ob-tains the best performance on emotions with overtlexical markers such as gratitude (86) amusement(8) and love (78) The model obtains the lowestF1-score on grief (0) relief (15) and realization(21) which are the lowest frequency emotions Wefind that less frequent emotions tend to be confusedby the model with more frequent emotions relatedin sentiment and intensity (eg grief with sadnesspride with admiration nervousness with fear) mdashsee Appendix G for a more detailed analysis

Table 5 and Table 6 show results for a sentiment-grouped model (F1-score = 69) and an Ekman-grouped model (F1-score = 64) respectively Thesignificant performance increase in the transitionfrom full to Ekman-level taxonomy indicates thatthis grouping mitigates confusion among inner-group lower-level categories

The biLSTM model performs significantlyworse than BERT obtaining an average F1-scoreof 41 for the full taxonomy 54 for an Ekman-grouped model and 6 for a sentiment-groupedmodel

6 Transfer Learning Experiments

We conduct transfer learning experiments on exist-ing emotion benchmarks in order to show our datageneralizes across domains and taxonomies Thegoal is to demonstrate that given little labeled datain a target domain one can utilize GoEmotions asbaseline emotion understanding data

100 200 500 1000 maxTraining Set Size

00

02

04

06F-

scor

eISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 3 Transfer learning results in terms of average F1-scores across emotion categories The bars indicate the95 confidence intervals which we obtain from 10 different runs on 10 different random splits of the data

Emotion Precision Recall F1

admiration 053 083 065amusement 070 094 080anger 036 066 047annoyance 024 063 034approval 026 057 036caring 030 056 039confusion 024 076 037curiosity 040 084 054desire 043 059 049disappointment 019 052 028disapproval 029 061 039disgust 034 066 045embarrassment 039 049 043excitement 026 052 034fear 046 085 060gratitude 079 095 086grief 000 000 000joy 039 073 051love 068 092 078nervousness 028 048 035neutral 056 084 068optimism 041 069 051pride 067 025 036realization 016 029 021relief 050 009 015remorse 053 088 066sadness 038 071 049surprise 040 066 050macro-average 040 063 046std 018 024 019

Table 4 Results based on GoEmotions taxonomy

61 Emotion Benchmark Datasets

We consider the nine benchmark datasets fromBostan and Klinger (2018)rsquos Unified Datasetwhich vary in terms of their size domain qual-

Sentiment Precision Recall F1

ambiguous 054 066 060negative 065 076 070neutral 064 069 067positive 078 087 082macro-average 065 074 069std 009 010 009

Table 5 Results based on sentiment-grouped data

Ekman Emotion Precision Recall F1

anger 050 065 057disgust 052 053 053fear 061 076 068joy 077 088 082neutral 066 067 066sadness 056 062 059macro-average 060 068 064std 010 012 011

Table 6 Results using Ekmanrsquos taxonomy

ity and taxonomy In the interest of space we onlydiscuss three of these datasets here chosen basedon their diversity of domains In our experimentswe observe similar trends for the additional bench-marks and all are included in the Appendix H

The International Survey on Emotion An-tecedents and Reactions (ISEAR) (Scherer andWallbott 1994) is a collection of personal reportson emotional events written by 3000 people fromdifferent cultural backgrounds The dataset con-tains 8k sentences each labeled with a single emo-tion The categories are anger disgust fear guiltjoy sadness and shame

EmoInt (Mohammad et al 2018) is part of theSemEval 2018 benchmark and it contains crowd-sourced annotations for 7k tweets The labels areintensity annotations for anger joy sadness and

fear We obtain binary annotations for these emo-tions by using 5 as the cutoff

Emotion-Stimulus (Ghazi et al 2015) containsannotations for 24k sentences generated based onFrameNetrsquos emotion-directed frames Their taxon-omy is anger disgust fear joy sadness shameand surprise

62 Experimental SetupTraining set size We experiment with varyingamount of training data from the target domaindataset including 100 200 500 1000 and 80(named ldquomaxrdquo) of dataset examples We generate10 random splits for each train set size with theremaining examples held as a test set

We report the results of the finetuning experi-ments detailed below for each data size with con-fidence intervals based on repeated experimentsusing the splits

Finetuning We compare three different finetun-ing setups In the BASELINE setup we finetuneBERT only on the target dataset In the FREEZE

setup we first finetune BERT on GoEmotions thenperform transfer learning by replacing the finaldense layer freezing all layers besides the lastlayer and finetuning on the target dataset TheNOFREEZE setup is the same as FREEZE exceptthat we do not freeze the bottom layers We holdthe batch size at 16 learning rate at 2e-5 and num-ber of epochs at 3 for all experiments

63 ResultsThe results in Figure 3 suggest that our dataset gen-eralizes well to different domains and taxonomiesand that using a model using GoEmotions can helpin cases when there is limited data from the targetdomain or limited resources for labeling

Given limited target domain data (100 or 200 ex-amples) both FREEZE and NOFREEZE yield signif-icantly higher performance than the BASELINE forall three datasets Importantly NOFREEZE resultsshow significantly higher performance for all train-ing set sizes except for ldquomaxrdquo where NOFREEZE

and BASELINE perform similarly

7 Conclusion

We present GoEmotions a large manually anno-tated carefully curated dataset for fine-grainedemotion prediction We provide a detailed dataanalysis demonstrating the reliability of the anno-tations for the full taxonomy We show the general-

izability of the data across domains and taxonomiesvia transfer learning experiments We build a strongbaseline by fine-tuning a BERT model howeverthe results suggest much room for future improve-ment Future work can explore the cross-culturalrobustness of emotion ratings and extend the tax-onomy to other languages and domains

Data Disclaimer We are aware that the datasetcontains biases and is not representative of globaldiversity We are aware that the dataset containspotentially problematic content Potential biases inthe data include Inherent biases in Reddit and userbase biases the offensivevulgar word lists usedfor data filtering inherent or unconscious bias inassessment of offensive identity labels annotatorswere all native English speakers from India Allthese likely affect labeling precision and recallfor a trained model The emotion pilot model usedfor sentiment labeling was trained on examplesreviewed by the research team Anyone using thisdataset should be aware of these limitations of thedataset

Acknowledgments

We thank the three anonymous reviewers for theirconstructive feedback We would also like to thankthe annotators for their hard work

ReferencesMuhammad Abdul-Mageed and Lyle Ungar 2017

EmoNet Fine-Grained Emotion Detection withGated Recurrent Neural Networks In Proceedingsof the 55th Annual Meeting of the Association forComputational Linguistics (Volume 1 Long Papers)pages 718ndash728

Cecilia Ovesdotter Alm Dan Roth and Richard Sproat2005 Emotions from text Machine learningfor text-based emotion prediction In Proceed-ings of Human Language Technology Conferenceand Conference on Empirical Methods in NaturalLanguage Processing pages 579ndash586 VancouverBritish Columbia Canada Association for Compu-tational Linguistics

Laura-Ana-Maria Bostan and Roman Klinger 2018An analysis of annotated corpora for emotion clas-sification in text In Proceedings of the 27th Inter-national Conference on Computational Linguisticspages 2104ndash2119

Luke Breitfeller Emily Ahn David Jurgens and Yu-lia Tsvetkov 2019 Finding microaggressions in thewild A case for locating elusive phenomena in so-cial media posts In Proceedings of the 2019 Con-ference on Empirical Methods in Natural Language

Processing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 1664ndash1674

Sven Buechel and Udo Hahn 2017 Emobank Study-ing the impact of annotation perspective and repre-sentation format on dimensional emotion analysisIn Proceedings of the 15th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics Volume 2 Short Papers pages 578ndash585

Jacob Cohen 1960 A coefficient of agreement fornominal scales Educational and psychological mea-surement 20(1)37ndash46

Alan Cowen Disa Sauter Jessica L Tracy and DacherKeltner 2019a Mapping the passions Towarda high-dimensional taxonomy of emotional experi-ence and expression Psychological Science in thePublic Interest 20(1)69ndash90

Alan S Cowen Hillary Anger Elfenbein Petri Laukkaand Dacher Keltner 2018 Mapping 24 emotionsconveyed by brief human vocalization AmericanPsychologist 74(6)698ndash712

Alan S Cowen Xia Fang Disa Sauter and Dacher Kelt-ner in press What music makes us feel At leastthirteen dimensions organize subjective experiencesassociated with music across cultures Proceedingsof the National Academy of Sciences

Alan S Cowen and Dacher Keltner 2017 Self-reportcaptures 27 distinct categories of emotion bridged bycontinuous gradients Proceedings of the NationalAcademy of Sciences 114(38)E7900ndashE7909

Alan S Cowen and Dacher Keltner 2019 What theface displays Mapping 28 emotions conveyed bynaturalistic expression American Psychologist

Alan S Cowen Petri Laukka Hillary Anger ElfenbeinRunjing Liu and Dacher Keltner 2019b The pri-macy of categories in the recognition of 12 emotionsin speech prosody across two cultures Nature Hu-man Behaviour 3(4)369

CrowdFlower 2016 httpswwwfigure-eightcomdatasentiment-analysis-emotion-text

Rosario Delgado and Xavier-Andoni Tibau 2019Why cohens kappa should be avoided as perfor-mance measure in classification PloS one 14(9)

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In 17th Annual Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics (NAACL)

Maeve Duggan and Aaron Smith 2013 6 of onlineadults are reddit users Pew Internet amp AmericanLife Project 31ndash10

Paul Ekman 1992a Are there basic emotions Psy-chological Review 99(3)550ndash553

Paul Ekman 1992b An argument for basic emotionsCognition amp Emotion 6(3-4)169ndash200

Diman Ghazi Diana Inkpen and Stan Szpakowicz2015 Detecting Emotion Stimuli in Emotion-Bearing Sentences In International Conference onIntelligent Text Processing and Computational Lin-guistics pages 152ndash165 Springer

Chao-Chun Hsu and Lun-Wei Ku 2018 SocialNLP2018 EmotionX challenge overview Recognizingemotions in dialogues In Proceedings of the SixthInternational Workshop on Natural Language Pro-cessing for Social Media pages 27ndash31 MelbourneAustralia Association for Computational Linguis-tics

Yanran Li Hui Su Xiaoyu Shen Wenjie Li ZiqiangCao and Shuzi Niu 2017 Dailydialog A manuallylabelled multi-turn dialogue dataset arXiv preprintarXiv171003957

Chen Liu Muhammad Osama and Anderson De An-drade 2019 Dens A dataset for multi-class emo-tion analysis arXiv preprint arXiv191011769

Saif Mohammad 2018 Obtaining reliable human rat-ings of valence arousal and dominance for 20000english words In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1 Long Papers) pages 174ndash184

Saif Mohammad Felipe Bravo-Marquez MohammadSalameh and Svetlana Kiritchenko 2018 SemEval-2018 task 1 Affect in tweets In Proceedings ofThe 12th International Workshop on Semantic Eval-uation pages 1ndash17 New Orleans Louisiana Asso-ciation for Computational Linguistics

Saif M Mohammad 2012 emotional tweets In Pro-ceedings of the First Joint Conference on Lexicaland Computational Semantics-Volume 1 Proceed-ings of the main conference and the shared task andVolume 2 Proceedings of the Sixth InternationalWorkshop on Semantic Evaluation pages 246ndash255Association for Computational Linguistics

Saif M Mohammad Xiaodan Zhu Svetlana Kir-itchenko and Joel Martin 2015 Sentiment emo-tion purpose and style in electoral tweets Informa-tion Processing amp Management 51(4)480ndash499

Shruthi Mohan Apala Guha Michael Harris FredPopowich Ashley Schuster and Chris Priebe 2017The impact of toxic language on the health of redditcommunities In Canadian Conference on ArtificialIntelligence pages 51ndash56 Springer

Burt L Monroe Michael P Colaresi and Kevin MQuinn 2008 Fightinrsquowords Lexical feature selec-tion and evaluation for identifying the content of po-litical conflict Political Analysis 16(4)372ndash403

Emily Ohman Kaisla Kajava Jorg Tiedemann andTimo Honkela 2018 Creating a dataset formultilingual fine-grained emotion-detection usinggamification-based annotation In Proceedings ofthe 9th Workshop on Computational Approaches toSubjectivity Sentiment and Social Media Analysispages 24ndash30

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P PrettenhoferR Weiss V Dubourg J Vanderplas A PassosD Cournapeau M Brucher M Perrot and E Duch-esnay 2011 Scikit-learn Machine learning inPython Journal of Machine Learning Research122825ndash2830

Matthew E Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep Contextualized Word Rep-resentations In 16th Annual Conference of theNorth American Chapter of the Association for Com-putational Linguistics (NAACL)

Rosalind W Picard 1997 Affective Computing MITPress

Inna Pirina and Cagrı Coltekin 2018 Identifyingdepression on reddit The effect of training dataIn Proceedings of the 2018 EMNLP WorkshopSMM4H The 3rd Social Media Mining for HealthApplications Workshop amp Shared Task pages 9ndash12

Robert Plutchik 1980 A general psychoevolutionarytheory of emotion In Theories of emotion pages3ndash33 Elsevier

James A Russell 2003 Core affect and the psycholog-ical construction of emotion Psychological review110(1)145

Klaus R Scherer and Harald G Wallbott 1994 Evi-dence for Universality and Cultural Variation of Dif-ferential Emotion Response Patterning Journal ofpersonality and social psychology 66(2)310

Hendrik Schuff Jeremy Barnes Julian Mohme Sebas-tian Pado and Roman Klinger 2017 Annotationmodelling and analysis of fine-grained emotions ona stance and sentiment detection corpus In Pro-ceedings of the 8th Workshop on Computational Ap-proaches to Subjectivity Sentiment and Social Me-dia Analysis pages 13ndash23

Carlo Strapparava and Rada Mihalcea 2007 SemEval-2007 task 14 Affective text In Proceedings of theFourth International Workshop on Semantic Evalua-tions (SemEval-2007) pages 70ndash74 Prague CzechRepublic Association for Computational Linguis-tics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT Rediscovers the Classical NLP Pipeline InAssociation for Computational Linguistics

Henry Tsai Jason Riesa Melvin Johnson Naveen Ari-vazhagan Xin Li and Amelia Archer 2019 Smalland Practical BERT Models for Sequence LabelingIn EMNLP 2019

Wenbo Wang Lu Chen Krishnaprasad Thirunarayanand Amit P Sheth 2012 Harnessing twitter lsquobigdatarsquo for automatic emotion identification In 2012International Conference on Privacy Security Riskand Trust and 2012 International Confernece on So-cial Computing pages 587ndash592 IEEE

Cebrian-M Yanardag P and I Rahwan 2018 Nor-man Worlds first psychopath ai

A Emotion Definitions

admiration Finding something impressive orworthy of respectamusement Finding something funny or beingentertainedanger A strong feeling of displeasure or antag-onismannoyance Mild anger irritationapproval Having or expressing a favorable opin-ioncaring Displaying kindness and concern for othersconfusion Lack of understanding uncertaintycuriosity A strong desire to know or learn some-thingdesire A strong feeling of wanting something orwishing for something to happendisappointment Sadness or displeasure caused bythe nonfulfillment of onersquos hopes or expectationsdisapproval Having or expressing an unfavor-able opiniondisgust Revulsion or strong disapproval arousedby something unpleasant or offensiveembarrassment Self-consciousness shame orawkwardnessexcitement Feeling of great enthusiasm and ea-gernessfear Being afraid or worriedgratitude A feeling of thankfulness and appre-ciationgrief Intense sorrow especially caused by some-onersquos deathjoy A feeling of pleasure and happinesslove A strong positive emotion of regard andaffectionnervousness Apprehension worry anxietyoptimism Hopefulness and confidence aboutthe future or the success of somethingpride Pleasure or satisfaction due to ones ownachievements or the achievements of those withwhom one is closely associatedrealization Becoming aware of somethingrelief Reassurance and relaxation following releasefrom anxiety or distressremorse Regret or guilty feelingsadness Emotional pain sorrowsurprise Feeling astonished startled by some-thing unexpected

B Taxonomy Selection amp Data Collection

We selected our taxonomy through a careful multi-round process In the first pilot round of data col-

lection we used emotions that were identified to besalient by Cowen and Keltner (2017) making surethat our set includes Ekmans emotion categories asused in previous NLP work In this round we alsoincluded an open input box where annotators couldsuggest emotion(s) that were not among the op-tions We annotated 3K examples in the first roundWe updated the taxonomy based on the results ofthis round (see details below) In the second pilotround of data collection we repeated this processwith 2k new examples once again updating thetaxonomy

While reviewing the results from the pilotrounds we identified and removed emotions thatwere scarcely selected by annotators andor hadlow interrater agreement due to being very similarto other emotions or being too difficult to detectfrom text These emotions were boredom doubtheartbroken indifference and calmness We alsoidentified and added those emotions to our tax-onomy that were frequently suggested by ratersandor seemed to be represented in the data uponmanual inspection These emotions were desiredisappointment pride realization relief and re-morse In this process we also refined the categorynames (eg replacing ecstasy with excitement) toones that seemed interpretable to annotators Thisis how we arrived at the final set of 27 emotions+ Neutral Our high interrater agreement in the fi-nal data can be partially explained by the fact thatwe took interpretability into consideration whileconstructing the taxonomy The dataset is we arereleasing was labeled in the third round over thefinal taxonomy

C Cohenrsquos Kappa Values

In Section 41 we measure agreement betweenraters via Spearman correlation following consid-erations by Delgado and Tibau (2019) In Table 7we report the Cohenrsquos kappa values for compari-son which we obtain by randomly sampling tworatings for each example and calculating the Co-henrsquos kappa between these two sets of ratings Wefind that all Cohenrsquos kappa values are greater than 0showing rater agreement Moreover the Cohenrsquoskappa values correlate highly with the interratercorrelation values (Pearson r = 085 p lt 0001)providing corroborative evidence for the significantdegree of interrater agreement for each emotion

Emotion InterraterCorrelation

Cohenrsquoskappa

admiration 0535 0468amusement 0482 0474anger 0207 0307annoyance 0193 0192approval 0385 0187caring 0237 0252confusion 0217 0270curiosity 0418 0366desire 0177 0251disappointment 0186 0184disapproval 0274 0234disgust 0192 0241embarrassment 0177 0218excitement 0193 0222fear 0266 0394gratitude 0645 0749grief 0162 0095joy 0296 0301love 0446 0555nervousness 0164 0144optimism 0322 0300pride 0163 0148realization 0194 0155relief 0172 0185remorse 0178 0358sadness 0346 0336surprise 0275 0331

Table 7 Interrater agreement as measured by interratercorrelation and Cohenrsquos kappa

D Sentiment of Reddit Subreddits

In Section 3 we describe how we obtain subredditsthat are balanced in terms of sentiment Here wenote the distribution of sentiments across subred-dits before we apply the filtering neutral (M=28STD=11) positive (M=41 STD=11) neg-ative (M=19 STD=7) ambiguous (M=35STD=8) After filtering the distribution of sen-timents across our remaining subreddits becameneutral (M=24 STD=5) positive (M=35STD=6) negative (M=27 STD=4) ambigu-ous (M=33 STD=4)

E BERTrsquos Most Activated Layers

To better understand whether there are any layersin BERT that are particularly important for ourtask we freeze BERT and calculate the center of

gravity (Tenney et al 2019) based on scalar mixingweights (Peters et al 2018) We find that all layersare similarly important for our task with center ofgravity = 619 (see Figure 4) This is consistentwith Tenney et al (2019) who have also found thattasks involving high-level semantics tend to makeuse of all BERT layers

0 1 2 3 4 5 6 7 8 9 10 11 12

Layer

000

002

004

006

008

010

Soft

max W

eig

hts

Figure 4 Softmax weights of each BERT layer whentrained on our dataset

F Number of Emotion Labels PerExample

Figure 5 shows the number of emotion labels perexample before and after we filter for those labelsthat have agreement We use the filtered set oflabels for training and testing our models

0 1 2 3 4 5 6 7Number of Labels

0k

10k

20k

30k

40k

Num

ber o

f Exa

mpl

es

pre-filterpost-filter

Figure 5 Number of emotion labels per example be-fore and after filtering the labels chosen by only a sin-gle annotator

G Confusion Matrix

Figure 6 shows the normalized confusion matrix forour model predictions Since GoEmotions is a mul-tilabel dataset we calculate the confusion matrix

similarly as we would calculate a co-occurrencematrix for each true label we increase the countfor each predicted label Specifically we define amatrix M where Mij denotes the raw confusioncount between the true label i and the predictedlabel j For example if the true labels are joyand admiration and the predicted labels are joyand pride then we increase the count for MjoyjoyMjoypride Madmirationjoy and MadmirationprideIn practice since most of our examples only hasa single label (see Figure 5) our confusion matrixis very similar to one calculated for a single-labelclassification task

Given the disparate frequencies among the la-bels we normalize M by dividing the counts ineach row (representing counts for each true emo-tion label) by the sum of that row The heatmap inFigure 6 shows these normalized counts We findthat the model tends to confuse emotions that arerelated in sentiment and intensity (eg grief andsadness pride and admiration nervousness andfear)

We also perform hierarchical clustering over thenormalized confusion matrix using correlation as adistance metric and ward as a linkage method Wefind that the model learns relatively similar clustersas the ones in Figure 2 even though the trainingdata only includes a subset of the labels that haveagreement (see Figure 5)

H Transfer Learning Results

Figure 7 shows the results for all 9 datasets thatare downloadable and have categorical emotionsin the Unified Dataset (Bostan and Klinger 2018)These datasets are DailyDialog (Li et al 2017)Emotion-Stimulus (Ghazi et al 2015) AffectiveText (Strapparava and Mihalcea 2007) Crowd-Flower (CrowdFlower 2016) Electoral Tweets(Mohammad et al 2015) ISEAR (Scherer andWallbott 1994) the Twitter Emotion Corpus (TEC)(Mohammad 2012) EmoInt (Mohammad et al2018) and the Stance Sentiment Emotion Corpus(SSEC) (Schuff et al 2017)

We describe the experimental setup in Sec-tion 62 which we use across all datasets Wefind that transfer learning helps in the case of alldatasets especially when there is limited train-ing data Interestingly in the case of Crowd-Flower which is known to be noisy (Bostan andKlinger 2018) and Electoral Tweets which is asmall dataset of sim4k labeled examples and a large

taxonomy of 36 emotions FREEZE gives a signifi-cant boost of performance over the BASELINE andNOFREEZE for all training set sizes besides ldquomaxrdquo

For the other datasets we find that FREEZE

tends to give a performance boost compared to theother setups only up to a couple of hundred train-ing examples For 500-1000 training examplesNOFREEZE tends to outperform the BASELINE butwe can see that these two setups come closer whenthere is more training data available These resultssuggests that our dataset helps if there is limiteddata from the target domain

Sentimentpositivenegativeambiguous

sentiment

curio

sity

conf

usio

nam

usem

ent

grat

itude

adm

iratio

npr

ide

appr

oval

real

izatio

nsu

rpris

eex

citem

ent

joy

relie

fca

ring

optim

ismde

sire

love fear

nerv

ousn

ess

grie

fsa

dnes

sre

mor

sedi

sapp

rova

ldi

sapp

oint

men

tan

ger

anno

yanc

eem

barra

ssm

ent

disg

ust

Predicted Label

curiosityconfusionamusementgratitudeadmirationprideapprovalrealizationsurpriseexcitementjoyreliefcaringoptimismdesirelovefearnervousnessgriefsadnessremorsedisapprovaldisappointmentangerannoyanceembarrassmentdisgust

True

Lab

el

Proportionof Predictions

000

015

030

045

060

075

Figure 6 A normalized confusion matrix for our model predictions The plot shows that the model confusesemotions with other emotions that are related in intensity and sentiment

02

03

04

F-sc

ore

DailyDialog (everyday conversations)

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

01

02

03

04

05

Affective Text (news headlines)

000

005

010

015

020

F-sc

ore

CrowdFlower (tweets)

006

008

010

012

ElectoralTweets (tweets)

00

02

04

06

ISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

01

02

03

04

05

F-sc

ore

TEC (tweets)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

100 200 500 1000 maxTraining Set Size

060

065

070

SSEC (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 7 Transfer learning results on 9 emotion benchmarks from the Unified Dataset (Bostan and Klinger 2018)

Page 8: GoEmotions: A Dataset of Fine-Grained Emotions · 2020. 5. 5. · Sentiment balancing. We reduce sentiment bias by removing subreddits with little representation of positive, negative,

100 200 500 1000 maxTraining Set Size

00

02

04

06F-

scor

eISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 3 Transfer learning results in terms of average F1-scores across emotion categories The bars indicate the95 confidence intervals which we obtain from 10 different runs on 10 different random splits of the data

Emotion Precision Recall F1

admiration 053 083 065amusement 070 094 080anger 036 066 047annoyance 024 063 034approval 026 057 036caring 030 056 039confusion 024 076 037curiosity 040 084 054desire 043 059 049disappointment 019 052 028disapproval 029 061 039disgust 034 066 045embarrassment 039 049 043excitement 026 052 034fear 046 085 060gratitude 079 095 086grief 000 000 000joy 039 073 051love 068 092 078nervousness 028 048 035neutral 056 084 068optimism 041 069 051pride 067 025 036realization 016 029 021relief 050 009 015remorse 053 088 066sadness 038 071 049surprise 040 066 050macro-average 040 063 046std 018 024 019

Table 4 Results based on GoEmotions taxonomy

61 Emotion Benchmark Datasets

We consider the nine benchmark datasets fromBostan and Klinger (2018)rsquos Unified Datasetwhich vary in terms of their size domain qual-

Sentiment Precision Recall F1

ambiguous 054 066 060negative 065 076 070neutral 064 069 067positive 078 087 082macro-average 065 074 069std 009 010 009

Table 5 Results based on sentiment-grouped data

Ekman Emotion Precision Recall F1

anger 050 065 057disgust 052 053 053fear 061 076 068joy 077 088 082neutral 066 067 066sadness 056 062 059macro-average 060 068 064std 010 012 011

Table 6 Results using Ekmanrsquos taxonomy

ity and taxonomy In the interest of space we onlydiscuss three of these datasets here chosen basedon their diversity of domains In our experimentswe observe similar trends for the additional bench-marks and all are included in the Appendix H

The International Survey on Emotion An-tecedents and Reactions (ISEAR) (Scherer andWallbott 1994) is a collection of personal reportson emotional events written by 3000 people fromdifferent cultural backgrounds The dataset con-tains 8k sentences each labeled with a single emo-tion The categories are anger disgust fear guiltjoy sadness and shame

EmoInt (Mohammad et al 2018) is part of theSemEval 2018 benchmark and it contains crowd-sourced annotations for 7k tweets The labels areintensity annotations for anger joy sadness and

fear We obtain binary annotations for these emo-tions by using 5 as the cutoff

Emotion-Stimulus (Ghazi et al 2015) containsannotations for 24k sentences generated based onFrameNetrsquos emotion-directed frames Their taxon-omy is anger disgust fear joy sadness shameand surprise

62 Experimental SetupTraining set size We experiment with varyingamount of training data from the target domaindataset including 100 200 500 1000 and 80(named ldquomaxrdquo) of dataset examples We generate10 random splits for each train set size with theremaining examples held as a test set

We report the results of the finetuning experi-ments detailed below for each data size with con-fidence intervals based on repeated experimentsusing the splits

Finetuning We compare three different finetun-ing setups In the BASELINE setup we finetuneBERT only on the target dataset In the FREEZE

setup we first finetune BERT on GoEmotions thenperform transfer learning by replacing the finaldense layer freezing all layers besides the lastlayer and finetuning on the target dataset TheNOFREEZE setup is the same as FREEZE exceptthat we do not freeze the bottom layers We holdthe batch size at 16 learning rate at 2e-5 and num-ber of epochs at 3 for all experiments

63 ResultsThe results in Figure 3 suggest that our dataset gen-eralizes well to different domains and taxonomiesand that using a model using GoEmotions can helpin cases when there is limited data from the targetdomain or limited resources for labeling

Given limited target domain data (100 or 200 ex-amples) both FREEZE and NOFREEZE yield signif-icantly higher performance than the BASELINE forall three datasets Importantly NOFREEZE resultsshow significantly higher performance for all train-ing set sizes except for ldquomaxrdquo where NOFREEZE

and BASELINE perform similarly

7 Conclusion

We present GoEmotions a large manually anno-tated carefully curated dataset for fine-grainedemotion prediction We provide a detailed dataanalysis demonstrating the reliability of the anno-tations for the full taxonomy We show the general-

izability of the data across domains and taxonomiesvia transfer learning experiments We build a strongbaseline by fine-tuning a BERT model howeverthe results suggest much room for future improve-ment Future work can explore the cross-culturalrobustness of emotion ratings and extend the tax-onomy to other languages and domains

Data Disclaimer We are aware that the datasetcontains biases and is not representative of globaldiversity We are aware that the dataset containspotentially problematic content Potential biases inthe data include Inherent biases in Reddit and userbase biases the offensivevulgar word lists usedfor data filtering inherent or unconscious bias inassessment of offensive identity labels annotatorswere all native English speakers from India Allthese likely affect labeling precision and recallfor a trained model The emotion pilot model usedfor sentiment labeling was trained on examplesreviewed by the research team Anyone using thisdataset should be aware of these limitations of thedataset

Acknowledgments

We thank the three anonymous reviewers for theirconstructive feedback We would also like to thankthe annotators for their hard work

ReferencesMuhammad Abdul-Mageed and Lyle Ungar 2017

EmoNet Fine-Grained Emotion Detection withGated Recurrent Neural Networks In Proceedingsof the 55th Annual Meeting of the Association forComputational Linguistics (Volume 1 Long Papers)pages 718ndash728

Cecilia Ovesdotter Alm Dan Roth and Richard Sproat2005 Emotions from text Machine learningfor text-based emotion prediction In Proceed-ings of Human Language Technology Conferenceand Conference on Empirical Methods in NaturalLanguage Processing pages 579ndash586 VancouverBritish Columbia Canada Association for Compu-tational Linguistics

Laura-Ana-Maria Bostan and Roman Klinger 2018An analysis of annotated corpora for emotion clas-sification in text In Proceedings of the 27th Inter-national Conference on Computational Linguisticspages 2104ndash2119

Luke Breitfeller Emily Ahn David Jurgens and Yu-lia Tsvetkov 2019 Finding microaggressions in thewild A case for locating elusive phenomena in so-cial media posts In Proceedings of the 2019 Con-ference on Empirical Methods in Natural Language

Processing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 1664ndash1674

Sven Buechel and Udo Hahn 2017 Emobank Study-ing the impact of annotation perspective and repre-sentation format on dimensional emotion analysisIn Proceedings of the 15th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics Volume 2 Short Papers pages 578ndash585

Jacob Cohen 1960 A coefficient of agreement fornominal scales Educational and psychological mea-surement 20(1)37ndash46

Alan Cowen Disa Sauter Jessica L Tracy and DacherKeltner 2019a Mapping the passions Towarda high-dimensional taxonomy of emotional experi-ence and expression Psychological Science in thePublic Interest 20(1)69ndash90

Alan S Cowen Hillary Anger Elfenbein Petri Laukkaand Dacher Keltner 2018 Mapping 24 emotionsconveyed by brief human vocalization AmericanPsychologist 74(6)698ndash712

Alan S Cowen Xia Fang Disa Sauter and Dacher Kelt-ner in press What music makes us feel At leastthirteen dimensions organize subjective experiencesassociated with music across cultures Proceedingsof the National Academy of Sciences

Alan S Cowen and Dacher Keltner 2017 Self-reportcaptures 27 distinct categories of emotion bridged bycontinuous gradients Proceedings of the NationalAcademy of Sciences 114(38)E7900ndashE7909

Alan S Cowen and Dacher Keltner 2019 What theface displays Mapping 28 emotions conveyed bynaturalistic expression American Psychologist

Alan S Cowen Petri Laukka Hillary Anger ElfenbeinRunjing Liu and Dacher Keltner 2019b The pri-macy of categories in the recognition of 12 emotionsin speech prosody across two cultures Nature Hu-man Behaviour 3(4)369

CrowdFlower 2016 httpswwwfigure-eightcomdatasentiment-analysis-emotion-text

Rosario Delgado and Xavier-Andoni Tibau 2019Why cohens kappa should be avoided as perfor-mance measure in classification PloS one 14(9)

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In 17th Annual Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics (NAACL)

Maeve Duggan and Aaron Smith 2013 6 of onlineadults are reddit users Pew Internet amp AmericanLife Project 31ndash10

Paul Ekman 1992a Are there basic emotions Psy-chological Review 99(3)550ndash553

Paul Ekman 1992b An argument for basic emotionsCognition amp Emotion 6(3-4)169ndash200

Diman Ghazi Diana Inkpen and Stan Szpakowicz2015 Detecting Emotion Stimuli in Emotion-Bearing Sentences In International Conference onIntelligent Text Processing and Computational Lin-guistics pages 152ndash165 Springer

Chao-Chun Hsu and Lun-Wei Ku 2018 SocialNLP2018 EmotionX challenge overview Recognizingemotions in dialogues In Proceedings of the SixthInternational Workshop on Natural Language Pro-cessing for Social Media pages 27ndash31 MelbourneAustralia Association for Computational Linguis-tics

Yanran Li Hui Su Xiaoyu Shen Wenjie Li ZiqiangCao and Shuzi Niu 2017 Dailydialog A manuallylabelled multi-turn dialogue dataset arXiv preprintarXiv171003957

Chen Liu Muhammad Osama and Anderson De An-drade 2019 Dens A dataset for multi-class emo-tion analysis arXiv preprint arXiv191011769

Saif Mohammad 2018 Obtaining reliable human rat-ings of valence arousal and dominance for 20000english words In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1 Long Papers) pages 174ndash184

Saif Mohammad Felipe Bravo-Marquez MohammadSalameh and Svetlana Kiritchenko 2018 SemEval-2018 task 1 Affect in tweets In Proceedings ofThe 12th International Workshop on Semantic Eval-uation pages 1ndash17 New Orleans Louisiana Asso-ciation for Computational Linguistics

Saif M Mohammad 2012 emotional tweets In Pro-ceedings of the First Joint Conference on Lexicaland Computational Semantics-Volume 1 Proceed-ings of the main conference and the shared task andVolume 2 Proceedings of the Sixth InternationalWorkshop on Semantic Evaluation pages 246ndash255Association for Computational Linguistics

Saif M Mohammad Xiaodan Zhu Svetlana Kir-itchenko and Joel Martin 2015 Sentiment emo-tion purpose and style in electoral tweets Informa-tion Processing amp Management 51(4)480ndash499

Shruthi Mohan Apala Guha Michael Harris FredPopowich Ashley Schuster and Chris Priebe 2017The impact of toxic language on the health of redditcommunities In Canadian Conference on ArtificialIntelligence pages 51ndash56 Springer

Burt L Monroe Michael P Colaresi and Kevin MQuinn 2008 Fightinrsquowords Lexical feature selec-tion and evaluation for identifying the content of po-litical conflict Political Analysis 16(4)372ndash403

Emily Ohman Kaisla Kajava Jorg Tiedemann andTimo Honkela 2018 Creating a dataset formultilingual fine-grained emotion-detection usinggamification-based annotation In Proceedings ofthe 9th Workshop on Computational Approaches toSubjectivity Sentiment and Social Media Analysispages 24ndash30

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P PrettenhoferR Weiss V Dubourg J Vanderplas A PassosD Cournapeau M Brucher M Perrot and E Duch-esnay 2011 Scikit-learn Machine learning inPython Journal of Machine Learning Research122825ndash2830

Matthew E Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep Contextualized Word Rep-resentations In 16th Annual Conference of theNorth American Chapter of the Association for Com-putational Linguistics (NAACL)

Rosalind W Picard 1997 Affective Computing MITPress

Inna Pirina and Cagrı Coltekin 2018 Identifyingdepression on reddit The effect of training dataIn Proceedings of the 2018 EMNLP WorkshopSMM4H The 3rd Social Media Mining for HealthApplications Workshop amp Shared Task pages 9ndash12

Robert Plutchik 1980 A general psychoevolutionarytheory of emotion In Theories of emotion pages3ndash33 Elsevier

James A Russell 2003 Core affect and the psycholog-ical construction of emotion Psychological review110(1)145

Klaus R Scherer and Harald G Wallbott 1994 Evi-dence for Universality and Cultural Variation of Dif-ferential Emotion Response Patterning Journal ofpersonality and social psychology 66(2)310

Hendrik Schuff Jeremy Barnes Julian Mohme Sebas-tian Pado and Roman Klinger 2017 Annotationmodelling and analysis of fine-grained emotions ona stance and sentiment detection corpus In Pro-ceedings of the 8th Workshop on Computational Ap-proaches to Subjectivity Sentiment and Social Me-dia Analysis pages 13ndash23

Carlo Strapparava and Rada Mihalcea 2007 SemEval-2007 task 14 Affective text In Proceedings of theFourth International Workshop on Semantic Evalua-tions (SemEval-2007) pages 70ndash74 Prague CzechRepublic Association for Computational Linguis-tics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT Rediscovers the Classical NLP Pipeline InAssociation for Computational Linguistics

Henry Tsai Jason Riesa Melvin Johnson Naveen Ari-vazhagan Xin Li and Amelia Archer 2019 Smalland Practical BERT Models for Sequence LabelingIn EMNLP 2019

Wenbo Wang Lu Chen Krishnaprasad Thirunarayanand Amit P Sheth 2012 Harnessing twitter lsquobigdatarsquo for automatic emotion identification In 2012International Conference on Privacy Security Riskand Trust and 2012 International Confernece on So-cial Computing pages 587ndash592 IEEE

Cebrian-M Yanardag P and I Rahwan 2018 Nor-man Worlds first psychopath ai

A Emotion Definitions

admiration Finding something impressive orworthy of respectamusement Finding something funny or beingentertainedanger A strong feeling of displeasure or antag-onismannoyance Mild anger irritationapproval Having or expressing a favorable opin-ioncaring Displaying kindness and concern for othersconfusion Lack of understanding uncertaintycuriosity A strong desire to know or learn some-thingdesire A strong feeling of wanting something orwishing for something to happendisappointment Sadness or displeasure caused bythe nonfulfillment of onersquos hopes or expectationsdisapproval Having or expressing an unfavor-able opiniondisgust Revulsion or strong disapproval arousedby something unpleasant or offensiveembarrassment Self-consciousness shame orawkwardnessexcitement Feeling of great enthusiasm and ea-gernessfear Being afraid or worriedgratitude A feeling of thankfulness and appre-ciationgrief Intense sorrow especially caused by some-onersquos deathjoy A feeling of pleasure and happinesslove A strong positive emotion of regard andaffectionnervousness Apprehension worry anxietyoptimism Hopefulness and confidence aboutthe future or the success of somethingpride Pleasure or satisfaction due to ones ownachievements or the achievements of those withwhom one is closely associatedrealization Becoming aware of somethingrelief Reassurance and relaxation following releasefrom anxiety or distressremorse Regret or guilty feelingsadness Emotional pain sorrowsurprise Feeling astonished startled by some-thing unexpected

B Taxonomy Selection amp Data Collection

We selected our taxonomy through a careful multi-round process In the first pilot round of data col-

lection we used emotions that were identified to besalient by Cowen and Keltner (2017) making surethat our set includes Ekmans emotion categories asused in previous NLP work In this round we alsoincluded an open input box where annotators couldsuggest emotion(s) that were not among the op-tions We annotated 3K examples in the first roundWe updated the taxonomy based on the results ofthis round (see details below) In the second pilotround of data collection we repeated this processwith 2k new examples once again updating thetaxonomy

While reviewing the results from the pilotrounds we identified and removed emotions thatwere scarcely selected by annotators andor hadlow interrater agreement due to being very similarto other emotions or being too difficult to detectfrom text These emotions were boredom doubtheartbroken indifference and calmness We alsoidentified and added those emotions to our tax-onomy that were frequently suggested by ratersandor seemed to be represented in the data uponmanual inspection These emotions were desiredisappointment pride realization relief and re-morse In this process we also refined the categorynames (eg replacing ecstasy with excitement) toones that seemed interpretable to annotators Thisis how we arrived at the final set of 27 emotions+ Neutral Our high interrater agreement in the fi-nal data can be partially explained by the fact thatwe took interpretability into consideration whileconstructing the taxonomy The dataset is we arereleasing was labeled in the third round over thefinal taxonomy

C Cohenrsquos Kappa Values

In Section 41 we measure agreement betweenraters via Spearman correlation following consid-erations by Delgado and Tibau (2019) In Table 7we report the Cohenrsquos kappa values for compari-son which we obtain by randomly sampling tworatings for each example and calculating the Co-henrsquos kappa between these two sets of ratings Wefind that all Cohenrsquos kappa values are greater than 0showing rater agreement Moreover the Cohenrsquoskappa values correlate highly with the interratercorrelation values (Pearson r = 085 p lt 0001)providing corroborative evidence for the significantdegree of interrater agreement for each emotion

Emotion InterraterCorrelation

Cohenrsquoskappa

admiration 0535 0468amusement 0482 0474anger 0207 0307annoyance 0193 0192approval 0385 0187caring 0237 0252confusion 0217 0270curiosity 0418 0366desire 0177 0251disappointment 0186 0184disapproval 0274 0234disgust 0192 0241embarrassment 0177 0218excitement 0193 0222fear 0266 0394gratitude 0645 0749grief 0162 0095joy 0296 0301love 0446 0555nervousness 0164 0144optimism 0322 0300pride 0163 0148realization 0194 0155relief 0172 0185remorse 0178 0358sadness 0346 0336surprise 0275 0331

Table 7 Interrater agreement as measured by interratercorrelation and Cohenrsquos kappa

D Sentiment of Reddit Subreddits

In Section 3 we describe how we obtain subredditsthat are balanced in terms of sentiment Here wenote the distribution of sentiments across subred-dits before we apply the filtering neutral (M=28STD=11) positive (M=41 STD=11) neg-ative (M=19 STD=7) ambiguous (M=35STD=8) After filtering the distribution of sen-timents across our remaining subreddits becameneutral (M=24 STD=5) positive (M=35STD=6) negative (M=27 STD=4) ambigu-ous (M=33 STD=4)

E BERTrsquos Most Activated Layers

To better understand whether there are any layersin BERT that are particularly important for ourtask we freeze BERT and calculate the center of

gravity (Tenney et al 2019) based on scalar mixingweights (Peters et al 2018) We find that all layersare similarly important for our task with center ofgravity = 619 (see Figure 4) This is consistentwith Tenney et al (2019) who have also found thattasks involving high-level semantics tend to makeuse of all BERT layers

0 1 2 3 4 5 6 7 8 9 10 11 12

Layer

000

002

004

006

008

010

Soft

max W

eig

hts

Figure 4 Softmax weights of each BERT layer whentrained on our dataset

F Number of Emotion Labels PerExample

Figure 5 shows the number of emotion labels perexample before and after we filter for those labelsthat have agreement We use the filtered set oflabels for training and testing our models

0 1 2 3 4 5 6 7Number of Labels

0k

10k

20k

30k

40k

Num

ber o

f Exa

mpl

es

pre-filterpost-filter

Figure 5 Number of emotion labels per example be-fore and after filtering the labels chosen by only a sin-gle annotator

G Confusion Matrix

Figure 6 shows the normalized confusion matrix forour model predictions Since GoEmotions is a mul-tilabel dataset we calculate the confusion matrix

similarly as we would calculate a co-occurrencematrix for each true label we increase the countfor each predicted label Specifically we define amatrix M where Mij denotes the raw confusioncount between the true label i and the predictedlabel j For example if the true labels are joyand admiration and the predicted labels are joyand pride then we increase the count for MjoyjoyMjoypride Madmirationjoy and MadmirationprideIn practice since most of our examples only hasa single label (see Figure 5) our confusion matrixis very similar to one calculated for a single-labelclassification task

Given the disparate frequencies among the la-bels we normalize M by dividing the counts ineach row (representing counts for each true emo-tion label) by the sum of that row The heatmap inFigure 6 shows these normalized counts We findthat the model tends to confuse emotions that arerelated in sentiment and intensity (eg grief andsadness pride and admiration nervousness andfear)

We also perform hierarchical clustering over thenormalized confusion matrix using correlation as adistance metric and ward as a linkage method Wefind that the model learns relatively similar clustersas the ones in Figure 2 even though the trainingdata only includes a subset of the labels that haveagreement (see Figure 5)

H Transfer Learning Results

Figure 7 shows the results for all 9 datasets thatare downloadable and have categorical emotionsin the Unified Dataset (Bostan and Klinger 2018)These datasets are DailyDialog (Li et al 2017)Emotion-Stimulus (Ghazi et al 2015) AffectiveText (Strapparava and Mihalcea 2007) Crowd-Flower (CrowdFlower 2016) Electoral Tweets(Mohammad et al 2015) ISEAR (Scherer andWallbott 1994) the Twitter Emotion Corpus (TEC)(Mohammad 2012) EmoInt (Mohammad et al2018) and the Stance Sentiment Emotion Corpus(SSEC) (Schuff et al 2017)

We describe the experimental setup in Sec-tion 62 which we use across all datasets Wefind that transfer learning helps in the case of alldatasets especially when there is limited train-ing data Interestingly in the case of Crowd-Flower which is known to be noisy (Bostan andKlinger 2018) and Electoral Tweets which is asmall dataset of sim4k labeled examples and a large

taxonomy of 36 emotions FREEZE gives a signifi-cant boost of performance over the BASELINE andNOFREEZE for all training set sizes besides ldquomaxrdquo

For the other datasets we find that FREEZE

tends to give a performance boost compared to theother setups only up to a couple of hundred train-ing examples For 500-1000 training examplesNOFREEZE tends to outperform the BASELINE butwe can see that these two setups come closer whenthere is more training data available These resultssuggests that our dataset helps if there is limiteddata from the target domain

Sentimentpositivenegativeambiguous

sentiment

curio

sity

conf

usio

nam

usem

ent

grat

itude

adm

iratio

npr

ide

appr

oval

real

izatio

nsu

rpris

eex

citem

ent

joy

relie

fca

ring

optim

ismde

sire

love fear

nerv

ousn

ess

grie

fsa

dnes

sre

mor

sedi

sapp

rova

ldi

sapp

oint

men

tan

ger

anno

yanc

eem

barra

ssm

ent

disg

ust

Predicted Label

curiosityconfusionamusementgratitudeadmirationprideapprovalrealizationsurpriseexcitementjoyreliefcaringoptimismdesirelovefearnervousnessgriefsadnessremorsedisapprovaldisappointmentangerannoyanceembarrassmentdisgust

True

Lab

el

Proportionof Predictions

000

015

030

045

060

075

Figure 6 A normalized confusion matrix for our model predictions The plot shows that the model confusesemotions with other emotions that are related in intensity and sentiment

02

03

04

F-sc

ore

DailyDialog (everyday conversations)

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

01

02

03

04

05

Affective Text (news headlines)

000

005

010

015

020

F-sc

ore

CrowdFlower (tweets)

006

008

010

012

ElectoralTweets (tweets)

00

02

04

06

ISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

01

02

03

04

05

F-sc

ore

TEC (tweets)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

100 200 500 1000 maxTraining Set Size

060

065

070

SSEC (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 7 Transfer learning results on 9 emotion benchmarks from the Unified Dataset (Bostan and Klinger 2018)

Page 9: GoEmotions: A Dataset of Fine-Grained Emotions · 2020. 5. 5. · Sentiment balancing. We reduce sentiment bias by removing subreddits with little representation of positive, negative,

fear We obtain binary annotations for these emo-tions by using 5 as the cutoff

Emotion-Stimulus (Ghazi et al 2015) containsannotations for 24k sentences generated based onFrameNetrsquos emotion-directed frames Their taxon-omy is anger disgust fear joy sadness shameand surprise

62 Experimental SetupTraining set size We experiment with varyingamount of training data from the target domaindataset including 100 200 500 1000 and 80(named ldquomaxrdquo) of dataset examples We generate10 random splits for each train set size with theremaining examples held as a test set

We report the results of the finetuning experi-ments detailed below for each data size with con-fidence intervals based on repeated experimentsusing the splits

Finetuning We compare three different finetun-ing setups In the BASELINE setup we finetuneBERT only on the target dataset In the FREEZE

setup we first finetune BERT on GoEmotions thenperform transfer learning by replacing the finaldense layer freezing all layers besides the lastlayer and finetuning on the target dataset TheNOFREEZE setup is the same as FREEZE exceptthat we do not freeze the bottom layers We holdthe batch size at 16 learning rate at 2e-5 and num-ber of epochs at 3 for all experiments

63 ResultsThe results in Figure 3 suggest that our dataset gen-eralizes well to different domains and taxonomiesand that using a model using GoEmotions can helpin cases when there is limited data from the targetdomain or limited resources for labeling

Given limited target domain data (100 or 200 ex-amples) both FREEZE and NOFREEZE yield signif-icantly higher performance than the BASELINE forall three datasets Importantly NOFREEZE resultsshow significantly higher performance for all train-ing set sizes except for ldquomaxrdquo where NOFREEZE

and BASELINE perform similarly

7 Conclusion

We present GoEmotions a large manually anno-tated carefully curated dataset for fine-grainedemotion prediction We provide a detailed dataanalysis demonstrating the reliability of the anno-tations for the full taxonomy We show the general-

izability of the data across domains and taxonomiesvia transfer learning experiments We build a strongbaseline by fine-tuning a BERT model howeverthe results suggest much room for future improve-ment Future work can explore the cross-culturalrobustness of emotion ratings and extend the tax-onomy to other languages and domains

Data Disclaimer We are aware that the datasetcontains biases and is not representative of globaldiversity We are aware that the dataset containspotentially problematic content Potential biases inthe data include Inherent biases in Reddit and userbase biases the offensivevulgar word lists usedfor data filtering inherent or unconscious bias inassessment of offensive identity labels annotatorswere all native English speakers from India Allthese likely affect labeling precision and recallfor a trained model The emotion pilot model usedfor sentiment labeling was trained on examplesreviewed by the research team Anyone using thisdataset should be aware of these limitations of thedataset

Acknowledgments

We thank the three anonymous reviewers for theirconstructive feedback We would also like to thankthe annotators for their hard work

ReferencesMuhammad Abdul-Mageed and Lyle Ungar 2017

EmoNet Fine-Grained Emotion Detection withGated Recurrent Neural Networks In Proceedingsof the 55th Annual Meeting of the Association forComputational Linguistics (Volume 1 Long Papers)pages 718ndash728

Cecilia Ovesdotter Alm Dan Roth and Richard Sproat2005 Emotions from text Machine learningfor text-based emotion prediction In Proceed-ings of Human Language Technology Conferenceand Conference on Empirical Methods in NaturalLanguage Processing pages 579ndash586 VancouverBritish Columbia Canada Association for Compu-tational Linguistics

Laura-Ana-Maria Bostan and Roman Klinger 2018An analysis of annotated corpora for emotion clas-sification in text In Proceedings of the 27th Inter-national Conference on Computational Linguisticspages 2104ndash2119

Luke Breitfeller Emily Ahn David Jurgens and Yu-lia Tsvetkov 2019 Finding microaggressions in thewild A case for locating elusive phenomena in so-cial media posts In Proceedings of the 2019 Con-ference on Empirical Methods in Natural Language

Processing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 1664ndash1674

Sven Buechel and Udo Hahn 2017 Emobank Study-ing the impact of annotation perspective and repre-sentation format on dimensional emotion analysisIn Proceedings of the 15th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics Volume 2 Short Papers pages 578ndash585

Jacob Cohen 1960 A coefficient of agreement fornominal scales Educational and psychological mea-surement 20(1)37ndash46

Alan Cowen Disa Sauter Jessica L Tracy and DacherKeltner 2019a Mapping the passions Towarda high-dimensional taxonomy of emotional experi-ence and expression Psychological Science in thePublic Interest 20(1)69ndash90

Alan S Cowen Hillary Anger Elfenbein Petri Laukkaand Dacher Keltner 2018 Mapping 24 emotionsconveyed by brief human vocalization AmericanPsychologist 74(6)698ndash712

Alan S Cowen Xia Fang Disa Sauter and Dacher Kelt-ner in press What music makes us feel At leastthirteen dimensions organize subjective experiencesassociated with music across cultures Proceedingsof the National Academy of Sciences

Alan S Cowen and Dacher Keltner 2017 Self-reportcaptures 27 distinct categories of emotion bridged bycontinuous gradients Proceedings of the NationalAcademy of Sciences 114(38)E7900ndashE7909

Alan S Cowen and Dacher Keltner 2019 What theface displays Mapping 28 emotions conveyed bynaturalistic expression American Psychologist

Alan S Cowen Petri Laukka Hillary Anger ElfenbeinRunjing Liu and Dacher Keltner 2019b The pri-macy of categories in the recognition of 12 emotionsin speech prosody across two cultures Nature Hu-man Behaviour 3(4)369

CrowdFlower 2016 httpswwwfigure-eightcomdatasentiment-analysis-emotion-text

Rosario Delgado and Xavier-Andoni Tibau 2019Why cohens kappa should be avoided as perfor-mance measure in classification PloS one 14(9)

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In 17th Annual Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics (NAACL)

Maeve Duggan and Aaron Smith 2013 6 of onlineadults are reddit users Pew Internet amp AmericanLife Project 31ndash10

Paul Ekman 1992a Are there basic emotions Psy-chological Review 99(3)550ndash553

Paul Ekman 1992b An argument for basic emotionsCognition amp Emotion 6(3-4)169ndash200

Diman Ghazi Diana Inkpen and Stan Szpakowicz2015 Detecting Emotion Stimuli in Emotion-Bearing Sentences In International Conference onIntelligent Text Processing and Computational Lin-guistics pages 152ndash165 Springer

Chao-Chun Hsu and Lun-Wei Ku 2018 SocialNLP2018 EmotionX challenge overview Recognizingemotions in dialogues In Proceedings of the SixthInternational Workshop on Natural Language Pro-cessing for Social Media pages 27ndash31 MelbourneAustralia Association for Computational Linguis-tics

Yanran Li Hui Su Xiaoyu Shen Wenjie Li ZiqiangCao and Shuzi Niu 2017 Dailydialog A manuallylabelled multi-turn dialogue dataset arXiv preprintarXiv171003957

Chen Liu Muhammad Osama and Anderson De An-drade 2019 Dens A dataset for multi-class emo-tion analysis arXiv preprint arXiv191011769

Saif Mohammad 2018 Obtaining reliable human rat-ings of valence arousal and dominance for 20000english words In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1 Long Papers) pages 174ndash184

Saif Mohammad Felipe Bravo-Marquez MohammadSalameh and Svetlana Kiritchenko 2018 SemEval-2018 task 1 Affect in tweets In Proceedings ofThe 12th International Workshop on Semantic Eval-uation pages 1ndash17 New Orleans Louisiana Asso-ciation for Computational Linguistics

Saif M Mohammad 2012 emotional tweets In Pro-ceedings of the First Joint Conference on Lexicaland Computational Semantics-Volume 1 Proceed-ings of the main conference and the shared task andVolume 2 Proceedings of the Sixth InternationalWorkshop on Semantic Evaluation pages 246ndash255Association for Computational Linguistics

Saif M Mohammad Xiaodan Zhu Svetlana Kir-itchenko and Joel Martin 2015 Sentiment emo-tion purpose and style in electoral tweets Informa-tion Processing amp Management 51(4)480ndash499

Shruthi Mohan Apala Guha Michael Harris FredPopowich Ashley Schuster and Chris Priebe 2017The impact of toxic language on the health of redditcommunities In Canadian Conference on ArtificialIntelligence pages 51ndash56 Springer

Burt L Monroe Michael P Colaresi and Kevin MQuinn 2008 Fightinrsquowords Lexical feature selec-tion and evaluation for identifying the content of po-litical conflict Political Analysis 16(4)372ndash403

Emily Ohman Kaisla Kajava Jorg Tiedemann andTimo Honkela 2018 Creating a dataset formultilingual fine-grained emotion-detection usinggamification-based annotation In Proceedings ofthe 9th Workshop on Computational Approaches toSubjectivity Sentiment and Social Media Analysispages 24ndash30

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P PrettenhoferR Weiss V Dubourg J Vanderplas A PassosD Cournapeau M Brucher M Perrot and E Duch-esnay 2011 Scikit-learn Machine learning inPython Journal of Machine Learning Research122825ndash2830

Matthew E Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep Contextualized Word Rep-resentations In 16th Annual Conference of theNorth American Chapter of the Association for Com-putational Linguistics (NAACL)

Rosalind W Picard 1997 Affective Computing MITPress

Inna Pirina and Cagrı Coltekin 2018 Identifyingdepression on reddit The effect of training dataIn Proceedings of the 2018 EMNLP WorkshopSMM4H The 3rd Social Media Mining for HealthApplications Workshop amp Shared Task pages 9ndash12

Robert Plutchik 1980 A general psychoevolutionarytheory of emotion In Theories of emotion pages3ndash33 Elsevier

James A Russell 2003 Core affect and the psycholog-ical construction of emotion Psychological review110(1)145

Klaus R Scherer and Harald G Wallbott 1994 Evi-dence for Universality and Cultural Variation of Dif-ferential Emotion Response Patterning Journal ofpersonality and social psychology 66(2)310

Hendrik Schuff Jeremy Barnes Julian Mohme Sebas-tian Pado and Roman Klinger 2017 Annotationmodelling and analysis of fine-grained emotions ona stance and sentiment detection corpus In Pro-ceedings of the 8th Workshop on Computational Ap-proaches to Subjectivity Sentiment and Social Me-dia Analysis pages 13ndash23

Carlo Strapparava and Rada Mihalcea 2007 SemEval-2007 task 14 Affective text In Proceedings of theFourth International Workshop on Semantic Evalua-tions (SemEval-2007) pages 70ndash74 Prague CzechRepublic Association for Computational Linguis-tics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT Rediscovers the Classical NLP Pipeline InAssociation for Computational Linguistics

Henry Tsai Jason Riesa Melvin Johnson Naveen Ari-vazhagan Xin Li and Amelia Archer 2019 Smalland Practical BERT Models for Sequence LabelingIn EMNLP 2019

Wenbo Wang Lu Chen Krishnaprasad Thirunarayanand Amit P Sheth 2012 Harnessing twitter lsquobigdatarsquo for automatic emotion identification In 2012International Conference on Privacy Security Riskand Trust and 2012 International Confernece on So-cial Computing pages 587ndash592 IEEE

Cebrian-M Yanardag P and I Rahwan 2018 Nor-man Worlds first psychopath ai

A Emotion Definitions

admiration Finding something impressive orworthy of respectamusement Finding something funny or beingentertainedanger A strong feeling of displeasure or antag-onismannoyance Mild anger irritationapproval Having or expressing a favorable opin-ioncaring Displaying kindness and concern for othersconfusion Lack of understanding uncertaintycuriosity A strong desire to know or learn some-thingdesire A strong feeling of wanting something orwishing for something to happendisappointment Sadness or displeasure caused bythe nonfulfillment of onersquos hopes or expectationsdisapproval Having or expressing an unfavor-able opiniondisgust Revulsion or strong disapproval arousedby something unpleasant or offensiveembarrassment Self-consciousness shame orawkwardnessexcitement Feeling of great enthusiasm and ea-gernessfear Being afraid or worriedgratitude A feeling of thankfulness and appre-ciationgrief Intense sorrow especially caused by some-onersquos deathjoy A feeling of pleasure and happinesslove A strong positive emotion of regard andaffectionnervousness Apprehension worry anxietyoptimism Hopefulness and confidence aboutthe future or the success of somethingpride Pleasure or satisfaction due to ones ownachievements or the achievements of those withwhom one is closely associatedrealization Becoming aware of somethingrelief Reassurance and relaxation following releasefrom anxiety or distressremorse Regret or guilty feelingsadness Emotional pain sorrowsurprise Feeling astonished startled by some-thing unexpected

B Taxonomy Selection amp Data Collection

We selected our taxonomy through a careful multi-round process In the first pilot round of data col-

lection we used emotions that were identified to besalient by Cowen and Keltner (2017) making surethat our set includes Ekmans emotion categories asused in previous NLP work In this round we alsoincluded an open input box where annotators couldsuggest emotion(s) that were not among the op-tions We annotated 3K examples in the first roundWe updated the taxonomy based on the results ofthis round (see details below) In the second pilotround of data collection we repeated this processwith 2k new examples once again updating thetaxonomy

While reviewing the results from the pilotrounds we identified and removed emotions thatwere scarcely selected by annotators andor hadlow interrater agreement due to being very similarto other emotions or being too difficult to detectfrom text These emotions were boredom doubtheartbroken indifference and calmness We alsoidentified and added those emotions to our tax-onomy that were frequently suggested by ratersandor seemed to be represented in the data uponmanual inspection These emotions were desiredisappointment pride realization relief and re-morse In this process we also refined the categorynames (eg replacing ecstasy with excitement) toones that seemed interpretable to annotators Thisis how we arrived at the final set of 27 emotions+ Neutral Our high interrater agreement in the fi-nal data can be partially explained by the fact thatwe took interpretability into consideration whileconstructing the taxonomy The dataset is we arereleasing was labeled in the third round over thefinal taxonomy

C Cohenrsquos Kappa Values

In Section 41 we measure agreement betweenraters via Spearman correlation following consid-erations by Delgado and Tibau (2019) In Table 7we report the Cohenrsquos kappa values for compari-son which we obtain by randomly sampling tworatings for each example and calculating the Co-henrsquos kappa between these two sets of ratings Wefind that all Cohenrsquos kappa values are greater than 0showing rater agreement Moreover the Cohenrsquoskappa values correlate highly with the interratercorrelation values (Pearson r = 085 p lt 0001)providing corroborative evidence for the significantdegree of interrater agreement for each emotion

Emotion InterraterCorrelation

Cohenrsquoskappa

admiration 0535 0468amusement 0482 0474anger 0207 0307annoyance 0193 0192approval 0385 0187caring 0237 0252confusion 0217 0270curiosity 0418 0366desire 0177 0251disappointment 0186 0184disapproval 0274 0234disgust 0192 0241embarrassment 0177 0218excitement 0193 0222fear 0266 0394gratitude 0645 0749grief 0162 0095joy 0296 0301love 0446 0555nervousness 0164 0144optimism 0322 0300pride 0163 0148realization 0194 0155relief 0172 0185remorse 0178 0358sadness 0346 0336surprise 0275 0331

Table 7 Interrater agreement as measured by interratercorrelation and Cohenrsquos kappa

D Sentiment of Reddit Subreddits

In Section 3 we describe how we obtain subredditsthat are balanced in terms of sentiment Here wenote the distribution of sentiments across subred-dits before we apply the filtering neutral (M=28STD=11) positive (M=41 STD=11) neg-ative (M=19 STD=7) ambiguous (M=35STD=8) After filtering the distribution of sen-timents across our remaining subreddits becameneutral (M=24 STD=5) positive (M=35STD=6) negative (M=27 STD=4) ambigu-ous (M=33 STD=4)

E BERTrsquos Most Activated Layers

To better understand whether there are any layersin BERT that are particularly important for ourtask we freeze BERT and calculate the center of

gravity (Tenney et al 2019) based on scalar mixingweights (Peters et al 2018) We find that all layersare similarly important for our task with center ofgravity = 619 (see Figure 4) This is consistentwith Tenney et al (2019) who have also found thattasks involving high-level semantics tend to makeuse of all BERT layers

0 1 2 3 4 5 6 7 8 9 10 11 12

Layer

000

002

004

006

008

010

Soft

max W

eig

hts

Figure 4 Softmax weights of each BERT layer whentrained on our dataset

F Number of Emotion Labels PerExample

Figure 5 shows the number of emotion labels perexample before and after we filter for those labelsthat have agreement We use the filtered set oflabels for training and testing our models

0 1 2 3 4 5 6 7Number of Labels

0k

10k

20k

30k

40k

Num

ber o

f Exa

mpl

es

pre-filterpost-filter

Figure 5 Number of emotion labels per example be-fore and after filtering the labels chosen by only a sin-gle annotator

G Confusion Matrix

Figure 6 shows the normalized confusion matrix forour model predictions Since GoEmotions is a mul-tilabel dataset we calculate the confusion matrix

similarly as we would calculate a co-occurrencematrix for each true label we increase the countfor each predicted label Specifically we define amatrix M where Mij denotes the raw confusioncount between the true label i and the predictedlabel j For example if the true labels are joyand admiration and the predicted labels are joyand pride then we increase the count for MjoyjoyMjoypride Madmirationjoy and MadmirationprideIn practice since most of our examples only hasa single label (see Figure 5) our confusion matrixis very similar to one calculated for a single-labelclassification task

Given the disparate frequencies among the la-bels we normalize M by dividing the counts ineach row (representing counts for each true emo-tion label) by the sum of that row The heatmap inFigure 6 shows these normalized counts We findthat the model tends to confuse emotions that arerelated in sentiment and intensity (eg grief andsadness pride and admiration nervousness andfear)

We also perform hierarchical clustering over thenormalized confusion matrix using correlation as adistance metric and ward as a linkage method Wefind that the model learns relatively similar clustersas the ones in Figure 2 even though the trainingdata only includes a subset of the labels that haveagreement (see Figure 5)

H Transfer Learning Results

Figure 7 shows the results for all 9 datasets thatare downloadable and have categorical emotionsin the Unified Dataset (Bostan and Klinger 2018)These datasets are DailyDialog (Li et al 2017)Emotion-Stimulus (Ghazi et al 2015) AffectiveText (Strapparava and Mihalcea 2007) Crowd-Flower (CrowdFlower 2016) Electoral Tweets(Mohammad et al 2015) ISEAR (Scherer andWallbott 1994) the Twitter Emotion Corpus (TEC)(Mohammad 2012) EmoInt (Mohammad et al2018) and the Stance Sentiment Emotion Corpus(SSEC) (Schuff et al 2017)

We describe the experimental setup in Sec-tion 62 which we use across all datasets Wefind that transfer learning helps in the case of alldatasets especially when there is limited train-ing data Interestingly in the case of Crowd-Flower which is known to be noisy (Bostan andKlinger 2018) and Electoral Tweets which is asmall dataset of sim4k labeled examples and a large

taxonomy of 36 emotions FREEZE gives a signifi-cant boost of performance over the BASELINE andNOFREEZE for all training set sizes besides ldquomaxrdquo

For the other datasets we find that FREEZE

tends to give a performance boost compared to theother setups only up to a couple of hundred train-ing examples For 500-1000 training examplesNOFREEZE tends to outperform the BASELINE butwe can see that these two setups come closer whenthere is more training data available These resultssuggests that our dataset helps if there is limiteddata from the target domain

Sentimentpositivenegativeambiguous

sentiment

curio

sity

conf

usio

nam

usem

ent

grat

itude

adm

iratio

npr

ide

appr

oval

real

izatio

nsu

rpris

eex

citem

ent

joy

relie

fca

ring

optim

ismde

sire

love fear

nerv

ousn

ess

grie

fsa

dnes

sre

mor

sedi

sapp

rova

ldi

sapp

oint

men

tan

ger

anno

yanc

eem

barra

ssm

ent

disg

ust

Predicted Label

curiosityconfusionamusementgratitudeadmirationprideapprovalrealizationsurpriseexcitementjoyreliefcaringoptimismdesirelovefearnervousnessgriefsadnessremorsedisapprovaldisappointmentangerannoyanceembarrassmentdisgust

True

Lab

el

Proportionof Predictions

000

015

030

045

060

075

Figure 6 A normalized confusion matrix for our model predictions The plot shows that the model confusesemotions with other emotions that are related in intensity and sentiment

02

03

04

F-sc

ore

DailyDialog (everyday conversations)

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

01

02

03

04

05

Affective Text (news headlines)

000

005

010

015

020

F-sc

ore

CrowdFlower (tweets)

006

008

010

012

ElectoralTweets (tweets)

00

02

04

06

ISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

01

02

03

04

05

F-sc

ore

TEC (tweets)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

100 200 500 1000 maxTraining Set Size

060

065

070

SSEC (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 7 Transfer learning results on 9 emotion benchmarks from the Unified Dataset (Bostan and Klinger 2018)

Page 10: GoEmotions: A Dataset of Fine-Grained Emotions · 2020. 5. 5. · Sentiment balancing. We reduce sentiment bias by removing subreddits with little representation of positive, negative,

Processing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 1664ndash1674

Sven Buechel and Udo Hahn 2017 Emobank Study-ing the impact of annotation perspective and repre-sentation format on dimensional emotion analysisIn Proceedings of the 15th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics Volume 2 Short Papers pages 578ndash585

Jacob Cohen 1960 A coefficient of agreement fornominal scales Educational and psychological mea-surement 20(1)37ndash46

Alan Cowen Disa Sauter Jessica L Tracy and DacherKeltner 2019a Mapping the passions Towarda high-dimensional taxonomy of emotional experi-ence and expression Psychological Science in thePublic Interest 20(1)69ndash90

Alan S Cowen Hillary Anger Elfenbein Petri Laukkaand Dacher Keltner 2018 Mapping 24 emotionsconveyed by brief human vocalization AmericanPsychologist 74(6)698ndash712

Alan S Cowen Xia Fang Disa Sauter and Dacher Kelt-ner in press What music makes us feel At leastthirteen dimensions organize subjective experiencesassociated with music across cultures Proceedingsof the National Academy of Sciences

Alan S Cowen and Dacher Keltner 2017 Self-reportcaptures 27 distinct categories of emotion bridged bycontinuous gradients Proceedings of the NationalAcademy of Sciences 114(38)E7900ndashE7909

Alan S Cowen and Dacher Keltner 2019 What theface displays Mapping 28 emotions conveyed bynaturalistic expression American Psychologist

Alan S Cowen Petri Laukka Hillary Anger ElfenbeinRunjing Liu and Dacher Keltner 2019b The pri-macy of categories in the recognition of 12 emotionsin speech prosody across two cultures Nature Hu-man Behaviour 3(4)369

CrowdFlower 2016 httpswwwfigure-eightcomdatasentiment-analysis-emotion-text

Rosario Delgado and Xavier-Andoni Tibau 2019Why cohens kappa should be avoided as perfor-mance measure in classification PloS one 14(9)

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 Bert Pre-training of deepbidirectional transformers for language understand-ing In 17th Annual Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics (NAACL)

Maeve Duggan and Aaron Smith 2013 6 of onlineadults are reddit users Pew Internet amp AmericanLife Project 31ndash10

Paul Ekman 1992a Are there basic emotions Psy-chological Review 99(3)550ndash553

Paul Ekman 1992b An argument for basic emotionsCognition amp Emotion 6(3-4)169ndash200

Diman Ghazi Diana Inkpen and Stan Szpakowicz2015 Detecting Emotion Stimuli in Emotion-Bearing Sentences In International Conference onIntelligent Text Processing and Computational Lin-guistics pages 152ndash165 Springer

Chao-Chun Hsu and Lun-Wei Ku 2018 SocialNLP2018 EmotionX challenge overview Recognizingemotions in dialogues In Proceedings of the SixthInternational Workshop on Natural Language Pro-cessing for Social Media pages 27ndash31 MelbourneAustralia Association for Computational Linguis-tics

Yanran Li Hui Su Xiaoyu Shen Wenjie Li ZiqiangCao and Shuzi Niu 2017 Dailydialog A manuallylabelled multi-turn dialogue dataset arXiv preprintarXiv171003957

Chen Liu Muhammad Osama and Anderson De An-drade 2019 Dens A dataset for multi-class emo-tion analysis arXiv preprint arXiv191011769

Saif Mohammad 2018 Obtaining reliable human rat-ings of valence arousal and dominance for 20000english words In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1 Long Papers) pages 174ndash184

Saif Mohammad Felipe Bravo-Marquez MohammadSalameh and Svetlana Kiritchenko 2018 SemEval-2018 task 1 Affect in tweets In Proceedings ofThe 12th International Workshop on Semantic Eval-uation pages 1ndash17 New Orleans Louisiana Asso-ciation for Computational Linguistics

Saif M Mohammad 2012 emotional tweets In Pro-ceedings of the First Joint Conference on Lexicaland Computational Semantics-Volume 1 Proceed-ings of the main conference and the shared task andVolume 2 Proceedings of the Sixth InternationalWorkshop on Semantic Evaluation pages 246ndash255Association for Computational Linguistics

Saif M Mohammad Xiaodan Zhu Svetlana Kir-itchenko and Joel Martin 2015 Sentiment emo-tion purpose and style in electoral tweets Informa-tion Processing amp Management 51(4)480ndash499

Shruthi Mohan Apala Guha Michael Harris FredPopowich Ashley Schuster and Chris Priebe 2017The impact of toxic language on the health of redditcommunities In Canadian Conference on ArtificialIntelligence pages 51ndash56 Springer

Burt L Monroe Michael P Colaresi and Kevin MQuinn 2008 Fightinrsquowords Lexical feature selec-tion and evaluation for identifying the content of po-litical conflict Political Analysis 16(4)372ndash403

Emily Ohman Kaisla Kajava Jorg Tiedemann andTimo Honkela 2018 Creating a dataset formultilingual fine-grained emotion-detection usinggamification-based annotation In Proceedings ofthe 9th Workshop on Computational Approaches toSubjectivity Sentiment and Social Media Analysispages 24ndash30

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P PrettenhoferR Weiss V Dubourg J Vanderplas A PassosD Cournapeau M Brucher M Perrot and E Duch-esnay 2011 Scikit-learn Machine learning inPython Journal of Machine Learning Research122825ndash2830

Matthew E Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep Contextualized Word Rep-resentations In 16th Annual Conference of theNorth American Chapter of the Association for Com-putational Linguistics (NAACL)

Rosalind W Picard 1997 Affective Computing MITPress

Inna Pirina and Cagrı Coltekin 2018 Identifyingdepression on reddit The effect of training dataIn Proceedings of the 2018 EMNLP WorkshopSMM4H The 3rd Social Media Mining for HealthApplications Workshop amp Shared Task pages 9ndash12

Robert Plutchik 1980 A general psychoevolutionarytheory of emotion In Theories of emotion pages3ndash33 Elsevier

James A Russell 2003 Core affect and the psycholog-ical construction of emotion Psychological review110(1)145

Klaus R Scherer and Harald G Wallbott 1994 Evi-dence for Universality and Cultural Variation of Dif-ferential Emotion Response Patterning Journal ofpersonality and social psychology 66(2)310

Hendrik Schuff Jeremy Barnes Julian Mohme Sebas-tian Pado and Roman Klinger 2017 Annotationmodelling and analysis of fine-grained emotions ona stance and sentiment detection corpus In Pro-ceedings of the 8th Workshop on Computational Ap-proaches to Subjectivity Sentiment and Social Me-dia Analysis pages 13ndash23

Carlo Strapparava and Rada Mihalcea 2007 SemEval-2007 task 14 Affective text In Proceedings of theFourth International Workshop on Semantic Evalua-tions (SemEval-2007) pages 70ndash74 Prague CzechRepublic Association for Computational Linguis-tics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT Rediscovers the Classical NLP Pipeline InAssociation for Computational Linguistics

Henry Tsai Jason Riesa Melvin Johnson Naveen Ari-vazhagan Xin Li and Amelia Archer 2019 Smalland Practical BERT Models for Sequence LabelingIn EMNLP 2019

Wenbo Wang Lu Chen Krishnaprasad Thirunarayanand Amit P Sheth 2012 Harnessing twitter lsquobigdatarsquo for automatic emotion identification In 2012International Conference on Privacy Security Riskand Trust and 2012 International Confernece on So-cial Computing pages 587ndash592 IEEE

Cebrian-M Yanardag P and I Rahwan 2018 Nor-man Worlds first psychopath ai

A Emotion Definitions

admiration Finding something impressive orworthy of respectamusement Finding something funny or beingentertainedanger A strong feeling of displeasure or antag-onismannoyance Mild anger irritationapproval Having or expressing a favorable opin-ioncaring Displaying kindness and concern for othersconfusion Lack of understanding uncertaintycuriosity A strong desire to know or learn some-thingdesire A strong feeling of wanting something orwishing for something to happendisappointment Sadness or displeasure caused bythe nonfulfillment of onersquos hopes or expectationsdisapproval Having or expressing an unfavor-able opiniondisgust Revulsion or strong disapproval arousedby something unpleasant or offensiveembarrassment Self-consciousness shame orawkwardnessexcitement Feeling of great enthusiasm and ea-gernessfear Being afraid or worriedgratitude A feeling of thankfulness and appre-ciationgrief Intense sorrow especially caused by some-onersquos deathjoy A feeling of pleasure and happinesslove A strong positive emotion of regard andaffectionnervousness Apprehension worry anxietyoptimism Hopefulness and confidence aboutthe future or the success of somethingpride Pleasure or satisfaction due to ones ownachievements or the achievements of those withwhom one is closely associatedrealization Becoming aware of somethingrelief Reassurance and relaxation following releasefrom anxiety or distressremorse Regret or guilty feelingsadness Emotional pain sorrowsurprise Feeling astonished startled by some-thing unexpected

B Taxonomy Selection amp Data Collection

We selected our taxonomy through a careful multi-round process In the first pilot round of data col-

lection we used emotions that were identified to besalient by Cowen and Keltner (2017) making surethat our set includes Ekmans emotion categories asused in previous NLP work In this round we alsoincluded an open input box where annotators couldsuggest emotion(s) that were not among the op-tions We annotated 3K examples in the first roundWe updated the taxonomy based on the results ofthis round (see details below) In the second pilotround of data collection we repeated this processwith 2k new examples once again updating thetaxonomy

While reviewing the results from the pilotrounds we identified and removed emotions thatwere scarcely selected by annotators andor hadlow interrater agreement due to being very similarto other emotions or being too difficult to detectfrom text These emotions were boredom doubtheartbroken indifference and calmness We alsoidentified and added those emotions to our tax-onomy that were frequently suggested by ratersandor seemed to be represented in the data uponmanual inspection These emotions were desiredisappointment pride realization relief and re-morse In this process we also refined the categorynames (eg replacing ecstasy with excitement) toones that seemed interpretable to annotators Thisis how we arrived at the final set of 27 emotions+ Neutral Our high interrater agreement in the fi-nal data can be partially explained by the fact thatwe took interpretability into consideration whileconstructing the taxonomy The dataset is we arereleasing was labeled in the third round over thefinal taxonomy

C Cohenrsquos Kappa Values

In Section 41 we measure agreement betweenraters via Spearman correlation following consid-erations by Delgado and Tibau (2019) In Table 7we report the Cohenrsquos kappa values for compari-son which we obtain by randomly sampling tworatings for each example and calculating the Co-henrsquos kappa between these two sets of ratings Wefind that all Cohenrsquos kappa values are greater than 0showing rater agreement Moreover the Cohenrsquoskappa values correlate highly with the interratercorrelation values (Pearson r = 085 p lt 0001)providing corroborative evidence for the significantdegree of interrater agreement for each emotion

Emotion InterraterCorrelation

Cohenrsquoskappa

admiration 0535 0468amusement 0482 0474anger 0207 0307annoyance 0193 0192approval 0385 0187caring 0237 0252confusion 0217 0270curiosity 0418 0366desire 0177 0251disappointment 0186 0184disapproval 0274 0234disgust 0192 0241embarrassment 0177 0218excitement 0193 0222fear 0266 0394gratitude 0645 0749grief 0162 0095joy 0296 0301love 0446 0555nervousness 0164 0144optimism 0322 0300pride 0163 0148realization 0194 0155relief 0172 0185remorse 0178 0358sadness 0346 0336surprise 0275 0331

Table 7 Interrater agreement as measured by interratercorrelation and Cohenrsquos kappa

D Sentiment of Reddit Subreddits

In Section 3 we describe how we obtain subredditsthat are balanced in terms of sentiment Here wenote the distribution of sentiments across subred-dits before we apply the filtering neutral (M=28STD=11) positive (M=41 STD=11) neg-ative (M=19 STD=7) ambiguous (M=35STD=8) After filtering the distribution of sen-timents across our remaining subreddits becameneutral (M=24 STD=5) positive (M=35STD=6) negative (M=27 STD=4) ambigu-ous (M=33 STD=4)

E BERTrsquos Most Activated Layers

To better understand whether there are any layersin BERT that are particularly important for ourtask we freeze BERT and calculate the center of

gravity (Tenney et al 2019) based on scalar mixingweights (Peters et al 2018) We find that all layersare similarly important for our task with center ofgravity = 619 (see Figure 4) This is consistentwith Tenney et al (2019) who have also found thattasks involving high-level semantics tend to makeuse of all BERT layers

0 1 2 3 4 5 6 7 8 9 10 11 12

Layer

000

002

004

006

008

010

Soft

max W

eig

hts

Figure 4 Softmax weights of each BERT layer whentrained on our dataset

F Number of Emotion Labels PerExample

Figure 5 shows the number of emotion labels perexample before and after we filter for those labelsthat have agreement We use the filtered set oflabels for training and testing our models

0 1 2 3 4 5 6 7Number of Labels

0k

10k

20k

30k

40k

Num

ber o

f Exa

mpl

es

pre-filterpost-filter

Figure 5 Number of emotion labels per example be-fore and after filtering the labels chosen by only a sin-gle annotator

G Confusion Matrix

Figure 6 shows the normalized confusion matrix forour model predictions Since GoEmotions is a mul-tilabel dataset we calculate the confusion matrix

similarly as we would calculate a co-occurrencematrix for each true label we increase the countfor each predicted label Specifically we define amatrix M where Mij denotes the raw confusioncount between the true label i and the predictedlabel j For example if the true labels are joyand admiration and the predicted labels are joyand pride then we increase the count for MjoyjoyMjoypride Madmirationjoy and MadmirationprideIn practice since most of our examples only hasa single label (see Figure 5) our confusion matrixis very similar to one calculated for a single-labelclassification task

Given the disparate frequencies among the la-bels we normalize M by dividing the counts ineach row (representing counts for each true emo-tion label) by the sum of that row The heatmap inFigure 6 shows these normalized counts We findthat the model tends to confuse emotions that arerelated in sentiment and intensity (eg grief andsadness pride and admiration nervousness andfear)

We also perform hierarchical clustering over thenormalized confusion matrix using correlation as adistance metric and ward as a linkage method Wefind that the model learns relatively similar clustersas the ones in Figure 2 even though the trainingdata only includes a subset of the labels that haveagreement (see Figure 5)

H Transfer Learning Results

Figure 7 shows the results for all 9 datasets thatare downloadable and have categorical emotionsin the Unified Dataset (Bostan and Klinger 2018)These datasets are DailyDialog (Li et al 2017)Emotion-Stimulus (Ghazi et al 2015) AffectiveText (Strapparava and Mihalcea 2007) Crowd-Flower (CrowdFlower 2016) Electoral Tweets(Mohammad et al 2015) ISEAR (Scherer andWallbott 1994) the Twitter Emotion Corpus (TEC)(Mohammad 2012) EmoInt (Mohammad et al2018) and the Stance Sentiment Emotion Corpus(SSEC) (Schuff et al 2017)

We describe the experimental setup in Sec-tion 62 which we use across all datasets Wefind that transfer learning helps in the case of alldatasets especially when there is limited train-ing data Interestingly in the case of Crowd-Flower which is known to be noisy (Bostan andKlinger 2018) and Electoral Tweets which is asmall dataset of sim4k labeled examples and a large

taxonomy of 36 emotions FREEZE gives a signifi-cant boost of performance over the BASELINE andNOFREEZE for all training set sizes besides ldquomaxrdquo

For the other datasets we find that FREEZE

tends to give a performance boost compared to theother setups only up to a couple of hundred train-ing examples For 500-1000 training examplesNOFREEZE tends to outperform the BASELINE butwe can see that these two setups come closer whenthere is more training data available These resultssuggests that our dataset helps if there is limiteddata from the target domain

Sentimentpositivenegativeambiguous

sentiment

curio

sity

conf

usio

nam

usem

ent

grat

itude

adm

iratio

npr

ide

appr

oval

real

izatio

nsu

rpris

eex

citem

ent

joy

relie

fca

ring

optim

ismde

sire

love fear

nerv

ousn

ess

grie

fsa

dnes

sre

mor

sedi

sapp

rova

ldi

sapp

oint

men

tan

ger

anno

yanc

eem

barra

ssm

ent

disg

ust

Predicted Label

curiosityconfusionamusementgratitudeadmirationprideapprovalrealizationsurpriseexcitementjoyreliefcaringoptimismdesirelovefearnervousnessgriefsadnessremorsedisapprovaldisappointmentangerannoyanceembarrassmentdisgust

True

Lab

el

Proportionof Predictions

000

015

030

045

060

075

Figure 6 A normalized confusion matrix for our model predictions The plot shows that the model confusesemotions with other emotions that are related in intensity and sentiment

02

03

04

F-sc

ore

DailyDialog (everyday conversations)

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

01

02

03

04

05

Affective Text (news headlines)

000

005

010

015

020

F-sc

ore

CrowdFlower (tweets)

006

008

010

012

ElectoralTweets (tweets)

00

02

04

06

ISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

01

02

03

04

05

F-sc

ore

TEC (tweets)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

100 200 500 1000 maxTraining Set Size

060

065

070

SSEC (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 7 Transfer learning results on 9 emotion benchmarks from the Unified Dataset (Bostan and Klinger 2018)

Page 11: GoEmotions: A Dataset of Fine-Grained Emotions · 2020. 5. 5. · Sentiment balancing. We reduce sentiment bias by removing subreddits with little representation of positive, negative,

Emily Ohman Kaisla Kajava Jorg Tiedemann andTimo Honkela 2018 Creating a dataset formultilingual fine-grained emotion-detection usinggamification-based annotation In Proceedings ofthe 9th Workshop on Computational Approaches toSubjectivity Sentiment and Social Media Analysispages 24ndash30

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P PrettenhoferR Weiss V Dubourg J Vanderplas A PassosD Cournapeau M Brucher M Perrot and E Duch-esnay 2011 Scikit-learn Machine learning inPython Journal of Machine Learning Research122825ndash2830

Matthew E Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep Contextualized Word Rep-resentations In 16th Annual Conference of theNorth American Chapter of the Association for Com-putational Linguistics (NAACL)

Rosalind W Picard 1997 Affective Computing MITPress

Inna Pirina and Cagrı Coltekin 2018 Identifyingdepression on reddit The effect of training dataIn Proceedings of the 2018 EMNLP WorkshopSMM4H The 3rd Social Media Mining for HealthApplications Workshop amp Shared Task pages 9ndash12

Robert Plutchik 1980 A general psychoevolutionarytheory of emotion In Theories of emotion pages3ndash33 Elsevier

James A Russell 2003 Core affect and the psycholog-ical construction of emotion Psychological review110(1)145

Klaus R Scherer and Harald G Wallbott 1994 Evi-dence for Universality and Cultural Variation of Dif-ferential Emotion Response Patterning Journal ofpersonality and social psychology 66(2)310

Hendrik Schuff Jeremy Barnes Julian Mohme Sebas-tian Pado and Roman Klinger 2017 Annotationmodelling and analysis of fine-grained emotions ona stance and sentiment detection corpus In Pro-ceedings of the 8th Workshop on Computational Ap-proaches to Subjectivity Sentiment and Social Me-dia Analysis pages 13ndash23

Carlo Strapparava and Rada Mihalcea 2007 SemEval-2007 task 14 Affective text In Proceedings of theFourth International Workshop on Semantic Evalua-tions (SemEval-2007) pages 70ndash74 Prague CzechRepublic Association for Computational Linguis-tics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT Rediscovers the Classical NLP Pipeline InAssociation for Computational Linguistics

Henry Tsai Jason Riesa Melvin Johnson Naveen Ari-vazhagan Xin Li and Amelia Archer 2019 Smalland Practical BERT Models for Sequence LabelingIn EMNLP 2019

Wenbo Wang Lu Chen Krishnaprasad Thirunarayanand Amit P Sheth 2012 Harnessing twitter lsquobigdatarsquo for automatic emotion identification In 2012International Conference on Privacy Security Riskand Trust and 2012 International Confernece on So-cial Computing pages 587ndash592 IEEE

Cebrian-M Yanardag P and I Rahwan 2018 Nor-man Worlds first psychopath ai

A Emotion Definitions

admiration Finding something impressive orworthy of respectamusement Finding something funny or beingentertainedanger A strong feeling of displeasure or antag-onismannoyance Mild anger irritationapproval Having or expressing a favorable opin-ioncaring Displaying kindness and concern for othersconfusion Lack of understanding uncertaintycuriosity A strong desire to know or learn some-thingdesire A strong feeling of wanting something orwishing for something to happendisappointment Sadness or displeasure caused bythe nonfulfillment of onersquos hopes or expectationsdisapproval Having or expressing an unfavor-able opiniondisgust Revulsion or strong disapproval arousedby something unpleasant or offensiveembarrassment Self-consciousness shame orawkwardnessexcitement Feeling of great enthusiasm and ea-gernessfear Being afraid or worriedgratitude A feeling of thankfulness and appre-ciationgrief Intense sorrow especially caused by some-onersquos deathjoy A feeling of pleasure and happinesslove A strong positive emotion of regard andaffectionnervousness Apprehension worry anxietyoptimism Hopefulness and confidence aboutthe future or the success of somethingpride Pleasure or satisfaction due to ones ownachievements or the achievements of those withwhom one is closely associatedrealization Becoming aware of somethingrelief Reassurance and relaxation following releasefrom anxiety or distressremorse Regret or guilty feelingsadness Emotional pain sorrowsurprise Feeling astonished startled by some-thing unexpected

B Taxonomy Selection amp Data Collection

We selected our taxonomy through a careful multi-round process In the first pilot round of data col-

lection we used emotions that were identified to besalient by Cowen and Keltner (2017) making surethat our set includes Ekmans emotion categories asused in previous NLP work In this round we alsoincluded an open input box where annotators couldsuggest emotion(s) that were not among the op-tions We annotated 3K examples in the first roundWe updated the taxonomy based on the results ofthis round (see details below) In the second pilotround of data collection we repeated this processwith 2k new examples once again updating thetaxonomy

While reviewing the results from the pilotrounds we identified and removed emotions thatwere scarcely selected by annotators andor hadlow interrater agreement due to being very similarto other emotions or being too difficult to detectfrom text These emotions were boredom doubtheartbroken indifference and calmness We alsoidentified and added those emotions to our tax-onomy that were frequently suggested by ratersandor seemed to be represented in the data uponmanual inspection These emotions were desiredisappointment pride realization relief and re-morse In this process we also refined the categorynames (eg replacing ecstasy with excitement) toones that seemed interpretable to annotators Thisis how we arrived at the final set of 27 emotions+ Neutral Our high interrater agreement in the fi-nal data can be partially explained by the fact thatwe took interpretability into consideration whileconstructing the taxonomy The dataset is we arereleasing was labeled in the third round over thefinal taxonomy

C Cohenrsquos Kappa Values

In Section 41 we measure agreement betweenraters via Spearman correlation following consid-erations by Delgado and Tibau (2019) In Table 7we report the Cohenrsquos kappa values for compari-son which we obtain by randomly sampling tworatings for each example and calculating the Co-henrsquos kappa between these two sets of ratings Wefind that all Cohenrsquos kappa values are greater than 0showing rater agreement Moreover the Cohenrsquoskappa values correlate highly with the interratercorrelation values (Pearson r = 085 p lt 0001)providing corroborative evidence for the significantdegree of interrater agreement for each emotion

Emotion InterraterCorrelation

Cohenrsquoskappa

admiration 0535 0468amusement 0482 0474anger 0207 0307annoyance 0193 0192approval 0385 0187caring 0237 0252confusion 0217 0270curiosity 0418 0366desire 0177 0251disappointment 0186 0184disapproval 0274 0234disgust 0192 0241embarrassment 0177 0218excitement 0193 0222fear 0266 0394gratitude 0645 0749grief 0162 0095joy 0296 0301love 0446 0555nervousness 0164 0144optimism 0322 0300pride 0163 0148realization 0194 0155relief 0172 0185remorse 0178 0358sadness 0346 0336surprise 0275 0331

Table 7 Interrater agreement as measured by interratercorrelation and Cohenrsquos kappa

D Sentiment of Reddit Subreddits

In Section 3 we describe how we obtain subredditsthat are balanced in terms of sentiment Here wenote the distribution of sentiments across subred-dits before we apply the filtering neutral (M=28STD=11) positive (M=41 STD=11) neg-ative (M=19 STD=7) ambiguous (M=35STD=8) After filtering the distribution of sen-timents across our remaining subreddits becameneutral (M=24 STD=5) positive (M=35STD=6) negative (M=27 STD=4) ambigu-ous (M=33 STD=4)

E BERTrsquos Most Activated Layers

To better understand whether there are any layersin BERT that are particularly important for ourtask we freeze BERT and calculate the center of

gravity (Tenney et al 2019) based on scalar mixingweights (Peters et al 2018) We find that all layersare similarly important for our task with center ofgravity = 619 (see Figure 4) This is consistentwith Tenney et al (2019) who have also found thattasks involving high-level semantics tend to makeuse of all BERT layers

0 1 2 3 4 5 6 7 8 9 10 11 12

Layer

000

002

004

006

008

010

Soft

max W

eig

hts

Figure 4 Softmax weights of each BERT layer whentrained on our dataset

F Number of Emotion Labels PerExample

Figure 5 shows the number of emotion labels perexample before and after we filter for those labelsthat have agreement We use the filtered set oflabels for training and testing our models

0 1 2 3 4 5 6 7Number of Labels

0k

10k

20k

30k

40k

Num

ber o

f Exa

mpl

es

pre-filterpost-filter

Figure 5 Number of emotion labels per example be-fore and after filtering the labels chosen by only a sin-gle annotator

G Confusion Matrix

Figure 6 shows the normalized confusion matrix forour model predictions Since GoEmotions is a mul-tilabel dataset we calculate the confusion matrix

similarly as we would calculate a co-occurrencematrix for each true label we increase the countfor each predicted label Specifically we define amatrix M where Mij denotes the raw confusioncount between the true label i and the predictedlabel j For example if the true labels are joyand admiration and the predicted labels are joyand pride then we increase the count for MjoyjoyMjoypride Madmirationjoy and MadmirationprideIn practice since most of our examples only hasa single label (see Figure 5) our confusion matrixis very similar to one calculated for a single-labelclassification task

Given the disparate frequencies among the la-bels we normalize M by dividing the counts ineach row (representing counts for each true emo-tion label) by the sum of that row The heatmap inFigure 6 shows these normalized counts We findthat the model tends to confuse emotions that arerelated in sentiment and intensity (eg grief andsadness pride and admiration nervousness andfear)

We also perform hierarchical clustering over thenormalized confusion matrix using correlation as adistance metric and ward as a linkage method Wefind that the model learns relatively similar clustersas the ones in Figure 2 even though the trainingdata only includes a subset of the labels that haveagreement (see Figure 5)

H Transfer Learning Results

Figure 7 shows the results for all 9 datasets thatare downloadable and have categorical emotionsin the Unified Dataset (Bostan and Klinger 2018)These datasets are DailyDialog (Li et al 2017)Emotion-Stimulus (Ghazi et al 2015) AffectiveText (Strapparava and Mihalcea 2007) Crowd-Flower (CrowdFlower 2016) Electoral Tweets(Mohammad et al 2015) ISEAR (Scherer andWallbott 1994) the Twitter Emotion Corpus (TEC)(Mohammad 2012) EmoInt (Mohammad et al2018) and the Stance Sentiment Emotion Corpus(SSEC) (Schuff et al 2017)

We describe the experimental setup in Sec-tion 62 which we use across all datasets Wefind that transfer learning helps in the case of alldatasets especially when there is limited train-ing data Interestingly in the case of Crowd-Flower which is known to be noisy (Bostan andKlinger 2018) and Electoral Tweets which is asmall dataset of sim4k labeled examples and a large

taxonomy of 36 emotions FREEZE gives a signifi-cant boost of performance over the BASELINE andNOFREEZE for all training set sizes besides ldquomaxrdquo

For the other datasets we find that FREEZE

tends to give a performance boost compared to theother setups only up to a couple of hundred train-ing examples For 500-1000 training examplesNOFREEZE tends to outperform the BASELINE butwe can see that these two setups come closer whenthere is more training data available These resultssuggests that our dataset helps if there is limiteddata from the target domain

Sentimentpositivenegativeambiguous

sentiment

curio

sity

conf

usio

nam

usem

ent

grat

itude

adm

iratio

npr

ide

appr

oval

real

izatio

nsu

rpris

eex

citem

ent

joy

relie

fca

ring

optim

ismde

sire

love fear

nerv

ousn

ess

grie

fsa

dnes

sre

mor

sedi

sapp

rova

ldi

sapp

oint

men

tan

ger

anno

yanc

eem

barra

ssm

ent

disg

ust

Predicted Label

curiosityconfusionamusementgratitudeadmirationprideapprovalrealizationsurpriseexcitementjoyreliefcaringoptimismdesirelovefearnervousnessgriefsadnessremorsedisapprovaldisappointmentangerannoyanceembarrassmentdisgust

True

Lab

el

Proportionof Predictions

000

015

030

045

060

075

Figure 6 A normalized confusion matrix for our model predictions The plot shows that the model confusesemotions with other emotions that are related in intensity and sentiment

02

03

04

F-sc

ore

DailyDialog (everyday conversations)

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

01

02

03

04

05

Affective Text (news headlines)

000

005

010

015

020

F-sc

ore

CrowdFlower (tweets)

006

008

010

012

ElectoralTweets (tweets)

00

02

04

06

ISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

01

02

03

04

05

F-sc

ore

TEC (tweets)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

100 200 500 1000 maxTraining Set Size

060

065

070

SSEC (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 7 Transfer learning results on 9 emotion benchmarks from the Unified Dataset (Bostan and Klinger 2018)

Page 12: GoEmotions: A Dataset of Fine-Grained Emotions · 2020. 5. 5. · Sentiment balancing. We reduce sentiment bias by removing subreddits with little representation of positive, negative,

A Emotion Definitions

admiration Finding something impressive orworthy of respectamusement Finding something funny or beingentertainedanger A strong feeling of displeasure or antag-onismannoyance Mild anger irritationapproval Having or expressing a favorable opin-ioncaring Displaying kindness and concern for othersconfusion Lack of understanding uncertaintycuriosity A strong desire to know or learn some-thingdesire A strong feeling of wanting something orwishing for something to happendisappointment Sadness or displeasure caused bythe nonfulfillment of onersquos hopes or expectationsdisapproval Having or expressing an unfavor-able opiniondisgust Revulsion or strong disapproval arousedby something unpleasant or offensiveembarrassment Self-consciousness shame orawkwardnessexcitement Feeling of great enthusiasm and ea-gernessfear Being afraid or worriedgratitude A feeling of thankfulness and appre-ciationgrief Intense sorrow especially caused by some-onersquos deathjoy A feeling of pleasure and happinesslove A strong positive emotion of regard andaffectionnervousness Apprehension worry anxietyoptimism Hopefulness and confidence aboutthe future or the success of somethingpride Pleasure or satisfaction due to ones ownachievements or the achievements of those withwhom one is closely associatedrealization Becoming aware of somethingrelief Reassurance and relaxation following releasefrom anxiety or distressremorse Regret or guilty feelingsadness Emotional pain sorrowsurprise Feeling astonished startled by some-thing unexpected

B Taxonomy Selection amp Data Collection

We selected our taxonomy through a careful multi-round process In the first pilot round of data col-

lection we used emotions that were identified to besalient by Cowen and Keltner (2017) making surethat our set includes Ekmans emotion categories asused in previous NLP work In this round we alsoincluded an open input box where annotators couldsuggest emotion(s) that were not among the op-tions We annotated 3K examples in the first roundWe updated the taxonomy based on the results ofthis round (see details below) In the second pilotround of data collection we repeated this processwith 2k new examples once again updating thetaxonomy

While reviewing the results from the pilotrounds we identified and removed emotions thatwere scarcely selected by annotators andor hadlow interrater agreement due to being very similarto other emotions or being too difficult to detectfrom text These emotions were boredom doubtheartbroken indifference and calmness We alsoidentified and added those emotions to our tax-onomy that were frequently suggested by ratersandor seemed to be represented in the data uponmanual inspection These emotions were desiredisappointment pride realization relief and re-morse In this process we also refined the categorynames (eg replacing ecstasy with excitement) toones that seemed interpretable to annotators Thisis how we arrived at the final set of 27 emotions+ Neutral Our high interrater agreement in the fi-nal data can be partially explained by the fact thatwe took interpretability into consideration whileconstructing the taxonomy The dataset is we arereleasing was labeled in the third round over thefinal taxonomy

C Cohenrsquos Kappa Values

In Section 41 we measure agreement betweenraters via Spearman correlation following consid-erations by Delgado and Tibau (2019) In Table 7we report the Cohenrsquos kappa values for compari-son which we obtain by randomly sampling tworatings for each example and calculating the Co-henrsquos kappa between these two sets of ratings Wefind that all Cohenrsquos kappa values are greater than 0showing rater agreement Moreover the Cohenrsquoskappa values correlate highly with the interratercorrelation values (Pearson r = 085 p lt 0001)providing corroborative evidence for the significantdegree of interrater agreement for each emotion

Emotion InterraterCorrelation

Cohenrsquoskappa

admiration 0535 0468amusement 0482 0474anger 0207 0307annoyance 0193 0192approval 0385 0187caring 0237 0252confusion 0217 0270curiosity 0418 0366desire 0177 0251disappointment 0186 0184disapproval 0274 0234disgust 0192 0241embarrassment 0177 0218excitement 0193 0222fear 0266 0394gratitude 0645 0749grief 0162 0095joy 0296 0301love 0446 0555nervousness 0164 0144optimism 0322 0300pride 0163 0148realization 0194 0155relief 0172 0185remorse 0178 0358sadness 0346 0336surprise 0275 0331

Table 7 Interrater agreement as measured by interratercorrelation and Cohenrsquos kappa

D Sentiment of Reddit Subreddits

In Section 3 we describe how we obtain subredditsthat are balanced in terms of sentiment Here wenote the distribution of sentiments across subred-dits before we apply the filtering neutral (M=28STD=11) positive (M=41 STD=11) neg-ative (M=19 STD=7) ambiguous (M=35STD=8) After filtering the distribution of sen-timents across our remaining subreddits becameneutral (M=24 STD=5) positive (M=35STD=6) negative (M=27 STD=4) ambigu-ous (M=33 STD=4)

E BERTrsquos Most Activated Layers

To better understand whether there are any layersin BERT that are particularly important for ourtask we freeze BERT and calculate the center of

gravity (Tenney et al 2019) based on scalar mixingweights (Peters et al 2018) We find that all layersare similarly important for our task with center ofgravity = 619 (see Figure 4) This is consistentwith Tenney et al (2019) who have also found thattasks involving high-level semantics tend to makeuse of all BERT layers

0 1 2 3 4 5 6 7 8 9 10 11 12

Layer

000

002

004

006

008

010

Soft

max W

eig

hts

Figure 4 Softmax weights of each BERT layer whentrained on our dataset

F Number of Emotion Labels PerExample

Figure 5 shows the number of emotion labels perexample before and after we filter for those labelsthat have agreement We use the filtered set oflabels for training and testing our models

0 1 2 3 4 5 6 7Number of Labels

0k

10k

20k

30k

40k

Num

ber o

f Exa

mpl

es

pre-filterpost-filter

Figure 5 Number of emotion labels per example be-fore and after filtering the labels chosen by only a sin-gle annotator

G Confusion Matrix

Figure 6 shows the normalized confusion matrix forour model predictions Since GoEmotions is a mul-tilabel dataset we calculate the confusion matrix

similarly as we would calculate a co-occurrencematrix for each true label we increase the countfor each predicted label Specifically we define amatrix M where Mij denotes the raw confusioncount between the true label i and the predictedlabel j For example if the true labels are joyand admiration and the predicted labels are joyand pride then we increase the count for MjoyjoyMjoypride Madmirationjoy and MadmirationprideIn practice since most of our examples only hasa single label (see Figure 5) our confusion matrixis very similar to one calculated for a single-labelclassification task

Given the disparate frequencies among the la-bels we normalize M by dividing the counts ineach row (representing counts for each true emo-tion label) by the sum of that row The heatmap inFigure 6 shows these normalized counts We findthat the model tends to confuse emotions that arerelated in sentiment and intensity (eg grief andsadness pride and admiration nervousness andfear)

We also perform hierarchical clustering over thenormalized confusion matrix using correlation as adistance metric and ward as a linkage method Wefind that the model learns relatively similar clustersas the ones in Figure 2 even though the trainingdata only includes a subset of the labels that haveagreement (see Figure 5)

H Transfer Learning Results

Figure 7 shows the results for all 9 datasets thatare downloadable and have categorical emotionsin the Unified Dataset (Bostan and Klinger 2018)These datasets are DailyDialog (Li et al 2017)Emotion-Stimulus (Ghazi et al 2015) AffectiveText (Strapparava and Mihalcea 2007) Crowd-Flower (CrowdFlower 2016) Electoral Tweets(Mohammad et al 2015) ISEAR (Scherer andWallbott 1994) the Twitter Emotion Corpus (TEC)(Mohammad 2012) EmoInt (Mohammad et al2018) and the Stance Sentiment Emotion Corpus(SSEC) (Schuff et al 2017)

We describe the experimental setup in Sec-tion 62 which we use across all datasets Wefind that transfer learning helps in the case of alldatasets especially when there is limited train-ing data Interestingly in the case of Crowd-Flower which is known to be noisy (Bostan andKlinger 2018) and Electoral Tweets which is asmall dataset of sim4k labeled examples and a large

taxonomy of 36 emotions FREEZE gives a signifi-cant boost of performance over the BASELINE andNOFREEZE for all training set sizes besides ldquomaxrdquo

For the other datasets we find that FREEZE

tends to give a performance boost compared to theother setups only up to a couple of hundred train-ing examples For 500-1000 training examplesNOFREEZE tends to outperform the BASELINE butwe can see that these two setups come closer whenthere is more training data available These resultssuggests that our dataset helps if there is limiteddata from the target domain

Sentimentpositivenegativeambiguous

sentiment

curio

sity

conf

usio

nam

usem

ent

grat

itude

adm

iratio

npr

ide

appr

oval

real

izatio

nsu

rpris

eex

citem

ent

joy

relie

fca

ring

optim

ismde

sire

love fear

nerv

ousn

ess

grie

fsa

dnes

sre

mor

sedi

sapp

rova

ldi

sapp

oint

men

tan

ger

anno

yanc

eem

barra

ssm

ent

disg

ust

Predicted Label

curiosityconfusionamusementgratitudeadmirationprideapprovalrealizationsurpriseexcitementjoyreliefcaringoptimismdesirelovefearnervousnessgriefsadnessremorsedisapprovaldisappointmentangerannoyanceembarrassmentdisgust

True

Lab

el

Proportionof Predictions

000

015

030

045

060

075

Figure 6 A normalized confusion matrix for our model predictions The plot shows that the model confusesemotions with other emotions that are related in intensity and sentiment

02

03

04

F-sc

ore

DailyDialog (everyday conversations)

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

01

02

03

04

05

Affective Text (news headlines)

000

005

010

015

020

F-sc

ore

CrowdFlower (tweets)

006

008

010

012

ElectoralTweets (tweets)

00

02

04

06

ISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

01

02

03

04

05

F-sc

ore

TEC (tweets)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

100 200 500 1000 maxTraining Set Size

060

065

070

SSEC (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 7 Transfer learning results on 9 emotion benchmarks from the Unified Dataset (Bostan and Klinger 2018)

Page 13: GoEmotions: A Dataset of Fine-Grained Emotions · 2020. 5. 5. · Sentiment balancing. We reduce sentiment bias by removing subreddits with little representation of positive, negative,

Emotion InterraterCorrelation

Cohenrsquoskappa

admiration 0535 0468amusement 0482 0474anger 0207 0307annoyance 0193 0192approval 0385 0187caring 0237 0252confusion 0217 0270curiosity 0418 0366desire 0177 0251disappointment 0186 0184disapproval 0274 0234disgust 0192 0241embarrassment 0177 0218excitement 0193 0222fear 0266 0394gratitude 0645 0749grief 0162 0095joy 0296 0301love 0446 0555nervousness 0164 0144optimism 0322 0300pride 0163 0148realization 0194 0155relief 0172 0185remorse 0178 0358sadness 0346 0336surprise 0275 0331

Table 7 Interrater agreement as measured by interratercorrelation and Cohenrsquos kappa

D Sentiment of Reddit Subreddits

In Section 3 we describe how we obtain subredditsthat are balanced in terms of sentiment Here wenote the distribution of sentiments across subred-dits before we apply the filtering neutral (M=28STD=11) positive (M=41 STD=11) neg-ative (M=19 STD=7) ambiguous (M=35STD=8) After filtering the distribution of sen-timents across our remaining subreddits becameneutral (M=24 STD=5) positive (M=35STD=6) negative (M=27 STD=4) ambigu-ous (M=33 STD=4)

E BERTrsquos Most Activated Layers

To better understand whether there are any layersin BERT that are particularly important for ourtask we freeze BERT and calculate the center of

gravity (Tenney et al 2019) based on scalar mixingweights (Peters et al 2018) We find that all layersare similarly important for our task with center ofgravity = 619 (see Figure 4) This is consistentwith Tenney et al (2019) who have also found thattasks involving high-level semantics tend to makeuse of all BERT layers

0 1 2 3 4 5 6 7 8 9 10 11 12

Layer

000

002

004

006

008

010

Soft

max W

eig

hts

Figure 4 Softmax weights of each BERT layer whentrained on our dataset

F Number of Emotion Labels PerExample

Figure 5 shows the number of emotion labels perexample before and after we filter for those labelsthat have agreement We use the filtered set oflabels for training and testing our models

0 1 2 3 4 5 6 7Number of Labels

0k

10k

20k

30k

40k

Num

ber o

f Exa

mpl

es

pre-filterpost-filter

Figure 5 Number of emotion labels per example be-fore and after filtering the labels chosen by only a sin-gle annotator

G Confusion Matrix

Figure 6 shows the normalized confusion matrix forour model predictions Since GoEmotions is a mul-tilabel dataset we calculate the confusion matrix

similarly as we would calculate a co-occurrencematrix for each true label we increase the countfor each predicted label Specifically we define amatrix M where Mij denotes the raw confusioncount between the true label i and the predictedlabel j For example if the true labels are joyand admiration and the predicted labels are joyand pride then we increase the count for MjoyjoyMjoypride Madmirationjoy and MadmirationprideIn practice since most of our examples only hasa single label (see Figure 5) our confusion matrixis very similar to one calculated for a single-labelclassification task

Given the disparate frequencies among the la-bels we normalize M by dividing the counts ineach row (representing counts for each true emo-tion label) by the sum of that row The heatmap inFigure 6 shows these normalized counts We findthat the model tends to confuse emotions that arerelated in sentiment and intensity (eg grief andsadness pride and admiration nervousness andfear)

We also perform hierarchical clustering over thenormalized confusion matrix using correlation as adistance metric and ward as a linkage method Wefind that the model learns relatively similar clustersas the ones in Figure 2 even though the trainingdata only includes a subset of the labels that haveagreement (see Figure 5)

H Transfer Learning Results

Figure 7 shows the results for all 9 datasets thatare downloadable and have categorical emotionsin the Unified Dataset (Bostan and Klinger 2018)These datasets are DailyDialog (Li et al 2017)Emotion-Stimulus (Ghazi et al 2015) AffectiveText (Strapparava and Mihalcea 2007) Crowd-Flower (CrowdFlower 2016) Electoral Tweets(Mohammad et al 2015) ISEAR (Scherer andWallbott 1994) the Twitter Emotion Corpus (TEC)(Mohammad 2012) EmoInt (Mohammad et al2018) and the Stance Sentiment Emotion Corpus(SSEC) (Schuff et al 2017)

We describe the experimental setup in Sec-tion 62 which we use across all datasets Wefind that transfer learning helps in the case of alldatasets especially when there is limited train-ing data Interestingly in the case of Crowd-Flower which is known to be noisy (Bostan andKlinger 2018) and Electoral Tweets which is asmall dataset of sim4k labeled examples and a large

taxonomy of 36 emotions FREEZE gives a signifi-cant boost of performance over the BASELINE andNOFREEZE for all training set sizes besides ldquomaxrdquo

For the other datasets we find that FREEZE

tends to give a performance boost compared to theother setups only up to a couple of hundred train-ing examples For 500-1000 training examplesNOFREEZE tends to outperform the BASELINE butwe can see that these two setups come closer whenthere is more training data available These resultssuggests that our dataset helps if there is limiteddata from the target domain

Sentimentpositivenegativeambiguous

sentiment

curio

sity

conf

usio

nam

usem

ent

grat

itude

adm

iratio

npr

ide

appr

oval

real

izatio

nsu

rpris

eex

citem

ent

joy

relie

fca

ring

optim

ismde

sire

love fear

nerv

ousn

ess

grie

fsa

dnes

sre

mor

sedi

sapp

rova

ldi

sapp

oint

men

tan

ger

anno

yanc

eem

barra

ssm

ent

disg

ust

Predicted Label

curiosityconfusionamusementgratitudeadmirationprideapprovalrealizationsurpriseexcitementjoyreliefcaringoptimismdesirelovefearnervousnessgriefsadnessremorsedisapprovaldisappointmentangerannoyanceembarrassmentdisgust

True

Lab

el

Proportionof Predictions

000

015

030

045

060

075

Figure 6 A normalized confusion matrix for our model predictions The plot shows that the model confusesemotions with other emotions that are related in intensity and sentiment

02

03

04

F-sc

ore

DailyDialog (everyday conversations)

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

01

02

03

04

05

Affective Text (news headlines)

000

005

010

015

020

F-sc

ore

CrowdFlower (tweets)

006

008

010

012

ElectoralTweets (tweets)

00

02

04

06

ISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

01

02

03

04

05

F-sc

ore

TEC (tweets)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

100 200 500 1000 maxTraining Set Size

060

065

070

SSEC (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 7 Transfer learning results on 9 emotion benchmarks from the Unified Dataset (Bostan and Klinger 2018)

Page 14: GoEmotions: A Dataset of Fine-Grained Emotions · 2020. 5. 5. · Sentiment balancing. We reduce sentiment bias by removing subreddits with little representation of positive, negative,

similarly as we would calculate a co-occurrencematrix for each true label we increase the countfor each predicted label Specifically we define amatrix M where Mij denotes the raw confusioncount between the true label i and the predictedlabel j For example if the true labels are joyand admiration and the predicted labels are joyand pride then we increase the count for MjoyjoyMjoypride Madmirationjoy and MadmirationprideIn practice since most of our examples only hasa single label (see Figure 5) our confusion matrixis very similar to one calculated for a single-labelclassification task

Given the disparate frequencies among the la-bels we normalize M by dividing the counts ineach row (representing counts for each true emo-tion label) by the sum of that row The heatmap inFigure 6 shows these normalized counts We findthat the model tends to confuse emotions that arerelated in sentiment and intensity (eg grief andsadness pride and admiration nervousness andfear)

We also perform hierarchical clustering over thenormalized confusion matrix using correlation as adistance metric and ward as a linkage method Wefind that the model learns relatively similar clustersas the ones in Figure 2 even though the trainingdata only includes a subset of the labels that haveagreement (see Figure 5)

H Transfer Learning Results

Figure 7 shows the results for all 9 datasets thatare downloadable and have categorical emotionsin the Unified Dataset (Bostan and Klinger 2018)These datasets are DailyDialog (Li et al 2017)Emotion-Stimulus (Ghazi et al 2015) AffectiveText (Strapparava and Mihalcea 2007) Crowd-Flower (CrowdFlower 2016) Electoral Tweets(Mohammad et al 2015) ISEAR (Scherer andWallbott 1994) the Twitter Emotion Corpus (TEC)(Mohammad 2012) EmoInt (Mohammad et al2018) and the Stance Sentiment Emotion Corpus(SSEC) (Schuff et al 2017)

We describe the experimental setup in Sec-tion 62 which we use across all datasets Wefind that transfer learning helps in the case of alldatasets especially when there is limited train-ing data Interestingly in the case of Crowd-Flower which is known to be noisy (Bostan andKlinger 2018) and Electoral Tweets which is asmall dataset of sim4k labeled examples and a large

taxonomy of 36 emotions FREEZE gives a signifi-cant boost of performance over the BASELINE andNOFREEZE for all training set sizes besides ldquomaxrdquo

For the other datasets we find that FREEZE

tends to give a performance boost compared to theother setups only up to a couple of hundred train-ing examples For 500-1000 training examplesNOFREEZE tends to outperform the BASELINE butwe can see that these two setups come closer whenthere is more training data available These resultssuggests that our dataset helps if there is limiteddata from the target domain

Sentimentpositivenegativeambiguous

sentiment

curio

sity

conf

usio

nam

usem

ent

grat

itude

adm

iratio

npr

ide

appr

oval

real

izatio

nsu

rpris

eex

citem

ent

joy

relie

fca

ring

optim

ismde

sire

love fear

nerv

ousn

ess

grie

fsa

dnes

sre

mor

sedi

sapp

rova

ldi

sapp

oint

men

tan

ger

anno

yanc

eem

barra

ssm

ent

disg

ust

Predicted Label

curiosityconfusionamusementgratitudeadmirationprideapprovalrealizationsurpriseexcitementjoyreliefcaringoptimismdesirelovefearnervousnessgriefsadnessremorsedisapprovaldisappointmentangerannoyanceembarrassmentdisgust

True

Lab

el

Proportionof Predictions

000

015

030

045

060

075

Figure 6 A normalized confusion matrix for our model predictions The plot shows that the model confusesemotions with other emotions that are related in intensity and sentiment

02

03

04

F-sc

ore

DailyDialog (everyday conversations)

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

01

02

03

04

05

Affective Text (news headlines)

000

005

010

015

020

F-sc

ore

CrowdFlower (tweets)

006

008

010

012

ElectoralTweets (tweets)

00

02

04

06

ISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

01

02

03

04

05

F-sc

ore

TEC (tweets)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

100 200 500 1000 maxTraining Set Size

060

065

070

SSEC (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 7 Transfer learning results on 9 emotion benchmarks from the Unified Dataset (Bostan and Klinger 2018)

Page 15: GoEmotions: A Dataset of Fine-Grained Emotions · 2020. 5. 5. · Sentiment balancing. We reduce sentiment bias by removing subreddits with little representation of positive, negative,

Sentimentpositivenegativeambiguous

sentiment

curio

sity

conf

usio

nam

usem

ent

grat

itude

adm

iratio

npr

ide

appr

oval

real

izatio

nsu

rpris

eex

citem

ent

joy

relie

fca

ring

optim

ismde

sire

love fear

nerv

ousn

ess

grie

fsa

dnes

sre

mor

sedi

sapp

rova

ldi

sapp

oint

men

tan

ger

anno

yanc

eem

barra

ssm

ent

disg

ust

Predicted Label

curiosityconfusionamusementgratitudeadmirationprideapprovalrealizationsurpriseexcitementjoyreliefcaringoptimismdesirelovefearnervousnessgriefsadnessremorsedisapprovaldisappointmentangerannoyanceembarrassmentdisgust

True

Lab

el

Proportionof Predictions

000

015

030

045

060

075

Figure 6 A normalized confusion matrix for our model predictions The plot shows that the model confusesemotions with other emotions that are related in intensity and sentiment

02

03

04

F-sc

ore

DailyDialog (everyday conversations)

02

04

06

08

10Emotion-Stimulus (FrameNet-based sentences)

01

02

03

04

05

Affective Text (news headlines)

000

005

010

015

020

F-sc

ore

CrowdFlower (tweets)

006

008

010

012

ElectoralTweets (tweets)

00

02

04

06

ISEAR (self-reported experiences)

100 200 500 1000 maxTraining Set Size

01

02

03

04

05

F-sc

ore

TEC (tweets)

100 200 500 1000 maxTraining Set Size

02

04

06

08

EmoInt (tweets)

100 200 500 1000 maxTraining Set Size

060

065

070

SSEC (tweets)

Baseline Transfer Learning w Freezing Layers Transfer Learning wo Freezing Layers

Figure 7 Transfer learning results on 9 emotion benchmarks from the Unified Dataset (Bostan and Klinger 2018)