The Nature of Emotional Expression in Social Media ... · grained human moods at scale, as expressed on social media (Twitter). We also analyze several attributes of collective behavior

The Nature of Emotional Expression in Social Media: Measurement, Inference and Utility

Munmun De Choudhury

Microsoft Research

Redmond, WA 98052

[email protected]

Scott Counts

Microsoft Research

Redmond, WA 98052

[email protected]

ABSTRACT

Today, social media platforms like Twitter and Facebook

enable in-the-moment reflection of people's attitudes,

attention, and emotions in a scale that was never available

before. In this paper, we present and explore how we can

measure, infer and utilize expression of human moods from

social media activity (e.g., Twitter). Motivated by literature

in psychology, we first measure more than 200 nuanced

human moods at scale on Twitter, through a popular

representation, known as the ‘circumplex model’ that

characterizes affective experience through two dimensions:

valence and activation. Second, moving beyond explicit

mood signals, we develop an automated classifier to infer

several human affective states in social media. Starting with

moods, we derive naturalistic signals from Twitter posts

about a variety of affects of individuals and deploy them in

a classification framework with promising results. Finally,

we illustrate the utility of emotion exploration in social

media via a case study that tracks behavioral change of new

mothers post child-birth. Our findings provide at-scale

naturalistic assessments and extensions of existing

conceptualizations of human mood, as well as indicate their

utility in domains of societal interest, such as public health.

Author Keywords

Affect, circumplex model, classification, emotion, mood,

social media, Twitter

ACM Classification Keywords

H.5.m [Information Systems]: Information Systems

Applications – Miscellaneous.

General Terms

Algorithms; Human Factors; Measurement.

INTRODUCTION

The dynamic nature of user-generated content today holds

formidable power in a culture that is increasingly shaped

and influenced by technology. In particular, the multitude

of various social media and social networking sites has

provided easy channels of opinion and emotional

expression to large audiences. On top of such ubiquity,

since social media sites provide a conducive platform to

provide and receive social support around issues

surrounding our daily lives, such in-the-moment emotional

expression of current experiences has been on the rise in the

past few years1: be it about the global economy broadly, the

riots in Egypt, or a personal vacation trip with one’s family.

As a timely research focus, judicious tracking and

monitoring of emotion at scale could potentially trigger a

societal shift with enough support and publicity.

Furthermore, they can enable new information-seeking

approaches; for instance, identifying search features given

an emotion attribute, or enabling effective “emotionally-

reflective” interfaces. Consequently, there is significant

value to be derived from measuring and

classifying/inferring human emotion in social media that

can potentially be utilized to impact several domains of

public interest – including health, finance, entertainment,

advertising, politics or evolution of language and culture.

What are emotions? According to literature in

psychology, emotions are complex patterns of cognitive

processes, physiological arousal, and behavioral reactions

[11]. Emotions arouse us to action and direct and sustain

that action. They also help us organize our experience by

directing attention, and by influencing our perceptions of

self, others, and our interpretation and memory of events

[21]. As renowned sociologist Herbert Blumer postulated,

what goes on around us sinks into the reservoirs of our

mind and changes how we think.

Given the critical value in understanding human emotion,

researchers have attempted to describe emotion

(alternatively affect) through a set of dimensions, typically

Positive Affect and Negative Affect, where the two

dimensions vary independently of each other [10,12].

However research in psychology provides evidence that

rather than being independent, these dimensions are

interrelated in a highly systematic fashion [13].

Consequently, psychology researchers devised a

psychometric instrument called the circumplex [20], a

1 http://gizmodo.com/5870075/everyone-on-twitter-is-

increasingly-depressed

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise,

or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

http://gizmodo.com/5870075/everyone-on-twitter-is-increasingly-depressed

http://gizmodo.com/5870075/everyone-on-twitter-is-increasingly-depressed

spatial model in which affective concepts fall on a circle

and enable quantification of the cognitive structure of

affective experience. The circumplex represents each

affective concept (or, mood) via two dimensions: a

pleasure-displeasure measure, known as valence, and a

degree-of-arousal dimension, known as activation.

Motivated along these lines, in this work, we utilize the

circumplex model as a tool to measure more than 200 fine-

grained human moods at scale, as expressed on social

media (Twitter). We also analyze several attributes of

collective behavior in the light of measurement of moods:

including usage levels, linguistic diversity as well as

activity rates associated with different mood expression on

social media. We thereby reveal how we can provide at-

scale naturalistic assessments and extensions of existing

conceptualizations of moods (extended version in [4]).

However notice that although measuring these moods in

the light of human behavior can lend us useful insights,

they do not help us infer moods broadly in the scope of any

arbitrary social media content. For example, the discussion

presented so far (lexicon-driven approach) relies on the

explicit presence of a mood in a Twitter post; however in

most cases, statistically, this might be a small percentage.

Hence an automated method of inferring the broad mood

(or affective state) might yield better coverage. Automatic

classification of affect might be useful in domains where

understanding the expressed mood of an author has some

practical utility: that could range from advertising,

recommendations, to tracking behavior of populations in

health, socio-economics or urban development domains.

Hence we utilize a wide range of moods to act as

supervisory ground truth signals in the development of an

affect classifier (see [5] for details). The classifier can infer

a range of fine-grained affective states from posts on social

media (beyond simply the valence of the affect descriptors),

that might not bear explicit signals about emotion. Such

inferred states would reflect the specific content, language

and state of the individual sharing the content, i.e., the

distinctive qualities of individuals’ affects.

Inevitably, there is considerable practical utility in

measuring and inferring human emotion at scale. For

instance, we rely on the general observation that, emotion-

bearing content, such as dark (negative) postings can serve

as signs of depression of individuals and can be utilized as

an early warning system for timely intervention or to reduce

non-invasively, the stigma around mental illness.

Motivated along these lines, in the final part of the paper,

we explore a particular utility domain of interest:

investigating behavioral and emotional change of new

mothers in the postnatal phase, compared to their prenatal

behavior, based on their postings on Twitter.

The rest of this paper is organized as follows. Following

a review of prior work, we discuss the three major segments

of interest in the paper in three sections: measuring moods,

automatic classification of affect, and finally a case study to

illustrate the utility of studying emotion dynamics in the

context of social media. We conclude with a discussion of

our major findings, and future research directions.

BACKGROUND LITERATURE

Considerable research in psychology has defined and

examined human emotion and mood (e.g., [8]), with basic

moods encompassing positive experiences like joy and

acceptance, negative experiences like anger and disgust,

and others like anticipation that are less clearly positive or

negative. The activation attribute of moods, together with

valence (i.e., the degree of positivity/negativity of a mood)

characterize the structure of affective experience (ref. PAD

emotional state model in [13]; also the circumplex model of

affect in [20]). In these works, authors utilized self-reports

of affective concepts to scale and order emotion types on

the pleasure-displeasure scale (valence) and the degree-of-

arousal scale (activation) based on perceived similarity

among the terms.

Turning to analyses of mood in social media, early work

focused on sentiment in weblogs. Mihalcea et al. [14]

utilized happy/sad labeled blog posts on LiveJournal to

determine temporal and semantic trends of happiness.

Similarly, Mishne et al. [16] utilized LiveJournal data to

build models that predict levels of various moods and

understand their seasonal cycles according to the language

used by bloggers. More recently, Bollen et al. [1] analyzed

trends of public moods (using a psychometric instrument

POMS to extract six mood states) in light of a variety of

social, political and economic events.

Research involving affect exploration on social websites

such as Facebook and Twitter has looked at trends of use of

positive and negative words [12]. Recently, Golder et al.

[10] studied how individual mood varies from hour-to-hour,

day-to-day, and across seasons and cultures by measuring

positive and negative affect in Twitter posts, using the

lexicon LIWC (http://www.liwc.net/).

Prior research has also tackled automatic classification of

sentiment in online domains [19]. These machine learning

techniques need extensive manual labeling of text for

creating ground truth. Some of these issues have been

tackled by utilizing emoticons present in text as labels for

sentiment [3], although they tend to perform well mostly in

the context of the two basic positive and negative affect

classes. The closest attempt towards multiclass affect

classification has been on LiveJournal data: the mood tags

associated with posts were used as ground truth [15].

An alternative that circumvents the problems of machine

learning techniques has been the use of generic sentiment

lexicons such as WordNet, LIWC, and other lists [9].

Recently, there has been a growing interest in

crowdsourcing techniques to manually rate polarity in

Twitter posts [16].

http://www.liwc.net/

Limitations

Despite widespread interest, it is clear that the notions of

affect and sentiment have been rather simplified in current

state-of-the-art, often confined to their valence measure

(positive/negative), with the six moods in [1] being an

exception. However, as indicated, the psychology literature

suggests that understanding the inter-relatedness of valence

and activation is important in conceptualizing human

emotion. Additionally, when it comes to classifying or

inferring emotional attributes, hand-labeled ground truth for

affect classification or manually curated word lists are

likely to be unreliable and scale poorly on noisy, topically-

diverse social media data, such as Twitter.

Finally, another problem with polarity-centric sentiment

classifiers is that they typically encompass a vague notion

of polarity that includes emotion, and opinion; and lumps

them all into two classes “positive” and “negative”, refer

[23]. In order to better make sense of emotional behavior on

social media, we require a principled notion of human

emotion – a contribution discussed in this paper.

MEASUREMENT OF HUMAN MOODS

We will begin with our methodology of measuring emotion,

in terms of moods, and understanding behavioral

differences found in mood expression across individuals at

scale. As pointed out earlier, we note that despite

considerable interest in mining and analyzing human

emotion, simplification of emotions to merely positive and

negative dimensions may miss important nuances in mood

expression. For instance, annoyed and frustrated are both

negative, but they express two very different emotional

states. A primary research challenge, therefore, is finding a

principled way to identify a set of words that can truly help

us measure emotional states of individuals, as well as

characterize their nuances in terms of both their valence and

arousal measure, known as activation.

Hence we study a popular representation of human mood

landscape: the ‘circumplex model’. For the purpose, we

first present a systematic method to identify moods in social

media that captures the broad range of individuals’

emotional states using mechanical turk studies and forays

into the psychology literature. Second we perform analysis

of these moods in the light of individuals’ behavioral

attributes to reveal nuances of collective mood expression.

Construction of Mood Lexicon

We begin by discussing our methodology to construct a

mood lexicon – a set of words that would indicate

individuals’ broad emotional states. We then characterize

the moods by the two dimensions of valence and activation,

and discuss how we infer them.

We utilize five established sources to develop a mood

lexicon:

1. ANEW: ANEW (Affective Norms for English Words)

that provides a set of normative emotional ratings for

~2000 words in English, including their valence,

activation and dominance ratings [2].

2. LIWC: For LIWC, we use sentiment-indicative

categories like positive/negative emotions, anxiety, anger

and sad (http://www.liwc.net/).

3. EARL: Emotion Annotation and Representation

Language dataset that classifies 48 emotions

(http://emotion-research.net/projects/humaine/earl).

4. A list of “basic emotions” provided in [18], e.g., fear,

contentment, disgust etc.

5. A list of moods provided by the blogging website

LiveJournal (http://www.livejournal.com/).

However, this large ensemble of words is likely to

contain several words that do not necessarily define a mood

(e.g., sleepy is a state of a person, rather than a mood). To

circumvent this issue, we first perform a mechanical turk

study (http://aws.amazon.com/mturk/) to narrow our

candidate set to truly mood-indicative words. In our task,

each word had a 1 – 7 Likert scale (1 indicated not a mood

at all, 7 meant definitely a mood). Only turkers from the

U.S. and having an approval rating greater than 95% were

allowed. Combining 12 different turkers’ ratings, we

construct a list of those words where the median rating was

at least 4, and the standard deviation was less than or equal

to 1. The final set of mood words contained 203 terms

(examples include: excited, nervous, quiet, grumpy,

depressed, patient, thankful, bored).

Given the final list of representative moods, our next task

was to determine the values of the valence and activation

dimensions of each mood. For those words in the final list

that were present in the ANEW lexicon, we use the source-

provided measures of valence and activation, as these

values in the ANEW corpus had already been computed

after extensive and rigorous psychometric studies. For the

remaining words, we conduct another turk study, to

systematically collect these measurements.

Like before, we considered only those turkers who had at

least 95% approval rating history and were from U.S. For a

given mood, each turker (12 turkers for each word) was

asked to rate the valence and activation measures, on two

different 1 – 10 Likert scales (1 indicated low valence/low

activation, while 10 indicated high values). Finally, we used

the mean ratings for each word as the final measures its

valence and activation (Fleiss-Kappa measure was 0.65).

Data

For data collection, we utilized the Twitter Firehose and

focused on one year's worth of Twitter posts posted in

English from Nov 1, 2010 to Oct 31, 2011. From this

ensemble, in order to obtain Twitter posts for each of the

203 moods, we resorted to a method that could infer mood-

containing posts reasonably consistently and in a principled

manner. We conjecture that posts containing moods as

hashtags at the end are likely to capture the emotional state

of the individual, in the limited context of the post. This is

motivated by prior work where Twitter's hashtags and

smileys were used as labels for sentiment classifiers [3]. We

http://www.liwc.net/

http://emotion-research.net/projects/humaine/earl

http://www.livejournal.com/

http://aws.amazon.com/mturk/

also referred to the study in [4] which indicated that that in

83% of the cases, hashtagged moods at the end of posts

indeed captured the users' moods. For instance, “#iphone4

is going to be available on verizon soon! #excited”

expresses the mood ‘excited’. By this process, our labeled

mood dataset comprised about 10.6M tweets from about

4.1M users.

Deciphering Human Behavior through Moods

We now study various aspects of the relationship between

mood expression and human behavior in social media

(Twitter). We present four aspects of such behavior: usage

levels, diurnal patterns, diversity of language use, and

activity rate, given a mood, in the next three subsections.

Usage Levels of Moods

Our study of mood exploration on Twitter data is based on

analyzing the circumplex model of moods in terms of the

moods’ usage frequencies. We illustrate these mood usage

frequencies (count over all posts) on the circumplex model

in Figure 2, where the size of squares (i.e., moods) is

proportional to its frequency. We note that the usages of

moods in each of the quadrants is considerably different

(the differences between each pair of quadrants were found

to be statistically significant based on independent sample t-

tests: p<0.0001). The overall trend shows that moods in Q3

(low valence, low activation) tend to be used extensively

(sad, bored, annoyed, lazy), along with a small number of

moods in Q1, of relatively higher valence and activation

(happy, optimistic). Overall, usage frequencies of lower

valence moods exceed those of higher valence moods.

We hypothesize the presence of a “broadcasting bias”

behind these observations. Since individuals often use

Twitter to broadcast their opinions and feelings on various

topics, it is likely that the mood about some information

needs to be of sufficiently low or high valence to be worth

reporting to the audience. This appears to be particularly

true with respect to positive valence terms, with mildly

positive moods expressed only rarely. The observation that

lower valence moods are shared more often might be due to

individuals seeking social support from their audiences in

response to various happenings externally as well as in their

own lives. In a sense, via observing the usage levels of

moods, we are able to validate the topology of the

circumplex model in the social media context and showed

that all moods are after all not created equal!

Temporal Patterns of Mood Use

We follow with the trail of investigating mood usage in the

light of diurnal and weekly behavioral patterns. In Figure 1

we show the usage of the 203 moods in Twitter posts over a

24 hour period, averaged over all users and all days in our

dataset (data points within the 95% confidence interval are

considered). We divide usage over the course of a day into

four different timings: morning (5am-12pm), afternoon

(12pm-5pm), evening (4pm-10pm), nightowl (10pm-5am).

The main observation from the circumplex model of the

diurnal patterns is that greater usage activity in general, in

terms of mood expression occurs during the evening and

night; with morning have the least usage. Evenings show

Figure 2. Circumplex model showing usage frequencies of

moods used as hashtags at the end of Twitter posts: larger

squares represent higher frequency of usage.

Q2

Q4 Q3

Figure 1. Circumplex model showing mood use diurnally – over

a 24 hour period. Each day is divided into four types of usage

timings: morning (5am-12pm), afternoon (12pm-5pm), evening

(4pm-10pm) and nightowl (10pm-5am).

both high use of negative and positive valence moods (e.g.,

negative moods: unhappy, tired, sad, lazy, worthless;

positive moods: win, nice, good, terrific); while at nights,

the overall activation of moods used tends to increase

compared to other times of the day (e.g., terrified, blocked,

stressed). Intuitively this appears to conform to common

expectations: people are likely to feel tired and worthless at

the end of a work day; at the same time certain others

probably feel blocked and stressed at nights (also ref. [10]).

We further present temporal patterns of mood usage

during weekdays (Monday-Thursday) and weekends

(Friday-Sunday) in Figure 4. The primary finding from the

two circumplex models is that positive moods are more

extensively used during weekends, compared to weekdays;

whereas negative ones are predominant during the

weekdays. Additionally a bulk of the moods shared on

weekdays has higher activation than others. This probably

is reflective of people being busy around a work week,

while tending to be more relaxed during the weekends,

thereby showing lower activation in mood expression.

Linguistic Diversity of Moods

We now explore how the diversity of linguistic content

associated with usage of various moods relates to their

valence and activation. Like before, we show the moods (as

squares) on the circumplex model (Figure 3). The color of

the squares indicate a mood’s normalized entropy, defined

as the entropy of the textual content (i.e., unigrams over all

posts associated with the mood), divided by the total

number of posts expressing the mood. In the figure, lighter

shades indicate higher entropy. We observe that squares on

the right side of the circumplex model (Q1, Q4) tend to

have higher entropy than left (Q2, Q3) (statistically

significant based on an independent sample t-test). This

indicates that while positive moods tend to be shared across

a wide array of linguistic context (topics, events etc.),

negative moods tend to be shared in a limited context,

confined to limited topics.

Activity Rates of Moods

Next we investigate the relationship between an

individual’s activity rate and his/her mood expression. We

define a measurement of how “active” s/he is in sharing

posts: the number of posts shared per second since the time

of the individual’s account creation.

Based on this definition, we show the circumplex model

of moods in Figure 5. The size of each square in this case is

proportional to the mean rate of activity of all individuals

who have shared the particular mood. The figure shows that

the majority of the larger squares (or moods shared by

highly “active” individuals) lie in Q1 and Q4; in other

words, high (or positive) valence moods are shared by

highly active individuals (statistically significant based on

independent sample t-tests between quadrant pairs). On the

other hand, moods of high activation (in Q2) but low

valence are shared primarily by individuals with a low

activity rate. In general, this indicates that positive moods

are associated more frequently with active individuals,

while negative and high arousal moods appear to be shared

more frequently by individuals with low activity rates.

Through the studies so far, our major contribution thus

lies in the large-scale naturalistic validation of the

circumplex model encompassing a variety of human moods

expressed online. However, we note that, on noisy and

immensely large and diverse streams like social media,

collecting explicit signals on moods (e.g. in the form of

hashtags) might pose as a significant research challenge [4,

5]. In the following section, we therefore present an

automatic classification framework that can infer affective

states automatically.

AUTOMATED INFERENCE OF AFFECTIVE STATES

We discuss how the 203 moods can be utilized to infer a

wide variety of human affective states in social media in an

automated manner. The primary challenge in classifying

Figure 3. Circumplex model showing entropies of moods in

terms of the content of posts: higher valence moods (shown in

lighter shades) have higher entropy.

Q1

Q1

Q2

Q4 Q3

Figure 4. Circumplex model showing mood use during weekdays

(Monday-Thursday); weekends (Friday-Sunday). Weekdays

show more negative, higher activation moods.

affect lies in the unavailability of ground truth: an aspect

often circumvented via manual labeling in order to create

training examples. As we move to social media domains

featuring enormous data, coupled with unavailability of

ground truth, gathering appropriate training data

necessitates a scalable alternative approach. Besides, in a

typical sentiment classification setting, two broad, general

factors – Positive Affect (PA) and Negative Affect (NA) –

have emerged reliably as the dominant dimensions of

emotional experience. However it is imperative to account

for more distinguishable and fine-grained affective states.

In this section we propose an affect classifier of social

media data, along with promising results, that does not rely

on any hand-built list of features or words, except for the

near 200 mood hashtags that we use as a supervision

ground truth signal (extended results in [5]).

Inferring Affective States from Moods

The psychology literature indicates that there is an implicit

relationship between the externally observed affect and the

internal mood of an individual [22]. When affect is detected

by an individual (e.g., smile as an expression of joviality), it

is characterized as an emotion or mood. In the subsection,

we, therefore, discuss how we can utilize our previously

discussed mood lexicon to find a mapping to broad

affective states.

Affect Types. Although affect has been found to comprise

broadly positive and negative dimensions (PANAS –

positive and negative affect schedule [22]), we are

interested in more fine-grained representation of human

affect. Hence we utilize a source known as PANAS-X [22].

PANAS-X defines 11 specific affects apart from the

positive and negative dimensions – ‘fear’, ‘sadness’, ‘guilt’,

‘hostility’, ‘joviality’, ‘self-assurance’, ‘attentiveness’,

‘shyness’, ‘fatigue’, ‘surprise’, and ‘serenity’.

Affect #moods Affect #moods

Joviality 30 Fear 14

Fatigue 19 Guilt 5

Hostility 17 Surprise 8

Sadness 38 Shyness 7

Serenity 12 Attentiveness 2

Self-assurance 20

Table 1. Number of moods associated with the affects.

Mood to Affect Associations. Next, based on the mapping

of moods to affects provided in the PANAS-X literature, we

derived associations for 60% moods from our lexicon. For

the remaining associations, we conducted a turk study [5].

Each turker (12 in all, per mood) was asked to select the

most appropriate affect that described a particular mood.

We combined the ratings per mood, and used the affect that

received majority rating (Fleiss-Kappa was 0.7). The

distribution of #moods over affects is shown in Table 1.

Training Data Collection

Like before, we utilized Twitter Firehose data (English

language posts between Nov 1, 2010 and Oct 31, 2011) for

constructing labeled training examples for affect

classification. We begin with our mood labeled dataset (ref.

previous section) and then utilize the mood-affect mapping

to labels posts corresponding to each affect. For instance,

“#iphone4 is going to be available on verizon soon!

#excited” expresses the mood ‘excited’, which can

subsequently be mapped to the affect joviality. We further

eliminated RT (retweet) posts to capture true personal affect

reflection in a post.

Classification and Experimental Results

We use a classification setup that is standard in text

classification as well as in sentiment classification. We

represent Twitter posts as vectors of unigram and bigram

features. Before feature extraction, the posts are

lowercased, numbers are normalized into a canonical form,

and URLs are removed. Finally the posts are tokenized.

After feature extraction, features that occur fewer than five

times are removed in a first step of feature reduction. We

then randomly split the data into three folds for cross-

validation. The classification algorithm is a standard

maximum entropy classifier. For each fold, we deploy this

classifier to predict the affect labels of the test portion of

the fold (33.3%), after training on the training portion

(66.6%) of the fold.

We begin by discussing the performance of classifying

the Twitter posts in our dataset into the 11 different affect

classes. We report the mean F1 measures for the 11 affect

classes in Table 2.

Our results show that the performance (precision/recall)

of various affect classes differs widely. Noting the mood

distributions for the various affects in Table 1, it appears

Figure 5. Circumplex model of moods showing the

relationship between mood expression and activity (twitter

posts made per second by an individual sharing the mood).

Larger squares indicate higher activity.

Q1 Q2

Q4 Q3

from the F1 measures in Table 2, that the good performance

for the affects joviality, fatigue, hostility and sadness can be

explained by the fact that all of them have a large number

of moods – consequently their feature space may be less

sparse, spanning a variety of topical and linguistic contexts

in Twitter posts. On the other hand, the worst performing

classes, e.g., guilt, shyness and attentiveness, are also the

ones with fewer corresponding moods. Hence it is possible

that their feature spaces are rather sparse due to the limited

contexts they are typically used in on Twitter. Moreover a

qualitative study of the posts that belong to these classes

tend to indicate significant degrees of sarcasm or irony in

them – e.g., for the guilt affect class: “I hate when ppl read

too deep into ur tweet and think it’s about them..... damn ..

#guilty”; and for the attentiveness affect class: “If a tomato

is a fruit does that mean ketchup is a smoothie?

#suspicious”. Due to such contextual mismatch between

content and the labeled affect, the classifier performs worse

for these classes.

Affect class Mean F1 Affect class Mean F1

Joviality 0.4644 Fear 0.2319

Fatigue 0.4146 Guilt 0.1838

Hostility 0.3270 Surprise 0.3328

Sadness 0.2885 Shyness 0.0722

Serenity 0.1833 Attentiveness 0.0203

Self-assurance 0.2642

Table 2. Mean F1 measures of 11 affect classes.

What the classification results indicate in general is that,

the manner in which the various affect classes are used on

Twitter (via explicit mood hashtags) has a significant

impact on the performance of the classifier. Moreover, we

established earlier that different moods have different

valence and activation measures (e.g., angry and frustrated

are both negative moods, but angry indicates higher arousal

than frustrated). These differences make the context of

affect manifestation widely diverse –in turn making affect

classification in social media a challenging problem.

UTILTY OF UNDERSTANDING EMOTION: CASE STUDY

Finally, we discuss a direct application area where these

mood and affect expression dynamics can be utilized to

promote self-reflection and social awareness.

Interest in developing applications to encourage and

promote psychological well-being of individuals is not new

to the HCI and health informatics communities [24, 25].

However there is exclusive value to utilizing expressed

emotion on social media like Twitter in applications related

to well-being. The value can be thought to be twofold. First,

since Twitter allows us to track the emotions of individuals

in a large time-scale (years), we can enable them observe

their differences and changes in behavior over time, that

could be indicative of a potential mental disorder (self-

reflection). Second, since Twitter already has a social

network associated with it, individuals are likely to obtain

better social support from their contacts, than a different

social platform, when it comes to helping themselves in

tackling a health-related issue (social support and

awareness).

Motivated in this light, we present how mood expressions

can be used to mine nuanced behavioral changes of

individuals over time, in the light of major life events (e.g.,

birth of a child, loss of job, death related bereavement etc.).

We propose a variety of measures to quantify such change

over time, and reveal how emotion based measures can be

used to forecast major events in people’s lives that can, in

turn, promote social and health-related well-being.

Data Around A Major Life Event: Child-birth

In this paper, we focus on a particular major event in a

person's life: birth of a child. For the purpose, our

population of interest comprises new mothers, i.e., female

Twitter users who are likely to have given birth to a child in

a given timeframe. Note that we focus on new mothers

here, although we have observed that there is a fair

percentage of Twitter posts from fathers right after their

child-birth. The reason for this is that we believe new

mothers experience a significant change in their lifestyle

and habits in the postnatal period, compared to the fathers.

Identifying new mothers on social media based on their

posts, and in the absence of self-reported gender can be a

challenging problem. Hence we follow an iterative

approach involving first constructing a candidate set of

likely new mothers based on filtering via several hard-

crafted queries on the Twitter stream, gender inference, and

finally identifying a set of high probability new mothers

using ratings from Amazon's mechanical turkers. We

discuss these steps as below:

1. We first construct a list of several hand-crafted queries to

search the Twitter Firehose stream for candidate users

likely to be posting about their child-birth. For the

convenience of consistent prenatal and postnatal

behavioral comparison of new mothers, we focus on

searching the Firehose stream over a fixed two month

period between May 1, 2011 and Jun 30, 2011. The

different search queries included: “after labor born”,

“arrival baby boy/girl”, “birth pounds/inches”, “its a

boy/girl born”. It is worth mentioning here that these

queries have been triggered by our observations that users

tend to talk about the labor related to the final phase of

their pregnancy, as well as tend to report on the physical

details of their newborn child, including their gender:

his/her weight and height. This phase yielded us a

candidate set of 483 users.

2. Since we are interested in new mothers only, inferring the

gender of the above constructed candidate user set was an

important step. Twitter does not provide a facility for

users to report on their gender; hence we had to rely on

cues obtained from their self-declared name to infer the

gender of the users in our candidate set. To this end, we

utilized a simple lexicon based approach that attempts to

find a match of the firstname of the Twitter user to a large

dictionary of names collated from the United States

Census data, as well as a publicly available corpus of

Facebook users’ names and self-reported gender. In a

sense, because of the cross-cultural nature of Facebook,

the dictionary amalgamating it and Census work fairly

well in inferring gender of the Twitter users. We tested

the accuracy of this inference mechanism by randomly

selecting 100 users and labeling them manually – the

lexicon-driven gender inference mechanism yielded

around 83% accuracy. In this manner, post gender

identification, we obtained a smaller set of likely new

mothers comprising 177 users.

3. In the final step, we use the gender-inferred set of likely

new mothers in a mechanical turk setup, in order to rule

out cases of false positives, and thereby obtain a high

precision dataset. For the purpose, we showed each turker

(min. 95% approval rating, English language proficient,

and familiar with Twitter) a set of 10 Twitter posts from

each user in our candidate set, such that 5 posts were

posted right before the child-birth indicative post, and 5

after it. We hoped that this would give the turker some

contextual cues to decide on whether the author of those

posts appeared to be legitimately a new mother.

Additionally, we also showed the Twitter profile bio,

picture and a link to their actual Twitter profile for each

user. The specific question involved responding to a

yes/no/maybe multiple-choice type question per user, to

evaluate if the user was a new mother. We thus collected

five ratings per user from the turkers, and used the

majority rating as the correct label (Fleiss-Kappa was

0.69). For our final dataset, we considered the users with

the “yes” label and it consisted of 85 new mothers.

Finally, for each of these 85 new mothers, we track their

Twitter timelines in the Firehose stream to collect all of

their posts in two 5-month periods, corresponding to

prenatal and postnatal phases around child-birth (Dec 1,

2010 – Apr 30, 2011 and Jul 1, 2011 – Nov 30, 2011,

respectively).

In the following subsection, we will present some

comparisons of the prenatal and postnatal behavioral trends

along a number of emotional measures.

Measures to Detect Behavioral Change

We propose three different measures to quantify the

behavioral change of the new mothers. These measures are

motivated from the emotion studies discussed in the

previous two sections. Note that our measures are crafted to

determine behavioral differences that would reflect change

in lifestyle in general, rather than aggregated volumes:

hence they capture differences over the course of a day (24

hour periods).

Volume. Our first measure called volume is based on the

average normalized number of posts made per hour by the

new mothers on a 24 hour period, over the prenatal and

postnatal periods respectively.

Positive Affect (PA). We define a measure of positive

affect (PA) of the new mothers per hour during a 24 hour

interval, during the prenatal and postnatal periods

respectively. To compute PA, we utilize the

psycholinguistic lexicon LIWC and focus on the words in

the positive emotion category. Given a post from a new

mother posted during a certain hour of the day, thereafter

we perform a simple word spotting exercise to determine

the fraction of words that match the words LIWC’s positive

emotion category. This fraction gives the measure of PA,

when averaged over all posts for all mothers, per hour,

corresponding to the prenatal and postnatal periods.

Negative Affect (NA). Like PA, we also define a measure

of negative affect (NA) averaged over all mothers per hour,

in a typical 24 hour interval, during the prenatal and

postnatal periods respectively. We again utilize LIWC for

its negative affect categories: ‘negative emotion’, ‘anger’,

‘anxiety’, ‘sadness’. Based on the same word spotting

technique, we measure NA per hour, averaged over all

posts for all mothers corresponding to the prenatal and

postnatal periods, over any typical 24 hour interval in a day.

Prenatal and Postnatal Behavior Comparison

Based on the three measures defined above, we now study

the behavioral differences of new mothers, in the prenatal

and postnatal periods. In particular, we are interested to

Figure 6. Volume (average normalized post count), PA and NA for the set of 85 new mothers during the prenatal and postnatal periods.

All measure demonstrate significant change, based on one-tail paired t-tests (p<0.0001).

observe how behavior changes during the postnatal period,

with respect to the prenatal phase.

Figure 6 shows the average trends of the measures in the

prenatal and postnatal periods for all 85 new mothers. The

results demonstrate that overall there is considerable change

in behavior, as quantified by these measures (statistically

significant based on one-tail paired t-tests). There is a

general decrease in volume (normalized post counts) over a

typical day in the postnatal period compared to prenatal.

This probably reflects that mothers are experiencing busy

times alongside the rather frequent “maternity blues”.

Besides lower post volume on a social media site like

Twitter might also indicate social withdrawal post child-

birth. Additionally, there is high volume during the prenatal

period in the early hours of the morning: a feature often

typical of pregnant mothers suffering from morning

sickness. This feature however disappears during the

postnatal phase. We also observe a decrease in PA over the

course of a day: likely because of a mother’s physical,

mental and emotional exhaustion [17], as well as sleep

deprivation typical of parenting a newborn (notice the low

PA in the early course of a day). Finally, in the case of NA,

we observe that the postnatal period shows significantly

high variance throughout the day, compared to that in the

prenatal phase: an indicator of mood swings for the new

mothers [7] as well as of increased anxiety or being

overwhelmed frequently but inconsistently.

We observe (not shown for interest of space) further

significant behavioral trend change for new mothers for

another set of measures that encompass their social

interactions on Twitter (measures motivated from [4]):

including number of re-tweets (RTs), number of @-replies

directed towards another user, number of links shared, as

well as number of questions asked. We encounter similar

finding here too: a general decrease in social interactions.

To summarize, a major life event like child-birth triggers

considerable behavioral change for new mothers in the

postnatal period, compared to that during the prenatal

phase; and social media activity can provide valuable

emotional indicators to track these changes systematically.

These findings also bear the potential to develop a

prediction framework that can utilize the social media based

emotional measures and forecast erratic behavior change in

the near future for new mothers.

DISCUSSION

In this paper, we provided an elaborate exploration of the

space of emotion expression in social media, in the light of

how to measure them, how to infer them automatically

when explicit signals are not available, and finally a utility-

driven case study to investigate how such measurement and

inference of emotion can be put to practical use.

Our broad findings extend existing psycholinguistics and

psychology literature in several ways. Considerable efforts

in prior research has focused on studying the circumplex

model of moods; however studying the various extents of

their usages by different individuals at a large-scale has still

been a challenge due to the expensive mechanisms of

collecting such data based on limited user studies. As

Twitter continues to evolve as a mechanism of human

expression, we have taken an effort to reveal dynamics of

the circumplex of moods frequently used on Twitter, not

only via their measures of valence and activation, but

through their respective usages, linguistic diversity, diurnal

patterns etc. as well.

Our work in emotion exploration and measurement in

social media provides ample scope for extensions and

future research directions. A limitation of current work is

that we have analyzed emotional behavior across all Twitter

users in general, without making distinctions between

different user types. For instance, one could conjecture that

the mechanisms of mood expression will be different for

elite users compared to ordinary individuals. Additionally

there is likely to be differences in mood expression across

cultures, demographics such as race, gender, social status,

educational background and so on. Furthermore, studying

the dynamics of the emotions as well as their utility in real

world could reveal the susceptibility of various population

segments to emergency situations – an aspect that can

benefit governmental agencies in particular.

Finally, as a utility application domain for the insights

obtained from measuring and inferring human moods in

social media, we explored the topic of behavioral change of

individuals around major life events. While there could be

an entire diverse range of major life events that can trigger

behavioral change, we have explored only one domain:

postnatal behavior of new mothers. Though limited in its

scope for the time-being, one could possibly expand this

work to studying a variety of emotional disorders while

investigating postnatal behavior change, such as post-

partum depression (PPD), postnatal psychosis and so on.

Plausibly, emotional markers obtained from social media

and social networks data can enable better diagnosis of

these disorders, at the same time provide mechanisms to

new mothers to track their behavioral change for their

health-related well-being.

Opportunities also remain in terms of venturing out to

other types of life events tracking whose behavioral change

via social media activity can be fruitful. These include loss

of a job or financial instability (to track population scale

unemployment dissatisfaction, or economic indicators);

death related grief and bereavement or loss of safety after a

trauma (to help individuals cope with surprising emotions

and improve their quality of life) and so on. In the coming

years, from the perspective of HCI research, these could

help us design emotion-aware interfaces: wherein a user's

online social experience is tuned to her emotional state,

emotional needs, requirement for social support as well as

to act as a self-narrative and self-reflective feedback tool.

CONCLUSIONS

In this paper, we studied a variety of moods that frequent

Twitter posts, from the point of view of their measurement,

their inference in the absence of explicit signals, as well as

their potential in catering to practical real-world application

scenarios that can promote social and personal wellness of

individuals. We used the dimensions of valence and

activation to represent moods in the circumplex model and

studied the topology of this space with respect to mood

usage, linguistic diversity, and activity. In this manner we

provided naturalistic validation of the circumplex using

social media data, and extended the conceptualization of

human emotion at scale.

Next, through our automatic affect classifier, we were

able to expand emotion measurement and tracking to

contexts where explicit mood signals might not be

available: a contribution that can help recommender and

search interfaces greatly. Finally, we investigated a case

study in detecting behavioral changes of individuals (new

mothers) around a major life event (child-birth), via their

shared content on social media. Through this study we have

laid the foundation of a line of HCI research that can utilize

emotion signals from online activity to predict and forecast

anomalous behavior or change at the personalized level of

an individual, at the same time at a collective scale

involving large populations.

ACKNOWLEDGMENTS

We thank Michael Gamon for help with the affect

classification task, and Eric Horvitz for fruitful discussions.

REFERENCES

1. Bollen, J., Mao, H., & Pepe, A. (2011). Modeling Public Mood and Emotion: Twitter Sentiment and Socio-Economic Phenomena. In Proc. ICWSM 2011.

2. Bradley, M.M., & Lang, P.J. (1999). Affective norms for English words (ANEW). Gainesville, FL. The NIMH Center for the Study of Emotion and Attention.

3. Davidov, D., Tsur, O., & Rappoport, A. (2010). Enhanced Sentiment Learning Using Twitter Hashtags and Smileys. Proceedings of the 23rd International Conference on Computational Linguistics Posters, (August), 241-249.

4. De Choudhury, M., Counts, S., & Gamon, M. (2012). Not All Moods are Created Equal! Exploring Human Emotional States in Social Media. In Proc. ICWSM 2012, to appear.

5. De Choudhury, M., Gamon, M., and Counts, S. (2012). Happy, Nervous or Surprised? Classification of Human Affective States in Social Media. In Proc. ICWSM 2012, to appear.

6. Diakopoulos, N., & Shamma, D. (2010). Characterizing debate performance via aggregated twitter sentiment. In Proc. CHI 2010. 1195-1198.

7. Edhborg, M., Lundh, W., Seimyr, L., & Widstrom, A-M. (2001). The long-term impact of postnatal depressed mood on mothers + child interaction: a preliminary study. In Journal of Reproductive and Infant Psychology 19: 61–71.

8. Ekman, P. (1973). Cross-cultural studies of facial expressions. In P. Ekman (Ed.), Darwin and facial

expression: A century of research in review (pp. 169-229).

9. Esuli, A., & Sebastiani, F. (2006). SentiWordNet: A publicly available lexical resource for opinion mining. Proceedings of LREC (Vol. 6, p. 417–422).

10. Golder, S. A., & Macy, M. W. (2011). Diurnal and Seasonal Mood Vary with Work, Sleep and Daylength Across Diverse Cultures. Science. 30 Sep 2011.

11. Kleinginna, P.R., & Kleinginna, A.M. (1981). A catagorized list of motivation definitions with a suggestion for consensual definition. Motivation and Emotion, 263-291.

12. Kramer, A. D. I. (2010). An unobtrusive behavioral model of “gross national happiness”. In Proc. CHI 2010. 287-290.

13. Mehrabian, Albert (1980). Basic dimensions for a general psychological theory. pp. 39–53.

14. Mihalcea, R., & Liu, H. (2006). A corpus-based approach to finding happiness. In Proceedings of computational approaches for analysis of weblogs, AAAI Spring Symposium.

15. Mishne, G. (2005). Experiments with mood classification in blog posts. In Style2005 -- the 1st Workshop on Stylistic Analysis Of Text For Information Access, at SIGIR 2005.

16. Mishne, G. & de Rijke, M. (2006). Capturing Global Mood Levels using Blog Posts. In AAAI 2006 Spring Symposium on Computational Approaches to Analyzing Weblogs.

17. O’Hara, M.W. (1995). Postpartum Depression: Causes and Consequences. New York: Springer-Verlag.

18. Ortony, A., & Turner, T. J. (1990). What's basic about basic emotions? Psychological Review, 97, 315-331.

19. Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up?: sentiment classification using machine learning techniques. In Proc. EMNLP 2002, Vol. 10. 79-86.

20. Russell, James A. (1980). A circumplex model of affect. J. of Personality and Social Psychology: 39, 1161–1178.

21. Tellegen, A. (1985). Structures of mood and personality and their relevance to assessing anxiety, with an emphasis on self-report. In A. H. Tuma & J. D. Maser (Eds.), Anxiety and the anxiety disorders (pp. 681-706).

22. Watson, D., & Clark, L. A. (1994). The PANAS-X: Manual for the positive and negative affect schedule-Expanded Form. Iowa City: University of Iowa.

23. Wiebe, J., Wilson, T., Bruce, R, Bell, M., & Martin, M. (2004). Learning subjective language. Computational Linguistics, 30 (3).

24. Anhoj, J., & Jensen, A-H. (2004). Using the Internet for life style changes in diet and physical activity: a feasibility study. In J Med Internet Res 8; 6(3):e28.

25. Mamykina, L., Mynatt, E. et al. (2008). MAHI: investigation of social scaffolding for reflective thinking in diabetes management. In Proc. CHI 2008.

Documents

The Nature of Emotional Expression in Social Media ... · grained human moods at scale, as expressed on social media (Twitter). We also analyze several attributes of collective behavior