Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
The Nature of Emotional Expression in Social Media: Measurement, Inference and Utility
Munmun De Choudhury
Microsoft Research
Redmond, WA 98052
Scott Counts
Microsoft Research
Redmond, WA 98052
ABSTRACT
Today, social media platforms like Twitter and Facebook
enable in-the-moment reflection of people's attitudes,
attention, and emotions in a scale that was never available
before. In this paper, we present and explore how we can
measure, infer and utilize expression of human moods from
social media activity (e.g., Twitter). Motivated by literature
in psychology, we first measure more than 200 nuanced
human moods at scale on Twitter, through a popular
representation, known as the ‘circumplex model’ that
characterizes affective experience through two dimensions:
valence and activation. Second, moving beyond explicit
mood signals, we develop an automated classifier to infer
several human affective states in social media. Starting with
moods, we derive naturalistic signals from Twitter posts
about a variety of affects of individuals and deploy them in
a classification framework with promising results. Finally,
we illustrate the utility of emotion exploration in social
media via a case study that tracks behavioral change of new
mothers post child-birth. Our findings provide at-scale
naturalistic assessments and extensions of existing
conceptualizations of human mood, as well as indicate their
utility in domains of societal interest, such as public health.
Author Keywords
Affect, circumplex model, classification, emotion, mood,
social media, Twitter
ACM Classification Keywords
H.5.m [Information Systems]: Information Systems
Applications – Miscellaneous.
General Terms
Algorithms; Human Factors; Measurement.
INTRODUCTION
The dynamic nature of user-generated content today holds
formidable power in a culture that is increasingly shaped
and influenced by technology. In particular, the multitude
of various social media and social networking sites has
provided easy channels of opinion and emotional
expression to large audiences. On top of such ubiquity,
since social media sites provide a conducive platform to
provide and receive social support around issues
surrounding our daily lives, such in-the-moment emotional
expression of current experiences has been on the rise in the
past few years1: be it about the global economy broadly, the
riots in Egypt, or a personal vacation trip with one’s family.
As a timely research focus, judicious tracking and
monitoring of emotion at scale could potentially trigger a
societal shift with enough support and publicity.
Furthermore, they can enable new information-seeking
approaches; for instance, identifying search features given
an emotion attribute, or enabling effective “emotionally-
reflective” interfaces. Consequently, there is significant
value to be derived from measuring and
classifying/inferring human emotion in social media that
can potentially be utilized to impact several domains of
public interest – including health, finance, entertainment,
advertising, politics or evolution of language and culture.
What are emotions? According to literature in
psychology, emotions are complex patterns of cognitive
processes, physiological arousal, and behavioral reactions
[11]. Emotions arouse us to action and direct and sustain
that action. They also help us organize our experience by
directing attention, and by influencing our perceptions of
self, others, and our interpretation and memory of events
[21]. As renowned sociologist Herbert Blumer postulated,
what goes on around us sinks into the reservoirs of our
mind and changes how we think.
Given the critical value in understanding human emotion,
researchers have attempted to describe emotion
(alternatively affect) through a set of dimensions, typically
Positive Affect and Negative Affect, where the two
dimensions vary independently of each other [10,12].
However research in psychology provides evidence that
rather than being independent, these dimensions are
interrelated in a highly systematic fashion [13].
Consequently, psychology researchers devised a
psychometric instrument called the circumplex [20], a
1 http://gizmodo.com/5870075/everyone-on-twitter-is-
increasingly-depressed
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise,
or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
spatial model in which affective concepts fall on a circle
and enable quantification of the cognitive structure of
affective experience. The circumplex represents each
affective concept (or, mood) via two dimensions: a
pleasure-displeasure measure, known as valence, and a
degree-of-arousal dimension, known as activation.
Motivated along these lines, in this work, we utilize the
circumplex model as a tool to measure more than 200 fine-
grained human moods at scale, as expressed on social
media (Twitter). We also analyze several attributes of
collective behavior in the light of measurement of moods:
including usage levels, linguistic diversity as well as
activity rates associated with different mood expression on
social media. We thereby reveal how we can provide at-
scale naturalistic assessments and extensions of existing
conceptualizations of moods (extended version in [4]).
However notice that although measuring these moods in
the light of human behavior can lend us useful insights,
they do not help us infer moods broadly in the scope of any
arbitrary social media content. For example, the discussion
presented so far (lexicon-driven approach) relies on the
explicit presence of a mood in a Twitter post; however in
most cases, statistically, this might be a small percentage.
Hence an automated method of inferring the broad mood
(or affective state) might yield better coverage. Automatic
classification of affect might be useful in domains where
understanding the expressed mood of an author has some
practical utility: that could range from advertising,
recommendations, to tracking behavior of populations in
health, socio-economics or urban development domains.
Hence we utilize a wide range of moods to act as
supervisory ground truth signals in the development of an
affect classifier (see [5] for details). The classifier can infer
a range of fine-grained affective states from posts on social
media (beyond simply the valence of the affect descriptors),
that might not bear explicit signals about emotion. Such
inferred states would reflect the specific content, language
and state of the individual sharing the content, i.e., the
distinctive qualities of individuals’ affects.
Inevitably, there is considerable practical utility in
measuring and inferring human emotion at scale. For
instance, we rely on the general observation that, emotion-
bearing content, such as dark (negative) postings can serve
as signs of depression of individuals and can be utilized as
an early warning system for timely intervention or to reduce
non-invasively, the stigma around mental illness.
Motivated along these lines, in the final part of the paper,
we explore a particular utility domain of interest:
investigating behavioral and emotional change of new
mothers in the postnatal phase, compared to their prenatal
behavior, based on their postings on Twitter.
The rest of this paper is organized as follows. Following
a review of prior work, we discuss the three major segments
of interest in the paper in three sections: measuring moods,
automatic classification of affect, and finally a case study to
illustrate the utility of studying emotion dynamics in the
context of social media. We conclude with a discussion of
our major findings, and future research directions.
BACKGROUND LITERATURE
Considerable research in psychology has defined and
examined human emotion and mood (e.g., [8]), with basic
moods encompassing positive experiences like joy and
acceptance, negative experiences like anger and disgust,
and others like anticipation that are less clearly positive or
negative. The activation attribute of moods, together with
valence (i.e., the degree of positivity/negativity of a mood)
characterize the structure of affective experience (ref. PAD
emotional state model in [13]; also the circumplex model of
affect in [20]). In these works, authors utilized self-reports
of affective concepts to scale and order emotion types on
the pleasure-displeasure scale (valence) and the degree-of-
arousal scale (activation) based on perceived similarity
among the terms.
Turning to analyses of mood in social media, early work
focused on sentiment in weblogs. Mihalcea et al. [14]
utilized happy/sad labeled blog posts on LiveJournal to
determine temporal and semantic trends of happiness.
Similarly, Mishne et al. [16] utilized LiveJournal data to
build models that predict levels of various moods and
understand their seasonal cycles according to the language
used by bloggers. More recently, Bollen et al. [1] analyzed
trends of public moods (using a psychometric instrument
POMS to extract six mood states) in light of a variety of
social, political and economic events.
Research involving affect exploration on social websites
such as Facebook and Twitter has looked at trends of use of
positive and negative words [12]. Recently, Golder et al.
[10] studied how individual mood varies from hour-to-hour,
day-to-day, and across seasons and cultures by measuring
positive and negative affect in Twitter posts, using the
lexicon LIWC (http://www.liwc.net/).
Prior research has also tackled automatic classification of
sentiment in online domains [19]. These machine learning
techniques need extensive manual labeling of text for
creating ground truth. Some of these issues have been
tackled by utilizing emoticons present in text as labels for
sentiment [3], although they tend to perform well mostly in
the context of the two basic positive and negative affect
classes. The closest attempt towards multiclass affect
classification has been on LiveJournal data: the mood tags
associated with posts were used as ground truth [15].
An alternative that circumvents the problems of machine
learning techniques has been the use of generic sentiment
lexicons such as WordNet, LIWC, and other lists [9].
Recently, there has been a growing interest in
crowdsourcing techniques to manually rate polarity in
Twitter posts [16].
Limitations
Despite widespread interest, it is clear that the notions of
affect and sentiment have been rather simplified in current
state-of-the-art, often confined to their valence measure
(positive/negative), with the six moods in [1] being an
exception. However, as indicated, the psychology literature
suggests that understanding the inter-relatedness of valence
and activation is important in conceptualizing human
emotion. Additionally, when it comes to classifying or
inferring emotional attributes, hand-labeled ground truth for
affect classification or manually curated word lists are
likely to be unreliable and scale poorly on noisy, topically-
diverse social media data, such as Twitter.
Finally, another problem with polarity-centric sentiment
classifiers is that they typically encompass a vague notion
of polarity that includes emotion, and opinion; and lumps
them all into two classes “positive” and “negative”, refer
[23]. In order to better make sense of emotional behavior on
social media, we require a principled notion of human
emotion – a contribution discussed in this paper.
MEASUREMENT OF HUMAN MOODS
We will begin with our methodology of measuring emotion,
in terms of moods, and understanding behavioral
differences found in mood expression across individuals at
scale. As pointed out earlier, we note that despite
considerable interest in mining and analyzing human
emotion, simplification of emotions to merely positive and
negative dimensions may miss important nuances in mood
expression. For instance, annoyed and frustrated are both
negative, but they express two very different emotional
states. A primary research challenge, therefore, is finding a
principled way to identify a set of words that can truly help
us measure emotional states of individuals, as well as
characterize their nuances in terms of both their valence and
arousal measure, known as activation.
Hence we study a popular representation of human mood
landscape: the ‘circumplex model’. For the purpose, we
first present a systematic method to identify moods in social
media that captures the broad range of individuals’
emotional states using mechanical turk studies and forays
into the psychology literature. Second we perform analysis
of these moods in the light of individuals’ behavioral
attributes to reveal nuances of collective mood expression.
Construction of Mood Lexicon
We begin by discussing our methodology to construct a
mood lexicon – a set of words that would indicate
individuals’ broad emotional states. We then characterize
the moods by the two dimensions of valence and activation,
and discuss how we infer them.
We utilize five established sources to develop a mood
lexicon:
1. ANEW: ANEW (Affective Norms for English Words)
that provides a set of normative emotional ratings for
~2000 words in English, including their valence,
activation and dominance ratings [2].
2. LIWC: For LIWC, we use sentiment-indicative
categories like positive/negative emotions, anxiety, anger
and sad (http://www.liwc.net/).
3. EARL: Emotion Annotation and Representation
Language dataset that classifies 48 emotions
(http://emotion-research.net/projects/humaine/earl).
4. A list of “basic emotions” provided in [18], e.g., fear,
contentment, disgust etc.
5. A list of moods provided by the blogging website
LiveJournal (http://www.livejournal.com/).
However, this large ensemble of words is likely to
contain several words that do not necessarily define a mood
(e.g., sleepy is a state of a person, rather than a mood). To
circumvent this issue, we first perform a mechanical turk
study (http://aws.amazon.com/mturk/) to narrow our
candidate set to truly mood-indicative words. In our task,
each word had a 1 – 7 Likert scale (1 indicated not a mood
at all, 7 meant definitely a mood). Only turkers from the
U.S. and having an approval rating greater than 95% were
allowed. Combining 12 different turkers’ ratings, we
construct a list of those words where the median rating was
at least 4, and the standard deviation was less than or equal
to 1. The final set of mood words contained 203 terms
(examples include: excited, nervous, quiet, grumpy,
depressed, patient, thankful, bored).
Given the final list of representative moods, our next task
was to determine the values of the valence and activation
dimensions of each mood. For those words in the final list
that were present in the ANEW lexicon, we use the source-
provided measures of valence and activation, as these
values in the ANEW corpus had already been computed
after extensive and rigorous psychometric studies. For the
remaining words, we conduct another turk study, to
systematically collect these measurements.
Like before, we considered only those turkers who had at
least 95% approval rating history and were from U.S. For a
given mood, each turker (12 turkers for each word) was
asked to rate the valence and activation measures, on two
different 1 – 10 Likert scales (1 indicated low valence/low
activation, while 10 indicated high values). Finally, we used
the mean ratings for each word as the final measures its
valence and activation (Fleiss-Kappa measure was 0.65).
Data
For data collection, we utilized the Twitter Firehose and
focused on one year's worth of Twitter posts posted in
English from Nov 1, 2010 to Oct 31, 2011. From this
ensemble, in order to obtain Twitter posts for each of the
203 moods, we resorted to a method that could infer mood-
containing posts reasonably consistently and in a principled
manner. We conjecture that posts containing moods as
hashtags at the end are likely to capture the emotional state
of the individual, in the limited context of the post. This is
motivated by prior work where Twitter's hashtags and
smileys were used as labels for sentiment classifiers [3]. We
also referred to the study in [4] which indicated that that in
83% of the cases, hashtagged moods at the end of posts
indeed captured the users' moods. For instance, “#iphone4
is going to be available on verizon soon! #excited”
expresses the mood ‘excited’. By this process, our labeled
mood dataset comprised about 10.6M tweets from about
4.1M users.
Deciphering Human Behavior through Moods
We now study various aspects of the relationship between
mood expression and human behavior in social media
(Twitter). We present four aspects of such behavior: usage
levels, diurnal patterns, diversity of language use, and
activity rate, given a mood, in the next three subsections.
Usage Levels of Moods
Our study of mood exploration on Twitter data is based on
analyzing the circumplex model of moods in terms of the
moods’ usage frequencies. We illustrate these mood usage
frequencies (count over all posts) on the circumplex model
in Figure 2, where the size of squares (i.e., moods) is
proportional to its frequency. We note that the usages of
moods in each of the quadrants is considerably different
(the differences between each pair of quadrants were found
to be statistically significant based on independent sample t-
tests: p<0.0001). The overall trend shows that moods in Q3
(low valence, low activation) tend to be used extensively
(sad, bored, annoyed, lazy), along with a small number of
moods in Q1, of relatively higher valence and activation
(happy, optimistic). Overall, usage frequencies of lower
valence moods exceed those of higher valence moods.
We hypothesize the presence of a “broadcasting bias”
behind these observations. Since individuals often use
Twitter to broadcast their opinions and feelings on various
topics, it is likely that the mood about some information
needs to be of sufficiently low or high valence to be worth
reporting to the audience. This appears to be particularly
true with respect to positive valence terms, with mildly
positive moods expressed only rarely. The observation that
lower valence moods are shared more often might be due to
individuals seeking social support from their audiences in
response to various happenings externally as well as in their
own lives. In a sense, via observing the usage levels of
moods, we are able to validate the topology of the
circumplex model in the social media context and showed
that all moods are after all not created equal!
Temporal Patterns of Mood Use
We follow with the trail of investigating mood usage in the
light of diurnal and weekly behavioral patterns. In Figure 1
we show the usage of the 203 moods in Twitter posts over a
24 hour period, averaged over all users and all days in our
dataset (data points within the 95% confidence interval are
considered). We divide usage over the course of a day into
four different timings: morning (5am-12pm), afternoon
(12pm-5pm), evening (4pm-10pm), nightowl (10pm-5am).
The main observation from the circumplex model of the
diurnal patterns is that greater usage activity in general, in
terms of mood expression occurs during the evening and
night; with morning have the least usage. Evenings show
Figure 2. Circumplex model showing usage frequencies of
moods used as hashtags at the end of Twitter posts: larger
squares represent higher frequency of usage.
Q2
Q4 Q3
Figure 1. Circumplex model showing mood use diurnally – over
a 24 hour period. Each day is divided into four types of usage
timings: morning (5am-12pm), afternoon (12pm-5pm), evening
(4pm-10pm) and nightowl (10pm-5am).
both high use of negative and positive valence moods (e.g.,
negative moods: unhappy, tired, sad, lazy, worthless;
positive moods: win, nice, good, terrific); while at nights,
the overall activation of moods used tends to increase
compared to other times of the day (e.g., terrified, blocked,
stressed). Intuitively this appears to conform to common
expectations: people are likely to feel tired and worthless at
the end of a work day; at the same time certain others
probably feel blocked and stressed at nights (also ref. [10]).
We further present temporal patterns of mood usage
during weekdays (Monday-Thursday) and weekends
(Friday-Sunday) in Figure 4. The primary finding from the
two circumplex models is that positive moods are more
extensively used during weekends, compared to weekdays;
whereas negative ones are predominant during the
weekdays. Additionally a bulk of the moods shared on
weekdays has higher activation than others. This probably
is reflective of people being busy around a work week,
while tending to be more relaxed during the weekends,
thereby showing lower activation in mood expression.
Linguistic Diversity of Moods
We now explore how the diversity of linguistic content
associated with usage of various moods relates to their
valence and activation. Like before, we show the moods (as
squares) on the circumplex model (Figure 3). The color of
the squares indicate a mood’s normalized entropy, defined
as the entropy of the textual content (i.e., unigrams over all
posts associated with the mood), divided by the total
number of posts expressing the mood. In the figure, lighter
shades indicate higher entropy. We observe that squares on
the right side of the circumplex model (Q1, Q4) tend to
have higher entropy than left (Q2, Q3) (statistically
significant based on an independent sample t-test). This
indicates that while positive moods tend to be shared across
a wide array of linguistic context (topics, events etc.),
negative moods tend to be shared in a limited context,
confined to limited topics.
Activity Rates of Moods
Next we investigate the relationship between an
individual’s activity rate and his/her mood expression. We
define a measurement of how “active” s/he is in sharing
posts: the number of posts shared per second since the time
of the individual’s account creation.
Based on this definition, we show the circumplex model
of moods in Figure 5. The size of each square in this case is
proportional to the mean rate of activity of all individuals
who have shared the particular mood. The figure shows that
the majority of the larger squares (or moods shared by
highly “active” individuals) lie in Q1 and Q4; in other
words, high (or positive) valence moods are shared by
highly active individuals (statistically significant based on
independent sample t-tests between quadrant pairs). On the
other hand, moods of high activation (in Q2) but low
valence are shared primarily by individuals with a low
activity rate. In general, this indicates that positive moods
are associated more frequently with active individuals,
while negative and high arousal moods appear to be shared
more frequently by individuals with low activity rates.
Through the studies so far, our major contribution thus
lies in the large-scale naturalistic validation of the
circumplex model encompassing a variety of human moods
expressed online. However, we note that, on noisy and
immensely large and diverse streams like social media,
collecting explicit signals on moods (e.g. in the form of
hashtags) might pose as a significant research challenge [4,
5]. In the following section, we therefore present an
automatic classification framework that can infer affective
states automatically.
AUTOMATED INFERENCE OF AFFECTIVE STATES
We discuss how the 203 moods can be utilized to infer a
wide variety of human affective states in social media in an
automated manner. The primary challenge in classifying
Figure 3. Circumplex model showing entropies of moods in
terms of the content of posts: higher valence moods (shown in
lighter shades) have higher entropy.
Q1
Q1
Q2
Q4 Q3
Figure 4. Circumplex model showing mood use during weekdays
(Monday-Thursday); weekends (Friday-Sunday). Weekdays
show more negative, higher activation moods.
affect lies in the unavailability of ground truth: an aspect
often circumvented via manual labeling in order to create
training examples. As we move to social media domains
featuring enormous data, coupled with unavailability of
ground truth, gathering appropriate training data
necessitates a scalable alternative approach. Besides, in a
typical sentiment classification setting, two broad, general
factors – Positive Affect (PA) and Negative Affect (NA) –
have emerged reliably as the dominant dimensions of
emotional experience. However it is imperative to account
for more distinguishable and fine-grained affective states.
In this section we propose an affect classifier of social
media data, along with promising results, that does not rely
on any hand-built list of features or words, except for the
near 200 mood hashtags that we use as a supervision
ground truth signal (extended results in [5]).
Inferring Affective States from Moods
The psychology literature indicates that there is an implicit
relationship between the externally observed affect and the
internal mood of an individual [22]. When affect is detected
by an individual (e.g., smile as an expression of joviality), it
is characterized as an emotion or mood. In the subsection,
we, therefore, discuss how we can utilize our previously
discussed mood lexicon to find a mapping to broad
affective states.
Affect Types. Although affect has been found to comprise
broadly positive and negative dimensions (PANAS –
positive and negative affect schedule [22]), we are
interested in more fine-grained representation of human
affect. Hence we utilize a source known as PANAS-X [22].
PANAS-X defines 11 specific affects apart from the
positive and negative dimensions – ‘fear’, ‘sadness’, ‘guilt’,
‘hostility’, ‘joviality’, ‘self-assurance’, ‘attentiveness’,
‘shyness’, ‘fatigue’, ‘surprise’, and ‘serenity’.
Affect #moods Affect #moods
Joviality 30 Fear 14
Fatigue 19 Guilt 5
Hostility 17 Surprise 8
Sadness 38 Shyness 7
Serenity 12 Attentiveness 2
Self-assurance 20
Table 1. Number of moods associated with the affects.
Mood to Affect Associations. Next, based on the mapping
of moods to affects provided in the PANAS-X literature, we
derived associations for 60% moods from our lexicon. For
the remaining associations, we conducted a turk study [5].
Each turker (12 in all, per mood) was asked to select the
most appropriate affect that described a particular mood.
We combined the ratings per mood, and used the affect that
received majority rating (Fleiss-Kappa was 0.7). The
distribution of #moods over affects is shown in Table 1.
Training Data Collection
Like before, we utilized Twitter Firehose data (English
language posts between Nov 1, 2010 and Oct 31, 2011) for
constructing labeled training examples for affect
classification. We begin with our mood labeled dataset (ref.
previous section) and then utilize the mood-affect mapping
to labels posts corresponding to each affect. For instance,
“#iphone4 is going to be available on verizon soon!
#excited” expresses the mood ‘excited’, which can
subsequently be mapped to the affect joviality. We further
eliminated RT (retweet) posts to capture true personal affect
reflection in a post.
Classification and Experimental Results
We use a classification setup that is standard in text
classification as well as in sentiment classification. We
represent Twitter posts as vectors of unigram and bigram
features. Before feature extraction, the posts are
lowercased, numbers are normalized into a canonical form,
and URLs are removed. Finally the posts are tokenized.
After feature extraction, features that occur fewer than five
times are removed in a first step of feature reduction. We
then randomly split the data into three folds for cross-
validation. The classification algorithm is a standard
maximum entropy classifier. For each fold, we deploy this
classifier to predict the affect labels of the test portion of
the fold (33.3%), after training on the training portion
(66.6%) of the fold.
We begin by discussing the performance of classifying
the Twitter posts in our dataset into the 11 different affect
classes. We report the mean F1 measures for the 11 affect
classes in Table 2.
Our results show that the performance (precision/recall)
of various affect classes differs widely. Noting the mood
distributions for the various affects in Table 1, it appears
Figure 5. Circumplex model of moods showing the
relationship between mood expression and activity (twitter
posts made per second by an individual sharing the mood).
Larger squares indicate higher activity.
Q1 Q2
Q4 Q3
from the F1 measures in Table 2, that the good performance
for the affects joviality, fatigue, hostility and sadness can be
explained by the fact that all of them have a large number
of moods – consequently their feature space may be less
sparse, spanning a variety of topical and linguistic contexts
in Twitter posts. On the other hand, the worst performing
classes, e.g., guilt, shyness and attentiveness, are also the
ones with fewer corresponding moods. Hence it is possible
that their feature spaces are rather sparse due to the limited
contexts they are typically used in on Twitter. Moreover a
qualitative study of the posts that belong to these classes
tend to indicate significant degrees of sarcasm or irony in
them – e.g., for the guilt affect class: “I hate when ppl read
too deep into ur tweet and think it’s about them..... damn ..
#guilty”; and for the attentiveness affect class: “If a tomato
is a fruit does that mean ketchup is a smoothie?
#suspicious”. Due to such contextual mismatch between
content and the labeled affect, the classifier performs worse
for these classes.
Affect class Mean F1 Affect class Mean F1
Joviality 0.4644 Fear 0.2319
Fatigue 0.4146 Guilt 0.1838
Hostility 0.3270 Surprise 0.3328
Sadness 0.2885 Shyness 0.0722
Serenity 0.1833 Attentiveness 0.0203
Self-assurance 0.2642
Table 2. Mean F1 measures of 11 affect classes.
What the classification results indicate in general is that,
the manner in which the various affect classes are used on
Twitter (via explicit mood hashtags) has a significant
impact on the performance of the classifier. Moreover, we
established earlier that different moods have different
valence and activation measures (e.g., angry and frustrated
are both negative moods, but angry indicates higher arousal
than frustrated). These differences make the context of
affect manifestation widely diverse –in turn making affect
classification in social media a challenging problem.
UTILTY OF UNDERSTANDING EMOTION: CASE STUDY
Finally, we discuss a direct application area where these
mood and affect expression dynamics can be utilized to
promote self-reflection and social awareness.
Interest in developing applications to encourage and
promote psychological well-being of individuals is not new
to the HCI and health informatics communities [24, 25].
However there is exclusive value to utilizing expressed
emotion on social media like Twitter in applications related
to well-being. The value can be thought to be twofold. First,
since Twitter allows us to track the emotions of individuals
in a large time-scale (years), we can enable them observe
their differences and changes in behavior over time, that
could be indicative of a potential mental disorder (self-
reflection). Second, since Twitter already has a social
network associated with it, individuals are likely to obtain
better social support from their contacts, than a different
social platform, when it comes to helping themselves in
tackling a health-related issue (social support and
awareness).
Motivated in this light, we present how mood expressions
can be used to mine nuanced behavioral changes of
individuals over time, in the light of major life events (e.g.,
birth of a child, loss of job, death related bereavement etc.).
We propose a variety of measures to quantify such change
over time, and reveal how emotion based measures can be
used to forecast major events in people’s lives that can, in
turn, promote social and health-related well-being.
Data Around A Major Life Event: Child-birth
In this paper, we focus on a particular major event in a
person's life: birth of a child. For the purpose, our
population of interest comprises new mothers, i.e., female
Twitter users who are likely to have given birth to a child in
a given timeframe. Note that we focus on new mothers
here, although we have observed that there is a fair
percentage of Twitter posts from fathers right after their
child-birth. The reason for this is that we believe new
mothers experience a significant change in their lifestyle
and habits in the postnatal period, compared to the fathers.
Identifying new mothers on social media based on their
posts, and in the absence of self-reported gender can be a
challenging problem. Hence we follow an iterative
approach involving first constructing a candidate set of
likely new mothers based on filtering via several hard-
crafted queries on the Twitter stream, gender inference, and
finally identifying a set of high probability new mothers
using ratings from Amazon's mechanical turkers. We
discuss these steps as below:
1. We first construct a list of several hand-crafted queries to
search the Twitter Firehose stream for candidate users
likely to be posting about their child-birth. For the
convenience of consistent prenatal and postnatal
behavioral comparison of new mothers, we focus on
searching the Firehose stream over a fixed two month
period between May 1, 2011 and Jun 30, 2011. The
different search queries included: “after labor born”,
“arrival baby boy/girl”, “birth pounds/inches”, “its a
boy/girl born”. It is worth mentioning here that these
queries have been triggered by our observations that users
tend to talk about the labor related to the final phase of
their pregnancy, as well as tend to report on the physical
details of their newborn child, including their gender:
his/her weight and height. This phase yielded us a
candidate set of 483 users.
2. Since we are interested in new mothers only, inferring the
gender of the above constructed candidate user set was an
important step. Twitter does not provide a facility for
users to report on their gender; hence we had to rely on
cues obtained from their self-declared name to infer the
gender of the users in our candidate set. To this end, we
utilized a simple lexicon based approach that attempts to
find a match of the firstname of the Twitter user to a large
dictionary of names collated from the United States
Census data, as well as a publicly available corpus of
Facebook users’ names and self-reported gender. In a
sense, because of the cross-cultural nature of Facebook,
the dictionary amalgamating it and Census work fairly
well in inferring gender of the Twitter users. We tested
the accuracy of this inference mechanism by randomly
selecting 100 users and labeling them manually – the
lexicon-driven gender inference mechanism yielded
around 83% accuracy. In this manner, post gender
identification, we obtained a smaller set of likely new
mothers comprising 177 users.
3. In the final step, we use the gender-inferred set of likely
new mothers in a mechanical turk setup, in order to rule
out cases of false positives, and thereby obtain a high
precision dataset. For the purpose, we showed each turker
(min. 95% approval rating, English language proficient,
and familiar with Twitter) a set of 10 Twitter posts from
each user in our candidate set, such that 5 posts were
posted right before the child-birth indicative post, and 5
after it. We hoped that this would give the turker some
contextual cues to decide on whether the author of those
posts appeared to be legitimately a new mother.
Additionally, we also showed the Twitter profile bio,
picture and a link to their actual Twitter profile for each
user. The specific question involved responding to a
yes/no/maybe multiple-choice type question per user, to
evaluate if the user was a new mother. We thus collected
five ratings per user from the turkers, and used the
majority rating as the correct label (Fleiss-Kappa was
0.69). For our final dataset, we considered the users with
the “yes” label and it consisted of 85 new mothers.
Finally, for each of these 85 new mothers, we track their
Twitter timelines in the Firehose stream to collect all of
their posts in two 5-month periods, corresponding to
prenatal and postnatal phases around child-birth (Dec 1,
2010 – Apr 30, 2011 and Jul 1, 2011 – Nov 30, 2011,
respectively).
In the following subsection, we will present some
comparisons of the prenatal and postnatal behavioral trends
along a number of emotional measures.
Measures to Detect Behavioral Change
We propose three different measures to quantify the
behavioral change of the new mothers. These measures are
motivated from the emotion studies discussed in the
previous two sections. Note that our measures are crafted to
determine behavioral differences that would reflect change
in lifestyle in general, rather than aggregated volumes:
hence they capture differences over the course of a day (24
hour periods).
Volume. Our first measure called volume is based on the
average normalized number of posts made per hour by the
new mothers on a 24 hour period, over the prenatal and
postnatal periods respectively.
Positive Affect (PA). We define a measure of positive
affect (PA) of the new mothers per hour during a 24 hour
interval, during the prenatal and postnatal periods
respectively. To compute PA, we utilize the
psycholinguistic lexicon LIWC and focus on the words in
the positive emotion category. Given a post from a new
mother posted during a certain hour of the day, thereafter
we perform a simple word spotting exercise to determine
the fraction of words that match the words LIWC’s positive
emotion category. This fraction gives the measure of PA,
when averaged over all posts for all mothers, per hour,
corresponding to the prenatal and postnatal periods.
Negative Affect (NA). Like PA, we also define a measure
of negative affect (NA) averaged over all mothers per hour,
in a typical 24 hour interval, during the prenatal and
postnatal periods respectively. We again utilize LIWC for
its negative affect categories: ‘negative emotion’, ‘anger’,
‘anxiety’, ‘sadness’. Based on the same word spotting
technique, we measure NA per hour, averaged over all
posts for all mothers corresponding to the prenatal and
postnatal periods, over any typical 24 hour interval in a day.
Prenatal and Postnatal Behavior Comparison
Based on the three measures defined above, we now study
the behavioral differences of new mothers, in the prenatal
and postnatal periods. In particular, we are interested to
Figure 6. Volume (average normalized post count), PA and NA for the set of 85 new mothers during the prenatal and postnatal periods.
All measure demonstrate significant change, based on one-tail paired t-tests (p<0.0001).
observe how behavior changes during the postnatal period,
with respect to the prenatal phase.
Figure 6 shows the average trends of the measures in the
prenatal and postnatal periods for all 85 new mothers. The
results demonstrate that overall there is considerable change
in behavior, as quantified by these measures (statistically
significant based on one-tail paired t-tests). There is a
general decrease in volume (normalized post counts) over a
typical day in the postnatal period compared to prenatal.
This probably reflects that mothers are experiencing busy
times alongside the rather frequent “maternity blues”.
Besides lower post volume on a social media site like
Twitter might also indicate social withdrawal post child-
birth. Additionally, there is high volume during the prenatal
period in the early hours of the morning: a feature often
typical of pregnant mothers suffering from morning
sickness. This feature however disappears during the
postnatal phase. We also observe a decrease in PA over the
course of a day: likely because of a mother’s physical,
mental and emotional exhaustion [17], as well as sleep
deprivation typical of parenting a newborn (notice the low
PA in the early course of a day). Finally, in the case of NA,
we observe that the postnatal period shows significantly
high variance throughout the day, compared to that in the
prenatal phase: an indicator of mood swings for the new
mothers [7] as well as of increased anxiety or being
overwhelmed frequently but inconsistently.
We observe (not shown for interest of space) further
significant behavioral trend change for new mothers for
another set of measures that encompass their social
interactions on Twitter (measures motivated from [4]):
including number of re-tweets (RTs), number of @-replies
directed towards another user, number of links shared, as
well as number of questions asked. We encounter similar
finding here too: a general decrease in social interactions.
To summarize, a major life event like child-birth triggers
considerable behavioral change for new mothers in the
postnatal period, compared to that during the prenatal
phase; and social media activity can provide valuable
emotional indicators to track these changes systematically.
These findings also bear the potential to develop a
prediction framework that can utilize the social media based
emotional measures and forecast erratic behavior change in
the near future for new mothers.
DISCUSSION
In this paper, we provided an elaborate exploration of the
space of emotion expression in social media, in the light of
how to measure them, how to infer them automatically
when explicit signals are not available, and finally a utility-
driven case study to investigate how such measurement and
inference of emotion can be put to practical use.
Our broad findings extend existing psycholinguistics and
psychology literature in several ways. Considerable efforts
in prior research has focused on studying the circumplex
model of moods; however studying the various extents of
their usages by different individuals at a large-scale has still
been a challenge due to the expensive mechanisms of
collecting such data based on limited user studies. As
Twitter continues to evolve as a mechanism of human
expression, we have taken an effort to reveal dynamics of
the circumplex of moods frequently used on Twitter, not
only via their measures of valence and activation, but
through their respective usages, linguistic diversity, diurnal
patterns etc. as well.
Our work in emotion exploration and measurement in
social media provides ample scope for extensions and
future research directions. A limitation of current work is
that we have analyzed emotional behavior across all Twitter
users in general, without making distinctions between
different user types. For instance, one could conjecture that
the mechanisms of mood expression will be different for
elite users compared to ordinary individuals. Additionally
there is likely to be differences in mood expression across
cultures, demographics such as race, gender, social status,
educational background and so on. Furthermore, studying
the dynamics of the emotions as well as their utility in real
world could reveal the susceptibility of various population
segments to emergency situations – an aspect that can
benefit governmental agencies in particular.
Finally, as a utility application domain for the insights
obtained from measuring and inferring human moods in
social media, we explored the topic of behavioral change of
individuals around major life events. While there could be
an entire diverse range of major life events that can trigger
behavioral change, we have explored only one domain:
postnatal behavior of new mothers. Though limited in its
scope for the time-being, one could possibly expand this
work to studying a variety of emotional disorders while
investigating postnatal behavior change, such as post-
partum depression (PPD), postnatal psychosis and so on.
Plausibly, emotional markers obtained from social media
and social networks data can enable better diagnosis of
these disorders, at the same time provide mechanisms to
new mothers to track their behavioral change for their
health-related well-being.
Opportunities also remain in terms of venturing out to
other types of life events tracking whose behavioral change
via social media activity can be fruitful. These include loss
of a job or financial instability (to track population scale
unemployment dissatisfaction, or economic indicators);
death related grief and bereavement or loss of safety after a
trauma (to help individuals cope with surprising emotions
and improve their quality of life) and so on. In the coming
years, from the perspective of HCI research, these could
help us design emotion-aware interfaces: wherein a user's
online social experience is tuned to her emotional state,
emotional needs, requirement for social support as well as
to act as a self-narrative and self-reflective feedback tool.
CONCLUSIONS
In this paper, we studied a variety of moods that frequent
Twitter posts, from the point of view of their measurement,
their inference in the absence of explicit signals, as well as
their potential in catering to practical real-world application
scenarios that can promote social and personal wellness of
individuals. We used the dimensions of valence and
activation to represent moods in the circumplex model and
studied the topology of this space with respect to mood
usage, linguistic diversity, and activity. In this manner we
provided naturalistic validation of the circumplex using
social media data, and extended the conceptualization of
human emotion at scale.
Next, through our automatic affect classifier, we were
able to expand emotion measurement and tracking to
contexts where explicit mood signals might not be
available: a contribution that can help recommender and
search interfaces greatly. Finally, we investigated a case
study in detecting behavioral changes of individuals (new
mothers) around a major life event (child-birth), via their
shared content on social media. Through this study we have
laid the foundation of a line of HCI research that can utilize
emotion signals from online activity to predict and forecast
anomalous behavior or change at the personalized level of
an individual, at the same time at a collective scale
involving large populations.
ACKNOWLEDGMENTS
We thank Michael Gamon for help with the affect
classification task, and Eric Horvitz for fruitful discussions.
REFERENCES
1. Bollen, J., Mao, H., & Pepe, A. (2011). Modeling Public Mood and Emotion: Twitter Sentiment and Socio-Economic Phenomena. In Proc. ICWSM 2011.
2. Bradley, M.M., & Lang, P.J. (1999). Affective norms for English words (ANEW). Gainesville, FL. The NIMH Center for the Study of Emotion and Attention.
3. Davidov, D., Tsur, O., & Rappoport, A. (2010). Enhanced Sentiment Learning Using Twitter Hashtags and Smileys. Proceedings of the 23rd International Conference on Computational Linguistics Posters, (August), 241-249.
4. De Choudhury, M., Counts, S., & Gamon, M. (2012). Not All Moods are Created Equal! Exploring Human Emotional States in Social Media. In Proc. ICWSM 2012, to appear.
5. De Choudhury, M., Gamon, M., and Counts, S. (2012). Happy, Nervous or Surprised? Classification of Human Affective States in Social Media. In Proc. ICWSM 2012, to appear.
6. Diakopoulos, N., & Shamma, D. (2010). Characterizing debate performance via aggregated twitter sentiment. In Proc. CHI 2010. 1195-1198.
7. Edhborg, M., Lundh, W., Seimyr, L., & Widstrom, A-M. (2001). The long-term impact of postnatal depressed mood on mothers + child interaction: a preliminary study. In Journal of Reproductive and Infant Psychology 19: 61–71.
8. Ekman, P. (1973). Cross-cultural studies of facial expressions. In P. Ekman (Ed.), Darwin and facial
expression: A century of research in review (pp. 169-229).
9. Esuli, A., & Sebastiani, F. (2006). SentiWordNet: A publicly available lexical resource for opinion mining. Proceedings of LREC (Vol. 6, p. 417–422).
10. Golder, S. A., & Macy, M. W. (2011). Diurnal and Seasonal Mood Vary with Work, Sleep and Daylength Across Diverse Cultures. Science. 30 Sep 2011.
11. Kleinginna, P.R., & Kleinginna, A.M. (1981). A catagorized list of motivation definitions with a suggestion for consensual definition. Motivation and Emotion, 263-291.
12. Kramer, A. D. I. (2010). An unobtrusive behavioral model of “gross national happiness”. In Proc. CHI 2010. 287-290.
13. Mehrabian, Albert (1980). Basic dimensions for a general psychological theory. pp. 39–53.
14. Mihalcea, R., & Liu, H. (2006). A corpus-based approach to finding happiness. In Proceedings of computational approaches for analysis of weblogs, AAAI Spring Symposium.
15. Mishne, G. (2005). Experiments with mood classification in blog posts. In Style2005 -- the 1st Workshop on Stylistic Analysis Of Text For Information Access, at SIGIR 2005.
16. Mishne, G. & de Rijke, M. (2006). Capturing Global Mood Levels using Blog Posts. In AAAI 2006 Spring Symposium on Computational Approaches to Analyzing Weblogs.
17. O’Hara, M.W. (1995). Postpartum Depression: Causes and Consequences. New York: Springer-Verlag.
18. Ortony, A., & Turner, T. J. (1990). What's basic about basic emotions? Psychological Review, 97, 315-331.
19. Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up?: sentiment classification using machine learning techniques. In Proc. EMNLP 2002, Vol. 10. 79-86.
20. Russell, James A. (1980). A circumplex model of affect. J. of Personality and Social Psychology: 39, 1161–1178.
21. Tellegen, A. (1985). Structures of mood and personality and their relevance to assessing anxiety, with an emphasis on self-report. In A. H. Tuma & J. D. Maser (Eds.), Anxiety and the anxiety disorders (pp. 681-706).
22. Watson, D., & Clark, L. A. (1994). The PANAS-X: Manual for the positive and negative affect schedule-Expanded Form. Iowa City: University of Iowa.
23. Wiebe, J., Wilson, T., Bruce, R, Bell, M., & Martin, M. (2004). Learning subjective language. Computational Linguistics, 30 (3).
24. Anhoj, J., & Jensen, A-H. (2004). Using the Internet for life style changes in diet and physical activity: a feasibility study. In J Med Internet Res 8; 6(3):e28.
25. Mamykina, L., Mynatt, E. et al. (2008). MAHI: investigation of social scaffolding for reflective thinking in diabetes management. In Proc. CHI 2008.