Sentiment in Social Media: Bootstrapping Subjectivity ...svitlana/papers/non... · To create a Twitter-speciﬁc sentiment lexicon for a given language, we start with a general-purpose,

Sentiment in Social Media: Bootstrapping Subjectivity Clues from Multilingual Twitter Streams and Exploiting Gender Language

Differences on Twitter

Svitlana Volkova

06/04/2013

Department of Computer Science Johns Hopkins University 3400 North Charles Street

Baltimore MD 21218

This project fulfills a requirement for the Computer Science PhD degree. Advisor’s signature ______________________ Date ___________ Advisor’s name: David Yarowsky

Exploring Sentiment in Social Media: Bootstrapping Subjectivity Clues

from Multilingual Twitter Streams

Svitlana Volkova

CLSPJohns Hopkins University

Baltimore, [email protected]

Theresa Wilson

HLTCOEJohns Hopkins University


David Yarowsky

CLSPJohns Hopkins University


Abstract

We study subjective language in socialmedia and create Twitter-specific lexi-cons via bootstrapping sentiment-bearingterms from multilingual Twitter streams.Starting with a domain-independent, high-precision sentiment lexicon and a largepool of unlabeled data, we bootstrapTwitter-specific sentiment lexicons, us-ing a small amount of labeled data toguide the process. Our experiments onEnglish, Spanish and Russian show thatthe resulting lexicons are effective forsentiment classification for many under-explored languages in social media.

1 Introduction

The language that people use to express opinionsand sentiment is extremely diverse. This is true forwell-formed data, such as news and reviews, andit is particularly true for data from social media.Communication in social media is informal, ab-breviations and misspellings abound, and the per-son communicating is often trying to be funny,creative, and entertaining. Topics change rapidly,and people invent new words and phrases.

The dynamic nature of social media togetherwith the extreme diversity of subjective languagehas implications for any system with the goalof analyzing sentiment in this domain. General,domain-independent sentiment lexicons have lowcoverage. Even models trained specifically on so-cial media data may degrade somewhat over timeas topics change and new sentiment-bearing termscrop up. For example, the word “occupy” wouldnot have been indicative of sentiment before 2011.

Most of the previous work on sentiment lexiconconstruction relies on existing natural language

processing tools, e.g., syntactic parsers (Wiebe,2000), information extraction (IE) tools (Riloffand Wiebe, 2003) or rich lexical resources suchas WordNet (Esuli and Sebastiani, 2006). How-ever, such tools and lexical resources are not avail-able for many languages spoken in social media.While English is still the top language in Twitter,it is no longer the majority. Thus, the applicabil-ity of these approaches is limited. Any method foranalyzing sentiment in microblogs or other socialmedia streams must be easily adapted to (1) manylow-resource languages, (2) the dynamic nature ofsocial media, and (3) working in a streaming modewith limited or no supervision.

Although bootstrapping has been used for learn-ing sentiment lexicons in other domains (Turneyand Littman, 2002; Banea et al., 2008), it has notyet been applied to learning sentiment lexicons formicroblogs. In this paper, we present an approachfor bootstrapping subjectivity clues from Twitterdata, and evaluate our approach on English, Span-ish and Russian Twitter streams. Our approach:

• handles the informality, creativity and the dy-namic nature of social media;

• does not rely on language-dependent tools;• scales to the hundreds of new under-explored

languages and dialects in social media;• classifies sentiment in a streaming mode.To bootstrap subjectivity clues from Twitter

streams we rely on three main assumptions:i. sentiment-bearing terms of similar orienta-

tion tend to co-occur at the tweet level (Tur-ney and Littman, 2002);

ii. sentiment-bearing terms of opposite orienta-tion do not co-occur at the tweet level (Ga-mon and Aue, 2005);

iii. the co-occurrence of domain-specific anddomain-independent subjective terms servesas a signal of subjectivity.

2 Related Work

Mihalcea et.al (2012) classifies methods for boot-strapping subjectivity lexicons into two types:corpus-based and dictionary-based.

Dictionary-based methods rely on existing lex-ical resources to bootstrap sentiment lexicons.Many researchers have explored using relations inWordNet (Miller, 1995), e.g., Esuli and Sabastiani(2006), Andreevskaia and Bergler (2006) for En-glish, Rao and Ravichandran (2009) for Hindi andFrench, and Perez-Rosas et al. (2012) for Spanish.Mohammad et al. (2009) use a thesaurus to aidin the construction of a sentiment lexicon for En-glish. Other works (Clematide and Klenner, 2010;Abdul-Mageed et al., 2011) automatically expandsand evaluates German and Arabic lexicons. How-ever, the lexical resources that dictionary-basedmethods need, do not yet exist for the majority oflanguages in social media. There is also a mis-match between the formality of many language re-sources, such as WordNet, and the extremely in-formal language of social media.

Corpus-based methods extract subjectivity andsentiment lexicons from large amounts of unla-beled data using different similarity metrics tomeasure the relatedness between words. Hatzivas-siloglou and McKeown (1997) were the first to ex-plore automatically learning the polarity of wordsfrom corpora. Early work by Wiebe (2000) iden-tifies clusters of subjectivity clues based on theirdistributional similarity, using a small amount ofdata to bootstrap the process. Turney (2002) andVelikovich et al. (2010) bootstrap sentiment lexi-cons for English from the web by using PointwiseMutual Information (PMI) and graph propaga-tion approach, respectively. Kaji and Kitsuregawa(2007) propose a method for building sentimentlexicon for Japanese from HTML pages. Baneaet al. (2008) experiment with Lexical SemanticAnalysis (LSA) (Dumais et al., 1988) to bootstrapa subjectivity lexicon for Romanian. Kanayamaand Nasukawa (2006) bootstrap subjectivity lexi-cons for Japanese by generating subjectivity can-didates based on word co-occurrence patterns.

In contrast to other corpus-based bootstrappingmethods, we evaluate our approach on multiplelanguages, specifically English, Spanish, and Rus-sian. Also, as our approach relies only on theavailability of a bilingual dictionary for translatingan English subjectivity lexicon and crowdsourcingfor help in selecting seeds, it is more scalable and

better able to handle the informality and the dy-namic nature of social media. It also can be effec-tively used to bootstrap sentiment lexicons for anylanguage for which a bilingual dictionary is avail-able or can be automatically induced from parallelcorpora.

3 Data

For the experiments in this paper, we use threesets of data for each language: 1M unlabeledtweets (BOOT) for bootstrapping Twitter-specificlexicons, 2K labeled tweets for development data(DEV), and 2K labeled tweets for evaluation(TEST). DEV is used for parameter tuning whilebootstrapping, and TEST is used to evaluating thequality of the bootstrapped lexicons.

We take English tweets from the corpus con-structed by Burger et al. (2011) which con-tains 2.9M tweets (excluding retweets) from 184Kusers.1 English tweets are identified automati-cally using a compression-based language identifi-cation (LID) tool (Bergsma et al., 2012). Accord-ing to LID, there are 1.8M (63.6%) English tweets,which we randomly sample to create BOOT, DEVand TEST sets for English. Unfortunately, Burger’scorpus does not include Russian and Spanish dataon the same scale as English. Therefore, forother languages we construct a new Twitter corpusby downloading tweets from followers of region-specific news and media feeds.

Sentiment labels for tweets in DEV and TESTsets for all languages are obtained using AmazonMechanical Turk. For each tweet we collect an-notations from five workers and use majority voteto determine the final label for the tweet. Snowet al. (2008) show that for a similar task, labelingemotion and valence, on average four non-expertlabelers are needed to achieve an expert level ofannotation. Table 1 gives the distribution of tweetsover sentiment labels for the development and testsets for English (E-DEV, E-TEST), Spanish (S-DEV, S-TEST), and Russian (R-DEV, R-TEST).Below are examples of tweets in Russian with En-glish translations labeled with sentiment:

• Positive: В планах вкусный завтраки куча фильмов (Planning for deliciousbreakfast and lots of movies);

• Negative: Хочу сдохнуть, и я это сделаю(I want to die and I will do that);

1They provided the tweet IDs, and we used the TwitterCorpus Tools to download the tweets.

Data Positive Neg Both Neutral

E-DEV 617 357 202 824E-TEST 596 347 195 862S-DEV 358 354 86 1,202S-TEST 317 387 93 1203R-DEV 452 463 156 929R-TEST 488 380 149 983

Table 1: Sentiment label distribution in develop-ment DEV and test TEST datasets across languages.

• Both: Хочется написать грубее профильм но не буду. Хотя актеры хоро-ши (I want to write about the movie rougherbut I will not. Although the actors are good);

• Neutral: Почему умные мысли приходяттолько ночью? (Why clever thoughts comeonly at night?).

4 Lexicon Bootstrapping

To create a Twitter-specific sentiment lexicon fora given language, we start with a general-purpose,high-precision sentiment lexicon2 and bootstrapfrom the unlabeled data (BOOT) using the labeleddevelopment data (DEV) to guide the process.

4.1 High-Precision Subjectivity Lexicons

For English we seed the bootstrapping pro-cess with the strongly subjective terms from theMPQA lexicon3 (Wilson et al., 2005). Theseterms have been previously shown to be high-precision for recognizing subjective sentences(Riloff and Wiebe, 2003).

For the other languages, the subjective seedterms are obtained by translating English seedterms using a bilingual dictionary, and then col-lecting judgments about term subjectivity fromMechanical Turk. Terms that truly are stronglysubjective in translation are used for seed termsin the new language, with term polarity projectedfrom the English. Finally, we expand the lexiconswith plurals and inflectional forms for adverbs, ad-jectives and verbs.

4.2 Bootstrapping Approach

To bootstrap, first the new lexicon LB(0) is seededwith the strongly subjective terms from the orig-inal lexicon LI . On each iteration i ≥ 1, tweetsin the unlabeled data are labeled using the lexicon

2Other works on generating domain-specific sentimentlexicons e.g., from blog data (Jijkoun et al., 2010) also startwith a general, domain-specific lexicon.

3http://www.cs.pitt.edu/mpqa/

from the previous iteration, LB(i−1). If a tweetcontains one or more terms from LB(i−1) it is con-sidered subjective, otherwise objective. The polar-ity of subjective tweets is determined in a similarway: if the tweet contains ≥ 1 positive terms, tak-ing into account the negation, it is considered neg-ative; if it contains ≥ 1 negative terms, taking intoaccount the negation, it is considered positive.4 Ifit contains both positive and negative terms, it isconsidered to be both. Then, for every term not inLB(i−1) that has a frequency ≥ θfreq, the proba-bility of that term being subjective is calculated asshown in Algorithm 1 line 10. The top θk termswith a subjective probability ≥ θpr are then addedto LB(i). The polarity of new terms is determinedbased on the probability of the term appearing inpositive or negative tweets as shown in line 18.5

The bootstrapping process terminates when thereare no more new terms meeting the criteria to add.

Algorithm 1 BOOTSTRAP (σ, θpr, θfreq, θtopK )

1: iter = 0,σ = 0.5, LB(�θ) ← LI(σ)2: while (stop �= true) do

3: LiterB (�θ) ← ∅,∆Liter

B (�θ) ← ∅4: for each new term w ∈ {V \ LB(�θ)} do

5: for each tweet t ∈ T do

6: if w ∈ t then

7: UPDATE c(w,LB(�θ)), c(w,LposB (�θ)), c(w)

8: end if

9: end for

10: psubj(w) ← c(w,LB(�θ))c(w)

11: ppos(w) ← c(w,LposB (�θ))

c(w,LB(�θ))

12: LiterB (�θ) ← w, psubj(w), ppol(w)

13: end for

14: SORT LiterB (�θ) by psubj(w)

15: while (K ≤ θtopK) do

16: for each new term w ∈ LiterB (�θ) do

17: if [psubj(w) ≥ θpr and cw ≥ θfreq then

18: if [ppos(w) ≥ 0.5] then

19: wpol ← positive20: else

21: wpol ← negative22: end if

23: ∆LiterB (�θ) ← ∆Liter

B (�θ) + wpol

24: end if

25: end for

26: K = K + 127: end while

28: if [∆LiterB (�θ) == 0] then

29: stop ← true30: end if

31: LB(�θ) ← LB(�θ) +∆LiterB (�θ)

32: iter = iter + 133: end while

4If there is a negation in the two words before a sentimentterm, we flip its polarity.

5Polarity association probabilities should sum up to 1ppos(w|LB(�θ)) + pneg(w|LB(�θ)) = 1.

English Spanish Russian

LEI LE

B LSI LS

B LRI LR

B

Pos 2.3 16.8 2.9 7.7 1.4 5.3Neg 2.8 4.7 5.2 14.6 2.3 5.5Total 5.1 21.5 8.1 22.3 3.7 10.8

Table 2: The original and the bootstrapped (high-lighted) lexicon term count (LI ⊂ LB) with polar-ity across languages (thousands).

The set of parameters �θ is optimized using a gridsearch on the development data using F-measurefor subjectivity classification. As a result, for En-glish �θ = [0.7, 5, 50] meaning that on each itera-tion the top 50 new terms with a frequency ≥ 5and probability ≥ 0.7 are added to the lexicon.For Spanish, the set of optimal parameters �θ =[0.65, 3, 50] and for Russian - �θ = [0.65, 3, 50]. InTable 2 we report size and term polarity from theoriginal LI and the bootstrapped LB lexicons.

5 Lexicon Evaluations

We evaluate our bootstrapped sentiment lexiconsEnglish LE

B , Spanish LSB and Russian LR

B by com-paring them with existing dictionary-expandedlexicons that have been previously shown to be ef-fective for subjectivity and polarity classification(Esuli and Sebastiani, 2006; Perez-Rosas et al.,2012; Chetviorkin and Loukachevitch, 2012). Forthat we perform subjectivity and polarity classifi-cation using rule-based classifiers6 on the test dataE-TEST, S-TEST and R-TEST.

We consider how the various lexicons performfor rule-based classifiers for both subjectivity andpolarity. The subjectivity classifier predicts thata tweet is subjective if it contains a) at least one,or b) at least two subjective terms from the lexi-con. For the polarity classifier, we predict a tweetto be positive (negative) if it contains at least onepositive (negative) term taking into account nega-tion. If the tweet contains both positive and nega-tive terms, we take the majority label.

For English we compare our bootstrapped lex-icon LE

B against the original lexicon LEI and

strongly subjective terms from SentiWordNet 3.0(Esuli and Sebastiani, 2006). To make a faircomparison, we automatically expand SentiWord-Net with noun plural forms and verb inflectionalforms. In Figure 1 we report precision, recall

6Similar approach to a rule-based classification usingterms from he MPQA lexicon (Riloff and Wiebe, 2003).

and F-measure results. They show that our boot-strapped lexicon significantly outperforms Senti-WordNet for subjectivity classification. For polar-ity classification we get comparable F-measure butmuch higher recall for LE

B compared to SWN .

(a) Subj ≥ 1 (b) Subj ≥ 2 (c) Polarity

Lexicon Fsubj≥1 Fsubj≥2 Fpolarity

SWN 0.57 0.27 0.78LEI 0.71 0.48 0.82

LEB 0.75 0.72 0.78

Figure 1: Precision (x-axis), recall (y-axis) andF-measure (in the table) for English: LE

I = ini-tial lexicon, LE

B = bootstrapped lexicon, SWN =strongly subjective terms from SentiWordNet.

For Spanish we compare our bootstrapped lex-icon LS

B against the original LSI lexicon, and the

full and medium strength terms from the Span-ish sentiment lexicon constructed by Perez-Rosaset el. (2012). We report precision, recall and F-measure in Figure 2. We observe that our boot-strapped lexicon yields significantly better perfor-mance for subjectivity classification compared toboth full and medium strength terms. However,our bootstrapped lexicon yields lower recall andsimilar precision for polarity classification.



SM 0.44 0.17 0.64SF 0.47 0.13 0.66

LSI 0.59 0.45 0.58

LSB 0.59 0.59 0.55

Figure 2: Precision (x-axis), recall (y-axis) and F-measure (in the table) for Spanish: LS

I = initiallexicon, LS

B = bootstrapped lexicon, SF = fullstrength terms; SM = medium strength terms.

For Russian we compare our bootstrapped lex-icon LR

B against the original LRI lexicon, and the

Russian sentiment lexicon constructed by Chetv-iorkin and Loukachevitchet (2012). The externallexicon in Russian P was built for the domainof product reviews and does not include polarityjudgments for subjective terms. As before, weexpand the external lexicon with the inflectionalforms for adverbs, adjectives and verbs. We reportresults for Russian in Figure 3. We find that forsubjectivity our bootstrapped lexicon shows betterperformance compared to the external lexicon (5kterms). However, the expanded external lexicon(17k terms) yields higher recall with a significantdrop in precision. Note that for Russian, we reportpolarity classification results for LR

B and LRI lexi-

cons only because P does not have polarity labels.



P 0.55 0.29 –PX 0.62 0.47 –LRI 0.46 0.13 0.73

LRB 0.61 0.35 0.73

Figure 3: Precision (x-axis), recall (y-axis) and F-measure for Russian: LR

I = initial lexicon, LRB =

bootstrapped lexicon, P = external sentiment lex-icon, PX = expanded external lexicon.

We next perform error analysis for subjectiv-ity and polarity classification for all languages andidentify common errors to address them in future.

For subjectivity classification we observe thatapplying part-of-speech tagging during the boot-strapping could improve results for all languages.We could further improve the quality of the lex-icon and reduce false negative errors (subjec-tive tweets classified as neutral) by focusing onsentiment-bearing terms such as adjective, adverbsand verbs. However, POS taggers for Twitter areonly available for a limited number of languagessuch as English (Gimpel et al., 2011). Other falsenegative errors are often caused by misspellings.7

7For morphologically-rich languages, our approach cov-ers different linguistic forms of terms but not their mis-spellings. However, it can be fixed by an edit-distance check.

We also find subjective tweets with philosophi-cal thoughts and opinions misclassified, especiallyin Russian, e.g., Иногда мы бываем не готовык исполнению заветной мечты но все рав-но так не хочется ее спугнуть (Sometimes weare not ready to fulfill our dreams yet but, at thesame time, we do not want to scare them). Suchtweets are difficult to classify using lexicon-basedapproaches and require deeper linguistic analysis.

False positive errors for subjectivity classifica-tion happen because some terms are weakly sub-jective and can be used in both subjective andneutral tweets e.g., the Russian term хвастаться(brag) is often used as subjective, but in a tweetникогда не стоит хвастаться будущим (neverbrag about your future) it is used as neutral. Simi-larly, the Spanish term buenas (good) is often usedsubjectively but it is used as neutral in the follow-ing tweet “@Diveke me falto el buenas! jaja queonda que ha pasado” (I miss the good times wehad, haha that wave has passed!).

For polarity classification, most errors happenbecause our approach relies on either positive ornegative polarity scores for a term but not both.8

However, in the real world terms may sometimeshave both usages. Thus, some tweets are misclas-sified (e.g., “It is too warm outside”). We canfix this by summing over weighted probabilitiesrather than over term counts. Additional errorshappen because tweets are very short and conveymultiple messages (e.g., “What do you mean byunconventional? Sounds exciting!”) Thus, our ap-proach can be further improved by adding wordsense disambiguation and anaphora resolution.

6 Conclusions

We propose a scalable and language independentbootstrapping approach for learning subjectivityclues from Twitter streams. We demonstrate theeffectiveness of the bootstrapping procedure bycomparing the resulting subjectivity lexicons withstate-of the-art sentiment lexicons. We performerror analysis to address the most common errortypes in the future. The results confirm that theapproach can be effectively exploited and furtherimproved for subjectivity classification for manyunder-explored languages in social media.

8During the bootstrapping we calculate probability for aterm to be positive and negative, e.g., p(warm|+) = 0.74and p(warm|−) = 0.26. But during polarity classificationwe rely on the highest probability score and consider it to be“the polarity” for the term e.g., positive for warm.

References

Muhammad Abdul-Mageed, Mona T. Diab, and Mo-hammed Korayem. 2011. Subjectivity and senti-ment analysis of modern standard arabic. In Pro-ceedings of ACL/HLT.

Alina Andreevskaia and Sabine Bergler. 2006. Min-ing wordnet for fuzzy sentiment: Sentiment tag ex-traction from WordNet glosses. In Proceedings ofEACL.

Carmen Banea, Rada Mihalcea, and Janyce Wiebe.2008. A bootstrapping method for building subjec-tivity lexicons for languages with scarce resources.In Proceedings of LREC.

Shane Bergsma, Paul McNamee, Mossaab Bagdouri,Clayton Fink, and Theresa Wilson. 2012. Languageidentification for creating language-specific Twittercollections. In Proceedings of 2nd Workshop onLanguage in Social Media.

John D. Burger, John C. Henderson, George Kim, andGuido Zarrella. 2011. Discriminating gender onTwittier. In Proceedings of EMNLP.

Ilia Chetviorkin and Natalia V. Loukachevitch. 2012.Extraction of Russian sentiment lexicon for productmeta-domain. In Proceedings of COLING.

Simon Clematide and Manfred Klenner. 2010. Eval-uation and extension of a polarity lexicon for Ger-man. In Proceedings of the 1st Workshop on Com-putational Approaches to Subjectivity and SentimentAnalysis.

Susan T. Dumais, George W. Furnas, Thomas K. Lan-dauer, Scott Deerwester, and Richard Harshman.1988. Using latent semantic analysis to improveaccess to textual information. In Proceedings ofSIGCHI.

Andrea Esuli and Fabrizio Sebastiani. 2006. Senti-WordNet: A publicly available lexical resource foropinion mining. In Proceedings of LREC.

Michael Gamon and Anthony Aue. 2005. Automaticidentification of sentiment vocabulary: exploitinglow association with known sentiment terms. InProceedings of the ACL Workshop on Feature Engi-neering for Machine Learning in Natural LanguageProcessing.

Kevin Gimpel, Nathan Schneider, Brendan O’Connor,Dipanjan Das, Daniel Mills, Jacob Eisenstein,Michael Heilman, Dani Yogatama, Jeffrey Flanigan,and Noah A. Smith. 2011. Part-of-speech taggingfor Twittier: annotation, features, and experiments.In Proceedings of ACL.

Vasileios Hatzivassiloglou and Kathy McKeown.1997. Predicting the semantic orientation of adjec-tives. In Proceedings of ACL.

Valentin Jijkoun, Maarten de Rijke, and WouterWeerkamp. 2010. Generating focused topic-specific sentiment lexicons. In Proceedings of ACL.

Nobuhiro Kaji and Masaru Kitsuregawa. 2007. Build-ing lexicon for sentiment analysis from massivecollection of html documents. In Proceedings ofEMNLP.

Hiroshi Kanayama and Tetsuya Nasukawa. 2006.Fully automatic lexicon expansion for domain-oriented sentiment analysis. In Proceedings ofEMNLP.

Rada Mihalcea, Carmen Banea, and Janyce Wiebe.2012. Multilingual subjectivity and sentiment anal-ysis. In Proceedings of ACL.

George A. Miller. 1995. Wordnet: a lexical databasefor English. Communications of the ACM, 38(11).

Saif Mohammad, Cody Dunne, and Bonnie Dorr.2009. Generating high-coverage semantic orienta-tion lexicons from overtly marked words and a the-saurus. In Proceedings of EMNLP.

Veronica Perez-Rosas, Carmen Banea, and Rada Mi-halcea. 2012. Learning sentiment lexicons in Span-ish. In Proceedings of LREC.

Delip Rao and Deepak Ravichandran. 2009. Semi-supervised polarity lexicon induction. In Proceed-ings of EACL.

Ellen Riloff and Janyce Wiebe. 2003. Learning extrac-tion patterns for subjective expressions. In Proceed-ings of EMNLP.

Rion Snow, Brendan O’Connor, Daniel Jurafsky, andAndrew Y. Ng. 2008. Cheap and fast – but is itgood?: Evaluating non-expert annotations for natu-ral language tasks. In Proceedings of EMNLP.

Peter D. Turney and Michael L. Littman. 2002. Un-supervised learning of semantic orientation from ahundred-billion-word corpus. Computing ResearchRepository.

Peter D. Turney. 2002. Thumbs up or thumbs down?:Semantic orientation applied to unsupervised classi-fication of reviews. In Proceedings of ACL.

Leonid Velikovich, Sasha Blair-Goldensohn, KerryHannan, and Ryan McDonald. 2010. The viabil-ity of web-derived polarity lexicons. In Proceedingsof NAACL.

Janyce Wiebe. 2000. Learning subjective adjectivesfrom corpora. In Proceedings of AAAI.

Theresa Wilson, Janyce Wiebe, and Paul Hoffmann.2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of EMNLP.

Exploiting Gender Language Differences for Sentiment Analysisin Social Media

AbstractIn this paper we investigate and quantifydifferences in subjective language, emoti-con, and hashtag use between male andfemale users in microblogs for three lan-guages: English, Spanish, and Russian. Inaddition, we compare gender-independentfeatures with different gender-dependentfeatures for sentiment classification, andshow that the novel addition of author gen-der as a parameter leads to statistically-significant improvements for sentimentclassification in all investigated languages.

1 IntroductionThe differences in how women and men use lan-guage have long been studied in sociolinguisticsand other fields (Gefen and Ridings, 2005). Onedimension in which language use differs betweengenders is in their use of subjective language.Boneva (2001), for example, found that womenare more likely than men to express their con-cern, support others, and share feelings in per-sonal email communication. Another study de-scribes how different genders communicate in theworkplace and suggests that men write short, pre-cise, more confrontational and challenging emailswhereas women write more emotional emails(Cockcroft, 2009). Given that there are genderdifferences in the use of subjective language, is itpossible to exploit these differences and improveautomatic sentiment analysis?

In this paper, we investigate gender differencesin the use of subjective language, emoticons, andhashtags for English, Spanish, and Russian in thedomain of social media. We study how men andwomen differ both in the language they use to ex-press opinions and in what they tend to express

opinions about. We find that some words are moreor less likely to be positive or negative in con-text depending on the the gender of the writer.For example, the word weakness is more likely tobe used in a positive way by women (Chocolateis my weakness!) but in a negative way by men(Clearly they know our weakness. Argggg). Fi-nally, we seek to answer a novel question: Cangender differences in the use of subjective lan-guage be exploited to improve sentiment analysisin social media? Our experiments show that tak-ing gender into account does lead to improvementsin sentiment analysis, but it is not as simple as in-cluding a binary feature to represent the gender ofthe writer. Instead, it is the combination of lexi-cal features, together with set-count features rep-resenting gender-dependent sentiment terms, thatleads to significant improvements, particularly forclassifying message polarity.

2 Related Work

Numerous studies since the early 1970’s have in-vestigated gender-language differences in inter-action, theme, and grammar among other topics(Schiffman, 2002; Sunderland et al., 2002). Morerecent research has studied gender differences intelephone speech (Cieri et al., 2004; Godfrey etal., 1992) and emails (Styler, 2011). Mohammadand Yang (2011) analyzed gender differences inthe expression of sentiment in love letters, hatemail, and suicide notes, as well as emotional wordusage across genders in corporate email.

There has also been a considerable amount ofwork in subjectivity and sentiment analysis overthe past decade, including, more recently, in mi-croblogs (Barbosa and Feng, 2010; Berminghamand Smeaton, 2010; Pak and Paroubek, 2010;Bifet and Frank, 2010; Davidov et al., 2010; Li

et al., 2010; Kouloumpis et al., 2011; Jiang et al.,2011; Agarwal et al., 2011a; Wang et al., 2011;Calais Guerra et al., 2011; Tan et al., 2011; Chenet al., 2012; Li et al., 2012). In spite of the surgeof research in both sentiment and social media,only a limited amount of work focusing on gen-der identification has looked at differences in sub-jective language across genders within social me-dia. Thelwall (2010) found that men and womenuse emoticons to differing degrees on MySpace,e.g., female users express positive emoticons morethan male users. Other researchers included sub-jective patterns as features for gender classifica-tion of Twitter users (Rao et al., 2010). Theyfound that the majority of emotion-bearing fea-tures, e.g., emoticons, repeated letters, exaspera-tion, are used more by female than male users,which is consistent with results reported in otherrecent work (Garera and Yarowsky, 2009; Burgeret al., 2011; Goswami et al., 2009; Argamon etal., 2007). Other related work is that of Otter-bacher (2010), who studied stylistic differencesbetween male and female reviewers writing prod-uct reviews, and Mukherjee and Liu (2010), whoapplied positive, negative and emotional connota-tion features as part of their work on gender clas-sification of microblogs.

Although previous work has investigated gen-der differences in the use of subjective language,and features of sentiment have been used in genderidentification, to the best of our knowledge no onehas yet investigated whether gender differences inthe use of subjective language can be exploited toimprove sentiment classification in English or anyother language. In this paper we seek to answerthis question for the domain of social media.

3 Data

For the experiments in this paper, we use three setsof data for each language: a large pool of data(800K tweets) labeled for gender but unlabeled forsentiment, plus 2K development data and 2K testdata labeled for both sentiment and gender. Weuse the unlabeled data to bootstrap Twitter-specificlexicons and investigate gender differences in theuse of subjective language. We use the develop-ment data for parameter tuning while bootstrap-ping, and the test data for sentiment classification.

For English, we download tweets from the cor-pus created by Burger et al. (2011). This datasetcontains 2,958,103 tweets from 184K users, ex-

cluding retweets. Retweets are omitted becauseour focus is on the sentiment of the person tweet-ing; in retweets, the words originate from a dif-ferent user. All users in this corpus have genderlabels, which Burger et al. automatically extractedfrom self-reported gender on Facebook or MyS-pace profiles linked to by the Twitter users. En-glish tweets are identified using a compression-based language identification (LID) tool (Bergsmaet al., 2012). According to LID, there are1,881,620 (63.6%) English tweets from which weselect a random, gender-balanced sample of 0.8Mtweets. Burger’s corpus does not include Russianand Spanish data on the same scale as English.Therefore, for Russian and Spanish we constructa new Twitter corpus by downloading tweets fromfollowers of region-specific news and media Twit-ter feeds. We use LID to identify the Russian andSpanish tweets, and remove retweets as before. Inthis data, gender is labeled automatically based onuser first and last name morphology with a pre-cision above 0.98 for all languages. Tweets fromusers on which the system cannot predict genderare omitted (33%).

Sentiment labels for tweets in the developmentand test sets are obtained using Amazon Mechan-ical Turk. For each tweet we collect annotationsfrom five workers and use majority vote to de-termine the final label for the tweet. Snow et al.(2008) show that for a similar task, labeling emo-tion and valence, on average four non-expert label-ers are needed to achieve an expert level of anno-tation. Below are some examples of tweets labeledwith sentiment:

• Pos: Happy Thanksgiving to all my friends and family!Thank you for your friendship!

• Neg: @Sunshine Yeah... she’s so not thrilled. BTWdate night soon. I haz more boy drama;P.

• Both: I am ahead on homework!!! I have studied, writ-ten essays, taken tests I am tired!

• Neut: I@bagussoo u mean u have no bahasa class fora week?

Table 1 gives the distribution of tweets oversentiment labels (positive, negative, both, neutral)and gender labels for the development and testsets for English (EDEV, ETEST), Spanish (SDEV,STEST), and Russian (RDEV, RTEST).

4 Subjective Language and Gender

To study the intersection of subjective languageand gender in social media, ideally we would havea large corpus labeled for both. Although our largecorpus is labeled for gender, it is not labeled for

Data Pos Neg Both Neut ♀ ♂EDEV 617 357 202 824 1,176 824ETEST 596 347 195 862 1,194 806SDEV 358 354 86 1,202 768 1,232STEST 317 387 93 1203 700 1,300RDEV 452 463 156 929 1,016 984RTEST 488 380 149 983 910 1,090

Table 1: Gender and sentiment label distributionin the development and test sets across languages.

sentiment. Only the 4K tweets that compose thedevelopment and test sets are labeled for both gen-der and sentiment. Obtaining sentiment labels forall tweets would be both impractical and expen-sive. Instead we use large multilingual sentimentlexicons developed specifically for Twitter as de-scribed below. Using these lexicons we can be-gin to explore the relationship between subjectivelanguage and gender in the large pool of data la-beled for gender but unlabeled for sentiment. Wealso look at the relationship between gender andthe use of different hashtags and emoticons. Thesecan be strong indicators of sentiment in social me-dia, and in fact are sometimes used to create noisytraining data for sentiment analysis in Twitter (Pakand Paroubek, 2010; Kouloumpis et al., 2011).

4.1 Sentiment LexiconsRecent work by Banea et.al (2012) classifies meth-ods for bootstrapping subjectivity lexicons intotwo types: corpus-based and dictionary-based.Corpus-based methods extract subjectivity lexi-cons from unlabeled data using different similar-ity metrics to measure the relatedness betweenwords, e.g., Pointwise Mutual Information (PMI).Corpus-based methods have been used to boot-strap lexicons for ENGLISH (Turney, 2002) andother languages, including ROMANIAN (Banea etal., 2008) and JAPANESE (Kaji and Kitsuregawa,2007). Dictionary-based methods rely on rela-tions between words in existing lexical resources.For example, Rao and Ravichandran (2009) con-struct HINDI and FRENCH sentiment lexicons us-ing relations in WordNet (Miller, 1995), Rosaset. al. (2012) bootstrap a SPANISH lexicon us-ing SentiWordNet (Baccianella et al., 2010) andOpinionFinder1, Clematide and Klenner (2010)and Abdul-Mageed et. al. (2011) automaticallyexpand and evaluate GERMAN and ARABIC lexi-cons, respectively.

We apply a corpus-based approach to bootstrapTwitter-specific subjectivity lexicons. To start, the

1www.cs.pitt.edu/mpqa/opinionfinder

English Spanish RussianLE

I LE

B LS

I LS

B LR

I LR

B

Pos 2.65 12.94 2.57 11.58 2.39 4.67Neg 5.75 10.28 9.66 33.86 6.46 9.96Total 8.40 23.22 12.23 45.44 8.85 14.63

Table 2: Original and bootstrapped (highlighted)lexicon term count across languages (thousands).

new lexicon LB(0) is seeded with terms from theoriginal lexicon LI . Each iteration, tweets in theunlabeled data are labeled using LB(i−1). If atweet contains one or more terms from LB(i−1) itis marked subjective, otherwise objective. Tweetpolarity is determined in a similar way, but takesinto account negation. For every term not inLB(i−1) with a frequency ≥ θfreq, the probabil-ity of that word appearing in a subjective sentenceis calculated. The top θk terms with a subjectiveprobability ≥ θpr are then added to LB(i). Boot-strapping continues until there are no more newterms meeting the criteria to add to the lexicon.Although more sophisticated bootstrapping meth-ods exist, the goal of this work is not to propose orevaluate different bootstrapping methods. Thus, astraightforward approach is sufficient.

For English, seed terms for bootstrapping arethe strongly subjective terms in the MPQA lexi-con (Wilson et al., 2005). For Spanish and Rus-sian, the seed terms are obtained by translating theEnglish seed terms using a bi-lingual dictionary,collecting subjectivity judgments from MTurk onthe translations, filtering out translations that arenot strongly subjective, and expanding the result-ing word lists with plurals and inflectional forms.

To verify that bootstrapping does provide abetter resource than existing dictionary-expandedlexicons, we compare our bootstrapped lexiconfor English to SentiWordNet (Baccianella et al.,2010) by performing simple, rule-based subjectiv-ity classification on the development data EDEV.Specifically, we compare the original LE

Iand

bootstrapped LE

Blexicons against strongly subjec-

tive terms from SentiWordNet. To make a faircomparison, we automatically expand SentiWord-Net with plurals and inflectional forms. If a tweetcontains (a) ≥ 1 or (b) ≥ 2 subjective terms fromthe given lexicon2, it is considered subjective. Fig-ure 1 shows that recall for the bootstrapped lexiconLE

Bis on average 0.45 higher than SWN with only

0.05 drop in precision.

2A similar rule-based approach using terms from theMPQA lexicon is suggested by (Riloff and Wiebe, 2003)

(a) ≥ 1 terms per tweet (b) ≥ 2 terms per tweet

Figure 1: Subjectivity classification using ◆ - orig-inal LE

Iterms, � - bootstrapped LE

Bterms, ● - ex-

panded SentiWordNet SWN strong. subj. terms.

Figure 2: Gender-dependent vs. independent sub-jectivity terms (+ and - indicates term polarity).

4.2 Lexical EvaluationWith our Twitter-specific sentiment lexicons, wecan now investigate how the subjective use ofthese terms differs depending on gender for ourthree languages. Figure 2 illustrates what we ex-pect to find. {F} and {M} are the sets of subjec-tive terms used by females and males, respectively.We expect that some terms will be used by males,but never by females, and vice-versa. The vastmajority, however, will be used by both genders.Within this set of shared terms, many words willshow little difference in their subjective use whenconsidering gender, but there will be some wordsfor which gender will have an influence. Of par-ticular interest for our work are words in which thepolarity of a term as it is used in context is gender-influenced, the extreme case being terms that fliptheir polarity depending on the gender of the user.Polarity may be different because the concept rep-resented by the term tends to be viewed in a dif-ferent light depending on gender. There are alsowords like weakness in which a more positive ormore negative word sense tends to be used by menor women. In Figure 3 we show the distributionof gender-specific and gender-independent subjec-tive terms from the lexicon LB for all languages.

To identify gender-influenced terms in our lex-icons, we start by randomly sampling 400K maleand 400K female tweets for each language fromthe data. Next, for both genders we calculate theprobability of term ti appearing in a tweet with

Figure 3: The distribution of gender-specific andgender-independent sentiment terms.

another subjective term (Eq.1), and the probabil-ity of it appearing with a positive or negative term(Eq.2-3) from the bootstrapped lexicon LB .

pti(subj�g) = c(ti, P, g) + c(ti,N, g)c(ti, g) , (1)

where g ∈ F,M and P and N are positive and neg-ative sets of terms from the original lexicon LI .

pti(+�g) = c(ti, P, g)c(ti, P, g) + c(ti,N, g) (2)

pti(−�g) = c(ti,N, g)c(ti, P, g) + c(ti,N, g) (3)

Let ∆p+ti be a metric that measures polaritychange across genders. For every subjective termti we want to maximize the difference3:

∆p+ti = �pti(+�F ) − pti(+�M)�s.t.��

1 − tf subj

ti(F )

tf subj

ti(M)

��≤ λ, tf subj

ti(M) ≠ 0, (4)

where p(+�F ) and p(+�M) are probabilities thatterm ti is positive for females and males respec-tively; tf subj

ti(F ) and tf subj

ti(M) are correspond-

ing term frequencies (if tf subj

ti(F ) > tf subj

ti(M)

the fraction is flipped); λ is a threshold that con-trols the level of term frequency similarity4. Theterms in which polarity is most strongly gender-influenced are those with λ→ 0 and ∆p+ti → 1.

Table 3 shows a sample of the most stronglygender-influenced terms from the original LI andthe bootstrapped LB lexicons for all languages. Aplus (+) means that the term tends to be used pos-itively by women and minus (−) means that the

3One can also maximize ∆p−ti = �pti(−�F ) − pti(−�M)�4λ = 0 means term frequencies are identical for both gen-

ders; λ→ 1 indicates increasing gender divergence.

English Original Terms LE

I ∆p+ λ English Bootstrapped Terms LE

B ∆p+ λperfecting + 0.7 0.2 pleaseeeeee + 0.7 0.0weakened + 0.1 0.0 adorably + 0.6 0.4saddened – 0.1 0.0 creatively – 0.6 0.5misbehaving – 0.4 0.0 dogfighting – 0.7 0.5glorifying – 0.7 0.5 overdressed – 1.0 0.3Spanish Original Terms LS

I Spanish Bootstrapped Terms LS

B

fiasco (fiasco) + 0.7 0.3 cafeína (caffeine) + 0.7 0.5triunfar (succeed) + 0.7 0.0 claro (clear) + 0.7 0.3inconsciente (unconscious) – 0.6 0.2 cancio (dog) – 0.3 0.3horroriza (horrifies) – 0.7 0.3 llevara (take) – 0.8 0.3groseramente (rudely) – 0.7 0.3 recomendarlo (recommend) – 1.0 0.0Russian Original Terms LR

I Russian Bootstrapped Terms LR

B

магическая (magical) + 0.7 0.3 мечтайте (dream!) + 0.7 0.3сенсационный (sensational) + 0.7 0.3 танцуете (dancing) + 0.7 0.3обожаемый (adorable) – 0.7 0.0 сложны (complicated) – 1.0 0.0искушение (temptation) – 0.7 0.3 молоденькие (young) – 1.0 0.0заслуживать (deserve) – 1.0 0.0 достич (achieve) – 1.0 0.0

Table 3: Sample of subjective terms sorted by ∆p+ to show lexical differences and polarity change acrossgenders. Module is not applied as defined in Eq.1 to demonstrate the polarity change direction.

English ∆p+ λ Spanish ∆p+ λ Russian ∆p+ λ#parenting + 0.7 0.0 #rafaelnarro (politician) + 1.0 0.0 #совет (advise) + 1.0 0.0#vegas – 0.2 0.8 #amores (loves) + 0.2 1.0 #ukrlaw + 1.0 1.0#horror – 0.6 0.7 #britneyspears + 0.1 0.3 #spartak (soccer team) – 0.7 0.9#baseball – 0.6 0.9 #latingrammy – 0.5 0.1 #сны (dreams) – 1.0 0.0#wolframalpha – 0.7 1.0 #metallica (music band) – 0.5 0.8 #iphones – 1.0 1.0

Table 4: Hashtag examples with opposite polarity across genders for English, Spanish, and Russian.

term tends to be used positively by men. For in-stance, in English we found that perfecting is usedwith negative polarity by male users but with posi-tive polarity by female users; the term dogfightinghas negative polarity for women but strong posi-tive polarity for men.

4.3 Hashtags

People may also express positive or negative senti-ment in their tweets using hashtags. From our bal-anced samples of 800K tweets for each language,we extracted 611, 879, and 71 unique hashtags forEnglish, Spanish, and Russian, respectively. Aswe did for terms in the previous section, we eval-uated the subjective use of the hashtags. Someof these are clearly expressing sentiment (#hor-ror), while others seem to be topics that peopleare frequently opinionated about (#baseball, #lat-ingrammy, #spartak).

Table 4 gives the hashtags, correlated with sub-jective language, that are most strongly gender-influenced. Analogously to ∆p+ values in Table 3,a plus (+) means the hashtag is more likely to beused positively by women, and a minus (−) meansthe hashtag is more likely to be used positively bymen. For example, in English we found that maleusers tend to express positive sentiment in tweetsmentioning #baseball, while women tend to be

negative about this hashtag. The opposite is truefor the hashtag #parenting.

4.4 Emoticons

In addition to lexical differences in subjective lan-guage and hashtag preference, we also investi-gated how emoticons are used differently by menand women in Twitter. Figure 4 shows the emoti-cons listed in Wikipedia5. The frequency of eachemoticon is given on the right of each languagechart, with probability of use by a male user in thatlanguage given on the x-axis. The top 8 emoti-cons are the same for all languages and sorted byEnglish frequency.

p(Male|Emoticon)

0.0 0.2 0.4 0.6 0.8 1.0

34.4K

8.7K

4.1K

2.7K

0.9K

0.7K

0.4K

0.1K

0.1K

0.1K

:)

:(

:-)

:-&

:-(

:[

:-/

8)

()

:-o

p(Male|Emoticon)

0.0 0.2 0.4 0.6 0.8 1.0

19.1K

9.5K

1.5K

0.1K

0.3K

0.3K

0.1K

1.5K

0.1K

0.1K

:)

:(

:-)

:-&

:-(

:[

:-/

8)

%)

:-o

p(Male|Emoticon)

0.0 0.2 0.4 0.6 0.8 1.0

41.5K

4.5K

4.6K

0.4K

0.4K

0.1K

0.1K

0.4K

0.4K

0.1K

:)

:(

:-)

:-&

:-(

:[

:-/

8)

%)

()

Figure 4: Probability of gender and emoticons forEnglish, Spanish and Russian (from left to right).

5List of emoticons from Wikipedia http://en.wikipedia.org/wiki/List_of_emoticons

We found that emoticons in English data areused more overall by female users, which is con-sistent previous findings. In addition, we foundthat some emoticons like :-) (smile face) and:-o (surprised) are used equally by both gen-ders, at least in Twitter. When comparing Englishemoticon usage to other languages, there are somesimilarities, but also some clear differences. InSpanish data, several emoticons are more likely tobe used by male than by female users, e.g., :-o(surprised) and :-& (tongue-tied), and the differ-ence in probability of use by males and femalesis greater for the emoticons, as compared to thesame emoticons for English. Interestingly, in Rus-sian Twitter data emoticons tend to be used moreor equally by male users rather than female users.

5 Classification Experiments

The previous section showed that there are gen-der differences in the use of subjective language,hashtags, and emoticons in tweets. Can we exploitthese differences to improve subjectivity and po-larity6 classification? To answer this question, weconduct experiments using gender-independent(GInd) and gender-dependent (GDep) features.Comparing the classification results for these fea-tures allows us to evaluate the influence of genderon sentiment classification.

We experiment with two classification ap-proaches. The first is a simple, rule-based ap-proach which uses only subjective terms from thelexicons. Our goal with this experiment is to see ifthe gender differences in subjective language cre-ate enough of a signal to influence sentiment clas-sification. Next we use machine learning for sen-timent classification and experiment with a varietyof classifiers from Weka (Hall et al., 2009). Forthese experiments we use lexical features as wellas lexicon set-count features7,8.

We report results for English, Spanish, and Rus-sian test data labeled with subjectivity and polar-ity as described in Section 3. Results for the ma-

6For polarity classification we distinguish between pos-itive and negative instances, which is the approach typicallyreported in the literature for recognizing polarity (Velikovichet al., 2010; Yessenalina and Cardie, 2011; Agarwal et al.,2011b; Taboada et al., 2011)

7A set-count feature is a count of the number of instancesfrom a set of terms that appears in a tweet.

8We also experimented with repeated punctuation (!!, ??)and letters (nooo, reealy), which are often used in sentimentclassification in social media. However, we found these fea-tures to be noisy and adding them decreased performance.

chine learning experiments are averages from 10-fold cross validation over the test data.

5.1 Sentiment ClassificationFor the rule-based GIndRB

subjclassifier, tweets are

labeled as subjective or neutral as follows:

GIndRB

subj = � 1 if ∑ki=0 si ⋅ fi ≥ 0.5,

0 otherwise(5)

where ∑ki=0 si ⋅ fi represents the sum of weighted

set features, e.g., terms from LI only, emoti-cons E, or different part-of-speech tags from LB

weighted using si = pfi(subj�M)+pfi(subj�F ) =pfi(subj) subjectivity score as shown in Eq.1.Similarly, the rule-based GIndRB

polmodel for clas-

sifying tweets as positive or negative is:

GIndRB

pol = � 1 if ∑ni=0 p+i ⋅ f+i ≥ ∑m

j=0 p−j ⋅ f−j ,0 otherwise

(6)where �f+, �f− are feature sets that include onlypositive and negative features from LI or LB; p+

i

and p−i

are positive and negative polarity scores es-timated using Eq.2-3 such as: p+

i= pfi(+�M) +

pfi(+�F ) = pfi(+) and p−i= pfi(−�M) +

pfi(−�F ) = pfi(−).The gender-dependent, rule-based classifiers

are defined in a similar way. Specifically, �f isreplaced by �fM and �fF in Eq.5 and �f−, �f+ arereplaced by �fM−, �fF− and �fM+, �fF+ respectivelyin Eq.6. We learn subjectivity �s and polarity �pscore vectors using Eq.1-3. The difference be-tween GInd and GDep models is that GIndscores pfi(subj), pfi(+) and pfi(−) are not con-ditioned on gender.

For gender-independent classification of sub-jectivity and polarity using machine learning, webuild feature vectors using lexical features V rep-resented as term frequencies, together with set-count features from LI , LB and E lexicons:

GIndsubj = [LI , LB,E, V ];GIndpol = [L+I , L+B,E+, L−I , L−B,E−, V ].

For gender-dependent classification, we try twoapproaches. First, we create set-count features formale and female terms from LI , LB,E, and usefeatures from both genders for classifying tweetsfor all users (implicit encoding):GDepAND

subj = [LM

I , LM

B ,EM , LF

I , LF

B,EF , V ];

GDepAND

pol = [LM+I , LM+

B ,EM+, LF+I , LF+

B ,EF+LM−I , . . . , LF−

B ,EF−, V ].

0.6 0.7 0.8 0.9

0.62

0.64

0.66

0.68

0.70

Recall

Precision

+E+A

+R

+V

+N

L_I

+E+A

+R

+V+N

L_I

(a) Rule-based subjectivity

0.65 0.70 0.75 0.80 0.85 0.90

0.65

0.70

0.75

0.80

0.85

Recall

Precision

L_I

+E

+A

+R

+V

+N

L_I

+E

+A

+R+V

+N

(b) Rule-based polarity

BLR NB AB RF J48 SVM

Classifiers

F-measure

0.55

0.60

0.65

0.70

0.75

0.80

0.85

GIndSubjAND

GDepSubjAND

GIndPolAND

GDepPolAND

(c) ML subjectivity and polarity

Figure 5: Rule-based (RB) and Machine Learning (ML) sentiment classification results for English. LI

- the original lexicon, E - emoticons, A,R,V,N are adjectives, adverbs, verbs, nouns from LB .

We also try an approach that uses V plus onlyfemale set-count features when classifying tweetsfor females; or, for male tweets, V plus only maleset-count features (explicit encoding). These ex-periments are identified in the results as GDepOR

subj

and GDepOR

pol, for subjectivity and polarity classi-

fication respectively.

5.2 Results

Figures 5a and 5b show the performance improve-ments for subjectivity and polarity classificationunder the rule-based approach when taking intoaccount gender. The left figure shows precision-recall curves for subjective vs. neutral classifica-tion, and the middle figure shows precision-recallcurves for positive vs. negative classification. Wemeasure performance starting with features fromLI , and then incrementally add emoticon featuresE and features from LB one part of speech at atime9. This experiment shows that there is a clearimprovement for the models parameterized withgender, at least for the simple, rule-based model.

For the machine learning approach we experi-ment with a variety of learners. Results for En-glish are reported in Figure 5c. For subjectivityclassification, Support Vector Machines (SVM),Naive Bayes (NB) and Bayesian Logistic Regres-sion (BLR) achieve the best results, with improve-ments in F-measure ranging from 0.5-5%. Thepolarity classifiers overall achieve much higherscores, with improvements for gender-dependentfeatures ranging from 1-2%. BLR with Gaussianprior is the top scorer for polarity classificationwith an F-measure of 82%. Only three classi-fiers, J48, AdaBoostM1 (AB) and Random For-

9POS from the Twitter POSTagger (Gimpel et al., 2011).

est (RF) do not always show significant improve-ments for the gender-dependent features10. For themajority of classifiers, gender-dependent featuresoutperform gender-independent features for bothtasks, demonstrating the robustness of the gender-dependent features for sentiment classification.

In Table 5 we report the best results for subjec-tivity and polarity classification for English, Span-ish and Russian data using:

- BLR classifier (Genkin et al., 2007) forGIndAND and GDepAND models.

- SVM classifier, wrapped LibSVM imple-mentation by (EL-Manzalawy and Honavar,2005), formulated as nuSVC with a radial-based function kernel for GIndOR andGDepOR models.

Each ∆R(%) row shows the relative percent im-provement of the GDep model compared to theGInd model in the preceeding two rows. Our re-sults show that differences in subjective languageacross genders can be exploited to improve senti-ment analysis, not only for English but for mul-tiple languages. For Spanish and Russian resultsare lower for subjectivity classification, we sus-pect, because lexical features V are already in-flected for gender and set-count features are down-weighted by the classifier. For polarity classifica-tion, on the other hand, gender-dependent featuresprovide consistent, significant improvements (1.5-2.5%) across all languages.

As a reality check, Table 5 also reports accu-racies (in AR columns) for experiments that userandom permutations of male and female subjec-tive terms, which are then encoded as gender-

10We test statistical significance using McNemar’s Chi-squared test with continuity correction (p-value<0.01) (Diet-terich, 1998)

P R F A AR P R F A AR

English subj vs. neutral p(subj)=0.57 English pos vs. neg p(pos)=0.63GIndBLR 0.62 0.58 0.60 0.66 – 0.78 0.83 0.80 0.71 –GDepAND 0.64 0.62 0.63 0.68 0.66 0.80 0.83 0.82 0.73 0.70∆RAND,% +3.23 +6.90 +5.00 +3.03 3.03↓ +2.56 0.00 +2.50 +2.82 4.29↓GIndSV M 0.66 0.70 0.68 0.72 – 0.79 0.86 0.82 0.77 –GDepOR 0.66 0.71 0.68 0.72 0.70 0.80 0.87 0.83 0.78 0.76∆ROR,% –0.45 +0.71 0.00 –0.14 2.85↓ +0.38 +0.23 +0.24 +0.41 2.63↓

Spanish subj vs. neutral p(subj)=0.40 Spanish pos vs. neg p(pos)=0.45GIndBLR 0.67 0.71 0.68 0.61 – 0.71 0.63 0.67 0.71 –GDepAND 0.67 0.72 0.69 0.62 0.61 0.72 0.65 0.68 0.71 0.68∆RAND,% 0.00 +1.40 +0.58 +0.73 1.64↓ +2.53 +3.17 +1.49 0.00 4.41↓GIndSV M 0.68 0.79 0.73 0.65 – 0.66 0.65 0.65 0.69 –GDepOR 0.68 0.79 0.73 0.66 0.65 0.68 0.67 0.67 0.71 0.68∆ROR,% +0.35 +0.21 +0.26 +0.54 1.54↓ +2.43 +2.44 +2.51 +2.08 4.41↓

Russian subj vs. neutral p(subj)=0.51 Russian pos vs. neg p(pos)=0.58GIndBLR 0.66 0.68 0.67 0.67 – 0.66 0.72 0.69 0.62 –GDepAND 0.66 0.69 0.68 0.67 0.66 0.68 0.73 0.70 0.64 0.63∆RAND,% 0.00 +1.47 +0.75 0.00 1.51↓ +3.03 +1.39 +1.45 +3.23 1.58↓GIndSV M 0.67 0.75 0.71 0.70 – 0.64 0.73 0.68 0.62 –GDepOR 0.67 0.76 0.71 0.70 0.69 0.65 0.74 0.69 0.63 0.62∆ROR,% –0.30 +1.46 +0.56 +0.14 1.44↓ +0.93 +1.92 +1.46 +1.49 1.61↓

Table 5: ML sentiment classification results for gender-dependent and independent models.

Figure 6: F-score comparison for ML subjectivity (solid line) and polarity (dashed line) classificationusing different feature combinations for English, Spanish and Russian (from left to right respectively).

dependent set-count features as before. We foundthat all gender-dependent models, GDepAND andGDepOR, outperformed their random equivalentsfor both subjectivity and polarity classification.

One question that remains is whether just en-coding gender as a binary feature would be suf-ficient to improve sentiment classification. Fig-ure 6 shows sentiment classification results usingdifferent feature combinations: (a) lexical featuresonly V , (b) lexical + gender binary V +GBin, (c)gender-independent GInd, (d) gender binary andgender-independent combined GBin +GInd, (e)gender-dependent implicitly encoded GDepAND,and (f) gender binary and gender-dependent com-bined GBin +GDepAND.

We train models using BLR classifier and com-pare F-score, precision and recall values for ev-ery feature combination. We observe that includ-ing a gender binary feature by itself or in com-bination with GDepAND does not yield a sig-nificant improvement compared to the results ob-

tained by GDepAND models where gender is im-plicitly encoded (except Spanish subjectivity clas-sification). We conclude that it is the combinationof GDepAND and lexical features together thatare needed for significant improvements.

6 ConclusionsThis paper has demonstrated in a qualitative andempirical study that there are substantial and in-teresting differences in sentiment term usage be-tween genders in social media, including caseswhere sentiment polarity flips completely, across arange of different term types (such as hashtag andemoticon usage) in English, Spanish and Russian.

Furthermore, it novelly demonstrates that mod-eling gender-language differences and incorporat-ing author gender as a model component can sig-nificantly improve sentiment classification in so-cial media, for both subjectivity and polarity clas-sification, in all investigated languages.

ReferencesMuhammad Abdul-Mageed, Mona T. Diab, and Mo-

hammed Korayem. 2011. Subjectivity and senti-ment analysis of modern standard arabic. In Pro-ceedings of ACL/HLT.

A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R. Pas-sonneau. 2011a. Sentiment analysis of twitter data.In Proceedings of the Workshop on Language in So-cial Media.

Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Ram-bow, and Rebecca Passonneau. 2011b. Sentimentanalysis of twitter data. In Proceedings of the Work-shop on Languages in Social Media.

Shlomo Argamon, Moshe Koppel, James W. Pen-nebaker, and Jonathan Schler. 2007. Mining theblogosphere: Age, gender and the varieties of self-expression.

Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas-tiani. 2010. Sentiwordnet 3.0: An enhanced lexicalresource for sentiment analysis and opinion mining.In Proceedings of LREC.

Carmen Banea, Rada Mihalcea, and Janyce Wiebe.2008. A bootstrapping method for building subjec-tivity lexicons for languages with scarce resources.In Proceedings of LREC.

Luciano Barbosa and Junlan Feng. 2010. Robust sen-timent detection on twitter from biased and noisydata. In Proceedings of COLING.

Shane Bergsma, Paul McNamee, Mossaab Bagdouri,Clayton Fink, and Theresa Wilson. 2012. Languageidentification for creating language-specific twittercollections. In Proceedings of 2nd Workshop onLanguage in Social Media.

Adam Bermingham and Alan F Smeaton. 2010. Clas-sifying sentiment in microblogs: Is brevity an ad-vantage? In Proceedings of CIKM-2010.

Albert Bifet and Eibe Frank. 2010. Sentiment knowl-edge discovery in twitter streaming data. In Pro-ceedings of Discovery Science.

Bonka Boneva, Robert Kraut, and David Frohlich.2001. Using email for personal relationships: Thedifference gender makes. American Behavioral Sci-entist, 45(3).

John D. Burger, John C. Henderson, George Kim, andGuido Zarrella. 2011. Discriminating gender ontwitter. In Proceedings of EMNLP.

Pedro Henrique Calais Guerra, Adriano Veloso, Wag-ner Meira Jr, and Virgílio Almeida. 2011. From biasto opinion: a transfer-learning approach to real-timesentiment analysis. In Proceedings of the KDD-2011.

Lu Chen, Wenbo Wang, Meenakshi Nagarajan, Shao-jun Wang, and Amit P. Sheth. 2012. Extractingdiverse sentiment expressions with target-dependentpolarity from twitter. In Proceedings of ICWSM-2012.

Christopher Cieri, David Miller, and Kevin Walker.2004. The fisher corpus: a resource for the next gen-erations of speech-to-text. In Proceedings of LREC.

Simon Clematide and Manfred Klenner. 2010. Eval-uation and extension of a polarity lexicon for Ger-man. In Proceedings of the 1st Workshop on Com-putational Approaches to Subjectivity and SentimentAnalysis.

Lucy Cockcroft. 2009. Why men write short email andwomen write emotional messages.

Dmitry Davidov, Oren Tsur, and Ari Rappoport. 2010.Enhanced sentiment learning using twitter hashtagsand smileys. In Proceedings of COLING.

Thomas G. Dietterich. 1998. Approximate statisticaltests for comparing supervised classification learn-ing algorithms. Neural Computation, 10(7).

Yasser EL-Manzalawy and Vasant Honavar, 2005.WLSVM: Integrating LibSVM into Weka Environ-ment.

N. Garera and D. Yarowsky. 2009. Modeling latent bi-ographic attributes in conversational genres. In Pro-ceedings of ACL-IJCNLP.

David Gefen and Catherine M. Ridings. 2005. If youspoke as she does, sir, instead of the way you do: asociolinguistics perspective of gender differences invirtual communities. SIGMIS Database, 36(2).

Alexander Genkin, David D. Lewis, and David Madi-gan. 2007. Large-scale bayesian logistic regressionfor text categorization. Technometrics, 49.

Kevin Gimpel, Nathan Schneider, Brendan O’Connor,Dipanjan Das, Daniel Mills, Jacob Eisenstein,Michael Heilman, Dani Yogatama, Jeffrey Flanigan,and Noah A. Smith. 2011. Part-of-speech taggingfor twitter: annotation, features, and experiments. InProceedings of ACL.

J.J. Godfrey, E.C. Holliman, and J. McDaniel. 1992.Switchboard: telephone speech corpus for researchand development. In Proceeding of ICASSP.

Sumit Goswami, Sudeshna Sarkar, and Mayur Rustagi.2009. Stylometric analysis of bloggers age and gen-der. In Proceedings of AAAI Conference on Weblogsand Social Media.

Mark Hall, Eibe Frank, Geoffrey Holmes, BernhardPfahringer, Peter Reutemann, and Ian H. Witten.2009. The weka data mining software: an update.Proceedings of SIGKDD.

Long Jiang, Mo Yu, Ming Zhou, Xiaohua Liu, andTiejun Zhao. 2011. Target-dependent twitter sen-timent classification. In Proceedings of ACL.

Nobuhiro Kaji and Masaru Kitsuregawa. 2007. Build-ing lexicon for sentiment analysis from massivecollection of html documents. In Proceedings ofEMNLP-CoNLL.

Efthymios Kouloumpis, Theresa Wilson, and JohannaMoore. 2011. Twitter sentiment analysis: The goodthe bad and the OMG! In Proceedings of ICWSM.

Guangxia Li, Steven CH Hoi, Kuiyu Chang, andRamesh Jain. 2010. Micro-blogging sentiment de-tection by collaborative online learning. In Proceed-ings of ICDM-2010.

Hao Li, Yu Chen, Heng Ji, Smaranda Muresan, andDequan Zheng. 2012. Combining social cognitivetheories with linguistic features for multi-genre sen-timent analysis. In Proceedings of the Pacific AsiaConference on Language, Information and Compu-tation (PACLIC-2012).

Rada Mihalcea, Carmen Banea, and Janyce Wiebe.2012. Multilingual subjectivity and sentiment anal-ysis. In Proceedings of ACL.

George A. Miller. 1995. Wordnet: a lexical databasefor english. Communications of the ACM.

Saif Mohammad and Tony Yang. 2011. Tracking sen-timent in mail: How genders differ on emotionalaxes. In Proceedings of the 2nd Workshop on Com-putational Approaches to Subjectivity and SentimentAnalysis.

Arjun Mukherjee and Bing Liu. 2010. Improving gen-der classification of blog authors. In Proceedings ofEMNLP.

Jahna Otterbacher. 2010. Inferring gender of movie re-viewers: exploiting writing style, content and meta-data. In Proceedings of CIKM, pages 369–378.

Alexander Pak and Patrick Paroubek. 2010. Twitter asa corpus for sentiment analysis and opinion mining.In Proceedings of LREC.

Veronica Perez-Rosas, Carmen Banea, and Rada Mi-halcea. 2012. Learning sentiment lexicons in Span-ish. In Proceedings of LREC.

Delip Rao and Deepak Ravichandran. 2009. Semi-supervised polarity lexicon induction. In Proceed-ings of EACL.

Delip Rao, David Yarowsky, Abhishek Shreevats, andManaswi Gupta. 2010. Classifying latent user at-tributes in Twitter. In Proceedings of the 2nd In-ternational Workshop on Search and Mining User-generated Contents.

Ellen Riloff and Janyce Wiebe. 2003. Learning extrac-tion patterns for subjective expressions. In Proceed-ings of EMNLP.

Harold Schiffman. 2002. Bibli-ography of gender and language.http://ccat.sas.upenn.edu/ haroldfs/popcult/bibliogs/gender/genbib.htm.

Rion Snow, Brendan O’Connor, Daniel Jurafsky, andAndrew Y. Ng. 2008. Cheap and fast—but is itgood?: evaluating non-expert annotations for natu-ral language tasks. In Proceedings of EMNLP.

Will Styler. 2011. The EnronSent Corpus. Technicalreport.

Jane Sunderland, Ren-Feng Duann, and PaulBake. 2002. Gender and genre bibliography.www.ling.lancs.ac.uk/pubs/clsl/clsl122.pdf.

Maite Taboada, Julian Brooke, Milan Tofiloski, Kim-berly Voll, and Manfred Stede. 2011. Lexicon-based methods for sentiment analysis. Computa-tional Linguistics, 37(2).

Chenhao Tan, Lillian Lee, Jie Tang, Long Jiang, MingZhou, and Ping Li. 2011. User-level sentiment anal-ysis incorporating social networks. In Proceedingsof the KDD-2011.

Mike Thelwall, David Wilkinson, and Sukhvinder Up-pal. 2010. Data mining emotion in social networkcommunication: Gender differences in MySpace.Journal of the American Society for Information Sci-ence and Technology, 61(1).

Peter D. Turney. 2002. Thumbs up or thumbs down?:semantic orientation applied to unsupervised classi-fication of reviews. In Proceedings of ACL.

Leonid Velikovich, Sasha Blair-Goldensohn, KerryHannan, and Ryan McDonald. 2010. The viabil-ity of web-derived polarity lexicons. In Proceedingsof NAACL.

Xiaolong Wang, Furu Wei, Xiaohua Liu, Ming Zhou,and Ming Zhang. 2011. Topic sentiment analysisin Twitter: A graph-based hashtag sentiment classi-fication approach. In Proceedings of CIKM-2011.

Theresa Wilson, Janyce Wiebe, and Paul Hoffmann.2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of EMNLP.

Ainur Yessenalina and Claire Cardie. 2011. Compo-sitional matrix-space models for sentiment analysis.In Proceedings of EMNLP.

Documents

Sentiment in Social Media: Bootstrapping Subjectivity ...svitlana/papers/non... · To create a Twitter-speciﬁc sentiment lexicon for a given language, we start with a general-purpose,