99
NAACL Tutorial Social Media Predictive Analytics Svitlana Volkova 1 , Benjamin Van Durme 1,2 , David Yarowsky 1 and Yoram Bachrach 3 1 Center for Language and Speech Processing, Johns Hopkins University, 2 Human Language Technology Center of Excellence, 3 Microsoft Research Cambridge

NAACL TutorialSocial Media Predictive Analytics

Embed Size (px)

Citation preview

Page 1: NAACL TutorialSocial Media Predictive Analytics

NAACL TutorialSocial Media Predictive

AnalyticsSvitlana Volkova1, Benjamin Van

Durme1,2,David Yarowsky1 and Yoram Bachrach3

1Center for Language and Speech Processing,Johns Hopkins University,

2Human Language Technology Center of Excellence,

3Microsoft Research Cambridge

Page 2: NAACL TutorialSocial Media Predictive Analytics

Tutorial SchedulePart I: Theoretical Session (2:00 – 4:30pm)

Batch PredictionOnline InferenceCoffee Break (3:30 – 4:00pm)Dynamic Learning and Prediction

Part II: Practice Session (4:30 – 5:30pm) Code and Data

Page 3: NAACL TutorialSocial Media Predictive Analytics

Tutorial Materials• Slides:

– http://www.cs.jhu.edu/~svitlana/slides.pptx

• Code and Data:– https://bitbucket.org/svolkova/queryingtwitter– https://bitbucket.org/svolkova/attribute – https://bitbucket.org/svolkova/psycho-demographics

• References:– http://www.cs.jhu.edu/~svitlana/references.pdf

Page 4: NAACL TutorialSocial Media Predictive Analytics

Social Media Obsession

PersonalizedDi

vers

eTimely

Large volumes

Billions of messagesMillions of users

Multilingual

Page 5: NAACL TutorialSocial Media Predictive Analytics

What do they think and

feel?

Where do they go?

What is their demographics

and personality?

What do they like?

What do they buy?

Page 6: NAACL TutorialSocial Media Predictive Analytics

First: a comment on privacy and ethics…

Page 7: NAACL TutorialSocial Media Predictive Analytics

Why is language in social media so interesting?

• Very Short – 140 chars• Lexically divergent• Abbreviated• Multilingual

Page 8: NAACL TutorialSocial Media Predictive Analytics

Why is language in social media so challenging?

• Data drift

• User activeness => generalization

• Topical sparsity => relationship, politics

• Dynamic streaming nature

Page 9: NAACL TutorialSocial Media Predictive Analytics

DEMO

Page 10: NAACL TutorialSocial Media Predictive Analytics

Predictive Analytics Services

• Social Network Prediction – https://apps.facebook.com/snpredictionapp/

• Twitter Psycho-Demographic Profile and Affect Inference – http://twitterpredictor.cloudapp.net (pswd: twitpredMSR2014)

• My personality Project – http://mypersonality.org/wiki/doku.php

• You Are What You Like – http://youarewhatyoulike.com/• Psycho-demographic trait predictions –

http://applymagicsauce.com/

• IBM Personality – https://watson-pi-demo.mybluemix.net

• World Well Being Project – http://wwbp.org

Page 11: NAACL TutorialSocial Media Predictive Analytics

Applications: RetailPersonalized marketing• Detecting opinions and emotions

users express about products or services within targeted populations

Personalized recommendations and search• Making recommendations based

on user emotions, demographics and personality

Page 12: NAACL TutorialSocial Media Predictive Analytics

Applications: AdvertisingOnline targeted advertising • Targeting ads based on

predicted user demographics

• Matching the emotional tone the user expects Deliver adds fast

Deliver adds to a true crowd

vs.

vs.

vs.

Page 13: NAACL TutorialSocial Media Predictive Analytics

Applications: Polling Real-time live polling• Mining political opinions• Voting predictions within certain

demographics

Large-scale passivepolling• Passive poling regarding products and services

vs.

Page 14: NAACL TutorialSocial Media Predictive Analytics

Applications: HealthLarge-scale real-time healthcare analytics• Identifying smokers, drug addicts, healthy

eaters, people into sports (Paul and Dredze 2011)

• Monitoring flue-trends, food poisonings, chronic illnesses (Culotta et. al. 2015)

Page 15: NAACL TutorialSocial Media Predictive Analytics

Applications: HR Recruitment and human resource management• Estimating emotional stability

and personality of the potential and current employees

• Measuring the overall well-being of the employees e.g., life satisfaction, happiness (Schwartz et. al. 2013; Volkova et. al., 2015)

• Monitor depression and stress level (Coppersmith et. al. 2014)

Page 16: NAACL TutorialSocial Media Predictive Analytics

User Attribute Prediction Task

Political PreferenceRao et al., 2010; Conover et al., 2011, Pennacchiotti and Popescu, 2011; Zamal et al.,

2012; Cohen and Ruths, 2013; Volkova et. al, 2014

.

.

.

Communications

GenderGarera and Yarowsky, 2009;

Rao et al., 2010; Burger et al., 2011; Van Durme, 2012;

Zamal et al., 2012; Bergsma and Van Durme, 2013

AgeRao et al., 2010; Zamal et al., 2012; Cohen and Ruth, 2013;

Nguyen et al., 2011, 2013; Sap et al., 2014

AAAI 2015 Demo (joint work with Microsoft Research) Income, Education Level, Ethnicity, Life Satisfaction, Optimism, Personality, Showing Off, Self-Promoting

Page 17: NAACL TutorialSocial Media Predictive Analytics

Tweets Revealing User Attributes

?

?

?

?

Page 18: NAACL TutorialSocial Media Predictive Analytics

Supervised ModelsClassification: binary (SVM) – gender, age, political, ethnicity• Goswami et. al., 2009; Rao et al. 2010; Burger et al. 2011;

Mislove et al. 2012; Nguyen et al. 2011; Nguyen et al. 2013;

• Pennacchiotti and Popescu 2011; Connover et. al. 2011; Filippova et. al. 2012; Van Durme 2012; Bergsma et. al. 2012, 2013; Bergsma and Van Durme 2013;

• Zamal et al. 2012; Ciot et. al. 2013; Cohen and Ruths 2013; • Schwartz et. al. 2013; Sap et. al., 2014; Kern et. al., 2014;

Schwartz et. al. 2013; Golbeck et. al. 2011; Kosinski et. al. 2013;

• Volkova et. al. 2014; Volkova et al. 2015.

Unsupervised and Generative Models• name morphology for gender & ethnicity prediction - Rao

et al. 2011;• large-scale clustering - Bergsma et. al. 2013; Culotta et. al.

2015;• demographic language variations - Eisenstein et al. 2010;

O’Connor et al. 2010; Eisenstein et. al. 2014.

*Rely on more than lexical features e.g., network, streaming

Page 19: NAACL TutorialSocial Media Predictive Analytics

Existing Approaches ~1K Tweets*

….…….…….…….…….…….…….…….…

Does an average Twitter user produce thousands of tweets?

*Rao et al., 2010; Conover et al., 2011; Pennacchiotti and Popescu, 2011a; Burger et al., 2011; Zamal et al., 2012; Nguyen et al., 2013

Tweets as a

document

Page 20: NAACL TutorialSocial Media Predictive Analytics

How Active are Twitter Users?

Page 21: NAACL TutorialSocial Media Predictive Analytics

Attributed Social Network

User Local Neighborhoods a.k.a. Social Circles

Page 22: NAACL TutorialSocial Media Predictive Analytics

ApproachesStatic

(Batch) Prediction

Streaming (Online)

Inference

Dynamic (Iterative) Learning and

Prediction• Offline

training• Offline

predictions+ Neighbor

content

• Offline training+ Online

predictions over time

• Exploring 6 types of neighborhoods

• Online predictions• Relying on

neighbors + Iterative re-training+ Active learning + Rationale

annotation

Topical sparsity

Data driftStreaming nature

Model generalization

Page 23: NAACL TutorialSocial Media Predictive Analytics

Part I OutlineI. Batch Prediction

i. How to collect and annotate data?ii. What models and features to use? iii. Which neighbors are the most predictive?

II. Online Inferencei. How to predict from a stream?

III.Dynamic (Iterative) Learning and Prediction

i. How to learn and predict on the fly?

Page 24: NAACL TutorialSocial Media Predictive Analytics

Part I OutlineI. Batch Prediction

i. How to collect and annotate data?ii. What models and features to use? iii. Which neighbors are the most predictive?

II. Online Inferencei. How to predict from a stream?

III.Dynamic (Iterative) Learning and Predictioni. How to learn and predict on the fly?

Page 25: NAACL TutorialSocial Media Predictive Analytics

How to get data? Twitter API

• Twitter API: https://dev.twitter.com/overview/api • Twitter API Status:https

://dev.twitter.com/overview/status

• Twitter API Rate Limits: https://dev.twitter.com/rest/public/rate-limits

Page 26: NAACL TutorialSocial Media Predictive Analytics

Querying Twitter API• Twitter Developer Account => access key and token

https://dev.twitter.com/oauth/overview/application-owner-access-tokens twitter = Twython(APP_KEY, APP_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)

I. Access 1% Twitter Firehouse and sample from itII. Query Twitter API to get:

user timelines (up to 3200 tweets) from userIDstweet json objects from tweetIDslists of friendIDs (5000 per query) from userIDs

Page 27: NAACL TutorialSocial Media Predictive Analytics

JSON Objects

MongoDB: http://docs.mongodb.org/manual/tutorial/getting-started/

Add predictions: sentiment, attributes,

emotions

Page 28: NAACL TutorialSocial Media Predictive Analytics

How to get labeled data?• Supervised classification in a new domain:

– Labeled data ≈ ground truth– Costly and time consuming to get!

• Ways to get ≈“ground truth” annotations: Fun psychological tests (voluntarily): myPersonality project Profile info: Facebook e.g., relationship, gender, age but sparse

for Twitter Self reports: “I am a republican…” (Volkova et al. 2013),

“Happy ##th/st/nd/rd birthday to me” (Zamal et. al. 2012), “I have been diagnosed with …” (Coppersmith et. al. 2014), “I am a writer …” (Beller at. al., 2014)

Distant supervision: following Obama vs. Romney (Zamal et. al. 2012), emotion hashtags (Mohammad et. al, 2014), user name (Burger et. al., 2011)

Crowdsourcing: subjective perceived annotations (Volkova et. al.2015), rationales (Bergsma et. al., 2013, Volkova et. al, 2014; 2015)

Attribute

Model ΦA(u) UL

UP

Annotation biases

Page 29: NAACL TutorialSocial Media Predictive Analytics

Twitter Social GraphI. Candidate-Centric

(distant supervision) 1,031 users

II. Geo-Centric (self-reports)

270 usersIII. Politically Active

(distant supervision)* 371 users (Dem;

Rep)IV. Age (self-reports)*

387 users (18 – 23; 23 - 25)V. Gender (name)*

384 users (Male; Female)Balanced datasets

*Pennacchiotti and Popescu, 2011; Zamal et al., 2012; Cohen and Ruths, 2013

Code, data and trained models for gender, age, political preference prediction

http://www.cs.jhu.edu/~svitlana/

10 - 20 neighbors of 6 types per userWhat types of neighbors lead to

the best attribute predictions?

Page 30: NAACL TutorialSocial Media Predictive Analytics

Part I OutlineI. Batch Prediction

i. How to collect and annotate data?ii. What models and features to use? iii. Which neighbors are the most predictive?

II. Online Inferencei. How to predict from a stream?

III.Dynamic (Iterative) Learning and Predictioni. How to learn and predict on the fly?

Page 31: NAACL TutorialSocial Media Predictive Analytics

Classification Model• Logistic regression = max entropy = log linear

models– Map discrete inputs w to binary output y

• Other options: SVM, NB

hair

eat cool

work

… xbox

Female

1 1 0 0 … 0

Male 0 1 0 1 … 1Male 0 0 1 1 … 1

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Labe

led

user

s (T

rain

ing)

Vocabulary size

hair

eat cool

work

… xbox

? 0 1 0 0 … 1Feature vectorTest user

Page 32: NAACL TutorialSocial Media Predictive Analytics

Features (I)• Lexical:

– normalized counts/binary ngrams (Goswami el. al. 2010; Rao et. al. 2010; Pennacchiotti and Popescu 2011; Ngyen et. al. 2013; Ciot et. al. 2013; Van Durme 2012; Kern et. al. 2014; Volkova et. al. 2014; Volkova and Van Durme 2015)

– class-based highly predictive (Bergsma and Van Durme 2013), rationales (Volkova and Yarowsky 2014); character-based (Peersman et. al. 2011), stems, co-stems, lemmas (Zamal et. al. 2012; Cohen et. al. 2014)

• Socio-linguistic, syntactic and stylistic: – syntax and style (Shler et. al. 2006; Cheng at. al., 2011),

smiles, excitement, emoticons and psycho-linguistic (Rao et. al. 2010; Marquardt et. al. 2014; Kokkos et. sl. 2014; Hovy 2015)

– lexicon features (Sap et. al. 2014); linguistic inquiry and word count (LIWC) (Mukherjee et. al. 2010; Fink et. al. 2012)

Page 33: NAACL TutorialSocial Media Predictive Analytics

Features (II)• Communication behavior: response/retweet/tweet

frequency, retweeting tendency (Connover et. al. 2011; Golbeck et. al. 2011; Pennacchiotti and Popescu 2011; Preotic at. al. 2015)

• Network structure: follower-following ratio, neighborhood size, in/out degree, degree of connectivity (Bamman et. al. 2012; Filippova 2012; Zamal et. al. 2012, Culotta et. al. 2015)

• Other: likes (Bachrach et. al. 2012; Kosinski et. al. 2014), name or census (Burger et. al. 2011; Liu and Ruths 2013), links/images (Rosenthal and McKeown 2011)

• Topics: word embeddings, LDA topics, word clusters (Preotic at. al. 2015)

hair eat cool

work

… xbox

RT neigh

images

….

Female

1 1 0 0 … 0 0.3 30 0.5 ….

Page 34: NAACL TutorialSocial Media Predictive Analytics

Batch Experiments• Log-linear word unigram models:

(I) Users vs. (II) Neighbors and (III) User-Neighbor

• Evaluate different neighborhood types:– varying neighborhood size n=[1, 2, 5, 10] and

content amount t=[5, 10, 15, 25, 50, 100, 200]– 10-fold cross validation with 100 random

restarts for every n and t parameter combination

Page 35: NAACL TutorialSocial Media Predictive Analytics

User Model

Train Graph Test Graph

Page 36: NAACL TutorialSocial Media Predictive Analytics

Neighbor ModelHLTCOE Text Meeting, June 09 2014

Train Graph Test Graph

Page 37: NAACL TutorialSocial Media Predictive Analytics

Joint User-Neighbor Model

Train Graph Test Graph

Learning on user and neighbor features jointly (not prefixing features)

Page 38: NAACL TutorialSocial Media Predictive Analytics

Part I OutlineI. Batch Prediction

i. How to collect and annotate data?ii. What models and features to use? iii. Which neighbors are the most predictive?

II. Online Inferencei. How to predict from a stream?

III.Dynamic (Iterative) Learning and Predictioni. How to learn and predict on the fly?

Page 39: NAACL TutorialSocial Media Predictive Analytics

40

Gender Prediction

?0.73

Neighbor: 0.63

User-Neigh: 0.73

User: 0.82

Page 40: NAACL TutorialSocial Media Predictive Analytics

Lexical Markers for Gender

Page 41: NAACL TutorialSocial Media Predictive Analytics

Gender Prediction QualityApproach User

sTweet

sFeatures Accurac

yRao et al., 2010 1K 405 BOW+sociolin

g0.72

Burger et al., 2011

184K

22 username, BOW

0.92

Zamal et al., 2012

384 10K neighbor BOW 0.80

Bergsma et al., 2013

33.8K

− BOW, clusters 0.90

JHU models 383 200/2K

BOW user/neigh

0.82/0.73

• This is not a direct comparison => Twitter data sharing restrictions

• Poor generalization: different datasets = different sampling and annotation biases

Page 42: NAACL TutorialSocial Media Predictive Analytics

Age Prediction? User: 0.77

Neighbor: 0.72

User-Neigh: 0.77

18 – 23 23 – 25

Page 43: NAACL TutorialSocial Media Predictive Analytics

Lexical Markers for Age

Page 44: NAACL TutorialSocial Media Predictive Analytics

Age Prediction Quality?

Approach Users

Tweets

Groups Features Accuracy

Rao et al., 2010

2K 1183 <=30; > 30 BOW+socioling

0.74

Zamal et al., 2012

386 10K 18 – 23; 23 - 25

neighbor BOW

0.80

JHU models 381 200/2K

18 – 23; 23 - 25

BOW/neighbors

0.77/0.74• This is not a direct comparison!

• Performance for different age groups• Sampling and annotation biases

Page 45: NAACL TutorialSocial Media Predictive Analytics

Political Preference?

0.91User: 0.89

User-Neigh: 0.92

Neighbor: 0.91

Page 46: NAACL TutorialSocial Media Predictive Analytics

Lexical Markers for Political Preferences

Page 47: NAACL TutorialSocial Media Predictive Analytics

Model Generalization

• Political preference classification is not easy!• Topical sparsity: average users rarely tweet about

politics

User Neighbor User-Neighbor0.000.200.400.600.801.00

0.570.67 0.690.72 0.75

0.870.89 0.91 0.92Geo-centric Cand-centric Active

Acc

urac

y

Page 48: NAACL TutorialSocial Media Predictive Analytics

Approach Users Tweets

Features Accuracy

Bergsma et al., 2013

400 5K BOW, clusters 0.82

Pennacchiiotti 2011

10.3K − BOW, network 0.89

Conover et al., 2011

1K 1K BOW, network 0.95

Zamal et al., 2012

400 1K neighbor BOW

0.91

JHU active 371 200 BOW user/neigh

0.89/0.92

JHU cand centric 1,051 200 BOW

user/neigh0.72/0.75

Political Preference Prediction Quality

JHU geo-centric 270 200 BOW user/neigh

0.57/0.67

Cohen et al., 2013

262 1K BOW, network 0.68

Politically Active Users (sampling/annotation bias)

Random /Average Users

Page 49: NAACL TutorialSocial Media Predictive Analytics

Querying more neighbors with less tweets is better than

querying more tweets from the existing neighbors

Limited Twitter

API Calls

Page 50: NAACL TutorialSocial Media Predictive Analytics

Optimizing Twitter API CallsCand-Centric Graph: Friend Circle

?

Page 51: NAACL TutorialSocial Media Predictive Analytics

Optimizing Twitter API CallsCand-Centric Graph: Friend Circle

?

Page 52: NAACL TutorialSocial Media Predictive Analytics

Optimizing Twitter API CallsCand-Centric Graph: Friend Circle

?

Page 53: NAACL TutorialSocial Media Predictive Analytics

Optimizing Twitter API CallsCand-Centric Graph: Friend Circle

?

Page 54: NAACL TutorialSocial Media Predictive Analytics

Summary: Static Prediction• Features: Binary (political) vs. count-based features

(age, gender)• Homophily: “neighbors give you away” => users with

no content• Attribute assortativity: similarity with neighbors

depends on attribute types • Content from more neighbors per user >> additional

content from the existing neighbors• Generalization of the classifiers

FollowerFriend RetweetMention MentionFriend

NUN

Page 55: NAACL TutorialSocial Media Predictive Analytics

Part I OutlineI. Batch Prediction

i. How to collect and annotate data?ii. What models and features to use? iii. Which neighbors are the most predictive?

II. Online Inferencei. How to predict from a stream?

III.Dynamic (Iterative) Learning and Predictioni. How to learn and predict on the fly?

Page 56: NAACL TutorialSocial Media Predictive Analytics

Iterative Bayesian Predictions

Time

?

?

Class prior

LikelihoodPosterior

Page 57: NAACL TutorialSocial Media Predictive Analytics

Cand-Centric Graph: Posterior Updates

?

Time?

Time

Page 58: NAACL TutorialSocial Media Predictive Analytics

Cand-Centric: Prediction Time (1)

User-Neighbor

0.750.95

User StreamDem

Rep

Prediction confidence: 0.95 vs. 0.75Democrats are easier to predict than republicans

Dem

Rep

Use

rs c

lass

ified

cor

rect

ly

Page 59: NAACL TutorialSocial Media Predictive Analytics

60

Cand-Centric Graph: Prediction Time (2)

0.001

0.01

0.1

1

10

100

0.02

12 20

0.01

19 8.9

0.002

1.23.2

0.001

3.51.1

Wee

ks (l

og sc

ale)

How much time does it take to classify 100 users with 75% confidence?

Compare: User Stream vs. Joint User-Neighbor Stream

Cand-centric Geo-CentricActive

Page 60: NAACL TutorialSocial Media Predictive Analytics

61

Batch vs. Online Performance

Cand Geo Active0.0

0.2

0.4

0.6

0.8

1.0 0.99

0.840.89

0.990.88

0.99

User StreamUser-Neighbor Stream

Cand Geo Active0.0

0.2

0.4

0.6

0.8

1.0

0.720.57

0.750.750.67

0.86

User Batch Neighbor Batch

Accu

racy

?

Page 61: NAACL TutorialSocial Media Predictive Analytics

Summary: Online Inference

• Homophily: Neighborhood content is useful*

• Lessons learned from batch predictions:– Age: user-follower or user-mention joint stream– Gender: user-friend joint stream– Political: user-mention and user-retweet joint

stream• Streaming models >> batch models• Activeness: tweeting frequency matters a lot!• Generalization of the classifiers: data sampling and

annotation biases*Pennacchiotti and Popescu, 2011a, 2001b; Conover et al., 2011a, 2001b; Golbeck et al., 2011; Zamal et al., 2012; Volkova et. al., 2014

Page 62: NAACL TutorialSocial Media Predictive Analytics

Part I OutlineI. Batch Prediction

i. How to collect and annotate data?ii. What models and features to use? iii. Which neighbors are the most predictive?

II. Online Inferencei. How to predict from a stream?

III.Dynamic (Iterative) Learning and Predictioni. How to learn and predict on the fly?

Page 63: NAACL TutorialSocial Media Predictive Analytics

Iterative Batch Learning

Time

R

D

?

?

t1

t1

Labeled

Unlabeled

t1

t1

Iterative Batch Retraining (IB)

Iterative Batch with Rationale Filtering (IBR)

?

tm…

tmt2 …

t2 …

tmt2 …

Page 64: NAACL TutorialSocial Media Predictive Analytics

Active LearningLa

bele

dU

nlab

eled

1-Jan-2011

1-Feb-2011

1-Nov-2011

1-Dec-2011

Time

Active Without Oracle (AWOO)

Active With Rationale Filtering (AWR)

Active With Oracle (AWO)

Page 65: NAACL TutorialSocial Media Predictive Analytics

Annotator Rationales

Rationales are explicitly highlighted ngrams in tweets that best justified why the annotators made their labeling

decisions

feature norms(psychology),

feature sparsity

Bergsma and Van Durme, 2013; Volkova and Yarowsky, 2014; Volkova and Van Durme, 2015

Page 66: NAACL TutorialSocial Media Predictive Analytics

Alternative: Rationale Weighting

• Annotator rationales for gender, age and political: http://www.cs.jhu.edu/~svitlana/rationales.html

• Multiple languages: English, Spanish• Portable to other languages

Improving Gender Prediction of Social Media Users via Weighted Annotator Rationales. Svitlana Volkova and David Yarowsky. NIPS Workshop on Personalization: Methods and Applications 2014.

Page 67: NAACL TutorialSocial Media Predictive Analytics

Performance Metrics• Accuracy over time:

• Find optimal models:– Data steam type (user, friend, user + friend)– Time (more correctly classified users faster)– Prediction quality (better accuracy over time)

Page 68: NAACL TutorialSocial Media Predictive Analytics

Results: Iterative Batch Learning

Mar Jun Sep50

100150200250300

0.00.20.40.60.81.0

user

Cor

rect

ly c

lass

ified

Acc

urac

yMar Jun Sep

50100150200250300

0.00.20.40.60.81.0

user

Cor

rect

ly c

lass

ified

Acc

urac

y

IB: higher recall IBR: higher precision

Time: # correctly classified users increases over time

IB faster, IBR slowerData stream selection:

User + friend stream > user stream

Page 69: NAACL TutorialSocial Media Predictive Analytics

Results: Active Learning AWOO: higher recall AWR: higher precision

Time:Unlike IB/IBR models, AWOO/AWR models

classify more users correctly faster (in Mar) but then plateaus

Mar Jun Sep50

100

150

200

250

300

0.0

0.2

0.4

0.6

0.8

1.0user

Cor

rect

ly c

lass

ified

Acc

urac

yMar Jun Sep

50

100

150

200

250

300

0.0

0.2

0.4

0.6

0.8

1.0user

Cor

rect

ly c

lass

ified

Acc

urac

y

Page 70: NAACL TutorialSocial Media Predictive Analytics

Mar Jun Sep0.50.60.70.80.91.0

IB: userIBR: user

Accu

racy

Mar Jun Sep0.50.60.70.80.91.0

AWOO: userAWR: user

Accu

racy

_x0003_Mar

_x0003_Jun

_x0003_Sep

0.50.60.70.80.91.0

IB: user + friend

Accu

racy

_x0003_Mar

_x0003_Jun

_x0003_Sep

0.50.60.70.80.91.0

AWOO: user + friend

Accu

racy

batch < activeus

er +

fri

end

> u

ser

Results: Model Quality

Page 71: NAACL TutorialSocial Media Predictive Analytics

Active with Oracle Annotations

Feb

Mar

Apr May

Jun Jul Aug

Sep Oct Nov

Dec

Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

50125200275350

user

Users in training for user only model

Cum

ul. r

eque

sts

to O

racl

e

Feb

Mar

Apr May

Jun Jul Aug

SepOct Nov

Dec

Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

50100150200250

userfriendC

orre

ctly

cla

ssifi

ed

Oracle is100% correct

Thousands of tweets in training

Page 72: NAACL TutorialSocial Media Predictive Analytics

Summary: Dynamic Learning and Prediction

• Active learning > iterative batch

• N, UN > U: “neighbors give you away”

• Higher confidence => higher precision, lower confidence => higher recall (as expected)

• Rationales significantly improve results

Page 73: NAACL TutorialSocial Media Predictive Analytics

Practical Recommendations: Models for Targeted Advertising

Prediction quality (better

accuracy over time)

Time (correct

ly classified users faster)

Data steam(user,

friend or joint)

Models with rationale filtering

IBR, AWR

Higher confidencethreshold

0.95

Models without rationale filtering

IB, AWOO

Lower confidencethreshold

0.55

User + Friend > User

Page 74: NAACL TutorialSocial Media Predictive Analytics

Recap: Why these models are good?

• Models streaming nature of social media• Limited user content => take advantage

of neighbor content• Actively learn from crowdsourced

rationales• Learn on the fly => data drift• Predict from multiple streams => topical

sparsity• Flexible extendable framework:

– More features: word embeddings, interests, profile info, tweeting behavior

Page 75: NAACL TutorialSocial Media Predictive Analytics

Software Requirements• Python: https://www.python.org/downloads/ python –V• Pip: https://pip.pypa.io/en/latest/installing.html

python get-pip.py • Twython: https://pypi.python.org/pypi/twython/

pip install twython

• matplotlib 1.3.1: http://sourceforge.net/projects/matplotlib/files/matplotlib/

• numpy 1.8.0: http://sourceforge.net/projects/numpy/files/NumPy/ • scipy 0.13: http://sourceforge.net/projects/scipy/files/scipy/ • scikit-learn 0.14.1: http://sourceforge.net/projects/scikit-learn/files/

python -c "import sklearn; print sklearn.__version__"python -c "import numpy; print numpy.version.version"python -c "import scipy; print scipy.version.version"python -c "import matplotlib; print matplotlib.__version__"

Page 76: NAACL TutorialSocial Media Predictive Analytics

Part II. Practice Session Outline

• Details on data collection and annotation– JHU: gender, age and political preferences– MSR: emotions, opinions and psycho-

demographics

• Python examples for static inference– Tweet-based: emotions– User-based: psycho-demographic attributes

• Python examples for online inference– Bayesian updates from multiple data streams

Page 77: NAACL TutorialSocial Media Predictive Analytics

Part II. Practice Session Outline

• Details on data collection and annotation– JHU: gender, age and political preferences– MSR: emotions, opinions and psycho-

demographics

• Python examples for static inference– Tweet-based: emotions– User-based: psycho-demographic attributes

• Python examples for online inference– Bayesian updates from multiple data streams

Page 78: NAACL TutorialSocial Media Predictive Analytics

JHU: Data Overview and Annotation Scheme

Political Preferences:– Candidate-Centric = 1,031 users (follow

candidates)– Geo-Centric = 270 users (self-reports in DE,

MD, VA)– Politically Active* = 371 users (active & follow

cand)Age (self-reports)*: 387 usersGender (name)*: 384 users

10 - 20 neighbors of each of 6 types

Details on Twitter data collection: http://www.cs.jhu.edu/~svitlana/data/data_collection.pdf

*Pennacchiotti and Popescu, 2011; Zamal et al., 2012; Cohen and Ruths, 2013

Explain relationships

Page 79: NAACL TutorialSocial Media Predictive Analytics

Links to Download JHU Attribute Data

• How does the data look like?– graph_type.neighbor_type.tsv e.g., cand-

centric.follower.tsv

• JHU gender and age: http://www.cs.jhu.edu/~svitlana/data/graph_gender_age.tar.gz

• JHU politically active*: http://www.cs.jhu.edu/~svitlana/data/graph_zlr.tar.gz

• JHU candidate-centric:http://www.cs.jhu.edu/~svitlana/data/graph_cand.tar.gz

• JHU geo-centric:http://www.cs.jhu.edu/~svitlana/data/geo_cand.tar.gz

Page 80: NAACL TutorialSocial Media Predictive Analytics

Code to query Twitter API• Repo: https://bitbucket.org/svolkova/

queryingtwitter – get lists of friends/followers for a user– 200 recent tweets for k randomly sampled retweeted or mentioned users– tweets for a list of userIDs

JSONObjects

Extract text fields

time, #friends Tweet

Collection

userIDs/tweetIDs

Page 81: NAACL TutorialSocial Media Predictive Analytics

Part II. Practice Session Outline

• Data and annotation schema description– JHU: gender, age and political preferences– MSR: emotions, opinions and psycho-

demographics

• Python examples for static inference:– Tweet-based: emotions– User-based: psycho-demographic attributes

• Python examples for streaming inference:– Bayesian updates from multiple data streams

Page 82: NAACL TutorialSocial Media Predictive Analytics

MSR: Psycho-Demographic Annotations via Crowdsourcing

5K profiles

IntelligenceRelationship

ReligionPolitical

EducationOptimism

IncomeLife satisfaction

AgeChildren

GenderEthnicity

0.0 0.2 0.4 0.6 0.8

Cohen's Kappa(2% random

sample)

Attribute

Models ΦA(u) UL

UP

5K Millions!

Trusted crowd

$6/hourquality control

Page 83: NAACL TutorialSocial Media Predictive Analytics

MSR: Emotion Annotations via Distant Supervision

6 Ekman’s Emotions hashtags (Mohammad et al.’14) + emotion synonym hashtags

Page 84: NAACL TutorialSocial Media Predictive Analytics

Part II. Practice Session• Data and annotation schema description

– JHU: gender, age and political preferences– MSR: emotions, opinions and psycho-

demographics

• Python examples for static inference:– Tweet-based: emotions– User-based: psycho-demographic attributes

• Python examples for streaming inference:– Bayesian updates from multiple data streams

Page 85: NAACL TutorialSocial Media Predictive Analytics

How to get MSR models and code?https://bitbucket.org/svolkova

/psycho-demographics 1. Load models for 15 psycho-demographic attributes +

emotions2. Extract features from input tweets 3. Apply pre-trained models to make predictions for input

tweets

Page 86: NAACL TutorialSocial Media Predictive Analytics

Predictive Models Supervised text classificationLog-linear models

User-based:• Lexical: normalized binary/count-based ngrams• Affect: emotions, sentimentsTweet-based:• BOW + Negation, Stylistic +0.3F1• Socio-linguistic and stylistic:

• Elongations Yaay, woooow,• Capitalization COOL, Mixed Punctuation ???!!!• Hashtags and Emoticons

Page 87: NAACL TutorialSocial Media Predictive Analytics

Tweet-based: Emotion Prediction

SadnessSurprise

FearJoy

AngerDisgust

0.000.200.400.600.801.000.620.64

0.770.790.80

0.92

F1 score (higher is better)

6 classes: joy, sadness, fear, surprise, disgust, anger

F1=0.78 (Roberts’12 0.67, Qadir’13 0.53, Mohammad’14 0.49)

Page 88: NAACL TutorialSocial Media Predictive Analytics

User-Based: Attribute Prediction

+0.04+0.05+0.11+0.12+0.12+0.11+0.08+0.17+0.10+0.08+0.11+0.11

Gain over BOW

Page 89: NAACL TutorialSocial Media Predictive Analytics

Predicting Demographics from User Outgoing Emotions and Opinions

0.760.760.580.580.650.660.720.760.750.760.690.73

AUC ROCSatisfiedOptimist

DissatisfiedPessimist

No KidsBelow 25 y.o.

Female Male

1/3 attributes AUC >=75%

Page 90: NAACL TutorialSocial Media Predictive Analytics

How to get JHU models and code?

Ex1: Train and test batch models

Ex2: Train a model from a training file and save it

Ex3: Predict an attribute using a pre-trained model and plot iterative updates

Ex4: Predict and plot iterative updates for multiple attributes using pre-trained models from a single communication stream

Ex5: Predict and plot iterative updates for multiple attributes from multiple communication streams

https://bitbucket.org/svolkova/attribute

Page 91: NAACL TutorialSocial Media Predictive Analytics

Ex1. Train/Test Batch Models

• Run as e.g., for gender:

• Customize features and model type/parameters:

Accuracy

Page 92: NAACL TutorialSocial Media Predictive Analytics

Ex2. Save Pre-trained Models

• Run as e.g., age:

• Customize features (process.py), model type and parameters (predict.py)

Page 93: NAACL TutorialSocial Media Predictive Analytics

Part II. Practice Session• Data and annotation schema description

– JHU: gender, age and political preferences– MSR: emotions, opinions and psycho-

demographics

• Python examples for static inference:– User-based: psycho-demographics– Tweet-based: emotions, opinions

• Python examples for streaming inference:– Bayesian updates from multiple data streams

Page 94: NAACL TutorialSocial Media Predictive Analytics

Recap: Iterative Bayesian Updates

Time

?

?

Class prior

LikelihoodPosterior

Page 95: NAACL TutorialSocial Media Predictive Analytics

Ex3. Iterative Updates for a Single Attribute from a Single Stream

Page 96: NAACL TutorialSocial Media Predictive Analytics

Ex4. Iterative Updates for Multiple Attributes from a Single Stream

Steps:1. Loading Models2. Processing data3. Setting up train/test priors4. Making Predictions5. Plotting results

Page 97: NAACL TutorialSocial Media Predictive Analytics

Joint User-Neighbor Streams

Page 98: NAACL TutorialSocial Media Predictive Analytics

Ex5. Iterative Updates for Multiple Attributes from Joint Streams

Likelihood Posterior