Cost-Effective Personal Analytics in Social Media Using Language and Network Structuresvitlana/papers/non_refereed/VolkovaQual... · 2014-04-04 · Cost-Effective Personal Analytics

Cost-Effective Personal Analytics in Social Media Using Language and Network Structure

Svitlana Volkova

07/25/2013

Department of Computer Science Johns Hopkins University 3400 North Charles Street

Baltimore MD 21218

This project fulfills a requirement for the Computer Science PhD degree. Advisor’s signature ______________________ Date ___________ Advisor’s name: Benjamin Van Durme

Cost-Effective Personal Analytics in Social Media Using Language andNetwork Structure

Abstract

Limitations to public access of social media,such as increased rate throttling under the re-vised Twitter API, prompts rethinking currentapproaches for a variety of inference tasks insocial networks, such as the prediction of la-tent author attributes. We investigate vari-ous novel network construction methods overTwitter users in order to leverage informa-tion from different interpretations of a “lo-cal” neighborhood, aimed at a cost-effectivemodel for personal analytics. We evaluate ourmodel as a function of the size of the neigh-borhood, along with the amount of contentdata (tweets) associated with those neighbors.We show that even when limited or no self-authored data is available, content from thelocal neighborhood can provide sufficient ev-idence for prediction. We demonstrate thatdata from friends, retweets and user mentionsis a stronger signal in this regard than theirreplies and shared hashtags.

1 Introduction

Interest in predicting latent author attributes auto-matically from personal communications and so-cial media (e.g., emails, blog posts or public dis-cussions) has grown with the volume of availabledata. This work can support a variety of applicationssuch as: predicting communicant (e.g., Twitter user)attributes such as gender, age, ethnicity and race(Rao et al., 2011; Burger et al., 2011; Bergsma etal., 2013); commercial applications1 in personalized

1http://www.wolframalpha.com/facebook/

recommendation systems and advertising (Bilenkoand Richardson, 2011); detecting fraudulent or un-helpful product reviews (Ott et al., 2011; Jindaland Liu, 2008); healthcare analytics (Lampos et al.,2010; Paul and Dredze, 2011); large-scale, low-cost,passive polling (O’Connor et al., 2010a; Lampos etal., 2013); and real-time live polling (Resnik, 2013).

Twitter has become an attractive resource forstudying the underlying properties of such informalcommunications because of its volume, dynamicnature, and diverse population. However, recentchanges to Twitter API querying rates further restrictthe speed of access to this resource, effectively re-ducing the amount of data that can be collected ina given time period. This is problematic for the ex-isting state-of-the-art models for personal analyticswhich were trained on thousands of tweets per au-thor (Rao et al., 2010; Zamal et al., 2012). Fur-thermore, most Twitter users are less prolific thanthose examined by Zamal and Rao, and thus do notproduce the thousands of tweets required to obtaintheir levels of accuracy. At the same time, the super-vised models for demographic profiling still requireat least some text data (content) for training fromcommunications authored by the user both whentrained in batch (Burger et al., 2011) or streamingsettings (Van Durme, 2012b).

Recent models have incorporated network struc-ture when calculating personal analytics in Twit-ter (Conover et al., 2011; Golbeck and Hansen,2011; Pennacchiotti and Popescu, 2011a), e.g., bycombining network structure with tweets from a net-work derived from Twitter user friends (Zamal etal., 2012). However these methods still require

querying Twitter API to obtain the network data(context). Therefore, in this work we propose mod-els that should be cognizant of these rate limits,maximizing the efficiency of the text and networkdata in calculating personal analytics. We make ex-plicit the tradeoff between accuracy and cost (man-ifest as calls to the Twitter API), and optimize toa different tradeoff than state-of-the-art approaches,seeking maximal performance when limited data isavailable. In the following we:

• develop a unified formalism for represent-ing diverse social network types from Twit-ter including friends, followers, user mentions,shared hashtags, replies, and retweets;

• explore the relative utility of these six graphtypes for personal analytics, using as an exam-ple the prediction of political affiliation;

• demonstrate that content from the local neigh-borhood of such graphs can be as informativeas user self-authored content;

• evaluate the quality of the local neighborhoodas a function of the neighborhood size and theamount of content available per neighbor;

• determine when querying for more contentfrom existing neighbors of querying for moreneighbors is optimal for personal analytics.

2 Related Work

User Attribute Prediction There has been signif-icant interest in characterizing latent user attributesin social media, e.g., Garera and Yarowsky (2009),Burger et al. (2011), Van Durme (2012b), and Raoet al. (2010) discuss how to classify user gen-der, age and ethnicity in Twitter. All these ap-proaches apply standard lexical features in super-vised classification settings. Bergsma et al. (2012)show that adding stylistic and syntactic informa-tion to the bag-of-word features improves the per-formance. Moreover, recent work by Bergsma andVan Durme (2013) presents an approach for demo-graphic profiling which also allows bootstrappingthe training data by learning lexical common-senseattributes for male and female users.

Other methods characterize Twitter users by ap-plying limited amounts of network structure infor-mation in addition to lexical features (Conover etal., 2011; Pennacchiotti and Popescu, 2011a; Gol-

beck and Hansen, 2011). Connover et al. (2011) relyon identifying strong partisan clusters of Democraticand Republican users in a Twitter network basedon retweet and user mention degree of connectivity,and then combine this clustering information withthe follower and friend neighborhood size features.Pennacchiotti et al. (2011a; 2011b) focus on userbehavior, network structure and linguistic features.Similar to our work, they assume that users from aparticular class tend to reply and retweet messagesof the users from the same class. We extend thisassumption and study other relationship types e.g.,shared hashtags, user mentions etc. The most sim-ilar work to ours is by Zamal et al. (2012), wherethe authors apply features from the tweets authoredby a user’s friend to infer attributes of that user. Inthis paper, we study different types of local neigh-borhoods (in addition to a friend network exploredby Zamal et al.) and we are the first to propose amodel for calculating personal analytics where thecost-effectiveness is explicitly taken into account.

Other works for personal analytics exploit unsu-pervised learning methods. Bergsma et al. (2013)show that clustering of user attributes improvesgender, ethnicity and location classification perfor-mance on a large-scale in Twitter. O’Connor etal. (2010b) following the work by Eisenstein (2010)propose a Bayesian generative model to discover de-mographic language variations and Rao et al. (2011)suggest a hierarchical Bayesian model for latent userattribute prediction in Twitter.

Network structure analysis Coppersmith andPriebe (2012) investigate both network structure(context) and user (vertex) communications (con-tent) to explore vertex nomination, based on the En-ron email corpus and demonstrate that a joint modelimproves performance. Other investigations study-ing network structure in Twitter include informa-tion propagation (Kwak et al., 2010) and link pre-diction (Hasan and Zaki, 2011; Rowe et al., 2012)rely mainly on Twitter follower graphs.

3 Identifying Graphs from Twitter Data

Our work makes explicit an array of graph construc-tion mechanisms under a unified formalism, serv-ing as a basis for future work in tying graph-basedpredictive models to socio-linguistic content analy-

Figure 1: Twitter graph example with follower, friend, user mention, reply, retweet and hashtag relationships betweenusers of interest (blue: Democratic, red: Republican, lighter vertices: local neighborhood).

sis. For example, some information about a user’spersonal preferences can be gleaned from looking atwho their friends are, but different information maybe available from other users who are not friends,but use many of the same hashtags.

3.1 Graph Definition

Lets define an attributed, directed graph G =(V,E), where V is a set of vertices and E is a set ofedges. Each vertex vi represents someone in a com-munication graph i.e., communicant: here a Twit-ter user. Each vertex is attributed with a featurevector �f(vi) which encodes communications e.g.,tweets available for a given user. Each vertex is as-sociated with a label L(vi), in our case it is binaryL(vi) = (0, 1). Each edge eij ∈ E represents aconnection between vi and vj , eij = (vi, vj) anddefines different communication relations betweenTwitter users e.g., follower (f), friend (b), user men-tion (m), hashtag (h), reply (y) and retweet (w).Thus, E ∈ V (2) × {f, b, h,m,w, y}. We denotea set of edges of a given type as φr(E) for r ∈{f, b, h,m,w, y}. We denote a set of vertices ad-jacent to vi by relationship type r as Nr(vi) whichis equivalent to {vj | eij ∈ φr(E)}. We refer toNr(vi) as vi’s local r-neighborhood. In most cases,we only work with a sample of a neighborhood, de-noted by N �

r(vi) where |N �r(vi)| = k is the size of

the sampled r-neighborhood for vi. In section 3.5will detail how r-neighborhoods are constructed.

Figure 1 presents an example Twitter graph. No-tably, users from Nr(vi) can be shared across the

users of interest of the same or different classes – auser can potentially be in both vj ∈ Nf (vi), vi ∈ Dand vj ∈ Nb(vi), vi ∈ R, where D stand for Demo-cratic and R for Republican users. Also note thatthe directionality of edges is specific to the type ofedge with respect to the user of interest vi. All fol-lower edges are incoming (in that vj follows our userof interest vi). Likewise, all hashtag, user mention,retweet and reply edges are outgoing (e.g., vi au-thors a tweet in which vj is mentioned). All friendedges are bidirectional. We plan to use edge direc-tionality in future work.

3.2 Candidate-Centric Graph

We construct a graph Gcand by looking into follow-ing relationships between the users and Democraticor Republican candidates during the 2012 US Pres-idential election as shown in Figure 2. We refer tothis graph as candidate-centric.

We define Democratic or Republican users asthose that followed both @BarackObama and @Joe-

Figure 2: Candidate-centric graph with D and R candi-dates, �: users of interests (blue: D, red: R, �: users notsampled, ➞: follower relations, ⇔: friend relations.

Biden, or @MittRomney and @RepPaulRyan.2 Werandomly sample n = 516 Democratic and m =515 Republican users. We label users as Democraticif they exclusively follow both Democratic candi-dates but do not follow both Republican candidatesand vice versa. We collectively refer to D and R asour “users of interest”.

For each such user we collect a biography, recenttweets and randomly sample user-local neighbor-hoods of followers, friends, user mentions, replies,retweets and hashtags. We describe the details of ex-ploring these neighborhood types in Section 3.5. Tothe best of our knowledge this resultant candidate-centric graph is the largest Twitter collection assem-bled with respect to labeled political preference (50kvertices, 60k edges) and including sampled usersacross a diverse set of local neighborhoods.

3.3 Geo-Centric Graph

We also construct a geo-centric graph Ggeo by col-lecting n = 135 Democratic and m = 135 Republi-can users from the Maryland, Virginia and Delawareregion of the United States with self-reported polit-ical preference in their biogaphies. Similar to thecandidate-centric graph, for each user we collect abiography, recent tweets and randomly sample user-local neighborhoods of followers, friends, user men-tions, replies, retweets and shared hashtags.

3.4 ZLR User-Friend Graph

We also construct the graph GZLR from a datasetpreviously used for political affiliation classifica-tion (Zamal et al., 2012). This dataset consists of200 Republican and 200 Democratic users associ-ated with 925 tweets on average per user.3 Eachuser has on average 6155 friends with 642 tweetsper friend. Sharing restrictions and rate limits on thedata prevented us from doing a direct comparison,but we were able to recreate a semblance of ZLRdata under our limited-data regime (described in de-tails below) – with 10 friends and 200 tweets perfriend for each user of interest.

2As of Oct 12, 2012, the number of followers for Obama,Biden, Romney and Ryan were 2m, 168k, 1.3m and 267k.

3The original dataset was collected in 2012 and has been re-leased at http://icwsm.cs.mcgill.ca/ recently. Po-litical labels are extracted from http://www.wefollow.com (Pennacchiotti and Popescu, 2011b).

3.5 Relationship TypesTo investigate relationships4 between Twitter usersfor Gcand and Ggeo graphs we collect lists of follow-ers and friends, and extract user mentions, hashtags,replies and retweets from user communications.

To encode user-follower and user-friend relation-ships we download lists of followers and friends foreach user of interest. Twitter’s Post API rate lim-its5 prevent us from downloading tweets for all fol-lowers and friends, so instead we randomly samplek = 10 followers and k = 10 friends for each userof interest. To represent user-reply relationships weconsider tweets with filled in reply to field and ex-tract reply user information. We collect all repliesfor users in D and R from the set of 2K user tweets,and similarly sample k = 10 reply users.

To encode user-mention relationships we extractall @mentions from approx. 200k tweets authoredby users of interest. We eliminate the 100 most fre-quent @mentions, treating them as stopwords (e.g.,@youtube, or @FoxNews) and @mentions that onlyoccur once. For each user of interest we randomlysample a subset k = 10 of their @mentions. Simi-larly, we extract user-retweet relationships from theretweets present in the set of 200k tweets, elimi-nating the 100 most frequent retweeted users (e.g.,@BarackObama), and randomly sampling a subsetk = 10 usernames retweeted each user of interest.

To encode user-hashtag relationships, we get asample of hashtags and users sharing a specific hash-tag as follows: for each user in D ∪R, we eliminatehashtags that occur only once, then randomly sam-ple a set of 5 hashtags; next, we download the 100most recent tweets per hashtag, extract user informa-tion associated with each tweet, and randomly sam-ple k = 10 users per hashtag.

4 Models

4.1 Baseline User ModelAs input we are given a set of vertices represent-ing users of interest vi ∈ V along with feature vec-

4Twitter relationship types are defined here: http://personalweb.about.com/od/twitterterms/a/Twitter-Language.htm

5Twitters Post API restricted queries to 3600 per day from asingle IP which is now changed to 1800; the data in this paperwould require at least a month to query based on a single IP.

tors �f(vi) derived from content authored by the userof interest. Each user is associated with a non-zeronumber of publicly posted tweets along with a self-authored biography field. Our general goal is assignto a category each user of interest vi based on �f(vi).Here we focus on a binary assignment into the cat-egories Democratic D or Republican R. The log-linear model for such binary classification is:

Φvi =

�D (1 + exp[−�θ · �f(vi)])−1 ≥ 0.5,R otherwise.

(1)We consider several types of features for our

user model, where features are normalized unigramcounts extracted from vi’s:

- tweets �ft(vi) : D × t(vi) → R;- biographies �fb(vi) : D × b(vi) → R;- biographies and tweets �fb,t(vi).Similar supervised models have been successfully

used before for user attribute classification e.g, gen-der, age, political affiliation and ethnicity in socialnetworks as described in Section 2. We furtherextend the existing state-of-the-art models throughconsideration of content deemed ”local” to a userunder various graph constructions.

4.2 Neighborhood ModelAs input we are given user local neighborhoodNr(vi), where r is a neighborhood type. Here weconsider six neighborhood types such as: follower,friend, user mention, hashtag, reply and retweet. Be-sides the neighborhood type r, each neighborhood ischaracterized by two parameters such as:

- the number of communications e.g., tweets perneighbor in the neighborhood �ft(Nr), t ={5, .., 200};

- the order of the neighborhood (the number ofneighbors per user of interest) |Nr| = deg(vi),n = {1, .., 10}.

Our goal is to classify users of interest using evi-dence (e.g., communications) from their local neigh-borhood

�n

�ft[Nr(vi)] ≡ �f(Nr) as Democratic or

Republican. The corresponding log-linear model isdefined as:

ΦNr =

�D (1 + exp[−�θ · �f(Nr)])−1 ≥ 0.5,R otherwise.

(2)

To make a fair comparison with the user model,we use normalized unigram features extracted fromeach user-local neighborhood Nr(vi) (the set of nusers adjacent to vi). For example, for the neighbor-hood Nr of type r we consider biographies �fb(Nr),tweets �ft(Nr) and both �fb,t(Nr).

Cost-effective Personal Analytics Our goal is tocompare the performance of the user model fromEq.1 vs. the neighborhood model from Eq.2 givena limited amount of resources. For Twitter, these re-sources are measured as requests for data from theTwitter API (queries). For the nonce, we simplifythe calculation of cost such that Twitter API returnsone communication at a time e.g., a single tweet tkfor a given user.6

Our objective is to maximize the classificationperformance subject to the cost constrains or, alter-natively, minimize the number of queries to TwitterAPI to achieve a certain performance.

The cost for the user model is defined as:

minimizek

�

k

tk(vi) (3)

The cost for the neighborhood model:

minimizek

�

n

�

k

tk[Nr(vi)] (4)

4.3 Combined ModelAs input we combine data from the user of interest�f(vi) and user-local neighborhood of a specific typer, �f(Nr(vi)). We experiment with two real-worldscenarios when:

- limited data is available for the user of inter-est e.g., user biography �fb(vi) is combined withtweets �ft(Nr) from the neighborhood;

- data is available for the user of interest, e.g., alluser data �ft,b(vi) is combined with tweets fromthe local neighborhood �ft(Nr) of type r.

Similarly, we define a combined log-linear modelΦc ≡ ΦNr,vi as shown below:

Φc =

�D (1 + exp[−�θ · �f(Nr, vi)])−1 ≥ 0.5,R otherwise.

(5)6This assumption reflects the real-world streaming scenario

e.g., the average tweet per day (TPD) rate is 5 tweets.

The cost for the combined model is a linear com-bination of the cost from Eq.3 and Eq.4 which is al-ways higher than either of them separately. Thus, forcost-effective personal analytics we consider onlythe user and the neighborhood models.

5 Experimental Setup

We design a set of experiments to compare mod-els for political affiliation prediction task defined inSection 4. We first answer whether communicationsfrom a user-local neighborhood can help predict po-litical preference for the user. For that, we startwith a set of exploratory experiments for the user,neighborhood and combined models on a candidate-centric graph Gcand (Experimental Setup I). We nextexamine the cost of calculating those personal an-alytics for only user and neighborhood models onour geo-centric Ggeo and candidate-centric Gcand

graphs, as well as ZLR user-friend graph describedin Section 3.4 (Experimental Setup II).

Experimental Setup I. We first verify if contentfrom the neighborhood Nr(vi) can be used to predictattributes for vi following the homophily assumptionused by (McPherson et al., 2001; Thelwall, 2009;Bilgic and Getoor, 2010; Zamal et al., 2012). Forthat purpose, we train our models on a training graphGtrain and predict vertex labels for a test graph Gtest

using 10-fold cross validation. We use the identicalfeature combinations for train and test graphs e.g.,if we only use follower tweets in Gtest, then we alsouse follower tweets to train Gtrain model. This setupallows us to apply disjoint (prefixed) sets of featuresfor biographies and tweets for the users vi ∈ V andtheir local neighborhoods Nr(vi).

Experimental Setup II. For cost-effective per-sonal analytics we experiment with our neighbor-hood model defined in Eq.2 with the aim to:

1. model neighborhood size, we change the num-ber of neighbors in the neighborhood and tryn = [1, 2, 5, 10] neighbor(s) per user;

2. model neighbor attribute, we alternate theamount of content per neighbor and try t =[5, 10, 15, 25, 50, 100, 200] tweets.

To control the number of neighbor variability weexperiment with the amount of tweets only and trainthe best model using all neighborhood types e.g.,

we only use follower tweets in Gtest, but we useall six types of neighbors to in Gtrain.7 We per-form 10-fold cross validation and run 100 randomrestarts for every n and t parameter combination.We compare our neighborhood model with the userbaseline using the cost functions from Eq.3 andEq.4. For all experiments we use LibLinear (Fanet al., 2008), integrated in the Jerboa toolkit(Van Durme, 2012a).

6 Results

We present results for the user attribute predictiontask for three models from Section 4 and two exper-imental setups from Section 5.

Setup I Results In Table 1 we present classifi-cation results for the user, neighborhood and com-bined models using different feature combinations.Each row represents features e.g, �ft(vi), �ft(Nr)tweets only, where t(vi) = 200 tweets per user andt(Nr) = 20 tweets from 10 neighbors

�10 t(Nr) =200; �fb(vi), �fb(Nr) are based on all vi’s and Nr’sbiographies available; ft(Nr, vi) and f �

t(Nr, vi) rep-resent features for the combined model where 200tweets from the user of interest are combined with200 and 2k tweets from their neighbors, respectively.

Features φf φb φm φh φy φw

�fb(vi) 0.52�ft(vi) 0.77�fb(Nr) 0.69 0.64 0.50 0.58 0.61 0.39�ft(Nr) 0.70 0.69 0.71 0.62 0.68 0.67�fb,t(Nr) 0.70 0.70 0.70 0.61 0.65 0.62�ft(Nr, vi) 0.78 0.71 0.70 0.62 0.67 0.65�f �t(Nr, vi) 0.77 0.85 0.87 0.76 0.84 0.77

Table 1: Classification accuracy for different featurecombinations for Experimental Setup I.

Table 1 results confirm that our neighborhoodmodel can be successfully used to classify user po-litical affiliation. We find accuracy results for theneighborhood model comparable to the user base-line. The combined model outperforms the user andthe neighborhood models. We find that friend and

7The purpose of the experimental setup II is not to train thebest model, but rather to explore the contribution of differentneighborhood types. Thus, we train the classifier using all dataavailable per user including tweets from all neighborhood types.

Figure 3: Difference ∆d in Φvi classifi-cation decisions on given random partitionof a candidate-centric graph while model-ing the amount of content per user for 5(blue) vs. 100 (green) tweets: (A) ∆d withcorrectly classified (filled) and misclassified(not filled) vertices; (B) averaged accuracyfor batched classification decisions; (C) ∆dbetween two independent runs (top), and theaverage of two runs and true class (bottom);(D) fraction of correctly classified (dark) vs.misclassified (light) vertices (users).

user mention neighborhoods yield the highest accu-racy. The remaining results in this section are ob-tained using the Experimental Setup II.

Modeling User Attribute We investigate the dif-ference in classification decision for our user modelΦvi by changing the number of tweets per user.For that purpose, we take a random partition of acandidate-centric graph and perform two indepen-dent classification experiments (run 1 and 2) usingt(vi) = 5 and t(vi) = 100 tweets.

Figure 3A shows that more of 100 tweet pointsare correctly classified. Moreover, a lot of 100 tweetpoints are close to 0.5 decision probability whichsuggests that the classifier in more uncertain. Fig-ure 3B presents accuracy results averaged over 10decision batches e.g., 0-0.1. We present four fittedlines for two independent experiments of Φvi clas-sifier trained using 5 and 100 tweets overlayed withthe actual decision points: � – 100, � – 5 tweets.

Figure 3C shows that the difference ∆d in classi-fication decision between two independent runs for5 (blue) and 100 (green) tweets and ∆d between theaverage of two runs and a true class is higher for 5tweets compared to 100 tweets. Finally, we presentthe fraction of correctly classified vs. misclassifiedusers in Figure 3D. We observe that at every deci-sion probability cutoff e.g., 0.1, 0.2 the number ofcorrectly classified points is higher than the numberof misclassified points.

Modeling Neighbor Attribute Here we presentthe results for the neighborhood model. We study

the influence of the neighborhood type r and size�nft(Nr) in terms of the number of neighbors n and

tweets t per neighbor.In Figure 4 we present accuracy results to demon-

strate the influence of neighbor’s attributes on clas-sification performance for Gcand and Ggeo. Weobserve that for n ≥ 2 neighbors per user allgraph types except shared hashtag and reply outper-form the user baseline (especially when t ≥ 250).

●

●

●

●

●●

●

5 10 20 50 100 200

0.50

0.55

0.60

0.65

0.70

0.75

log(Tweets Per Neighbor)

Accu

racy

5 10 15 25 50 100 200

FriendFollowerHashtagUsermention

RetweetReplyUser

●

●

●●

●● ●

5 10 15 25 50 100 200


RetweetReplyUser

●

●

●●

●●

●

5 10 15 25 50 100 200


RetweetReplyUser

●

●

●

●

●

● ●

5 10 15 25 50 100 200


RetweetReplyUser

●

●

●

●

●

●

●

5 10 15 25 50 100 200


RetweetReplyUser

●

●

●

●

●●

●

5 10 15 25 50 100 200


RetweetReplyUser

(a) 1 neighbor

●

●

●

●

●●

●

5 10 20 50 100 200

0.50

0.55

0.60

0.65

0.70

0.75


Accu

racy

50 100 250 500 1000 2000


RetweetReplyUser

●

●

●

●

●

●●

50 100 250 500 1000 2000


RetweetReplyUser

●

● ●●

● ●●

50 100 250 500 1000 2000


RetweetReplyUser

●

●

●

●

●

●●

50 100 250 500 1000 2000


RetweetReplyUser

●

●

●

●

●●

●

50 100 250 500 1000 2000


RetweetReplyUser

●

●

●

●

●

●●

50 100 250 500 1000 2000


RetweetReplyUser

(b) 10 neighbors

●

● ●

● ●●

●

5 10 20 50 100 200

0.50

0.55

0.60

0.65

0.70


Accu

racy

5 10 15 25 50 100 200


RetweetReplyUser

●

●

● ●● ● ●

5 10 15 25 50 100 200


RetweetReplyUser

●

● ● ●

● ●●

5 10 15 25 50 100 200


RetweetReplyUser

●

●

● ●

● ●●

5 10 15 25 50 100 200


RetweetReplyUser

●

●●

●● ● ●

5 10 15 25 50 100 200


RetweetReplyUser

●●

●●

●

●●

5 10 15 25 50 100 200


RetweetReplyUser

(c) 1 neighbor

●

●● ●

● ● ●

5 10 20 50 100 200

0.50

0.55

0.60

0.65

0.70


Accu

racy

50 100 250 500 1000 2000


RetweetReplyUser

●●

● ● ● ● ●

50 100 250 500 1000 2000


RetweetReplyUser

●

●● ●

● ●

●

50 100 250 500 1000 2000


RetweetReplyUser

●

●●

●●

●●

50 100 250 500 1000 2000


RetweetReplyUser

●

●●

●●

● ●

50 100 250 500 1000 2000


RetweetReplyUser

●

●● ●

●●

●

50 100 250 500 1000 2000


RetweetReplyUser

(d) 10 neighbors

Figure 4: Modeling neighbor attribute (# of tweets perneighbor) for t = [5, .., 200] tweets for the candidate-centric (above) and geo-centric (below) graphs.

Figure 5: Comparing accuracy for the bestperforming neighborhoods (A) friend and(B) retweet for candidate-centric graph, (C)user mention and (D) friend for geo-centricgraph shown with the corresponding costin terms of the equivalent number of Twit-ter API calls: “20-25”, “50”, “100-125”,“200-250” tweets; e.g., “20-25” tweet linesdemonstrate the difference in performancefor 50 tweets from 1 neighbor, 25 tweetsfrom 2 neighbors, 10 tweets from 5 neigh-bors, and 5 tweets from 10 neighbors.

Moreover, following Eq.3 and 4, we spent an equalamount of Twitter API calls to obtain t(vi) = 100user tweets and t(Nr) = 10 tweets from n = 10neighbors. We annotate these “points of equality forthe cost” with a line on top marked with a corre-sponding number of tweets t = [5, .., 2000].

We show that three of six Gcand (friend, follower,retweet) and Ggeo (friend, user-mention, retweet)graphs yield better accuracy compared to the usermodel when t ≥ 250 and comparable results oth-erwise. Thus, for cost-effective personal analytics,with the goal to classify a given user vi it is betterto take 200 tweets from 10 neighbors rather than 2kself-authored tweets from the user. For example, thebest accuracy for the neighborhood model applied toGcand is 0.75 for friend, follower, retweet and user-mention neighborhoods8 which is 0.03 higher thanthe user model; Ggeo is 0.67 for user-mention and0.64 for retweet neighborhoods compared to 0.57 forthe user model. Furthermore, for all neighborhoodtypes increasing the number of tweets per neighborleads to a significant gain in performance.

Modeling Neighborhood Size In Figure 6 wepresent accuracy results to show neighborhood sizeinfluence on classification performance for Ggeo andGcand. We demonstrate that increasing the neigh-borhood size leads to the improved performanceacross six neighborhood types. Friend, user men-

8The difference across these neighborhood types for 200tweets per neighbor is not statistically significant.

tion and retweet neighborhoods yield the highest ac-curacy for both Gcand and Ggeo. Moreover, we ob-serve that for all neighborhood types when the num-ber of neighbors is n = 1, the difference in accuracyacross all Nr types is not significant but for n ≥ 2 itbecomes more significant.User vs. Neighborhood Model In Figure 5 wereport accuracy results for the best neighborhoodsand their cost following Eq.4. We further confirm

●

●

●

●

1 2 5 10

0.50

0.55

0.60

0.65

0.70

0.75

log(Number of Neighbors)

Accu

racy

5 10 25 50

FriendFollowerHashtag

UsermentionRetweetReply

●

●

●

●

5 10 25 50



●

●

●

●

5 10 25 50



●

●

●

●

5 10 25 50



●

●

●

●

5 10 25 50



●

●

●

●

5 10 25 50



(a) 5 tweets

●

●

●

●

1 2 5 10

0.50

0.55

0.60

0.65

0.70

0.75


Accu

racy

200 400 1000 2000



●

●

●

●

200 400 1000 2000



●

●

●

●

200 400 1000 2000



●

●

●

●

200 400 1000 2000



●

●

●

●

200 400 1000 2000



●

●

●

●

200 400 1000 2000



(b) 200 tweets

●

●

●

●

1 2 5 10

0.50

0.55

0.60

0.65

0.70


Accu

racy

5 10 25 50



●

●

●

●

5 10 25 50



●

●

●

●

5 10 25 50



●

●

●

●

5 10 25 50



●

●

●

●

5 10 25 50



●

●

● ●

5 10 25 50



(c) 5 tweets

●

●

●

●

1 2 5 10

0.50

0.55

0.60

0.65

0.70


Accu

racy

200 400 1000 2000



●

●

● ●

200 400 1000 2000



●

●

●

●

200 400 1000 2000



●

●

●

●

200 400 1000 2000



●

●

●

●

200 400 1000 2000



●

●

●●

200 400 1000 2000



(d) 200 tweets

Figure 6: Modeling neighborhood size (# neighbors peruser) for n = [1, .., 10] on the candidate-centric (above)and user-centric (below) graphs.

that (1) increasing the number of tweets per neigh-bor and the number of neighbors per user improvesthe performance, and (2) querying for more neigh-bors per user is more optimal than querying for ad-ditional content from the existing neighbors, e.g., 5tweets from 10 neighbors is better than 25 tweetsfrom 2 neighbors or 500 tweets from 1 neighbor.

Figures 4-6 show that our neighborhood modelis more cost-effective than the user model, e.g., toachieve 0.72 accuracy for Gcand we need to issue atleast 250 calls to get user tweets9 compared to only150 calls for tweets from the local neighborhood.For Ggeo, the neighborhood model is as effective asthe user model. To achieve 0.57 accuracy we needto query either 50 self-authored tweets or 50 tweetsfrom the local neighborhood.Neighborhood vs. Combined Model The neigh-borhood model can be effectively applied when auser recently joined Twitter and has not tweeted yet,or has a private profile, thus we might not haveaccess to user tweets but instead have access totweets for the local neighborhood. It is more cost-efficient than the user model with friend, retweetand user mention neighborhoods yielding the high-est accuracy. However, if we have access to alluser tweets and tweets from the local neighborhood,the combined model significantly outperforms theuser model (0.1 accuracy gain) and the cost-efficientneighborhood model (0.17 accuracy gain) as shownin Table 1. The combined model performs the beston friend and user mention neighborhoods.ZLR Dataset Results We apply our user andneighborhood models to ZLR dataset following theExperimental Setup II. We model the neighborhoodsize n and the neighbor attribute t and present theresults in Figure 7. Zamal et al. report higher accu-racy numbers for the baseline user model (UserOnly= 0.89) and the neighborhood model Nbr-All = 0.92.Their models are trained using SVM classifier and10-fold cross-validation. However, because theirmodels are trained on thousands of tweets per useror neighbor,10 they are more difficult to apply forcost-effective real-time personal analytics, similarly

9For the state-of-the-art user model trained with 250, 500,1000 and 2000 tweets the accuracy changes are not statisticallysignificant as was discussed in (Van Durme, 2012b)

10Nbr-All model is trained on approx. 6K friends per userwith 642 tweets per friend.

●

●●

●

● ●

●

5 10 20 50 100 200

0.55

0.65

0.75

0.85


Accu

racy

5 10 15 25 50 100 200

Friend User

(a) 1 neighbour

●

●

●

●● ●

●

5 10 20 50 100 200

0.55

0.65

0.75

0.85


Accu

racy

50 100 250 500 1000

Friend User

(b) 10 neighbors

●

●

●

●

1 2 5 10

0.55

0.65

0.75

0.85


Accu

racy

5 10 25 50

Friend User

(c) 5 tweets

●

●

●

●

1 2 5 10

0.55

0.65

0.75

0.85


Accu

racy

200 400 1000

Friend User

(d) 200 tweets

Figure 7: Modeling the # of tweets per neighbor (top) andthe # neighbors per user (bottom) for GZLR.

to the models trained on 5K tweets per user sug-gested by Rao et. al (2010). In other words, theyeffectively spent all their queries on that small set ofusers, where our models are designed to take thosequeries and classify 10x more users instead, even ifat a lesser rate of accuracy. For example, our re-sults for the same dataset show that our neighbor-hood model outperforms our user baseline (0.03 im-provement for t ≥ 100) and is more cost-effective,e.g., to achieve accuracy of 0.74 it is better to query10 tweets from 10 friends than 250 user tweets.

7 Conclusions and Future Work

We proposed a novel approach for cost-effective per-sonal analytics which relies on both content and net-work structure, and therefore, can be applied to areal-time data stream as well as private and protectedTwitter user profiles. We showed that for personalanalytics: (I) content from user-local neighborhoodcan be equally or more effective as self-authoredcontent; (II) friend, user mention and retweet neigh-borhoods yield the best results; (III) to achievehigher accuracy while spending an equal amount ofTwitter API calls, it is better to request 200 tweetsfrom 10 neighbors than 2k self-authored tweets.

ReferencesShane Bergsma and Benjamin Van Durme. 2013. Using

conceptual class attributes to characterize social mediausers. In Proceedings of ACL.

Shane Bergsma, Matt Post, and David Yarowsky. 2012.Stylometric analysis of scientific articles. In Proceed-ings of NAACL.

Shane Bergsma, Mark Dredze, Benjamin Van Durme,Theresa Wilson, and David Yarowsky. 2013. Broadlyimproving user classification via communication-based name and location clustering on Twitter. In Pro-ceedings of NAACL.

Mikhail Bilenko and Matthew Richardson. 2011. Pre-dictive client-side profiles for personalized advertis-ing. In Proceedings of KDD.

Mustafa Bilgic and Lise Getoor. 2010. Active inferencefor collective classification. In Proceedings of AAAI.

John D. Burger, John Henderson, George Kim, and GuidoZarrella. 2011. Discriminating gender on Twitter. InProceedings of EMNLP.

Michael D. Conover, Bruno Goncalves, JacobRatkiewicz, Alessandro Flammini, and FilippoMenczer. 2011. Predicting the political alignment ofTwitter users. In Proceedings of Social Computing.

Glen A. Coppersmith and Carey E. Priebe. 2012. Vertexnomination via content and context. Proceedings ofComputing Research Repository.

Jacob Eisenstein, Brendan O’Connor, Noah A. Smith,and Eric P. Xing. 2010. A latent variable model for ge-ographic lexical variation. In Proceedings of EMNLP.

Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-RuiWang, and Chih-Jen Lin. 2008. LIBLINEAR: A li-brary for large linear classification. Journal of Ma-chine Learning Research.

Nikesh Garera and David Yarowsky. 2009. Modeling la-tent biographic attributes in conversational genres. InProceedings of ACL/AFNLP.

Jennifer Golbeck and Derek Hansen. 2011. Computingpolitical preference among Twitter followers. In Pro-ceedings of CHI.

Mohammad Al Hasan and Mohammed J. Zaki, 2011. ASurvey of Link Prediction in Social Networks.

Nitin Jindal and Bing Liu. 2008. Opinion spam and anal-ysis. In Proceedings of WSDM.

Haewoon Kwak, Changhyun Lee, Hosung Park, and SueMoon. 2010. What is Twitter, a social network or anews media? In Proceedings of WWW.

Vasileios Lampos, Tijl De Bie, and Nello Cristianini.2010. Flu detector: tracking epidemics on Twitter. InProceedings of ECML PKDD.

Vasileios Lampos, Daniel Preotiuc-Pietro, and TrevorCohn. 2013. A user-centric model of voting intentionfrom social media. In Proceedings of ACL.

Miller McPherson, Lynn Smith-Lovin, and James MCook. 2001. Birds of a feather: homophily in socialnetworks. Annual Review of Sociology, 27(1).

Brendan O’Connor, Ramnath Balasubramanyan,Bryan R. Routledge, and Noah A. Smith. 2010a.From tweets to polls: Linking text sentiment to publicopinion time series. In Proceedings of ICWSM.

Brendan O’Connor, Jacob Eisenstein, Eric P. Xing, andNoah A. Smith. 2010b. A mixture model of demo-graphic lexical variation. In Proceedings of the Ma-chine Learning in Computational Social Science.

Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T. Han-cock. 2011. Finding deceptive opinion spam by anystretch of the imagination. In Proceedings of ACL.

Michael J. Paul and Mark Dredze. 2011. You Are WhatYou Tweet : Analyzing Twitter for Public Health. Ar-tificial Intelligence.

Marco Pennacchiotti and Ana-Maria Popescu. 2011a.Democrats, republicans and starbucks afficionados:user classification in twitter. In Proceedings of KDD.

Marco Pennacchiotti and Ana-Maria Popescu. 2011b. Amachine learning approach to Twitter user classifica-tion. In Proceedings of ICWSM.

Delip Rao, David Yarowsky, Abhishek Shreevats, andManaswi Gupta. 2010. Classifying latent user at-tributes in Twitter. In Proceedings of SMUC.

Delip Rao, Michael Paul, Clay Fink, David Yarowsky,Timothy Oates, and Glen Coppersmith. 2011. Hierar-chical Bayesian models for latent attribute detection insocial media. In Proceedings of ICWSM.

Philip Resnik. 2013. Getting real(-time) with livepolling. http://vimeo.com/68210812.

Matthew Rowe, Milan Stankovic, and Harith Alani.2012. Who will follow whom? Exploiting semanticsfor link prediction in attention-information networks.In Proceedings of ISWC.

Mike Thelwall. 2009. Homophily in MySpace. Journalof the American Society for Information Science andTechnology, 60(2).

Benjamin Van Durme. 2012a. Jerboa: A toolkit for ran-domized and streaming algorithms. Technical report,Human Language Technology Center of Excellence.

Benjamin Van Durme. 2012b. Streaming analysis of dis-course participants. In Proceedings of EMNLP.

Faiyaz Al Zamal, Wendy Liu, and Derek Ruths. 2012.Homophily and latent attribute inference: Inferring la-tent attributes of Twitter users from neighbors. In Pro-ceedings of ICWSM.

Documents

Cost-Effective Personal Analytics in Social Media Using Language and Network Structuresvitlana/papers/non_refereed/VolkovaQual... · 2014-04-04 · Cost-Effective Personal Analytics