Content Recommendation in Social MediaTwitter as subject of this work. Blogging means writing a weblog, also known as blog. Nardi et al. [58] provide an ethnographic investigation

UNIVERSITY OF AMSTERDAM

FACULTY OF SCIENCE

MASTER THESIS

Content Recommendation in Social Media

A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

IN

ARTIFICIAL INTELLIGENCE

Author:Mathias BREUSS

Supervisor:Dr. Manos TSAGKIAS

March 25, 2013

2

Abstract

Where some say we suffer from information overload, others say we face an informa-tion filtering failure. Again others claim it’s none of both. Undoubtedly this challenge iscaused by the democratization of publishing technologies. This democratization leads to anoverwhelming number of content producers, from which we are free to choose accordingour personal interests. Employing additional technology to enhance our perception in thisrough sea of flooding information is not just necessary but the simple consequence of ourtransition into the information age.

To participate in the efforts of this transition, we propose a system that re-ranks tweetsthat contain URLs, according to a user’s personal interests. We make use of several signalsto form neighborhoods of similar users from whom new items can be taken for recommen-dation. We employ machine learning to find the best combination of such signals. Further,content-based models using information retrieval methods allow us to directly recommenditems according to the user’s profile, where the profile is build from items the user hasshared. We make use of late data fusion to combine those two methods.

We discuss several evaluation strategies to find the most suitable strategy to evaluateour models. We experiment with two different datasets, one representing the average usersand the other representing influential users. Given the evaluation strategies employed, ourmodels achieve up 400% of the baseline’s performance. We thoroughly discuss these resultsand the underlying recommendations.

3

4

Contents

Abstract 3

1 Introduction 131.1 Research Outline and Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.2 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Related Work 172.1 Information Overload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2 Timeline Reorganization / Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3 Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3.1 Retweet prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.3.2 URL recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Personalized Recommender Models for Social Media 253.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.1 Random baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2.2 Chronological baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 User-based Nearest Neighbor Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.1 Common items neighborhood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.2 User-mention neighborhood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3.3 Temporal neighborhoods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.3.4 Intrinsic tweet features-based neighborhoods . . . . . . . . . . . . . . . . . . . . . . . 303.3.5 Social features-based neighborhood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3.6 Item popularity based neighborhood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3.7 Similarity measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Content-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.4.1 Information retrieval vector space recommendation, tf-idf . . . . . . . . . . . . . . . 333.4.2 Probabilistic retrieval models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5 Combination of User Similarity Based Recommendation Models . . . . . . . . . . . . . . . . 363.5.1 Combination by z-scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.5.2 Combination by supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.6 Late Data Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5

6 CONTENTS

4 Evaluation 394.1 Incremental Data Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.2 Evaluation on User Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Experimental Setup 455.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2.1 Finding the average user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.2.2 Influential users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3 Feature Selection for Supervised Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 52

6 Results and Analysis 556.1 The Average Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.1.1 Performance of user-based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.1.2 Combination of user-based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.1.3 Performance of content-based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.1.4 Performance of hybrid models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616.1.5 Session-based evaluation methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 626.1.6 Remaining data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.2 The Influential Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.2.1 Performance of user-based recommendation models . . . . . . . . . . . . . . . . . . . 666.2.2 Performance of content-based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676.2.3 Performance of hybrid models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7 Discussion 717.1 Results of the Experiments on the Average User Dataset . . . . . . . . . . . . . . . . . . . . . 71

7.1.1 Per user performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717.1.2 Fusion source, where do the recommendations come from . . . . . . . . . . . . . . . 747.1.3 Influence of user activity on recommendation performance . . . . . . . . . . . . . . . 75

7.2 Results of the Experiments on the Influential User Dataset . . . . . . . . . . . . . . . . . . . . 787.2.1 Per user performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797.2.2 Fusion source, where do the recommendations come from . . . . . . . . . . . . . . . 80

7.3 Relationship Between the Two Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817.4 Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7.4.1 Common items neighborhood recommendation . . . . . . . . . . . . . . . . . . . . . . 817.4.2 User-mention neighborhood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827.4.3 Temporal neighborhoods and intrinsic tweet features-based neighborhoods . . . . . 827.4.4 Social features-based neighborhood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827.4.5 Item popularity-based neighborhood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827.4.6 Combination by supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827.4.7 Recommendation list generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827.4.8 Information retrieval approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

CONTENTS 7

8 Conclusion 858.1 Main Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 858.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

8 CONTENTS

List of Notations

λ A coefficient to control the influence of models in a mixture model . . . . . . . . . . . . . . . . . . . . . . . . . 35

λt The parameter for the probability distribution of term t in the collection . . . . . . . . . . . . . . . . . . . 36

|·| Length or number of elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

||f|| Euclidean norm of vector f . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

S A vector of random variables denoting the observed feature values (scores) . . . . . . . . . . . . . . . . 37

s A particular observed feature value vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

f Frequency vector for a seed user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

fau,bτ Frequency vector for associate user au from bin bτ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29

fau,Pτ Frequency vector for associate user au from Pτ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29

fau Frequency vector for associate user au . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

fbτ Frequency vector for the seed user from bin bτ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

fPτ Frequency vector for a seed user from Pτ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

V(d) Frequency vector of document d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34

C Set relevant, not relevant of classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

τ Point in time that splits past and future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

AU Set au1, ..., aun of a seed user’s associate users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

au Associate user, friend (followee) or follower . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

AUPτ The seed user’s associate users from Pτ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

AvePsu,τ The average precision for a seed user su and time τ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43

B Set b1, ..., bl of bins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

b Bin that contains items (URLs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

bτ Bin at time τ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

9

10 CONTENTS

C A collection of documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

cau,bτ Number of times an associate user was mentioned by the seed user in bin bτ . . . . . . . . . . . . . . . 28

cau,Pτ Number of times an associate user was mentioned by the seed user in Pτ . . . . . . . . . . . . . . . . . . . 28

d A document in a document collection C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34

Fτ Set of all items shared in the future, after time τ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

I Set i1, ..., io of items shared by a user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

i Item, e.g. a URL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

Iau,Pτ The items of an associate user au from Pτ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27

IPτ The items of a seed user from Pτ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27

Isu,Fτ The relevant items of a seed user su from Fτ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

id ft The inverse document frequency of term t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

ISi Score of item i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Jau,Pτ Jaccard similarity coefficient of a seed user and an associate user au from Pτ . . . . . . . . . . . . . . . 27

L List of recommendation lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

M Length or number of elements of a vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

NPτ Neighborhood of a specific seed user from Pτ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

p(·) Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

P(k) The precision of the recommendation list at length k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Pτ Set of all items shared in the past, before or at time τ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Pi All popularity indicators for item i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

PIi Popularity indicators for item i’s URL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

PINi Popularity indicators for item i’s normalized URL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

q A query against a document collection C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34

r List of recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

RPτ The recommendation list for a seed user with items from Pτ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26

Rsu,Pτ A recommendation list for seed user su with items from Pτ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

rank(i) The rank of item i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

rankr(i) The rank of item i in list r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

CONTENTS 11

Srandom Random score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Sau,Pτ The score of an associate user from Pτ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Sr(i) The score of item i in list r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

score(q, d) The score of query q given document d and a retrieval model . . . . . . . . . . . . . . . . . . . . . . . . . . 34

S f u1i Score of item i using the first fusion approach, defined as the reciprocal of the sum, of thereciprocals of weighted item scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

S f u2i Score of item i using second fusion approach which corresponds to WCombSUM . . . . . . . . . . . 38

sim(·, ·) Cosine similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

SU Set of seed users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

su A specific seed user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43

SU MPτ Set of associate users that were mentioned by the seed user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28

t A term in a query or document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

tdt The normalized form of t ft,d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

T D Set of time differences accounting for the time a user needs to create the second tweet after thefirst. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

T DOF F Set of time differences during an inactive period of a user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

T DON Set of time differences during an active period of a user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41

terms(i) The terms of item i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

t ft,d The term frequency of term t in document d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34

TS Set of timestamps of tweets a user created. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

ts(x) Timestamp of x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27

wr The weight of a specific recommendations list. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .38

X t A random variable that counts the occurrences of term t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

xqt The number of occurrences of term t in q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

12 CONTENTS

Chapter 1

Introduction

The number of active social media users surpassed the first billion in 2011 [80]. The sheer numberdoes not just point out the popularity of social media but also hints to its manifold incarnations. Socialmedia is not a very specific concept, and many attempts to define and explain its meaning exists. Ka-plan and Haenlein [49] discuss the manifold aspects of social media and provide clarification to whatsocial media is. The most outstanding aspect of social media is its difference to traditional media suchas television, radio and newspapers. In traditional media or mass media the content flow is directedfrom few producers to many consumers. In social media the means of content generations are put intothe hand of the former consumers, and the way information flows is much more erratic. Consumersand producers change their role repeatedly and in an instant. This explanation defines how the mediapart in social media is different from the traditional media, still it is necessary to explain why we callthis new media social. The social aspect might be strongly related to the fact that content is not justgenerated by individual users but in collaboration with other users. Also the term content becomesmuch more diverse, as social media is not just used to produce and spread content but also for users tocommunicate with each other.

With consumers becoming producers the amount of content generated reaches whole new magni-tudes. The areas of life it relates to and the content quality, become much more diverse. This leadsto several new social and technical challenges for us consumers. It becomes harder to find and choosethe relevant content a specific user desires, commonly known as information overload1 which will bediscussed later.

From the many different incarnations of social media we choose a microblogging service calledTwitter as subject of this work. Blogging means writing a weblog, also known as blog. Nardi et al.[58] provide an ethnographic investigation of blogging and uncover the range of motivations drivingindividuals to create and maintain blogs. Microblogging stands for a form of communication in whichusers can describe their current status in short posts distributed by instant messages, mobile phones,email or the web [46]. Twitter was founded in March 2006 and since got increasingly popular. Mes-sages on Twitter, called tweets, are limited to 140 characters. A user can see messages of other usersthat are being followed on a timeline. The messages are ordered chronologically, with the youngest

1Information overload describes a state where a person has difficulties in making decisions because too much information isavailable.

13

14 CHAPTER 1. INTRODUCTION

message coming first. The relationships that build up the social network of a Twitter users are directedand not necessarily reciprocal, meaning that a user can follow another user without this other userfollowing back. In this case a user receives all the messages of the person the user follows, but thefollowed user does not. A very important feature of Twitter is the ability of a user to easily retweet thetweet of another user. Doing so allows the dissemination of content beyond the social network of theoriginal author. Besides many other features, Twitter provides an open platform with an API that canbe facilitated by other service providers. This openness also has attracted many researchers as the APIprovides relatively easy access to real user generated content. Similar to blogs, the reasons why peopleuse Twitter are diverse. People talk about their daily activities and seek or share information [46].People can be seen as rather general term in this context. Single individual use it centered around theirprivate life and personal interests, and large corporations, politicians, and non profit organization usethis platform to communicate with interested parties. While the reasons for people to use Twitter arequite diverse, one of the commonly ways to use Twitter is by sharing URLs [32]. However, the usage ofTwitter, such as sharing URLs, also differs from language to language used by the Twitter user [82]. Inour work we will focus on Tweets that share URLs.

1.1 Research Outline and Questions

The work in this thesis focuses on building a personalized recommendation system that helps Twitterusers to find the most relevant URLs shared by the people in their social network. Doing this we addressthe main issue of information overload introduced above. We do this on one type of content (URLs)that is most commonly shared via Twitter after pure tweets.

Given the above objective we aim to answer following research questions:

• RQ 1: Can the user experience while consuming a Twitter timeline be improved, with respectto the tweets that share URLs, by solely considering content features and 1st degree networkfeatures?

• RQ 2: Are user neighborhoods an effective source for items to recommend and which of thedifferent features lead to the best performing neighborhoods? Features we consider are items thatwere co-shared by users, mentions by seed users, temporal information of users’ tweet activity,etc.

• RQ 2.1: Can user neighborhoods be combined to form a new source for items to recommend andwhich combinations prove most effective?

• RQ 3: Are content-based approaches effective in recommending relevant items and which content-based approaches perform best?

• RQ 4: Can user neighborhood- and content-based approaches be combined to improve recom-mendation effectiveness?

• RQ 5: Is a realistic evaluation methodology feasible given the data at hand?

This thesis is organized as follows. In Chapter 2 we give a small introduction into scientific work thatevolves around Twitter and deepen our discussion with work that is closely related to this thesis. We

1.2. MAIN CONTRIBUTIONS 15

formulated the problem and describe the models we developed in Chapter 3. In Chapter 4 we discusshow to evaluate the models most realistically. The experiments, their relationship to the researchquestions and the underlying data will be discussed in Chapter 5. We document and analyze the resultsof these experiments in Chapter 6. Further, we discuss our results and the algorithmic complexity ofthe developed models in Chapter 7. Finally, we conclude in Chapter 8 by revisiting and answering theresearch questions, and by giving an outlook for potential future work.

1.2 Main Contributions

In the following we list the main contributions of this work:

• We provide a small survey on scientific work that evolves around Twitter. The survey closes upon related work dealing with information overload, recommendation, filtering and re-ranking inthe context of social media and Twitter.

• One set of models, that provides recommendations by solely considering features of users andtheir properties, to find most similar users.

• Another set of recommendation models that exploits the content of items shared by users.

• We discuss evaluation strategies and introduce an improved strategy that takes user sessions intoaccount.

In this chapter we introduced the subject of this thesis, the research questions and our main con-tributions. In the next chapter we will discuss related work and work that evolves around Twitter associal media infrastructure.

16 CHAPTER 1. INTRODUCTION

Chapter 2

Related Work

Twitter attracted a lot of attention from the research community for already a long time. This attentionlead to an abundance of scientific work, touching various aspects of Twitter and microblogging in gen-eral.

A sample of past work done in this area can be classified into categories as follows:

General work about microblogging and social networks: Soon after its launch, Twitter becamesubject to intense scientific research, the authors of [46] study topological, geographical properties ofTwitter’s social network and analyze user’s intentions. They find that people use Twitter to talk aboutdaily activities and to seek or share information. Kwak et al. [50] perform a quantitative study withfocus on information diffusion. Yang and Counts [86] perform a network analyzes of information dif-fusion on Twitter. Their model captures speed, scale, and range of information diffusion. Nagarajanet al. [57] analyze tweets about three different events, the Iran Election, the Health Care Reform debateand the International Semantic Web Conference with respect to the properties of retweet behavior forthe most tweeted / viral messages. In [56] information diffusion is analyzed by looking how differentcontagions interact with each other while spreading through the Twitter network. They find that inter-actions (leading to suppression or promotion) cause a relative change in spreading probability by 71%on average. The authors of [43] find that Twitter usage is driven by a sparse and hidden network ofconnections beneath the "declared" set of friends and followers.

Search with and in microblogging: With increasing popularity of microblogging services such asTwitter the desire to make information accessible given a specific information need arises. Duan et al.[26] develop a model for real-time tweet search, which does not only use content relevance of a tweet,but also uses the user’s authority and tweet related features such as whether a URL link is included inthe tweet. Using machine learning to rank and BM25 as information retrieval method, experiments areconducted on a corpus of 500 tweets with relevance annotation and predefined query terms. Besidessearching on an microblogging platform, microblogging platforms can be used to increase novelty ofsearch results in web-search as discussed in [25]. Huang et al. [42] predict a quality label for a tweetusing several quality features like length of a tweet, term similarity, ratio of unique words, authority,etc. Central to this approach is the assumption that documents similar in content might have similarquality. For evaluation 9,000 tweets are annotated with a quality label and used for training and testing.

Conversation: Besides sharing information, Twitter is also used for conversational purposes. In

17

18 CHAPTER 2. RELATED WORK

[15] the conversational aspects of retweeting are discussed. For ongoing conversation it is useful toknow which dialog act (question, answer, statement) a certain tweet represents. A unsupervised modeladdressing this challenge is discussed in [68]. A next step is the recommendation of conversation forwhich different algorithms are discussed in [19].

Topic modeling and detection: Knowing what concepts users talk about on Twitter can be usefulfor many follow up tasks. Michelson and Macskassy [55] use entities mentioned in tweets to build atopic profile of a user by employing a knowledge base for entity disambiguation and classification. An-other approach, [66]makes use of a partially supervised learning model described in [65] to map usersand tweets into dimensions that correspond roughly to topics like substance, style, status and socialcharacteristic of a post. These mappings are then used to re-rank a user’s timeline or to recommend anew user to follow.

Contextualization: A more specific notion of concept is a named entity. The performance of tradi-tional natural language processing (NLP) tools like named entity recognition (NER), which were trainedon news corpora, is weak on tweet corpora [69]. Ritter et al. [69] rebuild the NLP pipeline from part-of-speech tagging over chunking to NER, to address the different challenges of short text. Anotherapproach towards NER is described in [48] where clusters of microtexts are build using contextual,social, semantic and temporal associations. Meij et al. [54] propose a solution to the problem of de-termining what a microblog post is about. They add semantics to posts by automatically identifyingconcepts that are semantically related to it and generate links to the corresponding Wikipedia arti-cles. One step further is the identification of aspects of an entity of interest such as products, services,competitors. Spina et al. [76] analyze methods which identify such aspects.

User Influence: To become influential on Twitter, knowing the influence and passivity of other usersis helpful. An approach described in [70] determines these factors based on information forwardingactivity of web links on Twitter. Cha et al. [16] analyze three influence measures, in-degree, retweets,and mentions and compared them across topics and time. Others introduce metrics that directly mea-sure the influence of users, like TwitterRank [83] which is an extension of PageRank. Companies like[2] and [4] provide influence metrics as a service.

Social Link / User Recommendation: Predicting links within social networks like described in [51]is also dealt with in the context of Twitter. Armentano et al. [12] recommend new users by exploringthe topology of the network of a target user’s neighborhood. Instead of considering networks, Hannonet al. [38] recommend users by profiling users based on their tweets.

Prediction of X from People’s Tweets: Often the Twitter-sphere is at least partly considered animage of society and thus many hope to use people’s tweets to predict certain real-world outcomes. In[60] these outcomes are IMDb movie ratings, for which a linear regression model is used to predicttheir values. With the work done in [13] box-office revenues are predicted by using a simple modelbuild from the tweet rate of a particular topic. In the context of the German federal election Tumasjanet al. [79] study Twitter usage and finds that the mere number of messages mentioning a party reflectsthe election results.

Event Detection: Twitter’s real-time nature also gives rise to applications where events need tobe detected or certain aspects of specific events have to be determined. Sakaki et al. [71] considerTwitter users as sensors to estimate the centers of earthquakes and the trajectories of typhoons. In [10]events are not just detected, but relevant information is collected, filtered and semantically enriched toprovided faceted search and real-time analytics.

News Stream Aggregation: Other methods and services use Twitter as additional channel in a per-sonal aggregation framework or to ensure hotness of news like in [72] where Twitter is used to capturetweets that correspond to late breaking news. A similar services available is [5], where personalization

2.1. INFORMATION OVERLOAD 19

is achieved by using a user’s Twitter account. Similar, [6] uses the user’s Twitter history and GoogleReader to provide a personalized experience. Also [1] aggregates news from different channels, as wellas [3] provides personalized content from RSS & social feeds.

User Modeling: Due to the importance of personalization others focus on the notion of usermodeling. A framework that enriches the semantics of tweets, identifies topics and entities is usedto build user profiles in [7], [8], [78], which then allows linking to news articles.

2.1 Information Overload

Having lined the various types of work done around Twitter we close up on the type of work that ex-plicitly deals with information overload.

In addition to the areas outlined above, there is work available on the area of information overloadin the context of Twitter and microblogging. Grineva and Grinev [33] discuss information overload inthe context of social media streams. Approaches to tackle the problem are separated into approachesthat filter the noisy stream in an algorithmic fashion, approaches that divide streams into sub-streams,and approaches that make use of hand-curated streams. The work found to be closely relevant to thiswork can be classified as discussed below.

2.2 Timeline Reorganization / Filtering

Comarela et al. [22] show evidence that the information overload problem is present. They identifyfactors that influence users response and retweet behavior, and develop a model to re-rank a user’stimeline. The timeline is re-ranked by putting items a user most likely interacts with, by responding orretweeting, first. The features used in these models are mainly activity related features like the timebetween a tweet arrives and the time that the user reacts on it. Other features include the age of atweet, prior interaction with the sender, and the activity of the sender. Textual features like presenceof hashtags, mentions, URLs and size of the tweet are used as well. The authors evaluate their systemusing sessions, where users are active and not active. The parameters were determined by a precedinganalysis of sample users with a certain minimum activity. The sessions are then used to create the user’stimeline which is then reorganized according to the model.

Bernstein et al. [14] organize a user’s timeline into trending topics which then can be further ex-plored by the user. The tweets in the user’s timeline are transformed into queries for an external searchengine, and the query results are then used to determine popular terms to be assigned as potentialtopics for the tweets.

Yan et al. [85] introduce a model that ranks tweets and their authors simultaneously for tweetrecommendation. The model uses the social network connecting the users, a network connecting thetweets, and a third network that ties the two together. For the first two a co-ranking algorithm isused. The links connecting the tweets are established by measuring how semantically similar any twotweets are. The model produces a ranking of tweets which are then compared against a gold standardranking which is based on whether a tweet has been retweeted or not. In addition to the gold standardranking the model’s ranking is also judged by humans. They used 23 human assessors as seed usersfor whom Twitter data was crawled as well as their followees and followers for whom again users


connected to them were crawled until no new users were added. The final dataset consisted of 9M users,250M tweets, 600M links and 55M retweets for two months. LDA topics are used for personalization.Although the authors showed the competitiveness of their model by comparing it against three modelsintroduced by other authors they did not reason about the computational costs and the feasibility ofthis approach.

To recommend tweets Lu et al. [52]map Twitter messages to concepts from Wikipedia by employingExplicit Semantic Analysis proposed in [31]. A user profile consists of concepts a user tweeted about,and in addition the affinity between users is measured by their interaction. To find related concepts aMarkov random walk is used. Cosine similarity is then used to calculate the ranking score for a tweetwhich is represented by Wikipedia concepts and related users. The model is then evaluated on a setof users for which the latest 1,000 tweets from the user’s timeline were crawled, as well as tweets thatwere favorite/retweeted/replied tweets or some of the user’s own tweets as relevant tweets. Althoughthe approach is quite interesting the evaluation strategy was not explained.

Gu [34] introduces an information filtering system for Twitter, which focuses on Twitter’s list feeds.List feeds already tend to be focused on specific topics but still contain irrelevant messages which arefiltered out by this system. The system first identifies the list’s key topic and then classifies whethertweets are relevant with respect to the identified topic. For classification several types of features areused. Starting with textual features where similarity of a post’s text to the topic of the list is measured.In addition to textual features authorship and social network features are exploited including featureslike number of followers in list and TunkRank [11]. Temporal features like "hour of the day" and "dayof the week" are used to discover patterns related to the time a post was created. Finally, the domainsof hyperlinks are used as another feature. Evaluation was done by using a manually labeled corpus oftweets from nine lists with respect to the relevance given the list’s topic.

To select tweets from outside the user’s network for recommendation Pennacchiotti et al. [63] builduser profiles from user’s tweets and the tweets of the user’s friends, to estimate a interestingness func-tion. The estimation is based on co-occurring terms and pairs of terms to score the similarity of twosets of tweets (user tweets and candidate tweets). For evaluation data from Twitter’s sample stream1 isused to build a corpus from which 250 users are selected. 90% of the user’s tweets are used to build theprofiles and the remaining 10% are mixed with all the tweets from other users in the corpus for testing.Tweets from the 10% set are considered as correct recommendation. Also a user study evaluation isperformed.

As a fundament for timeline reorganization and filtering, Padmanabhan and Chakraborti [61] in-troduce a model that finds relevant tweets given a query tweet. The authors employ features based onthe social network but also content-based similarity features. They evaluate on a a set of 200 candidatetweets for each of 50 query tweets, where relevance of the candidate tweets is judged by humans.

Chen et al. [20] use topic level latent factors of tweets to capture users’ common interests overtweet content, as well as users’ social relations. The model also incorporates explicit features such asauthority of publisher and quality of tweet. The authors evaluate on a dataset that is created by ran-domly selecting users and for those users following their followers and followees’ links for a unspecifiednumber of steps. Finally, they only consider a small portion of users having over fifteen followees inthe previously specified set. Retweeted tweets are considered as positive samples and the others asnegative samples. Further preprocessing and selection is performed before the final train and test datais compiled.

In [41] the authors want to initiate an open discussion on how to build effective systems for rankingsocial updates from the perspective of LinkedIn. The problem is addressed by using learning to rank,

1The sample stream provides 1% of the tweets currently disseminated.

2.3. RECOMMENDATION 21

collaborative filtering and click-through modeling. In the context of LinkedIn the authors build modelsmaking use of linear, latent bias and latent factor models.

2.3 Recommendation

Researchers rely on retweets as implicit feedback for user interest given specific tweets. Probably forthis reason many researchers describe their work as being about retweet behavior.

2.3.1 Retweet prediction

Suh et al. [77] conduct an exploratory data analysis using Principal Components Analysis and General-ized Linear Model to determine how features like whether a tweet contains an URL, hashtags, mentions,or the number of followers and followees of a user, the age of a user account, number of favorite tweets,the number of shared tweets and frequency affect whether a tweet is retweeted or not. The authorsfind that retweetability is significantly associated with these features.

Zaman et al. [88] develop a probabilistic collaborative filtering model to predict future retweets.The features found to be most important are the identity of the source of the tweet and retweeter. Thedata is build of tweets collected within a time window of an hour, and any retweet of these tweetswithin an hour after the original tweet was created. The retweets are used as positive feedback for themodel and the negative feedback is derived from a retweet network consisting of active users who haveretweeted or have been retweeted.

Yang et al. [87] propose a factor graph model to predict users’ retweeting behavior. The paper dealswith two aspects of retweeting. First, whether users will retweet a specific tweet to their friends afterviewing it. Second, the range of spread for a new tweet. Additionally a statistical analysis about thetweeting and retweeting behavior of users within their dataset is provided.

Hong et al. [40] study the characteristics of popular Twitter messages, for predicting if a messagewill be retweeted and how often it will be retweeted. For training the first n-1 retweets are usedas a positive example and the last retweet + all other messages that were not retweeted as negativeexamples. The number of retweets is used for training a multiclass classifier to predict ranges of possibleretweets for a message. The best performing combination of features used include the content andtopical information of messages, graph structural properties of users, and temporal dynamics of retweetchains of users.

Petrovic et al. [64] conduct a human experiment on predicting whether a tweet will be retweeted.The test subjects got pairs of randomly ordered tweets where one tweet was retweeted. This experimentperforms above chance and thus leads the way to an machine learning approach based on the passive-aggressive algorithm [23]. The method makes use of a global model and a local hourly model thatis trained only on tweets from a certain hour of day. Features used are social features like numberof followers, friends, statuses, favorites, number of times a user was listed and tweet content relatedfeatures like whether the tweet is in English, number of hashtags, mentions, URLs, trending words,length of tweets, novelty, whether the tweet is a reply, and actual words in the tweet. One criticalaspect of this work is the use of Twitter’s sample stream to acquire the data. It can be assumed thatmany retweets were lost due the use of sample stream as source.


Peng et al. [62] use Conditional Random Fields to predict the retweet decision of users withina targeted network. The computational costs of this approach are reduced by partitioning the data(network) with minor loss in performance. Despite the good results reported by this method the paperseems to be relatively vague when describing the data used for the experiment.

Uysal and Croft [81] propose two methods for predicting retweet behavior. First, incoming tweetsfor a specific user are ranked according the likelihood that this user is going to retweet the tweets.Features that are used for this method are based on the author of the tweet, syntactic features of thetweet itself, the information content of the tweet and features related to the user who consumes theincoming tweets. Special attention is payed to selecting active users with enough retweets. Second, fora specific tweet a ranking of users who most likely would retweet this tweet was created, using similarfeature sets as above. Special attention was payed to select a sample of tweets where each tweet wasat least retweeted by one user.

Naveed et al. [59] develop a prediction model to forecast the likelihood of retweet for a given tweet.This work emphasises only to use content related features like whether a tweet is a direct message, con-tains URLs, usernames and hashtags, exclamation and question marks, positive and negative terms, andemoticons. These features are called low-level, in addition to the so-called high level features like sen-timents and terms. Where for each message the odds of being retweeted based on the terms it containsis calculated. The structure of the social network is not considered by this work. A logistic regressionmodel is trained by the first 75% of the tweets to predict the retweet likelihood of the tweets in theremaining 25% of the testing data.

2.3.2 URL recommendation

In contrast to the papers above where tweets are considered in general, the following work specificallylooks at URLs or tweets containing URLs. The following can be considered closest to our work.

Chen et al. [18] study a URL recommendation system for Twitter with respect to different designdecisions. The system considers URLs posted by followees and URLs posted by followees of followeesof a specific user. Popular Twitter URLs from a third party service are used to support exploration,introducing URLs from outside a user’s network. The interests of a specific user are determined byconsidering the user’s tweets and the tweets of the user’s followees. The topic of a specific URL ismodeled by using the terms accompanying the URL in the tweet. Cosine similarity is then used tomatch the users profile with the URL topic. Social voting is used for ranking by giving more authority tousers with many followers and users that tweet less. The effectiveness of this approach is demonstratedwith a controlled field study where 44 Twitter users get recommendations of URLs which have to beassessed as interesting or not interesting. Pilot interviews are used to justify the impact of the differentdesign decisions on the system’s performance.

Using the followees of the followees as input for the URL candidate set might already be infeasibleor at least very expensive, given the rate limitations of the Twitter API. Our work we will focus only onthe followees directly associated with the users, and partly also on the user’s followers. In a realisticsetup, data of these directly related users can be acquired through Twitter’s streaming API in real-timeand at low cost.

Galuba et al. [32] develop a model to predict which users will tweet which URLs given a training setof existing URL mentions. First, they crawl Twitter’s search API for tweets containing the term http, for300 hours. For each tweet they also retrieve the author and the author’s followers to build a user graph.

2.3. RECOMMENDATION 23

On this data an empirical analysis is conducted. The observations made in this analysis are used toconstruct two models. The data is then split into 150h of training data where only URLs are consideredthat were tweeted more than 5 times, the remaining 150h serve as test data considering 100 randomURLs that were mentioned at least 10 times in the training and in the test data.

Their evaluation and data acquisition strategy is less focused on testing the recommendation and re-ranking of items for users, but instead on the likelihood with which a URL will be retweeted. Althoughit might be the basis of a recommendation system it has no notion of user personalization.

Abel et al. [9] study how topics and user interests into topics evolve over time. This study is con-ducted in the context of the events around the Egyptian revolution. They find that topics discussed onTwitter are represented by different concepts (entities, hashtags) with different relevance for the giventopic. They also find that the concepts used can change over time. Also the interests of a specific userevolve over time. Given these findings they develop a time-sensitive user model, where concepts areweighted using temporal distance of a specific point in time to the time the tweet related to the conceptwas authored. Based on their previous work Abel et al. [8] build a Twitter based URL recommendationsystem and analyze how the addition of the time-sensitive user model influences system performance.Using the URLs re-tweeted by specific user as ground truth they create recommendation on a daily basisfor a period of ten days. Finally, they also look into the influence of a user’s activity on recommendationperformance.

It is not clear where their candidate URLs come from, and how they selected the sample users. Alsoit is not explained how the sample user’s friends contribute URLs to the candidate set. Our modelsconsider URLs from friends and followers for recommendation.

In this chapter we gave a small survey on work that evolves around Twitter. We introduced re-lated work, and explained the relationship to this work. In the next chapter we will introduce ourpersonalized recommender models that help Twitter users to find the most relevant URLs shared by thepeople in their social network.


Chapter 3

Personalized Recommender Modelsfor Social Media

In this chapter we introduce various models developed and their underlying hypotheses. We develop auser-based, item-based, and a hybrid recommendation model. First, we introduce models which facili-tate user-based features to build, what is called in [45], user-based nearest neighbor recommendation.Second, we introduce item-based recommendation models which make use of item features, e.g., textthat accompanies an URL in a tweet, and the text of a website linked to by an URL. Finally, we learn howto combine different user neighborhood-based models, and create a hybrid recommendation model thatcombines user neighborhood- and item-based recommendation models.

3.1 Problem Formulation

In this section we formally define the problem, and describe the notation which is used throughout thiswork.

Each seed user, a Twitter user for whom we crawl friends, followers and tweets to generate recom-mendations, has a set of associate users (friends and followers) AU = au1, ..., aun. The URL itemsare denoted by I = i1, ..., io. Items are shared by users at a specific point in time and are put intobins that represent a certain time interval, which corresponds to the item’s timestamp. These bins aredenoted by B = b1, ..., bl. With respect to our evaluation strategy (growing past, shrinking future)explained in Section 4.1 we define past and future as follows. We use past as synonym for an itemsetwhich contains items shared up to a certain point in time, and future as a synonym for an itemset thatcontains the remaining items for this point in time till the time of the last item in the data collection.For a given time τ : 1≤ τ < l the past is denoted by:

Pτ =τ⋃

q=1

bq, (3.1)

25

26 CHAPTER 3. PERSONALIZED RECOMMENDER MODELS FOR SOCIAL MEDIA

and the future is denoted by:

Fτ =l⋃

q=τ+1

bq (3.2)

The recommendation system uses a seed user’s associate users and their items from the past, tobuild a candidate set of items to recommend to the seed user. A neighborhood NPτ for a specific seeduser is defined as:

NPτ = ⟨au, Sau,Pτ⟩|au ∈ AUPτ, (3.3)

where Pτ is the past (the bins from which we take the candidate items), au is an associate user, Sau,Pτis the score of this associate user with regard to the user’s similarity with the seed user, and AUPτ is theset of associate users available from Pτ.

Items are recommended by selecting the associate user au that is most similar to the seed user, usingau’s items Iau and then proceed to the second most similar associate user. In the introduction to thischapter we called this model a user-based nearest neighbor recommendation. The recommendation listRPτ is defined as:

RPτ = Iau1,Pτ , Iau2,Pτ · · · Iaun,Pτ |Sau1,Pτ > Sau2,Pτ > · · ·> Saun,Pτ , au ∈ AUPτ, (3.4)

where Iau1,Pτ is the set of items of an associate user au from the past Pτ, Sau1,Pτ is the score of thisassociate user and AUPτ are the seed user’s associate users. The itemsets are sorted with the itemset ofthe most similar associate user coming first.

In Section 3.5 we will discuss how to combine various user-based nearest neighbor recommendationmodels.

Items from associate users can also be directly recommended by considering properties of itemsrather then properties of users. We call such models item-based recommendation models. We will dis-cuss the recommendation of items based on item properties in Section 3.2.2 and in Section 3.4.

In Section 3.6 we discuss how we use late data fusion to combine user-based nearest neighborrecommendation models with item-based recommendation models.

3.2 Baseline Models

We start by explaining two baseline models that help us to put the performance of the other modelsinto perspective: (a) a random, and (b) a chronological baseline.

3.2.1 Random baseline

For a specific seed user we take all the associate users au ∈ AU and assign a random score between 0and 1 to each of them. The neighborhood NPτ of a seed user is then defined by:

3.3. USER-BASED NEAREST NEIGHBOR MODELS 27

NPτ = ⟨au, Srandom⟩|au ∈ AUPτ , 0≤ Srandom ≤ 1, (3.5)

where Srandom is a random score between 0 and 1, and AUPτ are the seed user’s associate users from Pτ.

To increase stability of this random baseline the scoring is repeated 100 times and then the averagescore for each associate user au is calculated.

3.2.2 Chronological baseline

Tweets in a Twitter timeline are ordered chronologically with the most recent tweet showing first. Thusthe shared URLs are ordered in the same way. This order is the baseline we need to improve over. Therecommendation list RPτ for a seed user is defined as:

RPτ = i1,Pτ , i2,Pτ · · · io,Pτ |ts(i1,Pτ)≤ ts(i2,Pτ)≤ · · · ≤ ts(io,Pτ), i ∈ IAU ,Pτ, (3.6)

where ts(i1,Pτ) is the timestamp of item i1,Pτ and IAU ,Pτ are the items of the seed user’s associate usersfrom Pτ.

3.3 User-based Nearest Neighbor Models

In this section we discuss models for generating neighborhoods.

3.3.1 Common items neighborhood

The hypothesis of this model is that users who shared the same items in the past might do so in thefuture. We discuss our approach in the following:

The similarity of a seed user with an associate user is given by the Jaccard similarity coefficientJau,Pτ :

Jau,Pτ =|IPτ

⋂

Iau,Pτ ||IPτ

⋃

Iau,Pτ |, (3.7)

where IPτ are the items of a seed user and Iau,Pτ are the items of the associate user au from Pτ. Theneighborhood NPτ of a seed user is given by:

NPτ = ⟨au, Jau,Pτ⟩|au ∈ AUPτ, (3.8)

where AUPτ are the seed user’s associate users.

We expect that this model provides strong evidence for user similarity, however, due to sparsity ofthe data its use might be limited.


3.3.2 User-mention neighborhood

Interaction between users, by mentioning another user within a tweet, gives strong evidence for com-mon interests. The more often a user mentions another user the stronger this bond is. We model thisinteraction by counting how often a certain seed user mentions an associate user, and normalize this bythe maximum number of mentions of any associate user. The set SU MPτ denotes the users that werementioned by the seed user in Pτ.

SU MPτ = au|au was mentioned , (3.9)

cau,bτ denotes how often a certain associate user au was mentioned by the seed user in a specific binbτ. Then, cau,bτ can be defined as follows:

cau,bτ =

¨

1 if au was mentioned at least once

n number of times au was mentioned in bin bτ(3.10)

We focus on the first approach described in Equation 3.10, where cau,bτ equals one if au as mentionedat least once, because preliminary experiments indicated better performance when using this approach.Although we do not sum all the mentions within one bin, we sum mentions over several bins to establishhow strong the bond between seed user and associate user is. cau,Pτ denotes the number of mentions ofan associate user au by the seed user in Pτ:

cau,Pτ =τ∑

q=1

cau,bq. (3.11)

The score we use to sort the neighborhood is defined as follows:

Sau,Pτ =cau,Pτ

max(cauall ,Pτ |auall ∈ SU MPτ), (3.12)

where max gives the biggest element of a set, and SU MPτ is the set of mentioned associate users.Although we can sort the users in the neighborhood without normalized scores, we still normalize tocreate a notion of similarity, comparable to the Jaccard similarity coefficient defined in the previoussection.

Then the neighborhood follows to be:

NPτ = (au, Sau,Pτ)|au ∈ AU. (3.13)

User mentions can be evidence for an ongoing conversation or any other statement about or to-wards the user mentioned. A retweet, which usually also contain user mentions, serves to disseminatea specific tweet, indicating that the user who is retweeting finds this content to be interesting for theown followers. We hypothesize that if a user finds content interesting for their own followers, the useralso finds the content to be interesting. This leads us to another model, which is a specialization of


the model described above with the difference that we only consider user mentions within a retweet.Retweets are usually marked with RT followed by @user to mention the user the tweet originated from.

These two models will indicate the type and the intensity of user interactions, which should providegood evidence for shared interests.

3.3.3 Temporal neighborhoods

Another hypothesis is that users who share the same temporal activity patterns also might share thesame interests. For example think of the difference in daily routine between employees and school stu-dents. Employees might have a more rigid daily routine than students, in general. While an employeee.g., a cashier in a supermarket, will continuously deal with customers over a extended period of time,and thus be unable to author tweets, a school student will have regular breaks in which she can authortweets.

We make use of two features that lead to two neighborhood models. First, we consider the hourof the day (hod) for a given tweet, and second, we consider the day of the week (dow) for a given tweet.

Both features can be represented by a vector of length 24, and length 7 respectively. Let fau,bτ besuch a vector for an associate user au for a certain bin bτ. Let fbτ be the feature vector for a seed user.For a specific bin bτ each element corresponds to either an hour in case of hod, or a day in case of dow.E.g., in case of hod, f6au,bτ

represents the number of tweets that were created at 6am given the bin bτby the associate user au.

Let fau,Pτ and fPτ denote the vectors which elements will be accumulated element-wise for a certainpast:

fau,Pτ =τ∑

q=1

fau,bq, (3.14)

fPτ =τ∑

q=1

fbq, (3.15)

where fau,bqis the frequency vector for associate user au from bin bq, and fbq

is the frequency vector forthe seed user, counting the number of tweets published at a certain time.

We use these feature vectors to calculate the score Sau of each associate user with respect to its seeduser as described in Section 3.3.7.

We can define the neighborhood for a specific seed user as follows:

NPτ = ⟨au, Sau,Pτ⟩|au ∈ AU. (3.16)

These models should help to discriminate groups of users depending on their tweet habits withrespect to time.


3.3.4 Intrinsic tweet features-based neighborhoods

In [22] the influence of intrinsic characteristics such as whether a Twitter message contains hashtags,URLs, and mentions is analyzed with respect to reply and retweet behavior of users. It is found thattweets containing URLs and hashtags are more likely to be retweeted, in contrast to tweets which con-tain mentions. We hypothesize that these features do not just influence retweet rate, but also definethe user’s writing style and thus can be used to find similar users based on their usage of hashtags,mentions, and URLs in tweets.

These models are similar to the models defined in the previous section. We consider three features,the number of mentions (nmentions), the number of hashtags (nhashtag) and the number of URLs(nurl) within a tweet. For each feature a separate neighborhood is defined.

All three features are represented by a vector of length 40. Similar to the previous section, wecalculate the score Sau of each associate user with respect to its seed user as described in Section 3.3.7.This score is then used to form the neighborhood.

These models should allow us to find users that are similar with respect to the way they use Twitter,e.g., by using hashtags, URLs, and mentions.

3.3.5 Social features-based neighborhood

Petrovic et al. [64] develop a model to predict whether a tweet will be retweeted. They make use ofauthor related features like number of followers, friends, statuses, favorites, number of times the userwas listed, is the user verified, and is the user’s language English. They call these features social featuresand find that number of followers, friends, the number of times the user was listed, perform best. Ourhypothesis is that users which are similar with respect to the such author related features, as describedbelow, share similar interests. The features used in this model are described in Table 3.1.

Table 3.1: Features extracted from users’ twitter profile for the social features based neighborhood.

Name Description

tweetcount Number of tweets created by this userfavouritescount Number of favoritesfollowerscount Number of followersfriendscount Number of friendslistedcount Number of times the user is listedutcoffset The UTC offset the user’s timezonelocationKey The location of the userlangKey Account language from Twitter are enumeratedverified Whether an account is verified by Twitter (1), or not (0)

Similar to the previous two models we represent these features by a feature vector in euclideanspace. However, in contrast to the other models this neighborhood is considered static with respectto the values of the features, because the feature values of users which are active for already a longer


time only change slowly. With static we mean that those values remain the same independent from thespecific past and future (Section 3.1).

We use these feature vectors to calculate the score Sau of each associate user with respect to its seeduser as described in Section 3.3.7.

We can define the neighborhood for a specific seed user as follows:

N = ⟨au, Sau⟩|au ∈ AU. (3.17)

The degree of activity, the timezone, the location and the language of a Twitter user should provideuseful clues for similarity.

3.3.6 Item popularity based neighborhood

Often, popular items deem to be relevant items. Thus, our hypothesis is that taking into account popu-larity indicators of items will improve system performance.

We make use of Facebook’s popularity indicators like number of shares, likes and comments. Wemake use of these indicators for a specific URL, and a normalized version of the URL where a possi-ble query string has been removed. The scores for the URL are denoted with PIi , the scores for thenormalized URL are denoted with PINi , and the combination of both is denoted with Pi:

PIi = ⟨nLikesi , nSharesi , nCommentsi⟩, (3.18)

PINi = ⟨nLikesNi , nSharesNi , nCommentsNi⟩, (3.19)

Pi = PIi ∪ PINi , (3.20)

the final popularity score ISi for a specific item i is then calculated as follows:

ISi =1

MaxSum

∑

p∈Pi

p, (3.21)

where MaxSum denotes the sum of likes, shares, and comments observed for the link with the highestsum in the entire corpus. We define the score for the neighborhood as follows:

Sau,Pτ =1

|Iau,Pτ |

∑

i∈Iau,Pτ

ISi (3.22)

where Iau,Pτ are the items of an associate user au in Pτ, and we can define the neighborhood as:

NPτ = (au, Sau,Pτ)|au ∈ AU (3.23)

Users that share many popular items have a good chance to be the right source for recommendingitems in general. However, Facebook popularity indicators (likes, comments, and shares) are not theonly clues for popularity, and thus might not give a complete picture of the popularity of an item.


3.3.7 Similarity measures

The temporal neighborhoods (Section 3.3.3), the intrinsic tweet features-based neighborhood (Section3.3.4) and the social features-based neighborhood (Section 3.3.5) make use of a vector representation.We can measure similarity by using euclidean distance or cosine similarity.

Euclidean distance

The euclidean distance for two vectors with M components is defined as follows:

dau = d(f, fau) =

s

M∑

i=1

( fi − fiau)2, (3.24)

where fi is the i-th element in vector f a seed user’s features and fi is the i-th element in vector fau anassociate user’s features, the distance is then normalized by:

ndau =dau

max(dauall|auall is associated to the seed user)

, (3.25)

the final score is then calculated by:

Sau = 1− ndau. (3.26)

We have rescaled the euclidean distance to maintain a certain notion of similarity.

Cosine similarity

Cosine similarity for two vectors with M components is defined as follows:

sim(f, fau) =f · fau

‖f‖‖fau‖, (3.27)

(3.28)

where f · fau is the dot product and ‖f‖ is the euclidean norm both defined as follows:

f · fau =M∑

i=1

fi faui, (3.29)

‖f‖=

s

M∑

i=1

f 2i , (3.30)

‖fau‖=

s

M∑

i=1

f 2aui

. (3.31)

(3.32)

3.4. CONTENT-BASED MODELS 33

We rescale the resulting similarity to get the score Sau:

max = max(sim(f, fauall)|auall associate of the seed user, (3.33)

min= min(sim(f, fauall)|auall associate of the seed user, (3.34)

o f f set =

|min| if min< 00 else , (3.35)

Sau =sim(f, fau) + o f f set

max + o f f set. (3.36)

3.4 Content-Based Models

Besides looking at properties of users, as done with the models described in Section 3.3, we also lookat properties of items, and their contents. The models described in Section 3.3 found the associateusers most similar to a specific seed user. In this section we will develop models that provide itemsfrom associate users which are most similar to items from a specific seed user. In the following we willmodel this problem as an information retrieval problem. Information retrieval denotes an activity whererelevant information, given an information need, is satisfied by a collection of information resources.We make use of Apache Lucene [28] as information retrieval system. As described in [28] Lucenecombines a boolean model of information retrieval with a vector-space model (VSM) of informationretrieval. In VSM, documents and queries are represented as weighted vectors in a multidimensionalspace, where each distinct index term is a dimension [28]. For generating the weights of the vectorsmultiple schemes are available. First, we will explain our use of the vector-space model with tf-idfweighting. Second, we will introduce probabilistic retrieval models.

3.4.1 Information retrieval vector space recommendation, tf-idf

The hypothesis of this approach is that the problem of finding relevant items to recommend can bemodeled as a information retrieval problem: The terms of an item i to be recommended from anassociate user are considered as the terms of a document d. The terms of the items of a related seeduser are considered as the query q. A query qPτ for a specific seed user is defined as:

qPτ =⋃

i∈IPτ

terms(i), (3.37)

where IPτ are the items of a seed user from Pτ, and terms(i) are the terms of an item.

The resulting recommendation list is then ordered by the rank of an item in the underlying infor-mation retrieval system’s result list:

RPτ = i1,Pτ , i2,Pτ · · · io,Pτ |rank(i1,Pτ)≤ rank(i2,Pτ)≤ · · · ≤ rank(io,Pτ), i ∈ Iau,Pτ , au ∈ AUPτ, (3.38)

where rank(i1,Pτ) is the rank of item i1,Pτ , and Iau,Pτ is the set of the items of associate user au in a seeduser’s set of associate users AUPτ .


The corresponding VSM score is calculated by using cosine similarity like introduced in Section3.3.7:

sim(q, d) =V(q) ·V(d)‖V(q)‖‖V(d)‖

, (3.39)

where V(q) is the term frequency vector of query q and V(d) is the term frequency vector of documentd. According to [28] the scoring formula of Lucene is as follows:

score(q, d) = coordfactor(q, d)×V(q) ·V(d)‖V(q)‖

× doclennorm(d), (3.40)

where coordfactor(q, d) is a factor based on how many of the query terms are found in the specifieddocument to reward documents for containing more query terms, and doclennorm(d) normalizes to avector equal to or larger than the unit vector.

The tf-idf factors are defined as follows [28]:

t ft,d =p

frequency, (3.41)

id ft = 1+ lognumDocs

docFreq+ 1, (3.42)

where t ft,d stands for the term frequency of term t in document d, id ft stands for the inverse documentfrequency of term t, numDocs is the total number of documents in the collection, and docFreq is thenumber of documents in which the term t appears. The final practical scoring function using tf-idf forthe score of query q and document d is defined as:

score(q, d) = coordfactor(q, d)×∑

t∈q

t ft,d × id f (t)2 × doclennorm(d), (3.43)

where the previous fraction has been replaced by the tf-idf term with omitting normalization by query.

Lucene allows the creation of indexes with different fields. These fields can be queried indepen-dently or in combination. We make use of two fields where one field contains the terms that surroundthe URL within a tweet, and the other field contains the terms of the URL’s website.

3.4.2 Probabilistic retrieval models

Beside tf-idf weighting many other approaches towards information retrieval have been developed.Some of them provide greater flexibility for adaptation towards specific needs. We will briefly discusssome of them here.

The basic idea of the language model information retrieval approach is to estimate a languagemodel for every document in the collection, and then rank documents according to the likelihood that

3.4. CONTENT-BASED MODELS 35

the query was generated by a document. Given document d and a query q the document with thehighest conditional probability p(d|q) is ranked on the top:

p(d|q)∝ p(q|d)p(d), (3.44)

where p(q|d) is the query likelihood given a document, telling us how well the document fits the query,and p(d) is the prior probability of d being relevant.

The scoring function is defined as:

score(q, d) =∏

t∈q

p(t|d), (3.45)

where t is a term in the query q and p(t|d) is the term probability given document d. A straightforwardsolution to estimate p(t|d) is to use the maximum likelihood estimate Pmle(t|d) defined as:

Pmle(t|d) =t ft,d

Nd, (3.46)

where t ft,d is the term frequency of term t in document d and Nd is the total number of terms in doc-ument d. If a term t of a query q does not exist in a document d, the maximum likelihood estimatewould give probability of zero for the term t and the resulting query score would be zero as well [75].To prevent this smoothing techniques have been developed.

Smoothing deals with unseen words and ensures that a non-zero probability is assigned to thosewords. Zhai and Lafferty [89] study the problem of language model smoothing. Beside other smooth-ing techniques, Zhai and Lafferty [89] report on the Jelinek-Mercer method which involves a linearinterpolation of the maximum likelihood model with the collection model:

pλ(t|d) = (1−λ)pml(t|d) +λp(t|C), (3.47)

where a coefficient λ is used to control the influence of a maximum likelihood model and a collectionmodel, t denotes term, d denotes document, and C denotes the collection.

Zhai and Lafferty [89] conclude that the Jelinek-Mercer method tends to perform better for longqueries, than for short queries. They state that λ strongly correlates with the query type and propose avalue of 0.1 for concise title queries and 0.7 for verbose queries. As the queries that will be used in thiswork are quite verbose we hope that this model will provide improvement over tf-idf.

Our last set of models, a family of information-based models for information retrieval, were intro-duced by Clinchant and Gaussier [21]. These models draw their inspiration from the idea that thedifference in the behaviors of a word at document and collection levels brings information on the sig-nificance of the word for the document. They state that their models outperform language modelsand Okapi BM25 and are on par with divergence from randomness models, while being conceptuallysimpler. The retrieval function score(q, d) is of the form:


score(q, d) =∑

t∈q

−xqt log p(X t ≥ td

t |λt), (3.48)

where q is the query, d is the document, xqt is the number of occurrences of term t in q, td

t is the normal-ized form of t ft,d (number of occurrences of t in d), λt is the parameter for the probability distributionof t in the collection, and X t is a random variable that counts the occurrences of term t.

The probabilistic distribution used to model term occurrence can be set to either a log-logistic (LL)or to a smoothed power-law (SPL) distribution. The parameter λw for this distribution can be set tothe average number of documents where t occurs (DF) or to the average number of occurrences of tin the collection (TTF). For normalization a uniform distribution of the term frequency can be chosen(H1) or a model in which the term frequency is inversely related to the length (H2), Dirichlet Priorsnormalization (H3), or Pareto-Zipf normalization (Z).

We believe that due to the short length of tweets compared to traditional documents, differentchoices in normalization might be useful to improve performance.

3.5 Combination of User Similarity Based Recommendation Models

Our hypothesis is that the combination of different models will yield better performance than the per-formance of models on their own. This section introduces ways to combine models based on usersimilarity.

3.5.1 Combination by z-scoring

As a mean to compare different measurements z-scores, also called standard scores, can be used. thez-score indicates by how many standard deviations an observation is above or below the mean. Thescore is defined as:

z =x −µσ

, (3.49)

where x is the raw score, µ is the mean of the population, and σ is the standard deviation of the popu-lation.

In our case, z is calculated by letting x be the original score of the associate user in a neighborhood,µ be the mean of all user scores in this neighborhood and accordingly σ is the standard deviation ofthe scores in this neighborhood.

After having replaced the original score with the z-score for each neighborhood the neighborhoodsare merged. In case a user exists in more than one neighborhoods the highest z-score is assigned forthis user in the merged neighborhood. The users are then ordered by their z-scores and items from theusers are recommended in the same order.

3.5. COMBINATION OF USER SIMILARITY BASED RECOMMENDATION MODELS 37

3.5.2 Combination by supervised learning

We hypothesize that machine learning will allow us to combine the models introduced in Section 3.3by using their scores and a relevance annotation of a user to train a classifier. In this section we showhow we acquire the training data for learning, and define our naive Bayes model.

With Pτ−1 we denote a past that was before the past denoted with Pτ. The future is denoted withFτ.

Pτ−1→ Pτ︸︷︷︸

Training

→ Fτ (3.50)

For the neighborhoods under consideration, we acquire users and their neighborhood scores forPτ−1. The neighborhood scores that previously (Section 3.3) were used to sort the neighborhood ofassociate users by similarity, will now be used as features / attributes of training instances. A user thatprovided at least one item that is relevant in Pτ is considered as positive training instance, all otherusers and their corresponding neighborhood scores are negative training instances. A training instancethus looks like this:

Sau,Pτ−1,1, ..., Sau,Pτ−1,n, t rue/ f alse, (3.51)

where n is the number of neighborhoods in place.

The classifier is trained on instances with scores, labeled by whether the user is relevant. The usersand their scores are taken from Pτ−1, and by checking whether the users provided relevant items inPτ the class label is established. At time τ = 0 there is no past Pτ−1 so we back off to the approachdescribed in the previous section.

Assuming that our features are conditionally independent given the relevance of a user, we use anaive Bayes classifier to compute the probability p(C = relevant|S = s) of the observed feature valuesbeing in the relevant class:

p(C = relevant|S= s) =p(C = relevant)p(S= s|C = relevant)

p(S= s), (3.52)

where C is a random variable denoting the class of an instance (relevant, or not relevant), S is a vectorof random variables denoting the observed feature values (scores), and s is a particular observed fea-ture value vector.

The resulting class probability can be directly used to rank users in a neighborhood from the highestto lowest probability of providing relevant items. The classifier learns how similar an associate user hasto be towards the seed user, with respect to a certain neighborhood, to provide relevant items.


3.6 Late Data Fusion

In Section 3.3 we defined models that create a neighborhood of users from which items were recom-mended to a specific seed user. In the previous section we showed how to combine the models fromSection 3.3. Besides recommending items from a ranked list of users, we also can recommend itemsdirectly based on the contents of items (Section 3.4). Our hypothesis is that a combination of bothapproaches is possible, and effective.

The models described previously provide different recommendation lists. We make use of late datafusion techniques to combine those recommendation lists to one list. Different methods to combineresults from various divergent search schemes and document collections are described in [73].

Our two flavors of a fusion approach are defined as follows: Given a list L of ranked recommen-dation lists r, and weights wr for a specific recommendation list, and the score Sr(i) of item i in listr:

Sr(i) =1

rankr(i), (3.53)

the first fusion approach is defined as the reciprocal of the sum, of the reciprocals of weighted itemscores:

S f u1i =1

∑

r∈L1

wr×Sr (i)

=1

∑

r∈Lrankr (i)

wr

, (3.54)

and the second method corresponds to WCombSUM, the sum of weighted scores introduced in [39]:

S f u2i =∑

r∈L

wr × Sr(i) =∑

r∈L

wr

rankr(i). (3.55)

Instead of normalizing scores we directly use the reciprocal of the rank of an item in a list, as a score.

In this chapter we gave a problem formulation, we defined our baseline models, and we presentedmodels that provide recommendations by forming neighborhoods of associate users which were sortedaccording to the similarity with the corresponding seed user. We showed how these models can becombined, by using supervised machine learning. We also introduced models that consider contents ofitems instead of user properties. Finally, we showed how to use late data fusion to combine both, user-based neighborhoods and content-based models. In the next chapter we will discuss how we evaluatethese models.

Chapter 4

Evaluation

So far we looked at models that provide recommendations of items shared by a seed user’s associateusers. An important question is how to evaluate the models that were defined in the previous chapter.In this chapter we first look at different ways to split the data into candidate-, training-, and testing set.Finally, will define our evaluation metric.

In personalized content recommendation the ideal ground truth would be build upon explicit userfeedback. This user feedback would not necessarily be acquired in an invasive form, in the sense thatthe user is forced to annotate a consumed piece of content. Instead, a recommendation system couldinfer relevance from the link usage which might be noisy, however it has been shown that the noise canbe dealt with [24]. If a user is interested in a tweet’s URL then the user most likely will click on it. Inthe next paragraph we discuss how links in Twitter are used, and how click information can be collected.

The links in Twitter are mostly shared by using URL shortener services, due to Twitter’s 140 char-acter limitation. A user of such a service can provide the service with an URL to a website found to beshare-worthy and in return the user gets an URL that is significantly shorter then the original URL, thusleaving more space for text within a tweet. When calling the shortened URL the service provider redi-rects the caller to the original URL. This allows the service provider to create comprehensive statisticsabout the usage of the URL. In 2010, Twitter introduced their own URL shortener service called t.co.Since then Twitter is able to produce statistics about URL usage but also statistics about users’ interestsin specific URLs, if the user makes use of the Twitter App or the Twitter website.

As Twitter does not provide access to this usage statistics we rely on another form of implicit userfeedback. We consider URLs shared by a specific user as the relevant URLs for this user, which in turnprovide positive training instances for the models we developed and also the foundation for our evalu-ation strategies.

Many different ways for evaluating the models developed in this work are thinkable. The goalindeed is to choose the evaluation strategy that is most realistic given the underlying real world appli-cation but also most insightful. First, beside choosing evaluation metrics, the way the data is split intocandidate set and testing set needs to be defined. Second, in case we use supervised learning we alsoneed to define a training set, which will be acquired from the candidate set like described Section 3.5.2.

39

40 CHAPTER 4. EVALUATION

As we want to recommend potentially unseen items from the past we also need to take the time whena tweet was authored into consideration.

4.1 Incremental Data Splitting

The cheapest and most simple way of splitting the data would be to choose a certain ratio of candidateset vs. test set. However, this approach unrealistic because in reality the amount of data available to thesystem changes over time and depends on the users involved. Another problem is that by splitting thedata into only two sets, we ignore many potentially successful recommendation events and eventuallytry to verify outdated recommendations. We visualize the following example with Figure 4.1a. Let usconsider 30 days of collected tweets. Lets assume we split this data into one candidate- and one testingset, with each 15 days of data. The candidate set would provide a recommendation from day three,which in turn would be validated against the testing set which starts 12 days later. Twelve days is a longtime and most likely the recommendation would be already outdated. However, it could be that thisrecommended item is re-shared at day five. Day five is in the candidate set, thus we miss a potentiallysuccessful recommendation.

A better strategy is to split the data into bins containing items within a certain timespan e.g. oneday. In the following example 6 bins would represent 6 days. In the following we will discuss twoapproaches that make use of these bins in an incremental way. In the first approach, depicted in Figure4.1b, a time window is shifted over the data, where the first half of the time window is used as past toextract the candidate items which are then evaluated with the second half (the future).

t

(a) Fixed ratio splitevaluation

t

(b) Shifting windowevaluation

t

(c) Growing past -shrinking future

Figure 4.1: Two schemes for splitting data into candidate set (which includes the training set) andtesting set.

The second approach depicted in Figure 4.1c just moves the border between past and future fromleft to right. On the one hand, the first approach is cheap because it limits the amount of data used,

4.2. EVALUATION ON USER SESSIONS 41

on the other hand, it can not use the potential of the data it discards. In the second approach theexperiments can use the full data available and thus allow us to get a more complete picture of themodels evaluated.

4.2 Evaluation on User Sessions

The approaches discussed so far do not account for sessions, i.e., time intervals where a certain seeduser actively consumes items and possible reshares consumed items or shares new items. If user ses-sions are ignored, situations can arise where data is split in a way that the phases of a user consumingitems and sharing items end up in the same bin. The next bin then might not hold any of the seedusers activity. Putting the phases of consumption and sharing into separate bins would be much morerealistic and most likely also lead to better evaluation results. Also the duration of these sessions ofactivity are different from user to user. We model these sessions by considering time intervals wherethe user is active, called ON, and time intervals where the user is inactive, called OFF. We illustrate thisscheme in Figure 4.2.

t

OFFON

user A

user B

Figure 4.2: Splitting data by considering user sessions based on intervals where a user is active (ON)and inactive (OFF).

Similar to [22] we develop an ON / OFF session model: For a specific seed user we collect thetime differences from an ordered list of timestamps ts ∈ TS of |TS| seed user’s tweets. The values ofthe time differences in T D =

⋃|TS|−1j=0 ⟨t j+1 − t j⟩ are then clustered by agglomerative hierarchical clus-

tering using Ward’s minimum variance method [84]. By using Ward’s minimum variance method wemake sure that the time difference values within a cluster exhibit small variance, while the mean of twodifferent clusters are allowed to differ greatly. An example cluster dendogram can be seen in Figure 4.3.

We denote the resulting two clusters with T DON containing the time differences that occur in be-tween two tweets when a user is active, and T DOF F which contain the time differences that occurbetween two tweets in a period of inactivity. The correct assignment of the cluster names T DON andT DOF F , can be easily determined by comparing the mean of each of the two clusters.

Next, we describe how we use the clusters to assign items to the past and to the future. Fromthe cluster that accounts for the inactivity between two sessions (T DOF F ), we take the smallest timedifference denoted by tdmin,OF F . For all items I = i1, i2 · · · im|ts(i1) ≤ ts(i2) ≤ · · · ≤ ts(im) for aspecific seed user, where ts is the timestamp, we calculate the time difference td j+1, j = ts(i j+1)− ts(i j)for j : 1 ≤ j < m. If td j+1, j ≥ tdmin,OF F then ts(i j+1) marks the beginning of a new session. Itemsolder then ts(im+1) are assigned to the past Pτ and the remaining items belong to the future Fτ where τcorresponds to the session boundaries. The split between candidate set and the test set does not movebin wise as explained previously but instead it moves according to the session boundaries. We will see


5758

.283

#1

4950

.883

#20

2060

.85

#13

2470

.4 #

6635

27.9

17 #

2629

70.3

#18

2807

.7 #

5329

02.2

#19

2881

.617

#15

2864

.8 #

351.

9 #4

01.

867

#49

2.38

3 #5

52.

317

#64

0.23

3 #4

50.

317

#46

0.55

#63

0.4

#60.

433

#47

1.13

3 #5

11.

317

#57

1.3

#59

0.78

3 #2

30.

767

#44

0.76

7 #5

80.

867

#61

0.96

7 #6

26.

7 #3

36.

3 #3

85.

333

#54

4.21

7 #3

64.

767

#60

16.3

33 #

410

.15

#79.

9 #4

89.

55 #

249.

1 #5

611

2.55

#30

113.

183

#37

124.

617

#14

127.

383

#29

47.3

33 #

535

.3 #

1274

.817

#8

74.3

#21

68.8

5 #3

162

.7 #

5015

17.6

17 #

2514

29.2

5 #2

811

60.5

17 #

3912

41.2

83 #

913

03.0

17 #

1164

0.41

7 #3

615.

117

#52

623.

85 #

1762

7.76

7 #2

236

0.61

7 #2

751

8.95

#34

936.

733

#32

1038

.017

#41

706.

85 #

4268

6.66

7 #4

381

1.75

#65

737.

417

#277

0.71

7 #1

075

5.51

7 #1

6

0.0e

+00

1.0e

+09

2.0e

+09

CompellemB

hclust (*, "ward")Time differences

Hei

ght

Figure 4.3: Clusters of time differences for seed user CompellemB, with two clusters, left T DOF F andright T DON . The label at the leaf indicates the time difference between two tweets in minutes, followedby # and an id.

in Section 6.1.5 how those session boundaries differ over a set of users and that splitting according tosuch boundaries naturally leads to a better performance of the chronological baseline but also for theother models we develop.

4.3 Evaluation Metrics

Besides organizing the data into candidate and testing set, also a choice with respect to the evaluationmetric needs to be made. The system’s goal is to provide a ranked list of URLs, where the URLs on thetop of the list are most interesting to the user. A very common metric used in such a situation is averageprecision. Due to our data splitting strategy we have multiple tests using different candidate- andtesting sets for a specific seed user. Also we perform tests over a set of seed users. We use the mean ofthe average precisions calculated in these tests, to evaluate overall performance of the different models.The average precision AvePsu,τ for a seed user su and time τ is defined as follows:

AvePsu,τ =

∑|Rsu,Pτ |k=1 P(k)× rel(k)

|Isu,Fτ |, (4.1)

where Isu,Fτ is the set of relevant items, the items we test against from future Fτ, Rsu,Pτ are the rec-ommendations, P(k) is the precision of the recommendation list at size k, and rel(k) is the indicatorfunction equaling 1 if the item at rank k is a relevant item, zero otherwise. Isu,Fτ is chosen to be the set

4.3. EVALUATION METRICS 43

of items that have been shared by a seed user’s associate users and reshared by the seed user.

Mean average precision MAP is then defined as the sum of AveP for all seed users su ∈ SU and atall times τ : 1≤ τ < l for the bins B = b1, ..., bl:

MAP(SU , B) =1

|SU |

∑

su∈SU

1

|Bsu|

l∑

τ=1

AvePsu,τ (4.2)

Because the growing past - shrinking future evaluation strategy provides a good trade-off betweencost and realism we make use of it most of the time. We will also show that evaluation on user sessionshas its merits and allows for a even more realistic assessment of the models involved. We use MAPto describe the systems overall performance and will indicate the relative improvement over baselineswith respect to MAP.

In this chapter we introduced several evaluation strategies which differ with respect to how closethey model the reality and the involved costs. We also introduced our evaluation metric. In the nextchapter we will outline the experiments we perform and the data that is used in those experiments.


Chapter 5

Experimental Setup

In this chapter we describe the experiments we conduct and the underlying data and evaluation strategythat is used.

5.1 Experiments

We will conduct the following experiments to answer our research questions stated in Section 1.1.

Performance of user-based recommendation models: With these experiments we seek to answerRQ 2, which asks if user neighborhoods are an effective recommendation source and how thedifferent features perform in comparison with each other. We answer this question by testing themodels from Section 3.3 with the growing past - shrinking future evaluation strategy introducedin Section 4.1. We compare the results with respect to MAP with each other and the baselines.We will run these experiments on one dataset representing average users (Section 5.2.1) and onedataset that represents influential users (Section 5.2.2).

Combination of user-based recommendation models: This set of experiments answers RQ 2.1, ifuser neighborhoods can be combined and which combinations maximize performance. We com-bine different sets of models defined in Section 3.3 by using z-scoring and machine learning asdefined in Section 3.5. We conduct the tests using the growing past - shrinking future evaluationstrategy introduced in Section 4.1 and compare the results with respect to MAP. We will run theseexperiments on one dataset representing average users (Section 5.2.1), due to time constraintswe won’t repeat this set of experiments on influential users (Section 5.2.2).

Performance of content-based models: With RQ 3 we ask whether content-based approaches can beeffective in recommendation and which perform best. We make use of the growing past - shrinkingfuture evaluation strategy with information retrieval models (Section 3.4) on the content oftweets and sites of URLs shared in tweets. We compare the different approaches with each otherand the baselines. We will run these experiments on one dataset representing average users (Sec-tion 5.2.1) and a reduced number of experiments on a dataset that represents influential users(Section 5.2.2).

45

46 CHAPTER 5. EXPERIMENTAL SETUP

Performance of hybrid recommendation models: Whether user neighborhood- and content-based ap-proaches can be combined (RQ 4) will be answered by conducting a series of experiments usinglate data fusion as described in Section 3.6. We will conduct those experiments on both datasetsusing the growing past - shrinking future evaluation strategy.

Session based evaluation methodology: RQ 5 asks whether a more realistic evaluation methodologyis feasible. We use clustering to determine session borders for each individual seed user. Thesesession borders allow us to adjust the splitting of data into candidate data and testing data moreappropriately. We examine this evaluation strategy (introduced in Section 4.2) by using it totest the combination model based on machine learning introduced in Section 3.5. We run thisexperiments on a small sample of the average user dataset.

5.2 Dataset

We made use of the Twitter API to acquire data for our experiments. In Twitter’s API Terms [44] Twitterasks API users not to sell, rent, lease, sublicense, redistribute, or syndicate access to the Twitter API orTwitter content to any third party without prior written approval from Twitter. Due to this but also dueto the specific goal of this research it is not possible to use existing data. As we are interested in provid-ing personalized recommendation we will make use of a set of users, called seed users, to validate ourmethods. We make use of two different sets of seed users, one set should represent the average user,the other set represents elite users with high influence.

The following assumptions govern the selection of seed users:

• The models have to cope with different groups of users described by quantitative properties likenumber of tweets, number of friends and followers.

• The friends of a user are more useful as recommendation input because the user explicitly indi-cated interest by following them.

• Followers of a user might also be interesting because the follower might share the same interests.

• Using Twitter’s search API, tweets can be fetched for all users, with limited loss of tweets in thetimeline.

5.2.1 Finding the average user

In this section we describe how we find a sample of active twitter users, how the data was crawled, andwe provide some statistics about the data at hand.

Seed user selection

To get active seed users we make use of Twitter’s public stream1. The public stream offers samples ofthe public data flowing through Twitter, a small sample of tweets currently disseminated. These tweetscontain tweet related content and information about the authors. We captured this stream for severaldays and then analyzed a random sample with respect to the number of tweets a user has authored,

1https://dev.twitter.com/docs/streaming-apis/streams/public

https://dev.twitter.com/docs/streaming-apis/streams/public

5.2. DATASET 47

the number of users that a user follows, and the number of the user’s followers. We give a statisticalsummary of our sample in Table 5.1.

Table 5.1: Statistical summary of our sample of tweets crawled from Twitter’s public stream for seeduser finding. E.g., the column denoted with mean shows the average number of tweets authored byusers in our sample.

Counts 1st Qu. Median Mean 3rd Qu. Max.

Nr. of tweets 573 2,732 22,400 11,910 34,650,000Nr. of friends 91 222 833 579 1,116,000Nr. of followers 61 180 1,849 554 7,486,000

The mean is biased by some extreme outliers as can be seen when looking at the other quantiles.Thus we only look at users satisfying constraints with respect to properties, such as number of tweets,number of friends and number of followers.

We first group depending on the number of friends. The groups are constructed by taking userswithin a range of ±10% of 1st Qu., Median and Mean. The effect on the remaining properties by filter-ing the users on the number of friends can be seen in Figure 5.1a and 5.1b. We observe less extremevalues for number of followers and number of tweets than observed before constraining on the numberof friends.

1 2 3

050

010

0015

0020

00

(a) Boxplot of the number of follow-ers for the 3 groups in the ranges±10% of 1st Qu., Median and Meanof number of friends.

1 2 3

020

000

4000

060

000

8000

0

(b) Boxplot of number of tweets forthe 3 groups in the ranges ±10% of1st Qu., Median and Mean of num-ber of friends.

Figure 5.1

Next we filter on the number of followers. We constrain the number of followers to the range of±10% of its median given the existing groups defined previously.


Finally, we are also interested in having different groups depending on the activity of users withrespect to the numbers of tweets they author. For that we double the number of groups by having oneset of groups with number of tweets constrained on ±20% of the median and another set of groupswith number of tweets constrained on ±20% of the mean. The resulting groups can be found in Table5.2.

Table 5.2: Statistics of seed users. Groups are defined by constraining on a seed user’s number offriends, number of followers and number of tweets. For example, group G1-1 consists of seed usersfor which the number of friends is ≥ 82.8 and < 101.2, while the number of followers is ≥ 60.8 and< 91.2, and the number of tweets is ≥ 2,206 and < 3,309.

Group Nr. Tweets

Nr. Friends 82.8≤ x < 101.2 G1-1 780≤ z < 1,170

Nr. Followers 60.8≤ y < 91.2 G1-2 2,206≤ z < 3,309

Nr. Friends 162≤ x < 198 G2-1 2,184≤ z < 3,276

Nr. Followers 146.4≤ y < 219.6 G2-2 5,154≤ z < 7,731

Nr. Friends 1,536.3≤ x < 1,877.7 G3-1 10,912≤ z < 16,368

Nr. Followers 604≤ y < 906 G3-2 23,344≤ z < 35,016

Only users having posted in English will be considered for the actual crawling process. Twitterprovides language information in a user’s profile and for tweets. Both turn out to be not very accurate.For this reason we did additional language detection using [74] to filter out users with a public streamtweet other than in English language. In Figure 5.2 we see the number of users from our stream samplethat account for the groups defined, where "all" refers to all users captured and "en" to users that werecreating English tweets identified with [74].

G1_1 G1_2 G2_1 G2_2 G3_1 G3_2

allen

050

0015

000

2500

0

Figure 5.2: Number of users assigned to the different activity groups within our sample from Twitter’spublic stream.

5.2. DATASET 49

Crawling

Given our seed user selection process, we randomly select 20 users from each of the six groups definedin Table 5.2. For the selected seed users, we retrieve friends and followers by querying Twitter’s API.For all users (seed users and their friends and followers) all tweets, available through Twitter’s searchAPI, are crawled. For each tweet the corresponding URL entities are extracted, and used to fetch thecorresponding resource. Resources larger than 3 MB are discarded. In this process we keep the originalURL, as written in the tweet, and the final target URL (URL after redirections are resolved).

In total 31M tweets were crawled, for which 6.3M URLs were shared. Crawling the URLs resulted in230GB of data. 10% of the crawls failed because of multiple reasons. The top reasons for crawl failureare depicted in Figure 5.3a, the most failures are caused by timeouts. For another part of the crawlsfetching the site succeeded but the content of the site is not valid due to several reasons depicted inFigure 5.3b.

Scheme 'urlhttp' not registered

Invalid header

No route to host

TruncatedChunkException

Premature end of Content−Length

The target server failed to respond

SocketException: Connection reset

hostname in certificate didn't match

Scheme 'rtsp' not registered

Connection refused

Redirect URI does not specify a valid host name

ConnectTimeoutException

SSLPeerUnverified: peer not authenticated

Circular redirect

Invalid redirect URI

UnknownHostException

Timeout waiting for connection

0e+

00

1e+

05

2e+

05

3e+

05

4e+

05

5e+

05

(a) Most common errors that occurred during crawlingof websites shared within tweets preventing the suc-cessful download of the site.

504 Gateway Time−out

Facebook Login

Copyright collective

Ask.fm rate limit

Not found

YouTube rate limit

0

5000

0

1000

00

1500

00

2000

00

(b) Most common reasons that lead to prohibitive web-site content to prevent model bias and resulting uselessrecommendations.

Figure 5.3: Most common issues during and after crawling.

The most common domains used in the URLs shared are visualized in Figure 5.4a. 2% of thesuccessful crawls were not of type HTML but other content type, e.g., gif, jpeg, png, pdf. The servicesused most often by the average seed users and their associate users are YouTube and Instagram. Theusage of domains in the shared URLs exhibits a Zipf’s law behavior as can be seen in Figure 5.4b.

5.2.2 Influential users

Having the average case covered is essential for a realistic assessment of our models. We also believethat using a set of seed users very different with respect to popularity and influence will provide uswith further insight. In this section we will describe how those users were selected, how the data wascrawled, and we provide some statistics about the data at hand.


www.soulseeds.comgetglue.com

t.cowww.rightnow.io

rt.comwww.followfriday.com

WWW.KZRBEATZ.COMmashable.compinterest.com

networkedblogs.comwww.nytimes.com

www.guardian.co.ukameblo.jp

www.mtv.comwww.chacha.comsoundcloud.com

twitcam.livestream.comstreetkode.com

www.ustream.tv25.media.tumblr.com

revoar.info24.media.tumblr.com

twitter.compaper.li

lockerz.comitunes.apple.com

www.huffingtonpost.comyfrog.com

www.twittascope.comwww.bbc.co.uk

www.twitlonger.comwww.amazon.com

m.tmi.metwitpic.com

www.facebook.comask.fm

at.mtv.cominstagram.com

www.youtube.com

0e+

00

1e+

05

2e+

05

3e+

05

4e+

05

(a) Most common domains used in shared URLs.

0 2 4 6 8

02

46

810

log(rank)

log(

freq

uenc

y)

Slope: −1

Slope: −1.336

(b) Log-log scale plot of domain frequency and domainrank, observing Zipf’s law in domain use.

Figure 5.4: Domain usage of URLs shared within tweets.

Seed user selection

As mentioned in Chapter 2 some services around Twitter try to estimate a user’s influence within theTwitter-sphere. One of those services is Peerreach2. Peerreach informs the user about the reach ofinfluence in general but also on a topic level. The most prominent topics are:

• Arts• Blogger• Film• Business

• Journalists• Marketing• Music• Politics

• Science• Sports• Television• Webtech

We select top users from each of the above topics as our seed users. In Table 5.3 we provide a sta-tistical summary of the seed user sample with respect to the number of tweets created, and the numberof followers and friends.

Table 5.3: Statistical summary of the sample of influential users

Counts 1st Qu. Median Mean 3rd Qu. Max.

Nr. of tweets 2,978 7,323 15,912 21,888 112,273Nr. of friends 199 519 9,655 1,244 667,416Nr. of followers 155,545 877,411 2,238,953 2,196,272 30,970,245

Cha et al. [16] state that popular users who have high in-degree (many followers) are not necessarilyinfluential in terms of spawning retweets or mentions. However, given our influential seed users from

2http://peerreach.com

http://peerreach.com

5.2. DATASET 51

Peerreach, we can observe that these users tend to have noticeably more followers then the averageusers, which can be seen when looking at Table 5.1 for the average user and at Table 5.3 for theinfluential users.

Crawling

For the seed users, we retrieve their friends by querying Twitter’s API. We only consider friends be-cause these seed users tend to have many more followers than we can handle in crawling, due torate-limits. For all users (seed users and their friends) all tweets, available through Twitter’s searchAPI, are crawled. For each tweet the corresponding URL entities are extracted, and used to fetch thecorresponding resource. Resources larger than 3 MB are discarded. In this process we keep the originalURL, as written in the tweet, and the final target URL (URL after redirections are resolved).

We crawled data for 247 influential seed users spread over the topics defined earlier. In total 21Mtweets were crawled, for which 6.8M URLs were shared. Crawling the URLs resulted in 300GB of data.We experienced similar crawling failures as already discussed for the previous dataset but at a lowerrate. In Figure 5.5a we show the most common domains of the URLs shared by the influential seedusers and their friends.

blogs.wsj.comabcnews.go.com

www.npr.orgvimeo.com

www.telegraph.co.ukwww.politico.com

t.comobile.theverge.com

www.usatoday.comwww.slate.com

www.theatlantic.comask.fm

www.wired.comedition.cnn.com

www.bloomberg.comsoundcloud.com

www.amazon.comwww.twittascope.com

www.latimes.comow.ly

twitter.yfrog.comespn.go.com

www.flickr.comonline.wsj.com

getglue.comtwitter.com

www.businessinsider.comtechcrunch.com

pinterest.compics.lockerz.comwww.forbes.com

vine.cowww.buzzfeed.com

www.reuters.commashable.comwww.bbc.co.uk

www.washingtonpost.compaper.li

www.guardian.co.ukwww.nytimes.com

twitpic.comwww.facebook.com

www.huffingtonpost.comwww.youtube.com

instagram.com

0e+

00

1e+

05

2e+

05

3e+

05

4e+

05

(a) Most common domains used in shared URLs.

0 2 4 6 80

2

4

6

8

10

log(rank)

log(

freq

uenc

y)

Slope: −1Slope: −1.237

(b) Log-log scale plot of domain frequency and domainrank, observing Zipf’s law in domain use.

Figure 5.5: Domain usage of URLs shared within tweets.

Similar to the previous dataset the most popular services are YouTube and Instagram, however theyswitched position. The usage of domains in the shared URLs exhibits a Zipf’s law behavior as can beseen in Figure 5.5b.

5.2.3 Preprocessing

To strip of unwanted HTML markup we made use of the Apache Tika library [30] which extracts thepure text from HTML and other documents.


The necessity of another preprocessing step stems from the problems that occurred during crawling.Although the Twitter API is well documented and provides relatively smooth access to an abundanceof user generated content, many details and technicalities need to be considered. One of those techni-calities is rate limiting. To maintain responsiveness of their systems and prevent misuse, rate limits areenforced. These rate limits limit the amount of data than can be retrieved within a certain time interval.While our crawler for Twitter honoured those rate-limits, we did not implement any strategy to dealwith rate-limits of services providing the websites we tried to crawl. This lead to a certain amount offailed crawls. Some of these limits were announced with a message telling the consumer that a ratelimit was hit. To prevent our models from getting biased by such message we removed the related items.

Another issue are blocks imposed by legal limitations. The server used for crawling was rented froma German company. Some service providers, e.g. YouTube, do not provide certain content to consumersin Germany but instead provide a message informing the user about these restrictions. Other crawlsfailed because the related site could only be viewed if authenticate, e.g. Facebook, and thus a loginscreen was provided. We also filtered out these items to prevent any bias.

We also performed stop word removal with a small set of English stop words.

5.2.4 Evaluation

For the most of our experiments, where we employ the growing past - shrinking future evaluation strat-egy introduced in Section 4.1, we chose a bin size of 24 hours. Items that were shared within the same24 hours are thus placed into the same bin. We deviate from this bin size for the experiment employingthe user session based evaluation strategy with 6 hours instead of 24 hours, as we need a more finegrained splitting of data.

Beside reporting on MAP and relative improvement compared to baselines with respect to MAP, wealso perform statistical significance tests using a two-tailed paired t-test and Wilcoxon rank-sum test.We indicate with Î(or È) significant differences for α= .01, or Í(and Ï) for α= .05.

5.3 Feature Selection for Supervised Machine Learning

When using real world data, limitations apply with respect to the features that can be extracted fromthe data. We describe our models and the underlying features in section 3.3. Most often it is justnot possible to select features based solely on their discriminative power. Often domain knowledge orpreexisting experience with features is helpful to select a useful set of features. Still, feature selectionremains a non trivial task. While features that perfectly correlate with each other just do not add addi-tional information, high variable correlation does not mean absence of variable complementarity [35].Variables useless by themselves can be useful if put together with others [35]. To find the most usefulset of features several feature selection techniques have been developed. Hall [37] introduced an ap-proach that builds on the hypothesis that good feature sets contain features that are highly correlatedwith the class, yet uncorrelated with each other.

For our supervised learning approach (Section 3.5.2) we use a naive Bayes classifier. Naive Bayesprovides competitive performance when compared to more sophisticated classification techniques, re-

5.3. FEATURE SELECTION FOR SUPERVISED MACHINE LEARNING 53

gardless of the strong independence assumption it makes. This characteristics of naive Bayes is subjectof previous research. Rish [67] provides an explanation that no matter how strong the dependen-cies among features are, naive Bayes can still be optimal if the dependencies are distributed evenly inclasses, or if the dependencies cancel each other out. Similar Zhang [90] shows that naive Bayes worksbest with completely independent features, but also with functionally dependent features. Our strategytowards feature selection will be to run naive Bayes on all features and then iteratively remove featuresthat contribute most to the classifier’s performance. We do this to identify the smallest useful set offeatures.

In this chapter we described our experiments, and the data used for these experiments. In the nextchapter we will report and analyze the results of our experiments.


Chapter 6

Results and Analysis

In this chapter we report on the results of the experiments defined in Section 5.1. The models areabbreviated as stated in Table 6.1. We look at each dataset separately: First, we look at the averageusers, and then we look at the influential users collected from peerreach. For each dataset we report onthe performance of user-based nearest neighborhood models, the combinations of user-based models,the content-based models, and the model using late data fusion to combine user-based models andcontent-based models.

6.1 The Average Users

In this section we report the results of experiments conducted on the average user dataset introducedin Section 5.2.1. For the following experiments we make use of one quarter of the data, which we willalso use to train the weights for the late data fusion model.

6.1.1 Performance of user-based models

We turn to our first experiment which aims to answer RQ 2 whether user neighborhoods are an effec-tive recommendation source and how the different features perform. We describe our models and theunderlying features in section 3.3

Figure 6.1 illustrates the effectiveness of the baseline models, and the user-based recommendationmodels introduced in Section 3.3, in terms of MAP. For the models which use distance metrics we pro-vide results of both, euclidean and cosine. We mark cosine with the appendix cos.

In Table 6.2 we report on the MAP scores for the models under consideration. We report on the rel-ative difference of the models towards its baseline models. We also report on the statistical significanceof the difference towards the baseline models. We see that the performance of ten models is around therandom baseline. Four models improve over the chronological baseline. The best performing model is"item" followed by "mention" and "retweet". This suggests that the hypothesis of users who shared thesame items in the past might do so in future, is a very strong one, and leads to a very solid model. Wesee that "mention" performs slightly better than "retweet" which indicates that many links are shared

55

56 CHAPTER 6. RESULTS AND ANALYSIS

Table 6.1: Abbreviation for models used in the experiments and the related section describing themodels.

Abbreviation Description Section

random Random user neighborhood 3.2.1chron Chronological item order 3.2.2item Traditional user-based nearest neighbour recommendation 3.3.1mention User mention neighborhood 3.3.2retweet Retweet neighborhood 3.3.2hod, dow Temporal neighborhoods 3.3.3nmentions, nhashtag, nurl Intrinsic tweet features-based neighborhoods 3.3.4social Social features-based neighborhood 3.3.5fbitem Item popularity-based neighborhood 3.3.6

with mentions of users and not all of them are retweets.

rand

om

chro

n

nurl−

cos

hod−

cos

dow

−co

s

dow

hod

nmen

tions

soci

al−

cos

nhas

htag

soci

al

nurl

nhas

htag

−co

s

nmen

tions

−co

s

fbite

m

retw

eet

men

tion

item

MA

P

0.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

Figure 6.1: MAP for 16 user-based recommendation models. Performance of the random baseline isindicated by the red line, and the performance of the chronological baseline is indicated by the orangeline.

6.1.2 Combination of user-based models

With our second experiment we seek answer for RQ 2.1: Whether user neighborhoods can be com-bined, and which combinations maximize performance.

6.1. THE AVERAGE USERS 57

Table 6.2: MAP and relative model performance compared to random, and chronological baseline, fordifferent user-based recommendation models. Statistical significance tested against both random, andchronological baseline.

Random ChronologicalModel MAP % random % chron t-test wilcox t-test wilcox

random 0.00095chron 0.00273

nurl-cos 0.00014 15 5 È È È È

hod-cos 0.00043 46 16 È È È È

dow-cos 0.00066 70 24 È È È

dow 0.00069 73 25 È È È

hod 0.00070 73 25 È È È

nmentions 0.00078 82 29 È È È

social-cos 0.00083 88 31 È È È

nhashtag 0.00093 98 34 È Ï È

social 0.00112 119 41 Î Ï È

nurl 0.00113 119 41 Î Ï È

nhashtag-cos 0.00120 127 44 Î Ï È

nmentions-cos 0.00245 258 89 Î È

fbitem 0.00308 325 113 Í Í

retweet 0.00464 490 170 Í Î Î

mention 0.00590 622 216 Î Í

item 0.00706 745 258 Î Í Í

Table 6.3 lists which features are contained in a specific naive Bayes model. Where applicable, weuse cosine similarity instead of the euclidean distance, our choice is based on preliminary experimentswhere cosine similarity outperformed the euclidean similarity measure. We also report on MAP. Forcomparison we included the common items neighborhood abbreviated with "item".

Beside combination with z-scoring, abbreviated with "merge" (Section 3.5.1), we report on modelsthat combine different sets of features by using naive Bayes for the supervised learning approach (Sec-tion 3.5.2).

Figure 6.2 illustrates the MAP for the baseline models, the merge approach, and the supervisedlearning approach. For the later, we report on the model that uses all features, "nb-1", and consecutivemodels where an attribute is removed are indicated by previousrattribute. In addition, we report onthe results of an experiment that used naive Bayes with kernel density estimation instead of a singleGaussian for continues features like introduced in [47].

In Table 6.4 we report on MAP and test the significance of the results against the random andchronological baseline. For the naive Bayes approach with kernel density estimation, denoted "nb-1+kde", we also test for significance against the naive Bayes approach denoted with "nb-1". We see thatthe naive Bayes classification approach using all features boosts overall performance significantly. Thecombination by z-scoring does not perform well, actually it performs worse than the "item" model onits own. We also see that striping off features only leads to moderate decrease in performance. Theclassifier provides still significant improvement even when only considering the "fbitem" and "social"features. This suggests that the popularity of items that a user shares, and the user’s social aspects pro-


rand

om

chro

n

item

mer

ge−

1

nb−

1+kd

e

nb−

1

nb−

2

nb−

3

nb−

4

nb−

5

nb−

6

nb−

7

nb−

8

nb−

9

nb−

10

nb−

11

MA

P

0.000

0.005

0.010

0.015

0.020

0.025

0.030

Figure 6.2: MAP for different combinations of user-based recommendation models using naive Bayeswith kernel density estimation (nb-1+kde) and without (nb-1), ranked by features contributing most tothe overall performance. Performance of random baseline is indicated by the red line, the performanceof the chronological baseline is indicated by the orange line and the performance of the naive Bayesmodel (nb-1) with the black line.

vide useful clues of similarity between users. The naive Bayes method using kernel density estimationinstead of a single Gaussian to model features provides a slight improvement which is not statisticallysignificant.

6.1.3 Performance of content-based models

This set of experiments is related to RQ 3: Can content-based approaches be effective in recommenda-tion, and which perform best?

Figure 6.3 shows the performance of information retrieval approaches (Section 3.4) using tweetcontent. In Table 6.5, we report on the MAP scores for tf-idf (Section 3.4.1), information-based modelsusing different distributions to model term occurrence, and language models (Section 3.4.2). We reporton the relative difference of the models towards its baseline models. We also report on the statisticalsignificance of the difference towards the baseline models.

Similar to the above we also show results for the experiments where we use the website contentinstead of the tweet content. The results are given in Figure 6.4 and Table 6.6. We find that six of the in-formation retrieval approaches perform above random baseline and four even close to the chronologicalbaseline for tweets, but for site only two models improve over the random baseline. Also, we see thatstemming does not have any considerable effect on performance. The use of abbreviations, and possi-ble misspellings might wreck the merit of language-dependent techniques, such as stemming. Further,


Table 6.3: List of features that were excluded step by step in the different combinations by naive Bayes,of user similarity-based models. The model that uses all features is denoted with "nb-1", and consecutivemodels are denoted with "previousrattribute", where an attribute is removed from the previous model.

Model Features

nb-1 allnb-2 nb-1rretweetnb-3 nb-2rhodnb-4 nb-3rnmentionsnb-5 nb-4rdownb-6 nb-5rnurlnb-7 nb-6ritemnb-8 nb-7rnhashtagnb-9 nb-8rmentionnb-10 fbitemnb-11 social

rand

om

chro

n

ib−

H1−

SP

L−T

TF

ib−

H1−

SP

L−D

F

tfIdf

tfIdf

−st

emm

ing

lmjm

−ve

rylo

ng

lmjm

−lo

ng

ib−

H1−

LL−

DF

ib−

H1−

LL−

TT

F

ib−

H2−

LL−

TT

F

ib−

H2−

LL−

TT

F−

stem

min

g

ib−

Z−

LL−

TT

F−

stem

min

g

ib−

Z−

LL−

TT

F

MA

P

0.0000

0.0005

0.0010

0.0015

0.0020

0.0025

0.0030

Figure 6.3: MAP for information retrieval-based models on tweets. Performance of random baseline isindicated by the red line, chronological baseline by orange line.

some of the additional weighting schemes outperform tf-idf, with term frequency normalization havinga considerable impact on the performance for tweets. In general, information-based retrieval modelsprovide improvement for tweets over tf-idf, while language model-based retrieval perform worse thantf-idf. For sites the differences between the retrieval models are much smaller, and traditional methodlike bm25 still perform best.


Table 6.4: MAP and relative model performance compared to random, and chronological baseline, fordifferent combinations of user similarity-based models using naive Bayes with kernel density estimation(nb-1+kde), and without (nb-1), ranked by features contributing most to the overall performance.Significance tests against random and chronological baseline.



item 0.00706 745 258 Î Í Í

merge-1 0.00406 428 149 Î

nb-1+kde 0.02609 2753 955 Î Î Î Î

nb-1 0.02471 2607 904 Î Î Î Î

nb-2 0.02471 2607 904 Î Î Î Î

nb-3 0.02457 2593 899 Î Î Î Î

nb-4 0.02382 2513 872 Î Î Î Î

nb-5 0.02440 2575 893 Î Î Î Î

nb-6 0.02293 2420 839 Î Î Î Î

nb-7 0.02113 2230 773 Î Î Î Î

nb-8 0.01987 2097 727 Î Î Î Î

nb-9 0.01433 1512 524 Î Î Î Î

nb-10 0.00907 957 332 Î Î

nb-11 0.00813 858 297 Î Î

rand

om

chro

n

lmjm

−lo

ng

lmjm

−ve

rylo

ng

tfIdf

−st

emm

ing

tfIdf

ib−

H1−

SP

L−T

TF

ib−

H1−

SP

L−D

F

ib−

H1−

LL−

DF

ib−

H1−

LL−

TT

F

bm25

MA

P

0.0000

0.0005

0.0010

0.0015

0.0020

0.0025

0.0030

Figure 6.4: MAP for information retrieval-based models on sites. Performance of random baseline isindicated by the red line, chronological baseline by orange line.


Table 6.5: MAP and relative model performance compared to random, and chronological baseline, forinformation retrieval-based models on tweets. Statistical significance tested against both random, andchronological baseline.



ib-H1-SPL-TTF 0.00029 31 11 È È È È

ib-H1-SPL-DF 0.00031 32 11 È È È È

tfIdf 0.00060 63 22 Ï È È È

tfIdf-stemming 0.00066 70 24 È È È

lmjm-verylong 0.00073 78 27 È È È

lmjm-long 0.00092 97 33 È Ï È

ib-H1-LL-DF 0.00122 129 45 Î È

ib-H1-LL-TTF 0.00123 130 45 Î È

ib-H2-LL-TTF 0.00214 226 78 Î È

ib-H2-LL-TTF-stemming 0.00216 227 79 Î È

ib-Z-LL-TTF-stemming 0.00222 235 81 Î È

ib-Z-LL-TTF 0.00223 235 81 Î È

6.1.4 Performance of hybrid models

To answer RQ 4 we conduct experiments which will show us how the combinations of user-based mod-els and content-based models perform.

We report on the performance of our hybrid recommendation models. We start by determiningthe best fusion weights. In Figure 6.5, we vary the weight for the naive Bayes model "nb-1" within0.2≤ wnb−1 ≤ 0.9999999 and the weight for the ir models with wir = 1−wnb−1. Sfu1 denotes the firstfusion method, and Sfu2 the second method which were introduced in Section 3.6. Fusion is performedon the naive Bayes model denote with "nb-1", and the content-based model using tweets and sites. Theresult for the fusion approach is illustrated in Figure 6.6, where the baselines and the best performingnaive Bayes model (nb-1) is shown for comparison.

In Table 6.7 we report on the MAP scores for the random-, the chronological baseline, the naiveBayes model "nb-1" and the fusion approaches Sfu1-tweet. We report on the relative difference of themodels towards its baseline models. We also report on the statistical significance of the differencetowards the baseline models. For models where no statistical significance can be determined we re-port on the resulting p-value of the significance test. We find that Sfu1 consistently outperforms Sfu2(WCombSUM). We also see that the differences of fusion with tweet, site, or both are very small andthat the improvement by using a different IR weighting schemes instead of tf-idf, does not improve theperformance of the fusion model. We also see that the combination of the content-based models ontweet, and the naive Bayes model "nb-1" provides a performance of 127% of the chronological baseline,but is deemed to not be statistically significant.


Table 6.6: MAP and relative model performance compared to random, and chronological baseline, forinformation retrieval-based models on sites. Statistical significance tested against both random, andchronological baseline.



lmjm-long 0.00063 66 23 È È È

lmjm-verylong 0.00064 67 23 È È È

tfIdf-stemming 0.00066 70 24 È È È

tfIdf 0.00067 71 24 È È È

ib-H1-SPL-TTF 0.00072 77 27 È È È

ib-H1-SPL-DF 0.00077 81 28 È È È

ib-H1-LL-DF 0.00094 99 34 È Ï È

ib-H1-LL-TTF 0.00100 106 37 Î Ï È

bm25 0.00104 110 38 Î Ï È

0.2 0.4 0.6 0.8 1.0

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

00.

035

Scale factor

MA

P

Sfu1_tweet

Sfu1_site

Sfu1_tweet+site

Sfu2_tweet

Sfu1_tweet_ib_Z_LL_TTF

Figure 6.5: Parameter optimization for different fusion models using information retrieval-based modelson tweets, sites, tweets + sites combined with the naive Bayes approach "nb-1". Sfu2_tweet accountsfor WCombSUM with "nb-1" and tweets.

6.1.5 Session-based evaluation methodology

Next, we turn to our final research question, RQ 5: Is a more realistic evaluation methodology feasible?

In this section we report on experimental results using the session-based evaluation strategy intro-


rand

om

chro

n

nb−

1

Sfu

1−tw

eet

MA

P

0.000

0.005

0.010

0.015

0.020

0.025

0.030

Figure 6.6: MAP for the best performing fusion approach Sfu1-tweet. Performance of random baselineis indicated by the red line, the performance of the chronological baseline is indicated by the orangeline and the performance of the naive Bayes model (nb-1) with the black line.

duced in Section 4.2, with a bin size of 6 hours instead of the previously used 24 hours.

In Figure 6.7 we see the resulting two clusters per user accounting for the user being active (ON),and being not active (OFF). This Figure shows us that different users do indeed exhibit different tweet-ing behavior with respect to the delays that occur in between their tweet activities.

To compare how this evaluation strategy affects the performance of our models, we make use of thebaseline models, and the model "nb-1" introduced in previous experiments in Section 6.1.2. Looking atFigure 6.8 we see that compared to the previous results (Section 6.1.2) the performance of the randombaseline doubled. However, when looking at the performance of the chronological baseline we see thatthe performance nearly improved by a factor of 5 compared to the previous evaluation strategy, whilethe performance of the naive Bayes model "nb-1" only doubled compared to the previous evaluationstrategy.

Looking at Table 6.8 we see that the performance of the naive Bayes model "nb-1" is now only 378%compared to 904% (previous evaluation strategy) of the chronological baseline, while the performancerelative to the random baseline remains similar with 2,194% compared to 2,607%. The fact that theimprovement of the chronological baseline is much higher than the improvement of the random base-line, indicates that using user specific session boundaries to separate the data into candidate set, andtesting set is useful as it puts the results of the remaining models into a more realistic perspective.


Table 6.7: MAP and relative model performance compared to random, and chronological baseline,for the best performing fusion approaches Sfu1-tweet. Statistical significance tested against standardbaselines + naive Bayes model "nb-1".

Random Chronological NBModel MAP % random % chron % NB t-test wilcox t-test wilcox t-test wilcox


nb-1 0.02471 2607 904 Î Î Î Î

Sfu1-tweet 0.03132 3305 1146 127 Î Î Î Î 0.13493 0.78875X

_vax

xat

omic

_yaw

nTe

sla_

XR

affi_

Nee

Ann

shad

owcy

clis

tC

arla

Haj

jar

jelli

otttt

ttS

olod

oloa

ble

Car

litaa

s_W

ayy

lizzi

edem

Gre

gBea

une

Yaya

yuso

f22

shar

mse

eIts

_Ang

eliiq

ueem

iliep

alm

er77

Phk

Ran

dyJ_

Fru

gal

ItsM

eLiy

yaJE

DW

ICK

ED

Alic

ePor

terX

char

les_

ross

24M

utha

fuck

aQue

endi

yana

hrid

zuan

mom

22gi

rlzC

ompe

llem

Bm

earn

crai

gprin

gle9

1ju

stin

aaa9

4A

gnet

a_1d

redg

irlan

gM

iko_

desu

CW

atso

nMey

ers

ON

OFF

hour

s

0

50

100

150

Figure 6.7: Average ON and OFF times (pause time between two tweets) in hours per user. The linesegments on each bar indicate the minimum and the maximum pause time in the cluster.

6.1.6 Remaining data

The previous results were acquired by performing the experiments on a quarter of the data. We attemptto improve the validity of our results by considering a larger sample and report on the results of someselected methods.

We visualize these results with Figure 6.9 and report on the scores in Table 6.9. We find that the mostprominent difference between the performance of the models on the small sample and the remainingdata is the decrease in relative model performance to the random baseline model, and the chronologicalbaseline model. Still the results remain statistically significant, and provide a performance of 410 % ofthe chronological baseline, compared to 1,146 % on the small sample.


rand

om

chro

n

nb−

1

MA

P

0.00

0.01

0.02

0.03

0.04

Figure 6.8: MAP for random, and chronological baseline, and naive Bayes model "nb-1" using the usersession-based evaluation strategy. The MAP results using the growing past - shrinking future evaluationstrategy are indicated for the random baseline (red line), chronological baseline (orange line), andnaive Bayes combination "nb-1" (black line).

rand

om

chro

n

nb−

1

Sfu

1−tw

eet

MA

P

0.000

0.005

0.010

0.015

0.020

0.025

0.030

Figure 6.9: MAP for the random, and chronological baseline, naive Bayes model "nb-1" and best per-forming fusion model Sfu1-tweet for the remaining data of the average user dataset.


Table 6.8: MAP and relative model performance compared to random and chronological baseline, fornaive Bayes model "nb-1" under the user session-based evaluation strategy.



nb-1 0.04776 2194 378 Î Î Î Î

Table 6.9: MAP and relative model performance compared to baseline, for the naive Bayes model "nb-1" and best performing fusion model Sfu1-tweet for the remaining data of the average user dataset.Statistical significance tested against standard baseline + naive Bayes model "nb-1". We report thep-value if difference is not significant.



nb-1 0.01830 1331 319 Î Î Î Î

Sfu1-tweet 0.02349 1708 410 128 Î Î Î Î 0.07283 0.79317

6.2 The Influential Users

In this section we report on the results of experiments that we conducted on the influential-user datasetintroduced in Section 5.2.2. Due to time, and space constraints we will only consider a small sample of30 randomly selected users for the following experiments.

6.2.1 Performance of user-based recommendation models

This experiment refers to RQ 2: Whether user neighborhoods are an effective recommendation source,and how the different features perform.

Figure 6.10 illustrates the effectiveness of the baseline models, and the user-based recommendationmodels. For the models which use distance metrics we provide results of both, euclidean and cosine.We mark cosine with the appendix cos.

In Table 6.10 we report on the MAP scores for the models under consideration. We report onthe relative difference of the models towards its baseline models. We also report on the statisticalsignificance of the difference towards the baseline models. We find that half of the models tested donot improve over the random baseline, while the other half of the models does, and only the "fbitem"model improves over the chronological baseline. Still, "mention" and "item" provide solid improvementover the random baseline.

6.2. THE INFLUENTIAL USERS 67

rand

om

chro

n

hod

nhas

htag

−co

s

soci

al

dow

nhas

htag

soci

al−

cos

nurl

hod−

cos

dow

−co

s

nurl−

cos

nmen

tions

−co

s

retw

eet

nmen

tions

item

men

tion

fbite

m

MA

P

0.000

0.005

0.010

0.015

0.020

Figure 6.10: MAP for user-based recommendation models using the influential users dataset. Per-formance of random baseline is indicated by the red line, and the performance of the chronologicalbaseline is indicated by the orange line.

6.2.2 Performance of content-based models

With RQ 3 we want to determine whether content-based approaches can be effective in recommenda-tion and which perform best. We perform this experiment in a slightly different fashion on our newdataset as we have learned with our previous experiments that alternative information retrieval modelsonly provide moderate improvement over tf-idf.

In Figure 6.11 we show the performance of the tf-idf information retrieval approach introduced inSection 3.4 on both, tweets, and sites. In Table 6.11 we report on the MAP scores for both models andthe random, and chronological baseline. We report on the relative difference of the models towards therandom baseline and chronological baseline, and on the statistical significance of the difference towardsthe baseline models. We find that the information retrieval-based approach applied on tweets does notoutperform the random baseline, with 67% of the random baseline’s performance. The informationretrieval-based approach when applied on sites does outperform the random baseline, and achieves91% of the performance of the chronological baseline.

6.2.3 Performance of hybrid models

In the previous section, we saw that content-based models exhibit different performance with thedataset from influential users compared to the average user dataset. Thus we also need to recon-sider our answer to RQ 4 whether user neighborhood and content-based approaches can be combined.In this section we report on the performance of our hybrid recommendation models. We start by deter-mining the best parameters. In Figure 6.12, we vary the parameter for the naive Bayes model "nb-1"within 0.2 ≤ wnb−1 ≤ 0.9999999, and the parameter for the information retrieval-based models with


Table 6.10: MAP and relative model performance compared to baseline, for user-based recommendationmodels. Statistical significance tested against both random, and chronological baseline.



hod 0.00120 35 10 È È È È

nhashtag-cos 0.00161 46 13 È È È È

social 0.00170 49 14 È È È È

dow 0.00194 56 15 Ï È È È

nhashtag 0.00230 67 18 È È È

social-cos 0.00245 71 20 È È È

nurl 0.00284 82 23 È È È

hod-cos 0.00286 83 23 È È È

dow-cos 0.00593 172 47 Î È È

nurl-cos 0.00595 172 47 Î È È

nmentions-cos 0.00672 195 53 Í Î È È

retweet 0.00705 204 56 Î Î È È

nmentions 0.00847 245 67 Î Î Ï È

item 0.00883 256 70 Î È

mention 0.01107 320 88 Î Î È

fbitem 0.01413 409 112 Î Î Î

rand

om

chro

n

tfIdf

−tw

eet

tfIdf

−si

te

MA

P

0.000

0.005

0.010

0.015

0.020

Figure 6.11: MAP for information retrieval approaches on tweets and sites. Performance of randombaseline is indicated by the red line, chronological baseline by orange line.

wir = 1− wnb−1. Sfu1 denotes the first fusion method, and Sfu2 the second method which were in-troduced in Section 3.6. Fusion is performed on the naive Bayes model denote with "nb-1", and thecontent-based model using tweets and sites.

6.2. THE INFLUENTIAL USERS 69

Table 6.11: MAP and relative model performance compared to baseline, for information retrieval-basedmodels on tweets and sites. Statistical significance tested against both random, and chronologicalbaseline. We report the p-value if difference is not significant.



tfIdf-tweet 0.00233 67 19 Ï È È È

tfIdf-site 0.01146 332 91 Î Î È

0.2 0.4 0.6 0.8 1.0

0.00

0.01

0.02

0.03

0.04

0.05

0.06

Scale factor

MA

P

Sfu1_tweet

Sfu1_site

Sfu2_tweet

Sfu2_site

Figure 6.12: Parameter optimization for different fusion models using information retrieval-based meth-ods on tweets, sites, combined with naive Bayes approach "nb-1". Sfu2_tweet accounts for WCombSUMwith "nb-1" and tweets.

The results for the different fusion approaches are illustrated in Figure 6.13, where the baselinesand the best performing naive Bayes model (nb-1) is shown for comparison. In Table 6.12 we report onthe MAP scores for the models under consideration. We report on the relative difference of the modelstowards its baseline model. We also report on the statistical significance of the difference towards thebaseline models. For models where no statistical significance can be determined we report on the re-sulting p-value of the significance test. We find that the first fusion approach (Sfu1) which makes use ofthe naive Bayes model "nb-1" and the information retrieval-based approach using sites performs best.The best weight given the data is 0.5, giving each source an equal share.

In Section 6.1 we saw how our models perform with respect to average users. We learned that the


rand

om

chro

n

nb−

1

Sfu

1−si

te

MA

P

0.00

0.01

0.02

0.03

0.04

0.05

Figure 6.13: MAP for baseline models, naive Bayes model "nb-1" and the best performing fusion model"Sfu1-site" which combines "nb-1" with the information retrieval-based approach applied on sites.

Table 6.12: Results for fusion approaches, statistical significance tested against standard baseline +naive Bayes model "nb-1". We report the p-value if difference is not significant.



nb-1 0.03062 886 244 Î Î Î Î

Sfu1-site 0.04572 1323 364 149 Î Î Î Î Î 0.55788

naive Bayes model, using all features, provides a performance of 319% of the chronological baseline.The content-based models have limited use on their own, however, when combined with the naiveBayes model by using late data fusion we achieve a performance of 410% of the chronological baseline.

In Section 6.2 we saw how our models perform with respect to influential users. The most inter-esting aspect is that the content-based models applied on websites perform much better then whenapplied on tweets, compared to the results from the average users. The naive Bayes model, usingall features, provides a performance of 244% of the chronological baseline. With late data fusion weachieve a performance of 364% of the chronological baseline, by combining the naive Bayes model withthe content-based models applied on websites.

In the next chapter we take a closer look at our results in order to better understand them.

Chapter 7

Discussion

In this chapter we discuss the experimental results, and deepen the analysis of our models with respectto their performance and computational costs.

7.1 Results of the Experiments on the Average User Dataset

In this section we discuss results reported previously in Chapter 6 for the experiments performed onthe data representing average users. We will look at the results from a different perspective and discussour findings.

7.1.1 Per user performance analysis

In Figure 7.1 and 7.2, we see the performance of users individually, given the best performing fusionmodel introduced in Section 3.6 for which we performed initial experiments in Section 6.1.4.

First, we see that the low average performance values reported in general, stems from the fact thatmany users do not perform well at all. The second observation we can make, when looking at bothFigures, is that the remaining data consists of even more very low performing users than the smallsample data, even when taking the different sample sizes into account. This leads to lower averageperformance with the remaining data, as reported in Section 6.1.6.

We observe few users with MAP ≥ 0.05. In the following, we investigate the top performing users.We first start by looking at the small sample for which all users are depicted in Figure 7.1. Then inFigure 7.3 we depict the performance of the selected top users with respect to average precision for thedifferent splits of our data into candidate and testing set. Below, we give a small summary of items thatwere successfully recommended to a seed user, and achieved high average precision:

• Its_Angeliique: Is a user that retweets tweets of musicians containing links to their newest song.

• mearn: Tweeted the URL to a blog which was recommended 3 weeks earlier, beside that mearnand his associate users are also users of klout1. This service measures user influence and produces

1http://klout.com

71

http://klout.com

72 CHAPTER 7. DISCUSSION

Its_A

ngel

iique

mea

rnJE

DW

ICK

ED

Phk

Ran

dyM

utha

fuck

aQue

enA

liceP

orte

rXcr

aigp

ringl

e91

Agn

eta_

1dre

dgirl

ang

Mik

o_de

suJ_

Fru

gal

CW

atso

nMey

ers

mom

22gi

rlzsh

adow

cycl

ist

just

inaa

a94

Tesl

a_X

jelli

otttt

ttdi

yana

hrid

zuan

Com

pelle

mB

3978

2169

shar

mse

eat

omic

_yaw

n_v

axx

Car

litaa

s_W

ayy

Raf

fi_N

eeA

nnIts

MeL

iyya

lizzi

edem

Yaya

yuso

f22

char

les_

ross

2428

0843

465

3131

4337

3S

olod

oloa

ble

Gre

gBea

une

4366

8804

6em

iliep

alm

er77

MA

P

0.00

0.05

0.10

0.15

0.20

0.25

Figure 7.1: MAP per user from the small sample.

FO

RE

VE

R_L

OYA

L12

OhM

yDay

sIts

Elle

afifa

fahk

ah_

Elis

haT

heW

ante

dE

harr

ioso

Rak

ariz

zsa

mgi

ldo

Col

aBre

eT1A

FE

orph

anan

ni19

78su

man

mal

ik1D

Jam

Sam

94M

oonc

hees

e78

The

TekG

uypo

etrit

eim

mag

leek

_JR

OD

akaC

arib

eno

Fatii

max

3N

inja

Bie

bsM

arily

nMon

roe7

0xM

rsS

chm

idt9

6xS

assy

Soc

cerC

hik

Kbi

zzle

58en

dark

enA

rjIm

ages

colo

twee

tju

stin

n122

7E

lisaT

horn

berr

iG

emD

_Fem

aleB

oss

lil_s

haw

ty_C

LaC

hela

bella

mz_

jew

elso

nN

tary

amira

Han

nahS

latte

rH

ardi

nSim

mon

s4to

mst

abb

Kyl

eD_

Dev

sfan

55cl

ickd

acos

igne

rK

imbe

rlyE

20te

enfo

rum

Syl

via_

Kos

tro

man

dee_

blis

s_E

vely

nsita

aaa

Clu

eles

sAtS

eam

ikeg

ray0

911

9998

620

vben

sonn

kash

y122

Lean

fran

cis1

1589

9621

7C

adill

acB

oog

1689

5927

1sa

mto

m_

sun_

b4D

AW

Nel

lB

uTta

Sco

tchE

zelg

inju

ndsa

aayd

eXo

anna

_g_k

Dym

andD

Yass

erS

MJi

iNC

zIz

zy_b

_Wre

ckin

Xan

derR

eyTr

ippy

_Eth

ejer

emyr

iver

s31

9942

482

GoH

ome_

TJA

Y34

4949

372

swee

et_r

issa

bria

nna_

rose

_ky

levl

3C

huyy

A_8

837

3944

240

chris

falc

onie

rejs

tlove

ly_a

llyga

reth

soye

MyN

ails

MyP

olis

hta

tisun

shin

e77

4436

8907

4K

ristie

thilm

ann

4669

5994

147

7403

906

Teen

yy_T

iiny

MA

P

0.0

0.1

0.2

0.3

0.4

0.5

Figure 7.2: MAP per user from the remaining data.

automatic tweets when a user gives a "K+" , which is a statement about another users influence.In these automatic tweets, which are created on the user’s behalf, a link to the website of theservice is included.

• JEDWICKED: Similar to mearn, JEDWICKED and his associates make use of a service named

7.1. RESULTS OF THE EXPERIMENTS ON THE AVERAGE USER DATASET 73

unfollowers.me2 that produces automatic tweets to give the user a quantitative overview aboutthe following relationships (e.g. increase, and decrease in number of users following the user, orfollowed by the user). Also JEDWICKED shared a link to a video on YouTube a week after one ofhis associate users shared this link.

• PhkRandy: Shared a link to a music blog article without using retweet, and shared a link to avideo using retweet.

0 10 20 30

0.0

0.2

0.4

mearn

Split

Ave

P

0 5 15 25 35

0.0

0.2

0.4

PhkRandy

Split

Ave

P

0 5 15 25 35

0.0

0.4

0.8

Its_Angeliique

Split

Ave

P

0 10 20 30

0.0

0.4

0.8

JEDWICKED

Split

Ave

P

Figure 7.3: Average precision of selected users per split of the data into candidate and testing set. Withthe growing past - shrinking future evaluation strategy a split of 5 means that the first 5 bins are partof the past and the remaining bins represent the future.

Second, we look at top performing users of the remaining data for which all users are depicted inFigure 7.2, and with Figure 7.4 we depict the performance of the selected top users with respect toaverage precision for the different splits of our data into candidate and testing set. Below, we give asmall summary of items that were successfully recommended to a seed user, and achieved high averageprecision:

• FOREVER_LOYAL12: Similar to the user JEDWICKED described previously, this user and theuser’s associate users make use of a service to keep track of people who newly follow or unfol-lowed the user. The automatic tweets generated by this service contain the URL of the website ofthis service. For this user the recommendations are dominated by those automatic tweets.

2http://unfollowers.me/

http://unfollowers.me/


• OhMyDaysItsElle: Like FOREVER_LOYAL12 also the recommendations of this user are domi-nated by unfollower tracker, but also by a service named twuration3 that provides statistics of theuser’s twitter usage with automatic tweets. Also it turns out that this user runs a second twitteraccount on which a link to http://ask.fm/ohmydaysitselle was shared which then wasshared again 3 days later through this account.

• afifafahkah_: Similar to the previous users this user also makes use of a service that tracks newfollowers, and users that stopped following. The most interesting aspect here is that while theuser was active in the first third of the time, activity completely ceased for the remaining time, ascan be seen in Figure 7.4.

• ElishaTheWanted: This user retweeted a personal directed tweet containing URL to a servicecalled twitlonger which allows users to circumvent the 140 character limit of ordinary tweets.Another promotional tweet containing a URL to a music video was retweeted with the intentionto make lyrics of the song available if the tweet receives 25 retweets.

• Eharrioso: This user retweets tweets that contain real time analysis of the 2012 PresidentialElection on Twitter, in addition to the URL directed to the website of the analysis service. Otherretweets show the user’s interest in the elections.

• Rakarizz: This user retweets URLs to a music website and shares favorite songs on myspace4.Further a service that tracks new followers, and users that stopped following is used.

Besides the very diverse use of Twitter we find the usage of unfollower tracker 5 most intriguing.On the one hand, it shows how users starve for acknowledgement and want to be aware about thosepeople who refuse to provide it. On the other hand, we see that such automated tweets might biasa recommendation system as it is not clear whether a user is equally interested in the track record ofthe associate users. Also we find that the way the data is split into candidate and test set has dramaticinfluence on the evaluation results, as can be seen in Figure 7.3 and Figure 7.4. For different usersdifferent splits are necessary making traditional evaluation strategies, where data is just split into twohalf’s, infeasible. We will reason about the low performing users later in Section 7.1.3.

7.1.2 Fusion source, where do the recommendations come from

Another perspective on the system is given by Figure 7.5 and Figure 7.6, where instead MAP the num-ber of relevant items per user are shown. Although the mere number of relevant recommendation isnot a good performance indicator, as many of the relevant items recommended might rank very lowand thus might be overlooked by the user, the source6 of relevant items and the number per source arestill insightful.

In Figure 7.5, the user mom22girlz has been provided with many more recommendation items fromthe information retrieval-based (IR) approach then the remaining users. When comparing mom22girlzwith other seed users, by looking at the tweets of this user and the user’s associates, we see that manyof them promote sales events. Similar, we see in Figure 7.6 that the users endarken and colotweet come

3http://www.twuration.com/4http://www.myspace.com/5Services that track by whom the user recently was added to be followed and who unfollowed the user.6Model which recommended the item.

http://ask.fm/ohmydaysitselle

http://www.twuration.com/

http://www.myspace.com/


0 5 10 15 20 25 30 35

0.0

0.4

0.8

Eharrioso

Split

Ave

P

0 5 10 15 20 25 30 35

0.0

0.4

0.8

OhMyDaysItsElle

Split

Ave

P

0 5 10 15 20 25 30 35

0.0

0.4

0.8

afifafahkah_

Split

Ave

P

0 5 10 15 20 25 30 35

0.0

0.4

0.8

FOREVER_LOYAL12

Split

Ave

P

0 5 10 15 20 25 30 35

0.0

0.4

0.8

Rakarizz

Split

Ave

P

0 5 10 15 20 25 30 35

0.0

0.4

0.8

ElishaTheWanted

Split

Ave

P

Figure 7.4: Average precision of selected users per split of the data into candidate and testing set. Withthe growing past - shrinking future evaluation strategy a split of 5 means that the first 5 bins are partof the past and the remaining bins represent the future.

with unusual large numbers of relevant items from the information retrieval-based (IR) approach.

In Table 7.1 we look closer at some example tweets of those users. We find that those users providevery specific context to the URLs they share. Instead of being general statements, these tweets containwords that are very discriminative given the URLs shared, and thus are more suited for a informationretrieval-based approach. However, as the underlying fusion approach that combines the user-basedmodels with the content-based models gives much more weight towards the user-based approaches (seeSection 6.1.4), the potentially good recommendations get ranked lower. This observation is supportedby the fact that the users analyzed show lower performance with respect to MAP as shown in Figure7.1 and 7.2. A way to improve recommendation for these users would be to also have a individualweighting scheme for each user instead of a global weighting scheme.

7.1.3 Influence of user activity on recommendation performance

In Section 5.2, we stated assumptions related to the activity of a user and formed groups of users withdifferent scales of activity. In Figure 7.7a, we see the performance with respect to MAP for each of thosegroups. In Figure 7.7b added bins that account for low performance [0,0.0001], medium performance(0.0001,0,5] and high performance (0.5,1].

The general trend that can be seen in Figure 7.7a and Figure 7.7b is that the model works betterwith more active users. However more tweets alone seems to be not sufficient, as can be seen in Figure7.7a when looking and the groups G2_1 and G2_2. On the other hand with more friends and followers


mom

22gi

rlzm

earn

crai

gprin

gle9

1P

hkR

andy

Its_A

ngel

iique

Agn

eta_

1dJE

DW

ICK

ED

redg

irlan

gJ_

Fru

gal

Mut

hafu

ckaQ

ueen

Alic

ePor

terX

CW

atso

nMey

ers

shad

owcy

clis

tM

iko_

desu

Tesl

a_X

just

inaa

a94

Com

pelle

mB

diya

nahr

idzu

anje

lliot

ttttt

3978

2169

shar

mse

eat

omic

_yaw

n_v

axx

Car

litaa

s_W

ayy

Raf

fi_N

eeA

nnIts

MeL

iyya

lizzi

edem

Yaya

yuso

f22

char

les_

ross

2428

0843

465

3131

4337

3S

olod

oloa

ble

Gre

gBea

une

4366

8804

6em

iliep

alm

er77

Both

IR

SF

Nr.

Item

s R

ecom

men

ded

& R

elev

ant

0

50

100

150

200

250

300

Figure 7.5: Number of relevant items recommended per user, per recommendation source on the smallsample of the average user dataset.. SF stands for the combination model of user neighborhoods "nb-1" and IR for the information retrieval-based approach. Both if the item was recommended by bothmodels.

enda

rken

Moo

nche

ese7

8T

heTe

kGuy

colo

twee

txM

rsS

chm

idt9

6xO

hMyD

aysI

tsE

llepo

etrit

eJR

OD

akaC

arib

eno

Eha

rrio

soS

assy

Soc

cerC

hik

sum

anm

alik

1DFa

tiim

ax3

FO

RE

VE

R_L

OYA

L12

Rak

ariz

zK

bizz

le58

ArjI

mag

esor

phan

anni

1978

sam

gild

oC

olaB

reeT

1AF

EN

inja

Bie

bsaf

ifafa

hkah

_Ja

mS

am94

Elis

haT

heW

ante

dju

stin

n122

7G

emD

_Fem

aleB

oss

imm

agle

ek_

Mar

ilynM

onro

e70

mz_

jew

elso

nE

lisaT

horn

berr

ilil

_sha

wty

_CN

tary

amira

Han

nahS

latte

rH

ardi

nSim

mon

s4La

Che

labe

llato

mst

abb

Kyl

eD_

Dev

sfan

55cl

ickd

acos

igne

rK

imbe

rlyE

20te

enfo

rum

Syl

via_

Kos

tro

man

dee_

blis

s_E

vely

nsita

aaa

Clu

eles

sAtS

eam

ikeg

ray0

911

9998

620

vben

sonn

kash

y122

Lean

fran

cis1

1589

9621

7C

adill

acB

oog

1689

5927

1sa

mto

m_

sun_

b4D

AW

Nel

lB

uTta

Sco

tchE

zelg

inju

ndsa

aayd

eXo

anna

_g_k

Dym

andD

Yass

erS

MJi

iNC

zIz

zy_b

_Wre

ckin

Xan

derR

eyTr

ippy

_Eth

ejer

emyr

iver

s31

9942

482

GoH

ome_

TJA

Y34

4949

372

swee

et_r

issa

bria

nna_

rose

_ky

levl

3C

huyy

A_8

837

3944

240

chris

falc

onie

rejs

tlove

ly_a

llyga

reth

soye

MyN

ails

MyP

olis

hta

tisun

shin

e77

4436

8907

4K

ristie

thilm

ann

4669

5994

147

7403

906

Teen

yy_T

iiny

Both

IR

SF

Nr.

Item

s R

ecom

men

ded

& R

elev

ant

0

100

200

300

400

500

600

700

Figure 7.6: Number of relevant items recommended per user, per source. SF stands for the combinationmodel of user neighborhoods "nb-1" and IR for the information retrieval-based approach. Both if theitem was recommended by both models. We look at the remaining users from the average user dataset.

we get more tweets and so the performance improves. This of course is also an artifact of our evaluation


Table 7.1: Example tweets of users which exhibit a unusual large number of items recommended by theinformation retrieval-based approach on tweets.

User Tweet content

mom22girlz GIVEAWAY: Crayola and Target Back to School Goodies! #MyBlogSpark | 3 Boys AndA Dog http://t.co/gTzxh6qb

Enter to win a $100 @Shopbop giftcard on Latina On a Mission: http://t.co/IHWplJfn#SMLatinas #Giveaway #Fashion #FNO Ends 9/23

endarken @nytimes 5 NZ soldier deaths in August also... report that? http://t.co/wt2MPLLp@OccupyNZ #ows

RT @WeMeantWell: Good News! We Sorta Now at War in #Guatemala. A foreignpolicy problems solved by military force. http://t.co/qRo...

colotweet #datacentre Interxion Launches Second Cloud Testlab at Its City of London...http://t.co/GHtrFpl3

#datacentre Dutch get serious about data center efficiency: A Dutch research organi-zation currently developing a... http://t.co/Zh4JeKVv

G1_1 G1_2 G2_1 G2_2 G3_1 G3_2

MA

P

0.00

0.01

0.02

0.03

0.04

0.05

(a) MAP per group.

G1_1 G1_2 G2_1 G2_2 G3_1 G3_2

[0,0.0001]

(0.0001,0.5]

(0.5,1]

Cou

nts

010

020

030

040

050

060

070

0

(b) Three bins for MAP values with ranges that accountfor low performance [0,0.0001], medium performance(0.0001,0,5] and high performance (0.5,1]

Figure 7.7: MAP per group of all users in the average user dataset, for groups defined in Table 5.2.

strategy that depends on a user sharing URLs in tweets.

This relationship is visible in Figure 7.8a, where we plot the average number of URLs that werereshared by the users by retweeting or by independently sharing, against the MAP of these users. Thesetwo values correlated according to Pearson’s correlation in log-log space with r = 0.62523. The av-erage number of URLs a user reshares correlates with the average number of URLs a user shares byr = 0.75451 in linear space. This relationship is visualized in Figure 7.8b where we plot the number ofURLs shared against the number of URLs reshared.


−3 −2 −1 0 1 2 3

−12

−8

−4

log(Nr. of URLs reshared)

log(

MA

P)

(a) Seed users and their average number of URLs thatwere reshared by retweeting or by independently shar-ing against MAP.

−3 −2 −1 0 1 2 3

23

45

67

log(Nr. of URLs reshared)

log(

Nr

UR

Ls s

hare

d)

(b) Seed users and the average numbers of URLs sharedagainst the average number of URLs reshared.

Figure 7.8: Correlation between MAP and number of shared URLs.

With this reasoning we can filter out results that only can lead to a average precision of zero becausethe number of reshared items is zero. In such cases we can decide to not perform an evaluation. Theresulting MAP values for the different activity groups is given in Figure 7.9. We see that in the best caseour system achieves MAP of up to 0.27590 as depicted for group G2_1 in Figure 7.9.

G1_1 G1_2 G2_1 G2_2 G3_1 G3_2

MA

P

0.00

0.05

0.10

0.15

0.20

0.25

Figure 7.9: MAP per group after removing test cases which have an empty test set.

7.2 Results of the Experiments on the Influential User Dataset

In this section we discuss results reported previously in Chapter 6 for the experiments performed on thedata representing influential users. We will look at the results from a different perspective and discuss

7.2. RESULTS OF THE EXPERIMENTS ON THE INFLUENTIAL USER DATASET 79

our findings.

7.2.1 Per user performance analysis

Similar to what was done with the previous dataset we look again how users perform individually.In Figure 7.10 we see the performance of users individually, given the best performing fusion model,introduced in Section 3.6, that combines the naive Bayes model nb-1 with the information-retrievalapproach on sites.

fred

wils

onK

evin

Spa

cey

Unc

leR

US

HJo

n_Fa

vrea

um

cuba

nN

ickK

risto

fS

enJo

hnM

cCai

nna

osal

voT

heF

ixB

uste

r_E

SP

N biz

eber

tchi

cago

bria

nste

lter

scie

nceb

ase

pier

smor

gan

Dra

keed

utes

tost

eron

aC

athe

rineQ

sree

jack

Jaso

nel

atab

leliz

stra

uss

dooc

ean

ders

onco

oper

mic

helle

mal

kin

billm

aher

kath

ygrif

final

yank

ovic

Nik

kiF

inke

MA

P

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Figure 7.10: MAP per user from the small sample.

In Figure 7.11, we depict the performance of the selected top users with respect to average precisionfor the different splits of our data into candidate and testing set. We discuss those users and the itemsthat were recommended to them in the following:

• fredwilson: A New York City-based venture capitalist and blogger who promotes his blog entrieswhich are also promoted by some of his friends.

• KevinSpacey: An American actor, director, screenwriter, producer, retweets and reshares linksabout an upcoming theatre event at the the old Vic Theatre for promotional purposes.

• UncleRUSH: Russell Simmons, an American business magnate and co-founder of the hip-hoplabel Def Jam, retweets and comments a tweet that contains a link to a peace keeping initiativehttp://thepeacekeepers.org/.

For the top users, many recommendations are a consequence of self-promotion, or promotion forothers where the attempt is made to make use of a user’s influence.

http://thepeacekeepers.org/


0 10 20 30 40

0.0

0.4

0.8

fredwilson

Split

Ave

P

0 10 20 30 40

0.0

0.4

0.8

UncleRUSH

Split

Ave

P

0 10 20 30 40

0.0

0.4

0.8

KevinSpacey

Split

Ave

P

Figure 7.11: Average precision of selected users per split of the data into candidate and testing set.

7.2.2 Fusion source, where do the recommendations come from

Again we look at the system with respect to the number of relevant items per user, given by Figure7.12. Compared to the average user dataset (see Section 7.1.2) we see that the origin of relevant itemsis much more diverse. This is a consequence of the information retrieval-based model having moreweight which stems from the fact that the weight was optimized accordingly. However, again we seethat a personalized weighting scheme would be beneficial, as some users do not have relevant itemsrecommended from the information retrieval-based model.

Bus

ter_

ES

PN

bria

nste

lter

Unc

leR

US

Hsr

eeC

athe

rineQ

mcu

ban

pier

smor

gan

The

Fix

Nic

kKris

tof

scie

nceb

ase

Kev

inS

pace

yfr

edw

ilson

eber

tchi

cago

naos

alvo

Jon_

Favr

eau

edut

esto

ster

ona

biz

Dra

keJa

son

Sen

John

McC

ain

jack

elat

able

lizst

raus

sdo

oce

ande

rson

coop

erm

iche

llem

alki

nbi

llmah

erka

thyg

riffin

alya

nkov

icN

ikki

Fin

ke

Both

IR

SF

Nr.

Item

s R

ecom

men

ded

& R

elev

ant

0

50

100

150

Figure 7.12: Number of relevant items recommended per user, per source. SF stands for the combina-tion model of user neighborhoods "nb-1" and IR for the information retrieval-based approach. Both ifthe item was recommended by both models.

7.3. RELATIONSHIP BETWEEN THE TWO DATASETS 81

7.3 Relationship Between the Two Datasets

The two datasets at hand differ significantly with respect the the content the seed users spread. Lookingat Figure 5.4a for the average user and Figure 5.5a for the influential users we see that except for the topranking YouTube and instagram domains, the remaining domains differ, with four times more domainsfor newspaper websites. For the influential users we find more newspapers where for the average usersmore domains refer to services that provide multimedia content rather than textual content. We believethat this is the reason why we achieve a much higher performance for the information retrieval-basedapproach on sites for the influential seed users, while there seems to be no merit for the same approachwith average users.

Looking at the results for user-based recommendation models for the two datasets (Section 6.1.1and 6.2.1) we see that the ordering of best performing features is different. We believe that, althoughaveraged over several users, the order is not stable, and might change with changing user sets. However,as the recommendation is personalized, the naive Bayes combination model will learn the most usefulfeatures for a specific user.

7.4 Algorithm Analysis

In this section we briefly discuss time and space complexity (where appropriate) for the models definedin Chapter 3. Even though we will report on the asymptotic behaviour of the algorithms involved, itis important to notice that the algorithms itself provide potential for improvement with respect to ef-ficiency but also efficiency might not be of concern given the limitations inherent to this application.One of these limitations is the numbers of users a user realistically follows, which is relatively low.

A help full reference of data-structure operations and their algorithmic complexity for python [27],and for java [17] was considered in the following analysis.

7.4.1 Common items neighborhood recommendation

We assume that the items have already been established, meaning we don’t need to go through allrelated tweets to find those items, then the set operations necessary for the calculation of the Jaccardcoefficient defined in Equation 3.7 maintain the following time complexity:

A∩ B : O (|A| × |B|) (7.1)

A∪ B : O (|A|+ |B|) (7.2)

With the Jaccard coefficient using both operation it can be seen that the term O (|A|× |B|)+O (|A|+|B|) is dominated by O (|A| × |B|). To create the neighborhood scores for a specific seed user (Equation3.8), we need to iterate over all associate users au in AUsu so the total running time is O (|AUsu|×|A|×|B|),where A and B are the item sets of the seed user and associate users.


7.4.2 User-mention neighborhood

We can assume constant running time for extracting the mentions from tweets because a tweet islimited to 140 characters in size. Assuming constant running time for the data-structures involved (e.g.hashtable) and the limited chance of rehashing due to the limited number of associate users the runningtime is O (|Tsu|) where Tsu are the tweets of a specific seed user su.

7.4.3 Temporal neighborhoods and intrinsic tweet features-based neighborhoods

Again, assuming constant running time for extracting the corresponding information from tweets andfor the data-structures involved, the running time is O (|Tsu,AU |) + O (|AUsu|) where Tsu,AU are all thetweets of the seed user and the associate users. The term O (|AUsu|) accounts for the calculation of thedistance metric which needs to be performed for every associate user. The calculation of the distancemetric itself can be considered as constant with respect to running time because the euclidean space islimited to 7 or 24 dimensions. We can assume that |Tsu,AU | >> |AUsu| and so the final running time isO (|Tsu,AU |).

7.4.4 Social features-based neighborhood

This approach comes at an even lower price as the involved features are aggregated by Twitter’s plat-form, and we only need to calculate the distance between seed user and the associate users |AUsu| times.So the final running time is O (|AUsu|).

7.4.5 Item popularity-based neighborhood

To calculate the score of an associate user in this approach we need to sum up the scores for all theitems Isu,au a associate user has shared. Thus the running time is O (|Isu,au|).

7.4.6 Combination by supervised learning

For this approach we use Weka [36] which implements a naive Bayes classification and an extension ofit using kernel density estimation for continues attributes instead of a single Gaussian. The algorithmiccomplexity of both are described by John and Langley [47] as provided in Table 7.2.

Table 7.2: Algorithmic complexity for Naive Bayes with Gaussian and kernel density estimation, givenn training cases and k features.

Naive Bayes Naive Bayes + KDEOperation Time Space Time SpaceTrain on n cases O (nk) O (k) O (nk) O (nk)Test on m cases O (mk) O (mnk)

7.4.7 Recommendation list generation

The creation of the recommendation list happens by sorting the user neighborhood according the neigh-borhood scores using a TreeSet which provides guaranteed O (log(n)) for basic operations (add, remove

7.4. ALGORITHM ANALYSIS 83

and contains), where n is the number of associate users. Because the add operation is executed n timesthe total running time is: O (n× log(n)). The associate users’ items are then added to the recommen-dation list accordingly, at a total cost of O (n×m), where m is the number of items.

7.4.8 Information retrieval approaches

For this work an algorithmic analysis of the involved information retrieval system is out of scope. Insteadwe provide the rough overview with respect to performance. For indexing Lucene can process over120GB/hour on modern hardware (CPU: 2 Xeon X5680, overclocked @ 4.0 Ghz, IO: Index stored on240 GB OCZ Vertex 3, 350 MB RAM buffer) [53]. It needs roughly 20-30% the size of text indexed[29].


Chapter 8

Conclusion

This thesis approached the problem of information overload by providing re-ranking of tweets contain-ing URLs based on two different types of models. First, we developed user neighborhoods that makeuse of user-centered features that allow a user centered notion of similarity. Second, we employedtechniques from information retrieval to find tweets and sites of associate users (friends and followers)that reflect the user’s interests sketched by tweets and sites that were shared by the user. We combinedthose two types of models leading to further improvement in performance. The models developed heresolely consider data that is realistically available to a user’s personal recommendation engine, withoutrelying on full access to the Twitter platform. In the following, we revisit the research questions statedin Section 1.1, and provide the answers that were gained through this work.

8.1 Main Findings

First we restate the questions, and then answer the question below.

RQ 1: Can the user experience while consuming a Twitter timeline be improved, with respect to thetweets that share URLs, by solely considering content features and 1st degree network features?

We have shown that both content features, and 1st degree network features can be sufficient toprovide improvement over the chronological baseline of up to 400 %.

RQ 2: Are user neighborhoods an effective source for items to recommend, and which of the differentfeatures lead to the best performing neighborhoods?

We created user neighborhoods, lists of ranked users where the ordering is given by the similarity ofa user with the seed user. To calculate this similarity, user-based and content-based features can be ex-ploited. Items recommended from those users can achieve higher performance compared to baselines.The features that lead to the best performing neighborhoods are common items and mentions.

RQ 2.1: Can user neighborhoods be effectively combined to form a new source for items to recommendand which combinations proof most fruitful?

85

86 CHAPTER 8. CONCLUSION

Our experiments have shown that machine learning with naive Bayes provides a effective way ofcombining different user neighborhoods, to form a user neighborhood that provides personalized rec-ommendation. We have learned that naive Bayes can deal with the various challenges our features posewith respect to redundancy and complementarity. We also have shown that by employing a learningmethod we can strip away features and still maintain a useful level of performance.

RQ 3: Are content-based approaches effective in recommending relevant items, and which content-based approaches perform best?

We found that content-based approaches employing methods from information retrieval do havemerit. However, the usefulness heavily depends on the type of content shared by the users. While thetweet content that accompanies the URL provides limited evidence for relevant items, we see improvedusefulness for site content when applied on suitable data.

RQ 4: Can user neighborhood and content-based approaches be combined to provide an effective rec-ommendation system?

We used weighted late data fusion to combine user neighborhoods and content-based approaches.For our first dataset, that represents the average user, we find that applying the content-based ap-proaches on the tweet content proofs useful when combined with the user neighborhoods, while theweight gives a strong preference towards the user neighborhood. The situation is different for the sec-ond dataset that represents influential users. Here, we learned that late data fusion with equal sharesof both methods provides the best performance. However, instead of tweet content the content of thesites referred to by the URL was used. We have learned that the optimal source needs to be determinedfor each user individually.

RQ 5: Is an realistic evaluation methodology feasible given the data at hand?

We have shown that the traditional way of splitting data into candidate set and testing set is not themost suitable way. A more realistic evaluation strategy tries to imitate the scenario the recommendationengine faces when recommending items to a user. A user shows a certain amount of activity for a certaintime and then becomes inactive for a longer time. During the user’s inactivity new items for recommen-dation accumulate. These recommendation items then should be tested against the next period of useractivity. Also we have learned that those user sessions are very different for different users.

Having answered our research questions, we proceed with a discussion of open issues.

8.2 Future Work

In this work we saw that many different signals can be used and combined, to provide remedy for theinformation overload problem. Still we believe that the way those signals are modeled can be refined toincrease their usefulness. Also new signals can be added. The data that was collected in this work alsoshows how a user’s social network changes over time. Unfollowed users and their content might be usedas signal for poor content and similarity, while newly followed users indicate a certain trend in a user’s

8.2. FUTURE WORK 87

interest. This potential has not been exploited yet. Another important aspect is the combination ofthose signals. We saw that late fusion has its merit but when applied globally it might be inappropriatefor some users. This situation can be improved by learning the fusion weights for each user individually.Further work might also deal more directly with prohibitive content that could bias the personalizedrecommendation. We believe that to a large extent the methods developed in this work can be appliedto other infrastructures of social media besides Twitter. We also indicated the need for more realisticapproaches in evaluation, and thus provide several possible strategies. The data-sets we build allowsfuture research to pursue the challenge of information overload.


Bibliography

[1] Fever - red hot. well read. http://feedafever.com/, Jan. 2013.

[2] klout - discover and be recognized for how you influence the world. http://klout.com, Feb.2013.

[3] my6sense - helping you get personalized! http://www.my6sense.com/, Jan. 2013.

[4] peerreach - compare and assess your peergroups, levels of influence, and join conversations withlike-minded people. http://peerreach.com/, Feb. 2013.

[5] The tweeted times - real-time personalized newspaper from your twitter account. http://tweetedtimes.com/, Jan. 2013.

[6] Zite - discover your interesting. http://zite.com/, Jan. 2013.

[7] F. Abel, Q. Gao, G.-J. Houben, and K. Tao. Analyzing user modeling on twitter for personalizednews recommendations. In Proceedings of the 19th international conference on User modeling, adap-tion, and personalization, UMAP’11, pages 1–12, Berlin, Heidelberg, 2011. Springer-Verlag. ISBN978-3-642-22361-7. URL http://dl.acm.org/citation.cfm?id=2021855.2021857.

[8] F. Abel, Q. Gao, G.-J. Houben, and K. Tao. Semantic enrichment of twitter posts for user profileconstruction on the social web. In Proceedings of the 8th extended semantic web conference on Thesemanic web: research and applications - Volume Part II, ESWC’11, pages 375–389, Berlin, Heidel-berg, 2011. Springer-Verlag. ISBN 978-3-642-21063-1. URL http://dl.acm.org/citation.cfm?id=2017936.2017967.

[9] F. Abel, Q. Gao, G.-J. Houben, and K. Tao. Analyzing temporal dynamics in twitter profiles forpersonalized recommendations in the social web. In ACM WebSci’11, pages 1–8, June 2011. URLhttp://journal.webscience.org/428/. WebSci Conference 2011.

[10] F. Abel, C. Hauff, G.-J. Houben, R. Stronkman, and K. Tao. Twitcident: fighting fire with infor-mation from social web streams. In Proceedings of the 21st international conference companion onWorld Wide Web, WWW ’12 Companion, pages 305–308, New York, NY, USA, 2012. ACM. ISBN978-1-4503-1230-1. doi: 10.1145/2187980.2188035. URL http://doi.acm.org/10.1145/2187980.2188035.

[11] T. algorithm by Daniel Tunkelang. tunkrank - a tool for measuring influence on twitter based onhow much attention your followers can actually give you. http://tunkrank.com/, Feb. 2013.

89

http://feedafever.com/

http://klout.com

http://www.my6sense.com/

http://peerreach.com/

http://tweetedtimes.com/

http://tweetedtimes.com/

http://zite.com/

http://dl.acm.org/citation.cfm?id=2021855.2021857



http://journal.webscience.org/428/

http://doi.acm.org/10.1145/2187980.2188035

http://doi.acm.org/10.1145/2187980.2188035

http://tunkrank.com/


[12] M. Armentano, D. Godoy, and A. Amandi. Topology-based recommendation of users in micro-blogging communities. J. Comput. Sci. Technol., 27(3):624–634, 2012. URL http://dblp.uni-trier.de/db/journals/jcst/jcst27.html#ArmentanoGA12.

[13] S. Asur and B. A. Huberman. Predicting the future with social media. In Proceedings of the2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technol-ogy - Volume 01, WI-IAT ’10, pages 492–499, Washington, DC, USA, 2010. IEEE Computer Soci-ety. ISBN 978-0-7695-4191-4. doi: 10.1109/WI-IAT.2010.63. URL http://dx.doi.org/10.1109/WI-IAT.2010.63.

[14] M. S. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam, and E. H. Chi. Eddi: interactive topic-based browsing of social status streams. In Proceedings of the 23nd annual ACM symposium onUser interface software and technology, UIST ’10, pages 303–312, New York, NY, USA, 2010. ACM.ISBN 978-1-4503-0271-5. doi: 10.1145/1866029.1866077. URL http://doi.acm.org/10.1145/1866029.1866077.

[15] D. Boyd, S. Golder, and G. Lotan. Tweet, tweet, retweet: Conversational aspects of retweeting ontwitter. In Proceedings of the 2010 43rd Hawaii International Conference on System Sciences, HICSS’10, pages 1–10, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-0-7695-3869-3.doi: 10.1109/HICSS.2010.412. URL http://dx.doi.org/10.1109/HICSS.2010.412.

[16] M. Cha, H. Haddadi, F. Benevenuto, and K. P. Gummadi. Measuring user influence in twitter:The million follower fallacy. In in ICWSM âAZ10: Proceedings of international AAAI Conference onWeblogs and Social, 2010.

[17] J. C. Cheatsheet. Java datastructures time complexity. http://www.coderfriendly.com/2009/05/23/java-collections-cheatsheet-v2/, Feb. 2013.

[18] J. Chen, R. Nairn, L. Nelson, M. Bernstein, and E. Chi. Short and tweet: experiments on recom-mending content from information streams. In Proceedings of the SIGCHI Conference on HumanFactors in Computing Systems, CHI ’10, pages 1185–1194, New York, NY, USA, 2010. ACM. ISBN978-1-60558-929-9. doi: 10.1145/1753326.1753503. URL http://doi.acm.org/10.1145/1753326.1753503.

[19] J. Chen, R. Nairn, and E. Chi. Speak little and well: recommending conversations in online socialstreams. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI’11, pages 217–226, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0228-9. doi: 10.1145/1978942.1978974. URL http://doi.acm.org/10.1145/1978942.1978974.

[20] K. Chen, T. Chen, G. Zheng, O. Jin, E. Yao, and Y. Yu. Collaborative personalized tweet rec-ommendation. In Proceedings of the 35th international ACM SIGIR conference on Research anddevelopment in information retrieval, SIGIR ’12, pages 661–670, New York, NY, USA, 2012. ACM.ISBN 978-1-4503-1472-5. doi: 10.1145/2348283.2348372. URL http://doi.acm.org/10.1145/2348283.2348372.

[21] S. Clinchant and E. Gaussier. Information-based models for ad hoc ir. In Proceedings of the 33rdinternational ACM SIGIR conference on Research and development in information retrieval, SIGIR’10, pages 234–241, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0153-4. doi: 10.1145/1835449.1835490. URL http://doi.acm.org/10.1145/1835449.1835490.

http://dblp.uni-trier.de/db/journals/jcst/jcst27.html#ArmentanoGA12

http://dblp.uni-trier.de/db/journals/jcst/jcst27.html#ArmentanoGA12

http://dx.doi.org/10.1109/WI-IAT.2010.63

http://dx.doi.org/10.1109/WI-IAT.2010.63

http://doi.acm.org/10.1145/1866029.1866077

http://doi.acm.org/10.1145/1866029.1866077

http://dx.doi.org/10.1109/HICSS.2010.412

http://www.coderfriendly.com/2009/05/23/java-collections-cheatsheet-v2/

http://www.coderfriendly.com/2009/05/23/java-collections-cheatsheet-v2/

http://doi.acm.org/10.1145/1753326.1753503

http://doi.acm.org/10.1145/1753326.1753503

http://doi.acm.org/10.1145/1978942.1978974

http://doi.acm.org/10.1145/2348283.2348372

http://doi.acm.org/10.1145/2348283.2348372

http://doi.acm.org/10.1145/1835449.1835490

BIBLIOGRAPHY 91

[22] G. Comarela, M. Crovella, V. Almeida, and F. Benevenuto. Understanding factors that affect re-sponse rates in twitter. In Proceedings of the 23rd ACM conference on Hypertext and social me-dia, HT ’12, pages 123–132, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1335-3. doi:10.1145/2309996.2310017. URL http://doi.acm.org/10.1145/2309996.2310017.

[23] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressivealgorithms. J. Mach. Learn. Res., 7:551–585, Dec. 2006. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1248547.1248566.

[24] A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable onlinecollaborative filtering. In Proceedings of the 16th international conference on World Wide Web,WWW ’07, pages 271–280, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-654-7. doi:10.1145/1242572.1242610. URL http://doi.acm.org/10.1145/1242572.1242610.

[25] A. Dong, R. Zhang, P. Kolari, J. Bai, F. Diaz, Y. Chang, Z. Zheng, and H. Zha. Time is of theessence: improving recency ranking using twitter data. In Proceedings of the 19th internationalconference on World wide web, WWW ’10, pages 331–340, New York, NY, USA, 2010. ACM. ISBN978-1-60558-799-8. doi: 10.1145/1772690.1772725. URL http://doi.acm.org/10.1145/1772690.1772725.

[26] Y. Duan, L. Jiang, T. Qin, M. Zhou, and H.-Y. Shum. An empirical study on learning to rank oftweets. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING’10, pages 295–303, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. URLhttp://dl.acm.org/citation.cfm?id=1873781.1873815.

[27] P. S. Foundation. Python datastructures time complexity. http://wiki.python.org/moin/TimeComplexity, Feb. 2013.

[28] T. A. S. Foundation. Apache lucene - high-performance, full-featured text search engine li-brary written entirely in java. https://lucene.apache.org/core/4_0_0/index.html,Feb. 2013.

[29] T. A. S. Foundation. Apache lucene - high-performance, full-featured text search engine librarywritten entirely in java. https://lucene.apache.org/core/, Feb. 2013.

[30] T. A. S. Foundation. Apache tika - a content analysis toolkit. http://tika.apache.org/, Feb.2013.

[31] E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicitsemantic analysis. In Proceedings of the 20th international joint conference on Artifical intelligence,IJCAI’07, pages 1606–1611, San Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc.URL http://dl.acm.org/citation.cfm?id=1625275.1625535.

[32] W. Galuba, K. Aberer, D. Chakraborty, Z. Despotovic, and W. Kellerer. Outtweeting the twitterers- predicting information cascades in microblogs. In Proceedings of the 3rd conference on Onlinesocial networks, WOSN’10, pages 3–3, Berkeley, CA, USA, 2010. USENIX Association. URL http://dl.acm.org/citation.cfm?id=1863190.1863193.

[33] M. Grineva and M. Grinev. Information overload in social media streams and the approaches tosolve it. 2012. URL http://www2012.wwwconference.org/.

http://doi.acm.org/10.1145/2309996.2310017



http://doi.acm.org/10.1145/1242572.1242610

http://doi.acm.org/10.1145/1772690.1772725

http://doi.acm.org/10.1145/1772690.1772725


http://wiki.python.org/moin/TimeComplexity

http://wiki.python.org/moin/TimeComplexity

https://lucene.apache.org/core/4_0_0/index.html

https://lucene.apache.org/core/

http://tika.apache.org/




http://www2012.wwwconference.org/


[34] B. Gu. Information Filtering on Micro-blogging Services. Master’s thesis. URL http://e-collection.ethbib.ethz.ch/eserv/eth:1802/eth-1802-01.pdf.

[35] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. J. Mach. Learn. Res.,3:1157–1182, Mar. 2003. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=944919.944968.

[36] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka datamining software: an update. SIGKDD Explor. Newsl., 11(1):10–18, Nov. 2009. ISSN 1931-0145.doi: 10.1145/1656274.1656278. URL http://doi.acm.org/10.1145/1656274.1656278.

[37] M. A. Hall. Correlation-based Feature Subset Selection for Machine Learning. PhD thesis, Universityof Waikato, Hamilton, New Zealand, 1998.

[38] J. Hannon, M. Bennett, and B. Smyth. Recommending twitter users to follow using content andcollaborative filtering approaches. In Proceedings of the fourth ACM conference on Recommendersystems, RecSys ’10, pages 199–206, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-906-0.doi: 10.1145/1864708.1864746. URL http://doi.acm.org/10.1145/1864708.1864746.

[39] D. He and D. Wu. Toward a robust data fusion for document retrieval. In Natural LanguageProcessing and Knowledge Engineering, 2008. NLP-KE ’08. International Conference on, pages 1–8,Oct.

[40] L. Hong, O. Dan, and B. D. Davison. Predicting popular messages in twitter. In Proceedings ofthe 20th international conference companion on World wide web, WWW ’11, pages 57–58, NewYork, NY, USA, 2011. ACM. ISBN 978-1-4503-0637-9. doi: 10.1145/1963192.1963222. URLhttp://doi.acm.org/10.1145/1963192.1963222.

[41] L. Hong, R. Bekkerman, J. Adler, and B. D. Davison. Learning to rank social update streams. In Pro-ceedings of the 35th international ACM SIGIR conference on Research and development in informationretrieval, SIGIR ’12, pages 651–660, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1472-5.doi: 10.1145/2348283.2348371. URL http://doi.acm.org/10.1145/2348283.2348371.

[42] M. Huang, Y. Yang, and X. Zhu. Quality-biased ranking of short texts in microblogging services. InProceedings of 5th International Joint Conference on Natural Language Processing, pages 373–382,Chiang Mai, Thailand, November 2011. Asian Federation of Natural Language Processing. URLhttp://www.aclweb.org/anthology/I11-1042.

[43] B. A. Huberman, D. M. Romero, and F. Wu. Social networks that matter: Twitter under themicroscope. CoRR, abs/0812.1045, 2008.

[44] T. Inc. Api-terms. https://dev.twitter.com/terms/api-terms, Feb. 2013.

[45] D. Jannach, M. Zanker, A. Felfernig, and G. Friedrich. Recommender Systems: An Introduction.Cambridge University Press, 1 edition, Sept. 2010. ISBN 0521493366. URL http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0521493366.

[46] A. Java, X. Song, T. Finin, and B. Tseng. Why we twitter: understanding microblogging usageand communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Webmining and social network analysis, WebKDD/SNA-KDD ’07, pages 56–65, New York, NY, USA,2007. ACM. ISBN 978-1-59593-848-0. doi: 10.1145/1348549.1348556. URL http://doi.acm.org/10.1145/1348549.1348556.

http://e-collection.ethbib.ethz.ch/eserv/eth:1802/eth-1802-01.pdf

http://e-collection.ethbib.ethz.ch/eserv/eth:1802/eth-1802-01.pdf



http://doi.acm.org/10.1145/1656274.1656278

http://doi.acm.org/10.1145/1864708.1864746

http://doi.acm.org/10.1145/1963192.1963222

http://doi.acm.org/10.1145/2348283.2348371

http://www.aclweb.org/anthology/I11-1042

https://dev.twitter.com/terms/api-terms

http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0521493366

http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0521493366

http://doi.acm.org/10.1145/1348549.1348556

http://doi.acm.org/10.1145/1348549.1348556

BIBLIOGRAPHY 93

[47] G. H. John and P. Langley. Estimating continuous distributions in bayesian classifiers. In Pro-ceedings of the Eleventh conference on Uncertainty in artificial intelligence, UAI’95, pages 338–345,San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc. ISBN 1-55860-385-9. URLhttp://dl.acm.org/citation.cfm?id=2074158.2074196.

[48] J. J. Jung. Online named entity recognition method for microtexts in social networking services:A case study of twitter. Expert Syst. Appl., 39(9):8066–8070, July 2012. ISSN 0957-4174. doi:10.1016/j.eswa.2012.01.136. URL http://dx.doi.org/10.1016/j.eswa.2012.01.136.

[49] A. M. Kaplan and M. Haenlein. Users of the world, unite! the challenges and opportunities ofsocial media. Business Horizons, 53(1):59–68, January 2010. URL http://ideas.repec.org/a/eee/bushor/v53y2010i1p59-68.html.

[50] H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? InProceedings of the 19th international conference on World wide web, WWW ’10, pages 591–600,New York, NY, USA, 2010. ACM. ISBN 978-1-60558-799-8. doi: 10.1145/1772690.1772751.URL http://doi.acm.org/10.1145/1772690.1772751.

[51] D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. In Proceedingsof the twelfth international conference on Information and knowledge management, CIKM ’03, pages556–559, New York, NY, USA, 2003. ACM. ISBN 1-58113-723-0. doi: 10.1145/956863.956972.URL http://doi.acm.org/10.1145/956863.956972.

[52] C. Lu, W. Lam, and Y. Zhang. Twitter user modeling and tweets recommendation basedon wikipedia concept graph, 2012. URL https://www.aaai.org/ocs/index.php/WS/AAAIW12/paper/view/5262/5629.

[53] M. McCandless. Lucene nightly benchmarks. http://people.apache.org/~mikemccand/lucenebench/, Feb. 2013.

[54] E. Meij, W. Weerkamp, and M. de Rijke. Adding semantics to microblog posts. In Proceedings ofthe fifth ACM international conference on Web search and data mining, WSDM 2012, 2012.

[55] M. Michelson and S. A. Macskassy. Discovering users’ topics of interest on twitter: a first look. InProceedings of the fourth workshop on Analytics for noisy unstructured text data, AND ’10, pages 73–80, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0376-7. doi: 10.1145/1871840.1871852.URL http://doi.acm.org/10.1145/1871840.1871852.

[56] S. A. Myers and J. Leskovec. Clash of the contagions: Cooperation and competition in informationdiffusion. In M. J. Zaki, A. Siebes, J. X. Yu, B. Goethals, G. I. Webb, and X. Wu, editors, ICDM,pages 539–548. IEEE Computer Society, 2012. ISBN 978-1-4673-4649-8. URL http://dblp.uni-trier.de/db/conf/icdm/icdm2012.html#MyersL12.

[57] M. Nagarajan, H. Purohit, and A. Sheth. A qualitative examination of topical tweet and retweetpractices, 2010. URL http://www.aaai.org/ocs/index.php/ICWSM/ICWSM10/paper/view/1484.

[58] B. A. Nardi, D. J. Schiano, M. Gumbrecht, and L. Swartz. Why we blog. Commun. ACM, 47(12):41–46, Dec. 2004. ISSN 0001-0782. doi: 10.1145/1035134.1035163. URL http://doi.acm.org/10.1145/1035134.1035163.


http://dx.doi.org/10.1016/j.eswa.2012.01.136

http://ideas.repec.org/a/eee/bushor/v53y2010i1p59-68.html

http://ideas.repec.org/a/eee/bushor/v53y2010i1p59-68.html

http://doi.acm.org/10.1145/1772690.1772751

http://doi.acm.org/10.1145/956863.956972

https://www.aaai.org/ocs/index.php/WS/AAAIW12/paper/view/5262/5629

https://www.aaai.org/ocs/index.php/WS/AAAIW12/paper/view/5262/5629

http://people.apache.org/~mikemccand/lucenebench/

http://people.apache.org/~mikemccand/lucenebench/

http://doi.acm.org/10.1145/1871840.1871852

http://dblp.uni-trier.de/db/conf/icdm/icdm2012.html#MyersL12

http://dblp.uni-trier.de/db/conf/icdm/icdm2012.html#MyersL12

http://www.aaai.org/ocs/index.php/ICWSM/ICWSM10/paper/view/1484


http://doi.acm.org/10.1145/1035134.1035163

http://doi.acm.org/10.1145/1035134.1035163


[59] N. Naveed, T. Gottron, J. Kunegis, and A. C. Alhadi. Bad news travel fast: A content-basedanalysis of interestingness on twitter. In ACM WebSci’11, pages 1–7, June 2011. URL http://journal.webscience.org/435/. WebSci Conference 2011.

[60] A. Oghina, M. Breuss, E. Tsagkias, and M. de Rijke. Predicting imdb movie ratings using social me-dia. In ECIR 2012: 34th European Conference on Information Retrieval, pages 503–507, Barcelona,Spain, 2012. Springer-Verlag, Springer-Verlag.

[61] D. Padmanabhan and S. Chakraborti. Finding relevant tweets. In H. Gao, L. Lim, W. Wang, C. Li,and L. Chen, editors, Web-Age Information Management, volume 7418 of Lecture Notes in ComputerScience, pages 228–240. Springer Berlin Heidelberg, 2012. ISBN 978-3-642-32280-8. doi: 10.1007/978-3-642-32281-5_23. URL http://dx.doi.org/10.1007/978-3-642-32281-5_23.

[62] H.-K. Peng, J. Zhu, D. Piao, R. Yan, and Y. Zhang. Retweet modeling using conditional randomfields. In Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops,ICDMW ’11, pages 336–343, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-0-7695-4409-0. doi: 10.1109/ICDMW.2011.146. URL http://dx.doi.org/10.1109/ICDMW.2011.146.

[63] M. Pennacchiotti, F. Silvestri, H. Vahabi, and R. Venturini. Making your interests follow you ontwitter. In Proceedings of the 21st ACM international conference on Information and knowledgemanagement, CIKM ’12, pages 165–174, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1156-4. doi: 10.1145/2396761.2396786. URL http://doi.acm.org/10.1145/2396761.2396786.

[64] S. Petrovic, M. Osborne, and V. Lavrenko. Rt to win! predicting message propagation in twitter,2011. URL http://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/view/2754.

[65] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled lda: a supervised topic modelfor credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empir-ical Methods in Natural Language Processing: Volume 1 - Volume 1, EMNLP ’09, pages 248–256,Stroudsburg, PA, USA, 2009. Association for Computational Linguistics. ISBN 978-1-932432-59-6.URL http://dl.acm.org/citation.cfm?id=1699510.1699543.

[66] D. Ramage, S. T. Dumais, and D. J. Liebling. Characterizing microblogs with topic models. In W. W.Cohen and S. Gosling, editors, ICWSM. The AAAI Press, 2010. URL http://dblp.uni-trier.de/db/conf/icwsm/icwsm2010.html#RamageDL10.

[67] I. Rish. An empirical study of the naive Bayes classifier. In IJCAI-01 workshop on "EmpiricalMethods in AI". URL http://www.intellektik.informatik.tu-darmstadt.de/~tom/IJCAI01/Rish.pdf.

[68] A. Ritter, C. Cherry, and B. Dolan. Unsupervised modeling of twitter conversations. In Human Lan-guage Technologies: The 2010 Annual Conference of the North American Chapter of the Associationfor Computational Linguistics, HLT ’10, pages 172–180, Stroudsburg, PA, USA, 2010. Associationfor Computational Linguistics. ISBN 1-932432-65-5. URL http://dl.acm.org/citation.cfm?id=1857999.1858019.



http://dx.doi.org/10.1007/978-3-642-32281-5_23

http://dx.doi.org/10.1007/978-3-642-32281-5_23

http://dx.doi.org/10.1109/ICDMW.2011.146

http://dx.doi.org/10.1109/ICDMW.2011.146

http://doi.acm.org/10.1145/2396761.2396786

http://doi.acm.org/10.1145/2396761.2396786



http://dblp.uni-trier.de/db/conf/icwsm/icwsm2010.html#RamageDL10

http://dblp.uni-trier.de/db/conf/icwsm/icwsm2010.html#RamageDL10

http://www.intellektik.informatik.tu-darmstadt.de/~tom/IJCAI01/Rish.pdf

http://www.intellektik.informatik.tu-darmstadt.de/~tom/IJCAI01/Rish.pdf



BIBLIOGRAPHY 95

[69] A. Ritter, S. Clark, Mausam, and O. Etzioni. Named entity recognition in tweets: an experimen-tal study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing,EMNLP ’11, pages 1524–1534, Stroudsburg, PA, USA, 2011. Association for Computational Lin-guistics. ISBN 978-1-937284-11-4. URL http://dl.acm.org/citation.cfm?id=2145432.2145595.

[70] D. M. Romero, W. Galuba, S. Asur, and B. A. Huberman. Influence and passivity in social media.In Proceedings of the 2011 European conference on Machine learning and knowledge discovery indatabases - Volume Part III, ECML PKDD’11, pages 18–33, Berlin, Heidelberg, 2011. Springer-Verlag. ISBN 978-3-642-23807-9. URL http://dl.acm.org/citation.cfm?id=2034161.2034164.

[71] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter users: real-time event detectionby social sensors. In Proceedings of the 19th international conference on World wide web, WWW’10, pages 851–860, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-799-8. doi: 10.1145/1772690.1772777. URL http://doi.acm.org/10.1145/1772690.1772777.

[72] J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D. Lieberman, and J. Sperling. Twitterstand:news in tweets. In Proceedings of the 17th ACM SIGSPATIAL International Conference on Advancesin Geographic Information Systems, GIS ’09, pages 42–51, New York, NY, USA, 2009. ACM. ISBN978-1-60558-649-6. doi: 10.1145/1653771.1653781. URL http://doi.acm.org/10.1145/1653771.1653781.

[73] J. A. Shaw, E. A. Fox, J. A. Shaw, and E. A. Fox. Combination of multiple searches. In The SecondText REtrieval Conference (TREC-2, pages 243–252, 1994.

[74] N. Shuyo. Language detection library for java, 2010. URL http://code.google.com/p/language-detection/.

[75] F. Song and W. B. Croft. A general language model for information retrieval. In Proceedings ofthe eighth international conference on Information and knowledge management, CIKM ’99, pages316–321, New York, NY, USA, 1999. ACM. ISBN 1-58113-146-1. doi: 10.1145/319950.320022.URL http://doi.acm.org/10.1145/319950.320022.

[76] D. Spina, E. Meij, M. de Rijke, A. Oghina, M. T. Bui, and M. Breuss. Identifying entity aspectsin microblog posts. In Proceedings of the 35th international ACM SIGIR conference on Researchand development in information retrieval, SIGIR ’12, pages 1089–1090, New York, NY, USA, 2012.ACM. ISBN 978-1-4503-1472-5. doi: 10.1145/2348283.2348483. URL http://doi.acm.org/10.1145/2348283.2348483.

[77] B. Suh, L. Hong, P. Pirolli, and E. H. Chi. Want to be retweeted? large scale analytics onfactors impacting retweet in twitter network. In Proceedings of the 2010 IEEE Second Interna-tional Conference on Social Computing, SOCIALCOM ’10, pages 177–184, Washington, DC, USA,2010. IEEE Computer Society. ISBN 978-0-7695-4211-9. doi: 10.1109/SocialCom.2010.33. URLhttp://dx.doi.org/10.1109/SocialCom.2010.33.

[78] K. Tao, F. Abel, Q. Gao, and G.-J. Houben. Tums: twitter-based user modeling service. In Proceed-ings of the 8th international conference on The Semantic Web, ESWC’11, pages 269–283, Berlin,Heidelberg, 2012. Springer-Verlag. ISBN 978-3-642-25952-4. doi: 10.1007/978-3-642-25953-1_22. URL http://dx.doi.org/10.1007/978-3-642-25953-1_22.





http://doi.acm.org/10.1145/1772690.1772777

http://doi.acm.org/10.1145/1653771.1653781

http://doi.acm.org/10.1145/1653771.1653781

http://code.google.com/p/language-detection/

http://code.google.com/p/language-detection/

http://doi.acm.org/10.1145/319950.320022

http://doi.acm.org/10.1145/2348283.2348483

http://doi.acm.org/10.1145/2348283.2348483

http://dx.doi.org/10.1109/SocialCom.2010.33

http://dx.doi.org/10.1007/978-3-642-25953-1_22


[79] A. Tumasjan, T. O. Sprenger, P. G. Sandner, and I. M. Welpe. Predicting elections with twitter: What140 characters reveal about political sentiment. In W. W. Cohen and S. Gosling, editors, ICWSM.The AAAI Press, 2010. URL http://dblp.uni-trier.de/db/conf/icwsm/icwsm2010.html#TumasjanSSW10.

[80] I. T. Union. Trends in telecommunication reform 2012. http://www.itu.int/ITU-D/treg/publications/trends12.html, May 2012.

[81] I. Uysal and W. B. Croft. User oriented tweet ranking: a filtering approach to microblogs. InProceedings of the 20th ACM international conference on Information and knowledge management,CIKM ’11, pages 2261–2264, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0717-8. doi:10.1145/2063576.2063941. URL http://doi.acm.org/10.1145/2063576.2063941.

[82] W. Weerkamp, S. Carter, and E. Tsagkias. How people use twitter in different languages. In WebScience 2011, Koblenz, 2011. ACM, ACM.

[83] J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: finding topic-sensitive influential twitterers.In Proceedings of the third ACM international conference on Web search and data mining, WSDM’10, pages 261–270, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-889-6. doi: 10.1145/1718487.1718520. URL http://doi.acm.org/10.1145/1718487.1718520.

[84] Wikipedia. Ward’s method. http://en.wikipedia.org/wiki/Ward%27s_method, Feb.2013.

[85] R. Yan, M. Lapata, and X. Li. Tweet recommendation with graph co-ranking. In Proceedings of the50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL’12, pages 516–525, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. URLhttp://dl.acm.org/citation.cfm?id=2390524.2390597.

[86] J. Yang and S. Counts. Predicting the speed, scale, and range of information diffusion in twitter,2010. URL http://www.aaai.org/ocs/index.php/ICWSM/ICWSM10/paper/view/1468.

[87] Z. Yang, J. Guo, K. Cai, J. Tang, J. Li, L. Zhang, and Z. Su. Understanding retweeting behaviorsin social networks. In Proceedings of the 19th ACM international conference on Information andknowledge management, CIKM ’10, pages 1633–1636, New York, NY, USA, 2010. ACM. ISBN978-1-4503-0099-5. doi: 10.1145/1871437.1871691. URL http://doi.acm.org/10.1145/1871437.1871691.

[88] T. R. Zaman, R. Herbrich, J. V. Gael, and D. Stern. Predicting information spreading in twitter,2010.

[89] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hocinformation retrieval. In Proceedings of the 24th annual international ACM SIGIR conference onResearch and development in information retrieval, SIGIR ’01, pages 334–342, New York, NY, USA,2001. ACM. ISBN 1-58113-331-6. doi: 10.1145/383952.384019. URL http://doi.acm.org/10.1145/383952.384019.

[90] H. Zhang. The Optimality of Naive Bayes. In V. Barr and Z. Markov, editors, FLAIRS Con-ference. AAAI Press, 2004. URL http://www.cs.unb.ca/profs/hzhang/publications/FLAIRS04ZhangH.pdf.

http://dblp.uni-trier.de/db/conf/icwsm/icwsm2010.html#TumasjanSSW10

http://dblp.uni-trier.de/db/conf/icwsm/icwsm2010.html#TumasjanSSW10

http://www.itu.int/ITU-D/treg/publications/trends12.html

http://www.itu.int/ITU-D/treg/publications/trends12.html

http://doi.acm.org/10.1145/2063576.2063941

http://doi.acm.org/10.1145/1718487.1718520

http://en.wikipedia.org/wiki/Ward%27s_method



http://doi.acm.org/10.1145/1871437.1871691

http://doi.acm.org/10.1145/1871437.1871691

http://doi.acm.org/10.1145/383952.384019

http://doi.acm.org/10.1145/383952.384019

http://www.cs.unb.ca/profs/hzhang/publications/FLAIRS04ZhangH.pdf

http://www.cs.unb.ca/profs/hzhang/publications/FLAIRS04ZhangH.pdf

Documents

Content Recommendation in Social MediaTwitter as subject of this work. Blogging means writing a weblog, also known as blog. Nardi et al. [58] provide an ethnographic investigation