2
Detecting Spammers and Content Promoters in Online Video Social Networks Fabr´ ıcio Benevenuto , Tiago Rodrigues, Jussara Almeida, Marcos Gonc ¸alves and Virg´ ılio Almeida Computer Science Department Federal University of Minas Gerais, Brazil {fabricio, tiagorm, jussara, mgoncalv, virgilio}@dcc.ufmg.br Abstract—Online video social networks provides features that allow users to post a video as a response to a discussion topic. These features open opportunities for users to introduce polluted content into the system. For instance, spammers may post an unrelated video as response to a popular one aiming at increasing the likelihood of the response being viewed by a larger number of users. Moreover, opportunistic users - promoters- may try to gain visibility to a specific video by posting a large number of responses to boost the rank of the responded video, making it appear in the top lists maintained by the system. In this paper, we address the issue of detecting video spammers and promoters. Towards that end, we build a test collection of real YouTube users. Using our test collection we investigate the feasibility of using a supervised classification algorithm to detect spammers and promoters. We found that our approach is able to correctly identify the majority of the promoters, misclassifying only a small percentage of legitimate users. In contrast, although we are able to detect a significant fraction of spammers, they showed to be much harder to distinguish from legitimate users. I. I NTRODUCTION With Internet video sharing sites gaining popularity in a dazzling speed, the Web is being transformed into a major channel for the delivery of multimedia. As a consequence, various services on the Web 2.0 are offering video-based functions as alternative to text-based ones, such as video reviews for products, video ads and video responses. By allowing users to publicize and share videos, video social networks become susceptible to different types of opportunis- tic user actions. As an example, YouTube provides features that allow users to post a video as a response to a discussion topic. Although appealing as a mechanism to enrich the online interaction, these features open opportunities for users to introduce polluted content, into the system. For example, users, which we call spammers, may post an unrelated video as response to a popular one aiming at increasing the likelihood of the response being viewed by a larger number of users. Moreover, opportunistic users, namely promoters may try to gain visibility to a specific video by posting a large number of (potentially unrelated) responses to boost the rank of the responded video, making it appear in the top lists maintained by YouTube. Promoters and spammers are motivated to pollute for several reasons, such as to spread advertises, disseminate pornography (often as an advertisement to a Web site), or just to compromise system reputation. Content pollution may compromise user patience and sat- isfaction with the system since users cannot easily identify * Fabricio is supported by UOL (www.uol.com.br), through UOL Bolsa Pesquisa program, process number 20080125143100a. polluted content before watching at least a segment of it, which also consumes system resources, especially bandwidth. Additionally, promoters can further impact negatively system aspects, since promoted videos that quickly reach high rank- ings are strong candidates to be kept in caches or in content distribution networks [2]. In this paper, we address the issue of detecting video spam- mers and promoters. To do it, we crawled a large user data set from YouTube site. Then, we created a labeled collection with users “manually” classified as legitimate, spammers and promoters. Lastly, we investigated the feasibility of applying a supervised learning method to identify polluters. II. USER TEST COLLECTION In order to evaluate our proposed approach to detect video spammers and promoters in online video social networking systems, we need a test collection of users, pre-classified into the target categories, namely, spammers, promoters and, legitimate users. Thus, our first step consists of collecting a sample of users who participate in interactions through video responses, i.e, who post or receive video responses. These interactions can be represented by a video response user graph G =(X, Y ), where X is the union of all users who posted or received video responses until a certain instant of time, and (x 1 ,x 2 ) is a directed arc in Y if user x 1 X has responded to a video contributed by user x 2 X [1]. To obtain a representative sample of graph (X, Y ), we build a crawler that collect an entire weak connect component of (X, Y ) by following links of video responses and responded videos. The sampling starts from a set of 88 seeds, consisting of the owners of the top-100 most responded videos of all time, provided by YouTube. The crawler ran for one week (01/11-18, 2008), gathering a total of 264,460 users, 381,616 responded videos and 701,950 video responses. In order to create a test collection we define three strategies for user selection, classified by volunteers. In order to mini- mize the impact of human error, three volunteers analyzed all video responses of each selected user in order to independently classify her into one of the three categories. In case of tie (i.e., each volunteer chooses a different class), a fourth independent volunteer was heard. Each user was classified based on majority voting. Volunteers were instructed to favor legitimate users. If one was not confident that a video response was unrelated to the responded video, she should consider it to be legitimate. Video responses containing people chatting or expressing their opinions were classified as legitimate, as we choose not to evaluate the expressed opinions. 978-1-4244-3968-3/09/$25.00 ©2009

[IEEE IEEE INFOCOM 2009 - IEEE Conference on Computer Communications Workshops - Rio de Janeiro, Brazil (2009.04.19-2009.04.25)] IEEE INFOCOM Workshops 2009 - Detecting Spammers and

Embed Size (px)

Citation preview

Page 1: [IEEE IEEE INFOCOM 2009 - IEEE Conference on Computer Communications Workshops - Rio de Janeiro, Brazil (2009.04.19-2009.04.25)] IEEE INFOCOM Workshops 2009 - Detecting Spammers and

Detecting Spammers and Content Promoters in Online Video Social Networks

Fabrıcio Benevenuto∗, Tiago Rodrigues, Jussara Almeida, Marcos Goncalves and Virgılio AlmeidaComputer Science Department

Federal University of Minas Gerais, Brazil{fabricio, tiagorm, jussara, mgoncalv, virgilio}@dcc.ufmg.br

Abstract—Online video social networks provides features thatallow users to post a video as a response to a discussion topic.These features open opportunities for users to introduce pollutedcontent into the system. For instance, spammers may post anunrelated video as response to a popular one aiming at increasingthe likelihood of the response being viewed by a larger numberof users. Moreover, opportunistic users - promoters- may try togain visibility to a specific video by posting a large number ofresponses to boost the rank of the responded video, making itappear in the top lists maintained by the system. In this paper,we address the issue of detecting video spammers and promoters.Towards that end, we build a test collection of real YouTubeusers. Using our test collection we investigate the feasibility ofusing a supervised classification algorithm to detect spammersand promoters. We found that our approach is able to correctlyidentify the majority of the promoters, misclassifying only a smallpercentage of legitimate users. In contrast, although we are ableto detect a significant fraction of spammers, they showed to bemuch harder to distinguish from legitimate users.

I. INTRODUCTION

With Internet video sharing sites gaining popularity in adazzling speed, the Web is being transformed into a majorchannel for the delivery of multimedia. As a consequence,various services on the Web 2.0 are offering video-basedfunctions as alternative to text-based ones, such as videoreviews for products, video ads and video responses.

By allowing users to publicize and share videos, video socialnetworks become susceptible to different types of opportunis-tic user actions. As an example, YouTube provides featuresthat allow users to post a video as a response to a discussiontopic. Although appealing as a mechanism to enrich the onlineinteraction, these features open opportunities for users tointroduce polluted content, into the system. For example, users,which we call spammers, may post an unrelated video asresponse to a popular one aiming at increasing the likelihoodof the response being viewed by a larger number of users.Moreover, opportunistic users, namely promoters may try togain visibility to a specific video by posting a large numberof (potentially unrelated) responses to boost the rank of theresponded video, making it appear in the top lists maintainedby YouTube. Promoters and spammers are motivated to pollutefor several reasons, such as to spread advertises, disseminatepornography (often as an advertisement to a Web site), or justto compromise system reputation.

Content pollution may compromise user patience and sat-isfaction with the system since users cannot easily identify

* Fabricio is supported by UOL (www.uol.com.br), through UOL BolsaPesquisa program, process number 20080125143100a.

polluted content before watching at least a segment of it,which also consumes system resources, especially bandwidth.Additionally, promoters can further impact negatively systemaspects, since promoted videos that quickly reach high rank-ings are strong candidates to be kept in caches or in contentdistribution networks [2].

In this paper, we address the issue of detecting video spam-mers and promoters. To do it, we crawled a large user dataset from YouTube site. Then, we created a labeled collectionwith users “manually” classified as legitimate, spammers andpromoters. Lastly, we investigated the feasibility of applyinga supervised learning method to identify polluters.

II. USER TEST COLLECTION

In order to evaluate our proposed approach to detect videospammers and promoters in online video social networkingsystems, we need a test collection of users, pre-classifiedinto the target categories, namely, spammers, promoters and,legitimate users. Thus, our first step consists of collecting asample of users who participate in interactions through videoresponses, i.e, who post or receive video responses. Theseinteractions can be represented by a video response user graphG = (X,Y ), where X is the union of all users who posted orreceived video responses until a certain instant of time, and(x1, x2) is a directed arc in Y if user x1 ∈ X has respondedto a video contributed by user x2 ∈ X [1]. To obtain arepresentative sample of graph (X,Y ), we build a crawlerthat collect an entire weak connect component of (X,Y ) byfollowing links of video responses and responded videos. Thesampling starts from a set of 88 seeds, consisting of the ownersof the top-100 most responded videos of all time, providedby YouTube. The crawler ran for one week (01/11-18, 2008),gathering a total of 264,460 users, 381,616 responded videosand 701,950 video responses.

In order to create a test collection we define three strategiesfor user selection, classified by volunteers. In order to mini-mize the impact of human error, three volunteers analyzed allvideo responses of each selected user in order to independentlyclassify her into one of the three categories. In case oftie (i.e., each volunteer chooses a different class), a fourthindependent volunteer was heard. Each user was classifiedbased on majority voting. Volunteers were instructed to favorlegitimate users. If one was not confident that a video responsewas unrelated to the responded video, she should consider itto be legitimate. Video responses containing people chattingor expressing their opinions were classified as legitimate, aswe choose not to evaluate the expressed opinions.

978-1-4244-3968-3/09/$25.00 ©2009

Page 2: [IEEE IEEE INFOCOM 2009 - IEEE Conference on Computer Communications Workshops - Rio de Janeiro, Brazil (2009.04.19-2009.04.25)] IEEE INFOCOM Workshops 2009 - Detecting Spammers and

2

The three user selection strategies used are: (1) In order toselect users with different levels of interaction through videoresponses, we first defined four groups of users based on theirin and out-degrees in the video response user graph. Group1 consists of users with low (≤ 10) in and out-degrees andgroup 2 contains users with high (> 10) in-degree and lowout-degree. Group 3 consists of users with low in-degree andhigh out-degree, whereas users with high in and out-degrees,fall into group 4. One hundred users were randomly selectedfrom each group1, and manually classified, yielding a total of382 legitimate, 10 spammers, and no promoter. The remaining8 users were discarded as they had their accounts suspended.(2) Aiming at populating the test collection with polluters, wesearched for them where they are more likely to be found.We first note that, in YouTube, a video v can be posted asresponse to at most one video at a time. Thus, we conjecturethat spammers would post their video responses more oftento popular videos so as to make each spam visible to a largercommunity of users. Moreover, some video promoters mighteventually be successful and have their target listed among themost popular videos. Thus, we browsed the video responsesposted to the top 100 most responded videos of all time,selecting a number of suspect users2. The classification ofthese suspect users led to 7 legitimate users, 118 spammers,and 28 promoters in the test collection. (3) To minimize apossible bias introduced by strategy (2), we randomly selected300 users who posted video responses to the top 100 mostresponded videos of all time, finding 252 new legitimateusers, 29 new spammers and 3 new promoters (16 users withclosed accounts were discarded). In total, our test collectioncontains 855 users, including 641 classified as legitimate, 157as spammers and 31 as promoters. Those users posted 20,644video responses to 9,796 unique responded videos.

III. DETECTING SPAMMERS AND PROMOTERS

To assess the effectiveness of our classification strategieswe use the standard information retrieval metrics of recall,precision, Micro-F1, and Macro-F1 [4]. We choose to usea Support Vector Machine (SVM) classifier [3], and theclassification experiments are performed using a 5-fold cross-validation. With 95% of confidence, results do not differ fromthe reported average in more than 5%.

The features used by the classifier are organized into threegroups. The first one consists on properties of the videosuploaded by the user, which are: duration, numbers of viewsand of commentaries received, ratings, number of times thevideo was selected as favorite, as well as numbers of honorsand of external links. The second set of attributes capturesthe social relationships established between users via videoresponse interactions, corresponding to: clustering coefficient,betweenness, reciprocity, assortativity, and UserRank. As thethird set, we consider three groups of videos owned by the

1Groups 1, 2, 3 and 4 have 162,546, 2,333, 3,189 and 1,154 users. Thus,homogeneous random selection from each one yields a bias towards group 4.

2As an example, the owner of a video with a pornographic picture asthumbnail but posted to a political debate video discussion topic.

user. The first group contains aggregate information of allvideos uploaded by the user, being useful to capture howothers see the (video) contributions of this user. The secondgroup considers only video responses, which may be pollution.The last group considers only the target videos to whichthis user posted video responses. For each video group, weconsidered the average and the sum of the aforementionedattributes, summing up 42 video attributes for each user. Thesecond set of attributes consists of individual characteristics ofuser behavior: average time between video uploads, maximumnumber of videos uploaded in 24 hours, and number offriends, videos uploaded, videos watched, videos added asfavorite, video responses posted and received, subscriptions,and subscribers.

As results, approximately 96% of promoters, 57% of spam-mers, and 95% of legitimate users were correctly classified.Moreover, no promoter was classified as legitimate user,whereas only a small fraction of promoters were erroneouslyclassified as spammers (3.87%). By manually inspecting thesepromoters, we found that the videos that they targeted (i.e.,the promoted videos) actually acquired a certain popularity. Inthat case, it is harder to distinguish them from spammers, whotarget more often very popular videos. A significant fraction(almost 42%) of spammers was misclassified as legitimateusers. These spammers exhibit a dual behavior, sharing a rea-sonable number of legitimate videos (non-spam) and postinglegitimate video responses, thus presenting themselves as le-gitimate users most of the time, but occasionally posting videospams. Therefore, as opposed to promoters, distinguishingspammers from legitimate users is much harder. As a summaryof the classification results, Micro-F1 value is 87.5, whereasper-class F1 values are 63.7, 90.8, and 92.3, for spammers,promoters, and legitimate users, respectively, resulting in anaverage Macro-F1 equal to 82.2. As a first approach, ourproposed classification provides significant benefits, beingeffective in identifying polluters in the system.

IV. CONCLUSIONS AND ONGOING RESEARCH

In this work we propose an effective solution to the problemof detecting polluters that can guide system administratorsto spammers and promoters in online video social networks.As future work, we aim at reducing the cost of the labelingprocess by studying the viability of semi-supervised learningmethods to detect spammers and promoters.

REFERENCES

[1] F. Benevenuto, F. Duarte, T. Rodrigues, V. Almeida, J. Almeida, andK. Ross. Understanding video interactions in youtube. In Proc. ACMMultimedia (MM), 2008.

[2] M. Cha, H. Kwak, P. Rodriguez, Y. Ahn, and S. Moon. I tube, you tube,everybody tubes: Analyzing the world’s largest user generated contentvideo system. In Proc. Internet Measurement Conference (IMC), 2007.

[3] T. Joachims. Text categorization with support vector machines: Learningwith many relevant features. In Proc. the European Conference onMachine Learning (ECML), 1998.

[4] Y. Yang. An evaluation of statistical approaches to text categorization.Information Retrival, 1, 1999.