2
The Social Honeypot Project: Protecting Online Communities from Spammers * Kyumin Lee Department of Computer Science and Engineering Texas A&M University College Station, TX, USA [email protected] James Caverlee Department of Computer Science and Engineering Texas A&M University College Station, TX, USA [email protected] Steve Webb College of Computing Georgia Institute of Technology Atlanta, GA, USA [email protected] ABSTRACT We present the conceptual framework of the Social Honey- pot Project for uncovering social spammers who target on- line communities and initial empirical results from Twitter and MySpace. Two of the key components of the Social Hon- eypot Project are: (1) The deployment of social honeypots for harvesting deceptive spam profiles from social network- ing communities; and (2) Statistical analysis of the prop- erties of these spam profiles for creating spam classifiers to actively filter out existing and new spammers. Categories and Subject Descriptors: H.3.5 [Online In- formation Services]: Web-based services; J.4 [Computer Ap- plications]: Social and behavioral sciences General Terms: Design, Experimentation, Security Keywords: social media, social honeypots, spam 1. OVERALL FRAMEWORK Spammers are increasingly targeting Web-based social sys- tems (like Facebook, MySpace, YouTube, etc.) as part of phishing attacks, to disseminate malware and commercial spam messages, and to promote affiliate websites. Success- fully defending against these social spammers is important to improve the quality of experience for community mem- bers, to lessen the system load of dealing with unwanted and sometimes dangerous content, and to positively impact the overall value of the social system going forward. However, little is known about these social spammers, their level of sophistication, or their strategies and tactics. In our ongoing research, we are developing approaches for uncovering and investigating social spammers through a prototype system called the Social Honeypot Project. Con- cretely, the Social Honeypot Project is designed to (i) auto- matically harvest spam profiles from social networking com- munities; (ii) develop robust statistical user models for dis- tinguishing between social spammers and legitimate users; and (iii) actively filter out unknown spammers based on these user models. Drawing inspiration from security re- searchers who have used honeypots to observe and analyze * This work is partially supported by a Google Research Award and by faculty startup funds from Texas A&M Uni- versity and the Texas Engineering Experiment Station. Copyright is held by the author/owner(s). WWW 2010, April 26–30, 2010, Raleigh, North Carolina, USA. ACM 978-1-60558-799-8/10/04. malicious activity (e.g., [1]), the Social Honeypot Project de- ploys and maintains social honeypots for trapping evidence of spam profile behavior. In practice, we deploy a social honeypot consisting of a legitimate profile and an associated bot to detect social spam behavior. If the social honeypot detects suspicious user activity (e.g., the honeypot’s profile receiving an unsolicited friend request) then the social hon- eypot’s bot collects evidence of the spam candidate (e.g., by crawling the profile of the user sending the unsolicited friend request plus hyperlinks from the profile to pages on the Web-at-large). What entails suspicious user behavior can be optimized for the particular community and updated based on new observations of spammer activity. While social honeypots alone are a potentially valuable tool for gathering evidence of social spam attacks and sup- porting a greater understanding of spam strategies, it is the goal of the Social Honeypot Project to support ongoing and active automatic detection of new and emerging spammers (See Figure 1). As the social honeypots collect spam ev- idence, we extract observable features from the collected candidate spam profiles (e.g., number of friends, text on the profile, age, etc.). Coupled with a set of known legitimate (non-spam) profiles which are more populous and easy to ex- tract from social networking communities, these spam and legitimate profiles become part of the initial training set of a spam classifier. Through iterative refinement of the features selected and the particular classifier used (e.g., Naive Bayes, SVM), the spam classifier can be optimized over the known spam and legitimate profiles. In our design of the overall architecture of the Social Honeypot Project we include hu- man inspectors in-the-loop for validating the quality of these extracted spam candidates. 2. SOCIAL SPAM DETECTION RESULTS Based on the overall social honeypot framework, we se- lected two social networking communities – Myspace and Twitter – to evaluate the effectiveness of the proposed spam defense mechanism. Both MySpace and Twitter support public access to their profiles, so all data collection can rely on purely public data capture. MySpace Social Honeypot Deployment: We created 51 generic honeypot profiles within the MySpace community for attracting spammer activity so that we can identify and analyze the characteristics of social spam profiles (fully de- scribed in [2]). Based on a four month evaluation period (October 2007 to January 2008), we collected 1,570 profiles that sent unsolicited friend requests to the honeypots.

The Social Honeypot Project: Protecting Online …faculty.cs.tamu.edu/caverlee/pubs/lee10 Social Honeypot Project: Protecting Online Communities from Spammers Kyumin Lee Department

  • Upload
    donhan

  • View
    220

  • Download
    1

Embed Size (px)

Citation preview

Page 1: The Social Honeypot Project: Protecting Online …faculty.cs.tamu.edu/caverlee/pubs/lee10 Social Honeypot Project: Protecting Online Communities from Spammers Kyumin Lee Department

The Social Honeypot Project: Protecting OnlineCommunities from Spammers∗

Kyumin LeeDepartment of ComputerScience and Engineering

Texas A&M UniversityCollege Station, TX, USA

[email protected]

James CaverleeDepartment of ComputerScience and Engineering

Texas A&M UniversityCollege Station, TX, USA

[email protected]

Steve WebbCollege of Computing

Georgia Institute ofTechnology

Atlanta, GA, [email protected]

ABSTRACTWe present the conceptual framework of the Social Honey-pot Project for uncovering social spammers who target on-line communities and initial empirical results from Twitterand MySpace. Two of the key components of the Social Hon-eypot Project are: (1) The deployment of social honeypotsfor harvesting deceptive spam profiles from social network-ing communities; and (2) Statistical analysis of the prop-erties of these spam profiles for creating spam classifiers toactively filter out existing and new spammers.

Categories and Subject Descriptors: H.3.5 [Online In-formation Services]: Web-based services; J.4 [Computer Ap-plications]: Social and behavioral sciences

General Terms: Design, Experimentation, Security

Keywords: social media, social honeypots, spam

1. OVERALL FRAMEWORKSpammers are increasingly targeting Web-based social sys-

tems (like Facebook, MySpace, YouTube, etc.) as part ofphishing attacks, to disseminate malware and commercialspam messages, and to promote affiliate websites. Success-fully defending against these social spammers is importantto improve the quality of experience for community mem-bers, to lessen the system load of dealing with unwanted andsometimes dangerous content, and to positively impact theoverall value of the social system going forward. However,little is known about these social spammers, their level ofsophistication, or their strategies and tactics.

In our ongoing research, we are developing approachesfor uncovering and investigating social spammers through aprototype system called the Social Honeypot Project. Con-cretely, the Social Honeypot Project is designed to (i) auto-matically harvest spam profiles from social networking com-munities; (ii) develop robust statistical user models for dis-tinguishing between social spammers and legitimate users;and (iii) actively filter out unknown spammers based onthese user models. Drawing inspiration from security re-searchers who have used honeypots to observe and analyze

∗This work is partially supported by a Google ResearchAward and by faculty startup funds from Texas A&M Uni-versity and the Texas Engineering Experiment Station.

Copyright is held by the author/owner(s).WWW 2010, April 26–30, 2010, Raleigh, North Carolina, USA.ACM 978-1-60558-799-8/10/04.

malicious activity (e.g., [1]), the Social Honeypot Project de-ploys and maintains social honeypots for trapping evidenceof spam profile behavior. In practice, we deploy a socialhoneypot consisting of a legitimate profile and an associatedbot to detect social spam behavior. If the social honeypotdetects suspicious user activity (e.g., the honeypot’s profilereceiving an unsolicited friend request) then the social hon-eypot’s bot collects evidence of the spam candidate (e.g.,by crawling the profile of the user sending the unsolicitedfriend request plus hyperlinks from the profile to pages onthe Web-at-large). What entails suspicious user behaviorcan be optimized for the particular community and updatedbased on new observations of spammer activity.

While social honeypots alone are a potentially valuabletool for gathering evidence of social spam attacks and sup-porting a greater understanding of spam strategies, it is thegoal of the Social Honeypot Project to support ongoing andactive automatic detection of new and emerging spammers(See Figure 1). As the social honeypots collect spam ev-idence, we extract observable features from the collectedcandidate spam profiles (e.g., number of friends, text on theprofile, age, etc.). Coupled with a set of known legitimate(non-spam) profiles which are more populous and easy to ex-tract from social networking communities, these spam andlegitimate profiles become part of the initial training set of aspam classifier. Through iterative refinement of the featuresselected and the particular classifier used (e.g., Naive Bayes,SVM), the spam classifier can be optimized over the knownspam and legitimate profiles. In our design of the overallarchitecture of the Social Honeypot Project we include hu-man inspectors in-the-loop for validating the quality of theseextracted spam candidates.

2. SOCIAL SPAM DETECTION RESULTSBased on the overall social honeypot framework, we se-

lected two social networking communities – Myspace andTwitter – to evaluate the effectiveness of the proposed spamdefense mechanism. Both MySpace and Twitter supportpublic access to their profiles, so all data collection can relyon purely public data capture.

MySpace Social Honeypot Deployment: We created 51generic honeypot profiles within the MySpace communityfor attracting spammer activity so that we can identify andanalyze the characteristics of social spam profiles (fully de-scribed in [2]). Based on a four month evaluation period(October 2007 to January 2008), we collected 1,570 profilesthat sent unsolicited friend requests to the honeypots.

Page 2: The Social Honeypot Project: Protecting Online …faculty.cs.tamu.edu/caverlee/pubs/lee10 Social Honeypot Project: Protecting Online Communities from Spammers Kyumin Lee Department

Figure 1: The Social Honeypot Project: Overall Framework

Twitter Social Honeypot Deployment: Similarly, we cre-ated and deployed a mix of honeypots within the Twittercommunity to track unsolicited “followers.” From August2009 to September 2009, these social honeypots collected500 users’ data.

Since social honeypots are triggered by spam behaviorsonly, it is unclear if the corresponding profiles engaging inthe spam behavior also exhibit clearly observable spam sig-nals. If there are clear patterns, then by training a clas-sifier on the observable signals, we may be able to predictnew spam even in the absence of triggering spam behaviors.We consider four broad classes of user attributes that aretypically observable (unlike, say, private messaging betweentwo users) in the social network: (i) user demographics: in-cluding age, gender, location, and other descriptive informa-tion about the user; (ii) user-contributed content: includ-ing “About Me” text, blog posts, comments posted on otheruser’s profiles, tweets, etc.; (iii) user activity features: in-cluding posting rate, tweet frequency; (iv) user connections:including number of friends in the social network, followers,following. The classification experiments were performedin the Weka [3] using 10-fold cross-validation to improvethe reliability of classifier evaluations and evaluated usingstandard metrics such as precision, recall, accuracy, the F1

measure, false positive and true positive.

Table 1: Spam Classification Results

Accuracy F1 FPMySpace 99.21% 0.992 0.7%Twitter 88.98% 0.888 5.7%

In Table 1, we report the results for spam classificationover both MySpace and Twitter.1 We additionally consid-ered different training mixtures of spam and legitimate train-ing data (from 10% spam / 90% legitimate to 90% spam /10% legitimate); we find that the classification metrics arerobust across these changes in training data. We addition-ally find that some features are stronger spam predictorsthan others (see Figure 2 for ROC curves for the Twitterdataset); in this case features like tweets per day and ac-count age are not strong spam signals, whereas number ofURLs per tweet and inter-tweet similarity are strong signals.

1Additional experimental details available from http://infolab.tamu.edu/projects/social_honeypots/

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e P

ositi

ve R

ate

F−F Ratio|@| per tweet|URL| per tweet|tweets| per dayAccount Age|unique @| per tweet|unique URL| per tweetBi−friendsTweets SimilarityTweets

Figure 2: Twitter – Feature Comparison

3. CONCLUSIONWe find strong evidence that social honeypots can attract

spam behaviors that are strongly correlated with observablefeatures of the spammer’s profiles and their activity in thenetwork (e.g., tweet frequency). These results hold acrosstwo fundamentally different communities and confirm thehypothesis that spammers engage in behavior that is cor-related with observable features that distinguish them fromlegitimate users. In addition, we find that some of these sig-nals may be difficult for spammers to obscure (e.g., contentcontaining a sales pitch or deceptive content), so that the re-sults are encouraging for ongoing effective spam detection.

4. REFERENCES[1] L. Spitzner. Honeypots: Tracking Hackers.

Addison-Wesley Longman Publishing Co., Inc., Boston,MA, USA, 2002.

[2] S. Webb, J. Caverlee, and C. Pu. Social honeypots:Making friends with a spammer near you. InProceedings of the Fifth Conference on Email andAnti-Spam (CEAS 2008), 2008.

[3] I. H. Witten and E. Frank. Data Mining: PracticalMachine Learning Tools and Techniques (SecondEdition). Morgan Kaufmann, June 2005.