Social Honeypots: Making Friends With A Spammer Near You

Social Honeypots: Making Friends With A Spammer Near You

Steve WebbCollege of Computing

Georgia Institute of TechnologyAtlanta, GA [email protected]

James CaverleeDept. of Computer Science

Texas A&M UniversityCollege Station, TX 77843

[email protected]

Calton PuCollege of Computing

Georgia Institute of TechnologyAtlanta, GA 30332

[email protected]

Abstract

Social networking communities have becomean important communications platform, butthe popularity of these communities has alsomade them targets for a new breed of so-cial spammers. Unfortunately, little is knownabout these social spammers, their level ofsophistication, or their strategies and tac-tics. Thus, in this paper, we provide the firstcharacterization of social spammers and theirbehaviors. Concretely, we make two contri-butions: (1) we introduce social honeypotsfor tracking and monitoring social spam, and(2) we report the results of an analysis per-formed on spam data that was harvested byour social honeypots. Based on our analysis,we find that the behaviors of social spam-mers exhibit recognizable temporal and geo-graphic patterns and that social spam con-tent contains various distinguishing charac-teristics. These results are quite promisingand suggest that our analysis techniques maybe used to automatically identify social spam.

1 Introduction

Over the past few years, social networking communi-ties have experienced unprecedented growth, both interms of size and popularity. In fact, of the top-20most visited World Wide Web destinations, six arenow social networks, which is five more than the listfrom only three years ago [2]. This flood of activ-ity is remaking the Web into a “social Web” whereusers and their communities are the centers for onlinegrowth, commerce, and information sharing. Unfor-tunately, the rapid growth of these communities hasmade them prime targets for attacks by malicious in-dividuals. Most notably, these communities are beingbombarded by social spam [14, 24].

Some of the spam in social networking communitiesis quite familiar. For example, message spam withina community is similar in form and function to emailspam on the wider Internet, and comment spam onsocial networking profiles manifests itself in a similarfashion to blog spam. Defenses against these familiarforms of spam can be easily adapted to target their so-cial networking analogs [14]. However, other forms ofsocial spam are new and have risen out of the very fab-ric of these communities. One of the most importantexamples of this new generation of spam is deceptivespam profiles, which attempt to manipulate user be-havior. These deceptive profiles are inserted into thesocial network by spammers in an effort to prey on in-nocent community users and to pollute these commu-nities. Although fake profiles (or fakesters) have beena “fun” part of online social networks from their earli-est days [22], growing evidence suggests that spammersare deploying deceptive profiles in increasing numbersand with more intent to do harm. For example, de-ceptive profiles can be used to drive legitimate users toWeb spam pages, to distribute malware, and to disruptthe quality of community-based knowledge by spread-ing disinformation [24].

Understanding different types of social spam and de-ception is the first step towards countering these vul-nerabilities. Hence, in this paper, we propose a noveltechnique for harvesting deceptive spam profiles fromsocial networking communities using social honeypots.Then, we provide a characterization of the spam pro-files that we collected with our social honeypots. Tothe best of our knowledge, this is the first use of hon-eypots in the social networking environment as well asthe first characterization of deceptive spam profiles.

Our social honeypots draw inspiration from secu-rity researchers who have used honeypots to observeand analyze malicious activity. Specifically, honey-pots have already been used to characterize malicioushacker activity [21], to generate intrusion detectionsignatures [16], and to observe email address har-

vesters [19]. In our current research, we create hon-eypot profiles within a community to attract spam-mer activity so that we can identify and analyze thecharacteristics of social spam profiles. Concretely, weconstructed 51 honeypot profiles and associated themwith distinct geographic locations in MySpace, thelargest and most active social networking community.After creating our social honeypots, we deployed themand collected all of the traffic they received (via friendrequests). Based on a four month evaluation periodfrom October 1, 2007 to February 1, 2008, we have con-ducted a sweeping characterization of the harvestedspam profiles from our social honeypots. A few of themost interesting findings from this analysis are:

• The spamming behaviors of spam profiles followdistinct temporal patterns.

• The most popular spamming targets are Midwest-ern states, and the most popular location for spamprofiles is California.

• The geographic locations of spam profiles almostnever overlap with the locations of their targets.

• 57.2% of the spam profiles obtain their “Aboutme” content from another profile.

• Many of the spam profiles exhibit distinct demo-graphic characteristics (e.g., age, relationship sta-tus, etc.).

• Spam profiles use thousands of URLs and variousredirection techniques to funnel users to a handfull of destination Web pages.

The rest of the paper is organized as follows. Sec-tion 2 summarizes related work. Section 3 providesbackground information about social networking com-munities and describes the social spam that is cur-rently plaguing these communities. In Section 4, wepresent our methodology for creating social honeypotsand collecting deceptive spam profiles. In Section 5,we report the results of an analysis we performed onthe spam profiles that we collected in these honeypots.Section 6 concludes the paper.

2 Related Work

Due to the explosive growth and popularity of socialnetworking communities, a great deal of research hasbeen done to study various aspects of these communi-ties. Specifically, these studies have focused on usagepatterns [7, 11], information revelation patterns [7, 15],and social implications [8, 9] of the most popular com-munities. Work has also been done to characterize the

growth of these communities [17] and to predict newfriendships [18] and group formations [3].

Recently, researchers have also begun investigating thedarker side of these communities. For example, numer-ous studies have explored the privacy threats associ-ated with public information revelation in the commu-nities [1, 3, 4, 12]. Aside from privacy risks, researchershave also identified attacks that are directed at thesecommunities (e.g., social spam) [14]. In our previouswork [24], we showed that social networking commu-nities are susceptible to two broad classes of attacks:traditional attacks that have been adapted to thesecommunities (e.g., malware propagation) and new at-tacks that have emerged from within the communities(e.g., deceptive spam profiles).

Unfortunately, very little work has been done to ad-dress the emerging security threats in social network-ing communities. Heymann et al. [14] presented aframework for addressing these threats, and Zinmanand Donath [25] attempted to use machine learningtechniques to classify profiles. In our previous re-search [6], we proposed the SocialTrust framework toprovide tamper-resilient trust establishment in thesecommunities. However, the research community des-perately needs real-world examples and characteriza-tions of malicious activity to inspire new solutions.Thus, to help address this problem, we present a noveltechnique for collecting deceptive spam profiles in so-cial networking communities that relies on social hon-eypot profiles. Additionally, we provide the first char-acterization of deceptive spam profiles in an effort tostimulate research progress.

3 Social Spam

Social networking communities, such as MySpace, pro-vide an online platform for people to manage existingrelationships, form new ones, and participate in vari-ous social interactions. To facilitate these interactions,a user’s online presence in the community is repre-sented by a profile, which is a user-controlled Web pagethat contains a picture of the user and various piecesof personal information. Additionally, a user’s profilealso contains a list of links to the profiles of that user’sfriends. Each of these friend links is bidirectional andestablished only after the user has received and ac-cepted a friend request from another user.

Aside from friend requests, MySpace also provides nu-merous communication facilities that enable users tocommunicate with each other within the community(e.g., messaging, commenting, and blogging systems).Unfortunately, spammers have already begun exploit-ing these systems by propagating spam (e.g., messagespam, comment spam, etc.) through them [24]. Even

Figure 1: An example of a deceptive spam profile.

more troubling is the fact that spammers are now pol-luting the communities with deceptive spam profilesthat aim to manipulate legitimate users.

An example of a MySpace spam profile1 is shown inFigure 1. As the figure shows, spam profiles containa wealth of information and various deceptive prop-erties. Most notably, these profiles typically use aprovocative image of a woman to entice users to viewthem. Then, once the profiles have attracted visitors,they direct those visitors to perform an action of somesort (e.g., visiting a Web page outside of the commu-nity) by using a seductive story in their “About me”sections. For example, the profile in Figure 1 providesa link to another Web page and promises that “badgirl pics” will be found there.

After spammers have constructed their deceptive pro-files, they must attract visitors. To generate this traf-fic for their profiles, spammers typically employ twostrategies. First, spammers keep their profiles loggedin to MySpace for long periods of time. This strat-egy generates attention because many of the MySpacesearching mechanisms give preferential treatment toprofiles that are currently logged in to the system.Consequently, when users are browsing through pro-files, the spam profiles will be prominently displayed.The second strategy is much more aggressive and in-volves sending friend requests to MySpace users. Fig-ure 2 shows an example friend request that corre-sponds to the profile shown in Figure 1. Unlike thefirst strategy, which passively relies on users to visitthe spam profiles, this strategy actively contacts users

1All of the provocative images in the paper have beenblurred so as not to offend anyone.

Figure 2: An example of a spam friend request.

and deceives them into believing the profile’s creatorwants to befriend them.

4 Social Honeypots

For years, researchers have been deploying honeypotsto capture examples of nefarious activities [16, 19, 21].In this paper, we utilize honeypots to collect decep-tive spam profiles in social networking communities.Specifically, we created 51 MySpace profiles to serve asour social honeypots. To observe any geographic arti-facts of spamming behavior, each of these profiles wasgiven a specific geographic location (i.e., one honeypotwas assigned to each of the U.S. states and Washing-ton, D.C.). With the exception of the D.C. honey-pot, each profile’s city was chosen based on the mostpopulated city in a given state. For example, Atlantahas the largest population in Georgia, and as a result,it was the city used for the Georgia honeypot. Weused this strategy because we assumed that spammerswould target larger cities due to their larger popula-tions of potential victims.

All 51 of our honeypot profiles are identical exceptfor their geographic information (see Figure 3 for anexample). Each profile has the same name, gender,and birthday. Additionally, all of the demographic in-formation was chosen to make the profiles appear at-tractive to spammers. Specifically, all of the profilesshare the same relationship status (single), body type(athletic), and ethnicity (White / Caucasian). Thesedemographic characteristics were also among the mostpopular in our previous large-scale characterization ofMySpace profiles [7].

To collect timely information and increase the likeli-hood of being targeted by spammers, we created cus-tom MySpace bots to ensure that all of our profiles arelogged in to MySpace 24 hours a day, 7 days a week 2.

2We experienced a few short outages (on the order ofhours) due to MySpace system updates, which forced us to

Figure 3: An example of a social honeypot.

In addition to keeping our honeypot profiles logged into the community, our bots also monitor any spam-ming activity that is directed at the profiles. Specifi-cally, the bots are constantly checking the profiles fornewly received friend requests. To avoid burdeningMySpace with excessive traffic (and to avoid being la-beled as a spam bot), each of our bots follows a pollingpolicy that employs random sleep timers and an expo-nential backoff algorithm, which fluctuates sleep timesbased on the current amount of spamming activity(with a minimum and maximum sleep time of five min-utes and one hour, respectively). Therefore, when ahoneypot profile is receiving spam, its correspondingbot polls MySpace more aggressively than when theprofile is not receiving spam.

After one of our honeypot profiles receives a new friendrequest, the bot responsible for that profile performsvarious tasks. First, the bot downloads the spam pro-file that sent the friend request3, storing a copy of theprofile along with a honeypot-specific identifier anda timestamp that corresponds to the time when thefriend request was sent to the honeypot profile. Then,after storing a local copy of the profile, the bot rejectsthe friend request. We decided to reject the friend re-quests for two reasons. First, we wanted to identifyspam profiles that are repeat offenders (i.e., they con-tinuously send friend requests until they are accepted).Second, we did not want our honeypot profiles to bemistaken as spam profiles. If we blindly accept all ofthe spam friend requests, our honeypot profiles willappear to be helping the spam profiles in a mannersimilar to a Web spam page that participates in a linkexchange or link farm [13]. Thus, to avoid suspicion byMySpace, our honeypot profiles conservatively rejectthe friend requests that they receive.

slightly modify our bots.3Band profiles also send unsolicited friend requests;

however, our bots simply reject these requests without pro-cessing them.

Many of the spam profiles contain links in their “Aboutme” sections that direct users to Web pages outside ofthe social networking community. We wanted to studythe characteristics of these Web pages; hence, in addi-tion to storing the spam profiles, our bots also crawlthe pages that are being advertised by these profiles.Specifically, after one of our bots stores a local copyof a profile, the bot parses the profile’s “About me”section and extracts its URLs. Then, the bot crawlsthe Web pages corresponding to those URLs, storingthem along with their associated spam profile.

Not surprisingly, almost all of the URLs that are ad-vertised by spam profiles are entrances to sophisticatedredirection chains. To identify the final destinations inthese chains, our bots follow every redirect. First, thebots attempt to access each of the URLs being adver-tised by a spam profile. If a bot encounters HTTPredirects (i.e., 3xx HTTP status codes), the bot fol-lows them until it accesses a URL that does not re-turn a redirect. Then, the corresponding Web pageis stored and parsed for HTML/javascript redirectiontechniques using the redirection detection algorithmwe presented in our previous research [23]. If our al-gorithm extracts redirection URLs, our bots attemptto access them. Once again, if the bots encounterHTTP redirects, they follow the redirects until theyfind URLs that do not return redirects. Finally, thecorresponding Web pages are stored. After completingthis process, we are left with a collection of final des-tination pages and the intermediary pages that werecrawled along the way.

5 Social Honeypot Data Analysis

In this section, we investigate various characteristicsof the 1,570 friend requests (and corresponding spamprofiles) that we collected in our social honeypots dur-ing a four month evaluation period from October 1,2007 to February 1, 2008. First, we characterize thetemporal distribution of the spam friend requests thatour honeypots received. Then, we analyze the geo-graphic properties of social spam. Next, we investigateduplication in spam profiles and identify five populargroups of spam profiles. After our duplication analy-sis, we identify interesting demographic characteristicsof spam profiles. Finally, we analyze the Web pagesthat are advertised by spam profiles.

5.1 Temporal Distribution of Spam FriendRequests

Since all of our honeypot profiles were constantlylogged into MySpace during our four month evalua-tion period, we were able to observe various temporalpatterns for spamming activity. In Figure 4(a), we

0

10

20

30

40

50

60

2-1-20081-1-200812-1-200711-1-200710-1-2007

Num

ber

of f

rien

d re

ques

ts

Date

(a) Monthly distribution

0

20

40

60

80

100

120

23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Num

ber

of f

rien

d re

ques

ts

Hour

(b) Hourly distribution

Figure 4: Temporal distributions of the spam friend requests received by our social honeypots.

present the number of friend requests that our honey-pots received on each day of this four month period.This figure is interesting for a few reasons. First, weobserve three peaks in spamming activity that occuraround holidays. Specifically, our honeypots receivedthe most friend requests the day before, the day of, andthe day after Columbus Day (79), Halloween (95), andThanksgiving (90). One possible explanation for thesespikes is that legitimate users might spend more timeonline during these holiday periods, giving spammersa larger audience for their deceptive profiles.

Another intriguing observation from Figure 4(a) isthat our honeypots began receiving significantly fewerfriend requests after December 2. In fact, of the 1,570friend requests that our honeypots received, only 299(19.0%) of them were received after this date. Weare still investigating the reasons behind this reducedactivity, but one hypothesis is that spammers realizedthe underlying purpose of our honeypot profiles. Sinceall of our honeypots reject friend requests after the cor-responding spam profiles have been stored, spammersshould eventually recognize that each of the honeypotsrepresents a wasted friend request. As we explained inSection 4, we decided to reject spam friend requestsbecause we wanted to avoid having our honeypots la-beled as spam by MySpace. As part of our ongoingresearch, we are revisiting this decision to investigatewhether it affects the spamming activity we observe.

To analyze finer-grained temporal patterns, Fig-ure 4(b) shows the hourly distribution of the friendrequests that our honeypots received4. As the figure

4All of the times are normalized based on the time zone

shows, for every hour of the day, our honeypots re-ceived at least 35 friend requests from spammers. Ad-ditionally, distinct hourly patterns emerge from thefigure. Most notably, spamming activity is at its peakaround 2pm and from 10pm to 1am, and it is at itslowest levels between 4am and 9am. These patternsare particularly interesting because they mirror previ-ous results about the communication patterns of legiti-mate users in social networking communities [11]. Thesimilarities between legitimate and spam activity pat-terns are somewhat intuitive for at least two reasons.First, spammers want to be active when their targetsare active because they want to increase the chances ofsuccessfully deceiving those users. Second, by blend-ing their traffic in with legitimate traffic, spammersreduce the risk of being identified by the operators ofthese communities.

5.2 Geographic Distribution of Spam FriendRequests

Since each of our honeypot profiles claims to be in aunique geographic location (i.e., one honeypot is ineach of the fifty U.S. states and Washington, D.C.),we are able to analyze the geographic properties ofspamming behavior. Figure 5(a) shows a color-codedmap of the United States, which represents the relativepopularity of our geographically dispersed honeypots.States with darker shades of green represent honeypotsthat received more friend requests than the honeypotsin states with lighter shades of green. As the figureshows, a large fraction of the spamming activity was

that corresponds to the honeypot profile’s location.

(a) Target distribution (b) Originator distribution

Figure 5: Geographic distributions of spam profiles and their targets.

directed at the Midwestern states. In fact, the fivemost popular targets were the honeypots in Omaha,Nebraska (80 friend requests), Kansas City, Missouri(58 friend requests), Milwaukee, Wisconsin (56 friendrequests), Louisville, Kentucky (56 friend requests),and Minneapolis, Minnesota (53 friend requests). Inour previous research [7], we found that MySpace usersfrom Midwestern states began using MySpace consid-erably later than users from Western states becauseMySpace was founded in California. As a result, oneexplanation for why Midwestern users are frequentlytargeted by spammers is that those users might be lessMySpace-savvy and thus a more attractive target fordeception by spammers. This hypothesis is also sup-ported by the fact that our Los Angeles, Californiahoneypot received fewer friend requests (10) than anyof the other honeypots.

Figure 5(b) shows another color-coded map of theUnited States. However, unlike Figure 5(a), this fig-ure shows the relative popularity of various states aslocations for spam profiles. States with darker shadesof green have more spam profiles affiliated with themthan states with lighter shades of green. As the figureshows, the most popular locations for the spam profileswere California and Southeastern states. Specifically,the five most popular states for spam profiles were Cal-ifornia (186 spam profiles), Florida (92 spam profiles),Georgia (78 spam profiles), Arkansas (74 spam pro-files), and Alabama (73 spam profiles).

After investigating the most popular originating andtarget locations for spam profiles, an obvious questionis how much overlap (if any) exists for those locations.Concretely, we wanted to know how often the declaredlocation of a spam profile matches the declared loca-tion of a targeted profile. Our original hypothesis was

that we would identify a significant number of matchesbecause we believed victims might be hesitant to ac-cept a friend request from someone outside of theircity or state. However, we were surprised to find that1,534 (97.7%) of the friend requests were from spamprofiles that reported a location that did not matchthe city or state associated with the honeypot profilethat received them.

One explanation for this disconnect between the loca-tions of spam profiles and their victims is that a cleartension exists between increasing the deceptive prop-erties of a spam profile and making the profile broadlyapplicable to a large number of potential victims. Ob-viously, spammers would prefer to create personalizedspam profiles for every potential victim because thatwould greatly increase the likelihood of a successfuldeception. However, it is costly to create personal-ized profiles for every potential victim, and as a result,spammers focus on casting as wide a net as possible.

5.3 Spam Profile Duplication

While investigating the geographic distribution of thefriend requests that our honeypot profiles received, wenoticed that many of our honeypots received a friendrequest from the same spam profile. In fact, 65 spamprofiles sent a friend request to more than one of ourhoneypots, generating a total of 148 friend requests.40 (78.4%) of our honeypots received at least one ofthese duplicate friend requests, and the honeypots thatreceived the most friend requests (i.e., the Omaha, Ne-braska honeypot and the Kansas City, Missouri hon-eypot) also received the most duplicates (11 duplicatesand 8 duplicates, respectively). Surprisingly, none ofour honeypots received more than one friend request

from a given spam profile (i.e., none of the spam pro-files were repeat offenders). Thus, after one of ourhoneypots rejected a spam profile’s friend request, thatprofile was intelligent enough not to send the honeypotanother friend request.

After we identified the existence of duplicate friend re-quests, we wanted to determine how much lag time(if any) exists between the first arrival of a friend re-quest and the arrivals of its duplicates. To quantifythe delays between duplicate friend requests, we cre-ated 65 time series – one for each set of the duplicatefriend requests. Then, for each time series, we com-puted the size of the time window that includes thefirst and last point. Based on this analysis, we foundthat 63 (96.9%) of the time windows close in less than4 minutes, and 53 (81.5%) of the time windows closein less than a minute. Therefore, when these spamprofiles sent friend requests, they sent a large numberof them in a short period of time (i.e., they were notparticularly stealthy).

Once we determined the number of unique profiles inour collection (1,487), we wanted to know how many ofthose profiles possess content that is a duplicate (or anear-duplicate) of another profile’s content. In our pre-vious work [23], we found that only one-third of Webspam pages are unique, and we wanted to determine ifthe same level of duplication exists among spam pro-files. To quantify the amount of content duplication inour collection of 1,487 unique spam profiles, we usedthe shingling algorithm from our previous work [23]on all of their HTML content to construct equivalenceclasses of duplicate and near-duplicate profiles.

First, we preprocessed each profile by replacing itsHTML tags with white space and tokenizing it into acollection of words (where a word is defined as an un-interrupted series of alphanumeric characters). Then,for every profile, we created a fingerprint for each ofits n words using a Rabin fingerprinting function [20](with a degree 64 primitive polynomial pA). Once wehad the n word fingerprints, we combined them into 5-word phrases. The collection of word fingerprints wastreated like a circle (i.e., the first fingerprint followsthe last fingerprint) so that every fingerprint started aphrase, and as a result, we obtained n 5-word phrases.Next, we generated n phrase fingerprints for the n 5-word phrases using a Rabin fingerprinting function(with a degree 64 primitive polynomial pB). Afterwe obtained the n phrase fingerprints, we applied 84unique Rabin fingerprinting functions (with degree 64primitive polynomials p1, ..., p84) to each of the nphrase fingerprints. For every one of the 84 functions,we stored the smallest of the n fingerprints, and oncethis process was complete, each spam profile was re-duced to 84 fingerprints, which are referred to as that

profile’s shingles. Once all of the profiles were con-verted to a collection of 84 shingles, we clustered theprofiles into equivalence classes (i.e., clusters of du-plicate or near-duplicate profiles). Two profiles wereconsidered duplicates if all of their shingles matched,and they were near-duplicates if their shingles agreedin two out of the six possible non-overlapping collec-tions of 14 shingles. For a more detailed description ofthis shingling algorithm, please consult [5, 10].

After this clustering was complete, we were left with1,261 unique clusters of duplicate and near-duplicateprofiles. Thus, only 226 (15.2%) of the profiles havethe same (or nearly the same) HTML content as one ofthe remaining 1,261 profiles. This level of duplicationis significantly less than what we observed with Webspam pages; however, we do not believe this is an accu-rate measure of spam profile duplication. Since most ofa spam profile’s deceptive text is found in the “Aboutme” section, a more reasonable metric for profile du-plication is actually “About me” duplication. Hence,in addition to running our shingling algorithm over allof the HTML content in a profile, we also extractedthe “About me” content and built equivalence classesusing that data. This “About me” clustering gener-ated 637 unique clusters, which means 850 (57.2%) ofthe profiles have the same (or nearly the same) “Aboutme” content as one of the remaining 637 profiles.

Based on the results of our content duplication anal-ysis, we can conclude that duplication among spamprofiles is on par with duplication among Web spampages. 15.2% of the spam profiles obtain all of theirHTML content from another profile, and 57.2% of thespam profiles obtain their “About me” content fromanother profile. This observation is quite encouragingbecause it implies that the problem of identifying allspam profiles can actually be reduced to the problemof identifying a much smaller set of unique profiles.

5.4 Spam Profile Examples

After we completed our content duplication analysis,we manually investigated the profiles in our variousclusterings. Based on this investigation, we discoveredthat most of our spam profiles fall into one of five cat-egories:

• Click Traps: Each profile contains a backgroundimage that is also a link to another Web page. Ifusers click anywhere on the profile, they are imme-diately directed to the link’s corresponding Website. One of the most popular (and most decep-tive) examples displays a fake list of friends, whichis actually a collection of provocative images thatdirect users to a nefarious Web page (see Figure 6for an example).

Figure 6: An example of a Click Trap.

• Friend Infiltrators: These profiles do not haveany overtly deceptive elements (aside from theirimages – and even those are innocuous in somecases). The purpose of the profiles is to befriendas many users as possible so that they can in-filtrate the users’ circles of friends and bypassany communication restrictions imposed on non-friends. Once a user accepts a friend request fromone of these profiles, the profile begins spammingthat user through every available communicationsystem (e.g., message spam, comment spam, etc.).

• Pornographic Storytellers: Each of these pro-files has an “About me” section that consists ofrandomized pornographic stories, which are book-ended by links that lead to pornographic Webpages. The anchor text used in these profiles isextremely similar, even though the rest of the“About me” text is almost completely random-ized.

• Japanese Pill Pushers: These profiles containa sales pitch for male enhancement pills in their“About me” sections. According to the pitch,the attractive woman pictured in the profile has aboyfriend that purchased these pills at an incred-ible discount, and if you act now, you can do thesame. An example is shown in Figure 7.

• Winnies: All of these profiles have the sameheadline: “Hey its winnie.” However, despite thisheadline, none of the profiles are actually named“Winnie.” In addition to a shared headline, eachof the profiles also includes a link to a Web page

Figure 7: An example of a Japanese Pill Pusher.

where users can see the pictured female’s porno-graphic pictures. An example of one of these pro-files was shown in Section 3 (Figure 1).

5.5 Spam Profile Demographics

In our content duplication analysis, we analyzed theHTML and “About me” sections of spam profiles ina general sense. To observe more specific features ofthese profiles, we investigated demographic character-istics of the 1,487 spam profiles that we captured inour honeypots. These characteristics include tradi-tional demographics (e.g., age, gender, etc.) as well asprofile-specific features (e.g., number of friends, head-lines, etc.).

Our first observation from this demographic analysisis that many of the spam profiles share various demo-graphic characteristics. Specifically, all of the profilesare female and between the ages of 17 and 34 (85.9% ofthe profiles state an age between 21 and 27). Addition-ally, 1,476 (99.3%) of the profiles report that they aresingle. None of these characteristics are particularlysurprising because they all reinforce the deceptive na-ture of these profiles. Specifically, these demographicfeatures make each profile appear as though it was cre-ated by a young, “available” woman.

Our second observation is that many spam profiles in-clude additional personal information to enhance theirdeceptive properties. The profiles that are most adeptat leveraging personal information to their advantageare the Japanese Pill Pushers. These profiles are theonly ones that claim to be in a relationship, but this

1

10

100

1000

10000

1 10 100 1000

Num

ber

of s

pam

pro

file

s

Number of friends

Figure 8: Distribution of the number friends associ-ated with spam profiles.

relationship status is warranted because the profilesmention a boyfriend in their “About me” sections.Additionally, these profiles list “less than $30,000” astheir annual income. This is the lowest allowable op-tion on MySpace, and as a result, this annual incomevalue makes it seem like the profile’s creator is notparticularly wealthy, which reinforces the affordabilityof the male enhancement pills that these profiles areadvertising.

Our final observation is that many of the spam pro-files successfully befriended legitimate users. Figure 8shows the distribution of friends associated with eachof the spam profiles. It is important to note thatthis distribution is skewed towards the low end ofthe spectrum because our bots visited and stored thespam profiles very quickly (and potentially before anyother users had a chance to accept the spam friend re-quests). However, despite this fact, 455 (30.6%) of theprofiles had more than one friend when our bots col-lected them. Thus, almost a third of the spam profileswere already attracting victims when our bots visitedthem.

5.6 Advertised Web Pages

Since the purpose of spam profiles is to deceive usersinto performing an action (e.g., visiting a Web page),most of the profiles contain links to Web pages out-side of the community. Specifically, 1,245 (83.7%) ofthe profiles contain at least one link in their “Aboutme” sections. The remaining 242 profiles are all exam-ples of Friend Infiltrators, and as a result, they post-pone their promotional activities until after they havebefriended users.

From the 1,245 profiles that contain links, our botswere able to extract and successfully access 1,048 pro-

file URLs. Of these 1,048 profile URLs, only 482(46.0%) of them were unique, which means more thanhalf of the URLs that appear in spam profiles are du-plicates. When our bots attempted to crawl the pro-file URLs, 339 (32.3%) of them returned a total of 657HTTP redirects. After following these HTTP redirectchains, our original 482 unique profile URLs funneledour bots to only 148 unique destination URLs. There-fore, of the 1,048 Web pages that our bots ultimatelyobtained with the profile URLs, 900 (85.9%) of themhave duplicate URLs.

To investigate this duplication even further, we per-formed a shingling analysis on the HTML content ofthese 1,048 Web pages. Based on this analysis, we dis-covered only 6 unique clusters of duplicate and near-duplicate Web pages. Thus, 1,042 (99.4%) of the Webpages contain content that was duplicated from theother 6 Web pages. Three of these clusters, which ac-count for 93.3% of the pages, contain pages that act asintermediary redirection pages (i.e., the pages immedi-ately redirect users using HTML/javascript redirectiontechniques). Two of the clusters, which account for6.6% of the pages, contain pornographic Web pages,and the last cluster contains a single Web page, whichexecutes a phishing attack against MySpace.

Since 93.3% of the pages employ redirection tech-niques, we parsed those pages for HTML/javascriptredirects using our redirection detection algorithmfrom previous research [23]. Based on this redirectionanalysis, we identified redirects that use HTML metarefresh tags, javascript location variable assignments,and HTML iframe tags. In total, our algorithm iden-tified 1,307 redirection URLs. However, of those 1,307URLs, only 136 (10.4%) of them were unique; hence,over 90% of the redirection URLs are duplicates.

When our bots crawled these redirection URLs, 959(73.4%) of them returned a total of 1,288 HTTP redi-rects. After following these HTTP redirect chains, ourbots were eventually funneled to only 15 unique URLs.Thus, of the 1,307 Web pages that our bots crawledusing the redirection URLs, only 15 (1.1%) of themhave unique URLs. Even more striking is the factthat only 5 domain names are used in those 15 uniqueURLs, and of those 5 domain names, fling.comand amateurmatch.com Web pages account for 975(74.6%) of the Web pages. An example of one of theamateurmatch.com Web pages is shown in Figure 9.

Based on the results of our Web page analysis, wecan conclude that all of the URLs that are advertisedin spam profiles point to an extremely small numberof destination pages. Specifically, 1,048 profile URLsfunneled our bots to only 6 destinations, and 1,307redirection URLs funneled our bots to only 5 destina-

Figure 9: An example of a Web page that is advertisedby a spam profile.

tions. This observation is quite valuable because it sig-nificantly reduces the problem of identifying the Webpages that are advertised by spam profiles. Instead ofdealing with 2,355 URLs, we must simply identify 11destinations.

6 Conclusions

In this paper, we presented a novel technique for au-tomatically collecting deceptive spam profiles in socialnetworking communities. Specifically, our approachdeploys honeypot profiles and collects all of the spamprofiles associated with the spam friend requests thatthey receive. We also provided the first characteriza-tion of deceptive spam profiles using the data that wecollected in our 51 social honeypots over a four monthevaluation period. This characterization covered var-ious topics including temporal and geographic distri-butions of spamming activity, content duplication, ananalysis of profile demographics, and an evaluation ofthe Web pages that are advertised by spam profiles.In our ongoing work, we are using our analysis resultsto automatically identify social spam.

References

[1] A. Acquisti and R. Gross. Imagined communities: Awareness,information sharing, and privacy on the facebook. In Proc. ofPET ’06, 2006.

[2] Alexa. Alexa top 500 sites. http://www.alexa.com/site/ds/top_sites?ts_mode=global, 2008.

[3] L. Backstrom, C. Dwork, and J. Kleinberg. Wherefore art thour3579x?: Anonymized social networks, hidden patterns, andstructural steganography. In Proc. of WWW ’07, 2007.

[4] D. Boyd. Social network sites: Public, private, or what? TheKnowledge Tree: An e-Journal of Learning Innovation, 2007.

[5] A. Broder et al. Syntactic clustering of the web. In Proc. ofWWW ’97, 1997.

[6] J. Caverlee, L. Liu, and S. Webb. Socialtrust: Tamper-resilienttrust establishment in online communities. In Proc. of JCDL’08, 2008.

[7] J. Caverlee and S. Webb. A large-scale study of myspace: Ob-servations and implications for online social networks. In Proc.of ICWSM ’08, 2008.

[8] J. Donath and D. Boyd. Public displays of connection. BTTechnology Journal, 22(4), 2004.

[9] N. Ellison, C. Steinfield, and C. Lampe. Spatially boundedonline social networks and social capital: The role of facebook.In Proc. of ICA ’06, 2006.

[10] D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proc. of SIGIR’05, 2005.

[11] S. A. Golder, D. Wilkinson, and B. A. Huberman. Rhythms ofsocial interaction: Messaging within a massive online network.In Proc. of CT ’07, 2007.

[12] R. Gross and A. Acquisti. Information revelation and privacyin online social networks. In Proc. of WPES ’05, 2005.

[13] Z. Gyongyi and H. Garcia-Molina. Link spam alliances. InProc. of VLDB ’05, 2005.

[14] P. Heymann, G. Koutrika, and H. Garcia-Molina. Fightingspam on social web sites: A survey of approaches and futurechallenges. IEEE Internet Computing, 11(6), 2007.

[15] S. Hinduja and J. W. Patchin. Personal information of ado-lescents on the internet: A quantitative content analysis ofmyspace. Journal of Adolescence, 31(1), 2008.

[16] C. Kreibich and J. Crowcroft. Honeycomb: Creating intru-sion detection signatures using honeypots. ACM SIGCOMMComputer Communication Review, 34(1), 2004.

[17] R. Kumar, J. Novak, and A. Tomkins. Structure and evolutionof online social networks. In Proc. of KDD ’06, 2006.

[18] C. Lampe, N. Ellison, and C Steinfield. A familiar face(book):Profile elements as signals in an online social network. In Proc.of CHI ’07, 2007.

[19] M. Prince et al. Understanding how spammers steal your e-mail address: An analysis of the first six months of data fromproject honey pot. In Proc. of CEAS ’05, 2005.

[20] M. Rabin. Fingerprinting by random polynomials. TechnicalReport TR-15-81, Center for Research in Computing Technol-ogy, Harvard University, 1981.

[21] L. Spitzner. The honeynet project: Trapping the hackers.IEEE Security & Privacy, 1(2), 2003.

[22] R. Wade. Fakesters: On myspace, you can be friendswith burger king. this is social networking? http://www.technologyreview.com/Infotech/17713/.

[23] S. Webb, J. Caverlee, and C. Pu. Characterizing web spamusing content and http session analysis. In Proc. of CEAS’07, 2007.

[24] S. Webb, J. Caverlee, and C. Pu. Granular computing systemvulnerabilities: Exploring the dark side of social networkingcommunities. Encyclopedia of Complexity and System Sci-ence (to appear), 2008.

[25] A. Zinman and J. Donath. Is britney spears spam? In Proc.of CEAS ’07, 2007.

Documents

Social Honeypots: Making Friends With A Spammer Near You