31
Topic Modeling of Freelance Job Postings to Monitor Web Service Abuse D k Ki M ti Mt Dokyum Kim, Marti Motoyama, Geoffrey M. Voelker and Lawrence Saul @UCS CS @UCSD CSE

Topic Modeling of Freelance Job Postings to Web Service Abuse

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Topic Modeling of Freelance Job Postingsp g gto Monitor Web Service Abuse

D k Ki M ti M tDo‐kyum Kim, Marti Motoyama, 

Geoffrey  M. Voelker and Lawrence  Saul 

@UCS CS

1/ 31

@UCSD CSE

Web Service AbuseWeb Service Abuse

b f /• Many Web services are free/open access• To attract large numbers of usersg

• To attract user‐generated content

• But openness invites abuse:• Exploitation of free resources• Exploitation of free resources

E.g. Sending spam from Web‐based email accounts

U ti d d ti i h l• Unsanctioned advertising channelsE.g. Spamming links on blog comments

2/ 31

Crowdsourcing for Web Service AbuseCrowdsourcing for Web Service Abuse

• Widely used:30% jobs on Freelancer.com [Motoyama et al., 2011]

40% jobs on Mechanical Turk [Ipeirotis, 2010]

• Example posting on Freelancer com• Example posting on Freelancer.comTitle Open 10 blog accounts and write/publish 10 posts

D I d t f il t (i hDesc. I need someone to open a free email account (i.e yahoo, hotmail, gmail) Then use that email to open 20 free blog accounts (excluding blogger, word press, blog) This project is to open 20 free blog accounts (all different sites) than post a single blog post (20 in total) one each blog account …

3/ 31

Keywords Blog

Why Crowdsource Abuse Jobs?Why Crowdsource Abuse Jobs?

• Cost Effective

Workers come from low wage regions

• Agile

Buyers can find technically skilled workers

• Scalable• Scalable

Freelancer.com has over one million workers

4/ 31

Example: Account Creation Form FillingExample: Account Creation – Form Filling

• Scenario: Abuser wants to send spam via Web email

• Prerequisite: Bulk accounts on Gmail

“i need gmail captcha entry agent immediately. 1000's new captcha entrys per week

5/ 31

captcha entrys per week

Example: Account Creation Varying IPExample: Account Creation – Varying IP

• Problem: Google detects mass account creation

• Solution: Purchase IP proxy services

6/ 31

Example: Account Creation Being VerifiedExample: Account Creation – Being Verified

• Problem: Google implements phone verification

• Solution: Buy telephone numbers

7/ 31

How to Identify Abuse Jobs?How to Identify Abuse Jobs?

P i k• Previous work [Motoyama et al., 2011] :• Manually inspect 2k postings on Freelancer.com• Identify 22 job categoriesIdentify 22 job categories• Label 10k+ jobs for training on SVM classifier

• Can we use this approach in operation?No – too much manual labor.

• Challenges:• How to scale?• How to discover new job categories?

8/ 31

Our Approach: Topic ModelingOur Approach: Topic Modeling

• Unsupervised vs. supervisedJob categories are discovered automatically from raw posts

No need for manual labeling

• Large‐scale vs. small‐scaleData‐driven from 7 years of posts on Freelancer.com

Categories identified from 355K (versus 2K) posts

• Principled vs. heuristic

Topics are collections of co‐occurring words

9/ 31

Topics are collections of co occurring words

Postings and users have distribution over topics

Background: Freelancer comBackground: Freelancer.com

l• Freelancer.com• One of largest and oldest freelancing sites• Over 2 million users from 200+ countries• Queryable by APIQ y y

• How it works:• How it works:1. Buyers/employers post jobs2 W k bid j b2. Workers bid on jobs3. Buyers select workers

10/ 31

Background: Data setBackground: Data set

• Job/user data from 2004 to 2011:• 840 k job descriptions

• 815 k user profiles

• 12 million bids

Open 10 blog account

ProjectOpen 10 blog account

ProjectWorker 1Buyer 1 Open 10 blog account

ProjectPost Bid

ProjectProject

Open 10 blog accountOpen 10 blog account

Worker 2Buyer 2 Get 1k likes on

Project

P j t

Generate 1k likes on Facebook

Generate 1k likes on Facebook

Buyer 2 Get 1k likes on Facebook

Project

11/ 31Write articles on car

ProjectWrite articles on car

ProjectWorker SBuyer B Write articles on car

Project

Topic ModelsTopic Models

12/ 31

Topic ModelingTopic Modeling

• Automatic, data‐driven approach for analyzing large corpora of text1. Discovers hidden topics in corpus

2. Represents each document as collection of topics

3. Models each topic as a distribution over words

Example:Example:

Topic Top Frequent Words

Articles from sports magazine

Labelp p q

1 nfl, quarterback, touchdown, …

2 driver, club, birdie, …Topic Model Football

Golf

13/ 31

3 mlb, hit, sox, … 

Baseball

Latent Dirichlet Allocation (LDA)Latent Dirichlet Allocation (LDA)

• First properly Bayesianmodel for topic modeling

• Assumptiond l h d i f i• Model each document as a mixture of topics

• Model each topic as a distribution over words

• Model each word as drawn from a particular topic

14/ 31

Main Parameters of LDAMain Parameters of LDA

• # Topics:

• # Words in vocabulary:# Words in vocabulary:

• Word distributions as topics:               matrix

Topic: Football Topic: GolfWord Probability

nfl 0.05

quarterback 0 02

Word Probability

driver 0.021

club 0 017…

quarterback 0.02

touchdown 0.018

club 0.017

birdie 0.014… …

15/ 31

Generative Process for LDAGenerative Process for LDA

• For each document in the corpus:1. Pick the topic proportions from a Dirichlet distribution.

2. For each word in the document

a) Pick a topic from the proportions in (1).

b) Pick a word based on the topic in (2a).

• In our problem:Document = Job postingp g

Topic = Job category

16/ 31

Fitting the Model ParametersFitting the Model Parameters

• Given observed words, discover the topics that best explain the documents in the corpusp p

B l l i i bl• But some calculations are intractable

• We use variational methods for approximationpp

• For details, see Blei et al., 2003.

17/ 31

LDA WorkflowLDA WorkflowOpen 10 blog accountsOpen 10 blog accounts and write/publish 10  posts …

Input Unlabeled Docs

LDAParams # topics: K

blog 0.03post 0.02forum 0.01

account 0.02gmail 0.01hotmail 0.01

write 0.03word 0.01English 0.01

K Topics …… …

g…Output

0 40.3D d / iOpen 10 blog accounts and write/publish 10

18/ 31

0.4

0.3

0.3Docs annotated w/ topics and write/publish 10  posts …

EvaluationEvaluation

19/ 31

PreprocessingPreprocessing

• Constructs document from project:

title, description, keywordstitle, description, keywords

• Lower‐cases, splits at punctuation, removes d li istopwords, applies stemming

• Filters infrequent termsq

• Filters buyers and workers in less than 20 j tprojects

20/ 31

Term document Matrix for LDATerm‐document Matrix for LDA

Wij: # word i in Terms27k+

ijdocument j

27k+ 

Documents

21/ 31

355k+

How many topics?How many topics?

• LDA discovers a prespecified # of topics.

• How to choose this number?1. Train models with varying # of topics

2. Measure likelihood of held‐out data

22/ 31

Categories of AbuseCategories of Abuse

Top Frequent Words Ratio

articl writer Articles write copyscap ‘Article Rewriting’ 5%

ti l k d it d itt h 3%

Label

SEO Content Generation1 

SEO C t t G ti 2articl keyword rewrit copyscap word rewritten paragraph 3%

link pr site page anchor websit nofollow farm googl 3%

‘Data Entry’ data entri team captcha Excel fast worker hr 3%

SEO Content Generation2

SEO Whitehat

CAPTCHA Solvingy p

market sale traffic promot affili lead Marketing commiss 2%

ad account craigslist post pva poster gmail cl ip proxi 2%

g

Click/CPA/Leads/Signups1

Ad Posts/Accounts

seo keyword googl search rank engin SEO optim adword 2%

email list address excel mail newslett e‐mail spreadsheet 2%

fan facebook member profil friend Facebook myspac 2%

SEO Unknown 

Bulk Emailing

OSN Linkingfan facebook member profil friend Facebook myspac 2%

blog post forum comment ‘Link Building’ thread phpbb 2%

submiss directori review social bookmark submit copi 2%

OSN Linking

SEO Greyhat1

SEO Greyhat2

23/ 31

p

sign signup citi countri uk up usa canada travel adult 1%

Companies or countries      Worker methodologies

y

Click/CPA/Leads/Signups2

Annotated Job PostingsAnnotated Job Postings

SEO G h t (0 276) Ad P ti / A t C ti (0 211)SEO Greyhat (0.276), Ad Posting / Account Creation (0.211), 

SEO Content Generation (0.169), Bulk Emailing (0.082)

Titl O 10 bl t d it / bli h 10 t

False positive: reveals limitations 

fTitle Open 10 blog accounts and write/publish 10 posts

Desc. I need someone to open a free email account (i.e yahoo,hotmail gmail) Then use that email to open 20 free blog

of LDA

hotmail, gmail) Then use that email to open 20 free blogaccounts (excluding blogger, word press, blog) This project is to open 20 free blog accounts (all different sites) than post a single blog post (20 in total) one each blog account.Each free blog account to be on a separate free blogservice.service. …

Keyword Blog

24/ 31

Word Trends Reveal Target TrendsWord Trends Reveal Target Trends

Years

2005 2006 2007 2008 2009 2010 20111 member myspac profil myspac facebook fan fan2 group friend myspac profil friend facebook facebook3 friend profil friend friend twitter account page4 profil member member member profil page account5 event account bot account myspac Facebook Facebook

Top 10 terms in “OSN y p

6 myspac peopl group facebook account friend real7 invit group account bot follow follow follow8 bot invit facebook event group real twitter

OSN Linking” topic

8 bot invit facebook event group real twitter9 meet bot invit group fan twitter usa10 account paid event invit member Social Networking like

25/ 31

Manual vs Automatic Discovery of TopicsManual vs. Automatic Discovery of Topics

Automatically Discovered Topicsy p

How much of postings in eachpostings in each category are assigned to topics?

ManualCategories

26/ 31

Manual vs Automatic Tracking of TrendManual vs. Automatic Tracking of Trend

0.8

0.9

1

s

Class SEO Content GenerationTopic SEO Content Generation 1, 2

0.8

0.9

1

s

Class Verified Accounts, Account Registration and Ad PostingTopic AdPosts/Accounts

0 5

0.6

0.7

ume

of P

roje

cts

0 5

0.6

0.7

ume

of P

roje

cts Topic AdPosts/Accounts

0.3

0.4

0.5

Nor

mal

ized

Vol

u

0.3

0.4

0.5

Nor

mal

ized

Vol

uJan05 Jan06 Jan07 Jan08 Jan09 Jan10 Jan11

0

0.1

0.2N

Jan05 Jan06 Jan07 Jan08 Jan09 Jan10 Jan11

0

0.1

0.2N

Very close agreement

Jan05 Jan06 Jan07 Jan08 Jan09 Jan10 Jan11Time Time

27/ 31

Very close agreement

Correlation of Worker Topic ProportionCorrelation of Worker Topic ProportionPearson correlation matrix

Indicates mergeablemergeabletopics

28/ 31

SummarySummary

• Explored LDA to identify and monitor abuse job postings

• What LDA automates:What LDA automates:• Discovery of topics as frequently co‐occurring words

• Labeling of individual postings by topicsLabeling of individual postings by topics

• Word‐by‐word annotations

• What remains:• Interpretation of topics

• Merging/splitting of discovered categories

29/ 31

Future DirectionsFuture Directions

• Other applications of LDA to unstructured text• IRC channels

• Underground Internet forums

M hi ti t d t i d l• More sophisticated topic models• Author‐reader topic models: incorporating buyers and workers

D i t i d l t ki h t i l ti• Dynamic topic models: tracking how topics evolve over time

• Online LDA: continuous modeling of streaming projects

30/ 31

Q&AQ&A

31/ 31