28
Demographics and Weblog Hackathon – Case Study 5.3% of Motley Fool visitors are subscribers. Design a classificaiton model for insight into which variables are important for strategies to increase the subscription rate Learn by Doing

Demographics and Weblog Hackathon – Case Study

  • Upload
    manchu

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

Demographics and Weblog Hackathon – Case Study. 5.3% of Motley Fool visitors are subscribers. Design a classificaiton model for insight into which variables are important for strategies to increase the subscription rate Learn by Doing. http:// www.meetup.com / HandsOnProgrammingEvents /. - PowerPoint PPT Presentation

Citation preview

Page 1: Demographics and Weblog  Hackathon  – Case Study

Demographics and Weblog Hackathon – Case Study

5.3% of Motley Fool visitors are subscribers. Design a classificaiton model for insight into which variables are

important for strategies to increase the subscription rateLearn by Doing

Page 2: Demographics and Weblog  Hackathon  – Case Study

http://www.meetup.com/HandsOnProgrammingEvents/

Page 3: Demographics and Weblog  Hackathon  – Case Study

Data Mining Hackathon

Page 4: Demographics and Weblog  Hackathon  – Case Study

Funded by Rapleaf

• With Motley Fool’s data• App note for Rapleaf/Motley Fool • Template for other hackathons• Did not use AWS. R on individual PCs• Logisics: Rapleaf funded prizes and food for 2

weekends for ~20-50. Venue was free

Page 5: Demographics and Weblog  Hackathon  – Case Study

Getting more subscribers

Page 6: Demographics and Weblog  Hackathon  – Case Study

Headline Data, Weblog

Page 7: Demographics and Weblog  Hackathon  – Case Study

Demographics

Page 8: Demographics and Weblog  Hackathon  – Case Study

Cleaning Data

• training.csv(201,000), headlines.tsv(811MB), entry.tsv(100k), demographics.tsv

• Feature Engineering• Github:

Page 9: Demographics and Weblog  Hackathon  – Case Study

Ensemble Methods

• Bagging, Boosting, randomForests• Overfitting• Stability (small changes make large prediction

changes)• Previously none of these work at scale• Small scale results using R, large scale exist in

proprietary implementations(google, amazon, etc..)

Page 10: Demographics and Weblog  Hackathon  – Case Study

ROC Curves

Binary Classifier Only!

Page 11: Demographics and Weblog  Hackathon  – Case Study

Paid Subscriber ROC curve, ~61%

Page 12: Demographics and Weblog  Hackathon  – Case Study

Boosted Regression Trees Performance

• training data ROC score = 0.745 • cv ROC score = 0.737 ; se = 0.002• 5.5% less performance than the winning score

without doing any data processing• Random is 50% or .50. We are .737-.50 better

than random by 23.7%

Page 13: Demographics and Weblog  Hackathon  – Case Study

Contribution of predictor variables

Page 14: Demographics and Weblog  Hackathon  – Case Study

Predictive Importance• Friedman, number of times a variable is selected for splitting weighted by squared

error or improvement to model. Measure of sparsity in data• Fit plots remove averages of model variables• 1 pageV 74.0567852• 2 loc 11.0801383• 3 income 4.1565597• 4 age 3.1426519• 5 residlen 3.0813927• 6 home 2.3308287• 7 marital 0.6560258• 8 sex 0.6476549• 9 prop 0.3817017• 10 child 0.2632598• 11 own 0.2030012

Page 15: Demographics and Weblog  Hackathon  – Case Study

Behavioral vs. Demographics

• Demographics are sparse• Behavioral weblogs are the best source. Most

sites aren’t using this information correctly. There is no single correct answer. Trial and Error on features. The features are more important than the algorithm

• Linear vs. Nonlinear

Page 16: Demographics and Weblog  Hackathon  – Case Study

Fitted Values (Crappy)

Page 17: Demographics and Weblog  Hackathon  – Case Study

Fitted Values Better

Page 18: Demographics and Weblog  Hackathon  – Case Study

Predictor Variable Interaction

• Adjusting variable interactions

Page 19: Demographics and Weblog  Hackathon  – Case Study

Variable Interactions

Page 20: Demographics and Weblog  Hackathon  – Case Study

Plot Interactions age, loc

Page 21: Demographics and Weblog  Hackathon  – Case Study

Trees vs. other methods

• Can see multiple levels good for trees. Do other variables match this? Simplify model or add more features. Iterate to a better model

• No Math. Analyst

Page 22: Demographics and Weblog  Hackathon  – Case Study

Number of Trees

Page 23: Demographics and Weblog  Hackathon  – Case Study

Data Set Number of Trees

Page 24: Demographics and Weblog  Hackathon  – Case Study

Hackathon Results

Page 25: Demographics and Weblog  Hackathon  – Case Study

Weblogs only 68.15%, 18% better than random

Page 26: Demographics and Weblog  Hackathon  – Case Study

Demographics add 1%

Page 27: Demographics and Weblog  Hackathon  – Case Study

AWS Advantages

• Running multiple instances with different algorithms and parameters using R

• Add tutorial, install Screen, R GUI bugs• http://amazonlabs.pbworks.com/w/page/280

36646/FrontPage

Page 28: Demographics and Weblog  Hackathon  – Case Study

Conclusion

• Data Mining at scale requires more development in visualization, MR algorithms, MR data preprocessing.

• Tuning using visualization. Tune 3 parameters, tc, lr, #trees. Didn’t cover 2/3.

• This isn’t reproducable in Hadoop/Mahout or any open source code I know of

• Other use cases, i.e. predicting which item will sell(eBay), search engine ranking.

• Careful with MR paradigms, Hadoop MR != Couchbase MR