29
Anthony Goldbloom CEO, Kaggle e-mail [email protected] twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird, making data science a sport

Anthony Goldbloom CEO, Kaggle e-mail [email protected] twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

Embed Size (px)

Citation preview

Page 1: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

Anthony GoldbloomCEO, Kagglee-mail [email protected] @antgoldbloom

Predictive modeling competitions

Photo by mikebaird, www.flickr.com/photos/mikebaird

making data science a sport

Page 2: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

1. Motivation2. Why compete?3. How it works4. R on Kaggle 5. The Heritage Health Prize

Page 3: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

Global competitions

1½ weeks 70.8%

Competition closes 77%

State of the art 70%

Predicting HIV viral load

Page 4: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

Mismatch between those with data andthose with the skills to analyse it

Crowdsourcing

Page 5: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

Countless approaches. Hard to know which will work

Page 6: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

Additional slidesNot MIT, not SAS … UoL?

Page 7: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

Forecast Error(MASE)

Existing model

Tourism Forecasting Competition

Aug 9 2 weeks later

1 month later

Competition End

Page 8: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

Existing model (ELO)

Chess Ratings Competition

Aug 4 1 monthlater

2 monthslater

Today

Error Rate(RMSE)

Page 9: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

Our User Base

Page 10: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

• neural networks• logistic regression• support vector machine• decision trees• ensemble methods• adaBoost• Bayesian networks

• genetic algorithms• random forest• Monte Carlo methods• principal component analysis• Kalman filter• evolutionary fuzzy modeling

Users apply different techniques

Page 11: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

1. Motivation2. Why compete?3. How it works4. R on Kaggle 5. The Heritage Health Prize

Page 12: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

Clean, Real world data Professional Reputation & Experience

Interactions with experts in related fields Prizes

1

4

2

3

Why Participants Compete

More fun than Sudoku

Page 13: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

1. Motivation2. Why compete?3. How it works4. R on Kaggle 5. The Heritage Health Prize

Page 14: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

Competitions are judged based on predictive accuracy

Page 15: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

Competition Mechanics

Competitions are judged on objective criteria

Page 16: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

1. Motivation2. Why compete?3. How it works4. R on Kaggle 5. The Heritage Health Prize

Page 17: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

R on Kaggle

Page 18: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

R on Kaggle among academics

Page 19: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

R on Kaggle among Americans

Page 20: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

Number Name Winner Packages

4HIV Progression Prediction Chris Raimondi

Caret (RFE and RandomForest)

5 Informs 2010 Cole Harris GLM, NNET6 Chess Rating Yannis Sismanis

7

Tourism Forecasting Part 2 Phil Brierley Forecast

10R Package Recommendation Max Lin

Stats, ROCR, GGPlot, GGPlot2

13 Ford Stay Alert Edward Stats

Who Uses R and How

Page 21: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

1. Motivation2. Why compete?3. How it works4. R on Kaggle 5. The Heritage Health Prize

Page 22: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,
Page 23: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,
Page 24: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,
Page 25: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

MembId AgeAtFirstClaim Sex

25872 19- Oct F

MembId DaysInHospital

25872 0

MembId ProviderId Vendor PCP YearSvc Specialty Place PayDelay LengthOfStayDSFS PrimaryConditionGroupCharIndexClaimID25872 171278567 7891165 294037 Y1 Internal Office 22 0- 1 month RESPR4 1- 2 125872 376108719 5024957 294037 Y1 Laboratory Independent Lab 23 0- 1 month MSC2a3 0 225872 171278567 7891165 294037 Y1 Internal Office 16 1- 2 months RESPR4 1- 2 325872 171278567 7891165 294037 Y1 Internal Office 19 2- 3 months RESPR4 1- 2 425872 171278567 7891165 294037 Y1 Internal Office 21 3- 4 months RESPR4 1- 2 525872 171278567 7891165 294037 Y1 Internal Office 21 4- 5 months RESPR4 1- 2 625872 376108719 5024957 294037 Y1 Laboratory Independent Lab 11 7- 8 months METAB3 1- 2 7

Mmm… how do I put this into R?

Page 26: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

Some SQL Magic

Page 27: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

Gives us a flat record

MembId DaysInHospital AgeAtFirstClaim Sex maxlos numclaims inhosp urgent25872 0 19- Oct F 7 0 0

Page 28: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

Voila, an entry!

Page 29: Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive modeling competitions Photo by mikebaird,

Photo by gidzy, www.flickr.com/photos/gidzy

What could the world’s bestanalysts find in your data?e-mail [email protected] +61438400053