Winning Kaggle 101: Mark Landry's Experience

Preview:

Citation preview

H2O.aiMachine Intelligence

Competitive Data Science

Kaggle from a competitor’s viewMark Landry, H2O

Competitive Data Scientist & Product Manager

H2O.aiMachine Intelligence

Overview• Personal background• Iterative workflow• Framing the problem• Learning from other competitors• Q&A

2

H2O.aiMachine Intelligence

Background

3

Competitive data scientist & product manager, H2O

BS, computer science

Additional roles: data warehousing, BI, analytics

Preferred algorithm: GBM

H2O.aiMachine Intelligence

Iterative Workflow• Agile workflows generally outperform waterfall

methodologies• One of the most commonly cited insights from

Kaggle employees regarding success

4

H2O.aiMachine Intelligence

Iterative Workflow: Basics• Work quickly to develop a reasonable model early

o Model should be complete enough to gauge score, per competition setup

o Simple models: understand how the mean and mode scoreo Confirms understanding of the problemo Confirms validity of your internal loss calculation

• Enhance model iterativelyo Explore and add features: additional data sets and/or

transformationso Experiment with additional model classeso Experiment with hyperparameters within algorithm classo Ensembleo Validate enhancements via improvement from prior leading

model5

H2O.aiMachine Intelligence

Iterative Workflow: Benefits• Allows the data guide what modeling approach fits

besto Availability and quality of data may not support complex

modeling ideas• Catch mistakes or incorrect assumptions early and

clearlyo If you observe no improvement after adding what you

considered to be a vital feature, you know to immediately check the accuracy of the calculations and/or question how the model already captured that information

6

H2O.aiMachine Intelligence

Framing the Problem• Have to make the data machine learning ready

o 1 training fileo 1 row per targeto Features do not require additional methodology (e.g. text,

images)

• Many Kaggle competitions arrive “ML-ready”

7

H2O.aiMachine Intelligence

Framing the Problem, 2• My favorite competitions are those that are non

ML-readyo Focuses more heavily on solving the data problemo More like solving a puzzle instead of tuning hyperparameters

8

H2O.aiMachine Intelligence

Learning from Kaggle• Sharing during competition

o Kaggle Scriptso Discussions on the forums

• Shared after the competitiono Most often several of the top ranking competitors will share

their methodologyo Often a summary post, occasionally Github codeo I find this the most valuable component of learning data

science

10

H2O.aiMachine Intelligence

Q & A

11