Upload
devin-didericksen
View
415
Download
1
Embed Size (px)
Citation preview
Decision Tree Ensembles(Random Forest & Gradient
Boosting)by Devin Didericksen
Sponsors
About Me
Education:● B.S. Math - University of Utah● M.S. Statistics - Utah State University
Career:● Sr. Pricing Strategist - Overstock.com
Hobbies:● Traveling, movies, food, and my two
rabbits My wife, Sara, and me
What Some of the Leading Data Scientists Say
“Random Forests have been the most successful general-purpose algorithm in modern times” - Jeremy Howard, CEO of Enlitic
What Some of the Leading Data Scientists Say
“Random Forests have been the most successful general-purpose algorithm in modern times” - Jeremy Howard, CEO of Enlitic
“Gradient Boosting Machines are the best!” - Gilberto Titericz, #1 Ranked Kaggler
What Some of the Leading Data Scientists Say
“Random Forests have been the most successful general-purpose algorithm in modern times” - Jeremy Howard, CEO of Enlitic
“Gradient Boosting Machines are the best!” - Gilberto Titericz, #1 Ranked Kaggler
“When in doubt, use GBM” - Owen Zhang, #2 Ranked Kaggler, Chief Product Officer at DataRobot
What Some of the Leading Data Scientists Say
“Random Forests have been the most successful general-purpose algorithm in modern times” - Jeremy Howard, CEO of Enlitic
“Gradient Boosting Machines are the best!” - Gilberto Titericz, #1 Ranked Kaggler
“When in doubt, use GBM” - Owen Zhang, #2 Ranked Kaggler, Chief Product Officer at DataRobot
“We achieve state of the art accuracy and demonstrate improved generalization [with Random Forest]” - Microsoft Xbox Kinect Research Team on human pose recognition
Kaggle in a Nutshell
Kaggle is a data competition marketplace● 400,000+ members across the globe
Kaggle in a Nutshell
Kaggle is a data competition marketplace● 400,000+ members across the globe● Companies will sponsor a competition using their own data
Kaggle in a Nutshell
Kaggle is a data competition marketplace● 400,000+ members across the globe● Companies will sponsor a competition using their own data● Prizes have ranged from an interview to $500k (but usually around
$20k)
Kaggle in a Nutshell
Kaggle is a data competition marketplace● 400,000+ members across the globe● Companies will sponsor a competition using their own data● Prizes have ranged from an interview to $500k (but usually around
$20k)● Competition data is sliced as follows:
Kaggle in a Nutshell
Kaggle is a data competition marketplace● 400,000+ members across the globe● Companies will sponsor a competition using their own data● Prizes have ranged from an interview to $500k (but usually around
$20k)● Competition data is sliced as follows:
● Labels are hidden from validation and holdout sets
Kaggle in a Nutshell
Kaggle is a data competition marketplace● 400,000+ members across the globe● Companies will sponsor a competition using their own data● Prizes have ranged from an interview to $500k (but usually around
$20k)● Competition data is sliced as follows:
● Labels are hidden from validation and holdout sets● Winners are determined by performance against holdout set
Kaggle in a Nutshell
Kaggle is a data competition marketplace● 400,000+ members across the globe● Companies will sponsor a competition using their own data● Prizes have ranged from an interview to $500k (but usually around
$20k)● Competition data is sliced as follows:
● Labels are hidden from validation and holdout sets● Winners are determined by performance against holdout set● Competitions are very competitive!
Last 10 Kaggle Competitions
Competition Prize Money End Date Best Single Model SpringLeaf Marketing $100,000 10/19 Gradient Boosting
Dato Truly Native $10,000 10/14 Gradient Boosting
Flavours of Physics $15,000 10/12 Gradient Boosting
Recruit Coupon Purchase $50,000 9/30 Gradient Boosting
Caterpillar Tube Pricing $30,000 8/31 Gradient Boosting
Grasp-and-Lift EEG $10,000 8/31 Recurrent Convolutional Neural Net
Liberty Mutual Property $25,000 8/28 Gradient Boosting
Drawbridge Cross-Device $10,000 8/24 Gradient Boosting
Avito Context Ad Clicks $20,000 7/28 Gradient Boosting
Diabetic Retinopathy $100,000 7/27 Convolutional Neural Net
Last 10 Kaggle Competitions
Competition Prize Money End Date Best Single Model SpringLeaf Marketing $100,000 10/19 Gradient Boosting
Dato Truly Native $10,000 10/14 Gradient Boosting
Flavours of Physics $15,000 10/12 Gradient Boosting
Recruit Coupon Purchase $50,000 9/30 Gradient Boosting
Caterpillar Tube Pricing $30,000 8/31 Gradient Boosting
Grasp-and-Lift EEG $10,000 8/31 Recurrent Convolutional Neural Net
Liberty Mutual Property $25,000 8/28 Gradient Boosting
Drawbridge Cross-Device $10,000 8/24 Gradient Boosting
Avito Context Ad Clicks $20,000 7/28 Gradient Boosting
Diabetic Retinopathy $100,000 7/27 Convolutional Neural Net
*It doesn’t take a data scientist to identify a trend here
Kaggle Competition - Titanic
Can you predict which people survived the Titanic tragedy?
Kaggle Titanic Data
PassengerId
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)
female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282
7.925 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel)
female 35 1 0 113803 53.1 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 S
6 0 3 Moran, Mr. James male 0 0 330877 8.4583 Q
7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S
Kaggle Titanic Data - Training Variable Selection
PassengerId
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)
female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282
7.925 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel)
female 35 1 0 113803 53.1 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 S
6 0 3 Moran, Mr. James male 0 0 330877 8.4583 Q
7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S
LabelDrop Drop Drop Drop
Kaggle Titanic Data - Training Set
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 3 male 22 1 0 7.25 S
1 1 female 38 1 0 71.2833 C
1 3 female 26 0 0 7.925 S
1 1 female 35 1 0 53.1 S
0 3 male 35 0 0 8.05 S
0 3 male 0 0 8.4583 Q
0 1 male 54 0 0 51.8625 S
Decision Tree Pioneers
Leo Breiman, PhDU. of California Berkeley
Adele Cutler, PhDUtah State University
Jerome Friedman, PhDStanford University
CART, Random Forest, Gradient Boosting Random Forest Gradient Boosting
Simple Decision Tree
Classification and Regression Tree (CART)
Titanic Survival Classification Tree
Classification and Regression Tree (CART)
Like Mr. Bean’s car, CART is ● Super Simple - They are
often easier to interpret than linear models.
● Very Efficient - The computation cost is minimal.
● Weak - It has low predictive power on its own. It’s in a class of models called the “weak learners”.
Random Forest
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 3 male 22 1 0 7.25 S
1 1 female 38 1 0 71.2833 C
1 3 female 26 0 0 7.925 S
1 1 female 35 1 0 53.1 S
0 3 male 35 0 0 8.05 S
0 3 male 0 0 8.4583 Q
0 1 male 54 0 0 51.8625 S
0 3 male 2 3 1 21.075 S
1 3 female 27 0 2 11.1333 S
1 2 female 14 1 0 30.0708 C
1. Randomly sample the rows (w/replacement) and columns (w/o replacement) at each node and build a tree.
Random Forest
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 3 male 22 1 0 7.25 S
1 1 female 38 1 0 71.2833 C
1 3 female 26 0 0 7.925 S
1 1 female 35 1 0 53.1 S
0 3 male 35 0 0 8.05 S
0 3 male 0 0 8.4583 Q
0 1 male 54 0 0 51.8625 S
0 3 male 2 3 1 21.075 S
1 3 female 27 0 2 11.1333 S
1 2 female 14 1 0 30.0708 C
1. Randomly sample the rows (w/replacement) and columns (w/o replacement) at each node and build a tree.
2. Repeat many times (1,000+)
Random Forest
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 3 male 22 1 0 7.25 S
1 1 female 38 1 0 71.2833 C
1 3 female 26 0 0 7.925 S
1 1 female 35 1 0 53.1 S
0 3 male 35 0 0 8.05 S
0 3 male 0 0 8.4583 Q
0 1 male 54 0 0 51.8625 S
0 3 male 2 3 1 21.075 S
1 3 female 27 0 2 11.1333 S
1 2 female 14 1 0 30.0708 C
1. Randomly sample the rows (w/replacement) and columns (w/o replacement) at each node and build a tree.
2. Repeat many times (1,000+)3. Ensemble trees by majority
vote (ie. if 300 out of 1,000 trees predicts a given individual dies then probability of death is 30%).
Random Forest - Tree 1
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 3 male 22 1 0 7.25 S
1 1 female 38 1 0 71.2833 C
1 3 female 26 0 0 7.925 S
1 1 female 35 1 0 53.1 S
0 3 male 35 0 0 8.05 S
0 3 male 0 0 8.4583 Q
0 1 male 54 0 0 51.8625 S
0 3 male 2 3 1 21.075 S
1 3 female 27 0 2 11.1333 S
1 2 female 14 1 0 30.0708 C
Random Forest - Tree 2
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 3 male 22 1 0 7.25 S
1 1 female 38 1 0 71.2833 C
1 3 female 26 0 0 7.925 S
1 1 female 35 1 0 53.1 S
0 3 male 35 0 0 8.05 S
0 3 male 0 0 8.4583 Q
0 1 male 54 0 0 51.8625 S
0 3 male 2 3 1 21.075 S
1 3 female 27 0 2 11.1333 S
1 2 female 14 1 0 30.0708 C
Random Forest - Several Trees
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 3 male 22 1 0 7.25 S
1 1 female 38 1 0 71.2833 C
1 3 female 26 0 0 7.925 S
1 1 female 35 1 0 53.1 S
0 3 male 35 0 0 8.05 S
0 3 male 0 0 8.4583 Q
0 1 male 54 0 0 51.8625 S
0 3 male 2 3 1 21.075 S
1 3 female 27 0 2 11.1333 S
1 2 female 14 1 0 30.0708 C
Random Forest
Like a Honda CR-V, Random Forest is● Versatile - It can do classification,
regression, missing value imputation, clustering, feature importance, and works well on most data sets right out of the box.
● Efficient - Trees can be grown in parallel.
● Low Maintenance - There is only one parameter to tune, number of columns to sample.
Gradient Boosting Machines (GBM)
Gradient Boosting Machines (GBM) - Tree 1
Gradient Boosting Machines (GBM) - Boost 1
Gradient Boosting Machines (GBM) - Tree 2
Gradient Boosting Machines (GBM) - Boost 2
Gradient Boosting Machines (GBM) - Tree 3
Gradient Boosting Machines (GBM) - Probabilities
Gradient Boosting Machines (GBM)
Given this process, how quickly do you think this will overfit?
Gradient Boosting Machines (GBM)
Given this process, how quickly do you think this will overfit?
The surprising answer is not very fast. There is even a mathematical proof behind this.
Gradient Boosting Machines (GBM)
Like the original hummer, GBM is● Powerful - On most real world data
sets, it is hard to beat in predictive power. It handles missing values natively.
● High Maintenance - There are many parameters to tune. Extra precautions must be taken to prevent overfitting.
● Expensive - Boosting is inherently sequential and computationally expensive. However, GBM is a lot faster than it used to be with the arrival of XGBoost.
Deep Learning
Deep learning is like the 2008 Tesla Roadster…
The technology holds a lot of promise for the future, but in it’s current form it is usually not the most practical tool for most problems (other than speech and image recognition).
Kaggle Titanic - R Code
Some (simple) CART and Random Forest R code can be found here:
https://github.com/didericksen/titanic/blob/master/titanic.R