1. Smartphone User Activity Prediction HJ van Veen |
Triskelion@Kaggle MLWave.com
2. APPROACH TO KAGGLE INCLASS COMPETITIONS 1) Get a good score
as fast as possible by: Getting the raw data into a universal data
format. Mostly CSV -> Numpy Array / LibSVMlight format 2) Using
versatile libraries: Scikit-Learn, Vowpal Wabbit, XGBoost. 3) Model
ensembling Voting, Bagging, Boosting, Binning, Blending,
Stacking
3. STRATEGY Try to create "machine learning" learning
algorithms and optimized pipelines which are: Data agnostic,
Problem agnostic, Solution agnostic, Automated Memory-friendly
Robust with good generalization.
4. FIRST OVERVIEW Problem type Classification? Regression?
Evaluation metric Description Benchmark code Predict human
activities based on their smartphone usage pattern. Predict if a
person is sitting, walking, etc, using their smartphone activities
https://inclass.kaggle.com/c/smartphone-user-activity-
prediction
5. FIRST OVERVIEW Data types Counts Text Categorical Numerical
Dates 0.28309984,-0.025501173,-0.11118051,-
0.37447712,-0.099567756,-0.20296558,-
0.37631066,-0.15016035,-0.18169451,- 0.29308661,-0.14946642, Quick
preview
6. FIRST OVERVIEW Data size Number of features? Number of train
samples? Number of test samples? Online learning or offline
learning? Linear problem or Non-linear?
7. BRANCH If issues with data: Clear up issues with data
(imputing missing data, joining tables, eval a JSON string) Give
up, and join another competition. If no issues with data: Get the
raw data into NumPy arrays, we want: X_train (train set), y
(labels), X_test (test set)
10. ALGORITHMS There is a bias-variance trade-off between
simple models and complex models.
11. ALGORITHMS There is No Free Lunch in machine learning. We
show that all algorithms that search for an extremum of a cost
function perform exactly the same, when averaged over all possible
cost functions. Wolpert, Macready, No free lunch theorems for
search Solution: Let algo's play to their own strengths for
particular problems and remove their weaknesses, then combine their
predictions.
12. RANDOM FORESTS 1/2 A Random Forest is an ensemble of
decision trees. "Bagging predictors is a method for generating
multiple versions of a predictor and using these to get an
aggregated predictor." - "Bagging Predictors". Breiman
13. RANDOM FORESTS 2/2 Strength: Relatively fast. Can be fitted
in parallel. Easy to tune. Easy to inspect. Easy to explore data
with. Good to benchmark against. One of the most powerful general
ML algorithms. You can introduce randomness. Weakness: Memory-heavy
(so use bagging). Popular (So use RGF and Extremely Randomized
Trees)
14. GBM 1/2 Gradient Boosted Decision Trees train weak
predictors on samples that previous predictors got wrong. "A method
is described for converting a weak learning algorithm [the learner
can produce an hypothesis that performs only slightly better than
random guessing] into one that achieves arbitrarily high accuracy."
"The strength of weak learnability." - Schapire
15. GBM 2/2 Strength: Can achieve very good results Can model
very complex problems Works on a wide variety of problems.
Weakness: Slower to run (use XGBoost). Tricky to tune (start with
max trees, tune eta, tune depth)
16. SVM Classification and regression using support vectors.
"Nothing is more practical than a good theory." The Nature of
Statistical Learning Theory, Vapnik Strength: Strong theoretical
guarantees Tuning regularization parameter can prevent overfit Uses
the kernel trick. Turn linear solvers into non-linear solvers.
Build custom kernels. Weakness: Requires a gridsearch. (Develop
intuition or new algo!) Too slow on large data (use stratified
subsampling)
17. KNN Look at distance to nearest neighbors "The nearest
neighbor decision rule assigns to an unclassified sample point the
classification of the nearest of a set of previously classified
points." Nearest neighbor pattern classification, Cover et. al.
Strength: Nonlinear Basic Easy to tune Different / unpopular.
Weakness: Slow and does not perform well in general. (so use for
stacking or finding near-duplicates)
19. ENSEMBLING Ensembling combines multiple models to
(hopefully) outperform any individual members. Ensembling (stacked
generalization) won the 1 million $ Netflix competition. Ensembling
reduces overfit and improves generalization performance. Tips: Use
diverse models Use many models Dont leak any information
(stratified out-of-fold predictions)
20. Automatic stacked ensembling Combining 100s of
automatically created models to improve accuracy and generalization
performance. "Hodor!" - Hodor. Strength: - Won this Kaggle
competition :) - Robust / good generalization - No tuning -
Incremental accuracy-increasing predictions Weakness: Unwieldy,
Dim-witted, Slow, Redundant.
21. Automatic stacked ensembling Step 1 (Generalization) Create
out-of-fold predictions for the train set and predictions for the
test set for: Different algorithms Different parameters Different
sampling Step 2 (Stacking) Add preds to original features and train
a GBM or RF on this. Step 3 (Model Selection) Brute-force averaging
of predictors.
22. Automatic stacked ensembling DEMO
23. LEAKAGE 'The introduction of information about the data
mining target, which should not be legitimately available to mine
from.' "Leakage in Data Mining. - Formulation, Detection, and
Avoidance" Kaufman et. al. 'one of the top ten data mining
mistakes' "Handbook of Statistical Analysis and Data Mining
Applications." Nisbet et. al.
24. LEAKAGE Exploiting Leakage In predictive modeling
competitions: Allowed and beneficial for results. In science or
business: A very big no-no! In both: Accidental leakage
exploitation. RF finds leakage automatically or KNN-classifier
finds duplicates.
25. LEAKAGE 1/2 In this competition Look at ordering of
training sample labels: - Classes (activity) cluster together. -
These are the different patients/subjects in the study? Exploits:
Build better CV. Use subject meta-features.
26. LEAKAGE 2/2 In this competition Look at ordering of test
prediction file: - Class predictions again cluster together - Is
the test set not randomized? Exploits: Change sequences to be more
uniform and look if that increases public score consistently.
27. RESOURCES & FURTHER READING
http://mlwave.com/kaggle-ensembling-guide/ http://scikit-learn.org
http://hunch.net/~vw/ https://github.com/dmlc/xgboost
https://www.youtube.com/watch?v=djRh0Rkqygw [Ihler, Linear
regression (5): Bias and variance]
http://www.cs.nyu.edu/~mohri/mls/lecture_8.pdf [Mohri, Foundations
of Machine Learning]
http://www.researchgate.net/profile/David_Wolpert/publication/2
[Wolpert, Stacked Generalization]