Smartphone Activity Prediction

1. Smartphone User Activity Prediction HJ van Veen | Triskelion@Kaggle MLWave.com

2. APPROACH TO KAGGLE INCLASS COMPETITIONS 1) Get a good score as fast as possible by: Getting the raw data into a universal data format. Mostly CSV -> Numpy Array / LibSVMlight format 2) Using versatile libraries: Scikit-Learn, Vowpal Wabbit, XGBoost. 3) Model ensembling Voting, Bagging, Boosting, Binning, Blending, Stacking

3. STRATEGY Try to create "machine learning" learning algorithms and optimized pipelines which are: Data agnostic, Problem agnostic, Solution agnostic, Automated Memory-friendly Robust with good generalization.

4. FIRST OVERVIEW Problem type Classification? Regression? Evaluation metric Description Benchmark code Predict human activities based on their smartphone usage pattern. Predict if a person is sitting, walking, etc, using their smartphone activities https://inclass.kaggle.com/c/smartphone-user-activity- prediction

5. FIRST OVERVIEW Data types Counts Text Categorical Numerical Dates 0.28309984,-0.025501173,-0.11118051,- 0.37447712,-0.099567756,-0.20296558,- 0.37631066,-0.15016035,-0.18169451,- 0.29308661,-0.14946642, Quick preview

6. FIRST OVERVIEW Data size Number of features? Number of train samples? Number of test samples? Online learning or offline learning? Linear problem or Non-linear?

7. BRANCH If issues with data: Clear up issues with data (imputing missing data, joining tables, eval a JSON string) Give up, and join another competition. If no issues with data: Get the raw data into NumPy arrays, we want: X_train (train set), y (labels), X_test (test set)

8. TRANSFORMS & PREPROCESSING TRANSFORMS & SCALING TF-IDF Weighting Log scaling Minmax and standard-scaling PREPROCESSING Parse dates Concatenate text fields Impute missing values

9. TRANSFORMS & PREPROCESSING TRANSFORMS & SCALING TF-IDF Weighting Log scaling Minmax and standard-scaling PREPROCESSING Parse dates Concatenate text fields Impute missing values

10. ALGORITHMS There is a bias-variance trade-off between simple models and complex models.

11. ALGORITHMS There is No Free Lunch in machine learning. We show that all algorithms that search for an extremum of a cost function perform exactly the same, when averaged over all possible cost functions. Wolpert, Macready, No free lunch theorems for search Solution: Let algo's play to their own strengths for particular problems and remove their weaknesses, then combine their predictions.

12. RANDOM FORESTS 1/2 A Random Forest is an ensemble of decision trees. "Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor." - "Bagging Predictors". Breiman

13. RANDOM FORESTS 2/2 Strength: Relatively fast. Can be fitted in parallel. Easy to tune. Easy to inspect. Easy to explore data with. Good to benchmark against. One of the most powerful general ML algorithms. You can introduce randomness. Weakness: Memory-heavy (so use bagging). Popular (So use RGF and Extremely Randomized Trees)

14. GBM 1/2 Gradient Boosted Decision Trees train weak predictors on samples that previous predictors got wrong. "A method is described for converting a weak learning algorithm [the learner can produce an hypothesis that performs only slightly better than random guessing] into one that achieves arbitrarily high accuracy." "The strength of weak learnability." - Schapire

15. GBM 2/2 Strength: Can achieve very good results Can model very complex problems Works on a wide variety of problems. Weakness: Slower to run (use XGBoost). Tricky to tune (start with max trees, tune eta, tune depth)

16. SVM Classification and regression using support vectors. "Nothing is more practical than a good theory." The Nature of Statistical Learning Theory, Vapnik Strength: Strong theoretical guarantees Tuning regularization parameter can prevent overfit Uses the kernel trick. Turn linear solvers into non-linear solvers. Build custom kernels. Weakness: Requires a gridsearch. (Develop intuition or new algo!) Too slow on large data (use stratified subsampling)

17. KNN Look at distance to nearest neighbors "The nearest neighbor decision rule assigns to an unclassified sample point the classification of the nearest of a set of previously classified points." Nearest neighbor pattern classification, Cover et. al. Strength: Nonlinear Basic Easy to tune Different / unpopular. Weakness: Slow and does not perform well in general. (so use for stacking or finding near-duplicates)

18. OTHERS Logistic Regression Stochastic Gradient Descent Ridge Regression Naive Bayes Artificial Neural Nets Matrix Factorization, SVD Quantile Regression AdaBoosting Genetic Algorithms Perceptrons

19. ENSEMBLING Ensembling combines multiple models to (hopefully) outperform any individual members. Ensembling (stacked generalization) won the 1 million $ Netflix competition. Ensembling reduces overfit and improves generalization performance. Tips: Use diverse models Use many models Dont leak any information (stratified out-of-fold predictions)

20. Automatic stacked ensembling Combining 100s of automatically created models to improve accuracy and generalization performance. "Hodor!" - Hodor. Strength: - Won this Kaggle competition :) - Robust / good generalization - No tuning - Incremental accuracy-increasing predictions Weakness: Unwieldy, Dim-witted, Slow, Redundant.

21. Automatic stacked ensembling Step 1 (Generalization) Create out-of-fold predictions for the train set and predictions for the test set for: Different algorithms Different parameters Different sampling Step 2 (Stacking) Add preds to original features and train a GBM or RF on this. Step 3 (Model Selection) Brute-force averaging of predictors.

22. Automatic stacked ensembling DEMO

23. LEAKAGE 'The introduction of information about the data mining target, which should not be legitimately available to mine from.' "Leakage in Data Mining. - Formulation, Detection, and Avoidance" Kaufman et. al. 'one of the top ten data mining mistakes' "Handbook of Statistical Analysis and Data Mining Applications." Nisbet et. al.

24. LEAKAGE Exploiting Leakage In predictive modeling competitions: Allowed and beneficial for results. In science or business: A very big no-no! In both: Accidental leakage exploitation. RF finds leakage automatically or KNN-classifier finds duplicates.

25. LEAKAGE 1/2 In this competition Look at ordering of training sample labels: - Classes (activity) cluster together. - These are the different patients/subjects in the study? Exploits: Build better CV. Use subject meta-features.

26. LEAKAGE 2/2 In this competition Look at ordering of test prediction file: - Class predictions again cluster together - Is the test set not randomized? Exploits: Change sequences to be more uniform and look if that increases public score consistently.

27. RESOURCES & FURTHER READING http://mlwave.com/kaggle-ensembling-guide/ http://scikit-learn.org http://hunch.net/~vw/ https://github.com/dmlc/xgboost https://www.youtube.com/watch?v=djRh0Rkqygw [Ihler, Linear regression (5): Bias and variance] http://www.cs.nyu.edu/~mohri/mls/lecture_8.pdf [Mohri, Foundations of Machine Learning] http://www.researchgate.net/profile/David_Wolpert/publication/2 [Wolpert, Stacked Generalization]

Documents

Smartphone Activity Prediction