Machine Learning/ Data Science: Boosting Predictive Analytics Model Performance

[email protected]

Machine Learning: Boosting

Analytics Model Performance

[email protected]

THE JOB OF DATA SCIENTISTSDoes this sound familiar to anyone?

[email protected]

How to design a strategy for boosting performance.

2- Strategy

How to use Feature Engineering to boost model performance.

3. Features

Explaining why boosting performance is relevant.

1- Background

Time for questions from the audience.

5. Questions

A collection of free resources for boosting model performance.

4. Bonus Round

AGENDA

[email protected]

BOOSTING MODEL PERFOMANCESection 1: Background

[email protected]


1- Background

SECTION 1: Background

[email protected]

TIPS SOURCESWhere do the recommendations originate?

197 Kaggle Winner

Interviews

How did they win?

50 In-depth Case

Studies

Which factors mattered

25,000 Head-to-Head

Tests

What made the difference?

[email protected]

WHERE HAVE THESE TIPS WORKED?

IMPORTANT: All views expressed are solely my own, and should not be taken as being those of current or past employers, clients or others.

[email protected]

TWO CATEGORIES OF TIPSPresentation Focus

The plan, method, series of tactics or stratagems for building your model.

Model StrategyPart 1

The process for identifying, building, developing, standardizing, normalizing and engineering the correct inputs for one or more analytics processes.

Data PreparationPart 2

[email protected]

BOOSTING MODEL PERFOMANCESection 2: Model Strategy

[email protected]


2- Strategy


1- Background

SECTION 2Strategy

[email protected]

Source: Jeong-Yoon Lee, Chief Data Scientist at Conversion Logic,https://www.slideshare.net/jeongyoonlee/data-science-competition-72596610

TIP 1: Leverage Extreme EnsemblesThe performance boost from models with non-correlated errors is consistently higher than single models or smaller ensembles.

Source: Owen Zhang, Chief Product Officer at DataRobot,https://www.slideshare.net/OwenZhang2/tips-for-data-science-competitions

• 6-layer process• 5 distinct data prep steps• 31 combined feature sets• 2 layers of 3 models each

2015 Liberty Mutual ContestOwen Zhang

• 7 feature sets• 64 component models• 15 models in Level 1 Ensemble• 2 models in Level 2 Ensemble

2015 KDD CUPJeong-Yoon Lee

[email protected]

• Seed lists• Old, unusable lead sources• Discontinued markets

MARKETINGEliminate irrelevant populations

• Low dollar thresholds

• “Best” customers

• Higher authentication transactions

• “Standing” transactions

• Canceled transfers

FRAUDEliminate “safer” populations

• What do you already know?• What is beyond your influence?• Which problems can be handled separately?

GENERALOther instances

TIP 2: Reduce Decision SpaceReduce the Decision Space

[email protected]

TIP 3: Use Targeted AUC Instead of Total AUCMatch model objective to organizational objective. Example courtesy of ORACLE.

• Less common approach• Perfect for projects with target thresholds such as

limited marketing budgets or maximum fraud referral/ turndown rates

• Sacrifices overall accuracy for accuracy at lower threshold targets

TARGETED AUCOptimizes targeted model performance

• Traditional approach• Perfect for may Kaggle competitions• Sacrifices accuracy at lower threshold targets for

overall accuracy

TOTAL AUCOptimizes overall model performance

[email protected]

TIP 4: Cross-Validate EverywhereReducing overfitting while extracting maximum learning from your data

OUT-OF-SAMPLE VALIDATION

Traditional methodology

CROSS-VALIDATION

Used to reduce both overfitting and outlier influence

[email protected]

TIP 5: Algorithm ArsenalLeverage diverse modeling arsenal

Bayesian Network

Gradient Boosting

Machines

Random Forests

Logistic Regression

Factorization Machines

Neural Network

Genetic Algorithms

Support Vector Machines

[email protected]

BOOSTING MODEL PERFOMANCESection 3: Features

[email protected]


2- Strategy


3. Features


1- Background

SECTION 3Features

[email protected]

TIP 7: Test Variable Transformation FunctionsFeatures

[email protected]

“Stumps” represent the first split in decision trees, and make powerful “weak learners.” Create a derived feature for each input.

1. Derive “Stumps”

Using trees creates bin “boundaries” directly associated with the dependent variable, rather than a more arbitrary approach. Assign bins for each continuous inputs.

2. Bin Continuous Inputs

Missing values assigned to a separate, unique category preserves information content and eliminates arbitrary replacement approaches.

3. Handle Missing Values

Each input, regardless of data type, can have consistent, normalized scaling by using something like NORM Sigmoid or Yule’s Q for each terminal node from each univariate tree.

5. Normalize scaling

Calling out tree nodes with uniquely powerful splitting capabilities as derived features leverages the most benefit from single inputs.

4. Derive High-Impact Flags

Re-coding the original input into the values from the terminal nodes makes interpretation much easier.

6. Overall Transformation

TIPS 8-13: Univariate Tree Feature EngineeringFeatures

[email protected]

Moving Away From… Moving Toward…

TIP 14: Think “Crafts-person-ship”Less “Assembly Line,” More “Fine Craftsmanship”

[email protected]

BOOSTING MODEL PERFOMANCESection 4: Bonus Round

[email protected]


2- Strategy


3. Features


1- Background


4. Bonus Round

SECTION 4Bonus Round

[email protected]

2. Create Common Table

of Values for Each Node

3. Calculate Z-Score

Across Entire Table

5. Calculate Avg., High

and Low

6. Gradient Boosting4. Assign New Value to

New Derived Feature

1. Univariate Tree

Models

Bonus Round:

Patent-Application IMPACT FeaturesPatent application approach for transforming and combining model inputs

[email protected]


2- Strategy


3. Features


1- Background

Time for questions from the audience.

5. Questions


4. Bonus Round

AGENDA

[email protected]

USA 1-443-810-8066

[email protected]

MktgSciences3719 Yolando RoadBaltimore, MD 21218

Get in TouchSee you soon....

[email protected]

Source: Jeong-Yoon Lee, Chief Data Scientist at Conversion Logic,https://www.slideshare.net/jeongyoonlee/data-science-competition-72596610

MODEL STRATEGY TIP 1Cross-validate everywhere.

[email protected]

Source: Owen Zhang, Chief Product Officer at DataRobot,https://www.slideshare.net/OwenZhang2/tips-for-data-science-competitions

MODEL STRATEGY TIP 1Cross-validate everywhere.

[email protected]

THANK YOU...

[email protected]

BOOSTING MODEL PERFOMANCEAppendix

[email protected]

DEFINITIONS

performance(noun):

“the manner in which or the efficiencywith which something reacts or fulfills its intended purpose.”

[email protected]

Moving Away From… Moving Toward…

PERFORMANCE IS BEING MORE CLOSELY MEASURED

[email protected]

PEFORMANCE WILL DETERMINE COMPENSATIONLike it or not, Data Science compensation will become more closely tied to model performance.

Data & Analytics

Machine Learning/ Data Science: Boosting Predictive Analytics Model Performance