Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
Michal BrysData Scientist @ AllegroMeasure Camp | London, 10th September 2016
Find signal in noise.6 steps to find value from messy data.
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
Michal BrysData Scientist @ Allegro
Specialized also in:+ Google Analytics + Google Tag Manager
michalbrys.comabout.me/michal.brys
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
Framework for data analysisCRISP-DM
- Cross Industry Standard Process for Data Mining
- Set up in 1996 (SPSS, Teradata, Daimler AG, NCR ,OHRA)
- Still works!
Read more: https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
1: Business Understanding
- Define analysis goal- What you want to achieve by analysis?
- Check business context- Don’t be afraid to ask questions
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
1: Business Understanding
I want to select customers group with the
highest probability of response (...) to target marketing campaign for this group.
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
2: Data Understanding- Collect data
Check:
- What all variables in dataset means- How about missing values?- Exploratory data analysis (EDA)
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
2: Data Understanding
Google Analytics with client id as custom dimension
- Source: Cookies + JavaScript tracker- Processed by Google Analytics- No access to raw data
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
2: Data Understanding
10 000 records with 11 variables
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
3: Data Preparation
- Data cleaning- Prepare new variables, transform data- Remove missing and outstanding values- Check distributions
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
3: Data Preparation
Example: Fix variables type.
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
4: Modeling- Classification problem- Prepare models by different methods- Training and test subset
- CARTC5.0Logit Regression
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
5: Evaluation
Model True Negative
True Positive
False Negative
False Positive
Total Error Rate
CART 5081 3150 1080 689 17.69%
C5.0 4089 2701 1606 1604 32.10%
Logit Regression 5871 2107 1307 715 20.22%
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
6: Deployment
- Prepare report- Implement in system- Bulid product- ...
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
Summary
CRISP-DM
+ Keeps business goal in mind+ Result will answer for initial question+ Reproducible and documented process
Image: https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining#/media/File:CRISP-DM_Process_Diagram.png
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
More inspiration
“Data Mining Methods and Models”Daniel T. Larose
“The Signal and the Noise” Nate Silver
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
One more thing...
michalbrys.gitbooks.io/r-google-analytics/
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
Q&A
Michal Brysabout.me/michal.brysgithub.com/michalbrys