Upload
srisatish-ambati
View
1.627
Download
3
Embed Size (px)
Citation preview
© 2015 PayPal Inc. All rights reserved..
Data Science with Big Data in a Corporate Environment Tasks, Challenges, and Tradeoffs
Nachum ShachamPrincipal Data Scientist
PayPal
11/10/2015
© 2015 PayPal Inc. All rights reserved.
Data Science From 10K ft
2
ML Algorithm DesignOptimization
ScoresPerformance
Charts
Training set
Test set(s)
ModelFitting
Validate, calculate KPIs
© 2015 PayPal Inc. All rights reserved.
Predictive Modeling Meets Big Data: Bridging the Gap
3
Data Warehouse
Model(s) selection and tuning
Data Munging
Deploy
FeatureEngineering
Observational big-data
DataExploration
ValidationData
Sourcing
Deep, wide, distributed ,sparse, partially unknown,
redundant, unreliable
• Unified format (“data frame”)• Well understood• Statistically known and “well behaved”
Data Science with Big Data: Quantitative change à Process adjustment
© 2015 PayPal Inc. All rights reserved. 4
Predictive Models for Serving Customers
Rewards
Tech Support
Customer’s future needs
Unblanced
Precision/recall
latency
Personalization
Model
Organic
External
Cohort
Info gathering
vs UX
Features
DATA
Metrics
Deplyment
© 2015 PayPal Inc. All rights reserved.
Data Exploration: Knowing What’s in the Data and Fitting the Pieces Together
5
Understaning Data Contents & meaning
Big data visualization
Units
Language Join keys
Null values: codes, quantities
Correlations
Distributions and outliers
© 2015 PayPal Inc. All rights reserved. 6
Data Munging: Reshaping the Data à Decisions, Decisions
Unify valuesMoney, time
Dummy variables
Sub groups
categorical variable
Missing values
Join keys do not match
Unify null symbolsImpute / discard
Reformat keys
Ignore rows
DiscardTransformNullify
Irrelevant/redundant
values
TransformRemoveoutliers
Standardize
Skewed distributions
© 2015 PayPal Inc. All rights reserved.
•R•data.table•Dplyr•Lubridate•complete •{t, l, s}apply•{g}sub•grep{l}•ggplot2
7
Data Munging Packages & Functions
•Python•pandas• re•Numpy•Matplotlib•csv•datetime
© 2015 PayPal Inc. All rights reserved. 8
Feature Engineering Transform raw data to better represent the problem to the model
Better features à Simpler models, better results
Manual Feature Manipulation Feature learning algorithms
Integrated in model fitting
• Dummy variables• Text integration• Sub/super sampling (unbalanced sets)
© 2015 PayPal Inc. All rights reserved. 9
Model Tuning
Learning Rate
Parameter Grid
Ensemble: degree, type
Interpretability vs Accuacy
© 2015 PayPal Inc. All rights reserved.
Model Building Using H2O
10
Rich library of ML algorithms Auto detect feature format
Interoperate with ordinary R: split
work for complete munging Runs through
parameter grid
Big Server
© 2015 PayPal Inc. All rights reserved.
•Who should judge the value of the results?•Evaluation criteria
•What are the cost/value of FP, TP, FN, TP?•FP: PR cost•FN: Financial loss•TP: Intended gain
•Continuous Model evaluation in production
11
Model Evaluation
© 2015 PayPal Inc. All rights reserved.
•Model results’ deployment destinations•Humans (decision makers, analysts)•Computers (recommendations)•Databases (scores, leads)
• Latencies: hours to milliseconds •From events to model•From model output to action
12
Model Deployment
© 2015 PayPal Inc. All rights reserved.
•Multiple aligned plots• Large number of data points on frame• Statistical plots•Network plots•Model fitting on plot data
13
Big Data Visualization
© 2015 PayPal Inc. All rights reserved.
•Detection of concept drift post deployment•Change in feature set distribution•Change in conditional probability P(y | X)
•Auto correction for concept drift•Repeat training on time window of recent data•Time window: fixed vs. variable size
•Change detection•CUSUM•Statistical Process Control
14
Model Drift
© 2015 PayPal Inc. All rights reserved.
Transaction & Behavioral: Site Speed impact
15
NOT ALL SPEED IMPROVEMENTS ARE CREATED EQUAL
load time
Bounce rate
© 2015 PayPal Inc. All rights reserved.
System Log Data: Query Cost Comparison
16
TCO CONSUMPTION EXEC COST
© 2015 PayPal Inc. All rights reserved.
Summary: The “Extra” Steps Make The Difference
17
Details: the Critical Factor
Platform àPerformance
Decisions: understand the
impact DelayedIncomplete
Irrelevant
© 2015 PayPal Inc. All rights reserved.
THANK YOU
18