18
© 2015 PayPal Inc. All rights reserved.. Data Science with Big Data in a Corporate Environment Tasks, Challenges, and Tradeoffs Nachum Shacham Principal Data Scientist PayPal 11/10/2015

H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

Embed Size (px)

Citation preview

Page 1: H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

© 2015 PayPal Inc. All rights reserved..

Data Science with Big Data in a Corporate Environment Tasks, Challenges, and Tradeoffs

Nachum ShachamPrincipal Data Scientist

PayPal

11/10/2015

Page 2: H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

© 2015 PayPal Inc. All rights reserved.

Data Science From 10K ft

2

ML Algorithm DesignOptimization

ScoresPerformance

Charts

Training set

Test set(s)

ModelFitting

Validate, calculate KPIs

Page 3: H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

© 2015 PayPal Inc. All rights reserved.

Predictive Modeling Meets Big Data: Bridging the Gap

3

Data Warehouse

Model(s) selection and tuning

Data Munging

Deploy

FeatureEngineering

Observational big-data

DataExploration

ValidationData

Sourcing

Deep, wide, distributed ,sparse, partially unknown,

redundant, unreliable

• Unified  format  (“data  frame”)• Well  understood• Statistically  known  and  “well  behaved”

Data Science with Big Data: Quantitative change à Process adjustment

Page 4: H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

© 2015 PayPal Inc. All rights reserved. 4

Predictive Models for Serving Customers

Rewards

Tech Support

Customer’s future needs

Unblanced

Precision/recall

latency

Personalization

Model

Organic

External

Cohort

Info gathering

vs UX

Features

DATA

Metrics

Deplyment

Page 5: H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

© 2015 PayPal Inc. All rights reserved.

Data Exploration: Knowing What’s in the Data and Fitting the Pieces Together

5

Understaning Data Contents & meaning

Big data visualization

Units

Language Join keys

Null values: codes, quantities

Correlations

Distributions and outliers

Page 6: H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

© 2015 PayPal Inc. All rights reserved. 6

Data Munging: Reshaping the Data à Decisions, Decisions

Unify valuesMoney, time

Dummy variables

Sub groups

categorical variable

Missing values

Join keys do not match

Unify null symbolsImpute / discard

Reformat keys

Ignore rows

DiscardTransformNullify

Irrelevant/redundant

values

TransformRemoveoutliers

Standardize

Skewed distributions

Page 7: H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

© 2015 PayPal Inc. All rights reserved.

•R•data.table•Dplyr•Lubridate•complete •{t, l, s}apply•{g}sub•grep{l}•ggplot2

7

Data Munging Packages & Functions

•Python•pandas• re•Numpy•Matplotlib•csv•datetime

Page 8: H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

© 2015 PayPal Inc. All rights reserved. 8

Feature Engineering Transform raw data to better represent the problem to the model

Better features à Simpler models, better results

Manual Feature Manipulation Feature learning algorithms

Integrated in model fitting

• Dummy  variables• Text  integration• Sub/super  sampling  (unbalanced  sets)

Page 9: H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

© 2015 PayPal Inc. All rights reserved. 9

Model Tuning

Learning Rate

Parameter Grid

Ensemble: degree, type

Interpretability vs Accuacy

Page 10: H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

© 2015 PayPal Inc. All rights reserved.

Model Building Using H2O

10

Rich library of ML algorithms Auto detect feature format

Interoperate with ordinary R: split

work for complete munging Runs through

parameter grid

Big Server

Page 11: H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

© 2015 PayPal Inc. All rights reserved.

•Who should judge the value of the results?•Evaluation criteria

•What are the cost/value of FP, TP, FN, TP?•FP: PR cost•FN: Financial loss•TP: Intended gain

•Continuous Model evaluation in production

11

Model Evaluation

Page 12: H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

© 2015 PayPal Inc. All rights reserved.

•Model results’ deployment destinations•Humans (decision makers, analysts)•Computers (recommendations)•Databases (scores, leads)

• Latencies: hours to milliseconds •From events to model•From model output to action

12

Model Deployment

Page 13: H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

© 2015 PayPal Inc. All rights reserved.

•Multiple aligned plots• Large number of data points on frame• Statistical plots•Network plots•Model fitting on plot data

13

Big Data Visualization

Page 14: H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

© 2015 PayPal Inc. All rights reserved.

•Detection of concept drift post deployment•Change in feature set distribution•Change in conditional probability P(y | X)

•Auto correction for concept drift•Repeat training on time window of recent data•Time window: fixed vs. variable size

•Change detection•CUSUM•Statistical Process Control

14

Model Drift

Page 15: H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

© 2015 PayPal Inc. All rights reserved.

Transaction & Behavioral: Site Speed impact

15

NOT  ALL  SPEED  IMPROVEMENTS  ARE  CREATED  EQUAL

load  time

Bounce  rate

Page 16: H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

© 2015 PayPal Inc. All rights reserved.

System Log Data: Query Cost Comparison

16

TCO CONSUMPTION EXEC  COST

Page 17: H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

© 2015 PayPal Inc. All rights reserved.

Summary: The “Extra” Steps Make The Difference

17

Details: the Critical Factor

Platform àPerformance

Decisions: understand the

impact DelayedIncomplete

Irrelevant

Page 18: H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

© 2015 PayPal Inc. All rights reserved.

THANK YOU

18