Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Overseeing Big Data AnalyticsA Data Scientist's Advice On Managing and Using
Big Data Results
Edin Hamzic, Data Scientist, Symphony
DATA SCIENCEFRAMEWORK
WHAT IS DATA SCIENCE
AN
ALY
TIC
S M
AT
UR
ITY
AS
PIR
AT
ION
AL
AD
VA
NC
ED
UNDERSTANDINGPATTERNS
IDENTIFYING FACTORSAND CAUSES
SIMULATIONS ANDOPTIMIZATION
SYSTEMS
FORECASTING ANDPROBABILITIES
UNDERSTANDINGSOCIAL CONTEXT
AND MEANING
BUSINESSINTELLIGENCE
TRANSACTIONAL STRATEGIC
BUSINESS VALUE
DATA QUALITY
DESCRIPTIVE
DIAGNOSTICS
PREDICTIVE
SOCIAL DETERMINANTS/ SEMANTIC
PROGRAMMER
BUSINESS ANALYST
STATISTICIAN
DATAVISUALIZATION
BUSINESSACUMEN
BIG DATAText analytics
Network analyticsGeospatial analytics
Social media analyticsSentiment analytics
Images
DATA SCIENTIST
BUSINESS ANALYST
PROGRAMMER
2
Data Scientist in
collaboration with MOH
Data Scientist, in
collaboration with
M&E staff at MOH
THE LIFE CYCLE OF AI / DATA SCIENCE PROJECT IN HEALTH
ASKING THE RIGHT QUESTION
Define success metrics
3
Ministries of health
Other stakeholders
Data Scientist, in
collaboration with
M&E staff at MOH
DATA COLLECTION
Ministries of health
Other stakeholders
Data Engineer,
Data Scientist, and
M&E staff at MOH
Choose machine learning algorithm
DATA ANALYSIS AND FITTING PREDICTIVE
MODEL
DATA CLEANING AND FEATURE ENGINEERING
COMMUNICATION AND DECISION
MAKING
EVALUATION AND INTERPRETATION
OF RESULTS
✓ Do we have all the right stakeholders?
✓ Do we have a non-ML baseline?
✓ Do we ask our question in a right way?
✓ Do we have the right data?
✓ Do we have the necessary domain expertise?
✓ How to select the most appropriate ML algorithm?
✓ How do we evaluate DS results?
✓ What opportunities do we create for bad actors?
✓ How do we communicate DS results?
✓ How do we improve decision making?
KEY QUESTIONS FOR A SUCCESSFUL EXECUTION
OF AN AI / DATA SCIENCE PROJECT
4
✓ Need to involve program staff, not just 'data' people
✓ Ministry of Health to identify other involved stakeholders
✓ Always someone as custodian of health data
✓ Champion for the work
✓ Involve program and policy makers early on
DO WE HAVE ALL THE RIGHT STAKEHOLDERS?
5
✓ Is there existing solution in place?
✓ How would human solve this problem at low scale?
✓ Will preparation to create model provide 90% of benefit without the
model itself?
✓ What analytical techniques other than ML are relevant?
DO WE HAVE A NON-ML BASELINE?
6
THE TRADITIONAL APPROACH
THE MACHINE LEARNINGAPPROACH
7Adapted from “Hands On Machine Learning with Scikit-Learn and Tensorflow” by Aurelien Geron
WHAT? WHY? HOW?
DO WE ASK OUR QUESTION IN A RIGHT WAY?
8
DO WE ASK OUR QUESTION IN A RIGHT WAY?
• Can we safely send pneumonia patient home and free up a bed?
• BUSINESS
• DATA SCIENCE
• Which pneumonia patients will have complications? • Unstated constraint: Don’t
change historic behavior.People with asthma were wrongly graded as low-risk by an AI system designed to predict pneumonia.
• CORRELATION VS. CAUSALITY
• DESCRIPTIVE VS. PRESCRIPTIVE
9
Data Scientist in
collaboration with MOH
Data Scientist, in
collaboration with
M&E staff at MOH
THE LIFE CYCLE OF AI / DATA SCIENCE PROJECT IN HEALTH
ASKING THE RIGHT QUESTION
Define success metrics
10
Ministries of health
Other stakeholders
DATA COLLECTION
Ministries of health
Other stakeholders
Data Engineer,
Data Scientist, and
M&E staff at MOH
DATA CLEANING AND FEATURE ENGINEERING
COMMUNICATION AND DECISION
MAKING
EVALUATION AND INTERPRETATION
OF RESULTS
Choose machine learning algorithm
DATA ANALYSIS AND FITTING PREDICTIVE
MODEL
Data Scientist, in
collaboration with
M&E staff at MOH
DO WE HAVE THE RIGHT DATA:
●Start with the domain where you have the hi data, not necessarily the most of the data.●Garbage in, garbage out●Quantity vs. quality
11
• Data heterogeneity
• Curse of modularity
• Dirty and noisy data
• Data locality
• Feature engineering
• Bonferroni’s principle
• Processing performances
• Real-time processing/ streaming
• Curse of dimensionality
• Non-linearity
BIGDATA
12
Exploration vs. Exploitation
Point of pride - “trust the data”
Black box problem
“Applied ML is feature
engineering”
DO WE HAVE ALL THE NECESSARY EXPERTS?
13
COMING UP WITH FEATURES IS DIFFICULT, TIME-CONSUMING, REQUIRES EXPERT KNOWLEDGE. 'APPLIED MACHINE LEARNING' IS BASICALLY FEATURE ENGINEERING.
— ANDREW NGMACHINE LEARNING AND AI VIA BRAIN SIMULATIONS
14
Feature
extraction
Feature
importance
Feature
construction
Feature
selection
FEATURE ENGINEERING
15
Data Scientist in
collaboration with MOH
Data Scientist, in
collaboration with
M&E staff at MOH
THE LIFE CYCLE OF AI / DATA SCIENCE PROJECT IN HEALTH
16
Ministries of health
Other stakeholders
Data Scientist, in
collaboration with
M&E staff at MOH
DATA COLLECTION
Ministries of health
Other stakeholders
Data Engineer,
Data Scientist, and
M&E staff at MOH
DATA CLEANING AND FEATURE ENGINEERING
COMMUNICATION AND DECISION
MAKING
EVALUATION AND INTERPRETATION
OF RESULTS
Choose machine learning algorithm
DATA ANALYSIS AND FITTING PREDICTIVE
MODEL
ASKING THE RIGHT QUESTION
Define success metrics
✓ Accuracy
✓ Training time
✓ Complexity of data
✓ Number of parameters
✓ Number of features
✓ Interpretation
✓ Speed
✓ Need for incremental training
SELECTING THE MOST APPROPRIATE ML ALGORITHM
17
SELECTING THE MOST APPROPRIATE ML ALGORITHMMACHINE LEARNING ALGORITHMS CHEAT SHEET
18
THE BIAS-VARIANCE TRADE-OFF
Low
Bia
sH
igh
Bia
s
Erro
r
Low Variance High Variance
Model Complexity
VarianceBias
Total Error
19
Data Scientist in
collaboration with MOH
Data Scientist, in
collaboration with
M&E staff at MOH
THE LIFE CYCLE OF AI / DATA SCIENCE PROJECT IN HEALTH
20
Ministries of health
Other stakeholders
DATA COLLECTION
Ministries of health
Other stakeholders
Data Engineer,
Data Scientist, and
M&E staff at MOH
DATA CLEANING AND FEATURE ENGINEERING
COMMUNICATION AND DECISION
MAKING
EVALUATION AND INTERPRETATION
OF RESULTS
Choose machine learning algorithm
DATA ANALYSIS AND FITTING PREDICTIVE
MODEL
Data Scientist, in
collaboration with
M&E staff at MOH
ASKING THE RIGHT QUESTION
Define success metrics
HOW DO WE EVALUATE DS RESULTS?
✓ Classification accuracy
✓ Logarithmic loss
✓ Confusion matrix
✓ Area under curve (AUC)
✓ F1 score
✓ Mean absolute error
✓ Mean squared error
21
HOW DO WE EVALUATE DS RESULTS?
✓ Typically multi-factor with no dominant right answer
✓ If limited to model metrics, be suspicious
✓ Understandable and trusted matters
22
Source: Panda, https://blog.openai.com/adversarial-example-research/ Stop, https://arstechnica.com/cars/2017/09/hacking-street-signs-with-stickers-could-confuse-self-driving-cars/
RECOGNIZED AS “45 MPH” SIGN
WHAT OPPORTUNITIES ARE WE CREATING FOR BAD ACTORS?
23
Data Scientist in
collaboration with MOH
Data Scientist, in
collaboration with
M&E staff at MOH
THE LIFE CYCLE OF AI / DATA SCIENCE PROJECT IN HEALTH
24
Ministries of health
Other stakeholders
DATA COLLECTION
Ministries of health
Other stakeholders
Data Engineer,
Data Scientist, and
M&E staff at MOH
DATA CLEANING AND FEATURE ENGINEERING
COMMUNICATION AND DECISION
MAKING
EVALUATION AND INTERPRETATION
OF RESULTS
Choose machine learning algorithm
DATA ANALYSIS AND FITTING PREDICTIVE
MODEL
Data Scientist, in
collaboration with
M&E staff at MOH
ASKING THE RIGHT QUESTION
Define success metrics
Speed Quality Robustness
25
HOW DO WE IMPROVE DECISION MAKING?
THANK YOU
26
QUESTIONS?
27
Offices inUnited States, Bosnia, Serbia
and FYR Macedonia
5Full time
developers and data scientists
120+Our
acceptance
rate
8%
High-growth
clients
20+Google Ventures preferred vendor How much
our clients have
cumulatively raised
$1.5 bn
Working on digital transformation for
Fortune 500Companies withDigital McKinsey
29