Upload
dataiku
View
5.283
Download
4
Embed Size (px)
Citation preview
The Role of the DevOps in theData Analytics Teams
J ON THE BEACH05/21/16
MORPHED WITH DEEP LEARNING™
TYPICAL OPS GUY (source: Reddit)
TYPICAL YOUNG DATA SCIENTIST(source: Common Sense)
My initial interests
Type Systems Automated Proving Abstract Program Interpretation Functional Programming Garbage Collection and Vms
Graph Analytics Chess IA Natural Language Processing 80% Emacs / 20% VIM
So to sum it up …
I (USED TO?) TO BE A BIG NERD
Collaboration
CLICKERS CODERS
Software is a Human Problem
I ended up buildingA collaborative software
For data science ....
DEV OPS&& DATA
Let’s get back to the (brief) history of DevOps
Agile Conference, 2008
Scrum, and Agile in an operational context
He!WeshouldhaveourownvelocityinBelgium
10 deploys per day : Dev and Op Operation at Flickr
O’Reilly Velocity, June 2009Patrick Dubois
2007
Dev
Ops
QA
DevOpsDays
Ghent, October 2009
DevOps
DevOps is the practice of operations and development
engineers participating together in the entire service lifecycle,
from design through the development
process to production support.
DevOps is also characterized by operations staff making
use many of the same techniques as developers for
their systems work.
Invite Ops to the Dev MeetingOh. And let them SPEAK
Ops should know how to code
Let’s take an example: John devops from 2009
Learnt Python the Hard WayStarted with Puppet 1.0
Used EC2 before ELB and EBS !
Hegelian perspective
Conflict and FrustrationConcept Combination Catharsis
Create CultureShare
Create Tools
Dev+
Ops
There’s been op associated to data for a while ?
It’s called Business Intelligence !
History of Data Analytics (Oversimplified)
2013 2014 2015 2016 2017 2018
Moving to a world of automated decision making
DATA FOR MORE INSIGHTS
DATAFOR AUTOMATED DECISIONS
The Age Of Distributed Intelligence
Global,PersonalisedandRealTimeDataDrivenServices
Data, Analytics and Data Science
Conflict and FrustrationConcept Combination Catharsis
Create CultureShare
Create Tools
Data+
Science
Welcome to Technoslavia !
Classic Business Intelligence Team Organization
Business Leader Data Consumer
Line-of-business Data Consumer Business Project
Sponsor
BI Solution Architect
Model Designer
ETL Developer
Dashboard / Report Designer
SpecsDim
Big Boss
Data Science Team Organization
Business Leader Data Consumer
Line-of-business Data Consumer
Business ProjectSponsor
Data Engineer
Data Analyst
System Engineer / Data Architect
Business Needs
Data Scientist
ITConstraints
I.T.
Is there room for a new role ?
Data Plumberer
DataEngineer
Data Scientist
Data Waiter
DataCleaner
DataAnalyst
REALJOB
DREAMJOB
DevOps For Data?
Imaginea company building
a new ”smart car” app: AutoFine™
”Revolutionary Collaborative network that check the quality of your driving and punishYou with virtual fines if you’re a bad driver”
Imaginea company building
a new ”smart car” service AutoFine™
10 TB of Data Every Month
Hive / Spark / Python
10 Different Predictive Models
Real-Time API / Workflow
????
????
OPERATIONS : Whose is responsible for …
Check that the newly trained model perform as
expected
Check that the product catalog and the website tags remain consistent
Check that the Hadoop cluster scales as expected and as enough bandwidth to handle the workload
Test the performance for the real-time API
Monitor the performance of the model and decide to
rollback / maintain / rollout
DATA OPSAs a Philosophy
X OPS PHILOSOPHY
Highly consensual
Highly controversial
Create an API culture
Do not shareo Random Piece of Codeo Flat Fileo Email
Do shareü Reproductible documented workflowsü Clean, documented APIs
Defensive Data Programming
•Software has errors.•You are not your software, yet you are are responsible for the errors.•You can never remove the errors, only reduce their probability.
Defensive Data Programming
•Handle the case when one of the input file is empty•Handle the case when a new value appear •Handle the case when two columns become completely correlated•Handle the case when a column is 16k long •Etc.. Etc. etc…
Monitoring : the alerts for people who love it
• Performance ….• Time Spent … • Number of Errors …
Monitoring : Business Informal Monitoring
• % Opening • Market Spent • Exception User Events …
Resource Allocation
I’ve got this strangeError ”OutOfMemory” . Do you know what it is
?
Why is the Hadoop Cluster going slower than my laptop ?
The Philosophy of pre-allocating more resources than necessary
Get to the latest package culture …
Data Scientist
I need the latest version of scikitAnd networkX ….
And coud you repackage that To enable TensorFlow optimizations ?
System Administrator
…..
The culture of containers
Developers’ Sandbox
DATA OPSAs a Job Title
Job Title : a matter of name, $$ and social ladder
Data scientist Data Ops
Developer
Statistician
Full Stack Developer
Sys Admin
DevOps
Job Role : A matter of Do or Don’t
DO DON’TThings you really want to do Things you really don’t want to get into
FIGHT THE TOY PLATFORM ANTI-PATTERN
Test and Invest in Infrastructure == Skilled Peopleor
Go For Cloud / Packaged Infrastructure
YourBrandNewHadoopClusterisperceivedasslow,notsousedandnotreliable
FIGHT THE TECHNO MISMATCH ANTI-PATTERN
Assume Being Polyglotor
Be a Dictator
VS
VS
ThePythonClan
TheRTribe
TheOldElephantFraternity
TheNewElephantClub
GETTING DATA POLITICS
> DATA NOT AVAILABLE
GETTING DATA POLITICS THEFOX
Hunt for Big Problem!
Convince the CEO that you can Solve a Business Critical problem And use it as an excuse to get allThe data you want !
THESPIDER
Create Network !
Create a set of trackers or Addictive Data Collection internallyTo get Data on your side !
PREDICTIVE ANALYTICS DEPLOYMENT STRATEGY
Website2000’winners
Companiesthatwereabletorelease fast
"ArtificialIntelligencewithDataforInternetofThings"2010’winners
Companiesabletoputintelligenceinproduction
?
Design a way to put “PREDITICTIVE MODELS” IN PRODUCTION
OWN ANONYMISATION / PRIVACY / DATA SECURITY WITH PARTNERS ISSUES
Technical Feasibility ? What can or cannot be done ?
Let’s Wrap IT Up ! A Company Building a GPS powered automated car fine system
10 TB of Data Every Month
Hive / Spark / Python
10 Different Predictive Models
Real-Time API / Workflow
Robust Workflow
With Data Quality
Checks
Functional MonitoringBy Business
People through
Slack and Dashboards
Monitoring for the API
Feature Engineering Pipeline in
Python
But you where do you stand ?
???? ???? ???? ?????
What's your roll-back strategy like?
What kind of multi-variate testing or strategies do you have in place for predictive models?
How do you manage the robustness of your data flow production scripts?
How can business people monitor the performance of the application?
http://bit.ly/production-survey
Food forthoughtswww.dataiku.com/blog
THANKYOU!http://bit.ly/production-survey http://bit.ly/production-survey