Upload
bradford-stephens
View
173
Download
0
Embed Size (px)
Citation preview
Your Data Scientist Hates YouBradford Stephens
ft. help from Nick Kypreos [email protected]
About Us
Data Infrastructure and Data Science at scale.
fact
71%1 of Data Science projects fail
fact
63%2 of Data Scientists quit in < 2 years
• Data Scientists aren’t involved from the beginning
• No strategy
• Bad Data: more common than you think and untestable
• Pointlessness
Why
• Everything stems from this
• Goals need to be attainable
• Data needs to be accessible and formatted correctly
• You can’t conceive of what’s possible (or impossible)
Involvement
Your Data Strategy
Your Data Strategy: Diagnostic
• Diagnostic: How did we get here?
• Understanding history and how your org drives decisions is key
• What will your org’s immune system allow?
• Infrastructure: what is currently in place and how did it happen?
• Goals: How do we drive revenue or KPIs?
Your Data Strategy: Roadmapping
• Roadmapping: What are we going to build?
• Data Architecture?
• Platform feasible?
• Who builds what when, for how much?
• How do we ensure a low-latency feedback loop? DS highly iterative
Your Data Strategy: Development
• Platform: What’s our stack?
• Storage: Where does data come from, go to, and latency/throughput requirements on storage?
• Processing: Where do we transform data? Batch? Real-time? Bounds?
• Collaboration: How do we share results, data, and APIs across the org? (always forgotten)
Bad Data
Data Science is Untestable
Data Science = Math + QA + CS + PM + Psionics
Untestable
• Data Scientists spend vast amounts of time fixing data
• …and you need to be OK with that
• Unit Testing doesn’t make sense in science
• Distributions fittings, etc
• Can only test via simulation: a whole ‘nother process
• “Simple” things take weeks to verify
Instrumentation
• Can you even verify your instrumentation?
• Are you collecting everything?
• Collecting the right thing?
• What if only 85% of the time?
• Systematically drop at high enough traffic?
• Someone comes into site through different channel from an acquisition 2 yr ago?
Software is Garbage
• Remember Hadoop?
• Spark?
• MLib bugs for years
• Wrong math won’t fail unit tests
• GIGO
• JSON, weekly microversioning, schema entropy…
• This is why DS efforts are so slow to start w/o initial involvement
• Don’t build the One True Data Platform
• one of our customers had 30 DBs including a critical out-of-license DB2 box
Pointlessness
! Dashboards ! are ! not ! a ! strategy
“Here’s some data, just tell us what’s interesting…”
“We didn’t think that was interesting, you’re bad at your job.”
Data Must be Treated like a Product
• Build a Data Products Team
• Engineers, PMs, Design. Data Science. Not just analysts.
• KPIs, Goals, Measurability, Backlogs
• Budget
• Freedom to Innovate
• Staff of diverse backgrounds
A Data Platform will touch every part of your org