Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Building a scalable data strategy with IPTOP
Hugo Bowne-Anderson@hugobowne
Illustrations you can use, just copy/paste
➔ Hugo Bowne-Anderson, data scientist at DataCamp
◆ Undergrad in sciences/humanities (double math major)
◆ PhD in Pure Mathematics (UNSW, Sydney)
◆ Applied math research in cell biology (Yale University,
Max Planck Institute)
◆ Python curriculum engineer at DataCamp
◆ Host of DataFramed, the DataCamp podcast
◆ Data & AI evangelist, strategy consultant
A bit about Hugo
Ramnath VaidyanathanYou can find him at @ramnath_vaidya
Ramnath leads Product Research at
Joint work with
3
Our Mission Our mission is to democratize data science education by building the
best platform to learn and teach data skills and make data fluency
accessible to millions of people and businesses around the world.
Learn by doing
➔ Short videos from expert instructors
➔ In-browser coding
➔ Real-time feedback
300+ Unmatched data science courses
➔ Languages: Python, R, SQL, Git, Shell, Spreadsheets
➔ Topics: Importing & Cleaning, Data Manipulation, Visualization, Probability & Statistics, Machine Learning, and more!
Industry-leading instructors
➔ Learn from the authors of renowned code packages and the organizations that understand data science innovation
Learn by Doing
➔ Scaling your data strategy
➔ Scaling
◆ Infrastructure
◆ People
◆ Tools
◆ Organization
◆ Processes
Today’s topics of discussion
➔ Scaling your data strategy
➔ Scaling
◆ Infrastructure
◆ People
◆ Tools
◆ Organization
◆ Processes
Today’s topics of discussion
What can data science do?
1. Descriptive analytics (Business Intelligence)
2. Predictive analytics (Machine Learning)
3. Prescriptive Analytics (Decision Science)
We can slice data science into 3 components:
Descriptive analytics
Illustrations you can use, just copy/pasteDifferent views for different business strategies
Descriptive analytics
Descriptive analytics
Another way to slice data work
1. Data work to inform decision making
2. Automated actions from data pipelines
3. Human-in-the-loop
Another telling way to slice data science:
1. 0-25%
2. 26-50%
3. 51-75%
4. 76-100%
POLL: What percentage of your data work is actually used??
Definition(s) of scalability
Scalability refers to the ability to take on increased demand without incurring proportional costs.
Definition(s) of scalability
A scalable data strategy is one that can easily accommodate new projects, employees, techniques, phases of growth, tools, infrastructural layers, among other things.
Illustrations you can use, just copy/pasteScaling your data strategy
How hard it
is to do
How many people can do it
Making the impossible possible
Making the possible widespread
David RobinsonPrincipal Data Scientist, Heap
Illustrations you can use, just copy/pasteScale your data strategy by scaling IPTOP
InfrastructureSet up a data lake
Enable data discovery
PeopleMap out roles and skills
Identify skill gaps
Personalize learning path
ToolsBuild tools to encapsulate.
Build frameworks to automate.
OrganizationEmbrace a hybrid model
Build flexibility
ProcessesStandardize project structure
Embrace version control
Embrace notebooks
Infrastructure
People
Tools Org Processes
IPTOP
➔ Scaling your data strategy
➔ Scaling
◆ Infrastructure
◆ People
◆ Tools
◆ Organization
◆ Processes
Today’s topics of discussion
Why do we need infrastructure?
20
Scaling infrastructure at DataCamp
Tables
Views
Knowledge Repo
Dashboards
Metabase Visualizations
ViewsData Pipeline
Data Lake InsightsToolsRaw Data
Campus
Sales
Assessment
Scaling infrastructure at Netflix
Data Infrastructure at Netflix
Scaling infrastructure at Airbnb
Data Infrastructure at Airbnb
Enable data discovery
Enable data discovery
Amundsen: Lyft’s data discovery and metadata engine
Recap
➔ Scaling infrastructure is key to scaling data work
➔ Developing a principled, modular tech stack is essential
➔ For data discovery, online experimentation, machine learning,
and more.
➔ Scaling your data strategy
➔ Scaling
◆ Infrastructure
◆ People
◆ Tools
◆ Organization
◆ Processes
Today’s topics of discussion
Identify roles
Map out skills by role
Identify gaps
Personalize learning paths
DataCamp: Custom Tracks
Support continuous learning
34
Recap
➔ Identify roles
➔ Map out skills by role
➔ Measure competencies & determine gaps
➔ Personalize learning paths & support continuous learning
➔ Scaling your data strategy
➔ Scaling
◆ Infrastructure
◆ People
◆ Tools
◆ Organization
◆ Processes
Today’s topics of discussion
The data science workflow
Hadley Wickham,Chief Scientist, RStudio
Build tools
Hadley Wickham,Chief Scientist, RStudio
datacamp(r/py)
dcmetrics
dcplot dcdash
dcdocs
dcmodels
Build tools
Build frameworks
I want to track recurring revenue over the last two years, aggregated by quarter, broken
down by segment, and geography.
I want to track course completion rates over the last year, aggregated by week, broken
down by technology, topic, and track.
Tidymetrics: Metrics in R
Airbnb’s framework for online experimentation
Tool building in machine learning
Only a small part of ML systems is the learning code itself. The rest is a vast and complex infrastructure that includes various aspects of
data collection and processing. Scully et al. (Google, Inc.)
Machine Learning workflow
Zipline: feature engineering at airbnb
Recap
➔ Tools are key to abstract over common data tasks
➔ Tools may be cool, but frameworks are cooler!
➔ Key for all types of data work, including descriptive analytics
and predictive analytics (machine learning)
➔ The point: gains in efficiency for a one off cost
➔ Scaling your data strategy
➔ Scaling
◆ Infrastructure
◆ People
◆ Tools
◆ Organization
◆ Processes
Today’s topics of discussion
Data team structure: centralized or decentralized?
Marketing
Finance Product
Engineering
Data Science
Marketing
Finance Product
Engineering
Data team structure: decentralized?
Marketing
Finance Product
Engineering
ProsEach team has a dedicated DS.
Clear alignment due to common roadmap for the team.
Data science has a more natural “seat at the table”.
Fewer dependencies across teams.
ConsHarder to move DS resources between teams to handle load.
Manager of the team may not have domain knowledge.
Harder for DS to collaborate.
Harder for DS to drive longer-term projects, with the risk of turning into a support service.
Data team structure: centralized
ProsAllows DS to function as a center of excellence
Promotes more collaboration and better knowledge sharing.
DS manager has domain knowledge
Easier to move resources to meet load.
Easier to advocate for consistent technology stack and better tooling.
ConsComplicates the coordination between DS and their stakeholders.
Risk of data science work not being aligned with product
DS is an extra function for the company to support.
Data Science
Marketing
Finance Product
Engineering
Data team structure: hybrid
Marketing
Finance Product
Engineering
ProsDS can function as a center of excellence.
DS can drive common tech stack, tooling, frameworks, and standardization.
DS can collaborate and align on organizational goals.
Better alignment between DS and business units
ConsRisk of mismatch of expectation leadership of DS and business unit.
Everyone has at least two teams.
Data Science
Recap
➔ Centralized, decentralized, and hybrid models for data teams
➔ Pros and cons of each
➔ Scaling your data strategy
➔ Scaling
◆ Infrastructure
◆ People
◆ Tools
◆ Organization
◆ Processes
Today’s topics of discussion
1. Define project lifecycle
Microsoft Team Data Science Process
2. Standardize project structure
Project Template
Cookie-Cutter Data Science
3. Embrace notebooks
JupyterLab is ready for users
4. Embrace version control
5. Adopt style guides
The Tidyverse style guide, Hadley Wickham
5. Adopt style guides
5. Adopt style guides
6. Other processes to consider
➔ Code review
➔ Pair programming
➔ Data testing
➔ “Data parties”
➔ Incorporating data work into the decision function
Recap
➔ Define project lifecycle
➔ Standardize project structure
➔ Embrace notebooks & version control
➔ Many more things!
Scale data strategy by scaling
InfrastructureSet up a data lake
Enable data discovery
PeopleMap out roles and skills
Identify skill gaps
Personalize learning path
ToolsBuild tools to encapsulate.
Build frameworks to automate.
OrganizationEmbrace a hybrid model
Build flexibility
ProcessesStandardize project structure
Embrace version control
Embrace notebooks
Infrastructure
People
Tools Org Processes
IPTOP
What’s next?
What’s next?
➔ April 23 (the third Thursday of the month)
DataCamp’s online conference
Thank you!
Hugo Bowne-AndersonData Scientist@hugobowne