IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp

Building a scalable data strategy with IPTOP

Hugo Bowne-Anderson@hugobowne

https://twitter.com/hugobowne

Illustrations you can use, just copy/paste

➔ Hugo Bowne-Anderson, data scientist at DataCamp

◆ Undergrad in sciences/humanities (double math major)

◆ PhD in Pure Mathematics (UNSW, Sydney)

◆ Applied math research in cell biology (Yale University,

Max Planck Institute)

◆ Python curriculum engineer at DataCamp

◆ Host of DataFramed, the DataCamp podcast

◆ Data & AI evangelist, strategy consultant

A bit about Hugo

https://www.datacamp.com/community/podcast

Ramnath VaidyanathanYou can find him at @ramnath_vaidya

Ramnath leads Product Research at

Joint work with

3

Our Mission Our mission is to democratize data science education by building the

best platform to learn and teach data skills and make data fluency

accessible to millions of people and businesses around the world.

Learn by doing

➔ Short videos from expert instructors

➔ In-browser coding

➔ Real-time feedback

300+ Unmatched data science courses

➔ Languages: Python, R, SQL, Git, Shell, Spreadsheets

➔ Topics: Importing & Cleaning, Data Manipulation, Visualization, Probability & Statistics, Machine Learning, and more!

Industry-leading instructors

➔ Learn from the authors of renowned code packages and the organizations that understand data science innovation

Learn by Doing

https://www.datacamp.com/instructors/

https://www.datacamp.com/learn-python-with-anaconda

➔ Scaling your data strategy

➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes

Today’s topics of discussion


➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes


What can data science do?

1. Descriptive analytics (Business Intelligence)

2. Predictive analytics (Machine Learning)

3. Prescriptive Analytics (Decision Science)

We can slice data science into 3 components:

Descriptive analytics

Illustrations you can use, just copy/pasteDifferent views for different business strategies



Another way to slice data work

1. Data work to inform decision making

2. Automated actions from data pipelines

3. Human-in-the-loop

Another telling way to slice data science:

1. 0-25%

2. 26-50%

3. 51-75%

4. 76-100%

POLL: What percentage of your data work is actually used??

Definition(s) of scalability

Scalability refers to the ability to take on increased demand without incurring proportional costs.

Definition(s) of scalability

A scalable data strategy is one that can easily accommodate new projects, employees, techniques, phases of growth, tools, infrastructural layers, among other things.

Illustrations you can use, just copy/pasteScaling your data strategy

How hard it

is to do

How many people can do it

Making the impossible possible

Making the possible widespread

David RobinsonPrincipal Data Scientist, Heap

Illustrations you can use, just copy/pasteScale your data strategy by scaling IPTOP

InfrastructureSet up a data lake

Enable data discovery

PeopleMap out roles and skills

Identify skill gaps

Personalize learning path

ToolsBuild tools to encapsulate.

Build frameworks to automate.

OrganizationEmbrace a hybrid model

Build flexibility

ProcessesStandardize project structure

Embrace version control

Embrace notebooks

Infrastructure

People

Tools Org Processes

IPTOP


➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes


Why do we need infrastructure?

20

Scaling infrastructure at DataCamp

Tables

Views

Knowledge Repo

Dashboards

Metabase Visualizations

ViewsData Pipeline

Data Lake InsightsToolsRaw Data

Campus

Sales

Assessment

Scaling infrastructure at Netflix

Data Infrastructure at Netflix

https://www.tableau.com/about/blog/2017/1/tableau-cloud-netflix-original-64442

Scaling infrastructure at Airbnb

Data Infrastructure at Airbnb

https://medium.com/airbnb-engineering/data-infrastructure-at-airbnb-8adfb34f169c



Amundsen: Lyft’s data discovery and metadata engine

https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9

Recap

➔ Scaling infrastructure is key to scaling data work

➔ Developing a principled, modular tech stack is essential

➔ For data discovery, online experimentation, machine learning,

and more.


➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes


Identify roles

Map out skills by role

Measure competencies

1.

DataCamp Signal: Data Science Assessments

https://www.datacamp.com/signal

Identify gaps

Personalize learning paths

DataCamp: Custom Tracks

https://support.datacamp.com/hc/en-us/articles/360009185093-Custom-Tracks-for-Administrators-DataCamp-Enterprise

Support continuous learning

34

Recap

➔ Identify roles

➔ Map out skills by role

➔ Measure competencies & determine gaps

➔ Personalize learning paths & support continuous learning


➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes


The data science workflow

Hadley Wickham,Chief Scientist, RStudio

Build tools

Hadley Wickham,Chief Scientist, RStudio

datacamp(r/py)

dcmetrics

dcplot dcdash

dcdocs

dcmodels

Build tools

Build frameworks

I want to track recurring revenue over the last two years, aggregated by quarter, broken

down by segment, and geography.

I want to track course completion rates over the last year, aggregated by week, broken

down by technology, topic, and track.

Tidymetrics: Metrics in R

Airbnb’s framework for online experimentation

Tool building in machine learning

Only a small part of ML systems is the learning code itself. The rest is a vast and complex infrastructure that includes various aspects of

data collection and processing. Scully et al. (Google, Inc.)

Machine Learning workflow

Zipline: feature engineering at airbnb

Recap

➔ Tools are key to abstract over common data tasks

➔ Tools may be cool, but frameworks are cooler!

➔ Key for all types of data work, including descriptive analytics

and predictive analytics (machine learning)

➔ The point: gains in efficiency for a one off cost


➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes


Data team structure: centralized or decentralized?

Marketing

Finance Product

Engineering

Data Science

Marketing

Finance Product

Engineering

Data team structure: decentralized?

Marketing

Finance Product

Engineering

ProsEach team has a dedicated DS.

Clear alignment due to common roadmap for the team.

Data science has a more natural “seat at the table”.

Fewer dependencies across teams.

ConsHarder to move DS resources between teams to handle load.

Manager of the team may not have domain knowledge.

Harder for DS to collaborate.

Harder for DS to drive longer-term projects, with the risk of turning into a support service.

Data team structure: centralized

ProsAllows DS to function as a center of excellence

Promotes more collaboration and better knowledge sharing.

DS manager has domain knowledge

Easier to move resources to meet load.

Easier to advocate for consistent technology stack and better tooling.

ConsComplicates the coordination between DS and their stakeholders.

Risk of data science work not being aligned with product

DS is an extra function for the company to support.

Data Science

Marketing

Finance Product

Engineering

Data team structure: hybrid

Marketing

Finance Product

Engineering

ProsDS can function as a center of excellence.

DS can drive common tech stack, tooling, frameworks, and standardization.

DS can collaborate and align on organizational goals.

Better alignment between DS and business units

ConsRisk of mismatch of expectation leadership of DS and business unit.

Everyone has at least two teams.

Data Science

Recap

➔ Centralized, decentralized, and hybrid models for data teams

➔ Pros and cons of each


➔ Scaling

◆ Infrastructure

◆ People

◆ Tools

◆ Organization

◆ Processes


1. Define project lifecycle

Microsoft Team Data Science Process

https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview

2. Standardize project structure

Project Template

Cookie-Cutter Data Science

3. Embrace notebooks

JupyterLab is ready for users

https://blog.jupyter.org/jupyterlab-is-ready-for-users-5a6f039b8906

3. Embrace notebooks

Rmarkdown from RStudio

https://rmarkdown.rstudio.com/lesson-6.html

4. Embrace version control

5. Adopt style guides

The Tidyverse style guide, Hadley Wickham

https://style.tidyverse.org/index.html



6. Other processes to consider

➔ Code review

➔ Pair programming

➔ Data testing

➔ “Data parties”

➔ Incorporating data work into the decision function

Recap

➔ Define project lifecycle

➔ Standardize project structure

➔ Embrace notebooks & version control

➔ Many more things!

Scale data strategy by scaling

InfrastructureSet up a data lake


PeopleMap out roles and skills

Identify skill gaps

Personalize learning path

ToolsBuild tools to encapsulate.

Build frameworks to automate.

OrganizationEmbrace a hybrid model

Build flexibility

ProcessesStandardize project structure

Embrace version control

Embrace notebooks

Infrastructure

People

Tools Org Processes

IPTOP

What’s next?

What’s next?

➔ April 23 (the third Thursday of the month)

DataCamp’s online conference

Thank you!

Hugo Bowne-AndersonData Scientist@hugobowne

Documents

IPTOP Building a scalable data strategy with Hugo Bowne ... · Applied math research in cell biology (Yale University, Max Planck Institute) Python curriculum engineer at DataCamp