How to Build a Data-Driven Company: From Infrastructure to Insights

Preview:

Citation preview

#datastack#datastack

Shaun

Tristan Handy
should delete this and the next slide. kafka is actually about the data pipeline component, not data storage/warehousing/analysis.
Tristan Handy
we could theoretically put it in that section but i don't think it actually adds to the narrative.

#datastack#datastack

What you’re going to learn1 How top engineering organizations are

building their data infrastructure

The 7 core challenges of data integration

Why companies like Asana, Buffer, and SeatGeek choose Redshift for their analytics warehouse

...and much more!

2

3

Shaun

#datastack

Data Infrastructure: Then and Now

Dillon

#datastack

The traditional approach: ETL Dillon

END USERBI TEAMETL TEAM EDW TEAM

AB

DC

I

G

JM

H

F

L

D

K

Q

BZ C

P

E

F

X

EB

Z

A

X

EVENT

DATA

TRANSACTIONAL DATA

SUMMARY

ELT - Heavy Transformation Restricted Q&AOLAP / Silos

SUMMARY

FE

#datastack

How companies are doing it today: ELT

Dillon

Modeling LayerTransform at Query

FFF

Database

Extract Load

- name: first_purchasers type: single_value base_view: orders measures:[orders.customer.all]

AnalyticsViz & Exploration

3rd Party Data

C

C

C

Transform (and Explore!)

#datastack

Benefits of this approach1.Redshift is performant enough to handle most

transformations2.Users prefer performing transformations in a

language they already use (SQL) or with UI3.Transformations are much simpler, more

transparent4.Performing transformations alongside raw data

is great for auditability

Dillon

#datastack

Data infrastructure has geek cred Shaun

#datastack

Data infrastructure has geek cred Shaun

#datastack

Data infrastructure has geek cred Shaun

#datastack

Data infrastructure has geek cred Shaun

#datastack#datastack

Data Integration

Data Warehouse

BI/Analytics

What the stack looks like Shaun

#datastack

Data Integration

Shaun

#datastack

Why consolidation matters

#datastack#datastack

Common data sources for internal analytics Shaun

#datastack

Quick poll Shaun

What top five data sources are a top priority for you to integrate/keep integrated?● production databases● events● error logs● billing● email marketing● crm● advertising● erp● a/b testing● support

#datastack

“A year ago, we were facing a lot of stability problems with our data processing. When there was a major shift in a graph, people immediately questioned the data integrity. It was hard to distinguish interesting insights from bugs. Data science is already an art so you need the infrastructure to give you trustworthy answers to the questions you ask. 99% correctness is not good enough. And on the data infrastructure team, we were spending a lot of time churning on fighting urgent fires, and that prevented us from making much long-term progress. It was painful.”

- Marco Gallotta, Asana, How to Build Stable, Accessible Data Infrastructure at a Startup

#datastack

“Our story would end here if real-time processing were perfect. But it’s not: some events can come in days late, some time ranges need to be re-processed after initial ingestion due to code changes or data revisions, various components of the real-time pipeline can fail, and so on.”

- Gian Merlino, MetaMarkets, Building a Data Pipeline That Handles Billions of Events in Real-Time

#datastack

7 core challenges of data integration

Connections: Every API is aunique and special snowflake

Accuracy: Ordering data on a distributed system

Latency: Large object data stores (Amazon S3, Redshift) are optimized for batches not streams

Scale: Data will grow exponentially as your company grows

Flexibility: you’re interacting with systems you don’t control

Monitoring: Notifications for expired credentials, errors, notifications of disruptions

Maintenance: Justifying investment in ongoing maintenance/improvement

Shaun

#datastack

Or...try Pipeline Shaun

Ad Platforms Customer Support

Web Data

Marketing Automation

CRM PaymentsEcommerce

#datastack

Warehousing Infrastructure

Shaun

#datastack

Analytics warehouse Shaun

Redshift is the most common analytics warehouse.

Chosen by: Asana, Braintree, Looker, Seatgeek, VigLink, Buffer

#datastack#datastack

Why Redshift is awesome Shaun

#datastack#datastack

AirBnB experimentHive Redshift

Test 1: 3 billion rows of data 28 minutes <6 minutesTest 2: two joins with millions of rows

182 seconds 8 seconds

Cost $1.29/hour/node $0.85/hour/node

Shaun

#datastack

Periscope research Shaun

#datastack

DiamondStream’s dashboard query performance Shaun

#datastack

Business Intelligence & Analytics

Dillon

#datastack#datastack

A broken model Dillon

● Feedback loop is broken

● Disparate reporting● Non-unified decision

making● Versioning● Reusability is lost

Marketing

Finance

AM

#datastack

Constraints of SQL Dillon

SQL is versatile, but shares the same flavor as assembly-only languages such as Perl

Can write but not readPromotes one-off, piecemeal analysisDisparate interpretation

#datastack

The critical multiplier: modeling Dillon

Any SQL Data Warehouse

Modeling Layer

What’s our most successful marketing campaign

How does our Q4 Pipeline looks?

Who are our healthiest / happiest customers?

#datastack#datastack

Interactive, collaborative analytics Dillon

● Data access

● Uniform definitions

● A Shared View

● Collaboration

● Analytical Speed

#datastack

What You Can Do

Dillon

#datastack#datastack

Integrated data + analytics tools Dillon

Week 1 Week 2-3RJMetrics Pipeline

BLOCKS

Tristan Handy
we don't have marketo out yet...need to replace with a different logo

#datastack#datastack

Looker blocks: sales & marketing

#datastack#datastack

Looker blocks: sales & marketing

#datastack#datastack

Looker blocks: event analytics

#datastack#datastack

Looker blocks: event analytics

#datastack

Thank you!

Recommended