37
#datastack #datastack Shaun

How to Build a Data-Driven Company: From Infrastructure to Insights

Embed Size (px)

Citation preview

Page 1: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack#datastack

Shaun

Tristan Handy
should delete this and the next slide. kafka is actually about the data pipeline component, not data storage/warehousing/analysis.
Tristan Handy
we could theoretically put it in that section but i don't think it actually adds to the narrative.
Page 2: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack#datastack

What you’re going to learn1 How top engineering organizations are

building their data infrastructure

The 7 core challenges of data integration

Why companies like Asana, Buffer, and SeatGeek choose Redshift for their analytics warehouse

...and much more!

2

3

Shaun

Page 3: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

Data Infrastructure: Then and Now

Dillon

Page 4: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

The traditional approach: ETL Dillon

END USERBI TEAMETL TEAM EDW TEAM

AB

DC

I

G

JM

H

F

L

D

K

Q

BZ C

P

E

F

X

EB

Z

A

X

EVENT

DATA

TRANSACTIONAL DATA

SUMMARY

ELT - Heavy Transformation Restricted Q&AOLAP / Silos

SUMMARY

FE

Page 5: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

How companies are doing it today: ELT

Dillon

Modeling LayerTransform at Query

FFF

Database

Extract Load

- name: first_purchasers type: single_value base_view: orders measures:[orders.customer.all]

AnalyticsViz & Exploration

3rd Party Data

C

C

C

Transform (and Explore!)

Page 6: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

Benefits of this approach1.Redshift is performant enough to handle most

transformations2.Users prefer performing transformations in a

language they already use (SQL) or with UI3.Transformations are much simpler, more

transparent4.Performing transformations alongside raw data

is great for auditability

Dillon

Page 7: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

Data infrastructure has geek cred Shaun

Page 8: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

Data infrastructure has geek cred Shaun

Page 9: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

Data infrastructure has geek cred Shaun

Page 10: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

Data infrastructure has geek cred Shaun

Page 11: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack#datastack

Data Integration

Data Warehouse

BI/Analytics

What the stack looks like Shaun

Page 12: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

Data Integration

Shaun

Page 13: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

Why consolidation matters

Page 14: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack#datastack

Common data sources for internal analytics Shaun

Page 15: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

Quick poll Shaun

What top five data sources are a top priority for you to integrate/keep integrated?● production databases● events● error logs● billing● email marketing● crm● advertising● erp● a/b testing● support

Page 16: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

“A year ago, we were facing a lot of stability problems with our data processing. When there was a major shift in a graph, people immediately questioned the data integrity. It was hard to distinguish interesting insights from bugs. Data science is already an art so you need the infrastructure to give you trustworthy answers to the questions you ask. 99% correctness is not good enough. And on the data infrastructure team, we were spending a lot of time churning on fighting urgent fires, and that prevented us from making much long-term progress. It was painful.”

- Marco Gallotta, Asana, How to Build Stable, Accessible Data Infrastructure at a Startup

Page 17: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

“Our story would end here if real-time processing were perfect. But it’s not: some events can come in days late, some time ranges need to be re-processed after initial ingestion due to code changes or data revisions, various components of the real-time pipeline can fail, and so on.”

- Gian Merlino, MetaMarkets, Building a Data Pipeline That Handles Billions of Events in Real-Time

Page 18: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

7 core challenges of data integration

Connections: Every API is aunique and special snowflake

Accuracy: Ordering data on a distributed system

Latency: Large object data stores (Amazon S3, Redshift) are optimized for batches not streams

Scale: Data will grow exponentially as your company grows

Flexibility: you’re interacting with systems you don’t control

Monitoring: Notifications for expired credentials, errors, notifications of disruptions

Maintenance: Justifying investment in ongoing maintenance/improvement

Shaun

Page 19: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

Or...try Pipeline Shaun

Ad Platforms Customer Support

Web Data

Marketing Automation

CRM PaymentsEcommerce

Page 20: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

Warehousing Infrastructure

Shaun

Page 21: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

Analytics warehouse Shaun

Redshift is the most common analytics warehouse.

Chosen by: Asana, Braintree, Looker, Seatgeek, VigLink, Buffer

Page 22: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack#datastack

Why Redshift is awesome Shaun

Page 23: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack#datastack

AirBnB experimentHive Redshift

Test 1: 3 billion rows of data 28 minutes <6 minutesTest 2: two joins with millions of rows

182 seconds 8 seconds

Cost $1.29/hour/node $0.85/hour/node

Shaun

Page 24: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

Periscope research Shaun

Page 25: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

DiamondStream’s dashboard query performance Shaun

Page 26: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

Business Intelligence & Analytics

Dillon

Page 27: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack#datastack

A broken model Dillon

● Feedback loop is broken

● Disparate reporting● Non-unified decision

making● Versioning● Reusability is lost

Marketing

Finance

AM

Page 28: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

Constraints of SQL Dillon

SQL is versatile, but shares the same flavor as assembly-only languages such as Perl

Can write but not readPromotes one-off, piecemeal analysisDisparate interpretation

Page 29: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

The critical multiplier: modeling Dillon

Any SQL Data Warehouse

Modeling Layer

What’s our most successful marketing campaign

How does our Q4 Pipeline looks?

Who are our healthiest / happiest customers?

Page 30: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack#datastack

Interactive, collaborative analytics Dillon

● Data access

● Uniform definitions

● A Shared View

● Collaboration

● Analytical Speed

Page 31: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

What You Can Do

Dillon

Page 32: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack#datastack

Integrated data + analytics tools Dillon

Week 1 Week 2-3RJMetrics Pipeline

BLOCKS

Tristan Handy
we don't have marketo out yet...need to replace with a different logo
Page 33: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack#datastack

Looker blocks: sales & marketing

Page 34: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack#datastack

Looker blocks: sales & marketing

Page 35: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack#datastack

Looker blocks: event analytics

Page 36: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack#datastack

Looker blocks: event analytics

Page 37: How to Build a Data-Driven Company: From Infrastructure to Insights

#datastack

Thank you!