Continuous Data Processing with Kinesis at Snowplow

Continuous data processing with Kinesis at Snowplow

Budapest DW Forum 2014

Agenda today

1. Introduction to Snowplow

2. Our batch data flow & use cases

3. Why are we excited about Kinesis?

4. Adding Kinesis support to Snowplow

5. Questions

Introduction to Snowplow

Snowplow is an open-source web and event analytics platform, first version released in early 2012

• Co-founders Alex Dean and Yali Sassoon met at OpenX, the open-source ad technology business in 2008

• After leaving OpenX, Alex and Yali set up Keplar, a niche digital product and analytics consultancy

• We released Snowplow as a skunkworksprototype at start of 2012:

github.com/snowplow/snowplow

• We started working full time on Snowplow in summer 2013

We wanted to take a fresh approach to web analytics

• Your own web event data -> in your own data warehouse• Your own event data model

• Slice / dice and mine the data in highly bespoke ways to answer your specific business questions

• Plug in the broadest possible set of analysis tools to drive value from your data

Data warehouseData pipeline

Analyse your data in any analysis tool

And we saw the potential of new “big data” technologies and services to solve these problems in a scalable, low-cost manner

These tools make it possible to capture, transform, store and analyse all your granular, event-level data, to you can perform any analysis

Amazon EMRAmazon S3CloudFront Amazon Redshift

Early on, we made a crucial decision: Snowplow should be composed of a set of loosely coupled subsystems

1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsA B C D

D = Standardised data protocols

Generate event data from any environment

Log raw events from trackers

Validate and enrich raw events

Store enriched events ready for analysis

Analyzeenriched events

These turned out to be critical to allowing us to evolve our technology stack

Our batch data flow & use cases

By spring 2013 we had arrived at a relatively stable batch-based processing architecture

Website / webapp

Snowplow Hadoop data pipeline

CloudFront-based event

collectorScalding-

based enrichment on Hadoop

JavaScript event tracker

Amazon Redshift /

PostgreSQL

Amazon S3

or

Clojure-based event

collector

What did people start using Snowplow for?

Warehousing their web event data

Agile aka ad hoc analytics

To enable…

Marketing attribution modelling

Customer lifetime value calculations

Customer churn

prediction

RTB fraud detection

Email product recs

These use cases tended to be characterized by a few important traits

Trait Example

Agile aka ad hoc analytics

Marketing attribution modelling

1. They use data collected over long time periods

2. They demand ongoing & hands-on involvement from a BA/ data scientist

3. They tend not to elicit synchronous/deterministic responses

RTB fraud detection

So why did we get excited about Kinesis?

A quick history lesson: the three eras of business data processing

1. The classic era, 1996+

2. The hybrid era, 2005+

3. The unified era, 2013+

For more see http://snowplowanalytics.com/blog/2014/01/20/the-three-eras-of-business-data-processing/

The classic era, 1996+

OWN DATA CENTER

Data warehouse

HIGH LATENCY

Point-to-point connections

WIDE DATA

COVERAGE

CMS

Silo

CRM

Local loop Local loop

NARROW DATA SILOES LOW LATENCY LOCAL LOOPS

E-comm

SiloLocal loop

Management reporting

ERP

SiloLocal loop

Silo

Nightly batch ETL process

FULL DATA

HISTORY

The hybrid era, 2005+

CLOUD VENDOR / OWN DATA CENTER

Search

SiloLocal loop

LOW LATENCY LOCAL LOOPS

E-comm

SiloLocal loop

CRM

Local loop

SAAS VENDOR #2

Email marketing

Local loop

ERP

SiloLocal loop

CMS

SiloLocal loop

SAAS VENDOR #1

NARROW DATA SILOES

Stream processing

Productrec’s

Micro-batch processing

Systems monitoring

Batch processing

Data warehouse


Batch processing

Ad hoc analytics

Hadoop

SAAS VENDOR #3

Web analytics

Local loop

Local loop Local loop

LOW LATENCY LOW LATENCY

HIGH LATENCY HIGH LATENCY

APIs

Bulk exports

The unified era, 2013+CLOUD VENDOR / OWN DATA CENTER

Search

Silo

SOME LOW LATENCY LOCAL LOOPS

E-comm

Silo

CRM

SAAS VENDOR #2

Email marketing

ERP

Silo

CMS

Silo

SAAS VENDOR #1

NARROW DATA SILOES

Streaming APIs / web hooks

Unified log

LOW LATENCY WIDE DATA

COVERAGE

Archiving

Hadoop

< WIDE DATA

COVERAGE >

< FULL DATA

HISTORY >

FEW DAYS’ DATA HISTORY

Systems monitoring

Eventstream

HIGH LATENCY LOW LATENCY

Product rec’sAd hoc

analytics


Fraud detection

Churn prevention

APIs


Search

Silo


E-comm

Silo

CRM

SAAS VENDOR #2

Email marketing

ERP

Silo

CMS

Silo

SAAS VENDOR #1

NARROW DATA SILOES


Unified log

Archiving

Hadoop

< WIDE DATA

COVERAGE >

< FULL DATA

HISTORY >

Systems monitoring

Eventstream



analytics


Fraud detection

Churn prevention

APIs

The unified log is Kinesis (or Kafka)


Search

Silo


E-comm

Silo

CRM

SAAS VENDOR #2

Email marketing

ERP

Silo

CMS

Silo

SAAS VENDOR #1

NARROW DATA SILOES


Unified log

Archiving

Hadoop

< WIDE DATA

COVERAGE >

< FULL DATA

HISTORY >

Systems monitoring

Eventstream



analytics


Fraud detection

Churn prevention

APIs

We asked: can we implement Snowplow on top of Kinesis?

What kinds of use cases can we support if we implement Snowplow on top of Kinesis?

Populating a unified log with your company’s event streams

In-session product recs

To enable…

Holistic systems

monitoring

In-game difficulty

tuning

In-session upselling

Ad retargeting &

RTB

… anything requiring low latency response /

holistic view of our data!

Adding Kinesis support to Snowplow

Where we are heading with our Kinesis architecture

Scala Stream Collector

Raw event stream

Enrich Kinesis app

Bad raw events stream

Enriched event

stream

S3

Redshift

S3 sink Kinesis app

Redshift sink Kinesis

app

SnowplowTrackers

This is where we are today

Scala Stream Collector

Raw event stream

Enrich Kinesis app

Bad raw events stream

Enriched event

stream

S3

Redshift

S3 sink Kinesis app

Redshift sink Kinesis app

SnowplowTrackers

What have we and the Snowplow community learnt about Kinesis and continuous data processing so far?

1. One stream many consuming apps is unexpected for many people (legacy of old MQs?)

2. Think of Kinesis apps as distributed Unix commands with streams mapping on to stdin, stderr, stdout

3. Build more complex systems by chaining simple Kinesis apps – the Kinesis stream is a really powerful primitive for continuous data flows

4. Scalability and elasticity are going to be much bigger challenges than in our batch flow

Questions?

http://snowplowanalytics.com

https://github.com/snowplow/snowplow

@snowplowdata

To talk offline – @alexcrdean on Twitter or [email protected]

http://snowplowanalytics.com/

https://github.com/snowplow/snowplow

Software

Continuous Data Processing with Kinesis at Snowplow