Upload
alexander-dean
View
1.295
Download
0
Embed Size (px)
DESCRIPTION
Since its inception, the Snowplow open source event analytics platform (https://github.com/snowplow/snowplow) has always been tightly coupled to the batched-based Hadoop ecosystem, and Elastic MapReduce in particular. With the release of Amazon Kinesis in late 2013, we set ourselves the challenge of porting Snowplow to Kinesis, to give our users access to their Snowplow event stream in near-real-time. With this porting process nearing completion, Alex Dean, Snowplow Analytics co-founder and technical lead, will share Snowplow’s experiences in adopting stream processing as a complementary architecture to Hadoop and batch-based processing. In particular, Alex will explore: - “Hero” use cases for event streaming which drove our adoption of Kinesis - Why we waited for Kinesis, and thoughts on how Kinesis fits into the wider streaming ecosystem - How Snowplow achieved a lambda architecture with minimal code duplication, allowing Snowplow users to choose which (or both) platforms to use - Key considerations when moving from a batch mindset to a streaming mindset – including aggregate windows, recomputation, backpressure
Citation preview
Continuous data processing with Kinesis at Snowplow
Budapest DW Forum 2014
Agenda today
1. Introduction to Snowplow
2. Our batch data flow & use cases
3. Why are we excited about Kinesis?
4. Adding Kinesis support to Snowplow
5. Questions
Introduction to Snowplow
Snowplow is an open-source web and event analytics platform, first version released in early 2012
• Co-founders Alex Dean and Yali Sassoon met at OpenX, the open-source ad technology business in 2008
• After leaving OpenX, Alex and Yali set up Keplar, a niche digital product and analytics consultancy
• We released Snowplow as a skunkworksprototype at start of 2012:
github.com/snowplow/snowplow
• We started working full time on Snowplow in summer 2013
We wanted to take a fresh approach to web analytics
• Your own web event data -> in your own data warehouse• Your own event data model
• Slice / dice and mine the data in highly bespoke ways to answer your specific business questions
• Plug in the broadest possible set of analysis tools to drive value from your data
Data warehouseData pipeline
Analyse your data in any analysis tool
And we saw the potential of new “big data” technologies and services to solve these problems in a scalable, low-cost manner
These tools make it possible to capture, transform, store and analyse all your granular, event-level data, to you can perform any analysis
Amazon EMRAmazon S3CloudFront Amazon Redshift
Early on, we made a crucial decision: Snowplow should be composed of a set of loosely coupled subsystems
1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsA B C D
D = Standardised data protocols
Generate event data from any environment
Log raw events from trackers
Validate and enrich raw events
Store enriched events ready for analysis
Analyzeenriched events
These turned out to be critical to allowing us to evolve our technology stack
Our batch data flow & use cases
By spring 2013 we had arrived at a relatively stable batch-based processing architecture
Website / webapp
Snowplow Hadoop data pipeline
CloudFront-based event
collectorScalding-
based enrichment on Hadoop
JavaScript event tracker
Amazon Redshift /
PostgreSQL
Amazon S3
or
Clojure-based event
collector
What did people start using Snowplow for?
Warehousing their web event data
Agile aka ad hoc analytics
To enable…
Marketing attribution modelling
Customer lifetime value calculations
Customer churn
prediction
RTB fraud detection
Email product recs
These use cases tended to be characterized by a few important traits
Trait Example
Agile aka ad hoc analytics
Marketing attribution modelling
1. They use data collected over long time periods
2. They demand ongoing & hands-on involvement from a BA/ data scientist
3. They tend not to elicit synchronous/deterministic responses
RTB fraud detection
So why did we get excited about Kinesis?
A quick history lesson: the three eras of business data processing
1. The classic era, 1996+
2. The hybrid era, 2005+
3. The unified era, 2013+
For more see http://snowplowanalytics.com/blog/2014/01/20/the-three-eras-of-business-data-processing/
The classic era, 1996+
OWN DATA CENTER
Data warehouse
HIGH LATENCY
Point-to-point connections
WIDE DATA
COVERAGE
CMS
Silo
CRM
Local loop Local loop
NARROW DATA SILOES LOW LATENCY LOCAL LOOPS
E-comm
SiloLocal loop
Management reporting
ERP
SiloLocal loop
Silo
Nightly batch ETL process
FULL DATA
HISTORY
The hybrid era, 2005+
CLOUD VENDOR / OWN DATA CENTER
Search
SiloLocal loop
LOW LATENCY LOCAL LOOPS
E-comm
SiloLocal loop
CRM
Local loop
SAAS VENDOR #2
Email marketing
Local loop
ERP
SiloLocal loop
CMS
SiloLocal loop
SAAS VENDOR #1
NARROW DATA SILOES
Stream processing
Productrec’s
Micro-batch processing
Systems monitoring
Batch processing
Data warehouse
Management reporting
Batch processing
Ad hoc analytics
Hadoop
SAAS VENDOR #3
Web analytics
Local loop
Local loop Local loop
LOW LATENCY LOW LATENCY
HIGH LATENCY HIGH LATENCY
APIs
Bulk exports
The unified era, 2013+CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs / web hooks
Unified log
LOW LATENCY WIDE DATA
COVERAGE
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
FEW DAYS’ DATA HISTORY
Systems monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’sAd hoc
analytics
Management reporting
Fraud detection
Churn prevention
APIs
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs / web hooks
Unified log
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
Systems monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’sAd hoc
analytics
Management reporting
Fraud detection
Churn prevention
APIs
The unified log is Kinesis (or Kafka)
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs / web hooks
Unified log
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
Systems monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’sAd hoc
analytics
Management reporting
Fraud detection
Churn prevention
APIs
We asked: can we implement Snowplow on top of Kinesis?
What kinds of use cases can we support if we implement Snowplow on top of Kinesis?
Populating a unified log with your company’s event streams
In-session product recs
To enable…
Holistic systems
monitoring
In-game difficulty
tuning
In-session upselling
Ad retargeting &
RTB
… anything requiring low latency response /
holistic view of our data!
Adding Kinesis support to Snowplow
Where we are heading with our Kinesis architecture
Scala Stream Collector
Raw event stream
Enrich Kinesis app
Bad raw events stream
Enriched event
stream
S3
Redshift
S3 sink Kinesis app
Redshift sink Kinesis
app
SnowplowTrackers
This is where we are today
Scala Stream Collector
Raw event stream
Enrich Kinesis app
Bad raw events stream
Enriched event
stream
S3
Redshift
S3 sink Kinesis app
Redshift sink Kinesis app
SnowplowTrackers
What have we and the Snowplow community learnt about Kinesis and continuous data processing so far?
1. One stream many consuming apps is unexpected for many people (legacy of old MQs?)
2. Think of Kinesis apps as distributed Unix commands with streams mapping on to stdin, stderr, stdout
3. Build more complex systems by chaining simple Kinesis apps – the Kinesis stream is a really powerful primitive for continuous data flows
4. Scalability and elasticity are going to be much bigger challenges than in our batch flow
Questions?
http://snowplowanalytics.com
https://github.com/snowplow/snowplow
@snowplowdata
To talk offline – @alexcrdean on Twitter or [email protected]