Upload
alexander-dean
View
390
Download
0
Tags:
Embed Size (px)
Citation preview
Introducing myself
• Alex Dean
• Co-founder and technical lead at Snowplow, the open-source event analytics platform based here in London [1]
• Weekend writer of Unified Log Processing, available on the Manning Early Access Program [2]
[1] https://github.com/snowplow/snowplow
[2] http://manning.com/dean
A quick history lesson: the three eras of business data processing [1]
1. The classic era, 1996+
2. The hybrid era, 2005+
3. The unified era, 2013+
[1] http://snowplowanalytics.com/blog/ 2014/01/20/the-three-eras-of-business-data-processing/
The classic era of business data processing, 1996+
OWN DATA CENTER
Data warehouse
HIGH LATENCY
Point-to-point connections
WIDE DATA COVERAGE
CMS
Silo
CRM
Local loop Local loop
NARROW DATA SILOES LOW LATENCY LOCAL LOOPS
E-comm
SiloLocal loop
Management reporting
ERP
SiloLocal loop
Silo
Nightly batch ETL process
FULL DATA HISTORY
The hybrid era, 2005+
CLOUD VENDOR / OWN DATA CENTER
Search
SiloLocal loop
LOW LATENCY LOCAL LOOPS
E-comm
SiloLocal loop
CRM
Local loop
SAAS VENDOR #2
Email marketing
Local loop
ERP
SiloLocal loop
CMS
SiloLocal loop
SAAS VENDOR #1
NARROW DATA SILOES
Stream processing
Productrec’s
Micro-batch processing
Systems monitoring
Batch processing
Data warehouse
Management reporting
Batch processing
Ad hoc analytics
Hadoop
SAAS VENDOR #3
Web analytics
Local loop
Local loop Local loop
LOW LATENCY LOW LATENCY
HIGH LATENCY HIGH LATENCY
APIs
Bulk exports
The hybrid era: a surfeit of software vendors
CLOUD VENDOR / OWN DATA CENTER
Search
SiloLocal loop
LOW LATENCY LOCAL LOOPS
E-comm
SiloLocal loop
CRM
Local loop
SAAS VENDOR #2
Email marketing
Local loop
ERP
SiloLocal loop
CMS
SiloLocal loop
SAAS VENDOR #1
NARROW DATA SILOES
Stream processing
Productrec’s
Micro-batch processing
Systems monitoring
Batch processing
Data warehouse
Management reporting
Batch processing
Ad hoc analytics
Hadoop
SAAS VENDOR #3
Web analytics
Local loop
Local loop Local loop
LOW LATENCY LOW LATENCY
HIGH LATENCY HIGH LATENCY
APIs
Bulk exports
The hybrid era: company-wide reporting and analytics ends up like Rashomon
The bandit’s story
vs.
The wife’s story
vs.
The samurai’s story
vs.
The woodcutter’s story
The unified era, 2013+CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs / web hooks
Unified log
LOW LATENCY WIDE DATA
COVERAGE
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
FEW DAYS’ DATA HISTORY
Systems monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’sAd hoc analytics
Management reporting
Fraud detection
Churn prevention
APIs
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs / web hooks
Unified log
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
Systems monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’sAd hoc analytics
Management reporting
Fraud detection
Churn prevention
APIs
The unified log is Amazon Kinesis, or Apache Kafka
• Amazon Kinesis, a hosted AWS service
• Extremely similar semantics to Kafka
• Apache Kafka, an append-only, distributed, ordered commit log
• Developed at LinkedIn to serve as their organization’s unified log
“Kafka is designed to allow a single cluster to serve as the central data backbone for a
large organization” [1]
[1] http://kafka.apache.org/
So what does a unified log give us?
A single version of the truth
Our truth is now upstream from the data warehouse
The hairball of point-to-point connections has been unravelled
Local loops have been unbundled
1
2
3
4
What does a unified log let us do that we couldn’t do before?
Populating a unified log with your company’s event streams
Real-time management
reporting
To enable…
Holistic systems
monitoring
Re-running models from
Day 0
A/B testing end-to-end
pipelines
Shipping offline
models to RT
… anything requiring low latency response / holistic view of our company’s data!
But garbage in, garbage out: it’s crucial to properly model the event streams feeding into the unified log
Subject DirectObject
IndirectObjectVerb
Event Context
Prep.Object~
• We are working on a semantic model for events – an “event grammar” at Snowplow [1]
• The event grammar borrows concepts from human language:
• A semantic model prevents business and technology assumptions leaking in to the event stream – making it less brittle over time
[1] http://snowplowanalytics.com/blog/2013/08/12/ towards-universal-event-analytics-building-an-event-grammar/
We also need to store and version the schemas used to describe our events, as these will change over time
Unified log
Questions?
http://snowplowanalytics.comhttps://github.com/snowplow/snowplow
@snowplowdataTo meet up or chat, @alexcrdean on Twitter or
Manning Deal of the Day today!
Discount code: dotd052015au (50% off just today)