19
Why your company needs a Unified Log Unified Log London, 20 th May 2015

Unified Log London (May 2015) - Why your company needs a unified log

Embed Size (px)

Citation preview

Why your company needs a Unified Log

Unified Log London, 20th May 2015

Introducing myself

• Alex Dean

• Co-founder and technical lead at Snowplow, the open-source event analytics platform based here in London [1]

• Weekend writer of Unified Log Processing, available on the Manning Early Access Program [2]

[1] https://github.com/snowplow/snowplow

[2] http://manning.com/dean

So what’s a Unified Log?

A quick history lesson: the three eras of business data processing [1]

1. The classic era, 1996+

2. The hybrid era, 2005+

3. The unified era, 2013+

[1] http://snowplowanalytics.com/blog/ 2014/01/20/the-three-eras-of-business-data-processing/

The classic era of business data processing, 1996+

OWN DATA CENTER

Data warehouse

HIGH LATENCY

Point-to-point connections

WIDE DATA COVERAGE

CMS

Silo

CRM

Local loop Local loop

NARROW DATA SILOES LOW LATENCY LOCAL LOOPS

E-comm

SiloLocal loop

Management reporting

ERP

SiloLocal loop

Silo

Nightly batch ETL process

FULL DATA HISTORY

The hybrid era, 2005+

CLOUD VENDOR / OWN DATA CENTER

Search

SiloLocal loop

LOW LATENCY LOCAL LOOPS

E-comm

SiloLocal loop

CRM

Local loop

SAAS VENDOR #2

Email marketing

Local loop

ERP

SiloLocal loop

CMS

SiloLocal loop

SAAS VENDOR #1

NARROW DATA SILOES

Stream processing

Productrec’s

Micro-batch processing

Systems monitoring

Batch processing

Data warehouse

Management reporting

Batch processing

Ad hoc analytics

Hadoop

SAAS VENDOR #3

Web analytics

Local loop

Local loop Local loop

LOW LATENCY LOW LATENCY

HIGH LATENCY HIGH LATENCY

APIs

Bulk exports

The hybrid era: a surfeit of software vendors

CLOUD VENDOR / OWN DATA CENTER

Search

SiloLocal loop

LOW LATENCY LOCAL LOOPS

E-comm

SiloLocal loop

CRM

Local loop

SAAS VENDOR #2

Email marketing

Local loop

ERP

SiloLocal loop

CMS

SiloLocal loop

SAAS VENDOR #1

NARROW DATA SILOES

Stream processing

Productrec’s

Micro-batch processing

Systems monitoring

Batch processing

Data warehouse

Management reporting

Batch processing

Ad hoc analytics

Hadoop

SAAS VENDOR #3

Web analytics

Local loop

Local loop Local loop

LOW LATENCY LOW LATENCY

HIGH LATENCY HIGH LATENCY

APIs

Bulk exports

The hybrid era: company-wide reporting and analytics ends up like Rashomon

The bandit’s story

vs.

The wife’s story

vs.

The samurai’s story

vs.

The woodcutter’s story

The hybrid era: the number of data integrations is unsustainable

So how do we unravel the hairball?

The unified era, 2013+CLOUD VENDOR / OWN DATA CENTER

Search

Silo

SOME LOW LATENCY LOCAL LOOPS

E-comm

Silo

CRM

SAAS VENDOR #2

Email marketing

ERP

Silo

CMS

Silo

SAAS VENDOR #1

NARROW DATA SILOES

Streaming APIs / web hooks

Unified log

LOW LATENCY WIDE DATA

COVERAGE

Archiving

Hadoop

< WIDE DATA

COVERAGE >

< FULL DATA

HISTORY >

FEW DAYS’ DATA HISTORY

Systems monitoring

Eventstream

HIGH LATENCY LOW LATENCY

Product rec’sAd hoc analytics

Management reporting

Fraud detection

Churn prevention

APIs

CLOUD VENDOR / OWN DATA CENTER

Search

Silo

SOME LOW LATENCY LOCAL LOOPS

E-comm

Silo

CRM

SAAS VENDOR #2

Email marketing

ERP

Silo

CMS

Silo

SAAS VENDOR #1

NARROW DATA SILOES

Streaming APIs / web hooks

Unified log

Archiving

Hadoop

< WIDE DATA

COVERAGE >

< FULL DATA

HISTORY >

Systems monitoring

Eventstream

HIGH LATENCY LOW LATENCY

Product rec’sAd hoc analytics

Management reporting

Fraud detection

Churn prevention

APIs

The unified log is Amazon Kinesis, or Apache Kafka

• Amazon Kinesis, a hosted AWS service

• Extremely similar semantics to Kafka

• Apache Kafka, an append-only, distributed, ordered commit log

• Developed at LinkedIn to serve as their organization’s unified log

“Kafka is designed to allow a single cluster to serve as the central data backbone for a

large organization” [1]

[1] http://kafka.apache.org/

So what does a unified log give us?

A single version of the truth

Our truth is now upstream from the data warehouse

The hairball of point-to-point connections has been unravelled

Local loops have been unbundled

1

2

3

4

What does a unified log let us do that we couldn’t do before?

Populating a unified log with your company’s event streams

Real-time management

reporting

To enable…

Holistic systems

monitoring

Re-running models from

Day 0

A/B testing end-to-end

pipelines

Shipping offline

models to RT

… anything requiring low latency response / holistic view of our company’s data!

But garbage in, garbage out: it’s crucial to properly model the event streams feeding into the unified log

Subject DirectObject

IndirectObjectVerb

Event Context

Prep.Object~

• We are working on a semantic model for events – an “event grammar” at Snowplow [1]

• The event grammar borrows concepts from human language:

• A semantic model prevents business and technology assumptions leaking in to the event stream – making it less brittle over time

[1] http://snowplowanalytics.com/blog/2013/08/12/ towards-universal-event-analytics-building-an-event-grammar/

We also need to store and version the schemas used to describe our events, as these will change over time

Unified log

Questions?

Questions?

http://snowplowanalytics.comhttps://github.com/snowplow/snowplow

@snowplowdataTo meet up or chat, @alexcrdean on Twitter or

[email protected]

Manning Deal of the Day today!

Discount code: dotd052015au (50% off just today)