Asynchronous micro-services and the unified log

  • View
    1.256

  • Download
    0

  • Category

    Software

Preview:

Citation preview

Asynchronous micro-services and the unified log

Crunch Conference, Budapest, 7th October 2016

Introducing myself

• Alexander Dean

• Co-founder and technical lead at Snowplow, the open-source event data pipeline

• Weekend writer of Unified Log Processing, available on the Manning Early Access Program

• Co-author at Snowplow of Iglu, our open-source schema registry system, and Sauna, our open-source decisioning and response platform

We are witnessing the convergence of two separate technology tracks towards asynchronous or event-driven micro-services

Transactional workloads

Analytical workloads

Asynchronous micro-services

Software monolithsSynchronous

micro-services

Classic data warehousing

Hybrid data pipelines

Unified log architectures

Analytical workloads, and the rise of the unified log

A quick history lesson: the three eras of business data processing [1]

1. Classic data warehousing, 1996+

2. Hybrid data pipelines, 2005+

3. Unified log architectures, 2013+

[1] http://snowplowanalytics.com/blog/2014/01/20/the-three-eras-of-business-data-processing/

The era of classic data warehousing, 1996+

OWN DATA CENTER

Data warehouse

HIGH LATENCY

Point-to-point connections

WIDE DATA

COVERAGE

CMS

Silo

CRM

Local loop Local loop

NARROW DATA SILOES LOW LATENCY LOCAL LOOPS

E-comm

SiloLocal loop

Management reporting

ERP

SiloLocal loop

Silo

Nightly batch ETL process

FULL DATA

HISTORY

The era of hybrid data pipelines, 2005+

CLOUD VENDOR / OWN DATA CENTER

Search

SiloLocal loop

LOW LATENCY LOCAL LOOPS

E-comm

SiloLocal loop

CRM

Local loop

SAAS VENDOR #2

Email marketing

Local loop

ERP

SiloLocal loop

CMS

SiloLocal loop

SAAS VENDOR #1

NARROW DATA SILOES

Stream processing

Productrec’s

Micro-batch processing

Systems monitoring

Batch processing

Data warehouse

Management reporting

Batch processing

Ad hoc analytics

Hadoop

SAAS VENDOR #3

Web analytics

Local loop

Local loop Local loop

LOW LATENCY LOW LATENCY

HIGH LATENCY HIGH LATENCY

APIs

Bulk exports

The hybrid era: a surfeit of software vendors

CLOUD VENDOR / OWN DATA CENTER

Search

SiloLocal loop

LOW LATENCY LOCAL LOOPS

E-comm

SiloLocal loop

CRM

Local loop

SAAS VENDOR #2

Email marketing

Local loop

ERP

SiloLocal loop

CMS

SiloLocal loop

SAAS VENDOR #1

NARROW DATA SILOES

Stream processing

Productrec’s

Micro-batch processing

Systems monitoring

Batch processing

Data warehouse

Management reporting

Batch processing

Ad hoc analytics

Hadoop

SAAS VENDOR #3

Web analytics

Local loop

Local loop Local loop

LOW LATENCY LOW LATENCY

HIGH LATENCY HIGH LATENCY

APIs

Bulk exports

The hybrid era: company-wide reporting and analytics ends up like Rashomon

The bandit’s story

vs.

The wife’s story

vs.

The samurai’s story

vs.

The woodcutter’s story

The hybrid era: the number of data integrations is unsustainable

So how do we unravel the hairball?

The advent of the unified log, 2013+CLOUD VENDOR / OWN DATA CENTER

Search

Silo

SOME LOW LATENCY LOCAL LOOPS

E-comm

Silo

CRM

SAAS VENDOR #2

Email marketing

ERP

Silo

CMS

Silo

SAAS VENDOR #1

NARROW DATA SILOES

Streaming APIs / web hooks

Unified log

LOW LATENCY WIDE DATA

COVERAGE

Archiving

Hadoop

< WIDE DATA

COVERAGE >

< FULL DATA

HISTORY >

FEW DAYS’ DATA HISTORY

Systems monitoring

Eventstream

HIGH LATENCY LOW LATENCY

Product rec’sAd hoc

analytics

Management reporting

Fraud detection

Churn prevention

APIs

CLOUD VENDOR / OWN DATA CENTER

Search

Silo

SOME LOW LATENCY LOCAL LOOPS

E-comm

Silo

CRM

SAAS VENDOR #2

Email marketing

ERP

Silo

CMS

Silo

SAAS VENDOR #1

NARROW DATA SILOES

Streaming APIs / web hooks

Unified log

Archiving

Hadoop

< WIDE DATA

COVERAGE >

< FULL DATA

HISTORY >

Systems monitoring

Eventstream

HIGH LATENCY LOW LATENCY

Product rec’sAd hoc

analytics

Management reporting

Fraud detection

Churn prevention

APIs

The unified log is Amazon Kinesis, or Apache Kafka

• Amazon Kinesis, a hosted AWS service

• Extremely similar semantics to Kafka

• Apache Kafka, an append-only, distributed, ordered commit log

• Developed at LinkedIn to serve as their organization’s unified log

“Kafka is designed to allow a single cluster to serve as the central data backbone for a

large organization” [1]

[1] http://kafka.apache.org/

So what does a unified log give us?

A single version of the truth

Our truth is now upstream from the data warehouse

The hairball of point-to-point connections has been unravelled

Local loops have been unbundled

1

2

3

4

Coming up to the end of 2016, and unified log architectures have seen extremely rapid and widespread adoption

Transactional data processing and the move to

micro-services

In parallel, we have seen a steady (if spotty) rejection of software monoliths for transactional workloads

In a micro-services architecture, the individual capabilities of the system are split out into separate services

Synchronous communication using request and response

(often using RESTfulHTTP or RPC)

What do synchronous micro-services give us?

Strong module boundaries• Network boundaries between modules can be

helpful for larger teams

Independent deployment• Deploy individual micro-services independently• Simpler to deploy and less likely to cause whole

system failures

Support diversity• Can use the best language/framework/database

for the capability• Reduces monoculture risk (anti-fragile)

1

2

3

Convergence on asynchronous or event-driven micro-services

When we re-architected Snowplow around the unified log 2.5 years ago, we designed it around small, composable workers

Diagram from our February 2014 Snowplow v0.9.0 release post

This was based on the insight that real-time pipelines can be composed a little like Unix pipes

We avoided monolithic Spark Streaming or Storm jobs, based on our experiences with “heavy” Hadoop jobs in our batch pipeline

Kinesis stream

Kinesis stream

Kinesis stream

Kinesis stream

Kinesis stream

What we didn’t do: an inner Storm topology

We wanted to avoid the “inner topology” effect, with effectively two tiers of topology to reason about

Difficult to unit test the inner topologies – complex behaviours inside each unit

Difficult to operationalize the inner topologies – how do they handle backpressure, how do they scale, how do we upgrade them?

Difficult to monitor the inner topologies

Fundamental problem: the event streams in an inner topology are not first class entities

1

2

3

It worked: today the Snowplow real-time pipeline is a collection of individual event-driven micro-services

Stream Collector

Stream Enrich

Kinesis S3

Kinesis Elasticsearch

Kinesis TeeKinesis

Redshift (design stage)

User’s AWS Lambda function

User’s KCL worker app

User’s Spark Streaming

job

“the most compelling applications for stream processing are actually pretty different from what you would typically do with a Hive or Spark job—they are closer to being a kind of asynchronous microservice rather than being a faster version of a batch analytics job. …”

Meanwhile, the Kafka team (now at Confluent) were seeing something interesting in the adoption of Kafka…

“…What I mean is that these stream processing apps were most often software that implemented core functions in the business rather than computing analytics about the business.” [1]

[1] http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/

And these micro-services were substituting not just for batch analytical workloads, but also for transactional workloads

Why are async micro-services replacing “classic” request/response micro-services (at least amongst

Kafka users)?

To start with, asynchronous micro-services have the same benefits as synchronous micro-services…

Strong module boundaries• Network boundaries between modules can be

helpful for larger teams

Independent deployment• Deploy individual micro-services independently• Easier to deploy and less likely to cause whole

system failures

Support diversity• Can use the best language/framework/database

for the capability• Reduces monoculture risk (anti-fragile)

1

2

3

… but in addition, event-driven asynchronous micro-services are extremely loosely-coupled, because they are intermediated by first class streams

Compare this to request/response synchronous micro-services,which have dependencies between upstream and downstream

Downstream

Upstream

Single-page web app

Login serviceNotifications

serviceContent service

Customer profile servicePersonalization

service

Ad service

If you can afford the (small) latency tax, there are some clear advantages in going asynchronous

Much better toolkit for upgrades – schema evolution, running old and new service versions in parallel etc

Adding new downstream services doesn’t increase the load on upstream services

Failure of individual services introduces lag into the overall system, rather than overall system failure

Easier to debug, because service inputs and outputs are directly inspectable in the event streams

1

2

3

4

So what’s next in this convergence?

?

More and more tooling for building asynchronous micro-services, including the unstoppable rise of “serverless”

Kinesis Client Library

AWS Lambda

IBM OpenWhisk

Event-driven cloud functionsStream processing as a library

Azure Functions

Greater adoption of open source schema registries as the canonical source of truth for the events in our topologies

Confluent or Iglu schema registries

Request/response micro-services aren’t going away –they are just too useful

But expect a move from slower HTTP-based to faster RPC-based options

• e.g. Google’s gRPC• These are easier to read from and write from in high-

volume event-driven architectures

Expect wider adoption of API definition languages• OpenAPI (Swagger), RAML, API Blueprint

Eventual harmonization of types?• Using JSON for RESTful APIs, Protocol Buffers for RPC

and Avro for stream processing is crazy• Needs company-wide standardization, or dynamic

translation (but this is lossy)

1

2

3

We also need new fabrics – or extensions of existing ones like Kubernetes – to address the challenges of running our topologies

“How do we monitor this topology, and alert if something (data loss; event lag) is going wrong?”

“How do we scale our streams and micro-services to handle event peaks and troughs smoothly?”

“How do we re-configure or upgrade our micro-services without breaking things?”

At Snowplow we are working on a unified log fabric, called Tupilak, to solve this problem

Questions?

http://snowplowanalytics.com

https://github.com/snowplow/snowplow

@snowplowdata

To meet up or chat, @alexcrdean on Twitter or alex@snowplowanalytics.com

Recommended