View
1.256
Download
0
Category
Preview:
Citation preview
Asynchronous micro-services and the unified log
Crunch Conference, Budapest, 7th October 2016
Introducing myself
• Alexander Dean
• Co-founder and technical lead at Snowplow, the open-source event data pipeline
• Weekend writer of Unified Log Processing, available on the Manning Early Access Program
• Co-author at Snowplow of Iglu, our open-source schema registry system, and Sauna, our open-source decisioning and response platform
We are witnessing the convergence of two separate technology tracks towards asynchronous or event-driven micro-services
Transactional workloads
Analytical workloads
Asynchronous micro-services
Software monolithsSynchronous
micro-services
Classic data warehousing
Hybrid data pipelines
Unified log architectures
Analytical workloads, and the rise of the unified log
A quick history lesson: the three eras of business data processing [1]
1. Classic data warehousing, 1996+
2. Hybrid data pipelines, 2005+
3. Unified log architectures, 2013+
[1] http://snowplowanalytics.com/blog/2014/01/20/the-three-eras-of-business-data-processing/
The era of classic data warehousing, 1996+
OWN DATA CENTER
Data warehouse
HIGH LATENCY
Point-to-point connections
WIDE DATA
COVERAGE
CMS
Silo
CRM
Local loop Local loop
NARROW DATA SILOES LOW LATENCY LOCAL LOOPS
E-comm
SiloLocal loop
Management reporting
ERP
SiloLocal loop
Silo
Nightly batch ETL process
FULL DATA
HISTORY
The era of hybrid data pipelines, 2005+
CLOUD VENDOR / OWN DATA CENTER
Search
SiloLocal loop
LOW LATENCY LOCAL LOOPS
E-comm
SiloLocal loop
CRM
Local loop
SAAS VENDOR #2
Email marketing
Local loop
ERP
SiloLocal loop
CMS
SiloLocal loop
SAAS VENDOR #1
NARROW DATA SILOES
Stream processing
Productrec’s
Micro-batch processing
Systems monitoring
Batch processing
Data warehouse
Management reporting
Batch processing
Ad hoc analytics
Hadoop
SAAS VENDOR #3
Web analytics
Local loop
Local loop Local loop
LOW LATENCY LOW LATENCY
HIGH LATENCY HIGH LATENCY
APIs
Bulk exports
The hybrid era: a surfeit of software vendors
CLOUD VENDOR / OWN DATA CENTER
Search
SiloLocal loop
LOW LATENCY LOCAL LOOPS
E-comm
SiloLocal loop
CRM
Local loop
SAAS VENDOR #2
Email marketing
Local loop
ERP
SiloLocal loop
CMS
SiloLocal loop
SAAS VENDOR #1
NARROW DATA SILOES
Stream processing
Productrec’s
Micro-batch processing
Systems monitoring
Batch processing
Data warehouse
Management reporting
Batch processing
Ad hoc analytics
Hadoop
SAAS VENDOR #3
Web analytics
Local loop
Local loop Local loop
LOW LATENCY LOW LATENCY
HIGH LATENCY HIGH LATENCY
APIs
Bulk exports
The hybrid era: company-wide reporting and analytics ends up like Rashomon
The bandit’s story
vs.
The wife’s story
vs.
The samurai’s story
vs.
The woodcutter’s story
The hybrid era: the number of data integrations is unsustainable
So how do we unravel the hairball?
The advent of the unified log, 2013+CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs / web hooks
Unified log
LOW LATENCY WIDE DATA
COVERAGE
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
FEW DAYS’ DATA HISTORY
Systems monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’sAd hoc
analytics
Management reporting
Fraud detection
Churn prevention
APIs
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs / web hooks
Unified log
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
Systems monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’sAd hoc
analytics
Management reporting
Fraud detection
Churn prevention
APIs
The unified log is Amazon Kinesis, or Apache Kafka
• Amazon Kinesis, a hosted AWS service
• Extremely similar semantics to Kafka
• Apache Kafka, an append-only, distributed, ordered commit log
• Developed at LinkedIn to serve as their organization’s unified log
“Kafka is designed to allow a single cluster to serve as the central data backbone for a
large organization” [1]
[1] http://kafka.apache.org/
So what does a unified log give us?
A single version of the truth
Our truth is now upstream from the data warehouse
The hairball of point-to-point connections has been unravelled
Local loops have been unbundled
1
2
3
4
Coming up to the end of 2016, and unified log architectures have seen extremely rapid and widespread adoption
Transactional data processing and the move to
micro-services
In parallel, we have seen a steady (if spotty) rejection of software monoliths for transactional workloads
In a micro-services architecture, the individual capabilities of the system are split out into separate services
Synchronous communication using request and response
(often using RESTfulHTTP or RPC)
What do synchronous micro-services give us?
Strong module boundaries• Network boundaries between modules can be
helpful for larger teams
Independent deployment• Deploy individual micro-services independently• Simpler to deploy and less likely to cause whole
system failures
Support diversity• Can use the best language/framework/database
for the capability• Reduces monoculture risk (anti-fragile)
1
2
3
Convergence on asynchronous or event-driven micro-services
When we re-architected Snowplow around the unified log 2.5 years ago, we designed it around small, composable workers
Diagram from our February 2014 Snowplow v0.9.0 release post
This was based on the insight that real-time pipelines can be composed a little like Unix pipes
We avoided monolithic Spark Streaming or Storm jobs, based on our experiences with “heavy” Hadoop jobs in our batch pipeline
Kinesis stream
Kinesis stream
Kinesis stream
Kinesis stream
Kinesis stream
What we didn’t do: an inner Storm topology
We wanted to avoid the “inner topology” effect, with effectively two tiers of topology to reason about
Difficult to unit test the inner topologies – complex behaviours inside each unit
Difficult to operationalize the inner topologies – how do they handle backpressure, how do they scale, how do we upgrade them?
Difficult to monitor the inner topologies
Fundamental problem: the event streams in an inner topology are not first class entities
1
2
3
It worked: today the Snowplow real-time pipeline is a collection of individual event-driven micro-services
Stream Collector
Stream Enrich
Kinesis S3
Kinesis Elasticsearch
Kinesis TeeKinesis
Redshift (design stage)
User’s AWS Lambda function
User’s KCL worker app
User’s Spark Streaming
job
“the most compelling applications for stream processing are actually pretty different from what you would typically do with a Hive or Spark job—they are closer to being a kind of asynchronous microservice rather than being a faster version of a batch analytics job. …”
Meanwhile, the Kafka team (now at Confluent) were seeing something interesting in the adoption of Kafka…
“…What I mean is that these stream processing apps were most often software that implemented core functions in the business rather than computing analytics about the business.” [1]
[1] http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/
And these micro-services were substituting not just for batch analytical workloads, but also for transactional workloads
Why are async micro-services replacing “classic” request/response micro-services (at least amongst
Kafka users)?
To start with, asynchronous micro-services have the same benefits as synchronous micro-services…
Strong module boundaries• Network boundaries between modules can be
helpful for larger teams
Independent deployment• Deploy individual micro-services independently• Easier to deploy and less likely to cause whole
system failures
Support diversity• Can use the best language/framework/database
for the capability• Reduces monoculture risk (anti-fragile)
1
2
3
… but in addition, event-driven asynchronous micro-services are extremely loosely-coupled, because they are intermediated by first class streams
Compare this to request/response synchronous micro-services,which have dependencies between upstream and downstream
Downstream
Upstream
Single-page web app
Login serviceNotifications
serviceContent service
Customer profile servicePersonalization
service
Ad service
If you can afford the (small) latency tax, there are some clear advantages in going asynchronous
Much better toolkit for upgrades – schema evolution, running old and new service versions in parallel etc
Adding new downstream services doesn’t increase the load on upstream services
Failure of individual services introduces lag into the overall system, rather than overall system failure
Easier to debug, because service inputs and outputs are directly inspectable in the event streams
1
2
3
4
So what’s next in this convergence?
?
More and more tooling for building asynchronous micro-services, including the unstoppable rise of “serverless”
Kinesis Client Library
AWS Lambda
IBM OpenWhisk
Event-driven cloud functionsStream processing as a library
Azure Functions
Greater adoption of open source schema registries as the canonical source of truth for the events in our topologies
Confluent or Iglu schema registries
Request/response micro-services aren’t going away –they are just too useful
But expect a move from slower HTTP-based to faster RPC-based options
• e.g. Google’s gRPC• These are easier to read from and write from in high-
volume event-driven architectures
Expect wider adoption of API definition languages• OpenAPI (Swagger), RAML, API Blueprint
Eventual harmonization of types?• Using JSON for RESTful APIs, Protocol Buffers for RPC
and Avro for stream processing is crazy• Needs company-wide standardization, or dynamic
translation (but this is lossy)
1
2
3
We also need new fabrics – or extensions of existing ones like Kubernetes – to address the challenges of running our topologies
“How do we monitor this topology, and alert if something (data loss; event lag) is going wrong?”
“How do we scale our streams and micro-services to handle event peaks and troughs smoothly?”
“How do we re-configure or upgrade our micro-services without breaking things?”
At Snowplow we are working on a unified log fabric, called Tupilak, to solve this problem
Questions?
http://snowplowanalytics.com
https://github.com/snowplow/snowplow
@snowplowdata
To meet up or chat, @alexcrdean on Twitter or alex@snowplowanalytics.com
Recommended