Building a Lambda Architecture with Elasticsearch at Yieldbot

May 06, 2014

Building a Lambda Architecture with Elasticsearch at Yieldbot

Richard Shea, CTO

@shearic

David White, Platform

Architect@dtabwhite

Batch computation layer (canonical eg. Hadoop -> HBase)

Real-time computation layer (canonical eg. Storm -> Cassandra)

Serving layer (query HBase, query Cassandra, mix and return)

Slide 2

Lambda Architecture Summary

Clickstreams of Events(pageviews, impressions, clicks, etc)

Events contain attributes

Aggregating Counts and Performance

Breakdowns by Several Dimensions

Slide 3

Our Use Case

Slide 4

Our Prior Approach

Two different types of systems

Two different access patterns

Query ability limited

Batch(Hbase)

Realtime(Redis)

Slide 5

Kafka

Persisted event queue

Consumers keep track of offset

Horizontally scalable, topics can be partitioned, etc.

Slide 6

Real-time Layer of Lambda with ES

Daily Index of “raw” events – each event is a document

Elasticsearch Kafka River to index

Real-time processing is trivial, just indexing events

Aggregation of Real-time info pushed to query-time

Slide 7

Batch Layer of Lambda with ES

Monthly Index of Aggregated Data Documents

Hourly Re-index events from archived, covers real-time issues

Aggregate desires breakdowns into documents

When done, note most recent hour completed

Slide 8

Serving Layer of Lambda with ES

Query Aggregated Data Documents as much as possible

Query Raw events from last aggregated available to present

Combine Aggregated and Raw query results together and return

We use Node.js, natural fit

Slide 9

Why Elasticsearch?

- calculations query-time and flexible - real-time is simple

Real-time

- some pre-calculation

- query-time ties it together

Batch

Serving

- queries are flexible

- batch and real-time query access patterns similar

Slide 10

More Elasticsearch Goodies

Kibana

- Mostly real-time events

- Aggregated documents useful too

Snapshotting for backups

Real-time data daily indexes are optimized

Slide 11

Future

ES Aggregations

Split cluster with Tribe Nodes

Aggregation via Spark

Slide 12

Good Lessons

Use index aliases

Build in operational plan to re-index

doc_values for raw events and high cardinality query

results

Thank You

Engineering

Building a Lambda Architecture with Elasticsearch at Yieldbot