Upload
yieldbot
View
531
Download
7
Embed Size (px)
DESCRIPTION
2014-05-06 Presentation to Boston Elasticsearch Meetup on Yieldbot's use of Elasticsearch in a Lambda Architecture
Citation preview
May 06, 2014
Building a Lambda Architecture with Elasticsearch at Yieldbot
Richard Shea, CTO
@shearic
David White, Platform
Architect@dtabwhite
Batch computation layer (canonical eg. Hadoop -> HBase)
Real-time computation layer (canonical eg. Storm -> Cassandra)
Serving layer (query HBase, query Cassandra, mix and return)
Slide 2
Lambda Architecture Summary
Clickstreams of Events(pageviews, impressions, clicks, etc)
Events contain attributes
Aggregating Counts and Performance
Breakdowns by Several Dimensions
Slide 3
Our Use Case
Slide 4
Our Prior Approach
Two different types of systems
Two different access patterns
Query ability limited
Batch(Hbase)
Realtime(Redis)
Slide 5
Kafka
Persisted event queue
Consumers keep track of offset
Horizontally scalable, topics can be partitioned, etc.
Slide 6
Real-time Layer of Lambda with ES
Daily Index of “raw” events – each event is a document
Elasticsearch Kafka River to index
Real-time processing is trivial, just indexing events
Aggregation of Real-time info pushed to query-time
Slide 7
Batch Layer of Lambda with ES
Monthly Index of Aggregated Data Documents
Hourly Re-index events from archived, covers real-time issues
Aggregate desires breakdowns into documents
When done, note most recent hour completed
Slide 8
Serving Layer of Lambda with ES
Query Aggregated Data Documents as much as possible
Query Raw events from last aggregated available to present
Combine Aggregated and Raw query results together and return
We use Node.js, natural fit
Slide 9
Why Elasticsearch?
- calculations query-time and flexible - real-time is simple
Real-time
- some pre-calculation
- query-time ties it together
Batch
Serving
- queries are flexible
- batch and real-time query access patterns similar
Slide 10
More Elasticsearch Goodies
Kibana
- Mostly real-time events
- Aggregated documents useful too
Snapshotting for backups
Real-time data daily indexes are optimized
Slide 11
Future
ES Aggregations
Split cluster with Tribe Nodes
Aggregation via Spark
Slide 12
Good Lessons
Use index aliases
Build in operational plan to re-index
doc_values for raw events and high cardinality query
results
Thank You