View
278
Download
0
Embed Size (px)
Citation preview
REAL-TIME AND BIG DATA
Mahmoud M. Jalajel
OUTLINE
• Intro: Real-time with Big Data
• The Lambda Architecture
• The Relay Model
WHY SOLVE FOR REAL-
TIME• Real-time offers more business value
• Live Web Analytics
• Recommendations
• Real-time = (semi-) realtime
• Event to index ~ single digit minutes
• Query duration ~ single digit seconds
REAL-TIME
IMPLEMENTATION• Incremental Implementation
• Stream processing / No full data context
• A real-time implementation is:
• Far more useful
• Faster
• Easily adaptable to batch mode
REAL-TIME IN HADOOP
MongoDb Query Time
(optimized, single-node)
Hive Query Time
(5 nodes)
Hangs, crashes, starts
begging for mercy then
commits suicide and
weepingly dies
A few hours
2 Seconds 15 Minutes
LAMBDA
ARCHITECTURE
BASIC ASSUMPTIONS
1. Query = Function(All Data)
2. Data are immutable timely facts
3. Append-Only (CRUD becomes CR)
4. Human Fault-Tolerance
THE BATCH LAYER
• Accepts stream of data
• Appends to master
dataset
• Uses: HDFS
THE SERVING LAYER
• Precomputes different
views
• Works on full dataset
• Refreshes regularly offline
• Batch views are usually
stored in a key-value store
CHECKPOINT
• Typical Hadoop Setup
• Slow, inefficient
• Outdated. usually lagging by hours or days
• Although accurate for surveyed data
• Costly to re-run. Real-time is not an option
THE SPEED LAYER
• Works with recent data
• Complements results
• Incremental implementation
THE FULL PICTUREQuery Merging
EXAMPLE
TECHNOLOGIES
DRUID EXAMPLE
REVIEWING LA
PROs
• Modular
• Flexible
• Self-Auditing
• Proven components
CONs
• Complex
• Maintainability
• Query Merging
THE RELAY MODEL
RELAY MODELQuery Merging
THE WORKFLOW
REVIEWING RM
PROs
• Coherent, Simpler
than LA
• Extensible to full
LA
• Cheaper
CONs
• Master Data
Storage
• Query flexibility
WHY NOT HADOOP NOW?
• Too much time, no capacity
• Too soon or too late
• Too expensive
• Hammer/nail problem
CONCLUSIONS
• Think big data, now!
• No need to invest years of development to
perfect a big data system.
• Start now! gradually grow system requirements
and engineering skill-set
• Select scalable components
Mahmoud Jalajel – @mjalajel
Questions ?
APPENDIX
Apache Kafka
Apache Storm
Apache Storm with external systems