Upload
nathan-bijnens
View
116
Download
11
Embed Size (px)
DESCRIPTION
Citation preview
A real-time architecture using Hadoop and Storm.
A real-time architecture using Hadoop & Storm. #JaxLondon 2
Speaker
Nathan Bijnens@nathan_gs
A real-time architecture using Hadoop & Storm. #JaxLondon 3
Our Vision
Big Data
test
Volume
A real-time architecture using Hadoop & Storm. #JaxLondon 4
Big Data
test
Velocity
A real-time architecture using Hadoop & Storm. #JaxLondon 5
Our Vision
Volume
test
Variety
A real-time architecture using Hadoop & Storm. #JaxLondon 6
Computing Trends
Source: Immutability Changes Everything - Pat Helland, RICON2012
Computation (CPUs) Expensive
Disk Storage Expensive
Coordination Easy(Latches Don t Often Hit)
DRAM Expensive
Computation Cheap (Many Core Computers)
Disk Storage Cheap(Cheap Commodity Disks)
Coordination Hard(Latches Stall a Lot, etc)
DRAM / SSD Getting Cheap
Past Current
A real-time architecture using Hadoop & Storm. #JaxLondon 7
Credits
Nathan MarzEx-Backtype & Twitter
Startup in Stealthmode
Storm
Cascalog
ElephantDB
manning.com/marz
A real-time architecture using Hadoop & Storm. #JaxLondon 8
A Data System
A real-time architecture using Hadoop & Storm. #JaxLondon 9
Not all information is equal. Some information is derived from other pieces of
information.
Data is more than Information
A real-time architecture using Hadoop & Storm. #JaxLondon 10
Eventually you will reach the most
This is the information you hold true, simple because it exists.
Data is more than Information
A real-time architecture using Hadoop & Storm. #JaxLondon 11
Events used to manipulate the master data.
Events - Before
A real-time architecture using Hadoop & Storm. #JaxLondon 12
Today, events are the master data.
Events - After
A real-time architecture using Hadoop & Storm. #JaxLondon 13
everything.
Data System
A real-time architecture using Hadoop & Storm. #JaxLondon 14
Data is Immutable
Events
A real-time architecture using Hadoop & Storm. #JaxLondon 15
Data is Time Based
Events
A real-time architecture using Hadoop & Storm. #JaxLondon 16
Capturing change traditionally
Person Location
Nathan Antwerp
Geert Dendermonde
John Ghent
Person Location
Nathan Ghent
Geert Dendermonde
John Ghent
A real-time architecture using Hadoop & Storm. #JaxLondon 17
Capturing change
Person Location Time
Nathan Antwerp 2005-01-01
Geert Dendermonde 2011-10-08
John Ghent 2010-05-02
Nathan Ghent 2013-02-03
Person Location Timestamp
Nathan Antwerp 2005-01-01
Geert Dendermonde 2011-10-08
John Ghent 2010-05-02
A real-time architecture using Hadoop & Storm. #JaxLondon 18
The data you query is often transformed, aggregated, ...
Query
A real-time architecture using Hadoop & Storm. #JaxLondon 19
Query
Query = function ( all data )
A real-time architecture using Hadoop & Storm. #JaxLondon 20
Number of people living in each city.
Person Location Time
Nathan Antwerp 2005-01-01
Geert Dendermonde 2011-10-08
John Ghent 2010-05-02
Nathan Ghent 2013-02-03
Location Count
Ghent 2
Dendermonde 1
A real-time architecture using Hadoop & Storm. #JaxLondon 22
Query
All Data Query
A real-time architecture using Hadoop & Storm. #JaxLondon 23
Query: Precompute
All Data QueryPrecomputed
View
A real-time architecture using Hadoop & Storm. #JaxLondon 24
Layered Architecture
Speed Layer
Batch Layer
Serving Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 25
Layered Architecture
HadoopElephant
DB
Qu
ery
Incoming Data
Cassandra
A real-time architecture using Hadoop & Storm. #JaxLondon 26
Batch Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 27
Batch Layer
HadoopElephant
DB
Incoming Data
A real-time architecture using Hadoop & Storm. #JaxLondon 28
Unrestrained computation.
Batch Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 29
No need to De-Normalize.
Batch Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 30
Horizontal scalable.
Batch Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 31
High Latency.
matter.
Batch Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 32
Functional computation, based on immutable inputs, is idempotent.
Batch Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 33
Stores master copy of data set...
Batch Layer
append only.
A real-time architecture using Hadoop & Storm. #JaxLondon 34
Batch Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 35
Batch: View generation
Master Dataset
View #1
View #3
View #2MapReduce
A real-time architecture using Hadoop & Storm. #JaxLondon 36
1. Take a large data set and divide it into subsets
2. Perform the same function on all subsets
3. Combine the output from all subsets
…
…
Output
MA
PR
ED
UC
E
MapReduce
DoWork() DoWork() DoWork()…
A real-time architecture using Hadoop & Storm. #JaxLondon 37
Serialization & Schema
Catch errors as quickly as they happen. Validation on write vs on read.
A real-time architecture using Hadoop & Storm. #JaxLondon 38
Serialization & Schema
CSV is actually a serialization language that is just poorly defined.
A real-time architecture using Hadoop & Storm. #JaxLondon 39
Serialization & Schema
Use a format with a schema.- Thrift
- Avro
- Protobuffers
A real-time architecture using Hadoop & Storm. #JaxLondon 40
Read only database.No random writes required.
Batch View Database
A real-time architecture using Hadoop & Storm. #JaxLondon 41
Every iteration produces the Views from scratch.
Batch View Database
A real-time architecture using Hadoop & Storm. #JaxLondon 42
Batch View Database
ElephantDB
Splout
Voldemort
A real-time architecture using Hadoop & Storm. #JaxLondon 44
Batch Layer
Not yet absorbed.Data absorbed into Batch Views
Time No
w
Just a few hours of data.
A real-time architecture using Hadoop & Storm. #JaxLondon 45
Speed Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 46
Overview
HadoopElephant
DB
Incoming Data
Cassandra
A real-time architecture using Hadoop & Storm. #JaxLondon 47
Stream processing.
Speed Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 48
Continuous computation.
Speed Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 49
Transactional.
Speed Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 50
Storing a limited window of data.Compensating for the last few hours of data.
Speed Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 51
All the complexity is isolated in the Speed layer.
-corrected.
Speed Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 52
CAP
You have a choice between:
Availability- Queries are eventual consistent.
Consistency- Queries are consistent.
A real-time architecture using Hadoop & Storm. #JaxLondon 53
Some algorithms are hard to implement in real time. For those cases we could
estimate the results.
Eventual accuracy
A real-time architecture using Hadoop & Storm. #JaxLondon 54
Speed Layer
Incoming Data
Real Time
View 1
Real Time
View 2
A real-time architecture using Hadoop & Storm. #JaxLondon 55
Storm
Message passing.
Distributed processing.
Horizontally scalable.
Incremental algorithms.
Fast.
Data in motion.
A real-time architecture using Hadoop & Storm. #JaxLondon 56
Storm
Nimbus Zookeeper
Worker Node
Supervisor
Executer
Executer
Executer
Worker Node
Supervisor
Executer
Executer
Executer
Worker Node
Supervisor
Executer
Executer
Executer
A real-time architecture using Hadoop & Storm. #JaxLondon 57
Storm
Tuple
Stream
A real-time architecture using Hadoop & Storm. #JaxLondon 58
Storm
Spout
Bolt
A real-time architecture using Hadoop & Storm. #JaxLondon 59
Storm
Grouping
A real-time architecture using Hadoop & Storm. #JaxLondon 60
Data Ingestion
Kafka
Flume
Scribe
*MQ
Kestrel
A real-time architecture using Hadoop & Storm. #JaxLondon 61
Speed Layer Views
The views are stored in Read & Write database.- Cassandra
- Hbase
- Redis
- MySQL
- ElasticSearch
-
Much more complex than a read only view.
A real-time architecture using Hadoop & Storm. #JaxLondon 62
Serving Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 63
Overview
HadoopElephant
DB
Qu
ery
Incoming Data
Cassandra
A real-time architecture using Hadoop & Storm. #JaxLondon 64
Random reads
Serving Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 65
This layer queries the Batch & Real Time views and merges it.
Serving Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 66
Serving Layer
Real Time Views
Merge
Batch Views
A real-time architecture using Hadoop & Storm. #JaxLondon 67
How to query an Average?
Serving Layer
A real-time architecture using Hadoop & Storm. #JaxLondon 68
Overview
A real-time architecture using Hadoop & Storm. #JaxLondon 69
Overview
HadoopElephant
DB
Qu
ery
Incoming Data
Cassandra
A real-time architecture using Hadoop & Storm. #JaxLondon 70
Lambda Architecture
A real-time architecture using Hadoop & Storm. #JaxLondon 71
Lambda Architecture
Can discard any view, batch and real time, and just recreate everything from the master
data.
A real-time architecture using Hadoop & Storm. #JaxLondon 72
Lambda Architecture
Mistakes are corrected via recomputation. Write bad data? Remove the data & recompute.
Bug in view generation? Just recompute the view.
A real-time architecture using Hadoop & Storm. #JaxLondon 73
Lambda Architecture
Data storage is highly optimized.
A real-time architecture using Hadoop & Storm. #JaxLondon 74
Lambda Architecture
Immutability changes everything.
A real-time architecture using Hadoop & Storm. #JaxLondon 75
Questions?
Questions?@nathan_gs & #BigDataCon13
A real-time architecture using Hadoop & Storm. #JaxLondon 76
DataCrunchers
We enable companies in envisioning, defining and implementing a data strategy.
A one-stop-shop for all your Big Data needs.
The first Big Data Consultancy agency in Belgium.
A real-time architecture using Hadoop & Storm. #JaxLondon 77
Thank you
Thank you@nathan_gs