Upload
jozo-kovac
View
290
Download
0
Tags:
Embed Size (px)
Citation preview
How to analyze billions of events in real-time?
[email protected] & Product Manager
Lambda architecture for real-time streaming analytics
Agenda
• Goals & requirements• Design patterns for streaming analytics– General idea– Lambda– Kappa
• INFINARIO backend• Discussion
Requirements
• VELOCITY– Process never ending stream of “events” in real-time
• VARIETY AT SPEED– Analyses! Not just predefined reports
• VOLUME– Be able to reprocess a stream; retain data
• RELIABILITY– Never lose an event
• AVAILABLITY– Avoid down-times
Real Time Streaming ArchitectureSource
Systems
Sources
Syslog
Machine Data
ExternalStreams
Other
Data Collection
Flume / Custom
Agent A
Agent B
Agent N
Messaging System
Kafka
Topic B
Topic N
Topic A
Real Time Processing
Storm
Topology B
Topology N
Topology A
Storage
Search
Elastic Search / Solr
Low Latency NoSql
HBase
Historic
Hive / HDFS
Access
Web Services
REST API
Web Apps
Analytic Tools
R / Python
BI Tools
Alerting Systems
Apache Kafka
• publish-subscribe messaging for real-time feeds• retains data for configurable period of time• immutable messages queue (events)• high-throughput, low-latency
Lambda Architecture
New Data
Data Stream
Batch Layer
All Data
Pre-compute Views
Speed Layer
Stream Processing
Real Time View
Serving Layer
Batch View
Batch ViewData
Access
Query
http://strataconf.com/big-data-conference-ca-2015/public/schedule/detail/38774
Components for LambdaBatch layer components
Speed layer components
Serving layer components
http://lambda-architecture.net/
Lambda pros & cons
• Pros– Combines real-time & batch processing– Retains input data unchanged– Allows to reprocess the data– Stores immediate stages
• Cons– 2 apps in 2 languages what do the same thing– 2x implement, maintain & debug the code– Say good bye to system specific features
http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
Kappa Architecture
Data Source
Data Stream
Stream Processing
System
Job Version n
Serving DB
Output table n
Output table n + 1
Data Access
Query
Job Version n + 1
1. Use Kafka that retains full log of data to reprocess and allows for multiple subscribers.2. Reprocessing: new instance of processing job process from start, outputs to new table.3. When the second job has caught up, switch the application to read from the new table.4. Stop the old version of the job, and delete the old output table.
http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
Kappa pros & cons
• Pros– Allows people to develop, test, debug, and
operate their systems on top of a single processing framework
• Cons– Needs 2x total storage (2 versions of results)– Requires DB with high volume writes
QU
ERIE
S
IN MEMORY PROCESSING(IMF™)
PERSISTENT STORAGE(NoSQL)
EVEN
T AP
I
LOAD HISTORYAFTER RESTART
EVENT STREAM
INFINARIO Architecture (now)
IMF™
• “In-Memory (event processing) Framework”
• Collect, store and analyze events and players
• Distributed & scalable– Built on NodeJS and C++– Nodes per CPU core & proportion of RAM– Provides API for analyses
IMF Benchmarking
100,000 1,000,000 10,000,000 100,000,0000
200
400
600
800
1000
1200
1400
0.004 0.007 0.039 0.349
0.243 2.354 23.894
262.7840.349 2.593 25.245
284.803
0.202 2.28
522.518
1.609 86.233
1273.985
BlinkBytesMongoTokuMXPostgresMySQL
# of events in database
Tim
e to
cal
cula
te f
un
nel
(s)
IMF
https://infinario.com/speedtest
Our experience
It’s lightning fast
Cheap reprocess No immediate results Easy life
Can process already processed stream (“streaming”)
x Code change or Add new node reload IMF
x Reloads can take too long
x PB of RAM in 2015 is a joke
Reloads
• NoSQL eats too much resources (CPU time)
• Can potentially lose some events
• Reload time (NoSQL to IMF) grows fast
• Analyses are unavailable during reload
INFINARIO is like thisSource
Systems
Sources
SDKs
BULK
Frontend
Data Collection
CustomAPI
Agent A
Agent B
Agent N
Messaging System
Real Time Processing
IMF
Topology B
Topology N
Topology A
Storage
Historic
NoSQL
Access
Web Services
REST API
Web Apps
Analytic Tools
R / Python
BI Tools
Alerting Systems
LOW LATENCY
Access
IN MEMORY PROCESSING
PERSISTENT STORAGE
KAFK
A
RELOADEVENT STREAM
INFINARIO Architecture Updated
RAW DATA HISTORY VIEW
RAW DATA HISTORY VIEW
Ad hocDM
APP
AngularJS developer wanted
Our designers works much faster than frontend-team. Could you help? Emai us: [email protected]