21
How to analyze billions of events in real-time? [email protected] Co-Founder & Product Manager Lambda architecture for real-time streaming analytics

Realtime streaming architecture in INFINARIO

Embed Size (px)

Citation preview

How to analyze billions of events in real-time?

[email protected] & Product Manager

Lambda architecture for real-time streaming analytics

Agenda

• Goals & requirements• Design patterns for streaming analytics– General idea– Lambda– Kappa

• INFINARIO backend• Discussion

Example: Lets build a funnel fast

Requirements

• VELOCITY– Process never ending stream of “events” in real-time

• VARIETY AT SPEED– Analyses! Not just predefined reports

• VOLUME– Be able to reprocess a stream; retain data

• RELIABILITY– Never lose an event

• AVAILABLITY– Avoid down-times

DESIGN PATTERNS FOR REAL-TIME STREAMING ANALYTICS

LETS LOOK OUTSIDE

Real Time Streaming ArchitectureSource

Systems

Sources

Syslog

Machine Data

ExternalStreams

Other

Data Collection

Flume / Custom

Agent A

Agent B

Agent N

Messaging System

Kafka

Topic B

Topic N

Topic A

Real Time Processing

Storm

Topology B

Topology N

Topology A

Storage

Search

Elastic Search / Solr

Low Latency NoSql

HBase

Historic

Hive / HDFS

Access

Web Services

REST API

Web Apps

Analytic Tools

R / Python

BI Tools

Alerting Systems

Apache Kafka

• publish-subscribe messaging for real-time feeds• retains data for configurable period of time• immutable messages queue (events)• high-throughput, low-latency

Lambda Architecture

New Data

Data Stream

Batch Layer

All Data

Pre-compute Views

Speed Layer

Stream Processing

Real Time View

Serving Layer

Batch View

Batch ViewData

Access

Query

http://strataconf.com/big-data-conference-ca-2015/public/schedule/detail/38774

Components for LambdaBatch layer components

Speed layer components

Serving layer components

http://lambda-architecture.net/

Lambda pros & cons

• Pros– Combines real-time & batch processing– Retains input data unchanged– Allows to reprocess the data– Stores immediate stages

• Cons– 2 apps in 2 languages what do the same thing– 2x implement, maintain & debug the code– Say good bye to system specific features

http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

Kappa Architecture

Data Source

Data Stream

Stream Processing

System

Job Version n

Serving DB

Output table n

Output table n + 1

Data Access

Query

Job Version n + 1

1. Use Kafka that retains full log of data to reprocess and allows for multiple subscribers.2. Reprocessing: new instance of processing job process from start, outputs to new table.3. When the second job has caught up, switch the application to read from the new table.4. Stop the old version of the job, and delete the old output table.

http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

Kappa pros & cons

• Pros– Allows people to develop, test, debug, and

operate their systems on top of a single processing framework

• Cons– Needs 2x total storage (2 versions of results)– Requires DB with high volume writes

QU

ERIE

S

IN MEMORY PROCESSING(IMF™)

PERSISTENT STORAGE(NoSQL)

EVEN

T AP

I

LOAD HISTORYAFTER RESTART

EVENT STREAM

INFINARIO Architecture (now)

IMF™

• “In-Memory (event processing) Framework”

• Collect, store and analyze events and players

• Distributed & scalable– Built on NodeJS and C++– Nodes per CPU core & proportion of RAM– Provides API for analyses

IMF Benchmarking

100,000 1,000,000 10,000,000 100,000,0000

200

400

600

800

1000

1200

1400

0.004 0.007 0.039 0.349

0.243 2.354 23.894

262.7840.349 2.593 25.245

284.803

0.202 2.28

522.518

1.609 86.233

1273.985

BlinkBytesMongoTokuMXPostgresMySQL

# of events in database

Tim

e to

cal

cula

te f

un

nel

(s)

IMF

https://infinario.com/speedtest

Our experience

It’s lightning fast

Cheap reprocess No immediate results Easy life

Can process already processed stream (“streaming”)

x Code change or Add new node reload IMF

x Reloads can take too long

x PB of RAM in 2015 is a joke

Reloads

• NoSQL eats too much resources (CPU time)

• Can potentially lose some events

• Reload time (NoSQL to IMF) grows fast

• Analyses are unavailable during reload

INFINARIO is like thisSource

Systems

Sources

SDKs

BULK

Frontend

Data Collection

CustomAPI

Agent A

Agent B

Agent N

Messaging System

Real Time Processing

IMF

Topology B

Topology N

Topology A

Storage

Historic

NoSQL

Access

Web Services

REST API

Web Apps

Analytic Tools

R / Python

BI Tools

Alerting Systems

Questions

• Lambda?

• Kappa?

• Kafka?

• Technologies for components?

LOW LATENCY

Access

IN MEMORY PROCESSING

PERSISTENT STORAGE

KAFK

A

RELOADEVENT STREAM

INFINARIO Architecture Updated

RAW DATA HISTORY VIEW

RAW DATA HISTORY VIEW

Ad hocDM

APP

AngularJS developer wanted

Our designers works much faster than frontend-team. Could you help? Emai us: [email protected]