"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson,...

Preview:

Citation preview

Tobias Johansson @ntjohansson

27/10/2016

Big data analyticsEinstürzenden Neudaten: Building an analytics engine from scratch

• Big data analytics engine

• Focusing on simplicity from an usage perspective• Single process containing

• Time-series repository • Semi-structured repository • Execution engine • Etc.

• Written in Scala/C++/Lua

What is Valo

• REST based

What is Valo

PUT /streams/sensors/environment/air

{“sampleTime”: { “type”: “datetime” },“sensor” : { “type”: “contributor” },“pollution” : { “type”: “double” }

}

POST /streams/sensors/environment/air

{“sampleTime”: “2016/10/27 15:13:00”,“sensor” : “131e90ad-e32a”,“pollution” : 85.6

}

• Data friendly

What is Valo

POST /streams/sensors/environment/airContent-Type: application/json

POST /streams/sensors/environment/airContent-Type: application/cbor

POST /streams/sensors/environment/airContent-Type: application/csv

POST /streams/sensors/environment/airContent-Type: application/bson

Time-series Semi-structured

• Real-time and historical queries

What is Valo

Looks simple?Trust me, it is not.

Looks simple?Trust me, it is not.

Dynamo style clustering and vector-clocks

Eventual consistency

Gossip protocols

Distributed algorithms

Distributed execution engine

Expression trees and runtime code generation

Query rewriting and optimization

Consistent hashing

Time-series repository

Semi-structured repository

Data atomicity

Back pressure

Elasticity

Advanced ML algorithms

IO

Actor systems

Data distribution

Cluster management

B+ trees

Query language KV-store

REST-api

Jump consistent hashing

Off-heap memory

Data formats

Distributed joins

Time semantics

Gap-filling

Statistical models

Distributed CRDTs

Transports

Realtime queries

Looks simple?Trust me, it is not.

Dynamo style clustering and vector-clocks

Eventual consistency

Gossip protocols

Distributed algorithms

Distributed execution engine

Expression trees and runtime code generation

Query rewriting and optimization

Consistent hashing

Time-series repository

Semi-structured repository

Data atomicity

Back pressure

Elasticity

Advanced ML algorithms

IO

Actor systems

Data distribution

Cluster management

B+ trees

Query language KV-store

REST-api

Jump consistent hashing

Off-heap memory

Data formats

Distributed joins

Time semantics

Gap-filling

Statistical models

Distributed CRDTs

Transports

Realtime queries

Know your clusterIt will crash

Know your cluster

• You need a cluster to run big data analytics on. But it is based on;

• Commodity hardware which can fail• Unreliable network

Know your cluster

• Issues;

• Unreachable nodes• Dropped messages • Delayed messages• No response

Know your cluster

• Issues;

• Unreachable nodes• Dropped messages • Delayed messages• No response

• Split network• Multiple working clusters• Mutable state is likely to diverge

Know your cluster

• Accept these issues and don’t try to fight it. Make life simpler by;

• Not having a single point of failure• No leaders• No master/slave• No special nodes

• Making it eventually consistent • Use CRDTs for sets, counters, etc.• Use vector-clocks for configuration

Know your data

• Do not treat all data the same

• Time-series repository• CPU data, market data, ECG

• Semi-structured repository• Log files, emails

• KV repository• Configuration

• Unless you are Oracle or Microsoft, make your data immutable, append only.

• Streams are facts at points in time, and facts do not change

Know your data

• Build properties into your data distribution policies. Properties which;

• Maximise resilience • Avoid replicas on the same physical server rack

• Optimise data locality• Minimise number of data transfers required when adding/removing

nodes• Deterministically tell where data lives in the cluster

• Where does data for T0 to T1 sit in the cluster?

Know your data

• Consistent hashing • Minimises number of data transfers in the cluster

• Time-based distribution• Distribute data in the cluster in second, minute, hour, day buckets

Know your data

• Consistent hashing • Minimises number of data transfers in the cluster

• Time-based distribution• Distribute data in the cluster in second, minute, hour, day buckets

Know your data

Node / Segment 1 2 3 4 5 6 8 9 Node / Segment 1 2 3 4 5 6 8 9

A x x A x

B x x x B x x

C x x x x C x x x

D x x x x D x x x

E x x x E x x x

F x x F x x x

G x G x x x

K x x K x x x

L x x L x x

M x M x

N N

• Consistent hashing • Minimises number of data transfers in the cluster

• Time-based distribution• Distribute data in the cluster in second, minute, hour, day buckets

Know your data

Node / Segment 1 2 3 4 5 6 8 9 Node / Segment 1 2 3 4 5 6 8 9

A x x A x

B x x x B x x

C x x x x C x x x

D x x x D x x x

E x X x E x x x

F x X F x x x

G x X G x x x

K x x K x x x

L x x L x x

M x M x

N N

Know your algos

Know your algos

from historical /streams/demo/infrastructure/cpuselect avg(kernel)

Know your algos

from historical /streams/demo/infrastructure/cpuselect avg(kernel)

Avg

Avg

Avg Avg

Know your algos

from historical /streams/demo/infrastructure/cpuselect avg(kernel)

Avg

Avg

Avg

Avg

Avg

Avg

Avg

Know your algos

from historical /streams/demo/infrastructure/cpuselect avg(kernel)

Avg

Avg

Avg Avg

Avg Avg Avg

Know your algos

Init: () -> βApply: β -> 'a list -> βReduce: β -> β -> βFinalise: β -> 'r

class AverageDouble {def apply(value: NamedDouble): Unit

def reset(): Unit

def merge(state: Parser)

def restore(state: Parser)

def getResult: NamedDouble

def save(gen: Generator)}

Travelling algos

Avg AvgAvgAvg Avg Avg

Node / Segment 1 2 3 4 5 6 8 9

A x

B x x

C x x x

D x x x

E x x x

F x x x

G x x x

K x x x

L x x

M x

N

from historical /streams/demo/infrastructure/itimegroup by timeStamp window of 5 minutes every 5 minutes fill last, alphaselect alpha, timeStamp, last(a) as la partition every 1 hour as implicit

Dynamo style clustering and vector-clocks

Eventual consistency

Gossip protocols

Distributed algorithms

Distributed execution engine

Expression trees and runtime code generationQuery rewriting and optimization

Consistent hashing

Time-series repository

Semi-structured repository

Data atomicity

Back pressure

Elasticity

Advanced ML algorithms

IO

Actor systems

Data distribution

Cluster management

B+ trees

Query language

KV-store

REST-api

Jump consistent hashing

Off-heap memory

Data formats

Distributed joins

Time semantics

Gap-filling

Statistical modelsDistributed CRDTs

Transports

Real-time queries./valo

www.valo.io

Thank youMeet us at the Startup Area

tobias@valo.io@ntjohansson

Algos

MicroTickFrequency

MicroVolatility

OnlineMisraGries

Anomaly

Histogram

Bivar

Univar

Skyline

EMA

MovingKurtosis

MovingDerivative

RecursiveEMA

MovingVariance

MovingVariance

Average

Sum

Sum

TopK

Quantiles

What has brought us here today