"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson,...

Tobias Johansson @ntjohansson

27/10/2016

Big data analyticsEinstürzenden Neudaten: Building an analytics engine from scratch

• Big data analytics engine

• Focusing on simplicity from an usage perspective• Single process containing

• Time-series repository • Semi-structured repository • Execution engine • Etc.

• Written in Scala/C++/Lua

What is Valo

• REST based

What is Valo

PUT /streams/sensors/environment/air

{“sampleTime”: { “type”: “datetime” },“sensor” : { “type”: “contributor” },“pollution” : { “type”: “double” }

POST /streams/sensors/environment/air

{“sampleTime”: “2016/10/27 15:13:00”,“sensor” : “131e90ad-e32a”,“pollution” : 85.6

• Data friendly

What is Valo

POST /streams/sensors/environment/airContent-Type: application/json

POST /streams/sensors/environment/airContent-Type: application/cbor

POST /streams/sensors/environment/airContent-Type: application/csv

POST /streams/sensors/environment/airContent-Type: application/bson

Time-series Semi-structured

• Real-time and historical queries

What is Valo

Looks simple?Trust me, it is not.

Dynamo style clustering and vector-clocks

Eventual consistency

Gossip protocols

Distributed algorithms

Distributed execution engine

Expression trees and runtime code generation

Query rewriting and optimization

Consistent hashing

Time-series repository

Semi-structured repository

Data atomicity

Back pressure

Elasticity

Advanced ML algorithms

Actor systems

Data distribution

Cluster management

B+ trees

Query language KV-store

REST-api

Jump consistent hashing

Off-heap memory

Data formats

Distributed joins

Time semantics

Gap-filling

Statistical models

Distributed CRDTs

Transports

Realtime queries

Looks simple?Trust me, it is not.

Gossip protocols

Expression trees and runtime code generation

Query rewriting and optimization

Consistent hashing

Data atomicity

Back pressure

Elasticity

Actor systems

Data distribution

Cluster management

B+ trees

Query language KV-store

REST-api

Off-heap memory

Data formats

Distributed joins

Time semantics

Gap-filling

Statistical models

Distributed CRDTs

Transports

Realtime queries

Know your clusterIt will crash

Know your cluster

• You need a cluster to run big data analytics on. But it is based on;

• Commodity hardware which can fail• Unreliable network

Know your cluster

• Issues;

• Unreachable nodes• Dropped messages • Delayed messages• No response

Know your cluster

• Issues;

• Unreachable nodes• Dropped messages • Delayed messages• No response

• Split network• Multiple working clusters• Mutable state is likely to diverge

Know your cluster

• Accept these issues and don’t try to fight it. Make life simpler by;

• Not having a single point of failure• No leaders• No master/slave• No special nodes

• Making it eventually consistent • Use CRDTs for sets, counters, etc.• Use vector-clocks for configuration

Know your data

• Do not treat all data the same

• Time-series repository• CPU data, market data, ECG

• Semi-structured repository• Log files, emails

• KV repository• Configuration

• Unless you are Oracle or Microsoft, make your data immutable, append only.

• Streams are facts at points in time, and facts do not change

Know your data

• Build properties into your data distribution policies. Properties which;

• Maximise resilience • Avoid replicas on the same physical server rack

• Optimise data locality• Minimise number of data transfers required when adding/removing

nodes• Deterministically tell where data lives in the cluster

• Where does data for T0 to T1 sit in the cluster?

Know your data

• Consistent hashing • Minimises number of data transfers in the cluster

• Time-based distribution• Distribute data in the cluster in second, minute, hour, day buckets

Know your data

Node / Segment 1 2 3 4 5 6 8 9 Node / Segment 1 2 3 4 5 6 8 9

A x x A x

B x x x B x x

C x x x x C x x x

D x x x x D x x x

E x x x E x x x

F x x F x x x

G x G x x x

K x x K x x x

L x x L x x

M x M x

Know your data

Node / Segment 1 2 3 4 5 6 8 9 Node / Segment 1 2 3 4 5 6 8 9

A x x A x

B x x x B x x

C x x x x C x x x

D x x x D x x x

E x X x E x x x

F x X F x x x

G x X G x x x

K x x K x x x

L x x L x x

M x M x

Know your algos

from historical /streams/demo/infrastructure/cpuselect avg(kernel)

Know your algos

Avg Avg

Know your algos

Avg Avg

Avg Avg Avg

Know your algos

Init: () -> βApply: β -> 'a list -> βReduce: β -> β -> βFinalise: β -> 'r

class AverageDouble {def apply(value: NamedDouble): Unit

def reset(): Unit

def merge(state: Parser)

def restore(state: Parser)

def getResult: NamedDouble

def save(gen: Generator)}

Travelling algos

Avg AvgAvgAvg Avg Avg

Node / Segment 1 2 3 4 5 6 8 9

C x x x

D x x x

E x x x

F x x x

G x x x

K x x x

from historical /streams/demo/infrastructure/itimegroup by timeStamp window of 5 minutes every 5 minutes fill last, alphaselect alpha, timeStamp, last(a) as la partition every 1 hour as implicit

Gossip protocols

Expression trees and runtime code generationQuery rewriting and optimization

Consistent hashing

Data atomicity

Back pressure

Elasticity

Actor systems

Data distribution

Cluster management

B+ trees

Query language

KV-store

REST-api

Off-heap memory

Data formats

Distributed joins

Time semantics

Gap-filling

Statistical modelsDistributed CRDTs

Transports

Real-time queries./valo

www.valo.io

Thank youMeet us at the Startup Area

tobias@valo.io@ntjohansson

MicroTickFrequency

MicroVolatility

OnlineMisraGries

Anomaly

Histogram

Univar

Skyline

MovingKurtosis

MovingDerivative

RecursiveEMA

MovingVariance

Average

Quantiles

What has brought us here today

"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias Johansson,...

Data & Analytics

JOHANSSON - chap (13)

Scarlett Johansson Wallpapers

Fredrik Johansson

Johansson Malaysia2014 v2

Therese Johansson

JOHANSSON - chap (3)

Scarlett johansson Hairstyles

ABAP Objects: Application Development from Scratch · Thorsten Franz, Tobias Trapp ABAP Objects: Application Development from Scratch Bonn Boston 211_Book.indb 3 8/5/08 10:39:34 AM

Miriam Markus-Johansson

Johansson Drilling Machine

Benedicte Johansson

Apologetics Tobias England Apologetics Tobias England

Scarlett Johansson Awards

Rasmus johansson a

WORKING PAPERS 200615 - Göteborgs universitet · School of Public Administration Working Paper Series Editor Tobias Johansson E-mail tobias.johansson@spa.gu.se Papers from the SPA

Erik Johansson

Scarlett Johansson Movies

JOHANSSON - chap (4)

JOHANSSON - chap (8)

JOHANSSON - chap (10)