68
1 Building Reactive Distributed Systems For Streaming Big Data, Analytics & ML Helena Edelson @helenaedelson Reactive Summit 2016

Building Reactive Distributed Systems For Streaming Big Data, Analytics & Machine Learning

Embed Size (px)

Citation preview

1

Building Reactive Distributed Systems For Streaming Big Data, Analytics & ML

Helena Edelson @helenaedelson Reactive Summit 2016

@helenaedelson #reactivesummit

Distributed Systems & Big Data Platform Engineer, Event-Driven systems, Streaming Analytics, Machine Learning, Scala

Committer: FiloDB, Spark Cassandra Connector, Kafka Connect Cassandra

Contributor: Akka (Akka Cluster), prev: Spring Integration

Speaker: Reactive Summit, Kafka Summit, Spark Summit, Strata, QCon, Scala Days, Philly ETE

2

I play with big data

twitter.com/helenaedelson slideshare.net/helenaedelson

linkedin.com/in/helenaedelson github.com/helena

and think about FT

@helenaedelson #reactivesummit

If you’re here you’ve taken the red pill

@helenaedelson #reactivesummit

It’s Not Easy Being…

Globally Distributed

Eventually Consistent

Highly Available

While handling TBs of data

Self Healing

Fast

4

@helenaedelson #reactivesummit

Massive event spikes & bursty traffic

Fast producers / slow consumers

Network partitioning & out of sync systems

DC down

Failure before commit

DDOS'ing yourself - no backpressure

Lack of graceful lifecycle handling, i.e. data loss when auto-scaling down

5

@helenaedelson #reactivesummit

Beware the C word

Consistency?

6

Convergence?

@helenaedelson #reactivesummit 7

Complexity, or…

@helenaedelson #reactivesummit 8

http://www.slideshare.net/palvaro/ricon-keynote-outwards-from-the-middle-of-the-maze/42

The Real Topic ^Other

@helenaedelson #reactivesummit

“Everything fails, all the time”

Start with this premise

Build failure, chaos routing and intelligence into your platform infrastructure

Be fault proactive vs just reactive

9

- Werner Vogels

@helenaedelson #reactivesummit 10

The matrix is everywhere…

It is a world where humans vs machines need to be directly involved in health of systems, respond to alerts…

@helenaedelson #reactivesummit 11

Log analysis tools like Splunk perpetuate this illusion…

@helenaedelson #reactivesummit 12

Self Healing, Intelligent Platforms

@helenaedelson #reactivesummit 13

Self Healing, Intelligent Platforms

Imagine a world

where engineers designed software

while the machines did triage, failure analysis and took action to correct failures

and learned from it…

@helenaedelson #reactivesummit 14

Alerting and Failure Response

The New Boilerplate

@helenaedelson #reactivesummit

Building Systems That LearnYou have systems receiving, processing & collecting data. They can provide insight into trends, patterns, and when they start failing, your monitoring triggers alerts to humans.

Alternative

1. Automatically route data only to nodes that can process it.

2. Automatically route failure contexts to event streams with ML algorithms to learn: What is normal? What is an anomaly? When these anomalies occur, what are they indicative of?

3. Automate proactive actions when those anomalies bubble up to a fault-aware layer.

15

@helenaedelson #reactivesummit

Define Failures vs Error Fault Tolerance Pathways & Routing

Failures - more network related, connectivity, …

Errors - application errors, user input from REST requests, config load-time errors…

Escape Loops and Feedback system

Built-in failure detection

@helenaedelson #reactivesummit 17

I want to keep all my apps healthy, handle errors with a

similar strategies, and failures with a similar strategy…

…and share that data across datacenters with my multi-dc aware kernel to build smarter,

automated systems.

@helenaedelson #reactivesummit 18

In A Nutshell

Event Stream

Petabytes Multi-DC Aware Mgmt

Kernel

Clustered Apps

Clustered Apps

Clustered AppsClustered

Apps

Clustered Apps

Clustered Apps

Clustered Apps

Clustered Apps

@helenaedelson #reactivesummit 19

In A Nutshell

Multi-DC Aware Mgmt

Kernel

Clustered Apps

Clustered Apps

Clustered AppsClustered

Apps

Clustered Apps

Clustered Apps

Clustered Apps

Clustered Apps

Danger Sauce

Event Stream

Petabytes

@helenaedelson #reactivesummit

Multi-DC Aware Mgmt

Kernel

20

X-DC-Aware Kernel

Clustered Apps

Clustered Apps

Clustered AppsClustered

Apps

Clustered Apps

Clustered Apps

Clustered Apps

Clustered Apps

Event Stream

Petabytes Danger Sauce

@helenaedelson #reactivesummit 21

I want to

route failure context

(when a node is alive

enough to talk)

to my.failures.fu Kafka topic for

my X-DC kernel to process through my ML streams

to learn about usage,

traffic, memory patterns and

anomalies, to anticipate failures,

intercept and correct

Without Human Intervention

@helenaedelson #reactivesummit 22

Akka Cluster it. The cluster knows

the node is down. Many options/

mechanisms to send/publish failure

metadata to learning streams.

If it’s a network failure and my node can’t

report to the kernel?

Bring it

@helenaedelson #reactivesummit 23

I think I’ll start with a

base cluster-aware Akka Extension for

my platform.

Any non-JVM apps can share data

with these via Kafka, ZeroMQ,

Cassandra…

Sew easy.

Might sneak it in without a

PR.

@helenaedelson #reactivesummit 24

Then I can bake my fault tolerance

strategy & route awareness in every

app & node.

And make it know when to use

Kafka and when to use Akka for

communication.

@helenaedelson #reactivesummit 25

And all logic and flows for graceful handling of

node lifecycle are baked in to the framework too.

Dude, that’s tight.

Graceful Lifecycle

@helenaedelson #reactivesummit

Akka ClusterNode-aware cluster membership

Gossip protocol

Status of all member nodes in the ring

Tunable, automatic failure detection

No single point of failure or bottleneck

A level of composability above Actors & Actor Hierarchy

26

@helenaedelson #reactivesummit 27

Akka Cluster has tunable Gossip.I

can partition segments of my node ring

by role, and route data by role or

broadcast.

Tunable Gossip

@helenaedelson #reactivesummit 28

Akka Cluster has

AdaptiveLoadBalancing logic.

I can route events away from nodes in trouble and

directed to healthier nodes. Many options, highly

configurable, good granularity.

AdaptiveLoadBalancing* Router

Cluster Metrics API

@helenaedelson #reactivesummit 29

Dynamically Route by Node Health Cluster Metrics API & AdaptiveLBR

Leader

Let’s pummel my cluster with critical data !

@helenaedelson #reactivesummit 30

Dynamically Route by Node Health Cluster Metrics API & AdaptiveLBR

Oh look, slow consumers or something in GC

thrash…Leader

Uses EWMA algo to weight decay of health data

@helenaedelson #reactivesummit 31

Dynamically Route by Node Health Cluster Metrics API & AdaptiveLBR

Leader

The Cluster leader ‘knew’ the red

node was nearing capacity & routed my traffic to healthy nodes.

@helenaedelson #reactivesummit

Reactive - Proactive

32

@helenaedelson #reactivesummit

Proactive - Parallelized

Akka Cluster on nodes

Cluster LB routers

Node router Actors

Kafka replication & routing across datacenters

33

@helenaedelson #reactivesummit 34

@helenaedelson #reactivesummit 35

I drank the cool aid,

put on the costume,

now what…

@helenaedelson #reactivesummit 36

Embrace Uncertainty

@helenaedelson #reactivesummit

So many opportunities for failure in this statement

Eventually Consistent Across DCs

37

@helenaedelson #reactivesummit 38

there is no now…

Embrace Asynchrony & Location Transparency

…or where

@helenaedelson #reactivesummit

There Is No Now

39

US-East-1

MirrorMakerEU-west-1

ZK

ZK

Akka micro

services

Akka micro

services

Akka micro

services

FiloDB Cassandra

Akka micro

services

Akka micro

services

Akka micro

services

Spark

Compute Clusters

Analytics/Timeseries/ML Storage Storage Clusters

S3/ Cold

Raw Event Stream

Event Stream

DC-1

DC-2

Akka Cluster-ed apps, partitioned by Role=service Raw Event Replay

FlinkBackpressure

@helenaedelson #reactivesummit 40

US-East-1

MirrorMakerEU-west-1

ZK

ZK

Akka micro

services

Akka micro

services

Akka micro

services

FiloDB Cassandra

Akka micro

services

Akka micro

services

Akka micro

services

Spark

Compute Clusters

Analytics/Timeseries/ML Storage Storage Clusters

S3/ Cold

Raw Event Stream

Raw Event Replay

Event Stream

But I Can Replay Future[Then]

DC-1

DC-2

Akka Cluster-ed apps, partitioned by Role=service

Backpressure

@helenaedelson #reactivesummit

Replaying Data from Time (T)

General time - We updated some compute algorithms which run on raw and aggregated data. I need to replay some datasets against the updated algos

Specific time - A node crashed or there was a network partitioning event and we lost data from T1 to T2. I need to replay from T1 to T2 without having to de-dupe data

41

@helenaedelson #reactivesummit

Cassandra Bakes Clustered Ordering Into Data ModelCREATE TABLE timeseries_t ( id text, year int, month int, day int, hour int, PRIMARY KEY ((wsid), year, month, day, hour)) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

42

Now the read path queries automatically return sorted by most recent. No sorts to do or ordering in your code.

Can also easily bucket in cold storage for faster reads.

@helenaedelson #reactivesummit 43

US-East-1

MirrorMakerEU-west-1

ZK

ZK

Compute Clusters

Event Stream

Build Apps Not Clusters?App clusters

vs Infra clusters

Actors & Kafka Streams

Actors & Kafka Streams

Actors & Kafka Streams

Event Stream

Event Stream

@helenaedelson #reactivesummit

Kafka Streams Key Features

Secure stream processing using Kafka security

Elastic and highly scalable

Fault-tolerant

Stateful and stateless computations

44

@helenaedelson #reactivesummit

Kafka Streams Key FeaturesInteractive queries

Time Model & Windowing

Supports late-arriving and out-of-order data

Millisecond processing latency

At-least-once processing guarantees

exactly-once is in progress :)

45

@helenaedelson #reactivesummit

val builder = new KStreamBuilder()

// Consume from topic x in the stream, from any or one DC

val kstream: KStream[K,V] = builder.stream(des, des, “fu.input.topic")

// Do some analytics computations

// Publish to subscribers in all DCs in the stream

kstream.to(“serviceA.questionB.aggregC.topic”, …)

// Start the stream

val streams = new KafkaStreams(builder, props)

streams.start()

// Do some more work then close the stream

streams.close()

Kafka Streams: KStream

46

@helenaedelson #reactivesummit 47

Kafka Streams: KTableval locations: KTable[UserId, Location] = builder.table(“user-locations-topic”)

val userPrefs: KTable[UserId, Prefs] = builder.table(“user-preferences-topic”)

// Join detailed info from 2 streams as events stream in

val userProfiles: KTable[UserId, UserProfile] = locations.join(userPrefs, (loc, prefs) -> new UserProfile(loc, prefs))

// Compute statistics

val usersPerRegion: KTable[UserId, Long] = userProfiles

.filter((userId, profile) -> profile.age < 30)

.groupBy((userId, profile) -> profile.location)

.count()

@helenaedelson #reactivesummit

val streams = new KafkaStreams(builder, props)

streams.start()

// Called when a stream thread abruptly terminates

// due to an uncaught exception.

streams.setUncaughtExceptionHandler(myErrorHandler)

Kafka Streams DSL Basics

48

@helenaedelson #reactivesummit 49

Integrating IntelligenceFor Fast Data and Feedback Loops

@helenaedelson #reactivesummit

Translation Layer

50

US-East-1

MirrorMaker

EU-west-1

PubSub Apps

PubSub Apps

PubSub Apps

Compute Clusters

Event StreamDC-1Raw Data

Cassandra

Spark SQLSpark MLLib

Spark Streaming

Kafka

@helenaedelson #reactivesummit

Translation Layer

51

US-East-1

MirrorMaker

EU-west-1

PubSub Apps

PubSub Apps

PubSub Apps

Compute Clusters

Event StreamDC-1Raw Data

Cassandra

To Columnar Format

Reads chunks, translates to spark

rows

Spark MLLib Spark SQL

Spark Streaming

Kafka

FiloDB

@helenaedelson #reactivesummit

Translation Layer

52

US-East-1

MirrorMaker

EU-west-1

PubSub Apps

PubSub Apps

PubSub Apps

Compute Clusters

Event StreamDC-1Raw Data

Cassandra

To Columnar Format

Reads chunks, translates to spark

rows

Spark MLLib Spark SQL

Spark Streaming

Kafka

FiloDB

Faster Ad Hoc

Querying

Faster ML

Feedback

Instead of slowing things down, it makes it faster.

@helenaedelson #reactivesummit 53

Your Akka Cluster got in my Reactive Distributed Analytics

Database!

Walking with a jar of PB as one does…

@helenaedelson #reactivesummit

FiloDB: Reactive

Scala & SBT

Spark (also written in scala)

Akka actors & Akka Cluster for Coordinators

Futures for IO

Typesafe Config

54

@helenaedelson #reactivesummit

import filodb.spark._

KafkaUtils.createDirectStream[..](..)

.map(_._2)

.map(FuEvent(_))

.foreachRDD { rdd =>

sql.insertIntoFilo(rdd.toDF, “fu_events”, database = Some(“org1Keyspace”))}

55

FiloDB API: Append only, to existing dataset

Write: Kafka, Spark Streaming, Spark SQL & FiloDB

@helenaedelson #reactivesummit

Read: FiloDB & Spark SQL

56

import filodb.spark._

val df1 = sqlContext.read.format(“filodb.spark”) .option(“dataset”, “aggrTable”) .load()

val df2 = sqlContext.filoDataset(“aggrTable”, database = Some(“ks2”))

More typesafe than the dataframe read (df1)

@helenaedelson #reactivesummit 57

DetermineTrain the model

Train on historic data

Score

Based on Objectives

Proactive Responses

React in real time

Machine Learning

@helenaedelson #reactivesummit

What do I want to predict & what can I learn from the data?

Which attributes in datasets are valuable predictors?

Which algorithms will best uncover valuable patterns?

58

Optimization & Predictive Models

@helenaedelson #reactivesummit 59

I need fast ML feedback loops, what tools should I pick? Depends…

@helenaedelson #reactivesummit

Read from stored training data into MLLib

60

import org.apache.spark.mllib.stat.Statistics

import filodb.spark._

val df = sqlContext.filoDataset(“table”, database = Some(“ks”))

val seriesX = df.select(“NumX”).map(_.getInt(0).toDouble)

val seriesY = df.select(“NumY”).map(_.getInt(0).toDouble)

val correlation: Double = Statistics.corr(seriesX, seriesY, “pearson”)

Calculate correlation between multiple series of data

@helenaedelson #reactivesummit

Read from the stream to MLLib & store for later work in FiloDB

61

val stream = KafkaUtils.createDirectStream[..](..) .map(transformFunc) .map(LabeledPoint.parse)

val trainingData = sqlContext.filoDataset(“trainings_fu")

val model = new StreamingLinearRegressionWithSGD() .setInitialWeights(Vectors.dense(weights)) .setStepSize(0.2).setNumIterations(25) .trainOn(trainingData)

model .predictOnValues(stream.map(lp => (lp.label, lp.features))) .insertIntoFilo(“predictions_fu")

@helenaedelson #reactivesummit 62

Parting Thoughts

@helenaedelson #reactivesummit

Don’t take when someone says X doesn’t work at face value.

It may not have been the right choice for their use case

They may not have configured/deployed/wrote code for it properly for their load

All technologies are optimized for a set of things vs every thing

Invest in R&D

63

Cultivate Skepticism

@helenaedelson #reactivesummit

Modular does one thing

Decoupled knows one thing

Reusable & Extendable when not final or private for a reason

64

Keep Everything Separate

@helenaedelson #reactivesummit 65

Did you just duplicate codez?

@helenaedelson #reactivesummit 66

Versus duplicating, with disparate strategies, all over your teams & orgs

Build A Reactive / Proactive Failure & Chaos Intelligence System for platform infrastructure

@helenaedelson #reactivesummit

Byzantine Fault Tolerance?

67

Looks like I'll miss

standup

@helenaedelson #reactivesummit

twitter.com/helenaedelson slideshare.net/helenaedelson

github.com/helena

Thanks!