Data Stream Analytics - Why they are important

An Introduction to Data Stream Analytics

using Apache Flink

SeRC Big Data Workshop

Paris Carbone<[email protected]> PhD Candidate

KTH Royal Institute of Technology

1

Motivation• Time-critical problems / Actionable Insights

• Stock market predictions

• Fraud detection

• Network security

• Fresh customer recommendations

2

more like First-World Problems..

How about Tsunamis

3

4

Q =

Q

Deploy Sensors

Analyse Data Regularly

Collect Data

evacuation window

earth & wave activity

Motivation

5

Q Q

Q =

Motivation

6

Q

Standing Query

Q =

evacuationwindow

Data Stream Paradigm

• Standing queries are evaluated continuously

• Input data is unbounded

• Queries operate on the full data stream or on the most recent views of the stream ~ windows

7

Data Stream Basics• Events/Tuples : elements of computation - respect a schema

• Data Streams : unbounded sequences of events

• Stream Operators: consume streams and generate new ones.

• Events are consumed once - no backtracking!

8

f

S1

S2

So

S’1

S’2

Streaming Pipelines

9

stream1

stream2

approximations predictions alerts ……

Q

sources

sinks

Stream Analytics Systems

10

Proprietary Open Source

Google DataFlow

IBM Infosphere

Microsoft Azure

Flink

Storm

Samza

Spark

Programming Models

11

Compositional Declarative

• Offer basic building blocks for composing custom operators and topologies

• Advanced behaviour such as windowing is often missing

• Custom Optimisation

• Expose a high-level API • Operators are transformations

on abstract data types • Advanced behaviour such as

windowing is supported • Self-Optimisation

Introducing Apache Flink

0

20

40

60

80

100

120

juli-09 nov-10 apr-12 aug-13 dec-14 maj-16

#unique contributor ids by gitcommits

• A Top-level project

• Community-driven open source software development

• Publicly open to new contributors

Native Workload Support

Apache Flink

Stream Pipelines

Batch Pipelines Scalable Machine Learning

Graph Analytics

14

The Apache Flink Stack

APIs

Execution

DataStreamDataSet

Distributed Dataflow

Deployment

• Bounded Data Sources • Blocking Operations • Structured Iterations

• Unbounded Data Sources • Continuous Operations • Asynchronous Iterations

The Big Picture

DataStreamDataSet

Distributed Dataflow

Deployment

Graph

-Gelly

Table

ML

Hado

opM/R

Table

CEP

SQL

SQL

ML

Graph

-Gelly

16

Basic API Concept

Source Data Stream Operator Data

Stream Sink

Source Data Set Operator Data

Set Sink

Writing a Flink Program1.Bootstrap Sources 2.Apply Operators 3.Output to Sinks

Data Streams as Abstract Data Types

• Tasks are distributed and run in a pipelined fashion.

• State is kept within tasks.

• Transformations are applied per-record or window.

• Transformations: map, flatmap, filter, union…

• Aggregations: reduce, fold, sum

• Partitioning: forward, broadcast, shuffle, keyBy

• Sources/Sinks: custom or Kafka, Twitter, Collections…

17

DataStream

Example

18

textStream .flatMap {_.split("\\W+")}

.map {(_, 1)} .keyBy(0) .sum(1) .print()

“live and let live”

“live”“and”“let”“live”(live,1)(and,1)(let,1)(live,1)

(live,1)(and,1)(let,1)(live,2)

Working with Windows

19

Why windows? We are often interested in fresh data!

Highlight: Flink can form and trigger windows consistently under different notions of time and deal with late events!

#sec40 80

SUM #2

0

SUM #1

20 60 100

#sec40 80

SUM #3

SUM #2

0

SUM #1

20 60 100

120

15 38 65 88

15 38

38 65

65 88

15 38 65 88

110 120

myKeyedStream.timeWindow( Time.seconds(60), Time.seconds(20));

1) Sliding windows

2) Tumbling windowsmyKeyedStream.timeWindow( Time.seconds(60));

window buckets/panes

Example

20


.map {(_, 1)} .keyBy(0)

.timeWindow(Time.minutes(5)) .sum(1) .print()

“live and”

(live,1)(and,1)

(let,1)(live,1)

counting words over windows

“let live”10:48

11:01

Window (10:45-10:50)

Window (11:00-11:05)

Example

21

printwindow sumflatMap


.map {(_, 1)} .keyBy(0)

.timeWindow(Time.minutes(5)) .sum(1) .print()

map

where counts are kept in state

Example

22window sum

flatMap


.map {(_, 1)} .keyBy(0)

.timeWindow(Time.minutes(5)) .sum(1)

.setParallelism(4) .print()

map print

Making State Explicit

23

• Explicitly defined state is durable to failures

• Flink supports two types of explicit states

• Operator State - full state

• Key-Value State - partitioned state per key

• State Backends: In-memory, RocksDB, HDFS

Fault Tolerance

24

t2t1

snap - t1 snap - t2

snapshotting snapshotting

State is not affected by failuresWhen failures occur we revert computation and state back to a snapshot

events

Also part of Apache Storm

Performance• Twitter Hack Week - Flink as an in-memory data store

25

Jamie Grier - http://data-artisans.com/extending-the-yahoo-streaming-benchmark/

So how is Flink different that Spark?

26

Two major differences

1) Stream Execution 2) Mutable State

Flink vs Spark

27

(Spark Streaming)

put new states in output RDDdstream.updateStateByKey(…)

In S’

S

• dedicated resources

• leased resources

• mutable state

• immutable state

What about DataSets?

28

• Sophisticated SQL-inspired optimiser

• Efficient Join Strategies

• Managed Memory bypasses Garbage Collection

• Fast, in-memory Iterative Bulk Computations

Some Interesting Libraries

29

Detecting Patterns

30

PatternStream<Event> tsunamiPattern = CEP.pattern(sensorStream, Pattern .begin("seismic").where(evt -> evt.motion.equals(“ClassB”)) .next("tidal").where(evt -> evt.elevation > 500));

DataStream<Alert> result = tsunamiPattern.select( pattern -> { return getEvacuationAlert(pattern); });

CEP Java library Example

Scala DSL coming soon

Mining Graphs with Gelly

31

• Iterative Graph Processing

• Scatter-Gather

• Gather-Sum-Apply

• Graph Transformations/Properties

• Library Methods: Community Detection, Label Propagation, Connected Components, PageRank.Shortest Paths, Triangle Count etc…

Coming Soon : Real-time graph stream support

Machine Learning Pipelines

32

• Scikit-learn inspired pipelining

• Supervised: SVM, Linear Regression

• Preprocessing: Polynomial Features, Scalers

• Recommendation: ALS

Relational Queries

33

Table table = tableEnv.fromDataSet(input);

Table filtered = table .groupBy("word") .select("word.count as count, word") .filter("count = 2");

DataSet<WC> result = tableEnv.toDataSet(filtered, WC.class);

Table API Example

SQL and Stream SQL coming soon

Real-Time Monitoring

34

…for real-time processing

Coming Soon

35

• SQL and Stream SQL

• Stream ML

• Stream Graph Processing (Gelly-Stream)

• Autoscaling

• Incremental Snapshots

Data & Analytics

Data Stream Analytics - Why they are important