Upload
paris-carbone
View
250
Download
6
Embed Size (px)
Citation preview
An Introduction to Data Stream Analytics
using Apache Flink
SeRC Big Data Workshop
Paris Carbone<[email protected]> PhD Candidate
KTH Royal Institute of Technology
1
Motivation• Time-critical problems / Actionable Insights
• Stock market predictions
• Fraud detection
• Network security
• Fresh customer recommendations
2
more like First-World Problems..
How about Tsunamis
3
4
Q =
Q
Deploy Sensors
Analyse Data Regularly
Collect Data
evacuation window
earth & wave activity
Motivation
5
Q Q
Q =
Motivation
6
Q
Standing Query
Q =
evacuationwindow
Data Stream Paradigm
• Standing queries are evaluated continuously
• Input data is unbounded
• Queries operate on the full data stream or on the most recent views of the stream ~ windows
7
Data Stream Basics• Events/Tuples : elements of computation - respect a schema
• Data Streams : unbounded sequences of events
• Stream Operators: consume streams and generate new ones.
• Events are consumed once - no backtracking!
8
f
S1
S2
So
S’1
S’2
Streaming Pipelines
9
stream1
stream2
approximations predictions alerts ……
Q
sources
sinks
Stream Analytics Systems
10
Proprietary Open Source
Google DataFlow
IBM Infosphere
Microsoft Azure
Flink
Storm
Samza
Spark
Programming Models
11
Compositional Declarative
• Offer basic building blocks for composing custom operators and topologies
• Advanced behaviour such as windowing is often missing
• Custom Optimisation
• Expose a high-level API • Operators are transformations
on abstract data types • Advanced behaviour such as
windowing is supported • Self-Optimisation
Introducing Apache Flink
0
20
40
60
80
100
120
juli-09 nov-10 apr-12 aug-13 dec-14 maj-16
#unique contributor ids by gitcommits
• A Top-level project
• Community-driven open source software development
• Publicly open to new contributors
Native Workload Support
Apache Flink
Stream Pipelines
Batch Pipelines Scalable Machine Learning
Graph Analytics
14
The Apache Flink Stack
APIs
Execution
DataStreamDataSet
Distributed Dataflow
Deployment
• Bounded Data Sources • Blocking Operations • Structured Iterations
• Unbounded Data Sources • Continuous Operations • Asynchronous Iterations
The Big Picture
DataStreamDataSet
Distributed Dataflow
Deployment
Graph
-Gelly
Table
ML
Hado
opM/R
Table
CEP
SQL
SQL
ML
Graph
-Gelly
16
Basic API Concept
Source Data Stream Operator Data
Stream Sink
Source Data Set Operator Data
Set Sink
Writing a Flink Program1.Bootstrap Sources 2.Apply Operators 3.Output to Sinks
Data Streams as Abstract Data Types
• Tasks are distributed and run in a pipelined fashion.
• State is kept within tasks.
• Transformations are applied per-record or window.
• Transformations: map, flatmap, filter, union…
• Aggregations: reduce, fold, sum
• Partitioning: forward, broadcast, shuffle, keyBy
• Sources/Sinks: custom or Kafka, Twitter, Collections…
17
DataStream
Example
18
textStream .flatMap {_.split("\\W+")}
.map {(_, 1)} .keyBy(0) .sum(1) .print()
“live and let live”
“live”“and”“let”“live”(live,1)(and,1)(let,1)(live,1)
(live,1)(and,1)(let,1)(live,2)
Working with Windows
19
Why windows? We are often interested in fresh data!
Highlight: Flink can form and trigger windows consistently under different notions of time and deal with late events!
#sec40 80
SUM #2
0
SUM #1
20 60 100
#sec40 80
SUM #3
SUM #2
0
SUM #1
20 60 100
120
15 38 65 88
15 38
38 65
65 88
15 38 65 88
110 120
myKeyedStream.timeWindow( Time.seconds(60), Time.seconds(20));
1) Sliding windows
2) Tumbling windowsmyKeyedStream.timeWindow( Time.seconds(60));
window buckets/panes
Example
20
textStream .flatMap {_.split("\\W+")}
.map {(_, 1)} .keyBy(0)
.timeWindow(Time.minutes(5)) .sum(1) .print()
“live and”
(live,1)(and,1)
(let,1)(live,1)
counting words over windows
“let live”10:48
11:01
Window (10:45-10:50)
Window (11:00-11:05)
Example
21
printwindow sumflatMap
textStream .flatMap {_.split("\\W+")}
.map {(_, 1)} .keyBy(0)
.timeWindow(Time.minutes(5)) .sum(1) .print()
map
where counts are kept in state
Example
22window sum
flatMap
textStream .flatMap {_.split("\\W+")}
.map {(_, 1)} .keyBy(0)
.timeWindow(Time.minutes(5)) .sum(1)
.setParallelism(4) .print()
map print
Making State Explicit
23
• Explicitly defined state is durable to failures
• Flink supports two types of explicit states
• Operator State - full state
• Key-Value State - partitioned state per key
• State Backends: In-memory, RocksDB, HDFS
Fault Tolerance
24
t2t1
snap - t1 snap - t2
snapshotting snapshotting
State is not affected by failuresWhen failures occur we revert computation and state back to a snapshot
events
Also part of Apache Storm
Performance• Twitter Hack Week - Flink as an in-memory data store
25
Jamie Grier - http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
So how is Flink different that Spark?
26
Two major differences
1) Stream Execution 2) Mutable State
Flink vs Spark
27
(Spark Streaming)
put new states in output RDDdstream.updateStateByKey(…)
In S’
S
• dedicated resources
• leased resources
• mutable state
• immutable state
What about DataSets?
28
• Sophisticated SQL-inspired optimiser
• Efficient Join Strategies
• Managed Memory bypasses Garbage Collection
• Fast, in-memory Iterative Bulk Computations
Some Interesting Libraries
29
Detecting Patterns
30
PatternStream<Event> tsunamiPattern = CEP.pattern(sensorStream, Pattern .begin("seismic").where(evt -> evt.motion.equals(“ClassB”)) .next("tidal").where(evt -> evt.elevation > 500));
DataStream<Alert> result = tsunamiPattern.select( pattern -> { return getEvacuationAlert(pattern); });
CEP Java library Example
Scala DSL coming soon
Mining Graphs with Gelly
31
• Iterative Graph Processing
• Scatter-Gather
• Gather-Sum-Apply
• Graph Transformations/Properties
• Library Methods: Community Detection, Label Propagation, Connected Components, PageRank.Shortest Paths, Triangle Count etc…
Coming Soon : Real-time graph stream support
Machine Learning Pipelines
32
• Scikit-learn inspired pipelining
• Supervised: SVM, Linear Regression
• Preprocessing: Polynomial Features, Scalers
• Recommendation: ALS
Relational Queries
33
Table table = tableEnv.fromDataSet(input);
Table filtered = table .groupBy("word") .select("word.count as count, word") .filter("count = 2");
DataSet<WC> result = tableEnv.toDataSet(filtered, WC.class);
Table API Example
SQL and Stream SQL coming soon
Real-Time Monitoring
34
…for real-time processing
Coming Soon
35
• SQL and Stream SQL
• Stream ML
• Stream Graph Processing (Gelly-Stream)
• Autoscaling
• Incremental Snapshots