56
Overview of Apache Flink: the 4 G of Big Data Analytics Frameworks Hadoop Summit Europe, Dublin, Ireland. April 13 th , 2016 Slim Baltagi Director, Enterprise Architecture Capital One Financial Corporation

Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

Embed Size (px)

Citation preview

Page 1: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

Overview of Apache Flink: the 4 G of Big Data Analytics Frameworks

Hadoop Summit Europe, Dublin, Ireland.April 13th, 2016

Slim BaltagiDirector, Enterprise Architecture

Capital One Financial Corporation

Page 2: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

2

Agenda

1. How Apache Flink is a multi-purpose Big Data Analytics Framework?

2. Why streaming analytics are emerging?3. Why Flink is suitable for real-world

streaming analytics? 4. What are some novel use cases enabled by

Flink?5. Who is using Flink? 6. Where do you go from here?

Page 3: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

3

1. How Apache Flink is a multi-purpose Big Data Analytics Framework?

1.1. What is Apache Flink Stack?1.2. Why Apache Flink is the 4G of Big Data Analytics?1.3. What are Apache Flink Innovations?

Page 4: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

4

1.1. What is Apache Flink Stack?

Gel

lyTa

ble

Had

oop

M/R

Stor

m

DataSet (Java/Scala/Python)Batch Processing

DataStream (Java/Scala)Stream Processing

Flin

kML

Local• Single JVM• Embedded• Docker

Cluster• Standalone • YARN, • Mesos (WIP)

Cloud• Google’s GCE• Amazon’s EC2• IBM Docker Cloud, …

Apa

che

Bea

m

Cas

cadi

ng

Tabl

e

MR

QL

Distributed Streaming Dataflow Engine

Zepp

elin

DEP

LOY

SYST

EMA

PIs

& L

IBR

AR

IES

STO

RA

GE Files

• Local• HDFS• S3, Azure• Alluxio

Databases• MongoDB • HBase• SQL …

Streams • Flume• Kafka, MapR Streams • RabbitMQ…

Batch Optimizer Stream Builder

SAM

OA

Flin

kCEP

Gel

ly-S

trea

m

Apa

che

Bea

m

Page 5: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

5

1.2. Why Apache Flink is the 4G of Big Data Analytics?

Batch Batch Interactive

Batch Interactive Near-Real

Time Streaming (micro-batches)

Iterative processing

Hybrid Interactive Real-Time

Streaming + Real-World Streaming (out of order streams, windowing, backpressure, CEP, …)

Native Iterative processing

MapReduce Direct Acyclic Graphs (DAG)Dataflows

RDD: Resilient Distributed Datasets

Cyclic Dataflows

1G 2G 3G 4G

Page 6: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

6

1.3. What are Apache Flink Innovations?Apache Flink came with many innovations. Some of these innovations are influencing quite a few

features in other frameworks such as:1. Custom memory management and binary

processing in Flink from day one inspired Apache Spark to so so for its project Tungsten since version 1.6

• https://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html• https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-clos

er-to-bare-metal.html

2. DataSet API is in Flink since its early days and inspired Apache Spark to come with its Dataset API in version 1.6

• https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html• https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html

Page 7: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

7

1.3. What are Apache Flink Innovations?3. Flink’s rich windowing semantics for streamingFlink supports windows over time, count, or

sessionsWindows can be customized with flexible triggering

conditions, to support sophisticated streaming patterns.

Flink inspired both Apache Storm (1.0.0 was released on April 12th , 2016) and Spark streaming (version 2.0 is expected in May 2016) to start supporting rich windowing • https://storm.apache.org/2016/04/12/storm100-released.html• http

://www.slideshare.net/databricks/2016-spark-summit-east-keynote-matei-zaharia/15

Page 8: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

8

1.3. What are Apache Flink Innovations?Some of Flink innovations are not available in other

open source tools such as:1. The only hybrid (Real-Time Streaming + Batch)

distributed data processing engine natively supporting many use cases: Batch, Real-Time streaming, Machine learning, Graph processing and Relational queries

2. Native iterations ( Iterate and DeltaIterate) dramatically boost the performance of Machine learning and Graph analytics requiring iterations.

Page 9: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

9

The only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine natively supporting many use cases:

Real-Time stream processing Machine Learning at scale

Graph AnalysisBatch Processing

Page 10: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

10

1.3. What are Apache Flink Innovations?3. Simplicity of configuration: Flink requires no memory thresholds to configure, no complicated network configurations, no serializers to be configured, …4. Little tuning required: Flink’s optimizer can choose execution strategies automatically in any environment. According to Mike Olsen, Chief Strategy Officer of

Cloudera Inc. “Spark is too knobby — it has too many tuning parameters, and they need constant adjustment as workloads, data volumes, user counts change.”

Reference: http://vision.cloudera.com/one-platform/

Page 11: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

11

1.3. What are Apache Flink Innovations?5. Full support of Apache Beam (for combination of Batch and Stream) : event time, sessions, …References: • The Dataflow Model: A Practical Approach to Balancing Correctness,

Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing, 2015 http://research.google.com/pubs/pub43864.html

• Dataflow/Beam & Spark: A Programming Model Comparison, February 3rd, 2016https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison

6. Innovations in stream processing: event time, rich streaming window operations, savepoints, …• http

://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/

• http://data-artisans.com/how-apache-flink-enables-new-streaming-applications/

Page 12: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

12

1.3. What are Apache Flink Innovations?

7. FlinkCEP is the Complex Event Processing library for Flink. It allows you to easily detect complex event patterns in a stream of endless data to support better insight and decision making. • Introducing Complex Event Processing (CEP) with Apache Flink, Till Rohrmann

April 6, 2016 http://flink.apache.org/news/2016/04/06/cep-monitoring.html• FlinkCEP - Complex event processing for Flinkhttps

://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/libs/cep.html

8. Run Legacy Big Data applications on Flink: Preserve your investment in your legacy Big Data applications by currently running your legacy code on Flink’s powerful engine using Hadoop and Storm compatibility layers, Cascading adapter and probably a Spark adapter in the future.

Page 13: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

13

Run your legacy Big Data applications on Flink

Flink’s MapReduce compatibility layer allows to run legacy Hadoop MapReduce jobs, reuse Hadoop input and output formats and reuse functions like Map and Reduce. https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/hadoop_compatibility.html

Cascading on Flink allows to port existing Cascading-MapReduce applications to Apache Flink with virtually no code changes. Expected advantages are performance boost and less resources consumption. https://github.com/dataArtisans/cascading-flink/tree/release-0.2

Flink is compatible with Apache Storm interfaces and therefore allows reusing code that was implemented for Storm: Execute existing Storm topologies using Flink as the underlying engine. Reuse legacy application code (bolts and spouts) inside Flink programs. https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/storm_compatibility.html

Page 14: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

14

Agenda1. How Apache Flink is a multi-purpose Big

Data Analytics Framework?2. Why streaming analytics are emerging?3. Why Flink is suitable for real-world

streaming analytics? 4. What are some novel use cases enabled by

Flink?5. Who is using Flink? 6. Where do you go from here?

Page 15: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

15

2. Why streaming analytics are emerging?

Stonebraker et al. predicted in 2005 that stream processing is going to become increasingly important and attributed this to the ‘sensorization of the real world: everything of material significance on the planet get ‘sensor-tagged’ and report its state or location in real time’. Reference: http://cs.brown.edu/~ugur/8rulesSigRec.pdf

I think stream processing is becoming important not only because of this sensorization of the real world but also because of the following factors:

1. Data streams2. Technology3. Business4. Customers

Page 16: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

16

2. Why streaming analytics are emerging?

CustomersData StreamsTechnology Business1

2 34

Emergence of Streaming Analytics

Page 17: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

17

2. Why streaming analytics are emerging?

1 Data Streams Real-world data is available as series of events that

are continuously produced by a variety of applications and disparate systems inside and outside the enterprise. Examples: • Sensor networks data• Web logs• Database transactions• System logs• Tweets and social media data in general• Click streams • Mobile apps data

Page 18: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

18

2. Why streaming analytics are emerging?

2 TechnologySimplified data architecture with Apache Kafka as a

major innovation and backbone of streaming architectures.

Rapidly maturing open source streaming analytics tools: Apache Flink, Apache Spark’s Streaming module, Kafka Streams, Apache Samza, Apache Storm, Apache Nifi…

Cloud services for streaming processing: Google Cloud Dataflow, Azure Stream Analytics, Amazon Kinesis Streams, IBM InfoSphere Streams, …

Vendors innovating in this space: Data Artisans, DataTorrent, Striim, Databricks, MapR, Hortonworks, Confluent, StreamSets, …

More mobile devices than human beings!

Page 19: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

19

2. Why streaming analytics are emerging?

3 BusinessChallenges:

Lag between data creation and actionable insights. Web and mobile application growth, new types/sources of data. Need of organizations to shift from reactive approach to a more

of a proactive approach to interactions with customers, suppliers and employees.

Opportunities: Embracing streaming analytics helps organizations with faster

time to insight, competitive advantages and operational efficiency in a wide range of verticals.

With streaming analytics, new startups are/will be challenging established companies. Example: Pay-As-You-Go insurance or Usage-Based Auto Insurance

Speed is said to have become the new currency of business.

Page 20: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

20

2. Why streaming analytics are emerging?

4 Customers

Customers are becoming more and more demanding for instant responses in the way they are used to in social networks: Twitter, Facebook, Linkedin, …

Younger generation who grow up with video gaming and accustomed to real-time interaction are now themselves a growing class of customers

Page 21: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

21

Agenda

1. How Apache Flink is a multi-purpose Big Data Analytics Framework?

2. Why streaming analytics are emerging?3. Why Flink is suitable for real-world

streaming analytics? 4. What are some novel use cases enabled by

Flink?5. Who is using Flink? 6. Where do you go from here?

Page 22: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

22

3. Why Flink is suitable for real-world streaming analytics?

3.1. Flink’s streaming analytics features3.2. What are some streaming analytics use cases suitable for Flink?

Page 23: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

23

3.1. Flink’s streaming analytics features

Apache Flink 1.0, which was released on March 8th 2016, comes with a competitive set of streaming analytics features, some of which are unique in the open source domain. 

Apache Flink 1.0.1 was released on April 6th 2016. The combination of these features makes Apache

Flink a unique choice for real-world streaming analytics.

Let’s discuss some of Apache Flink features for real-world streaming analytics.

Page 24: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

24

3.1. Flink’s streaming analytics features

1. Pipelined processing engine2. Stream abstraction: DataStream as in the real-world3. Performance: Low latency and high throughput4. Support for rich windowing semantics5. Support for different notions of time6. Stateful stream processing7. Fault tolerance and correctness8. High Availability9. Backpressure handling10. Expressive and easy-to-use APIs in Scala and Java11. Support for batch12. Integration with the Hadoop ecosystem

Page 25: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

25

1. Pipelined processing engine

Flink is a pipelined (streaming) engine akin to parallel database systems, rather than a batch engine as Spark.

‘Flink’s runtime is not designed around the idea that operators wait for their predecessors to finish before they start, but they can already consume partially generated results.’

‘This is called pipeline parallelism and means that several transformations in a Flink program are actually executed concurrently with data being passed between them through memory and network channels.’ http://data-artisans.com/apache-flink-new-kid-on-the-block/

Page 26: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

26

2. Stream abstraction: DataStream as in the real-world

Real world data is a series of events that are continuously produced by a variety of applications and disparate systems inside and outside the enterprise.

Flink, as a stream processing system, models streams as what they are in the real world, a series of events and use DataStream as an abstraction.

Spark, as a batch processing system, approximates these streams as micro-batches and uses DStream as an abstraction. This adds an artificial latency!

Page 27: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

27

3. Performance: Low latency and high throughput

Pipelined processing engine enable true low latency streaming applications with fast results in milliseconds

High throughput: efficiently handle high volume of streams (millions of events per second)

Tunable latency / throughput tradeoff: Using a tuning knob to navigate the latency-throughput trade off.

Yahoo! benchmarked Storm, Spark Streaming and Flink with a production use-case (counting ad impressions grouped by campaign).

Full Yahoo! Article, benchmark stops at low write throughput and programs are not fault tolerant. https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

Page 28: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

28

3. Performance: Low latency and high throughputFull Data Artisans article, extends the Yahoo!

benchmark to high volumes and uses Flink’s built-in state http://data-artisans.com/extending-the-yahoo-streaming-benchmark/

Flink outperformed both Spark Streaming and Storm in this benchmark modeled after a real-world application:• Flink achieves throughput of 15 million messages/second on a

10 machines cluster. This is 35x higher throughput compared to Storm (80x compared to Yahoo’s runs)

• Flink ran with exactly once guarantees, Storm with at least once.

Ultimately, you need to test the performance of your own streaming analytics application as it depends on your own logic and the version of your preferred stream processing tool!

Page 29: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

29

4. Support for rich windowing semantics

Flink provides rich windowing semantics. A window is a grouping of events based on some function of time (all records of the last 5 minutes), count (the last 10 events) or session (all the events of a particular web user ).

Window types in Flink:• Tumbling windows ( no overlap)• Sliding windows (with overlap)• Session windows ( gap of activity)• Custom windows (with assigners, triggers and

evictors)

Page 30: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

30

4. Support for rich windowing semanticsIn many systems, these windows are hard-coded and

connected with the system’s internal checkpointing mechanism. Flink is the first open source streaming engine that completely decouples windowing from fault tolerance, allowing for richer forms of windows, such as sessions.

Further reading: • http://flink.apache.org/news/2015/12/04/Introducing-windows.html• http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html• https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101• https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Page 31: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

31

5. Support for different notions of time

In a streaming program with Flink, for example to define windows in respect to time, one can refer to different notions of time:• Event Time: when an event did happen in the real world.• Ingestion time: when data is loaded into Flink, from Kafka

for example.• Processing Time: when data is processed by Flink

In the real word, streams of events rarely arrive in the order that they are produced due to distributed sources, non-synced clocks, network delays… They are said to be “out of order’ streams.

Flink is the first open source streaming engine that supports out of order streams and which is able to consistently process events according to their event time.

Page 32: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

32

5. Support for different notions of time

http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html

https://ci.apache.org/projects/flink/flink-docs-master/concepts/concepts.html#timehttps://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/event_time.htmlhttp://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/

Page 33: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

33

6. Stateful stream processingMany operations in a dataflow simply look at one

individual event at a time, for example an event parser.Some operations called stateful operations are defined as

the ones where data is needed to be stored at the end of a window for computations occurring in later windows.

Now, where the state of these stateful operations is maintained?

Page 34: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

34

6. Stateful stream processing The state can be stored in memory in the File System

or in RocksDB which is an embedded key value data store and not an external database.

Flink also supports state versioning through savepoints which are checkpoints of the state of a running streaming job that can be manually triggered by the user while the job is running.

Savepoints enable: • Code upgrades: both application and framework • Cluster maintenance and migration• A/B testing and what-if scenarios• Testing and debugging.• Restart a job with adjusted parallelism

Further reading: http://data-artisans.com/how-apache-flink-enables-new-streaming-applications/

https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/savepoints.html

Page 35: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

35

7. Fault tolerance and correctnessHow to ensure that the state is correct after failures?Apache Flink offers a fault tolerance mechanism to

consistently recover the state of data streaming applications.

This ensures that even in the presence of failures, the operators do not perform duplicate updates to their state (exactly once guarantees). This basically means that the computed results are the same whether there are failures along the way or not.

There is a switch to downgrade the guarantees to at least once if the use case tolerates duplicate updates.

Page 36: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

36

7. Fault tolerance and correctnessFurther reading:

• High-throughput, low-latency, and exactly-once stream processing with Apache Flinkhttp://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/

• Data Streaming Fault Tolerance document: http://ci.apache.org/projects/flink/flink-docs-master/internals/stream_checkpointing.html

• ‘Lightweight Asynchronous Snapshots for Distributed Dataflows’ http://arxiv.org/pdf/1506.08603v1.pdf June 28, 2015

• Distributed Snapshots: Determining Global States of Distributed Systems, February 1985, Chandra-Lamport algorithm http://research.microsoft.com/en-us/um/people/lamport/pubs/chandy.pdf

Page 37: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

37

8. High Availability

In the real world, streaming analytics applications need to be reliable and capable of running jobs for months and remain resilient in the event of failures.

The JobManager (Master) is responsible for scheduling and resource management. If it crashes, no new programs can be submitted and running program will fail.

Flink provides a High Availability (HA) mode to recover from JobManager crash, to eliminate the Single Point Of Failure (SPOF)

Further reading: JobManager High Availability https://ci.apache.org/projects/flink/flink-docs-master/setup/jobmanager_high_availability.html

Page 38: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

38

9. Backpressure handling

In the real world, there are situations where a system is receiving data at a higher rate than it can normally process. This is called backpressure.

Flink handles backpressure implicitly through its architecture without user interaction while backpressure handling in Spark is through manual configuration: spark.streaming.backpressure.enabled.

Flink provides backpressure monitoring to allow users to understand bottlenecks in streaming applications.

Further reading:• How Flink handles backpressure? by Ufuk Celebi, Kostas Tzoumas and

Stephan Ewen, August 31, 2015. http

://data-artisans.com/how-flink-handles-backpressure/

Page 39: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

39

10. Expressive and easy-to-use APIs in Scala and Java

High level, expressive and easy to use DataStream API with flexible window semantics results in significantly less custom application logic compared to other open source stream processing solutions.

Flink's DataStream API ports many operators from its DataSet batch processing API such as map, reduce, and join to the streaming world.

In addition, it provides stream-specific operations such as window, split, connect, …

Its support for user-defined functions eases the implementation of custom application behavior.

The DataStream API is available in Scala and Java.

Page 40: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

40

10. Expressive and easy-to-use APIs in Scala and Java

case class Word (word: String, frequency: Int)

val env = StreamExecutionEnvironment.getExecutionEnvironment()val lines: DataStream[String] = env.fromSocketStream(...)lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .keyBy("word").sum("frequency") .print()env.execute()

val env = ExecutionEnvironment.getExecutionEnvironment()val lines: DataSet[String] = env.readTextFile(...)lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print()env.execute()

DataSet API (batch): WordCount

DataStream API (streaming): Window WordCount

Page 41: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

41

11. Support for batch

In Flink, batch processing is a special case of stream processing, as finite data sources are just streams that happen to end.

Flink offers a full toolset for batch processing with a dedicated DataSet API and libraries for machine learning and graph processing.

In addition, Flink contains several batch-specific optimizations such as for scheduling, memory management, and query optimization.

Flink out-performs dedicated batch processing engine such as Spark and Hadoop MapReduce in batch use cases.

Page 42: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

42

12. Integration with the Hadoop ecosystem

POSIX Java/ScalaCollections

POSIX

Page 43: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

43

3.2 What are some streaming analytics use cases suitable for Flink?

1. Financial services2. Telecommunications3. Online gaming systems4. Security & Intelligence 5. Advertisement serving6. Sensor Networks7. Social Media8. Healthcare9. Oil & Gas10. Retail & eCommerce11. Transportation and logistics

Page 44: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

44

Agenda1. How Apache Flink is a multi-purpose Big

Data Analytics Framework?2. Why streaming analytics are emerging?3. Why Flink is suitable for real-world

streaming analytics? 4. What are some novel use cases enabled by

Flink?5. Who is using Flink? 6. Where do you go from here?

Page 45: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

45

4. What are some novel use cases enabled by Flink?

4.1. Flink as an imbedded key/value data store4.2. Flink as a distributed CEP engine

Page 46: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

46

4.1. Flink as an imbedded key/value data store The stream processor as a database: a new design pattern for data

streaming applications, using Apache Flink and Apache Kafka: Building applications directly on top of the stream processor, rather than on top of key/value databases populated by data streams.

The stateful operator features in Flink allow a streaming application to query state in the stream processor instead of a key/value store often a bottleneck http://data-artisans.com/extending-the-yahoo-streaming-benchmark/

Page 47: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

47

“State querying” feature is expected in upcoming Flink 1.1http://www.slideshare.net/JamieGrier/stateful-stream-processing-at-inmemory-speed/38

Page 48: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

48

4.2. Flink as a distributed CEP engine

Flink stream processor as CEP (Complex Event Processing) engine. Example: an application that ingests network monitoring events, identifies access patterns such as intrusion attempts using FlinkCEP, and analyzes and aggregates identified access patterns.

Upcoming Talk: Streaming analytics and CEP - Two sides of the same coin’ by Till Rohrmann and Fabian Hueske at the Berlin Buzzwords on June 05-07 2016. http://berlinbuzzwords.de/session/streaming-analytics-and-cep-two-sides-same-coin

Further reading: – Introducing Complex Event Processing (CEP) with Apache Flink,

Till Rohrmann April 6, 2016 http://flink.apache.org/news/2016/04/06/cep-monitoring.html

– FlinkCEP - Complex event processing for Flinkhttps://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/libs/cep.html

Page 49: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

49

Agenda1. How Apache Flink is a multi-purpose Big

Data Analytics Framework?2. Why streaming analytics are emerging?3. Why Flink is suitable for real-world

streaming analytics? 4. What are some novel use cases enabled by

Flink?5. Who is using Flink? 6. Where do you go from here?

Page 50: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

50

5. Who is using Flink? . Who is using Apache Flink?

Some companies using Flink for streaming analytics: [Telecommunications] [Retail] [Financial Services]

Gaming Security

[Gaming] [Security]

Powered by Flink pagehttps://cwiki.apache.org/confluence/display/FLINK/Powered+by+Flink

Page 51: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

51

5. Who is using Flink?

has its hack week and the winner, announced on December 18th 2015, was a Flink based streaming project! Extending the Yahoo! Streaming Benchmark and Winning Twitter Hack-Week with Apache Flink. Posted on February 2, 2016 by Jamie Grier http://data-artisans.com/extending-the-yahoo-streaming-benchmark/http://www.slideshare.net/JamieGrier/stateful-stream-processing-at-inmemory-speed

did some benchmarks to compare performance of one of their use case originally implemented on Apache Storm against Spark Streaming and Flink. Results posted on December 18, 2015

• http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

• http://data-artisans.com/extending-the-yahoo-streaming-benchmark/• https://github.com/dataArtisans/yahoo-streaming-benchmark• http://www.slideshare.net/JamieGrier/extending-the-yahoo-streaming-benchmark

Page 52: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

52

Generic Streaming Analytics Architectural pattern: This is changing with Flink’s alerts, StreamSQL, state querying, FlinkCEP, …

Even

tPr

oduc

ers

Col

lect

or

Bro

ker

Proc

esso

r

Inde

xer

Visu

aliz

er/S

earc

h

• Kafka• RabitMQ• JMS• Amazon

Kinesis• Google Cloud

Pub/Sub• MapR Streams

• Flink• Spark• Storm• Samza• Kafka

streams

• ElasticSearch• Solr• Cassandra• HBase• MapR DB• MongoDB• Apache Geode

• Kibana• Custom

GUI

• Flume• SpringXD• Logstash• Nifi• Fluentd

• Apps• Devices• Sensors

Page 53: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

53

Agenda

1. How Apache Flink is a multi-purpose Big Data Analytics Framework?

2. Why streaming analytics are emerging?3. Why Flink is suitable for real-world

streaming analytics? 4. What are some novel use cases enabled by

Flink?5. Who is using Flink? 6. Where do you go from here?

Page 54: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

54

6. Where do you go from here?

A few resources for you:• Flink Knowledge Base: One-Stop for everything

related to Apache Flink. By Slim Baltagihttp://sparkbigdata.com/component/tags/tag/27-flink

• Flink at the Apache Software Foundation: flink.apache.org/

• Free Apache Flink training from data Artisans http://dataartisans.github.io/flink-training

• Flink Forward Conference, 12-14 September 2016, Berlin, Germany http://flink-forward.org/ (call for submissions announced today April 13th , 2016!)

• Free ebook from MapR: Streaming Architecture: New Designs Using Apache Kafka and MapR Streams https://www.mapr.com/streaming-architecture-using-apache-kafka-mapr-streams

Page 55: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

55

6. Where do you go from here? A few takeaways:

• Apache Flink unique capabilities enable new and sophisticated use cases especially for real-world streaming analytics.

• Customers demand will push major Hadoop distributors to package Flink and support it.

• What would be the 5G of Big Data Analytics platforms? Guiding principles would be Unification, Simplification and Ease of use:

GUI to build batch and streaming applicationsUnified API for batch and streaming Single engine for batch and streamingUnified storage layer (files, streams, NoSQL)Unified query engine for SQL, NoSQL and structured

streams

Page 56: Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

56

Thanks!To all of you for attending!Let’s keep in touch!

[email protected]• @SlimBaltagi• https://www.linkedin.com/in/slimbaltagi

Any questions?