51
Apache Flink 1.0: A New Era for Real-World Streaming Analytics Chicago Apache Flink Meetup. April 19 th , 2016 Slim Baltagi Director, Enterprise Architecture Capital One Financial Corporation

Apache Fink 1.0: A New Era for Real-World Streaming Analytics

Embed Size (px)

Citation preview

Page 1: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

Apache Flink 1.0: A New Era for Real-World Streaming Analytics

Chicago Apache Flink Meetup. April 19th, 2016

Slim BaltagiDirector, Enterprise Architecture

Capital One Financial Corporation

Page 2: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

2

Agenda1. Origin and evolution of streaming

capabilities in Flink 2. Why Flink is suitable for real-world

streaming analytics? 3. What are some streaming analytics use

cases suitable for Flink? 4. What are some streaming analytics use

cases from companies actually using Flink?5. What are some novel use cases enabled by

Flink?6. Where do you go from here?

Page 3: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

3

1. Origin and evolution of data streaming capabilities in Flink

2009 Apache Flink has its origins in a research project called

Stratosphere of which the idea was conceived in 2009 by professor Volker Markl  from the Technische Universität Berlin in Germany.

At its core, Flink has always been a distributed dataflow streaming engine.

2012 Massively-Parallel Stream Processing under QoS Constraints with

Nephele, June 12th , 2012 http://stratosphere.eu/assets/papers/massivelyParallelStreamProcessing_12.pdf

2013Nephele Streaming: Stream Processing under QoS Constraints at

Scale, August 5th, 2013 http://stratosphere.eu/assets/papers/nephele-streaming.pdf

Page 4: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

4

1. Origin and evolution of data streaming capabilities in Flink

2014March 2014: Work on the first prototype for an API demonstrating the

streaming capabilities of Stratosphere started in March 2014 by Gyula Fora and Marton Balassi from the Hungarian Academy of Sciences.

April 2014: Flink joined the Apache incubator in April 2014 and graduated as an Apache Top Level Project (TLP) in December 2014.

June 2014: First public mention of this prototype was on June 4th, 2014 http://2014.adattarhazforum.hu/letoltes/2014dwforum/mta_sztaki_balassi_marton.pdf

October 2014: 2nd public mention of this prototype was in October 7th 2014 https://www.youtube.com/watch?v=k2AOqwm_7ts at 10’37” http://data-artisans.com/apache-flink-new-kid-on-the-block/

November 2014: The first talk using ‘Flink Streaming’ at the ApacheCon on November 18th , 2014 http://events.linuxfoundation.org/sites/events/files/slides/flink_apachecon_small.pdf

Page 5: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

5

1. Origin and evolution of data streaming capabilities in Flink

2015June 2015: “I would consider stream data analysis to be a major

unique selling proposition for Flink. Due to its pipelined architecture Flink is a perfect match for big data stream processing in the Apache stack.” – Volker Markl. Ref.: On Apache Flink. Interview with Volker Markl, June 24th 2015 http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/

June 2015: Flink 0.9 released on June 24, 2015, DataStream API in beta, exactly-once guarantees via checkpointing

November 2015: Flink 0.10 released on November 16th, 2015, Event time support, windowing mechanism based on Dataflow/Beam model, graduated DataStream API, high availability, state backbends, new/updated connectors (Kafka, Nifi, ...), improved monitoring, …

Page 6: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

6

1. Origin and evolution of streaming capabilities in Flink

2016This Google paper “The Dataflow Model: A Practical

Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing” http://research.google.com/pubs/pub43864.html influenced Flink rich windowing semantics

March 2016: Flink 1.0 released on March 8th 2016, Stable DataStream API, Out-of-core state, savepoints, CEP library, improved monitoring, Kafka 0.9 support, …

April 2016: Apache Flink 1.0.1 was released on April 6th 2016.

Flink 1.0.2 is being voted on.

Page 7: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

7

1. Origin and evolution of streaming capabilities in Flink

Post Flink 1.0 in 2016 Queryable state: query the state from within Flink

instead of a database. Querying the state that Flink holds while it is doing its computation will effectively replace a database! Planned for Flink 1.1

SQL/StreamSQL and Table APIDynamic Scaling: Runtime scaling for DataStream

programsManaged memory for streaming operatorsSecurity: Over-the-wire encryption of RPC (Akka) and

data transfers (Netty)

Page 8: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

8

1. Origin and evolution of streaming capabilities in Flink

Expose more runtime metrics: Backpressure monitoring, Spilling / Out of Core

Additional streaming connectors: Kinesis, Cassandra, … Making YARN resource dynamicSupport for Apache Mesos https://issues.apache.org/jira/browse/FLINK-

1984

Further reading: • Apache Flink Roadmap Draft, December 2015

https://docs.google.com/document/d/1ExmtVpeVVT3TIhO1JoBpC5JKXm-778DAD7eqw5GANwE/edit

• What’s next? Roadmap 2016. Robert Metzger, January 26, 2016. Berlin Apache Flink Meetup. http://www.slideshare.net/robertmetzger1/january-2016-flink-community-update-roadmap-2016/9

Page 9: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

9

Agenda1. Origin and evolution of streaming

capabilities in Flink 2. Why Flink is suitable for real-world

streaming analytics? 3. What are some streaming analytics use

cases suitable for Flink? 4. What are some streaming analytics use

cases from companies actually using Flink?5. What are some novel use cases enabled by

Flink?6. Where do you go from here?

Page 10: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

10

2. Why Flink is suitable for real-world streaming analytics?

Apache Flink 1.0, which was released on March 8th 2016, comes with a competitive set of streaming analytics features, some of which are unique in the open source domain. 

The combination of these features makes Apache Flink a unique choice for real-world streaming analytics.

Let’s discuss some of Apache Flink features for real-world streaming analytics.

Page 11: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

11

2. Why Flink is suitable for real-world streaming analytics? 2.1. Pipelined processing engine2.2. Stream abstraction: DataStream as in the real-world2.3. Performance: Low latency and high throughput2.4. Support for rich windowing semantics2.5. Support for different notions of time2.6. Stateful stream processing2.7. Fault tolerance and correctness2.8. High Availability2.9. Backpressure handling2.10. Expressive and easy-to-use APIs in Scala and Java2.11. Support for batch2.12. Integration with the Hadoop ecosystem

Page 12: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

12

2.1. Pipelined processing engine

Flink is a pipelined (streaming) engine akin to parallel database systems, rather than a batch engine as Spark.

‘Flink’s runtime is not designed around the idea that operators wait for their predecessors to finish before they start, but they can already consume partially generated results.’

‘This is called pipeline parallelism and means that several transformations in a Flink program are actually executed concurrently with data being passed between them through memory and network channels.’ http://data-artisans.com/apache-flink-new-kid-on-the-block/

Page 13: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

13

2.2. Stream abstraction: DataStream as in the real-world

Real world data is a series of events that are continuously produced by a variety of applications and disparate systems inside and outside the enterprise.

Flink, as a stream processing system, models streams as what they are in the real world, a series of events and use DataStream as an abstraction.

Spark, as a batch processing system, approximates these streams as micro-batches and uses DStream as an abstraction. This adds an artificial latency!

Page 14: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

14

2.3. Performance: Low latency and high throughput

Pipelined processing engine enable true low latency streaming applications with fast results in milliseconds

High throughput: efficiently handle high volume of streams (millions of events per second)

Tunable latency / throughput tradeoff: Using a tuning knob to navigate the latency-throughput trade off.

Yahoo! benchmarked Storm, Spark Streaming and Flink with a production use-case (counting ad impressions grouped by campaign).

Full Yahoo! Article, benchmark stops at low write throughput and programs are not fault tolerant. https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

Page 15: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

15

2.3. Performance: Low latency and high throughputFull Data Artisans article, extends the Yahoo!

benchmark to high volumes and uses Flink’s built-in state http://data-artisans.com/extending-the-yahoo-streaming-benchmark/

Flink outperformed both Spark Streaming and Storm in this benchmark modeled after a real-world application:• Flink achieves throughput of 15 million messages/second on a

10 machines cluster. This is 35x higher throughput compared to Storm (80x compared to Yahoo’s runs)

• Flink ran with exactly once guarantees, Storm with at least once.

Ultimately, you need to test the performance of your own streaming analytics application as it depends on your own logic and the version of your preferred stream processing tool!

Page 16: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

16

2.4. Support for rich windowing semantics

Flink provides rich windowing semantics. A window is a grouping of events based on some function of time (all records of the last 5 minutes), count (the last 10 events) or session (all the events of a particular web user ).

Window types in Flink:• Tumbling windows ( no overlap)• Sliding windows (with overlap)• Session windows ( gap of activity)• Custom windows (with assigners, triggers and

evictors)

Page 17: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

17

2.4. Support for rich windowing semanticsIn many systems, these windows are hard-coded and

connected with the system’s internal checkpointing mechanism. Flink is the first open source streaming engine that completely decouples windowing from fault tolerance, allowing for richer forms of windows, such as sessions.

Further reading: • http://flink.apache.org/news/2015/12/04/Introducing-windows.html• http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html• https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101• https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Page 18: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

18

2.5. Support for different notions of time

In a streaming program with Flink, for example to define windows in respect to time, one can refer to different notions of time:• Event Time: when an event did happen in the real world.• Ingestion time: when data is loaded into Flink, from Kafka

for example.• Processing Time: when data is processed by Flink

In the real word, streams of events rarely arrive in the order that they are produced due to distributed sources, non-synced clocks, network delays… They are said to be “out of order’ streams.

Flink is the first open source streaming engine that supports out of order streams and which is able to consistently process events according to their event time.

Page 19: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

19

2.5. Support for different notions of time

http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html

https://ci.apache.org/projects/flink/flink-docs-master/concepts/concepts.html#timehttps://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/event_time.htmlhttp://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/

Page 20: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

20

2.6. Stateful stream processingMany operations in a dataflow simply look at one

individual event at a time, for example an event parser.Some operations called stateful operations are defined as

the ones where data is needed to be stored at the end of a window for computations occurring in later windows.

Now, where the state of these stateful operations is maintained?

Page 21: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

21

2.6. Stateful stream processing The state can be stored in memory, in the File System

or in RocksDB which is an embedded key value data store and not an external database.

Flink also supports state versioning through savepoints which are checkpoints of the state of a running streaming job that can be manually triggered by the user while the job is running.

Savepoints enable: • Code upgrades: both application and framework • Cluster maintenance and migration• A/B testing and what-if scenarios• Testing and debugging.• Restart a job with adjusted parallelism

Further reading: http://data-artisans.com/how-apache-flink-enables-new-streaming-applications/

https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/savepoints.html

Page 22: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

22

2.7. Fault tolerance and correctnessHow to ensure that the state is correct after failures?Apache Flink offers a fault tolerance mechanism to

consistently recover the state of data streaming applications.

This ensures that even in the presence of failures, the operators do not perform duplicate updates to their state (exactly once guarantees). This basically means that the computed results are the same whether there are failures along the way or not.

There is a switch to downgrade the guarantees to at least once if the use case tolerates duplicate updates.

Page 23: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

23

2.7. Fault tolerance and correctnessFurther reading:

• High-throughput, low-latency, and exactly-once stream processing with Apache Flinkhttp://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/

• Data Streaming Fault Tolerance document: http://ci.apache.org/projects/flink/flink-docs-master/internals/stream_checkpointing.html

• ‘Lightweight Asynchronous Snapshots for Distributed Dataflows’ http://arxiv.org/pdf/1506.08603v1.pdf June 28, 2015

• Distributed Snapshots: Determining Global States of Distributed Systems, February 1985, Chandra-Lamport algorithm http://research.microsoft.com/en-us/um/people/lamport/pubs/chandy.pdf

Page 24: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

24

2.8. High Availability

In the real world, streaming analytics applications need to be reliable and capable of running jobs for months and remain resilient in the event of failures.

The JobManager (Master) is responsible for scheduling and resource management. If it crashes, no new programs can be submitted and running program will fail.

Flink provides a High Availability (HA) mode to recover from JobManager crash, to eliminate the Single Point Of Failure (SPOF)

Further reading: JobManager High Availability https://ci.apache.org/projects/flink/flink-docs-master/setup/jobmanager_high_availability.html

Page 25: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

25

2.9. Backpressure handling

In the real world, there are situations where a system is receiving data at a higher rate than it can normally process. This is called backpressure.

Flink handles backpressure implicitly through its architecture without user interaction while backpressure handling in Spark is through manual configuration: spark.streaming.backpressure.enabled.

Flink provides backpressure monitoring to allow users to understand bottlenecks in streaming applications.

Further reading:• How Flink handles backpressure? by Ufuk Celebi, Kostas Tzoumas and

Stephan Ewen, August 31, 2015. http

://data-artisans.com/how-flink-handles-backpressure/

Page 26: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

26

2.10. Expressive and easy-to-use APIs in Scala and Java

High level, expressive and easy to use DataStream API with flexible window semantics results in significantly less custom application logic compared to other open source stream processing solutions.

Flink's DataStream API ports many operators from its DataSet batch processing API such as map, reduce, and join to the streaming world.

In addition, it provides stream-specific operations such as window, split, connect, …

Its support for user-defined functions eases the implementation of custom application behavior.

The DataStream API is available in Scala and Java.

Page 27: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

27

2.10. Expressive and easy-to-use APIs in Scala and Java

case class Word (word: String, frequency: Int)

val env = StreamExecutionEnvironment.getExecutionEnvironment()val lines: DataStream[String] = env.fromSocketStream(...)lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .keyBy("word").sum("frequency") .print()env.execute()

val env = ExecutionEnvironment.getExecutionEnvironment()val lines: DataSet[String] = env.readTextFile(...)lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print()env.execute()

DataSet API (batch): WordCount

DataStream API (streaming): Window WordCount

Page 28: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

28

2.11. Support for batch

In Flink, batch processing is a special case of stream processing, as finite data sources are just streams that happen to end.

Flink offers a full toolset for batch processing with a dedicated DataSet API and libraries for machine learning and graph processing.

In addition, Flink contains several batch-specific optimizations such as for scheduling, memory management, and query optimization.

Flink out-performs dedicated batch processing engine such as Spark and Hadoop MapReduce in batch use cases.

Page 29: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

29

2.12. Integration with the Hadoop ecosystem

POSIX Java/ScalaCollections

POSIX

Page 30: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

30

Agenda1. Origin and evolution of streaming

capabilities in Flink 2. Why Flink is suitable for real-world

streaming analytics? 3. What are some streaming analytics use

cases suitable for Flink? 4. What are some streaming analytics use

cases from companies actually using Flink?5. What are some novel use cases enabled by

Flink?6. Where do you go from here?

Page 31: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

31

3. What are some streaming analytics use cases suitable for Flink?

1. Financial services2. Telecommunications3. Online gaming systems4. Security & Intelligence 5. Advertisement serving6. Sensor Networks7. Social Media8. Healthcare9. Oil & Gas10. Retail & eCommerce11. Transportation and logistics

Page 32: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

32

Agenda1. Origin and evolution of streaming

capabilities in Flink 2. Why Flink is suitable for real-world

streaming analytics? 3. What are some streaming analytics use

cases suitable for Flink? 4. What are some streaming analytics use

cases from companies actually using Flink?5. What are some novel use cases enabled by

Flink?6. Where do you go from here?

Page 33: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

33

4. What are some streaming analytics use cases from companies actually using Flink?. Who is using Apache Flink? Some companies using Flink for streaming analytics:

[Telecommunications] [Retail] [Financial Services]

Gaming Security

[Gaming] [Security]

Powered by Flink [Companies, Software Projects, Universities/Research Institutes] https://cwiki.apache.org/confluence/display/FLINK/Powered+by+Flink

Page 34: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

34

4. What are some streaming analytics use cases from companies actually using Flink?

Bouygues Telecom is a full-service communication operator (mobile, fixed telephony, TV, Internet, and Cloud computing) and one of the largest providers in France, with over 11 million mobile subscribers, …

Bouygues Telecom uses Flink for real-time event processing and analytics over billions of Kafka messages per day.

Stream processing at Bouygues Telecom with Apache Flink, by Mohamed Amine Abdessemed• Blog: http://data-artisans.com/flink-at-bouygues-html/ June 1st , 2015• Slides:

http://www.slideshare.net/FlinkForward/mohamed-amine-abdessemed-realtime-data-integration-with-apache-flink-kafka

• Video: https://www.youtube.com/watch?v=hjmgZfXSi3M

Page 35: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

35

4. What are some streaming analytics use cases from companies actually using Flink?

Otto Group is the world’s second-largest online retailer in the end-consumer (B2C) business and Europe’s largest online retailer in the end-consumer B2C fashion and lifestyle business. “A range of exciting projects at the BI department were implemented with Apache Flink, e.g. a crowd-sourced user-agent identification, and a search session identifier.” How we selected Apache Flink as our Stream Processing

Framework at the Otto Group Business Intelligence Department? October 6, 2015

Blog:  http://data-artisans.com/how-we-selected-apache-flink-at-otto-group/ Slides: http://www.slideshare.net/FlinkForward/christian-kreuzfeld-static-vs-dynamic-stream-processing

Video: https://www.youtube.com/watch?v=cnqPyw_uQAQ

Page 36: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

36

4. What are some streaming analytics use cases from companies actually using Flink?

At king.com, Flink is used to process more than 30 billion events daily and compute real-time player statistics by leveraging Flink's stateful streaming abstractions and Complex Event Processing.

References: • Apache Software Foundation Blog, March 8th 2016

• Blog:https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces88

• Hadoop Summit Dublin 2016, April 13, 2016• Slides:

http://www.slideshare.net/GyulaFra/largescale-stream-processing-in-the-hadoop-ecosystem-hadoop-summit-2016-60887821/3

• Video: https://www.youtube.com/watch?v=mRhCpp-p11E

Page 37: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

37

4. What are some streaming analytics use cases from companies actually using Flink?

Zalando(.com) is Europe’s leading online fashion platform, doing business in 15 markets and attracting well over 100 million visits per month.

“Delivering first-class shopping experiences to our +14 million customers requires moving fast and using cutting-edge, open-source technologies.”

Near real time business intelligence for the following use cases: Business process monitoring and continuous ETL

Apache Showdown: Flink vs. Spark by Javier Lopez, Mihail Vieru - 31 March 2016https://tech.zalando.com/blog/apache-showdown-flink-vs.-spark/

Page 38: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

38

4. What are some streaming analytics use cases from companies actually using Flink?

Capital One is a top 10 leading consumer and commercial banking institution which is conducting business in the US, Canada and UK.

Flink was used for Real-Time monitoring of customer activity data (Audit log event details, failure and success data, … ) to:

• proactively detect and resolve issue immediately• prevent significant customer impact • enable flawless digital enterprise experience

Flink Case study at Capital One, 2015 FlinkForward Conference, Berlin, Germany October 12th 2015

http://www.slideshare.net/FlinkForward/flink-case-study-capital-one

Page 39: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

39

Real-Time Monitoring of Customer Activity

Page 40: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

40

4. What are some streaming analytics use cases from companies actually using Flink?

has its hack week and the winner, announced on December 18th 2015, was a Flink based streaming project! Extending the Yahoo! Streaming Benchmark and Winning Twitter Hack-Week with Apache Flink. Posted on February 2, 2016 by Jamie Grier http://data-artisans.com/extending-the-yahoo-streaming-benchmark/http://www.slideshare.net/JamieGrier/stateful-stream-processing-at-inmemory-speed

did some benchmarks to compare performance of one of their use case originally implemented on Apache Storm against Spark Streaming and Flink. Results posted on December 18, 2015

• http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

• http://data-artisans.com/extending-the-yahoo-streaming-benchmark/• https://github.com/dataArtisans/yahoo-streaming-benchmark• http://www.slideshare.net/JamieGrier/extending-the-yahoo-streaming-benchmark

Page 41: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

41

Generic Streaming Analytics Architectural pattern: This is changing with Flink’s alerts, StreamSQL, state querying, FlinkCEP, …

Even

tPr

oduc

ers

Col

lect

or

Bro

ker

Proc

esso

r

Inde

xer

Visu

aliz

er/S

earc

h

• Kafka• RabitMQ• JMS• Amazon

Kinesis• Google Cloud

Pub/Sub• MapR Streams

• Flink• Spark• Storm• Samza• Kafka

streams

• ElasticSearch• Solr• Cassandra• HBase• MapR DB• MongoDB• Apache Geode

• Kibana• Custom

GUI

• Flume• SpringXD• Logstash• Nifi• Fluentd

• Apps• Devices• Sensors

Page 42: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

42

Agenda1. Origin and evolution of streaming

capabilities in Flink 2. Why Flink is suitable for real-world

streaming analytics? 3. What are some streaming analytics use

cases suitable for Flink? 4. What are some streaming analytics use

cases from companies actually using Flink?5. What are some novel use cases enabled by

Flink?6. Where do you go from here?

Page 43: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

43

5. What are some novel use cases enabled by Flink?

5.1. Flink as an imbedded key/value data store5.2. Flink as a distributed CEP engine

Page 44: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

44

5.1. Flink as an imbedded key/value data store The stream processor as a database: a new design pattern for data

streaming applications, using Apache Flink and Apache Kafka: Building applications directly on top of the stream processor, rather than on top of key/value databases populated by data streams.

The stateful operator features in Flink allow a streaming application to query state in the stream processor instead of a key/value store often a bottleneck http://data-artisans.com/extending-the-yahoo-streaming-benchmark/

Page 45: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

45

“State querying” feature is expected in upcoming Flink 1.1http://www.slideshare.net/JamieGrier/stateful-stream-processing-at-inmemory-speed/38

Page 46: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

46

5.2. Flink as a distributed CEP engine

Flink stream processor as CEP (Complex Event Processing) engine. Example: an application that ingests network monitoring events, identifies access patterns such as intrusion attempts using FlinkCEP, and analyzes and aggregates identified access patterns.

Upcoming Talk: Streaming analytics and CEP - Two sides of the same coin’ by Till Rohrmann and Fabian Hueske at the Berlin Buzzwords on June 05-07 2016. http://berlinbuzzwords.de/session/streaming-analytics-and-cep-two-sides-same-coin

Further reading: – Introducing Complex Event Processing (CEP) with Apache Flink,

Till Rohrmann April 6, 2016 http://flink.apache.org/news/2016/04/06/cep-monitoring.html

– FlinkCEP - Complex event processing for Flinkhttps://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/libs/cep.html

Page 47: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

47

5.2. Flink as a distributed CEP engine

Pattern<MonitoringEvent, ?> warningPattern = Pattern.<MonitoringEvent>begin("First Event") .subtype(TemperatureEvent.class).where(evt -> evt.getTemperature()>=THRESHOLD).next("Second Event") .subtype(TemperatureEvent.class).where(evt -> evt.getTemperature() >= THRESHOLD) .within(Time.seconds(10));

Page 48: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

48

Agenda

1. Why streaming analytics are emerging?2. Why Flink is suitable for real-world

streaming analytics? 3. What are some streaming analytics use

cases suitable for Flink?4. What are some streaming analytics use

cases from companies actually using Flink? 5. What are some novel use cases enabled by

Flink?6. Where do you go from here?

Page 49: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

49

6. Where do you go from here?

A few resources for you:

• Overview of Apache Flink: the 4G of Big Data Analytics Frameworks, Hadoop Summit Europe, April 13th 2016• Slides:

http://www.slideshare.net/SlimBaltagi/overview-of-apache-fink-the-4-g-of-big-data-analytics-frameworks

• Video: https://www.youtube.com/watch?v=_BZURQn2EQI

• Flink Knowledge Base: One-Stop for everything related to Apache Flink. http://sparkbigdata.com/component/tags/tag/27-flink

• Flink at the Apache Software Foundation: flink.apache.org/

• Free Apache Flink training from data Artisans http://dataartisans.github.io/flink-training

• Flink Forward Conference, 12-14 September 2016, Berlin, Germany http://flink-forward.org/ (call for submissions announced on April 13th , 2016)

Page 50: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

50

6. Where do you go from here? • Free ebook from MapR: Streaming Architecture: New

Designs Using Apache Kafka and MapR Streams https://www.mapr.com/streaming-architecture-using-apache-kafka-mapr-streams

• Free ebook from Confluent: Making sense of stream processing http://www.confluent.io/making-sense-of-stream-processing-ebook

A few takeaways:• Apache Flink unique capabilities enable new and

sophisticated use cases especially for real-world streaming analytics.

• Customers demand will push major Hadoop distributors to package Flink and support it.

• Apache Flink will enable innovations and disruptions in many verticals with its capabilities in real-world streaming analytics.

Page 51: Apache Fink 1.0: A New Era  for Real-World Streaming Analytics

51

Thanks!To all of you for attending!Let’s keep in touch!

[email protected]• @SlimBaltagi• https://www.linkedin.com/in/slimbaltagi

Any questions?