111
Apache Flink Crash Course Slim Baltagi & Srini Palthepu with some materials from data-artisans.com Chicago Apache Flink Meetup

Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

Embed Size (px)

Citation preview

Page 1: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

Apache Flink Crash Course

Slim Baltagi & Srini Palthepu

with some materials from data-artisans.com

Chicago Apache Flink Meetup

August 4th 2015

Page 2: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

2

“One week of trials and errors can

save you up to half an hour of

reading the documentation.”

Anonymous

Page 3: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

3

For an overview of Apache Flink, see our slides at http://goo.gl/gVOSp8

Gel

ly

Tab

le

ML

SA

MO

A

DataSet (Java/Scala/Python)Batch Processing

DataStream (Java/Scala)

Stream Processing

Had

oo

p M

/R

LocalSingle JVMEmbedded

Docker

ClusterStandalone YARN, Tez, Mesos (WIP)

CloudGoogle’s GCEAmazon’s EC2IBM Docker Cloud, …

Go

og

le D

ataf

low

Dat

aflo

w (W

iP)

MR

QL

Tab

le

Cas

cad

ing

(W

iP)

RuntimeDistributed Streaming Dataflow

Zep

pelin

DE

PL

OY

SY

ST

EM

AP

Is &

LIB

RA

RIE

SS

TO

RA

GE Files

LocalHDFS

S3Tachyon

DatabasesMongoDB HBaseSQL …

Streams FlumeKafkaRabbitMQ…

Batch Optimizer Stream Builder

Page 4: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

4

In this talk, we will cover practical steps for: Setup and configuration of your Apache

Flink environment Using Flink tools Learning Flink’s APIs & Domain Specific

Libraries through Some Apache Flink program

examples Free Training from Data Artisans

in Java and Scala  Writing, testing, debugging, deploying and

tuning your Flink applications

Page 5: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

5

Agenda1. How to setup and configure your Apache Flink

environment?

2. How to use Apache Flink tools?

3. How to learn Apache Flink’s APIs and its domain specific libraries?

4. How to set up your IDE (IntelliJ IDEA or Eclipse) for Apache Flink?

5. How to write, test and debug your Apache Flink program in an IDE?

6. How to deploy your Apache Flink application in local, in a cluster or in the cloud?

7. How to tune your Apache Flink application?

Page 6: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

6

1. How to setup and configure your Apache Flink environment?

1.1   Local (on a single machine)

1.2   VM image (on a single machine)

1.3  Docker

1.4   Standalone Cluster 

1.5   YARN Cluster

1.6   Cloud 

Page 7: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

7

1.1   Local (on a single machine)

Flink runs on Linux, OS X and Windows.In order to execute a program on a running Flink

instance (and not from within your IDE) you need

to install Flink on your machine. The following steps will be detailed for both Unix-

Like (Linux, OS X) as well as Windows

environments: 1.1.1 Verify requirements

1.1.2 Download

1.1.3 Unpack

1.1.4 Check the unpacked archive

1.1.5 Start a local Flink instance

1.1.6 Validate Flink is running

1.1.7 Run a Flink example

1.1.8 Stop the local Flink instance

Page 8: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

8

1.1   Local (on a single machine)

1.1.1 Verify requirementsThe machine that Flink will run on must have Java

1.6.x or higher installed.In Unix-like environment, the $JAVA_HOME

environment variable must be set. Check the correct

installation of Java by issuing the following

commands: java –version and also check if $Java-

Home is set by issuing: echo $JAVA_HOME. If

needed, follow the instructions for installing Java

and Setting JAVA_HOME here:

http://docs.oracle.com/cd/E19182-01/820-7851/inst_cli_jdk_javahome_t/index.html

Page 9: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

9

1.1   Local (on a single machine)

In Windows environment, check the correct

installation of Java by issuing the following

commands: java –version. Also, the bin folder of your

Java Runtime Environment must be included in

Window’s %PATH% variable. If needed, follow

this guide to add Java to the path variable.

http://www.java.com/en/download/help/path.xml

1.1.2 Download the latest stable release of Apache

Flink from http://flink.apache.org/downloads.html

For example: In Linux-Like environment, run the

following command:

wget https://www.apache.org/dist/flink/flink-0.9.0/flink-

0.9.0-bin-hadoop2.tgz

Page 10: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

10

1.1   Local (on a single machine)

1.1.3 Unpack the downloaded .tgz archiveExample:

$ cd ~/Downloads        # Go to download directory

$ tar -xvzf flink-*.tgz     # Unpack the downloaded archive

1.1.4. Check the unpacked archive $ cd flink-0.9.0

The resulting folder contains a Flink setup that can be locally

executed without any further configuration.

flink-conf.yaml under flink-0.9.0/conf contains the default

configuration parameters that allow Flink to run out-of-the-box

in single node setups.

Page 11: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

11

1.1   Local (on a single machine)

Page 12: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

12

1.1   Local (on a single machine)

1.1.5. Start a local Flink instance:

• Given that you have a local Flink installation,

you can start a Flink instance that runs a master

and a worker process on your local machine in a

single JVM. This execution mode is useful for

local testing.• On UNIX-Like system you can start a Flink instance as

follows: cd /to/your/flink/installation ./bin/start-local.sh

Page 13: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

13

1.1 Local (on a single machine)

1.1.5. Start a local Flink instance:On Windows you can either start with:

• Windows Batch Files by running the following

commands cd C:\to\your\flink\installation .\bin\start-local.bat

• or with Cygwin and Unix Scripts: start the Cygwin

terminal, navigate to your Flink directory and run

the start-local.sh script $ cd /cydrive/c cd flink $ bin/start-local.sh

Page 14: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

14

1.1   Local (on a single machine)

The JobManager (the master of the distributed system)

automatically starts a web interface to observe program

execution. In runs on port 8081 by default (configured

in conf/flink-config.yml). http://localhost:8081/

1.1.6 Validate that Flink is runningYou can validate that a local Flink instance is running

by:• Issuing the following command: $jps

jps: java virtual machine process status tool• Looking at the log files in ./log/ 

$tail log/flink-*-jobmanager-*.log • Opening the JobManager’s web interface at

http://localhost:8081

Page 15: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

15

1.1   Local (on a single machine)

1.1.7 Run a Flink example• On UNIX-Like system you can run a Flink example as follows:

cd /to/your/flink/installation ./bin/flink run ./examples/flink-java-examples-0.9.0-

WordCount.jar• On Windows Batch Files, open a second terminal and run the

following commands”

cd C:\to\your\flink\installation .\bin\flink.bat run .\examples\flink-java-

examples-0.9.0-WordCount.jar

1.1.8 Stop local Flink instance•On UNIX you call ./bin/stop-local.sh•On Windows you quit the running process with Ctrl+C

Page 16: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

16

1.2   VM image (on a single machine)

Download Flink Virtual Machine from: https

://docs.google.com/uc?id=0B-oU5Z27sz1hZ0VtaW5idFViNU0&export=

download

The password is: flinkThis version works with VMware Fusion on

OS X since there is no VMware player for

OSX. https://www.vmware.com/products/fusion/fusion-evaluation.html

Page 17: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

17

1.3 Docker  Apache Flink cluster deployment on Docker using

Docker-Compose By Romeo Kienzler. Talk at the Apache

Flink Meetup Berlin planned for August 26, 2015 http

://www.meetup.com/Apache-Flink-Meetup/events/223913365

/The talk will:

• Introduce the basic concepts on container isolation

exemplified on Docker • Explain how Apache Flink is made elastic using

Docker-Compose. • Show how to push the cluster to the cloud exemplified

on the IBM Docker Cloud.

Page 18: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

18

1.4  Standalone Cluster 

 See quick start - Cluster setuphttps

://ci.apache.org/projects/flink/flink-docs-release-0.9/quickstart/setup_quickstart.html#cluster-

setup

See instructions on how to run Flink in a fully

distributed fashion on a cluster. This involves

two steps:• Installing and configuring Flink • Installing and configuring the Hadoop

Distributed File System (HDFS)https://ci.apache.org/projects/flink/flink-docs-master/setup/cluster_setup.html

Page 19: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

19

1.5 YARN Cluster

You can easily deploy Flink on your

existing YARN cluster.Download the Flink Hadoop2 package: Flink

with Hadoop 2

http://www.apache.org/dyn/closer.cgi/flink/flink-0.9.0/flink-0.9.0-bin-hadoop2.tgz

Make sure

your HADOOP_HOME (or YARN_CONF_DIR or

 HADOOP_CONF_DIR) environment

variable is set to read your YARN and HDFS

configuration.

Page 20: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

20

1.5 YARN Cluster

Run the YARN client with: 

./bin/yarn-session.shYou can run the client with options -n 10 -tm

8192 to allocate: 10 TaskManagers with 8GB of memory each.

For more detailed instructions, check out the

documentation: https

://ci.apache.org/projects/flink/flink-docs-master/setup/yarn_setup.html

Page 21: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

21

1.6   Cloud 

1.6.1 Google Compute Engine (GCE)

1.6.2 Amazon EMR

Page 22: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

22

1.6 Cloud

 1.6.1 Google Compute EngineFree trial for Google Cloud Engine: https://cloud.google.com/free-trial/

Enjoy your $300 in GCE for 60 days!Now, how to setup Flink with Hadoop 1 or Hadoop 2 on top of a

Google Compute Engine cluster?  Google’s bdutil starts a cluster and

deploys Flink with Hadoop. To get started, just follow the steps here:

https://ci.apache.org/projects/flink/flink-docs-master/setup/gce_setup.html

https://ci.apache.org/projects/flink/flink-docs-master/setup/gce_setup.html

Page 23: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

23

1.6 Cloud

1.6.2 Amazon EMRAmazon Elastic MapReduce (Amazon EMR) is

a web service providing a managed Hadoop

framework. • http://aws.amazon.com/elasticmapreduce/• http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-what-is

-

emr.html

•Example: Use Stratosphere with Amazon

Elastic MapReduce, February 18, 2014 by

Robert Metzgerhttps

://flink.apache.org/news/2014/02/18/amazon-elastic-mapreduce-cloud-yarn.html

Page 24: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

24

1.6 Docker

Docker can be used for local developmentOften resource requirements on Data

Processing Clusters exhibit high variation.

Elastic deployments reduce TCO (Total Cost of

Ownership). Container based virtualization; lightweight and

portable; build once, run anywhere; ease of

packaging applications; automated and

scripted; isolatedApache Flink cluster deployment on Docker

using Docker-Compose https://github.com/streamnsight/docker-flink

Page 25: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

25

2. How to use Apache Flink tools?

2.1   Command-Line Interface (CLI)

2.2   Job Client Web Interface

2.3   Job Manager Web Interface

2.4   Interactive Scala Shell

2.5   Zeppelin Notebook

Page 26: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

26

2.1   Command-Line Interface (CLI)

Example:

./bin/flink run ./examples/flink-java-examples-

0.9.0-WordCount.jar bin/flink has 4 major actions

• run  #runs a program• info  #displays information about a program.• list  #lists running and finished programs. -r & -

s

./bin/flink list -r -s• cancel #cancels a running program. –I

See more examples: https://ci.apache.org/projects/flink/flink-docs-master/apis/cli.html

Page 27: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

27

2.2   Job Client Web InterfaceFlink provides a web interface to:

• Upload jobs• Inspect their execution plans• Execute them• Showcase programs• Debug execution plans• Demonstrate the system as a whole

The web interface runs on port 8080 by default.To specify a custom port set

the webclient.port property in the

 ./conf/flink.yaml configuration file.

Page 28: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

28

2.2   Job Client Web InterfaceStart the web interface by executing:

./bin/start-webclient.shStop the web interface by executing:

./bin/stop-webclient.sh • Jobs are submitted to the JobManager

specified

by jobmanager.rpc.address and jobmanager.rpc.port

• For more details and further configuration

options, please consult this webpage: https

://ci.apache.org/projects/flink/flink-docs-release-0.9/setup/config.html#webclient

Page 29: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

29

2.3   Job Manager Web Interface

The JobManager (the master of the

distributed system) starts a web interface to

observe program execution. It runs on port 8081 by default (configured

in conf/flink-config.yml). Open the JobManager’s web interface at

http://localhost:8081• jobmanager.rpc.port 6123 • jobmanager.web.port 8081

Page 30: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

30

2.3   Job Manager Web Interface

Overall system status

Job execution details

Task Manager resourceutilization

Page 31: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

31

2.3 Job Manager Web Interface

The JobManager web frontend allows to :• Track the progress of a Flink program as

all status changes are also logged to the

JobManager’s log file.• Figure out why a program failed as it

displays the exceptions of failed tasks

and allow to figure out which parallel task

first failed and caused the other tasks to

cancel the execution.

Page 32: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

32

2.4   Interactive Scala ShellFlink comes with an Interactive Scala Shell - REPL ( Read

Evaluate Print Loop ) : ./bin/start-scala-shell.shInteractive queriesLet’s you explore data quicklyComplete Scala API availableIt can be used in a local setup as well as in a cluster

setup. The Flink Shell comes with command history and auto

completion.So far only batch mode is supported. There is plan to

add streaming in the future: https://ci.apache.org/projects/flink/flink-docs-master/scala_shell.html

Page 33: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

33

2.4   Interactive Scala Shellbin/start-scala-shell.sh --host localhost --port 6123

Page 34: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

34

2.4   Interactive Scala Shell

Example 1: Scala-Flink> val input = env.fromElements(1,2,3,4)

Scala-Flink> val doubleInput = input.map(_ *2)

Scala-Flink> doubleInput.print()

Example 2: Scala-Flink> val text = env.fromElements(   "To be, or not

to be,--that is the question:--",   "Whether 'tis nobler in the

mind to suffer",   "The slings and arrows of outrageous

fortune",   "Or to take arms against a sea of troubles,")

Scala-Flink> val counts = text.flatMap

{ _.toLowerCase.split("\\W+") }.map { (_,

1) }.groupBy(0).sum(1)

Scala-Flink> counts.print()

Page 35: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

35

2.4   Interactive Scala Shell

Problems with the Interactive Scala Shell:

No visualizationNo saving No replaying of written codeNo assistance as in an IDE

Page 36: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

36

2.5   Zeppelin Notebook

Web-based interactive computation

environment Combines rich text, execution code, plots

and rich media Exploratory data scienceStorytelling

Page 37: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

37

2.5   Zeppelin Notebook

http://localhost:8080/

Page 38: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

38

3. How to learn Flink’s APIs and libraries?

3.1 How to run the examples in the Apache

Flink bundle?

3.2 How to learn Flink Programming APIs?

3.3 How to learn Apache Flink Libraries?

Page 39: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

39

3.1 How to run the examples in the Apache Flink bundle?

3.1.1 Where are the examples?

3.1.2 Where are the related source

codes?

3.1.3 How to re-build these examples?

3.1.4 How to run these examples?

Page 40: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

40

3.1 How to run the examples in the Apache Flink bundle?

3.1.1    Where are the examples?

Page 41: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

41

3.1 How to run the examples in the Apache Flink bundle?

The examples provided in the Flink bundle

showcase different applications of Flink from

simple word counting to graph algorithms.They illustrate the use of Flink’s API. They are a very good way to learn how to

write Flink jobs. A good starting point would be to modify

them!Now, where are the related source codes!?

Page 42: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

42

3.1 How to run the examples in the Apache Flink bundle?

3.1.2    Where are the related source codes?You can find the source code of these

Flink examples in the flink-java-examples or

the flink-scala-examples of the flink-

examples module of the source release of

Flink. You can also access the source (and

hence the examples) through GitHub: https://github.com/apache/flink/tree/master/flink-examples

Page 43: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

43

3.1 How to run the examples in the Apache Flink bundle?

3.1.2    Where are the related source codes?If you don't want to import the whole Flink project

just for playing around with the examples, you can:• Create an empty maven project. This script will

automatically set everything up for you: $ curl

http://flink.apache.org/q/quickstart.sh | bash• Import the "quickstart" project into Eclipse or

IntelliJ. It will download all dependencies and

package everything correctly. • If you want to use an example there, just copy the

Java file into the "quickstart" project. 

Page 44: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

44

3.1 How to run the examples in the Apache Flink bundle?

3.1.3    How to re-build these examples?

To build the examples, you can run:

"mvn clean package -DskipTests”

in the "flink-examples/flink-java-examples"

directory.

This will re-build them. 

Page 45: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

45

3.1 How to run the examples in the Apache Flink bundle?

3.1.4 How to run these examples?How to display the command line

arguments? ./bin/flink info ./examples/flink-java-

examples-0.9.0-WordCount.jarExample of running an example: ./bin/flink

run ./examples/flink-java-examples-0.9.0-

WordCount.jarMore on the bundled examples: https://ci.apache.org/projects/flink/flink-docs-master/apis/examples.html#running-an-

example

Page 46: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

46

3.2 How to learn Flink Programming APIs?

3.2.1 DataSet API

3.2.2 DataStream API

3.2.3 Table API - Relational Queries

Page 47: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

47

3.2 How to learn Flink Programming APIs?

3.2.1 DataSet API

https://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html

https://ci.apache.org/projects/flink/flink-docs-master/api/java/

FREE Apache Flink Training by Data Artisans:

DataSet API Basis

•Lecture:

http://dataartisans.github.io/flink-training/dataSetBasics/slides.html Slides

https://www.youtube.com/watch?v=1yWKZ26NQeU Video

•Exercise: http://dataartisans.github.io/flink-training/dataSetBasics/handsOn.html

Page 48: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

48

3.2 How to learn Flink Programming APIs?

3.2.1 DataSet API

DataSet API Advanced

• Lecture:

• Slides http://dataartisans.github.io/flink-training/dataSetAdvanced/slides.html

• Video https://www.youtube.com/watch?v=1yWKZ26NQeU

• Exercise: http://dataartisans.github.io/flink-training/dataSetAdvanced/handsOn.html

Page 49: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

49

3.2 How to learn Flink Programming APIs?

3.2.2 DataStream API

https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html

https://ci.apache.org/projects/flink/flink-docs-master/api/java/

Example 1: Event pattern detection with Apache FlinkThis is a Flink streaming demo given By Data Artisans

on July 17, 2015 titled 'Apache Flink: Unifying batch

and streaming modern data analysis' at the Bay Area

Apache Flink Meetup: 

• Related code: https://github.com/StephanEwen/flink-demos/tree/master/streaming-state-

machine

• Related slides:  http://www.slideshare.net/KostasTzoumas/first-flink-bay-area-meetup

• Related video recording:  https://www.youtube.com/watch?v=BJjGD8ijJcg

Page 50: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

50

3.2 How to learn Flink Programming APIs?

3.2.2 DataStream API

Example 2: Fault-Tolerant Streaming with Flink

Slides 16-23 http://www.slideshare.net/AljoschaKrettek/flink-010-upcoming-features

Code https://github.com/aljoscha/flink-fault-tolerant-stream-example

This is a demo to show how Flink can deal with stateful streaming

jobs and fault-tolerance.

Example 3: Flink-storm compatibility examplehttps

://github.com/apache/flink/tree/master/flink-contrib/flink-storm-compatibility/flink-storm-compatibility-

examples

Page 51: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

51

3.2 How to learn Flink Programming APIs?

3.2.2 DataStream API

Example 4: Data Stream Analytics with Flink

http://net.t-labs.tu-berlin.de/~nsemmler/blog//flink/2015/03/02/Data-Stream-Analysis-with-

flink.html

Example 5: Introducing Flink Streaming

http://flink.apache.org/news/2015/02/09/streaming-example.html

Examples from the code base: flink-streaming-

examples

https://github.com/apache/flink/tree/master/flink-staging/flink-streaming/flink-streaming-examples/src/main/scala/org/apache/flink/streaming/scala/examples

Page 52: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

52

3.2 How to learn Flink Programming APIs?

3.2.3 Table API - Relational Queries https

://ci.apache.org/projects/flink/flink-docs-master/libs/table.html

To use the Table API in a project:

• First setup a Flink program: https://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html - linking-with-flink

• Add this to the dependencies section of your pom.xml <dependency>

<groupId>org.apache.flink</groupId>

<artifactId>flink-table</artifactId>

<version>0.10-SNAPSHOT</version>

</dependency>

Table is not currently part of the binary distribution. You need to link it for cluster execution: https://ci.apache.org/projects/flink/flink-docs-master/apis/cluster_execution.html - linking-with-modules-not-contained-in-the-binary-distribution

Page 53: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

53

3.2 How to learn Flink Programming APIs?

3.2.3 Table API - Relational Queries

FREE Apache Flink Training by Data Artisans – Table API

• Lecture: http://www.slideshare.net/dataArtisans/flink-table

• Exercise: http://dataartisans.github.io/flink-training/tableApi/

handsOn.html

See also example in slides 36-43 on Log Analysis

http://www.grid.ucy.ac.cy/file/Talks/talks/DeepAnalysiswithApacheFlink_2nd_cloud_workshop.pdf

Page 54: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

54

3.3 Apache Flink Domain Specific Libraries

3.3.1 FlinkML - Machine Learning for Flink

3.3.2 Gelly - Graph Analytics for Flink

Page 55: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

55

3.3 Apache Flink Libraries

3.3.1 FlinkML - Machine Learning for Flinkhttps

://ci.apache.org/projects/flink/flink-docs-master/libs/ml/

FlinkML – Quickstart Guidehttps

://ci.apache.org/projects/flink/flink-docs-master/libs/ml/quickstart.html

To use FlinkML in a project:

• First setup a Flink program: https://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink

• Add this to the dependencies section of your pom.xml <dependency>

<groupId>org.apache.flink</groupId>

<artifactId>flink-ml</artifactId>

<version>0.10-SNAPSHOT</version>

</dependency>

Page 56: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

56

3.3 Apache Flink Libraries3.3.1 FlinkML - Machine Learning for FlinkQuick Start: Run K-Means Examplehttps

://ci.apache.org/projects/flink/flink-docs-master/quickstart/run_example_quickstart.html

Computing Recommendations at Extreme Scale with

Apache Flink

http://data-artisans.com/computing-recommendations-at-extreme-scale-with-apache-flink/a and

related code: https://github.com/tillrohrmann/flink-perf/blob/ALSJoinBlockingUnified/flink-jobs/src/main/scala/com/github/projectflink/als/

ALSJoinBlocking.scala

Naive Bayes on Apache Flink http://www.itshared.org/2015/03/naive-bayes-on-apache-flink.html

FlinkML is not currently part of the binary distribution.

You need to link it for cluster execution: https://ci.apache.org/projects/flink/flink-docs-master/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution

Page 57: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

57

3.3 Apache Flink Libraries

3.3.2 Gelly: Flink Graph API https://ci.apache.org/projects/flink/flink-docs-master/libs/gelly_guide.html

To use Gelly in a project:

• First setup a Flink program: https://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink

• Add this to the dependencies section of your pom.xml

<dependency>

<groupId>org.apache.flink</groupId>

<artifactId>flink-gelly</artifactId>

<version>0.10-SNAPSHOT</version>

</dependency>

Page 58: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

58

3.3 Apache Flink Libraries

Gelly Examples: https://github.com/apache/flink/tree/master/flink-staging/flink-gelly/src/main/java/org/apache/flink/graph/

example

Gelly exercise & solution

Gelly API - PageRank on Reply Graphhttp://dataartisans.github.io/flink-training/exercises/replyGraphGelly.html

Gelly is not currently part of the binary distribution.

You need to link it for cluster execution: https://ci.apache.org/projects/flink/flink-docs-master/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution

Page 59: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

59

4. How to set up your IDE (IntelliJ IDEA or Eclipse) for Apache Flink?

4.1   How to set up your IDE (IntelliJ IDEA)?

4.2   How to setup your IDE (Eclipse)?

Flink uses mixed Scala/Java projects, which

pose a challenge to some IDEsMinimal requirements for an IDE are:

• Support for Java and Scala (also mixed projects)

• Support for Maven with Java and Scala

Page 60: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

60

4.1   How to set up your IDE (IntelliJ IDEA)?IntelliJ IDEA supports Maven out of the box

and offers a plugin for Scala development.IntelliJ IDEA Download https

://www.jetbrains.com/idea/download/

IntelliJ Scala Plugin http://plugins.jetbrains.com/plugin/?id=1347

Check out Setting up IntelliJ IDEA guide for

detailshttps://github.com/apache/flink/blob/master/docs/internals/

ide_setup.md#intellij-idea

Screencast: Run Apache Flink WordCount

from IntelliJ https://www.youtube.com/watch?v=JIV_rX-OIQM

Page 61: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

61

4.2   How to setup your IDE (Eclipse)?

• For Eclipse users, Apache Flink committers

recommend using Scala IDE 3.0.3, based on

Eclipse Kepler. • While this is a slightly older version, they

found it to be the version that works most

robustly for a complex project like Flink. One

restriction is, though, that it works only with

Java 7, not with Java 8.• Check out how to setup Eclipse docs:

https://github.com/apache/flink/blob/master/docs/internals/ide_setup.md

#eclipse

Page 62: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

62

5. How to write, test and debug your Apache Flink program in an IDE?

5.1  How to write a Flink program?

5.1.1 How to generate a Flink project with

Maven?

5.1.2 How to import the Flink Maven project

into IDE

5.1.3 How to use logging?

5.1.4 FAQs and best practices related to

coding

5.2 How to test your Flink program?

5.3  How to debug your Flink program?

Page 63: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

63

5.1 How to write a Flink program in an IDE?

The easiest way to get a working setup to

develop (and locally execute) Flink programs

is to follow the Quick Start guide:https://ci.apache.org/projects/flink/flink-docs-master/quickstart/java_api_quickstart.html

https://ci.apache.org/projects/flink/flink-docs-master/quickstart/scala_api_quickstart.html

It uses Maven archetype to configure and

generate a Flink Maven project. This will save you time dealing with transitive

dependencies! This Maven project can be imported into your

IDE.

Page 64: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

64

5.1 How to write a Flink program in an IDE?

Generate a skeleton project with Maven to get

started

mvn archetype:generate / -DarchetypeGroupId=org.apache.flink / -DarchetypeArtifactId=flink-quickstart-java / -DarchetypeVersion=0.9.0

you can also put “quickstart-scala” here

you can also put “quickstart-scala” here

or “0.10-SNAPSHOT”or “0.10-SNAPSHOT”

No need for manually downloading any .tgz or .jar files for now

5.1.1 How to generate a skeleton Flink project with Maven?

Page 65: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

65

5.1 How to write a Flink program in an IDE? 5.1.1 How to generate a skeleton Flink project

with Maven?The generated projects are located in a folder

called flink-java-project or flink-scala-project.In order to test the generated projects and to download

all required dependencies run the following commands

(change flink-java-project to flink-scala-project for Scala

projects)• cd flink-java-project

• mvn clean packageMaven will now start to download all required

dependencies and build the Flink quickstart project.

Page 66: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

66

5.1 How to write a Flink program in an IDE? 5.1.2 How to import the Flink Maven project into

IDEThe generated Maven project needs to be imported into

your IDE:IntelliJ:

• Select “File” -> “Import Project”

• Select root folder of your project

• Select “Import project from external model”,

select “Maven”

• Leave default options and finish the importEclipse:

• Select “File” -> “Import” -> “Maven” -> “Existing Maven

Project”

• Follow the import instructions

Page 67: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

67

5.1 How to write a Flink program in an IDE?

5.1.3   How to use logging? The logging in Flink is implemented using the slf4j

logging interface. log4j is used as underlying logging

framework. Log4j is controlled using property file usually

called log4j.properties. You can pass to the JVM the

filename and location of this file using

the Dlog4j.configuration= parameter.

The loggers using slf4j are created by callingimport org.slf4j.LoggerFactory

import org.slf4j.Logger

Logger LOG = LoggerFactory.getLogger(Foobar.class)

You can also use logback instead of log4j. https://ci.apache.org/projects/flink/flink-docs-release-0.9/internals/logging.html

Page 68: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

68

5.1 How to write a Flink program?

5.1.4 FAQs & best practices related to codingErrors

http://flink.apache.org/faq.html#errorsUsage

http://flink.apache.org/faq.html#usageBest Practices

https://ci.apache.org/projects/flink/flink-docs-

master/apis/best_practices.html

Page 69: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

69

5.2 How to test your Flink program in an IDE?

Start Flink in your IDE for local development &

debugging.

final ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment(); Use Flink’s testing framework

@RunWith(Parameterized.class)class YourTest extends MultipleProgramsTestBase {

@Testpublic void testRunWithConfiguration(){

expectedResult = "1 11\n“; }}

Page 70: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

70

5.3 How to debug your Flink program in an IDE?

Flink programs can be executed and debugged from

within an IDE. This significantly eases the development process and

gives a programming experience similar to working

on a regular Java application. Starting a Flink program in your IDE is as easy as

starting its main()method. Under the hood, the ExecutionEnvironment will start a

local Flink instance within the execution process. Hence it is also possible to put breakpoints

everywhere in your code and debug it.

Page 71: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

71

5.3 How to debug your Flink program in an IDE?

• Assuming you have an IDE with a Flink quickstart

project imported, you can execute and debug the

example WordCount program which is included in the

quickstart project as follows:• Open the org.apache.flink.quickstart.WordCount class

in your IDE• Place a breakpoint somewhere in the flatMap() method

of the LineSplitter class which is inline defined in

the WordCount class.• Execute or debug the main() method of

the WordCount class using your IDE.

Page 72: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

72

5.3 How to debug your Flink program in an IDE?

When you start a program locally with

the LocalExecutor, you can place breakpoints

in your functions and debug them like normal

Java/Scala programs.The Accumulators are very helpful in tracking

the behavior of the parallel execution. They

allow you to gather information inside the

program’s operations and show them after the

program execution.

Page 73: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

73

Debugging with the IDE

Page 74: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

74

Debugging on a cluster

Good old system out debugging• Get a logger

– Start logging

• Start logging

private static final Logger LOG = LoggerFactory.getLogger(YourJob.class);

LOG.info("elementCount = {}", elementCount);

• You can also use System.out.println().

Page 75: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

75

Getting logs on a cluster

• Non-YARN (=bare metal installation)–The logs are located in each TaskManager’s

log/ directory.–ssh there and read the logs.

• YARN–Make sure YARN log aggregation is enabled–Retrieve logs from YARN (once app is finished)

$ yarn logs -applicationId <application ID>

Page 76: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

76

Flink Logs

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - --------------------------------------------------------------------------------

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager (Version: 0.9-SNAPSHOT, Rev:2e515fc, Date:27.05.2015 @ 11:24:23 CEST)

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - Current user: robert

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.75-b04

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - Maximum heap size: 736 MiBytes

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - JAVA_HOME: (not set)

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - JVM Options:

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -XX:MaxPermSize=256m

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -Xms768m

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -Xmx768m

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlog.file=/home/robert/incubator-flink/build-target/bin/../log/flink-robert-jobmanager-robert-da.log

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlog4j.configuration=file:/home/robert/incubator-flink/build-target/bin/../conf/log4j.properties

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlogback.configurationFile=file:/home/robert/incubator-flink/build-target/bin/../conf/logback.xml

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - Program Arguments:

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - --configDir

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - /home/robert/incubator-flink/build-target/bin/../conf

11:42:39,234 INFO org.apache.flink.runtime.jobmanager.JobManager - --executionMode

11:42:39,234 INFO org.apache.flink.runtime.jobmanager.JobManager - local

11:42:39,234 INFO org.apache.flink.runtime.jobmanager.JobManager - --streamingMode

11:42:39,234 INFO org.apache.flink.runtime.jobmanager.JobManager - batch

11:42:39,234 INFO org.apache.flink.runtime.jobmanager.JobManager - --------------------------------------------------------------------------------

11:42:39,469 INFO org.apache.flink.runtime.jobmanager.JobManager - Loading configuration from /home/robert/incubator-flink/build-target/bin/../conf

11:42:39,525 INFO org.apache.flink.runtime.jobmanager.JobManager - Security is not enabled. Starting non-authenticated JobManager.

11:42:39,525 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager

11:42:39,527 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager actor system at localhost:6123.

11:42:40,189 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started

11:42:40,316 INFO Remoting - Starting remoting

11:42:40,569 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://[email protected]:6123]

11:42:40,573 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager actor

11:42:40,580 INFO org.apache.flink.runtime.blob.BlobServer - Created BLOB server storage directory /tmp/blobStore-50f75dc9-3001-4c1b-bc2a-6658ac21322b

11:42:40,581 INFO org.apache.flink.runtime.blob.BlobServer - Started BLOB server at 0.0.0.0:51194 - max concurrent requests: 50 - max backlog: 1000

11:42:40,613 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting embedded TaskManager for JobManager's LOCAL execution mode

11:42:40,615 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager at akka://flink/user/jobmanager#205521910.

11:42:40,663 INFO org.apache.flink.runtime.taskmanager.TaskManager - Messages between TaskManager and JobManager have a max timeout of 100000 milliseconds

11:42:40,666 INFO org.apache.flink.runtime.taskmanager.TaskManager - Temporary file directory '/tmp': total 7 GB, usable 7 GB (100.00% usable)

11:42:41,092 INFO org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 64 MB for network buffer pool (number of memory segments: 2048, bytes per segment: 32768).

11:42:41,511 INFO org.apache.flink.runtime.taskmanager.TaskManager - Using 0.7 of the currently free heap space for Flink managed memory (461 MB).

11:42:42,520 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager uses directory /tmp/flink-io-4c6f4364-1975-48b7-99d9-a74e4edb7103 for spill files.

11:42:42,523 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManger web frontend

Build Information

JVM details

Init messages

Page 77: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

77

Get logs of a running YARN application

Page 78: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

78

Debugging on a cluster - Accumulators

Useful to verify your assumptions about the

data

class Tokenizer extends RichFlatMapFunction<String, String>> { @Override public void flatMap(String value, Collector<String> out) { getRuntimeContext()

.getLongCounter("elementCount").add(1L); // do more stuff. } }

Use “Rich*Functions” to get RuntimeContext Use “Rich*Functions” to get RuntimeContext

Page 79: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

79

Debugging on a cluster - Accumulators

Where can I get the accumulator results?• returned by env.execute()

• displayed when executed with /bin/flink• in the JobManager web frontend

JobExecutionResult result = env.execute("WordCount");long ec = result.getAccumulatorResult("elementCount");

Page 80: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

80

Live Monitoring with Accumulators

In previous versions to Flink 0.10• Accumulators only available after Job finishes

• In Flink 0.10• Accumulators updated while Job is running• System accumulators (number of bytes/records

processed…)

Page 81: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

81

In Flink 0.10, the Job Manager Web Interface displays the accumulators live in the web interface

Page 82: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

82

Excursion: RichFunctions

The default functions are SAMs (Single

Abstract Method). Interfaces with one method

(for Java8 Lambdas)There is a “Rich” variant for each function.

• RichFlatMapFunction, …• Methods open(Configuration c) & close() getRuntimeContext()

Page 83: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

83

Excursion: RichFunctions & RuntimeContext

The RuntimeContext provides some useful

methodsgetIndexOfThisSubtask () /

getNumberOfParallelSubtasks() – who am I, and

if yes how many?getExecutionConfig() AccumulatorsDistributedCache

Page 84: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

84

Attaching a remote debugger to Flink in a Cluster

Page 85: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

85

Attaching a debugger to Flink in a cluster

Add JVM start option in flink-conf.yaml

env.java.opts: “-agentlib:jdwp=….”Open an SSH tunnel to the machine:

ssh -f -N -L 5005:127.0.0.1:5005 user@host

Use your IDE to start a remote debugging

session

Page 86: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

86

6. How to deploy your Apache Flink application in local, in a cluster or in the cloud?

6.1 Deploy in Local

6.2 Deploy in Cluster

6.3 Deploy in Cloud

Page 87: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

87

6. How to deploy your Apache Flink application in local, in a cluster or in the cloud?

6.1 Deploy in Local

Package your job in a jar and submit it: • /bin/flink (Command Line Interface)• RemoteExecutionEnvironment (From a

local java app)• Web Frontend (GUI)• Scala Shell

Page 88: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

88

Flink Web Submission Client

Select jobs and preview plan

Understand Optimizer choices

Page 89: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

89

6.2 Deploy in Cluster

• You can start a cluster locally

$ tar xzf flink-*.tgz$ cd flink$ bin/start-cluster.shStarting Job ManagerStarting task manager on host $ jps5158 JobManager5262 TaskManager

Page 90: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

90

6.3 Deploy in Cloud

Google Compute Engine (GCE)

Free trial for Google Cloud Engine: https://cloud.google.com/free-trial/

Enjoy your $300 in GCE for 60 days!

http://ci.apache.org/projects/flink/flink-docs-master/setup/gce_setup.html

./bdutil -e extensions/flink/flink_env.sh deploy

Page 91: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

91

6.3 Deploy in Cloud

Amazon EMR or any other cloud provider with

preinstalled Hadoop YARN http://ci.apache.org/projects/flink/flink-docs-master/setup/yarn_setup.html

wget http://stratosphere-bin.amazonaws.com/flink-0.9-SNAPSHOT-bin-hadoop2.tgztar xvzf flink-0.9-SNAPSHOT-bin-hadoop2.tgzcd flink-0.9-SNAPSHOT/./bin/yarn-session.sh -n 4 -jm 1024 -tm 4096

Install Flink yourself on the machines

Page 92: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

92

7. How to tune your Apache Flink application

7.1   Tuning CPU

7.2   Tuning memory

7.3   Tuning I/O

7.4   Optimizer hints

Page 93: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

93

7. How to tune your Apache Flink application (CPU, Memory, I/O)?

7.1   Tuning CPU Processing slots, threads, … 

https://ci.apache.org/projects/flink/flink-docs-master/setup/

config.html#configuring-taskmanager-processing-slots

Page 94: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

94

Tell Flink how many CPUs you have

taskmanager.numberOfTaskSlots in flink-config.yaml:

• number of parallel job instances• number of pipelines per TaskManager

recommended: number of available CPU cores

MapMap ReduceReduceMapMap ReduceReduce

MapMap ReduceReduceMapMap ReduceReduce

MapMap ReduceReduceMapMap ReduceReduce

MapMap ReduceReduce

Page 95: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

95

Task Manager 1

Slot 1

Slot 2

Slot 3

Task Manager 2

Slot 1

Slot 2

Slot 3

Task Manager 3

Slot 1

Slot 2

Slot 3

Task Managers: 3

Total number of processing slots: 12

flink-config.yaml:taskmanager.numberOfTaskSlots: 4 or/bin/yarn-session.sh –slots 4 –n 4(Recommended value: Number of CPU cores)

Configuring TaskManager Processing slots

3 machines each with 4 CPU cores gives us a total of 12 processing slots

Slot 4 Slot 4Slot 4

Page 96: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

96

Task Manager 2

Slot 1

Slot 2

Slot 3

Task Manager 3

Slot 1

Slot 2

Slot 3

When no argument are given, parallelism.default from flink-config.yaml is used. Default value = 1

Example 1: WordCount with parallelism = 1

Task Manager 1

Slot 1

Slot 2

Slot 3

Source -> flatMap

Reduce

Sink

Slot 4 Slot 4 Slot 4

Page 97: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

97

Example 2: WordCount with parallelism = 2

Task Manager 1

Slot 1

Slot 2

Slot 3

Task Manager 2

Slot 1

Slot 2

Slot 3

Task Manager 3

Slot 1

Slot 2

Slot 3

Source -> flatMap

Reduce SinkSource -> flatMap

Reduce Sink

Places to set parallelism for a job flink-config.yaml parallelism.default: 2 Flink Client:./bin/flink -p 2 ExecutionEnvironment: env.setParallelism(2)

Slot 4 Slot 4 Slot 4

Page 98: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

98

Example 3: WordCount with parallelism = 12 (using all resources)

Task Manager 1

Slot 1

Slot 2

Slot 3

Source -> flatMap

Reduce Sink

Source -> flatMap

Reduce Sink

Source -> flatMap

Reduce Sink

Task Manager 2

Slot 1

Slot 2

Slot 3

Task Manager 3

Slot 1

Slot 2

Slot 3

Source -> flatMap

Reduce Sink

Source -> flatMap Reduce Sink

Source -> flatMap

Reduce Sink

Source -> flatMap

Reduce Sink

Source -> flatMap

Reduce Sink

Reduce Sink

Source -> flatMap

Page 99: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

99

Example 4: WordCount with parallelism = 12 and sink parallelism = 1

Task Manager 1

Slot 1

Slot 2

Slot 3

Task Manager 2

Slot 1

Slot 2

Slot 3

Task Manager 3

Slot 1

Slot 2

Slot 3

Source -> flatMap Reduce Source -

> flatMapReduce

Source -> flatMap

Reduce

Source -> flatMap

Reduce

Source -> flatMap

Reduce

Source -> flatMap

Reduce

Source -> flatMap

Reduce

Source -> flatMap

Reduce

Source -> flatMap

Reduce

The parallelism of each operator can be set individually in the APIscounts.writeAsCsv(outputPath, "\n", " ").setParallelism(1);

Sink

The data is streamed to this Sink from all the other slots on the other TaskManagers

Slot 4 Slot 4 Slot 4Source -> flatMap

ReduceSource -> flatMap

ReduceSource -> flatMap

Reduce

Page 100: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

100

7. How to tune your Apache Flink application (CPU, Memory, I/O)?

7.2   Tuning MemoryHow to adjust memory usage on the

TaskManager?

Page 101: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

101

Memory in Flink - Theory

Memory Management (Batch API) https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=53741525

Page 102: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

102

taskmanager.network.numberOfBufferstaskmanager.network.numberOfBuffers

relative: taskmanager.memory.fractionabsolute: taskmanager.memory.size

relative: taskmanager.memory.fractionabsolute: taskmanager.memory.size

Memory in Flink - Configuration

taskmanager.heap.mb or „-tm“ argument for bin/yarn-session.sh

taskmanager.heap.mb or „-tm“ argument for bin/yarn-session.sh

Page 103: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

103

Memory in Flink - OOM2015-02-20 11:22:54 INFO JobClient:345 - java.lang.OutOfMemoryError: Java heap space at org.apache.flink.runtime.io.network.serialization.DataOutputSerializer.resize(DataOutputSerializer.java:249) at org.apache.flink.runtime.io.network.serialization.DataOutputSerializer.write(DataOutputSerializer.java:93) at org.apache.flink.api.java.typeutils.runtime.DataOutputViewStream.write(DataOutputViewStream.java:39) at com.esotericsoftware.kryo.io.Output.flush(Output.java:163) at com.esotericsoftware.kryo.io.Output.require(Output.java:142) at com.esotericsoftware.kryo.io.Output.writeBoolean(Output.java:613) at com.twitter.chill.java.BitSetSerializer.write(BitSetSerializer.java:42) at com.twitter.chill.java.BitSetSerializer.write(BitSetSerializer.java:29) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:599) at org.apache.flink.api.java.typeutils.runtime.KryoSerializer.serialize(KryoSerializer.java:155) at org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:91) at org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:30) at org.apache.flink.runtime.plugable.SerializationDelegate.write(SerializationDelegate.java:51) at org.apache.flink.runtime.io.network.serialization.SpanningRecordSerializer.addRecord(SpanningRecordSerializer.java:76) at org.apache.flink.runtime.io.network.api.RecordWriter.emit(RecordWriter.java:82) at org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:88) at org.apache.flink.api.scala.GroupedDataSet$$anon$2.reduce(GroupedDataSet.scala:262) at org.apache.flink.runtime.operators.GroupReduceDriver.run(GroupReduceDriver.java:124) at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:493) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:360) at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:257) at java.lang.Thread.run(Thread.java:745)

Memory is missing here

Memory is missing here

Reduce managed memory

reduce taskmanager.memory.fraction

Reduce managed memory

reduce taskmanager.memory.fraction

Page 104: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

104

Memory in Flink – Network buffers

Memory is missing here

Memory is missing here

Managed memory will shrink automatically

Managed memory will shrink automatically

Error: java.lang.Exception: Failed to deploy the task CHAIN Reduce(org.okkam.flink.maintenance.deduplication.blocking.RemoveDuplicateReduceGroupFunction) -> Combine(org.apache.flink.api.java.operators.DistinctOperator$DistinctFunction) (15/28) - execution #0 to slot SubSlot 5 (cab978f80c0cb7071136cd755e971be9 (5) - ALLOCATED/ALIVE): org.apache.flink.runtime.io.network.InsufficientResourcesException: okkam-nano-2.okkam.it has not enough buffers to safely execute CHAIN Reduce(org.okkam.flink.maintenance.deduplication.blocking.RemoveDuplicateReduceGroupFunction) -> Combine(org.apache.flink.api.java.operators.DistinctOperator$DistinctFunction) (36 buffers missing)

Increase

taskmanager.network.numberOfBuffersIncrease

taskmanager.network.numberOfBuffers

Page 105: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

105

What are these buffers needed for?

TaskManager 1

Slot 2

MapMap ReduceReduce

Slot 1

TaskManager 2

Slot 2

Slot 1

A small Flink cluster with 4 processing slots (on 2 Task Managers)

A simple MapReduce Job in Flink:

Page 106: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

106

What are these buffers needed for?

Map Reduce job with a parallelism of 2 and 2 processing slots per Machine

TaskManager 1 TaskManager 2

Slo

t 1

Slo

t 2

MapMap

MapMap

ReduceReduce

ReduceReduce

MapMap

MapMap

ReduceReduce

ReduceReduce

MapMap

MapMap

ReduceReduce

ReduceReduce

MapMap

MapMap

ReduceReduce

ReduceReduceS

lot

1S

lot

2

Network bufferNetwork buffer

8 buffers for outgoing data

8 buffers for outgoing data 8 buffers for incoming data8 buffers for incoming data

Page 107: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

107

What are these buffers needed for?

Map Reduce job with a parallelism of 2 and 2 processing slots per Machine

TaskManager 1 TaskManager 2

Slo

t 1

Slo

t 2

MapMap

MapMap

ReduceReduce

ReduceReduce

MapMap

MapMap

ReduceReduce

ReduceReduce

MapMap

MapMap

ReduceReduce

ReduceReduce

MapMap

MapMap

ReduceReduce

ReduceReduce

Each mapper has a logical connection to

a reducer

Each mapper has a logical connection to

a reducer

Page 108: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

108

7. How to tune your Apache Flink application (CPU, Memory, I/O)?

7.3   Tuning I/OSpecifying temporary directories for

spilling

Page 109: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

109

Disk I/O

Sometimes your data doesn’t fit into main

memory, so we have to spill to disk:

taskmanager.tmp.dirs:

/mnt/disk1,/mnt/disk2Use real local disks only (no tmpfs or NAS)

Reader ThreadReader Thread

Disk 1Disk 1

Writer ThreadWriter Thread

Reader ThreadReader Thread

Writer ThreadWriter Thread

Disk 2Disk 2

Task Manager

Page 110: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

110

7. How to tune your Apache Flink application

7.4   Optimizer hints Examples: DataSet.join(DataSet other,

JoinHint.BROADCAST_HASH_SECOND) DataSet.join(DataSet other, JoinHint.BROADCAST_HASH_FIRST)

http://stackoverflow.xluat.com/questions/31484856/the-difference-and-benefit-

of-joinwithtiny-joinwithhuge-and-joinhint

Page 111: Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

111

Consider attending the first dedicated Apache Flink conference on October 12-13, 2015 in Berlin, Germany! http://flink-forward.org/

Two parallel tracks: Talks: Presentations and use cases Trainings: 2 days of hands on training workshops by the Flink committers