70
1 Apache Flink® Training Flink v1.3.0 11.09.2017 Apache Flink® Training Deployment & Operations

Apache Flink Training - Deployment & Operations

Embed Size (px)

Citation preview

Page 1: Apache Flink Training - Deployment & Operations

1

Apache Flink® Training

Flink v1.3.0 – 11.09.2017

Apache Flink® Training

Deployment & Operations

Page 2: Apache Flink Training - Deployment & Operations

What you’ll learn in this session

Capacity planning

Deployment options for Flink

Deployment best practices

Tuning: Work distribution

Tuning: Memory configuration

Tuning: Checkpointing

Tuning: Serialization

Lessons learned2

Page 3: Apache Flink Training - Deployment & Operations

This is an interactive session

Please interrupt me at any time if you have a

question

3

Page 4: Apache Flink Training - Deployment & Operations

Capacity Planning

4

Page 5: Apache Flink Training - Deployment & Operations

First step: do the math!

Think through the resource requirements of your problem• Number of keys, state per key

• Number of records, record size

• Number of state updates

• What are your SLAs? (downtime, latency, max throughput)

What resources do you have?• Network capacity (including Kafka, HDFS, etc)

• Disk bandwidth (RocksDB relies on the local disk)

• Memory

• CPUs5

Page 6: Apache Flink Training - Deployment & Operations

Establish a baseline

Normal operation should be able to avoid back pressure

Add a margin for recovery – these resources will be used to “catch up”

Establish your baseline with checkpointing enabled

6

Page 7: Apache Flink Training - Deployment & Operations

Consider spiky loads

For example, operators downstream from a

Window won’t be continuously busy

So how much downstream parallelism is

required depends on how quickly you expect to

process these spikes

7

Page 8: Apache Flink Training - Deployment & Operations

Example: The Setup

Data:

• Message size: 2 KB

• Throughput: 1,000,000 msg/sec

• Distinct keys: 500,000,000 (aggregation in window: 4 longs per key)

• Checkpoint every minute

8

Kafka Source

keyByuserId

Sliding Window5m size1m slide

Kafka Sink

RocksDB

Page 9: Apache Flink Training - Deployment & Operations

Example: The setup

Hardware:

• 5 machines

• 10 gigabit Ethernet

• Each machine running a

Flink TaskManager

• Disks are attached via

the network

Kafka is separate

9

TM 1

TM 2

TM 3

TM 4 TM 5

NAS Kafka

Page 10: Apache Flink Training - Deployment & Operations

Example: A machine’s perspective

10

TaskManager n

Kafka Source

keyBy

window

Kafka Sink

Kafka: 400 MB/s

10 Gigabit Ethernet (Full Duplex)

In: 1250 MB/s10 Gigabit Ethernet (Full Duplex)

Out: 1250 MB/s

2 KB * 1,000,000 = 2GB/s

2GB/s / 5 machines = 400 MB/s

Shuffle: 320 MB/s

80

MB

/sShuffle: 320 MB/s

400MB/s / 5 receivers =

80MB/s

1 receiver is local, 4 remote:

4 * 80 = 320 MB/s out

Kafka: 67 MB/s

Page 11: Apache Flink Training - Deployment & Operations

Excursion 1: Window emit

11

How much data is the window emitting?

Recap: 500,000,000 unique users (4 longs per key)

Sliding window of 5 minutes, 1 minute slide

Assumption: For each user, we emit 2 ints (user_id, window_ts) and 4 longs

from the aggregation = 2 * 4 bytes + 4 * 8 bytes = 40 bytes per key

100,000,000 (users) * 40 bytes = 4 GB every minute from each machine

Page 12: Apache Flink Training - Deployment & Operations

Example: A machine’s perspective

12

TaskManager n

Kafka Source

keyBy

window

Kafka Sink

Kafka: 400 MB/s

10 Gigabit Ethernet (Full Duplex)

In: 1250 MB/s10 Gigabit Ethernet (Full Duplex)

Out: 1250 MB/s

2 KB * 1,000,000 = 2GB/s

2GB/s / 5 machines = 400 MB/s

Shuffle: 320 MB/s

80

MB

/sShuffle: 320 MB/s

400MB/s / 5 receivers =

80MB/s

1 receiver is local, 4 remote:

4 * 80 = 320 MB/s out

Kafka: 67 MB/s

4 GB / minute => 67 MB/

second (on average)

Page 13: Apache Flink Training - Deployment & Operations

Example: Result

13

TaskManager n

Kafka Source

keyBy

window

Kafka Sink

Kafka: 400 MB/s

10 Gigabit Ethernet (Full Duplex)

In: 1250 MB/s10 Gigabit Ethernet (Full Duplex)

Out: 1250 MB/s

Shuffle: 320 MB/s

80

MB

/sShuffle: 320 MB/sKafka: 67 MB/s

Total In: 720 MB/s Total Out: 387 MB/s

Page 14: Apache Flink Training - Deployment & Operations

Example: Result

14

TaskManager n

Kafka Source

keyBy

window

Kafka Sink

Kafka: 400 mb/s

10 Gigabit Ethernet (Full Duplex)

In: 1250 mb/s10 Gigabit Ethernet (Full Duplex)

Out: 1250 mb/s

Shuffle: 320 mb/s

80

mb

/s

Shuffle: 320 mb/sKafka: 67 mb/s

Total In: 720 mb/s Total Out: 387 mb/s

WRONG.

We forgot:

• Disk Access to RocksDB

• Checkpointing

Page 15: Apache Flink Training - Deployment & Operations

Example: Intermediate Result

15

TaskManager n

Kafka Source

keyBy

window

Kafka Sink

Kafka: 400 MB/s

10 Gigabit Ethernet (Full Duplex)

In: 1250 MB/s10 Gigabit Ethernet (Full Duplex)

Out: 1250 MB/s

Shuffle: 320 MB/s

Shuffle: 320 MB/s Kafka: 67 MB/s

Disk read: ? Disk write: ?

Page 16: Apache Flink Training - Deployment & Operations

Excursion 2: Window state access

16

How is the Window operator accessing state?

Recap: 1,000,000 msg/sec. Sliding window of 5 minutes, 1 minute slide

Assumption: For each user, we store 2 ints (user_id, window_ts) and 4 longs

from the aggregation = 2 * 4 bytes + 4 * 8 bytes = 40 bytes per key

5 minute window (window_ts)

key (user_id)

value (long, long, long, long)

Incoming data

For each incoming record, update

aggregations in 5 windows

Page 17: Apache Flink Training - Deployment & Operations

Excursion 2: Window state access

17

How is the Window operator accessing state?

For each key-value access, we need to retrieve 40 bytes from disk,

update the aggregates and put 40 bytes back

per machine: 40 bytes * 5 windows * 200,000 msg/sec = 40 MB/s

Page 18: Apache Flink Training - Deployment & Operations

Example: Intermediate Result

18

TaskManager n

Kafka Source

keyBy

window

Kafka Sink

Kafka: 400 MB/s

10 Gigabit Ethernet (Full Duplex)

In: 1250 MB/s10 Gigabit Ethernet (Full Duplex)

Out: 1250 MB/s

Shuffle: 320 MB/s

Shuffle: 320 MB/s Kafka: 67 MB/s

Total In: 760 MB/s Total Out: 427 MB/s

Disk read: 40 MB/s Disk write: 40 MB/s

Page 19: Apache Flink Training - Deployment & Operations

Excursion 3: Checkpointing

19

How much state are we checkpointing?

per machine: 40 bytes * 5 windows * 100,000,000 keys = 20 GB

We checkpoint every minute, so

20 GB / 60 seconds = 333 MB/s

Page 20: Apache Flink Training - Deployment & Operations

Example: Final Result

20

TaskManager n

Kafka Source

keyBy

window

Kafka Sink

Kafka: 400 MB/s

10 Gigabit Ethernet (Full Duplex)

In: 1250 MB/s10 Gigabit Ethernet (Full Duplex)

Out: 1250 MB/s

Shuffle: 320 MB/s

Shuffle: 320 MB/s Kafka: 67 MB/s

Total In: 760 MB/s Total Out: 760 MB/s

Disk read: 40 MB/s Disk write: 40 MB/s

Checkpoints: 333 MB/s

Page 21: Apache Flink Training - Deployment & Operations

Example: Network requirements

21

TM 2

TM 3

TM 4

TM 5

TM 1

NAS

In: 760 MB/s

Out: 760 MB/s

5x 80 MB/s =

400 MB/sKafka

400 MB/s * 5 +

67 MB/s * 5 = 2335 MB/s

Overall network traffic:

2 * 760 * 5 + 400 + 2335

= 10335 MB/s

= 82,68 Gigabit/s

Page 22: Apache Flink Training - Deployment & Operations

Disclaimer!

This was just a “back of the napkin” calculation

Ignored network factors

• Protocol overheads (Ethernet, IP, TCP, …)

• RPC (Flink‘s own RPC, Kafka, checkpoint store)

• Checkpointing causes network bursts

• A window emission causes bursts

• Remote disk access is not accounted for on the 10 GigE on AWS

• Other systems using the network

CPU, memory, disk access speed have all been ignored

22

Page 23: Apache Flink Training - Deployment & Operations

(High Availability) Deployment

23

Page 24: Apache Flink Training - Deployment & Operations

Flexible Deployment Options

Hadoop YARN integration• Cloudera, Hortonworks, MapR, …

• Amazon Elastic MapReduce (EMR), Google Cloud dataproc

Mesos & DC/OS integration

Standalone Cluster (“native”)• provided bash scripts

• provided Docker images

24

Flink in containerlandDAY 3 / 3:20 PM - 4:00 PM

MASCHINENHAUS

Page 25: Apache Flink Training - Deployment & Operations

Flexible Deployment Options

Docs and best-practices coming soon for

• Kubernetes

• Docker Swarm

Check Flink Documentation details!

25

Flink in containerlandDAY 3 / 3:20 PM - 4:00 PM

MASCHINENHAUS

Page 26: Apache Flink Training - Deployment & Operations

26

High Availability

Deployments

Page 27: Apache Flink Training - Deployment & Operations

YARN / Mesos HA

Run only one JobManager

Restarts managed by the cluster framework

For HA on YARN, we recommend using at least

Hadoop 2.5.0 (due to a critical bug in 2.4)

Zookeeper is always required

27

Page 28: Apache Flink Training - Deployment & Operations

Standalone cluster HA

Run standby JobManagers

Zookeeper manages

JobManager failover and

restarts

TaskManager failures are

resolved by the

JobManager

Use custom tool to ensure

a certain number of Job- and

TaskManagers28

Page 29: Apache Flink Training - Deployment & Operations

Deployment Best Practices

Things you should consider before putting a job in production

29

Page 30: Apache Flink Training - Deployment & Operations

Choose your state backend

Name Working state State backup Snapshotting

RocksDBStateBackend Local disk (tmpdirectory)

Distributed file system

Asynchronously

• Good for state larger than available memory• Rule of thumb: 10x slower than memory-based backends

FsStateBackend JVM Heap Distributed file system

Synchronous / Async

• Fast, requires large heap

MemoryStateBackend JVM Heap JobManager JVM Heap

Synchronous / Async

• Good for testing and experimentation with small state (locally)

30

Page 31: Apache Flink Training - Deployment & Operations

Asynchronous Snapshotting

31

Sy

nc

As

yn

c

Page 32: Apache Flink Training - Deployment & Operations

Explicitly set max parallelism

Changing this parameter is painful

• requires a complete restart and loss of all

checkpointed/savepointed state

0 < parallelism <= max parallelism <= 32768

Max parallelism > 128 has some impact on

performance and state size

32

Page 33: Apache Flink Training - Deployment & Operations

Set UUIDs for all (stateful) operators

Operator UUIDs are needed to restore state from a savepoint

Flink will auto—generate UUIDs, but this results in fragile snapshots.

Setting UUIDs in the API:DataStream<String> stream = env.addSource(new StatefulSource()).uid("source-id") // ID for the source operator.map(new StatefulMapper()).uid("mapper-id") // ID for the mapper.print();

33

Page 34: Apache Flink Training - Deployment & Operations

Use the savepoint tool for deletions

Savepoint files contain only metadata and depend on the checkpoint files

• bin/flink savepoint -d :savepointPath

There is work in progress to make savepointsself-contained deletion / relocation will be much easier

34

Page 35: Apache Flink Training - Deployment & Operations

Avoid the deprecated state APIs

Using the Checkpointed interface will prevent you from rescaling

your job

Use ListCheckpointed (like Checkpointed, but redistributeble) or

CheckpointedFunction (full flexibility) instead.

Production Readiness Checklist:

https://ci.apache.org/projects/flink/flink-docs-release-

1.3/ops/production_ready.html

35

Page 36: Apache Flink Training - Deployment & Operations

Tuning: CPU usage / work

distribution

36

Page 37: Apache Flink Training - Deployment & Operations

Configure parallelism / slots

These settings influence how the work is spread

across the available CPUs

1 CPU per slot is common

multiple CPUs per slot makes sense if one slot

(i.e. one parallel instance of the job) performs

many CPU intensive operations

37

Page 38: Apache Flink Training - Deployment & Operations

Operator chaining

38

Page 39: Apache Flink Training - Deployment & Operations

Task slots

39

Page 40: Apache Flink Training - Deployment & Operations

Slot sharing (parallelism now 6)

40

Page 41: Apache Flink Training - Deployment & Operations

What can you do?

Number of TaskManagers vs number of slots per

TM

Set slots per TaskManager

Set parallelism per operator

Control operator chaining behavior

Set slot sharing groups to break operators into

different slots

42

Page 42: Apache Flink Training - Deployment & Operations

Tuning: Memory Configuration

44

Page 43: Apache Flink Training - Deployment & Operations

Memory in Flink (on YARN)

45

YARN Container Limit

JVM Heap (limited by Xmx parameter)

Other JVM allocations: Classes, metadata, DirectByteBuffers

JVM process size

Netty RocksDB?

NetworkbuffersInternal

Flink services

User code(window contents, …) Memory

Manager

Page 44: Apache Flink Training - Deployment & Operations

Example: Memory in Flink

46

YARN Container Limit: 2000 MB

JVM Heap: Xmx: 1500MB = 2000 * 0.75 (default cutoff is 25%)

Other JVM allocations: Classes, metadata, stacks, …

JVM process size: < 2000 MB

Netty ~64MB RocksDB?

MemoryManager?up to 70% of the available heap

TaskManager: 2000 MB on YARN

“containerized.heap-cutoff-ratio”

Container request size

“taskmanager.memory.fraction“

RocksDB Config

Network Buffers

„taskmanager.network.memory.min“ (64MB)

and „.max“ (1GB)

Page 45: Apache Flink Training - Deployment & Operations

RocksDB

If you have plenty of memory, be generous with RocksDB(note: RocksDB does not allocate its memory from the JVM’s heap!). When allocating more memory for RocksDB on YARN, increase the memory cutoff (= smaller heap)

RocksDB has many tuning parameters.

Flink offers predefined collections of options:• SPINNING_DISK_OPTIMIZED_HIGH_MEM

• FLASH_SSD_OPTIMIZED

47

Page 46: Apache Flink Training - Deployment & Operations

Tuning: Checkpointing

48

Page 47: Apache Flink Training - Deployment & Operations

Checkpointing

Measure, analyze and try out!

Configure a checkpointing interval• How much can you afford to reprocess on restore?

• How many resources are consumed by the checkpointing? (cost in throughput and latency)

Fine-tuning• “min pause between checkpoints”

• “checkpoint timeout”

• “concurrent checkpoints”

Configure exactly once / at least once• exactly once does buffer alignment spilling (can affect latency)

49

Page 48: Apache Flink Training - Deployment & Operations

50

Page 49: Apache Flink Training - Deployment & Operations

Tuning: Serialization

51

Page 50: Apache Flink Training - Deployment & Operations

(de)serialization is expensive

Getting this wrong can have a huge impact

But don’t overthink it

52

Page 51: Apache Flink Training - Deployment & Operations

Serialization in Flink

Flink has its own serialization framework, which

is used for

• Basic types (Java primitives and their boxed form)

• Primitive arrays and Object arrays

• Tuples

• Scala case classes

• POJOs

Otherwise Flink falls back to Kryo53

Page 52: Apache Flink Training - Deployment & Operations

A note on custom serializers / parsers

Avoid obvious anti-patterns, e.g. creating a new JSON

parser for every record

54

Source map()String

keyBy()/window()/apply()

Sink

Source map()keyBy()/window()/apply()

Many sources (e.g. Kafka) can parse JSON directly

Avoid, if possible, to ship the schema with every record

String

Page 53: Apache Flink Training - Deployment & Operations

What else?

You should register types with Kryo, e.g.,• env.registerTypeWithKryoSerializer(DateTime.class,

JodaDateTimeSerializer.class)

You should register any subtypes; this can increase performance a lot

You can use serializers from other systems, like Protobuf or Thrift with Kyro by registering the types (and serializers)

Avoid expensive types, e.g. Collections, large records

Do not change serializers or type registrations if you are restoring from a savepoint

55

Page 54: Apache Flink Training - Deployment & Operations

Conclusion

56

Page 55: Apache Flink Training - Deployment & Operations

Tuning Approaches

1. Develop / optimize job locally• Use data generator / small sample dataset

• Check the logs for warnings

• Check the UI for backpressure, throughput, metrics

• Debug / profile locally

2. Optimize on cluster• Checkpointing, parallelism, slots, RocksDB,

network config, …

57

Page 56: Apache Flink Training - Deployment & Operations

The usual suspects

Inefficient serialization

Inefficient dataflow graph

• Too many repartitionings; blocking I/O

Slow external systems

Slow network, slow disks

Checkpointing configuration

58

Page 57: Apache Flink Training - Deployment & Operations

Q & A

Let’s discuss …

5

9

Page 58: Apache Flink Training - Deployment & Operations

6

0

Thank you!

@rmetzger | @theplucas

@dataArtisans

Page 59: Apache Flink Training - Deployment & Operations

We are hiring!

data-artisans.com/careers

61

Page 60: Apache Flink Training - Deployment & Operations

Deployment: Security

Bonus Slides

62

Page 61: Apache Flink Training - Deployment & Operations

Outline

1. Hadoop delegation tokens

2. Kerberos authentication

3. SSL

63

Page 62: Apache Flink Training - Deployment & Operations

Quite limited

• YARN only

• Hadoop services only

• Tokens expire

Hadoop delegation tokens

64

DATA

Job

Task Task

HDFS

Kafka

ZK

WebUI CLI

HTTP

Akka

token

Page 63: Apache Flink Training - Deployment & Operations

Keytab-based identity

Standalone, YARN, Mesos

Shared by all jobs

Kerberos authentication

65

DATA

Job

Task Task

HDFS

Kafka

ZK

WebUI CLI

HTTP

Akka

keytab

Page 64: Apache Flink Training - Deployment & Operations

SSL

taskmanager.data.ssl.enabled:

communication between task

managers

blob.service.ssl.enabled:

client/server blob service

akka.ssl.enabled: akka-based

control connection between the flink

client, jobmanager and

taskmanager

jobmanager.web.ssl.enabled:

https for WebUI

66

DATA

Job

Task Task

HDFS

Kafka

ZK

WebUI CLI

HTTPS

Akka

keytab certs

Page 65: Apache Flink Training - Deployment & Operations

Limitations

The clients are not authenticated to the cluster

All the secrets known to a Flink job are exposed

to everyone who can connect to the cluster's

endpoint

Exploring SSL mutual authentication

67

Page 66: Apache Flink Training - Deployment & Operations

Network buffers

68

Page 67: Apache Flink Training - Deployment & Operations

Configure network buffers

TaskManagers exchange data via permanent TCP connections

Each TM needs enough buffers to concurrently serve all outgoing and incoming connections

Configuration parameter: „taskmanager.network.numberOfBuffers“

As few as possible, maybe 2x the minimum• Avoid having too much data in buffers: delayed

checkpointing / barrier alignment spilling

69

Page 68: Apache Flink Training - Deployment & Operations

What are these buffers needed for?

flink.apache.org 70

TaskManager 1

Slot 2

Map Keyed Window

Slot 1

TaskManager 2

Slot 2

Slot 1

A small Flink cluster with 4 processing slots (on 2 Task Managers)

A simple Job in Flink:

Page 69: Apache Flink Training - Deployment & Operations

What are these buffers needed for?

flink.apache.org 71

Job with a parallelism of 4 and 2 processing slots per Machine

TaskManager 1 TaskManager 2

Slo

t 1

Slo

t 2

Map Window

Map Window

Map Window

Map Window

Slo

t 1

Slo

t 2

Network buffer

8 buffers for outgoing data 8 buffers for incoming

data

Page 70: Apache Flink Training - Deployment & Operations

What are these buffers needed for?

flink.apache.org 72

Job with a parallelism of 4 and 2 processing slots per Machine

TaskManager 1 TaskManager 2

Slo

t 1

Slo

t 2

Map Window

Map Window

Map Window

Map Window

Slo

t 1

Slo

t 2