29
1 Apache Flink® Training Flink v1.3 14.09.2017 Apache Flink® Training Metrics and Monitoring

Apache Flink Training - Metrics & Monitoring

Embed Size (px)

Citation preview

Page 1: Apache Flink Training - Metrics & Monitoring

1

Apache Flink® Training

Flink v1.3 – 14.09.2017

Apache Flink® Training

Metrics and Monitoring

Page 2: Apache Flink Training - Metrics & Monitoring

Metrics

2

Page 3: Apache Flink Training - Metrics & Monitoring

Metrics

<identifier, measurement>

Types

• Counter

• Meter (rate)

• Histogram

• Gauge (arbitrary value)

Exposed via MetricReporters

Also a REST API

Visualized in the WebUI

3

Page 4: Apache Flink Training - Metrics & Monitoring

Example

4

public static class MyMap extends RichMapFunction<String, String> {private Counter count;

@Overridepublic void open(Configuration config) {

count = getRuntimeContext().getMetricGroup().counter("numRecordsIn");

}

@Overridepublic String map(String input) {

count.inc();// return something

}}

Page 5: Apache Flink Training - Metrics & Monitoring

Other metric types

Gauge

• Value can be any object that implement toString()

Histogram

• No default implementation, but a wrapper for

Codahale/DropWizard histograms

Meter

• Your code calls meter.markEvent() or meter.markEvent(n)

• Flink counts events, and also reports the average rate

5

Page 6: Apache Flink Training - Metrics & Monitoring

Metric Groups

Metrics are attached to MetricGroups, which provide

context about what is being measured

6

Page 7: Apache Flink Training - Metrics & Monitoring

Adding your own MetricGroups

Useful for categorizing your measurements

counter = getRuntimeContext().getMetricGroup().addGroup("MyMetrics").counter("myCounter");

7

Page 8: Apache Flink Training - Metrics & Monitoring

8

127.0.0.1.taskmanager.ABCDE.MyJob.MyOperator.1.numRecordsIn

host taskmanager job operator metric

Page 9: Apache Flink Training - Metrics & Monitoring

Scope Formats

Dot-separate list of variables and contants

Variables are replaced at runtime

Configured in flink-conf.yaml

<host>.taskmanager.<tm_id>.<job_name>.<operator_name>.<subtask_index>

9

Page 10: Apache Flink Training - Metrics & Monitoring

Scope Formats

Each metric is associated with one of 6 formats:• metrics.scope.jm

• metrics.scope.jm.job

• metrics.scope.tm

• metrics.scope.tm.job

• metrics.scope.task

• metrics.scope.operator

10

Page 11: Apache Flink Training - Metrics & Monitoring

Metric Reporter

Exposes metrics to the outside world

• Ganglia

• Graphite

• JMX

• StatsD

• or roll your own …

11

Page 12: Apache Flink Training - Metrics & Monitoring

Example

12

public static class Log4JReporter implements MetricReporter {private static final Logger LOG = LoggerFactory.getLogger(LogReporter.class);

public void open(MetricConfig config) {}

public void close() {}

Page 13: Apache Flink Training - Metrics & Monitoring

Example, cont.

13

public static class Log4JReporter implements MetricReporter {private static final Logger LOG = LoggerFactory.getLogger(LogReporter.class);

private final Map<Counter, String> counters = new ConcurrentHashMap<>();

public void notifyOfAddedMetric(Metric metric, String metricName, MetricGroup group) {if (metric instanceof Counter) {counters.put((Counter) metric, group.getMetricIdentifier(metricName));

}}

public void notifyOfRemovedMetric(Metric metric, String metricName, MetricGroup group) {if (metric instanceof Counter) {counters.remove(metric);

}}

Page 14: Apache Flink Training - Metrics & Monitoring

Example, cont.

14

public static class Log4JReporter implements MetricReporter, Scheduled {private static final Logger LOG = LoggerFactory.getLogger(LogReporter.class);

private final Map<Counter, String> counters = new ConcurrentHashMap<>();

public void notifyOfAddedMetric(Metric metric, String metricName, MetricGroup group) {if (metric instanceof Counter) {counters.put((Counter) metric, group.getMetricIdentifier(metricName));

}}

public void notifyOfRemovedMetric(Metric metric, String metricName, MetricGroup group) {if (metric instanceof Counter) {counters.remove(metric);

}}

public void report() {for (Map.Entry<Counter, String> metric : counters.entrySet()) {LOG.info(metric.getValue() + ": " + metric.getKey());

}}

Page 15: Apache Flink Training - Metrics & Monitoring

Configuration

metrics.reporters: log

metrics.reporter.log.class: org.apache.flink.metrics.log4j.Log4JReporter

metrics.reporter.log.interval: 5 SECONDS

https://github.com/zentol/log4jreporter/blob/master/src/main/java/org/apache/fli

nk/metrics/log4j/Log4JReporter.java

15

Page 16: Apache Flink Training - Metrics & Monitoring

Monitoring REST API

16

Page 17: Apache Flink Training - Metrics & Monitoring

Some available requests

/config

/overview

/jobmanager/metrics

/jobs

/jobs/<id>/metrics

/jobs/<id>/checkpoints

/jobs/<id>/vertices/<id>/metrics?get=0.numRecordsOutPerSecond

/taskmanagers

/taskmanagers/<id>/metrics?get=<metric>

...

17

Page 18: Apache Flink Training - Metrics & Monitoring

Available metrics

18

Page 19: Apache Flink Training - Metrics & Monitoring

Available metrics

Many system metrics are built into Flink, including

• CPU, memory, threads, GC

• Classloading, network, cluster, IO

• Checkpointing, throughput, latency

19

Page 20: Apache Flink Training - Metrics & Monitoring

Latency Monitoring

20

Page 21: Apache Flink Training - Metrics & Monitoring

Latency tracking

env.getConfig().setLatencyTrackingInterval(msec)

Latency markers are injected by the sources, and flow

through the execution graph

• If records are queued in front of an operator, a marker will wait,

but it will otherwise bypass operators

Sinks track latency for each parallel source instance

21

Page 22: Apache Flink Training - Metrics & Monitoring

Back Pressure Monitoring

22

Page 23: Apache Flink Training - Metrics & Monitoring

What is Back Pressure?

Records in your job flow downstream, from

sources to sinks

When a downstream operator can’t keep

up, it exerts back pressure that propagates

upstream

23

Page 24: Apache Flink Training - Metrics & Monitoring

Detecting Back Pressure

24

OK: < 10%LOW: 10 – 50%HIGH: > 50%

Page 25: Apache Flink Training - Metrics & Monitoring

Configuration

jobmanager.web.backpressure.refresh-interval (60000 msec)

jobmanager.web.backpressure.num-samples (100)

jobmanager.web.backpressure.delay-between-samples (50 msec)

25

Page 26: Apache Flink Training - Metrics & Monitoring

26

Page 27: Apache Flink Training - Metrics & Monitoring

27

Page 28: Apache Flink Training - Metrics & Monitoring

28

Page 29: Apache Flink Training - Metrics & Monitoring

Slow sink?

E.g., slow database indexing may be

causing backpressure

Try a discarding sink to rule out the sink

29