OSMC 2014: Introduction into collectd | Florian Foster

Preview:

DESCRIPTION

Periodically measuring performance metrics of production systems allows administrators and developers to analyze system behavior during and after outages, quantify performance improvements, and detect trends and take proactive measures before problems arise. Performance metrics are also interesting for alerting, because they can be aggregated meaningfully, thereby basing an alert on a group of hosts rather than each host individually, for example. This talk will give an introduction to collectd, an open-source tool to gather, process and store performance metrics. A sample setup which aggregates a couple of metrics and stores the aggregate in Graphite will be presented. Afterwards, we will show how the collectd-nagios utility can be used to define alerts in Icinga based on this data.

Citation preview

collectdAn introduction

About me

● Florian "octo" Forster

● Open-source work since 2001

● Started collectd in 2005

Agenda

● collectd

● Aggregation of metrics

● Alerting with Icinga

Agenda

● collectd

● Aggregation of metrics

● Alerting with Icinga

collectd

● Daemon

● collect metrics

● mangle / transport metrics

● store metrics (no retrieve)

collectd

● Open-source project○ MIT and GPL licensed

● Platform independent○ Linux, BSD, Solaris, AIX, HP-UX, …○ Windows via SSC Serv (non-free)

collectd

● Agent based design○ Runs on each host

● Extensible via plugins○ Language bindings (Perl, Python, Java)○ "exec" plugin, e.g. shell scripts

collectd

● 95+ "read" (input) plugins

○ System metrics (e.g. CPU, memory)

○ Application metrics (e.g. MySQL)

○ Other (Xeon Phi, SNMP, OneWire)

collectd

● 15+ "write" (output) plugins

○ Graphite○ RRDtool○ RRDCacheD○ Riemann○ MongoDB○ HTTP (generic)

collectd

# Input

LoadPlugin cpu

LoadPlugin memory

LoadPlugin df

<Plugin df>

MountPoint "/"

ValuesPercentage true

</Plugin>

# Output

LoadPlugin write_graphite

<Plugin write_graphite>

<Node "default">

Host "graphite.example.com"

</Node>

</Plugin>

Example configuration

collectd

● collectd's write_graphite plugin

○ Sends metric to Graphite○ TCP or UDP transport○ Metric names somewhat adjustable

→ Monitoring mit Graphite(15:30 in this room, German)

Agenda

● collectd

● Aggregation of metrics

● Alerting with Icinga

Aggregation

● Aggregates often more useful for alerting○ e.g. sum over CPUs, minimum RTT

● Metric storage often I/O bound

● Dashboards require "sane" amount of information

Aggregation

collectd Graphite

CPU

Disk

Memory

…Aggregation

Aggregation

● Load the Aggregation plugin

● Select (filter) applicable metrics

● Group by metric type and other fields

● Aggregate functions (e.g. sum)

Aggregation

LoadPlugin aggregation

<Plugin aggregation>

<Aggregation>

</Aggregation>

</Plugin>

example.com/battery/percent-charged

example.com/cpu-0/cpu-idle

example.com/cpu-0/cpu-user

example.com/cpu-0/cpu-wait

example.com/cpu-1/cpu-idle

…example.com/df-root/df_complex-free

example.com/df-root/df_complex-used

example.com/df-root/df_complex-rsvd

Load the aggregation plugin

Aggregation: Selection

● Five fields usable for selection

○ Host○ Plugin○ PluginInstance○ Type (mandatory)○ TypeInstance

Aggregation: Selection

LoadPlugin aggregation

<Plugin aggregation>

<Aggregation>

Plugin "cpu"

Type "cpu"

</Aggregation>

</Plugin>

example.com/cpu-0/cpu-idle

example.com/cpu-0/cpu-user

example.com/cpu-0/cpu-wait

example.com/cpu-1/cpu-idle

example.com/cpu-1/cpu-user

example.com/cpu-1/cpu-wait

example.com/cpu-2/cpu-idle

example.com/cpu-2/cpu-user

example.com/cpu-2/cpu-wait

Select metrics

Aggregation: Grouping

● Four fields usable for selection

○ Host○ Plugin○ PluginInstance○ TypeInstance

● One field unspecified (or more)

Aggregation: Grouping

LoadPlugin aggregation

<Plugin aggregation>

<Aggregation>

Plugin "cpu"

Type "cpu"

GroupBy Host

GroupBy TypeInstance

</Aggregation>

</Plugin>

example.com/cpu-???/cpu-idle

example.com/cpu-???/cpu-user

example.com/cpu-???/cpu-wait

Configure grouping

Aggregation: Functions

● Up to six aggregate functions

○ Count○ Sum○ Minimum○ Maximum○ Average○ Standard deviation

Aggregation

LoadPlugin aggregation

<Plugin aggregation>

<Aggregation>

Plugin "cpu"

Type "cpu"

GroupBy Host

GroupBy TypeInstance

CalculateSum true

</Aggregation>

</Plugin>

example.com/cpu-sum/cpu-idle

example.com/cpu-sum/cpu-user

example.com/cpu-sum/cpu-wait

Select aggregate function(s)

Aggregation

● Creates additional metrics

● Use chains to filter out unwanted "raw" metrics.

● Usable on client and/or server.

Agenda

● collectd

● Aggregation of metrics

● Alerting with Icinga

Alerting

● Load the Unixsock plugin

● Query and check values with collectd-nagios

● Both come with collectd

Alerting

LoadPlugin unixsock

<Plugin unixsock>

SocketFile "/var/run/collectd-unixsock"

SocketGroup "collectd-nagios"

SocketPerms "0660"

DeleteSocket true

</Plugin>

Load the Unixsock plugin

Alerting

-> GETVAL example.com/cpu-average/cpu-wait

<- 1 Value found

<- value=8.540017+e00

Query values with the Unixsock plugin

Alerting

● collectd-nagios queries and checks metrics

● Ranged -w (warn) and -c (critical) options

● Conforms to Icinga's best practices

Alerting

$ collectd-nagios -s /var/run/collectd-unixsock \

> -n cpu-average/cpu-wait -H example.com \

> -w '0:10' -c '0:25'

OKAY: 0 critical, 0 warning, 1 okay | value=8.540017;;;;

Example: collectd-nagios

Alerting

define command{ command_name check_cpuio_collectd command_line collectd-nagios \

-H $HOSTNAME$ \

-n cpu-average/cpu-wait \

-w $ARG1$ -c $ARG2$

}

define service{ use generic-service host_name example.com service_description I/O wait check_command \

check_cpuio_collectd!10:!5: }

commands.cfg services.cfg

Alerting

● What's next?

○ Use "passive checks"

○ Let collectd push metrics to Icinga 2?

○ Bring on the patches!

Thank you!

Thank you!

Questions?

It's time for

Questions

Recommended