OSMC 2014: Introduction into collectd | Florian Foster

collectdAn introduction

About me

● Florian "octo" Forster

● Open-source work since 2001

● Started collectd in 2005

Agenda

● collectd

● Aggregation of metrics

● Alerting with Icinga

Agenda

● collectd



collectd

● Daemon

● collect metrics

● mangle / transport metrics

● store metrics (no retrieve)

collectd

● Open-source project○ MIT and GPL licensed

● Platform independent○ Linux, BSD, Solaris, AIX, HP-UX, …○ Windows via SSC Serv (non-free)

collectd

● Agent based design○ Runs on each host

● Extensible via plugins○ Language bindings (Perl, Python, Java)○ "exec" plugin, e.g. shell scripts

collectd

● 95+ "read" (input) plugins

○ System metrics (e.g. CPU, memory)

○ Application metrics (e.g. MySQL)

○ Other (Xeon Phi, SNMP, OneWire)

collectd

● 15+ "write" (output) plugins

○ Graphite○ RRDtool○ RRDCacheD○ Riemann○ MongoDB○ HTTP (generic)

collectd

# Input

LoadPlugin cpu

LoadPlugin memory

LoadPlugin df

<Plugin df>

MountPoint "/"

ValuesPercentage true

</Plugin>

# Output

LoadPlugin write_graphite

<Plugin write_graphite>

<Node "default">

Host "graphite.example.com"

</Node>

</Plugin>

Example configuration

collectd

● collectd's write_graphite plugin

○ Sends metric to Graphite○ TCP or UDP transport○ Metric names somewhat adjustable

→ Monitoring mit Graphite(15:30 in this room, German)

Agenda

● collectd



Aggregation

● Aggregates often more useful for alerting○ e.g. sum over CPUs, minimum RTT

● Metric storage often I/O bound

● Dashboards require "sane" amount of information

Aggregation

collectd Graphite

CPU

Disk

Memory

…Aggregation

Aggregation

● Load the Aggregation plugin

● Select (filter) applicable metrics

● Group by metric type and other fields

● Aggregate functions (e.g. sum)

Aggregation

LoadPlugin aggregation

<Plugin aggregation>

<Aggregation>

</Aggregation>

</Plugin>

example.com/battery/percent-charged

example.com/cpu-0/cpu-idle

example.com/cpu-0/cpu-user

example.com/cpu-0/cpu-wait


…example.com/df-root/df_complex-free

example.com/df-root/df_complex-used

example.com/df-root/df_complex-rsvd

…

Load the aggregation plugin

Aggregation: Selection

● Five fields usable for selection

○ Host○ Plugin○ PluginInstance○ Type (mandatory)○ TypeInstance

Aggregation: Selection



<Aggregation>

Plugin "cpu"

Type "cpu"

</Aggregation>

</Plugin>










…

Select metrics

Aggregation: Grouping

● Four fields usable for selection

○ Host○ Plugin○ PluginInstance○ TypeInstance

● One field unspecified (or more)

Aggregation: Grouping



<Aggregation>

Plugin "cpu"

Type "cpu"

GroupBy Host

GroupBy TypeInstance

</Aggregation>

</Plugin>

example.com/cpu-???/cpu-idle

example.com/cpu-???/cpu-user

example.com/cpu-???/cpu-wait

Configure grouping

Aggregation: Functions

● Up to six aggregate functions

○ Count○ Sum○ Minimum○ Maximum○ Average○ Standard deviation

Aggregation



<Aggregation>

Plugin "cpu"

Type "cpu"

GroupBy Host

GroupBy TypeInstance

CalculateSum true

</Aggregation>

</Plugin>

example.com/cpu-sum/cpu-idle

example.com/cpu-sum/cpu-user

example.com/cpu-sum/cpu-wait

Select aggregate function(s)

Aggregation

● Creates additional metrics

● Use chains to filter out unwanted "raw" metrics.

● Usable on client and/or server.

Agenda

● collectd



Alerting

● Load the Unixsock plugin

● Query and check values with collectd-nagios

● Both come with collectd

Alerting

LoadPlugin unixsock

<Plugin unixsock>

SocketFile "/var/run/collectd-unixsock"

SocketGroup "collectd-nagios"

SocketPerms "0660"

DeleteSocket true

</Plugin>

Load the Unixsock plugin

Alerting

-> GETVAL example.com/cpu-average/cpu-wait

<- 1 Value found

<- value=8.540017+e00

Query values with the Unixsock plugin

Alerting

● collectd-nagios queries and checks metrics

● Ranged -w (warn) and -c (critical) options

● Conforms to Icinga's best practices

Alerting

$ collectd-nagios -s /var/run/collectd-unixsock \

> -n cpu-average/cpu-wait -H example.com \

> -w '0:10' -c '0:25'

OKAY: 0 critical, 0 warning, 1 okay | value=8.540017;;;;

Example: collectd-nagios

Alerting

define command{ command_name check_cpuio_collectd command_line collectd-nagios \

-H $HOSTNAME$ \

-n cpu-average/cpu-wait \

-w $ARG1$ -c $ARG2$

}

define service{ use generic-service host_name example.com service_description I/O wait check_command \

check_cpuio_collectd!10:!5: }

commands.cfg services.cfg

Alerting

● What's next?

○ Use "passive checks"

○ Let collectd push metrics to Icinga 2?

○ Bring on the patches!

Thank you!

Thank you!

Questions?

It's time for

Questions

Software

OSMC 2014: Introduction into collectd | Florian Foster