Upload
netways
View
260
Download
5
Embed Size (px)
DESCRIPTION
Periodically measuring performance metrics of production systems allows administrators and developers to analyze system behavior during and after outages, quantify performance improvements, and detect trends and take proactive measures before problems arise. Performance metrics are also interesting for alerting, because they can be aggregated meaningfully, thereby basing an alert on a group of hosts rather than each host individually, for example. This talk will give an introduction to collectd, an open-source tool to gather, process and store performance metrics. A sample setup which aggregates a couple of metrics and stores the aggregate in Graphite will be presented. Afterwards, we will show how the collectd-nagios utility can be used to define alerts in Icinga based on this data.
Citation preview
collectdAn introduction
About me
● Florian "octo" Forster
● Open-source work since 2001
● Started collectd in 2005
Agenda
● collectd
● Aggregation of metrics
● Alerting with Icinga
Agenda
● collectd
● Aggregation of metrics
● Alerting with Icinga
collectd
● Daemon
● collect metrics
● mangle / transport metrics
● store metrics (no retrieve)
collectd
● Open-source project○ MIT and GPL licensed
● Platform independent○ Linux, BSD, Solaris, AIX, HP-UX, …○ Windows via SSC Serv (non-free)
collectd
● Agent based design○ Runs on each host
● Extensible via plugins○ Language bindings (Perl, Python, Java)○ "exec" plugin, e.g. shell scripts
collectd
● 95+ "read" (input) plugins
○ System metrics (e.g. CPU, memory)
○ Application metrics (e.g. MySQL)
○ Other (Xeon Phi, SNMP, OneWire)
collectd
● 15+ "write" (output) plugins
○ Graphite○ RRDtool○ RRDCacheD○ Riemann○ MongoDB○ HTTP (generic)
collectd
# Input
LoadPlugin cpu
LoadPlugin memory
LoadPlugin df
<Plugin df>
MountPoint "/"
ValuesPercentage true
</Plugin>
# Output
LoadPlugin write_graphite
<Plugin write_graphite>
<Node "default">
Host "graphite.example.com"
</Node>
</Plugin>
Example configuration
collectd
● collectd's write_graphite plugin
○ Sends metric to Graphite○ TCP or UDP transport○ Metric names somewhat adjustable
→ Monitoring mit Graphite(15:30 in this room, German)
Agenda
● collectd
● Aggregation of metrics
● Alerting with Icinga
Aggregation
● Aggregates often more useful for alerting○ e.g. sum over CPUs, minimum RTT
● Metric storage often I/O bound
● Dashboards require "sane" amount of information
Aggregation
collectd Graphite
CPU
Disk
Memory
…Aggregation
Aggregation
● Load the Aggregation plugin
● Select (filter) applicable metrics
● Group by metric type and other fields
● Aggregate functions (e.g. sum)
Aggregation
LoadPlugin aggregation
<Plugin aggregation>
<Aggregation>
</Aggregation>
</Plugin>
example.com/battery/percent-charged
example.com/cpu-0/cpu-idle
example.com/cpu-0/cpu-user
example.com/cpu-0/cpu-wait
example.com/cpu-1/cpu-idle
…example.com/df-root/df_complex-free
example.com/df-root/df_complex-used
example.com/df-root/df_complex-rsvd
…
Load the aggregation plugin
Aggregation: Selection
● Five fields usable for selection
○ Host○ Plugin○ PluginInstance○ Type (mandatory)○ TypeInstance
Aggregation: Selection
LoadPlugin aggregation
<Plugin aggregation>
<Aggregation>
Plugin "cpu"
Type "cpu"
</Aggregation>
</Plugin>
example.com/cpu-0/cpu-idle
example.com/cpu-0/cpu-user
example.com/cpu-0/cpu-wait
example.com/cpu-1/cpu-idle
example.com/cpu-1/cpu-user
example.com/cpu-1/cpu-wait
example.com/cpu-2/cpu-idle
example.com/cpu-2/cpu-user
example.com/cpu-2/cpu-wait
…
Select metrics
Aggregation: Grouping
● Four fields usable for selection
○ Host○ Plugin○ PluginInstance○ TypeInstance
● One field unspecified (or more)
Aggregation: Grouping
LoadPlugin aggregation
<Plugin aggregation>
<Aggregation>
Plugin "cpu"
Type "cpu"
GroupBy Host
GroupBy TypeInstance
</Aggregation>
</Plugin>
example.com/cpu-???/cpu-idle
example.com/cpu-???/cpu-user
example.com/cpu-???/cpu-wait
Configure grouping
Aggregation: Functions
● Up to six aggregate functions
○ Count○ Sum○ Minimum○ Maximum○ Average○ Standard deviation
Aggregation
LoadPlugin aggregation
<Plugin aggregation>
<Aggregation>
Plugin "cpu"
Type "cpu"
GroupBy Host
GroupBy TypeInstance
CalculateSum true
</Aggregation>
</Plugin>
example.com/cpu-sum/cpu-idle
example.com/cpu-sum/cpu-user
example.com/cpu-sum/cpu-wait
Select aggregate function(s)
Aggregation
● Creates additional metrics
● Use chains to filter out unwanted "raw" metrics.
● Usable on client and/or server.
Agenda
● collectd
● Aggregation of metrics
● Alerting with Icinga
Alerting
● Load the Unixsock plugin
● Query and check values with collectd-nagios
● Both come with collectd
Alerting
LoadPlugin unixsock
<Plugin unixsock>
SocketFile "/var/run/collectd-unixsock"
SocketGroup "collectd-nagios"
SocketPerms "0660"
DeleteSocket true
</Plugin>
Load the Unixsock plugin
Alerting
-> GETVAL example.com/cpu-average/cpu-wait
<- 1 Value found
<- value=8.540017+e00
Query values with the Unixsock plugin
Alerting
● collectd-nagios queries and checks metrics
● Ranged -w (warn) and -c (critical) options
● Conforms to Icinga's best practices
Alerting
$ collectd-nagios -s /var/run/collectd-unixsock \
> -n cpu-average/cpu-wait -H example.com \
> -w '0:10' -c '0:25'
OKAY: 0 critical, 0 warning, 1 okay | value=8.540017;;;;
Example: collectd-nagios
Alerting
define command{ command_name check_cpuio_collectd command_line collectd-nagios \
-H $HOSTNAME$ \
-n cpu-average/cpu-wait \
-w $ARG1$ -c $ARG2$
}
define service{ use generic-service host_name example.com service_description I/O wait check_command \
check_cpuio_collectd!10:!5: }
commands.cfg services.cfg
Alerting
● What's next?
○ Use "passive checks"
○ Let collectd push metrics to Icinga 2?
○ Bring on the patches!
Thank you!
Thank you!
Questions?
It's time for
Questions