Efficient monitoring and alerting

Preview:

Citation preview

Efficient monitoringin modern environments

Tobias Schmidt - ContainerDays Hamburg 2016

@dagrobie - github.com/grobie

Introduction

About myself

Production Engineer for 5+ yearsContainer orchestration (in-house, Kubernetes)

Service discoveryMonitoring (Prometheus)

Production readiness

Monitoring

Collecting, processing, aggregating, and displaying real-

time quantitative data about a system, such as query

counts and types, processing times, and server lifetimes.

Site Reliability Engineering - O’Reilly 2016

Monitoring

Monitoring

Monitoring

Why monitor?

Enable automatic alertingAnalysis of long-term trends

Validate new features/experiments/implementationsDebugging

Monitoring

Blackbox vs. Whitebox

Blackbox: Externally observedWhat the user sees

Whitebox: Data exposed by the systemAllows to act on imminent issues

Metrics

Metrics

Instrument everythingHost (CPU, memory, I/O, network, filesystem, …)

Container (CPU, memory, restarts, OOM, throttling, …)Applications (throughput, latency, queues, …)

Metrics

Export detailed metricsAttach all relevant information

Use aggregations later in alerts and dashboards

Metrics

Four golden signalsMinimum set of metrics every service should have

Coined by Google SRE

Four golden signals

LatencyTime to serve user requests

Median doesn’t reflect user experience

Four golden signals

TrafficDemand placed on a system

(HTTP requests, network throughput, transactions, …)

Four golden signals

ErrorsFailure responses to user requests

Four golden signals

Saturation & UtilizationConsumption of constrained resources

(Memory, I/O, CPU slices, …)

Alerting

Alerting

Use symptom based alertingMonitor for your users

Four golden signals (traffic is tricky)

Only page if something needs immediate human intervention

Alerting

Prevent alert fatigueAlert grouping

Provide easy silencingDependencies

Avoid static thresholds

Alerting

Use ticketing systemAvoid email spam

Warnings are tasks like new features

Alerting

Provide runbooks (playbooks)Keep them concise

Explanation, hints, linksDynamic - include recent observations

Discuss with non-experts

Alerting

Practice outages“Game days”

Repeat regularly

Thank youMay the queries flow, and your pagers be quiet.

Tobias Schmidt - ContainerDays Hamburg 2016

@dagrobie - github.com/grobie

Recommended