23
Efficient monitoring in modern environments Tobias Schmidt - ContainerDays Hamburg 2016 @dagrobie - github.com/grobie

Efficient monitoring and alerting

Embed Size (px)

Citation preview

Page 1: Efficient monitoring and alerting

Efficient monitoringin modern environments

Tobias Schmidt - ContainerDays Hamburg 2016

@dagrobie - github.com/grobie

Page 2: Efficient monitoring and alerting

Introduction

About myself

Production Engineer for 5+ yearsContainer orchestration (in-house, Kubernetes)

Service discoveryMonitoring (Prometheus)

Production readiness

Page 3: Efficient monitoring and alerting

Monitoring

Page 4: Efficient monitoring and alerting

Collecting, processing, aggregating, and displaying real-

time quantitative data about a system, such as query

counts and types, processing times, and server lifetimes.

Site Reliability Engineering - O’Reilly 2016

Monitoring

Page 5: Efficient monitoring and alerting

Monitoring

Page 6: Efficient monitoring and alerting

Monitoring

Why monitor?

Enable automatic alertingAnalysis of long-term trends

Validate new features/experiments/implementationsDebugging

Page 7: Efficient monitoring and alerting

Monitoring

Blackbox vs. Whitebox

Blackbox: Externally observedWhat the user sees

Whitebox: Data exposed by the systemAllows to act on imminent issues

Page 8: Efficient monitoring and alerting

Metrics

Page 9: Efficient monitoring and alerting

Metrics

Instrument everythingHost (CPU, memory, I/O, network, filesystem, …)

Container (CPU, memory, restarts, OOM, throttling, …)Applications (throughput, latency, queues, …)

Page 10: Efficient monitoring and alerting

Metrics

Export detailed metricsAttach all relevant information

Use aggregations later in alerts and dashboards

Page 11: Efficient monitoring and alerting

Metrics

Four golden signalsMinimum set of metrics every service should have

Coined by Google SRE

Page 12: Efficient monitoring and alerting

Four golden signals

LatencyTime to serve user requests

Median doesn’t reflect user experience

Page 13: Efficient monitoring and alerting

Four golden signals

TrafficDemand placed on a system

(HTTP requests, network throughput, transactions, …)

Page 14: Efficient monitoring and alerting

Four golden signals

ErrorsFailure responses to user requests

Page 15: Efficient monitoring and alerting

Four golden signals

Saturation & UtilizationConsumption of constrained resources

(Memory, I/O, CPU slices, …)

Page 16: Efficient monitoring and alerting

Alerting

Page 17: Efficient monitoring and alerting

Alerting

Use symptom based alertingMonitor for your users

Four golden signals (traffic is tricky)

Only page if something needs immediate human intervention

Page 18: Efficient monitoring and alerting

Alerting

Prevent alert fatigueAlert grouping

Provide easy silencingDependencies

Avoid static thresholds

Page 19: Efficient monitoring and alerting

Alerting

Use ticketing systemAvoid email spam

Warnings are tasks like new features

Page 20: Efficient monitoring and alerting

Alerting

Provide runbooks (playbooks)Keep them concise

Explanation, hints, linksDynamic - include recent observations

Discuss with non-experts

Page 21: Efficient monitoring and alerting

Alerting

Practice outages“Game days”

Repeat regularly

Page 23: Efficient monitoring and alerting

Thank youMay the queries flow, and your pagers be quiet.

Tobias Schmidt - ContainerDays Hamburg 2016

@dagrobie - github.com/grobie