Efficient monitoring and alerting

Efficient monitoringin modern environments

Tobias Schmidt - ContainerDays Hamburg 2016

@dagrobie - github.com/grobie

Introduction

About myself

Production Engineer for 5+ yearsContainer orchestration (in-house, Kubernetes)

Service discoveryMonitoring (Prometheus)

Production readiness

Monitoring

Collecting, processing, aggregating, and displaying real-

time quantitative data about a system, such as query

counts and types, processing times, and server lifetimes.

Site Reliability Engineering - O’Reilly 2016

Monitoring

Why monitor?

Enable automatic alertingAnalysis of long-term trends

Validate new features/experiments/implementationsDebugging

Monitoring

Blackbox vs. Whitebox

Blackbox: Externally observedWhat the user sees

Whitebox: Data exposed by the systemAllows to act on imminent issues

Metrics

Instrument everythingHost (CPU, memory, I/O, network, filesystem, …)

Container (CPU, memory, restarts, OOM, throttling, …)Applications (throughput, latency, queues, …)

Metrics

Export detailed metricsAttach all relevant information

Use aggregations later in alerts and dashboards

Metrics

Four golden signalsMinimum set of metrics every service should have

Coined by Google SRE

Four golden signals

LatencyTime to serve user requests

Median doesn’t reflect user experience

Four golden signals

TrafficDemand placed on a system

(HTTP requests, network throughput, transactions, …)

Four golden signals

ErrorsFailure responses to user requests

Four golden signals

Saturation & UtilizationConsumption of constrained resources

(Memory, I/O, CPU slices, …)

Alerting

Use symptom based alertingMonitor for your users

Four golden signals (traffic is tricky)

Only page if something needs immediate human intervention

Alerting

Prevent alert fatigueAlert grouping

Provide easy silencingDependencies

Avoid static thresholds

Alerting

Use ticketing systemAvoid email spam

Warnings are tasks like new features

Alerting

Provide runbooks (playbooks)Keep them concise

Explanation, hints, linksDynamic - include recent observations

Discuss with non-experts

Alerting

Practice outages“Game days”

Repeat regularly

Matt T. Proud, Julius Volz, Björn Rabenstein, Matthias Rampke

Philosophy on Alerting - Rob Ewaschuk

Acknowledgements

Thank youMay the queries flow, and your pagers be quiet.

Tobias Schmidt - ContainerDays Hamburg 2016

@dagrobie - github.com/grobie

Efficient monitoring and alerting

Engineering

Alerting Overview - Rigor Monitoring

Monitoring and Alerting Policy Suite · 53-1002933-02 Updated to match FOS 7.2.0a September 2013. Monitoring and Alerting Policy Suite Administrator’s Guide iii ... This section

Model monitoring & alerting

24/7 Monitoring and Alerting of PostgreSQL

Monitoring, Alerting & Reporting Service - Exclusive …info.exclusive-networks.co.uk/rs/472-UVN-404/images/Monitoring... · Service for firewalls (MARS). ... MONITORING, ALERTING

WEB SERVER MONITORING AND ALERTING USING WEBLOGIC · 2018-04-20 · The purpose of this document and setup is to configure monitoring and alerting for the JVM’s on the Weblogic

Understanding & Using Spiceworks Monitoring, Alerting & Reporting

Configuration of Standby Monitoring and Alerting ... - SAP

VMware Validated Design™ Monitoring and Alerting GuideVMware Validated Design Monitoring and Alerting Guide is intended for cloud architects, infrastructure administrators, cloud

Monitoring and Alerting Policy Suite Administrator's Guide, 7.2 · 2019-06-12 · Monitoring and Alerting Policy Suite Administrator’s Guide iii 53-1002933-02 Contents ... Java

AFrame Digital, Inc. “Intelligent, Mobile Health Monitoring and Alerting”

GSM BASED EMBEDDED SYSTEM FOR REMOTE LABORATORY SAFETY MONITORING AND ALERTING

NetBackup Operations Manager: Monitoring, Alerting and ... · White Paper: NetBackup Operations Manager: Monitoring, Alerting and Reporting for Veritas NetBackup Advanced Operational

Dell EMC SRM 4.3.1 Alerting Guide · Alerting infrastructure ... Efficient alert management Administrators and managers should identify the critical alerts that are being actively

Monitoring and Alerting Policy Suite · 53-1003147-01 27 June 2014 Monitoring and Alerting Policy Suite Administrator's Guide Supporting Fabric OS v7.3.0 © 2014, Brocade Communications

Brocade Monitoring and Alerting Policy Suite Configuration ... · Supporting Fabric OS 8.2.0 CONFIGURATION GUIDE Brocade Monitoring and Alerting Policy Suite Configuration Guide,

MediaAlert – A Broadcast Video Monitoring and Alerting ... · news monitoring, automatic speech recognition (ASR), multimedia messaging, alerting, notification. 1INTRODUCTION Mobile

Scalable Monitoring & Alerting

Real-Time Alerting, Monitoring External Security Monitor

European alerting and monitoring data as inputs for the ... · European alerting and monitoring data as inputs for the risk assessment of microbiological and chemical hazards in spices