StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Introduction to Monitoring

Monitoring is both the process and the set of tools of finding problems before

your users, minimizing monetary impact of failure and enabling fast recovery.

Efficient Monitoring aims at notifying the right person at the right time (and right time only) with the most precise

information.

What monitoring is

measure Aggregate & Visualize Alert

Webapp DB

Webapp DB

What to Measure?

End userexperience

/performance

End User Monitoring• Validates our application is running

from “outside”• Measure “real user” performance• Geo-Distributed – including real

latency• Many tools offer such solutions–Measure, visualize, alerts

End User Monitoring• When is a page fully loaded?• Take care - some tools are biased

End User Monitoring• Measure yourself • Using– Resource Timing API– User Timing API– Custom JS

• Send metrics from Browsers to your own sync server– all users / samples

End User MonitoringWhat to measure• Measure page load time (as you

define it)• Measure loading errors• Measure number of page views• Group by Geo & Application• Group by browser

End User MonitoringAlert on• Sudden drop in traffic from a certain

geo• Sudden increase in traffic• Increase in loading times• Increase in errors– From a specific browser

Webapp DB

What to Measure?

Is Alive?

Is Alive• Measure a process liveliness– Is the process running?

• Measure a process responsiveness– Does the process respond to a request?

• Alert on instance down– And auto restart it

• Alert on all instances down

Is Alive• A variety of great tools• Tools that perform “ping” tests• Tools that call a designated URL for

responsiveness tests

• Is alive != Availability– Is alive is per host– Availability is about the system as a whole

Webapp DB

What to Measure?

Request performance

Request Monitoring• Measure how your application

performs– Regardless of networking to the user– Regardless of latency

• Measuring on the server, per server• Many tools provide such solutions–Measure, visualize, alerts

Request Monitoring• But many tools miss the branching

point– Branching point – the point in your code

at which your code decides what branch of execution to perform for a request

• Issues with aggregation, what is monitored, alert flexibility

• But still, there are some great tools

Request MonitoringWhat to measure• Measure request rate• Measure performance histogram• Measure error rate, by error type, http

response code• Group by request type (as you define it)• Group by host, application, data center• Group by error type (as you define it)

Do not use Average• Don’t use Average for performance• Instead, use median, 95%tile and

99%tile.

Request MonitoringWhat to Visualize• Request rate (RPM)

• Request performance–Median, 95%tile and 99%tile

on a moving window

Request MonitoringWhat to Visualize• Errors– Rate, percent (compared to request

rate)– Top X errors by percent– Separate system and application errors– You will always have application errors– You should have exactly 0 system errors

Request MonitoringAlert on• Big changes in traffic• Increase in response times• Increase in errors• System errors

Webapp DB

What to Measure?

Resource Utilization

Resources• System resources– CPU, Memory, IO, Storage, network

• Resource pools– Database connection pools– HTTP connection pools– Thread pools– Other resource pools

Resource MonitoringWhat to measure• Measure resource utilization– Percent of resource used

• Measure resource acquisition queue– Time to acquire– Acquire Timeouts – Usage Timeouts

Resource MonitoringWhat to measure• Group by resource type and pool• Group by host, application, data center• Group by error type (as you define it)

Alert on• Resource over utilization –

avg usage over XX% in a time window

Webapp DB

What to Measure?

Database Monitor

Database monitoringDepends on the database, but yet -• Storage• Replication “lag”• Slow operations• Resource usage

Monitoring at Wix

Precise information

Alert the right person

Automation

Service is alive• Is my application alive on the

minimum number required by my SLA?

• 2 out of 5 instances of my-app are not responding to isAlive

• my-app requires a minimum of 3 instances to meet the SLA

Alert

SensuQueries NginxAlert & SLA

ZooKeeperPlanned Configuration

Service owner

NginxService Load Balancer

Is-alive

Alert

SensuQueries NginxAlert & SLA

ZooKeeperPlanned Configuration

Service owner

NginxService Load Balancer

Is-alive


Precise information

Automation

Service anomalies• Backend Anomalies

• Identify unhealthy KPIs per endpoints

• Abnormal increase in error rate for class.method.get

Anomaly Alert

AnodotTime series anomaly

detectionAlerts & graphs

statsdStats aggregation

Forwarding metrics

JVM serversMetrics librarymetrics / 1m

Graphs

Anomaly Alert

AnodotTime series anomaly

detectionAlerts & graphs

statsdStats aggregation

Forwarding metrics

JVM serversMetrics librarymetrics / 1m

Graphs

Precise information


Automation

Service anomalies• Frontend Anomalies

• Browser (client) generated KPIs

• User Experience - Users effected or not? How and where?

Anomaly Alert

Storm & EsperRealtime streaming

processingMetrics / 1m

ClientJS in Browser

events Graphs

Loggerflume

events

AnodotTime series

anomaly detectionAlerts & graphs

Anomaly Alert

Storm & EsperRealtime streaming

processingMetrics / 1m

ClientJS in Browser

events Graphs

Loggerflume

events

AnodotTime series

anomaly detectionAlerts & graphs

Precise information

Alert the right personAutomation

Alert management

• What are the active alerts?

• What is the root cause?

• It is correlated to a change?

Alert

BigPandaCentral alerts & changes

Alerts & Changes

ChangesDeploymentsChef uploadsA/B, F-Toggle,

Exp.

AlertsNewRelic

SensuNagios

PingDomWeb UI

Alert

BigPandaCentral alerts & changes

Alerts & Changes

ChangesDeploymentsChef uploadsA/B, F-Toggle,

Exp.

AlertsNewRelic

SensuNagios

PingDomWeb UI

Precise information


Automation

Questions?

Technology

StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis