48
Introduction to Monitoring

StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Embed Size (px)

Citation preview

Page 1: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Introduction to Monitoring

Page 2: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
Page 3: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
Page 4: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
Page 5: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Monitoring is both the process and the set of tools of finding problems before

your users, minimizing monetary impact of failure and enabling fast recovery.

Page 6: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Efficient Monitoring aims at notifying the right person at the right time (and right time only) with the most precise

information.

Page 7: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

What monitoring is

measure Aggregate & Visualize Alert

Page 8: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Webapp DB

Page 9: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Webapp DB

What to Measure?

End userexperience

/performance

Page 10: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

End User Monitoring• Validates our application is running

from “outside”• Measure “real user” performance• Geo-Distributed – including real

latency• Many tools offer such solutions–Measure, visualize, alerts

Page 11: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

End User Monitoring• When is a page fully loaded?• Take care - some tools are biased

Page 12: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
Page 13: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
Page 14: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

End User Monitoring• Measure yourself • Using– Resource Timing API– User Timing API– Custom JS

• Send metrics from Browsers to your own sync server– all users / samples

Page 15: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

End User MonitoringWhat to measure• Measure page load time (as you

define it)• Measure loading errors• Measure number of page views• Group by Geo & Application• Group by browser

Page 16: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

End User MonitoringAlert on• Sudden drop in traffic from a certain

geo• Sudden increase in traffic• Increase in loading times• Increase in errors– From a specific browser

Page 17: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Webapp DB

What to Measure?

Is Alive?

Page 18: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Is Alive• Measure a process liveliness– Is the process running?

• Measure a process responsiveness– Does the process respond to a request?

• Alert on instance down– And auto restart it

• Alert on all instances down

Page 19: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Is Alive• A variety of great tools• Tools that perform “ping” tests• Tools that call a designated URL for

responsiveness tests

• Is alive != Availability– Is alive is per host– Availability is about the system as a whole

Page 20: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Webapp DB

What to Measure?

Request performance

Page 21: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Request Monitoring• Measure how your application

performs– Regardless of networking to the user– Regardless of latency

• Measuring on the server, per server• Many tools provide such solutions–Measure, visualize, alerts

Page 22: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Request Monitoring• But many tools miss the branching

point– Branching point – the point in your code

at which your code decides what branch of execution to perform for a request

• Issues with aggregation, what is monitored, alert flexibility

• But still, there are some great tools

Page 23: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Request MonitoringWhat to measure• Measure request rate• Measure performance histogram• Measure error rate, by error type, http

response code• Group by request type (as you define it)• Group by host, application, data center• Group by error type (as you define it)

Page 24: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Do not use Average• Don’t use Average for performance• Instead, use median, 95%tile and

99%tile.

Page 25: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Request MonitoringWhat to Visualize• Request rate (RPM)

• Request performance–Median, 95%tile and 99%tile

on a moving window

Page 26: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Request MonitoringWhat to Visualize• Errors– Rate, percent (compared to request

rate)– Top X errors by percent– Separate system and application errors– You will always have application errors– You should have exactly 0 system errors

Page 27: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Request MonitoringAlert on• Big changes in traffic• Increase in response times• Increase in errors• System errors

Page 28: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Webapp DB

What to Measure?

Resource Utilization

Page 29: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Resources• System resources– CPU, Memory, IO, Storage, network

• Resource pools– Database connection pools– HTTP connection pools– Thread pools– Other resource pools

Page 30: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Resource MonitoringWhat to measure• Measure resource utilization– Percent of resource used

• Measure resource acquisition queue– Time to acquire– Acquire Timeouts – Usage Timeouts

Page 31: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Resource MonitoringWhat to measure• Group by resource type and pool• Group by host, application, data center• Group by error type (as you define it)

Alert on• Resource over utilization –

avg usage over XX% in a time window

Page 32: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Webapp DB

What to Measure?

Database Monitor

Page 33: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Database monitoringDepends on the database, but yet -• Storage• Replication “lag”• Slow operations• Resource usage

Page 34: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Monitoring at Wix

Page 35: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Precise information

Alert the right person

Automation

Page 36: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Service is alive• Is my application alive on the

minimum number required by my SLA?

• 2 out of 5 instances of my-app are not responding to isAlive

• my-app requires a minimum of 3 instances to meet the SLA

Page 37: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Alert

SensuQueries NginxAlert & SLA

ZooKeeperPlanned Configuration

Service owner

NginxService Load Balancer

Is-alive

Page 38: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Alert

SensuQueries NginxAlert & SLA

ZooKeeperPlanned Configuration

Service owner

NginxService Load Balancer

Is-alive

Alert the right person

Precise information

Automation

Page 39: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Service anomalies• Backend Anomalies

• Identify unhealthy KPIs per endpoints

• Abnormal increase in error rate for class.method.get

Page 40: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Anomaly Alert

AnodotTime series anomaly

detectionAlerts & graphs

statsdStats aggregation

Forwarding metrics

JVM serversMetrics librarymetrics / 1m

Graphs

Page 41: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Anomaly Alert

AnodotTime series anomaly

detectionAlerts & graphs

statsdStats aggregation

Forwarding metrics

JVM serversMetrics librarymetrics / 1m

Graphs

Precise information

Alert the right person

Automation

Page 42: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Service anomalies• Frontend Anomalies

• Browser (client) generated KPIs

• User Experience - Users effected or not? How and where?

Page 43: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Anomaly Alert

Storm & EsperRealtime streaming

processingMetrics / 1m

ClientJS in Browser

events Graphs

Loggerflume

events

AnodotTime series

anomaly detectionAlerts & graphs

Page 44: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Anomaly Alert

Storm & EsperRealtime streaming

processingMetrics / 1m

ClientJS in Browser

events Graphs

Loggerflume

events

AnodotTime series

anomaly detectionAlerts & graphs

Precise information

Alert the right personAutomation

Page 45: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Alert management

• What are the active alerts?

• What is the root cause?

• It is correlated to a change?

Page 46: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Alert

BigPandaCentral alerts & changes

Alerts & Changes

ChangesDeploymentsChef uploadsA/B, F-Toggle,

Exp.

AlertsNewRelic

SensuNagios

PingDomWeb UI

Page 47: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Alert

BigPandaCentral alerts & changes

Alerts & Changes

ChangesDeploymentsChef uploadsA/B, F-Toggle,

Exp.

AlertsNewRelic

SensuNagios

PingDomWeb UI

Precise information

Alert the right person

Automation

Page 48: StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

Questions?