Upload
statscraft
View
204
Download
1
Embed Size (px)
Citation preview
Introduction to Monitoring
Monitoring is both the process and the set of tools of finding problems before
your users, minimizing monetary impact of failure and enabling fast recovery.
Efficient Monitoring aims at notifying the right person at the right time (and right time only) with the most precise
information.
What monitoring is
measure Aggregate & Visualize Alert
Webapp DB
Webapp DB
What to Measure?
End userexperience
/performance
End User Monitoring• Validates our application is running
from “outside”• Measure “real user” performance• Geo-Distributed – including real
latency• Many tools offer such solutions–Measure, visualize, alerts
End User Monitoring• When is a page fully loaded?• Take care - some tools are biased
End User Monitoring• Measure yourself • Using– Resource Timing API– User Timing API– Custom JS
• Send metrics from Browsers to your own sync server– all users / samples
End User MonitoringWhat to measure• Measure page load time (as you
define it)• Measure loading errors• Measure number of page views• Group by Geo & Application• Group by browser
End User MonitoringAlert on• Sudden drop in traffic from a certain
geo• Sudden increase in traffic• Increase in loading times• Increase in errors– From a specific browser
Webapp DB
What to Measure?
Is Alive?
Is Alive• Measure a process liveliness– Is the process running?
• Measure a process responsiveness– Does the process respond to a request?
• Alert on instance down– And auto restart it
• Alert on all instances down
Is Alive• A variety of great tools• Tools that perform “ping” tests• Tools that call a designated URL for
responsiveness tests
• Is alive != Availability– Is alive is per host– Availability is about the system as a whole
Webapp DB
What to Measure?
Request performance
Request Monitoring• Measure how your application
performs– Regardless of networking to the user– Regardless of latency
• Measuring on the server, per server• Many tools provide such solutions–Measure, visualize, alerts
Request Monitoring• But many tools miss the branching
point– Branching point – the point in your code
at which your code decides what branch of execution to perform for a request
• Issues with aggregation, what is monitored, alert flexibility
• But still, there are some great tools
Request MonitoringWhat to measure• Measure request rate• Measure performance histogram• Measure error rate, by error type, http
response code• Group by request type (as you define it)• Group by host, application, data center• Group by error type (as you define it)
Do not use Average• Don’t use Average for performance• Instead, use median, 95%tile and
99%tile.
Request MonitoringWhat to Visualize• Request rate (RPM)
• Request performance–Median, 95%tile and 99%tile
on a moving window
Request MonitoringWhat to Visualize• Errors– Rate, percent (compared to request
rate)– Top X errors by percent– Separate system and application errors– You will always have application errors– You should have exactly 0 system errors
Request MonitoringAlert on• Big changes in traffic• Increase in response times• Increase in errors• System errors
Webapp DB
What to Measure?
Resource Utilization
Resources• System resources– CPU, Memory, IO, Storage, network
• Resource pools– Database connection pools– HTTP connection pools– Thread pools– Other resource pools
Resource MonitoringWhat to measure• Measure resource utilization– Percent of resource used
• Measure resource acquisition queue– Time to acquire– Acquire Timeouts – Usage Timeouts
Resource MonitoringWhat to measure• Group by resource type and pool• Group by host, application, data center• Group by error type (as you define it)
Alert on• Resource over utilization –
avg usage over XX% in a time window
Webapp DB
What to Measure?
Database Monitor
Database monitoringDepends on the database, but yet -• Storage• Replication “lag”• Slow operations• Resource usage
Monitoring at Wix
Precise information
Alert the right person
Automation
Service is alive• Is my application alive on the
minimum number required by my SLA?
• 2 out of 5 instances of my-app are not responding to isAlive
• my-app requires a minimum of 3 instances to meet the SLA
Alert
SensuQueries NginxAlert & SLA
ZooKeeperPlanned Configuration
Service owner
NginxService Load Balancer
Is-alive
Alert
SensuQueries NginxAlert & SLA
ZooKeeperPlanned Configuration
Service owner
NginxService Load Balancer
Is-alive
Alert the right person
Precise information
Automation
Service anomalies• Backend Anomalies
• Identify unhealthy KPIs per endpoints
• Abnormal increase in error rate for class.method.get
Anomaly Alert
AnodotTime series anomaly
detectionAlerts & graphs
statsdStats aggregation
Forwarding metrics
JVM serversMetrics librarymetrics / 1m
Graphs
Anomaly Alert
AnodotTime series anomaly
detectionAlerts & graphs
statsdStats aggregation
Forwarding metrics
JVM serversMetrics librarymetrics / 1m
Graphs
Precise information
Alert the right person
Automation
Service anomalies• Frontend Anomalies
• Browser (client) generated KPIs
• User Experience - Users effected or not? How and where?
Anomaly Alert
Storm & EsperRealtime streaming
processingMetrics / 1m
ClientJS in Browser
events Graphs
Loggerflume
events
AnodotTime series
anomaly detectionAlerts & graphs
Anomaly Alert
Storm & EsperRealtime streaming
processingMetrics / 1m
ClientJS in Browser
events Graphs
Loggerflume
events
AnodotTime series
anomaly detectionAlerts & graphs
Precise information
Alert the right personAutomation
Alert management
• What are the active alerts?
• What is the root cause?
• It is correlated to a change?
Alert
BigPandaCentral alerts & changes
Alerts & Changes
ChangesDeploymentsChef uploadsA/B, F-Toggle,
Exp.
AlertsNewRelic
SensuNagios
PingDomWeb UI
Alert
BigPandaCentral alerts & changes
Alerts & Changes
ChangesDeploymentsChef uploadsA/B, F-Toggle,
Exp.
AlertsNewRelic
SensuNagios
PingDomWeb UI
Precise information
Alert the right person
Automation
Questions?