Upload
franklin-angulo
View
604
Download
0
Tags:
Embed Size (px)
DESCRIPTION
We'd like to share our experience building a scalable solution using Graphite, Grafana, Collectd, Nagios, Logstash, Elasticsearch and Kibana, among others. We believe that it is easy to collect metrics from any one system, and to define alerts on single metrics. We already have this capability in place. However, in complex systems, the real operational challenges arise from the way system components interact. Some of these components live inside our data center, some live outside our data center, and all are updating on differing timelines. The functionality and performance of every component has the potential to change every day. Our challenge is to identify patterns and correlations across multiple systems in our stack. We need to integrate top-down and bottom-up analysis, so we can see, for example, that trial subscription signups (a user metric) fell off at the same time that an internal API call began to fail (an application metric), and it was caused by a database host falling offline (a system metric). When collecting so much data, there is a risk of being overwhelmed and not being able to make sense of it all. In essence, a risk of collecting data but not producing intelligence. We combat this risk by converting our accumulated data into the most visually information-dense format available: graphs. Then we make graphs easy to compare and easy to share. We make them informative at a glance and easy for the team to keep watching. Finally, once we are regularly identifying patterns across our graphs, we should have an automated way to "watch the graphs" in our absence. It is not an AI or a pattern recognition "black box", it should just automate patterns that humans have first validated to be meaningful.
Citation preview
SCALABLEMONITORING & ALERTING
NEW YORK CITY
NEW YORK CITY
EXPERIENCE
THE BEGINNING
NEW YORK CITY
NEW YORK CITY
DEPLOYMENTS
+ =
NEW YORK CITY
DDOS ATTACKS
NEW YORK CITY
THE VISION● Easy publishing of metrics
● Single source of truth
● Democratized access
NEW YORK CITY
CARBON-AGGREGATOR
CARBON-CACHE
WHISPER
FILESYSTEM (rotating disks)
SEPTEMBER 2013
GRAPHITE WEBAPP
NEW YORK CITY
16 CARBON-RELAY
10 CARBON-CACHE
WHISPER
FILESYSTEM (rotating)
LOAD BALANCER
WEBAPP
JANUARY 2014
16 CARBON-RELAY
10 CARBON-CACHE
WHISPER
FILESYSTEM (rotating)
WEBAPP
16 CARBON-RELAY
10 CARBON-CACHE
WHISPER
FILESYSTEM (rotating)
WEBAPP
NEW YORK CITY
CULTURE
NEW YORK CITY
CULTURE
SUPER BOWL 2014
NEW YORK CITY
NEW YORK CITY
NEW YORK CITY
THE PROBLEMS● Reads didn’t scale
● No usage insight
● Dumb dashboards
NEW YORK CITY
8 CARBON-RELAY
LOAD BALANCER
APRIL 2014
8 CARBON-RELAY 8 CARBON-RELAY
10 CARBON-CACHE
WHISPER
FILESYSTEM (SSDs)
WEBAPP
NEW YORK CITY
8 CARBON-RELAY
LOAD BALANCER
MAY 2014
8 CARBON-RELAY 8 CARBON-RELAY
10 CARBON-CACHE
WHISPER
FILESYSTEM (SSDs)
GRAPHITE API
NEW YORK CITY
8 CARBON-RELAY
LOAD BALANCER
JUNE 2014
8 CARBON-RELAY
8 CARBON-RELAY
10 CARBON-CACHE
WHISPER
FILESYSTEM (SSDs)
GRAPHITE API
10 CARBON-AGGREGATOR 10 CARBON-AGGREGATOR
10 CARBON-AGGREGATOR
NEW YORK CITY
USAGE INFORMATION
+
=
GRAPHITE API + +
NEW YORK CITY
NEW YORK CITYGRAFANA
ALERTING
NEW YORK CITY
Core InfrastructureTHRESHOLDS
CRITICAL THRESHOLD
WARNING THRESHOLD
Core InfrastructureALERTING MATH
Core InfrastructureALERTING MATH
SkipRead30 5
Alarm9
Core InfrastructureALERTING MATH
SkipRead30 5
Frequency10 sec
300 secs 50 secs
Alarm9
90 secs
Core InfrastructureNAGIOS INTEGRATION
NEW YORK CITY
NEXT STEPS● Replication
● Scale even more
● More robust alerting logic
NEW YORK CITY
8 CARBON-RELAY
LOAD BALANCER
REPLICATION
8 CARBON-RELAY
8 CARBON-RELAY
10 CARBON-CACHE
WHISPER
FILESYSTEM (SSDs)
GRAPHITE API
10 CARBON-AGGREGATOR 10 CARBON-AGGREGATOR
10 CARBON-AGGREGATOR
Core InfrastructureALERTING LOGIC
= Google Forms Throttling
Core InfrastructureALERTING LOGIC
Trial subscription signups
Internal subscription API calls
Database host offline
User
Application
System
NEW YORK CITYLEARN MORE
QUESTIONS
NEW YORK CITY