31
SCALABLE MONITORING & ALERTING NEW YORK CITY

Scalable Monitoring & Alerting

Embed Size (px)

DESCRIPTION

We'd like to share our experience building a scalable solution using Graphite, Grafana, Collectd, Nagios, Logstash, Elasticsearch and Kibana, among others. We believe that it is easy to collect metrics from any one system, and to define alerts on single metrics. We already have this capability in place. However, in complex systems, the real operational challenges arise from the way system components interact. Some of these components live inside our data center, some live outside our data center, and all are updating on differing timelines. The functionality and performance of every component has the potential to change every day. Our challenge is to identify patterns and correlations across multiple systems in our stack. We need to integrate top-down and bottom-up analysis, so we can see, for example, that trial subscription signups (a user metric) fell off at the same time that an internal API call began to fail (an application metric), and it was caused by a database host falling offline (a system metric). When collecting so much data, there is a risk of being overwhelmed and not being able to make sense of it all. In essence, a risk of collecting data but not producing intelligence. We combat this risk by converting our accumulated data into the most visually information-dense format available: graphs. Then we make graphs easy to compare and easy to share. We make them informative at a glance and easy for the team to keep watching. Finally, once we are regularly identifying patterns across our graphs, we should have an automated way to "watch the graphs" in our absence. It is not an AI or a pattern recognition "black box", it should just automate patterns that humans have first validated to be meaningful.

Citation preview

Page 1: Scalable Monitoring & Alerting

SCALABLEMONITORING & ALERTING

NEW YORK CITY

Page 2: Scalable Monitoring & Alerting

NEW YORK CITY

EXPERIENCE

Page 3: Scalable Monitoring & Alerting

THE BEGINNING

NEW YORK CITY

Page 4: Scalable Monitoring & Alerting

NEW YORK CITY

DEPLOYMENTS

+ =

Page 5: Scalable Monitoring & Alerting

NEW YORK CITY

DDOS ATTACKS

Page 6: Scalable Monitoring & Alerting

NEW YORK CITY

THE VISION● Easy publishing of metrics

● Single source of truth

● Democratized access

Page 7: Scalable Monitoring & Alerting

NEW YORK CITY

CARBON-AGGREGATOR

CARBON-CACHE

WHISPER

FILESYSTEM (rotating disks)

SEPTEMBER 2013

GRAPHITE WEBAPP

Page 8: Scalable Monitoring & Alerting

NEW YORK CITY

16 CARBON-RELAY

10 CARBON-CACHE

WHISPER

FILESYSTEM (rotating)

LOAD BALANCER

WEBAPP

JANUARY 2014

16 CARBON-RELAY

10 CARBON-CACHE

WHISPER

FILESYSTEM (rotating)

WEBAPP

16 CARBON-RELAY

10 CARBON-CACHE

WHISPER

FILESYSTEM (rotating)

WEBAPP

Page 9: Scalable Monitoring & Alerting

NEW YORK CITY

CULTURE

Page 10: Scalable Monitoring & Alerting

NEW YORK CITY

CULTURE

Page 11: Scalable Monitoring & Alerting

SUPER BOWL 2014

NEW YORK CITY

Page 12: Scalable Monitoring & Alerting

NEW YORK CITY

Page 13: Scalable Monitoring & Alerting

NEW YORK CITY

THE PROBLEMS● Reads didn’t scale

● No usage insight

● Dumb dashboards

Page 14: Scalable Monitoring & Alerting

NEW YORK CITY

8 CARBON-RELAY

LOAD BALANCER

APRIL 2014

8 CARBON-RELAY 8 CARBON-RELAY

10 CARBON-CACHE

WHISPER

FILESYSTEM (SSDs)

WEBAPP

Page 15: Scalable Monitoring & Alerting

NEW YORK CITY

8 CARBON-RELAY

LOAD BALANCER

MAY 2014

8 CARBON-RELAY 8 CARBON-RELAY

10 CARBON-CACHE

WHISPER

FILESYSTEM (SSDs)

GRAPHITE API

Page 16: Scalable Monitoring & Alerting

NEW YORK CITY

8 CARBON-RELAY

LOAD BALANCER

JUNE 2014

8 CARBON-RELAY

8 CARBON-RELAY

10 CARBON-CACHE

WHISPER

FILESYSTEM (SSDs)

GRAPHITE API

10 CARBON-AGGREGATOR 10 CARBON-AGGREGATOR

10 CARBON-AGGREGATOR

Page 17: Scalable Monitoring & Alerting

NEW YORK CITY

USAGE INFORMATION

+

=

GRAPHITE API + +

Page 18: Scalable Monitoring & Alerting

NEW YORK CITY

Page 19: Scalable Monitoring & Alerting

NEW YORK CITYGRAFANA

Page 20: Scalable Monitoring & Alerting

ALERTING

NEW YORK CITY

Page 21: Scalable Monitoring & Alerting

Core InfrastructureTHRESHOLDS

CRITICAL THRESHOLD

WARNING THRESHOLD

Page 22: Scalable Monitoring & Alerting

Core InfrastructureALERTING MATH

Page 23: Scalable Monitoring & Alerting

Core InfrastructureALERTING MATH

SkipRead30 5

Alarm9

Page 24: Scalable Monitoring & Alerting

Core InfrastructureALERTING MATH

SkipRead30 5

Frequency10 sec

300 secs 50 secs

Alarm9

90 secs

Page 25: Scalable Monitoring & Alerting

Core InfrastructureNAGIOS INTEGRATION

Page 26: Scalable Monitoring & Alerting

NEW YORK CITY

NEXT STEPS● Replication

● Scale even more

● More robust alerting logic

Page 27: Scalable Monitoring & Alerting

NEW YORK CITY

8 CARBON-RELAY

LOAD BALANCER

REPLICATION

8 CARBON-RELAY

8 CARBON-RELAY

10 CARBON-CACHE

WHISPER

FILESYSTEM (SSDs)

GRAPHITE API

10 CARBON-AGGREGATOR 10 CARBON-AGGREGATOR

10 CARBON-AGGREGATOR

Page 28: Scalable Monitoring & Alerting

Core InfrastructureALERTING LOGIC

= Google Forms Throttling

Page 29: Scalable Monitoring & Alerting

Core InfrastructureALERTING LOGIC

Trial subscription signups

Internal subscription API calls

Database host offline

User

Application

System

Page 30: Scalable Monitoring & Alerting

NEW YORK CITYLEARN MORE

Page 31: Scalable Monitoring & Alerting

QUESTIONS

NEW YORK CITY