MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

MONITORING @ SCALECLOUD SCALE

Sławomir Skowron Devops @ BaseCRM

Devops Kraków 2015

OUTLINE• What is Graphite ?

• Graphite architecture

• Additional components

• Current production setup

• Writes

• Reads

• BaseCRM graphite evolution

• Data migrations and recovery

• Multi region

• Dashboards management

• Future work

Monitoring system focused on:

• Simple store metrics time series

• Render graphs from time series on demand

• API with functions

• Dashboards

• Huge number of tools and 3rd party products based on graphite

WHAT IS GRAPHITE ?

WHY ALL THIS ?

LET’S LOOK AT EXAMPLE IN GRAFANA

GRAPHITE ARCHITECTURE

• Graphite-Web - Django web application with JS frontend

• Dashboards (DB to save dashboards)

• API with functions, server side graphs rendering

• Carbon - Twisted daemon

• carbon-relay - hash / route metrics

• carbon-aggregator - aggregate metrics

• carbon-cache - “memory cache” and persist metrics to disk

• Whisper - simple time series DB

• seconds to point

Data points send as: metric name + value + Unix epoch timestamp



ADDITIONAL COMPONENTS


• Diamond - https://github.com/BrightcoveOS/Diamond

• Python daemon

• over 120 collectors

• simple collectors development

• used for OS and generic services monitoring


• Statsd - https://github.com/etsy/statsd/

• Node.js daemon

• counts, sets, gauges, timers aggregates sends to graphite

• many clients library’s

• used for in app metrics

CURRENT PRODUCTION SETUP

WRITES

ALL PROVISIONED BY ANSIBLE

Which components failed to work at scale ?

carbon-relay

switch to

Carbon-c-Relay

carbon-cache - switch to PyPy

REPLACEMENT• Carbon-C-Relay - https://github.com/grobian/carbon-c-relay

• Written in C

• replacement for carbon-relay in python

• High performance

• multi cluster support (traffic replication)

• traffic load-balancing

• traffic hashing

• Aggregation and rewrites

IMPROVECarbon-cache

Switch to PyPy (2.4 and current 2.5)

40-50% less CPU usage on carbon-cache

CURRENT PRODUCTION SETUP - WRITES


VM’s report to ELB

Round-Robin to Relay Top

Consistent hash with replication 2

Any of Carbon-cache on each store instance

Write to local whisper store volume

CURRENT PRODUCTION SETUP - WRITES450+ instances as clients (diamond, statsd, other)report in 30 seconds intervals

carbon-c-relay as Top Relay• 3.5 - 4 mln metrics / min• 20% CPU usage on each• batch send (20k metrics) • queue 10 mln metrics

carbon-c-relay as Local Relay• 7- 8 mln metrics / min• batch send (20k metrics)• queue 5 mln metrics

Carbon-cache with PyPy (50% less CPU)• 7 - 8 mln metrics / min• each point update 0.13 - 0.15 ms

250K-350K Write IOPS 5-6 mln whisper DB files (2 copies)


Graphite hashing - max space/performance like weakest host in cluster



• Minimise number of other processes and CPU usage• CPU Offload

• Carbon-c-relay low cpu, • Batch writes, • Separate webs for clients from store hosts • Focus on carbon-cache (Write) + graphite-web (Read)

• Leverage OS memory for carbon-cache• Raid0 for more write performance - we have replica• Focus on IOPS - low service time• Time must be always sync

CURRENT PRODUCTION SETUP

READS

CURRENT PRODUCTION SETUP - READS

CURRENT PRODUCTION SETUP - READS

Web front dashboard based on Graphite-Web

Graphite-web django backend as API for Grafana

Couchbase as cache for graphite-web metrics

Each store as API via graphite-web

Average response <300ms

Nginx on top behind ELB

Webs calculates functions, stores serves RAW metrics

(CPU offload)

BASECRM GRAPHITE

EVOLUTION

BASECRM GRAPHITE EVOLUTION• PoC with external EBS and C3 instances

• graphite-web, carbon-relay, carbon-cache, whisper files on EBS • Production started on i2.xlarge

• 5 store instances - 800GB SSD, 30GB RAM, 4xCPU’s• 4 carbon-cache’s on each store• Same software as in PoC• Problems with machines replace and migrations to bigger cluster• dash-maker to manage complicated dashboards

• Next with i2.4xlarge • 5 store instances - 2x800GB in Raid0, 60GB RAM, 8xCPU’s• 8 carbon-cache’s on each store • carbon-c-relay as Top and Local relay• Recovery tool to recover data from old cluster• Grafana as second dashboard interface

• Current with i2.8large - latest bump • 5 store instances - 4x800GB in Raid0, 120GB RAM, 16xCPU’s• 16 carbon-cache’s on each store

DATA MIGRATION &

RECOVERY

DATA MIGRATION & RECOVERY

Replicate Traffic

Copy old whispers, based on new cluster creates


5 instances with 1Gbit/s - recovery tops 4.5Gbit/s using HTTP


Switch on ELB


Remove old cluster

MULTI REGION

METRICS COLLECTING

MULTI REGION

DASH-MAKER

INTERNAL DASHBOARDS MANAGEMENT

DASH-MAKER

• Manage dashboards like never before

• Template everything with Jinja2 (all jinja2 features)

• Dashboard config - one YAML with Jinja2 support

• Reusable graphs - Json's like in graphite-web with Jinja2 support

• Global key=values for Jinja2

• Dynamic Jinja2 vars expanded from graphite (last * in metric name)

• Many dashboards options from one config based on loop vars

• supports graphite 0.9.12, 0.9.12 (evernote), 0.9.13, 0.10.0

DASH-MAKER

$ dash-maker -f rabbitmq-server.yml23:54:55 - dash-maker - main():Line:292 - INFO - Time [templating: 0. 023229 build: 0. 000177 save: 2.546407] Dashboard dm.us-east- 1.production.rabbitmq. server saved with success23:54:58 - dash-maker - main():Line:292 - INFO - Time [templating: 0. 017746 build: 0. 000057 save: 2.549711] Dashboard dm.us-east- 1.sandbox.rabbitmq. server saved with success

FUTURE WORK AND PROBLEMS

FUTURE WORK AND PROBLEMS

• Dash-maker with Grafana support

• Out-o-band fast aggregation with anomaly detection

• Graphite with Hashing is not elastic - InfluxDB ? march prod ready ?

• In future one dashboard - grafana + influxdb ?

Documents

MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay