40
MONITORING @ SCALE CLOUD SCALE Sławomir Skowron Devops @ BaseCRM Devops Kraków 2015

MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

MONITORING @ SCALECLOUD SCALE

Sławomir Skowron Devops @ BaseCRM

Devops Kraków 2015

Page 2: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

OUTLINE• What is Graphite ?

• Graphite architecture

• Additional components

• Current production setup

• Writes

• Reads

• BaseCRM graphite evolution

• Data migrations and recovery

• Multi region

• Dashboards management

• Future work

Page 3: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

Monitoring system focused on:

• Simple store metrics time series

• Render graphs from time series on demand

• API with functions

• Dashboards

• Huge number of tools and 3rd party products based on graphite

WHAT IS GRAPHITE ?

Page 4: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

WHY ALL THIS ?

LET’S LOOK AT EXAMPLE IN GRAFANA

Page 5: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay
Page 6: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

GRAPHITE ARCHITECTURE

Page 7: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

• Graphite-Web - Django web application with JS frontend

• Dashboards (DB to save dashboards)

• API with functions, server side graphs rendering

• Carbon - Twisted daemon

• carbon-relay - hash / route metrics

• carbon-aggregator - aggregate metrics

• carbon-cache - “memory cache” and persist metrics to disk

• Whisper - simple time series DB

• seconds to point

Data points send as: metric name + value + Unix epoch timestamp

GRAPHITE ARCHITECTURE

Page 8: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

GRAPHITE ARCHITECTURE

Page 9: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

ADDITIONAL COMPONENTS

Page 10: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

ADDITIONAL COMPONENTS

• Diamond - https://github.com/BrightcoveOS/Diamond

• Python daemon

• over 120 collectors

• simple collectors development

• used for OS and generic services monitoring

Page 11: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

ADDITIONAL COMPONENTS

• Statsd - https://github.com/etsy/statsd/

• Node.js daemon

• counts, sets, gauges, timers aggregates sends to graphite

• many clients library’s

• used for in app metrics

Page 12: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

CURRENT PRODUCTION SETUP

WRITES

Page 13: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

ALL PROVISIONED BY ANSIBLE

Page 14: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay
Page 15: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

Which components failed to work at scale ?

carbon-relay

switch to

Carbon-c-Relay

carbon-cache - switch to PyPy

Page 16: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

REPLACEMENT• Carbon-C-Relay - https://github.com/grobian/carbon-c-relay

• Written in C

• replacement for carbon-relay in python

• High performance

• multi cluster support (traffic replication)

• traffic load-balancing

• traffic hashing

• Aggregation and rewrites

Page 17: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

IMPROVECarbon-cache

Switch to PyPy (2.4 and current 2.5)

40-50% less CPU usage on carbon-cache

Page 18: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

CURRENT PRODUCTION SETUP - WRITES

Page 19: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

CURRENT PRODUCTION SETUP - WRITES

VM’s report to ELB

Round-Robin to Relay Top

Consistent hash with replication 2

Any of Carbon-cache on each store instance

Write to local whisper store volume

Page 20: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

CURRENT PRODUCTION SETUP - WRITES450+ instances as clients (diamond, statsd, other)report in 30 seconds intervals

carbon-c-relay as Top Relay• 3.5 - 4 mln metrics / min• 20% CPU usage on each• batch send (20k metrics) • queue 10 mln metrics

carbon-c-relay as Local Relay• 7- 8 mln metrics / min• batch send (20k metrics)• queue 5 mln metrics

Carbon-cache with PyPy (50% less CPU)• 7 - 8 mln metrics / min• each point update 0.13 - 0.15 ms

250K-350K Write IOPS 5-6 mln whisper DB files (2 copies)

Page 21: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

CURRENT PRODUCTION SETUP - WRITES

Graphite hashing - max space/performance like weakest host in cluster

Page 22: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

CURRENT PRODUCTION SETUP - WRITES

Page 23: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

CURRENT PRODUCTION SETUP - WRITES

• Minimise number of other processes and CPU usage• CPU Offload

• Carbon-c-relay low cpu, • Batch writes, • Separate webs for clients from store hosts • Focus on carbon-cache (Write) + graphite-web (Read)

• Leverage OS memory for carbon-cache• Raid0 for more write performance - we have replica• Focus on IOPS - low service time• Time must be always sync

Page 24: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

CURRENT PRODUCTION SETUP

READS

Page 25: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

CURRENT PRODUCTION SETUP - READS

Page 26: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

CURRENT PRODUCTION SETUP - READS

Web front dashboard based on Graphite-Web

Graphite-web django backend as API for Grafana

Couchbase as cache for graphite-web metrics

Each store as API via graphite-web

Average response <300ms

Nginx on top behind ELB

Webs calculates functions, stores serves RAW metrics

(CPU offload)

Page 27: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

BASECRM GRAPHITE

EVOLUTION

Page 28: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

BASECRM GRAPHITE EVOLUTION• PoC with external EBS and C3 instances

• graphite-web, carbon-relay, carbon-cache, whisper files on EBS • Production started on i2.xlarge

• 5 store instances - 800GB SSD, 30GB RAM, 4xCPU’s• 4 carbon-cache’s on each store• Same software as in PoC• Problems with machines replace and migrations to bigger cluster• dash-maker to manage complicated dashboards

• Next with i2.4xlarge • 5 store instances - 2x800GB in Raid0, 60GB RAM, 8xCPU’s• 8 carbon-cache’s on each store • carbon-c-relay as Top and Local relay• Recovery tool to recover data from old cluster• Grafana as second dashboard interface

• Current with i2.8large - latest bump • 5 store instances - 4x800GB in Raid0, 120GB RAM, 16xCPU’s• 16 carbon-cache’s on each store

Page 29: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

DATA MIGRATION &

RECOVERY

Page 30: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

DATA MIGRATION & RECOVERY

Replicate Traffic

Copy old whispers, based on new cluster creates

Page 31: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

DATA MIGRATION & RECOVERY

5 instances with 1Gbit/s - recovery tops 4.5Gbit/s using HTTP

Page 32: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

DATA MIGRATION & RECOVERY

Switch on ELB

Page 33: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

DATA MIGRATION & RECOVERY

Remove old cluster

Page 34: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

MULTI REGION

METRICS COLLECTING

Page 35: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

MULTI REGION

Page 36: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

DASH-MAKER

INTERNAL DASHBOARDS MANAGEMENT

Page 37: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

DASH-MAKER

• Manage dashboards like never before

• Template everything with Jinja2 (all jinja2 features)

• Dashboard config - one YAML with Jinja2 support

• Reusable graphs - Json's like in graphite-web with Jinja2 support

• Global key=values for Jinja2

• Dynamic Jinja2 vars expanded from graphite (last * in metric name)

• Many dashboards options from one config based on loop vars

• supports graphite 0.9.12, 0.9.12 (evernote), 0.9.13, 0.10.0

Page 38: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

DASH-MAKER

$ dash-maker -f rabbitmq-server.yml23:54:55 - dash-maker - main():Line:292 - INFO - Time [templating: 0. 023229 build: 0. 000177 save: 2.546407] Dashboard dm.us-east- 1.production.rabbitmq. server saved with success23:54:58 - dash-maker - main():Line:292 - INFO - Time [templating: 0. 017746 build: 0. 000057 save: 2.549711] Dashboard dm.us-east- 1.sandbox.rabbitmq. server saved with success

Page 39: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

FUTURE WORK AND PROBLEMS

Page 40: MONITORING @ SCALEfiles.meetup.com/10485232/graphite.pdf · CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay

FUTURE WORK AND PROBLEMS

• Dash-maker with Grafana support

• Out-o-band fast aggregation with anomaly detection

• Graphite with Hashing is not elastic - InfluxDB ? march prod ready ?

• In future one dashboard - grafana + influxdb ?