Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
MONITORING @ SCALECLOUD SCALE
Sławomir Skowron Devops @ BaseCRM
Devops Kraków 2015
OUTLINE• What is Graphite ?
• Graphite architecture
• Additional components
• Current production setup
• Writes
• Reads
• BaseCRM graphite evolution
• Data migrations and recovery
• Multi region
• Dashboards management
• Future work
Monitoring system focused on:
• Simple store metrics time series
• Render graphs from time series on demand
• API with functions
• Dashboards
• Huge number of tools and 3rd party products based on graphite
WHAT IS GRAPHITE ?
WHY ALL THIS ?
LET’S LOOK AT EXAMPLE IN GRAFANA
GRAPHITE ARCHITECTURE
• Graphite-Web - Django web application with JS frontend
• Dashboards (DB to save dashboards)
• API with functions, server side graphs rendering
• Carbon - Twisted daemon
• carbon-relay - hash / route metrics
• carbon-aggregator - aggregate metrics
• carbon-cache - “memory cache” and persist metrics to disk
• Whisper - simple time series DB
• seconds to point
Data points send as: metric name + value + Unix epoch timestamp
GRAPHITE ARCHITECTURE
GRAPHITE ARCHITECTURE
ADDITIONAL COMPONENTS
ADDITIONAL COMPONENTS
• Diamond - https://github.com/BrightcoveOS/Diamond
• Python daemon
• over 120 collectors
• simple collectors development
• used for OS and generic services monitoring
ADDITIONAL COMPONENTS
• Statsd - https://github.com/etsy/statsd/
• Node.js daemon
• counts, sets, gauges, timers aggregates sends to graphite
• many clients library’s
• used for in app metrics
CURRENT PRODUCTION SETUP
WRITES
ALL PROVISIONED BY ANSIBLE
Which components failed to work at scale ?
carbon-relay
switch to
Carbon-c-Relay
carbon-cache - switch to PyPy
REPLACEMENT• Carbon-C-Relay - https://github.com/grobian/carbon-c-relay
• Written in C
• replacement for carbon-relay in python
• High performance
• multi cluster support (traffic replication)
• traffic load-balancing
• traffic hashing
• Aggregation and rewrites
IMPROVECarbon-cache
Switch to PyPy (2.4 and current 2.5)
40-50% less CPU usage on carbon-cache
CURRENT PRODUCTION SETUP - WRITES
CURRENT PRODUCTION SETUP - WRITES
VM’s report to ELB
Round-Robin to Relay Top
Consistent hash with replication 2
Any of Carbon-cache on each store instance
Write to local whisper store volume
CURRENT PRODUCTION SETUP - WRITES450+ instances as clients (diamond, statsd, other)report in 30 seconds intervals
carbon-c-relay as Top Relay• 3.5 - 4 mln metrics / min• 20% CPU usage on each• batch send (20k metrics) • queue 10 mln metrics
carbon-c-relay as Local Relay• 7- 8 mln metrics / min• batch send (20k metrics)• queue 5 mln metrics
Carbon-cache with PyPy (50% less CPU)• 7 - 8 mln metrics / min• each point update 0.13 - 0.15 ms
250K-350K Write IOPS 5-6 mln whisper DB files (2 copies)
CURRENT PRODUCTION SETUP - WRITES
Graphite hashing - max space/performance like weakest host in cluster
CURRENT PRODUCTION SETUP - WRITES
CURRENT PRODUCTION SETUP - WRITES
• Minimise number of other processes and CPU usage• CPU Offload
• Carbon-c-relay low cpu, • Batch writes, • Separate webs for clients from store hosts • Focus on carbon-cache (Write) + graphite-web (Read)
• Leverage OS memory for carbon-cache• Raid0 for more write performance - we have replica• Focus on IOPS - low service time• Time must be always sync
CURRENT PRODUCTION SETUP
READS
CURRENT PRODUCTION SETUP - READS
CURRENT PRODUCTION SETUP - READS
Web front dashboard based on Graphite-Web
Graphite-web django backend as API for Grafana
Couchbase as cache for graphite-web metrics
Each store as API via graphite-web
Average response <300ms
Nginx on top behind ELB
Webs calculates functions, stores serves RAW metrics
(CPU offload)
BASECRM GRAPHITE
EVOLUTION
BASECRM GRAPHITE EVOLUTION• PoC with external EBS and C3 instances
• graphite-web, carbon-relay, carbon-cache, whisper files on EBS • Production started on i2.xlarge
• 5 store instances - 800GB SSD, 30GB RAM, 4xCPU’s• 4 carbon-cache’s on each store• Same software as in PoC• Problems with machines replace and migrations to bigger cluster• dash-maker to manage complicated dashboards
• Next with i2.4xlarge • 5 store instances - 2x800GB in Raid0, 60GB RAM, 8xCPU’s• 8 carbon-cache’s on each store • carbon-c-relay as Top and Local relay• Recovery tool to recover data from old cluster• Grafana as second dashboard interface
• Current with i2.8large - latest bump • 5 store instances - 4x800GB in Raid0, 120GB RAM, 16xCPU’s• 16 carbon-cache’s on each store
DATA MIGRATION &
RECOVERY
DATA MIGRATION & RECOVERY
Replicate Traffic
Copy old whispers, based on new cluster creates
DATA MIGRATION & RECOVERY
5 instances with 1Gbit/s - recovery tops 4.5Gbit/s using HTTP
DATA MIGRATION & RECOVERY
Switch on ELB
DATA MIGRATION & RECOVERY
Remove old cluster
MULTI REGION
METRICS COLLECTING
MULTI REGION
DASH-MAKER
INTERNAL DASHBOARDS MANAGEMENT
DASH-MAKER
• Manage dashboards like never before
• Template everything with Jinja2 (all jinja2 features)
• Dashboard config - one YAML with Jinja2 support
• Reusable graphs - Json's like in graphite-web with Jinja2 support
• Global key=values for Jinja2
• Dynamic Jinja2 vars expanded from graphite (last * in metric name)
• Many dashboards options from one config based on loop vars
• supports graphite 0.9.12, 0.9.12 (evernote), 0.9.13, 0.10.0
DASH-MAKER
$ dash-maker -f rabbitmq-server.yml23:54:55 - dash-maker - main():Line:292 - INFO - Time [templating: 0. 023229 build: 0. 000177 save: 2.546407] Dashboard dm.us-east- 1.production.rabbitmq. server saved with success23:54:58 - dash-maker - main():Line:292 - INFO - Time [templating: 0. 017746 build: 0. 000057 save: 2.549711] Dashboard dm.us-east- 1.sandbox.rabbitmq. server saved with success
FUTURE WORK AND PROBLEMS
FUTURE WORK AND PROBLEMS
• Dash-maker with Grafana support
• Out-o-band fast aggregation with anomaly detection
• Graphite with Hashing is not elastic - InfluxDB ? march prod ready ?
• In future one dashboard - grafana + influxdb ?