Click here to load reader
Upload
signalfx
View
334
Download
3
Embed Size (px)
Citation preview
M M / D D / Y Y
YOUR T ITLE HERE
P R E PA R E D F O R :
P L A C E L O G O
H E R E
An e e l L a k h a n i Mahd i Ben Ham ida
Mon i tor ing E las t ic search Per formance and Capac i ty
SignalFlowTM
Streaming & Historical Analytics
Real-time visibility and correlation across the stack
Compare incoming patterns against historical patterns in real-time
No query language needed
Intelligent & dynamic alerting
Resolution down to 1s
Use existing investments in metrics, events and logs
Prebuilt integrations and content
S Y S T E M M E T R I C S & E V E N T S
A P P M E T R I C C & E V E N T S
U S E R M E T R I C S & E V E N T S
B U S I N E S S M E T R I C S & E V E N T S
W H Y
S IGNALFX: MONITORING FOR MODERN INFRASTRUCTURE
Elasticsearch at SignalFx
• Used for storing metadata about metrics, events, and other objects in the system • Source of truth is Cassandra. Elasticsearch allows us to do ad-
hoc queries and full-text search • 4 clusters in production (+more in testing/staging) • Biggest cluster has 75 nodes (72 data nodes + 3 dedicated
master nodes) • ~20TB of data, half a billion documents and growing ! • 24 shards with 2 replicas (moving to 168 shards as we speak) • Running in EC2 across 3 availability zones
Monitoring Elasticsearch
• Metrics are collected from ES nodes using the open source collectd agent • collectd uses ES REST api to fetch metrics at a fixed,
configurable interval • metrics are sent to SignalFx
• By default, SignalFx will create dashboards showing the most important metrics of Elasticsearch • We monitor infrastructure, cluster, node and index level
metrics • We have alerts setup to notify us when something is wrong
Key Performance Metrics
• CPU load • JVM heap, garbage collection • Indexing, query rates and respective latencies • Segment merges • Thread pool queues and rejections • Filter and field data cache sizes
Key Alerts
• High CPU load, low disk storage • Master nodes availability • Cluster state (green/yellow/red) • Unassigned shards • Sustained thread pool rejections
M M / D D / Y Y
YOUR T ITLE HERE
P R E PA R E D F O R :
P L A C E L O G O
H E R EDEMO
M M / D D / Y Y
YOUR T ITLE HERE
P R E PA R E D F O R :
P L A C E L O G O
H E R E
T H A N K Y O U !
S I G N U P F O R A T R I A L AT:
signalfx.com
M M / D D / Y Y
YOUR T ITLE HERE
P R E PA R E D F O R :
P L A C E L O G O
H E R EAPPENDIX
MODERN APPS ARE FUNDAMENTALLY DIFFERENTMore scale-out, more open-source, and more ephemeral infrastructure
L E G A C Y A P P S M O D E R N A P P S
Monolithic, scale-up, running on enterprise-grade
infrastructure
Elastic, scale-out, running on ephemeral infrastructure
Apps
VM
Checkout Service
VM VM VM
VM VM VM VM
ITPublic/Private Cloud
(w/ Self-Service APIs)
HOST SPECIFIC ALERTS GENERATE NOISE
Noisy, reactive monitoring
C H A L L E N G E• Too many alerts fire at once for a cluster-
wide problem
• Is the machine down because we scaled down the cluster or because we had a real problem?
• Do we even care if a single node is down?
• Very high overhead to setup and reconfigure monitoring every time you add/remove nodes in a cluster
What matters?
Where to start?
?
BUT A CENTRALIZED VIEW IS CRITICAL
2/3 OF MACHINES DOWN (CAPACITY DOWN TO 1/3)
LOAD INCREASED BY 2X
YOU WANT TO BE ALERTED !
USE ANALYTICS TO CALCULATE THE NUMBER OF DAYS OF DISK CAPACITY YOU HAVE LEFT ACROSS A SHARDED DATA
STORE – ALERT WHEN YOU HAVE < 7 DAYS
0%
83%
100%
t
D I S K U S A G E
BUILD ACTIONABLE & TIMELY ALERTS
Alert here!
It is the only way to do quality alerting
PROACTIVELY DISCOVER A DISK ISSUE BEFORE IT CRIPPLES YOUR SYSTEM
GET STARTED QUICKLY WITH INTEGRATIONSFor platforms, technologies and 3rd party business processes
G R O W I N G A N D V I B R A N T E C O S Y S T E M , P R E - B U I LT C O N T E N T U S I N G A N A LY T I C S