SignalFx Elasticsearch Metrics Monitoring and Alerting

M M / D D / Y Y

YOUR T ITLE HERE

P R E PA R E D F O R :

P L A C E L O G O

H E R E

An e e l L a k h a n i Mahd i Ben Ham ida

Mon i tor ing E las t ic search Per formance and Capac i ty

SignalFlowTM

Streaming & Historical Analytics

Real-time visibility and correlation across the stack

Compare incoming patterns against historical patterns in real-time

No query language needed

Intelligent & dynamic alerting

Resolution down to 1s

Use existing investments in metrics, events and logs

Prebuilt integrations and content

S Y S T E M M E T R I C S & E V E N T S

A P P M E T R I C C & E V E N T S

U S E R M E T R I C S & E V E N T S

B U S I N E S S M E T R I C S & E V E N T S

W H Y

S IGNALFX: MONITORING FOR MODERN INFRASTRUCTURE

Elasticsearch at SignalFx

• Used for storing metadata about metrics, events, and other objects in the system • Source of truth is Cassandra. Elasticsearch allows us to do ad-

hoc queries and full-text search • 4 clusters in production (+more in testing/staging) • Biggest cluster has 75 nodes (72 data nodes + 3 dedicated

master nodes) • ~20TB of data, half a billion documents and growing ! • 24 shards with 2 replicas (moving to 168 shards as we speak) • Running in EC2 across 3 availability zones

Monitoring Elasticsearch

• Metrics are collected from ES nodes using the open source collectd agent • collectd uses ES REST api to fetch metrics at a fixed,

configurable interval • metrics are sent to SignalFx

• By default, SignalFx will create dashboards showing the most important metrics of Elasticsearch • We monitor infrastructure, cluster, node and index level

metrics • We have alerts setup to notify us when something is wrong

Key Performance Metrics

• CPU load • JVM heap, garbage collection • Indexing, query rates and respective latencies • Segment merges • Thread pool queues and rejections • Filter and field data cache sizes

Key Alerts

• High CPU load, low disk storage • Master nodes availability • Cluster state (green/yellow/red) • Unassigned shards • Sustained thread pool rejections

M M / D D / Y Y

YOUR T ITLE HERE


P L A C E L O G O

H E R EDEMO

M M / D D / Y Y

YOUR T ITLE HERE


P L A C E L O G O

H E R E

T H A N K Y O U !

S I G N U P F O R A T R I A L AT:

signalfx.com

M M / D D / Y Y

YOUR T ITLE HERE


P L A C E L O G O

H E R EAPPENDIX

MODERN APPS ARE FUNDAMENTALLY DIFFERENTMore scale-out, more open-source, and more ephemeral infrastructure

L E G A C Y A P P S M O D E R N A P P S

Monolithic, scale-up, running on enterprise-grade

infrastructure

Elastic, scale-out, running on ephemeral infrastructure

Apps

VM

Checkout Service

VM VM VM

VM VM VM VM

ITPublic/Private Cloud

(w/ Self-Service APIs)

HOST SPECIFIC ALERTS GENERATE NOISE

Noisy, reactive monitoring

C H A L L E N G E• Too many alerts fire at once for a cluster-

wide problem

• Is the machine down because we scaled down the cluster or because we had a real problem?

• Do we even care if a single node is down?

• Very high overhead to setup and reconfigure monitoring every time you add/remove nodes in a cluster

What matters?

Where to start?

?

BUT A CENTRALIZED VIEW IS CRITICAL

2/3 OF MACHINES DOWN (CAPACITY DOWN TO 1/3)

LOAD INCREASED BY 2X

YOU WANT TO BE ALERTED !

USE ANALYTICS TO CALCULATE THE NUMBER OF DAYS OF DISK CAPACITY YOU HAVE LEFT ACROSS A SHARDED DATA

STORE – ALERT WHEN YOU HAVE < 7 DAYS

0%

83%

100%

t

D I S K U S A G E

BUILD ACTIONABLE & TIMELY ALERTS

Alert here!

It is the only way to do quality alerting

PROACTIVELY DISCOVER A DISK ISSUE BEFORE IT CRIPPLES YOUR SYSTEM

GET STARTED QUICKLY WITH INTEGRATIONSFor platforms, technologies and 3rd party business processes

G R O W I N G A N D V I B R A N T E C O S Y S T E M , P R E - B U I LT C O N T E N T U S I N G A N A LY T I C S

Technology

SignalFx Elasticsearch Metrics Monitoring and Alerting