Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) •...

Preview:

Citation preview

Agenda

Emergency Response

••

•••

Managing Incidents

•••

Managing Incidents• Prioritize.Stop the bleeding, restore service, and preserve the evidence for root-causing.• Prepare.Develop and document your incident management procedures in advance, in consultation with

incident participants.• Trust.Give full autonomy within the assigned role to all incident participants.• Introspect.Pay attention to your emotional state while responding to an incident. If you start to feel

panicky or overwhelmed, solicit more support.• Consider alternatives.Periodically consider your options and re-evaluate whether it still makes sense to

continue what you’re doing or whether you should be taking another tack in incident response.• Practice.Use the process routinely so it becomes second nature.• Change it around.Were you incident commander last time? Take on a different role this time.

Encourage every team member to acquire familiarity with each role.

Troubleshooting

Troubleshooting - Report & Triage

••

Troubleshooting - Examine

•••

Troubleshooting - Diagnose

•••

Troubleshooting - Test / Treat

••••

A Cassandra example

A Cassandra example

A Cassandra example

•••••

A Cassandra example - Stabilize cluster

Monitoring Cassandra (tpstats)•

Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0 MiscStage 0 0 0 0 0 CompactionExecutor 2 2 1759 0 0 MutationStage 128 615267 118435822 0 0 MemtableReclaimMemory 0 0 210 0 0 PendingRangeCalculator 0 0 45 0 0 GossipStage 0 0 12390 0 0 SecondaryIndexManagement 0 0 0 0 0 HintsDispatcher 1 22 10 0 0 RequestResponseStage 1 5 519510274 0 0 Native-Transport-Requests 1 0 38354372 0 21184990 ReadRepairStage 0 0 1 0 0 CounterMutationStage 0 0 0 0 0 MigrationStage 0 0 65 0 0 MemtablePostFlush 1 1 231 0 0 PerDiskMemtableFlushWriter_0 1 1 210 0 0 ValidationExecutor 0 0 0 0 0 Sampler 0 0 0 0 0 MemtableFlushWriter 1 1 210 0 0 InternalResponseStage 0 0 2817415 0 0 ViewMutationStage 0 0 0 0 0 AntiEntropyStage 0 0 0 0 0 CacheCleanupExecutor 0 0 0 0 0

A Cassandra example - Stabilize cluster

Monitoring Cassandra (tpstats)

••

•Message type DroppedRANGE_SLICE 0READ_REPAIR 23PAGED_RANGE 0BINARY 0READ 10434MUTATION 4948_TRACE 0REQUEST_RESPONSE 6COUNTER_MUTATION 0

A Cassandra example - Stabilize cluster

A Cassandra example

A Cassandra example

A Cassandra example

Monitoring Cassandra (status)

•Datacenter: us-west=============================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 10.65.XX.XXX 108.77 GB 256 ? e462bc9f-9df7-4342-b987-52a86d29c7f4 1aUN 10.65.XX.XXX 116.28 GB 256 ? 93530c86-3cb3-4d4e-a005-9f02ed4c0b3a 1cUN 10.65.XX.XXX 109.17 GB 256 ? ab779176-1513-4849-8531-6ff39037e078 1aUN 10.65.XX.XXX 103.1 GB 256 ? cd112339-3224-4b8f-9be0-de26edb3a0d1 1aUN 10.65.XX.XXX 111.45 GB 256 ? 3bfa406f-63f6-47e7-8798-6f650726ba23 1cUN 10.65.XX.XXX 110.09 GB 256 ? 5b39c8c2-4896-48b5-940d-d48b12157acf 1aUN 10.65.XX.XXX 105.18 GB 256 ? 467e03e4-0cdd-4088-b122-6b0e6848f7ed 1cUN 10.65.XX.XXX 112.22 GB 256 ? a48b999f-4473-4e85-83b2-1208fa63223c 1aUN 10.65.XX.XXX 107.69 GB 256 ? 9e48a874-57ca-40df-8053-dfb141389c09 1aUN 10.65.XX.XXX 109.21 GB 256 ? cb20eaa4-ba95-452f-9ac0-5ff41010b702 1cUN 10.65.XX.XXX 119.29 GB 256 ? 3cf1cd91-26ed-4057-b09b-9092c01e03ec 1cUN 10.65.XX.XXX 109.08 GB 256 ? d7aff1c4-0ace-46c2-b7db-a18f285fcdc4 1c

Monitoring Cassandra (Metrics)

Metric Description Frequency**Node Status Nodes DOWN should be investigated immediately Continuous,

with alerting**Client read latency Latency per read query over your threshold Continuous,

with alerting**Client write latency Latency per write query over your threshold Continuous,

with alertingCF read latency Local CF read latency per read, useful if some CF are particularly

latency sensitive.Continuous if required

Tombstones per read A large number of tombstones per read indicates possible performance problems, and compactions not keeping up or may require tuning.

Weekly checks

SSTables per read High number (>5) indicates data is spread across too ma Weekly checks**Pending compactions Sustained pending compactions (>20) indicates compactions are not

keeping up. This will have a performance impact.Continuous, with alerting

Pending repairs Continuous, when running

Cluster Health Checks (Logs)

••

•WARN [Native-Transport-Requests:3683972] 2015-03-02 00:20:30,639 BatchStatement.java (line 223) Batch of prepared statements for

[prod.network_traffic] is of size 195456, exceeding specified threshold of 5120 by 190336.

Cluster Health Checks

••

Backups

netstats~ $ nodetool netstatsMode: JOININGBootstrap 24b26bf0-bc05-11e6-a95a-5d59c4606c05 /52.22.XXX.XXX (using /10.224.XXX.XXX) /52.22.XXX.XXX (using /10.224.XXX.XXX) Receiving 360 files, 40875561944 bytes total. Already received 1 files, 195299513 bytes total instametrics/events_raw_5m 195295154/278140764 bytes(70%) received from idx:0/52.22.XXX.XXX instametrics/host 4359/4359 bytes(100%) received from idx:0/52.22.XXX.XXX /52.55.XXX.XXX (using /10.224.130.6) Receiving 101 files, 34477437769 bytes total. Already received 4 files, 483917865 bytes total instametrics/events_raw_5m 4898307/4898307 bytes(100%) received from idx:0/52.55.XXX.XXX instametrics_rollup/events_rollup_3600 277979189/277979189 bytes(100%) received from idx:0/52.55.XXX.XXX instametrics_rollup/events_rollup_86400 1652187/1652187 bytes(100%) received from idx:0/52.55.XXX.XXX instametrics/host 3560/3560 bytes(100%) received from idx:0/52.55.XXX.XXX instametrics_rollup/events_rollup_300 199384622/11291788462 bytes(1%) received from idx:0/52.55.XXX.XXXRead Repair Statistics:Attempted: 0Mismatch (Blocking): 0Mismatch (Background): 0Pool Name Active Pending Completed DroppedLarge messages n/a 20 0 0Small messages n/a 1 50 0Gossip messages n/a 0 69238 0

Some final tips

••

•••

••

••

How Instaclustr can help•

••

••

Recommended