MongoDB Operational Best Practices (mongosf2012)

Operational Best Practices

Tales from the field

The Plan● Review support cases

○ Taken from real issues○ Names/ips/dates changed to protect identities

● Analyze reported issues● Distill best practices● Summarize takeaways ● Repeat...

Scenario 1● Fire, it is on fire! ● Users notice response time takes 1-3 sec● App logs show timeouts● Server log show socket exceptions

Scenario 1 - Diagnostics● Logs ● Understanding the timeouts

○ Client read timeout set○ Connection closed/discarded○ Symptom not cause

● Server connection exceptions

○ Match timing of client timeouts○ Symptom not cause

Scenario 1 - MonitoringGraphs speak a thousand words

Scenario 1 - Takeaways● Monitor Logs

○ Alert, escalate○ Correlate

● Disk○ Monitor○ Moved to RAID (10)

● Instrument/Monitor App● Know your application and application (write)

characteristics

Scenario 2● Alerts warn that server is running hot● Random (small) slowdowns● Increased traffic/queries

Scenario 2 - SymptomsHigh use cpu Similar query pattern

Scenario 2 - Diagnostics● Turn on DB Profiling● Look at logs Identify query patterns taking longest or with highest frequency and run explain

Scenario 2 - Explaindb.scenario2.find({...}).sort({...}).explain() { "cursor" : "BtreeCursor ABC", "nscanned" : 160677, "nscannedObjects" : 12015, "n" : 55, "millis" : 99, "scanAndOrder" : true, "indexBounds" : {...} }

Scenario 2 - Diagnostics● Create a compound index

○ Used for criteria and sort○ Reduced CPU dramatically

Scenario 2 - Takeaways● Performance test/analyze system behavior● Load test before deployment● Alert on abnormal states● High CPU is a sign of poorly indexed● Rolling upgrade for indexes

Scenario 3● General slowdown on login● High disk utilization

Scenario 3 - DiagnosticsiostatDevice: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %utilsdp 0.00 0.00 0.50 0.00 27.86 0.00 56.00 149.58 20320.00 2010.00 100.00

Scenario 3$ blockdev --reportRO RA SSZ BSZ StartSec Size Devicerw 8096 512 4048 0 1099494850560 /dev/sdp

Huge read-ahead of 4MB

Scenario 3 - Takeaways● Pay attention to disk configurations● Load testing would have found this early● MongoDB depends on the OS a lot● Connect the dots from disportionate effects

Best Practices Learned● System provisioning

○ Capacity○ Performance○ Scale○ Configuration

● Logs○ Review○ Alert○ Rotate and collect (per cluster)

Best Practices Learned● Query/Index Analysis

○ Database Profiler○ Run explain periodically (sampled)○ Instrument code, generate metrics

● Plan/test rollouts○ Rolling upgrade for Replica Set○ Generate indexes on secondaries first○ Name services, use redirection

Thanks, more refsPlease take a look at http://mongodb.org (docs) ● Ask on mongodb-user group● Use MMS or historic monitoring

○ Watch for trends○ Create alerts○ Forecast capacity for provisioning

● logrotate unix command● monitor disk - munin or the like● iostat, dstat, vmstat, free, netstat

Questions

Technology

MongoDB Operational Best Practices (mongosf2012)