Building and Monitoring Services at Lithium(fault tolerance, resiliency and monitoring)
Paul Cichonski, Senior Software Engineer
@paulcichonski
2
Services at Lithium Use:
3
Failure is a Constant, Need to Avoid Cascading Failure
Image Source: Netflix Hystrix: https://github.com/Netflix/Hystrix/wiki
4
We All Know How to Simulate Failure:
5
But how do we develop code to deal with failure?
6
Need to build fault tolerant and resilient services... How?
Clustering, for high-availability, is not enough to protect against cascading failure
7
#1 Fail Fast: use timeouts aggressively
8
#2 Use circuit breakers on network calls
9
#3 Use async communication when possible
10
#4 Have well thought-out backpressure mechanisms
11
#5 Use cross-region (or cross-datacenter) replication
12
#6 Failure models should be built into the business requirements of a service
13
Read:
14
Even with all of that, your app will still fail, so how do you recover quickly?
15
Devops/Cloudops Model: OODA
16
Observe and Orient: you need metrics and dashboards
17
You Need Metrics
• Reduce “map/territory” confusion• We use Yammer Metrics
– Timers– Meters– Histograms
• We use them a lot– Every class has at least one metric, most
have multiple
18
You Need to Visualize the Metrics
19
You Need Dashboards Keyed to Business Functionality
20
Use alerting as a last resort (because sometimes we need to sleep)
21
Decide and Act: you need robust CI and fast code roll-outs
22
Rinse and Repeat