Download pptx - Building and Monitoring Services at Lithium

Building and Monitoring Services at Lithium(fault tolerance, resiliency and monitoring)

Paul Cichonski, Senior Software Engineer

@paulcichonski

2

Services at Lithium Use:

3

Failure is a Constant, Need to Avoid Cascading Failure

Image Source: Netflix Hystrix: https://github.com/Netflix/Hystrix/wiki

4

We All Know How to Simulate Failure:

5

But how do we develop code to deal with failure?

6

Need to build fault tolerant and resilient services... How?

Clustering, for high-availability, is not enough to protect against cascading failure

7

#1 Fail Fast: use timeouts aggressively

8

#2 Use circuit breakers on network calls

9

#3 Use async communication when possible

10

#4 Have well thought-out backpressure mechanisms

11

#5 Use cross-region (or cross-datacenter) replication

12

#6 Failure models should be built into the business requirements of a service

13

Read:

14

Even with all of that, your app will still fail, so how do you recover quickly?

15

Devops/Cloudops Model: OODA

16

Observe and Orient: you need metrics and dashboards

17

You Need Metrics

• Reduce “map/territory” confusion• We use Yammer Metrics

– Timers– Meters– Histograms

• We use them a lot– Every class has at least one metric, most

have multiple

18

You Need to Visualize the Metrics

19

You Need Dashboards Keyed to Business Functionality

20

Use alerting as a last resort (because sometimes we need to sleep)

21

Decide and Act: you need robust CI and fast code roll-outs

22

Rinse and Repeat