Fault Tolerance in a High Volume, Distributed System

Preview:

DESCRIPTION

More information can be found at http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html

Citation preview

Fault Tolerance in a High Volume, Distributed SystemBen ChristensenSoftware Engineer – API Platform at Netflix@benjchristensenhttp://www.linkedin.com/in/benjchristensen

1

Dozens of dependencies.

One going down takes everything down.

99.99%30 = 99.7% uptime0.3% of 1 billion = 3,000,000 failures

2+ hours downtime/montheven if all dependencies have excellent uptime.

Reality is generally worse.

2

3

4

5

No single dependency should take down the entire app.

Fail fast.Fail silent.Fallback.

Shed load.

6

Options

Aggressive Network Timeouts

Semaphores (Tryable)

Separate Threads

Circuit Breaker

7

Options

Aggressive Network Timeouts

Semaphores (Tryable)

Separate Threads

Circuit Breaker

8

Options

Aggressive Network Timeouts

Semaphores (Tryable)

Separate Threads

Circuit Breaker

9

TryableSemaphore executionSemaphore = getExecutionSemaphore();// acquire a permitif (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); }} else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback();}

Semaphores (Tryable): Limited Concurrency

10

TryableSemaphore executionSemaphore = getExecutionSemaphore();// acquire a permitif (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); }} else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback();}

Semaphores (Tryable): Limited Concurrency

if (executionSemaphore.tryAcquire()) { } else { }

11

TryableSemaphore executionSemaphore = getExecutionSemaphore();// acquire a permitif (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); }} else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback();}

Semaphores (Tryable): Limited Concurrency

if (executionSemaphore.tryAcquire()) { } else { return getFallback();}

12

Options

Aggressive Network Timeouts

Semaphores (Tryable)

Separate Threads

Circuit Breaker

13

try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior

throw new RejectedExecutionException("Rejected command because thread-pool queueSize is at rejection threshold."); }

... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command);} catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback();}

Separate Threads: Limited Concurrency

14

try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior

throw new RejectedExecutionException("Rejected command because thread-pool queueSize is at rejection threshold."); }

... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command);} catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback();}

Separate Threads: Limited Concurrency

if (!threadPool.isQueueSpaceAvailable()) {

throw new RejectedExecutionException }

} catch (RejectedExecutionException e) { }

15

try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior

throw new RejectedExecutionException("Rejected command because thread-pool queueSize is at rejection threshold."); }

... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command);} catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback();}

Separate Threads: Limited Concurrency

if (!threadPool.isQueueSpaceAvailable()) {

throw new RejectedExecutionException }

} catch (RejectedExecutionException e) { return getFallback();}

16

Separate Threads: Timeout

public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime);

// retrieve the fallback return getFallback(); }}

Override of Future.get()

17

Separate Threads: Timeout

public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime);

// retrieve the fallback return getFallback(); }}

Override of Future.get()

try { return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) {

}}

18

Separate Threads: Timeout

public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime);

// retrieve the fallback return getFallback(); }}

Override of Future.get()

try { return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) {

return getFallback(); }}

19

Options

Aggressive Network Timeouts

Semaphores (Tryable)

Separate Threads

Circuit Breaker

20

if (circuitBreaker.allowRequest()) { return executeCommand();} else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback();}

Circuit Breaker

21

if (circuitBreaker.allowRequest()) { return executeCommand();} else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback();}

Circuit Breaker

if (circuitBreaker.allowRequest()) { } else { }

22

if (circuitBreaker.allowRequest()) { return executeCommand();} else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback();}

Circuit Breaker

if (circuitBreaker.allowRequest()) { } else { return getFallback(); }

23

Netflix uses all 4 in combination

24

25

Tryable semaphores for “trusted” clients and fallbacks

Separate threads for “untrusted” clients

Aggressive timeouts on threads and network callsto “give up and move on”

Circuit breakers as the “release valve”

26

27

28

29

Benefits of Separate Threads

Protection from client libraries

Lower risk to accept new/updated clients

Quick recovery from failure

Client misconfiguration

Client service performance characteristic changes

Built-in concurrency30

Drawbacks of Separate Threads

Some computational overhead

Load on machine can be pushed too far

...

Benefits outweigh drawbackswhen clients are “untrusted”

31

32

Visualizing Circuits in Realtime(generally sub-second latency)

Video available athttps://vimeo.com/33576628

33

Rolling 10 second counter – 1 second granularity

Median Mean 90th 99th 99.5th

Latent Error Timeout Rejected

Error Percentage(error+timeout+rejected)/

(success+latent success+error+timeout+rejected).

34

Netflix DependencyCommand Implementation

35

Netflix DependencyCommand Implementation

36

Netflix DependencyCommand Implementation

37

Netflix DependencyCommand Implementation

38

Netflix DependencyCommand Implementation

39

Netflix DependencyCommand Implementation

40

Netflix DependencyCommand Implementation

Fallbacks

CacheEventual Consistency

Stubbed DataEmpty Response

41

Netflix DependencyCommand Implementation

42

Netflix DependencyCommand Implementation

43

Rolling NumberRealtime Stats and Decision Making

44

Request CollapsingTake advantage of resiliency to improve efficiency

45

Request CollapsingTake advantage of resiliency to improve efficiency

46

47

Fail fast.Fail silent.Fallback.

Shed load.

48

Questions & More Information

Fault Tolerance in a High Volume, Distributed Systemhttp://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html

Making the Netflix API More Resilienthttp://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html

Ben Christensen@benjchristensen

http://www.linkedin.com/in/benjchristensen

49

Recommended