78
Failure Self Defense 1 Defend your App against failures in a (micro) services world

Failure Self Defense: Defend your App against failures in a (micro) services world

Embed Size (px)

Citation preview

Failure Self Defense

1

Defend your App against failures in a (micro) services world

Tony Fabeen

APP

APP

CRM

APP

CRM

Email

APP

CRM

Email

BillingPayments

APP

CRM

Email

BillingPayments

Database

Search

Queue

APP

CRM

Email

BillingPayments

Database

Search

Queue

WELCOME TO THE

DEV PARTY

Database

Database

APPAPPAPPAPP

APPAPPAPPAPP

APP

Products

Orders

Payments

module Api

class Products

def initialize(host = nil)

@client =

HTTPClient.new(host)

end

def all

@client.get('/products')

end

end

end

def send_checkout

params = { email: @email, token: @token,

ssl_version: :SSLv3 }

RestClient.post(checkout_url, checkout_xml,

params: params,

content_type: "application/xml"){|resp, request, result|

resp }

end

APP Orders

Payments

Orders

Payments

CASCADING FAILURE

MAP DEPENDENCIES

THAT IMPACTS YOUR SYSTEM

Products

Orders

Payments

Catalog

Checkout

Pack

Route

Authorize

Charge

Feature Service

Products

Orders

Payments

Catalog

Checkout

Pack

Route

Authorize

Charge

Degraded

Down

Up

Feature Service

Products

Orders

Payments

Catalog

Checkout

Pack

Route

Authorize

Charge

Degraded

Down

Up

Feature Service

Products

Orders

Payments

Catalog

Checkout

Pack

Route

Authorize

Charge

Degraded

Down

Up

Feature Service

https://github.com/Shopify/toxiproxy

Toxiproxy.populate([{ name: 'redis', listen: '127.0.0.1:22222', upstream: '127.0.0.1:6379' }])

context 'when service UP' do

before { Cache.put('key', 'value') }

it 'saves value' do

expect(Cache.get('key')).to eq('value')

end

end

context 'when service DOWN' do

it 'will raises error' do

Toxiproxy[:redis].down do

expect { Cache.put('key', 'value') }.to

raise_error(Redis::CannotConnectError)

end

end

end

context 'when service UP' do

before { Cache.put('key', 'value') }

it 'saves value' do

expect(Cache.get('key')).to eq('value')

end

end

context 'when service DOWN' do

it 'will raises error' do

Toxiproxy[:redis].down do

expect { Cache.put('key', 'value') }.to

raise_error(Redis::CannotConnectError)

end

end

end

Fault Tolerance

An application with an average Response Time of 60ms can process 1.000 Requests Per Minute (RPM) per Thread.

An application with an average Response Time of 60ms can process 1.000 Requests Per Minute (RPM) per Thread.

How many Threads we need to handle 100.000 RPM of Throughput ?

100

Imagine that 1% of the traffic timeout on a Service after 30 seconds, the Response Time will raise to 360 ms.

Imagine that 1% of the traffic timeout on a Service after 30 seconds, the Response Time will raise to 360 ms.

How many Threads we need to handle 100.000 RPM of Throughput ?

600

Service

600 RPM0.01 s

0.01 s

0.01 s

0.01 s

0.01 s

0.01 s

Service

60 RPM0.01 s

0.01 s

0.10 s

0.10 s

0.10 s

0.10 s

Service

60 RPM0.01 s

0.01 s

0.10 s

0.10 s

0.10 s

0.10 s

High Response Time

Less Throughput

Fail Fast

Fail Fast

Low timeouts

Fail Fast

Low timeouts

Connection timeout

Fail Fast

Low timeouts

Connection timeout Socket Read timeout

Fail Fast

Low timeouts

Connection timeout Socket Read timeout Resource aquisition

Fail Gracefully

class Cache

def self.put(key, value)

service.set(key, value)

end

def self.get(key)

service.get(key)

end

end

end

class Cache

def self.put(key, value)

service.set(key, value)

end

def self.get(key)

service.get(key)

end

end

end

Cache.put('key', 'value')

Cache.get('key

')

def put(key, value)

service.set(key, value)

true

rescue Redis::CannotConnectError => error

AwesomeLogger.log(error)

false

end

def get(key, fallback_value = nil)

service.get(key)

rescue Redis::CannotConnectError => error

AwesomeLogger.log(error)

fallback_value

end

end

Don't try if you can't succeed

Circuit Breakers

Client Circuit Breaker Service

Closed

Client Circuit Breaker

Closed

Client Circuit Breaker

Open

Error

Client Circuit Breaker

Open

Client Circuit Breaker

Open

Error

Client Circuit Breaker Service

Closed

Slow services

High timeouts

Bulkheads

FeatureA

Bulkhead

2 available

FeatureC

FeatureB

Bulkhead

FeatureC

1 available

FeatureA

FeatureB

FeatureC

no requests available

FeatureA

FeatureB

Bulkhead

Bulkhead

FeatureC

no requests available

FeatureA

FeatureB

Error

Monitor Service Calls

Monitor Service Calls

Timeout rate Rejected call rate

Short circuit rate Failure/Success rate

Response Times

SummaryKnow your dependenciesImprove your test suite

Fail FastTimeouts

Fail GracefullyFallbacks

Don't try if you can't succeedCircuit Breakers and Bulkheads are friends

Monitor Service Calls

Notice problems

?

https://github.com/tonyfabeen

https://twitter.com/tonyfabeen

https://linkedin.com/tonyfabeen