How to Build High-Volume, Scalable, and Resilient APIs (EXP18038)

ResilientAPIs

@JPENNINKHOF

Challenge

“If you add up all the smartphones and the tablets

and the digital televisions and the PCs... we see a

large opportunity of perhaps 3 billion to 4 billion

units per annum, but we see an embedded market

that’s maybe 30 billion to 40 billion units per

annum”

- ARM CEO Warren East

Problem definition

For example, running an application that depends on 30 services that each have 99.99% uptime we get:

99.9930 = 99.7% uptime

0.3% of 1 million requests = 3,000 failures

2+ hours downtime/month even if all dependencies have excellent uptime.

Reality is generally worse.

API vulnerability

API Fallbacks

Design principles

• Restrict any single dependency from using up all user threads.

• Shed load and fail fast instead of queueing.

• Provide fallbacks wherever feasible to protect users from failure

• Use isolation techniques (such as bulkhead, swimlane and circuit breaker patterns) to limit impact of any one dependency.

• Optimize for time-to-discovery through near real-time metrics, monitoring and alerting

• Optimize for time-to-recovery with low latency propagation of configuration changes and support for dynamic property changes in virtually all aspects of Hystrix to allow real-time operational modifications with low latency feedback loops.

• Protect against entire dependency client execution, not just network traffic

Use timeoutsTime-out calls that take longer than defined thresholds. A

default exists but for most dependencies is custom-set via

properties to be just slightly higher than the measured

99.5th percentile performance for each dependency.

BulkheadsMaintain a small thread-pool (or semaphore) for each dependency and if it becomes full commands will be immediately rejected instead of queued up. Dependencies with Clogged threads pools shouldn’t hinder access to other dependencies.

Circuit breakersTrip a circuit-breaker automatically or manually

to stop all requests to that service for a period of

time if error percentage passes a threshold.

Fallback logicPerform fallback logic when a request

fails, is rejected, timed-out or short-

circuited.

MeasureMeasure success, failures

(exceptions thrown by client),

timeouts, and thread

rejections.

Request collapsingCollapse multiple concurrent user request

into one a single backend dependency call

(within a short time window of e.g. 10ms)

Request cachingReduce the number of request being sent to the

backend dependencies by caching and de-

duping requests.

Define a pipeline and contextMany service share base functionality such as

authentication. Defining a clear request pipeline and

context, optimizes shared logic and prevents

repeating calls (e.g. getCustomer)

Don’t lock the bonnetMake it possible to switch on logging and direct certain

traffic to a specific node

REST vs Experience API

/users/<id>/ratings/title

/users/<id>/queues

/users/<id>/queues/instant

/users/<id>/recommendations

/catalog/titles/movie

/catalog/titles/series

/catalog/people

VS

Example: /phone/homescreen

User Interface Rendering

Data gathering, formattingand delivery

We are hiring!Contact me:

[email protected]

Thanks for listening!

Technology

How to Build High-Volume, Scalable, and Resilient APIs (EXP18038)