Borg-Like Resilience for Your Microservices

Preview:

Citation preview

Borg-like Resilience for Your Microservices

Philip LombardiEngineer

datawire.io

Background...

1. Philip Lombardi @ Datawire.io (twitter: @TheBigLombowski)

2. Datawire.io is building a Microservices Development Kit to enable developers to build resilient microservice applications.

3. Check us out after the talk: app.datawire.io

2

datawire.io 3

What is a Microservice?

datawire.io

Common Microservice Definitions

It’s a service that is...

● Small● Self contained● Narrow in scope● Bounded context● Independent● Loosely coupled

Sort of…

All these things describe attributes

4

datawire.io

A (simpler) Microservices Definition

A Microservice is a unit of business logic.

A Microservice application is a distributed composition of business logic via services.

5

datawire.io

Microservices Benefits and Tradeoffs

● Easier to reason about the individual components that make up the system.

● Easy to add new biz logic.

● More difficult to deploy than a classic monolith.

● More difficult to operate.

6

datawire.io

Combine to build AWESOME!

7

“Death Star Topology”

datawire.io

Awesome…, until someone puts a torpedo down the vent shaft!

8

datawire.io

In reality, they rarely seem to explode...

1. When was the last time you can remember Netflix being down? Or Uber? Or Yelp? Twitter?

2. The actual Death Stars were brittle and had a clear single point of failure. But these Death Star topologies are NOT brittle.

9

datawire.io

These systems are very resilient...

They survive whole classes of problems...

● Hardware and network issues● Software bugs● Security exploits

Engineers:

● Find and fix issues without stopping the system.● Add new features to the product the biz logic represents.● Alter the system multiple times a day causing the topology to shift and change

constantly.

10

datawire.io

These systems are a lot more like The Borg

11

datawire.io

The Borg...

1. A collective hive of drones (biz logic) that are loosely controlled by The Queen (orchestration, discovery).

2. Nearly impossible to stop:a. Routinely take on numerous adversaries (bugs, security threats).

b. Continue to make progression regardless of whether The Collective is unable to communicate with all Drones because of secondary objectives. (hardware failures, network outages).

c. Forced evolution by adopting best of breed technologies. What doesn’t kill them just makes them

stronger (continuous integration and improvement).

3. The Borg assimilate new cultures and tech to strengthen their operational efficiency and resiliency.

12

datawire.io

How did these companies become Borg-like?

1. New Architecture!

2. The new architecture made them extremely resilient to infrastructure failure AND software bugs.

13

datawire.io

Failure Types...

● There are the kind everyone always engineers for…

○ Network failures○ Server failures○ Storage failures

○ Resource Limits

● And then there are the kinds we often think we’re engineering for…

○ Integration Bugs○ Functional Bugs

14

datawire.io

Integration Bugs… Your new worst enemy.

● In a Microservices app your biggest issue will be the integration bug.

● It’s nearly impossible at a certain app size to get a whole running system up and running to run integration tests.

● There’s also no compiler to save you from yourself and there is no type safety at service boundaries.

● Integration bugs are a way of life in a Microservices app because of how the system is decomposed into many small independent units.

15

datawire.io

The new architecture

● Born from a need to be resilient to integration bugs

● Allowed adopters of the new architecture to move quickly as they found a way to be resilient to both infrastructure level issues AND software integration bugs.

16

datawire.io

Routing table is (relatively) static

Routing policy is global

Traditional Architecture

17

Client

DNS

Load Balancer

Serverre

solv

e

traffic

datawire.io

Traditional Architecture

18

Client

DNS

Load Balancer

Serverre

solv

e

traffic

Load Balancers are designed to protect against infrastructure failures first and foremost.

datawire.io

Traditional Architecture

19

Client

DNS

Load Balancer

Serverre

solv

e

traffic

Infrastructure is NOT the biggest cause of bugs and system failure in 2016...

datawire.io

Traditional Architecture

20

Client

DNS

Load Balancer

Server

reso

lve

traffic

Biggest issue is integration bugs...

1.1

1.0

1.0

1.0

datawire.io

Traditional Architecture

21

Client

DNS

Load Balancer

Server

reso

lve

traffic

New service is returning faulty JSON / XML / CSV to the Client.

1.1

1.0

1.0

1.0

datawire.io

Traditional Architecture

22

Client

DNS

Load Balancer

Server

reso

lve

traffic

LB thinks everything is OK. Client is exploding.

1.1

1.0

1.0

1.0

datawire.io

But some smart folks figured out a better way...

● The architecture involves using “Smart Endpoints”.

● Each node is “smart” because:

○ Node knows how to communicate with every other node without a load balancer. Intelligence of a

central load balancer exists on each node.

○ When an integration bug happens the client node can blacklist the misbehaving service node and

still use the other set of nodes.

● The end result is a mesh of intelligent intercommunicating service nodes (like the Borg Collective).

23

datawire.io

Smart Endpoints Architecture

24

Client

DiscoveryServer

heartbeatsro

utes

Smart Endpoint

datawire.io

Smart Endpoints Architecture

25

Client

DiscoveryServer

heartbeatsro

utes

Smart Endpoint

Servers send their addresses to Discovery and periodic heartbearts which protects you against infrastructure issues.

datawire.io

Smart Endpoints Architecture

26

Client

DiscoveryServer

heartbeatsro

utes

Smart Endpoint

Discovery pushes addresses to clients which keep the server addresses in a local hash table. Discovery is not a SPOF because it’s just a broker. Client owns its own independent routing table.

datawire.io

Smart Endpoints Architecture

27

Client

Discovery

Server

heartbeatsro

utes

Smart Endpoint

In the Smart Endpoint model when a Client talks to our buggyservice and fails due to a software bug it blacklists the node!

1.1

1.0

1.0

1.0

datawire.io

Smart Endpoints Architecture

28

Client

Discovery

Server

heartbeatsro

utes

Smart Endpoint

Failure is detected QUICKLY and while a tiny amount of traffic will still fail compared to the LB model it’s a tiny tiny amount.

1.1

1.0

1.0

1.0

datawire.io

It’s mostly about Circuit Breakers...

29

datawire.io

It’s mostly about Circuit Breakers...

● Smart Endpoints as an architecture works because of a tech called Circuit Breakers

● Nodes independently track usage of remote services when they encounter failure due to software or infrastructure then the remote service is blacklisted.

● It’s important to understand circuit breakers are local and not global. Each service in your system might have a different concept of “working”.

30

datawire.io

Circuit Breakers are really powerful...

● Circuit breakers provide safety from both infrastructure AND software issues

● Timeouts, network partitions, and server failures are all transient bugs that come and go. Node can be temporarily blacklisted when a failure is due to an infrastructure issue.

● Software bug such as our aforementioned integration bug on the remote service is never going to be fixed. Node can be permanently blacklisted.

31

datawire.io

Smart Endpoints In a Nutshell

Two big but very simple things!

1. Each service maintains its own record of addresses in the environment. A service can in theory talk to any other service.

2. Circuit breakers prevent catastrophic failures by blacklisting misbehaving nodes. Blacklisting is done on a per-node basis.

32

datawire.io

Smart Endpoints Advantages...

● Smart Endpoints allow us to do quick integration testing because we don’t have to worry about catastrophic cascade failure (blacklist the misbehaving node and talk to known working nodes).

● Smart Endpoints make integration bugs far less dangerous and therefore enable faster development cycles.

● Still prevent classic infrastructure issues from being the downfall of your app.

● Do you see where this is going?

33

datawire.io

Back to Borg-like Architecture...

● If Smart endpoints prevent catastrophic software failure then the following things become true:

○ Change becomes a way of life. Modifying the system with new services becomes something

developers, management and operations is comfortable with.

○ Change means new technology can be integrated, features added, bugs can be fixed and security

holes patched without fear that a million or billion dollar line of business application fails and

costs the company huge $$$.

○ The end result is a an application that is in a continual state of improvement.

34

datawire.io

To learn more

Contact me:

● plombardi@datawire.io● Twitter: @TheBigLombowski

Jobs

● https://www.datawire.io/careers/○ Java, Python, Go and Kubernetes Developers Wanted!

Try

● https://github.com/datawire/mdk● https://www.datawire.io

35

Recommended