38
CCA - NoDerivs 3.0 Unported License - Usage OK, no modifications, full attribution* * All unlicensed or borrowed works retain their original licenses Redundancy Doesn't Always Mean "HA" or "Cluster" A cautionary tale against using hammers to solve all redundancy and resiliency problems ... 1 Randy Bias @randybias CTO, Cloudscaling Dan Sneddon @dxs Sr. Engineer, Cloudscaling OpenStack Design Summit – Oct 2012 Thursday, October 18, 12

OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

Embed Size (px)

DESCRIPTION

true

Citation preview

Page 1: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

CCA - NoDerivs 3.0 Unported License - Usage OK, no modifications, full attribution** All unlicensed or borrowed works retain their original licenses

Redundancy Doesn't Always Mean "HA" or "Cluster"A cautionary tale against using hammers to solve all redundancy and resiliency problems ...

1

Randy Bias@randybiasCTO, Cloudscaling

Dan Sneddon@dxsSr. Engineer, Cloudscaling

OpenStack Design Summit – Oct 2012

Thursday, October 18, 12

Page 2: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

1. “HA” pairs are not the only type of redundancy

2. Alternative redundancy patterns for HA

3. Redundancy patterns in Open Cloud System*

Our Journey Today

2

* Cloudscaling’s OpenStack-powered cloud operating system (“distribution”)

Thursday, October 18, 12

Page 3: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

We mean what most people mean ...What Do We Mean By “HA”?

3

Two servers or network devices that look like one

Thursday, October 18, 12

Page 4: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

HA pairs come in a couple flavors“HA HA”?

4

Active / Passive

Thursday, October 18, 12

Page 5: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

People like this flavor best, but it’s not always possible...

“HA HA”?

5

Active / Active

Thursday, October 18, 12

Page 6: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

Many people wish they could get it more like this ...“HA HA HA HA HA”??

6

HA cluster aka ‘massive operational nightmare’

Thursday, October 18, 12

Page 7: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

Imagine this was 4 or 6 nodes in the clusterCluster<bleep>!

7

• 4 network tech.

• 7 NICs / node

• A million different ways to break

Thursday, October 18, 12

Page 8: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

Herein lies the problem ...“HA” Pairs Are One Type of Redundancy

8

Thursday, October 18, 12

Page 9: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

There are many, but these two matter most ...

• Catastrophic failures

• No scale out

The Problem With “HA”-mmers

9

Thursday, October 18, 12

Page 10: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

Either working or dead, nothing in-betweenHA Pairs Have Binary Failures

10

Thursday, October 18, 12

Page 11: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

What is Scale-out?

11

A B

A B

A B C D N

Scale-up - Make boxes bigger (usually an HA pair)

Scale-out - Make moar boxes

Thursday, October 18, 12

Page 12: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

Scaling out is a mindset

12

bowzer.company.com web001.company.com

Servers *are* cattle

Scaling up is like treating your servers as pets

Thursday, October 18, 12

Page 13: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

Hardware rarely fails, operators fail, software failsHA Pair Failures* - 100% down

13

Who Type Year Why Duration

Apple Switch 2005 Bug 2 hrs

Flexiscale SAN 2007 Ops Err 24 hrs

Vendio NAS 2008 Ops Err 8 hrs

UOL Brazil SAN 2011 Bug 72 hrs

Twitter Datacenter 2012 Bug+Ops 2 hrs

* This is a handful of examples as a baseline; I’m sure you can find many more

Thursday, October 18, 12

Page 14: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

They better not fail ...“HA” Pairs Are an All-in Move

14

Thursday, October 18, 12

Page 15: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

Many small failure domains is usually betterRisk Reduction

15

Thursday, October 18, 12

Page 16: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

Would you rather have the whole cloud down or just a small bit for a short period of time?

Big failure domains vs. small

16

Still a scale-up pattern ... wouldn’t you rather scale-out?

Thursday, October 18, 12

Page 17: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

No scale-outPair vs. Scale-out Load Balancing

17

State Sync

(100% loss)

Shared-nothing Architecture

(20% loss)

Thursday, October 18, 12

Page 18: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

No scale-outPair vs. Scale-out Load Balancing

17

State Sync

(100% loss)

Shared-nothing Architecture

(20% loss)

Thursday, October 18, 12

Page 19: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

Everything ...What’s Usually an “HA” Pair in OpenStack?

18

Service Endpoints(APIs)

Messaging System(RPC)

Worker Threads(e.g. Scheduler,

Networking)

Database(MySQL)

Thursday, October 18, 12

Page 20: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

Not much needs state synchronizationWhat needs to be an HA pair?

19

Service Endpoints(APIs)

Messaging System(RPC)

Worker Threads(e.g. Scheduler,

Networking)

Database(MySQL)

Thursday, October 18, 12

Page 21: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

Fault Tolerance Methodologies

20

Thursday, October 18, 12

Page 22: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

Fault Tolerance in OCS

21

Thursday, October 18, 12

Page 23: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

Service Distribution

22

Resilient Stateless Scale-out

High Availability Without Compromise

Thursday, October 18, 12

Page 24: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

Combines Standard Networking TechnologiesService Distribution

23

OSPF

Anycast

Load-BalancingProxy

/etc/quagga/ospfd.confrouter ospf ospf router-id 10.1.1.1 network 10.1.255.1 area 0.0.0.0

/etc/quagga/zebra.confinterface lo:2 description Pound listening address ip address 10.1.255.1/32

/etc/pound/pound.conf

ListenHTTP Address 10.1.255.1 Port 8774 xHTTP 1 Service BackEnd Address 10.1.1.1 Port 8774 End BackEnd Address 10.1.1.2 Port 8774 End EndEnd

Thursday, October 18, 12

Page 25: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

Horizontally Scalable, No Single Point Of FailureResilient OpenStack

Service Endpoints(APIs)

Messaging System(RPC)

Worker Threads(e.g. Scheduler,

Networking)

Database(MySQL)

Service Distribution ZeroMQ

MMR + HAService Distribution

Thursday, October 18, 12

Page 26: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

What Makes This a Superior Solution?Service Distribution Advantages

25

• True horizontal scalability with no centralized controller

• Services are always running, failover is nearly instant

• Reduced complexity, fewer idle resources

• No need for separate load balancers

Server Server Server Server Server Server Server ...Failover Distributed Servicesvs.

Thursday, October 18, 12

Page 27: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

Service Distribution Works With Multiple Sites

Perfect For Site Resiliency

26

• Traditional HA pairs do not support cross-site resiliency

• Service Distribution fail across sites without DNS redirections

Thursday, October 18, 12

Page 28: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

27

Example: Distributed Load Balancing

HTTP Proxy

V

HTTP Proxy

OSPF Router(s)

OSPF1)

OSPFadvertisement

OSPFadvertisement

Quagga Quagga

Service Distribution in Action

Thursday, October 18, 12

Page 29: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

28

Example: Distributed Load BalancingService Distribution in Action

HTTP Proxy

V

HTTP Proxy

OSPF Router(s)

OSPF1)

OSPFadvertisement

OSPFadvertisement

Quagga Quagga

Per-FlowLoad

Balancing

ECMP Per-flowLoad Balancing

2)

HTTP ProxyHTTP Proxy

2)

Load-balancingHTTP Proxy

3)

1)

Thursday, October 18, 12

Page 30: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

29

Example: Distributed Load BalancingService Distribution in Action

HTTP Proxy

V

HTTP Proxy

OSPF Router(s)

OSPF1)

OSPFadvertisement

OSPFadvertisement

Quagga Quagga

Per-FlowLoad

Balancing

ECMP Per-flowLoad Balancing

2)

HTTP ProxyHTTP Proxy

2)

Load-balancingHTTP Proxy

3)

1)

HTTP ProxyHTTP Proxy

Unlimited #of Back-EndServers

4)

Server ServerServer Server

Thursday, October 18, 12

Page 31: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

30

10% LoadEach

Server

Load Balancer/Proxy

Load Balancer/Proxy

Load Balancer/Proxy

Load Balancer/Proxy

Server Server Server Server

Client

1 2 3 4

1 2 3 4

Failure Resiliency

Load Balancer/Proxy

Server

Server Server Server Server Server

Client Client Client

Thursday, October 18, 12

Page 32: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

31

10% LoadEach

Server

Load Balancer/Proxy

Load Balancer/Proxy

Load Balancer/Proxy

Load Balancer/Proxy

Server Server Server Server

Client

1 2 3 4

1 2 3 4

Failure Resiliency

Load Balancer/Proxy

Server

Server Server Server Server Server

Client Client Client

X1

Thursday, October 18, 12

Page 33: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

32

Load Balancer/Proxy

Load Balancer/Proxy

Load Balancer/Proxy

Load Balancer/Proxy

Server Server Server Server

Client

1 2 3 4

1 2 3 4

Failure Resiliency

Load Balancer/Proxy

Server

Server Server Server Server Server

Client Client Client

X 10% Increased

Server Load

Thursday, October 18, 12

Page 34: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

Example: Scale-out Network Address TranslationOCS NAT Service

33

NAT

VMs

BGP Multiple ISPproviders

ServiceDistribution

Thursday, October 18, 12

Page 35: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

Avoiding RabbitMQ’s Single Point Of FailureBrokerless Messaging With ZeroMQ

34

Nova-Compute

Nova-Scheduler Nova-API

RabbitMQBroker

RabbitMQ(Brokered)

Single Point Of Failure

Thursday, October 18, 12

Page 36: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

Avoiding RabbitMQ’s Single Point Of FailureBrokerless Messaging With ZeroMQ

35

Nova-Compute

Nova-Scheduler Nova-API

RabbitMQBroker

RabbitMQ(Brokered)

Single Point Of Failure

Nova-Compute

Nova-Scheduler Nova-API

vs. ZeroMQ(Peer To Peer)

Thursday, October 18, 12

Page 37: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

What did we learn today?

36

1. HA-mmers are for nails

2. Scale-out rules for redundancy

3. Design-for-failure is a mentality, not a pair

4. Resiliency over redundancy

Thursday, October 18, 12

Page 38: OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf

Q&A

37

Randy Bias@randybiasCTO, Cloudscaling

Dan Sneddon@dxsSr. Engineer, Cloudscaling

OCS 2.0Public Cloud Benefits | Private Cloud Control | Open Cloud Economics

Thursday, October 18, 12