38
CCA - NoDerivs 3.0 Unported License - Usage OK, no modifications, full attribution* * All unlicensed or borrowed works retain their original licenses Redundancy Doesn't Always Mean "HA" or "Cluster" A cautionary tale against using hammers to solve all redundancy and resiliency problems ... 1 Randy Bias @randybias CTO, Cloudscaling Dan Sneddon @dxs Sr. Engineer, Cloudscaling OpenStack Design Summit – Oct 2012 Thursday, October 18, 12

OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Embed Size (px)

DESCRIPTION

Redundancy Doesn't Always Mean "HA" or "Cluster," presentation by Randy Bias and Dan Sneddon at OpenStack Summit fall 2012.

Citation preview

Page 1: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

CCA - NoDerivs 3.0 Unported License - Usage OK, no modifications, full attribution** All unlicensed or borrowed works retain their original licenses

Redundancy Doesn't Always Mean "HA" or "Cluster"A cautionary tale against using hammers to solve all redundancy and resiliency problems ...

1

Randy Bias@randybiasCTO, Cloudscaling

Dan Sneddon@dxsSr. Engineer, Cloudscaling

OpenStack Design Summit – Oct 2012

Thursday, October 18, 12

Page 2: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

1. “HA” pairs are not the only type of redundancy

2. Alternative redundancy patterns for HA

3. Redundancy patterns in Open Cloud System*

Our Journey Today

2

* Cloudscaling’s OpenStack-powered cloud operating system (“distribution”)

Thursday, October 18, 12

Page 3: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

We mean what most people mean ...What Do We Mean By “HA”?

3

Two servers or network devices that look like one

Thursday, October 18, 12

Page 4: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

HA pairs come in a couple flavors“HA HA”?

4

Active / Passive

Thursday, October 18, 12

Page 5: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

People like this flavor best, but it’s not always possible...

“HA HA”?

5

Active / Active

Thursday, October 18, 12

Page 6: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Many people wish they could get it more like this ...“HA HA HA HA HA”??

6

HA cluster aka ‘massive operational nightmare’

Thursday, October 18, 12

Page 7: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Imagine this was 4 or 6 nodes in the clusterCluster<bleep>!

7

• 4 network tech.

• 7 NICs / node

• A million different ways to break

Thursday, October 18, 12

Page 8: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Herein lies the problem ...“HA” Pairs Are One Type of Redundancy

8

Thursday, October 18, 12

Page 9: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

There are many, but these two matter most ...

• Catastrophic failures

• No scale out

The Problem With “HA”-mmers

9

Thursday, October 18, 12

Page 10: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Either working or dead, nothing in-betweenHA Pairs Have Binary Failures

10

Thursday, October 18, 12

Page 11: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

What is Scale-out?

11

A B

A B

A B C D N

Scale-up - Make boxes bigger (usually an HA pair)

Scale-out - Make moar boxes

Thursday, October 18, 12

Page 12: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Scaling out is a mindset

12

bowzer.company.com web001.company.com

Servers *are* cattle

Scaling up is like treating your servers as pets

Thursday, October 18, 12

Page 13: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Hardware rarely fails, operators fail, software failsHA Pair Failures* - 100% down

13

Who Type Year Why Duration

Apple Switch 2005 Bug 2 hrs

Flexiscale SAN 2007 Ops Err 24 hrs

Vendio NAS 2008 Ops Err 8 hrs

UOL Brazil SAN 2011 Bug 72 hrs

Twitter Datacenter 2012 Bug+Ops 2 hrs

* This is a handful of examples as a baseline; I’m sure you can find many more

Thursday, October 18, 12

Page 14: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

They better not fail ...“HA” Pairs Are an All-in Move

14

Thursday, October 18, 12

Page 15: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Many small failure domains is usually betterRisk Reduction

15

Thursday, October 18, 12

Page 16: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Would you rather have the whole cloud down or just a small bit for a short period of time?

Big failure domains vs. small

16

Still a scale-up pattern ... wouldn’t you rather scale-out?

Thursday, October 18, 12

Page 17: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

No scale-outPair vs. Scale-out Load Balancing

17

State Sync

(100% loss)

Shared-nothing Architecture

(20% loss)

Thursday, October 18, 12

Page 18: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

No scale-outPair vs. Scale-out Load Balancing

17

State Sync

(100% loss)

Shared-nothing Architecture

(20% loss)

Thursday, October 18, 12

Page 19: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Everything ...What’s Usually an “HA” Pair in OpenStack?

18

Service Endpoints(APIs)

Messaging System(RPC)

Worker Threads(e.g. Scheduler,

Networking)

Database(MySQL)

Thursday, October 18, 12

Page 20: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Not much needs state synchronizationWhat needs to be an HA pair?

19

Service Endpoints(APIs)

Messaging System(RPC)

Worker Threads(e.g. Scheduler,

Networking)

Database(MySQL)

Thursday, October 18, 12

Page 21: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Fault Tolerance Methodologies

20

Thursday, October 18, 12

Page 22: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Fault Tolerance in OCS

21

Thursday, October 18, 12

Page 23: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Service Distribution

22

Resilient Stateless Scale-out

High Availability Without Compromise

Thursday, October 18, 12

Page 24: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Combines Standard Networking TechnologiesService Distribution

23

OSPF

Anycast

Load-BalancingProxy

/etc/quagga/ospfd.confrouter ospf ospf router-id 10.1.1.1 network 10.1.255.1 area 0.0.0.0

/etc/quagga/zebra.confinterface lo:2 description Pound listening address ip address 10.1.255.1/32

/etc/pound/pound.conf

ListenHTTP Address 10.1.255.1 Port 8774 xHTTP 1 Service BackEnd Address 10.1.1.1 Port 8774 End BackEnd Address 10.1.1.2 Port 8774 End EndEnd

Thursday, October 18, 12

Page 25: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Horizontally Scalable, No Single Point Of FailureResilient OpenStack

Service Endpoints(APIs)

Messaging System(RPC)

Worker Threads(e.g. Scheduler,

Networking)

Database(MySQL)

Service Distribution ZeroMQ

MMR + HAService Distribution

Thursday, October 18, 12

Page 26: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

What Makes This a Superior Solution?Service Distribution Advantages

25

• True horizontal scalability with no centralized controller

• Services are always running, failover is nearly instant

• Reduced complexity, fewer idle resources

• No need for separate load balancers

Server Server Server Server Server Server Server ...Failover Distributed Servicesvs.

Thursday, October 18, 12

Page 27: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Service Distribution Works With Multiple Sites

Perfect For Site Resiliency

26

• Traditional HA pairs do not support cross-site resiliency

• Service Distribution fail across sites without DNS redirections

Thursday, October 18, 12

Page 28: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

27

Example: Distributed Load Balancing

HTTP Proxy

V

HTTP Proxy

OSPF Router(s)

OSPF1)

OSPFadvertisement

OSPFadvertisement

Quagga Quagga

Service Distribution in Action

Thursday, October 18, 12

Page 29: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

28

Example: Distributed Load BalancingService Distribution in Action

HTTP Proxy

V

HTTP Proxy

OSPF Router(s)

OSPF1)

OSPFadvertisement

OSPFadvertisement

Quagga Quagga

Per-FlowLoad

Balancing

ECMP Per-flowLoad Balancing

2)

HTTP ProxyHTTP Proxy

2)

Load-balancingHTTP Proxy

3)

1)

Thursday, October 18, 12

Page 30: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

29

Example: Distributed Load BalancingService Distribution in Action

HTTP Proxy

V

HTTP Proxy

OSPF Router(s)

OSPF1)

OSPFadvertisement

OSPFadvertisement

Quagga Quagga

Per-FlowLoad

Balancing

ECMP Per-flowLoad Balancing

2)

HTTP ProxyHTTP Proxy

2)

Load-balancingHTTP Proxy

3)

1)

HTTP ProxyHTTP Proxy

Unlimited #of Back-EndServers

4)

Server ServerServer Server

Thursday, October 18, 12

Page 31: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

30

10% LoadEach

Server

Load Balancer/Proxy

Load Balancer/Proxy

Load Balancer/Proxy

Load Balancer/Proxy

Server Server Server Server

Client

1 2 3 4

1 2 3 4

Failure Resiliency

Load Balancer/Proxy

Server

Server Server Server Server Server

Client Client Client

Thursday, October 18, 12

Page 32: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

31

10% LoadEach

Server

Load Balancer/Proxy

Load Balancer/Proxy

Load Balancer/Proxy

Load Balancer/Proxy

Server Server Server Server

Client

1 2 3 4

1 2 3 4

Failure Resiliency

Load Balancer/Proxy

Server

Server Server Server Server Server

Client Client Client

X1

Thursday, October 18, 12

Page 33: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

32

Load Balancer/Proxy

Load Balancer/Proxy

Load Balancer/Proxy

Load Balancer/Proxy

Server Server Server Server

Client

1 2 3 4

1 2 3 4

Failure Resiliency

Load Balancer/Proxy

Server

Server Server Server Server Server

Client Client Client

X 10% Increased

Server Load

Thursday, October 18, 12

Page 34: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Example: Scale-out Network Address TranslationOCS NAT Service

33

NAT

VMs

BGP Multiple ISPproviders

ServiceDistribution

Thursday, October 18, 12

Page 35: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Avoiding RabbitMQ’s Single Point Of FailureBrokerless Messaging With ZeroMQ

34

Nova-Compute

Nova-Scheduler Nova-API

RabbitMQBroker

RabbitMQ(Brokered)

Single Point Of Failure

Thursday, October 18, 12

Page 36: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Avoiding RabbitMQ’s Single Point Of FailureBrokerless Messaging With ZeroMQ

35

Nova-Compute

Nova-Scheduler Nova-API

RabbitMQBroker

RabbitMQ(Brokered)

Single Point Of Failure

Nova-Compute

Nova-Scheduler Nova-API

vs. ZeroMQ(Peer To Peer)

Thursday, October 18, 12

Page 37: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

What did we learn today?

36

1. HA-mmers are for nails

2. Scale-out rules for redundancy

3. Design-for-failure is a mentality, not a pair

4. Resiliency over redundancy

Thursday, October 18, 12

Page 38: OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Q&A

37

Randy Bias@randybiasCTO, Cloudscaling

Dan Sneddon@dxsSr. Engineer, Cloudscaling

OCS 2.0Public Cloud Benefits | Private Cloud Control | Open Cloud Economics

Thursday, October 18, 12