Upload
cloudscaling-inc
View
4.650
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Redundancy Doesn't Always Mean "HA" or "Cluster," presentation by Randy Bias and Dan Sneddon at OpenStack Summit fall 2012.
Citation preview
CCA - NoDerivs 3.0 Unported License - Usage OK, no modifications, full attribution** All unlicensed or borrowed works retain their original licenses
Redundancy Doesn't Always Mean "HA" or "Cluster"A cautionary tale against using hammers to solve all redundancy and resiliency problems ...
1
Randy Bias@randybiasCTO, Cloudscaling
Dan Sneddon@dxsSr. Engineer, Cloudscaling
OpenStack Design Summit – Oct 2012
Thursday, October 18, 12
1. “HA” pairs are not the only type of redundancy
2. Alternative redundancy patterns for HA
3. Redundancy patterns in Open Cloud System*
Our Journey Today
2
* Cloudscaling’s OpenStack-powered cloud operating system (“distribution”)
Thursday, October 18, 12
We mean what most people mean ...What Do We Mean By “HA”?
3
Two servers or network devices that look like one
Thursday, October 18, 12
HA pairs come in a couple flavors“HA HA”?
4
Active / Passive
Thursday, October 18, 12
People like this flavor best, but it’s not always possible...
“HA HA”?
5
Active / Active
Thursday, October 18, 12
Many people wish they could get it more like this ...“HA HA HA HA HA”??
6
HA cluster aka ‘massive operational nightmare’
Thursday, October 18, 12
Imagine this was 4 or 6 nodes in the clusterCluster<bleep>!
7
• 4 network tech.
• 7 NICs / node
• A million different ways to break
Thursday, October 18, 12
Herein lies the problem ...“HA” Pairs Are One Type of Redundancy
8
Thursday, October 18, 12
There are many, but these two matter most ...
• Catastrophic failures
• No scale out
The Problem With “HA”-mmers
9
Thursday, October 18, 12
Either working or dead, nothing in-betweenHA Pairs Have Binary Failures
10
Thursday, October 18, 12
What is Scale-out?
11
A B
A B
A B C D N
Scale-up - Make boxes bigger (usually an HA pair)
Scale-out - Make moar boxes
Thursday, October 18, 12
Scaling out is a mindset
12
bowzer.company.com web001.company.com
Servers *are* cattle
Scaling up is like treating your servers as pets
Thursday, October 18, 12
Hardware rarely fails, operators fail, software failsHA Pair Failures* - 100% down
13
Who Type Year Why Duration
Apple Switch 2005 Bug 2 hrs
Flexiscale SAN 2007 Ops Err 24 hrs
Vendio NAS 2008 Ops Err 8 hrs
UOL Brazil SAN 2011 Bug 72 hrs
Twitter Datacenter 2012 Bug+Ops 2 hrs
* This is a handful of examples as a baseline; I’m sure you can find many more
Thursday, October 18, 12
They better not fail ...“HA” Pairs Are an All-in Move
14
Thursday, October 18, 12
Many small failure domains is usually betterRisk Reduction
15
Thursday, October 18, 12
Would you rather have the whole cloud down or just a small bit for a short period of time?
Big failure domains vs. small
16
Still a scale-up pattern ... wouldn’t you rather scale-out?
Thursday, October 18, 12
No scale-outPair vs. Scale-out Load Balancing
17
State Sync
(100% loss)
Shared-nothing Architecture
(20% loss)
Thursday, October 18, 12
No scale-outPair vs. Scale-out Load Balancing
17
State Sync
(100% loss)
Shared-nothing Architecture
(20% loss)
Thursday, October 18, 12
Everything ...What’s Usually an “HA” Pair in OpenStack?
18
Service Endpoints(APIs)
Messaging System(RPC)
Worker Threads(e.g. Scheduler,
Networking)
Database(MySQL)
Thursday, October 18, 12
Not much needs state synchronizationWhat needs to be an HA pair?
19
Service Endpoints(APIs)
Messaging System(RPC)
Worker Threads(e.g. Scheduler,
Networking)
Database(MySQL)
Thursday, October 18, 12
Fault Tolerance Methodologies
20
Thursday, October 18, 12
Fault Tolerance in OCS
21
Thursday, October 18, 12
Service Distribution
22
Resilient Stateless Scale-out
High Availability Without Compromise
Thursday, October 18, 12
Combines Standard Networking TechnologiesService Distribution
23
OSPF
Anycast
Load-BalancingProxy
/etc/quagga/ospfd.confrouter ospf ospf router-id 10.1.1.1 network 10.1.255.1 area 0.0.0.0
/etc/quagga/zebra.confinterface lo:2 description Pound listening address ip address 10.1.255.1/32
/etc/pound/pound.conf
ListenHTTP Address 10.1.255.1 Port 8774 xHTTP 1 Service BackEnd Address 10.1.1.1 Port 8774 End BackEnd Address 10.1.1.2 Port 8774 End EndEnd
Thursday, October 18, 12
Horizontally Scalable, No Single Point Of FailureResilient OpenStack
Service Endpoints(APIs)
Messaging System(RPC)
Worker Threads(e.g. Scheduler,
Networking)
Database(MySQL)
Service Distribution ZeroMQ
MMR + HAService Distribution
Thursday, October 18, 12
What Makes This a Superior Solution?Service Distribution Advantages
25
• True horizontal scalability with no centralized controller
• Services are always running, failover is nearly instant
• Reduced complexity, fewer idle resources
• No need for separate load balancers
Server Server Server Server Server Server Server ...Failover Distributed Servicesvs.
Thursday, October 18, 12
Service Distribution Works With Multiple Sites
Perfect For Site Resiliency
26
• Traditional HA pairs do not support cross-site resiliency
• Service Distribution fail across sites without DNS redirections
Thursday, October 18, 12
27
Example: Distributed Load Balancing
HTTP Proxy
V
HTTP Proxy
OSPF Router(s)
OSPF1)
OSPFadvertisement
OSPFadvertisement
Quagga Quagga
Service Distribution in Action
Thursday, October 18, 12
28
Example: Distributed Load BalancingService Distribution in Action
HTTP Proxy
V
HTTP Proxy
OSPF Router(s)
OSPF1)
OSPFadvertisement
OSPFadvertisement
Quagga Quagga
Per-FlowLoad
Balancing
ECMP Per-flowLoad Balancing
2)
HTTP ProxyHTTP Proxy
2)
Load-balancingHTTP Proxy
3)
1)
Thursday, October 18, 12
29
Example: Distributed Load BalancingService Distribution in Action
HTTP Proxy
V
HTTP Proxy
OSPF Router(s)
OSPF1)
OSPFadvertisement
OSPFadvertisement
Quagga Quagga
Per-FlowLoad
Balancing
ECMP Per-flowLoad Balancing
2)
HTTP ProxyHTTP Proxy
2)
Load-balancingHTTP Proxy
3)
1)
HTTP ProxyHTTP Proxy
Unlimited #of Back-EndServers
4)
Server ServerServer Server
Thursday, October 18, 12
30
10% LoadEach
Server
Load Balancer/Proxy
Load Balancer/Proxy
Load Balancer/Proxy
Load Balancer/Proxy
Server Server Server Server
Client
1 2 3 4
1 2 3 4
Failure Resiliency
Load Balancer/Proxy
Server
Server Server Server Server Server
Client Client Client
Thursday, October 18, 12
31
10% LoadEach
Server
Load Balancer/Proxy
Load Balancer/Proxy
Load Balancer/Proxy
Load Balancer/Proxy
Server Server Server Server
Client
1 2 3 4
1 2 3 4
Failure Resiliency
Load Balancer/Proxy
Server
Server Server Server Server Server
Client Client Client
X1
Thursday, October 18, 12
32
Load Balancer/Proxy
Load Balancer/Proxy
Load Balancer/Proxy
Load Balancer/Proxy
Server Server Server Server
Client
1 2 3 4
1 2 3 4
Failure Resiliency
Load Balancer/Proxy
Server
Server Server Server Server Server
Client Client Client
X 10% Increased
Server Load
Thursday, October 18, 12
Example: Scale-out Network Address TranslationOCS NAT Service
33
NAT
VMs
BGP Multiple ISPproviders
ServiceDistribution
Thursday, October 18, 12
Avoiding RabbitMQ’s Single Point Of FailureBrokerless Messaging With ZeroMQ
34
Nova-Compute
Nova-Scheduler Nova-API
RabbitMQBroker
RabbitMQ(Brokered)
Single Point Of Failure
Thursday, October 18, 12
Avoiding RabbitMQ’s Single Point Of FailureBrokerless Messaging With ZeroMQ
35
Nova-Compute
Nova-Scheduler Nova-API
RabbitMQBroker
RabbitMQ(Brokered)
Single Point Of Failure
Nova-Compute
Nova-Scheduler Nova-API
vs. ZeroMQ(Peer To Peer)
Thursday, October 18, 12
What did we learn today?
36
1. HA-mmers are for nails
2. Scale-out rules for redundancy
3. Design-for-failure is a mentality, not a pair
4. Resiliency over redundancy
Thursday, October 18, 12
Q&A
37
Randy Bias@randybiasCTO, Cloudscaling
Dan Sneddon@dxsSr. Engineer, Cloudscaling
OCS 2.0Public Cloud Benefits | Private Cloud Control | Open Cloud Economics
Thursday, October 18, 12