Outage-Proof Your Cloud Applications
Brian Adler, Sr. Services Architect
Roberto Monge, Cloud Solutions Engineer
RightScale
December 18, 2012
Watch the video of this webinar
# 2
Cloud Management
# 2
#rightscale
Your Panel TodayPresenting• Brian Adler, Sr. Services Architect, RightScale• Roberto Monge, Cloud Solutions Engineer, RightScale
Q&A • Spencer Adams, Account Manager, RightScale• Noel Cohen, Account Manager, RightScale
Please use the “Questions” window
to ask questions any time!
# 3
Cloud Management
# 3
#rightscale
Agenda
• High Availability and Disaster Recovery• Terminology/Level-Setting• Designing for Failure• Cloud and component definitions• HA and DR configurations
• Conclusions / Q&A
# 4
Cloud Management
# 4
#rightscale
Terminology
High Availability (HA)
Disaster Recovery (DR)
Fault Tolerance
Ability of a system to continue operating properly (perhaps at a degraded level) if one or more components fails
The process, policies and procedures related to restoring critical systems after a catastrophic event
Fault Tolerant systems are measured by their Availability in terms of planned and unplanned service outages for end users
# 5
Cloud Management
# 5
#rightscale
Designing for Failure
Large scale failures in the cloud are rare but do happen
Need to balance cost and complexity of HA efforts against risks you are willing to bear
Application owners are ultimately responsible for availability and recoverability
Cloud infrastructure has made DR and HA remarkably affordable
• Multi-server• Multi-Zone• Multi-Region• Multi-Cloud
3
4
1
2
# 6
Cloud Management
# 6
#rightscale
Cloud Isolation DefinitionsRegion Zone
Resources One or more geographically proximate Zones
Datacenter with separate power source
API endpoint, control plane Shared Shared
Local Area Network Shared Shared
Clouds
Amazon Web Services Region Availability Zone
Rackspace Region
Windows Azure Region
Google Cloud Platform Region Availability Group
CloudStack Region Zone
OpenStack Zone Availability Zone
# 7
Cloud Management
# 7
#rightscale
Multi-Zone HA
SLAVE DBMASTER DB
SNAPSHOTS
LOAD BALANCERS
REPLICATE
DNS
S3
EBS
US-EAST 1a 1US-EAST 1b
LOAD BALANCERS
APP SERVERS
AUTOSCALE
172.168.7.31 172.168.8.62
Snapshot data volume for backups so the database can be readily
recovered within the region.
Place Slave databases in one or more zones for failover.
Consider local storage for additional slave database to remove
dependency on attached volume
Consider distributed
NoSQL databases with
the same distribution
considerations.
Spread primary and replica
nodes across multiple zones. Place as many as you need for
required resiliency.
# 8
Cloud Management
# 8
#rightscale
Multi-Region/Cloud DR Options
Cold DR
Warm DR
Hot DR
Multi-Cloud HA0
< 5 Mins
< 1 Hour
> 1 Hour
$ $$ $$$ $$$$
(Most Common)
(Recommended)
(Least Common)
(Live/Live Config)
DowntimeAvailability
99.999%
99.9%
99.5%
99%
# 9
Cloud Management
# 9
#rightscale
Multi-Region Cold DR
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
APP SERVERS
DALLAS
SNAPSHOTS
172.168.7.31
SLAVE DB
CHICAGO
CLOUD FILES
Staged Server Configuration and generally no staged data• Not recommended if rapid recovery is required• Slow to replicate data to other cloud and bring database online
CBS
# 10
Cloud Management
# 10
#rightscale
Multi-Region Warm DR
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
APP SERVERS
SLAVE DB
REPLICATE
DALLAS
172.168.7.31
CHICAGO
SNAPSHOTS
Staged Server Configuration, pre-staged data and running Slave Database Server• Generally recommended DR solution• Minimal additional cost and allows fairly rapid recovery
SNAPSHOTS
CBS
CLOUD FILES
# 11
Cloud Management
# 11
#rightscale
APP SERVERS
Multi-Region Hot DR
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
SLAVE DB
REPLICATE
DALLAS
SNAPSHOTS
172.168.7.31
CHICAGO
Parallel Deployment with all servers running but all traffic going to primary• Not recommended• Very high additional cost to allow rapid recovery
SNAPSHOTS
CBS
CLOUD FILES
# 12
Cloud Management
# 12
#rightscale
Multi-Cloud HA
APP SERVERS
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
SLAVE DB
REPLICATE
CHICAGO
SNAPSHOTS
172.168.7.31 172.168.8.62
US-EAST
S3 SWIFT
SNAPSHOTS
Live/Live configuration. Geo-target IP services to direct traffic to regional LBs.• Possible, but not recommended (more to follow…)• Max additional cost and max availability, but complex to implement and manage
EBS
# 13
Cloud Management
# 13
#rightscale
APP SERVERS
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
SLAVE DB
REPLICATE
CHICAGO
SNAPSHOTS
172.168.7.31 172.168.8.62
US-EAST
S3
Multi-Cloud HA
You need DNS management or a global load balancer.
Security is an issue as security groups are Region-specific.
Machine Images are specific to the
cloud/region.
Looks similar to Multi-Zone… but additional problems to solve as some resources are not shared
SNAPSHOTS
SWIFT
EBS VOLUME
# 14
Cloud Management
# 14
#rightscale
In the Dashboard
Multi-region or cloud
Multi-region Warm DR
Staged servers
Cost forecasting
for DR environment
# 15
Cloud Management
# 15
#rightscale
Automating HA and DR• Use dynamic DNS for your database servers
• Allow app servers to use a single FQDN.• Use a low TTL to allow rapid failover in the case of a change in master
database
• Automatic connection of app servers to load balancing servers• App servers can connect to all load balancers automatically at launch• No manual intervention• No DNS modifications
• Automated promotion of slave to master• Process is automated• Decision to run process is manual
# 16
Cloud Management
# 16
#rightscale
MultiCloud Images• MultiCloud Images can be launched across regions and clouds
without modification
How RightScale makes it possible
MultiCloud Images
Cloud A, B, Image 1
Cloud A C, Image 2
Cloud B, Image 1
ServerTemplate contains a list of MultiCloud Images (MCIs)
When the Server is created, a specific MCI is chosen.
Cloud A, B, Image 1
Cloud B
Image 1
The appropriate RightImage is used at launch.
RightImage
Stability across clouds
1
2
3
# 17
Cloud Management
# 17
#rightscale
How RightScale makes it possibleServerTemplates, Tags, and Inputs• Automated load balancer registration and database connections• Autoscaling across zones• Dynamic configuration
# 18
Cloud Management
# 18
#rightscale
DR Cost Comparison ExampleMulti-RegionCold DR
Multi-RegionWarm DR
Multi-RegionHot DR
Total $4480 / month $5630 / month $8800 / month
Running $4470 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Master DB (2XLarge)1 Slave DB (2XLarge)
$5540 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Master DB (2XLarge)2 Slave DB (2XLarge)
$8440 / month6 Load Balancers (Large)12 App Servers (XLarge)1 Master DB (2XLarge)2 Slave DB (2XLarge)
Staged $0 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Slave DB (2XLarge)
$0 / month3 Load Balancers (Large)6 App Servers (Xlarge)
Replication $10 / month25GB / day cross-zone
$90 / month25GB / day cross-region
$360 / month100GB / day cross-region
# 19
Cloud Management
# 19
#rightscale
Most Common Observed Cloud Outages• Outage of specific services in a zone
• Degraded performance• E.g. EBS, ELB, RDS
• Outage of specific services in a region• Control plane error or cascading problems• E.g. EBS
• Outage of power or network in a zone• No connectivity• E.g. EC2, Azure
• Capacity availability in a region during an outage• Not possible to provision instances, volumes, or other services
# 20
Cloud Management
# 20
#rightscale
Outage-Proofing Best Practices
Place in >1 zone:• Load balancers• App servers• Databases
Maintain capacity to absorb zone or region failures
Replicate data across zones
Design stateless apps for resilience to reboot / relaunch
Replicate data across zones
Backup across regions & clouds
Monitoring, alert, and automate operations to speed up failover
Replication and Failover
Application Design
Resource Placement
# 21
Cloud Management
# 21
#rightscale
Next Steps• Learn: Building Scalable Applications in the Cloud Whitepaper
• http://www.rightscale.com/info_center/white-papers/building-scalable-applications-in-the-cloud.php
• Analyze: Deployment review of your environment• http://www.rightscale.com/about_us/contact_us.php
• Try: Free Edition• www.rightscale.com/free
Contact RightScale(866) 720-0208
Recommended