30
Protect your app from Outages Nati Shalom CTO GigaSpaces @natishalom May 2013

Avoiding Cloud Outage

Embed Size (px)

DESCRIPTION

Building cross-region and cross could high availability into your app, a real life use case by Gigaspaces, Nati Shalom, Funder & CTO, Gigaspaces Achieving high levels of availability and disaster recovery in a cloud environment requires the implementation of patterns and practices that introduce redundancy through multi-zone, multi-region, and multi-cloud deployments. As we move towards implementing higher availability, we cannot escape the direct increase in the accidental complexity of the deployment architecture resulting from lack of cloud portability and deployment lifecycle automation. We present how high availability and disaster recovery were achieved in reality by using the Cloudify open source framework on top of AWS. This approach applies to not just AWS but also other public clouds and private cloud environments such as Eucalyptus. The resulting reference architecture provides portable PostgreSQL replication and disaster recovery as well as application tier scalability across zones, regions, and public/private clouds through a unified deployment workflow.

Citation preview

Page 1: Avoiding Cloud Outage

Protect your app from OutagesNati Shalom CTO GigaSpaces@natishalom

May 2013

Page 2: Avoiding Cloud Outage

2

AWS and outages Outage impact Disaster Recovery – it’s all about redundancy! Cloudify as a solution for redundancy Demo with Cloudify on EC2

® Copyright 2013 GigaSpaces Ltd. All Rights Reserved

AGENDA

Page 3: Avoiding Cloud Outage

3

AWS USAGE

Managing Big Data on the Cloud

• AWS – around 0.5M servers• Facebook – less than 0.1M servers• Google – around 1M servers

Page 4: Avoiding Cloud Outage

4

THE OUTAGE PROBLEM

Page 5: Avoiding Cloud Outage

® Copyright 2012 GigaSpaces Ltd. All Rights Reserved5

OUTAGE – APRIL 21, 2011

Page 6: Avoiding Cloud Outage

® Copyright 2012 GigaSpaces Ltd. All Rights Reserved6

OUTAGE - JUNE 29, 2012

Page 7: Avoiding Cloud Outage

® Copyright 2012 GigaSpaces Ltd. All Rights Reserved7

OUTAGE - OCTOBER 22, 2012

Page 8: Avoiding Cloud Outage

® Copyright 2012 GigaSpaces Ltd. All Rights Reserved8

OUTAGE - CHRISTMAS EVE 2012

Page 9: Avoiding Cloud Outage

® Copyright 2012 GigaSpaces Ltd. All Rights Reserved9

NOT ONLY AMAZON

28 December 2012 - some owners of Microsoft's XBox 360 gaming console were unable to access some of their cloud-based storage files.

26 July 2012 - Service for Microsoft’s Windows Azure Europe region went down for more than two hours

29 February 2012 - The ultimate result was service impacts of 8-10 hours for users of Azure data centers in Dublin, Ireland, Chicago, and San Antonio.

Page 10: Avoiding Cloud Outage

10

THAT’S WHAT YOU EXPECT?

Managing Big Data on the Cloud

99% - 3.65 days downtime99.9% - 8.76 hours downtime99.99% - 53 minutes downtime99.999% - 5.26 minutes downtime

Page 11: Avoiding Cloud Outage

® Copyright 2012 GigaSpaces Ltd. All Rights Reserved11

OUTAGE IMPACT – DESIGN FOR FAILURES

Outage could cost…$89K per hour for Amadeus$225K per hour for PayPal!

Page 12: Avoiding Cloud Outage

12

DISASTER RECOVERY

Page 13: Avoiding Cloud Outage

13

MULTI CLOUD

Managing Big Data on the Cloud

Page 14: Avoiding Cloud Outage

14

PREPARE FOR DISASTER RECOVERY

Managing Big Data on the Cloud

•Dedicated expert for DR architecture•Define target recovery time & point•Assume every tier can fail•Use monitoring and alerts•Document your operational processes

Page 15: Avoiding Cloud Outage

15

CHAOS MONKEY

Managing Big Data on the Cloud

Page 16: Avoiding Cloud Outage

16

It’s all about REDUNDANCY!

Page 17: Avoiding Cloud Outage

17

CLONE YOUR ENVIORMENT

Managing Big Data on the Cloud

Page 18: Avoiding Cloud Outage

18

CLONE YOUR DATA

•RDS Read Replica•More to come…

Page 19: Avoiding Cloud Outage

19

Automating your DR

Processes

Page 20: Avoiding Cloud Outage

Leverage Existing Automation Frameworks

Configuration Centric APP Centric (PaaS)

Page 21: Avoiding Cloud Outage

CLONE YOUR ENV - HOW DOES IT WORK?

Page 22: Avoiding Cloud Outage

BUILT IN SUPPORT FOR MANAGING DATA IN THE CLOUD

Real Time Relational DB Clusters

NoSQL Clusters Hadoop

Storm MySQL MongoDB Hadoop (Hive, Pig,..)

Elastic Caching XAP Postgress Cassandra ZooKeeper

Couchbase

ElasticSearch

Page 23: Avoiding Cloud Outage

23

Real Life Scenario

Page 24: Avoiding Cloud Outage

VERIFI (CURRENT) DEPLOYMENT ARCHITECTURE

24

Availability region (US-West: Oregon)

Data VolumeInternet EC2 Instance

mod_cluster

EC2 Instance

JBoss

Data Volume

EC2 Instance

EC2 Instance

PostgresSQL

Cassandra

4 recipes

Page 25: Avoiding Cloud Outage

TARGET ARCHITECTURE

Availability Region (US-West Oregon)

Data Volume

Internet EC2 Instance

mod_cluster

EC2 Instance

JBoss

Data Volume

Postgres MasterEC2 Instance

EC2 Instance

Cassandra

Availability Region (US-East Virginia)

Data Volume

EC2 Instance

mod_cluster

EC2 Instance

JBoss

Data Volume

Postgres SlaveEC2 Instance

EC2 Instance

Cassandra

replication

Bootstrap two EC2 clouds in different regions, install the “verifi” application on each. The second cloud will have a slightly modified (extended) postgres recipe for acting as a slave + no running app servers. Upon the primary zone failure, the second cloud will spin up instances of the app servers and turn the data instance into master, then bootstrapping another “slave” cloud in another zone.

Page 26: Avoiding Cloud Outage

FAILOVER SCENARIO

26

Region (US-West Oregon)

App ServersPostgresSQL

Region (US-East Virginia)

PostgresSQL

Cloud #1 Cloud #2

Region (US-East Virginia )

PostgresSQL

Cloud #1 Cloud #2

XApp Servers

Region (US-West California)

PostgresSQL

Cloud #3

Region failure occurs

Bootstrap another cloud in a different region using the same application recipe used to bootstrap cloud #2 above*

1 2 3

Liveness poll

Liveness poll

0 Upon initial deployment, the primary deployment of the application will be bootstrapped onto cloud #1, another slightly modified application recipe will be bootstrapped as cloud #2, polling cloud #1 for failure, and acting as a PostgresSQL db slave.

Turn Postgres slave into master, Start app server instances*

Page 27: Avoiding Cloud Outage

27 Copyright 2012 Gigaspaces. All Rights Reserved

NEXT STEPS

Across clouds(AWS, Rackspace, Azure…etc)

Across AWS regions

Across AWS zones

1 application + overrides

Several cloud drivers

1 application + overrides1 cloud driver

1 application + overrides 1 cloud driver

Avai

labi

lity

Supported byVerifi phase #1

Page 28: Avoiding Cloud Outage

28 Copyright 2012 Gigaspaces. All Rights Reserved

EVOLUTION PATH

Availability

Com

plex

ityMulti

cloud/provider

Multi region

Multi zone

Multi instance

Multi cloud/provider

Multi region

Multi zoneMulti

instance

Page 29: Avoiding Cloud Outage

29

AWS and outages Outage impact Disaster Recovery – it’s all about redundancy!

Cloning your environment – app stack Cloning your DB – Replication

Cloudify as a solution for Redundancy Use recipes to work on any cloud Fast and customized data replication

Demo with Cloudify on EC2

® Copyright 2013 GigaSpaces Ltd. All Rights Reserved

SUMMARY

Page 30: Avoiding Cloud Outage

30

Thank You!@natishalom

® Copyright 2013 GigaSpaces Ltd. All Rights Reserved

QUESTIONS & ANSWERS