Managing RightScale on RightScale

Preview:

DESCRIPTION

RightScale Webinar: February 1, 2011 – Just like our customers, RightScale runs in the cloud and requires the best platform to automate operations. As such, RightScale uses RightScale to manage RightScale. Our complete infrastructure – development, testing, staging, and production – consists of servers that are configured, launched and managed by the RightScale Platform.

Citation preview

1

Managing RightScale on RightScale

February 1, 2011

2

Your Panel Today

Presenting

• Rafael H. Saavedra – VP, Engineering at RightScale

• Chris Horne – Director, Product Marketing at RightScale

Q&A

• Douglas Johnson, Operations Manager at RightScale

Please use the questions window to ask questions any time!

3

Topics

• Managing RightScale on RightScale (Dev, Staging, Prod & Meta)

• RightScale Meta manages RightScale Production

• Production System Overview

• Monitoring Production – Quis Custodiet Ipsos Custodes

• Our Favorite RightScale Features

• Our Not-so-favorite Features

• Deploying RightScale – Cloud Best Practices

4

RightScale

Production

Managing RightScale on RightScale

Customer A Customer DCustomer B Customer C

RightScale

Development

RightScale

Staging

RightScale

Development

5

RightScale

Production

RS Production is managed by RS Meta

RightScale Meta

Production

RightScale

StagingCustomer A Customer D

RightScale

Development

RightScale

Development

6

A multitude of RightScale systems

• Meta Production manages the Production system

• Meta currently lives outside the cloud containing production

• Meta is extremely secure, accessible only by a handful of operations folks

• The Production system is my.rightscale.com

• We are reaching 200 servers with a large fraction in EC2 US-East

• Servers are located in every cloud to achieve high availability

• Servers are allocated in well defined availability zones

• A few staging systems are used for integration and QA

• Ad hoc systems for performance testing, demos, betas, etc.

• Many development systems with simplified configurations

• Development systems are available at the click of a button

7

Significant increase in cloud usage

N-08 D-08 J-09 F-09 M-09 A-09 M-09 J-09 J-09 A-09 S-09 O-09 N-09 D-09 J-10 F-10 M-10 A-10 M-10 J-10 J-10 A-10 S-10 O-10

EC

2 U

sage

N-08 D-08 J-09 F-09 M-09 A-09 M-09 J-09 J-09 A-09 S-09 O-09 N-09 D-09 J-10 F-10 M-10 A-10 M-10 J-10 J-10 A-10 S-10 O-10

EC

2 U

sa

ge

8

Some interesting RightScale numbers

• 2M servers launched by RightScale

• RightScale continuously monitors more than 70k servers

• Every day at RightScale:

• 2,000 array resize actions are executed

• 35,000 alert escalations are triggered

• 20,000 escalation emails are sent to users

• 9.0TB of monitoring data is exchange with our servers

• 1.6TB of logging data is sent to our servers

9

RightScale production (simplified)d

ae

mo

ns

DB Master

DB Slave

da

tab

as

es

mir

rors

log

gin

gm

on

ito

rin

g

Front Ends

da

sh

bo

ard

AP

I

Main App oth

ers

10

What do our users do?

• Dashboard, API, monitoring graphs & event notifications

• Most of the requests are monitoring updates 85% (70%)

• Dashboard and API calls are heavier requests; they represent

7% of requests but 26% of bandwidth

Monitoring85%

Notifications8%

API6%

Dashboard1%

Distribution by Requests

Monitoring70%

Notifications4%

API15%

Dashboard11%

Distribution by Bandwidth

11

We eat our own dog food

• Production servers are organized into independent deployments

• Core servers: frontends, core/api servers, databases, daemons

12

We eat our own dog food

• We use security groups extensively to isolate servers

• ServerTemplates are versioned for each major release

• This preserves the ability to launch exact configurations of past versions

13

Monitoring, alerts & escalations

• We monitor as much relevant data as possible and display it

in insightful ways to quickly detect patterns and abnormalities

• We proactively eliminate the conditions that raise critical alerts

• No broken windows policy. No critical alerts can remain unresolved.

API Network Activity Dashboard Network Activity

14

How to monitor hundreds of servers?

15

How to monitor hundreds of servers?

• We leverage a

monitoring data

warehouse to

develop heat maps

& stacked graphs

16

Quis Custodiet Ipsos Custodes?*

• We monitor the monitoring and alerting systems

• We extensively use alerts to monitor the responsiveness of all

RightScale servers

• When you have hundreds of cloud servers, you statistically

see more instance failures. Instance and EBS failures can

cause headaches. Be prepared to grab a new instance.

• The meta & production monitoring and alerting systems are

fully decoupled from each other

* Who watches the watchmen?

17

Our favorite RightScale features

• RightImages – Resist the temptation to build custom images.

Leverage pure, base images to avoid introducing surprises.

• Input Inheritance – Makes it easy to keep configurations in

sync for dozens of servers

• ServerTemplates – Makes it very easy to reproduce

configurations across production, staging and development.

You have to fully automate configuration to manage a high

number of servers.

• Component Library – There are always new assets

(RightScripts, ServerTemplates, Macros, etc.) that can be

adapted to our needs

• Monitoring – It’s easy to make collectd plugins to monitor just

about anything

18

Our not-so-favorite features

• ServerTemplates Inputs – Powerful but too many of them

make templates difficult to use. Document them well for others.

• Revision Management – Still a ways to go to make users

aware of new versions and how to update

• Component Library – Finding new resources from the library

is not easy and intuitive

• Alerts – They work pretty well but they are not easy to

configure, in particular, custom ones

19

Best practices for upgrading RightScale

• In the cloud, the cost of duplicating servers is minimal

• Avoid upgrading existing servers (a non-cloud approach).

Launch fresh ones with new software instead (fail forward).

• Old servers can take over in case something goes wrong

• Launch additional slaves to capture recovery points

• One slave continues to replicate in case of master failure

• Another slave is frozen at upgrade point – can rollback by failing over

• Don’t forget to take snapshots in case of major failure

20

Front Ends

DB Slave

Databases

DB Master

Main App

Upgrading RightScale Step-by-Step

Main App

DB Slave

7) Take snapshot

at cutoff

6) Stop replication

2) Servers with new code

1) Servers with current code

4) Cut access

to site5) Stop all access

to databases

3) Add second slave

9) Reconnect

all servers8) Update schema

10) Open access

to site

21

Front Ends

DB Slave

Databases

DB Master

Main App

Upgrading RightScale Step-by-Step

Main App

DB Slave

Cutoff SnapshotServers with new code

Servers with old code

22

Have a project and want to discuss how RightScale can help?

Contact sales@rightscale.com or (866) 720-0208

Ready to get started?

Sign up for our Free Edition: www.RightScale.com/Free

Call us for a VIP trial of our paid editions

Need to learn more?

TCO calculator: www.RightScale.com/tco-calculator

User Conference Videos: www.RightScale.com/conference

Webinar archive: www.RightScale.com/webinars

White papers: www.RightScale.com/whitepapers

Q&A / Getting Started

23

Thank You!

Recommended