15
Rainya Mosher, Dev Manager, Deploy Infrastructure IRC: rainya on freenode Twitter: @rainyamosher Learning to Scale OpenStack: A Case Study in Rackspace's Open Cloud Deployment April 17, 2013 at 4:30pm

Learning to Scale OpenStack

Embed Size (px)

DESCRIPTION

Learning to Scale Openstack: A Case Study in Rackspace's Open Cloud Deployment was presented at OpenStack Design Summit in Portland, OR on April 17, 2013. Watch the recording of the presentation on youtube at the following link: http://www.youtube.com/watch?v=3x8X6f5mnzc

Citation preview

Page 1: Learning to Scale OpenStack

Rainya Mosher, Dev Manager, Deploy InfrastructureIRC: rainya on freenode Twitter: @rainyamosher

Learning to Scale OpenStack: A Case Study in Rackspace's Open Cloud Deployment

April 17, 2013 at 4:30pm

Page 2: Learning to Scale OpenStack

2RACKSPACE® HOSTING | WWW.RACKSPACE.COM

It is not the critic who counts; not the man who points out how the strong man stumbles, or where the doer of deeds

could have done them better. The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood; who strives valiantly; . . . who at best knows in the end the triumph of high achievement, and who

at worst, if he fails, at least fails while daring greatly.

Theodore Roosevelt

The Man in the Arena, April 1910

In the ArenaLearning to Scale OpenStack

Page 3: Learning to Scale OpenStack

3RACKSPACE® HOSTING | WWW.RACKSPACE.COM

Hundreds of HVs

Thousands of HVs

Tens of Thousand HVs

Hundreds of Thousand HVs Global Cloud

Region Region

Cell Cell Cell

HV HV HV HV HV HV

Cell Cell

Region

What does “At Scale” Mean?Learning to Scale OpenStack

Page 4: Learning to Scale OpenStack

4RACKSPACE® HOSTING | WWW.RACKSPACE.COM

Code

Package

Deploy

Verify

What is the Control Plane Release Strategy?Learning to Scale OpenStack

Page 5: Learning to Scale OpenStack

RACKSPACE® HOSTING | WWW.RACKSPACE.COM

First Scaling Hurdle – Deploy MechanismLearning to Scale OpenStack

5

• Aug 2012

– Rackspace launches Open Cloud

– Frequent releases to fine tune

• Sep 2012 thru Nov 2012

– Deploying code that is two weeks from trunk takes about two hours

– Begin designing new deploy mechanism at October Summit

• Dec 2012

– Code deploys take 4 - 6 hours

– Deploy team says, bleary-eyed, they aren’t doing it again

• Jan 2012

– Deploy again

– Takes more than 6 hours

– Accept that it is no longer “reasonable” and temporarily stop deploying code releases

– Focus on the deploy mechanism

Aug-12 Sep-12 Oct-12 Nov-12 Dec-12 Jan-13 Feb-130

1

2

3

4

5

6

7

0

1

2

3

4

5

6

Internal Code Releases Linear (Internal Code Releases)

Capacity

Page 6: Learning to Scale OpenStack

6RACKSPACE® HOSTING | WWW.RACKSPACE.COM

• switched from Debian packages to virtual environments

Package

• used torrent for package, pssh for fact files, and mcollective for actions

Distribute • moved centralized puppet master to decentralized masterless puppet

Execute

Improving the Deploy MechanismDeploying from OpenStack Trunk

Page 7: Learning to Scale OpenStack

RACKSPACE® HOSTING | WWW.RACKSPACE.COM

Second Scaling Hurdle – Catch up to TrunkLearning to Scale OpenStack

7

• March 2013

– Production code is 2 months behind trunk

– Trunk as of 2/28 becomes our “v152” and bakes in preprod

– Prep for impacting DB migrations in production

– Re-enable our CI process

• April 2013

– Deploy v152 to production

– 10x increase in DB traffic

– Community works to fix

– Re-deploy v152 with Community fixes

– Attend Summit in Portland and share the story

1

2

3

4

1 – Normal DB throughput ; 2 – First installation of v152; 3 – Disabled several periodic tasks; 4 – Re-installed v152 with patches from Community & turned

periodic tasks back on

Page 8: Learning to Scale OpenStack

RACKSPACE® HOSTING | WWW.RACKSPACE.COM

• Testing & Environments– More robust testing coverage

– Deployer-specific testing further upstream

– Production-like dev environments

– Simulate production compute numbers on non-production hardware

• Database & Code Management – Non-disruptive DB migration patterns

– DB calls with 6 million rows in mind, not just 60

– Code optimization paths for large datasets

• Process & Community– Stay close to trunk, even though it is hard

– Explore options for a continuously deployable trunk

How Can We Adapt for Scale Issues?Learning to Scale OpenStack

8

Page 9: Learning to Scale OpenStack

9RACKSPACE® HOSTING | WWW.RACKSPACE.COM

Backup SlidesLearning to Scale OpenStack

Many of these backup slides were first presented on 4/16/2013 during the OpenStack Summit session “Deploying from OpenStack Trunk” and are

included here for reference.

Page 10: Learning to Scale OpenStack

10RACKSPACE® HOSTING | WWW.RACKSPACE.COM

Merge and Branch StrategyLearning to Scale OpenStack

• The most recent Rackspace release branch took over 50 minor tags make to work in production

• Rackspace Development branch is about 40 patches on top of OpenStack trunk for internal service compatability

Page 11: Learning to Scale OpenStack

11RACKSPACE® HOSTING | WWW.RACKSPACE.COM

• per-project venv• .tar of project

venvs + configs

Package

• seed .torrent• distribute fact

files• verify

completion

Distribute • switch version• sync databases• run puppet• verify

completion

Execute

Package and Distribute StrategyLearning to Scale OpenStack

Page 12: Learning to Scale OpenStack

RACKSPACE® HOSTING | WWW.RACKSPACE.COM

Deploy and Test StrategyLearning to Scale OpenStack

• pre-code check-in validation

Dev

• smoke tests• unit tests

Integration • functional tests• integration

tests

QA

• regression tests

• build tests

Pre-Prod • smoke tests• build tests

Production

Page 13: Learning to Scale OpenStack

13RACKSPACE® HOSTING | WWW.RACKSPACE.COM

Benefits and Challenges of Trunk DeploysLearning to Scale OpenStack

Why We Do It (Benefits)• Issue Resolution

– Early detection of issues and conflicts

– Shorter feedback loop within the community

– Faster resolution of issues

• Early Feature Delivery– Smaller, incremental periodic releases

– More stable release candidates at end of cycle

Why It’s Hard (Challenges)• Code Management

– Merge conflicts with local patches

– Disruptive DB migrations

– Service restarts

– Temporary version skew

• Testing – Devstack-based testing vs testing at

scale

– Rework when issues found in RAX deploy pipeline

• Process – CI/CD vs Release methodology

– Time to merge patches

Page 14: Learning to Scale OpenStack

14RACKSPACE® HOSTING | WWW.RACKSPACE.COM

Scale of Deploy PipelineLearning to Scale OpenStack

1,000s of Nodes100s of Nodes10s of NodesDevStack

Dev Integration & QA PreProd Production

Page 15: Learning to Scale OpenStack

15

RACKSPACE® HOSTING | 5000 WALZEM ROAD | SAN ANTONIO, TX 78218

US SALES: 1-800-961-2888 | US SUPPORT: 1-800-961-4454 | WWW.RACKSPACE.COM

RACKSPACE® HOSTING | © RACKSPACE US, INC. | RACKSPACE® AND FANATICAL SUPPORT® ARE SERVICE MARKS OF RACKSPACE US, INC. REGISTERED IN THE UNITED STATES AND OTHER COUNTRIES. | WWW.RACKSPACE.COM