57
© 2014 VMware Inc. All rights reserved. OpenStack: Toward a More Resilient Cloud Or: “How I Learned to Stop Worrying & Embrace My Inner Cloudiness” Mark T. Voelker OpenStack Architect Percona University Smart Data Event February 12, 2015

OpenStack: Toward a More Resilient Cloud

Embed Size (px)

Citation preview

Page 1: OpenStack: Toward a More Resilient Cloud

© 2014 VMware Inc. All rights reserved.

OpenStack: Toward a More Resilient CloudOr: “How I Learned to Stop Worrying &

Embrace My Inner Cloudiness”

Mark T. VoelkerOpenStack ArchitectPercona University Smart Data Event

February 12, 2015

Page 2: OpenStack: Toward a More Resilient Cloud

• Who is this guy?

• A little background on OpenStack

• Building more resilient clouds

– Withstanding failures

– Quickly recovering from failures

• Questions?

Page 3: OpenStack: Toward a More Resilient Cloud

@marktvoelker• OpenStack Architect @ VMware, OpenStack ATC, Ex-StackForge Puppet

core dev, Triangle OpenStack Meetup founder, OS Foundation Member #54

• Fact: can be bribed with doughnuts

• Currently works in VMware’s Software Defined Datacenter R&D group

• In copious (hah!) spare time: OpenStack solutions, Big Data, Massively Scalable Data Centers, Devops, making sawdust with extreme prejudice

Page 4: OpenStack: Toward a More Resilient Cloud

• Tech lead, manager, software developer, architect

• Started in OpenStack in 2011 at the Diablo Design Summit

Page 5: OpenStack: Toward a More Resilient Cloud

….I’ve built a few clouds.

Page 6: OpenStack: Toward a More Resilient Cloud

Today’s talk won’t be overly formal….

Page 7: OpenStack: Toward a More Resilient Cloud

…because I tend to get excited by this stuff.

Page 8: OpenStack: Toward a More Resilient Cloud

• Who is this guy?

• A little background on OpenStack

• Building more resilient clouds

– Withstanding failures

– Quickly recovering from failures

• Questions?

Page 9: OpenStack: Toward a More Resilient Cloud

“OpenStack is a global collaboration of developers and cloud computing technologists producing the ubiquitous open source cloud computing platform for public and private clouds. The project aims to deliver solutions for all types of clouds by being simple to implement, massively scalable, and feature rich. The technology consists of a series of interrelated projects delivering various components for a cloud infrastructure solution.”

-- openstack.org

Basically, it’s software to run cloud

services—including compute, network,

storage, and security—and the

community behind that software.

Page 10: OpenStack: Toward a More Resilient Cloud
Page 11: OpenStack: Toward a More Resilient Cloud
Page 13: OpenStack: Toward a More Resilient Cloud

• IRC Channels and Mailing Lists

• User/Meetup Groups

• Social Networking– Twitter

– LinkedIn

– Facebook

– Ohloh

• Code in cgit, mirrored on GitHub, Bugs/Milestones in Launchpad• For now…may move to StoryBoard in future

• Over 20 million lines of code by over 1,419 contributors

• Two Annual Design Summit/Conferences (coinciding roughly w/releases)

• Want to contribute? Start here.

Page 14: OpenStack: Toward a More Resilient Cloud

• Don’t be intimidated.

• HolycrapthingsmovereallyreallyfastinOpenStack

• Jump in feet first: be agile and flexible.

• This is going to feel a little different for some of you.

Page 16: OpenStack: Toward a More Resilient Cloud

Horizon

NovaNeutron

Swift (Object Storage)

Cinder (Block storage)

Glance

(VM Image Service)

Keystone

(Identity Service)

AWS Management Console

EC2VPC

S3

EBS

Ceilometer

(Telemetry Service)

Trove

(Database Service) Heat

(Orchestration Service)Sahara

(Data Processing Service)

Page 17: OpenStack: Toward a More Resilient Cloud

Library Projects

Supporting Projects

Documentation

Oslo (common code libraries)

Client libraries

Incubated Projects

(may become core

components in the future)

Designate (DNS service)

Zaqar (queuing service)

Gating Projects

CI & Infrastructure

DevStack (deployment script)

Tempest (integration test)

Barbican (key management)

Manila (shared FS as a

service)

Page 18: OpenStack: Toward a More Resilient Cloud
Page 19: OpenStack: Toward a More Resilient Cloud

• Who is this guy?

• A little background on OpenStack

• Building more resilient clouds

– Withstanding failures

– Quickly recovering from failures

• Questions?

Page 20: OpenStack: Toward a More Resilient Cloud

What’s a “resilient” cloud?re·sil·ient

/rəˈzilyənt/

(adjective) Able to withstand or recover quickly from difficult conditions.

Page 21: OpenStack: Toward a More Resilient Cloud

• Today we’ll primarily focus on the cloud itself

• Workloads running *in* clouds are another story…but we’ve only got one hour!

Page 22: OpenStack: Toward a More Resilient Cloud

8am: “Uh-oh. Something tells me it’s going to be an interesting day in the datacenter….”

Page 23: OpenStack: Toward a More Resilient Cloud

• Hardware Failures

• OpenStack software bugs (yep, those exist)

• Underpinning software failures (database, message queue, etc)

• Operating system failures

• Network/storage/power failures

• Planned maintenance windows

• Hackers and malcontents

• Upgrades

• Automation failures

• “Whoops, did I do that?”

Page 24: OpenStack: Toward a More Resilient Cloud

Some causes of outages in the past year

….did you plan for these?

CONFIDENTIAL24

Page 25: OpenStack: Toward a More Resilient Cloud

Sometimes things break (in *any* system).

25

Withstand what you can. Quickly recover from the rest.

Because you don’t look this cute when your cloud is down.

Page 26: OpenStack: Toward a More Resilient Cloud
Page 27: OpenStack: Toward a More Resilient Cloud

High

Availability?

Sounds

great--I’ll

take two!

Page 28: OpenStack: Toward a More Resilient Cloud

General Premise:

Assume hardware and software fail.

(because, shockingly, that actually happens)

CONFIDENTIAL28

Page 29: OpenStack: Toward a More Resilient Cloud

What Does “HA” Mean in an OpenStack Cloud?

CONFIDENTIAL29

• Compute• Multiple clusters• Consider segmenting with Availability Zones, Host Aggregates, etc• Consider ability to live migrate instances for hypervisor node

maintenance• Ensure some capacity buffer for maintenance operations

• Storage• Avoid single points of failure• Multiple technologies can be used…but each has it’s own limitations• Don’t think just Cinder here…your Glance backend and compute

storage matter too!• Network

• Network disruptions will inevitably occur, so plan for them• Design for control plane disruption (and pick technology accordingly)

• Control Plane• May depend on the other things above• Essential to keeping the cloud operational

• Data Plane• Stuff that workloads running in the cloud depends on

Page 30: OpenStack: Toward a More Resilient Cloud

High Availability Is Part of the Story….

….we need to think a bit about architecture.

(I’ll use a reference architecture from VMware Integrated OpenStackas an example)

CONFIDENTIAL30

Page 31: OpenStack: Toward a More Resilient Cloud

VIO Architecture – Logical Topology-Management

Page 32: OpenStack: Toward a More Resilient Cloud

Notice something?

There’s a lot of stuff in there that isn’t OpenStack, but upon which OpenStack depends.

CONFIDENTIAL32

Page 33: OpenStack: Toward a More Resilient Cloud

VIO Architecture – RabbitMQ

• RabbitMQ is a messaging broker - an

intermediary for messaging. It gives

applications a common platform to send and

receive messages, and the messages a safe

place to live until received.

• RabbitMQ is the default AMQP server used by

OpenStack services (Qpid is also an option,

some support for 0mq). In production clouds,

this should be a highly available infrastructure

component.

• The OpenStack subcomponents (nova-scheduler to nova-compute, for example)

communicate among themselves using this

hosted message queue service. They also

utilize the hosted Memcached services for

caching authentication tokens etc. As always,

they persist data to the Database.

• Component-to-Component communications

(Nova-> Neutron) is done via REST.

• For more details about the HA implementation

of RabbitMQ, please click here.

Page 34: OpenStack: Toward a More Resilient Cloud

VIO Architecture – Database

• The database cluster is at the heart of the

infrastructure. Typically MySQL or MariaDB

are used, but other options such as

PostgreSQL are also supported.

• The VIO MariaDB implementation makes use

of a 3-node Galera cluster, which in itself is

Active-Active-Active. However, since some

OpenStack services enforce table locking,

reads and writes are directed to a single

node via the Load Balancers.

• Note that this database is for management

plane data. OpenStack services that store

data as part of their purpose may use

additional DB’s. For example: Ceilometer

may store meter data to MySQL, Mongo,

PostgreSQL, HBase, or DB2.

Page 35: OpenStack: Toward a More Resilient Cloud

VIO Architecture – Load Balancers

• Most OpenStack Services run on the

Controllers, which are mirrored on each

controller VM and load-balanced. They

are accessible via the internal virtual IP.

• Some of the services, such as the

Dashboard, compute-api, glance-api,

keystone, cinder, neutron and

novncproxy are exposed to the end

users via the load balancer’s public virtual

IP.

• Likewise, the hosted Message Queue

(RabbitMQ) and Memcached services are

also load-balanced between 2 VMs.

• For the Database Service, the load

balancer is configured to use a primary

DB VM. In case of failure it will switch to

one of the two backup DB VMs.

• Load Balancers user here are HAProxy

with Keepalived for high availability

Page 36: OpenStack: Toward a More Resilient Cloud

Etc, etc, etc

CONFIDENTIAL36

• There’s a network connecting all that stuff• It’s running (as virtual machines) on servers which have operating systems• Things may get wonky if NTP fails and clocks are out of sync• If DNS can’t resolve, Bad Things ™ will probably happen

Page 37: OpenStack: Toward a More Resilient Cloud

• Consider whether you want active/active or active/passive

• Setup and tooling differs a bit, but I generally like active/active

• Note that docs.openstack.org has an HA Guide

• Currently undergoing lots of updates…patches welcome!

• Prioritize HA for the control plane

• That also means thinking about your database, network, and RPC bus

• Note: HA == more hardware

• Some components need at least 3 nodes

• Mitigate by virtualizing control plane

Page 38: OpenStack: Toward a More Resilient Cloud

• Stuff OpenStack needs to run: message brokers

• Check out RabbitMQ clustering and mirrored queues

• Check out Galera for MySQL/MariaDB

• I often see Percona XtraDB in the wild

• Frontend with an HAProxy/Keepalived pair

• Memcached for caching

Page 39: OpenStack: Toward a More Resilient Cloud

• Don’t do rabbit clustering

over a WAN

• Be aware of the SELECT…

FOR UPDATE issue

Page 40: OpenStack: Toward a More Resilient Cloud

• Long story short: Neutron and some parts of Nova invoke an SQL pattern known as “SELECT…FOR UPDATE” which Galera doesn’t support due to issues with cross-node locking.

• Can cause deadlock-like symptoms due to locks not being replicated.

• Neutron/nova code being refactored, but will likely not be done soon.

• Meanwhile: use HAProxy to send writes to a single Galera node and you should be fine

• With the obvious scalability bottleneck

• More info here, here, & here.

• Thank Jay Pipes of Mirantis & Peter

Boros of Percona for the find!

Page 41: OpenStack: Toward a More Resilient Cloud

• Pick a highly available storage to back Glance

• Pick a highly available storage backend for Cinder too

– SAN, distributed, software defined, plethora of options here

• Use Keepalived/HAProxy to front-end multiple API servers

• Or another load balancer technology of your choice

• Can be deployed as dedicated nodes for scale, or cohabitate

• Data plane network: DVR & Provider Network Extensions

• Distributed Virtual Routers are a new experimental feature in Juno (not yet ready for production)

• Please go test it and report/fix bugs!

• Provider networks essentially punt the availability issue to your physical network

• Allows you to use standard tools like virtual port channels and VRRP

• Also highly performant

Page 42: OpenStack: Toward a More Resilient Cloud

• Network: consider your backend technology

– Neutron offers a variety of plugins for various open source and vendor-supplied network technology

– Physical networks need usual redundancy protections

– Overlays are popular for segmentation/isolation; some scale better than others

– Shameless plug: check out VMware NSX which has been used in some very large OpenStack deployments!

Page 43: OpenStack: Toward a More Resilient Cloud

• Actually, most of these techniques and technologies are things that seasoned developers and sysadmins have used before.

• It doesn’t take a genius to learn lessons from the past and apply them, tweak them, and tune them (but it’s a fair amount of work).

Page 44: OpenStack: Toward a More Resilient Cloud

Simple Rules of Thumb

Planning for availability can go to extreme levels, so start simple:

– Can I take any one [server | switch | storage unit] out of service in my control plane and still be operational?

– For all of the above, what’s the impact?

• Performance hit?

• Capacity loss?

• World is broken?!?

– For all the above, how easy is it to reintroduce a repaired/replaced $thing?

• Is there a recovery period that will further impact performance?

• Is it a complex procedure?

• Does the procedure cause more $things to be temporarily unavailable?

– For all of the above, how can I monitor & alert for failure?

Once you have that down, dig deeper to your heart’s content.

CONFIDENTIAL44

Page 45: OpenStack: Toward a More Resilient Cloud

Recover Quickly

CONFIDENTIAL45

Page 46: OpenStack: Toward a More Resilient Cloud

Rule 1: Assume You Will Need to Change Stuff

• Change can be a lot of things:

– Hardware or software upgrades/patches/replacements

– Configuration tweaks

– Adding or subtracting capacity

– All systems change over time; OpenStack clouds are no exception.

CONFIDENTIAL46

“Change in OpenStack? Yeah, I’m good with that…”

Page 47: OpenStack: Toward a More Resilient Cloud

Rule 2: Assume You Can’t Manually Log In To All The Nodes To Make Those Changes

• OpenStack is a series of cooperating distributed systems

– That means you could (potentially) have a lot of nodes

– Software & config must often be placed on many machines

– Manual changes == slow changes != quick recovery

CONFIDENTIAL47

“I guess multitasking only speeds things up to a certain extent…”

Page 48: OpenStack: Toward a More Resilient Cloud

Rule 3: Assume You Will Need To Test Stuff

• It’s a good idea to have a small test cloud where you can examine the impact of changes

• When possible, roll out changes to a portion of your cloud and evaluate before rolling out the rest

– Note: this means you need tests and monitoring…otherwise you don’t know what “ok” looks like

CONFIDENTIAL48

“It’s 3am and I’m still debugging in production…maybe I should have taken the time to set up a test environment and automate some testing after all…”

Page 49: OpenStack: Toward a More Resilient Cloud

Pile of

Bash

Scripts

Page 50: OpenStack: Toward a More Resilient Cloud

• Software developers and operators are increasingly the same people.

– Agile development

– Automate (almost) everything

– Treat config & changes as you would code

– Continuous integration, testing, deployment

– Incremental change & iteration

– Unified tooling & versioning

– Critical approach to working at scale

– Really useful for building resilient clouds

Image courtesy of Rajiv Pant (http://en.wikipedia.org/wiki/File:Devops.png)

Page 51: OpenStack: Toward a More Resilient Cloud

How Configuration Tools Management Help

• Can orchestrate deployment….and re-deployment.

• Most can idempotently check configuration (no-op if everything is ok)

• Can touch many nodes in parallel

• Can type much faster and more accurately than you

• Are a great way to collaborate amongst teams of operators

• Most have strong communities within the OpenStack universe

– Using a commercial OpenStack? Most vendors are using one of these tools to deploy and manage your cloud, whether you know it or not.

– Rolling your own? Check out StackForge for tons of Ansible/Puppet/Chef modules you can use today

• Allow you to manage other things besides OpenStack itself

CONFIDENTIAL51

Page 52: OpenStack: Toward a More Resilient Cloud
Page 53: OpenStack: Toward a More Resilient Cloud
Page 54: OpenStack: Toward a More Resilient Cloud

• I’ve worked on a lot of OpenStack clouds and almost everyone has their own preferred monitoring toolset.

• One possible exception: almost everybody seems to love Graphite.

• The golden rule is: use the tools that work for you!

• Very often this will be whatever you’re using in the rest of your infrastructure.

• Break it down into at least two buckets:

• Up/down and alerting (ex: Nagios or it’s derivatives…yes, there are OpenStack plugins out there on NagiosExchange)

• Trending data collection/plotting (ex: collectd/statsd feeding graphite)

• Don’t forget logging

• LogInsight, Logstash, etc.

• Also: use your peers!

• Operators often willing to share, so ask on the openstack-operators list.

Page 55: OpenStack: Toward a More Resilient Cloud

• Ok, this could take another hour, so I’ll just hit a few highlights…

• Make use of OpenStack’s segregation features

– Availability zones, host aggregates, regions, server groups for compute

– Regions and zones for Swift

• Plan to make infrastructure maintenance less impacting

– Put API servers behind load balancers

– Virtualize tenant-facing parts of the control plane for greater scale and mobilty

– Use host evacuation and live migration to reduce impact

– OpenStack is extremely pluggable…choose your backends wisely

• You should know how to operate, monitor, and troubleshoot them

• Understand how their drivers interact with OpenStack

• You should be comfortable with their failure and recovery modes

Page 56: OpenStack: Toward a More Resilient Cloud

Questions?@marktvoelker

http://openstack.org/

http://www.vmware.com/products/openstack/

Page 57: OpenStack: Toward a More Resilient Cloud

Thank you!