Nova states summit

Moving to structured state management in OpenStack

Yahoo! and NTT Data

Deployer use cases

• As a deployer I want to ensure that an instance is reserved & provisioned without falling back and/or reporting to users internal OpenStackerrors.

• As a deployer I want to be able to allocate, schedule and reserve resources before they are consumed so that I can make advanced/complex/custom scheduling decisions using the combination of those resources as a whole.

• I want to convey to my users that OpenStack is a reliable and dependable system that is resilient to API outages, resource failures…

Developer use cases

• I want to be able to add new (and improved!) states to OpenStack and know what the impacts will be on the other states in OpenStack in a easy to understand manner.

• I want to be able to undo (and redo) resource allocation decisions in a transactional and verifiablycorrect manner on errors or on other ‘smart’ algorithmic placement logic.

• I want to be able to quickly and easily understand an API request from start to finish & I want other developers to have a single place to understand the same.

User use cases

• I want to ensure that my instances are reliablybrought up without involving myself to resolve(or raise to support) errors inside of OpenStack.

• I want to ensure that my instances (and associated resources) are optimally scheduled in a reliable and correct manner or not have them scheduled to begin with.

• I want my resources to be fully utilized, and not have zombie resources being ‘locked’ due to the lack of transactional semantics (and recovery) in the underlying code.

The problem

• Hard to [follow, recover from, debug, ensure reliability, correctness, extend, audit…] ad-hoc distributed state transitions.– Created by continual placement of new features

without revisiting the underlying state management system.• The never ending battle between new hotness vs. stability

– Majority of focus (understandably) on getting OpenStack operational.

– Typical technical debt.• Acceptable for a new project like OpenStack to get off the

ground, but now is the time to focus on features that addstability/scalability...

The problem

• Inter-state ‘cutting’ results in instances which require manual or periodic tasks to recover.– Distributed systems should always be able to

automatically recover from failures, and not require manual/periodic intervention.

• Continually adding local [solutions,fixes,patches]• Lack of [focus,time,desire] to fix the system as a whole?

• How many inter-state race conditions are hiding underneath the covers??– Can verification even be done with the current

codebase (in a reasonable time period)?

request nova-api

Libvirt

MySQL

RabbitMQkeystone

glance

nova-compute

nova-scheduler

VolumeService

NetworkService

1

2

3

4

5

6

7

8

9

10,14 16

15

11

12

13

CREATE SERVER API (admin/user)

Create Server - Transitions and States

ID Service Operation vm_state task_state power_state

1 Nova API Initial State - - -

2 Keystone Authenticate user - - -

3 Nova API/Glance Show image - - -

4 Nova API/MySQL Create entry BUILDING SCHEDULING -

5 Nova API/RabbitMQ Cast to Scheduler BUILDING SCHEDULING -

6 Scheduler Received at Scheduler BUILDING SCHEDULING -

7 Scheduler/RabbitMQ Cast to Compute BUILDING SCHEDULING -

8 Compute Received at Compute BUILDING SCHEDULING -

9 Compute/Glance Show image BUILDING SCHEDULING -

10 Compute/MySQL Update DB BUILDING NETWORKING -

11 Compute/RabbitMQ Call on Network BUILDING NETWORKING -

12 Network Allocate Network BUILDING NETWORKING -

13 Compute/Volume Attach volume BUILDING BLOCK_DEVICE_MAPPING

-

14 Compute/MySQL Update DB BUILDING SPAWNING -

15 Compute/Libvirt Spawn instance BUILDING SPAWNING -

16 Compute/MySQL Update DB ACTIVE None RUNNING

What happensif we cut here??

Or here??

Or here??

https://wiki.openstack.org/w/images/a/a9/Curr-run-instance.png

Solutions solutions solutions

• Nova has mostly stabilized (code-wise)

– It appears to be a good time to rethink some of the foundations. And rework some of the foundations (with as minimal of an impact as we can)

– Eventually as other core components (quantum) stabilize similar analysis can be done there (if needed)

• Prototyping a potential solution and discuss with community on next steps.

– That’s why we are here folks

Create request without orchestration

https://docs.google.com/document/d/1xpUszQFEtKmRAf1Wz_XpwyJslhI5X6siM29amPnKifE



Create request with orchestration




Key Benefits

• Less scattering of state management– Makes it easier to understand…

• Less scattering of recovery scenarios – Clearly defined rollbacks…

• Faster and more dependable resource acquisition– Compute node will perform initialization and final acquisition of resources. – Reservations and initial acquisitions will be done before request to provision

instances, hence faster VM spawns.

• Scheduler can be make better ‘overall’ scheduling decisions.– Ex. no need for compute <-> scheduler retry hacks– Can make advanced scheduling decisions based on volume choices, locality,

network choices... When you are able to acquire/release resources before there use, anything is possible…

– No more need for 'hinting'...

• Creates a single place where others can extend or alter nova state transitions to plug-in there own ‘custom/internal’ state transitions.

DEMOAND

DISCUSSION

https://etherpad.openstack.org/the-future-of-orch










Documents

Nova states summit