Upload
john-burwell
View
199
Download
0
Embed Size (px)
Citation preview
Embracing FailureSelf-Healing, Decentralized Resource Management for Apache CloudStack
John BurwellVice President, Software Engineering
[email protected] | @john_burwell
@shapeblue #ccceu
VP of Software Engineering @ ShapeBlue
Member, Apache CloudStack PMC (June 2013)
Ran operations and designed automated provisioning for analytic/virtualization clouds
Led architectural design and server-side development of a SaaS physical security platform
About Me
@shapeblue #ccceu
“ShapeBlue are expert builders of public &
private clouds. They are the leading global
Apache CloudStack integrator & consultancy”
…and we’re hiring!
About ShapeBlue
@shapeblue #ccceu
Bang ups and Hang Ups Can Happen to You
Derive the normative operationdesign from failure recovery
@shapeblue #ccceu
What is a Resource?Control Plane
Device
Device
Device
(Desired State)
(Actual State)
Resource
(Converges Desired with Actual State)
Eventually, the desired and actual states will be consistent
@shapeblue #ccceu
CloudStack partitions resources into zones,
clusters, and pods
@shapeblue #ccceu
Resource status information is stale or lost
Resource definitions conflict with device state
Entropy
Failure Modes
@shapeblue #ccceu
@shapeblue #ccceu
Consistency
AvailabilityPartition Tolerance
Pick 2
@shapeblue #ccceu
Orchestration operations are available and eventually consistent
... but device modifications must be consistent.
@shapeblue #ccceu
@shapeblue #ccceu
Orchestration TierAP
CP Automation Control Tier
@shapeblue #ccceu
Desired Resource StateAP
CP Actual Resource State
@shapeblue #ccceu
SchedulingAP
CP State Convergence
Resource OffersResource Status
State Transitions
Hoke
@shapeblue #ccceu
Simple Self-contained Locality Non-persistent
Hoke Design Goals
@shapeblue #ccceu
Runtime Resource View
ResourceFSM
Management
ProcessDevic
e
Queue
State Transitio
n
1
1
Monitor Process
ResourceOfferResourceStatu
s
@shapeblue #ccceu
An actor represents state and behavior
Communicate by message passing — each actor has a dedicated queue or mailbox
Each actor is allocated a lightweight thread — implicit lock
Actor Model
@shapeblue #ccceu
All resources represented in a directed, acyclic graph
The root node of the graph is the region organized in the following manner:region -> zone -> pod -> cluster
Each resource is a child of the partition node in which owns it
Resource Graph
@shapeblue #ccceu
Google’s resource scheduler Transactional shared state model
enabling sophisticated, global decision making
Supports both high churn and low churn workloads
Multiple, pluggable schedulers working in parallel
Inspiration from Omega
@shapeblue #ccceu
Two level scheduler Resource Offers Pessimistic Locking Pluggable Geared towards high churn workloads
Inspiration from Mesos
@shapeblue #ccceu
Best Effort shared-state scheduler Multiple parallel schedulers
distributed by partition Combines allocators and planners Pluggable
Hybrid Scheduler
@shapeblue #ccceu
Partition controllers spawn system VMs for their child partitions as need to address scheduler business and reliability guarantees
Parent partition controllers monitor the health of their child partition controllers and re-spawn as necessary
Auto Scaling, Self Healing
@shapeblue #ccceu
Evaluate implementing the concepts in the Orleans paper to reduce the number of active actors required
Determine best approach causality tracking for state transitions (e.g. version vectors)
Create a library implementing these concepts to demonstrate viability and separate concerns and performance test
Next Steps
@shapeblue #ccceu
Gilbert, Seth & Nancy Lynch. Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. 2002.
Schwarkopf, Malte; Konwinski, Andy; et. al. Omega: flexible, scalable schedulers for large compute clusters. 2013.
References
@shapeblue #ccceu
Hindman, Benjamin; Konwinski, Andy; et. al. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. 2011.
Bernstien, Philip; Bykov, Sergey; et. al. Orleans: Distributed Virtual Actors for Programmability and Scalability. 2014.
References
@shapeblue #ccceu
Questions
Comments
@shapeblue #ccceu
Thank you