Looking for the perfect VM scheduler - Fabien HermenierLooking for the perfect VM scheduler. ......

Fabien Hermenier— placing rectangles since 2006

fabien.hermenier@nutanix.com

@fhermeni

https://fhermeni.github.io

Looking for the perfect VM scheduler

Gestion dynamique des tâches dans les grappes, une approche à base de machines virtuelles

How to design a better testbed: Lessons from a decade of network experiments

2006 - 2010

2011 - 2016

PhD - Postdoc

Postdoc

Associate professor

VM scheduling, green computing

VM scheduling, resource management Virtualization

Entreprise cloud company

“Going beyond hyperconverged infrastructures”

Inside a private cloud

Clusters

isolated applications

SAN based: converged infrastructure shared over the nodes: hyper-converged infrastructure

virtual machines containers

storage layer

from 2 to x physical servers

monitoring data

VM queue

sVM scheduler

decisions

find a server to every VM to run

VM scheduling

Such that compatible hw enough pCPU enough RAM enough storage enough whatever

Whilemin or max sth

Bigger business value, same infrastructure

A good VM scheduler provides

Same business value, smaller infrastructure

A good VM scheduler provides

KEEPCALMAND

CONSOLIDATEAS HELL

1 node =

VDI workload: 12+ vCPU/1 pCPU

100+ VMs / server

dynamic schedulers

static schedulers

consider the VM queue

deployed everywhere [1,2,3,4]

fragmentation issues

live-migrations [5] to address fragmentation

Costly (storage, migration latency)

thousands of articles [10-13]

over-hyped ? [9]

but used in private clouds [6,7,8](steady workloads ?)

Placement constraints

hard or soft

manipulated concepts

performance, security, power efficiency, legal agreements, high-availability, fault-tolerance …

dimension

various concerns

spatial or temporal

enforcement level

state, placement, resource allocation, action schedule, counters, etc.

discrete constraints

spread(VM[1,2]) ban(VM1, N1) ban(VM2, N2)

continuous constraints

>>spread(VM[1,2]) ban(VM1, N1) ban(VM2, N2)

harder scheduling problem (think about actions interleaving)

“simple” spatial problem

soft constraints

hard constraints

must be satisfied all or nothing approach not always meaningful

satisfiable or not internal or external penalty model

harder to implement/scale hard to standardise ?

spread(VM[1..50])

mostlySpread(VM[1..50], 4, 6)

High-availability

exact approach: solve n placement problems [17]

0 - FT1 - FT

x-FT VMs must survive to any crash of x nodes

The VMWare DRS way

slot based

catch the x- biggest nodes

checks the remaining free slots

simple, scalable

waste with heterogeneous VMs

cluster based

VM-host affinity (DRS 4.1)

Dedicated instances (EC2)

MaxVMsPerServer (DRS 5.1)

apr. 2011

mar. 2011

sep. 2012

The constraint needed in 2014

VM-VM affinity (DRS)

2010 ?

Dynamic Power Management (DRS 3.1)

2009 ?

The constraint catalog evolves

the bjectiveprovider side

min(x) or max(x)

min(penalties)min(Total Cost Ownership)

min(unbalance)

atomic objectives

min(αx + β y)

composite objectivesusing weights

useful to model sth. you don’t understand ?How to estimate coefficients ?

min(α TCO + β VIOLATIONS)max(REVENUES)

€ as a common quantifier:

threshold basedmin(…) or max(…)

composablecomposable through weighting magic

Optimize o r satisfy ?

verifiablehardly provable

domain specific expertiseeasy to say

Trigger

affinity constraints

Resource demand (from machine learning)

Thresholds

85%Maintain Minimize

Σmig.cost

CPU storage-CPU

Acropolis Dynamic Scheduler [18]

Hotspot mitigation

adapt the VM placement depending on pluggable expectations

network and memory-aware migration scheduler, VM-(VM|PM) affinities, resource matchmaking, node state manipulation, counter based restrictions, energy efficiency, discrete or continuous restrictions

spread(VM[2..3]); preserve(VM1,’cpu’, 3); offline(@N4);

0’00 to 0’02: relocate(VM2,N2) 0’00 to 0’04: relocate(VM6,N2) 0’02 to 0’05: relocate(VM4,N1) 0’04 to 0’08: shutdown(N4) 0’05 to 0’06: allocate(VM1,‘cpu’,3)

The reconfiguration plan

BtrPlace

interaction though a DSL, an API or JSON messages

the right model for the right problem

deterministic composition high-level constraints

An Open-Source java library for constraint programming

BtrPlace core CSPmodels a reconfiguration plan

1 model of transition per element action durations as constants *

D(v) ∈ N

st(v) = [0, H −D(v)]ed(v) = st(v) +D(v)d(v) = ed(v)− st(v)d(v) = D(v)ed(v) < H

d(v) < H

h(v) ∈ {0, .., |N |− 1}

boot(v ∈ V ) !

relocatable(v ∈ V ) ! . . .

shutdown(v ∈ V ) ! . . .

suspend(v ∈ V ) ! . . .

resume(v ∈ V ) ! . . .

kill(v ∈ V ) ! . . .

bootable(n ∈ N) ! . . .

haltable(n ∈ N) ! . . .

new variables and relationsViews bring additional concerns

ShareableResource(r) ::=

Network() ::= …

Power() ::= …

High-Availability() ::= …

Constraints state new relations

vector packing problem

items with a finite volume to place inside finite bins

the basic to model the infra. 1 dimension = 1 resource

generalisation of the bin packing problem

N1 cpu

N2 cpu

NP-hard problem

how to support migrations

temporary, resources are used on the source and

the destination nodes

200 300 400 500 600 700 800 900 1000

tion [

Allocated bandwidth [Mbit/s]

1000*200K

1000*100K

1000*10K

Migrations are costly

dynamic schedulersUsing Vector packing [10,12]

N3 cpu

N1 cpu

N4 cpu

N2 cpu

VM1 VM2

min(#onlineNodes) = 3

sol #1: 1m,1m,2m

N3 cpu

N1 cpu

N4 cpu

N2 cpu

VM1 VM2

min(#onlineNodes) = 3

sol #2: 1m,2m1m

lower MTTR (faster)

dynamic schedulersUsing Vector packing [10,12]

sol #1: 1m,1m,2m

dynamic scheduling using vector packing

N3 cpu

N1 cpu

N4 cpu

N2 cpu

VM1 VM2

VM5VM6

N5 cpumemVM7

offline(N2) + no CPU sharing

[10, 12]

VM1 VM2

Dependency management

2) shutdown(N2), migrate VM7

1) migrate VM2, migrate VM4, migrate VM5

coarse grain staging delay actions

mig(VM2)

mig(VM4)

mig(VM5)

off(N2)

mig(VM7)

timestage 1 stage 2

VM5VM6

VM3VM7

VM4VM2 off

VM4VM2

Resource-Constrained Project Scheduling Problem [14]

Resource-Constrained Project Scheduling Problem

1 resource per (node x dimension), bounded capacity

tasks to model the VM lifecycle. height to model a consumption width to model a duration

at any moment, the cumulative task consumption on a resource cannot exceed its capacity

comfortable to express continuous optimisation

NP-hard problem

duration may be longer

0:3 - migrate VM4 0:3 - migrate VM5 0:4 - migrate VM2 3:8 - migrate VM7 4:8 - shutdown(N2)

convert to an event based schedule

- : migrate VM4 - : migrate VM5 - : migrate VM2 !migrate(VM2) & !migrate(VM4): shutdown(N2) !migrate(VM5): migrate VM7

From a theoretical to a practical solution

6050403020100 70 80 90

VM 8VM 3

BtrPlace vanilla

migration duration (sec.)

network and workload blind

[btrplace vanilla, entropy, cloudsim, …]

Extensibility in practicelooking for a better migration scheduler

network and workload aware

6050403020100 70 80 90

VM 1VM 5VM 4VM 8VM 3VM 7VM 6VM 2

BtrPlace + scheduler

migration duration (sec.)

btrplace + migration scheduler [16]

Extensibility in practicelooking for a better migration scheduler

Extensibility in practicesolver-side

Network Model

Migration Model

heterogeneous networkcumulative constraints; +/- 300 sloc.

memory and network aware+/- 200 sloc.

Constraints Model restrict the migration models+/- 100 sloc.

VM1VM2 VM3core

switch

placement

scheduling

vector packing problem

multi-mode resource-constrained project scheduling problem

NP-hardscalingproblems

1000 VMs / 10 nodes -> 10 1000 assignments

Nobody’s perfect

exact approaches:

heuristics approaches: fast but approximatives

the search heuristicper objective guide choco to instantiation of interest at each search node

1. which of the variables to focus 2. which value to try

do not alter the theoretical problem

.[1/2] relocatable(vm#0).dSlice_hoster = {31}

..[1/2] relocatable(vm#1).dSlice_hoster = {31}

...[1/2] relocatable(vm#2).dSlice_hoster = {31}

....[1/2] relocatable(vm#3).dSlice_hoster = {31}

.....[1/2] relocatable(vm#4).dSlice_hoster = {31}

......[1/2] relocatable(vm#5).dSlice_hoster = {31}

.........[1/2] shutdownableNode(node#3).start = {0}

..........[1/2] shutdownableNode(node#2).start = {0}

...........[1/2] shutdownableNode(node#1).start = {0}

............[1/2] shutdownableNode(node#0).start = {0}

..............[1/2] relocatable(vm#97).cSlice_end = {1}

..................[2/2] relocatable(vm#202).cSlice_end \ {2}

...................[1/2] relocatable(vm#202).cSlice_end = {4}

....................[1/2] relocatable(vm#203).cSlice_end = {2}

manage only supposed mis-placed VMs beware of under estimations !

spread({VM3,VM2,VM8}); lonely({VM7}); preserve({VM1},’ucpu’, 3); offline(@N6); ban($ALL_VMS,@N8); fence(VM[1..7],@N[1..4]); fence(VM[8..12],@N[5..8]);

scheduler.doRepair(true)

static model analysis 101

spread({VM3,VM2,VM8}); lonely({VM7}); preserve({VM1},’ucpu’, 3); offline(@N6); ban($ALL_VMS,@N8); fence(VM[1..7],@N[1..4]); fence(VM[8..12],@N[5..8]);

independent sub-problems solved in parallel beware of resource fragmentation !

s.setInstanceSolver( new StaticPartitioning())

15 20 25 300

Virtual machines (x 1,000)

● LINRLI−filterNR−filter

2013 perf numbers…

LINRLI-repairNR-repair

0 1000 2000 3000 4000 50000

306090

120150180

Partition size (servers)

LI + filterNR + filterLI-repairNR-repair

Repair benefits Partitioning benefits

/!\ non Nutanix workloads

Master the problemunderstand the workload,

tune the model, tune the solver, tune the heuristics

(benching on my laptop)/!\ non Nutanix workloads

15 20 25 30Virtual machines (x 1,000)

kindlinr

xeon servers

“current” performance

/!\ non Nutanix workloads

very high load small but hard instances

ok when non-solvable but no evidence

The right filtering algorithm for the right workload

32 instances

timeout

16 instances

The costly Knapsack filtering to the rescue

trigger based

smarter but slower higher memory consumption

bigger constants

RECAP52

The VM scheduler makes cloud benefits real

think about what is costly

static scheduling for a peaceful life

dynamic scheduling to cease the day

no holy grail

master the problem

with great power comes great responsibility

BtrPlacehttp:// .org

production ready live demo stable user API documented tutorials

issue tracker support chat room

WE WANT YOU(once graduated)

Member of Technical Staff

Efficiently connecting CLOUD & EDGE

2 yrs. postdoc Sophia, France

resource management in edge computing

San Jose, California

1. Omega: flexible, scalable schedulers for large computer clusters. Eurosys’132. Sparrow: distributed, low latency scheduling, SOSP’133. Large-scale cluster management at Google with Borg. Eurosys 154. Firmament: fast, centralized cluster at scale. OSDI 165. live-migration of virtual machines. NSDI’056. VMWare DRS. 20067. OpenStack Watcher. 20168. Nutanix Acropolis Dynamic Scheduler. 20179. Virtual Machine Consolidation in the Wild. Middleware 201410. Entropy: a consolidation manager for clusters. VEE 200911. pMapper: power and migration cost aware application placement in virtualized systems. Middleware

200912. Memory Buddies: exploiting page sharing for smart consolidation in virtualised data centres. VEE

200913. Energy-aware resource allocation heuristics for efficient management of data centres for cloud

computing. FGCS 201214. BtrPlace: a flexible consolidation manager for highly available applications. TDSC 201315. Higher SLA satisfaction in datacenter with continuous VM placement constraints. HotDep 201316. Scheduling live-migrations for fast, adaptable and energy-efficient relocation operations. UCC 2015 17. Guaranteeing high availability goals for virtual machine placement. ICDCS 201118. The Acropolis Dynamic Scheduler. http://nutanixbible.com/

References

Looking for the perfect VM scheduler - Fabien HermenierLooking for the perfect VM scheduler. ......

Documents

Video Scheduler

Meeting Scheduler

A GRID META-SCHEDULER USING COMMUNITY SCHEDULER …eprints.usm.my/31401/1/TAN_ZHEN_LING.pdf · A GRID META-SCHEDULER USING COMMUNITY SCHEDULER FRAMEWORK (CSF) WITH DIFFERENT PLUG-IN

Cognos Scheduler - public.dhe.ibm.compublic.dhe.ibm.com/.../documentation/docs/ea/scheduler/6.4/sched.p… · SeeBusinessSeeResultsSeeCognos Cognos Scheduler User Guide. To print

Maximo scheduler

Millennium Scheduler. 2 Scheduler How to find “Scheduler” Overview Making a task Making a schedule Bugs ________________________________________ More

IBM Tivoli Workload Scheduler for z/OS: …...Member RJJJJ in the Tivoli Workload Scheduler for z/OS JCL library .....242 2. Using automatic-event reporting to control VM operations

Scheduler-based Defenses against Cross-VM Side …pages.cs.wisc.edu/~venkatv/usenixsec14-robsched.pdfVM side-channel attacks to steal cryptographic secrets. The straightforward solution

Maximo Scheduler & Scheduler Plus

DEVELOPING SCHEDULER TEST CASES TO VERIFY SCHEDULER ... › ijesa › V6N1 › 6116ijesa01.pdf · Scheduler algorithm, scheduler implementation, cyclic executive, time-triggered co-operative

CICADA - USENIX · 1 vm 2 vm 3 vm 4 vm 5vm 6 vm 7 vm 8 vm 9 vm 2 vm 3 vm 4 vm 5 vm 6 vm 7 vm 8 vm 9 vm 1 rigid application (similar to VOC [1]) vm 1 vm 2 vm 3 vm 4 vm 5vm 6 vm 7 vm

Expert Days 2018 - suse.com · PDF fileNeutron Dashboard Scheduler Cloud UI Other OpenStack compute z/VM Hyper-V ... • OpenDaylight integration • OPNFV framework integration Operational

Accelerate OpenStack* Together - 01.org · VM Instance created by OpenStack Scheduler by finding the best Server matching the VM Policy to Geo-Tag on the Server Geo-Tagging Flow in

developing scheduler test cases to verify scheduler implementations

Resprep VM-96 Vacuum Manifold · Resprep VM-96 Vacuum Manifold Designed for performance, versatility, ergonomics, and accuracy, Restek’s Resprep VM-96 vacuum manifold is perfect

Sassy Scheduler

Cranberry Scheduler

Scheduler Activations

genio scheduler

Scheduler Installation