Predicting Web Service Levels During VM Live Migrations · Predicting Web Service Levels During VM Live Migrations Experiment Experiment Experiment • 2 servers, 1 VM, migrating

Predicting Web Service Levels During VM Live Migrations

Predicting Web Service Levels During VM Live

Migrations

5th International DMTF Academic Alliance Workshop on Systems

and Virtualization Management: Standards and the Cloud

Helmut Hlavacs, Thomas Treutner

24. 10. 2011

1 / 31


Table of Contents

1 Scenario

2 Experiment

3 Data overview

4 Model selection

5 Conclusions

2 / 31


Abstract

• VM live migration important for energy efficiency• Establish energy efficient target distribution of VMs

• No perceivable downtime while live migrating?

• Live migration is resource intensive (iterative page copying)

• Experiments: Influence on service levels while migrating?• Modelling: Predict service levels based on utilization?

3 / 31


Scenario

Scenario

Scenario

• Virtualized data center, static consolidation (P2V)

• Provisioning for peak load, still bad energy efficiency

• E.g., 9-5-cycles, very low utilization at night

• Live migration enables dynamic consolidation

• But: Seldom used, fear of possible side effects

• ⇒ Identify and quantify effects on (web) service levels

4 / 31


Scenario

Scenario

5 / 31


Experiment

Experiment

Experiment

• 2 servers, 1 VM, migrating forth and back

• VM disk image on central node (Gbit, open-iscsi)

• qemu-kvm VM: Linux, Apache2, PHP5, MediaWiki

• SQL VM and load generation on an extra nodes

• Logging utilization of servers and VM, >100 variables

• Rise load from 50 to 600 concurrent virtual users and back

• Migrate every 15min, track response time of last 5min (SLA=1s)

6 / 31


Experiment

Experiment

7 / 31


Data overview

CPU utilization

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 20 40 60 80 100 120 140 160 180 200

CP

U U

tiliz

atio

n (

VM

no

rme

d)

Interval

Abbildung: Effect of increasing/decreasing HTTP workload in equal steps on

the VM’s CPU utilization

8 / 31


Data overview

Load average

0

1

2

3

4

5

6

0 20 40 60 80 100 120 140 160 180 200

Lo

ad

5 A

vera

ge

Interval

Abbildung: Effect of increasing/decreasing HTTP workload in equal steps on

the VM’s UNIX load average.

9 / 31


Data overview

Markov Queues

• UNIX load as approximation to Q, the average number of jobs

in an M/M/1 queue, CPU utilization as ρ, the system utilization

• QM/M/1 = ρ2

1−ρ

• UNIX load exponentially averaged by definition, service times

are not necessarily exactly exponentially distributed, a

systematic deviation from the theoretical solution is expectable.

• QM/G/1 = ρ2

1−ρ · (1+c2

B)2

, cB: coefficient of service time variation

• Exponentially distributed service times: c2

B = 1

• Deterministic service times: c2

B = 0

• Simple linear regression: c2

B = 0.42

10 / 31


Data overview

0

1

2

3

4

5

6

7

8

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Lo

ad

5 A

vera

ge

CPU Utilization (VM normed)

Measured

M/M/1

M/D/1

M/G/1, c2B=0.42

Abbildung: Queuing issues are rapidly increasing at 60-70% CPU utilization.

The measured data follow the trend of the theoretical solution very closely.

11 / 31


Data overview

Response Time

0

5000

10000

15000

20000

25000

06 12 18 00 06 12 18 00 06 12

Re

spo

nse

Tim

e [

ms]

Hours

Abbildung: HTTP response times are massively increased during phases

with higher workload.

12 / 31


Data overview

Service Level

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

0 20 40 60 80 100 120 140 160 180 200

Se

rvic

e L

eve

l R

atio

Interval

Abbildung: Service levels decrease significantly during phases of high

workload. The maximum allowed response time is 1 s, which is extremely

conservative.13 / 31


Data overview

Load average vs Service Level

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

0 1 2 3 4 5 6

Se

rvic

e L

eve

l

Load5 Average

MeasuredTheoretic

Abbildung: Higher load average is an indicator for queuing issues which are

causing decreased service level ratios. The measured data follow the

theoretical solution closely for higher load values.14 / 31


Data overview

Response Time: Low Load

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0.95 0.96 0.97 0.98 0.99 1

Resp

onse

Tim

e [s]

SLA

Ratio

Time

HTTP Response TimeMigration started

Migration finishedSLA Ratio

Maximum allowed response time

Abbildung: Influence of live migration on HTTP response time and service

level during low load.

15 / 31


Data overview

Response Time: Medium Load

0

2

4

6

8

10

12

14

16

0.85

0.9

0.95

1

Resp

onse

Tim

e [s]

SLA

Ratio

Time




Abbildung: Influence of Live Migration on HTTP Response Time and service

level during medium load.

16 / 31


Data overview

Response Time: High Load

0

5

10

15

20

0.5

0.6

0.7

0.8

0.9

1

Resp

onse

Tim

e [s]

SLA

Ratio

Time




Abbildung: Influence of Live Migration on HTTP Response Time and service

level during high load.

17 / 31


Model selection

Model selection

Stepwise: AIC

• Akaike Information Criterion (AIC), lower values are better

• AIC = 2k − 2 ln(L)• k : the number of independent parameters of a model

• L: the maximum likelihood

• Stepwise model selection approach

• Find trade-off between number of parameters (model size) and

goodness of fit (model quality)

• Start with empty model, in each step a variable can be

added/removed, until the AIC can not be decreased further.

• Does not guarantee to find a global optimum, but typically gives a

near-optimum result within reasonable time.

18 / 31


Model selection

Model selection

Exhaustive: LEAPS

• Comparison with the stepwise AIC approach

• Exhaustive all-subsets-regression

• Find best of all possible models for given range of model size

• Computationally intensive even if number of variables is limited

19 / 31


Model selection

AIC

-4550

-4500

-4450

-4400

-4350

-4300

-4250

-4200

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

AIC

va

lue

Step

AIC value in current stepFinal AIC value

Abbildung: The AIC value quickly improves during the first steps and then

slowly converges to the final value, thus allowing a trade-off between model

complexity and precision.20 / 31


Model selection

AIC

0.013

0.014

0.015

0.016

0.017

0.018

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Sta

nd

ard

De

via

tion

of

Re

sid

ua

ls

Step

Best value

Abbildung: The residuals’ standard deviation quickly improves during the first

steps and then slowly converges to the final value, when using stepwise AIC.

21 / 31


Model selection

AIC

Abbildung: QQ-Plot of standardized residuals for the model produced by

stepwise AIC.

22 / 31


Model selection

AIC

Abbildung: There is no visible correlation between the residuals’ variance

and the time series.

23 / 31


Model selection

LEAPS

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

1 2 3 4 5 6 7 8 9 10 11 12

R2 A

dj

Model Size

Best value

Abbildung: Higher model complexity yields better model quality when using

LEAPS, but the increase is rather small.

24 / 31


Model selection

LEAPS

0.015

0.016

0.017

0.018

0.019

0.02

0.021

0.022

0.023

0.024

0.025

1 2 3 4 5 6 7 8 9 10 11 12

Sta

nd

ard

De

via

tio

n o

f R

esid

ua

ls

Model Size

Best value

Abbildung: Higher model complexity yields better regression error when

using LEAPS.

25 / 31


Model selection

AIC: Most influencing variables of final model

Variable Meaning Estimate Std. Error Pr(> |t|)Intercept 2.395e+00 5.069e-01 3.00e-06

wp01 load5 VM UNIX load5 -1.871e-02 2.627e-03 3.67e-12

wp01 swapUsed VM swap used -7.656e-07 8.809e-08 < 2e-16

wp01 residentSize SQ The squared amount of res-

ident memory used by the

qemu-kvm process.

-4.652e-14 1.652e-14 0.00506

src host cpu proc s Tasks created/s, source

host.

-2.475e+00 9.166e-01 0.00716

src host cpu proc s SQ 1.091e+00 4.133e-01 0.00856

wp01 cpu util vmnorm SQ -1.328e-01 3.187e-02 3.63e-05

wp01 cpu util vmnorm CPU util measured inside

VM

9.517e-02 2.316e-02 4.64e-05

wp01 load5 SQ The squared UNIX load5of the VM.

-1.140e-03 1.918e-04 5.22e-09

wp01 freeMemRatio SQ The squared ratio of free

memory inside the VM.

1.976e-02 9.462e-03 0.03727

Tabelle: The most influencing and significant variables of the final model

produced by stepwise AIC.

26 / 31


Conclusions

Conclusions

Qualitative

1 Impact of live migration on SL depends on amount of workload

2 Tighter SLAs can be fulfilled for low workload situations

3 High workload: SL decreases massively

4 ⇒ Live migrating when VM has only medium load

5 Avoid multiple live migrations within short time

6 ⇒ Inertia effect for recently migrated VMs

7 Avoid senseless flip-flop migrations and stabilize service level

27 / 31


Conclusions

Conclusions

Quantitative

1 SL variance during a live migration to 90% predictable using only

a single variable, the UNIX load5 average.

2 Models with 12 variables can explain 95% of variance

3 Models selected by AIC/LEAPS are of comparable quality and

robust to statistical tests

4 AIC is way faster, so LEAPS is not really necessary

28 / 31


Conclusions

Conclusions

Technical

1 Systems using live migration as a mechanism to realize a more

energy efficient target distribution need to consider the UNIXload average, at least for web servers and comparable

2 Typical hypervisors do not collect and export this information

3 Needs to be done by VM introspection

4 Related efforts by qemu-kvm and libvirt developers to

pass-through the VMs’ memory utilization (free mem)

5 Should be extended to export load information

29 / 31


Conclusions

Conclusions

Future Work

1 Influence of additional VMs

• idle

• utilized

• and a mixed scenario

2 Linux and qemu-kvm: Kernel Samepage Merging (KSM)

3 Database VM migration

4 Bandwidth limits, maximum allowed downtime

5 Migration delay, energy consumption, service downtime

30 / 31


Conclusions

Q&A

Thank you for your attention!

31 / 31

Documents

Predicting Web Service Levels During VM Live Migrations · Predicting Web Service Levels During VM Live Migrations Experiment Experiment Experiment • 2 servers, 1 VM, migrating