Predicting Web Service Levels During VM Live Migrations

Predicting Web Service Levels During VM Live

Migrations

5th International DMTF Academic Alliance Workshop on Systems

and Virtualization Management: Standards and the Cloud

Helmut Hlavacs, Thomas Treutner

24. 10. 2011

1 / 31

Table of Contents

1 Scenario

2 Experiment

3 Data overview

4 Model selection

5 Conclusions

2 / 31

Abstract

• VM live migration important for energy efficiency• Establish energy efficient target distribution of VMs

• No perceivable downtime while live migrating?

• Live migration is resource intensive (iterative page copying)

• Experiments: Influence on service levels while migrating?• Modelling: Predict service levels based on utilization?

3 / 31

Scenario

• Virtualized data center, static consolidation (P2V)

• Provisioning for peak load, still bad energy efficiency

• E.g., 9-5-cycles, very low utilization at night

• Live migration enables dynamic consolidation

• But: Seldom used, fear of possible side effects

• ⇒ Identify and quantify effects on (web) service levels

4 / 31

Scenario

5 / 31

Experiment

• 2 servers, 1 VM, migrating forth and back

• VM disk image on central node (Gbit, open-iscsi)

• qemu-kvm VM: Linux, Apache2, PHP5, MediaWiki

• SQL VM and load generation on an extra nodes

• Logging utilization of servers and VM, >100 variables

• Rise load from 50 to 600 concurrent virtual users and back

• Migrate every 15min, track response time of last 5min (SLA=1s)

6 / 31

Experiment

7 / 31

Data overview

CPU utilization

0 20 40 60 80 100 120 140 160 180 200

Interval

Abbildung: Effect of increasing/decreasing HTTP workload in equal steps on

the VM’s CPU utilization

8 / 31

Data overview

Load average

0 20 40 60 80 100 120 140 160 180 200

Interval

Abbildung: Effect of increasing/decreasing HTTP workload in equal steps on

the VM’s UNIX load average.

9 / 31

Data overview

Markov Queues

• UNIX load as approximation to Q, the average number of jobs

in an M/M/1 queue, CPU utilization as ρ, the system utilization

• QM/M/1 = ρ2

1−ρ

• UNIX load exponentially averaged by definition, service times

are not necessarily exactly exponentially distributed, a

systematic deviation from the theoretical solution is expectable.

• QM/G/1 = ρ2

1−ρ · (1+c2

, cB: coefficient of service time variation

• Exponentially distributed service times: c2

• Deterministic service times: c2

• Simple linear regression: c2

B = 0.42

10 / 31

Data overview

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

CPU Utilization (VM normed)

Measured

M/G/1, c2B=0.42

Abbildung: Queuing issues are rapidly increasing at 60-70% CPU utilization.

The measured data follow the trend of the theoretical solution very closely.

11 / 31

Data overview

Response Time

06 12 18 00 06 12 18 00 06 12

Abbildung: HTTP response times are massively increased during phases

with higher workload.

12 / 31

Data overview

Service Level

0 20 40 60 80 100 120 140 160 180 200

Interval

Abbildung: Service levels decrease significantly during phases of high

workload. The maximum allowed response time is 1 s, which is extremely

conservative.13 / 31

Data overview

Load average vs Service Level

0 1 2 3 4 5 6

Load5 Average

MeasuredTheoretic

Abbildung: Higher load average is an indicator for queuing issues which are

causing decreased service level ratios. The measured data follow the

theoretical solution closely for higher load values.14 / 31

Data overview

Response Time: Low Load

0.95 0.96 0.97 0.98 0.99 1

HTTP Response TimeMigration started

Migration finishedSLA Ratio

Maximum allowed response time

Abbildung: Influence of live migration on HTTP response time and service

level during low load.

15 / 31

Data overview

Response Time: Medium Load

Abbildung: Influence of Live Migration on HTTP Response Time and service

level during medium load.

16 / 31

Data overview

Response Time: High Load

Abbildung: Influence of Live Migration on HTTP Response Time and service

level during high load.

17 / 31

Model selection

Stepwise: AIC

• Akaike Information Criterion (AIC), lower values are better

• AIC = 2k − 2 ln(L)• k : the number of independent parameters of a model

• L: the maximum likelihood

• Stepwise model selection approach

• Find trade-off between number of parameters (model size) and

goodness of fit (model quality)

• Start with empty model, in each step a variable can be

added/removed, until the AIC can not be decreased further.

• Does not guarantee to find a global optimum, but typically gives a

near-optimum result within reasonable time.

18 / 31

Model selection

Exhaustive: LEAPS

• Comparison with the stepwise AIC approach

• Exhaustive all-subsets-regression

• Find best of all possible models for given range of model size

• Computationally intensive even if number of variables is limited

19 / 31

Model selection

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

AIC value in current stepFinal AIC value

Abbildung: The AIC value quickly improves during the first steps and then

slowly converges to the final value, thus allowing a trade-off between model

complexity and precision.20 / 31

Model selection

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Best value

Abbildung: The residuals’ standard deviation quickly improves during the first

steps and then slowly converges to the final value, when using stepwise AIC.

21 / 31

Model selection

Abbildung: QQ-Plot of standardized residuals for the model produced by

stepwise AIC.

22 / 31

Model selection

Abbildung: There is no visible correlation between the residuals’ variance

and the time series.

23 / 31

Model selection

1 2 3 4 5 6 7 8 9 10 11 12

Model Size

Best value

Abbildung: Higher model complexity yields better model quality when using

LEAPS, but the increase is rather small.

24 / 31

Model selection

1 2 3 4 5 6 7 8 9 10 11 12

Model Size

Best value

Abbildung: Higher model complexity yields better regression error when

using LEAPS.

25 / 31

Model selection

AIC: Most influencing variables of final model

Variable Meaning Estimate Std. Error Pr(> |t|)Intercept 2.395e+00 5.069e-01 3.00e-06

wp01 load5 VM UNIX load5 -1.871e-02 2.627e-03 3.67e-12

wp01 swapUsed VM swap used -7.656e-07 8.809e-08 < 2e-16

wp01 residentSize SQ The squared amount of res-

ident memory used by the

qemu-kvm process.

-4.652e-14 1.652e-14 0.00506

src host cpu proc s Tasks created/s, source

-2.475e+00 9.166e-01 0.00716

src host cpu proc s SQ 1.091e+00 4.133e-01 0.00856

wp01 cpu util vmnorm SQ -1.328e-01 3.187e-02 3.63e-05

wp01 cpu util vmnorm CPU util measured inside

9.517e-02 2.316e-02 4.64e-05

wp01 load5 SQ The squared UNIX load5of the VM.

-1.140e-03 1.918e-04 5.22e-09

wp01 freeMemRatio SQ The squared ratio of free

memory inside the VM.

1.976e-02 9.462e-03 0.03727

Tabelle: The most influencing and significant variables of the final model

produced by stepwise AIC.

26 / 31

Conclusions

Qualitative

1 Impact of live migration on SL depends on amount of workload

2 Tighter SLAs can be fulfilled for low workload situations

3 High workload: SL decreases massively

4 ⇒ Live migrating when VM has only medium load

5 Avoid multiple live migrations within short time

6 ⇒ Inertia effect for recently migrated VMs

7 Avoid senseless flip-flop migrations and stabilize service level

27 / 31

Conclusions

Quantitative

1 SL variance during a live migration to 90% predictable using only

a single variable, the UNIX load5 average.

2 Models with 12 variables can explain 95% of variance

3 Models selected by AIC/LEAPS are of comparable quality and

robust to statistical tests

4 AIC is way faster, so LEAPS is not really necessary

28 / 31

Conclusions

Technical

1 Systems using live migration as a mechanism to realize a more

energy efficient target distribution need to consider the UNIXload average, at least for web servers and comparable

2 Typical hypervisors do not collect and export this information

3 Needs to be done by VM introspection

4 Related efforts by qemu-kvm and libvirt developers to

pass-through the VMs’ memory utilization (free mem)

5 Should be extended to export load information

29 / 31

Conclusions

Future Work

1 Influence of additional VMs

• idle

• utilized

• and a mixed scenario

2 Linux and qemu-kvm: Kernel Samepage Merging (KSM)

3 Database VM migration

4 Bandwidth limits, maximum allowed downtime

5 Migration delay, energy consumption, service downtime

30 / 31

Conclusions

Thank you for your attention!

31 / 31

Predicting Web Service Levels During VM Live Migrations · Predicting Web Service Levels During VM...

Documents

An experiment with C++ and the Qt Framework · complexity that is comparable to those found in real migrations and because enough information is available to conduct the experiment

Upgrades and migrations

CM 7 Territoires et Migrations: 1. Généralités sur les migrations

III. Migrations

CICADA - USENIX · 1 vm 2 vm 3 vm 4 vm 5vm 6 vm 7 vm 8 vm 9 vm 2 vm 3 vm 4 vm 5 vm 6 vm 7 vm 8 vm 9 vm 1 rigid application (similar to VOC [1]) vm 1 vm 2 vm 3 vm 4 vm 5vm 6 vm 7 vm

Geese Migrations

8 Great Migrations

Human Migrations

BALKAN MIGRATIONS

Architektura+systemu+ OpenContrail+ - Semihalf · Rack,+servers,+VMs+ VM VM VM VM hypervisor+ VM VM VM VM hypervisor+ VM VM VM VM hypervisor+ Server+rack+ To+spine+switch+

Bantu Migrations

Migrations och integrationspolitiken

Complex Joomla! Migrations

Indo-European Migrations

Africa. Bantu Migrations

Oceanic Migrations - sfcmp.org

VM-1010, VM-1015, VM-1021,VM-1042, VM-1044, VN-1055, VM-54 )k.kramerav.com/downloads/manuals/VM-1055.pdf · 2011. 11. 21. · Distribution Amplifiers Models: VM-1010, VM-1015, VM-1021,

116 great migrations

Migrations Using Powerpath

Application Migrations