1
Predicting Web Service Levels During VM Live Migrations Helmut Hlavacs and Thomas Treutner Research Group Entertainment Computing, University of Vienna, Austria Introduction I VM live migration important for energy efficiency I Enables us to establish energy efficient target distribution of VMs I Supposedly no perceivable service downtime while live migrating I Live migration is resource intensive (iterative page copying) I Experiments: Influence on service levels while migrating? I Modelling: Predict service levels based on utilization? Scenario I Virtualized data center, static consolidation (P2V) I Provisioning for peak load, still bad energy efficiency I E.g., 9-5-cycles, very low utilization at night I Live migration enables dynamic consolidation I But: Seldom used, fear of possible side effects I Identify and quantify effects on (web) service levels I Find most influencing utilization metrics Experiment I Two servers, a single VM, migrating forth and back I VM disk image on central node (Gbit, open-iscsi) I qemu-kvm VM: Linux, Apache2, PHP5, MediaWiki I SQL VM and load generation on an extra nodes I Logging utilization of servers and VM, more than 100 variables I Rise load from 50 to 600 concurrent virtual users and back I Migrate every 15min, track response time of last 5min I Maximum allowed response time: 1s Data Overview 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 20 40 60 80 100 120 140 160 180 200 CPU Utilization (VM normed) Interval 0 1 2 3 4 5 6 0 20 40 60 80 100 120 140 160 180 200 Load5 Average Interval 0 5000 10000 15000 20000 25000 06 12 18 00 06 12 18 00 06 12 Response Time [ms] Hours 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 0 20 40 60 80 100 120 140 160 180 200 Service Level Ratio Interval 0 1 2 3 4 5 6 7 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Load5 Average CPU Utilization (VM normed) Measured M/M/1 M/D/1 M/G/1, c 2 B =0.42 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 0 1 2 3 4 5 6 Service Level Load5 Average Measured Theoretic I We can interpret the UNIX load as approximation to Q, the average number of jobs in a Markovian M/M/1 queue, and the VM’s CPU utilization as ρ, the system utilization: Q M/M/1 = ρ 2 1-ρ I UNIX load is exponentially averaged by definition and the service times are not necessarily exactly exponentially distributed: Q M/G/1 = ρ 2 1-ρ · (1+c 2 B ) 2 I For deterministic service times c 2 B =0, resulting in Q M/D/1 = ρ 2 2(1-ρ) I Simple linear regression delivers the coefficient c 2 B =0.42 I P(T x) = F T (x) = 1 - e -μ(1-ρ)x , the theoretical probability that the response time T is lower than or equal to a limit x for a given service rate μ Service Levels for Low/Medium/High Workload Scenarios I Low (top right): Slightly increased response times during live migration, seldom response time violations I Medium (bottom left): SLA ratio generally satisfies the 97% limit I High (bottom right): Often and heavy violations, unacceptable low service levels, typically decreased by 20-25% percentage points 0 2 4 6 8 10 12 14 16 0.85 0.9 0.95 1 Response Time [s] SLA Ratio Time HTTP Response Time Migration started Migration finished SLA Ratio Maximum allowed response time 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0.95 0.96 0.97 0.98 0.99 1 Response Time [s] SLA Ratio Time HTTP Response Time Migration started Migration finished SLA Ratio Maximum allowed response time 0 5 10 15 20 0.5 0.6 0.7 0.8 0.9 1 Response Time [s] SLA Ratio Time HTTP Response Time Migration started Migration finished SLA Ratio Maximum allowed response time Model Selection I Stepwise model selection: Akaike Information Criterion I Finds trade-off between number of parameters (model size) and goodness of fit (model quality) I For comparison: Exhaustive all-subsets-regression (LEAPS) I LEAPS: Find best of all possible models for given range of model size I Computationally intensive even if number of variables is limited I R 2 Adj increases slightly with increased model complexity. UNIX load5 contributes R 2 Adj of 90% 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 1 2 3 4 5 6 7 8 9 10 11 12 R 2 Adj Model Size Best value Most Influencing Model Parameters Variable Meaning Estimate Std. Error Pr(> |t|) Intercept 2.395e+00 5.069e-01 3.00e-06 wp01 load5 VM UNIX load5 -1.871e-02 2.627e-03 3.67e-12 wp01 swapUsed VM swap used -7.656e-07 8.809e-08 < 2e-16 wp01 residentSize SQ Suared amount of resident memory used by the qemu-kvm process. -4.652e-14 1.652e-14 0.00506 src host cpu proc s Tasks created/s, source host. -2.475e+00 9.166e-01 0.00716 src host cpu proc s SQ 1.091e+00 4.133e-01 0.00856 wp01 cpu util vmnorm SQ -1.328e-01 3.187e-02 3.63e-05 wp01 cpu util vmnorm CPU util measured inside VM 9.517e-02 2.316e-02 4.64e-05 wp01 load5 SQ Squared UNIX load5 of the VM. -1.140e-03 1.918e-04 5.22e-09 wp01 freeMemRatio SQ Squared ratio of free memory inside the VM. 1.976e-02 9.462e-03 0.03727 Conclusions I Impact of live migration on SL depends on amount of workload I Tighter SLAs can be fulfilled during low and medium workload I Migrating during high load causes massive decrease of service level I Service level variance during a live migration to 90% predictable using only a single variable, the UNIX load5 average, models with 12 variables can explain 95% of variance I Systems using live migration as a mechanism to realize a more energy efficient target distribution and have service level targets need to consider the UNIX load average, but typical hypervisors do not collect/export this information I Hypervisors should be extended to export load information (cf. free memory) Future Work I Influence of additional VMs (idle, utilized, mixed) I Linux and qemu-kvm: Kernel Samepage Merging (KSM) I Database VM migration, currently taboo due to potentially severe influence I qemu-kvm parameters: Bandwidth limits, maximum allowed downtime I Predict migration delay, energy consumption, service downtime Created with L A T E Xbeamerposter http://www-i6.informatik.rwth-aachen.de/ ~ dreuw/latexbeamerposter.php http://cs.univie.ac.at/research/research-groups/entertainment-computing/ <firstname>.<lastname>@univie.ac.at

Predicting Web Service Levels During VM Live Migrations · I )Identify and quantify e ects on (web) service levels I )Find most in uencing utilization metrics Experiment I Two servers,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Predicting Web Service Levels During VM Live Migrations · I )Identify and quantify e ects on (web) service levels I )Find most in uencing utilization metrics Experiment I Two servers,

Predicting Web Service LevelsDuring VM Live Migrations

Helmut Hlavacs and Thomas TreutnerResearch Group Entertainment Computing, University of Vienna, Austria

Introduction

I VM live migration important for energy efficiencyI Enables us to establish energy efficient target distribution of VMsI Supposedly no perceivable service downtime while live migratingI Live migration is resource intensive (iterative page copying)I Experiments: Influence on service levels while migrating?I Modelling: Predict service levels based on utilization?

Scenario

I Virtualized data center, static consolidation (P2V)I Provisioning for peak load, still bad energy efficiencyI E.g., 9-5-cycles, very low utilization at nightI Live migration enables dynamic consolidationI But: Seldom used, fear of possible side effectsI⇒ Identify and quantify effects on (web) service levelsI⇒ Find most influencing utilization metrics

Experiment

I Two servers, a single VM, migrating forth and backI VM disk image on central node (Gbit, open-iscsi)I qemu-kvm VM: Linux, Apache2, PHP5, MediaWikiI SQL VM and load generation on an extra nodesI Logging utilization of servers and VM, more than 100 variablesI Rise load from 50 to 600 concurrent virtual users and backI Migrate every 15min, track response time of last 5minI Maximum allowed response time: 1s

Data Overview

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 20 40 60 80 100 120 140 160 180 200

CP

U U

tiliz

atio

n (

VM

no

rme

d)

Interval

0

1

2

3

4

5

6

0 20 40 60 80 100 120 140 160 180 200

Lo

ad

5 A

ve

rag

e

Interval

0

5000

10000

15000

20000

25000

06 12 18 00 06 12 18 00 06 12

Re

sp

on

se

Tim

e [

ms]

Hours

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

0 20 40 60 80 100 120 140 160 180 200

Se

rvic

e L

eve

l R

atio

Interval

0

1

2

3

4

5

6

7

8

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Lo

ad

5 A

ve

rag

e

CPU Utilization (VM normed)

Measured

M/M/1

M/D/1

M/G/1, c2B=0.42

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

0 1 2 3 4 5 6

Se

rvic

e L

eve

l

Load5 Average

MeasuredTheoretic

I We can interpret the UNIX load as approximation to Q, the average numberof jobs in a Markovian M/M/1 queue, and the VM’s CPU utilization as ρ,

the system utilization: QM/M/1 = ρ2

1−ρI UNIX load is exponentially averaged by definition and the service times are

not necessarily exactly exponentially distributed: QM/G/1 = ρ2

1−ρ ·(1+c2

B)

2

I For deterministic service times c2B = 0, resulting in QM/D/1 = ρ2

2(1−ρ)

I Simple linear regression delivers the coefficient c2B = 0.42

I P(T ≤ x) = FT(x) = 1− e−µ(1−ρ)x, the theoretical probability that theresponse time T is lower than or equal to a limit x for a given service rate µ

Service Levels for Low/Medium/High Workload Scenarios

I Low (top right): Slightly increasedresponse times during live migration,seldom response time violations

I Medium (bottom left): SLA ratiogenerally satisfies the 97% limit

I High (bottom right): Often andheavy violations, unacceptable lowservice levels, typically decreased by20-25% percentage points

0

2

4

6

8

10

12

14

16

0.85

0.9

0.95

1

Respo

nse T

ime

[s]

SL

A R

atio

Time

HTTP Response TimeMigration started

Migration finishedSLA Ratio

Maximum allowed response time

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0.95

0.96

0.97

0.98

0.99

1

Respo

nse T

ime

[s]

SL

A R

atio

Time

HTTP Response TimeMigration started

Migration finishedSLA Ratio

Maximum allowed response time

0

5

10

15

20

0.5

0.6

0.7

0.8

0.9

1

Respo

nse T

ime

[s]

SL

A R

atio

Time

HTTP Response TimeMigration started

Migration finishedSLA Ratio

Maximum allowed response time

Model Selection

I Stepwise model selection:Akaike Information Criterion

I Finds trade-off between number ofparameters (model size) andgoodness of fit (model quality)

I For comparison: Exhaustiveall-subsets-regression (LEAPS)

I LEAPS: Find best of all possiblemodels for given range of model size

I Computationally intensive even ifnumber of variables is limited

I R2Adj increases slightly with increased

model complexity. UNIX load5

contributes R2Adj of ∼ 90%

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

1 2 3 4 5 6 7 8 9 10 11 12

R2 A

dj

Model Size

Best value

Most Influencing Model ParametersVariable Meaning Estimate Std. Error Pr(> |t|)Intercept 2.395e+00 5.069e-01 3.00e-06

wp01 load5 VM UNIX load5 -1.871e-02 2.627e-03 3.67e-12

wp01 swapUsed VM swap used -7.656e-07 8.809e-08 < 2e-16

wp01 residentSize SQ Suared amount of resident memory used by

the qemu-kvm process.

-4.652e-14 1.652e-14 0.00506

src host cpu proc s Tasks created/s, source host. -2.475e+00 9.166e-01 0.00716

src host cpu proc s SQ 1.091e+00 4.133e-01 0.00856

wp01 cpu util vmnorm SQ -1.328e-01 3.187e-02 3.63e-05

wp01 cpu util vmnorm CPU util measured inside VM 9.517e-02 2.316e-02 4.64e-05

wp01 load5 SQ Squared UNIX load5 of the VM. -1.140e-03 1.918e-04 5.22e-09

wp01 freeMemRatio SQ Squared ratio of free memory inside the VM. 1.976e-02 9.462e-03 0.03727

Conclusions

I Impact of live migration on SL depends on amount of workloadI Tighter SLAs can be fulfilled during low and medium workloadI Migrating during high load causes massive decrease of service levelI Service level variance during a live migration to 90% predictable

using only a single variable, the UNIX load5 average, models with 12variables can explain 95% of variance

I Systems using live migration as a mechanism to realize a more energy efficienttarget distribution and have service level targets need to consider the UNIX

load average, but typical hypervisors do not collect/export this informationI Hypervisors should be extended to export load information (cf. free memory)

Future Work

I Influence of additional VMs (idle, utilized, mixed)I Linux and qemu-kvm: Kernel Samepage Merging (KSM)I Database VM migration, currently taboo due to potentially severe influenceI qemu-kvm parameters: Bandwidth limits, maximum allowed downtimeI Predict migration delay, energy consumption, service downtime

Created with LATEXbeamerposter http://www-i6.informatik.rwth-aachen.de/~dreuw/latexbeamerposter.php

http://cs.univie.ac.at/research/research-groups/entertainment-computing/ <firstname>.<lastname>@univie.ac.at