Noha danms13 talk_final

Preview:

Citation preview

Autonomous Resource Provision in Virtual Data Centers

Presented by: Noha Elprince

noha.elprince@uwaterloo.ca

IFIP/IEEE DANMS, 31 May 2013

Unused resources

2

Demand

Capacity

Time

Res

ourc

es

Demand

Capacity

Time

Res

ourc

es

Static Data Centers vs. Dynamic

Figure: RAD Lab, UC Berkeley

•  Fixed pre-assigned Resources ( provision for peak )

•  Static Environment •  Manual change of configurations

•  Cloud Elasticity ~ “Pay as you go” •  Virtualized Environment •  Automated “Self-service” change of

configurations.

3

•  under-provisioning Heavy Penalty

Lost revenue

Lost users

Res

ourc

es

Demand

Capacity

Time (days) 1 2 3

Res

ourc

es

Demand

Capacity

Time (days) 1 2 3

Res

ourc

es

Demand

Capacity

Time (days) 1 2 3

3

Figures: RAD Lab, UC Berkeley

Cloud Elasticity problems…

Cloud Elasticity …

4

•  over-provisioning overutilization

Demand

Capacity

Time

Res

ourc

es

Unused resources

Figure: RAD Lab, UC Berkeley

5

Virtualization (Cloud Foundation) •  Virtualization allows a

computational resource to be partitioned on multiple isolated execution environments (VMs)

•  Turning the machine into a “virtual

image” ~ Self-immunity from: Ø  Hardware breakdowns. Ø  Running out of the resources

Challenge: Service Differentiation

Problem q  Over and under provisioning in spite of:

difficulty of estimating the actual needs due to time-varying and diverse workload.

q  Enabling service differentiation in a virtualized environment.

6

Methodology

•  Develop and implement an autonomic resource management controller that: Ø  Effectively optimize the resource by predicting current

resource needs. Ø Continuous resource self-tuning to accommodate load

variations and enforce service differentiation during resource allocation.

•  Test the proposed prototype on real traces.

7

Motivation

•  Help datacenters to manage resources effectively. •  Propagate Cloud Computing (increase cloud users =>

less expensive) •  Optimize resources (green I.T !)

8

Related Work

v  Approaches for autonomic resource management:

–  Utility based self-optimizing approach. –  Model-based approach based on perf. Modeling. –  Machine Learning Approach –  Fuzzy logic approach.

9

Proposed Solution Architecture: Sys. Modeling

10

r(t+1)

I. System Modeling : Data set

11

•  Idea: Learning from successful jobs: (normal termination, fulfill client’s anticipated perf.) •  A real computing center trace of Los Alamos National Lab (LANL) •  LANL is a United States Department of Energy (DOE) national

laboratory.

•  LANL conducts multidisciplinary research in fields such as national security , space exploration, renewable energy, medicine, nanotechnology and supercomputing.

•  System: 1024-node Connection Machine CM-5 from Thinking Machines •  Jobs: 201,387 , Duration: 2 Yrs.

12

v  Feature Selection •  Use stepwise regression to: Sort variables out & leave more influential ones in the model. •  Results: Out of 18 features, 5 features were selected:

=> run_time, wait_time, Avg_cpu_time, used_mem, status

v  Filter •  Remove jobs with status =unsuccessful ( failed/ aborted ) •  Discard records that have average_cpu_time_used <=0 and used_mem <=0

v  Data Cleaning •  Normalize data to remove noise

I. System Modeling : Data Preprocessing

13

I. System Modeling : Statistical Analysis

14

!

Cascaded classifiers (MISO model)

I. System Modeling : Model I/O

15

§  Linear Regression §  Sugeno Fuzzy Inference System (FCM, SUB) §  Regression Tree (REP-Tree) §  Model Tree (M5P) §  Boosting (Rep-Tree, M5P) §  Bagging (Rep-Tree, M5P)

I. System Modeling : ML approaches

Why ML ? - Due to the Non-linear nature of the data -  Ability to deal with complex nature of data. -  Detect dependency between i/ps and o/ps efficiently.

16

Bagging vs. Boosting Classifiers

•  Bagging (Bootstrap aggregating) uses bootstrap sampling.

•  Trains k classifier on each bootstrap sample.

•  A weighted majority (voting) of the k learned classifiers (using equal weights).

•  Boosting: weak classifiers form a final strong classifier.

•  After a weak learner is added, the data is reweighted: Ø  misclassified examples=> gain weight Ø  examples classified correctly => lose

weight.

•  Thus future learners focus more on the data that previous weak learners misclassified.

II. Res. Predictor

17

v The client requests hosting a specific type of application with a pre-specified response time.

v  An initial estimate is generated. v Rate of prediction is

accompanied by the coming of the client to the data center.

18

Classifier Type RMSE MAE RAE CC

Linear Reg. C1 0.0024 0.0008 50.33% 0.70 C2 0.0023 0.0001 57.29% 0.71 C3 0.0026 0.0003 58.15% 0.98

Sugeno FIS (SUB) C1 0.0021 0.0009 44.89% 0.66 C2 0.0012 0.0002 51.06% 0.66 C3 0.0011 0.0002 53.93% 0.85

Boosting (M5P) C1 0.0020 0.0006 34.59% 0.80 C2 0.0018 0.0007 39.20% 0.84 C3 0.0003 0.0001 10.99% 0.99

Bagging Tree (M5P)

C1 0.0018 0.0005 32.57% 0.84 C2 0.0017 0.0007 36.38% 0.84 C3 0.0003 0.0001 11.82% 0.99

Validation: Perf. Measures for different prediction models

19

Resource Predictor: Learning Time Comparison

III. Resource Allocator

20

1.  Res Allocator initially allocates resources ( based on the prediction model).

2.  Check the error resulting from the

tuner. 3.  The tuner Calculates the normalized error in resource allocation

4. Takes the feedback from the tuner (ResAdjustment) and sends a command to the VC in the VM with the appropriate decision.

RespTimeError (k) = RespTime(k)ref ! RespTime(k)obs

RespTime(k)ref

21

IV. Resource Tuner : Rule-Based Fuzzy System

i/ps

o/p

RespTimeError

ClientClass

Status ResDirection

ResController ( mamdani)

22

IV. Resource Tuner: Rule-Based Fuzzy System

!!!!"#$%&'()($%!

!!!!!!!#$%*+',$-((.(!/.0! 1$2'3,! 4'56!

!!!78'$9:!!;8<%%!

=.82! !!!!!!!! !!>?/!

!!!!!!!!>&1!!>?1!

!!!!!!>&4!!>?4!

>'8@$(! !!!!!!>&/!!! !!

!!!!!!!>&1!!>?1!

!!!!!>&4!!>?4!

A(.9B$! !!!!!!>&/!! !!

!!!!!!!>&1!!!!!!!!!! !

!!!!!!!>&4!>?4!

!Over provision / Under provision

-  Total # rules : 18 -  The grades of

membership of each attribute (high, medium , low) are adjusted by experts in the datacenter.

-  ResDir : •  reflects a percentage of the resource that should be utilized in

the VC. (ResAdjust = ResDir x ResWt x VCres) •  ranges [-1 +1] with MFs (low, med, high) for:

Ø  speed up (+ve side) Ø  step down (-ve side)

V. Adaptive Learning

23

New incoming data will be fed into the prediction model by different ways (depending on the prediction model used): -  Directly via clustering (if

clustering is used as in TS-FIS) => online learning

-  OR it will be stored in the

database until a certain threshold reached, then an ECA rule is fired , initiating re-modeling => offline learning

V. Adaptive Learning: update Rules in Fuzzy Tuner FIS

24

Rule Editor

25

Resource Tuner validation - Example

!!!!"#$%&'()($%!

!!!!!!!#$%*+',$-((.(!/.0! 1$2'3,! 4'56!

!!!78'$9:!!;8<%%!

=.82! !!!!!!!! !!>?/!

!!!!!!!!>&1!!>?1!

!!!!!!>&4!!>?4!

>'8@$(! !!!!!!>&/!!! !!

!!!!!!!>&1!!>?1!

!!!!!>&4!!>?4!

A(.9B$! !!!!!!>&/!! !!

!!!!!!!>&1!!!!!!!!!! !

!!!!!!!>&4!>?4!

! Over provision / Under provision

Method: Testing cases using the fuzzy rule viewer.

26

I/ps: RespTimeError : medium , client class : Gold and status: underprovision O/p: ResDirection : SUM (speed up medium)

Resource Tuner Validation - Example

RespTimeError= 0.5 ClientClass= 0.9 status= 0.2 ResDirection = 0.5

27

I/ps: RespTimeError: medium, Client class: Silver, Status : underprovision o/p: ResDirection is SUM (speed up medium)

Resource Tuner Validation - Example

RespTimeError= 0.5 ClientClass= 0.5 status= 0.2 ResDirection = 0.5

28

I/ps: RespTimeError : medium , client class : bronze , status: underprovision o/p: ResDirection : noAction

Resource Tuner Validation - Example

RespTimeError= 0.5 ClientClass= 0.19 status= 0.2 ResDirection = 0.01

Conclusions •  Proposed ML model predicts the right amount of

resources (Bagging/Boosting is promising). •  The Fuzzy tuner

- Accommodates any deviation in workload c/cs. - Enforces service differentiation.

•  Adaptive Learning guarantee having an up-to-date model that lowers future SLA violations.

29

Questions ?

30

Recommended