30
Autonomous Resource Provision in Virtual Data Centers Presented by: Noha Elprince [email protected] IFIP/IEEE DANMS, 31 May 2013

Noha danms13 talk_final

Embed Size (px)

Citation preview

Page 1: Noha danms13 talk_final

Autonomous Resource Provision in Virtual Data Centers

Presented by: Noha Elprince

[email protected]

IFIP/IEEE DANMS, 31 May 2013

Page 2: Noha danms13 talk_final

Unused resources

2

Demand

Capacity

Time

Res

ourc

es

Demand

Capacity

Time

Res

ourc

es

Static Data Centers vs. Dynamic

Figure: RAD Lab, UC Berkeley

•  Fixed pre-assigned Resources ( provision for peak )

•  Static Environment •  Manual change of configurations

•  Cloud Elasticity ~ “Pay as you go” •  Virtualized Environment •  Automated “Self-service” change of

configurations.

Page 3: Noha danms13 talk_final

3

•  under-provisioning Heavy Penalty

Lost revenue

Lost users

Res

ourc

es

Demand

Capacity

Time (days) 1 2 3

Res

ourc

es

Demand

Capacity

Time (days) 1 2 3

Res

ourc

es

Demand

Capacity

Time (days) 1 2 3

3

Figures: RAD Lab, UC Berkeley

Cloud Elasticity problems…

Page 4: Noha danms13 talk_final

Cloud Elasticity …

4

•  over-provisioning overutilization

Demand

Capacity

Time

Res

ourc

es

Unused resources

Figure: RAD Lab, UC Berkeley

Page 5: Noha danms13 talk_final

5

Virtualization (Cloud Foundation) •  Virtualization allows a

computational resource to be partitioned on multiple isolated execution environments (VMs)

•  Turning the machine into a “virtual

image” ~ Self-immunity from: Ø  Hardware breakdowns. Ø  Running out of the resources

Challenge: Service Differentiation

Page 6: Noha danms13 talk_final

Problem q  Over and under provisioning in spite of:

difficulty of estimating the actual needs due to time-varying and diverse workload.

q  Enabling service differentiation in a virtualized environment.

6

Page 7: Noha danms13 talk_final

Methodology

•  Develop and implement an autonomic resource management controller that: Ø  Effectively optimize the resource by predicting current

resource needs. Ø Continuous resource self-tuning to accommodate load

variations and enforce service differentiation during resource allocation.

•  Test the proposed prototype on real traces.

7

Page 8: Noha danms13 talk_final

Motivation

•  Help datacenters to manage resources effectively. •  Propagate Cloud Computing (increase cloud users =>

less expensive) •  Optimize resources (green I.T !)

8

Page 9: Noha danms13 talk_final

Related Work

v  Approaches for autonomic resource management:

–  Utility based self-optimizing approach. –  Model-based approach based on perf. Modeling. –  Machine Learning Approach –  Fuzzy logic approach.

9

Page 10: Noha danms13 talk_final

Proposed Solution Architecture: Sys. Modeling

10

r(t+1)

Page 11: Noha danms13 talk_final

I. System Modeling : Data set

11

•  Idea: Learning from successful jobs: (normal termination, fulfill client’s anticipated perf.) •  A real computing center trace of Los Alamos National Lab (LANL) •  LANL is a United States Department of Energy (DOE) national

laboratory.

•  LANL conducts multidisciplinary research in fields such as national security , space exploration, renewable energy, medicine, nanotechnology and supercomputing.

•  System: 1024-node Connection Machine CM-5 from Thinking Machines •  Jobs: 201,387 , Duration: 2 Yrs.

Page 12: Noha danms13 talk_final

12

v  Feature Selection •  Use stepwise regression to: Sort variables out & leave more influential ones in the model. •  Results: Out of 18 features, 5 features were selected:

=> run_time, wait_time, Avg_cpu_time, used_mem, status

v  Filter •  Remove jobs with status =unsuccessful ( failed/ aborted ) •  Discard records that have average_cpu_time_used <=0 and used_mem <=0

v  Data Cleaning •  Normalize data to remove noise

I. System Modeling : Data Preprocessing

Page 13: Noha danms13 talk_final

13

I. System Modeling : Statistical Analysis

Page 14: Noha danms13 talk_final

14

!

Cascaded classifiers (MISO model)

I. System Modeling : Model I/O

Page 15: Noha danms13 talk_final

15

§  Linear Regression §  Sugeno Fuzzy Inference System (FCM, SUB) §  Regression Tree (REP-Tree) §  Model Tree (M5P) §  Boosting (Rep-Tree, M5P) §  Bagging (Rep-Tree, M5P)

I. System Modeling : ML approaches

Why ML ? - Due to the Non-linear nature of the data -  Ability to deal with complex nature of data. -  Detect dependency between i/ps and o/ps efficiently.

Page 16: Noha danms13 talk_final

16

Bagging vs. Boosting Classifiers

•  Bagging (Bootstrap aggregating) uses bootstrap sampling.

•  Trains k classifier on each bootstrap sample.

•  A weighted majority (voting) of the k learned classifiers (using equal weights).

•  Boosting: weak classifiers form a final strong classifier.

•  After a weak learner is added, the data is reweighted: Ø  misclassified examples=> gain weight Ø  examples classified correctly => lose

weight.

•  Thus future learners focus more on the data that previous weak learners misclassified.

Page 17: Noha danms13 talk_final

II. Res. Predictor

17

v The client requests hosting a specific type of application with a pre-specified response time.

v  An initial estimate is generated. v Rate of prediction is

accompanied by the coming of the client to the data center.

Page 18: Noha danms13 talk_final

18

Classifier Type RMSE MAE RAE CC

Linear Reg. C1 0.0024 0.0008 50.33% 0.70 C2 0.0023 0.0001 57.29% 0.71 C3 0.0026 0.0003 58.15% 0.98

Sugeno FIS (SUB) C1 0.0021 0.0009 44.89% 0.66 C2 0.0012 0.0002 51.06% 0.66 C3 0.0011 0.0002 53.93% 0.85

Boosting (M5P) C1 0.0020 0.0006 34.59% 0.80 C2 0.0018 0.0007 39.20% 0.84 C3 0.0003 0.0001 10.99% 0.99

Bagging Tree (M5P)

C1 0.0018 0.0005 32.57% 0.84 C2 0.0017 0.0007 36.38% 0.84 C3 0.0003 0.0001 11.82% 0.99

Validation: Perf. Measures for different prediction models

Page 19: Noha danms13 talk_final

19

Resource Predictor: Learning Time Comparison

Page 20: Noha danms13 talk_final

III. Resource Allocator

20

1.  Res Allocator initially allocates resources ( based on the prediction model).

2.  Check the error resulting from the

tuner. 3.  The tuner Calculates the normalized error in resource allocation

4. Takes the feedback from the tuner (ResAdjustment) and sends a command to the VC in the VM with the appropriate decision.

RespTimeError (k) = RespTime(k)ref ! RespTime(k)obs

RespTime(k)ref

Page 21: Noha danms13 talk_final

21

IV. Resource Tuner : Rule-Based Fuzzy System

i/ps

o/p

RespTimeError

ClientClass

Status ResDirection

ResController ( mamdani)

Page 22: Noha danms13 talk_final

22

IV. Resource Tuner: Rule-Based Fuzzy System

!!!!"#$%&'()($%!

!!!!!!!#$%*+',$-((.(!/.0! 1$2'3,! 4'56!

!!!78'$9:!!;8<%%!

=.82! !!!!!!!! !!>?/!

!!!!!!!!>&1!!>?1!

!!!!!!>&4!!>?4!

>'8@$(! !!!!!!>&/!!! !!

!!!!!!!>&1!!>?1!

!!!!!>&4!!>?4!

A(.9B$! !!!!!!>&/!! !!

!!!!!!!>&1!!!!!!!!!! !

!!!!!!!>&4!>?4!

!Over provision / Under provision

-  Total # rules : 18 -  The grades of

membership of each attribute (high, medium , low) are adjusted by experts in the datacenter.

-  ResDir : •  reflects a percentage of the resource that should be utilized in

the VC. (ResAdjust = ResDir x ResWt x VCres) •  ranges [-1 +1] with MFs (low, med, high) for:

Ø  speed up (+ve side) Ø  step down (-ve side)

Page 23: Noha danms13 talk_final

V. Adaptive Learning

23

New incoming data will be fed into the prediction model by different ways (depending on the prediction model used): -  Directly via clustering (if

clustering is used as in TS-FIS) => online learning

-  OR it will be stored in the

database until a certain threshold reached, then an ECA rule is fired , initiating re-modeling => offline learning

Page 24: Noha danms13 talk_final

V. Adaptive Learning: update Rules in Fuzzy Tuner FIS

24

Rule Editor

Page 25: Noha danms13 talk_final

25

Resource Tuner validation - Example

!!!!"#$%&'()($%!

!!!!!!!#$%*+',$-((.(!/.0! 1$2'3,! 4'56!

!!!78'$9:!!;8<%%!

=.82! !!!!!!!! !!>?/!

!!!!!!!!>&1!!>?1!

!!!!!!>&4!!>?4!

>'8@$(! !!!!!!>&/!!! !!

!!!!!!!>&1!!>?1!

!!!!!>&4!!>?4!

A(.9B$! !!!!!!>&/!! !!

!!!!!!!>&1!!!!!!!!!! !

!!!!!!!>&4!>?4!

! Over provision / Under provision

Method: Testing cases using the fuzzy rule viewer.

Page 26: Noha danms13 talk_final

26

I/ps: RespTimeError : medium , client class : Gold and status: underprovision O/p: ResDirection : SUM (speed up medium)

Resource Tuner Validation - Example

RespTimeError= 0.5 ClientClass= 0.9 status= 0.2 ResDirection = 0.5

Page 27: Noha danms13 talk_final

27

I/ps: RespTimeError: medium, Client class: Silver, Status : underprovision o/p: ResDirection is SUM (speed up medium)

Resource Tuner Validation - Example

RespTimeError= 0.5 ClientClass= 0.5 status= 0.2 ResDirection = 0.5

Page 28: Noha danms13 talk_final

28

I/ps: RespTimeError : medium , client class : bronze , status: underprovision o/p: ResDirection : noAction

Resource Tuner Validation - Example

RespTimeError= 0.5 ClientClass= 0.19 status= 0.2 ResDirection = 0.01

Page 29: Noha danms13 talk_final

Conclusions •  Proposed ML model predicts the right amount of

resources (Bagging/Boosting is promising). •  The Fuzzy tuner

- Accommodates any deviation in workload c/cs. - Enforces service differentiation.

•  Adaptive Learning guarantee having an up-to-date model that lowers future SLA violations.

29

Page 30: Noha danms13 talk_final

Questions ?

30