Upload
srikumarv
View
246
Download
1
Embed Size (px)
DESCRIPTION
Summary of the work performed by Dr. Srikumar Venugopal, UNSW and his team on various aspects of cloud elasticity.
Citation preview
Towards a Unified View of Elasticity
Srikumar Venugopal & Team
School of Computer Science and Engineering, University of New South Wales, Sydney, Australia
Acknowledgements
• Basem Suleiman • Han Li • Reza Nouri • Freddie Sunarso • Richard Gow
Agenda
• Introduction to elasticity and its challenges
• Performance Modeling of Elasticity Rules • Autonomic Decentralised Elasticity
Management of Cloud Applications • Efficient Bootstrapping for Decentralised
Shared-nothing Key-value Stores
Simple Service Deployment on Cloud
Elasticity
The ability of a system to change its capacity in direct response to the workload demand
Different Views of Elasticity
• Performance View – When to scale and how much ?
• Application View – Does the architecture accommodate scaling ? – How is state managed ?
• Configuration View – Are there changes in configuration due to
scaling?
Elastic Deployment Architecture
Elasticizing Application Layer
Trigger – Controller – Action
• Trigger: Threshold Breach • Controller: Intelligence/Logic • Action: Add or Remove Capacity
State-of-the-art in Auto-scaling
Product/Project Trigger Controller Ac3ons
Amazon Autoscaling
Cloudwatch metrics/ Threshold
Rule-‐based/Schedule-‐based
Add/Remove Capacity
WASABi Azure Diagnos3cs/Threshold
Rule-‐based Add/Remove Capacity, Custom
RightScale/Scalr Load monitoring Rule-‐based/Schedule-‐based
Add/Remove Capacity, Custom
Google Compute Engine
CPU Load, etc. Rule-‐based Add/Remove Capacity
Academic
CloudScale Demand Predic3on Control theory Voltage-‐scaling
Cataclysm Threshold-‐based Queueing-‐model Admission Control
IBM Unity Applica3on U3lity U3lity func3ons/RL Add/Remove Capacity
Summary
• Currently, the most popular mechanisms for auto-scaling are rule-based mechanisms
• The effectiveness of rule-based autoscaling is determined by the trigger conditions
• So, how do we know how to set up the right triggers ?
Performance Modeling of Elasticity Rules
Basem Suleiman
Elasticity (Auto-Scaling) Rules
Examples: • If CPU Utilization ≥ 85% for 7 min. add 1 server (Scale Out) • If RespTimeSLA ≥ 95% for 10 min. remove 1 server (Scale In)
B. Suleiman, S. Venugopal, Modeling Performance of Elasticity Rules for Cloud-based Applications, EDOC 2013.
Performance of Different Elasticity Rules
• How well do elasticity rules perform in terms of SLA satisfaction, CPU utilization , costs and % served request?
Rule Elasticity Rules
CPU75 If CPU Util.>75% for 5 min; add 1 server If CPU Util.≤30% for 5 min; remove 1 server
CPU80 If CPU Util.>80% for 5 min; add 1 server If CPU Util.≤30% for 5 min; remove 1 server
CPU85 If CPU Util.>85% for 5 min; add 1 server If CPU Util.≤30% for 5 min; remove 1 server
SLA90 If SLA < 90% for 5 min; add 1 server If SLA ≥ 90% for 5 min; remove 1 server
SLA95 If SLA < 95% for 5 mins; add 1 server If SLA ≥ 95% for 5 mins; remove 1 server
B. Suleiman, S. Sakr, S. Venugopal, W. Sadiq, Trade-‐off Analysis of Elas2city Approaches for Cloud-‐Based Business Applica2ons, Proc. WISE 2012
Cloud Testbed for Collecting Metrics
TPC-W database
EC2
EC2
TPC-W application
.......
Elastic Load Balancer
EC2
EC2
% SLA Satisfaction, Avg. CPU Utilization Server Costs and % served Requests
Response Time
B. Suleiman, S. Sakr, S. Venugopal, W. Sadiq, Trade-‐off Analysis of Elas2city Approaches for Cloud-‐Based Business Applica2ons, Proc. WISE 2012
Performance Evaluation - Different Elasticity Rules
Max
Min
Median
Q3
Q1
Mean
Legend
$0.00
$0.50
$1.00
$1.50
$2.00
$2.50 CPU75
CPU80
CPU85
SLA90
SLA95
Cos
ts
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
CPU75
CPU80
CPU85
SLA90
SLA95 CPU
Util
izatio
n B. Suleiman, S. Sakr, S. Venugopal, W. Sadiq, Trade-‐off Analysis of Elas2city Approaches for Cloud-‐Based Business Applica2ons, Proc. WISE 2012
The Challenges of Thresholds
You must be at least this tall to scale up!
• Threshold values determine performance and cost
• E.g. Low CPU utilization => Higher cost, Better Performance
• Thresholds vary from one application to another
• Empirically determining thresholds is expensive.
B. Suleiman, S. Venugopal, Modeling Performance of Elasticity Rules for Cloud-based Applications, EDOC 2013.
Can we construct a model that allows us to establish the right thresholds ?
Queue Model of 3-tier
B. Suleiman, S. Venugopal, Modeling Performance of Elas2city Rules for Cloud-‐based Applica2ons, EDOC 2013 (Accepted)
Establishing Rule Thresholds
• Developed a model based on M/M/m queuing model – Simultaneous session initiations on 1 server – Provisioning Lag Time of the provider – Cool-down interval after elasticity action – Algorithms to model scale-in and scale-out – Request Mix
• Compared model fidelity with actual cloud execution of TPC-W workload.
B. Suleiman, S. Venugopal, Modeling Performance of Elasticity Rules for Cloud-based Applications, EDOC 2013.
Experiments: Methodology
• Run the TPC-W workload on Amazon cloud resources using thresholds
• Simulate the model using MATLAB with the same thresholds
• Compare the simulation results to the results from the actual execution – If both are equivalent, then we are good J
B. Suleiman, S. Venugopal, Modeling Performance of Elas2city Rules for Cloud-‐based Applica2ons, EDOC 2013 (Accepted)
Experiments: Testbed
TPC-W database
EC2
TPC-W user emulation
Linux – Extra-large
EC2
TPC-W application
.......
Elastic Load Balancer
EC2
Small/Medium server Linux – JBoss/JSDK
Extra-large server Linux - MySQL
EC2
Experiments: Input Workload
0 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 5700
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
Req
uest
Arr
ival
Rat
e (r
eq/m
in)
Time (minutes)
Workload
• Used TPC-W Browsing profile (95% read) • Stress on application tier • Number of concurrent-users – Zipf • Inter-arrival times - Poisson
Experiments: Elasticity Rules
Rule Rule Expansion
CPU75 If CPU Util. > 75% for 5 min, add 1 server If CPU Util. < 30% for 5 min, remove 1 server
CPU80 If CPU Util. > 80% for 5 min, add 1 server If CPU Util. < 30% for 5 min, remove 1 server
Common parameters: • Waiting time – 10 mins., Measuring interval – 1 min. Metrics Captured: • Average CPU Utilization across all the servers • Average Response Time in a time interval • Number of servers in operation at any point of time
Results
CPU Utilization
CPU75M CPU75E CPU80M CPU80E0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Elasticity Rules - Model (M) & Empirical (E)
Avg
. CP
U U
tiliz
atio
n
CPU75M CPU75E CPU80M CPU80E
Average Response Time
CPU75M CPU75E CPU80M CPU80E0.0
0.1
0.2
0.3
0.4
0.5
Elasticity Rules - Models (M) & Empirical (E)
Avg
. Res
pons
e Ti
me
(sec
)
CPU75M CPU75E CPU80M CPU80E
0 40 80 120 160 200 240 280 320 360 400 440 480 520 5600%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%A
vg. C
PU
Util
izat
ion
(%)
Time (minutes)
CPU80M CPU80E
CPU Utilization over Time
0 40 80 120 160 200 240 280 320 360 400 440 480 520 5600
1
2
3
4
5
6
No.
Ser
vers
(App
. Tie
r)
Time (minutes)
CPU75M CPU75E CPU80M CPU80E
Number of Servers Initialized
Summary
• Developed a queueing model that can be used to reason about elasticity
• Model captures effects of thresholds and can be used for testing different rules
• Evaluations show that the model approx. real-world conditions closely
• Future work: handling initial bursts in workload
Autonomic Decentralised Elasticity Management of Cloud Applications
Reza Nouri and Han Li
Cons of Rule-based Autoscaling
• Commercial products are rule-based – Gives “illusion of control” to users – Leads to the problem of defining the “right”
thresholds • Centralised controllers
– Communication overhead increases with size – Processing overhead also increases (Big
Data!) • One application/VM at a time
Challenges of large-scale elasticity
• Large numbers of instances and apps – Deriving solutions takes time
• Dynamic conditions – Apps are going into critical all the time
• Shifting bottlenecks – Greedy solutions may create bottlenecks in
other places • Network partitions, fault tolerance…
H. Li, S. Venugopal, Using Reinforcement Learning for Controlling an Elastic Web Application Hosting Platform, Proceedings of 8th ICAC '11.
Initial Conditions
Instance1 App Server1
app1 app2
Instance2 App Server2
app3 app4
IaaS Provider
A Critical Event
Instance1 App Server1
app1 app2
IaaS Provider
Instance2 App Server2
app3 app4
Placement 1
Instance1 App Server1
app1
IaaS Provider
Instance2 App Server2
app3 app4 app2
Placement 2
Instance1 App Server1
app2
IaaS Provider
Instance2 App Server2
app3 app4
Instance3 App Server3
app1
$$
Placements 3 & 4
Instance1 App Server1
app2
IaaS Provider
Instance2 App Server2
app3 app4
Instance1 App Server1
app2
IaaS Provider
Instance2 App Server2
app3 app4
Instance3 App Server3
app1 app1
app1 app1
Problems for Automatic Placement
• Provisioning – Smallest number of servers required to satisfy
resource requirements of all the applications • Dynamic Placement
– Distribute applications so as to maximise utilisation yet meet each app’s response time and availability requirements
H. Li, S. Venugopal, Using Reinforcement Learning for Controlling an Elastic Web Application Hosting Platform, Proceedings of 8th ICAC '11.
Co-ordinated Control of Elasticity
• Instances control their own utilisation – Monitoring, management and feedback
• Local controllers are learning agents – Reinforcement Learning
• Controllers learn from each other – Share their knowledge and update their own
• Servers are linked by a DHT – Agility, Flexibility, Co-ordination
H. Li, S. Venugopal, “Using Reinforcement Learning for Controlling an Elastic Web Application Hosting Platform”, Proceedings of 8th ICAC '11.
Abstract View of the Control Scheme
H. Li, S. Venugopal, “Using Reinforcement Learning for Controlling an Elastic Web Application Hosting Platform”, Proceedings of 8th ICAC '11.
Fuzzy Thresholds
H. Li, S. Venugopal, Using Reinforcement Learning for Controlling an Elastic Web Application Hosting Platform, Proceedings of 8th ICAC '11.
Basic Actions
Instance
Applica3on
create! terminate! find!
move! duplicate! merge!
(-‐3.5) (3.5) (3.5)
(0.5) (0.5) (0.5)
Co-ordination using find!
• Server looks up other servers with the least load – DHT lookup
• Sends a move message to the selected server
• Replies with accept or reject!– accept has a +ve reward
Shrinking
• The controller is always reward maximising – Highest Reward is for merge+terminate
• A controller initiates its own shutdown – Low load on its applications
• Gets exclusive lock on termination – Only one instance can terminate at a time
• Transfers state before shutdown
Experiments
• Six web applications – Test Application: Hotel Management – Search à Book à Confirm
• Five were subjected to a background load – Uniform Random
• One was subjected to the test load • Application threshold: 200 and 500 ms • Metrics
– Average Response Time, Drop Rate, Servers
H. Li, S. Venugopal, “Using Reinforcement Learning for Controlling an Elas3c Web Applica3on Hos3ng Pla\orm”, Proceedings of 8th ICAC '11.
Experimental Results (EC2)
Elasticising Persistence Layer
Efficient Bootstrapping for Decentralised Shared-nothing Key-
value Stores
Han Li
Key-value Stores
• The standard component for cloud data management
• Increasing workload à Node bootstrapping – Incorporate a new, empty node as a member of KVS
• Decreasing workload à Node decommissioning – Eliminate an existing member with redundant data off
the KVS
H. Li, S. Venugopal, Efficient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of MIddleware 2013.
Research Questions
• As the system scales, how to efficiently incorporate or remove data nodes? – Load balancing, migration overheads, etc.
• How to partition and place the data replicas when the system is elastic? – Data consistency, durability, availability, etc..
H. Li, S. Venugopal, Efficient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of MIddleware 2013.
Elasticity in Key-Value Stores
• Minimise the overhead of data movement – How to partition/store data?
• Balance the load at node bootstrapping
– Both data volume and workload – How to place/allocate data?
• Maintain data consistency and availability – How to execute data movement?
H. Li, S. Venugopal, Efficient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of MIddleware 2013.
A
B
G
F
C
D
E
I
H
Key space
Split-Move Approach
A
I
C CD
Node 1 Node 2
Node 3 Node 4
B
IB
B
A
Master Replica Slave Replica
A
H
A
I B2
C CD
Node 1 Node 2
Node 3 Node 4
New Node
B1 B2
I
B1
B2
A
B1
Master Replica Slave Replica
A
H①
①①
A
B
G
F
C
D
E
I
HB2
B1
①
Key space
②A
I B2
C CD
B2
A B1
Node 1 Node 2
Node 3 Node 4
New Node②
B1 B2
I
B1
B2
A
B1
Master Replica Slave Replica
A
H
A
I B2
C CD
B2
A B1
Node 1 Node 2
Node 3 Node 4
New Node②②
B1 B2
I
B1
B2
A
B1
Master Replica Slave Replica
To be deleted
③
A
H
Partition at node bootstrapping
H. Li, S. Venugopal, Efficient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of MIddleware 2013.
Virtual-Node Approach
A
B
G
F
C
D
E
I
H
Key spaceD B
E H
I G
A C
D F
G I
A B
C E
I
C D
F H
G
Node 1 Node 2
Node 3 Node 4
D B
E H
I G
A C
D F
G I
A B
C E
I
C D
F H
G
Node 1 Node 2
Node 3 Node 4
New NodeD B
E H
I G
A C
D F
G I
A B
C E
I
C D
F H
G
B A
E F
H
Node 1 Node 2
Node 3 Node 4
New Node
......Partition at system startup
Data skew: e.g., the majority of data is stored in a minority of partitions. Moving around giant partitions is not a good idea.
H. Li, S. Venugopal, Efficient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of MIddleware 2013.
Our Solution • Virtual-node based movement
– Each partition of data is stored in separated files – Reduced overhead of data movement – Many existing nodes can participate in bootstrapping
• Automatic sharding – Split and merge partitions at runtime – Each partition stores a bounded volume of data – Easy to reallocate data – Easy to balance the load
H. Li, S. Venugopal, Efficient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of MIddleware 2013.
The timing for data partitioning • Shard partitions at writes (insert and delete)
– Split: Size(Pi) ≤ Θmax – Merge: Size(Pi) + Size(Pi+1) ≥ Θmin
Split
Delete
Insert
Merge
BA
CD
E
B1A
CD
E
B2
B1A
CD
E
B2
B1A
M
DE
Split
Delete
InsertB
A
CD
E
B1A
CD
E
B2
B1A
CD
E
B2
Split
InsertB
A
CD
E
B1A
CD
E
B2B
A
CD
E
Θmax ≥ 2Θmin
Avoid oscilla3on!
H. Li, S. Venugopal, Efficient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of MIddleware 2013.
Sharding coordination • Solution: Election-based coordination
Node-A
Node-C
Node-E
Node-B
SortedList:C, E, ..., A, ..., B Step1
Election
Node-A
CoordinatorNode-C
Node-E
Node-B
Step 2Enforce Split/Merge
Data/Node mappingNode-A
CoordinatorNode-C
Node-E
Node-B 1st
Data/Node mapping
Step 3 Finish Split/Merge
2nd
3rd
4th
Node-A
CoordinatorNode-C
Node-E
Node-B
Step 4Announce to all nodes
H. Li, S. Venugopal, Efficient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of MIddleware 2013.
Node failover during sharding Non-
coordinatorsNon-
coordinatorsNon-
coordinatorsElection
Notification:Shard Pi
Time
Beforeexecution
Duringexecution
Afterexecution
Replace Replicas
Coordinator
Announce:Successful
Step2
Step3
Step4
Step1Non-
coordinatorsNon-
coordinators
Removed from candidate list
Non-coordinatorsElection
Failed Resurrectyes
No
Yes
Notification:Shard Pi
Append to candidate list
Gossip
No Dead
Time
Beforeexecution
Duringexecution
Afterexecution
Replace Replicas
Coordinator
Announce:Successful
Step2
Step3
Step4
Step1Non-
coordinatorsNon-
coordinatorsNon-
coordinatorsElection
Notification:Shard Pi
Gossip Continue without coordinator Resurrect
Dead
No
Yes
Time
Beforeexecution
Duringexecution
Afterexecution
Failed
Replace Replicas
Coordinator
Announce:Successful
Step2
Step3
Step4
Step1Non-
coordinatorsNon-
coordinatorsNon-
coordinatorsElection
Notification:Shard Pi
Failed
Gossip
Yes
Continue without coordinator
ElectNew coordinator
NoInvalidate Piin this node
Timeout
Time
Beforeexecution
Duringexecution
Afterexecution
Replace Replicas
Coordinator
Announce:Successful
Step2
Step3
Step4
Step1
H. Li, S. Venugopal, Efficient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of MIddleware 2013.
Evaluation Setup
• ElasCass: An implemention of auto-sharding, building on Apache Cassandra (version 1.0.5), which uses Split-Move approach.
• Key-value stores: ElasCass vs. Cassandra (v1.0.5)
• Test bed: Amazon EC2, m1.large type, 2 CPU cores, 8GB ram
• Benchmark: YCSB
H. Li, S. Venugopal, Efficient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of MIddleware 2013.
Evaluation – Bootstrap Time
• Start from 1 node, with 100GB of data, R=2. Scale up to 10 nodes.
• In Split-Move, data volume transferred reduces by half from 3 nodes onwards.
• In ElasCass, data volume transferred remains below 10GB from 2 nodes.
• Bootstrap time is determined by data volume transferred. ElasCass exhibits a consistent performance at all scales.
H. Li, S. Venugopal, Efficient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of MIddleware 2013.
Conclusions
• We have designed and implemented a decentralised auto-sharding scheme that – consolidates each partition replica into single
transferable units to provide efficient data movement;
– automatically shards the partitions into bounded ranges to address data skew;
– reduces the time to bootstrap nodes, achieves more balancing load and better performance of query processing.
A Unified View of Elasticity (?)
Final Thoughts
• Elasticising Application Logic is done – How do we eliminate thresholds ? – Should it be more autonomic ?
• Application View of Elasticity – Managing state is the big challenge – Decoupling of different components (service-
oriented model) – How would you scale interconnected
components ?