Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
©2009 HP Confidential template rev. 12.10.09 1
Dynamic Capacity Provisioning of Server Farm
Yuan Chen Senior Research Scientist Sustainable Ecosystems Research Group Hewlett Packard Laboratories
Demand How to allocate servers to meet
SLA requirements while minimizing
power consumption?
Consume lots of power
100 billion kWh in 2011
$ 7.4 billion
Have to meet SLAs SLA violations lead to revenue loss
20% traffic loss for additional 500ms delay in Google search
Resource Provisioning in Data Centers
3
Workload Traces
• SAP a SAP enterprise application that hosts enterprise applications
such as customer relationship management applications for small and
medium sized businesses
• VDR a high-availability, multi-tier business-critical HP application
serving both external customers and HP users on six continents
• Web 2.0 - a popular HP Web service application with more than 85
million registered users in 22 countries (over 10 million users daily)
4
Variability
(a) SAP (b) VDR (c) Web
Workload demand for a single day for (a) the SAP trace; (b) the VDR trace; and (c) the Web trace
Variability in the workload demands
Observation: The workload demands
typically have significant variability.
Implication: A single size cannot fit all, and
will result in either over-provisioning or
under-provisioning
5
Variability
(a) SAP (b) VDR (c) Web
Workload demand for two hours for (a) the SAP trace; (b) the VDR trace; and (c) the
Web trace
Observation: Workload demands can change abruptly during short intervals.
Implication: To handle such variations, a provisioning mechanism is
required at short time-scales.
6
Periodicity
Observation: The workloads exhibit prominent daily patterns
(a) SAP (b) VDR (c) Web
Time-series and periodogram for (a) the SAP trace; (b) the VDR trace; and (c) the Web trace.
Predictive [Krioukov, …, Culler, Katz „10] [Chen, He, …, Zhao ‟08] [Bobroff, Kuchut, Beaty ‘07] [Chen, Das, …, Gautam ‟05] [Castellanos et al. „05]
Every few hours
Stable and easy
Cannot react to changes
Reactive [Leite, Kusic, Mosse „10] [Nathuji, Kansal, Ghaffarkhah „10] [Wang, Chen „08] [Fan, Weber, Barroso „07] [Wood, Shenoy, … „07]
Every few minutes
Can react quickly
Unstable and expensive
Prior Work
Demand varies, but there are periodic patterns
There will be deviations from these patterns
Provisioning is not free; there are various
associated costs and risks. Turning servers on can take a significant amount of time and
consume a lot of power
“wear and tear”
SAP trace VDR trace Web trace
Observations
9
Our Approach
Combines predictive and reactive control to allocate resources at multi-time scales
• Identify long-term sustained patterns --- a “base” workload
• A predictive provisioning proactively handles the estimated base workload (every a few hours)
• A reactive provisioning handles any excess workload (every a few minutes)
Base Workload
Predictor
Coordinator Predictive
Controller Reactive
Controller
Server pool 1 Server pool 2
actual workload
historic workload traces
predicted base
workload
excess workload
workload not exceeding base
live workload
trace
base provisioning
noise provisioning
Hybrid Provisioning
1
1
Base Workload Predictor
Coordinator Predictive Controller
Reactive Controller
Server pool 1 Server pool 2
actual workload
historic workload traces
predicted base
workload
excess workload
workload not exceeding base
live workload
trace
base provisioning
noise provisioning
Base Workload Prediction
12
Base Workload Prediction
1. Periodicity analysis
• Use Fast Fourier Transform (FFT)
2. Workload prediction
• Auto regressive model
3. Workload discretization
…
13
Workload Discretization
– Discretize the demands into consecutive, disjoint time-intervals with a single representative demand value in each interval
– Given the demand time-series X on the domain [s, t], a time-series Y on the same domain is a workload characterization of X if [s, t] can be partitioned into n successive disjoint time intervals, {[s, t1],[t1, t2],...,[tn-1, t]}, such that X(j)= ri, for all j in the ith interval, [ti-1, ti].
1. Deviation from actual demand is small
2. Avoid having too many intervals
14
Workload Discretization
Optimization problem: Minimize (Error + C. # Changes)
, ,
• For a given partition, setting ri to be the mean of the time-series values on that partition minimizes the mean-squared error
• The optimal solution for the domain [t0, tn] contains the optimal solution for the domain [t0, tn-1]
• dynamic programming
• Pick the normalization constant c
15
Different Discretization Techniques
Mean 90 %ile Max
Mean/1hr Mean/3hrs Mean/6hrs
DP SAX K-means
Base Workload Predictor
Coordinator Predictive Controller
Reactive Controller
Server pool 1 Server pool 2
actual workload
historic workload traces
predicted base
workload
excess workload
workload not exceeding base
live workload
trace
base provisioning
noise provisioning
Dynamic Provisioning
How ?
base server provisioning predicted base workload
time responses ize job
1
rate(t) arrival servers(t) Num.
1Given
SLA
Derive experimentally
Predictive Controller
Simple feedback model: noise(interval(t)) = noise(interval(t-1)) Interval length = 10mins Can use sophisticated control-theoretic models Not the focus of this work
actual-base=noise
time responses ize job
1
rate(t) arrival servers(t) Num.
1
noise provisioning
Real-time
Reactive Controller
1
9
Base Workload Predictor
Coordinator Predictive Controller
Reactive Controller
Server pool 1 Server pool 2
actual workload
historic workload traces
predicted base
workload
excess workload
workload not exceeding base
live workload
trace
base provisioning
noise provisioning
Coordinator
20
Coordinator
Forward incoming requests to either the server farm based on the predicted base demand and the actual demand
– Load Balancing Dispatcher
• Load-balance the incoming requests among all servers of two server farms
– Priority Dispatcher
• Forward and load balance the job requests to the base provisioning servers as long as the request rate is below the forecasted base workload request rate
• Requests that exceed the forecasted request rate are forwarded to the reactive provisioning servers
• Isolate the base workload from the noise workload
• Provide stronger performance guarantees for the base workload
• Dispatch the important jobs to the (more robust) base workload server farm
Coordinator Predictive Controller
Reactive Controller
Server pool 1 Server pool 2
excess workload workload not exceeding base
actual workload
Base Workload Predictor
Put It All Together
22
Trace-driven Simulation
Provision resources for a single tier Web server farm
Different policies:
• Predictive: 24 hour, 6 hour, 4 hour and variable length
• Reactive: 10 minutes
• Hybrid: predictive (fixed length and variable length) plus reactive
Metrics:
• Percentage of SLA violations
• Power consumption
• Number of provisioning changes
23
Results for SAP Trace
Trace-based analysis results for the SAP trace showing the
SLA violations, power consumption and number of
provisioning changes.
Time-series for the demand, SLA violations, power consumption and the
number of servers for the SAP trace.
24
Results for Web 2.0 and World Cup Traces
Trace-based analysis results for the Web trace showing the SLA
violations, power consumption and number of provisioning
changes.
Trace-based analysis results for the World Cup 98 trace showing the
SLA violations, power consumption and number of provisioning
changes.
10-server test bed (web server farm)
Multiple workload traces (SAP, VDR, Web)
Workload generator (httperf) + Load balancer (Apache) + Back-end servers.
Provisioning strategies: Predictive (every 1 hr)
Reactive (every 10 mins)
Hybrid (base provisioning + noise provisioning)
Experimental Setup
Web trace VDR trace
Hybrid reduces response times by as much as 40% compared to Predictive.
Hybrid provides better response times than Predictive and Reactive, and invokes fewer changes than Reactive.
Experiment Results
Hybrid (Predictive + Reactive) server provisioning
Good performance, low power consumption, very few changes, across various traces.
Important to have a good “base workload”: use dynamic programming.
Conclusion
28
References
1.Minimizing Data Center SLA Violations and Power Consumption via
Hybrid Resource Provisioning. Anshul Gandhi, Yuan Chen, Daniel Gmach, Martin
Arlitt, and Manish Marwah. Proceedings of the Second International Green Computing
Conference (IGCC 2011), July 2011.
2.Hybrid Resource Provisioning for Minimizing Data Center SLA Violations
and Power Consumption. Anshul Gandhi, Yuan Chen, Daniel Gmach, Martin Arlitt,
and Manish Marwah. Journal of Sustainable Computing: Informatics and Systems
(SUSCOM), 2012. (extended version of IGCC paper)