View
222
Download
1
Category
Preview:
Citation preview
DEMAND FORECAST, RESOURCE
ALLOCATION AND PRICING FOR
MULTIMEDIA DELIVERY FROM THE CLOUD
by
Di Niu
A thesis submitted in conformity with the requirements
for the degree of Doctor of Philosophy,
Department of Electrical and Computer Engineering,
at the University of Toronto.
Copyright
c� 2013 by Di Niu.
All Rights Reserved.
Demand Forecast, Resource Allocation and Pricingfor Multimedia Delivery from the Cloud
Doctor of Philosophy ThesisEdward S. Rogers Sr. Dept. of Electrical and Computer Engineering
University of Toronto
by Di NiuMay 6, 2013
Abstract
Video tra�c constitutes a major part of the Internet tra�c nowadays. Yet most video de-
livery services remain best-e↵ort, relying on server bandwidth over-provisioning to guar-
antee Quality of Service (QoS). Cloud computing is changing the way that video services
are o↵ered, enabling elastic and e�cient resource allocation through auto-scaling. In
this thesis, we propose a new framework of cloud workload management for multimedia
delivery services, incorporating demand forecast, predictive resource allocation and qual-
ity assurance, as well as resource pricing as inter-dependent components. Based on the
trace analysis of a production Video-on-Demand (VoD) system, we propose time-series
techniques to predict video bandwidth demand from online monitoring, and determine
bandwidth reservations from multiple data centers and the related load direction policy.
We further study how such quality-guaranteed cloud services should be priced, in both
a game theoretical model and an optimization model.Particularly, when multiple video
providers coexist to use cloud resources, we use pricing to control resource allocation
ii
in order to maximize the aggregate network utility, which is a standard network util-
ity maximization (NUM) problem with coupled objectives. We propose a novel class of
iterative distributed solutions to such problems with a simple economic interpretation
of pricing. The method proves to be more e�cient than the conventional approach of
dual decomposition and gradient methods for large-scale systems, both in theory and in
trace-driven simulations.
iii
Dedicated to my parents
iv
Acknowledgments
First and foremost, I would like to sincerely thank my thesis supervisor, Professor
Baochun Li, for his invaluable guidance throughout my Ph.D. studies. I deeply appre-
ciate his strict training and precious advice in all aspects of my academic development,
which are life-long treasures for myself. I feel very fortunate to have Professor Li as my
supervisor, and cherish all the opportunities to explore my favourite research topics and
exert my own strengths. These would not have been possible without the unwavering
support of Professor Li. The dedication, diligence and enthusiasm he has for an aca-
demic career was contagious and motivational for me, especially during tough times in
my Ph.D. pursuit. I am also thankful for the excellent example Professor Li has provided
as a successful scholar and professor in computer engineering and science.
The members of the iQua research group have contributed immensely to both my
personal and professional time at the University of Toronto. The group has been a
source of good advice and collaboration as well as friendships. I am especially grateful
for the valuable discussions and collaborations with Chen Feng, Hong Xu and Zimu
Liu. Through the projects we worked on and the papers we coauthored, I very much
appreciated their enthusiasm, intelligence, and amazing abilities in mathematical and
experimental studies. I would like to thank all the other colleagues in the iQua research
group, for precious friendships that have supported me throughout the past four years:
Jin Jin, Wei Wang, Yuefei Zhu, Yiwei Pu, Yuan Feng, Junqi Yu, and Elias Kehdi. I
would also like to acknowledge the visiting students in our group: Fangming Liu, Anh
Tuan Nguyen, Zhi Wang, and Boyang Wang for the numerous nights we spent together
in the lab. Other past group members that I have had the pleasure to work alongside
of are graduate students Mea Wang, Chuan Wu, and Hassan Shajonia; visiting scholars
v
Hui Wang and Ruixuan Li; and the numerous summer and rotation students who have
come through the lab.
Lastly, I would like to thank my parents for raising me with a passion for science and
engineering, and for their love and encouragement in all my pursuits. Thank you!
Di Niu
University of Toronto
May 6, 2013
vi
Contents
Abstract iii
iv
Acknowledgments v
List of Tables xii
List of Figures xviii
1 Introduction 1
1.1 Challenges in Multimedia Delivery Systems . . . . . . . . . . . . . . . . . 1
1.2 In the Era of Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Related Work 10
2.1 Measurement, Prediction and Learning in VoD Systems . . . . . . . . . . 10
2.2 Cloud Workload Management and Resource Allocation . . . . . . . . . . 13
2.3 Cloud Resource Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
vii
CONTENTS CONTENTS
2.4 Algorithms for Distributed and Parallel Optimization . . . . . . . . . . . 16
3 Forecast in Video-on-Demand Systems 18
3.1 Monitoring UUSee Video-on-Demand System . . . . . . . . . . . . . . . . 20
3.2 Some Prediction Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Modeling and Forecasting the Demand . . . . . . . . . . . . . . . . . . . 24
3.3.1 Population Prediction via Seasonal ARIMA Processes . . . . . . . 27
3.3.2 Forecasting the Aggregate Bandwidth Demand . . . . . . . . . . . 32
3.3.3 Inferring Initial Demand using Mixtures of Gaussians . . . . . . . 34
3.4 Predicting Peer Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Applications in a P2P System . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Volatility Prediction and Dimension Reduction 48
4.1 Modeling Demand Volatility . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Volatility Estimation and Resource Allocation . . . . . . . . . . . . . . . 52
4.2.1 Comparing Five Volatility Models . . . . . . . . . . . . . . . . . . 53
4.2.2 Utilization as a Volatility Indicator . . . . . . . . . . . . . . . . . 56
4.3 Reducing Volatility through Diversification . . . . . . . . . . . . . . . . . 57
4.4 Dimension Reduction using PCA . . . . . . . . . . . . . . . . . . . . . . 61
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5 Cloud Bandwidth Reservation for VoD Applications 70
5.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Load Direction and Bandwidth Reservation . . . . . . . . . . . . . . . . 75
5.2.1 The Optimal Load Direction . . . . . . . . . . . . . . . . . . . . . 77
viii
CONTENTS CONTENTS
5.2.2 Suboptimal heuristics with Limited Replication . . . . . . . . . . 81
5.3 Experiments based on Real-World Traces . . . . . . . . . . . . . . . . . . 83
5.3.1 A Channel Interleaving Scheme . . . . . . . . . . . . . . . . . . . 84
5.3.2 Algorithms for Comparison . . . . . . . . . . . . . . . . . . . . . 85
5.3.3 Assumption Validation . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.4 Predictive Auto-Scaling vs. Reactive Provisioning . . . . . . . . . 88
5.3.5 Theoretical Optimal vs. Replication-Limited Heuristics . . . . . . 91
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6 The Bandwidth Reservation Market 94
6.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.1.1 The Cloud and Tenants . . . . . . . . . . . . . . . . . . . . . . . 98
6.1.2 The Broker and Load Direction Matrix . . . . . . . . . . . . . . . 99
6.1.3 Bandwidth Pricing in the Presence of a Broker . . . . . . . . . . . 101
6.2 Main Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3 Optimal Bandwidth Saving . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3.1 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.4 Pricing Region in Controlled Markets . . . . . . . . . . . . . . . . . . . . 108
6.4.1 Broker Profit and Price Discounts in the Pricing Region . . . . . 113
6.5 Pricing in Free Markets: A Game Theoretical Analysis . . . . . . . . . . 115
6.5.1 Nash Equilibrium: Existence and Uniqueness . . . . . . . . . . . . 116
6.5.2 Market Properties and Equilibrium Bandwidth Costs . . . . . . . 120
6.6 Trace-Driven Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
ix
CONTENTS CONTENTS
7 Algorithmic Cloud Bandwidth Pricing 126
7.1 A New Tenant-Cloud Agreement . . . . . . . . . . . . . . . . . . . . . . 128
7.2 Pricing Towards Social Welfare Maximization . . . . . . . . . . . . . . . 132
7.2.1 An Equivalent Pricing Problem . . . . . . . . . . . . . . . . . . . 134
7.2.2 No Multiplexing vs. Multiplexing All . . . . . . . . . . . . . . . . 138
7.2.3 Economic Implications . . . . . . . . . . . . . . . . . . . . . . . . 140
7.3 Distributed Optimization with Coupled Objectives . . . . . . . . . . . . 141
7.4 A Pricing Framework for Distributed Optimization . . . . . . . . . . . . 143
7.5 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.5.1 Dual Decomposition and Gradient Methods . . . . . . . . . . . . 145
7.5.2 New Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.5.3 Relationship to the Nonlinear Jacobi Algorithm . . . . . . . . . . 150
7.6 Convergence Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.6.1 Contraction Mapping . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.6.2 Convergence of Algorithm 1 . . . . . . . . . . . . . . . . . . . . . 153
7.6.3 Convergence of Algorithm 2 . . . . . . . . . . . . . . . . . . . . . 158
7.6.4 Convergence of Gradient Methods for Dual Problem . . . . . . . . 160
7.6.5 Quadratic Programming: Convergence Speed Comparison . . . . 162
7.6.6 Handling General Coupled Objectives . . . . . . . . . . . . . . . . 165
7.7 Distributed Solutions to Cloud Pricing . . . . . . . . . . . . . . . . . . . 168
7.7.1 Equation Update . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.8 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
x
CONTENTS CONTENTS
8 Concluding Remarks 180
8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
8.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Bibliography 185
xi
List of Tables
3.1 Frequently used notations (for a particular channel). . . . . . . . . . . . . 21
3.2 Prediction of population at the initial stage based on mixtures of Gaussians
for flash-crowd video channels. . . . . . . . . . . . . . . . . . . . . . . . 36
5.1 The performance of five schemes averaged over each test period, in terms
of QoS, resource utilization, and replication overhead. . . . . . . . . . . . 89
xii
List of Figures
1.1 Bandwidth auto-scaling with quality assurance, as compared to provision-
ing for the peak demand and pay-as-you-go. . . . . . . . . . . . . . . . . 5
3.1 The evolution of several time series in a typical flash crowd video channel
(with channel ID: A55F) since the channel was released. Each time unit
represents 10 minutes. All rates are average quantities for the channel.
The video bit rate is R = 79.6 KB/s. . . . . . . . . . . . . . . . . . . . . 24
3.2 Population evolution in steady-state channels and various kinds of flash
crowd channels. The samples of steady-state channels start from 2008-
08-08 14:51:58. The samples of flash-crowd channels start from the time
that the corresponding video is released. The video release times are:
channels 4872 and B830, before 2008-08-08 14:51:58; channel F190, 2008-
08-17 00:21:03; channel FA5C, 2008-08-19 01:47:20; channels 7A40 and
E856, 2008-08-12 20:57:34; channel 2E0B, 2008-08-17 23:43:43; channel
D55A, 2008-08-09 00:03:26. . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 The power spectrum of online population Nt
for various channels. To
better view periodicity, the x axis is set to be the period measured in
timestamps, which is the inverse of frequency. . . . . . . . . . . . . . . . 27
xiii
LIST OF FIGURES LIST OF FIGURES
3.4 One day-ahead population prediction based on seasonal ARIMA models
for three di↵erent channels. The model is trained based on the data of
first few days. Prediction for Nt
only makes use of the observations {N⌧
:
0 < ⌧ t � 144} up to time t � 144. The two flash-crowd channels have
di↵erent daily decreasing of Nt
. The validation and training sets overlap
for 1 day, since in a seasonal model, the predictor requires the first 144
validation samples to be the initial input. . . . . . . . . . . . . . . . . . 31
3.5 The autocorrelation function (ACF) and partial autocorrelation function
(PACF) of N(t) = r144Nt
for the training data of channel B830. . . . . . 32
3.6 10-minutes-ahead prediction for the aggregate bandwidth demand Dt
in
a popular video channel A55FF released at time period 264, compared
against the trace data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.7 Prediction of initial demand using a mixture of Gaussians. The fitting
function is trained with the first 360 population samples in channel 8331
released on 2008-08-10 19:46:29. Test data 1 is from channel 57F7 released
on 2008-08-18 20:16:35, and test data 2 is from channel EDAF released
on 2008-08-18 20:49:39. R-square1 = 0.5864, R-square2 = 0.1374, RMSE1
=74.44, RMSE2 = 273.55. . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.8 Fitting AR(10) to {ut
} of the first 9 days in channel A55F, and making 2.5
hour (15-step) ahead prediction for the rest of the days. The prediction
for each ut
is conditioned on {u⌧
; 0 < ⌧ t � 15}. . . . . . . . . . . . . . 39
xiv
LIST OF FIGURES LIST OF FIGURES
3.9 A fluid chart illustrating the dependency of ct
on Nt
through the underlying
system dynamics. The white arrows denote the transition rates, while the
black arrows denote the dependencies. N c
t
is the number of cached peers
and uc
(t) is the average upload rate of each cached peer at time t. . . . . 39
3.10 The h-step prediction of average receiving rate ct
and total receiving rate
ct
Nt
from cached peers in channel A55F. Prediction for ct
or ct
Nt
only
makes use of the observations of N⌧
and c⌧
up to time t � h. We choose
d1 = d2 = 0, p = 54, q = 0 in the experiments. The validation and training
sets overlap for 1 day, since in a seasonal model, the predictor requires the
first 144 validation samples to be the initial input. . . . . . . . . . . . . . 42
3.11 Progressive model learning and 2.5 hours ahead (15-step) prediction for
the server bandwidth required by the peers in channel A55F over a 16-day
period, compared with the real server bandwidth requirement calculated
from the traces, and the actual server bandwidth used by the channel in the
traces. The channel is released on 2008-08-10 10:47:39, and the predicted
values start from 2008-08-11 16:47:39 (timestamp 181). The predictor is
retrained with the most recent data on a daily basis. . . . . . . . . . . . 46
3.12 Progressive model learning and the estimation for the average receiving
rate rt
of playing peers in channel A55F over a 12-day period. There is a
2.5-hour (15-step) delay in data collection. A comparison with the real rt
and the server rate st
allocated to each peer is provided. . . . . . . . . . 47
4.1 Online bandwidth demand monitoring and reservation. The reserved band-
width should match the sum of the expected demand and a risk premium
that tolerates volatility. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
xv
LIST OF FIGURES LIST OF FIGURES
4.2 The ACFs of the disturbance Zt
and Z2t
obtained from fitting the seasonal
ARIMA model (3.6) with d = 0 and p = q = 20 to bandwidth demand
{Dt
} in video channel A55FF. . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 One-step-ahead forecast errors for bandwidth demand Dt
in video channel
A55FF, and predicted conditional standard deviations for the forecast errors. 52
4.4 The empirical CDF of the utilization of booked bandwidth and the ratio
of time periods where the booked bandwidth is insu�cient. Bandwidth
reservation is performed in 169 popular video channels independently, each
with a test period of 2 days. . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 Bandwidth reservation with “probabilistic GARCH” in channel A55FF,
tested on a period of 3 days. . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6 The mean and standard deviation of U when di↵erent numbers of video
channels are randomly combined for bandwidth reservation. The achieved
average e is less than 4% in all cases. . . . . . . . . . . . . . . . . . . . . 59
4.7 Bandwidth reservation with “probabilistic GARCH” for the mixed tra�c
of all the 93 channels in the 6-day period from time 264 to 1127. The first
3 days are the training periods, and the other 3 days (time 697-1127) are
the test periods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.8 Bandwidth consumption time series of 5 representative channels over a
2.5-day period. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.9 The first 3 principal components C1—C3 in bandwidth consumption series
of 468 channels during 2 days via principal component analysis. . . . . . 63
4.10 Variance explained in bandwidth consumption series of 468 channels in 2
days. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
xvi
LIST OF FIGURES LIST OF FIGURES
4.11 Data projected onto the first 2 principal components. Each point is one
of the 468 channels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.12 Root mean squared errors (RMSEs) of the two prediction methods over
test period 1680-1860 (1.25 days), compared to mean bandwidth consump-
tion in each channel. The three largest channels are channels 295, 241 and
317. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.13 10-minute-ahead conditional mean prediction in channel 172 over a test
period of 1.25 days. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.14 The departure of actual bandwidth consumption from its conditional mean
forecast and the predicted standard deviation of bandwidth consumption
in channel 172. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.15 Q-Q plot of conditional mean forecast errors in channel 172 over the test
period 1680-1860 (1.25 days), in reference to Gaussian quantiles. . . . . 68
5.1 The system decides the bandwidth reservation from each data center and
a matrix W = [wsi
] every �t minutes, where wsi
is the proportion of video
channel i’s requests directed to data center s. DC: data center. . . . . . . 72
5.2 Using demand correlation between channels, we can save the total band-
width reservation, even within each 10-minute period, while still providing
quality assurance to each channel. DC: data center. . . . . . . . . . . . . 73
5.3 The conditional mean demand prediction for virtual new channel 11, with
a test period of 1.5 days from time 1585 to 1800. . . . . . . . . . . . . . . 85
5.4 QQ plot of innovations for t =1562—1640 vs. normal distribution. . . . . 87
xvii
LIST OF FIGURES LIST OF FIGURES
5.5 Predictive vs. reactive bandwidth provisioning for a typical peak period
702–780. There are 35 data centers available, each with capacity 300
Mbps, and 91 channels, including 52 popular channels, 24 small channels,
15 non-zero new channels. K = 2, k = 10. . . . . . . . . . . . . . . . . . 88
5.6 Workload portfolio selection vs. random load direction for a typical peak
usage period from time 1562 to 1640. K = 2, k = 10. . . . . . . . . . . . 92
6.1 A system of two cloud providers and two tenants. Random variables are
labeled with “r.v.”. The data centers are possibly owned by di↵erent cloud
providers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2 The region of Pi
(·) in a good pricing policy {Pi
(·)}. Pi
(·) is between P ⇤i
(·)
and P 0i
(·), and satisfies P 0i
(1) � P ⇤0i
(1). . . . . . . . . . . . . . . . . . . . 113
6.3 The aggregate bandwidthP
s
As
(t) booked by the broker, compared to the
aggregate bandwidthP
i
Bi
(t) needed if each channel books individually
and the real aggregate demandP
i
Di
(t). . . . . . . . . . . . . . . . . . . 123
6.4 The histogram of payment discounts of all channels at all times in equi-
librium, i.e., the histogram of 1 � P ⇤i
(1, t)/P 0i
(1, t) for all i and t. . . . . . 124
7.1 Iterative updates of prices and guaranteed portions. . . . . . . . . . . . 135
7.2 Behavior of Equation Update in a one dimensional case. “o” represents
the starting point (w(0)i
, k(0)i
). . . . . . . . . . . . . . . . . . . . . . . . . 172
7.3 CDF of convergence iterations in all 81 10-minute periods. . . . . . . . . 176
7.4 Final objective values SW(x⇤) (expected social welfare) produced by dif-
ferent algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
7.5 The evolution of kw(t) � w(t � 1)k1 in the first experiment. . . . . . . . 178
xviii
Chapter 1
Introduction
1.1 Challenges in Multimedia Delivery Systems
Video delivery systems, such as Youtube, Netflix, Hulu, etc., have gained unprecedented
popularity on the Internet nowadays. According to “Cisco Visual Networking Index:
Forecast and Methodology,” globally, Internet video tra�c will be 55% of all consumer
Internet tra�c in 2016, up from 51% in 2011. This does not include video exchanged
through peer-to-peer (P2P) file sharing. Most Internet users watch live and on-demand
videos through a web-browser and http-based video streaming, others may watch videos
through client software downloadable from the video service provider. Other video de-
livery systems include on-line video gaming services such as OnLive [7].
Despite the popularity of Internet video and the ever-increasing demand for improved
video quality, most Internet video services remain best-e↵ort systems. Since video flows
are delay-sensitive, to guarantee the Quality of Experience (QoE) for an end-user, the
video must be delivered from media servers to the end-user at a rate no less than the
video bit rate (at least in the long run). However, QoE is usually not guaranteed in
1
1.2. IN THE ERA OF CLOUD COMPUTING 2
current Internet VoD systems, mainly due to the bounded egress bandwidth from video
servers.1 Most video service providers over-provision the bandwidth capacity of their
streaming servers to provide quality assurance. However, over-provisioning is costly and
even ine↵ective sometimes, since a large amount of server capacity is unused during non-
peak hours, whereas in the event of a flash crowd, when a large number users join the
system, the provisioned capacity may not even be su�cient.
Certain VoD systems adopt a Peer-to-Peer (P2P) architecture (e.g., CoolStreaming
[47], PPLive [8], UUSee [9]), where end-users can help the servers deliver video content
to each other. While leveraging user upload bandwidth can alleviate the burden on
media servers to some extent, the user resources are not dedicated and their contribution
is not reliable. As a result, most P2P video systems are in fact peer-assisted systems,
where the servers still play a major role in streaming, and thus face the same issue of
how much server capacity should be provisioned. Because of the above reasons, in order
to guarantee the QoE, what is missing from today’s video delivery systems is a refined
scheme to accurately predict the online user demand together with a flexible allocation
mechanism that can economically vary the resource provisioning over time.
1.2 In the Era of Cloud Computing
Cloud computing delivers Infrastructure as a Service (IaaS) that integrates computation,
storage and network resources in a virtualized environment. It represents a new business
model where applications as tenants of the cloud can dynamically reserve instances on
1Due to the practice of over-provisioning, Quality-of-Service (QoS) is usually guaranteed at the coreof the Internet, with very small losses and delays.
1.2. IN THE ERA OF CLOUD COMPUTING 3
demand. Cloud computing is changing the way that multimedia content providers op-
erate their businesses, including video-on-demand (VoD) and online gaming companies.
Traditionally, video companies invest in commodity servers as business grows and acquire
monthly bandwidth deals from Internet Service Providers (ISPs). Nowadays, they can be
freed from the complexity of hardware maintenance and network administration, relying
on the cloud for service.
One particular example is Netflix [6], which moved its streaming servers, encoding
software, search engines, and huge data stores to Amazon Web Services (AWS) in 2010
[12, 13]. Moreover, many interactive gaming companies, e.g., OnLive [7], also depend on
the cloud to stream their content to online players.
One of the most important economic appeals of cloud computing is its elasticity
and “auto-scaling” in resource provisioning. As has been pointed out, to accommodate
the peak workload, over-provisioning is traditionally a common practice, leading to low
resource utilization during non-peak hours. In contrast, in the cloud, the number of
computing instances launched can be changed adaptively at a fine granularity with a
lead time of hours or even minutes. This converts the up-front capital infrastructure
investment to operating expenses charged by cloud providers. Since the cloud’s auto-
scaling ability enhances resource utilization by closely matching supply with demand,
overall expenses of the enterprise may be reduced.
However, unlike web servers or scientific computing, VoD is a network-bound service
with stringent bandwidth requirements. A major risk to the video applications using
cloud services is that unlike CPU and memory, bandwidth is not guaranteed in current-
generation cloud platforms (e.g., Amazon EC2), leading to unpredictable network perfor-
mance [17, 63]. A lack of bandwidth guarantee to some degree impedes cloud adoption by
1.3. THESIS SCOPE 4
applications that require such guarantees, such as video-on-demand (VoD) applications
[12] and transaction processing web applications [45]. The utility of tenants running
these applications depends not only on the bandwidth usage, but more importantly on
how many of their end-user requests are served with guaranteed performance.
With an ever-increasing demand for performance predictability, a recent trend in
networking research is to augment cloud computing to explicitly account for network
resources. In fact, data center engineering techniques have been developed to expand the
tenant-cloud interface to allow bandwidth reservation for tra�c flowing from a virtual
machine (VM) in the cloud to the Internet [19, 39]. We envision that in future cloud
platforms, bandwidth reservation will be a value-added feature that attracts tenants,
such as video providers, who seek bandwidth guarantees.
1.3 Thesis Scope
In this thesis, we aim to leverage the auto-scaling ability of the cloud to save the op-
erational cost for multimedia delivery services, while providing quality assurance. Since
bandwidth, as compared to CPU and memory, is a major factor that a↵ects the video
delivery cost and quality, we focus on bandwidth auto-scaling. The benefits of bandwidth
auto-scaling can be intuitively envisioned. As shown in Fig. 1.1(a), traditionally, a VoD
provider acquires a monthly plan from ISPs, in which a fixed bandwidth capacity, e.g., 1
Gbps, is guaranteed to accommodate the anticipated peak demand. As a result, resource
utilization is low during non-peak hours and demand troughs. Alternatively, a usage-
based pay-as-you-go model is adopted by current cloud providers as shown in Fig. 1.1(b),
where a VoD provider pays for the total amount of bytes transferred. However, the band-
width capacity available to the VoD provider is subject to variation due to contention
1.3. THESIS SCOPE 5
Band
wid
th
Demand
Capacity
Days(a) Provisioning for peak demand
1 20 Days
Demand
Capacity
(b) Pay as you go1 20 Days
DemandCapacity
(c) Auto-scaling1 20
Figure 1.1: Bandwidth auto-scaling with quality assurance, as compared to provisioningfor the peak demand and pay-as-you-go.
from other applications, incurring unpredictable performance and QoS issues. Fig. 1.1(c)
illustrates bandwidth auto-scaling and reservation to match demand with su�cient re-
sources, leading to both high resource utilization and quality guarantees. Apparently,
the more frequently the rescaling happens, the more closely resource supply will match
the demand.
The topics that will be covered in this thesis include demand prediction, resource
allocation based on prediction, as well as the economic issues related to cloud resource
pricing for video providers. Our contributions can mainly be divided into the following
four parts:
Demand Forecast. Accurate demand forecast serves as the first step towards a
flexible resource allocation scheme to replace traditional capacity over-provisioning. Tar-
geting multimedia applications, we employ a set of established statistical learning and
time series analysis techniques to automatically forecast demand and infer performance
in large-scale VoD applications. Our work is backed up by 400 GB of operational traces
collected from UUSee2, a major commercial VoD provider based in China. Although
many measurement studies of VoD systems exist in the literature, few practical workload
2Although UUSee adopts a peer-assisted video streaming architecture, our collected data containsthe download bandwidth of all online users sampled every 10 minutes, which reflects user bandwidthdemands regardless of the service architecture adopted.
1.3. THESIS SCOPE 6
forecast mechanisms have been developed for these systems. We utilize the daily peri-
odicity and trends inherent in historical workload data to predict the future population
and bandwidth demand in video services, study the demand volatility, and use a factor
model to reduce the dimension and complexity involved in statistical learning.
Cloud Resource Allocation. Based on such demand forecast, we have developed
resource allocation mechanisms for data centers in the cloud to satisfy video application
requirements with the minimum cost. We study a new type of service where companies
like Netflix can make reservations for (outgoing) bandwidth guarantees from the cloud.
The objective is to reserve as little bandwidth as possible while still guaranteeing the
video delivery quality. We have proposed an auto-scaling system that can adjust resource
allocation for each cloud tenant (a video company or channel) to closely match its short-
term demand prediction, accommodating both the estimated demand expectation and
unexpected demand variation. Borrowing insights from financial risk management, we
have proposed a mean-variance optimization framework to statistically mix negatively
correlated (anti-correlated) workloads to save the bandwidth reservation cost. When
multiple data centers coexist in the cloud, we have derived the optimal direction policy
of tenant demands to di↵erent data centers, proposed practical heuristic solutions and
evaluated our methods through extensive trace-driven simulations.
The Bandwidth Reservation Market. Based on demand forecast and the pro-
posed workload management mechanisms, we further propose new pricing models for
cloud bandwidth resources. The current usage-based cloud pricing or pay-as-you-go is
insu�cient for multimedia applications that require bandwidth guarantees. We study
the pricing of cloud bandwidth reservations for this kind of tenant. Since tenants may
lack expertise in demand estimation, we propose a new business model in which a tenant
1.3. THESIS SCOPE 7
just needs to specify a percentage of its demand, called the guaranteed portion, to be
satisfied with guaranteed service, while the cloud takes the responsibility to fulfill the
guarantee. Leveraging abundant computing power and workload monitoring, the cloud
can predict tenant demands and make corresponding reservations of actual bandwidth
for the tenants.
Under the assumptions that each tenant always chooses a guaranteed portion of 100%
(i.e., each video provider as a cloud tenant always requires the cloud to guarantee all its
end-user requests) and that pricing cannot a↵ect such choices, we have made the following
contributions. First, we characterize the unique Nash equilibrium in the bandwidth
reservation market, where the price that each tenant is willing to pay is a function of its
tra�c burstiness and its demand correlation to the aggregate market demand. Second,
when multiple cloud providers coexist, we propose a profitable cloud brokerage service
similar to “Groupon,” which sells bandwidth guarantees to tenants and jointly reserves
bandwidth from the clouds for the mixed demand while directing requests to di↵erent
cloud providers. We analytically characterize the broker pricing region in which a profit-
driven broker helps enhance cloud resource utilization and lower bandwidth prices. Third,
we develop a proof-of-concept bandwidth trading system driven by real-world demand
traces, incorporating demand estimation, workload management and pricing as the key
components.
Algorithmic Pricing for Cloud Resources. We further consider the case that
a tenant’s choice of guaranteed portion may be a↵ected by the pricing policy and may
not always be 100%. For example, a video provider may require the cloud to guarantee
only 95% of all its end-user requests if such bandwidth guarantees are costly. In this
model, our objective is to find the optimal bandwidth pricing policy to maximize the
1.4. THESIS ORGANIZATION 8
expected social welfare, that is the sum of the profit of all the entities in the system
including the tenants and the cloud provider. The problem is related to network utility
maximization (NUM) with a coupled objective function, which is traditionally solved
through Lagrangian dual decomposition and gradient/subgradient methods. However,
due to the large scale of our algorithmic pricing problem, the gradient method usually
achieves slow convergence because of its over-sensitivity to stepsize choices. In order
to e�ciently solve such a large-scale distributed optimization problem within a small
number of iterations, we propose a novel class of nonlinear distributed and iterative
algorithms, which achieve faster convergence than gradient methods both in theory and
in trace-driven simulations.
Besides bandwidth, bu↵ering and jitter handling are also important in video flow
management. With bu↵ering on the user side, more data blocks in a particular video flow
can be prefetched and stored when bandwidth is abundant, so that the playback can rely
on the bu↵er when bandwidth conditions degrade. For example, the applications (e.g.,
Netflix) that use HTML5, Microsoft Silverlight and HTTP-based streaming all perform
bu↵ering. In this thesis, we do not consider these control mechanisms for individual video
flows. Instead, we focus on satisfying all the end users of a video provider or a video
channel as a whole in the long run instead of individual video flows. Nevertheless, any
individual flow management and bu↵ering scheme can be combined with our cloud-based
video workload management mechanism to achieve good video delivery performance.
1.4 Thesis Organization
The remainder of this thesis is organized as follows. The related literature is reviewed in
Chapter 2. We present e↵ective time-series methods for demand forecast in Chapter 3,
1.4. THESIS ORGANIZATION 9
and study the demand volatility estimation and the dimensionality reduction problem
involved in the statistical learning in Chapter 4. In Chapter 5, we study the problem of
cloud bandwidth allocation and reservation for video-on-demand applications based on
prediction, together with the related load direction problem when multiple data centers
coexist. We then study the bandwidth pricing issues in a broker-assisted cloud band-
width reservation market in Chapter 6, under the assumption that pricing will not a↵ect
tenant demands. In Chapter 7, we further study the algorithmic cloud bandwidth pric-
ing problem when pricing can a↵ect tenant demands, and solve the underlying utility
maximization problem with e�cient new distributed algorithms.
Chapter 2
Related Work
2.1 Measurement, Prediction and Learning in VoD
Systems
Video-on-Demand systems have gained enormous popularity on today’s Internet. Some
examples of production systems include Netflix [6], Hulu [3], Justin.tv [4], Youtube [10],
Metacafe [5], etc. Since the inception of using a mesh-like P2P architecture as a solution
to live media streaming, exemplified in CoolStreaming [75], peer assistance has also
been introduced into video-on-demand services to increase system scalability. Significant
research e↵orts [35], [41], [42], [50], [52] have been devoted to the measurements of video
streaming systems, with a focus on the user behavior, scalability and system performance.
A number of coding and ad hoc optimization techniques [15], [31], [50] have also been
proposed to enhance their performance.
The importance of bandwidth demand estimation to capacity planning in Internet
VoD systems has been recognized recently. It is shown that estimating time-varying
10
2.1. MEASUREMENT, PREDICTION AND LEARNING IN VOD SYSTEMS 11
demands in a large-scale IPTV network can help the system optimally place content on
its geographically distributed servers [16]. Toward this goal, the recent demand history
is used as an estimate of future demand in each video channel [16] and demands for
previously released movies are used as the demand prediction for newly released movies.
Apparently, this simple method does not yield accurate forecasts. Since measurements
show that video workload demonstrates regular diurnal periodicity [16, 59, 70, 72], various
techniques have recently been proposed to forecast large-scale VoD tra�c. An Autore-
gressive Integrated Moving Average (ARIMA) model is introduced in [70] to predict
the population evolution in video streaming systems. To account for diurnal e↵ects,
seasonal ARIMA models have been introduced in our previous work [58, 59] to predict
non-stationary demand evolution at a fine granularity.
However, most existing forecast methods for video demand assume a constant fore-
cast error variance, and thus provision a constant amount of resource cushion for quality
assurance. In fact, our measurements of the UUSee system show that bandwidth de-
mand is subject to rapid changes in some periods, while remaining tranquil and highly
predictable in other periods. We therefore introduce Generalized Autoregressive Condi-
tional Heteroskedastic (GARCH) models [24, 28, 33] from econometrics research to model
the volatility persistence phenomenon — the bandwidth demand at a certain time period
tends to exhibit similar volatility as in recent time periods.
Volatility reduction in the mixed tra�c of multiple channels is similar to the idea of
statistical multiplexing and resource overbooking [67] in shared hosting platforms, where
the resources are booked to satisfy a certain percentile of demand in each application
instead of its worst-case demand, so as to enhance resource utilization. However, the
volatility reduction discussed here is novel in three aspects. First, we are concerned with
2.1. MEASUREMENT, PREDICTION AND LEARNING IN VOD SYSTEMS 12
forward-looking resource allocation and volatility forecasts for future demand, while in
[67] the resource usage of each application is profiled in an o✏ine and fixed manner, ig-
noring the change of demand patterns over time. Second, our study focuses on large-scale
VoD systems, where the concurrent number of users can ramp up by several hundreds
or thousands in tens of minutes. In this scenario, any fixed resource usage profiling for
small video channels (e.g., those with a user population of 20 in [67]) will be insu�cient.
Last but not least, we do not assume independence between the demands of channels.
Instead, we accurately quantify the conditional demand variance in each channel, which
enables the use of financial instruments such as hedging and diversification to achieve
cost-e↵ective server management with service level guarantees.
Principal Component Analysis (PCA) has been proposed in [40] to extract video
demand evolution patterns over longer periods (of weeks or months) and forecast coarse-
grained daily populations. In this thesis, we utilize an approach to find the common
factors that drive the demand evolution of all coexisting video channels using PCA at
a fine granularity. We then make forecasts for individual video channels as regressions
from factor forecasts obtained from the seasonal ARIMA model. Such an approach
combines the strengths of both PCA and the seasonal ARIMA model. Unlike [40], our
approach makes short-term predictions with a lead time of 10 minutes, enabling fine-
grained autoscaling of resource allocation.
Recently, there has been an increasing interest in applying statistical learning tools
to diagnose, predict and improve the reliability of large-scale distributed systems. Bahl
et al. [18] propose Sherlock, which discovers the dependencies among services, software
and physical components in a large enterprise network based on network tra�c, and
leverages such dependencies to detect and localize problems. Chen et al. [30] further
2.2. CLOUD WORKLOAD MANAGEMENT AND RESOURCE ALLOCATION 13
study the performance and limitations of automated dependency discovery from network
tra�c, and propose Orion that discovers dependencies in an enterprise network, using
the delay information between every pair of network flows. Mahimkar et al. [53] focus
on characterizing performance issues in a large-scale IPTV network, and propose Giza,
which applies multi-resolution data analysis to localize regions and physical components
in the IPTV distribution hierarchy that are experiencing performance problems. This
thesis represents one of the first attempts to characterize the dependencies among the
demand statistics and performance metrics in a large-scale video delivery system. We
introduce a systematic time-series analysis approach that learns the user behavior and
system dynamics from online measurements, and utilizes the learned rules to predict
demand and performance for the future.
2.2 Cloud Workload Management and Resource Al-
location
Cloud bandwidth reservation is becoming technically feasible. There have been proposals
on data center tra�c engineering to o↵er elastic bandwidth guarantees for egress tra�c
from virtual machines (VMs) [39]. The idea of virtual networks has also been proposed
to connect the VMs of the same tenant in a virtual network with bandwidth guarantees
[19, 39]. Further, explicit rate control has been proposed to apportion bandwidth accord-
ing to flow deadlines [69]. Such research progress has made the cloud more attractive
to bandwidth-intensive applications such as video-on-demand and MapReduce compu-
tations that rely on the network to transfer large amounts of data at high rates [73].
Netflix, as a major VoD provider, moved its data store, video encoding, and streaming
2.2. CLOUD WORKLOAD MANAGEMENT AND RESOURCE ALLOCATION 14
servers to Amazon AWS [2] in 2010 [12].
Virtualization techniques for supporting cloud-based IPTV services are also being
developed by major U.S. VoD providers such as AT&T [14]. Novel flow control algorithms
and improvements over TCP have been presented in [34] to ensure QoE-aware video
delivery from cloud data centers. Furthermore, video demand forecasting techniques have
been proposed, such as the non-stationary time series models introduced in [58, 59, 60],
and video access pattern extraction via principal component analysis in [40]. All these
recent advances will help the realization of video delivery from data centers, and will
enable e�cient and quality-assured management of video workload in the cloud.
Predictive and dynamic resource provisioning has been proposed mostly for VMs
and web applications with respect to CPU utilization [23, ?, 36, 37, 66] and power con-
sumption [46, 48]. VM consolidation with dynamic bandwidth demand has also been
considered in [68]. Our work exploits the unique characteristics of VoD bandwidth de-
mands and is distinct from the foregoing works in three aspects. First, our bandwidth
workload consolidation is as simple as solving convex optimization for a load direction
matrix. We leverage the fact that unlike VM, demand of a VoD channel can be fraction-
ally split into video requests. Second, our system forecasts not only the expected demand
but also the demand volatility, and thus can control the risk factors more accurately. In
contrast, most previous works [37, 36] assume a constant demand variance. Third, we
exploit the statistical correlation between bandwidth demands of di↵erent video channels
to save resource reservation while previous work, such as [68], considers VM consolidation
with independent random bandwidth demands.
The idea of statistical multiplexing and resource overbooking has been empirically
evaluated for a shared hosting platform in [67]. Our novelty is that we formulate the
2.3. CLOUD RESOURCE PRICING 15
quality-assured resource minimization problem using Value at Risk (VaR), a useful risk
measure in financial asset management [54], with the aid of accurate demand correla-
tion forecasts. We believe that our theoretically grounded approach provides stronger
robustness against the demand volatility in practice.
2.3 Cloud Resource Pricing
Cloud computing, e.g., Amazon EC2, is usually o↵ered with usage-based pricing (pay-
as-you-go) [17, 38]. In addition, cloud resources can also be sold with auction type of
pricing policies in a spot instance market [64, 74], to reflect the underlying demand-
supply relationship in the computing instance market. Di↵erent from pay-as-you-go,
resource reservation involves paying a negotiated cost to have the resource over a time
period, whether or not the resource is used. Although suitable for delay-insensitive
applications, pay-as-you-go is insu�cient as a business model for bandwidth-intensive
and quality-stringent applications like VoD, since no performance guarantees are provided
in general. To support guaranteed cloud services, we need new policies to price not only
the bandwidth usage but also bandwidth reservations.
Cloud brokers, e.g., Zimory [11], have recently emerged as intermediaries connecting
buyers and sellers of computing resources. The engineering aspects of using brokerage to
interconnect clouds into a global cloud market have been discussed in [29]. We propose a
new type of cloud brokerage that multiplexes bandwidth reservations to save cost while
providing quality guarantees to customers.
Amazon Cluster Compute (as of 2011) [1] allows tenants to reserve, at a high cost,
a dedicated 10 Gbps network with no multiplexing. Instead of over-provisioning a fixed
amount of capacity, our proposed broker dynamically books resources in response to
2.4. ALGORITHMS FOR DISTRIBUTED AND PARALLEL OPTIMIZATION 16
demand changes, exempting tenants from demand estimation, for which they have no
expertise. We aim to find the pricing policies under which a selfish broker can also
enhance cloud resource e�ciency and save money for VoD providers. Di↵erent from
usage-based pricing [17], our proposed pricing policy depends on demand statistics such
as burstiness and correlation.
Our bandwidth reservation pricing model is partly inspired by pricing electric power
consumption and capacity reservation under demand uncertainty [62]. However, due
to the computing capability and abundant workload data in the cloud, our bandwidth
reservation pricing theory is essentially a distributed computational problem based on
prediction.
2.4 Algorithms for Distributed and Parallel Opti-
mization
The optimal cloud pricing problem presented in Chapter 7 is related to network utility
maximization (NUM), which has been extensively studied in the past [32, 61, 49]. A num-
ber of primal/dual decomposition methods have been proposed to solve such problems in
a distributed way. Most NUM problems in the literature are concerned with uncoupled
utilities, where the local variables of one network element do not a↵ect the utilities of
other network elements. Unfortunately, this does not apply to systems with competition
or cooperation, where utilities are often coupled, such as in our cloud resource reservation
model in Chapter 7.
Traditional distributed solutions to NUM problems with coupled objectives seek de-
composition from Lagrangian dual problems with auxiliary variables, which are then
2.4. ALGORITHMS FOR DISTRIBUTED AND PARALLEL OPTIMIZATION 17
solved with gradient/subgradient methods. This approach has a simple economic in-
terpretation of consistency pricing [61, 65]. However, gradient methods are sensitive to
the choice of stepsizes, leading to slow and unstable convergence. Furthermore, existing
faster numerical convex problem solvers, e.g., Newton’s method [20, 27], can not be ap-
plied in a distributed way, since they involve a lot of message-passing beyond gradient
information, which is hard to implement and justify physically in reality. In Chapter 7,
we propose a new class of distributed algorithms to solve NUM problems with coupled
objective functions in general. The new methods are e�cient in terms of convergence
performance without complicating the message-passing that is involved.
The nonlinear Jacobi algorithm [21] solves convex optimization problems by opti-
mizing the objective function for each variable in parallel while holding other variables
unchanged, and converges under certain contraction conditions. Yet it cannot be ap-
plied to our pricing problem, since no network element has global information of all the
utilities. However, we will show in Chapter 7 that the nonlinear Jacobi algorithm is a
limiting case of our Algorithm 1 proposed in Sec. 7.5 (executed asynchronously), while
our proposed Algorithm 1 can be understood as a distributed version of the non-linear
Jacobi algorithm.
Chapter 3
Forecast in Video-on-Demand
Systems
Automated demand and performance prediction can help video services instantaneously
estimate their operational cost, avoid performance issues, place servers optimally and
allocate resources dynamically. In this chapter, we investigate the feasibility of automated
forecasting in on-demand video streaming systems, by analyzing the operational traces
that we have collected from UUSee Inc., one of the leading media content providers in
China, during the 2008 Summer Olympic Games.
The first step towards performance prediction is to understand the evolution of de-
mand, which is not only a deciding factor for the resource consumption, but also has
strong links with various system dynamics. From the real-world traces, we discover that
the demand evolution for a video channel is highly predictable — even in the long term
— as users exhibit diurnal behavior and regular habits that can be learned from mea-
surements. To model the diurnal periodicity and trend in the demand evolution, we use
seasonal Autoregressive Integrated Moving Average (ARIMA) models [25], which can
18
CHAPTER 3. FORECAST IN VIDEO-ON-DEMAND SYSTEMS 19
yield excellent prediction results for non-stationary demand evolution.
Surprisingly, not only is the demand of a channel predictable based on the auto-
correlation with its history, videos with similar properties, e.g., popular Olympics videos
released around the same time of day, also demonstrate similar demand evolution pat-
terns. This enables the inference of the initial demand of a newly released channel
based on the statistics of similar channels that were released earlier. We propose a gen-
erative model based on the regression of mixtures of Gaussians to account for such a
phenomenon.
When it comes to the estimation of the demand for server bandwidth and performance
at end users, the peer-assisted architecture adopted by many VoD services, including
UUSee, has posed great challenges to such tasks. Unlike a client-server based system,
a peer-assisted service partly relies on the uploads from end users, which are highly
dynamic, volatile and hard to predict. We adopt a general class of algorithms that apply
non-linear transformations to the performance-related time series, and make predictions
leveraging both the serial dependencies within each series and the interdependencies
among di↵erent series.
We also demonstrate the application of our time-series modeling techniques in the
prediction of server bandwidth requirement and the estimation of peer playback per-
formance at system run-time. In this chapter, we conduct extensive evaluation of the
proposed schemes based on the traces of 40 channels (with simultaneously online peer
population up to 8000) in a 21-day period spanning the entire 2008 Beijing Olympics,
corroborating the feasibility of implementing forecasting functions in real systems.
3.1. MONITORING UUSEE VIDEO-ON-DEMAND SYSTEM 20
3.1 Monitoring UUSee Video-on-Demand System
UUSee is one of the leading commercial multimedia solution providers in China, featur-
ing o�cial online broadcasting rights to the 2008 Beijing Olympics. It simultaneously
broadcasts thousands of on-demand video channels to millions of users distributed across
over 40 countries in the world. Videos in our UUSee dataset are encoded into di↵er-
ent streaming bit rates, ranging from 264 kbps to 1.3 Mbps, corresponding to di↵erent
streaming qualities.
The UUSee on-demand streaming system organizes users that are interested in the
same video into a random mesh network. Once a peer joins a video channel, an initial
set of partner nodes (up to 50) is supplied by one of the tracking servers. The peer then
requests media blocks from its partners based on periodically exchanged block avail-
ability information. In such a mesh network, there are three types of nodes, namely
media servers, playing peers, i.e., peers that are currently watching the video, and cached
peers, i.e., online peers that have previously watched the video and cached it in their
file systems. A number of algorithms and coding schemes are also incorporated to allow
peers to dynamically and optimally alter their partners, and to facilitate e�cient media
stream dissemination [50]. In general, the UUSee network forms a large-scale distributed
application with highly complex and time-varying system dynamics.
To monitor the run-time behavior of the system, we have implemented detailed mea-
surement and reporting capabilities within the UUSee client. Each online peer collects
its vital statistics, and encapsulates them into “heartbeat” reports, which are sent to our
dedicated logging servers every 10 minutes via UDP. Such statistics include the video it
is playing, its instantaneous aggregate download and upload throughput from and to all
its partners, its download and upload bandwidth capacities, as well as a list of all of its
3.1. MONITORING UUSEE VIDEO-ON-DEMAND SYSTEM 21
partners, with the number of blocks sent to or received from each partner within the past
10 minutes, and the type of nodes from which it downloads blocks.
The measurements used for validation in this chapter feature a set of traces collected
from 40 video channels during a 21-day period, from 14:51:58, August 8, 2008 (GMT+8)
to 23:43:56, August 28, 2008 (GMT+8), spanning the entire period of the 2008 Summer
Olympics. Among these channels, 32 videos were published during this period, while
others were released earlier. We believe this trace set best captures the user demand
patterns for di↵erent types of videos and UUSee on-demand streaming performance in a
wide range of scenarios, including large flash crowds gathered to watch popular Olympic
videos as soon as they were published.
Since our analysis focuses on the collective behavior of peers in individual channels,
we process and aggregate the peer heartbeat reports into time series concerning channel
average statistics using MySQL. Notations for frequently used quantities are listed in
Table 3.1. With a sampling interval of 10 minutes, there are 144 samples in each time
series per day.
Table 3.1: Frequently used notations (for a particular channel).R the playback rate, i.e., the bit rate of the videoN
t
the number of playing peers in the channel at time t, i.e., the online populationN c
t
the number of cached peers in the channel at time tD
t
the aggregate bandwidth demand from end users in the channel at time tSt
the aggregate server sending rate in the channel at time tSr
t
the aggregate server bandwidth required by the channel at time t to maintain its playback qualityst
the average server sending rate per playing peer at time tut
the average sending rate of playing peers at time tct
the aggregate sending rate of cached peers at time t divided by Nt
rt
the average receiving rate of playing peers at time t
3.2. SOME PREDICTION PROBLEMS 22
3.2 Some Prediction Problems
We observe that, in a complex peer-assisted on-demand streaming system, there exist
patterns in di↵erent kinds of time series, which contain information regarding human
factors and underlying system dynamics. Such information, if mined and learned from
measurements, will make demand forecast and performance prediction possible. In this
chapter, we consider a number of prediction/estimation problems that have practical
applications in peer-assisted on-demand streaming systems.
Demand Forecast and Inference. Given the statistics of the online population denoted
by {Nt
, Nt�1, . . .} up to time t, the demand forecast problem is to make an h-step predic-
tion of the future population Nt+h
. Furthermore, for a newly released channel with very
limited channel statistics {N0, . . . , Nl
} collected (l > 0 is small), the demand inference
problem is to infer the demand evolution pattern in the first n days, i.e., the evolution
pattern of {Nl+1, . . . , N144n}.
A similar problem is to make an h-step prediction of the future aggregate bandwidth
demand Dt+h
in a video channel, given bandwidth demand series {Dt
, Dt�1, . . .} up to
time t.
Being aware of the presence of peer contribution, we define Sr
t
as the total server
bandwidth required by the channel at time t in order to maintain satisfactory perfor-
mance. Clearly, as a minimum requirement, the average receiving rate of the playing
peers rt
should be at least the video bit rate R. This motivates the following problem:
Predicting Server Bandwidth Requirements. Given observations collected up to time
t, the problem is to make an h-step prediction of the total server bandwidth Sr
t+h
required
by a channel so that the achieved average receiving rate rt
� R. We can first make predic-
tions for the online population Nt+h
, the average receiving rates from playing peers ut+h
3.2. SOME PREDICTION PROBLEMS 23
and from cached peers ct+h
to obtain Nt+h
, ut+h
, and ct+h
. The total server bandwidth
required, Sr
t+h
, can then be forecasted as
Sr
t+h
= Nt+h
(R � ut+h
� ct+h
). (3.1)
In a video channel within the streaming system, the current total server rate St
allocated to the channel and the current online population Nt
can be instantaneously
monitored at the media servers and tracking servers. However, there could be a significant
delay of l timestamps to collect and process the statistics {rt
}, {ct
} and {ut
} from online
peers. This leads to the following problem:
Playback Performance Estimation. Estimate the average receiving rate rt
of playing
peers at the current time t, provided that we have monitored St
, Nt
, and have collected
{rt�l
, rt�l�1, . . .}, {c
t�l
, ct�l�1, . . .} and {u
t�l
, ut�l�1, . . .}. This again can be achieved by
first estimating ct
and ut
given the observations of these series up to time t � l. The
average receiving rate rt
of playing peers can then be estimated as
rt
=St
Nt
+ ut
+ ct
, (3.2)
where rt
, ut
, and ct
are the estimates of rt
, ut
, and ct
, respectively.
Demand and performance forecast can benefit a peer-assisted on-demand streaming
system in performance optimization, risk management, capacity planning, and resource
allocation, especially when thousands of channels coexist in a complex peer-assisted VoD
network. For example, the evolution of di↵erent time series are plotted in Fig. 3.1 for a
typical channel that attracts a flash crowd upon its release. Clearly, there is a performance
issue before t = 300, where the average peer receiving rate rt
is significantly lower than
3.3. MODELING AND FORECASTING THE DEMAND 24
200 400 600 800 1000 1200Time t
0
20
40
60
80
100
120
140 Average Peer Receiving Rate (KB/s)Population (Normalized)
200 400 600 800 1000 1200Time t
0
20
40
60
80
100Receive from Server (KB/s)Receive from Cached Peers (KB/s)Receive from Playing Peers (KB/s)
Figure 3.1: The evolution of several time series in a typical flash crowd video channel(with channel ID: A55F) since the channel was released. Each time unit represents 10minutes. All rates are average quantities for the channel. The video bit rate is R = 79.6KB/s.
the video playback rate 79.6 KB/s1. This is largely due to the insu�cient st
and ct
,
representing the average receiving rates from servers and cached peers, respectively.
A forecasting mechanism can infer the performance issues in a channel by estimating
the current rt
even in the presence of measurement delay, and advise system administra-
tors to take actions accordingly. It can also predict the server requirement in di↵erent
channels at a future time and do the capacity planning in advance to avoid problems.
However, we focus on tackling the unique challenges posed by the prediction, inference
and estimation problems themselves, while the investigation of their applications is not
the focus of this chapter.
3.3 Modeling and Forecasting the Demand
We start with analyzing the real-world traces, to investigate the feasibility of demand
inference and prediction. The demand of each video, which represents the interests of
users, follows a clear pattern and is highly predictable. In Fig. 3.2, we plot the number of
1Throughout this thesis, we use KB/s to represent kbytes per second, and use kbps to represent kbitsper second.
3.3. MODELING AND FORECASTING THE DEMAND 25
online playing peers, i.e., population, in several typical video channels since these videos
were released. According to their population patterns, channels in our trace set can be
categorized into steady-state channels and flash-crowd channels.
Steady-state channels, as shown in Fig. 3.2(a), usually have been released for a long
time and attract a small number of online users. Although the population in these
channels is roughly steady on a daily basis, it clearly exhibits a diurnal pattern with
a period of 144 (one day). To take a closer look at the periodicity in user demands,
we apply FFT to {Nt
} of the channels in Fig. 3.2(a), and plot their spectral densities
in Fig. 3.3(a), which shows that crests of power density are around period 72 and 144
in both channels. These two cycles correspond to the two demand peaks each day in
Fig. 3.2(a), which are usually around noon and midnight. Note that channel B830 and
4872 reach their peak populations at di↵erent times of day, due to the distinct nature
of the videos, e.g., news attracts more audience at specific times of day, while movies
and entertainment channels are popular during night hours. Fig. 3.3(a) also shows the
existence of high-frequency variations in Nt
.
Flash crowds often immediately follow the release of a popular video, e.g., Olympic
game videos released during our measurement period. Unlike the steady-state channels
in Fig. 3.2(a), flash-crowd channels, as shown in Fig. 3.2(b)(c)(d), attract hundreds or
thousands of interested online peers shortly after their release, with user interest gradually
diminishing over time.
Depending on the popularity, properties and release time of the video, flash-crowd
channels exhibit di↵erent patterns in the initial demand evolution, as well as di↵erent
decreasing speeds in Nt
. For example, a video published after midnight, e.g., F190 and
FA5C, is likely to have its first peak population on the second day around noon, whereas
3.3. MODELING AND FORECASTING THE DEMAND 26
200 400 600 800 10000
50
100
150
200
Time
Popu
latio
n
Channel_4872Channel_B830
(a) Steady-state channels
250 500 750 10000
500
1000
1500
2000
2500
3000
3500
TimePo
pula
tion
Channel_F190Channel_FA5C
(b) Flash-crowd channels
250 500 750 10000
500
1000
1500
2000
Time
Popu
latio
n
Channel_7A40Channel_E856
(c) Flash-crowd channels
250 500 750 10000
2000
4000
6000
8000
Time
Popu
latio
n
Channel_2E0BChannel_D55A
(d) Flash-crowd channels
Figure 3.2: Population evolution in steady-state channels and various kinds of flash crowdchannels. The samples of steady-state channels start from 2008-08-08 14:51:58. The sam-ples of flash-crowd channels start from the time that the corresponding video is released.The video release times are: channels 4872 and B830, before 2008-08-08 14:51:58; chan-nel F190, 2008-08-17 00:21:03; channel FA5C, 2008-08-19 01:47:20; channels 7A40 andE856, 2008-08-12 20:57:34; channel 2E0B, 2008-08-17 23:43:43; channel D55A, 2008-08-0900:03:26.
3.3. MODELING AND FORECASTING THE DEMAND 27
0 72 144 288 432Period (timestamps)
102
103
104
105
106
107
108
109
1010|F
FT(N
t)|2
(log
scale)
Channel 4872
Channel B830
(a) Steady-state channels
0 72 144 288 432Period (timestamps)
103
104
105
106
107
108
109
1010
1011
|FFT(N
t)|2
(log
scale)
Channel A55F
Channel F190
(b) Flash-crowd channels
Figure 3.3: The power spectrum of online population Nt
for various channels. To betterview periodicity, the x axis is set to be the period measured in timestamps, which is theinverse of frequency.
a video released before midnight, e.g., 7A40, will have its first peak population around
midnight. The spectral densities of two flash crowd channels A55F and F190 are plotted
in Fig. 3.3(b), from which we also find the periodicity with cycles 72 and 144.
By learning the periodicity and trend that exist in the past demand evolution of a
channel, we can make predictions of its future demand, even for several days. Moreover,
due to the interesting observation that videos with similar properties and similar release
times also share similar demand evolution patterns, we can infer a channel’s population
for the first few days, simply based on the statistics from similar channels that were
released earlier.
3.3.1 Population Prediction via Seasonal ARIMA Processes
If the population time series of a channel has been observed for a few days, we can
make fine-grained prediction of its future demand, leveraging the trend, periodicity and
autocorrelation exhibited in its own history. Due to the seasonality (diurnal pattern) in
3.3. MODELING AND FORECASTING THE DEMAND 28
all the channels and the decreasing trend in flash-crowd channels, Nt
of any channel is
clearly non-stationary over the long run2. However, we can eliminate both seasonality
and trend in the series via di↵erencing to obtain a stationary yet autocorrelated residual
series, which can then be characterized by models for stationary processes such as ARMA
(autoregressive moving-average) models. Such a model for non-stationary time series is
referred to as the seasonal ARIMA (autoregressive integrated moving-average) model [25]
and is an established time-series analysis technique in statistical learning [25, 28]. In the
following, we briefly describe how such a methodology can be applied to our time-series
forecasting problem.
Given a time series of interest {Yt
}, define the backward shift operator B by BYt
=
Yt�1, and define the lag-1 di↵erence operator r by
rYt
= Yt
� Yt�1 = (1 � B)Y
t
. (3.3)
Powers of B and r are defined in the obvious way, i.e., Bj(Yt
) = Yt�j
and rj(Yt
) =
r(rj�1(Yt
)), j � 1, with r0(Yt
) = Yt
. We further introduce the lag-d di↵erence operator
rd
defined by
rd
Yt
= Yt
� Yt�d
= (1 � Bd)Yt
. (3.4)
For steady-state channels, the population time series {N(t)} has a period of 144 (one
day) and no trend. We therefore remove the seasonality of {N(t)} to obtain a stationary
residual series Nt
= r144Nt
. {Nt
} can then be modeled as an ARMA(p, q) process:
Nt
= �1Nt�1 + . . . + �p
Nt�p
+ Zt
+ ✓1Zt�1 + . . . + ✓q
Zt�q
,
2A process {Yt} is (weakly) stationary if its mean E[Yt] and its covariance function Cov(Yt+h, Yt) ateach lag h are independent of t.
3.3. MODELING AND FORECASTING THE DEMAND 29
where {Zt
} ⇠ WN(0, �2) denotes the uncorrelated white noise with zero mean, and
�1, . . . , �p
, ✓1, . . . , ✓q are the model parameters. The ARMA process degenerates into an
autoregressive process AR(p) if q = 0, and becomes a moving-average process MA(q) if
p = 0. Equivalently, {Nt
} is modeled by the seasonal ARIMA model,
�(B)r144Nt
= ✓(B)Zt
, (3.5)
where {Zt
} ⇠ WN(0, �2), and �(B) = 1��1B�. . .��p
Bp and ✓(B) = 1+✓1B+. . .+✓q
Bq
are polynomial operators in B of degrees p and q.
For flash-crowd channels, however, both trend and periodicity exist in {Nt
}. The
fluctuation in the data decreases as t increases. We therefore first apply a logarithmic
transformation to {Nt
} to equalize the fluctuation. We then apply the transformation
r144 to {log(Nt
)} to remove seasonality, and di↵erence r144 log(Nt
) for d times to re-
move the trend, obtaining the stationary residual N(t) = rdr144 log(Nt
), which is well
explained by an ARMA(p, q) process. The corresponding seasonal ARIMA model derived
for {Nt
} is
�(B)rdr144 log(Nt
) = ✓(B)Zt
, d 2 {0, 1} (3.6)
where {Zt
} ⇠ WN(0, �2), �(B) = 1��1B � . . .��p
Bp and ✓(B) = 1+ ✓1B + . . .+ ✓q
Bq.
The value of d is chosen from {0, 1}, depending on how fast the daily population decreases
in trend.
Once the parameters of a seasonal ARIMA model are determined based on the training
sequence, the prediction for future Nt
is performed as follows. Let Pt
Nt+h
denote the best
predictor of Nt+h
, h > 0, in terms of the values {N1, . . . , Nt
}. To predict Nt+h
, h > 0,
we first obtain Pt
Nt+h
, the minimum mean square error predictor for the corresponding
3.3. MODELING AND FORECASTING THE DEMAND 30
residual Nt+h
. If q = 0, i.e., Nt
is modeled by a pure autoregressive process AR(p), we
have (e.g., refer to [25])
Pt
Nt+h
= �1Pt
Nt+h�1 + . . . + �
p
Pt
Nt+h�p
. (3.7)
Pt
Nt+h
can thus be calculated by solving (3.7) recursively for Pt
Nt+1, . . . , Pt
Nt+h
. Pt
Nt+h
can finally be obtained by retransforming Pt
Nt+h
using the inverse of the corresponding
di↵erence operators.
Let us illustrate the modeling and prediction of {Nt
} in a typical steady-state channel
B830. A model of the form (3.5) is fit to the training data from 5 consecutive days in
channel B830 and is used to make demand prediction for the next 9 days, as shown in
Fig. 3.4(a). By checking the autocorrelation function (ACF) of the residual Nt
= r144Nt
of the training data in Fig. 3.5(a), we can see that N(t) is not uncorrelated white noise and
needs to be explained by an ARMA process. Since the partial autocorrelation function
(PACF) of N(t) cuts o↵ at lag 144, it suggests an AR(144) model with p = 144, q = 0
for N(t) [25].
We use the standard Yule-Walker procedure [25] to obtain preliminary estimates of the
parameters �1, . . . , �p
in (3.5) and obtain final estimates through a maximum likelihood
estimator [25]. The trained model is used to make one day-ahead (144-step) demand
prediction, i.e., to predict each Nt+144 based on {N
⌧
; 0 ⌧ t}, as shown in Fig. 3.4(a).
We can see with p = 144, (3.5) is able to capture the slight demand variations from day
to day, which cannot be captured by a smaller p in our experiments.
To predict demand for a flash-crowd channel based on seasonal ARIMA models, we
train the model (3.6) based on the data of the first 2 days since a video was released,
i.e., {Nt
; t1 < t t1 + 288}, where t1 72 is to exclude the initial samples that do not
3.3. MODELING AND FORECASTING THE DEMAND 31
500 1000 1500 20000
50
100
150
Time
Popu
lation
Validation data144 step!ahead predictionTraining data
(a) Steady state channel B830
200 400 600 800 10000
500
1000
1500
2000
2500
3000
Time
Popu
lation
Validation data144 step!ahead predictionTraining data
(b) Flash-crowd channel A55F
200 400 600 800 1000 12000
200
400
600
800
1000
1200
Time
Popu
lation
Validation data144 step!ahead predictionTraining data
(c) Flash-crowd channel 7A40
Figure 3.4: One day-ahead population prediction based on seasonal ARIMA modelsfor three di↵erent channels. The model is trained based on the data of first few days.Prediction for N
t
only makes use of the observations {N⌧
: 0 < ⌧ t � 144} up totime t � 144. The two flash-crowd channels have di↵erent daily decreasing of N
t
. Thevalidation and training sets overlap for 1 day, since in a seasonal model, the predictorrequires the first 144 validation samples to be the initial input.
comply with the later evolution pattern (the inference of initial demand pattern based
on statistics of similar channels will be discussed in Sec. 3.3.3).
For channels with a slow decrease of the daily population as illustrated in Fig. 3.4(b),
we set d = 0 and do not di↵erence r144 log(Nt
), while for channels with faster daily
population decrease as illustrated in Fig. 3.4(c), we set d = 1. Similarly, having observed
the PACF of the residual N(t) = rdr144 log(Nt
), which cuts o↵ at lag 36, we choose
p = 36 and q = 0 as the model order, and plot one day-ahead (144-step) prediction
3.3. MODELING AND FORECASTING THE DEMAND 32
0 50 100 150 200−0.5
0
0.5
1
Lag(a) Sample ACF
0 50 100 150 200−0.5
0
0.5
1
Lag(b) Sample PACF
Figure 3.5: The autocorrelation function (ACF) and partial autocorrelation function(PACF) of N(t) = r144Nt
for the training data of channel B830.
starting from the 3rd day in Fig. 3.4(b) and 3.4(c) for two typical flash-crowd channels
with di↵erent decreasing speeds of daily population.
3.3.2 Forecasting the Aggregate Bandwidth Demand
We can also use the seasonal ARIMA model to predict the aggregate bandwidth demand
in each video channel directly. We use the series {Dt
} to denote the total bandwidth
demand from end users in a particular video channel. We first apply r144 to {Dt
} to
remove the daily periodicity, and then apply di↵erencing to r144Dt
for d times to remove
the trend, obtaining a stationary series D(t) = rdr144Dt
, which is well explained by
an ARMA(p, q) process. The corresponding seasonal ARIMA model [25] for the original
series {Dt
} is thus
�(B)rdr144Dt
= ✓(B)Zt
, d 2 {0, 1}, (3.8)
where {Zt
} ⇠ WN(0, �2) denotes the uncorrelated white noise with zero mean, and
�(B) = 1 � �1B � . . . � �p
Bp and ✓(B) = 1 + ✓1B + . . . + ✓q
Bq are polynomial operators
3.3. MODELING AND FORECASTING THE DEMAND 33
264 408 552 696 840 9840
1
2
3
4
5
6 x 104
Time (unit: 10 minutes)
KB/s
Trace data1−step predictionTraining data
(a) One-step-ahead prediction of Dt
697 769 841 913 985 1057−1
−0.5
0
0.5
1 x 104
Time (unit: 10 minutes)
KB/s
(b) Prediction errors
Figure 3.6: 10-minutes-ahead prediction for the aggregate bandwidth demand Dt
in apopular video channel A55FF released at time period 264, compared against the tracedata.
in B of degrees p and q. The di↵erencing order d is chosen from {0, 1}, depending on
whether a trend exists in the daily population variation.
As an example, we make 10-minute-ahead (one-step) prediction of the aggregate band-
width demand {Dt
} in a popular video channel released at time period t0 = 264 (2008-
08-10 10:47:39). The channel has a maximum online population of 2664. The demand
series of the first 3 days is used as the training data, excluding the initial 80 time periods
right after the release of the video, which may not conform to later evolution patterns.
The prediction is tested on the data of 3 days following the training period. We fit model
(3.8) to the training data and obtain parameter estimates through a maximum likelihood
estimator [25]. As shown in Fig. 3.6(a), with d = 0, p = 20, q = 20, model (3.8) can yield
3.3. MODELING AND FORECASTING THE DEMAND 34
prediction results that are close to the real bandwidth demand in the video channel.
3.3.3 Inferring Initial Demand using Mixtures of Gaussians
We now infer the initial demand evolution pattern in a newly released video channel,
where no past observation is available yet. As shown in the UUSee data, usually, several
irregular surges in the online population evolution are observed before it exhibits clearer
seasonality and trends. Therefore, the inference of the initial demand evolution cannot
make use of seasonal ARIMA models as in Sec. 3.3.1. However, the inference can be done
based on the fact that popular videos released around the same time of day (possibly
on di↵erent dates) demonstrate a similar demand evolution pattern in the first several
days, as we have observed in Fig. 3.2(b)(c)(d). Again, the underlying reason for this
phenomenon is that human users access the VoD system following a tractable diurnal
pattern.
For example, in Fig. 3.2(b), both channels F190 and FA5C are released after midnight,
it is expected that the first peak of demand comes right after their release due to a small
portion of users with the habit of staying up late. The second demand peak comes the
second day around noon, featuring a much larger population of daily users, and the third
demand peak is at midnight of the second day. As another example, Fig. 3.7 shows the
initial demands of 3 di↵erent channels released around 7-9 PM on di↵erent dates. They
also have a similar demand evolution pattern: unlike F190 and FA5C, the first peak
came around midnight after the videos were published, while the second peak came in
the second day around noon.
Assume t = 0 upon the release of a video v. The population {Nt
; 0 t 144n} in
the video channel for the first n days (e.g., n = 2.5) can be modeled as a realization of
3.3. MODELING AND FORECASTING THE DEMAND 35
the process Nt
= f(t, v) + Zt
, where f(t, v) is the signal underlying the evolution of Nt
in channel v, and Zt
represents a zero-mean random noise. We further model f(t, v) as
a mixture of Gaussians and obtain
Nt
= P (v)kX
j=1
⇡jq
2⇡�2j
exp
✓� (t � µ
j
)2
2�2j
◆+ Z
t
, (3.9)
withP
k
j=1 ⇡j
= 1, where P (v) is a popularity index associated with each video v.
We now group all the videos into a number of classes by their release times, and
assume that
Gt
:=f(t, v)
P (v)=
kX
j=1
⇡jq
2⇡�2j
exp
✓� (t � µ
j
)2
2�2j
◆
is the same pattern that underlies the population evolution of all the videos in the same
class. In other words, all videos in one class share the same parameters ⇡j
, µj
and �j
.
There is a strong probability rationale for the model proposed above, explained by
analyzing how a user chooses the time u(v) at which she starts to watch the video v,
although the success of the model does not depend on such an analysis. Assume each
user only watches the video once and her viewing span is negligible as compared to 144n.
Let µj
, j = 1, 2, . . . , k, represent the j-th peak timestamp (usually at midnight or noon)
after v is published. In reality, a user will watch v around time µ1 < µ2 < . . . with
probabilities ⇡1 > ⇡2 > . . ., since a video is more likely to attract audience when it is
just published. Conditioned on a user choosing to watch v around µj
, u(v) is normally
distributed with density N (µj
, �2j
). Thus, we have
Pr
✓t u(v) < t + 1
◆⇡
kX
j=1
⇡j
· 1q2⇡�2
j
exp
✓� (t � µ
j
)2
2�2j
◆· 1,
3.3. MODELING AND FORECASTING THE DEMAND 36
Table 3.2: Prediction of population at the initial stage based on mixtures of Gaussiansfor flash-crowd video channels.
Video Release Time 11PM - 1AM 1 - 4AM 7 - 11AM 3 - 5PM 5 - 7PM 7 - 9PM 9 - 11PMNumber of Training Videos 2 2 2 1 3 4 2
Number of Test Videos 2 2 1 1 3 4 2Mean R-Square of Test Videos 0.8029 0.8448 0.6056 0.7873 0.7153 0.7112 0.4193
Max RMSE of Test Videos 1071.30 205.98 227.92 420.86 149.81 259.09 465.99Peak Population in Test Set 8612 2307 1339 4287 1404 1855 3643
Let P (v) be the total number of potential users that are interested in video v. When
P (v) is large, we have Nt
, the number of users that choose to watch the video v at time
t, given by (3.9).
In our experiments, we categorize all 32 flash-crowd channels into 7 classes based
on their release times, as shown in Table 3.2. For each class, we train the mixture of
Gaussians (3.9) with k = 5 using the Expectation-Maximization (EM) algorithm [22]
based on the data {Nt
; 0 < t 144n} from the training channels. For each test channel
in the same class, {Nt
; l < t 144n} are inferred with the trained model, given very
limited initial observations {Nt
; 0 t l} of its own.
{Nt
; 0 t l} is used to avoid estimating P (v) for each video v individually. SinceP
l
t=1 Zt
⇡ 0, we have
P (v)P
l
t=1 Nt
=P (v)
P (v)P
l
t=1 Gt
+P
l
t=1 Zt
=1
Pl
t=1 Gt
, (3.10)
which is the same for all the videos in the same class. Thus, dividing both sides of (3.9)
byP
l
t=1 Nt
and letting ⇡0j
= ⇡j
P (v)/P
l
t=1 Nt
converts the original problem of estimating
P (v), ⇡j
, µj
and �j
into a problem of estimating ⇡0j
, µj
and �j
that are shared by all the
channels in the same class. Such an estimation can be approached by the EM algorithm
for standard mixtures of Gaussians [22].
3.4. PREDICTING PEER CONTRIBUTION 37
0 50 100 150 200 250 300 3500
200
400
600
800
1000
1200
1400
Time
Popu
latio
n
Training dataTest data 1Test data 2Fit to training dataPrediction of test data 1Prediction of test data 2
Figure 3.7: Prediction of initial demand using a mixture of Gaussians. The fittingfunction is trained with the first 360 population samples in channel 8331 released on 2008-08-10 19:46:29. Test data 1 is from channel 57F7 released on 2008-08-18 20:16:35, andtest data 2 is from channel EDAF released on 2008-08-18 20:49:39. R-square1 = 0.5864,R-square2 = 0.1374, RMSE1 =74.44, RMSE2 = 273.55.
We choose n = 2.5 and l = 30. For each test channel, the R-square and root mean
squared error (RMSE) are calculated. R-square is a measure of model fitness ranging in
[0, 1] with 1 being a perfect fit to data. We can see that the proposed method based on
mixtures of Gaussians achieves satisfactory performance as a best-e↵ort coarse-grained
predictor in most classes, with relatively small RMSEs compared to the peak population
sizes in the test set. An example of the prediction for two channels based on the model
trained from only one channel is plotted in Fig. 3.7.
3.4 Predicting Peer Contribution
In contrast to a client-server based solution, a major reason that a↵ects the stability
of peer-assisted video streaming is that peer bandwidth contributions are much more
dynamic and less reliable than with dedicated servers. Such bandwidth contributions
3.4. PREDICTING PEER CONTRIBUTION 38
are good to have as they reduce server costs, but are di�cult to predict. Modeling peer
contribution {ct
} and {ut
} plays an important role in understanding the dynamics of
a peer-assisted video service. The prediction of {ct
} and {ut
} is also critical in various
performance prediction problems outlined in Sec. 3.2. Since steady-state channels have
similar daily population evolution, the performance time series in such channels also
exhibit steady and normal behavior. In this section, let us focus on the challenging case
of flash-crowd channels.
As illustrated in the right-hand-side of Fig. 3.1, we find that in almost all channels,
the average sending rate of playing peers {ut
} is a stationary noisy process (“Receive from
Playing Peers (KB/s)” in that figure). By checking its partial autocorrelation function,
we find that it can be modeled as a low order AR(p) process,
�(B)(ut
� µu
) = Zt
, (3.11)
where Zt
is white noise and µu
is the sample mean of {ut
}. For example, Fig. 3.8 shows
the prediction performance of an AR(10) model fitted to {ut
} in a flash-crowd channel.
In contrast, the average receiving rate from cached peers ct
is much harder to predict.
As shown in the right-hand-side of Fig. 3.1, ct
increases from 0 to a relatively stationary
value over time, as more playing peers finish downloading and become cached peers. As
{ct
} has a slow increasing trend, the first intuition is to transform ct
into rct
, which
actually has an ACF similar to white noise. However, such a model (�(B)rct
= Zt
with a small order p for �(B)) is ine↵ective in predicting future values of ct
. The model
yields a linear predictor that forecasts future ct
’s around a slowly increasing line, which
is actually an under-fit of {ct
} and misinterprets a large part of the system dynamics
as noise. Moreover, it fails to capture the slightly concave trend in ct
as t grows — ct
3.4. PREDICTING PEER CONTRIBUTION 39
500 1000 1500 2000 25000
10
20
30
40
50
Time
KB/s
Validation dataPredictionTraining data
Figure 3.8: Fitting AR(10) to {ut
} of the first 9 days in channel A55F, and making 2.5hour (15-step) ahead prediction for the rest of the days. The prediction for each u
t
isconditioned on {u
⌧
; 0 < ⌧ t � 15}.
will stop increasing when the channel evolves into its steady state after several days.
Therefore, ct
is hardly predictable only based on its own history.
Cached Peers
Playing PeersNt N c
t
ct =uc(t)N c
t
Nt
�1(t)Abort
�2(t)Leave�1(t)Join
Finish downloading
�2(t)Nt
Figure 3.9: A fluid chart illustrating the dependency of ct
on Nt
through the underlyingsystem dynamics. The white arrows denote the transition rates, while the black arrowsdenote the dependencies. N c
t
is the number of cached peers and uc
(t) is the averageupload rate of each cached peer at time t.
However, we note that {ct
} has a strong dependency on {Nt
}, which we can make
use of in our prediction. Let N c
t
denote the number of cached peers in the channel and
uc
(t) denote the average upload rate of each cached peer at time t. Clearly, we have
ct
= uc
(t)N c
t
/Nt
. The fluid chart in Fig. 3.9 illustrates how {Nt
} is a deciding factor for
{N c
t
} and thus for ct
through the underlying system dynamics. Assume that each playing
3.4. PREDICTING PEER CONTRIBUTION 40
peer finishes downloading with rate �2(t), then N c
t
= N c
t�1+�2(t)Nt
��2(t). Since Nt
can
be expressed as an autoregressive function of its past values, N c
t
is able to be predicted
as a function of {N c
t�1, Nc
t�2, . . .} and {Nt
, Nt�1, . . .}, and so is c
t
= uc
(t)N c
t
/Nt
.
Such a dependency of ct
on {ct�1, ct�2, . . .} and {N
t
, Nt�1, . . .} is non-linear in nature.
However, we can transform both {ct
} and {Nt
} into stationary series by a logarithm
transformation and the di↵erencing transformation used in Sec. 3.3.1, so that the trans-
formed series {ct
} and {Nt
} can be linked together through a linear time-invariant (LTI)
system of the form
ct
=pX
i=1
�i
ct�i
+qX
i=0
!i
Nt�i
+ Zt
, (3.12)
or more concisely, using back shift operator B,
�(B)ct
= !(B)Nt
+ Zt
, (3.13)
where Zt
is white noise, �(B) = 1� �1B � . . .� �p
Bp and !(B) = !0 +!1B + . . .+!q
Bq.
Specifically, we take the logarithm of ct
and Nt
to suppress fluctuation, remove sea-
sonality and trends by di↵erencing, and subtract the means to get the mean-corrected
stationary processes ct
, Nt
. We can thus obtain a transfer function model [25] with added
noise as follows, 8>>>>><
>>>>>:
ct
= ��1(B)!(B)Nt
+ �(B)�1Zt
,
Nt
= rd1r144 log(Nt
) � µ1, d1 2 {0, 1},
ct
= rd2r144 log(ct
) � µ2, d2 2 {0, 1},
(3.14)
where {Zt
} ⇠ WN(0, �2), and µ1 and µ2 are the means of rd
Nr144 log(Nt
) and rd
cr144 log(ct
),
respectively. As a result, by appropriately choosing d1, d2 2 {0, 1}, for any flash-crowd
channel, the relationship between {ct
} and {Nt
} can be characterized by the model (3.14).
3.4. PREDICTING PEER CONTRIBUTION 41
Given the training data, the maximum likelihood estimates of �’s and !’s in (3.14) can
be obtained by minimizing the conditional sum-of-squares function [25] or by a non-
linear least squares algorithm [51], which are standard estimation algorithms in system
identification theory.
Experiments show that the transfer function model (3.14) yields very accurate pre-
diction for ct
. We illustrate the prediction procedure with a typical flash-crowd channel
A55F. We choose d1 = d2 = 0 and learn the model parameters using the data of the
first few days. For a lead time h, each ct
in the validation data is predicted based on
{ct�h
, ct�h�1, . . .} and {N
t�h
, Nt�h�1, . . .}. We choose d1 = d2 = 0, p = 54, q = 0. From
Fig. 3.10(a), we see that a model trained from the first 6 days of data yields very accurate
150 minute-ahead (15-step) prediction for ct
.
We also evaluate the h-step prediction for the total receiving rate from cached peers
ct
Nt
shown in Fig. 3.10(b) and Fig. 3.10(c). ct
Nt
is predicted as ct
Nt
, where the predic-
tions ct
and Nt
are calculated using the models (3.14) and (3.6) in Sec. 3.3.1, respectively,
based on observations up to time t � h. Fig. 3.10(b) shows that the proposed method
can predict the total receiving rate from cached peers accurately for 2.5 hours into the
future. An interesting phenomenon is that although ct
is increasing over time, the peak
of ct
Nt
is around the 5th day due to a decreasing number of Nt
. The proposed method is
able to predict such a peak in ct
Nt
1 hour ahead of time, simply with a training period
of the first 3 days, as shown in Fig. 3.10(c), although with lower accuracy as compared
to Fig. 3.10(b).
3.4. PREDICTING PEER CONTRIBUTION 42
500 1000 1500 2000 25000
20
40
60
80
100
Time
KB/s
Validation dataPredictionTraining data
(a) ct, with a 6-day training period, h = 15
500 1000 1500 2000 25000
1
2
3
4
5
6x 104
Time
KB/s
Validation dataPredictionTraining data
(b) ctNt, with a 5-day training period, h = 15
500 1000 1500 2000 25000
1
2
3
4
5
6x 104
Time
KB/s
Validation dataPredictionTraining data
(c) ctNt, with a 3-day training period, h = 6
Figure 3.10: The h-step prediction of average receiving rate ct
and total receiving ratect
Nt
from cached peers in channel A55F. Prediction for ct
or ct
Nt
only makes use of theobservations of N
⌧
and c⌧
up to time t � h. We choose d1 = d2 = 0, p = 54, q = 0 inthe experiments. The validation and training sets overlap for 1 day, since in a seasonalmodel, the predictor requires the first 144 validation samples to be the initial input.
3.5. APPLICATIONS IN A P2P SYSTEM 43
3.5 Applications in a P2P System
Having elaborated the algorithms for modeling and predicting the online population {Nt
},
the average receiving rates {ut
} and {ct
} from playing and cached peers, we are ready
to solve the problems of server bandwidth requirement prediction and peer playback
performance estimation proposed in Sec. 3.2.
Given observations collected up to time t, we want to make an h-step prediction of the
total server bandwidth Sr
t+h
required by a channel so that the achieved rt
� R. This can
be done by making individual predictions for Nt+h
, ut+h
, and ct+h
, and predicting Sr
t+h
using (3.1). In the presence of data collection delay l, we also aim to estimate the average
receiving rate rt
of playing peers at the current time t, provided that we have monitored St
and Nt
, and collected {rt�l
, rt�l�1, . . .}, {c
t�l
, ct�l�1, . . .} and {u
t�l
, ut�l�1, . . .}. Similarly,
we can first infer ut
and ct
, based on which rt
can be estimated by (3.2).
We propose an iterative procedure for progressive model learning and performance
prediction that can be practically applied in real-world on-demand streaming channels.
Specifically, after the release of a channel, we first train the models (3.6), (3.11) and
(3.14) for {Nt
}, {ut
} and {ct
} using the data of the first 1-2 days. This preliminary
training period should be more than 1 day in order to incorporate the daily variation
of the data into the models. The models are retrained periodically during the system
operation, with a progressively increased order, based on all the measurements collected.
We illustrate such a procedure in a typical flash-crowd channel A55F. The preliminary
model is trained using the data of the first 1.25 days (180 timestamp) assuming an AR(p)
for (3.6), (3.11) and (3.14) with p = 10. p is increased by 10 (except for (3.11)) each
time, as the model is retrained every 24 hours. The prediction/estimation starts at time
181 and is always made with the latest trained model.
3.5. APPLICATIONS IN A P2P SYSTEM 44
We make 2.5 hours ahead (h=15) prediction for the server bandwidth Sr
t+h
required
by the channel and plot the results in Fig. 3.11. We can see that the predicted per-peer
server bandwidth required, R�ut+h
�ct+h
, is quite close to its real value R�ut+h
�ct+h
and
so is the total server requirement Sr
t+h
to its real value Sr
t+h
. We also see that the actual
server bandwidth used by the channel is much less than the required minimum amount
at many points, especially in the first 5 days. This accounts for the performance issues
after the channel is just released, as observed in Fig. 3.1. Incorporating our prediction
mechanism into the real system would forecast the server bandwidth requirement 2.5
hours ahead of time and suggest actions for server reallocation. Fig. 3.11(b) also shows
that the mechanism is able to predict an abnormal surge in the bandwidth requirement
at time 2075 because the seasonal ARIMA model (3.6) accurately predicts a surge in Nt
at this time. Moreover, the entire simulation of the above procedure takes less than 90
seconds on a Macbook Pro laptop with a 2.26 GHz duo-core processor.
Fig. 3.12 shows the estimation for rt
if the measurement collection is delayed by 2.5
hours (l = 15). The first 1.25 days (180 timestamps) are the preliminary training period.
We can see that the concave trend of rt
is successfully forecasted, and the estimation
becomes more accurate as the model is progressively refined with more data available.
Again, the proposed mechanism is able to discover the insu�cient rt
as compared to
the video bit rate (79.6 KB/s) starting from time 181, and would advise the system to
increase the server bandwidth allocated to the channel to compensate for the shortage
before time 500.
3.6. SUMMARY 45
3.6 Summary
In this chapter, we have addressed the issues of demand forecast and performance pre-
diction in on-demand streaming services either with or without peer assistance. Despite
the complexity of such large-scale distributed systems, we find that there exist clear pat-
terns in the evolution of user demands as well as dependencies within the time-series
related to various performance metrics. With the aid of online measurement collection, a
class of non-stationary time-series analysis techniques is used to enable demand predic-
tion and performance forecast. We also shed light on how such a monitoring-prediction
functionality can be incorporated in real-world systems at run time.
3.6. SUMMARY 46
180 468 756 1044 1332 1620 19080
20
40
60
80
Time
KB/s
Server bandwidth required per peerPredicted server bandwidth required per peerActual server bandwidth used per peer
(a) Prediction for the server bandwidth required per peer
180 468 756 1044 1332 1620 1908 2196 24840
2
4
6
8
10 x 104
Time
KB/s
Total server bandwidth requiredPredicted total server bandwidth requiredActual total server bandwidth used
(b) Prediction for the total server bandwidth required by the channel
Figure 3.11: Progressive model learning and 2.5 hours ahead (15-step) prediction for theserver bandwidth required by the peers in channel A55F over a 16-day period, comparedwith the real server bandwidth requirement calculated from the traces, and the actualserver bandwidth used by the channel in the traces. The channel is released on 2008-08-10 10:47:39, and the predicted values start from 2008-08-11 16:47:39 (timestamp 181).The predictor is retrained with the most recent data on a daily basis.
3.6. SUMMARY 47
0 288 576 864 1152 1440 17280
50
100
150
Time
KB/s
Server sending rate per playing peerActual average peer receiving ratePredicted average peer receiving rate
Figure 3.12: Progressive model learning and the estimation for the average receiving ratert
of playing peers in channel A55F over a 12-day period. There is a 2.5-hour (15-step)delay in data collection. A comparison with the real r
t
and the server rate st
allocatedto each peer is provided.
Chapter 4
Volatility Prediction and Dimension
Reduction
Demand forecast based on past observations enables the system to allocate the right
amount of bandwidth to match the demand. An online bandwidth monitoring and reser-
vation framework, as shown in Fig. 4.1, can be implemented and incorporated into ex-
isting (non-P2P) VoD systems. It monitors the bandwidth demand in a particular video
channel periodically, e.g., every 10 minutes, learns models to forecast future demand,
and judiciously decides the amount of server bandwidth to be provisioned in the next
time period.
However, as the predictor in Chapter 3 only forecasts the conditional mean demand,
which is the expected future demand conditioned on the history, the real demand may
vary around the predicted conditional mean. Due to the existence of prediction errors,
we need to provision an additional “risk premium” to tolerate demand fluctuation, which
depends on the variance of prediction errors. Prediction errors of bandwidth demand
48
4.1. MODELING DEMAND VOLATILITY 49
Demand Forecast
Expected Demand
Server UsageMonitoring
Risk Premium
Volatility PredictionModel
Learning
Bandwidth Reservation
Figure 4.1: Online bandwidth demand monitoring and reservation. The reserved band-width should match the sum of the expected demand and a risk premium that toleratesvolatility.
usually do not have a constant variance: there are periods where the prediction is rel-
atively accurate alongside periods with less trustworthy forecasts. In other words, the
bandwidth demand is highly predictable in some periods, but becomes highly variable
and unpredictable in other periods. In this chapter, we will present methods to estimate
the changing variance of the prediction error, an important factor to consider in resource
allocation.
The predictor presented in Chapter 3 has another weakness in that a di↵erent model
needs to be trained for each video channel, which is not scalable to a large number of
video channels in practice. In this chapter, we will also propose methods to reduce the
computational complexity of the model training based on principal component analysis
(PCA). In this chapter, we focus on non-P2P systems.
4.1 Modeling Demand Volatility
The changing demand volatility poses great challenges to e�cient resource provisioning.
As traditional time-series techniques assume a constant variance for the disturbance,
4.1. MODELING DEMAND VOLATILITY 50
0 5 10 15 20−0.5
0
0.5
1
Lag
ACF
(a) ACF of Zt
0 5 10 15 20−0.5
0
0.5
1
Lag
ACF
(b) ACF of Z2t
Figure 4.2: The ACFs of the disturbance Zt
and Z2t
obtained from fitting the seasonalARIMA model (3.6) with d = 0 and p = q = 20 to bandwidth demand {D
t
} in videochannel A55FF.
the “risk premium”1 provisioned will be a function of the fixed prediction error vari-
ance averaged over the long run. However, such an approach will necessarily result in
over-provisioning when demand is less volatile, but insu�ciency when demand varies
drastically. To achieve e�cient resource allocation, we need a conditional variance fore-
cast mechanism to estimate the time-changing variance of demand around its conditional
mean, and adjust the amount of risk premium provisioned accordingly.
Such a “risk premium” should increase when we are less certain about the accuracy
of the conditional mean demand forecast, and decrease otherwise, when the conditional
variance of future demand is low. In contrast, the unconditional variance, i.e., the long-
run estimate of forecast error variance averaged over time, would not be important if we
care about the instantaneous “risk premium” needed.
Although the seasonal ARIMA model (3.8) is a good conditional mean model that
predicts the expectation of bandwidth demand Dt+h
conditioned on {Dt
, Dt�1, . . .}, it
fails to capture the serial dependency within the disturbance series {Zt
} obtained from
1The additional resources allocated to guard against demand fluctuation.
4.1. MODELING DEMAND VOLATILITY 51
fitting (3.6) to {Dt
}. To check such dependency, we plot the autocorrelation functions
(ACFs) of Zt
and Z2t
in Fig. 4.2. We can see that although {Zt
} is an uncorrelated
white noise, it is not independent and identically distributed (IID) — the variance term
Z2t
clearly depends on Z2t�1, Z
2t�2, . . .. In fact, we can observe a persistence of volatility
for Zt
from Fig. 3.6(b): Zt
tends to exhibit a similar conditional variance as in recent
periods.
To include past variances in the explanation of future variances, we model Zt
using the
GARCH (generalized autoregressive conditional heteroscedasticity) process [24], which
has been successfully applied to modeling the volatility of stock data for the past decade.
Specifically, we model the disturbance Zt
obtained from fitting model (3.8) to {Dt
} as a
GARCH(P, Q) process:
8><
>:
Zt
=p
ht
et
, {et
} ⇠ IID N (0, 1),
ht
= ↵0 +P
P
i=1 ↵i
Z2t�i
+P
Q
j=1 �j
ht�j
,
(4.1)
where ↵0 > 0 and ↵j
, �j
� 0, j = 1, 2, . . ., and ht
is the conditional variance of Zt
given
its history {Zs
; s < t}. The GARCH model reflects the evolution of the variance in data
by incorporating correlation in the sequence {ht
} of conditional variances.
Taking channel A55FF as an example, we fit a GARCH(1, 1) model to the one-step-
ahead prediction errors for the bandwidth demand {Dt
} shown in Fig. 3.6(b). The model
parameters are obtained using maximum likelihood estimation (pp. 417, [25]) based on
the prediction errors of model (3.8) during the training period. We predict the conditional
standard deviations {p
ht
} of the prediction errors for the test period using the trained
model, and plot the results in Fig. 4.3. We can see that the predicted error standard
deviation is larger when the demand prediction errors are highly variable.
4.2. VOLATILITY ESTIMATION AND RESOURCE ALLOCATION 52
697 769 841 913 985 1057
−5000
0
5000
10000
Time (unit: 10 minutes)
KB/s
Prediction errorPredicted error standard deviation
Figure 4.3: One-step-ahead forecast errors for bandwidth demand Dt
in video channelA55FF, and predicted conditional standard deviations for the forecast errors.
With the GARCH model, we are able to forecast by how much the real demand will
deviate from the predicted conditional mean produced by model (3.8). It allows us to
quantify our certainty about bandwidth demand forecast so that server provisioning can
leverage this fact to enhance resource utilization. When we are certain about the Dt
in
the next time period, the server bandwidth allocated for the channel should be close to
the predicted conditional mean. On the other hand, during periods where Dt
is subject
to rapid changes and less predictable, we need to provision a higher risk premium to
tolerate the demand volatility in the video channel.
4.2 Volatility Estimation and Resource Allocation
In this section, we present the application of volatility forecasts to resource allocation. To
achieve service level guarantees to users without over-provisioning, the resource allocated
should match the future demand with a conditional mean forecast plus a “risk premium”
that tolerates tra�c volatility.
In general, the e↵ectiveness of a resource allocation scheme can be evaluated by two
performance metrics: 1) insu�ciency ratio e, which is the ratio of time periods where
4.2. VOLATILITY ESTIMATION AND RESOURCE ALLOCATION 53
the booked resource is lower than the actual demand over all the test periods; and 2)
time-averaged utilization U , which is the average utilization of the allocated resource over
all the test periods.
To provide quality assurance to users, it is expected to maintain the insu�ciency
ratio e to a low level. On the other hand, for the cost-e↵ectiveness of servers, we should
keep the average utilization U at a high level by booking resources sparingly. Striking
a balance between the two conflicting objectives, the key to successful resource booking
is to decide the minimum necessary “risk premium” Rt+1 at time period t + 1 given
observations up to time t that achieves a target insu�ciency ratio, e.g., e 2%, under
an appropriate volatility model.
4.2.1 Comparing Five Volatility Models
We propose five proactive server bandwidth reservation schemes for (non-P2P) VoD sys-
tems, each based on a di↵erent volatility model including GARCH and other heuristics.
To reserve bandwidth for a video channel at time t + 1, all these schemes periodically
monitor the bandwidth demand {D1, . . . , Dt
} of this channel by checking server logs (e.g.,
at a 10-minute frequency), and predict the conditional mean demand Pt
Dt+1 using the
method described in Sec. 3.3.2. However, they incorporate di↵erent volatility models to
determine the “risk premium” provisioned. Specifically, assuming e = 2%, these schemes
are described as follows:
Constant variance (baseline method) assumes a constant variance �2 in demand
forecast errors, which can be learned from training data. As demand forecast errors
exhibit Gaussian distribution in the traces, the risk premium is the 98th percentile of
the normal distribution N (0, �2), i.e., the value below which 98% of samples from the
4.2. VOLATILITY ESTIMATION AND RESOURCE ALLOCATION 54
distribution N (0, �2) fall.
Probabilistic GARCH predicts the conditional variance ht+1 for the demand fore-
cast error using the GARCH model. The risk premium is the 98th percentile of the
normal distribution N (0, ht+1).
Deterministic GARCH predicts the conditional variance ht+1 for the forecast error
using the GARCH model. The risk premium is ⌘p
ht+1, where ⌘ is a positive constant
determined as the ⌘ that achieves an insu�ciency ratio e = 2% in the training data.
Recent variance is a heuristic method that calculates the sample variance �2⌧
of
demand forecast errors in the ⌧ most recent time periods. The risk premium is the 98th
percentile of the normal distribution N (0, �2⌧
).
Maximum absolute error is a heuristic method that decides the risk premium as
the maximum absolute error of demand forecasts in recent ⌧ time periods.
We test the performance of the above schemes in terms of resource utilization and in-
su�ciency ratio through simulations driven by the real-world traces of 169 popular video
channels. For each channel, we train the conditional mean model (3.8) and conditional
variance model (4.1) based on the data of 1.5 days from the 50th to the 266th time period
after the channel is released. The first 500 minutes (50 time periods) are excluded from
training as the initial demands may not conform to the later evolution patterns. To test
the generalizability of our proposed methods to any VoD demands, we only assume a
simple seasonal ARIMA model (3.8) with d = 0, p = q = 1 for demand forecast. For the
two GARCH-based methods, we train a GARCH(1, 1) model for demand forecast errors
in the training data. We then use the above 5 methods with the trained parameters to
perform one-step-ahead bandwidth reservation in each channel for a test period of 2 days
following the training period.
4.2. VOLATILITY ESTIMATION AND RESOURCE ALLOCATION 55
0 10 20 30 40 50 60 70 80 900
0.2
0.4
0.6
0.8
1
Resource Utilization (%)
CD
F
Constant varianceProbablistic GARCHDeterministic GARCHRecent varianceMax absolute error
(a) CDF of resource utilization U in 169 channels
0 2 4 6 8 10 120
0.2
0.4
0.6
0.8
1
Insufficiency Ratio (%)
CD
F
Constant varianceProbablistic GARCHDeterministic GARCHRecent varianceMax absolute error
(b) CDF of bandwidth insu�ciency ratio e in 169channels
Figure 4.4: The empirical CDF of the utilization of booked bandwidth and the ratioof time periods where the booked bandwidth is insu�cient. Bandwidth reservation isperformed in 169 popular video channels independently, each with a test period of 2days.
We calculate the average resource utilization U and insu�ciency ratio e in each of the
169 channels over its corresponding test period, and plot the empirical cumulative distri-
bution functions (CDFs) of these U ’s and e’s in Fig. 4.4(a) and Fig. 4.4(b), respectively.
We can see that “constant variance,” as the baseline method, achieves the lowest resource
utilization as well as the lowest e, because it aggressively books a high risk premium as-
suming a large constant variance for demand forecast errors. In contrast, “probabilistic
GARCH” can adjust the risk premium dynamically based on the changing forecast error
variance. From Fig. 4.4(b), we see that it achieves an insu�ciency ratio e 2% in 80%
of the 169 channels, which is close to “constant variance,” and an insu�ciency ratio
e 4% in more than 90% of the channels, which is even better than “constant variance.”
“Probabilistic GARCH” achieves better resource utilization, shown in Fig. 4.4(a), since
it only books the necessary risk premium to tolerate the instantaneous demand volatility
instead of the long-run variance.
Although “deterministic GARCH” further enhances utilization by booking the risk
premium more conservatively, its average e exceeds the target of 2%. Furthermore,
4.2. VOLATILITY ESTIMATION AND RESOURCE ALLOCATION 56
the “recent variance” and “maximum absolute error” heuristics, when compared with
the first three methods, are advantageous, since they are fully adaptive during the test
period, while the first three use fixed model parameters estimated from the training
data for prediction. Even in such an unfair comparison, both heuristics perform worse
than GARCH-based methods, resulting in an e way beyond the target of 2%. Since they
inherently lack a mechanism to quantitatively tune the tradeo↵ between U and e, they are
not appealing for the sake of quality assurance. We conjecture that a fully online version
of ARIMA and GARCH, which adaptively relearns model parameters as the simulation
proceeds in the entire test period, can yield even better utilization while being able to
constrain e within the target range.
4.2.2 Utilization as a Volatility Indicator
From the above comparison, we find that the best bandwidth reservation scheme that
strikes a balance in the U -e tradeo↵ is “probabilistic GARCH,” thanks to its superior
ability to adjust the risk premium provisioned as the volatility changes. As an example,
we apply “probabilistic GARCH” to a popular channel A55FF over a test period of 3 days,
with models learned from the preceding 3 days. The bandwidth demand is forecasted
as a seasonal ARIMA process with d = 0, p = q = 20 driven by GARCH(1, 1) forecast
errors. We plot the bandwidth provisioned by “probabilistic GARCH” in Fig. 4.5(a) and
the achieved utilization in Fig. 4.5(b) under a target insu�ciency ratio of e = 5%. The
achieved e is 3.71% and U = 75.04% on the test data.
Just as tra�c volatility limits the resource utilization e�ciency, the best achievable
utilization U produced by one-step-ahead resource booking, given a target insu�ciency
ratio e (e.g., e = 2%), is essentially a quantitative indicator of tra�c volatility in the
4.3. REDUCING VOLATILITY THROUGH DIVERSIFICATION 57
697 769 841 913 985 10570
1
2
3
4 x 104
Time (unit: 10 minutes)
KB/s
Test workloadResource provisioned
(a) The bandwidth reserved by “probabilistic GARCH.”
697 769 841 913 985 105725
50
75
100
Time (unit: 10 minutes)
Perc
enta
ge (%
)
UtilizationMean Utilization
(b) Utilization of the reserved bandwidth.
Figure 4.5: Bandwidth reservation with “probabilistic GARCH” in channel A55FF,tested on a period of 3 days.
long term. A high U means that less “risk premium” is needed and the tra�c is likely
to evolve smoothly with tractable variation. In contrast, a low U implies that the tra�c
inherently has rapid and unpredictable variations, and thus more “risk premium” must
be booked to tolerate demand fluctuation. In other words, U evaluates volatility by the
ratio of the actual bandwidth usage over the sum of expected usage and the volatile
part of usage. Therefore, in the following analysis, we use the time-average utilization U
achieved by resource booking based on “probabilistic GARCH” to evaluate the demand
volatility in the long run.
4.3 Reducing Volatility through Diversification
In this section, we consider the combined or mixed tra�c of multiple video channels. We
observe that the aggregate tra�c volatility decreases as the number of channels to be
4.3. REDUCING VOLATILITY THROUGH DIVERSIFICATION 58
combined increases. Furthermore, such volatility reduction is not merely a consequence
of a higher volume of tra�c, but more importantly is due to tra�c mixing between
channels that may exhibit diverse variations. From the previous section, we see that
U evaluates volatility by computing the ratio of the actual bandwidth usage over the
sum of the expected usage and usage uncertainty. Therefore, in this section, we evaluate
the demand volatility of mixed channels by the achievable U in one-step-ahead resource
booking, given a target of e = 2%.
To study how the number of combined channels a↵ects tra�c volatility, we conduct
a series of bandwidth reservation simulations driven by the traces of 173 video channels
in UUSee over 21 days. We divide the entire 21 days into 7 periods, each of 3 days, and
study the tra�c volatility of random combinations of video channels in each 3-day test
period (except the first 3 days, which are only used for model training).
Now we illustrate the test with the second 3-day period (days 4-6). Suppose that
the number of channels to be combined is n. We randomly choose n channels that exist
in this test period and its previous 3-day training period, and consider their aggregate
tra�c. We perform bandwidth reservation based on “probabilistic GARCH” in this test
period (days 4-6), using the models trained from the previous 3 days (days 1-3). We
assume a simple conditional mean model (3.8) with d = 0, p = q = 1 and a conditional
variance model (4.1) of GARCH(1, 1). We calculate U and e in this test period of days
4-6. For this particular value of n, the above experiment is repeated for 60 times to
allow for di↵erent combinations of channels. The mean and standard deviation of the
achieved 60 U ’s and e’s are calculated. The above procedure applies to the other 3-day
test periods, each with its preceding 3 days as the training period.
From Fig. 4.6, we see that as bandwidth is reserved for an increasing number of
4.3. REDUCING VOLATILITY THROUGH DIVERSIFICATION 59
30
40
50
60
70
80
90
Number of video channels
Mea
n U
tiliz
atio
n (%
)
21 22 23 24 25 26 27 28
Days 4−6Days 7−9Days 10−12Days 13−15Days 16−18Days 19−21
(a) Mean of utilization U
02468
101214161820
Number of video channels
Util
izat
ion
Stan
dard
Dev
iatio
n (%
)
21 22 23 24 25 26 27 28
Days 4−6Days 7−9Days 10−12Days 13−15Days 16−18Days 19−21
(b) Standard deviation of U
Figure 4.6: The mean and standard deviation of U when di↵erent numbers of videochannels are randomly combined for bandwidth reservation. The achieved average e isless than 4% in all cases.
combined video channels, the mean of the achieved U ’s increases significantly up to close
to 90%, and the standard deviation of U ’s decreases in all of the test days 4-21. As
the resource utilization in one-step-ahead bandwidth reservation essentially evaluates
the degree of volatility in data, the above phenomenon implies that the aggregate tra�c
of multiple channels demonstrates less volatility. Bandwidth provisioned to multiple
channels can thus match the demand more closely. A further check into Fig. 4.6(b) shows
that when the number of channels is small, the U ’s have a high standard deviation.
This means that although all channels follow diurnal evolution in trend, there exist
di↵erent degrees of correlation between the demand forecast errors of di↵erent channels:
the volatility of the combined tra�c is amplified when they are positively correlated and
suppressed when they are negatively correlated.
To verify that volatility reduction is not a consequence of an increased amount of total
tra�c, but because of mixing channels with di↵erent statistical properties, we perform
bandwidth reservation for all the 93 channels in a test period from time 697 to 1127,
4.4. DIMENSION REDUCTION USING PCA 60
697 769 841 913 985 10570
1
2
3
4 x 104
Time (unit: 10 minutes)KB
/s
Test workloadResource provisioned
(a) The mixed tra�c and the provisioned bandwidth.
697 769 841 913 985 1057
−4000
−2000
0
2000
4000
Time (unit: 10 minutes)
KB/s
Prediction errorPredicted error standard deviation
(b) Conditional standard deviations for forecast errors.
Figure 4.7: Bandwidth reservation with “probabilistic GARCH” for the mixed tra�c ofall the 93 channels in the 6-day period from time 264 to 1127. The first 3 days are thetraining periods, and the other 3 days (time 697-1127) are the test periods.
which is the same test period as in Fig. 3.6, Fig. 4.3, and Fig. 4.5(a), where bandwidth
reservation is performed for a single channel A55FF. However, instead of considering the
aggregate tra�c of all 93 channels (including channel A55FF), we consider a mixture of
them by taking 1/10 of user requests from each channel and adding them up. Comparing
Fig. 4.7(a) with Fig. 4.5(a), we see that the resulting mixed tra�c is roughly the same in
size as that of the single channel A55FF. But the mixed tra�c evolves more smoothly,
and its forecast errors shown in Fig. 4.7(b) not only have a smaller variance than that
of channel A55FF shown in Fig. 4.3, but also exhibits less heteroscedastic property, i.e.,
less time variation in terms of variance.
4.4. DIMENSION REDUCTION USING PCA 61
1500 1550 1600 1650 1700 1750 1800 18500
200
400
600
Time (unit: 10 minutes)
Band
wid
th (M
bps)
Channel 295Channel 241Channel 317
(a) Three large channels
1500 1550 1600 1650 1700 1750 1800 18500
10
20
30
Time (unit: 10 minutes)
Band
wid
th (M
bps)
Channel 22Channel 298
(b) Two small channels
Figure 4.8: Bandwidth consumption time series of 5 representative channels over a 2.5-day period.
4.4 Dimension Reduction using PCA
In this section, we propose methods to reduce the model training complexity involved
in demand prediction. We consider one single time period, where N video channels are
present. Suppose that in this period, video channel i’s aggregate bandwidth demand is a
random variable Di
(Mbps) with mean µi
= E[Di
] and variance �2i
= var[Di
]. Since the
random demands D1, . . . , DN
of di↵erent channels may be correlated, we denote by ⇢ij
the
correlation coe�cient of Di
and Dj
, with ⇢ii
⌘ 1. For convenience, let µµµ = [µ1, . . . , µN
]T
and ⌃ = [�ij
] be the N ⇥ N symmetric demand covariance matrix, with �ii
= �2i
and
�ij
= ⇢ij
�i
�j
for i 6= j.
Let us first review some key characteristics of the video workload dataset from UUSee.
Fig. 4.8 shows the aggregate bandwidth demand in each channel for 5 representative
channels over a 2.5-day period. We have four observations. First, the workload dataset
consists of a large number of small unpopular channels, such as those in Fig. 4.8(b),
dominated by a small number of large popular channels, such as those in Fig. 4.8(a).
Second, there is a diurnal periodicity in the access pattern of each video channel. Third,
a channel’s popularity evolution may follow some trends over days. For example, band-
width consumption in channels 241 and 317 exhibits a downward trend over the 2.5 days
4.4. DIMENSION REDUCTION USING PCA 62
in Fig. 4.8(a), with 144 time periods representing one day. Finally, both the diurnal
periodicity and daily trends become vague in small channels, such as in channel 22 in
Fig. 4.8(b).
Chapter 3 has proposed to use seasonal ARIMA processes to predict bandwidth series
in each individual channel [59] at a fine granularity of 10 minutes. The prediction is
based on a regression of historical demand in the most recent time periods, as well as
demand around the same time in previous days. However, this method is not without
shortcomings. First, a separate statistical model needs to be trained for every video
channel. As a result, the method does not scale to a large number of channels. Second,
this approach performs poorly for small channels, e.g., channel 22, or ill-behaved channels,
e.g., channel 317. In both types of video channels, the daily repetition pattern is obscured
by various random factors.
To tackle the problems mentioned above, in this section, we further use a factor
model to account for demand evolutions, i.e., the demand series {Di
(t)} of each channel
i can be viewed as driven by M uncorrelated underlying factors C1(t), . . . , CM
(t) with a
zero-mean random shock e(t):
Di
(t) = ↵i1C1(t) + . . . + ↵
iM
CM
(t) + e(t), 8 i. (4.2)
If the coe�cients ↵i1, . . . , ↵iM
can be learned statistically, we will be able to forecast
{Di
(t)} by predicting factor movements first. As noted from Fig. 4.8, channel demands
exhibit co-movements. This inspires us to mine the factors while learning their coe�cients
from the collective demand history of all the channels.
We use principal component analysis (PCA) to find such underlying factors. Given
N bandwidth demand series {D1(t)}, . . . , {DN
(t)} from N video channels, PCA applies
4.4. DIMENSION REDUCTION USING PCA 63
1500 1550 1600 1650 1700 1750 1787
−0.1
−0.05
0
0.05
0.1
0.15
0.2
Time (unit: 10 minutes)
Mbp
s
C3 C2 C1
Figure 4.9: The first 3 principal components C1—C3 in bandwidth consumption series of468 channels during 2 days via principal component analysis.
an orthogonal transformation to these N demand series to obtain a small number of
uncorrelated time series {C1(t)}, . . ., {CM
(t)} called the principal components. This
transformation is defined in such a way that data projected onto the first principal
component has as high a variance as possible (that is, accounts for as much of the
variability in data as possible), and each succeeding component in turn has the highest
variance possible.
We perform PCA for all the 468 channels that are online in a 2-day period (time
1500-1787). Fig. 4.9 plots the first 3 principal component series in the data. We can see
the first component C1 explains the diurnal periodicity shared by all the channels. The
second component C2 accounts for the downward daily trend, which is salient in channel
317 as popularity diminishes and less salient in channel 295. A further check of Fig. 4.10
reveals that the first 10 principal component series explain 99% of the data variability,
which are su�cient to model the factors underlying all the demand evolution.
However, di↵erent channels have di↵erent dependencies on each factor. Fig. 4.11
shows the 468 demand series projected onto the first 2 components, i.e., the point
4.4. DIMENSION REDUCTION USING PCA 64
1 2 3 4 5 6 7 8 9 100102030405060708090
100
Principal Component
Varia
nce
Expl
aine
d (%
)
Figure 4.10: Variance explained in band-width consumption series of 468 channelsin 2 days.
0 1000 2000 3000−1500
−1000
−500
0
500
1000
1500
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
Channel 295
Channel 241
Channel 317
Channel 22
Channel 298
Figure 4.11: Data projected onto the first 2principal components. Each point is one ofthe 468 channels.
(↵i1, ↵i2) for all i. Without surprise, the dependence on the first component, ↵
i1, in-
dicates how large the channel is. In contrast, the dependence on the second component,
↵i2, accounts for how fast end-users may lose interest in channel i. Channel 295 has a
low ↵i2, indicating almost no decrease in popularity over days. Channel 241 has a mod-
erate ↵i2, showing a slightly downward trend. Channel 317 has a large ↵
i2, exhibiting a
dramatic decrease of popularity just on the second day.
To predict the demand means µµµ(t) and the covariance matrix ⌃(t) for {Di
(t) : i =
1, . . . , N} at time t, we first predict the principal components C1(t), . . . , CM
(t) for M = 10
based on the history and obtain forecasts about their means C(t) = [C1(t), . . . , CM
(t)]T
and their covariances ⌃C
(t). We further predict the error series e(t) to obtain its forecast
e(t) and error variance �2e
(t). Note that ⌃C
(t) is a diagonal matrix because the principal
components are uncorrelated. Denote the coe�cient matrix as AN⇥M
= [↵im
], i =
4.4. DIMENSION REDUCTION USING PCA 65
100 101 102
1
2357
101520305070
100160
Channel Index in Log Scale (Sorted by Channel Size)
Mbp
s (L
og S
cale
)
468
Ch 295
Ch 241
Ch 317
Mean Bandwidth ConsumptionRMSE of PCA−based PredictionRMSE of Individual Channel Prediction
Figure 4.12: Root mean squared errors (RMSEs) of the two prediction methods overtest period 1680-1860 (1.25 days), compared to mean bandwidth consumption in eachchannel. The three largest channels are channels 295, 241 and 317.
1, . . . , N , m = 1, . . . , M . We can therefore forecast µµµ(t) and ⌃(t) as
µµµ(t) =AC(t) + e(t) · 1, (4.3)
⌃(t) =A⌃C
(t)AT + �2e
(t) · [1, . . . ,1], (4.4)
where 1 is an all-one column vector of length N and [1, . . . ,1] is an all-one matrix of
size N ⇥ N .
We model each principal component series using a low-order seasonal ARIMA model
[25]. Since {C1(t)} clearly shows daily periodicity, we model {C1(t) � C1(t � 144)} as
an ARMA(1, 1) process, so that the forecast C1(t) is regressed from both the previous
value C1(t � 1), the values one day before C1(t � 144), C1(t � 145), and random noise
terms. All other components {Ci
(t)} for i � 2 do not exhibit periodicity. We thus use
ARMA(1, 1) processes to model these principal components. The conditional variances
4.4. DIMENSION REDUCTION USING PCA 66
1680 1730 1780 1830 18600
50
100
Time (unit: 10 minutes)
Band
wid
th (M
bps)
Original dataPCA−based Prediction
1680 1730 1780 1830 18600
50
100
150
Time (unit: 10 minutes)
Band
widt
h (M
bps)
Original dataIndividual Prediction
Figure 4.13: 10-minute-ahead conditional mean prediction in channel 172 over a testperiod of 1.25 days.
of all the component series are forecasted using GARCH(1, 1) models [25, 58]. Since the
components are orthogonal, we do not need to forecast their covariances.
We compare PCA-based prediction with individual channel prediction over a test pe-
riod of 1.25 days. Each prediction is made based on the training data of only the previous
1.25 days, which are a little more than one day to incorporate periodicity. The root mean
squared errors (RMSEs) of both approaches for all the 468 channels are summarized in
Fig. 4.12. The channel indices are sorted in descending order of the channel size. We
can see that the PCA-based approach outperforms individual predictions regardless of
channel sizes. For large channels, the ratio of RMSE over mean bandwidth consumption
is less than 15% in most cases using the PCA-based approach.
To zoom in, we take channel 172 as an example. Fig. 4.13 compares the conditional
mean predictions produced by the PCA-based approach with those produced by individ-
ual prediction. We observe that individual prediction tends to oscillate drastically, while
the PCA-based approach can better identify both periodicity and downward trends. One
4.4. DIMENSION REDUCTION USING PCA 67
1680 1730 1780 1830 1860
−70−50−30−10
103050
Time (unit: 10 minutes)
Mbp
s
Departure from Mean ForecastPredicted Standard Deviation
(a) PCA-based Prediction
1680 1730 1780 1830 1860
−110−90−70−50−30−10
10305070
Time (unit: 10 minutes)
Mbp
s
Departure from Mean ForecastPredicted Standard Deviation
(b) Individual Prediction
Figure 4.14: The departure of actual bandwidth consumption from its conditional meanforecast and the predicted standard deviation of bandwidth consumption in channel 172.
reason is that the driving factors found by PCA are weighted averages over all the chan-
nels, with channel-specific erratic noises smoothed out, exhibiting co-movements of all
the channels.
The standard deviation forecast of channel 172 is plotted in Fig. 4.14. Even though
there is a big gap between real demand and its conditional mean forecast around time
1700, the GARCH(1, 1) model is able to forecast a larger demand variance at this point,
which will be leveraged by the video service provider to allocate more capacity to guard
against performance risks. It is worth noting that we do not assume that demand can al-
ways be perfectly forecasted: the entire point of variance or volatility forecast via GARCH
is to estimate the deviation of actual demand from the conditional mean prediction and
enable risk management in a probabilistic sense. Fig. 4.15 shows the Q-Q plot of forecast
errors. We observe that with PCA, the actual demand will oscillate around its condi-
tional mean forecast more like a Gaussian process. This also substantiates the belief that
each Di
behaves like a Gaussian random variable.
4.5. SUMMARY 68
−4 −2 0 2 4
−60
−40
−20
0
20
40
60
Standard Normal Quantiles
Fore
cast
Erro
r Qua
ntile
s
(a) PCA-based Prediction
−4 −2 0 2 4
−60
−40
−20
0
20
40
60
Standard Normal Quantiles
Fore
cast
Erro
r Qua
ntile
s
(b) Individual Prediction
Figure 4.15: Q-Q plot of conditional mean forecast errors in channel 172 over the testperiod 1680-1860 (1.25 days), in reference to Gaussian quantiles.
Last but not least, the PCA-based approach has a lower complexity: it involves
training a seasonal ARIMA model for each of the 10 principal components, together
with finding these components from the 468 channels using PCA. Once the models are
trained, forecasting is simply a linear regression with negligible running time. In contrast,
individual prediction has to train a seasonal ARIMA model for each of the 468 channels
separately, leading to a much higher complexity.
4.5 Summary
In this chapter, we have focused on demand volatility forecasts in large-scale operational
VoD systems, with the objective of dynamically and e�ciently provisioning bandwidth
resources in VoD servers. We introduce GARCH models originated from econometrics to
predict demand volatility based on resource usage monitoring. We propose and compare
five volatility-aware resource provisioning schemes, based on GARCH modeling and other
4.5. SUMMARY 69
volatility heuristics. It is shown that GARCH models yield the best tradeo↵ in terms of
resource utilization and the service level provided to users. We further study the volatility
reduction due to diversification when tra�c of multiple video channels is mixed. As a
result, allocating resources for a well diversified collection of video channels can improve
resource utilization by up to 3 times as opposed to single-channel allocation. We have also
used factor models and principal component analysis (PCA) to reduce the computational
complexity involved in the statistical forecast model training, when a large number of
video channels coexist in a real-world VoD system.
Chapter 5
Cloud Bandwidth Reservation for
VoD Applications
Cloud computing is redefining the way many Internet services are operated and provided,
including Video-on-Demand (VoD). Instead of buying racks of servers and building pri-
vate data centers, it is now feasible for VoD companies to use computing and bandwidth
resources of cloud service providers. As an example, Netflix moved its streaming servers,
encoding software, data stores and other customer-oriented APIs to Amazon Web Ser-
vices (AWS) in 2010 [12]. As has been mentioned in Chapter 1, it is now becoming
feasible to make egress1 bandwidth reservation from a VM in the cloud to the Internet
[19, 39]. Within this context, we analyze the benefits and address open challenges of
cloud bandwidth auto-scaling for VoD applications.
As has been explained in Chapter 1, bandwidth auto-scaling can potentially achieve
high resource utilization and quality assurance at the same time. However, a number of
1In this thesis, we assume the connection bottleneck lies in the outgoing bandwidth of data centers,while the download bandwidth at end users and the paths in the Internet are well provisioned.
70
CHAPTER 5. CLOUD BANDWIDTH RESERVATION FOR VOD APPLICATIONS71
important challenges need to be addressed to realize bandwidth auto-scaling for a VoD
provider. First, since resource rescaling requires a delay of at least several minutes to
update configuration and launch instances, it is best to predict the demand with a lead
time greater than the update interval, and scale the capacity to meet anticipated demand.
Such a proactive, rather than passive, strategy for resource provisioning needs to take
into account demand fluctuations in order to avoid bandwidth insu�ciency. Second, since
statistical multiplexing can smooth tra�c, a VoD provider can reserve less bandwidth
to guard against fluctuations by jointly reserving bandwidth for all its video channels.
However, to serve geographically distributed end users, a VoD provider usually has its
collection of video channels served by multiple data centers, which are geographically
distributed and are possibly owned by di↵erent cloud providers. The question is — how
should the VoD provider optimally split and direct its workload across data centers to
save the overall bandwidth reservation cost?
In this chapter, we propose a bandwidth auto-scaling facility that dynamically reserves
resources from multiple data centers for VoD providers, with several distinct features.
First, it is predictive. The facility tracks the history of bandwidth demand in each
video channel using cloud monitoring services, and periodically estimates the expectation,
volatility and correlations of demands in all video channels for the near future using time-
series techniques proposed in the previous chapters. We propose a channel interleaving
scheme that can even predict demand for new videos that lack historical demand data.
Second, it provides quality assurance by judiciously deciding the minimum bandwidth
reservation needed to satisfy the demand with high probability. Third, it optimally mixes
anti-correlated (negatively correlated) demands to save the aggregate bandwidth capacity
reserved from all data centers, while confining risks of under-provisioning.
5.1. SYSTEM ARCHITECTURE 72
DC 1
DC 2
DC
w11
w21
ws1
Demand Predictor
BandwidthUsageMonitor
Load Optimizer
Demand HistoryChannel 1
Demand HistoryChannel
W
i
s
w1i
w2i
wsi
. . .
. . .
Load Direction
Demand Projections
BandwidthReservation
Figure 5.1: The system decides the bandwidth reservation from each data center anda matrix W = [w
si
] every �t minutes, where wsi
is the proportion of video channel i’srequests directed to data center s. DC: data center.
We formulate the bandwidth reservation minimization problem given the predicted
demand statistics as input, derive the theoretically optimal load direction policy across
data centers, and propose heuristic solutions that balance bandwidth and storage costs.
The proposed facility is evaluated through extensive trace-driven simulations based on a
large data set of 1693 video channels collected from UUSee, a production VoD system,
over a 21-day period.
5.1 System Architecture
Consider a VoD provider with N video channels, relying on S data centers for service,
which are possibly owned by di↵erent cloud service providers. We propose an unobtrusive
auto-scaling system that draws beliefs about future demands of all channels and reserves
minimal resources from multiple data centers to satisfy the demand. Our system archi-
tecture, shown in Fig. 5.1, consists of three key components: bandwidth usage monitor,
5.1. SYSTEM ARCHITECTURE 73
Channel 1
Band
wid
th
Reserved Capacity A1 A2
(a) (b)
DC 1 DC 2Asum Asum/2
(d) (e)(c)
Asum/2
Time 10 Minutes0
Channel 2
Reserved Capacity
Time 10 Minutes0 Time 10 Minutes0 Time 10 Minutes0 Time 10 Minutes0
Figure 5.2: Using demand correlation between channels, we can save the total bandwidthreservation, even within each 10-minute period, while still providing quality assurance toeach channel. DC: data center.
demand predictor and load optimizer. Bandwidth rescaling happens proactively every �t
minutes, with the following three steps:
First, before time t, the system collects bandwidth demand history of all channels up
to time t, which can easily be obtained from cloud monitoring services. As an example,
Amazon CloudWatch provides a free resource monitoring service to AWS customers at a
5-minute frequency [2].
Second, the bandwidth demand history of all channels is fed into the demand predictor
to predict bandwidth requirement of each video channel in the next �t minutes, i.e., in
the period [t, t + �t). Our predictor not only forecasts the expected demand, but also
outputs a volatility estimate, which represents the degree that demand will be fluctuating
around its expectation, as well as the demand correlations between di↵erent channels in
this period. Our volatility and correlation estimation is based on multivariate GARCH
models [24], which gained success in stock modeling in the past decade.
Finally, the load optimizer takes predicted statistics as the input, and calculates the
bandwidth capacity to be reserved from each data center. It also outputs a load direction
matrix W = [wsi
], where wsi
represents the portion of video channel i’s workload directed
to data center s. Apparently, we should haveP
s
wsi
= 1 if the aggregate data center
capacity is su�cient. The matrix W also indicates the content placement decision: video
5.1. SYSTEM ARCHITECTURE 74
i is replicated to data center s only if wsi
> 0. In practice, the load direction W can
be readily implemented by routing the requests for video channel i to data center s with
probability wsi
.
The system finishes the above three steps before time t, so that a new bandwidth
reservation can be performed at time t for the period [t, t + �t), after which the above
process is repeated for the next period [t + �t, t + 2�t).
Bandwidth Reservation vs. Load Balancing. One may be tempted to think
that periodic bandwidth reservation is unnecessary, since requests can be flexibly directed
to whichever data center has available capacity by a load balancer. However, the latter
will exactly fall in the range of pay-as-you-go model with no quality guarantee to VoD
users, whereas bandwidth reservation ensures that the provisioned resource exceeds the
projected demand with high probability.
Furthermore, a major di�culty of load balancing is that data centers might be owned
by di↵erent cloud providers. As a result, it is complicated if not infeasible to implement
a gateway that can continuously watch resources of each cloud provider (instead of every
5 minutes as o↵ered by free cloud watch services) and redirect requests instantaneously
across cloud providers. Even if requests can be redirected instantaneously to lightly
loaded data centers when the playback quality degrades, significant engineering e↵orts
are required to monitor the video playback quality at end users.
Quality-Assured Bandwidth Multiplexing. The bandwidth demand of each
video channel can fluctuate drastically even at small time scales. To avoid performance
risks, the bandwidth reservation made for each channel in each �t period should accom-
modate such fluctuations, inevitably leading to low utilization at troughs, as illustrated
in Fig. 5.2(a) and (b). Trough filling within a short period such as 10 minutes is hard,
5.2. LOAD DIRECTION AND BANDWIDTH RESERVATION 75
since there are too many random shocks in demand in such a short period.
However, our load optimizer strives to enhance utilization even when �t is as small
as 10 minutes by multiplexing demands based on their correlations. The usefulness of
anti-correlation is illustrated in Fig. 5.2(c): if we jointly book capacity for two negatively
correlated channels, the total reserved capacity is Asum < A1 + A2. Besides aggregation,
we can also take a part of the demand for each channel, mix them and reserve bandwidth
for the mixed demands from multiple data centers. As an example, in Fig. 5.2(d) and
(e), the aggregate demand of two channels is split into two data centers, each serving a
mixture of demands, which still leads to a total bandwidth reservation of Asum. In each
�t period, we leverage the estimated demand correlations to optimally direct workloads
across data centers so that the total bandwidth reservation necessary to guarantee quality
is minimized.
5.2 Load Direction and Bandwidth Reservation
In this section, we focus on the design of the load optimizer, which has been presented
in our published work [60]. Suppose before time t, we have obtained the estimates about
demands in the coming period [t, t + �t). Our objective is to decide load direction W
so as to minimize the total bandwidth reservation while controlling the under-provision
risk in each data center.
We first introduce a few useful notations. Since we are considering an individual time
period, without loss of generality, we drop subscript t in our notations. Recall that the
VoD provider runs N video channels. The bandwidth demand of channel i is a random
variable Di
with mean µi
and variance �2i
. For convenience, let D = [D1, . . . , DN
]T,
µµµ = [µ1, . . . , µN
]T and ��� = [�1, . . . , �N
]T.
5.2. LOAD DIRECTION AND BANDWIDTH RESERVATION 76
Note that the random demands D1, . . . , DN
may be highly correlated due to the
correlation between video genres, viewer preferences and video release times. Denote ⇢ij
the correlation coe�cient of Di
and Dj
, with ⇢ii
⌘ 1. Let ⌃ = [�ij
] be the N ⇥ N
symmetric demand covariance matrix, with �ii
= �2i
and �ij
= ⇢ij
�i
�j
for i 6= j.
The VoD provider will book resources from S data centers. Denote Cs
the upper
bound on the bandwidth capacity that can be reserved from data center s, for s =
1, . . . , S. Cs
may be limited by the available instantaneous outgoing bandwidth at data
center s, or may be intentionally set by the VoD provider to geographically spread its
workload and avoid booking resources from a single data center. Let Csum =P
s
Cs
be the
aggregate utilizable bandwidth capacity of all S data centers. Throughout the chapter,
we assume that Csum is su�ciently large to satisfy all the demands in the system.2
We define a load direction decision as a weight matrix W = [wsi
], s = 1, . . . , S,
i = 1, . . . , N , where wsi
represents the portion of video i’s demand directed to and served
by data center s, with 0 wsi
1 andP
s
wsi
= 1. We observe that ws
= [ws1, . . . , wsN
]T
represents the workload portfolio of data center s. Given ws
, the aggregate bandwidth
load imposed on data center s is a random variable
Ls
=P
i
wsi
Di
= wTs
D. (5.1)
We use As
to denote the amount of bandwidth reserved from data center s for this
period. Clearly, we must have As
Cs
. Let A =: [A1, . . . , AS
]T. To control the under-
provision risk, we require the load imposed on data center s to be no more than the
2A rigorous condition for supply exceeding demand is given in Theorem 1.
5.2. LOAD DIRECTION AND BANDWIDTH RESERVATION 77
reserved bandwidth As
with high probability, i.e.,
Pr(Ls
> As
) ✏, 8 s, (5.2)
where ✏ > 0 is a small constant, called the under-provision probability.
5.2.1 The Optimal Load Direction
Given demand expectations µµµ and covariances ⌃, and the available capacities C1, . . . , CS
,
the load optimizer can decide the optimal bandwidth reservation A⇤ and load direction
W⇤ by solving the following optimization problem:
minW,A
Ps
As
(5.3)
s.t. As
Cs
, 8 s, (5.4)
Pr(Ls
> As
) ✏, 8 s, (5.5)
Ps
wsi
= 1, 8 i. (5.6)
Through reasonable aggregation, we believe that Ls
follows a Gaussian distribution.
We will empirically justify this assumption in Sec. 5.3 using real-world traces. When Ls
is Gaussian-distributed, constraint (5.2) is equivalent to
As
� E[Ls
] + ✓pvar[L
s
], with ✓ := F�1(1 � ✏), (5.7)
where F (·) is the CDF of the normal distribution N (0, 1). For example, when ✏ = 2%,
5.2. LOAD DIRECTION AND BANDWIDTH RESERVATION 78
we have ✓ = 2.05. Since
E[Ls
] = µ1ws1 + . . . + µN
wsN
= µµµTws
,
var[Ls
] =P
i,j
⇢ij
�i
�j
wsi
wsj
= wTs
⌃ws
,
it follows that (5.2) is equivalent to
As
� µµµTws
+ ✓p
wTs
⌃ws
. (5.8)
Therefore, the bandwidth minimization problem (5.3) is now converted to
minW
Ps
As
(5.9)
As
= µµµTws
+ ✓pwT
s
⌃ws
, (5.10)
s.t. µµµTws
+ ✓p
wTs
⌃ws
Cs
, 8 s, (5.11)
Ps
ws
= 1, (5.12)
0 � ws
� 1, 8 s, (5.13)
where 1 = [1, . . . , 1]T and 0 = [0, . . . , 0]T are N -dimensional column vectors. We can
derive nearly closed-form solutions to problem (5.9) in the following theorem:
Theorem 1 When Csum � µµµT1 + ✓p1T⌃1, an optimal load direction matrix [w⇤
si
] is
given by
w⇤si
= ↵s
, 8 i, s = 1, . . . , S, (5.14)
5.2. LOAD DIRECTION AND BANDWIDTH RESERVATION 79
where ↵1, . . . , ↵S
can be any solution to
X
s
↵s
= 1, 0 ↵s
min
⇢1,
Cs
µµµT1 + ✓p1T⌃1
�, 8 s. (5.15)
When Csum < µµµT1+✓p1T⌃1, there is no feasible solution that satisfies constraints (5.11)
to (5.13).
Proof: The function f(ws
) =p
wTs
⌃ws
is a cone and a convex function. We have
f
0
B@w1 + w2
2
1
CA f(w1) + f(w2)
2, (5.16)
or equivalently,
p(w1 + w2)T⌃(w1 + w2)
qwT
1⌃w1 +q
wT2⌃w2.
Applying the above inequality iteratively, we can prove
X
s
pwT
s
⌃ws
�s
(X
s
wTs
)⌃(X
s
ws
) =pwT⌃w. (5.17)
If w = 1 is feasible, by (5.11) and (5.17) we have
X
s
Cs
� µµµT1 + ✓pwT⌃w = µµµT1 + ✓
p1T⌃1. (5.18)
IfP
s
Cs
� µµµT1 + ✓p1T⌃1, it is easy to verify (5.15) is feasible. When w⇤
si
= ↵s
given
by (5.14), we find (5.11), (5.12) and (5.13) are all satisfied. Hence, (5.14) is a feasible
5.2. LOAD DIRECTION AND BANDWIDTH RESERVATION 80
solution and w = 1 is feasible. By (5.17), the objective function (5.9) satisfies
X
s
(µµµTws
+ ✓p
wTs
⌃ws
) � µµµTw + ✓pwT⌃w = µµµT1 + ✓
p1T⌃1. (5.19)
We find that [w⇤si
] given by (5.14) achieves the above inequality with equality, and thus
is also an optimal solution to (5.9). ut
Theorem 1 implies that in the optimal solution, each video channel should split and
direct its workload to S data centers following the same weights ↵1, . . . , ↵S
, which can be
found by solving the linear constraints (5.15). Moreover, the optimal workload portfolio
of each data center s has a similar structure of ws
= ↵s
1, where ↵s
depends on its
available capacity Cs
through the constraints (5.15).
Under the optimal load direction, the aggregate bandwidth reservation reaches its
minimum value:
Ps
A⇤s
=P
s
(µµµTw⇤s
+ ✓pw⇤T
s
⌃w⇤s
) = µµµT1 + ✓p1T⌃1,
which does not depend on S, the number of data centers. This means that under the
optimal load direction, having demand served by multiple data centers instead of one big
data center does not increase bandwidth reservation cost as long as we have wsi
= ↵s
,
8 i given by (5.14). In other words, the multiplexing gain stems from combining all
workloads from di↵erent video channels and serving them aggregately, regardless of the
number of data centers used to serve these workloads. Therefore, the load optimizer can
first aggregate all the demands and then split the aggregated demand into di↵erent data
centers subject to their capacities.
5.2. LOAD DIRECTION AND BANDWIDTH RESERVATION 81
5.2.2 Suboptimal heuristics with Limited Replication
Although solution (5.14) is optimal and e�cient, it encounters two major obstacles in
practice. First, as long as ↵s
> 0, w⇤si
= ↵s
> 0 for all i, which means that data center
s has to store all N videos. In other words, a video has to be replicated at all the data
centers with ↵s
> 0. This incurs significant additional storage fees at the VoD provider
charged by data centers. Second, each video channel i splits its workload into S data
centers according to the weights ↵1, . . . , ↵S
. When S is large and Di
is small, such fine-
grained splitting will not be technically feasible. Therefore, in practice, we need to limit
the replication degree of each video, or equivalently, limiting the number of videos stored
in each data center.
To achieve the above goal, we propose suboptimal solutions to problem (5.9) that
addresses replication concerns. First, we need the following heuristic to bridge the optimal
load direction to replication-limited load direction:
Heuristic 1 Per-DC Optimal. The algorithm iteratively outputs w⇤⇤1 , . . . ,w⇤⇤
S
for one
data center after another. Initially, set b = 1. Repeat the following for s = 1, . . . , S:
1) Solve the following problem to obtain w⇤⇤s
:
maxws
µµµTws
(5.20)
s.t. µµµTws
+ ✓p
wTs
⌃ws
As
Cs
, (5.21)
0 ws
b. (5.22)
2) Replace b in (5.22) by b � w⇤⇤s
.
The program terminates if b = 0.
5.2. LOAD DIRECTION AND BANDWIDTH RESERVATION 82
Heuristic 1 packs the random demands into each data center, one after another, by
maximizing the expected demand µµµTws
that each data center s can accommodate subject
to the probabilistic performance guarantee in (5.21). As a result, the total amount of re-
sources needed to guard against demand variability is reduced. Clearly, under Heuristic 1,
the aggregate bandwidth reservation from all data centers is
Ps
A⇤⇤s
=P
S
s=1(µµµTw⇤⇤
s
+ ✓pw⇤⇤T
s
⌃w⇤⇤s
). (5.23)
Note that Heuristic 1 is also computationally e�cient since (5.20) is a standard second-
order cone program.
Now we can introduce the replication-limited load direction, which only requires the
VoD provider to upload k videos to each data center. We modify Heuristic 1 to cope
with this constraint as follows:
Heuristic 2 Per-DC Limited Channels. The algorithm iteratively outputs w01, . . . ,w
0S
for one data center after another. Initially, set b = 1. Repeat the following for s =
1, . . . , S:
1) Solve problem (5.20) to obtain w⇤⇤s
.
2) Choose the top k channels with the largest weights and solve problem (5.20) again
only for these k channels to obtain w0s
.
3) Replace b in (5.22) by b � w0s
.
The program terminates if b = 0.
5.3. EXPERIMENTS BASED ON REAL-WORLD TRACES 83
Under Heuristic 2, the aggregate bandwidth reserved is
Ps
A0s
=P
S
s=1(µµµTw0
s
+ ✓pw0T
s
⌃w0s
). (5.24)
In Sec. 5.3, we will show through trace-driven simulations that Heuristic 2, though sub-
optimal, e↵ectively limits the content replication degree, thus balancing the savings on
storage cost and bandwidth reservation cost for VoD providers.
5.3 Experiments based on Real-World Traces
We conduct a series of simulations to evaluate the performance of our bandwidth auto-
scaling system. The simulations are driven by the replay of the workload traces of UUSee
video-on-demand system over a 21-day period during the 2008 Summer Olympics [50].
The dataset contains performance snapshots taken at a 10-minute frequency of 1693
video channels, including sports events, movies, TV episodes and other genres. The
statistics we use in this chapter are the time-averaged total bandwidth demand in each
video channel in each 10-minute period. There are 144 time periods in a day. We ask
the question—what the performance would have been if UUSee had all its workload in
this period served by cloud services through our bandwidth auto-scaling system?
We assume that the bandwidth demand of channel i at any point in the period
[t, t + �t) can be represented by the same random variable Dit
. This is a reasonable
assumption when �t is small. Similarly, let µµµt
= [µ1t, . . . , µNt
] and ⌃t
= [�ijt
] represent
the demand expectation vector and demand covariance matrix for all N channels in
[t, t + �t). We assume that before time t, the system has already collected all the
demand history from cloud monitoring services with a sampling interval of �t. We
5.3. EXPERIMENTS BASED ON REAL-WORLD TRACES 84
implement statistical learning and demand prediction techniques presented in Chapter 3
and Chapter 4 to forecast the expected demands µµµt
and demand covariance matrix ⌃t
every 10 minutes. The model parameters are retrained daily, with training data being the
bandwidth demand series {Di⌧
} in the recent 1.25 days of each channel i. Once trained,
the models will be used for the next 24 hours.
We conduct performance evaluation for 4 typical time spans which are near the be-
ginning, middle and end of the 21-day duration. Although video users may join or quit a
channel unexpectedly, our prediction mechanism is still e↵ective, since it deals with the
aggregate demand in the channel, which features diurnal evolution patterns. We assume
that there is a pool of data centers from which UUSee can reserve bandwidth. To spread
the load across data centers, we set Cs
= 300 Mbps for each s. The QoS parameter
✓ := F�1(1 � ✏) is set to ✓ = 2.05 to confine the under-provision probability to ✏ = 2%.
5.3.1 A Channel Interleaving Scheme
There are two practical challenges with regard to demand prediction. First, many of
the 1693 video channels are released during the 21 days: a new video does not have
su�cient demand history for statistical model learning. Second, there are many small
channels with only a few users online for which prediction is hard. We propose a channel
interleaving scheme to circumvent these obstacles. It has been shown from the traces
[59] that videos released on di↵erent dates but around the same time of day exhibit
similar initial demand evolution patterns, though with possibly di↵erent popularities.
The main reason is that most users watch VoD channels around several peak times of a
day. Therefore, it is possible to predict demands for new videos based on earlier videos.
We define virtual new channel k as a combination of all video channels with an
5.3. EXPERIMENTS BASED ON REAL-WORLD TRACES 85
433 577 721 865 1009 1153 1297 1441 1585 18000
200
400
600
Time (unit: 10 minutes)
Aggr
egat
e ba
ndw
idth
(Mbp
s)
Trace data 1−step prediction Training data
Figure 5.3: The conditional mean demand prediction for virtual new channel 11, with atest period of 1.5 days from time 1585 to 1800.
age less than 1.25 days and released in hour k 2 {1, . . . , 24} on any date. Upon release, a
new video joins virtual new channel k based on its release hour k. Similarly, we aggregate
small video channels and set up 24 virtual small channels. When a video reaches the
age of 1.25 days, it quits its virtual new channel. If its demand never exceeded a threshold
(e.g., 40 Mbps) in the first 1.25 days, it will join one of the virtual small channels in a
round robin fashion. Otherwise, it becomes a mature channel.
Each mature or virtual channel is deemed as an entity to which predictions and
optimizations are applied. For example, we make 10-minutes-ahead conditional mean
prediction for virtual new channel 11 and plot results in Fig. 5.3. The bandwidth demand
exhibits repetition of a similar pattern because the videos in this virtual channel are all
released in hour 11 (possibly on di↵erent dates). Although conditional mean prediction
is subject to errors, the GARCH model can forecast the changing error variance, which
contributes to the risk constraint (5.11) in resource minimization (5.9).
5.3.2 Algorithms for Comparison
We compare our optimal load direction (5.14), Heuristic 1 and Heuristic 2, with the
following benchmark algorithms:
5.3. EXPERIMENTS BASED ON REAL-WORLD TRACES 86
Reactive without Prediction: Initially replicate each video to K random data
centers. This limits the initial content replication degree to K. Each client requesting
channel i is randomly directed to a data center that has video i and idle bandwidth
capacity. A request is dropped if there is no such data center. In this case, the algorithm
reacts by replicating video i to an additional data center chosen randomly that has idle
capacity. Replicating content is not instant, and thus we assume that the replication
involves a delay of one period of time.
Random with Prediction: Initially, let s = 1 and b = 1. Second, randomly
generate ws
in (0,b) and rescale it so that the QoS constraint (5.11) is achieved with
equality for s. Update b to b � ws
and update s to s + 1. Go to the second step unless
b = 0 or s = S + 1, in which case the program terminates.
The reactive scheme represents provisioning for peak demand in Fig. 1.1 in some way,
with limited replication. It does not leverage prediction or bandwidth reservation. We
assume in Reactive, the total cloud capacity allocated is always the minimum capacity
needed to meet the peak demand in the system. The random scheme leverages prediction
and makes bandwidth reservation, but randomly directs workloads instead of using anti-
correlation to minimize bandwidth reservation.
5.3.3 Assumption Validation
First, we verify that the demand Di⌧
approximately follows Gaussian distribution in each
10-minute period. As an example, we apply one-day-lagged di↵erences (the lag is 144
when �t = 10 minutes) onto {Di⌧
} to remove daily periodicity to obtain the transformed
5.3. EXPERIMENTS BASED ON REAL-WORLD TRACES 87
−3 −2 −1 0 1 2 3−15
−10
−5
0
5
10
15
Standard Normal Quantiles
Qua
ntile
s of
Inpu
t Sam
ple
(a) Zit of a typical channel
−3 −2 −1 0 1 2 3−1500
−1000
−500
0
500
1000
1500
Standard Normal Quantiles
Qua
ntile
s of
Inpu
t Sam
ple
(b)P
i Zit of all channels combined
Figure 5.4: QQ plot of innovations for t =1562—1640 vs. normal distribution.
series {D0i⌧
:= Di⌧
�Di⌧�144}, which can be modeled as a low-order autoregressive moving-
average (ARMA) process:
8><
>:
D0i⌧
� �i
D0i⌧�1 = N
i⌧
+ �i
Ni⌧�1,
D0i⌧
= Di⌧
� Di⌧�144,
(5.25)
where {Ni⌧
} ⇠ WN(0, �2) denotes the noise.
For each channel i, given conditional mean prediction µit
at time t, the innovation
is Zit
:= Dit
� µit
. Fig. 5.4(a) shows the QQ plot of Zit
for a typical channel i = 121
from time period 1562 to 1640, which indicates {Zit
} sampled at 10-minute intervals is
a Gaussian process. Thus, it is reasonable to assume Dit
follows a Gaussian distribution
within the 10 minutes following t, with mean µit
. Fig. 5.4(b) shows the QQ plot ofP
i
Zit
,
which indicates that the aggregated demandP
i
Dit
tends to Gaussian even if Djt
is not
Gaussian for some channel j. Since the load Ls
of each data center is aggregated from
many videos, it is reasonable to assume Ls
is Gaussian.
5.3. EXPERIMENTS BASED ON REAL-WORLD TRACES 88
702 712 722 732 742 752 762 772 7800
10
20
30
40
50
Time (unit: 10 minutes)
Uns
atis
fied
chan
nels
Per−DC optimalPer−DC limited channelsReactive without prediction
(a) Channels with unsatisfied requests
702 712 722 732 742 752 762 772 78040
60
80
100
120
Time (unit: 10 minutes)
Util
izat
ion
(%)
Per−DC optimalPer−DC limited channelsReactive without prediction
(b) Resource Utilization
702 712 722 732 742 752 762 772 7802
4
6
8
10
12
14
Time (unit: 10 minutes)
Rep
licat
ion
degr
ee
Per−DC optimalPer−DC limited channelsReactive without prediction
(c) Replication Degree
Figure 5.5: Predictive vs. reactive bandwidth provisioning for a typical peak period 702–780. There are 35 data centers available, each with capacity 300 Mbps, and 91 channels,including 52 popular channels, 24 small channels, 15 non-zero new channels. K = 2,k = 10.
5.3.4 Predictive Auto-Scaling vs. Reactive Provisioning
We implement all of the five schemes discussed above, and present their performance
comparison in Table 5.1 for each of the four time spans. Note that the channels in the
table include mature channels, virtual new and virtual small channels. The number of
videos in each virtual channel can vary over time. As new videos are introduced, more
channels are present in later test periods. We evaluate the performance with regard to
QoS, bandwidth resource occupied, and replication cost.
5.3. EXPERIMENTS BASED ON REAL-WORLD TRACES 89
Table 5.1: The performance of five schemes averaged over each test period, in terms ofQoS, resource utilization, and replication overhead.
PeriodsTime periods 702—780 (91 mature and virtual channels)
Peak demand 6.56 Gbps, mean demand 5.19 GbpsShort Drop Util Rep Booked Over-prov
Optimal 0.2 Chs 0.66% 92.9% 91.0 6.57 Gbps 108.5%Per-DC Opt 1.0 Chs 0.37% 90.0% 8.5 6.79 Gbps 112.2%Per-DC Lim 0.3 Chs 0.06% 85.7% 2.6 7.13 Gbps 117.8%
Random 5.9 Chs 0.02% 83.3% 3.8 7.33 Gbps 121.2%Reactive 7.9 Chs 0.47% 77.2% 4.3 7.91 Gbps 132.4%
PeriodsTime periods 1422—1480 (161 mature and virtual channels)
Peak demand 6.81 Gbps, mean demand 4.91 GbpsShort Drop Util Rep Booked Over-prov
Optimal 0.1 Chs 0.25% 91.1% 161.0 6.38 Gbps 110.3%Per-DC Opt 1.2 Chs 0.13% 88.6% 6.9 6.56 Gbps 113.4%Per-DC Lim 0.2 Chs 0.03% 84.6% 2.4 6.86 Gbps 118.8%
Random 7.6 Chs 0.00% 82.2% 3.0 7.08 Gbps 122.4%Reactive 7.2 Chs 0.34% 70.4% 3.6 8.20 Gbps 146.0%
PeriodsTime periods 1562—1640 (176 mature and virtual channels)
Peak demand 7.55 Gbps, mean demand 5.62 GbpsShort Drop Util Rep Booked Over-prov
Optimal 0.1 Chs 0.31% 91.1% 176.0 7.27 Gbps 110.4%Per-DC Opt 0.7 Chs 0.16% 88.3% 7.3 7.51 Gbps 114.0%Per-DC Lim 1.4 Chs 0.00% 83.9% 2.4 7.89 Gbps 119.9%
Random 6.2 Chs 0.00% 80.4% 3.3 8.28 Gbps 125.4%Reactive 5.9 Chs 0.27% 72.7% 3.5 9.08 Gbps 140.4%
PeriodsTime periods 2402—2500 (199 mature and virtual channels)
Peak demand 9.19 Gbps, mean demand 7.62 GbpsShort Drop Util Rep Booked Over-prov
Optimal 0.0 Chs 0.11% 85.4% 199.0 10.54 Gbps 118.1%Per-DC Opt 1.0 Chs 0.09% 82.7% 6.3 10.87 Gbps 121.8%Per-DC Lim 20.7 Chs 0.17% 82.3% 2.5 10.95 Gbps 122.6%
Random 33.4 Chs 0.02% 77.9% 4.5 11.54 Gbps 129.3%Reactive 15.8 Chs 0.43% 74.6% 3.6 12.01 Gbps 140.3%
Acronym Meanings:
Optimal : The optimal load direction. Per-DC Opt : Heuristic 1 (Per-DC Optimal). Per-DC Lim: Heuristic 2 (Per-DC Limited Channels). Random: Random with Prediction.Reactive: Reactive without Prediction.
Short : # channels with dropped requests; Drop: the request drop rate; Util : utilization ofallocated resources; Rep: replication degree; Booked : the booked (reserved) bandwidth;Over-prov : over-provisioning ratio.
5.3. EXPERIMENTS BASED ON REAL-WORLD TRACES 90
Table 5.1 shows that Reactive generally has more QoS problems than all four pre-
dictive schemes in terms of both the number of unsatisfied channels and request drop
rate, demonstrating the benefit of utilizing demand prediction. Fig. 5.5 presents a more
detailed comparison for a typical peak period from time 702 to 780. Without surprise,
Reactive has many unfulfilled requests at the beginning. Since the videos are randomly
replicated to K = 2 data centers (shown in Fig. 5.5(c) at t = 702) and requests are
randomly directed, it is likely that a channel does not acquire enough capacity to meet
its demand. As Reactive detects the QoS problem, videos are replicated to more data
centers to acquire more capacity, with a gradually increasing replication degree over time,
as in Fig. 5.5(c). We can see that after 140 minutes, when the replication degree exceeds
4, the QoS of Reactive becomes relatively stable in Fig. 5.5(a). However, around time
763, Reactive su↵ers from QoS problems again, due to a sudden ramp-up of demand.
In contrast, the predictive schemes foresee and prepare for demand changes, resulting in
much better QoS, even in the event of drastic demand increase.
The predictive schemes also achieve higher resource utilization. Utilization of a pre-
dictive scheme is the ratio between the actual used bandwidth and the total booked
bandwidth in all data centers. For Reactive, the utilization is the actual bandwidth de-
mand divided by the peak demand. Although Fig. 5.5(b) shows that Reactive achieves
a high utilization for the peak demand around time 763, its average utilization is merely
77.19% in the test period from 702 to 780. Predictive auto-scaling enhances utilization to
85.67% with Per-DC Limited Channels, to 89.99% with Per-DC Optimal, and to 92.9%
with the theoretical optimal solution. In addition, the prediction and optimization in
predictive methods are computationally e�cient, e.g., prediction and Per-DC Optimal
finish in 2 minutes, well before the deadline of 10 minutes.
5.3. EXPERIMENTS BASED ON REAL-WORLD TRACES 91
5.3.5 Theoretical Optimal vs. Replication-Limited Heuristics
Now we focus on each of the four predictive schemes. Among them, as shown in Table 5.1,
Optimal books the minimum necessary bandwidth and achieves the highest bandwidth
utilization, yet with the highest replication overhead: a video is replicated to every data
center. The VoD provider thus needs to pay a high storage fee to the cloud.
Per-DC Optimal can reduce the replication degree while maintaining other perfor-
mance metrics. By further imposing a channel number constraint on each data center,
Per-DC Limited Channels strikes a balance between replication overhead and bandwidth
utilization. It aggressively reduces the replication degree to a very small value of 2.4-
2.6 copies/video, which is the smallest among all four schemes, with an extremely low
drop rate and an over-provisioning ratio only slightly higher than Optimal and Per-DC
Optimal. Random achieves the lowest utilization, since it is blind to the correlation
information in workload selection and direction.
We further show a detailed comparison between the three predictive heuristics from
time 1562 to 1640 in Fig. 5.6. The e�ciency of predictive bandwidth booking can be
evaluated by the cushion bandwidth (“risk premium”) needed, which is the gap between
the booked bandwidth and actual required bandwidth. Fig. 5.6(a) plots the cushion
bandwidth. While achieving a similar QoS level, Random load direction uses a cushion
bandwidth up to 3 Gbps compared to a mean demand of 5.62 Gbps, representing sig-
nificant over-provisioning. Using Per-DC Optimal, the cushion bandwidth can be saved
by 50% on average, as shown in Fig. 5.6(b). Even Per-DC Limited Channels, with a
replication degree of 2.4 copies/video, can save cushion bandwidth by around 30% as
compared to Random, which has a higher replication degree of 3.3 copies/video.
5.3. EXPERIMENTS BASED ON REAL-WORLD TRACES 92
1562 1582 1602 1622 1640−1000
0
1000
2000
3000
4000
Time (unit: 10 minutes)
Cus
hion
Ban
dwid
th (M
bps)
Per−DC optimalPer−DC limited channelsRandom
(a) Cushion Bandwidth
1562 1582 1602 1622 16400
20
40
60
80
100
120
140
Time (unit: 10 minutes)
Impr
ovem
ent (
%)
Per−DC optimal vs. RandomPer−DC limited channels vs. Random
(b) Savings on Cushion Bandwidth
1562 1582 1602 1622 1640
100
120
140
160
180
Time (unit: 10 minutes)
Ove
r−pr
ovisi
on ra
tio (%
)
Per−DC optimalPer−DC limited channelsRandom
(c) Over-Provisioning Ratio
Figure 5.6: Workload portfolio selection vs. random load direction for a typical peakusage period from time 1562 to 1640. K = 2, k = 10.
QoS problems occur if bandwidth is under-provisioned, leading to a cushion band-
width below 0 and an over-provisioning ratio less than 100%. From Fig. 5.6(a) and
Fig. 5.6(c), we observe that QoS problems occur occasionally for Per-DC Optimal but
seldom for Per-DC Limited Channels from time 1562 to 1640, because the latter scheme
conservatively books more cushion bandwidth. In addition, we note that request drop
rates in Table 5.1 are significantly lower than the frequency of under-provisioning in the
figures, because when under-provisioning happens, most user requests are still served.
Only the demand exceeding the booked capacity is dropped. From the above analysis,
we conclude that Per-DC Limited Channels achieves the best tradeo↵ in the domain of
5.4. SUMMARY 93
utilization, QoS and replication overhead.
5.4 Summary
In this chapter, we have proposed an unobtrusive, predictive and elastic cloud bandwidth
auto-scaling system for VoD providers. Operated at a 10-minute frequency, the system
automatically predicts the expected future demand as well as demand volatility in each
video channel through ARIMA and GARCH time-series forecasting techniques based on
history. Leveraging demand prediction, the system jointly makes load direction to and
bandwidth reservations from multiple data centers to satisfy the projected demands with
high probability. The system can save the resource reservation cost for VoD providers
with regard to both bandwidth and storage.
We exploit the predictable anti-correlation between demands to enhance resource
utilization, and derive the optimal load direction policy that minimizes the aggregate
bandwidth resource reservation while confining under-provisioning risks. Two suboptimal
heuristics have also been proposed to limit the storage cost. From extensive simulations
driven by the demand traces of a large-scale real-world VoD system, we observe that
suboptimal heuristics have practical appeal due to their ability to balance the costs of
bandwidth and storage.
Chapter 6
The Bandwidth Reservation Market
Current cloud providers charge their tenants (including VoD providers like Netflix) a
bandwidth fee in a pay-as-you-go model based on the number of bytes transferred in
each hour [17]. This business model is not suitable for QoS-sensitive applications where
the value of the service critically depends on the guarantee provided. We believe that,
when multiple content providers (such as Netflix or OnLive) use cloud services from
cloud providers (such as Amazon), there will be a market between content providers and
cloud providers, and commodities to be traded in such a market consist of bandwidth
reservations, so that media streaming performance can indeed be guaranteed.
As a buyer in such a market, each tenant in the cloud can periodically make reserva-
tions for bandwidth capacity to satisfy its demand projection. Specifically, each tenant
should reserve capacity that accommodates both its expected demand and unexpected
demand variation in the predictable future. The expectation and variance of its future
demand can be estimated online using historical demand information, which is easily
obtainable from cloud monitoring services (e.g., as has been mentioned in Chapter 5,
Amazon CloudWatch provides a free resource monitoring service to AWS customers at a
94
CHAPTER 6. THE BANDWIDTH RESERVATION MARKET 95
5-minute frequency [2]).
However, since demands at all tenants are unlikely to surge simultaneously, the ag-
gregate reserved cloud bandwidth — that the tenants have paid for — are under-utilized
most of the time. A classical way to improve utilization is to encourage statistical mul-
tiplexing and reserve resources for multiple tenants together to accommodate their ag-
gregate peak load instead of each tenant’s peak load individually. As diversification and
anti-correlation come into e↵ect, less cushion capacity needs to be reserved to guard
against unexpected demand fluctuation. Apparently, improving utilization is a win-win
scenario: better utilization of bandwidth at cloud providers may lead to reduced pric-
ing for bandwidth reservations in the market, eventually benefiting content providers.
Given multiple cloud providers in the market, we ask the question — can we use a simple
mechanism inherent in the market to optimally mix tenant demands, save the total cloud
bandwidth reservation cost for tenants and thus lower the market bandwidth price?
The answer is “yes”. Cloud resource brokerage has recently emerged as a cloud service
intermediator connecting buyers and sellers of computing resources. For example, Zimory
[11], as a spino↵ of Deutsche Telekom, claims to be the first online marketplace for cloud
computing. In this chapter, we propose an entirely new type of brokerage that collects
bandwidth demand statistics from content providers, selling probabilistic performance
guarantees to them, while reserving as few resources as possible from multiple cloud
providers for them. The broker charges each VoD provider a fee for the service guarantee
provided according to a certain pricing policy, while paying cloud providers their fees for
bandwidth reservations. In a healthy market, the broker should be able to: (1) make
a profit, an incentive for all selfish entities; (2) save the total resource reservation by
optimally mixing demands; and (3) lower bandwidth prices billed to content providers
CHAPTER 6. THE BANDWIDTH RESERVATION MARKET 96
to attract business.
We present an in-depth analysis of bandwidth pricing policies in such a broker-assisted
market, and make a number of original contributions as follows:
First, the broker needs to decide a load direction matrix from all tenants to all
cloud providers. Load direction should be such that tenant demands are mixed based
on anti-correlation to save the aggregate capacity reservation from all cloud providers.
We formulate this problem as a convex optimization and find this problem has the same
form as the problem of finding the optimal load direction policy in Chapter 5, with the
only di↵erence that now the broker (instead of a cloud provider) is in charge of directing
the loads of di↵erent video providers to all the data centers (possibly owned by di↵erent
cloud providers).
Second, however, a selfish broker may not have the incentive to mix and direct de-
mands optimally; it may even reject demand if accepting it is not lucrative — the broker
is always driven by profit after all. Our second contribution is therefore to study how to
enforce a good pricing policy (to be adopted by the broker), such that a profit-maximizing
broker behaves in the same way as an altruistic broker that saves the cloud bandwidth
reservation through optimal load direction. We give a necessary and su�cient condition
for all such “good” pricing policies. Such a pricing region characterization also helps
us to understand the maximum price discount that each tenant can enjoy in a healthy
market.
Third, we study a free market where the pricing policy is not enforced authoritatively.
Instead, each tenant can submit a pricing strategy at its own will, while the broker
makes demand admission decisions to each tenant based on all the submitted prices.
Assuming that all the tenants wish to have their demands fully served, we prove that the
6.1. SYSTEM MODEL 97
selfishness of tenants will inevitably drive their chosen pricing strategies into a unique
Nash equilibrium, at which each tenant enjoys the lowest possible price in the “good”
pricing region. As a highlight, we derive the equilibrium payment of each tenant in a
closed-form function of its expected demand, demand burstiness as well as its demand
correlation to the aggregate demand in the market.
Finally, we conduct online bandwidth trading simulations driven by the real-world
workload of more than 200 video channels from an operational VoD system over two 800-
minute periods. We have developed a novel dynamic pricing scheme in a broker-assisted
market that varies the price charged to each tenant periodically, based on prediction
of tenant demand statistics. Such demand-statistics-based pricing is di↵erent from the
traditional usage-based pricing. We observe that with the help of a broker, our theory
can lower the market price for cloud bandwidth reservations by around 50% on average,
and reduce the usage of cloud resources by over 30%. Most content in this chapter has
been published in our work [55].
6.1 System Model
With broker-assisted bandwidth reservation, the system operates periodically on a short-
term basis. At the beginning of a short-term period, each tenant makes a bandwidth
reservation at the broker to satisfy its future bandwidth demand. The broker attempts to
statistically multiplex bandwidth demands from the tenants to save the total bandwidth
reservation cost, while providing each tenant with a certain level of service guarantee.
At the end of the short-term period, the broker charges each tenant a fee for the service
guarantee.
6.1. SYSTEM MODEL 98
From the perspective of tenants, having the ability to reserve bandwidth on a short-
term basis o↵ers a unique advantage. Many user-facing multimedia workloads, such as
VoD bandwidth demand and online gaming, involve diurnal periodicity and are highly
predictable even at a fine granularity of 10-minute intervals [58, 59]. Tenants running
predictable workloads are able to dynamically reserve (and pay for) bandwidth that just
su�ces to accommodate the predicted short-term demand, leading to a cost-e↵ective
operation mode. From the perspective of cloud providers, o↵ering users the ability to
make bandwidth reservations constitutes an important performance guarantee, which is a
desirable feature that makes the cloud provider more attractive to QoS-sensitive tenants.
6.1.1 The Cloud and Tenants
We now present our system model in more detail. We start from a description of cloud
providers and tenants.
Cloud Providers. We assume there are S cloud providers in the system, each of
which has an outgoing bandwidth capacity Cs
, s = 1, . . . , S. Let Csum be the aggregate
bandwidth capacity of all the cloud providers, i.e., Csum =P
s
Cs
. Throughout this
chapter, we assume that Csum is su�ciently large to satisfy all demands in the system.
This assumption is justified by the “illusion of infinite capacity” [17] and the fact that
costs of high-end routers are dropping more quickly than before.
Tenants. We assume that there are N tenants in the system, each of which reserves
bandwidth on a short-term basis. Without loss of generality, we consider one such short-
term period. Suppose that tenant i’s bandwidth demand in this period is a random
variable Di
with mean µi
and variance �2i
. To ensure that its demand Di
is satisfied with
a high probability 1 � ✏, tenant i needs to reserve bandwidth Bi
= f✏
(Di
), which is 1 � ✏
6.1. SYSTEM MODEL 99
percentile of the distribution of Di
, i.e., we have
Pr(Di
> Bi
) ✏,
where ✏ is defined as the risk factor. Let D = [D1, . . . , DN
]T, µµµ = [µ1, . . . , µN
]T and
��� = [�1, . . . , �N
]T.
As has been discussed in Chapter 5, the random demands D1, . . . , DN
may be highly
correlated due to the correlation between video genres, viewer preferences and video
release times. Following the notations used in Chapter 5, denote ⇢ij
the correlation
coe�cient of Di
and Dj
, with ⇢ii
⌘ 1. Let ⌃ = [�ij
]N⇥N
be the N ⇥ N symmetric
demand covariance matrix, with �ii
= �2i
, and �ij
= ⇢ij
�i
�j
, for i 6= j.
For convenience, denote �M
the standard deviation of the “market demand”P
i
Di
,
and �iM
the covariance between Di
and the market demandP
i
Di
. Clearly, we have
�iM
=E[(Di
� µi
)(X
j
Dj
�X
j
µj
)] =NX
j=1
�ij
, (6.1)
�M
=
svar[
X
i
Di
] =p1T⌃1 =
sX
i,j
�ij
. (6.2)
We further denote ⇢iM
= �iM
/�i
�M
2 [�1, 1] as the correlation coe�cient between Di
and the market demandP
i
Di
.
6.1.2 The Broker and Load Direction Matrix
As shown in Fig. 6.1, the essence of the broker is a load direction weight matrix W =
[wsi
]S⇥N
, where wsi
represents the proportion of tenant i’s demand Di
served by cloud
provider s, for s = 1, . . . , S and i = 1, . . . , N , with 0 wsi
1. At the cloud provider
6.1. SYSTEM MODEL 100
side, given [wsi
], the bandwidth load imposed on cloud provider s is a random variable
Ls
=P
i
wsi
Di
.
Similar to Bi
= f✏
(Di
), we define As
= f✏
(Ls
) (As
Cs
) as the amount of bandwidth
reserved from cloud provider s to ensure that the load Ls
is satisfied with a high proba-
bility 1 � ✏. Clearly, the reserved bandwidth As
must not exceed the capacity of cloud
provider s, i.e., As
Cs
, or equivalently,
Pr(Ls
> Cs
) ✏, 8 s. (6.3)
For convenience, let L = [L1, . . . , LS
]T and let ws
= [ws1, . . . , wsN
]T. Note that ws
can
be interpreted as the workload portfolio of cloud provider s. From tenant i’s point of
view, given [wsi
], the total fraction of its demand Di
served by all cloud providers is
wi
=P
s
wsi
1. For convenience, let w = [w1, . . . , wN
]. Clearly, we have w =P
s
ws
.
The broker takes demand estimates µµµ, ⌃ and cloud provider capacities C1, . . . , CS
as inputs, and outputs a load direction weight matrix W⇤ = [w⇤si
] to achieve a certain
goal in favor of cloud providers, tenants, or the broker, or all of them. The output W⇤
corresponds to a way that N inter-correlated random demands D1, . . . , DN
are packed
into S cloud providers of limited capacities. For convenience, let w⇤i
=P
s
w⇤si
and
let w⇤s
= [w⇤s1, . . . , w
⇤sN
]. A “good” broker should satisfy the demand of tenants while
reserving as little bandwidth resource as possible from cloud providers. In other words,
the direction matrix W⇤ should be chosen such that: 1) w⇤i
= 1, for all tenants i =
1, . . . , N ; and 2) the total bandwidth reservationP
s
As
is minimized.
6.1. SYSTEM MODEL 101
w11
w21 w12
w22
C2
D1 D2
w1 = w11 + w21 w2 = w12 + w22
L1 =P
i Diw1i L2 =P
i Diw2i
L2
A2
L1
C1Capacity
r.v. r.v.
r.v. r.v.
A1Booked Bandwidth
Load
Demands
Service fraction
Load Direction
VoD Providers
Broker
DataCenters
Figure 6.1: A system of two cloud providers and two tenants. Random variables arelabeled with “r.v.”. The data centers are possibly owned by di↵erent cloud providers.
6.1.3 Bandwidth Pricing in the Presence of a Broker
Without loss of generality, we assume that each cloud provider charges a unit bandwidth
fee of $1 for every unit of bandwidth reserved in each short-term period. The broker needs
to pay $1 ·P
s
As
in total to the cloud providers for booking a total amount ofP
s
As
bandwidth. On the other hand, it charges tenant i a fee of Pi
(wi
) for accommodating wi
portion of its bandwidth demand Di
according to some pricing strategy Pi
(·). Under a
given W, the broker profit is thus
R(W) =P
i
Pi
(wi
) �P
s
As
. (6.4)
We make minimum assumptions on the pricing strategy Pi
(·) adopted by the broker as
follows:
Definition 1 A bandwidth pricing strategy Pi
(·) relative to tenant i is a concave func-
tion Pi
(wi
) of the service portion wi
2 [0, 1], with Pi
(0) = 0. The collection of bandwidth
pricing strategies {Pi
(·) : i = 1, . . . , N} forms a pricing policy.
6.2. MAIN OBJECTIVES 102
We observe that these assumptions on Pi
(·) are quite reasonable. Pi
(·) is concave
because in real life wholesale is cheaper than retail, i.e., the more a user buys some
goods, the lower the unit price P (wi
)/wi
she has to pay. Pi
(0) = 0 because a tenant
should pay nothing when receiving no service. In our system model, the pricing policy
{Pi
(·)} can be authoritatively enforced in a controlled market, or negotiated between the
broker and each tenant in a free market. As a special case, the broker may adopt the
cloud pricing policy {P 0i
(·)}:
P 0i
(wi
) = f✏
(Di
wi
) · 1, (6.5)
with P 0i
(1) = f✏
(Di
) = Bi
, 8 i. In other words, the broker charges the same amount of
money as cloud providers do. Note that a “good” pricing policy {Pi
(·)} should help ten-
ants to save money, i.e., Pi
(wi
) P 0i
(wi
), 8 wi
2 [0, 1], 8 i 2 1, . . . , N . Otherwise,
tenants will have little incentive to reserve bandwidth through a broker instead of from
cloud providers directly.
6.2 Main Objectives
In this section, we outline our objectives in this chapter.
Cloud Bandwidth Saving. From the cloud and social wellness point of view, a
basic goal is to enhance resource e�ciency in the cloud. When aggregate capacity supply
is su�cient, this means to serve all given demands D using as little bandwidth resourceP
s
As
as possible, i.e.,
minW
Ps
As
(6.6)
6.2. MAIN OBJECTIVES 103
s.t. As
Cs
, 8 s, wi
= 1, 8 i. (6.7)
This leads to the first question we ask in this chapter:
(Q1) How do we e�ciently determine the load direction matrix W⇤ to minimize the
aggregate bandwidth reservation from all cloud providers?
Chapter 5 has already shown that the statistical anti-correlation between tenant demands
can be utilized to optimally consolidate the workload. We provided a simple criterion to
quickly check if all given demands can be served or not, without having to solve (6.6). If
yes, we have given nearly closed-form solutions to (6.6). In Sec. 6.3, we will review these
results, for the consistency of presentation in this Chapter.
Pricing in Controlled Markets. Note that objective (6.6) requires the broker to
be altruistic to the cloud and tenants. However, a selfish broker has no obligation to
accommodate all demands for service, i.e., it may output w⇤i
< 1 for some i. Neither will
it necessarily optimize load direction to maximize cloud bandwidth saving. Instead, the
broker may simply decide W⇤ by maximizing its profit R(W) as follows
maxW
R(W) =P
i
Pi
(wi
) �P
s
As
(6.8)
s.t. As
Cs
, 8 s. (6.9)
We refer to such a broker as a profit-driven broker. We observe that even a profit-driven
broker is willing to participate in the controlled market, since R(W⇤) � R(W)
����w
si
=0, 8 s,i
=
0 for all possible pricing policies. Although there is always some incentive for the broker
to operate regardless of the pricing policy {Pi
(·)}, the key question is
(Q2) Under what pricing policy {Pi
(·)} is a profit-driven broker an altruistic broker that
6.2. MAIN OBJECTIVES 104
maximizes cloud bandwidth saving while accommodating all demands for service?
In other words, under what pricing policy {Pi
(·)} are problems (6.8) and (6.6)
equivalent?
An answer to this question is significant, in that it enables the optimal allocation
of cloud resources simply through a money-driven broker, by enforcing a good pricing
policy {Pi
(·)} to the market. In Sec. 6.4, we give a necessary and su�cient condition
for such good pricing policies. As has been mentioned, broker pricing {Pi
(·)} should
be upper-bounded by the cloud pricing {P 0i
(·)}. Since the broker can turn a profit from
bandwidth saving through arbitrage (out of no investment), tenants may expect to obtain
price discounts. In Sec. 6.4, we also find the fair discount each tenant can enjoy, without
hurting the interests of the cloud and other tenants.
Pricing in Free Markets. The good pricing policies found by answering question
(Q2) are enforced in the market by a supervisory agency other than the broker or tenants.
However, in a free market, a selfish tenant has the freedom to negotiate with the broker
on the price it should pay. We assume each tenant i can submit to the broker its own
pricing strategy Pi
(·). In return, based on all the submitted pricing strategies {Pi
(·)},
the broker decides a service fraction w⇤i
for each i by maximizing its own profit. The
submitted prices can a↵ect all tenants’ utilities. We assume each tenant expects to be
fully served (i.e., w⇤i
= 1, 8 i), though striving to lower its payment. Intuitively, if Pi
(·)
is too low, the broker will not serve Di
completely out of profit concerns. Ironically, if
Pi
(·) is su�ciently high, Di
may not get fully served either, depending on the prices other
tenants submit. We ask the question:
(Q3) In a free market where each selfish tenant competes for service by submitting its
own pricing strategy Pi
(·), will {Pi
(·)} reach an equilibrium point?
6.3. OPTIMAL BANDWIDTH SAVING 105
In Sec. 6.5, we use game theory to show that {Pi
(·)} will converge to a unique Nash
equilibrium {P ⇤i
(·)}. It turns out {P ⇤i
(·)} forms the lower border of the good pricing
region, which means that without intervention, the free market itself will lead to the
optimal load direction, where the tenants receive the maximum price discounts. A key
finding of this chapter is the following theorem:
Theorem 2 In equilibrium market, tenant i pays
$ P ⇤i
(w⇤i
) = µi
+ [f✏
(Di
) � µi
]⇢iM
= µi
+ ✓ · �i
⇢iM
(6.10)
for bandwidth reservation, where ⇢iM
2 [�1, 1] is the correlation coe�cient between Di
andP
i
Di
.
In Sec. 6.5, we discuss the connections between demand statistics and bandwidth
pricing.
6.3 Optimal Bandwidth Saving
For consistency, in this section, we briefly review our answer to question (Q1) — what are
the load direction weights W⇤ that achieve the most bandwidth saving? This question has
already been answered in Chapter 5. We assume each random demand Di
is Gaussian-
distributed. This assumption has already been verified by real-world traces in Chapter 5
(please refer to Fig. 5.4).
Recall from Chapter 5 that for a Gaussian random variable X, we have
f✏
(X) = E[X] + ✓pvar[X], ✓ = F�1(1 � ✏), (6.11)
6.3. OPTIMAL BANDWIDTH SAVING 106
where F (·) is the CDF of normal distribution N (0, 1). If each Di
is Gaussian-distributed,
so is Ls
=P
i
wsi
Di
, and we have
8><
>:
E[Ls
] = µ1ws1 + . . . + µN
wsN
= µµµTws
,
var[Ls
] =P
i,j
⇢ij
�i
�j
wsi
wsj
= wTs
⌃ws
.
It follows that
Bi
= f✏
(Di
) = µi
+ ✓�i
,
As
= f✏
(Ls
) = µµµTws
+ ✓pwT
s
⌃ws
.
Therefore, cloud resource minimization (6.6) has the following form under Gaussian
demands:
minW
Ps
(µµµTws
+ ✓p
wTs
⌃ws
), (6.12)
s.t. µµµTws
+ ✓p
wTs
⌃ws
Cs
, 8 s, (6.13)
w =P
s
ws
= 1, (6.14)
0 � ws
� 1, 8 s, (6.15)
where 1 = [1, . . . , 1]T and 0 = [0, . . . , 0]T are N -dimensional column vectors. Constraint
(6.13) is equivalent to As
Cs
and thus to (6.3).Note that problem (6.12) has the same
form as problem (5.9) in Chapter 5, with the only di↵erence being that now the broker
is responsible for load direction.
Recall that when capacity supply is su�cient, nearly closed-form solutions to prob-
lem (6.12) can be derived, as has been shown in Theorem 1 in Chapter 5. We rewrite
Theorem 1 of Chapter 5 here for ease of presentation in the sections to follow:
6.3. OPTIMAL BANDWIDTH SAVING 107
Theorem 3 When Csum � µµµT1 + ✓p1T⌃1, an optimal solution [w⇤
si
] is given by
w⇤si
= ↵s
, 8 i, s = 1, . . . , S, (6.16)
where ↵1, . . . , ↵S
can be any solution to
X
s
↵s
= 1, 0 ↵s
min
8><
>:1,
Cs
µµµT1 + ✓p1T⌃1
9>=
>;, 8 s. (6.17)
When Csum < µµµT1+✓p1T⌃1, there is no feasible solution that satisfies constraints (6.13)
to (6.15).
6.3.1 Discussions
Using Theorem 3, the broker can quickly check if all demands can be served, simply by
comparing the total cloud capacity Csum with total bandwidth required for all demands
combined:
f✏
(X
i
Di
) = E[X
i
Di
] + ✓
svar[
X
i
Di
] = µµµT1 + ✓p1T⌃1.
If Csum < µµµT1+ ✓p1T⌃1, it is infeasible to pack all demands into the limited resources
at a risk factor of ✏. Otherwise, all demands can be accommodated.
When supply is su�cient, as has been mentioned in Chapter 5, Theorem 3 implies
that in the optimal solution, each tenant should direct its workload into S cloud providers
following the same weights ↵1, . . . , ↵S
, withP
s
↵s
= 1. An altruistic broker can e�ciently
find a feasible set of ↵1, . . . , ↵S
subject to linear constraints (6.17), and promise each
6.4. PRICING REGION IN CONTROLLED MARKETS 108
tenant i to serve all its demand, under the same agreement that ↵s
of Di
will be directed
to cloud provider s.
The maximum bandwidth saving from joint bandwidth booking as compared to book-
ing bandwidth individually for each tenant is
�B(W⇤) =P
i
Bi
�P
s
As
=P
i
(µi
+ ✓�i
) �P
s
(µµµTw⇤s
+ ✓p
w⇤Ts
⌃w⇤s
)
= ✓(���T1 �p1T⌃1) = ✓(
Pi
�i
� �M
), (6.18)
which is ✓ times the gap between the sum of all demand standard deviations and the
standard deviation of all demands combined. Furthermore, the number of cloud providers
S does not a↵ect bandwidth saving: theoretically, there is no loss of resource e�ciency
if a big cloud provider is substituted by numerous small cloud providers with the same
aggregate capacity.
6.4 Pricing Region in Controlled Markets
In this section, we answer question (Q2)—if the broker is profit-driven, what pricing
policies need be enforced in the market in order to benefit cloud resource e�ciency and
the interests of tenants? Theorem 3 provides a necessary and su�cient condition for
supply to exceed demand. We thus assume ⌃s
Cs
� µµµT1 + ✓p1T⌃1 throughout the
chapter.
We assume a selfish broker determines W⇤ by maximizing its own profit via (6.8).
Similarly, under Gaussian demands, we can translate (6.8) into the following problem:
maxW
Pi
Pi
(wi
) �P
s
(µµµTws
+ ✓p
wTs
⌃ws
), (6.19)
6.4. PRICING REGION IN CONTROLLED MARKETS 109
s.t. µµµTws
+ ✓p
wTs
⌃ws
Cs
, 8 s, (6.20)
w =P
s
ws
� 1, (6.21)
0 � ws
� 1, 8 s, (6.22)
A selfish broker does not only have a di↵erent objective (6.19) than (6.12), but also
has a constraint (6.21) di↵erent from (6.14), since a profit-driven broker has no obligation
to accommodate all demands for service, if it cannot gain a higher profit by doing so.
Instead, the broker decides the service fraction w⇤i
1 for each tenant depending on the
pricing policy {Pi
(·)}. We aim to find all pricing policies such that problem (6.19) is
equivalent to problem (6.12).
The following theorem gives a necessary and su�cient condition for a good pricing
policy:
Theorem 4 Broker profit maximization (6.19) and cloud bandwidth resource minimiza-
tion (6.12) have the same optimal solution (6.16), if and only if
P 0i
(1) � µi
+ ✓ · �iM
�M
, 8 i, (6.23)
where �iM
is the covariance between Di
andP
i
Di
given by (6.1) and �M
is the standard
deviation ofP
i
Di
given by (6.2). Furthermore, if P 0i
(1) < µi
+ ✓�iM
/�M
for some i,
then w⇤ 6= 1.
Proof: We first show that if P 0i
(1) � µi
+ ✓�iM
/�M
, 8 i, (6.16) is also an optimal
solution to (6.19). Using the bound (5.17), problem (6.19) can be relaxed to the following
6.4. PRICING REGION IN CONTROLLED MARKETS 110
concave problem:
maxw
Ru
(w) :=P
i
Pi
(wi
) � (µµµTw + ✓pwT⌃w), (6.24)
s.t. µµµTw + ✓pwT⌃w
Ps
Cs
, (6.25)
w � 1, (6.26)
where Ru
(w) � R(W) is a concave function, and (6.25) is a relaxation of constraints
(6.20). Since µµµT1 + ✓p1T⌃1
Ps
Cs
, w = 1 is a feasible solution to (6.24).
The derivative of Ru
(w) with respect to wi
is given by
@Ru
(w)
@wi
����w=1
=P 0i
(wi
)
����w
i
=1
� µi
�✓P
N
j=1 �ij
wjp
wT⌃w
����w=1
=P 0i
(1) � µi
�✓Cov[D
i
,P
j
Dj
]p1T⌃1
=P 0i
(1) � µi
� �1M
�M
(6.27)
If (6.23) is satisfied, then @Ru
(w)/@wi
|w=1 � 0, 8 i. According to the gradient ascent
algorithm for concave maximization, w = 1 is the optimal solution to (6.24). If wsi
= ↵s
given by (6.16), we have the following 3 facts:
• All constraints (6.20)-(6.22) are satisfied;
• Ru
(w) = R(W);
• Ru
(w) = Ru
(1) in problem (6.24) reaches optimality.
Therefore, (6.16) is an optimal solution to (6.19).
6.4. PRICING REGION IN CONTROLLED MARKETS 111
Now we prove the converse. If there exists an i such that
P 0i
(1) < µi
� �1M
�M
, (6.28)
then the optimal solution w⇤ to the relaxed problem (6.24) must satisfy w⇤i
< 1. Then
the original problem (6.19) has an optimal solution w⇤si
= �s
, 8 i, 8 s, where �s
solves
8><
>:
0 �s
min{1, C
s
µ
µ
µ
Tw⇤+✓
pw⇤T⌃w⇤}, 8 s,
Ps
�s
= 1.
Let w⇤⇤si
= ↵s
, 8 i, 8 s be given by (6.16). Since
R
✓[w⇤⇤
si
]S⇥N
◆= R
u
(1) < Ru
(w⇤) = R
✓[w⇤
si
]S⇥N
◆,
we have proved that w⇤⇤si
given by (6.16) is not an optimal solution to (6.19). ut
We depict the region of good pricing policies in Corollary 5, which relies on the
following lemma.
Lemma 1 Let f(x) be a concave function on [0, 1] with f(0) = 0. Let k be a (constant)
real number. If f 0(1) � k, then for each x 2 (0, 1], f(x) � kx. In this case, if f(x0) = kx0
for some x0 2 (0, 1], then f(x) ⌘ kx, 8 x 2 [0, 1].
Proof: Note that the function f(x) is bounded by
f 0(1)x f(x) f 0(1)(x � 1) + f(1), 8 x 2 (0, 1], (6.29)
6.4. PRICING REGION IN CONTROLLED MARKETS 112
where the inequality on the left comes from the fact
f(x) =
Zx
0
f 0(t)dt + f(0) �Z
x
0
f 0(1)dt = f 0(1)x (6.30)
and the inequality on the right comes from the concavity.
Hence, if f 0(1) � k, then we have
f(x) � f 0(1)x � kx, 8 x 2 (0, 1].
In this case, if f(x0) = kx0 for some x0 2 (0, 1], then we have
kx0 � f 0(1)x0 � kx0.
It follows that f 0(1) = k. Further, note that
f(x0) =
Zx
0
0
f 0(x)dx �Z
x
0
0
kdx � kx0,
with equality achieved if and only if f 0(x) = k, 8 x 2 (0, x0]. Thus, we have f 0(x) = k,
for all x 2 (0, 1]. That is, f(x) ⌘ kx, 8 x 2 [0, 1]. ut
Corollary 5 In a good pricing policy {Pi
(·) : i = 1, . . . , N}, each Pi
(·) must satisfy for
all wi
2 [0, 1],
(µi
+ ✓ · �iM
�M
)wi
Pi
(wi
) (µi
+ ✓�i
)wi
. (6.31)
Proof: In a good pricing policy, each Pi
(·) cannot exceed cloud provider pricing
P 0i
(·), otherwise, the VoD providers would have booked bandwidth from cloud providers
directly. Therefore, P 0i
(wi
) = (µi
+✓�i
)wi
is the upper bound for Pi
(·). Consider another
6.4. PRICING REGION IN CONTROLLED MARKETS 113
wi
Pi(wi)
O 1
µi + � · �iM
�M
µi + ��i
upper border
lower border
P �i (w�
i )
good pricing
P 0i (1)
Figure 6.2: The region of Pi
(·) in a good pricing policy {Pi
(·)}. Pi
(·) is between P ⇤i
(·)and P 0
i
(·), and satisfies P 0i
(1) � P ⇤0i
(1).
pricing policy {P ⇤i
(·)}, where
P ⇤i
(wi
) = (µi
+ ✓ · �iM
�M
)wi
, 8 i. (6.32)
By Theorem 4 and Lemma 1, in a good pricing policy, for each i, Pi
(wi
) � (µi
+
✓�iM
/�M
)wi
, 8 wi
2 (0, 1], with equality achieved only if Pi
(·) ⌘ P ⇤i
(·). ut
6.4.1 Broker Profit and Price Discounts in the Pricing Region
Fig. 6.2 illustrates the region of Pi
(·) in a good pricing policy. Each good Pi
(·) is a
concave function between P ⇤i
(·) and P 0i
(·) (inclusive) with P 0i
(1) � µi
+ ✓�iM
/�M
. In
the good pricing region, a selfish broker behaves like an altruistic broker that optimizes
cloud bandwidth saving while serving all demands. When pricing policy falls out of this
region, the broker will reject a part of demands to chase a profit, i.e., there exists a j
such that w⇤j
< 1, which is undesirable when supply is su�cient. Note that Pi
(·) a↵ects
not only w⇤i
but also w⇤j
for all j. This phenomenon will be understood in Sec. 6.5.
We further analyze the broker profit R(W⇤) in the pricing region. Under any good
pricing policy, the optimal solution to (6.19) is w⇤si
= ↵s
given by (6.16) and w⇤i
= 1.
6.4. PRICING REGION IN CONTROLLED MARKETS 114
Thus, the broker’s payment to cloud providers is
X
s
As
=X
s
(µµµTw⇤s
+ ✓p
w⇤Ts
⌃w⇤s
) = µµµT1 + ✓p1T⌃1 =
X
i
µi
+ ✓�M
, (6.33)
Using Corollary 5, we obtain
R(W⇤) X
i
(µi
+ ✓�i
)w⇤i
�X
s
As
= ✓(X
i
�i
� �M
) = �B(W⇤), (6.34)
with equality achieved when Pi
(·) ⌘ P 0i
(·), and
R(W⇤) �X
i
(µi
+✓�
iM
�M
)w⇤i
�X
s
As
= ✓ ·P
i
�iM
�M
� ✓�M
= 0 (6.35)
with equality achieved when Pi
(·) ⌘ P ⇤i
(·).
The above bounds show that the maximum broker profit is essentially the maximum
achievable bandwidth saving �B(W⇤) from joint bandwidth booking given by (6.18),
which depends on the gap between the sum of all demand standard deviations and the
standard deviation of market demand. Moreover, the maximum profit is achieved when
the broker adopts the same pricing policy {P 0i
(·)} as cloud providers do. On the other
hand, when Pi
(·) ⌘ P ⇤i
(·), the broker profit is zero, which means all the profit made from
cloud bandwidth multiplexing has been rewarded to tenants as price discounts.
In a controlled market, the lowest price that tenant i should be charged for having
all its demand served is
P ⇤i
(1) = µi
+ ✓ · �iM
�M
. (6.36)
If any tenant i is charged less than P ⇤i
(1), the broker will deny a part of all demands to
turn a higher profit.
6.5. PRICING IN FREE MARKETS: A GAME THEORETICAL ANALYSIS 115
6.5 Pricing in Free Markets: A Game Theoretical
Analysis
In Sec. 6.4, good pricing is enforced as a policy by a third party other than the broker
and tenants, which is di�cult to implement in reality. To closely model a free market, we
assume each selfish tenant can freely bargain with the broker to establish a bandwidth
price. Each tenant i simply submits to the broker any pricing strategy Pi
(·) it prefers and
accepts the service fraction w⇤i
returned by the broker. Based on {Pi
(·)} collected from
all tenants, the broker determines the load direction matrix W⇤ and thus the service
fraction w⇤i
for each i by maximizing its own profit. We answer question (Q3) — what
will the prices eventually look like in such a free market of selfish players?
We define a game played by all tenants, each with an independent strategy Pi
(·), i.e.,
the price it submits. Under a set of submitted prices {Pi
(·)}, we define a utility function
associated with each tenant i as
Ui
[P1(·), . . . , PN
(·)] =
8><
>:
�Pi
(w⇤i
), if w⇤i
= 1,
�1, if w⇤i
< 1,(6.37)
where the broker determines w⇤i
by solving (6.19).
The rationale behind (6.37) is that in a system of su�cient capacity, a selfish tenant
always: 1) expects to get fully served1; and 2) tries to reduce its price if condition 1) is
met. In other words, unlike the case of scarce metal (e.g., gold), it is not profitable for
a tenant to deliberately deny user requests to trade for a lower unit bandwidth cost pi
.
In reality, popularity matters more to a tenant than instantaneous profit per user. In a
1This assumption will be relaxed in Chapter 7.
6.5. PRICING IN FREE MARKETS: A GAME THEORETICAL ANALYSIS 116
market of su�cient supply, if a tenant is found sacrificing user requests to chase cheaper
bandwidth deals, it will lose reputation and revenue in the long run.
To find the equilibrium market price, we just need to find the Nash equilibrium in
the above game, where no tenant i can get a better utility by unilaterally changing Pi
(·).
Intuitively, tenant i cannot submit too low a Pi
(·), beyond which the broker will be unable
to guarantee w⇤i
= 1 via (6.19), leading to Ui
= �1. From the analysis in Sec. 6.4, one
may conjecture that {P ⇤i
(·)} is a Nash equilibrium. However, is this the unique Nash
equilibrium? Could there be another equilibrium point where each tenant submits a very
low price and gets a utility of �1 without being able to be better o↵ by changing its
own pricing? In other words, can market collapse happen?
6.5.1 Nash Equilibrium: Existence and Uniqueness
Theorem 6 If tenants have utility (6.37) and the broker decides W⇤ by maximizing its
profit through (6.19), then {Pi
(·)} will converge to a unique Nash equilibrium {P ⇤i
(·)},
where
P ⇤i
(wi
) = (µi
+ ✓ · �iM
�M
)wi
, 0 wi
1, (6.38)
where �iM
and �M
are given by (6.1) and (6.2), respectively.
Let P�i
(·) represent the pricing strategies of all tenants except for tenant i. The proof
of Theorem 6 relies on the following two lemmas and Theorem 4 that characterize the
solution structure of (6.19) given {Pi
(·)}.
Lemma 2 If P 0i
(1) < µi
+ ✓�iM
/�M
, then w⇤i
< 1 regardless of P�i
(·).
6.5. PRICING IN FREE MARKETS: A GAME THEORETICAL ANALYSIS 117
Proof: It su�ces to show that
@Ru
(w)
@wi
����w
i
=1
< 0 for all w with wi
= 1,
since in this case we have w⇤i
= 1 regardless of {wj
}j 6=i
by the gradient ascent algorithm.
Recall that@R
u
(w)
@wi
= P 0i
(wi
) � µi
�✓P
N
j=1 �ij
wjp
wT⌃w.
If P 0i
(wi
) < µi
+ ✓�iM
/�M
, then
@Ru
(w)
@wi
����w
i
=1
< ✓
�iM
�M
�P
N
j=1 �ij
wjp
wT⌃w
!����w
i
=1
.
Now construct a function g(w1, . . . , wN
) as
g(w1, . . . , wN
) =
PN
j=1 �ij
wjp
wT⌃w.
It is easy to see that g(1, . . . , 1) = �iM
/�M
. Hence, it su�ces to show that
g(w1, . . . , wN
) � g(1, . . . , 1) for all w with wi
= 1.
Without loss of generality, we shall need only to prove the case when N = 2 and i = 1,
that is,�21 + ⇢12�1�2w2p
�21 + �2
2w22 + 2⇢12�1�2w2
� �1(�1 + ⇢12�2)p�21 + �2
2 + 2⇢12�1�2
.
It is easy to check that the above inequality indeed holds. Hence, @Ru
(w)/@wi
|w
i
=1 < 0
for all w with wi
= 1, completing the proof. ut
Lemma 3 If Pi
(wi
) = P 0i
(wi
) = (µi
+ ✓�i
)wi
for wi
2 [0, 1], then w⇤i
= 1, regardless of
6.5. PRICING IN FREE MARKETS: A GAME THEORETICAL ANALYSIS 118
P�i
(·).
Proof: It su�ces to show that
@Ru
(w)
@wi
����w
i
=1
� 0 for all w with wi
= 1,
since in this case we have w⇤i
= 1 regardless of {wj
}j 6=i
.
Recall that@R
u
(w)
@wi
= P 0i
(wi
) � µi
�✓P
N
j=1 �ij
wjp
wT⌃w.
If Pi
(wi
) = (µi
+ ✓�i
)wi
, then P 0i
(wi
) = µi
+ ✓�i
. Thus,
@Ru
(w)
@wi
����w
i
=1
= ✓
�i
�P
N
j=1 �ij
wjp
wT⌃w
!����w
i
=1
.
Now construct a random variable X as X =P
j 6=i
wj
Dj
+Di
. By the Cauchy-Schwarz
inequality, we have
E2[(Di
� µi
)(X � E[X])] E[(Di
� µi
)2] · E[(X � E[X])2]. (6.39)
That is, �2i
+X
j 6=i
�ij
wj
!2
�2i
pwT⌃w
����w
i
=1
.
It follows immediately that
�i
�P
N
j=1 �ij
wjp
wT⌃w
����w
i
=1
.
Hence, @Ru
(w)/@wi
|w
i
=1 � 0 for all w with wi
= 1. ut
Proof of Theorem 6: We first show that {P ⇤i
(·)} is indeed a Nash equilibrium. We
need to show that at {P ⇤i
(·)}, any tenant i cannot increase its utility Ui
by unilaterally
6.5. PRICING IN FREE MARKETS: A GAME THEORETICAL ANALYSIS 119
changing Pi
(·), i.e.,
Ui
[P ⇤i
(·), P ⇤�i
(·)] � Ui
[Pi
(·), P ⇤�i
(·)] 8 i. (6.40)
We exhaust the possibilities of Pi
(·) by considering the range of P 0i
(1). First, if
P 0i
(1) < µi
+ ✓�iM
/�M
,
by Lemma 2, w⇤i
< 1, and by (6.37), we have
Ui
[Pi
(·), P ⇤�i
(·)] = �1 < �P ⇤i
(1) = Ui
[P ⇤i
(·), P ⇤�i
(·)].
Second, if P 0i
(1) � µi
+ ✓�iM
/�M
, while P�i
(·) ⌘ P ⇤�i
(·), according to Theorem 4, we
have w⇤i
=P
s
w⇤si
=P
s
↵s
= 1. Hence, we have
Ui
[Pi
(·), P ⇤�i
(·)] = �Pi
(1) �P ⇤i
(1) = Ui
[P ⇤i
(·), P ⇤�i
(·)].
The inequality above is due to Lemma 1 and the fact that P ⇤i
(wi
) is a linear function in
wi
that passes through (0, 0). We have thus proved (6.40). Therefore, {P ⇤i
(·)} is indeed
a Nash equilibrium.
Now we show that {P ⇤i
(·)} is the unique Nash equilibrium, i.e., if {Pi
(·)} is a Nash
equilibrium, then Pi
(·) ⌘ P ⇤i
(·). First of all, let us show that the following inequality
holds:
P 0i
(1) � µi
+✓�
iM
�M
, 8 i. (6.41)
We prove (6.41) by contradiction. Assume there exists an i such that P 0i
< µi
+✓�iM
/�M
.
6.5. PRICING IN FREE MARKETS: A GAME THEORETICAL ANALYSIS 120
By Lemma 2, w⇤i
< 1 and thus Ui
= �1. If tenant i uses another strategy P 0i
(wi
) =
(µi
+ ✓�i
)wi
, then by Lemma 3, we have w⇤i
= 1, regardless of P�i
(·). Hence,
Ui
[P 0i
(·), P�i
(·)] = �P 0i
(1) > �1 = Ui
[Pi
(·), P�i
(·)],
contradicting the definition of Nash equilibrium that unilaterally changing Pi
(·) cannot
increase Ui
. Therefore, (6.41) must hold.
If (6.41) holds, by Theorem 4, we have w⇤i
=P
s
w⇤si
=P
s
↵s
= 1, 8 i, and thus
Ui
[Pi
(·), P�i
(·)] = �Pi
(1), 8 i. If (6.41) holds, by Lemma 1, Pi
(1) � (µi
+ ✓�
iM
�
M
) · 1 =
P ⇤i
(1), with equality achieved only if Pi
(wi
) ⌘ (µi
+ ✓�
iM
�
M
)wi
⌘ P ⇤i
(wi
) for wi
2 [0, 1]. In
other words, if Pi
(·) is not P ⇤i
(·),
Ui
[Pi
(·), P�i
(·)] = �Pi
(1) < �P ⇤i
(1) = Ui
[P ⇤i
(·), P�i
(·)],
contradicting the fact that {Pi
(·)} is a Nash equilibrium. Therefore, Pi
(·) must be P ⇤i
(·).
In other words, {P ⇤i
(·)} is the unique Nash equilibrium. ut
6.5.2 Market Properties and Equilibrium Bandwidth Costs
Theorem 6 shows that even without a supervisory party, the selfishness of tenants nec-
essarily drives {Pi
(·)} into an equilibrium {P ⇤i
(·)}, which is exactly the lower border of
all good pricing policies. Therefore, in a free market, a profit-driven broker is naturally
a workload consolidator that optimizes the cloud resource e�ciency. In equilibrium mar-
kets, we have Pi
(·) ⌘ P ⇤i
(·) and broker profit R(W⇤) = 0 according to (6.35). This is
reasonable since in the presence of competition, the broker would be unable to exploit
the profit on no investment gained from bandwidth multiplexing. Yet the broker can
6.6. TRACE-DRIVEN SIMULATIONS 121
turn a positive profit through agent fees, membership fees, ads and other means. In
addition, the free market automatically guarantees fairness on bandwidth costs across
di↵erent tenants.
Let us now analyze the equilibrium bandwidth cost for each tenant. At market
equilibrium, w⇤i
= 1 and P ⇤i
(w⇤i
) = P ⇤i
(1) given by (6.36). Rewriting (6.36), we have
$ P ⇤i
(w⇤i
) = µi
+ ✓ · �i
⇢iM
= µi
+ [f✏
(Di
) � µi
] · ⇢iM
,
proving Theorem 2.
The payment consists of two parts: $1 per unit of bandwidth for mean demand µi
, and
$⇢iM
per unit of bandwidth reservation exceeding the expected demand µi
. Apparently,
the lower the ⇢iM
, the more discount tenant i receives. In particular, if ⇢iM
< 0, Di
is
negatively correlated with market demand. Surprisingly, except for a charge on mean
demand, now tenant i earns a bonus of �✓�i
⇢iM
> 0 to be in the system. The reason is
that tenant i serves as a risk neutralizer: whenever market demand has a random increase,
Di
will decrease to release occupied resources to accommodate the market surge. This
helps the broker save the bandwidth reservation and hedge under-provision risks.
6.6 Trace-Driven Simulations
In this section, we simulate a broker-assisted cloud bandwidth trading system, evaluate
its benefit quantitatively through trace-driven simulations, and verify assumptions. Our
workload traces come from UUSee [50], an operational large-scale VoD service based in
China. The dataset contains the bandwidth demand in UUSee video channels sampled
every 10 minutes during the 2008 Summer Olympics. We consider two time spans in the
6.6. TRACE-DRIVEN SIMULATIONS 122
traces, time periods 702—780 and 1562—1640, containing 91 and 176 concurrent video
channels, respectively. We let each video channel i represent tenant i, with Di
equal to
the channel’s demand, assuming the same workload (or demand) is served by the clouds
in a broker-assisted market.
In our system, bandwidth reservation and trading are carried out online every �t = 10
minutes. (The same methodology can be applied to any other values of �t, e.g., 1
hour.) Before time t, the broker should have obtained estimates µµµt
= [µ1t, . . . , µNt
]
and ⌃t
= [�ijt
] for all tenants in the coming period [t, t + �t). Such statistics can
be predicted accurately based on historical demand data, since VoD demand follows
repeated daily patterns and is highly predictable [40, 58, 59]. Some established time
series forecasting tools in econometrics (e.g., [24, 25]) may be used for prediction. In
particular, we implement an online version of seasonal ARIMA and GARCH models
introduced in Chapter 3 and Chapter 4 to predict µµµt
and ⌃t
. In practice, historical
demand data can be easily obtained from cloud providers through their logging services
or from cloud providers themselves. Once µµµt
and ⌃t
are obtained, the broker maximizes
its profit by solving (6.19), making load direction decision W⇤ and bandwidth reservation
As
from each cloud provider s. The broker serves as a gateway that directs w⇤si
fraction
of tenant i’s incoming video requests to cloud provider s for service during [t, t + �t).
The above process is repeated for the next period [t + �t, t + 2�t).
To evaluate the benefit of such a system, we quantify the equilibrium point in the
market. Fig. 6.3 evaluates the saving on cloud bandwidth resources in equilibrium market.
Due to multiplexing, the aggregate bandwidthP
s
Ast
booked by the broker is less than
the aggregate bandwidthP
i
Bit
needed if each channel books individually. We set ✏ = 5%
and ✏ = 1% for time periods 702—780 and 1562—1640, respectively, resulting in an actual
6.6. TRACE-DRIVEN SIMULATIONS 123
702 722 742 762 78002468
101214
Time Periods
Gbp
s
Mean Bandwidth Saving =35%
Bandwidth Booked IndividuallyBandwidth Booked via BrokerAggregate Bandwidth Demand
(a) ✏ = 5%, ✏0 = 5.06%
1562 1582 1602 1622 164002468
10121416
Time Periods
Gbp
s
Mean Bandwidth Saving =36%
Bandwidth Booked IndividuallyBandwidth Booked via BrokerAggregate Bandwidth Demand
(b) ✏ = 1%, ✏0 = 1.27%
Figure 6.3: The aggregate bandwidthP
s
As
(t) booked by the broker, compared to theaggregate bandwidth
Pi
Bi
(t) needed if each channel books individually and the realaggregate demand
Pi
Di
(t).
under-provision probability of ✏0 = 5.06% and ✏0 = 1.27% in Fig. 6.3(a) and Fig. 6.3(b),
respectively. AlthoughP
i
Bit
always exceeds the real aggregate demandP
i
Dit
, it does
not mean there is no outage when booking individually: each individual channel is still
subject to an under-provision probability ✏. The broker can save bandwidth reservation
by more than 30% on average with controllable quality risks.
We further evaluate the price discount each tenant enjoys. At time t, tenant i pays
$ P 0it
(1) = µit
+ ✓�it
if reserving bandwidth individually, but pays $ P ⇤it
(1) = µit
+
✓�it
⇢iMt
in the equilibrium market with a broker. The discount it receives at time t is
1 � P ⇤it
(1)/P 0it
(1). Fig. 6.4 plots the histogram of discounts 1 � P ⇤it
(1)/P 0it
(1) for all i and
t, showing that the mean price discount depends on demand statistics and even exceeds
50% during the second time span.
An interesting finding is that discount is over 100% for some rare cases, which means
that the payment of some tenant i, P ⇤it
(1) = µit
+ ✓�it
⇢iMt
< 0, at some time t. We
call such tenants “bonus winners,” since the broker would even not mind paying to have
6.7. SUMMARY 124
0 20 40 60 80 100 1200
200
400
600
800
Discounts (%)
� Mean=44%
� Bonus� Winners
(a) t =702—780, 91 channels
0 20 40 60 80 100 1200
200
400
600
800
Discounts (%)
� Mean=53%
� Bonus� Winners
(b) t =1562—1640, 176 channels
Figure 6.4: The histogram of payment discounts of all channels at all times in equilibrium,i.e., the histogram of 1 � P ⇤
i
(1, t)/P 0i
(1, t) for all i and t.
them in the system. As pointed out in Sec. 6.5, this is because “bonus winners” have
demands negatively correlated to the market and thus are risk neutralizers: whenever
there is an increase in market demand, the demand of “bonus winners” drops to make
resources available for other tenants in the market.
6.7 Summary
In this chapter, we consider a scenario in which multiple tenants running user-facing
multimedia workloads can make reservations for bandwidth guarantees from cloud service
providers in order to support continuous media streaming. Such tenants (video providers)
attract inter-correlated random demands that can be directed to multiple cloud providers,
each with a bandwidth capacity. We introduce a profit-driven broker that statistically
mixes demands based on anti-correlation while controlling the resource shortage risks.
We study the behavior of the broker and how each tenant should be charged in such a
broker-assisted bandwidth reservation market.
6.7. SUMMARY 125
The region of all good pricing policies is characterized, such that making a profit
at the broker is equivalent to maximizing cloud resource utilization through optimal
load direction from tenants to the cloud providers. In a free market where tenants can
negotiate and bargain for bandwidth cost, we prove that the bandwidth prices charged to
the tenants will converge to a unique Nash equilibrium, which forms the lower bound of
the good pricing region. Furthermore, the equilibrium bandwidth price billed to a tenant
critically depends on its demand expectation, burstiness and its demand correlation to the
market. Real-world traces verify that the presence of a broker can lower the market price
for cloud bandwidth reservation by around 50% on average and save cloud bandwidth
resources by over 30%.
Chapter 7
Algorithmic Cloud Bandwidth
Pricing
In Chapter 6, we have shown that the bandwidth price each video company (as a cloud
tenant) has to pay in a cloud-based bandwidth reservation market depends on its demand
statistics. Yet we have assumed that each cloud tenant requires 100% of its demand
serviced no matter how expensive it is. However, in reality, a video company does have
an incentive to choose a “guaranteed portion” of its demand to be less than 100%, while
leaving a small amount of demand served with best e↵ort, if guaranteeing this small
additional demand is too costly.
We consider the following guaranteed service. A tenant can simply specify a percent-
age (e.g., 95%) of its (bandwidth) demand to be served with guaranteed performance,
which we call the guaranteed portion, while the rest of its demand will be served with
best e↵ort. It is then the cloud provider’s responsibility to satisfy the guaranteed portion
of the tenant with a high probability. Since the cloud provider has vast historical work-
load data, it can leverage statistical learning to predict tenant demands and make actual
126
CHAPTER 7. ALGORITHMIC CLOUD BANDWIDTH PRICING 127
bandwidth reservations for the tenants. The cloud can also multiplex tenant demands to
save bandwidth reservation cost.
In this chapter, we continue to study the pricing of such guaranteed services, but
assume the more realistic case that the bandwidth pricing can actually a↵ect tenant
demands, i.e., their choices of the guaranteed portion may vary according to the prices
set. On the other hand, such demand changes will in turn a↵ect the pricing decisions,
leading to the challenge of handling potentially unstable iterations. Moreover, due to
multiplexing, the absolute amount of bandwidth reserved for each tenant is unknown.
It is a challenging question to find out each tenant’s fair share in the aggregate service
cost, especially when pricing and demand depend on each other and both change in a
dynamic way.
To overcome these di�culties, we define the reservation fee of each tenant as a func-
tion of its specified guaranteed portion instead of the absolute amount of bandwidth
reserved. We also express each tenant’s utility as a function of its guaranteed portion,
which essentially measures the Quality of Service (QoS) at the tenant. For example,
a guaranteed portion of 95% for a video provider implies that 95% of all its end user
requests will receive a guaranteed service.
We study a cloud provider whose objective is to maximize the social welfare of the
system, i.e., the total expected tenant utility minus the aggregate service cost. Although
the cloud cannot know the exact form of utility at each tenant, it can a↵ect each tenant’s
choice of guaranteed portion indirectly through pricing, and thus control the social wel-
fare achieved. To expedite the computation of the optimal pricing policy in the presence
of a large number of tenants, we propose novel methods to e�ciently solve the underlying
large-scale distributed optimization problem with a coupled objective function. These
7.1. A NEW TENANT-CLOUD AGREEMENT 128
new methods are step-size-free and proved to be more e�cient than traditional gradi-
ent/subgradient methods based on Lagrangian dual decomposition in a decentralized
setting.
Finally, we evaluate the proposed algorithms on the workload traces of a real-world
VoD system, namely, UUSee [9]. We conduct trace-driven simulations of bandwidth
reservation and algorithmic pricing based on the demand prediction proposed in Chap-
ters 3 and 4. The system is shown to operate e↵ectively at a fine granularity (of as small
as 10 minutes).
Pricing in distributed systems has also been approached with game theory [43, 44].
However, the objective of this chapter is concerned with achieving fast and robust con-
vergence in large-scale systems, while incentive issues are beyond our focus.
7.1 A New Tenant-Cloud Agreement
Our system model is a generalization of the operation mode of the current cloud. Current
cloud providers charge tenants a usage fee based on the number of bytes transferred in
the past hour, and do not provide bandwidth guarantees. We extend this model to allow
tenants to make reservations for (egress) bandwidth guarantees explicitly, assuming that
the network reservation technologies proposed in [19, 39, 71] can be realized in cloud
computing systems. Our system operates on a short-term basis, e.g., based on hours or
tens of minutes. At the beginning of each short period, each tenant specifies a guaranteed
portion to guard against performance risks. The cloud decides the actual bandwidth
reservation for all the tenants based on both demand estimation and the submitted
guaranteed portions, and charges both a usage and reservation fee. We now describe our
system model in detail.
7.1. A NEW TENANT-CLOUD AGREEMENT 129
We consider one such short period, where N bandwidth-sensitive tenants are present.
Suppose that in this period, tenant i’s bandwidth demand is a random variable Di
(Mbps)
with mean µi
= E[Di
] and variance �2i
= var[Di
]. We assume the cloud can predict µi
and �i
based on demand history and share them with tenant i before the period starts.
In our proposed market, the key commodity traded is a notion called the guaranteed
portion instead of the absolute amount of bandwidth. Specifically, the tenants and cloud
will comply to the following service agreement Si
(wi
, ✏, Ri
):
• Before the period starts, each tenant i specifies a guaranteed portion wi
2 [0, 1];
• The cloud guarantees wi
fraction of demand Di
with a high probability 1 � ✏;
outage is allowed to happen with a small probability ✏, during which the bandwidth
allocated to tenant i is limited to Ri
.
The parameters ✏ and Ri
are a part of a service level agreement (SLA) advertised by the
cloud provider. We introduce the risk factor ✏, because for random demand, no matter
how much bandwidth is allocated, there exists a small risk of resource shortage.
Let qi
denote the actual bandwidth usage (realized data rate) in this period. Under
a guaranteed portion wi
, service Si
(wi
, ✏, Ri
) is supposed to lead to the following actual
usage of tenant i:
qi
(wi
) =
8><
>:
wi
Di
, w.p. 1 � ✏,
min{wi
Di
, Ri
}, w.p. ✏,
(7.1)
i.e., with probability 1� ✏ the actual usage qi
is a realization of wi
fraction of its demand
Di
, while during outage (which happens with a small probability ✏), the actual usage qi
is rationed by Ri
.
Clearly, tenant i will choose wi
based on both the utility Ui
and the price of guarantee-
ing wi
portion of its demand Di
. Unlike most prior work on network utility maximization
7.1. A NEW TENANT-CLOUD AGREEMENT 130
(NUM) [32], which assumes that the utility Ui
depends on a single variable such as rate,
we model utility Ui
(qi
, Di
) of tenant i as a function of both actual usage qi
and the de-
mand Di
. For example, a video content provider (or a VoD company) may have a linear
utility gain (or revenue) ↵i
qi
from usage qi
and a convexly increasing utility loss eAi
(Di
�q
i
)
for the denied requests Di
� qi
, with ↵i
, Ai
being tenant-specific parameters:
Ui
✓qi
(wi
), Di
◆= ↵
i
qi
(wi
) � eAi
(Di
�q
i
(wi
)), (7.2)
where the utility loss term can model the reputation degradation and potential revenue
loss due to unfulfilled demand1. We assume Ui
is concave and monotonically increasing
in qi
.
The price for tenant i to use service Si
(wi
, ✏, Ri
) is divided into two parts: a usage fee
and a reservation fee. As most current cloud providers do, we assume uniform pricing for
usage: each tenant pays $p for every unit of bandwidth consumed. As a key departure
from current clouds, we introduce a reservation fee, which is a function of the guaran-
teed portion instead of the absolute amount of bandwidth reservation: each tenant i is
charged a price of $ki
wi
for having wi
portion of its demand guaranteed. We price the
guaranteed portion wi
rather than the absolute bandwidth, because tenants usually have
no idea about how much bandwidth they need. Instead, they can intuitively know what
percentage of guarantee is desired. This new business model frees each tenant from the
computational burden of demand prediction: it simply submits its desired guaranteed
portion wi
, while the cloud provider computes the actual bandwidth reservation as well
as the reservation fee ki
wi
for each tenant.
1Even though the cloud provider may still be able to fulfill Di � qi in a best-e↵ort fashion, the tenantwill have no knowledge if this is the case, and will not be able to factor it into its expected utility.
7.1. A NEW TENANT-CLOUD AGREEMENT 131
We define the surplus of tenant i as its utility minus its price:
Si
(wi
, p, ki
) := Ui
✓qi
(wi
), Di
◆� pq
i
� ki
wi
. (7.3)
Given prices p and ki
, a rational tenant will choose a wi
to maximize its surplus Si
.
Tenant i will not always choose wi
= 1 because when its demand is bursty, the price to
guarantee 100% of Di
may be high. In this case, tenant i will choose a wi
close to 1
instead of being exactly 1, while the rest of its demand (1 � wi
)Di
will be served with
best e↵ort.
Based on tenant-specified guaranteed portions w1, . . . , wN
, the cloud should guarantee
the demands w1D1, . . . , wN
DN
for service. Denote w := [w1, . . . , wN
]T. To realize the
above service guarantees, the cloud provider needs to reserve a total bandwidth capacity of
K(w). Depending on the technology used, the value of K could vary significantly from
one case to another. For example, a simple non-multiplexing technology is to reserve
Ri
capacity for each tenant i individually such that demand wi
Di
is satisfied with high
probability, i.e.,
Pr(wi
Di
> Ri
) < ✏, (7.4)
and correspondingly, reserve capacity K =P
i
Ri
in total. In contrast, a multiplexing
technology will reserve capacity K for the tenants altogether such that the aggregate
demand is satisfied with high probability, i.e.,
Pr(X
i
wi
Di
> K) < ✏, (7.5)
and during outage (whenP
i
wi
Di
> K), the usage qi
of tenant i is rationed to Ri
withP
i
Ri
= K. In both cases, K is an implicit function of w defined by the probabilistic
7.2. PRICING TOWARDS SOCIAL WELFARE MAXIMIZATION 132
constraints (7.4) and (7.5), respectively. To determine K(w), the cloud provider must
estimate the future demand statistics of all the tenants, and convert tenant-specified
guaranteed portions w into the actual total bandwidth reservation K.
Similarly, the cloud provider has two kinds of service costs: usage and reservation
costs. We assume tier-1 ISPs charge the cloud provider $b for every unit of bandwidth
actually used. Furthermore, reserving bandwidth capacity K will incur a reservation cost
of $c(K). Due to multiplexing gain, to guarantee a similar service level, multiplexing will
incur a lower K and thus a lower reservation cost c(K) than without multiplexing.
We define the cloud profit ⇧ as the di↵erence between its total revenue and total cost,
i.e.,
⇧(w) :=X
i
✓pq
i
(wi
) + ki
wi
◆� c
✓K(w)
◆�X
i
bqi
(wi
). (7.6)
7.2 Pricing Towards Social Welfare Maximization
We study a cloud provider as a social planner whose objective is to maximize social
welfare W (w), which is defined as the total tenant utility minus the total service cost:
W (w) :=X
i
Ui
� c
✓K(w)
◆�X
i
bqi
(wi
)
= ⇧(w) +X
i
Si
(wi
, p, ki
) (7.7)
Under random demands, the cloud aims to decide a set of optimal guaranteed portions
w⇤ = [w⇤1, . . . , w
⇤N
]T for the tenants to maximize the expected social welfare by solving
maxw E[W (w)]
s.t. 0 w 1.(7.8)
7.2. PRICING TOWARDS SOCIAL WELFARE MAXIMIZATION 133
To solve (7.8), we first derive the expected social welfare in a simple approximated
form. Note that E[Ui
] is bounded as follows:
E[Ui
] � (1 � ✏)E[Ui
(wi
Di
, Di
)] + ✏E[Ui
(0, Di
)],
E[Ui
] (1 � ✏)E[Ui
(wi
Di
, Di
)] + ✏E[Ui
(Ri
, Di
)].
When the risk factor ✏ is small, we have
E[Ui
(qi
, Di
)] ⇡ (1 � ✏)E[Ui
(wi
Di
, Di
)]. (7.9)
To simplify notations, we define
Ui
(wi
) := E[Ui
(wi
Di
, Di
)], (7.10)
which turns out to be monotonically increasing and concave in wi
under very mild tech-
nical conditions. Similarly, the expected usage of tenant i is
E[qi
(wi
)] ⇡ (1 � ✏)E[wi
Di
] = (1 � ✏)wi
µi
. (7.11)
Therefore, the expected surplus of tenant i is
E[Si
(wi
, p, ki
)] =E[Ui
] � pE[qi
] � ki
wi
= (1 � ✏)
✓U
i
(wi
) � pwi
µi
◆� k
i
wi
, (7.12)
7.2. PRICING TOWARDS SOCIAL WELFARE MAXIMIZATION 134
and the expected profit of the cloud provider is
E[⇧(w)] = (p � b)(1 � ✏)X
i
wi
µi
+X
i
ki
wi
� c
✓K(w)
◆. (7.13)
Substituting the above into (7.7) gives the expected social welfare as
E[W (w)] = (1 � ✏)X
i
✓U
i
(wi
) � bwi
µi
◆� c
✓K(w)
◆. (7.14)
7.2.1 An Equivalent Pricing Problem
In reality, although the cloud provider has full knowledge about its service cost c(K(w)),
it does not know the utility function of each tenant. In other words, maximizing E[W (w)]
in terms of w requires the cloud to know the utility Ui
of each tenant and is infeasible.
We now convert problem (7.8) into an equivalent pricing problem.
Note that the expected social welfare is also the sum of the expected cloud profit and
the total expected tenant surplus, i.e.,
E[W (w)] = E[⇧(w)] +X
i
E[Si
(wi
, p, ki
)]. (7.15)
Furthermore, when charged with prices p, ki
and facing a random demand Di
, a rational
tenant will choose a guaranteed portion wi
to maximize its expected surplus, i.e.,
wi
= arg maxw
i
E[Si
(wi
, p, ki
)], (7.16)
which defines an implicit function wi
(p, ki
) of the prices. The cloud can a↵ect guaranteed
portion choices w = [w1, . . . , wN
]T via appropriate choices of prices p, k1, . . . , kN
, and
7.2. PRICING TOWARDS SOCIAL WELFARE MAXIMIZATION 135
Tenant i
Cloud Provider
Update
wi(p, ki) p, ki
wi(p, ki) = arg maxwi
E�Si(wi, p, ki)
�
Tenant 1 Tenant N... ...
p, ki to increase social welfare
Figure 7.1: Iterative updates of prices and guaranteed portions.
control the corresponding expected social welfare E[W (w)].
Therefore, the social welfare maximization problem (7.8) is converted into an equiv-
alent optimal pricing problem:
maxp,k1,...,k
N
E[W (w)] = E[⇧(w)] +X
i
E[Si
(wi
, p, ki
)], (7.17)
which, by combining (7.14) and (7.30), can be rewritten as
maxp,k1,...,k
N
(1 � ✏)X
i
✓U
i
(wi
) � bwi
µi
◆� c
✓K(w)
◆, (7.18)
where wi
= wi
(p, ki
) is determined in a distributed fashion by each tenant i via surplus
maximization (7.16). Such a distributed optimization is illustrated in Fig. 7.1. Now the
cloud provider does not need to know Ui
: it simply charges tenant i the usage price p
and reservation price ki
, and expects a guaranteed portion w(p, ki
) chosen by tenant i.
We denote the optimal prices that solve problem (7.18) as p⇤, k⇤1, . . ., k⇤
N
. The
optimal pricing problem (7.18) is equivalent to the original problem (7.8), because by
adjusting p and ki
, wi
(p, ki
) can take any value in [0, 1]. In other words, the guaranteed
portion wi
(p⇤, k⇤i
) chosen by tenant i under optimal pricing p⇤, k⇤i
is exactly the guaranteed
7.2. PRICING TOWARDS SOCIAL WELFARE MAXIMIZATION 136
portion w⇤i
that maximizes the expected social welfare, i.e., we have
w⇤i
= wi
(p⇤, k⇤i
). (7.19)
Therefore, once a set of optimal prices is obtained, we essentially have found a de-
centralized solution to expected social welfare maximization (7.8), which was originally
impossible to solve in a centralized way.
Nonetheless, the optimal pricing problem (7.18) is not easy to solve either. At first
glance, (7.18) can be understood as a network utility maximization (NUM) problem
[32] that may be solved via decomposition among the tenants. A closer look at (7.18)
suggests that the term c(K(w)) in the objective function may be coupled among all wi
’s,
so that (7.18) cannot be decomposed into a set of subproblems, each solved at a tenant
in a distributed fashion. Coupling happens in the cost term when the cloud multiplexes
tenant demands and books a capacity K(w) for the aggregated demand. As shown in
Fig. 7.1, the key to the solution is that the cloud provider must be able to update p
and ki
iteratively towards the direction that increases E[W (w)]. And an e�cient price
update algorithm should require fewer rounds of message-passing between the cloud and
tenants before reaching optimality.
Before presenting the distributed solutions to (7.18), let us first provide a number of
insights on how to make pricing policies, by checking the KKT conditions [20] that the
optimal prices p⇤, k⇤1, . . . , k
⇤N
must satisfy.
Proposition 7 The optimal prices p⇤, k⇤1, . . . , k
⇤N
must satisfy
(1 � ✏)(p⇤ � b)µi
+ k⇤i
� c0✓
K(w)
◆· @K
@wi
����w=w
= 0, 8 i, (7.20)
7.2. PRICING TOWARDS SOCIAL WELFARE MAXIMIZATION 137
where w = [w1, . . . , wN
]T with wi
= wi
(p⇤, k⇤i
) given by (7.16).
Proof: First, applying the first order conditions @E[Si
]/@wi
= 0 to (7.16) and using
the approximation (7.12), we obtain
(1 � ✏)U0i
(wi
) = p(1 � ✏)µi
+ ki
, (7.21)
which defines an implicit function wi
(p, ki
), with partial derivatives given by
@wi
@p=
µi
U00i
(wi
)< 0,
@wi
@ki
=1
(1 � ✏)U00i
(wi
)< 0. (7.22)
The partial derivatives are negative because U00i
(wi
) < 0 by the concavity of Ui
(·). This
means that when the usage and reservation prices are higher, the tenant will choose a
lower guaranteed portion.
Furthermore, applying KKT conditions [20] @E[W ]/@ki
= 0 and @E[W ]/@p = 0 to
(7.18), we obtain
✓(1 � ✏)
✓U
0i
(wi
) � bµi
◆� c0(K)@K
@wi
����w=w
◆@w
i
@ki
= 0, 8 i,
X
i
✓(1 � ✏)
✓U
0i
(wi
) � bµi
◆� c0(K)@K
@wi
����w=w
◆@w
i
@p= 0.
Since @wi
/@p < 0 and @wi
/@ki
< 0, we have
(1 � ✏)
✓U
0i
(wi
) � bµi
◆� c0(K)@K
@wi
����w=w
= 0. (7.23)
Combining with (7.21), we have derived the proposition. ut
7.2. PRICING TOWARDS SOCIAL WELFARE MAXIMIZATION 138
An inspection of (7.20) reveals that one set of optimal prices is given by
8>><
>>:
p⇤ = b,
k⇤i
= @c(K(w))@w
i
����w=w
, 8 i.(7.24)
Although (7.24) is not the only set of optimal prices, our finding complies with the
economic intuition that a welfare-maximizing cloud provider should charge marginal cost
for both tra�c usage and guaranteed reservation. In (7.24), we can also observe that k⇤i
depends on w, which in turn depends on p, k⇤1, . . . , k
⇤N
. Due to such coupling, (7.24) is
not yet a closed-form solution for the reservation price k⇤i
.
7.2.2 No Multiplexing vs. Multiplexing All
To draw insights, we take a look at two special service technologies that may be adopted
by the cloud: non-multiplexing and multiplexing across all the tenants. For simplicity,
we assume a linear reservation cost (which will be relaxed later): c(K) = �K.
When multiplexing is not used, we can derive the optimal prices in a closed form
from (7.24). Without multiplexing, recall that the capacity Ri
is reserved for each tenant
i individually, such that Pr(wi
Di
> Ri
) < ✏ and K =P
i
Ri
. When Di
is a Gaussian
random variable, it is easy to check that
Ri
(wi
) =
✓µi
+ ✓(✏)�i
◆w
i
, (7.25)
where ✓(✏) = F�1(1 � ✏) is a constant, with F (·) being the CDF of normal distribution
N (0, 1). Since the cost function is naturally decoupled among tenants, according to
7.2. PRICING TOWARDS SOCIAL WELFARE MAXIMIZATION 139
(7.24), the optimal prices are immediately given in a closed form by p⇤ = b and
k⇤i
= �
✓µi
+ ✓(✏)�i
◆, 8 i. (7.26)
When multiplexing is used, however, optimal prices have no explicit solutions. Recall
that with multiplexing, a capacity K is reserved to accommodate all the tenants together,
such that the aggregate (instead of individual) demand is satisfied with high probability:
Pr(P
i
wi
Di
> K) < ✏.
Since the random demands D1, . . . , DN
of di↵erent tenants may be correlated, we
denote ⇢ij
the correlation coe�cient of Di
and Dj
, with ⇢ii
⌘ 1. For convenience, let
µµµ = [µ1, . . . , µN
]T and ⌃ = [�ij
] be the N ⇥ N symmetric demand covariance matrix,
with �ii
= �2i
and �ij
= ⇢ij
�i
�j
for i 6= j.
Under Gaussian demands, K can be written as
K(w) =E[P
i
wi
Di
] + ✓(✏)pvar[
Pi
wi
Di
]
=µµµTw + ✓(✏)pwT⌃⌃⌃w. (7.27)
Substituting the above K(w) into (7.24) gives p⇤ = b and
k⇤i
= �
✓µi
+ ✓(✏) · @pwT⌃⌃⌃w
@wi
����w=w⇤
◆, 8 i, (7.28)
where w⇤i
= wi
(p⇤, k⇤i
). Clearly, with multiplexing, k⇤i
is not given in a closed form yet,
due to the coupled cost function.
We note that whether with or without multiplexing, K(w) is a convex function and
so is c(K(w)). In fact, we can relax the linear cost assumption c(K) = �K, as long as
7.2. PRICING TOWARDS SOCIAL WELFARE MAXIMIZATION 140
c(K(w)) is strictly convex and monotonically increasing in terms of each wi
2 [0, 1].
There is an interesting connection between the non-multiplexing and multiplexing
cases: the optimal solution for non-multiplexing can be used to bound the optimal solu-
tion for the multiplexing case. Specifically, the optimal prices {k⇤i
} of non-multiplexing
upper-bound {k⇤i
} of the multiplexing case, whereas the optimal portions {w⇤i
} of non-
multiplexing lower-bound {w⇤i
} of the multiplexing case. This is intuitive because multi-
plexing leads to a reduced cost c(K), stimulating tenants to increase their choices of the
guaranteed portion. We will use this connection in our distributed algorithms.
7.2.3 Economic Implications
Condition (7.20) has several economic implications, which apply to a general cost func-
tion, although we may use the non-multiplexing case for explanation due to its simplicity.
First, merely adopting a usage price p cannot maximize social welfare: when ki
= 0
for all i, there is no p that can satisfy (7.20). In other words, a positive reservation
fee ki
> 0 is necessary to achieve welfare optimality, since in the presence of demand
uncertainty, only ki
can incorporate a risk factor (e.g., �i
in (7.26)) into pricing. This
reveals that current cloud bandwidth pricing schemes are ine�cient in terms of providing
service guarantee against demand fluctuation. On the other hand, a usage price p is not
necessary: even if p = 0, the expected welfare is maximized as long as ki
satisfies (7.20).
In this case, the reservation fee can be raised to compensate for no usage fee.
Furthermore, heterogeneous reservation prices k1, . . . , kN
are necessary to achieve op-
timality, with each ki
depending on the statistical characteristics of tenant i’s demand Di
.
This conforms to the intuition that tenants have di↵erent degrees of demand volatility,
incurring di↵erent costs for service guarantees. For example, without multiplexing, ki
7.3. DISTRIBUTED OPTIMIZATION WITH COUPLED OBJECTIVES 141
depends on �i
in (7.26): the more bursty a tenant’s demand Di
, the more capacity that
must be reserved to guard against fluctuation, and thus the higher the price. In contrast,
in terms of usage pricing, it is e�cient enough to charge a homogeneous price p for every
unit of bandwidth consumed.
7.3 Distributed Optimization with Coupled Objec-
tives
To solve (7.8), we find that the social welfare maximization problem (7.8) can be written
as a basic resource allocation problem of the following form2:
maxx
nX
i=1
Ui
(xi
) � C(x) (7.29)
s.t. xi
2 [ai
, bi
], i = 1, . . . , n,
for some real numbers ai
, bi
, where x = (x1, . . . , xn
) represents resource allocation, each
Ui
is a strictly concave and monotonically increasing utility function, and C(x) is a
strictly convex cost function that is monotonically increasing in each xi
. Assume C and
Ui
are twice continuously di↵erentiable. In Sec. 7.6.6, we will see that based on duality
theory, problems with general coupled objectives can be transformed into (7.29).
Problem (7.29) also arises in many other networking systems, where each user i is
associated with an utility Ui
(xi
) by using xi
units of some resource, while the users
incur a total cost C(x) at the resource provider, which may be coupled among xi
’s in an
arbitrary way depending on the technology used. The goal of the resource provider is to
2The notations used from Sec. 7.3 to Sec. 7.6 will be independent of the notations used elsewhere inthis chapter.
7.3. DISTRIBUTED OPTIMIZATION WITH COUPLED OBJECTIVES 142
make resource allocation decision x wisely, so that the social welfareP
i
Ui
(xi
) � C(x) is
maximized. Apparently, granting the maximum resources to users will maximize the sum
of their utilities, but will also incur a high cost C(x). As a convex optimization problem,
(7.29) has a unique solution x⇤ = (x⇤1, . . . , x
⇤n
). However, it is undesirable to solve (7.29)
in a centralized way using conventional algorithms such as interior point methods [20],
because the utility Ui
is not known to the resource provider or other users, and the cost
function C is unknown to the users.
Before solving the cloud social welfare maximization problem (7.8), we propose new
distributed approaches to the general resource allocation problem (7.29) with coupled
objective functions, with a goal towards fast and robust convergence in large-scale sys-
tems. Let us illustrate the basic idea of the new approach with the simple example of
maximizing the sum of user utilities minus a coupled provider cost. The algorithm pro-
ceeds in the following iterative steps: given all the resource allocation requirements from
users, the provider computes a price vector as a weighted average of the marginal cost
under current resource requirements and the previous price; each user then computes its
resource requirement under the new price by maximizing its surplus, that is, its utility
minus the price.
The new approach distinguishes itself from traditional gradient methods in three as-
pects. First, while the performance of gradient methods is sensitive to stepsize selection,
the parameter in our approach is a weight between 0 and 1, which is easier to set and
achieves robustness. Second, our approach can be understood as a novel distributed
version of the nonlinear Jacobi algorithm [21], and only passes the gradient information
of certain utilities, e.g., the cost function, instead of all the utilities. By exploiting the
7.4. A PRICING FRAMEWORK FOR DISTRIBUTED OPTIMIZATION 143
structural di↵erence between utility functions, our approach can achieve much faster con-
vergence in certain scenarios. Third, in certain engineering applications, unlike gradient
methods, our approach satisfies a contraction assumption [21], and thus converges even if
executed asynchronously among network elements, further increasing convergence speed.
While enjoying these advantages, the new approach only feeds back gradient information,
and has the same economic interpretation as pricing.
We provide asynchronous convergence conditions of the new approach and compare
its convergence speed with that of gradient methods both analytically and through ex-
periments based on real-world trace data. Furthermore, we show how to generalize the
above basic idea, with the aid of duality theory, to solve a general class of problems with
additive coupled objective functions, which is also referred to as consensus optimization
[26] in statistical and machine learning. The proposed approach also turns out to be a
fast decentralized solution to machine learning problems that involve high-dimensional
data and a large number of training samples.
7.4 A Pricing Framework for Distributed Optimiza-
tion
Let us first introduce a pricing framework, based on which we interpret various distributed
algorithms. Distributed solutions to problem (7.29) are feasible via pricing. The basic
idea is that the provider should charge each user i a price pi
xi
for using resource xi
, where
the unit price pi
can be heterogeneous among users. Given a price vector p = (p1, . . . , pn),
7.4. A PRICING FRAMEWORK FOR DISTRIBUTED OPTIMIZATION 144
problem (7.29) is equivalent to
maxx2
Qi
[ai
,b
i
]
nX
i=1
✓Ui
(xi
) � pi
xi
◆+
✓pTx � C(x)
◆, (7.30)
where Ui
(xi
) � pi
xi
is the surplus of user i for using resource xi
, and pTx � C(x) is the
profit of the provider.
We can classify pricing-based iterative solutions to (7.30) into two types, depending
on who updates the price. In a type I algorithm, the provider sets the price p, under
which each user will maximize its surplus and choose to use resource
xi
(pi
) := arg maxx
i
2[ai
,b
i
]Ui
(xi
) � pi
xi
, 8 i. (7.31)
The provider aims to update p iteratively, such that x(p) = (x1(p1), . . . , xn
(pn
)) produced
by the surplus-maximizing users will converge to the optimal resource allocation x⇤.
We assume each network element can only use its local information and the available
feedback. Therefore, when the resource provider sets the price, the price update for each
user i is a function of the current price p and the returned user resource requirements
x1(p1), . . . , xn
(pn
) given by (7.31). The provider has full knowledge of C but no knowledge
of Ui
.
In a type II algorithm, each user submits a price (or bid) pi
to the provider, who
determines resource allocation x(p) by maximizing its own profit:
x(p) := arg maxx2
Qi
[ai
,b
i
]pTx � C(x), (7.32)
with x(p) = (x1(p), . . . , xn
(p)), and feeds xi
(p) back to each user i, who then updates its
7.5. ALGORITHMS 145
bid pi
as a function of its current bid pi
and the resource allocation xi
(p) determined by
the provider through (7.32). The bid updating at user i can make use of Ui
, but not C
or Uj
for j 6= i.
In both cases, the price update can be rewritten as a function
pi
:= Ti
(p), (7.33)
with T = (T1, . . . , Tn
). The above locality assumptions on the structure of T enforce price
updating under the minimum message-passing overhead. Clearly, when message-passing
is limited to only prices p and resource allocations x(p), any algorithm that requires
information beyond gradients, e.g., Newton methods, can hardly be decentralized.
7.5 Algorithms
Our main objective is to propose e↵ective price updating rules T in both types of algo-
rithms, so that x(p(t)) can converge to the optimal resource allocation x⇤ within a small
number of iterations. We propose two classes of iterative algorithms, depending on who
updates the price p. Before presenting the new algorithms, let us review the existing
method based on Lagrangian dual decomposition with gradient algorithms.
7.5.1 Dual Decomposition and Gradient Methods
We now briefly describe Lagrangian dual decomposition and interpret it as a price up-
dating algorithm of type I introduced in Sec. 7.3. Problem (7.29) is equivalent to
maxx,y
nX
i=1
Ui
(xi
) � C(y) (7.34)
7.5. ALGORITHMS 146
s.t. x = y, xi
2 [ai
, bi
], i = 1, . . . , n.
It is easy to check that the Lagrangian dual problem of (7.34) is
minp
q(p) =nX
i=1
maxx
i
2[ai
,b
i
]{U
i
(xi
) � pi
xi
} + maxy
{pTy � C(y)}.
Since problem (7.34) is convex, the duality gap is zero: if p⇤ solves the dual problem,
then x⇤i
= arg maxx
i
2[ai
,b
i
] Ui
(xi
) � p⇤i
xi
solves the original problem.
Thus, we only need to solve the dual problem. It is known [20, 21, 32] that the
gradient of q(p) is
rq(p) = y(p) � x(p), (7.35)
where y(p) = (y1(p), . . . , yn
(p)) is given by
y(p) = arg maxy
pTy � C(y), (7.36)
and x(p) = (x1(p1), . . . , xn
(pn
)) is defined as follows:
xi
(pi
) = arg maxx
i
2[ai
,b
i
]Ui
(xi
) � pi
xi
. (7.37)
Applying a gradient algorithm to the dual problem yields the following price updating
rule:
pi
:= Ti
(p) = pi
� �
✓yi
(p) � xi
(pi
)
◆, (7.38)
where � is a su�ciently small positive stepsize.
Obviously, (7.38) can be interpreted as a type I algorithm: given a p set by the resource
7.5. ALGORITHMS 147
provider, each user returns xi
(pi
) to the provider by maximizing its surplus. The provider
then computes y(p) locally by maximizing pTy �C(y) and updates p according to (7.38).
Apparently, price update here only makes use of the current price p, the returned x(p),
and the local cost function C at the provider. Similarly, (7.38) can also be explained as
a type II algorithm, which we omit here. We will compare the convergence conditions
and speeds of our new algorithms against the above dual decomposition with gradient
methods.
7.5.2 New Algorithms
We propose two types of new algorithms for solving problem (7.29), which satisfy the
locality assumptions on T in Sec. 7.3.
Algorithm 1: Assume the resource provider updates the price p. The price update
rule is, for all i,
pi
:= Ti
(p) = (1 � �)pi
+ �ri
C
✓x1(p1), . . . , xn
(pn
)
◆, (7.39)
where � 2 (0, 1] is a relaxation parameter, and
xi
(pi
) := arg maxx
i
2[ai
,b
i
]Ui
(xi
) � pi
xi
(7.40)
is the resource requirement collected from each user i under the current price p. This
algorithm has been presented in our published work [57], and a variation of this algorithm
has been presented in our work [56].
Algorithm 2: Assume each user updates its bid pi
. The price update rule is, for all
7.5. ALGORITHMS 148
i,
pi
:= Ti
(p) = (1 � �)pi
+ �U 0i
✓xi
(p)
◆, (7.41)
where � 2 (0, 1] is a relaxation parameter, and
(x1(p), . . . , xn
(p)) = x(p) := arg maxx2Q
i
[ai
, bi
]pTx � C(x) (7.42)
is the resource allocation determined by the provider under the current submitted bids
p.
Instead of relying on dual decomposition, Algorithms 1 and 2 actually observe the
optimality conditions of (7.29). For ease of explanation, let us ignore the constraint
xi
2 [ai
, bi
] for now. By first-order conditions, if x⇤ solves (7.29), we have
U 0i
(x⇤i
) = ri
C(x⇤), 8 i. (7.43)
Let us consider Algorithm 1 when � = 1. Assume the iteration (7.39) has a fixed point
p⇤. Then at p⇤, we have
U 0i
✓xi
(p⇤i
)
◆= p⇤
i
= ri
C
✓x1(p
⇤1), . . . , xn
(p⇤n
)
◆, 8 i, (7.44)
where the first equality is due to the first-order condition of (7.40). Thus, (x1(p⇤1), . . . , xn
(p⇤n
))
satisfies the conditions (7.43) and is an optimal solution x⇤, as our problem is convex.
Similarly, Algorithm 2 executes price updates in the reverse direction of Algorithm 1,
also aiming to achieve (7.44).
The following proposition shows that as long as Algorithm 1 or Algorithm 2 converges,
it will solve the optimization problem (7.29).
7.5. ALGORITHMS 149
Proposition 8 Suppose that Algorithm 1 (or Algorithm 2) has a fixed point p⇤ and an
associated x⇤. Then, x⇤ solves (7.29).
Proof: One approach is to show the fixed point x⇤ satisfies the KKT conditions of
(7.29). Here, we present a simpler proof based on the following lemma, the proof of which
can be found in [21] (pp. 210, Proposition 3.1):
Lemma 4 Suppose F is convex on a convex set X. Then, x 2 X minimizes F over X,
if and only if (y � x)TrF (x) � 0 for every y 2 X.
Suppose Algorithm 1 has a fixed point p⇤ and an associated x⇤. Then we have
8><
>:
p⇤i
= ri
C(x⇤), 8 i,
x⇤i
= arg maxx
i
2[ai
,b
i
] Ui
(xi
) � p⇤i
xi
, 8 i.
(7.45)
Since x⇤i
maximizes Ui
(xi
) � p⇤i
xi
on [ai
, bi
], by Lemma 4,
(yi
� x⇤i
)(U 0i
(x⇤i
) � p⇤i
) 0, 8 yi
2 [ai
, bi
], 8 i.
Since p⇤i
= ri
C(x⇤), we have
Pi
(yi
� x⇤i
)(U 0i
(x⇤i
) � ri
C(x⇤)) 0, 8 y 2Q
i
[ai
, bi
], (7.46)
which, by Lemma 4, means x⇤ solves (7.29).
Suppose Algorithm 2 has a fixed point p⇤ and an associated x⇤. Then, we have
8><
>:
p⇤i
= U 0i
(x⇤i
), 8 i,
x⇤ = arg maxx2
Qi
[ai
,b
i
] p⇤Tx � C(x).
(7.47)
7.5. ALGORITHMS 150
Applying Lemma 4 to the second equation above yields
Pi
(yi
� x⇤i
)(p⇤i
� ri
C(x⇤)) 0, 8 y 2Q
i
[ai
, bi
].
Substituting p⇤i
= U 0i
(x⇤i
) into the equation above leads to (7.46), which, by Lemma 4,
means x⇤ solves (7.29). ut
In Algorithms 1 and 2, surplus (profit) maximization is performed at either users or
the provider, whereas in gradient methods (7.38), such maximizations must be performed
at both users and the provider to obtain x(p) and y(p). First, this greatly reduces the
computational complexity when n is large. In particular, Algorithm 1 can scale to a
very large n without having to solve convex optimization at the provider side. More
importantly, we will see that the new algorithms can better exploit the structures of Ui
and C to achieve faster convergence rates. We will also see that it is easier to choose �
in the new algorithms since it does not have to be very small.
7.5.3 Relationship to the Nonlinear Jacobi Algorithm
All the above algorithms have synchronous and asynchronous versions, depending on
whether updates of pi
and xi
(p) are performed for all users i synchronously or asyn-
chronously. For example, in an asynchronous Algorithm 1, pi
and xi
(pi
) can be updated
for an user i for any finite number of iterations, before other xj
(pj
) for j 6= i are updated.
We have mentioned in Chapter 2 that the nonlinear Jacobi algorithm [21] is a limiting
case of the asynchronous version of Algorithm 1, while Algorithms 1 and 2 can be un-
derstood as novel decentralizations of the nonlinear Jacobi algorithm. In the following,
we show the connection of our proposed algorithms to the nonlinear Jacobi algorithm.
The nonlinear Jacobi algorithm consists of iteratively optimizing in a parallel fashion
7.6. CONVERGENCE CONDITIONS 151
with respect to each variable xi
while keeping the rest of the variables xj
(j 6= i) fixed.
Formally, for problem (7.29), it is defined as
xi
:= arg maxx
i
2[ai
,b
i
]Ui
(xi
) � C(x1, . . . , xi
, . . . , xn
) +X
j 6=i
Uj
(xj
). (7.48)
And the above update is to be conducted for all i in parallel. Note that the above non-
linear Jacobi algorithm cannot be applied to our problem, because the resource provider
does not know any user utility Ui
, and the users do not know the cost function C. Neither
can the nonlinear Jacobi algorithm be interpreted as a pricing-based algorithm.
However, the nonlinear Jacobi algorithm is a limiting case of Algorithm 1 executed
asynchronously. If for each user i, we perform (7.39) and (7.40) of Algorithm 1 for an
infinite number of iterations before updating the other variables xj
(for all j 6= i) in
equation (7.39) for this user, then the xi
obtained this way is the same as the one given
by (7.48), and thus Algorithm 1 becomes the nonlinear Jacobi algorithm. In other words,
the Algorithm 1 proposed here, with a convenient interpretation as pricing, is essentially
a novel distributed version of the nonlinear Jacobi algorithm.
7.6 Convergence Conditions
We analyze the convergence of the above algorithms, and in particular, aim to find su�-
cient conditions under which a certain algorithm is a contraction mapping [21]. Although
more general convergence conditions may be found for an algorithm when executed syn-
chronously, contraction mapping can ensure its convergence even if executed completely
asynchronously [21]. Note that an asynchronous algorithm is much better than its syn-
chronous counterpart, since message-passing between network elements usually incurs
7.6. CONVERGENCE CONDITIONS 152
heterogeneous delays in reality. Based on the derived conditions, we will extensively
compare the algorithm robustness to the choice of � and their convergence speeds in
Sec. 7.8.
7.6.1 Contraction Mapping
As some preliminaries, let us introduce contraction mapping and its properties. Many
iterative algorithms can be written as
x(t + 1) := T (x(t)), t = 0, 1, . . . , (7.49)
where x(t) 2 X ⇢ <n, and T : X 7! X is a mapping from X into itself. The mapping T
is called a contraction if
kT (x) � T (y)k ↵kx � yk, 8 x, y 2 X, (7.50)
where k ·k is some norm, and the constant ↵ 2 [0, 1) is called the modulus of T . Further-
more, the mapping T is called a pseudo-contraction if T has a fixed point x⇤ 2 X (such
that x⇤ = T (x⇤)) and
kT (x) � x⇤k ↵kx � x⇤k, 8 x 2 X. (7.51)
The following theorem establishes the geometric convergence of both contractions and
pseudo-contractions:
Theorem 9 (Geometric Convergence) Suppose that X ⇢ <n and the mapping T : X 7!
X is a contraction, or a pseudo-contraction with a fixed point x⇤ 2 X. Suppose the
7.6. CONVERGENCE CONDITIONS 153
modulus of T is ↵ 2 [0, 1). Then, T has a unique fixed point x⇤ and the sequence {x(t)}
generated by x(t + 1) := T
✓x(t)
◆satisfies
kx(t) � x⇤k ↵tkx(0) � x⇤k, 8 t � 0, (7.52)
for every choice of the initial vector x(0) 2 X. In particular, {x(t)} converges to x⇤
geometrically.
To simplify notations, we may use ci
(x) to stand for ri
C(x) and ui
(xi
) for U 0i
(xi
).
We denote @x
j
x
i
C(x) = rj
ci
(x) as the second order partial derivatives of C.
7.6.2 Convergence of Algorithm 1
We analyze the convergence of Algorithm 1 under di↵erent choices of �. First, suppose
� = 1. We rewrite Algorithm 1 in another form amenable to analysis. Denote [xi
]+i
the
projection of xi
2 < onto the interval [ai
, bi
], i.e.,
[xi
]+i
= arg minz2[a
i
,b
i
]|z � x
i
|, i = 1, . . . , n. (7.53)
It is easy to check that (7.40) is equivalent to
xi
(pi
) := [arg maxx
i
Ui
(xi
) � pi
xi
]+i
= [u�1i
(pi
)]+i
(7.54)
Therefore, letting � = 1 and substituting (7.39) into (7.54) yields
xi
:= Ti
(x) = [u�1i
(ci
(x))]+i
, i = 1, . . . , n, (7.55)
7.6. CONVERGENCE CONDITIONS 154
which is an equivalent iteration to Algorithm 1 when � = 1. The following result gives a
su�cient condition for the convergence of Algorithm 1 with � = 1.
Proposition 10 Suppose � = 1. If, for all i, we have
nX
j=1
|@x
j
x
i
C(x)| < minz
i
|U 00i
(zi
)|, 8 x 2Q
i
[ai
, bi
], (7.56)
then T = (T1, . . . , Tn
) given by (7.55) is a contraction and the sequence {x(p(t))} gener-
ated by Algorithm 1 converges geometrically to the optimal solution x⇤ to (7.29), given
any initial price vector p(0).
Proof: For each i, define a function gi
: [0, 1] 7! < by
gi
(t) = u�1i
✓ci
(z(t))
◆= u�1
i
✓ci
(tx + (1 � t)y)
◆. (7.57)
Note that gi
is continuously di↵erentiable. We have
|Ti
(x) � Ti
(y)| = |[u�1i
(ci
(x))]+i
� [u�1i
(ci
(y))]+i
|
|gi
(1) � gi
(0)| =
����Z 1
0
dgi
(t)
dtdt
����
Z 1
0
����dg
i
(t)
dt
����dt maxt2[0,1]
����dg
i
(t)
dt
����,
where the first inequality is because |[xi
]+i
� [yi
]+i
| |xi
� yi
| for all xi
, yi
2 <. Further-
more, the chain rule yields
����dg
i
(t)
dt
����=����P
n
j=1rj
u�1i
� ci
(tx + (1 � t)y) · (xj
� yj
)
����
����(u
�1i
)0(ci
(z(t)))
���� ·P
n
j=1
����rj
ci
(z(t))
���� · |xj
� yj
|,
7.6. CONVERGENCE CONDITIONS 155
where the function u�1i
� ci
(x) is equivalent to u�1i
(ci
(x)). If condition (7.56) holds, we
have for all x 2Q
i
[ai
, bi
],
Pn
j=1|rj
ci
(x)| < minz
i
|u0i
(zi
)| ����u
0i
✓u�1i
(ci
(x))
◆����
=1
|(u�1i
)0(ci
(x))|(7.58)
Therefore, there exists a positive ↵ < 1 such that
����dg
i
(t)
dt
���� ↵ maxj
|xj
� yj
| = ↵kx � yk1, 8 t 2 [0, 1], 8 i.
Therefore, we have shown that
kT (x) � T (y)k1 ↵kx � yk1, 8 x, y 2Q
i
[ai
, bi
],
and T is a contraction with modulus ↵ with respect to the maximum norm inQ
i
[ai
, bi
].
By Theorem 9, Algorithm 1 converges geometrically to x⇤. ut
Next, we consider the convergence condition when � < 1, which is much harder to
analyze in general. In many scenarios, the optimal solution x⇤ is in the interior of the boxQ
i
[ai
, bi
], i.e., ai
< x⇤i
< bi
for all i. If the x(p(t)) generated by (7.39) and (7.40) in each
iteration t is always in the interior ofQ
i
[ai
, bi
], we can rewrite (7.40) as xi
(pi
) := u�1i
(pi
),
and Algorithm 1 is equivalent to the following iteration:
xi
:= Ti
(x) = u�1i
✓(1 � �)u
i
(xi
) + �ci
(x)
◆, 8 i. (7.59)
The following result gives a convergence condition of Algorithm 1 when � < 1.
7.6. CONVERGENCE CONDITIONS 156
Proposition 11 Suppose ai
< x⇤i
< bi
for all i. Set xi
(pi
(0)) to be ai
for all i, or bi
for
all i. If we have 8><
>:
0 < � ✓
1 +@
x
i
x
i
C(x)
|U 00i
(xi
)|
◆�1
,
Pj 6=i
@x
j
x
i
C(x) 0,
(7.60)
for all x between x(p(0)) and x⇤, for all i, then T = (T1, . . . , T2) given by (7.59) is
a pseudo-contraction between x(p(0)) and x⇤, and the sequence {x(p(t))} generated by
Algorithm 1 converges to x⇤ geometrically.
Proof: For each i, define a function gi
: [0, 1] 7! < by
gi
(t) = u�1i
✓(1 � �)u
i
(zi
(t)) + �ci
(z(t))
◆, (7.61)
where z(t) = tx + (1 � t)x⇤.
Now suppose x < x⇤. Let us show that xi
< Ti
(x) x⇤i
for all i. We have
Ti
(x) � Ti
(x⇤) = gi
(1) � gi
(0) =
Z 1
0
dgi
(t)
dtdt, (7.62)
where dgi
(t)/dt is given by
dgi
(t)
dt= (u�1
i
)0✓
(1 � �)ui
(zi
(t)) + �ci
(z(t))
◆·✓
(1 � �)
u0i
(zi
(t))(xi
� x⇤i
) + �P
j
rj
ci
(z(t))(xj
� x⇤j
)
◆
=
✓u0i
(Ti
(z(t)))
◆�1
·✓
((1 � �)u0i
(zi
(t)) + �ri
ci
(z(t)))
(xi
� x⇤i
) + �P
j 6=i
rj
ci
(z(t))(xj
� x⇤j
)
◆(7.63)
7.6. CONVERGENCE CONDITIONS 157
Because Ui
is strictly concave, u0i
(x) < 0 for all x. Hence, under the condition (7.60), if
x < x⇤, we have Ti
(x) x⇤i
for all i, from which we immediately have
ui
✓Ti
(x)
◆= (1 � �)u
i
(xi
) + �ci
(x) � ui
(x⇤i
) = ci
(x⇤) > ci
(x),
which leads to ui
(xi
) > ci
(x) and thus
ui
(xi
) > (1 � �)ui
(xi
) + �ci
(x), 8 � > 0. (7.64)
Applying u�1i
(·) to both sides yields xi
< Ti
(x) for all i. Therefore, there exists an
↵ 2 [0, 1), which depends on �, ui
, ci
and x⇤, such that
|Ti
(x) � x⇤i
| ↵|xi
� x⇤i
|, 8 xi
2 [ai
, x⇤i
], 8 i, (7.65)
or equivalently, kT (x) � x⇤k1 ↵kx � x⇤k1, for all x between a = (a1, . . . , an
) and x⇤.
In other words, we have shown that T is a pseudo-contraction inQ
i
[ai
, x⇤i
].
Similarly, if x > x⇤, we can show that xi
> Ti
(x) � x⇤i
for all i, and T is a pseudo-
contraction inQ
i
[x⇤i
, bi
]. Thus, Algorithm 1 converges geometrically. ut
In many engineering problems, users may have a priori knowledge of (ai
, bi
) that
contains its optimal x⇤i
, which justifies the assumption that x⇤ resides in the interior ofQ
i
[ai
, bi
]. Furthermore, setting xi
(pi
(0)) to ai
(or bi
) is easy: the resource provider only
needs to set a very high (or very low) pi
(0) to push the user resource requirement xi
(pi
)
to its corresponding boundary value. Conditions (7.60) essentially ensure that x(p(t))
increases (or decreases) from a (or b) to x⇤ monotonically.
7.6. CONVERGENCE CONDITIONS 158
7.6.3 Convergence of Algorithm 2
The convergence of Algorithm 2 is much harder to analyze in general when there are
constraints ai
xi
bi
, since x(p) in (7.42) cannot be decoupled as projections onto
intervals [ai
, bi
]. In the following, we give su�cient convergence conditions of Algorithm
2 in the case of unconstrained optimization.
In this case, Algorithm 2 is rewritten as
8><
>:
pi
:= Ti
(p) = (1 � �)pi
+ �ui
✓xi
(p)
◆, 8 i,
x(p) := arg maxx
pTx � C(x)
(7.66)
Proposition 12 Suppose that problem (7.29) has no constraints and that the Hessian
matrix of C(x) is H(x) = [@x
j
x
i
C(x)]n⇥n
. Denote the inverse of H(x) as P (x) =
[H(x)]�1 = [Pij
(x)]n⇥n
.
If � = 1 and the following condition holds:
Pn
j=1|Pij
(x)| < |U 00i
(xi
)|�1, 8 x, 8 i, (7.67)
or if 0 < � (1 + Pii
(x) · |U 00i
(xi
)|)�1, for all x and i, and
Pj 6=i
|Pij
(x)| < Pii
(x) + |U 00i
(xi
)|�1, 8 x, 8 i, (7.68)
then T = (T1, . . . , Tn
) given by (7.66) is a contraction and the sequence {x(p(t))} gener-
ated by Algorithm 2 converges geometrically to the optimal solution to (7.29).
7.6. CONVERGENCE CONDITIONS 159
Proof: For each i, define a function gi
: [0, 1] 7! < by
gi
(t) = (1 � �)(tpi
+ (1 � t)qi
) + �ui
✓xi
(tp + (1 � t)q)
◆. (7.69)
Similarly, we have the following bound:
|Ti
(p) � Ti
(q)| = |gi
(1) � gi
(0)| maxt2[0,1]
|dgi
(t)/dt|,
Similar to the proof of Proposition 10, it su�ces to bound |dgi
(t)/dt|. Let r = tp+(1�t)q.
We have
����dg
i
(t)
dt
����=����(1 � �)(p
i
� qi
) + �P
j
rj
ui
� xi
✓r(t)
◆(p
j
� qj
)
����
=
����
✓1 � � + �u0
i
(xi
(r))ri
xi
(r)
◆(p
i
� qi
)
+�u0i
(xi
(r))P
j 6=i
rj
xi
(r)(pj
� qj
)
����
maxj
|pj
� qj
| ·✓����
✓1 � � + �u0
i
(xi
(r))ri
xi
(r)
◆����+
+�|u0i
(xi
(r))|P
j 6=i
|rj
xi
(r)|◆
(7.70)
If either of the following two conditions holds: 1) � = 1 and
Pn
j=1|rj
xi
(r)| < 1/|u0i
(xi
(r))|, 8 r, 8 i, (7.71)
or 2) 0 < � (1 + ri
xi
(r) · |u0i
(xi
(r))|)�1 and
Pj 6=i
|rj
xi
(r)| < ri
xi
(r) + |u0i
(xi
(r))|�1, 8 r, 8 i, (7.72)
7.6. CONVERGENCE CONDITIONS 160
then there exists a constant ↵ 2 (0, 1) such that
����dg
i
(t)
dt
���� ↵ maxj
|pj
� qj
| = ↵kp � qk1, 8 t 2 [0, 1], 8 i,
or in other words, T is a contraction with modulus ↵.
We then derive rj
xi
(r) based on C(x). Applying the first-order conditions to the
definition of x(r) in (7.66), we have ri
� ci
(x(r)) = 0, for all i. Thus, the chain rule
yields
1 =@c
i
(x(r))
@ri
=nX
l=1
rl
ci
(x(r)) · @xl
(r)
@ri
, 8 i,
0 =@c
i
(x(r))
@rj
=nX
l=1
rl
ci
(x(r)) · @xl
(r)
@rj
, 8 i, j : i 6= j
Hence, we have [rj
xi
(r)]n⇥n
= [rj
ci
(x(r))]�1n⇥n
= [H(x(r))]�1, or in other words,
rj
xi
(r) = Pij
(x(r)). Substituting rj
xi
(r) into (7.71) and (7.72) leads to (7.67) and
(7.68), respectively. ut
We can also establish for Algorithm 2 a convergence condition similar to (7.60), which
can handle constrained optimization with x⇤ residing in the interior ofQ
i
[ai
, bi
] by making
x(p(t)) move to x⇤ always in the increasing or decreasing direction. On the other hand,
however, it turns out to be technically di�cult to establish a condition like (7.68) for
Algorithm 1 when � < 1.
7.6.4 Convergence of Gradient Methods for Dual Problem
There are many results on the convergence of gradient algorithms. For example, for
a su�ciently small constant stepsize �, the algorithm is guaranteed to converge to the
7.6. CONVERGENCE CONDITIONS 161
optimal value [20]. For comparison, here we derive a su�cient condition similar to (7.68),
under which the gradient algorithm applied to the dual problem, i.e., iteration (7.38), is
a contraction for the unconstrained version of problem (7.29).
Proposition 13 Suppose that problem (7.29) has no constraints and that the Hessian
matrix of C(x) is H(x) = [@x
j
x
i
C(x)]n⇥n
. Denote the inverse of H(x) as P (x) =
[H(x)]�1 = [Pij
(x)]n⇥n
.
If 0 < � (Pii
(x) + |U 00i
(xi
)|�1)�1, for all x and i, and
Pj 6=i
|Pij
(x)| < Pii
(x) + |U 00i
(xi
)|�1, 8 x, 8 i, (7.73)
then T = (T1, . . . , Tn
) given by (7.38) is a contraction and the sequence {x(p(t))} con-
verges geometrically to x⇤.
Proof: Ti
(p) in (7.38) can be written in the form Ti
(p) = pi
� �fi
(p), with fi
being
continuously di↵erentiable. By Proposition 1.11 of [21] (pp. 194), Ti
(p) is a contraction
if 8><
>:
0 < � 1/ri
fi
(p), 8 p, 8 i,
Pj 6=i
|rj
fi
(p)| < ri
fi
(p), 8 p, 8 i.
(7.74)
Similar to Proposition 12, we can derive rj
fi
(p) from the Hessian matrix of C(x) and
U(x). In particular,
ri
fi
(p) = ri
yi
(p) � ri
xi
(p) = Pii
(x(p)) �✓
U 00i
(xi
(p))
◆�1
rj
fi
(p) = rj
yi
(p) � rj
xi
(p) = Pij
(x(p)), 8 j 6= i.
Substituting the above into (7.74) and utilizing the fact U 00i
(xi
) < 0 will prove the
proposition. ut
7.6. CONVERGENCE CONDITIONS 162
We observe that (7.73) is exactly the same as the convergence condition (7.68) for
Algorithm 2 when � < 1, except for the choice of �. When constraints ai
xi
bi
are considered, the contraction property is hard to analyze in general, since in this case,
fi
(p) = yi
(p) � [u�1i
(pi
)]+i
involves projection and is not continuously di↵erentiable.
As a final note, when an update rule T is a contraction or pseudo-contraction with
respect to the maximum norm k · k1, asynchronous convergence of the corresponding
algorithm can be established using Proposition 2.1 in [21] (pp. 431). In other words, if
the su�cient conditions in this section hold, the corresponding algorithms also converge
asynchronously.
7.6.5 Quadratic Programming: Convergence Speed Compari-
son
We compare the convergence conditions and speeds of di↵erent algorithms using an ex-
ample of quadratic programming. A quadratic cost C and utilities Ui
simplify the con-
ditions derived in Sec. 7.6, allowing us to draw insights on the algorithm convergence
performance.
Suppose Ui
(xi
) = 12(�i
x2i
+ ⇢i
xi
+ ⌧i
) and C(x) = 12x
TAx, where �i
< 0 for all i, and
A = [aij
]n⇥n
is an n⇥n positive semi-definite symmetric matrix. Then the unconstrained
version of problem (7.29) becomes
maxx
1
2
✓Pn
i=1(�i
x2i
+ ⇢i
xi
+ ⌧i
) � xTAx
◆. (7.75)
It is easy to check that U 00i
(xi
) = �i
and @x
j
x
i
C(x) = aij
, for all x. Furthermore, the
Hessian matrix of C(x) is H(x) = A, and the inverse of H(x) is P (x) = [Pij
(x)]n⇥n
= A�1.
7.6. CONVERGENCE CONDITIONS 163
If we denote A�1 as [a0ij
]n⇥n
, then Pij
(x) = a0ij
, for all x.
According to Proposition 10 and Proposition 11, Algorithm 1 will converge to x⇤, if
� = 1 andP
n
j=1|aij
| < |�i
|, 8 i, or if
8><
>:
0 < � (1 + aii
/|�i
|)�1,
Pj 6=i
aij
0,8 i. (7.76)
According to Proposition 12, Algorithm 2 will converge to x⇤, if � = 1 andP
n
j=1 |a0ij
| <
|�i
|�1, 8 i, or if 8><
>:
0 < � (1 + a0ii
|�i
|)�1,
Pj 6=i
|a0ij
| < a0ii
+ |�i
|�1,
8 i. (7.77)
According to Proposition 13, the gradient algorithm applied to the dual decomposition
will converge to x⇤, if 8><
>:
0 < � (a0ii
+ |�i
|�1)�1,
Pj 6=i
|a0ij
| < a0ii
+ |�i
|�1,
8 i. (7.78)
Since the convergence speed of a contraction is governed by its modulus ↵, we now
derive ↵1(�), ↵2(�) and ↵D
(�) for Algorithm 1, Algorithm 2 and dual decomposition,
respectively, under di↵erent choices of parameter �.
If � = 1, from the proof of Proposition 10 and 12, we obtain
↵1(1) = maxi
Pn
j=1|aij
|/|�i
|, ↵2(1) = maxi
Pn
j=1|a0ij
| · |�i
|.
If � < 1, let us consider Algorithm 2 first. Substituting u0i
and rj
xi
into (7.70), it is
7.6. CONVERGENCE CONDITIONS 164
not hard to check that
↵2(�) = maxi
|1 � � + ��i
a0ii
| + �|�i
| ·P
j 6=i
|a0ij
|
� maxi
Pj 6=i
|a0ij
|/(a0ii
+ |�i
|�1), (7.79)
where the equality is achieved when a di↵erent �i
is allowed for each user i and �i
=
(1 + a0ii
|�i
|)�1.
Similarly, for dual decomposition, we have
↵D
(�) = maxi
|1 � �(|�i
|�1 + a0ii
)| + �|P
j 6=i
|a0ij
|
� maxi
Pj 6=i
|a0ij
|/(a0ii
+ |�i
|�1), (7.80)
where the equality is achieved when a di↵erent �i
is allowed for each user i and �i
=
(a0ii
+ |�i
|�1)�1.
Let us compare the convergence speeds of di↵erent algorithms. Note that the smaller
the modulus ↵ is, the faster the corresponding contraction converges. Algorithm 1 may
potentially achieve much faster convergence if |�i
| is big relative to |aij
|, which is often
the case in communication systems, where the service cost C is low due to multiplexing
or low unit cost for resources. On the contrary, if |�i
| is small relative to |aij
|, Algorithm
2 and gradient algorithms for the dual problem will converge faster. Therefore, when
gradient methods for dual decomposition fail to be a contraction (and thus a very small
stepsize must be chosen), we may resort to Algorithm 1 to achieve fast and asynchronous
convergence.
Another obvious strength of the new algorithms is their robustness and simplicity
in parameter selection. Under certain cases, choosing � = 1 su�ces to achieve fast
7.6. CONVERGENCE CONDITIONS 165
convergence for the new algorithms. Even if � < 1, it is also much easier to set in the
new algorithms. Let us compare Algorithm 2 with dual decomposition, both of which,
when � < 1, share the same su�cient convergence conditions (7.77) and (7.78), i.e.,
diagonal dominance for the inverse of Hessian matrix A�1 or a su�ciently large |�i
|�1.
They also share the same best convergence speed (7.79) and (7.80). However, for dual
decomposition to achieve its best convergence, � should be set around (a0ii
+ |�i
|�1)�1.
This requires one to have good estimates about the absolute values of a0ii
, |�i
|; otherwise,
� usually needs to be set to a small value to guarantee convergence, which, however,
leads to slow convergence.
In contrast, for Algorithm 2 to achieve its best convergence, � should be set to around
(1+a0ii
|�i
|)�1, which only depends on the relative values of a0ii
and |�i
| and is thus easier to
estimate. For example, if a0ii
|�i
|�1 for all i, then (1 + a0ii
|�i
|)�1 � 0.5, 8 i. Therefore,
setting � = 0.5 su�ces to achieve good performance if a0ii
and |�i
|�1 are close. Similarly,
for Algorithm 1, to choose a � < 1 that satisfies condition (7.76) is easy: if aii
|�i
|, for
all i, which is often the case when the cost is low, we can set � to 0.5.
7.6.6 Handling General Coupled Objectives
The proposed algorithms do not only solve (7.29). We now show how to use Algorithm
1 to approach a general form of optimization problems with coupled objective functions,
referred to as consensus optimization [26]:
minx
mX
i=0
Fi
(x) (7.81)
s.t. x 2 Pi
, i = 0, 1, ..., m,
7.6. CONVERGENCE CONDITIONS 166
where Fi
: <n 7! < are strictly convex functions, and Pi
are bounded polyhedral subsets
of <n. We assume at least F0 depends on all n coordinates of the vector x and P0 ⇢ Pi
.
Problem (7.81) not only models distributed resource allocation with arbitrary coupling
among utility functions, but also finds wide applications in machine learning problems
that involve a large number of training samples. Note that (7.29) is a special case of
(7.81) in that each Ui
only depends on the ith coordinate of x, while C plays the role of
F0.
We show that based on duality theory, consensus optimization (7.81) can be converted
into an equivalent form of (7.29). Introducing auxiliary variables xi
2 <n, (7.81) is
equivalent to
minx
F0(x) +mX
i=1
Fi
(xi
) (7.82)
s.t. x 2 P0, xi
= x, xi
2 Pi
, i = 1, ..., m.
It is easy to check that the dual problem of (7.82) is
maxp
q(p) = q0
✓mX
i=1
pi
◆+
mX
i=1
qi
(pi
) (7.83)
where pi
2 <n are the Lagrangian multipliers, and
q0
✓mX
i=1
pi
◆= min
x2P0
⇢F0(x) �
✓mX
i=1
pi
◆T
x
�, (7.84)
qi
(pi
) = minx
i
2Pi
{Fi
(xi
) + pTi
xi
}. (7.85)
7.6. CONVERGENCE CONDITIONS 167
The dual function q(p) is continuously di↵erentiable and
@q(p)
@pi
=@q0(p)
@pi
+@q
i
(pi
)
@pi
= �x(p) + xi
(pi
), (7.86)
where x(p) and xi
(pi
) are the unique minimizing vectors in (7.84) and (7.85), respectively.
A traditional gradient method would update p in the direction of its gradient rq(p).
Alternatively, we can apply Algorithm 1 to the dual problem (7.83) with p as the
variable by setting Ui
(pi
) = qi
(pi
) and C(p) = �q0(p), leading to the alternating iterations
for all i:
xi
:= (1 � �)xi
+ �ri
C(p) = (1 � �)xi
+ �x(p) (7.87)
pi
:= arg maxp
i
qi
(pi
) � pTi
xi
, (7.88)
where xi
is the price set for Fi
. We can further simplify the update (7.88) for pi
. We
assume P0 ⇢ Pi
for all i. Since x(p) 2 P0, from (7.87), we have xi
2 P0 ⇢ Pi
. Clearly, pi
is the solution to the system of equations @q
i
(pi
)@p
i
� xi
= 0, or equivalently, xi
(pi
)� xi
= 0.
When xi
(pi
) = xi
is in the interior of Pi
, from the first-order conditions of (7.85), we have
pi
= �@Fi
(xi
)
@xi
. (7.89)
Therefore, Algorithm 1, when applied to consensus optimization (7.81), is fully described
by the iterations (7.87) and (7.89), where pi
are variables, xi
are prices, F0 plays the role
of a “resource provider” and each Fi
is a “user”. Each “user” updates its variable pi
in
a distributed fashion based on the price xi
computed and sent by the “provider” until
convergence.
7.7. DISTRIBUTED SOLUTIONS TO CLOUD PRICING 168
Note that P0 ⇢ Pi
for all i is a condition to use Algorithm 1, since otherwise, xi
may
reside outside of Pi
and the gradient of qi
(pi
)� pTi
xi
, which is xi
(pi
)� xi
, cannot be zero.
In this case, (7.88) is not well defined. When P0 ⇢ Pi
, we can apply Algorithm 1 to the
unconstrained dual problem (7.83), which is equivalent to (7.81).
7.7 Distributed Solutions to Cloud Pricing
We now use the proposed methods to solve the large-scale cloud bandwidth resource
reservation and pricing problem (7.8) introduced in Sec. 7.2. We will continue to adopt
the notations used in Sec. 7.2. Recall that the social welfare maximization problem (7.8)
is
maxw
E[W (w)] = (1 � ✏)X
i
✓U
i
(wi
) � bwi
µi
◆� c
✓K(w)
◆(7.90)
s.t. 0 w 1, (7.91)
which is equivalent to the pricing problem (7.18). In this section, we propose a novel step-
size-free algorithm for price updates, based on the distributed optimization techniques
proposed in Sec. 7.5. We call the algorithm Equation Update, which does not rely on
decomposition at all: instead, it resorts to iterative equation updates based on the KKT
conditions (7.20). Our algorithm applies to a general convex cost function c(K(w)) (in
terms of w) under any service technology (e.g., multiplexing tenant demands in groups),
and can be executed asynchronously among all cloud tenants.
7.7. DISTRIBUTED SOLUTIONS TO CLOUD PRICING 169
7.7.1 Equation Update
If we apply Algorithm 1 of Sec. 7.5 with � = 1 to problem (7.8), we obtain the following
Equation Update algorithm, which is based on alternated phases of price updates via
the cost-price relationship (7.20) and the tenant surplus maximization equation (7.16).
Recall that p is the uniform usage price and ki
is the reservation price for tenant i.
Equation Update. (Algorithm 1 for Problem (7.8).) Denote w(t) := [w(t)1 , . . . , w
(t)N
]T.
Set p ⌘ b. Set k(0)i
:= �(µi
+ ✓(✏)�i
) for all i and w(0) = 1. For t = 0, 1, . . ., repeat
(1) Distributed Surplus Maximization. Pass p and k(t)i
to each tenant i, which
returns:
w(t+1)i
:= arg max0w
i
1E[S
i
(wi
, p, k(t)i
)], 8 i. (7.92)
(2) Price Update. Set
k(t+1)i
:= c0✓
K(w(t+1))
◆· @K(w)
@wi
����w=w(t+1)
, 8 i. (7.93)
(3) If kw(t+1) � w(t)k ⇠, return w⇤ = w(t+1), k⇤i
= k(t+1)i
.
The above algorithm starts by setting k(0)i
to be the k⇤i
in the cost function without
multiplexing. It then updates ki
and wi
alternately by setting the current prices k(t)i
to
be the marginal reservation cost with the current w(t), and by collecting the next w(t+1)
from tenants who maximize their surpluses given the current prices. Applying the KKT
conditions to (7.16), step (1) can also be viewed as solving an equation
(1 � ✏)U0i
(w(t+1)i
) = p(1 � ✏)µi
+ k(t)i
, (7.94)
7.7. DISTRIBUTED SOLUTIONS TO CLOUD PRICING 170
for w(t+1)i
. Since each tenant maximizes its expected surplus locally, the cloud provider
does not have to know the utility function of each tenant, leading to an iterative dis-
tributed solution.
Compared with Lagrangian dual decomposition based on consistency pricing [32, 65],
Equation Update represents a new way of handling coupled objectives. Since price up-
dates are based on equation (7.93) rather than on updating Lagrangian multipliers, the
algorithm is not concerned with the choice of step sizes that are required by gradi-
ent/subgradient methods.
We observe that the sequence {k(t)i
} produced by equations (7.93) and (7.94) could
demonstrate significantly di↵erent behavior under di↵erent initial values w(0)i
and dif-
ferent forms of utility and cost functions. In other words, our algorithm demonstrates
chaotic behavior, whose eventual outcome is sensitive to initial conditions and the struc-
ture of updating equations.
Our objective going forward is to analyze the behavior of {k(t)i
} and {w(t)} and find
out the conditions under which the algorithm can achieve fast convergence. To simplify
notations, we define
hi
(wi
) ⌘ (1 � ✏)
✓U
0i
(wi
) � bµi
◆, (7.95)
⌫i
(w) ⌘@c
✓K(w)
◆
@wi
. (7.96)
Recall that w⇤ := [w⇤1, . . . , w
⇤N
]T, where w⇤i
= wi
(p⇤, k⇤i
) is the guaranteed portion cho-
sen by tenant i under optimal pricing. By the definition of w⇤, the optimality condition
7.7. DISTRIBUTED SOLUTIONS TO CLOUD PRICING 171
is
hi
(w⇤i
) = k⇤i
= ⌫i
(w⇤), 8 i. (7.97)
Since we have set p ⌘ b, (7.94) can be written as
hi
(w(t+1)i
) := (1 � ✏)
✓U
0i
(w(t+1)i
) � bµi
◆= k
(t)i
. (7.98)
Thus, the updating rules (7.93) and (7.94) in Equation Update can be rewritten as
hi
(w(t+1)i
) := k(t)i
:= ⌫i
(w(t)), 8 i. (7.99)
Let us illustrate the algorithm behavior using the special case of a single tenant.
There are three scenarios where the algorithm can produce dramatically di↵erent results,
as shown in Fig. 7.2. Since utility ui
(wi
) is strictly concave and monotonically increasing
in [0, 1], and cost c
✓K(w)
◆is strictly convex and monotonically increasing in [0, 1], we
have for all i:
8><
>:
hi
(wi
) > �bµi
, h0i
(wi
) < 0, 8 wi
2 [0, 1]
⌫i
(w) > 0, @⌫
i
(w)@w
i
> 0, 8 w : 0 � w � 1.
(7.100)
All three cases in Fig. 7.2 satisfy (7.100). In Fig. 7.2(a), {w(t)1 } always converges to w⇤
1 for
any initial value w(0)1 2 [0, 1], whereas in Fig. 7.2(b), {w
(t)1 } always diverges regardless of
its initial starting point. In Fig. 7.2(c), however, the behavior of {w(t)1 } critically depends
on its initial value w(0)1 . If w
(0)1 takes “initial value 1”, w
(t)1 will eventually hop between
two values alternately, without being able to approach w⇤1. On the other hand, if w
(0)1
takes “initial value 2”, w(t)1 will converge to w⇤
1.
7.7. DISTRIBUTED SOLUTIONS TO CLOUD PRICING 172
h(wi) �(wi)
wiw(0)iw(1)
i w(2)iw(3)
i w(4)i
w�i
k(0)i
k(1)i
k(2)i
k(3)i
k�i
(a) Convergence to Global Optimality
h(wi) �(wi)
wiw(0)iw(1)
i w�i
k(0)i
k(1)i
k�i
(b) Divergence
h(wi) �(wi)
wiw�i
k�i
Initial Value 1
Initial Value 2
(c) 1: Periodic; 2: Converging.
Figure 7.2: Behavior of Equation Update in a one dimensional case. “o” represents thestarting point (w(0)
i
, k(0)i
).
7.8. PERFORMANCE EVALUATION 173
Directly applying Proposition 10 derived in Sec. 7.6 to problem (7.8), we obtain a
su�cient condition for the convergence of Equation Update in Theorem 14:
Theorem 14 If for each i = 1, . . . , N , we have
minx
i
(1 � ✏)|U 00i
(xi
)| >
NX
j=1
�������
@2c
✓K(w)
◆
@wi
@wj
�������, (7.101)
for all w between w(0) = 1 and w(1), then using Equation Update, k(t)i
converges to k⇤i
and w(t)i
converges to w⇤i
.
The economic implication behind Theorem 14 is that the algorithm will converge
when the marginal utility gain decreases faster than the marginal cost increases, as wi
increases. This technical assumption can be easily justified, since the marginal cost
c0(K) = � for adding network capacity (routers and switches) is decreasing at a fast pace
in our economy.
Moreover, we can also apply Algorithm 1 of Sec. 7.5 with � < 1 or Algorithm 2 of
Sec. 7.5 to problem (7.8), and obtain similar but di↵erent iterative distributed solutions
to Problem (7.8).
7.8 Performance Evaluation
In this section, we simulate a computerized cloud bandwidth reservation and trading
environment based on the proposed algorithms. We compare our proposed algorithms
with the gradient method, in terms of the convergence speed and optimization accuracy.
We assume that the reservation of egress network bandwidth from cloud data centers has
already been enabled by the detailed engineering techniques proposed in [19, 39, 71].
7.8. PERFORMANCE EVALUATION 174
Suppose the random demands D1, . . . , DN
of N tenants are Gaussian with predictable
means µµµ = (µ1, . . . , µN
), standard deviations ��� = (�1, . . . , �N
), and the covariance ma-
trix ⌃⌃⌃ = [�ij
]N⇥N
. We consider utility functions of the form (7.2). Under a Gaussian
approximation of Di
(which has been verified in Sec. 5.3.3), each tenant will have an
expected utility
E[Ui
(wi
)] = ↵i
wi
µi
� eAi
(1�w
i
)µi
+ 12A
2i
(1�w
i
)2�2i . (7.102)
The first term on the righthand side corresponds to the expected revenue of each tenant
made from serving the demand wi
Di
, while the second term models a reputation loss
which is convex and increasing in terms of the unfulfilled demand. Recall that under
Gaussian demands, the bandwidth reservation K is
K(w) = µµµTw + ✓(✏)pwT⌃⌃⌃w, (7.103)
where ✓(✏) = F�1(1 � ✏) is a constant, F being the CDF of normal distribution N (0, 1).
It is now clear that the social welfare maximization problem (7.8) is a convex problem
and can be solved using Algorithm 1 (or Equation Update), i.e., the cloud can charge
user i a reservation price ki
wi
to control the user choice of the guaranteed portion wi
towards expected social welfare maximization.
Our study is based on a video demand dataset from a real-world commercial VoD
system called UUSee. The data used here contains the bandwidth demands of N = 468
video channels, measured every 10 minutes, over an 810-minute period during the 2008
Olympics. We assume each video channel is a tenant that relies on the cloud for servicing
the video requests from its end-users. We input such demand traces to our pricing
framework and check the algorithm e�ciency in the challenging case that prediction and
7.8. PERFORMANCE EVALUATION 175
optimization are to be carried out every 10 minutes. If the algorithms work for a 10-
minute frequency, they will be competent for lower operating frequencies, such as on an
hourly basis.
Assume the demand statistics µµµ, ⌃⌃⌃ of tenants remain the same within each 10-minute
period. We can estimate µµµ, ⌃⌃⌃ for the upcoming 10 minutes based on historical demands
using the time-series techniques presented in Chapters 3 and 4. Once estimates about
µµµ, ⌃⌃⌃ are obtained, we solve problem (7.8) for an optimal allocation w in this period.
After the 10-minute period has passed, the system proceeds to estimate µµµ, ⌃⌃⌃ for the
next 10 minutes and computes a new w from (7.8). Since the cloud does not know user
utility Ui
and the tenants do not know the multiplexed cost K, (7.8) must be solved in
a distributed fashion with the minimum message-passing overhead. Furthermore, (7.8)
must be solved within at most a few seconds in order not to delay the entire resource
reservation process performed every 10 minutes.
In our simulation, we set ↵i
= 1 and Ai
= 0.5. Since di↵erent tenants have di↵erent µi
and �i
, their utilities are heterogeneous. We set the marginal cost of allocating bandwidth
capacity to be � := c0(K) = 0.5, and assume that the cloud provider has an outage
probability of ✏ = 0.01.
We apply di↵erent algorithms to (7.8) for the 81 consecutive 10-minute periods. To
select algorithms, for the first period, we find that the condition (7.60) is satisfied, and
(7.56) is satisfied for most but not all i. We may thus tentatively believe Algorithm 1
is contracting when � = 0.5 and probably contracting when � = 1 (Equation Update).
However, conditions (7.67), (7.68), and (7.73) are not satisfied, since the Hessian of K(w)
is small due to multiplexing and its inverse is big. Thus, Algorithm 2 and gradient meth-
ods for dual decomposition are not contractions. We therefore choose to use Algorithm
7.8. PERFORMANCE EVALUATION 176
2 10 20 40 60 80 1000
0.2
0.4
0.6
0.8
1
Convergence Iteration of the Last Tenant
Empi
rical
CD
F
Alg 1: � = 1Alg 1: � = 0.5Gradient: � = 0.01Gradient: � = 0.02Gradient Sync: � = 0.01Gradient Sync: � = 0.008
Figure 7.3: CDF of convergence iterations in all 81 10-minute periods.
1 instead of Algorithm 2. As a benchmark algorithm, gradient methods for the dual
problem, although not contracting, can still converge theoretically when stepsize � is
su�ciently small. Since in our particular problem, the cloud provider has superior com-
putation power, the delay is mainly due to the iterative message passing of prices and
guaranteed portions between tenants and the cloud.
Since the objective value of the expected social welfare SW(w) is unknown to the
cloud, we let an algorithm stop at iteration t if either kw(t) � w(t � 1)k1 < 10�2 or
t = 100. Since Algorithm 1 is a contraction, which even converges asynchronously, we
adopt an asynchronous stop criterion: if |wi
(t) � wi
(t � 1)| < 10�2, then tenant i stops
updating wi
. In contrast, since the gradient method is not a contraction, we evaluate
both synchronous and asynchronous stop criteria. We do not assume heterogeneous
message-passing delays, which may further strengthen the benefit of an asynchronous
algorithm.
Fig. 7.3 plots the CDF of convergence iterations needed in all 81 experiments, one
for each 10-minute test period. Algorithm 1 always converges within 10 iterations, when
� = 0.5. If � = 1, the convergence rate can be further speeded up to 5 iterations on
7.8. PERFORMANCE EVALUATION 177
1700 1720 1740 1760 1780800
1000
1200
1400
1600
1800
Time Periods (unit: 10 minutes)
Fina
l Exp
ecte
d So
cial
Wel
fare
Alg 1: � = 1Alg 1: � = 0.5Gradient: � = 0.01Gradient: � = 0.02Gradient Synch: � = 0.01Gradient Synch: � = 0.008
Figure 7.4: Final objective values SW(x⇤) (expected social welfare) produced by di↵erentalgorithms.
average, although it does not converge in one single test period, due to violation of the
contraction assumption. In general, gradient methods need much longer time to converge,
and also face a dilemma in selecting the stepsize �. Since algorithms will stop when w
is changing by less than 10�2, we cannot set too small a �, in which case, the algorithm
will stop when t = 2 due to too small a change in the price vector. Fig. 7.3 shows that
gradient methods with all considered � produce erroneous outputs in many cases, since
they stop at t = 2. Increasing � reduces errors, but will lead to even longer convergence
time. Fig. 7.3 suggests that unlike Algorithm 1, no constant stepsize � for gradient
methods can achieve correct outputs and fast convergence at the same time for all 81
experiments.
A further check into Fig. 7.4 shows that the final objective SW(w⇤) produced by
Algorithm 1 is also higher than those of the gradient methods. Gradient methods with
asynchronous stop rules, although converged, may output wrong answers because they
are not contractions and in theory, should not be applied with asynchronous stop rules.
Gradient methods with synchronous stopping achieve slightly higher objective values if
7.8. PERFORMANCE EVALUATION 178
100 101 102
10�2
10�1
100
Iteration t
Dist
ance
bet
wee
n x(
t) an
d x(
t�1)
Alg 1: � = 1Alg 1: � = 0.75Alg 1: � = 0.5Gradient: � = 0.01Gradient: � = 0.02Gradient Sync: � = 0.01
Figure 7.5: The evolution of kw(t) � w(t � 1)k1 in the first experiment.
not stopping at t = 2. However, since they do not converge within 100 iterations, they
are still inferior to Algorithm 1.
Let us zoom into the first 10-minute test period and see how convergence happens for
this experiment. From Fig. 7.5, we see that Algorithm 1 has an increasing convergence
speed as t increases. In contrast, gradient methods perform well in the first two iterations,
but its convergence slows down as the gradient becomes smaller. It is confirmed that
gradient methods are extremely sensitive to the stepsize �. Although it is possible to
vary the stepsize � dynamically as the optimization iterations proceed, the design of such
dynamic stepsizes is a non-trivial technical challenge.
Finally, the above analysis does not take into account the additional benefit of Algo-
rithm 1 regarding the computational complexity at each node. Both algorithms have the
same complexity at each user, which is a 1-dimensional surplus maximization. However,
in Algorithm 1, the provider only needs to compute a cost gradient rK(w) straight-
forwardly, whereas in gradient methods, the provider needs to solve a 468-dimensional
profit maximization problem.
7.9. SUMMARY 179
7.9 Summary
In this chapter, we propose a guaranteed cloud service model, where each tenant does
not have to estimate the absolute amount of bandwidth it needs to reserve—it simply
specifies a percentage of its demand from end-users that it wishes to serve with guaranteed
performance, which we call the guaranteed portion, while the rest of the demand will
be served with best e↵ort as the current cloud providers do. The cloud provider will
estimate tenant demands through workload analysis and guarantee the performance in a
probabilistic sense. The above process is repeated in small periods such as hours or tens
of minutes.
Our main contribution is to fairly price such guaranteed services in each period. In
contrast to the uniform usage pricing model, we price bandwidth reservations heteroge-
neously for tenants based on their workload statistics such as burstiness and correlation.
It turns out to be computationally challenging to find the optimal prices that maximize
the expected social welfare under demand uncertainty.
To address this challenge, we propose new algorithms for large-scale distributed opti-
mization with coupled objectives, and analyze the asynchronous algorithm convergence
conditions. Trace-driven simulations show that the proposed method can speed up con-
vergence by 5 times over gradient methods in the real-time cloud bandwidth resource
reservation and pricing problem with better robustness to parameter changes. It remains
open to explore the applicability of the new algorithms to problems with coupled con-
straints. Incentive issues (especially of Algorithm 2) with game theoretical modeling can
further open up new avenues of investigation to ensure that tenants follow the designed
protocol out of self-interest.
Chapter 8
Concluding Remarks
8.1 Conclusions
This thesis has focused on improving the Quality-of-Service (QoS) and reducing the
operational cost in large-scale multimedia delivery systems, such as Internet Video-on-
Demand (VoD) systems, especially in the context of emerging cloud-based services. The
thesis spans the areas of cloud computing, multimedia delivery, and economics. Our
contribution is a coherent cloud management framework for multimedia delivery systems
that involves statistical learning, demand forecast, optimized resource allocation and load
direction, as well as algorithmic resource pricing as interdependent components, mainly
targeting large-scale VoD systems. Our study has been backed up by data mining and
the analytics of real-world resource usage traces collected from a production Internet
VoD company named UUSee.
We first identify that automated demand forecast will be the key to intelligent and
e�cient workload management in an age of cloud computing. We propose a series of
statistical learning and time-series analysis techniques to forecast the expected video
180
8.2. FUTURE DIRECTIONS 181
bandwidth tra�c demand in the short-term future, as well as to estimate the demand
volatility, i.e., how confident we can be about such expected demand forecasts. We
also studied the complexity and dimension reduction issues related to such statistical
learning, in the presence of a large number of video channels. Furthermore, based on
automated demand forecast, we propose predictive resource reservation schemes together
with optimized load direction schemes in a multi-data center network, analyzing the
most cost-e↵ective solutions both in theory and in practice. Finally, we investigate the
bandwidth resource pricing issues and related business models for VoD providers to use
the cloud services, with two approaches adopted. In the first approach, we use game
theory to characterize the economic properties of the bandwidth reservation market when
multiple VoD providers rely on multiple cloud providers for services. In the second
approach, we formulate the bandwidth pricing as a distributed optimization problem,
for which we propose new distributed algorithms other than the traditional gradient
algorithm, to ensure fast and robust convergence in very large systems. In addition, the
second approach also turns out to bring new technical contributions to distributed and
parallel optimization techniques in large-scale systems.
8.2 Future Directions
Cloud workload management continues to be an exciting area of research, by con-
sidering even more diverse application requirements and by extending the scope from
request routing among data centers to VM placement within data centers. As has been
noted, virtual machines (VMs) can be placed to optimize application performance, by
spreading workload evenly over the underlying topology, or to save power consumption,
by consolidating workload onto as few servers as possible. As the number of VMs per
8.2. FUTURE DIRECTIONS 182
server increases, the power e�ciency increases linearly, whereas performance degrades
in a nonlinear fashion due to VM interference. What is the sweet spot in performing
workload consolidation that balances performance and power e�ciency? Given diverse
application requirements on di↵erent resources that may or may not be correlated, we
can mix VMs based on workload statistics balancing performance and cost. Machine
learning and demand prediction based on monitoring will be useful to serve this purpose.
However, unlike user-oriented web applications, the resource and power consumption of
many cloud applications may not follow a regular pattern. More advanced learning tech-
niques such as those based on dynamic neural networks can be developed for workload
analysis.
Tra�c management in the cloud. In today’s multi-tenant data centers, band-
width sharing leads to unpredictable network performance. However, the good news is
that unlike in the Internet, cloud operators can control not only tra�c routing but also
the physical positions to place the source-destination pairs, which are essentially pairs of
VMs in data centers. This leads to a new problem of joint VM placement and inter-VM
tra�c routing, so as to reduce the network burden in the cloud and to optimize application
performance. Furthermore, in a data center, VMs often need to be migrated dynamically
due to application requirements and the need for performance optimization or workload
consolidation. Dynamic VM placement introduces a multi-period network optimization
problem that requires statistical demand predictions as input, with VM repositioning
cost as an optimization constraint. Competitive online algorithms and approximations
need to be developed to solve this type of problem. From a systems perspective, a sta-
tistical analysis and optimization framework can be integrated into current data centers
to diagnose tra�c and guide management decisions.
8.2. FUTURE DIRECTIONS 183
Pricing in the cloud. Cloud computing has brought about a new business model
of resource leasing. Currently, Amazon EC2 mainly o↵ers three types of computing
instances: on-demand, spot and reserved instances. What is the advantage of having
spot and reserved instances? What is the impact of the cloud service menu with its
pricing on both tenants and the cloud? I hypothesize that more diversified instance
types would suit the flavors of more tenants, bringing a win-win situation to both cloud
providers and tenants. Two related questions need to be answered: (1) facing di↵erent
types of computing instances together with their prices, how each tenant, with her unique
utility and patience characteristics, should choose her instance type, and how much to
bid if she chooses the spot instance; and (2) how the cloud provider should price di↵erent
instances to maximize revenue. By studying these questions, we will be able to analyze
the aggregate social welfare under di↵erent markets and pricing policies. The good news is
that a dataset of AWS spot and on-demand instances pricing history since 2009 is publicly
available. We can use statistical learning to model and advise tenant strategies, infer
resource availability, optimize cloud pricing policies, and study the interaction between
tenant and cloud strategies.
Derivatives in cloud computing. Cloud pricing largely remains in a primitive
stage. A natural expansion of the cloud service menu is to introduce derivatives to the
computing market, so that tenants can hedge against the uncertainty of future prices for
computing resources. In this sense, the reserved instance of AWS cloud is essentially a
derivative product. As a popular derivative in stock markets, a futures contract involves
selling goods for a price agreed today with delivery occurring at a specified future date.
As another example, an option will give a buyer the right, but not the obligation, to
buy goods at a reference price in the future. Viewing computing resources as goods
8.2. FUTURE DIRECTIONS 184
and tenants as buyers, many derivatives and new pricing structures can be introduced,
modified and adapted to the cloud to benefit both tenants and cloud providers. On the
other hand, automated workload analysis and demand prediction based on statistical
learning need to be developed for companies that rely on cloud services to take the right
investment positions in the computing market.
Bibliography
[1] Amazon Cluster Compute, 2011. http://aws.amazon.com/ec2/hpc- applications/.
[2] Amazon Web Services. http://aws.amazon.com/.
[3] Hulu. http://www.hulu.com/.
[4] Justintv. http://www.justin.tv/.
[5] Metacafe. http://www.metacafe.com/.
[6] Netflix. http://www.netflix.com/.
[7] OnLive. http://www.onlive.com/.
[8] PPLive. http://www.pplive.com.
[9] UUSee Inc. [Online]. Available: http://www.uusee.com.
[10] Youtube. http://www.youtube.com/.
[11] Zimory Cloud Computing. http://www.zimory.com/.
[12] Four Reasons We Choose Amazon’s Cloud as Our Computing Platform. The Netflix
“Tech” Blog, Dec. 14 2010.
185
BIBLIOGRAPHY 186
[13] V. Adhikari, Y. Guo, F. Hao, M. Varvello, V. Hilt, M. Steiner, and Z.-L. Zhang.
Unreeling netflix: Understanding and improving multi-cdn movie delivery. In Proc.
of IEEE INFOCOM, Orlando, Florida, 2012.
[14] V. Aggarwal, X. Chen, V. Gopalakrishnan, R. Jana, K. K. Ramakrishnan, and V. A.
Vaishampayan. Exploiting Virtualization for Delivering Cloud-based IPTV Services.
In Proc. of IEEE INFOCOM Workshop on Cloud Computing, 2011.
[15] S. Annapureddy, S. Guha, C. Gkantsidis, D. Gunawardena, and P. Rodriguez. Ex-
ploring VoD in P2P Swarming Systems. In Proc. INFOCOM’07, pages 2571–2575,
May 2007.
[16] D. Applegate, A. Archer, V. Gopalakrishnan, S. Lee, and K. K. Ramakrishnan. Op-
timal Content Placement for a Large-Scale VoD System. In Proc. ACM International
Conference on Emerging Networking Experiments and Technologies (CoNEXT),
2010.
[17] M. Armbrust, A. Fox, R. Gri�th, A. D. Joseph, R. Katz, A. Konwinski, G. Lee,
D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia. A View of Cloud Computing.
Communications of the ACM, 53(4):50–58, 2010.
[18] P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D. A. Maltz, and M. Zhang. To-
wards Highly Reliable Enterprise Network Services Via Inference of Multi-level De-
pendencies. In Proc. of SIGCOMM’07, Kyoto, Japan, 2007.
[19] H. Ballani, P. Costa, T. Karagiannis, and A. Rowstron. Towards Predictable Data-
center Networks. In Proc. of SIGCOMM’11, Toronto, ON, Canada, 2011.
BIBLIOGRAPHY 187
[20] D. P. Bertsekas, A. Nedic, and A. E. Ozdaglar. Convex Analysis and Optimization.
Athena Scientific, 2003.
[21] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numer-
ical Methods. Prentice Hall, 1989.
[22] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
[23] N. Bobro↵, A. Kochut, and K. Beaty. Dynamic Placement of Virtual Machines for
Managing SLA Violations. In Proc. 10th IFIP/IEEE International Symposium on
Integrated Network Management, 2007.
[24] T. Bollerslev. Generalized Autoregressive Conditional Heteroskedasticity. Journal
of Econometrics, 31:307–327, 1986.
[25] G. E. P. Box, G. M. Jenkins, and G. C. Reinsel. Time Series Analysis: Forecasting
and Control. Wiley, 2008.
[26] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed Optimiza-
tion and Statistical Learning via the Alternating Direction Method of Multipliers.
Foundations and Trends in Machine Learning, 3(1):1–122, 2011.
[27] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press,
2004.
[28] P. J. Brockwell and R. A. Davis. Introduction to Time Series and Forecasting.
Springer, 2002.
BIBLIOGRAPHY 188
[29] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic. Cloud Computing
and Emerging IT Platforms: Vision, Hype, and Reality for Delivering Computing
as the 5th Utility. Future Generation Computer Systems, 25(6):599–616, 2009.
[30] X. Chen, M. Zhang, Z. Morley Mao, and P. Bahl. Automating Network Application
Dependency Discovery: Experiences, Limitations, and New Solutions. In Proc. of
OSDI ’08, 2008.
[31] B. Cheng, X. Liu, Z. Zhang, H. Jin, L. Stein, and X. Liao. Evaluation and Optimiza-
tion of a Peer-to-Peer Video-on-Demand System. J. Syst. Archit., 54(7):651–663,
Jul. 2008.
[32] M. Chiang, S. H. Low, A. R. Calderbank, and J. C. Doyle. Layering as Optimization
Decomposition: A Mathematical Theory of Network Architectures. Proceedings of
the IEEE, 95(1):255–312, January 2007.
[33] W. Enders. Applied Econometric Time Series. Wiley, Hoboken, NJ, third edition,
2010.
[34] V. Gabale, P. Dutta, R. Kokku, and S. Kalyanaraman. InSite: QoE-aware Video
Delivery From Cloud Data Centers. In Proc. of International Workshop on Quality
of Service (IWQoS), 2012.
[35] P. Gill, M. Arlitt, Z. Li, and A. Mahanti. YouTube Tra�c Characterization: A View
From the Edge. In Proc. of ACM SIGCOMM Internet Measurement Conference
(IMC), San Diego, USA, October 2007.
BIBLIOGRAPHY 189
[36] D. Gmach, J. Rolia, L. Cherkasova, and A. Kemper. Workload Analysis and De-
mand Prediction of Enterprise Data Center Applications. In Proc. of IEEE Symp.
Workload Characterization, 2007.
[37] Z. Gong, X. Gu, and J. Wilkes. PRESS: PRedictive Elastic ReSource Scaling for
Cloud Systems. In Proc. of IEEE International Conference on Network and Services
Management (CNSM), 2010.
[38] R. L. Grossman. The Case for Cloud Computing. IT Professional, 11(2):23–27,
March-April 2009.
[39] C. Guo, G. Lu, H. J. Wang, S. Yang, C. Kong, P. Sun, W. Wu, and Y. Zhang.
SecondNet: a Data Center Network Virtualization Architecture with Bandwidth
Guarantees. In Proc. of ACM International Conference on Emerging Networking
Experiments and Technologies (CoNEXT), 2010.
[40] G. Gursun, M. Crovella, and I. Matta. Describing and Forecasting Video Access
Patterns. In Proc. of IEEE INFOCOM Mini-Conference, 2011.
[41] C. Huang, J. Li, and K. W. Ross. Can Internet Video-on-Demand be Profitable? In
Proc. of SIGCOMM’07, Kyoto, Japan, August 2007.
[42] Y. Huang, T. Z. J. Fu, D.-M. Chiu, J. C. S. Lui, and C. Huang. Challenges, Design
and Analysis of a Large-scale P2P-VoD System. In Proc. of SIGCOMM’08, Seattle,
Washington, August 2008.
[43] R. Johari and J. N. Tsitsiklis. E�ciency Loss in a Network Resource Allocation
Game. Mathematics of Operations Research, 29(3):407–435, August 2004.
BIBLIOGRAPHY 190
[44] F. P. Kelly. Charging and Rate Control for Elastic Tra�c. European Transactions
on Telecommunications, 8:33–37, 1997.
[45] D. Kossmann, T. Kraska, and S. Loesing. An Evaluation of Alternative Architectures
for Transaction Processing in the Cloud. In Proc. of International Conference on
Management of Data (SIGMOD), 2010.
[46] D. Kusic, J. O. Kephart, J. E. Hanson, N. Kandasamy, and G. Jiang. Power and
Performance Management of Virtualized Computing Environments via Lookahead
Control. Cluster computing, 12(1):1–15, March 2009.
[47] B. Li, S.-S. Xie, G. Y. Keung, J.-C. Liu, I. Stoica, H. Zhang, and X.-Y. Zhang. An
Empirical Study of the Coolstreaming+ System. IEEE Journal on Selected Areas
in Communications, Special Issue on Advances in Peer-to-Peer Streaming System,
25(9):1627–1639, December 2007.
[48] M. Lin, A. Wierman, L. L. H. Andrew, and E. Thereska. Dynamic Right-Sizing for
Power-Proportional Data Centers. In Proc. of IEEE INFOCOM, 2011.
[49] X. Lin, N. B. Shro↵, and R. Srikant. A Tutorial on Cross-Layer Optimization in
Wireless Networks. IEEE Journal on Selected Areas in Communications, 24(8):1452–
1463, August 2006.
[50] Z. Liu, C. Wu, B. Li, and S. Zhao. UUSee: Large-Scale Operational On-Demand
Streaming with Random Network Coding. In Proc. of IEEE INFOCOM, 2010.
[51] L. Ljung. System Identification: Theory for the User. Prentice Hall PTR, 2 edition,
1999.
BIBLIOGRAPHY 191
[52] J.-G. Luo, Q. Zhang, Y. Tang, and S.-Q. Yang. A Trace-Driven Approach to Eval-
uate the Scalability of P2P-Based Video-on-Demand Service. IEEE Trans. Parallel
Distrib. Syst., 20(1):59–70, Jan. 2009.
[53] A. Mahimkar, Z. Ge, A. Shaikh, J. Wang, J. Yates, Y. Zhang, and Q. Zhao. To-
wards Automated Performance Diagnosis in a Large IPTV Network. In Proc. of
SIGCOMM’09, Barcelona, Spain, 2009.
[54] A. McNeil, R. Frey, and P. Embrechts. Quantitative Risk Management: Concepts
Techniques and Tools. Princeton University Press, 2005.
[55] D. Niu, C. Feng, and B. Li. A Theory of Cloud Bandwidth Pricing for Video-on-
Demand Providers. In Proc. of IEEE INFOCOM, 2012.
[56] D. Niu, C. Feng, and B. Li. Pricing Cloud Bandwidth Reservations under Demand
Uncertainty. In Proc. of ACM SIGMETRICS, 2012.
[57] D. Niu and B. Li. An E�cient Distributed Algorithm for Resource Allocation in
Large-Scale Coupled Systems. In Proc. of IEEE INFOCOM, 2013.
[58] D. Niu, B. Li, and S. Zhao. Understanding Demand Volatility in Large VoD Sys-
tems. In Proc. of the 21st International workshop on Network and Operating Systems
Support for Digital Audio and Video (NOSSDAV), 2011.
[59] D. Niu, Z. Liu, B. Li, and S. Zhao. Demand Forecast and Performance Prediction
in Peer-Assisted On-Demand Streaming Systems. In Proc. of IEEE INFOCOM
Mini-Conference, 2011.
BIBLIOGRAPHY 192
[60] D. Niu, H. Xu, B. Li, and S. Zhao. Quality-Assured Cloud Bandwidth Auto-Scaling
for Video-on-Demand Applications. In Proc. of IEEE INFOCOM, Orlando, Florida,
2012.
[61] D. P. Palomar and M. Chiang. A Tutorial on Decomposition Methods for Net-
work Utility Maximization. IEEE Journal on Selected Areas in Communications,
24(8):1439–1451, August 2006.
[62] J. C. Panzar and D. S. Sibley. Public Utility Pricing under Risk: The Case of
Self-Rationing. The American Economic Review, 68(5):888–895, Dec. 1978.
[63] J. Schad, J. Dittrich, and J.-A. Quiane-Ruiz. Runtime Measurements in the Cloud:
Observing, Analyzing, and Reducing Variance. In Proc. of VLDB, 2010.
[64] Y. Song, M. Zafer, and K.-W. Lee. Optimal Bidding in Spot Instance Market. In
Proc. of IEEE INFOCOM, 2012.
[65] C. W. Tan, D. P. Palomar, and M. Chiang. Distributed Optimization of Coupled Sys-
tems With Applications to Network Utility Maximization. In Proc. of IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing, Toulouse, France,
2006.
[66] C. Tang, M. Steinder, M. Spreitzer, and G. Pacifici. A Scalable Application Place-
ment Controller for Enterprise Data Centers. In Proc. ACM WWW, 2007.
[67] B. Urgaonkar, P. Shenoy, and T. Roscoe. Resource Overbooking and Application
Profiling in Shared Hosting Platforms. In Proc. of USENIX Symposium on Operating
Systems Design and Implementation (OSDI), 2002.
BIBLIOGRAPHY 193
[68] M. Wang, X. Meng, and L. Zhang. Consolidating Virtual Machines with Dynamic
Bandwidth Demand in Data Centers. In Proc. of IEEE INFOCOM 2011 Mini-
Conference, 2011.
[69] C. Wilson, H. Ballani, T. Karagiannis, and A. Rowstron. Better Never Than Late:
Meeting Deadlines in Datacenter Networks . In Proc. of SIGCOMM, 2011.
[70] C. Wu, B. Li, and S. Zhao. Multi-Channel Live P2P Streaming: Refocusing on
Servers. In Proc. of IEEE INFOCOM, 2008.
[71] D. Xie, N. Ding, Y. C. Hu, and R. Kompella. The Only Constant is Change:
Incorporating Time-Varying Network Reservations in Data Centers. In Proc. of
ACM SIGCOMM, 2012.
[72] H. Yin, X. Liu, F. Qiu, N. Xia, C. Lin, H. Zhang, V. Sekar, and G. Min. Inside the
Bird’s Nest: Measurements of Large-Scale Live VoD from the 2008 Olympics. In
Proc. ACM Internet Measurement Conference (IMC), 2009.
[73] M. Zaharia, A. Konwinski, A. D. Joseph, Y. Katz, and I. Stoica. Improving MapRe-
duce Performance in Heterogeneous Environments. In Proc. of OSDI, 2008.
[74] H. Zhang, B. Li, H. Jiang, F. Liu, A. V. Vasilakos, and J. Liu. A Framework for
Truthful Online Auctions in Cloud Computing with Heterogeneous User Demands.
In Proc. of IEEE INFOCOM, 2013.
[75] X. Zhang, J. Liu, B. Li, and T.-S. P. Yum. CoolStreaming/DONet: A Data-Driven
Overlay Network for E�cient Live Media Streaming. In Proc. of IEEE INFOCOM,
2005.
Recommended