Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
iCostale: Adaptive Cost Optimization for Storage Clouds
Sandip Agarwala, Divyesh Jadav
IBM Almaden Research Center, San Jose, CA 95120
Email: {sagarwala, divyesh}@us.ibm.com
Luis A Bathen
University of California, Irvine, CA 92697
Email: [email protected]
Abstract—The unprecedented volume of data generated bycontemporary business users and consumers has created enor-mous data storage and management challenges. In order tocontrol data storage cost, many users are moving their data toonline storage clouds, and applying capacity usage reducing datatransformation techniques like de-duplication, compression, andtranscoding. These give rise to several challenges, such as whichcloud to choose, and what data transformation techniques toapply for optimizing cost.
This paper presents an integrated storage service callediCostale that reduces the overall cost of data storage throughautomatic selection and placement of users data into one of manystorage clouds. Further, it intelligently transforms data basedon its type, access frequency, transformation overhead, and thecost model of the storage cloud providers. We demonstrate theefficacy of iCostale through a series of micro- and application-level benchmarks. Our experimental results show that, throughintelligent data placement and transformation, iCostale canreduce overall cost of data storage by more than 50%.
I. INTRODUCTION
Two recent phenomena have created an interesting set of
challenges at the storage layer of distributed systems. First,
the amount of disparate data types and quantities created,
transferred and shared by web applications (blogs, social
media, games etc.) has exploded, along with the multitude of
end-point devices (traditional computers, smart phones, tablets
and game consoles) used to create and manipulate such data.
Second, there is a growing number of providers [1], [2], [3],
[4], [5] of the cloud or utility computing model [6], and a
widespread adoption of this model, whereby clients can run
compute jobs and/or store their data in remote data centers that
are owned and operated by a potentially separate organization.
Customers are charged using a pay-as-you-go model, based on
various combinations of compute cycles, network bandwidth
and/or storage consumed, and/or transactions executed.
The trends described above have implications: the
widespread generation, transmission and storage of ever in-
creasing types and volumes of data increases the load on the
networking and storage infrastructure. The utility computing
model is suitable for certain workload and data usage pat-
terns [7], has the advantages of lower capital expenditure, and
potentially lower operational expenditure. However, separation
of the ownership of the data and the resources used to
manipulate or store the data creates new issues in privacy,
security, durability, availability, and access performance. One
effect of the above trends is increased operational cost: the
explosion in the number of data sources and destinations, file
types and the variety of file sizes increases the overall cost
to manage them. Furthermore, different utility providers have
different pricing models, and an incomplete understanding of
the pricing model can lead to a high operational cost.
In order to reduce the cost of cloud data storage, two
orthogonal approaches can be applied: moving computation
closer to the data [8] and data transformation (deduplication,
compression and transcoding). In this paper, we focus on using
adaptive compression for storage cost reduction.
Typical data stored in clouds are characterized by: large ca-
pacity, disparate I/O access pattern (some objects are accessed
frequently, others are not), soft performance requirements,
online access from different geographically locations, low
management overhead, and lower pricing is preferred com-
pared to richer functionality. Compression algorithms typically
have parameters that control the compression level, the levels
trade off resource consumption (memory, CPU cycles) for
compression ratios. Finding the optimal compression algo-
rithm that minimizes cloud storage cost is a difficult theoretical
problem, complicated further by the potential for change in
pricing model and performance of the service over time.
This paper describes iCostale, an integrated storage service
framework that is interposed between multiple clients and mul-
tiple cloud storage providers. Storage cloud providers support
various data access interfaces. For example, Amazon’s Simple
Storage Service (S3) [1] supports SOAP, REST and BitTor-
rent [9]. Other providers support file-based protocols [10].
The techniques described in this paper can be implemented
in a multitude of interfaces, but we restrict the discussion to
a REST-like interface. iCostale makes the following technical
contributions: First, it provides a single unified interface to
end-users to store their data into different clouds. Second, it
builds a comprehensive cost model to evaluate and compare
different compression methods and placement alternatives.
Third, its adaptive compression and placement algorithm effi-
ciently computes a cost-effective placement/compression com-
bination based on end-user requirements, pricing models, data
type and access frequency. Finally, a large set of experiments
evaluates the effectiveness of iCostale with large datasets and
different pricing models from multiple providers.
II. RELATED WORK
Case studies from contemporary cloud providers [11], [12],
[13] provide insight into how applications and users exercise
contemporary commercial clouds. [7] provided an excellent
perspective on the definition, historical evolution, taxonomy
2011 IEEE 4th International Conference on Cloud Computing
978-0-7695-4460-1/11 $26.00 © 2011 IEEE
DOI 10.1109/CLOUD.2011.88
436
and near-term and long-term obstacles and opportunities in the
realm of cloud computing. As one of the early cloud offerings,
Amazon EC2 and S3 have been studied extensively for cost,
availability and performance [14], [9]. The CloudCmp [15]
framework aids a customer to select a cloud provider using
a set of benchmarking tools. It showed that different storage
services can offer different performance at different data scales
with different operations. In contrast, iCostale transparently
places data at the cloud provider that best matches the cus-
tomer workload and budget, and also provides a dynamic
mechanism to move data between clouds, should the pricing
and/or the reference pattern change.
Wang et al. [16] proposed that due to the decoupling
between the owners of cloud data (customers), and the owners
of the hardware and software used to store or process the
data (cloud providers), with a pricing scheme as the bridge,
cloud computing has fundamentally changed the landscape
of system design and optimization. We accept this premise
and focus on reducing customer cost by defining data storage
and cross-cloud placement optimizations to reduce customer
cost. Like iCostale, [17] studied the tradeoff between storing
end results on the one hand, and storing only the provenance
data, and re-computation of results when needed, on the other
hand. We agree with the philosophy of using computation in
place of storage for rarely used data, but differ in that we
provide a concrete cost model that identifies when to store as-
is and when to transform, and avoid the uncertainty associated
with forecasting future pricing by dynamically placing data
to take advantage of observed cloud provider pricing and
performance.
Much work has been done in the fields of compression and
transcoding, each addressing different aspects, and at different
levels of the systems software stack. Source compression can
be the most effective level, as it reduces network transmission
cost and storage footprint. However, resource limitations of
source devices (CPU, battery life, scratch space) can limit the
cost reductions of compression. In the networking domain,
these include tradeoffs between transmitting compressed vs.
uncompressed data [18], dynamically choosing which copy to
send [19], etc. However, optimizing for transmission footprint
is useful only when data is actually stored or accessed, as
opposed to recurring costs for data storage. Cumulus [20]
showed that for a cloud filesystem backup, changes to the
storage efficiency have a more substantial effect on total cost
than changes in bandwidth efficiency. iCostale focuses on
intelligent use of data transformation at the storage systems
layer of a cloud computing environment. There has been recent
interest in using an appliance to perform compression at the
storage subsystem level, either in-line [21] or out-of-band [22].
iCostale differs in that it uses selective compression to reduce
storage footprint, and by using access frequency to guide
whether to compress or not.
The disparate interfaces imposed by different cloud opera-
tors have spurred the development of cloud API libraries [23],
[24]. They aid user transparency and uniformity by providing
protocol-independence and common authentication mecha-
nisms. The proxy approach to accelerate applications or hide
the complexity of cloud APIs has attracted recent interest in
the academic [25] and commercial [10] arenas. iCostale is tar-
geted at this space, but differs from other proxy and cloud API
library approaches in that it combines inter-cloud placement
with transformation techniques, and (optional) user provided
access hints with dynamically tracked access frequency, to
reduce end-user cloud storage cost.
III. DESIGN OVERVIEW
In a storage cloud environment, users can store their data
directly on a cloud provider like Amazon S3, or they may use
third party online storage services like Dropbox, Slideshare,
SmugMug, etc. The latter provide value added services like
backups, content sharing, collaboration, etc. to their users and
may store user data into other cloud storage providers (e.g.
Amazon S3).
In iCostale, we implement a third party online storage
service that provides a REST and SOAP-based API similar
to Amazon S3. Instead of managing its own storage, iCostaleuses storage from one of the many cloud providers. Users
upload and access their contents in iCostale, which in turn
stores them in the backend cloud. We assume that iCostale is
hosted in a compute cloud like Amazon EC2 (Elastic Compute
Cloud). The goal of iCostale is to reduce the overall cost ofdata storage for its users.
In order to optimize data storage performance and resource
usage, application developers and system designers have used
many different techniques, as discussed in Section II. Each
of these techniques has their own set of advantages and
disadvantages. For the purpose of reducing cost, iCostalefocuses on adaptive data placement and transformation. The
data transformation techniques that we will discuss in this
paper include different kinds of lossy and lossless compression
and transcoding. Other techniques like caching, de-duplication,
etc. are complementary to our techniques and can be applied
in conjunction with iCostale.onjunction with iCostale.
Fig. 1. iCostale Architecture DiagramFigure 1 shows an architectural overview of the iCostale
storage service. It is interposed between client and storage
cloud providers. Users communicate via REST and SOAP-
based API with iCostale, which in turn forwards those requests
to one of the backend storage clouds after some processing.
For write requests, iCostale compresses the data based on its
type, user preferences, etc. and stores it in a storage cloud such
that overall cost for the content owner is reduced. For read
requests, iCostale fetches data from the storage cloud, applies
the corresponding decompression or decoding algorithm and
437
returns the resultant data back to the user. The data returned to
the user is the exact copy of the original data stored by the user
in case of lossless compression. In case of lossy compression,
data returned is in the same format as the original and meets
the quality requirement of the user.
IV. COST OPTIMIZATION
For cost reduction purposes, several factors need to be
carefully considered. The first is the overhead due to the
data transformation technique. On the one hand, compression
reduces data size (in most cases), which reduces capacity and
bandwidth usage, while on the other hand, it consumes CPU
cycles, which increases the compute cost. iCostale selects thecompression algorithm that reduces the total sum of compute,storage and I/O cost. The second factor is the data access
pattern. For infrequently accessed data, the compute cost is low
because of reduced number of transformations. For frequently
accessed data, transformation cost may become quite signifi-
cant compared to storage cost. The third factor relates to the
data type. Some types of data are more compression friendly
than others. Depending on the data type, the effectiveness of
compression algorithms may vary. Fourth, different storage
cloud providers have different cost models that may range
from one flat monthly pricing to complex tiered and itemized
pricing. Finally, in addition to reducing cost, different users
may have additional requirements in terms of performance,
availability and resiliency. In the next few sub-sections, we
discuss each of these factors in greater detail, and show how
we address them with iCostale.
A. Choice of Data Transformation Techniques
There is a wide variety of compression techniques with
varying characteristics. Different compression algorithms are
optimized for different criteria like compression ratio, com-
pression speed, decompression speed, memory consumption,
etc. Also, their performance may vary from one dataset to
the other. For example, deflate-based (e.g. gzip) compression
methods are optimized for both speed and compression ratio,
while algorithms like PAQ [26] focus primarily on compres-
sion ratio.
Compression techniques can be broadly grouped in two
categories: lossy and lossless. Lossy compression is generally
applied to multimedia datasets like images, audio and video,
where quality can be compromised for higher compression
ratio. A common theme across all compression techniques
is that there is an inherent tradeoff between compression
ratio, compression/decompression speed, quality, memory con-
sumption, etc. An aggressive compression method may lower
bandwidth and storage capacity usage, but it may consume
more CPU cycles. This results in lower storage cost and higher
compute cost. Further, it may not meet the response time and
quality requirements of the user.
Figure 2 shows the tradeoff between annual storage cost
and compute cost for storing about 80 GB of text data (100K
text books of average size 0.8 MB each). Pricing is based on
Amazon’s S3 and EC2. Storage cost dominates in all the cases,
Fig. 2. Storage and Compute Cost Comparison
but the compute overhead in some cases offsets the savings
due to compression. Nanozip [27], for example, achieves the
maximum compression in this analysis, but the compute cost
drives the total cost higher than other compression algorithms.
To reduce the total cost, iCostale monitors the performance
and savings due to different compression techniques and for
different types of data, and applies the one that meets the user
requirements (response time, quality, etc.) so as to achieve
maximum savings.
B. Data Access Pattern
Different workloads have different access patterns. Data
generated by workloads like backups, archival, logs, etc. are
seldom accessed frequently. On the other hand, a popular web
document or media content may be frequently accessed. For
every read access, iCostale fetches the compressed data from
the storage cloud, decompresses or decodes it to its original
format, and sends it back to the user. The decompression
overhead and the associated compute cost add up for fre-
quently accessed data objects and may offset the savings due
to compression.
10
100
1000
10000
0.1 1 10 100
Cos
t ($)
Read requests (in millions)
NONE GZIP NANOZIP BMF JPEG2000
Fig. 3. Total cost (excluding transfer cost) with different compression
Figure 3 shows the cost of hosting 100GB of BMP image
data (100,000 images of roughly 1 MB size each) on an Ama-
zon S3 like environment. We applied different compression
techniques and computed the cost for varying number of read
accesses. X-axis represents the total number of read requests
across all the images. The cost (in the Y-axis) includes the
storage capacity cost and the compute cost for decompression
(based on Amazon EC2 pricing). Transfer bandwidth cost is
not included to show the effect of compute overhead due to
different I/O access. From the figure, as the number of accesses
increases, the cost of storing uncompressed data remains the
same. For different compression algorithms, the cost initially
goes down for reduced space usage. But with larger number
of accesses, the cost of decompression goes up and offsets
438
the savings due to reduction in space usage. For jpeg2000
and nanozip, the cost of transforming the data takes over the
cost of just storing the uncompressed data at around 1M and
10M IOs respectively. For gzip and bmf, the price blows up
at around the 100M IOs. This graph shows the importance of
the access pattern in determining the appropriate compression
method. For very hot data, it would be more cost-effective to
store data in its original form. For others, compression would
be beneficial. iCostale adapts the compression methods based
on how frequently an object is accessed.
C. Cost Model
Storage cloud providers are coming up with unique sets
of features and pricing models to differentiate themselves
and attract customers. For selecting appropriate providers and
overall cost optimization, iCostale models the cost that a user
would pay for storing data in the cloud.
iCostale
Uncompressed datart1: Transfer Cost / GBs: Size of data
Compressed datasc: Size of compressed datart2: Transfer Cost / GBrr: Cost / request
Compute Cloud:rc: Compute Cost / hour
Storage Cloudrs: Storage Cost / GB
Fig. 4. iCostale Cost Components
Figure 4 shows a high-level view of iCostale along with
some of the cost components. In this model, iCostale is
deployed in a compute cloud (like Amazon EC2) and stores
compressed data in a storage cloud (like Amazon S3, Google
storage cloud, etc.) iCostale charges back the cost of using
the compute cloud and the storage. The charge is proportional
to the resources used on behalf of the user. Table I shows the
parameters in our cost model.TABLE I
ICOSTALE COST MODEL PARAMETERS
Symbol Description
s Size of original data (in GB)sc Size of compressed data (in GB)CR Compression Ratio (sc/s)rt1 Cost ($/GB) to transfer data between
client and iCostale servicert2 Cost ($/GB) to transfer data between
iCostale service and the storage cloudrr Cost per request (get, put, etc.)
from iCostale service to the storage cloudrc Compute cost / hourrs Storage cost / GB / monthtca,d Average time taken to compress
1GB of data ‘d’ with algorithm ‘a’
tda,d Average time taken to decompress
1GB of data ‘d’ with algorithm ‘a’CCPU Compute Cost
CIO IO Cost
CS Storage Cost
CT Total Costnrd No. of read requests/month for data ‘d’
nwd No. of write requests/month for data ‘d’
nd Total no. of requests/month for data ‘d’ (ndr+nd
w)
From Table I, we can define the total cost for a particular
data object ‘d’ as follows:
CCPU = nwd * tca,d * rc * s + nrd * tda,d * rc * s
= [nwd * tca,d + nrd * tda,d] * rc * s
CIO = [ s * rt1 + sc * rt2 + rr ] * ndCS = sc * rsCT = CCPU + CIO + CS
The above cost model is a simplified representation of the
pricing model of many popular IaaS (Infrastructure as a Ser-vice) providers. It can be adapted very easily to accommodate
the tiered and other pricing models of the providers.
D. User Profile
Users can specify a set of requirements that are later used
by iCostale to decide the placement and compression of
their data. For each type of data (e.g. document, text files,
source code, executables, images, audio, video, etc.), a user
can specify the following attributes: Response time, Quality
(for lossy compression), Preferred storage cloud provider,
Availability of the storage provider and Expected number of
monthly accesses. The last attribute is just an initial estimate
for the iCostale algorithm. The actual runtime statistics are
maintained by iCostale optimization algorithm, which is dis-
cussed in the next section.
E. iCostale Optimization Algorithm
Fig. 5. Sample compression options for BMP fileFor every write request, the iCostale optimization algorithm
determines the following three items:
• Where to place data
• What compression algorithm to apply
• Level of compression for the above algorithm
This is a hard problem to solve as there are many
compression algorithms and they may accept many fine
tuning parameters. Figure 5 shows the transformation
possibilities for an image file. As we can see, for a
single BMP file, there many different options as to what
transformation to apply, be it lossless or lossy, for each
lossy/lossless type, we can choose from a wide range of
algorithms, from each algorithm, we must choose the right
set of flags in order to give us the desired size reduction
while keeping the computational overhead low. A brute force
way to achieve optimum compression is to apply all possible
compression algorithms with different parameters. This may
achieve the best compression result, but may impose lot of
439
CPU overhead and/or miss response time requirements.
Algorithm 1 iCostale algorithm
for each data write request d of size s dof ← type of data d
U ← profile of user associated with the request
P ← list of storage cloud providers that satisfy the user’s
availability requirements
K ← set of compression techniques applicable to
data of type/format ‘f’ from knowledge base
for each provider p ∈ P dofrom the pricing published by provider p, determine
rt2, rr, rsfor each compression method k ∈ K do
if combination of provider p’s latency, tca,d, and tca,ddoes not meet the response time requirements then
continue
end ifcompute CT
keep track of minimum CT and associated compres-
sion method and placement in Cmin
end forend for
end forIn order to implement a more practical solution, we build a
knowledge base that contains the performance characteristics
of different compression techniques for different data types
and formats. We ran thousands of compression experiments
against a large dataset consisting of files of different types
and formats, and recorded the result in our knowledge base.
For example, we computed average compression ratio, average
compression time, average decompression time, etc. for all
*.txt files in our dataset for different compression techniques.
The timings were normalized for a file size of 1 GB. For
simplicity, we assumed that for the same file type and format,
compression and decompression costs are proportional to the
original data size. This knowledge base becomes the key
for the iCostale optimization algorithm, which is given in
Algorithm 1.
The algorithm assumes that the type (e.g. image, text, audio,
etc.) and the format (e.g. .bmp, .txt. .doc, .mp3, etc.) of
the data can be determined from the metadata (e.g. content-
type attribute, filename, etc.) contained in the write request.
After the algorithm finishes, Cmin contains the placement and
compression information that needs to be applied to the request
‘d’. iCostale compresses the data object ‘d’ and stores it in the
storage cloud specified by Cmin. If no suitable compression
method or placement is found, iCostale stores the unmodified
data in a default provider. It also records a location mapping
(key, location) along with a few other attributes (compression
method used, reference to user profile, etc.) in its database for
future read requests. The ‘key’ is the same as the one specified
in the user’s write request.
Read requests are simpler to handle. iCostale looks up
location information in its database and fetches data from the
storage cloud provider. It decompresses the data in its original
form (if required) and sends it back to the user. It also records
and updates a few metadata associated with this object like last
access time, access frequency, etc. iCostale stores the location
mapping, user profiles and other metadata in a shared database
like Amazon SimpleDB.
The exact type and number of compute cloud instances
for hosting iCostale would depend on the I/O traffic and the
compute load. In the experimental section, we will show how
changing the number of cores impacts the performance of
iCostale.
V. EXPERIMENTAL RESULTS
This section discusses the experimental evaluation of
iCostale and shows its effectiveness compared to non-adaptive
techniques.
A. Experimental Setup
iCostale has been implemented in Java, and runs as a storage
service. It provides the SOAP and REST APIs similar to
Amazon S3. In order to mimic the storage cloud provider, we
implemented a server that provided a basic key-value object
store interface and stored data in its local disks. iCostale sits
between the clients and the object store servers. All experi-
ments were run on machines with two quad-core processors,
4GB memory and running 2.6.34 Linux kernel. The required
compression libraries are installed on the iCostale node.
B. Compression Knowledge base
The goal of the first set of our experiments was to compute
performance statistics of different compression algorithms for
a wide variety of datasets and record them in our compres-
sion knowledge base that was discussed in previous section.
‘Httperf’ [28] client was used to generate the workload.
Figure 6 shows the results for some types of data with a sub-
set of compression techniques (and their associated flags). For
each transformation we computed the compression ratio (CR),
the compression time (CT) and decompression time (DT). We
define CR as the ratio between the size of compressed data and
uncompressed data. Aside from the standard gzip (g-9 in the
graph), bzip2 (b-9), lzma [29] (lz-9) compressors, we included
several other rather unknown but very powerful compressors
such as the PAQ8 series which have shown to out-compress
many popular techniques at the cost of processing time and
memory consumption. In general, the LZ based compressors
are the fastest, but yield poor CR. In the case of lossless
multimedia data such as BMPs, BMF achieved the overall best
result, both in CR as well as compression/decompression time.
In some cases, lossless compression resulted in CR greater
than 1.
Figure 7 shows the compression ratios (CR), encoding
(ET) and decoding (DT) times, as well as two different
quality metrics, peak signal to noise ratio (PSNR) and root
mean square error (RMSE) for the JPEG and JPEG2000
lossy encoders. As expected, higher quality resulted in lower
compression, higher PSNR and lower RMSE. Unlike lossless
compressors, these two lossy algorithms saw less variation
440
0.01
0.1
1
10
100
0.1
0.2
0.3
0.4
g-9 b2-9 LZ-9 p8PX-8 p8L-8
ASCII Books
0.01
0.1
1
10
100
1000
0.5
0.55
0.6
0.65
g-9 b2-9 LZ-9 p8PX-8 p8L-8
Word Docs
0.01
0.1
1
10
100
0.5
0.6
0.7
0.8
0.9
g-9 p8L-8 p8PX-8 bmf-Q1 jp2k
BMP files
0.01
0.1
1
10
100
0.15
0.2
0.25
0.3
g-9 b2-9 LZ-9 p8PX-8 p8L-8
Web Data
0.01
0.1
1
10
100
1000
0.74
0.76
0.78
0.8
g-9 b2-9 LZ-9 p8PX-8 p8L-8
PDF files
0.01
0.1
1
10
100
0.7
0.8
0.9
1
g-9 p8L-8 p8PX-8 pjg
JPEG files
Fig. 6. Compression knowledge base for lossless techniques (Left Y-axis shows the compression ratio, Right Y-axis shows the compression/decompressiontime in seconds, the X-axis shows the compression methods
Fig. 7. Compression knowledge base for lossy techniques (Left Y-axis shows the compression ratio, the X-axis shows the quality parameter, the right Y-axisshows: 1) the encoding/decoding time in ms, 2) PSNR, 3) RMSE)
in their encoding/decoding time as a function of the quality
since in most cases, the time to build the internal models is
about the same, the only difference is how data is partitioned.
For instance, in JPEG2000, quality is affected by the number
of quantization steps (Δ) used by the encoder to divide
the coefficients. Users can specify their tolerance for quality
deterioration for their multimedia data and iCostale uses that
to make run-time transformation decisions.
C. iCostale without Adaptive Placement
In order to evaluate the effectiveness of iCostale’s adaptive
compression, we performed a series of experiment and com-
puted cost of storing different types of data under the following
scenarios: (i) No compression, (ii) Compression with gzip,
(iii) adaptive compression with iCostale and, (iv) adaptive
compression with an oracle algorithm. Since iCostale’s opti-
mization is based on the average performance statistics stored
in its knowledge base, it is possible that the best compression
method from the knowledge base perspective may not be the
best for the actual data. We developed an oracle version of
iCostale that has prior knowledge of the best compression
method for all objects in the data set. This helped us determine
the effectiveness of using average performance statistics from
our knowledge base.
Figure 8 shows the result with six types of data. We varied
the number of read requests and computed the percentage
annual cost savings compared to no compression. In each
experiment, we assumed a total of 100K data objects. The
cost is based on pricing models of Amazon S3 and EC2.
For fairness with respect to gzip, iCostale and oracle didn’t
perform lossy compression. For gzip compression, the storage
cost saving remains the same as the number of read requests
goes up, but the compute overhead increases. This results
in decreasing cost savings with gzip. For jpeg and audio
data, gzip barely achieved any compression. Therefore, as
number of requests went up, it resulted in negative cost
savings for those data types. iCostale was able to achieve
much greater savings. On an average, the savings were 40%
more compared to vanilla gzip compression. In some cases
441
Fig. 8. Percentage cost improvement for different access pattern (Y-axis: % cost reduction relative to no compression, X-axis: Number of read requests)
it was able to achieve greater than 60% savings, but with
increasing the number of IO, the transfer and compute cost
offsets the savings due to iCostale compression. The oraclealgorithm was only slightly better than iCostale. This is
because, in most cases, the compression method chosen by
oracle was same as that by iCostale. This shows that our
approach of using a knowledge base is a viable approach for
implementing adaptive compression.
Table II shows the iCostale performance for a mixed dataset
consisting of 80% reads and 20% writes. As the number
of CPU cores are increased, the throughput of iCostale in-
creases almost linearly. Note that these numbers are for an
unoptimized implemention of iCostale. These can be further
improved with caching and performance tuning.TABLE II
ICOSTALE PERFORMANCE
No. of CPU Cores 1 2 4 8Throughput (Mbps) 195 405 817 1609
D. iCostale with Adaptive Placement
The next series of experiments demonstrate the effective-
ness of combined adaptive placement and compression. We
performed these experiments with a mixed dataset consisting
of 100K objects of total size 234 GB and compared cost of
storing them on 8 storage cloud providers (Amazon S3 [1],
Nirvanix [5], Windows Azure [3], Diomede [30], Google [2],
TopHostingCenter [31]). For each cloud offering, we chose
the cheapest of the different cost models they had available.
Figure 9 shows the normalized cost of storing data in different
cloud relative to Amazon S3. Amazon S3 is represented by
the horizontal line (at 1) in the three graphs. The first graph
in the figure just compares the cost of storing uncompressed
data and shows the change in cost due to different number of
I/Os. Google and Azure have similar pricing. Diomede was the
most inexpensive of all the cloud offerings (in terms of storage
cost). For larger number of I/Os, TopHostingCenter, had lower
total cost. From the graph, it is clear that there is no one
provider whose total cost is consistently low. iCostale takes
advantage of this fact to determine the appropriate placement
for its workload. In a real world scenario, price is not the only
criteria for selecting the cloud providers. In this experiment,
we focus only on the cost aspects for provider selection.
The second graph in the figure shows the cost of storing
data in different cloud with iCostale’s adaptive compression.
The line labeled ‘iCostale’ represents combined adaptive
compression and placement. In this experiment, we assume
that iCostale is hosted in Amazon EC2 and use EC2 pricing
to determine the compute cost. As expected, the EC2/S3
combination is the cheapest solution as pushing data onto the
other cloud providers incur a transfer cost to/from EC2 (at
both ends), where as transfers between EC2 and S3 are free
(assuming they are in the same zone). If iCostale was hosted
on a cloud owned by a different provider, the result would
be different as shown in the third graph, where the ‘iCostale’
adaptive placement changed the provider for different level of
IOs. For low IO load, the cloud providers with the smaller
storage cost would be preferred, and as the IO load increases,
the storage providers with the lower transfer cost would be
preferred.
VI. CONCLUSIONS AND FUTURE WORK
In this paper, we presented iCostale, an intelligent inter-
mediary between consumers and providers of cloud storage
service. iCostale combines well-known data compression tech-
niques, a knowledge base of compression algorithms and cloud
pricing schemes, and history of access patterns to reduce
the end-user cost of using cloud storage. We do this in a
manner that is transparent to client applications, and provide
the added benefit of avoiding storage cloud provider lock-in.
The results of benchmarking several known compression algo-
rithms were presented for different data types, for both lossy
442
Fig. 9. Storage cost with different placements (Y-axis: shows the cost in dollars, X-axis (same for all graphs) shows the IO load (K=thousand, M=million)
and lossless data compression. We showed that the choice of
transformation algorithm and its parameter values depends on
the data type and access pattern. We presented the results of
using iCostale in both non-adaptive and adaptive placement
scenarios. For non-adaptive placement (i.e. restricted to a given
cloud), iCostale outperformed the no-transformation and fixed
lossless (gzip) transformation approaches. For multimedia
data, greater savings were achieved when user allowed for
quality deterioration. In either case, iCostale will analyze
the complex universe of data type, reference pattern and
tradeoffs in substituting storage cost with computation cost,
to dynamically find the sweet spot for a given workload.
iCostale was able to save greater than 50% of the end-user
cost for many data types. An ‘oracle’ algorithm with a priori
knowledge of the best compression method was able to do
only a few percent better than iCostale. We also showed
how iCostale can leverage knowledge of pricing schemes of
multiple cloud providers to adaptively place user data so as to
reduce user cost.
There are a few things that we would like to address as
part of future work. First, iCostale service failure implications
were not addressed in this paper. The key component that
would render user data inaccessible in a failure scenario is
the iCostale location database. Since this mapping can be
stored in an online database like Amazon SimpleDB, it should
be easy to implement a client program that the user can
run to access its data. Second, we want to experiment with
more sophisticated ways to do data migration between the
clouds that minimize cost and provide better availability and
performance. Finally, we used compression as the primary
technique to reduce data storage and transfer cost. However,
we did not address using caching at the iCostale layer to
store frequently accessed data. Caching hot data can avoid
the need to store it in a remote cloud, and reduce the need for
unnecessary transformations at the caching layer. These may
further reduce overall costs.
REFERENCES
[1] Amazon, “Web Services.” [Online]. Available: http://aws.amazon.com[2] Google, “AppEngine.” [Online]. Available: http://code.google.com/
appengine[3] Microsoft, “Windows Azure.” [Online]. Available: www.microsoft.com/
windowsazure
[4] Rackspace, “Cloud Service.” [Online]. Available: http://www.rackspacecloud.com
[5] Nirvanix, “Cloud Service.” [Online]. Available: http://www.nirvanix.com/products-services/index.aspx
[6] L. M. Vaquero et al., “A break in the clouds: towards a cloud definition,”SIGCOMM Comput. Commun. Rev., vol. 39, pp. 50–55, 2008.
[7] A. M. and F. A., “Above the clouds: A berkeley view of cloudcomputing.” in UC Berkeley Technical Report EECS-2009-28, 2009.
[8] Borthakur, D., “The Hadoop Distributed File System: Arch andDesign.” [Online]. Available: http://hadoop.apache.org/common/docs/r0.18.0/hdfs design.pdf
[9] M. R. Palankar et al., “Amazon s3 for science grids: a viable solution?”in Workshop on Data-aware distributed computing, 2008.
[10] G. Orenstein, “Show me the gateway – taking storage to the cloud.”June 2010.
[11] Amazon Case Studies. [Online]. Available: http://aws.amazon.com/solutions/case-studies
[12] Google, “App Engine Developer Profiles.” [Online]. Available:http://code.google.com/appengine/casestudies.html
[13] “Azure Case Studies.” [Online]. Available: http://www.microsoft.com/azure/casestudies.mspx
[14] S. Garfinkel, “An evaluation of Amazon’s grid computing services: EC2,S3 and SQS,” in Technical Report TR-08-07, Harvard University, 2007.
[15] A. Li et al., “Cloudcmp: shopping for a cloud made easy,” in Proc ofthe 2nd USENIX conference on Hot topics in cloud computing, 2010.
[16] H. Wang et al., “Distributed systems meet economics: pricing in thecloud,” in USENIX conference on Hot topics in cloud computing, 2010.
[17] I. F. Adams et al., “Maximizing efficiency by trading storage forcomputation,” in Conference on Hot topics in cloud computing, 2009.
[18] C. Pu and L. Singaravelu, “Fine-grain adaptive compression in dynam-ically variable networks,” in In ICDCS, 2005, pp. 685–694.
[19] S. Sucu and C. Krintz, “Ace: A resource-aware adaptive compressionenvironment,” in Proc of the ITCC, 2003.
[20] M. Vrable, S. Savage, and G. M. Voelker, “Cumulus: Filesystem backupto the cloud,” Trans. Storage, vol. 5, pp. 14:1–14:28, December 2009.
[21] IBM, “Storwize Technology Overview.” [Online]. Available: http://www.storwize.com/Products Technology.asp
[22] Ocarina, “Ocarina Networks.” [Online]. Available: http://www.ocarinanetworks.com/technology/technology-homepage-menu
[23] “Cloudloop Wiki.” [Online]. Available: http://wiki.java.net/bin/view/Projects/CloudloopWiki
[24] “Multi-Cloud Data Access.” [Online]. Available: http://code.google.com/p/smestorage/
[25] R. S. et al., “Using Proxies to Accelerate Cloud Applications,” in InUSENIX Hot Cloud, 2009.
[26] Mahoney, M, “Florida Institute of Technology Technical ReportCS-2005-16.” [Online]. Available: http://mattmahoney.net/dc/#paq
[27] NanoZip, “NanoZip 0.08 alpha.” [Online]. Available: http://code.google.com/p/smestorage/
[28] D. Mosberger and T. Jin, “httperf: A tool for measuring web serverperformance,” SIGMETRICS Perform. Eval. Rev., vol. 26, 1998.
[29] LZMA, “SDK.” [Online]. Available: http://www.7-zip.org/sdk.html[30] Diomede, “Diomede Storage.” [Online]. Available: http://www.
diomedestorage.com/[31] “Cloud Storage .” [Online]. Available: http://www.tophostingcenter.
com/cloudstorage.htm
443