8
iCostale: Adaptive Cost Optimization for Storage Clouds Sandip Agarwala, Divyesh Jadav IBM Almaden Research Center, San Jose, CA 95120 Email: {sagarwala, divyesh}@us.ibm.com Luis A Bathen University of California, Irvine, CA 92697 Email: [email protected] Abstract—The unprecedented volume of data generated by contemporary business users and consumers has created enor- mous data storage and management challenges. In order to control data storage cost, many users are moving their data to online storage clouds, and applying capacity usage reducing data transformation techniques like de-duplication, compression, and transcoding. These give rise to several challenges, such as which cloud to choose, and what data transformation techniques to apply for optimizing cost. This paper presents an integrated storage service called iCostale that reduces the overall cost of data storage through automatic selection and placement of users data into one of many storage clouds. Further, it intelligently transforms data based on its type, access frequency, transformation overhead, and the cost model of the storage cloud providers. We demonstrate the efficacy of iCostale through a series of micro- and application- level benchmarks. Our experimental results show that, through intelligent data placement and transformation, iCostale can reduce overall cost of data storage by more than 50%. I. I NTRODUCTION Two recent phenomena have created an interesting set of challenges at the storage layer of distributed systems. First, the amount of disparate data types and quantities created, transferred and shared by web applications (blogs, social media, games etc.) has exploded, along with the multitude of end-point devices (traditional computers, smart phones, tablets and game consoles) used to create and manipulate such data. Second, there is a growing number of providers [1], [2], [3], [4], [5] of the cloud or utility computing model [6], and a widespread adoption of this model, whereby clients can run compute jobs and/or store their data in remote data centers that are owned and operated by a potentially separate organization. Customers are charged using a pay-as-you-go model, based on various combinations of compute cycles, network bandwidth and/or storage consumed, and/or transactions executed. The trends described above have implications: the widespread generation, transmission and storage of ever in- creasing types and volumes of data increases the load on the networking and storage infrastructure. The utility computing model is suitable for certain workload and data usage pat- terns [7], has the advantages of lower capital expenditure, and potentially lower operational expenditure. However, separation of the ownership of the data and the resources used to manipulate or store the data creates new issues in privacy, security, durability, availability, and access performance. One effect of the above trends is increased operational cost: the explosion in the number of data sources and destinations, file types and the variety of file sizes increases the overall cost to manage them. Furthermore, different utility providers have different pricing models, and an incomplete understanding of the pricing model can lead to a high operational cost. In order to reduce the cost of cloud data storage, two orthogonal approaches can be applied: moving computation closer to the data [8] and data transformation (deduplication, compression and transcoding). In this paper, we focus on using adaptive compression for storage cost reduction. Typical data stored in clouds are characterized by: large ca- pacity, disparate I/O access pattern (some objects are accessed frequently, others are not), soft performance requirements, online access from different geographically locations, low management overhead, and lower pricing is preferred com- pared to richer functionality. Compression algorithms typically have parameters that control the compression level, the levels trade off resource consumption (memory, CPU cycles) for compression ratios. Finding the optimal compression algo- rithm that minimizes cloud storage cost is a difficult theoretical problem, complicated further by the potential for change in pricing model and performance of the service over time. This paper describes iCostale, an integrated storage service framework that is interposed between multiple clients and mul- tiple cloud storage providers. Storage cloud providers support various data access interfaces. For example, Amazon’s Simple Storage Service (S3) [1] supports SOAP, REST and BitTor- rent [9]. Other providers support file-based protocols [10]. The techniques described in this paper can be implemented in a multitude of interfaces, but we restrict the discussion to a REST-like interface. iCostale makes the following technical contributions: First, it provides a single unified interface to end-users to store their data into different clouds. Second, it builds a comprehensive cost model to evaluate and compare different compression methods and placement alternatives. Third, its adaptive compression and placement algorithm effi- ciently computes a cost-effective placement/compression com- bination based on end-user requirements, pricing models, data type and access frequency. Finally, a large set of experiments evaluates the effectiveness of iCostale with large datasets and different pricing models from multiple providers. II. RELATED WORK Case studies from contemporary cloud providers [11], [12], [13] provide insight into how applications and users exercise contemporary commercial clouds. [7] provided an excellent perspective on the definition, historical evolution, taxonomy 2011 IEEE 4th International Conference on Cloud Computing 978-0-7695-4460-1/11 $26.00 © 2011 IEEE DOI 10.1109/CLOUD.2011.88 436

iCostale: Adaptive Cost Optimization for Storage …...iCostale: Adaptive Cost Optimization for Storage Clouds Sandip Agarwala, Divyesh Jadav IBM Almaden Research Center, San Jose,

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: iCostale: Adaptive Cost Optimization for Storage …...iCostale: Adaptive Cost Optimization for Storage Clouds Sandip Agarwala, Divyesh Jadav IBM Almaden Research Center, San Jose,

iCostale: Adaptive Cost Optimization for Storage Clouds

Sandip Agarwala, Divyesh Jadav

IBM Almaden Research Center, San Jose, CA 95120

Email: {sagarwala, divyesh}@us.ibm.com

Luis A Bathen

University of California, Irvine, CA 92697

Email: [email protected]

Abstract—The unprecedented volume of data generated bycontemporary business users and consumers has created enor-mous data storage and management challenges. In order tocontrol data storage cost, many users are moving their data toonline storage clouds, and applying capacity usage reducing datatransformation techniques like de-duplication, compression, andtranscoding. These give rise to several challenges, such as whichcloud to choose, and what data transformation techniques toapply for optimizing cost.

This paper presents an integrated storage service callediCostale that reduces the overall cost of data storage throughautomatic selection and placement of users data into one of manystorage clouds. Further, it intelligently transforms data basedon its type, access frequency, transformation overhead, and thecost model of the storage cloud providers. We demonstrate theefficacy of iCostale through a series of micro- and application-level benchmarks. Our experimental results show that, throughintelligent data placement and transformation, iCostale canreduce overall cost of data storage by more than 50%.

I. INTRODUCTION

Two recent phenomena have created an interesting set of

challenges at the storage layer of distributed systems. First,

the amount of disparate data types and quantities created,

transferred and shared by web applications (blogs, social

media, games etc.) has exploded, along with the multitude of

end-point devices (traditional computers, smart phones, tablets

and game consoles) used to create and manipulate such data.

Second, there is a growing number of providers [1], [2], [3],

[4], [5] of the cloud or utility computing model [6], and a

widespread adoption of this model, whereby clients can run

compute jobs and/or store their data in remote data centers that

are owned and operated by a potentially separate organization.

Customers are charged using a pay-as-you-go model, based on

various combinations of compute cycles, network bandwidth

and/or storage consumed, and/or transactions executed.

The trends described above have implications: the

widespread generation, transmission and storage of ever in-

creasing types and volumes of data increases the load on the

networking and storage infrastructure. The utility computing

model is suitable for certain workload and data usage pat-

terns [7], has the advantages of lower capital expenditure, and

potentially lower operational expenditure. However, separation

of the ownership of the data and the resources used to

manipulate or store the data creates new issues in privacy,

security, durability, availability, and access performance. One

effect of the above trends is increased operational cost: the

explosion in the number of data sources and destinations, file

types and the variety of file sizes increases the overall cost

to manage them. Furthermore, different utility providers have

different pricing models, and an incomplete understanding of

the pricing model can lead to a high operational cost.

In order to reduce the cost of cloud data storage, two

orthogonal approaches can be applied: moving computation

closer to the data [8] and data transformation (deduplication,

compression and transcoding). In this paper, we focus on using

adaptive compression for storage cost reduction.

Typical data stored in clouds are characterized by: large ca-

pacity, disparate I/O access pattern (some objects are accessed

frequently, others are not), soft performance requirements,

online access from different geographically locations, low

management overhead, and lower pricing is preferred com-

pared to richer functionality. Compression algorithms typically

have parameters that control the compression level, the levels

trade off resource consumption (memory, CPU cycles) for

compression ratios. Finding the optimal compression algo-

rithm that minimizes cloud storage cost is a difficult theoretical

problem, complicated further by the potential for change in

pricing model and performance of the service over time.

This paper describes iCostale, an integrated storage service

framework that is interposed between multiple clients and mul-

tiple cloud storage providers. Storage cloud providers support

various data access interfaces. For example, Amazon’s Simple

Storage Service (S3) [1] supports SOAP, REST and BitTor-

rent [9]. Other providers support file-based protocols [10].

The techniques described in this paper can be implemented

in a multitude of interfaces, but we restrict the discussion to

a REST-like interface. iCostale makes the following technical

contributions: First, it provides a single unified interface to

end-users to store their data into different clouds. Second, it

builds a comprehensive cost model to evaluate and compare

different compression methods and placement alternatives.

Third, its adaptive compression and placement algorithm effi-

ciently computes a cost-effective placement/compression com-

bination based on end-user requirements, pricing models, data

type and access frequency. Finally, a large set of experiments

evaluates the effectiveness of iCostale with large datasets and

different pricing models from multiple providers.

II. RELATED WORK

Case studies from contemporary cloud providers [11], [12],

[13] provide insight into how applications and users exercise

contemporary commercial clouds. [7] provided an excellent

perspective on the definition, historical evolution, taxonomy

2011 IEEE 4th International Conference on Cloud Computing

978-0-7695-4460-1/11 $26.00 © 2011 IEEE

DOI 10.1109/CLOUD.2011.88

436

Page 2: iCostale: Adaptive Cost Optimization for Storage …...iCostale: Adaptive Cost Optimization for Storage Clouds Sandip Agarwala, Divyesh Jadav IBM Almaden Research Center, San Jose,

and near-term and long-term obstacles and opportunities in the

realm of cloud computing. As one of the early cloud offerings,

Amazon EC2 and S3 have been studied extensively for cost,

availability and performance [14], [9]. The CloudCmp [15]

framework aids a customer to select a cloud provider using

a set of benchmarking tools. It showed that different storage

services can offer different performance at different data scales

with different operations. In contrast, iCostale transparently

places data at the cloud provider that best matches the cus-

tomer workload and budget, and also provides a dynamic

mechanism to move data between clouds, should the pricing

and/or the reference pattern change.

Wang et al. [16] proposed that due to the decoupling

between the owners of cloud data (customers), and the owners

of the hardware and software used to store or process the

data (cloud providers), with a pricing scheme as the bridge,

cloud computing has fundamentally changed the landscape

of system design and optimization. We accept this premise

and focus on reducing customer cost by defining data storage

and cross-cloud placement optimizations to reduce customer

cost. Like iCostale, [17] studied the tradeoff between storing

end results on the one hand, and storing only the provenance

data, and re-computation of results when needed, on the other

hand. We agree with the philosophy of using computation in

place of storage for rarely used data, but differ in that we

provide a concrete cost model that identifies when to store as-

is and when to transform, and avoid the uncertainty associated

with forecasting future pricing by dynamically placing data

to take advantage of observed cloud provider pricing and

performance.

Much work has been done in the fields of compression and

transcoding, each addressing different aspects, and at different

levels of the systems software stack. Source compression can

be the most effective level, as it reduces network transmission

cost and storage footprint. However, resource limitations of

source devices (CPU, battery life, scratch space) can limit the

cost reductions of compression. In the networking domain,

these include tradeoffs between transmitting compressed vs.

uncompressed data [18], dynamically choosing which copy to

send [19], etc. However, optimizing for transmission footprint

is useful only when data is actually stored or accessed, as

opposed to recurring costs for data storage. Cumulus [20]

showed that for a cloud filesystem backup, changes to the

storage efficiency have a more substantial effect on total cost

than changes in bandwidth efficiency. iCostale focuses on

intelligent use of data transformation at the storage systems

layer of a cloud computing environment. There has been recent

interest in using an appliance to perform compression at the

storage subsystem level, either in-line [21] or out-of-band [22].

iCostale differs in that it uses selective compression to reduce

storage footprint, and by using access frequency to guide

whether to compress or not.

The disparate interfaces imposed by different cloud opera-

tors have spurred the development of cloud API libraries [23],

[24]. They aid user transparency and uniformity by providing

protocol-independence and common authentication mecha-

nisms. The proxy approach to accelerate applications or hide

the complexity of cloud APIs has attracted recent interest in

the academic [25] and commercial [10] arenas. iCostale is tar-

geted at this space, but differs from other proxy and cloud API

library approaches in that it combines inter-cloud placement

with transformation techniques, and (optional) user provided

access hints with dynamically tracked access frequency, to

reduce end-user cloud storage cost.

III. DESIGN OVERVIEW

In a storage cloud environment, users can store their data

directly on a cloud provider like Amazon S3, or they may use

third party online storage services like Dropbox, Slideshare,

SmugMug, etc. The latter provide value added services like

backups, content sharing, collaboration, etc. to their users and

may store user data into other cloud storage providers (e.g.

Amazon S3).

In iCostale, we implement a third party online storage

service that provides a REST and SOAP-based API similar

to Amazon S3. Instead of managing its own storage, iCostaleuses storage from one of the many cloud providers. Users

upload and access their contents in iCostale, which in turn

stores them in the backend cloud. We assume that iCostale is

hosted in a compute cloud like Amazon EC2 (Elastic Compute

Cloud). The goal of iCostale is to reduce the overall cost ofdata storage for its users.

In order to optimize data storage performance and resource

usage, application developers and system designers have used

many different techniques, as discussed in Section II. Each

of these techniques has their own set of advantages and

disadvantages. For the purpose of reducing cost, iCostalefocuses on adaptive data placement and transformation. The

data transformation techniques that we will discuss in this

paper include different kinds of lossy and lossless compression

and transcoding. Other techniques like caching, de-duplication,

etc. are complementary to our techniques and can be applied

in conjunction with iCostale.onjunction with iCostale.

Fig. 1. iCostale Architecture DiagramFigure 1 shows an architectural overview of the iCostale

storage service. It is interposed between client and storage

cloud providers. Users communicate via REST and SOAP-

based API with iCostale, which in turn forwards those requests

to one of the backend storage clouds after some processing.

For write requests, iCostale compresses the data based on its

type, user preferences, etc. and stores it in a storage cloud such

that overall cost for the content owner is reduced. For read

requests, iCostale fetches data from the storage cloud, applies

the corresponding decompression or decoding algorithm and

437

Page 3: iCostale: Adaptive Cost Optimization for Storage …...iCostale: Adaptive Cost Optimization for Storage Clouds Sandip Agarwala, Divyesh Jadav IBM Almaden Research Center, San Jose,

returns the resultant data back to the user. The data returned to

the user is the exact copy of the original data stored by the user

in case of lossless compression. In case of lossy compression,

data returned is in the same format as the original and meets

the quality requirement of the user.

IV. COST OPTIMIZATION

For cost reduction purposes, several factors need to be

carefully considered. The first is the overhead due to the

data transformation technique. On the one hand, compression

reduces data size (in most cases), which reduces capacity and

bandwidth usage, while on the other hand, it consumes CPU

cycles, which increases the compute cost. iCostale selects thecompression algorithm that reduces the total sum of compute,storage and I/O cost. The second factor is the data access

pattern. For infrequently accessed data, the compute cost is low

because of reduced number of transformations. For frequently

accessed data, transformation cost may become quite signifi-

cant compared to storage cost. The third factor relates to the

data type. Some types of data are more compression friendly

than others. Depending on the data type, the effectiveness of

compression algorithms may vary. Fourth, different storage

cloud providers have different cost models that may range

from one flat monthly pricing to complex tiered and itemized

pricing. Finally, in addition to reducing cost, different users

may have additional requirements in terms of performance,

availability and resiliency. In the next few sub-sections, we

discuss each of these factors in greater detail, and show how

we address them with iCostale.

A. Choice of Data Transformation Techniques

There is a wide variety of compression techniques with

varying characteristics. Different compression algorithms are

optimized for different criteria like compression ratio, com-

pression speed, decompression speed, memory consumption,

etc. Also, their performance may vary from one dataset to

the other. For example, deflate-based (e.g. gzip) compression

methods are optimized for both speed and compression ratio,

while algorithms like PAQ [26] focus primarily on compres-

sion ratio.

Compression techniques can be broadly grouped in two

categories: lossy and lossless. Lossy compression is generally

applied to multimedia datasets like images, audio and video,

where quality can be compromised for higher compression

ratio. A common theme across all compression techniques

is that there is an inherent tradeoff between compression

ratio, compression/decompression speed, quality, memory con-

sumption, etc. An aggressive compression method may lower

bandwidth and storage capacity usage, but it may consume

more CPU cycles. This results in lower storage cost and higher

compute cost. Further, it may not meet the response time and

quality requirements of the user.

Figure 2 shows the tradeoff between annual storage cost

and compute cost for storing about 80 GB of text data (100K

text books of average size 0.8 MB each). Pricing is based on

Amazon’s S3 and EC2. Storage cost dominates in all the cases,

Fig. 2. Storage and Compute Cost Comparison

but the compute overhead in some cases offsets the savings

due to compression. Nanozip [27], for example, achieves the

maximum compression in this analysis, but the compute cost

drives the total cost higher than other compression algorithms.

To reduce the total cost, iCostale monitors the performance

and savings due to different compression techniques and for

different types of data, and applies the one that meets the user

requirements (response time, quality, etc.) so as to achieve

maximum savings.

B. Data Access Pattern

Different workloads have different access patterns. Data

generated by workloads like backups, archival, logs, etc. are

seldom accessed frequently. On the other hand, a popular web

document or media content may be frequently accessed. For

every read access, iCostale fetches the compressed data from

the storage cloud, decompresses or decodes it to its original

format, and sends it back to the user. The decompression

overhead and the associated compute cost add up for fre-

quently accessed data objects and may offset the savings due

to compression.

10

100

1000

10000

0.1 1 10 100

Cos

t ($)

Read requests (in millions)

NONE GZIP NANOZIP BMF JPEG2000

Fig. 3. Total cost (excluding transfer cost) with different compression

Figure 3 shows the cost of hosting 100GB of BMP image

data (100,000 images of roughly 1 MB size each) on an Ama-

zon S3 like environment. We applied different compression

techniques and computed the cost for varying number of read

accesses. X-axis represents the total number of read requests

across all the images. The cost (in the Y-axis) includes the

storage capacity cost and the compute cost for decompression

(based on Amazon EC2 pricing). Transfer bandwidth cost is

not included to show the effect of compute overhead due to

different I/O access. From the figure, as the number of accesses

increases, the cost of storing uncompressed data remains the

same. For different compression algorithms, the cost initially

goes down for reduced space usage. But with larger number

of accesses, the cost of decompression goes up and offsets

438

Page 4: iCostale: Adaptive Cost Optimization for Storage …...iCostale: Adaptive Cost Optimization for Storage Clouds Sandip Agarwala, Divyesh Jadav IBM Almaden Research Center, San Jose,

the savings due to reduction in space usage. For jpeg2000

and nanozip, the cost of transforming the data takes over the

cost of just storing the uncompressed data at around 1M and

10M IOs respectively. For gzip and bmf, the price blows up

at around the 100M IOs. This graph shows the importance of

the access pattern in determining the appropriate compression

method. For very hot data, it would be more cost-effective to

store data in its original form. For others, compression would

be beneficial. iCostale adapts the compression methods based

on how frequently an object is accessed.

C. Cost Model

Storage cloud providers are coming up with unique sets

of features and pricing models to differentiate themselves

and attract customers. For selecting appropriate providers and

overall cost optimization, iCostale models the cost that a user

would pay for storing data in the cloud.

iCostale

Uncompressed datart1: Transfer Cost / GBs: Size of data

Compressed datasc: Size of compressed datart2: Transfer Cost / GBrr: Cost / request

Compute Cloud:rc: Compute Cost / hour

Storage Cloudrs: Storage Cost / GB

Fig. 4. iCostale Cost Components

Figure 4 shows a high-level view of iCostale along with

some of the cost components. In this model, iCostale is

deployed in a compute cloud (like Amazon EC2) and stores

compressed data in a storage cloud (like Amazon S3, Google

storage cloud, etc.) iCostale charges back the cost of using

the compute cloud and the storage. The charge is proportional

to the resources used on behalf of the user. Table I shows the

parameters in our cost model.TABLE I

ICOSTALE COST MODEL PARAMETERS

Symbol Description

s Size of original data (in GB)sc Size of compressed data (in GB)CR Compression Ratio (sc/s)rt1 Cost ($/GB) to transfer data between

client and iCostale servicert2 Cost ($/GB) to transfer data between

iCostale service and the storage cloudrr Cost per request (get, put, etc.)

from iCostale service to the storage cloudrc Compute cost / hourrs Storage cost / GB / monthtca,d Average time taken to compress

1GB of data ‘d’ with algorithm ‘a’

tda,d Average time taken to decompress

1GB of data ‘d’ with algorithm ‘a’CCPU Compute Cost

CIO IO Cost

CS Storage Cost

CT Total Costnrd No. of read requests/month for data ‘d’

nwd No. of write requests/month for data ‘d’

nd Total no. of requests/month for data ‘d’ (ndr+nd

w)

From Table I, we can define the total cost for a particular

data object ‘d’ as follows:

CCPU = nwd * tca,d * rc * s + nrd * tda,d * rc * s

= [nwd * tca,d + nrd * tda,d] * rc * s

CIO = [ s * rt1 + sc * rt2 + rr ] * ndCS = sc * rsCT = CCPU + CIO + CS

The above cost model is a simplified representation of the

pricing model of many popular IaaS (Infrastructure as a Ser-vice) providers. It can be adapted very easily to accommodate

the tiered and other pricing models of the providers.

D. User Profile

Users can specify a set of requirements that are later used

by iCostale to decide the placement and compression of

their data. For each type of data (e.g. document, text files,

source code, executables, images, audio, video, etc.), a user

can specify the following attributes: Response time, Quality

(for lossy compression), Preferred storage cloud provider,

Availability of the storage provider and Expected number of

monthly accesses. The last attribute is just an initial estimate

for the iCostale algorithm. The actual runtime statistics are

maintained by iCostale optimization algorithm, which is dis-

cussed in the next section.

E. iCostale Optimization Algorithm

Fig. 5. Sample compression options for BMP fileFor every write request, the iCostale optimization algorithm

determines the following three items:

• Where to place data

• What compression algorithm to apply

• Level of compression for the above algorithm

This is a hard problem to solve as there are many

compression algorithms and they may accept many fine

tuning parameters. Figure 5 shows the transformation

possibilities for an image file. As we can see, for a

single BMP file, there many different options as to what

transformation to apply, be it lossless or lossy, for each

lossy/lossless type, we can choose from a wide range of

algorithms, from each algorithm, we must choose the right

set of flags in order to give us the desired size reduction

while keeping the computational overhead low. A brute force

way to achieve optimum compression is to apply all possible

compression algorithms with different parameters. This may

achieve the best compression result, but may impose lot of

439

Page 5: iCostale: Adaptive Cost Optimization for Storage …...iCostale: Adaptive Cost Optimization for Storage Clouds Sandip Agarwala, Divyesh Jadav IBM Almaden Research Center, San Jose,

CPU overhead and/or miss response time requirements.

Algorithm 1 iCostale algorithm

for each data write request d of size s dof ← type of data d

U ← profile of user associated with the request

P ← list of storage cloud providers that satisfy the user’s

availability requirements

K ← set of compression techniques applicable to

data of type/format ‘f’ from knowledge base

for each provider p ∈ P dofrom the pricing published by provider p, determine

rt2, rr, rsfor each compression method k ∈ K do

if combination of provider p’s latency, tca,d, and tca,ddoes not meet the response time requirements then

continue

end ifcompute CT

keep track of minimum CT and associated compres-

sion method and placement in Cmin

end forend for

end forIn order to implement a more practical solution, we build a

knowledge base that contains the performance characteristics

of different compression techniques for different data types

and formats. We ran thousands of compression experiments

against a large dataset consisting of files of different types

and formats, and recorded the result in our knowledge base.

For example, we computed average compression ratio, average

compression time, average decompression time, etc. for all

*.txt files in our dataset for different compression techniques.

The timings were normalized for a file size of 1 GB. For

simplicity, we assumed that for the same file type and format,

compression and decompression costs are proportional to the

original data size. This knowledge base becomes the key

for the iCostale optimization algorithm, which is given in

Algorithm 1.

The algorithm assumes that the type (e.g. image, text, audio,

etc.) and the format (e.g. .bmp, .txt. .doc, .mp3, etc.) of

the data can be determined from the metadata (e.g. content-

type attribute, filename, etc.) contained in the write request.

After the algorithm finishes, Cmin contains the placement and

compression information that needs to be applied to the request

‘d’. iCostale compresses the data object ‘d’ and stores it in the

storage cloud specified by Cmin. If no suitable compression

method or placement is found, iCostale stores the unmodified

data in a default provider. It also records a location mapping

(key, location) along with a few other attributes (compression

method used, reference to user profile, etc.) in its database for

future read requests. The ‘key’ is the same as the one specified

in the user’s write request.

Read requests are simpler to handle. iCostale looks up

location information in its database and fetches data from the

storage cloud provider. It decompresses the data in its original

form (if required) and sends it back to the user. It also records

and updates a few metadata associated with this object like last

access time, access frequency, etc. iCostale stores the location

mapping, user profiles and other metadata in a shared database

like Amazon SimpleDB.

The exact type and number of compute cloud instances

for hosting iCostale would depend on the I/O traffic and the

compute load. In the experimental section, we will show how

changing the number of cores impacts the performance of

iCostale.

V. EXPERIMENTAL RESULTS

This section discusses the experimental evaluation of

iCostale and shows its effectiveness compared to non-adaptive

techniques.

A. Experimental Setup

iCostale has been implemented in Java, and runs as a storage

service. It provides the SOAP and REST APIs similar to

Amazon S3. In order to mimic the storage cloud provider, we

implemented a server that provided a basic key-value object

store interface and stored data in its local disks. iCostale sits

between the clients and the object store servers. All experi-

ments were run on machines with two quad-core processors,

4GB memory and running 2.6.34 Linux kernel. The required

compression libraries are installed on the iCostale node.

B. Compression Knowledge base

The goal of the first set of our experiments was to compute

performance statistics of different compression algorithms for

a wide variety of datasets and record them in our compres-

sion knowledge base that was discussed in previous section.

‘Httperf’ [28] client was used to generate the workload.

Figure 6 shows the results for some types of data with a sub-

set of compression techniques (and their associated flags). For

each transformation we computed the compression ratio (CR),

the compression time (CT) and decompression time (DT). We

define CR as the ratio between the size of compressed data and

uncompressed data. Aside from the standard gzip (g-9 in the

graph), bzip2 (b-9), lzma [29] (lz-9) compressors, we included

several other rather unknown but very powerful compressors

such as the PAQ8 series which have shown to out-compress

many popular techniques at the cost of processing time and

memory consumption. In general, the LZ based compressors

are the fastest, but yield poor CR. In the case of lossless

multimedia data such as BMPs, BMF achieved the overall best

result, both in CR as well as compression/decompression time.

In some cases, lossless compression resulted in CR greater

than 1.

Figure 7 shows the compression ratios (CR), encoding

(ET) and decoding (DT) times, as well as two different

quality metrics, peak signal to noise ratio (PSNR) and root

mean square error (RMSE) for the JPEG and JPEG2000

lossy encoders. As expected, higher quality resulted in lower

compression, higher PSNR and lower RMSE. Unlike lossless

compressors, these two lossy algorithms saw less variation

440

Page 6: iCostale: Adaptive Cost Optimization for Storage …...iCostale: Adaptive Cost Optimization for Storage Clouds Sandip Agarwala, Divyesh Jadav IBM Almaden Research Center, San Jose,

0.01

0.1

1

10

100

0.1

0.2

0.3

0.4

g-9 b2-9 LZ-9 p8PX-8 p8L-8

ASCII Books

0.01

0.1

1

10

100

1000

0.5

0.55

0.6

0.65

g-9 b2-9 LZ-9 p8PX-8 p8L-8

Word Docs

0.01

0.1

1

10

100

0.5

0.6

0.7

0.8

0.9

g-9 p8L-8 p8PX-8 bmf-Q1 jp2k

BMP files

0.01

0.1

1

10

100

0.15

0.2

0.25

0.3

g-9 b2-9 LZ-9 p8PX-8 p8L-8

Web Data

0.01

0.1

1

10

100

1000

0.74

0.76

0.78

0.8

g-9 b2-9 LZ-9 p8PX-8 p8L-8

PDF files

0.01

0.1

1

10

100

0.7

0.8

0.9

1

g-9 p8L-8 p8PX-8 pjg

JPEG files

Fig. 6. Compression knowledge base for lossless techniques (Left Y-axis shows the compression ratio, Right Y-axis shows the compression/decompressiontime in seconds, the X-axis shows the compression methods

Fig. 7. Compression knowledge base for lossy techniques (Left Y-axis shows the compression ratio, the X-axis shows the quality parameter, the right Y-axisshows: 1) the encoding/decoding time in ms, 2) PSNR, 3) RMSE)

in their encoding/decoding time as a function of the quality

since in most cases, the time to build the internal models is

about the same, the only difference is how data is partitioned.

For instance, in JPEG2000, quality is affected by the number

of quantization steps (Δ) used by the encoder to divide

the coefficients. Users can specify their tolerance for quality

deterioration for their multimedia data and iCostale uses that

to make run-time transformation decisions.

C. iCostale without Adaptive Placement

In order to evaluate the effectiveness of iCostale’s adaptive

compression, we performed a series of experiment and com-

puted cost of storing different types of data under the following

scenarios: (i) No compression, (ii) Compression with gzip,

(iii) adaptive compression with iCostale and, (iv) adaptive

compression with an oracle algorithm. Since iCostale’s opti-

mization is based on the average performance statistics stored

in its knowledge base, it is possible that the best compression

method from the knowledge base perspective may not be the

best for the actual data. We developed an oracle version of

iCostale that has prior knowledge of the best compression

method for all objects in the data set. This helped us determine

the effectiveness of using average performance statistics from

our knowledge base.

Figure 8 shows the result with six types of data. We varied

the number of read requests and computed the percentage

annual cost savings compared to no compression. In each

experiment, we assumed a total of 100K data objects. The

cost is based on pricing models of Amazon S3 and EC2.

For fairness with respect to gzip, iCostale and oracle didn’t

perform lossy compression. For gzip compression, the storage

cost saving remains the same as the number of read requests

goes up, but the compute overhead increases. This results

in decreasing cost savings with gzip. For jpeg and audio

data, gzip barely achieved any compression. Therefore, as

number of requests went up, it resulted in negative cost

savings for those data types. iCostale was able to achieve

much greater savings. On an average, the savings were 40%

more compared to vanilla gzip compression. In some cases

441

Page 7: iCostale: Adaptive Cost Optimization for Storage …...iCostale: Adaptive Cost Optimization for Storage Clouds Sandip Agarwala, Divyesh Jadav IBM Almaden Research Center, San Jose,

Fig. 8. Percentage cost improvement for different access pattern (Y-axis: % cost reduction relative to no compression, X-axis: Number of read requests)

it was able to achieve greater than 60% savings, but with

increasing the number of IO, the transfer and compute cost

offsets the savings due to iCostale compression. The oraclealgorithm was only slightly better than iCostale. This is

because, in most cases, the compression method chosen by

oracle was same as that by iCostale. This shows that our

approach of using a knowledge base is a viable approach for

implementing adaptive compression.

Table II shows the iCostale performance for a mixed dataset

consisting of 80% reads and 20% writes. As the number

of CPU cores are increased, the throughput of iCostale in-

creases almost linearly. Note that these numbers are for an

unoptimized implemention of iCostale. These can be further

improved with caching and performance tuning.TABLE II

ICOSTALE PERFORMANCE

No. of CPU Cores 1 2 4 8Throughput (Mbps) 195 405 817 1609

D. iCostale with Adaptive Placement

The next series of experiments demonstrate the effective-

ness of combined adaptive placement and compression. We

performed these experiments with a mixed dataset consisting

of 100K objects of total size 234 GB and compared cost of

storing them on 8 storage cloud providers (Amazon S3 [1],

Nirvanix [5], Windows Azure [3], Diomede [30], Google [2],

TopHostingCenter [31]). For each cloud offering, we chose

the cheapest of the different cost models they had available.

Figure 9 shows the normalized cost of storing data in different

cloud relative to Amazon S3. Amazon S3 is represented by

the horizontal line (at 1) in the three graphs. The first graph

in the figure just compares the cost of storing uncompressed

data and shows the change in cost due to different number of

I/Os. Google and Azure have similar pricing. Diomede was the

most inexpensive of all the cloud offerings (in terms of storage

cost). For larger number of I/Os, TopHostingCenter, had lower

total cost. From the graph, it is clear that there is no one

provider whose total cost is consistently low. iCostale takes

advantage of this fact to determine the appropriate placement

for its workload. In a real world scenario, price is not the only

criteria for selecting the cloud providers. In this experiment,

we focus only on the cost aspects for provider selection.

The second graph in the figure shows the cost of storing

data in different cloud with iCostale’s adaptive compression.

The line labeled ‘iCostale’ represents combined adaptive

compression and placement. In this experiment, we assume

that iCostale is hosted in Amazon EC2 and use EC2 pricing

to determine the compute cost. As expected, the EC2/S3

combination is the cheapest solution as pushing data onto the

other cloud providers incur a transfer cost to/from EC2 (at

both ends), where as transfers between EC2 and S3 are free

(assuming they are in the same zone). If iCostale was hosted

on a cloud owned by a different provider, the result would

be different as shown in the third graph, where the ‘iCostale’

adaptive placement changed the provider for different level of

IOs. For low IO load, the cloud providers with the smaller

storage cost would be preferred, and as the IO load increases,

the storage providers with the lower transfer cost would be

preferred.

VI. CONCLUSIONS AND FUTURE WORK

In this paper, we presented iCostale, an intelligent inter-

mediary between consumers and providers of cloud storage

service. iCostale combines well-known data compression tech-

niques, a knowledge base of compression algorithms and cloud

pricing schemes, and history of access patterns to reduce

the end-user cost of using cloud storage. We do this in a

manner that is transparent to client applications, and provide

the added benefit of avoiding storage cloud provider lock-in.

The results of benchmarking several known compression algo-

rithms were presented for different data types, for both lossy

442

Page 8: iCostale: Adaptive Cost Optimization for Storage …...iCostale: Adaptive Cost Optimization for Storage Clouds Sandip Agarwala, Divyesh Jadav IBM Almaden Research Center, San Jose,

Fig. 9. Storage cost with different placements (Y-axis: shows the cost in dollars, X-axis (same for all graphs) shows the IO load (K=thousand, M=million)

and lossless data compression. We showed that the choice of

transformation algorithm and its parameter values depends on

the data type and access pattern. We presented the results of

using iCostale in both non-adaptive and adaptive placement

scenarios. For non-adaptive placement (i.e. restricted to a given

cloud), iCostale outperformed the no-transformation and fixed

lossless (gzip) transformation approaches. For multimedia

data, greater savings were achieved when user allowed for

quality deterioration. In either case, iCostale will analyze

the complex universe of data type, reference pattern and

tradeoffs in substituting storage cost with computation cost,

to dynamically find the sweet spot for a given workload.

iCostale was able to save greater than 50% of the end-user

cost for many data types. An ‘oracle’ algorithm with a priori

knowledge of the best compression method was able to do

only a few percent better than iCostale. We also showed

how iCostale can leverage knowledge of pricing schemes of

multiple cloud providers to adaptively place user data so as to

reduce user cost.

There are a few things that we would like to address as

part of future work. First, iCostale service failure implications

were not addressed in this paper. The key component that

would render user data inaccessible in a failure scenario is

the iCostale location database. Since this mapping can be

stored in an online database like Amazon SimpleDB, it should

be easy to implement a client program that the user can

run to access its data. Second, we want to experiment with

more sophisticated ways to do data migration between the

clouds that minimize cost and provide better availability and

performance. Finally, we used compression as the primary

technique to reduce data storage and transfer cost. However,

we did not address using caching at the iCostale layer to

store frequently accessed data. Caching hot data can avoid

the need to store it in a remote cloud, and reduce the need for

unnecessary transformations at the caching layer. These may

further reduce overall costs.

REFERENCES

[1] Amazon, “Web Services.” [Online]. Available: http://aws.amazon.com[2] Google, “AppEngine.” [Online]. Available: http://code.google.com/

appengine[3] Microsoft, “Windows Azure.” [Online]. Available: www.microsoft.com/

windowsazure

[4] Rackspace, “Cloud Service.” [Online]. Available: http://www.rackspacecloud.com

[5] Nirvanix, “Cloud Service.” [Online]. Available: http://www.nirvanix.com/products-services/index.aspx

[6] L. M. Vaquero et al., “A break in the clouds: towards a cloud definition,”SIGCOMM Comput. Commun. Rev., vol. 39, pp. 50–55, 2008.

[7] A. M. and F. A., “Above the clouds: A berkeley view of cloudcomputing.” in UC Berkeley Technical Report EECS-2009-28, 2009.

[8] Borthakur, D., “The Hadoop Distributed File System: Arch andDesign.” [Online]. Available: http://hadoop.apache.org/common/docs/r0.18.0/hdfs design.pdf

[9] M. R. Palankar et al., “Amazon s3 for science grids: a viable solution?”in Workshop on Data-aware distributed computing, 2008.

[10] G. Orenstein, “Show me the gateway – taking storage to the cloud.”June 2010.

[11] Amazon Case Studies. [Online]. Available: http://aws.amazon.com/solutions/case-studies

[12] Google, “App Engine Developer Profiles.” [Online]. Available:http://code.google.com/appengine/casestudies.html

[13] “Azure Case Studies.” [Online]. Available: http://www.microsoft.com/azure/casestudies.mspx

[14] S. Garfinkel, “An evaluation of Amazon’s grid computing services: EC2,S3 and SQS,” in Technical Report TR-08-07, Harvard University, 2007.

[15] A. Li et al., “Cloudcmp: shopping for a cloud made easy,” in Proc ofthe 2nd USENIX conference on Hot topics in cloud computing, 2010.

[16] H. Wang et al., “Distributed systems meet economics: pricing in thecloud,” in USENIX conference on Hot topics in cloud computing, 2010.

[17] I. F. Adams et al., “Maximizing efficiency by trading storage forcomputation,” in Conference on Hot topics in cloud computing, 2009.

[18] C. Pu and L. Singaravelu, “Fine-grain adaptive compression in dynam-ically variable networks,” in In ICDCS, 2005, pp. 685–694.

[19] S. Sucu and C. Krintz, “Ace: A resource-aware adaptive compressionenvironment,” in Proc of the ITCC, 2003.

[20] M. Vrable, S. Savage, and G. M. Voelker, “Cumulus: Filesystem backupto the cloud,” Trans. Storage, vol. 5, pp. 14:1–14:28, December 2009.

[21] IBM, “Storwize Technology Overview.” [Online]. Available: http://www.storwize.com/Products Technology.asp

[22] Ocarina, “Ocarina Networks.” [Online]. Available: http://www.ocarinanetworks.com/technology/technology-homepage-menu

[23] “Cloudloop Wiki.” [Online]. Available: http://wiki.java.net/bin/view/Projects/CloudloopWiki

[24] “Multi-Cloud Data Access.” [Online]. Available: http://code.google.com/p/smestorage/

[25] R. S. et al., “Using Proxies to Accelerate Cloud Applications,” in InUSENIX Hot Cloud, 2009.

[26] Mahoney, M, “Florida Institute of Technology Technical ReportCS-2005-16.” [Online]. Available: http://mattmahoney.net/dc/#paq

[27] NanoZip, “NanoZip 0.08 alpha.” [Online]. Available: http://code.google.com/p/smestorage/

[28] D. Mosberger and T. Jin, “httperf: A tool for measuring web serverperformance,” SIGMETRICS Perform. Eval. Rev., vol. 26, 1998.

[29] LZMA, “SDK.” [Online]. Available: http://www.7-zip.org/sdk.html[30] Diomede, “Diomede Storage.” [Online]. Available: http://www.

diomedestorage.com/[31] “Cloud Storage .” [Online]. Available: http://www.tophostingcenter.

com/cloudstorage.htm

443