13
arXiv:1403.5007v1 [cs.NI] 20 Mar 2014 1 On Throughput-Delay Optimal Access to Storage Clouds via Load Adaptive Coding and Chunking Guanfeng Liang, Member, IEEE, and Ulas ¸ C. Kozat, Senior Member, IEEE Abstract—Recent literature including our past work provide analysis and solutions for using (i) erasure coding, (ii) parallelism, or (iii) variable slicing/chunking (i.e., dividing an object of a spe- cific size into a variable number of smaller chunks) in speeding up the I/O performance of storage clouds. However, a comprehensive approach that considers all three dimensions together to achieve the best throughput-delay trade-off curve had been lacking. This paper presents the first set of solutions that can pick the best combination of coding rate and object chunking/slicing options as the load dynamically changes. Our specific contributions are as follows: (1) We establish via measurements that combining variable coding rate and chunking is mostly feasible over a popular public cloud. (2) We relate the delay optimal values for chunking level and code rate to the queue backlogs via an approximate queuing analysis. (3) Based on this analysis, we propose TOFEC that adapts the chunking level and coding rate against the queue backlogs. Our trace-driven simulation results show that TOFEC’s adaptation mechanism converges to an appropriate code that provides the optimal throughput- delay trade-off without reducing system capacity. Compared to a non-adaptive strategy optimized for throughput, TOFEC delivers 2.5× lower latency under light workloads; compared to a non- adaptive strategy optimized for latency, TOFEC can scale to support over 3× as many requests. (4) We propose a simpler greedy solution that performs on a par with TOFEC in average delay performance, but exhibits significantly more performance variations. Index Terms—FEC, Cloud storage, Queueing, Delay I. I NTRODUCTION Cloud storage has gained wide adoption as an economic, scalable, and reliable mean of providing data storage tier for applications and services. Typical cloud storage systems are implemented as key-value stores in which data objects are stored and retrieved via their unique keys. To provide high degree of availability, scalability, and data durability, each object is replicated several times within the internal distributed file system and sometimes also further protected by erasure codes to more efficiently use the storage capacity while attaining very high durability guarantees [1]. Cloud storage providers usually implement a variety of optimization mechanisms such as load balancing and caching/prefetching internally to improve performance. De- spite all such efforts, still evaluations of large scale systems indicate that there is a high degree of randomness in delay performance [2]. Thus, services that require more robust and predictable Quality of Service (QoS) must deploy their own external solutions such as sending multiple/redundant G. Liang and U.C. Kozat are with DOCOMO Innovations Inc., Palo Alto, California USA. G. Liang is the contact author. E-mail: [email protected] 0 20 40 60 80 10 2 10 3 Arrival Rate (req/sec) Mean Total Delay (msec) (1,1) (2,1) (2,2) (4,2) (3,3) (6,3) Fig. 1. Delay for downloading 3MB files using fixed MDS codes requests (in parallel or sequentially), chunking large objects into smaller ones and read/write each chunk through parallel connections, replicate the same object using multiple distinct keys in a coded or uncoded fashion, etc. In this paper, we present black box solutions 1 that can provide much better throughput-delay performance for reading and writing files on cloud storage utilizing (i) parallelism, (ii) erasure coding, and (iii) chunking. To the best of our knowledge, our work is the first one that adaptively picks the best erasure coding rate and chunk size to minimize the expected latency without sacrificing the supportable rate region (i.e., maximum requests per second) of the storage tier. The presented solutions can be deployed over a proxy tier external to the cloud storage tier or can be utilized internally by the cloud provider to improve the performance of their storage services for all or a subset of their tenants with higher priority. A. State of the Art Among the vast amount of research on improving cloud storage system’s delay performance emerged in the past few years, two groups in particular are closely related to our work presented in this paper: Erasure Coding with Redundant Requests: As proposed by authors of [3], [4], [5], files (or objects) are divided into a pre-determined number of k chunks, each of which is 1/k the size of the original file, and encoded into n>k of “coded chunks” using an (n, k) Maximum Distance Separable (MDS) code, or more generally a Forward Error Correction (FEC) code. Downloading/uploading of the original file is 1 They use only the API provided by storage clouds and do not require any modification or knowledge of the internal implementation of storage clouds.

On Throughput-Delay Optimal Access to Storage Clouds via ...connections, replicate the same object using multiple distinct keys in a coded or uncoded fashion, etc. In this paper, we

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: On Throughput-Delay Optimal Access to Storage Clouds via ...connections, replicate the same object using multiple distinct keys in a coded or uncoded fashion, etc. In this paper, we

arX

iv:1

403.

5007

v1 [

cs.N

I] 2

0 M

ar 2

014

1

On Throughput-Delay Optimal Access to StorageClouds via Load Adaptive Coding and Chunking

Guanfeng Liang,Member, IEEE,and Ulas C. Kozat,Senior Member, IEEE

Abstract—Recent literature including our past work provideanalysis and solutions for using (i) erasure coding, (ii) parallelism,or (iii) variable slicing/chunking (i.e., dividing an object of a spe-cific size into a variable number of smaller chunks) in speeding upthe I/O performance of storage clouds. However, a comprehensiveapproach that considers all three dimensions together to achievethe best throughput-delay trade-off curve had been lacking. Thispaper presents the first set of solutions that can pick the bestcombination of coding rate and object chunking/slicing optionsas the load dynamically changes. Our specific contributionsareas follows: (1) We establish via measurements that combiningvariable coding rate and chunking is mostly feasible over apopular public cloud. (2) We relate the delay optimal valuesfor chunking level and code rate to the queue backlogs viaan approximate queuing analysis. (3) Based on this analysis,we propose TOFEC that adapts the chunking level and codingrate against the queue backlogs. Our trace-driven simulationresults show that TOFEC’s adaptation mechanism convergesto an appropriate code that provides the optimal throughput-delay trade-off without reducing system capacity. Compared to anon-adaptive strategy optimized for throughput, TOFEC delivers2.5× lower latency under light workloads; compared to a non-adaptive strategy optimized for latency, TOFEC can scale tosupport over 3× as many requests. (4) We propose a simplergreedy solution that performs on a par with TOFEC in averagedelay performance, but exhibits significantly more performancevariations.

Index Terms—FEC, Cloud storage, Queueing, Delay

I. I NTRODUCTION

Cloud storage has gained wide adoption as an economic,scalable, and reliable mean of providing data storage tierfor applications and services. Typical cloud storage systemsare implemented as key-value stores in which data objectsare stored and retrieved via their unique keys. To providehigh degree of availability, scalability, and data durability,each object is replicated several times within the internaldistributed file system and sometimes also further protectedby erasure codes to more efficiently use the storage capacitywhile attaining very high durability guarantees [1].

Cloud storage providers usually implement a varietyof optimization mechanisms such as load balancing andcaching/prefetching internally to improve performance. De-spite all such efforts, still evaluations of large scale systemsindicate that there is a high degree of randomness in delayperformance [2]. Thus, services that require more robustand predictable Quality of Service (QoS) must deploy theirown external solutions such as sending multiple/redundant

G. Liang and U.C. Kozat are with DOCOMO Innovations Inc.,Palo Alto, California USA. G. Liang is the contact author. E-mail:[email protected]

0 20 40 60 80

102

103

Arrival Rate (req/sec)

Mea

n T

otal

Del

ay (

mse

c)

(1,1)(2,1)(2,2)(4,2)(3,3)(6,3)

Fig. 1. Delay for downloading 3MB files using fixed MDS codes

requests (in parallel or sequentially), chunking large objectsinto smaller ones and read/write each chunk through parallelconnections, replicate the same object using multiple distinctkeys in a coded or uncoded fashion, etc.

In this paper, we presentblack box solutions1 that canprovide much better throughput-delay performance for readingand writing files on cloud storage utilizing (i) parallelism,(ii) erasure coding, and (iii) chunking. To the best of ourknowledge, our work is the first one that adaptively picksthe best erasure coding rate and chunk size to minimize theexpected latency without sacrificing the supportable rate region(i.e., maximum requests per second) of the storage tier. Thepresented solutions can be deployed over a proxy tier externalto the cloud storage tier or can be utilized internally by thecloud provider to improve the performance of their storageservices for all or a subset of their tenants with higher priority.

A. State of the Art

Among the vast amount of research on improving cloudstorage system’s delay performance emerged in the past fewyears, two groups in particular are closely related to our workpresented in this paper:

Erasure Coding with Redundant Requests:As proposedby authors of [3], [4], [5], files (or objects) are divided into apre-determinednumber ofk chunks, each of which is1/kthe size of the original file, and encoded inton > k of“coded chunks” using an(n, k) Maximum Distance Separable(MDS) code, or more generally a Forward Error Correction(FEC) code. Downloading/uploading of the original file is

1They use only the API provided by storage clouds and do not require anymodification or knowledge of the internal implementation ofstorage clouds.

Page 2: On Throughput-Delay Optimal Access to Storage Clouds via ...connections, replicate the same object using multiple distinct keys in a coded or uncoded fashion, etc. In this paper, we

2

accomplished by downloading/uploadingn coded chunks us-ing parallel connections simultaneously and is deemed servedwhen download/upload of anyk coded chunks complete. Suchmechanisms significantly improves the delay performanceunder light workload. However, as shown in our previous work[3] and later reconfirmed by [5], system capacity is reduceddue to the overhead for using smaller chunks and redundant re-quests. This phenomenon is illustrated in Fig.1 where we plotthe throughput-delay trade-off for using different MDS codesfrom our simulations using delays traces collected on AmazonS3. Codes with differentk are grouped in different colors.Using a code with high level of chunking and redundancy, inthis case a(6, 3) code, although delivers2× gain in delay atlight workload, reduces system capacity to only30% of theoriginal basic strategy without chunking and redundancy, i.e.,(1, 1) code!

This problem is partially addressed in [3] where we presentstrategies that adjustn according to workload level so thatit achieves the near-optimal throughput-delay trade-off for apredeterminedk. For example, ifk = 3 is used, the strategiesin [3] will achieve the lower-envelope of the red curves inFig.1. Yet, it still suffers from an almost 60% loss in systemcapacity.

Dynamic Job Sizing: It has been observed in [2], [6] that inkey-value storage systems such as Amazon S3 and Microsoft’sAzure Storage, throughput is dramatically higher when theyreceive a small number of storage access requests for largejobs (or objects) than if they receive a large number of requestsfor small jobs (or objects), because each storage request incursoverheads such as networking delay, protocol-processing,lockacquisitions, transaction log commits, etc. Authors of [6]developed Stout in which requests are dynamically batchedto improve throughput-delay trade-off of key-value storagesystems. Based on the observed congestion Stout increase orreduce the batching size. Thus, at high congestion, a largerbatch size is used to improve the throughput while at lowcongestion a smaller batch size is adopted to reduce the delay.

B. Main Contributions

Our work unifies the ideas of redundant requests witherasure coding and dynamic job sizing together in one solutionframework. Our major contributions can be listed as follows.

• Providing dynamic job sizing while maintaining paral-lelism and erasure coding gains is a non-trivial undertak-ing. Key-value stores map an objectkey to one or morephysical storage nodes (if replication is used). Dependingon the implementation, a request for a key might alwaysgo to the same physical node or load balanced across allreplicas. As detailed in Section III, one has the option ofusing unique keys for each chunk of an object or sharethe same key across chunks but assign them differentbyte ranges. The former wastes significant storage ca-pacity, whereas the latter will likely demonstrate highercorrelation across parallel reads/writes of distinct chunksof the same object. Nonetheless, our measurements indifferent regions over a popular public cloud establishthat in fact sharing the same key results in reasonably well

weak-correlations enabling parallelism and coding gains.However, our measurements also indicate that indeeduniversally good performance is not guaranteed as oneregion fails to deliver this weak-correlation.

• Exact analysis for computing the optimal code rate andchunking level is far beyond trivial. In Sections IV-A toIV-C, we relate the delay optimal values for chunkinglevel and code rate to the queue backlogs via an approx-imate queuing analysis.

• Using this analysis, in Section IV-D, we introduceTOFEC (Throughput Optimal FEC Cloud) that imple-ments dynamic adjustment of chunking and redundancylevels to provide the optimal throughput-delay trade-off.In other words, TOFEC achieves the lower envelope ofcurves in all colors in Fig.1.

• The primary novelty of TOFEC is in its backlog-basedadaptive algorithm for dynamically adjusting the chunksize as well as the number of redundant requests issued tofulfill storage access requests. This algorithm of variablechunk sizing can be viewed as a novel integration ofprior observations from the two bodies of works discussedabove. Based on the observed backlog level as an indi-cator of the workload, TOFEC increases or decreases thechunk size, as well as the number of redundant requests.In our trace-driven evaluations, we demonstrate that: (1)TOFEC successfully adapts to full range of workloads,delivering 3× lower average delay than the basic staticstrategy without chunking under light workloads, andunder heavy workloads over3× the throughput of astatic strategy with high chunking and redundancy levelsoptimized for service delay; and (2) TOFEC providesgood QoS guarantees as it delivers low delay variations.

• Although TOFEC does not need any explicit informationabout the internal operations of the storage cloud, it needsto log latency performance and model the cumulativedistribution of the delay performance of the storage cloud.We also propose a greedy heuristic that does not needto build such a model, and via trace-driven simulationswe show that its performance on average latency is ona par with the performance of TOFEC, but exhibitingsignificantly higher performance variations.

II. SYSTEM MODELS

A. Basic Architecture and Functionality

The basic system architecture captures how Internet servicestoday utilize public or private storage clouds. The architectureconsists of proxy servers in the front-end and a key-valuestore, referred to as storage cloud, in the back-end. Usersinteract with the proxy through a high-level API and/or userinterfaces. The proxy translates every high-level user request(to read or write a file) into a set ofn ≥ 1 tasks. Each taskis essentially a basic storage access operation such asput,

get, delete, etc. that will be accomplished using low-level APIs provided by the storage cloud. The proxy maintainsa certain number of parallel connections to the storage cloudand each task is executed over one of these connections. Aftera certain number of tasks are completed successfully, the user

Page 3: On Throughput-Delay Optimal Access to Storage Clouds via ...connections, replicate the same object using multiple distinct keys in a coded or uncoded fashion, etc. In this paper, we

3

Fig. 2. System Model

request is considered accomplished and the proxy responds tothe user with an acknowledgment. The solutions we presentare deployed on the proxy server side transparent to the storagecloud.

For read request, we assume that the file is pre-coded intonmax ≥ n coded chunks with an(nmax, k) MDS code andstored on the cloud. Completion of downloading anyk codedchunks provides sufficient data to reconstruct the requestedfile. Thus, the proxy decodes the requested file from thek downloaded chunks and replies to the client. Then − kunfinished and/or not-yet-started tasks are then canceled andremoved from the system.

For write request, the file to be uploaded is divided andencoded inton coded chunks using an(n, k) MDS code andhence completion of uploading anyk coded chunks meanssufficient data have been stored onto the cloud. Thus, uponreceivingk successful responses from the storage cloud, theproxy sends aspeculativesuccess response to the client,without waiting for the remaining(n−k) tasks to finish. Suchspeculative execution is a commonly practiced optimizationtechnique to reduce client perceived delay in many computersystems such as databases and replicated state machines [7].Depending on the subsequent read profile on the same file, theproxy can (1) continue serving the remaining tasks till allntasks finish, or (2) change them to low priority jobs that willbeserved only when system utilization is low, or (3) cancel thempreemptively. The proxy can even (4) run a demon programin the background that generates allnmax coded chunks fromthe already uploaded chunks when the system is not busy.

Accordingly, we model the proxy by the queueing systemshown in Fig.2. There are two FIFO (first-in-first-out) queues:(i) the request queuethat buffers all incoming user requests,and (ii) thetask queuethat is a multi-server queue and holdsall tasks waiting to be executed.L threads2, representing theset of parallel connections to the storage cloud, are attachedto the task queue. The adaptation module of TOFEC monitorsthe state of the queues and the threads, and decides whatcoding parameter(n, k) to be used for each request. Withoutloss of generality, we assume that the head-of-line (HoL)request leaves the request queue only when there is at leastone idle threadand the task queue is empty. A batch ofntasks are then created for that request and injected into thetask queue. As soon as anyk tasks complete successfully, therequest is considered completed. Such a queue system is work

2We avoid the term “server” that is commonly used in queueing theoryliterature to prevent confusion.

Fig. 3. Example of supporting multiple chunk sizes with Shared Keyapproach: the 3MB file is divided and encoded into a coded file of 6MBconsisting 12 strips, each of 0.5MB. Download the file using a(2, 1) MDScode is accomplished by creating two read tasks: one for strips 1-6, and theother for strips 7-12.

conserving since no thread is left idle as long as there is anyrequest or task pending.

B. Basics of Erasure Codes

An (n, k) MDS code (e.g., Reed-Soloman codes) encodesk data chunks each ofB bits into a codeword consisting ofnB-bit long coded chunks. The coded chunks can sustain up to(n − k) erasures such that thek original data chunks can beefficiently reconstructed fromany subset ofk coded chunks.n andk are called the length and dimension of the MDS code.We also definer = n/k as the redundancy ratio of an(n, k)MDS code. The erasure resistant property of MDS codes hasbeen utilized in prior works [3], [4], [5], as well as in thispaper, to improve delay of cloud storage systems. Essentiallya coded chunk experiencing long delay is treated as an erasure.

In this paper, we make use of another interesting propertyof MDS codes to implement variable chunk sizing of TOFECin a storage efficient manner: MDS code of high length anddimension for small chunk size can be used as MDS code ofsmaller code length and dimension for larger chunk size. To bemore specific, consider any(N,K) MDS code for chunks ofbbits. To avoid confusion, we will refer to theseb-bit chunks asstrips. A different MDS code of lengthn = N/m, dimensionk = K/m and chunk sizeB = bm for somem > 1 can beconstructed by simply batching everym data/coded strips intoone data/coded chunk. The resulting code is an(n, k) MDScode forB-bit chunks because anyk coded chunks coversmk = K coded strips, which is sufficient to reconstruct theoriginal file of Bk = bm×K/m = bK bits. This property isillustrated as an example in Fig. 3. In this example, a 3MB fileis divided into 6 strips of 0.5MB and encoded into 12 codedstrips of total size 6MB, using a(12, 6) MDS code. This codecan then be used as a(2, 1) code for 3MB chunks, a(4, 2)code for 1.5MB chunks and a(6, 3) code for 1MB chunkssimultaneouslyby batching 6, 3 and 2 strips into a chunk.

C. Definitions of Different Delays

The delay experienced by a user request consists of twocomponents:queueing delay (Dq) and service delay (Ds).Both are defined with respect to the request queue: (i) thequeueing delay is the amount of time a request spends waitingin the request queue and (ii) the service delay is the period oftime between when the request leaves the request queue (i.e.,admitted into the task queue and started being served by at

Page 4: On Throughput-Delay Optimal Access to Storage Clouds via ...connections, replicate the same object using multiple distinct keys in a coded or uncoded fashion, etc. In this paper, we

4

least one thread) and when it finally leaves the system (i.e.,the first time when anyk of the corresponding tasks complete).In addition, we also consider thetask delays (Dt), which isthe time it takes for a thread to serve a task assuming it is notterminated or canceled preemptively. To clarify these defini-tions of delays, consider a request served with an(n, k) MDScode, withTA being its arrival time,T1 ≤ T2 ≤ · · · ≤ Tn

the starting times of the correspondingn tasks3. Then thequeueing delay isDq = T1 − TA. SupposeDt,1, · · · , Dt,n

are the corresponding task delays, then the completion timesof these task will beX = T1 + Dt,1, · · · , Tn + Dt,n ifnone is canceled. So the request will leave the system at timeX(k), which denotes thek-th smallest value inX , i.e., thetime whenk tasks complete. Then the service delay of thisrequest isDs = X(k) − T1.

III. VARIABLE CHUNK SIZING

In this section, we discuss implementation issues as wellas pros and cons of two potential approaches, namelyUniqueKeyandShared Key, for supporting erasure-code-based accessto files on the storage cloud with a variety of chunk sizes.Suppose the maximum desired redundancy ratio isr, thenthese approaches implement variable chunk sizing as follows:

• Unique Key: For every choice of chunk size (or equiva-lently k), a separate batch ofrk coded chunks are createdand each coded chunk is stored as an individual objectwith its unique key on the storage cloud. The accessto different chunks is implemented through basicget,put storage cloud APIs.

• Shared Key: A coded file is first obtained by stackingtogether the coded strips obtained by applying a high-dimension (N = rK,K) MDS code to the originalfile, as described in Section II-B and illustrated in Fig.3.For read, the coded file is stored on the cloud as oneobject. Access to chunks with variable size is realized bydownloading segments in the coded file correspondingto batches of a corresponding number of strips, usingthe same key with more advanced “partial read” storagecloud APIs. Similarly, for write, the file is uploaded inparts using “partial write” APIs and then later mergedinto one object in the cloud.

A. Implementation and Comparison of the two Approaches

1) Storage cost:When the user request is to write a file,storage cost of Unique Key and Shared Key is not so different.However, to support variable chunk sizing for read requests,Shared Key is significantly more cost-efficient than UniqueKey. With Shared Key, a single coded file stored on the cloudcan be reused to support essentially an arbitrary number ofdifferent chunk sizes, as long as the strip size is small enough.On the other hand, it seems impossible to achieve similarreusing with the Unique Key approach where different chunksof the same file is treated as individual objects. So with UniqueKey, every additional chunk size to be supported requires anextra storage costr× file size. Such linear growth of storage

3We assumeTi = ∞ if the i-th task is never started.

cost easily makes it prohibitively expensive even to support asmall number of chunk sizes.

2) Diversity in delays:The success of TOFEC and otherproposals to use redundant requests (either with erasure codingor replication) for delay improvement relies on diversity incloud storage access delays. In particular, TOFEC, as well as[3], [4], [5], requires access delays for different chunks of thesame fileto be weakly correlated.

With Unique Key, since different chunks are treated asindividual objects, there is no inherent connection among themfrom the storage cloud system’s perspective. So depending onthe internal implementation of object placement policy of thestorage cloud system, chunks of a file can be stored on thecloud in different storage units (disks or servers) on the samerack, or in different racks in the same data center, or even todifferent data centers at distant geographical locations.Henceit is quite likely that delays for accessing different chunks ofthe same file show very weak correlation.

On the other hand, with Shared Key, since coded chunksare combined into one coded file and stored as one objectin the cloud, it is very likely that the whole coded file,hence all coded chunks/strips, is stored in the same storageunit, unless the storage cloud system internally divides thecoded file into pieces and distributes them to different units.Although many distributed storage systems do divide files intoparts and store them separately, it is normally only for largerfiles. For example, the popular Hadoop distributed file systemby default does not divide files smaller than 64MB. Whendifferent chunks are stored on the same storage unit, we canexpect higher correlation in their access delays. It then istobe verified that the correlation between different chunks withthe Shared Key approach is still weak enough for our codingsolution to be beneficial.

3) Universal support:Unique Key is the approach adoptedin our previous work [3] to support erasure-code based fileaccessing withone predetermined chunk size. A benefitof Unique Key is that it only requires basicget and putAPIs that all storage cloud systems must provide. So it isreadily supported by all storage cloud systems and can beimplemented on top of any one.

On the other hand, Shared Key requires more advancedAPIs that allow the proxy to download or upload only thetargeted segment of an object. Such advanced APIs are notcurrently supported by all storage cloud systems. For example,to the best of our knowledge currently Microsoft’s AzureStorage provides only methods for “partial read”4 but none for“partial write”. On the contrary, Amazon S3 provides partialaccess for both read and write: the proxy can download aspecific inclusive byte range within an object stored on S3 bycalling getObject(request,destination)5; and foruploading anuploadPart method to upload segments ofan object and ancompleteMultipartUpload method tomerge the uploaded segments are provided. We expect more

4E.g. DownloadRangeToStream(target, offset, length)

downloads a segment oflength bytes starting from theoffset-th byteof the target object (or “blob” in Azure’s jargon).

5The byte range is set by callingrequest.setRange(start,end).

Page 5: On Throughput-Delay Optimal Access to Storage Clouds via ...connections, replicate the same object using multiple distinct keys in a coded or uncoded fashion, etc. In this paper, we

5

0 500 1000 1500 2000 2500 300010

−4

10−3

10−2

10−1

100

Task Delay (msec)

CC

DF

Unique Key, 1Unique Key, 2Unique Key, 3Unique Key, 4Shared Key, 1Shared Key, 2Shared Key, 3Shared Key, 4

(a) Region US Standard

0 500 1000 1500 2000 2500 300010

−4

10−3

10−2

10−1

100

Task Delay (msec)

CC

DF

Unique Key, 1Unique Key, 2Unique Key, 3Unique Key, 4Shared Key, 1Shared Key, 2Shared Key, 3Shared Key, 4

(b) Region US Standard at a different time

0 500 1000 1500 2000 2500 300010

−4

10−3

10−2

10−1

100

Task Delay (msec)

CC

DF

Unique Key, 1Unique Key, 2Unique Key, 3Unique Key, 4Shared Key, 1Shared Key, 2Shared Key, 3Shared Key, 4

(c) Region North California

Fig. 4. CCDF of individual threads with 1MB chunks andn = 4, measured on May 1st, 2013

0 500 1000 1500 200010

−4

10−3

10−2

10−1

100

Service Delay (msec)

CC

DF

n=3n=4n=5n=6

(a) Unique Key

0 500 1000 1500 200010

−4

10−3

10−2

10−1

100

Service Delay (msec)

CC

DF

n=3n=4n=5n=6

(b) Shared Key

Fig. 5. CCDF of service delay for reading 3MB files with 1MB chunks

service providers to introduce both partial read and write APIsin the near future.

B. Measurements on Amazon S3

To understand the trade-off between Unique Key and SharedKey, we run measurements over Amazon EC2 and S3. EC2instance served as the proxy in our system model. We instan-tiated an extra large EC2 instance with high I/O capability inthe same availability region as the S3 bucket that stores ourobjects. We conducted experiments on different week days inMay to July 2013 with various chunk sizes between 0.5MBto 3MB and up ton = 12 coded chunks per file. For eachvalue of n, we allow L = n simultaneously active threadswhile thei-th thread being responsible for downloading thei-th coded chunk of each file. Each experiment lasted longer than24 hours. We alternated between different settings to capturesimilar time of day characteristics across all settings.

The experiments are conducted within all 8 availability re-gions in Amazon S3. Except for the “US Standard” availabilityregion, all other 7 regions demonstrate similar performancestatistics that are consistent over different times and days. Onthe other hand, the performance of “US Standard” demon-strated significant variation even at different times in thesameday, as illustrated in Fig.4(a) and Fig.4(b). We conjecturethat the different and inconsistent behavior of “US Standard”might be due to the fact that it targets a slightly differentusage pattern and it may employ a different implementation for

that reason6. We will exclude “US Standard” from subsequentdiscussions. For conciseness, we only show a limited subsetof findings for availability region “North California” thatarerepresentative for regions other than “US Standard”:

(1) In both Unique Key and Shared Key, the task delaydistribution observed by different threads are almost identical.The two approaches are indistinguishable even beyond 99.9thpercentile. Fig.4(c) shows the complementary cumulative dis-tribution function (CCDF) of task delays observed by indi-vidual threads for 1MB chunks andn = 4. Both approachesdemonstrate large delay spread in all regions.

(2) Task delays for different threads in Unique Key showclose to zero correlation, while they demonstrate slightlyhigher correlation in Shared Key, as it is expected. With alldifferent settings, the cross correlation coefficient betweendifferent threads stays below 0.05 in Unique Key and rangesfrom 0.11 to 0.17 in Shared Key. Both approaches achievesignificant service delay improvements. Fig.5 plots the CCDFof service delays for downloading 3MB files with 1MB chunks(k = 3) with n = 3 ∟ 6, assuming alln tasks in a batchstart at the same time. In this setting, both approaches reducethe 99th percentile delays by roughly 50%, 65% and 80%by downloading 1, 2 and 3 extra coded chunks. AlthoughShared Key demonstrates up to 3 times higher cross correlationcoefficient, there is no meaningful statistical distinction in

6See http://docs.aws.amazon.com/general/latest/gr/rande.html#s3region

Page 6: On Throughput-Delay Optimal Access to Storage Clouds via ...connections, replicate the same object using multiple distinct keys in a coded or uncoded fashion, etc. In this paper, we

6

0 1 2 30

50

100

150

200

Chunk Size (MB)

Mea

n (m

sec)

Unique KeyUnique Key − fittingShared KeyShared Key − fitting

(a) Mean

0 1 2 35

10

15

20

25

30

35

Chunk Size (MB)

Sta

ndar

d de

viat

ion

(mse

c)

Unique KeyUnique Key − fittingShared KeyShared Key − fitting

(b) Standard Deviation

Fig. 6. Delay statistics vs. chunk size

service delay between the two approaches until beyond the99th percentile. All availability regions experience differentdegrees of degradation at high percentiles with Shared Keydue to the higher correlation. Significant degradation emergesfrom around 99.9th percentile and beyond in all regions exceptfor “Sao Paulo”, in which degradation appears around 99thpercentile.

(3) Task delays are always lower bounded by some constant∆ ≥ 0 that grows roughly linearly as chunk size increases.This constant part of delay cannot be reduced by using morethreads: see the flat segment at the beginning of the CCDFcurves in Fig.4 and Fig.5. Since this constant portion of taskdelays is unavoidable, it leads to the negative effect of usinglargern since there is a minimum cost of system resource ofn∆ (time×thread) that grows linearly inn. This cost leads toa reduced capacity region for using more redundant tasks, asillustrated in the example of Fig.1. We observe that the twoapproaches deliver almost identical total delays (queueing +service) for all arrival rates, in spite of the degraded servicedelay with Shared Key at very high percentile. So we onlyplot the results with Shared Key in Fig.1.

(4) Both the mean and standard deviation of task delaysgrow roughly linearly as chunk size increases. Fig.6 plots themeasured mean and standard deviation of task delays in bothapproaches at different chunk sizes. Also plotted in the figuresare least squares fitted lines for the measurement results. Asthe figures show, performance of Unique Key and Shared Keyare comparable also in terms of how delay statistics scale asfunctions of the chunk size. Notice that the extrapolationsat chunk size = 0 are all greater than zero. We believe thisobservation reflects the costs of non-I/O-related operations inthe storage cloud that do not scale proportionally to objectsize: for example, the cost to locate the requested object. Wealso believe such costs contribute partially to the minimumtask delay constant∆.

SUMMARY: Our measurement study shows that dynamicchunking while preserving weak correlation across differentchunks is realizable through both Unique Key and SharedKey. We believe Shared Key is a reasonable choice forimplementing dynamic chunking given that it is able to deliver

delay performance comparable to Unique Key at a much lowercost of storage capacity. We turn our attention on how to pickthe best choices of chunking and FEC rate in the remainingparts of the paper.

C. Model of Task Delays

For the analysis present in the next section, we modelthe task delays as independently distributed random variableswhose mean and standard deviation grow linearly as chunksizeB increases. More specifically, we assume the task delayDt for chunk sizeB following distribution in the form of

Dt(B) ∼ ∆(B) + exp(µ(B)), (1)

where∆(B) = ∆ + ∆B captures the lower bound of taskdelays as in observation (3), andexp(µ(B)) represents aexponential random variable that models the tail of the CCDF.The mean and standard deviation of the exponential tail bothequal to 1

µ(B) = Ψ + ΨB. With this model, constants∆ and

Ψ together capture the non-zero extrapolations of the meanand standard deviation of task delays at chunk size 0, andsimilarly, constants∆ andΨ together capture the rate at whichthe mean and standard deviation grow as chunk size increases,as in observation (4).

IV. D ESIGN OFTOFEC

For the analysis in this section, we group requests intoclasses according to the tuple(type, size). Heretypecan be read or write, and can potentially be other typeof operations supported by the cloud storage. Each type ofoperation has its own set of delay parameters∆, ∆,Ψ, Ψ.Subscripts will be used to indicate variables associated witheach class. We useni, ki and ri to denote the code length,dimension and redundancy ratio for the code used to serveclassi requests. Also letpi denote the fraction of total arrivalscontributed by classi. We use vectorsn, k, r and p to denotethe collection of corresponding variables for all classes.

The system throughput is defined as the average numberof successfully served requests per unit time. Thestatic codecapacity Csta(p, k, r) is defined as the maximum deliverablethroughput, assumingp, k, andr are fixed. Thefull capacity

Page 7: On Throughput-Delay Optimal Access to Storage Clouds via ...connections, replicate the same object using multiple distinct keys in a coded or uncoded fashion, etc. In this paper, we

7

C(p) is then defined as the maximum static code capacityconsidering all possible choices of(k, r) with p fixed. For agiven request arrivals rateÎť, the system throughput equals tothe smaller ofÎť and the (static or full) capacity.

A. Problem Formulation and Main Result for Static Strategy

Given total arrival rateÎť and composition of requestsp, wewant to find the best choice of FEC code for each class suchthat the average expected total delay is minimized. Relaxingthe requirement forni andki being integers, this is formulatedas the following minimization problem7:

mink,r

Dq +∑

i

piDs,i (∗)

s.t. ki ≥ 0, ri ≥ 1 ∀i,

Îť < Csta(p, k, r).

In the above formulation, we usek and r as the optimizingvariables, instead of a more intuitive choice ofn and k. Thischoice helps simplify the analysis becausek and r can betreated as independent variables whilen being subject to theconstraintn ≥ k. In subsequent sections, we first introduceapproximations for the expected queueing and service delaysassuming that the FEC code used to serve requests of eachclass is predetermined and fixed (Section IV-B). Then we showthat optimal solutions to the above non-convex optimizationproblem exhibit the following property (Section IV-C):

The optimal values ofni, ki and ri can all be expressedas functions solely determined byQ – the expected lengthof the request queue:

ni = Ni(Q), ki = Ki(Q) and ri = Ri(Q).

Ni, Ki andRi are all strictly decreasing functions ofQ.

This finding is then used as the guideline in the design of ourbacklog-driven adaptive strategy TOFEC (Section IV-D).

B. Approximated Analysis of Static Strategies

DenoteJi as the file size of classi. Consider a request ofclassi served with an(ni, ki) MDS code, i.e.,Bi = Ji/ki.First supposeall ni tasks start at the same time, i.e.,T1 = Tni

.In this case, given our model for task delays, it is trivial toshow that the expected service delay equals to

Ds,i =∆i(Ji/ki) +1

Âľi(Ji/ki)

ni∑

j=ni−ki+1

1

j

≅∆i(Ji/ki) +1

Âľi(Ji/ki)ln

(ni

ni − ki

)

=∆i +∆iJiki

+

(Ψi +

ΨiJiki

)ln

(ri

ri − 1

). (2)

For the analysis, we approximate the summation∑ni

j=ni−ki+1 1/j with its integral upper bound

7Notice that all classes share the same queueing delay. Also,we requireki ≥ 0 instead ofki ≥ 1 for a technicality to simplify the proof of theuniqueness of the optimal solution. We requireri ≥ 1 sinceni ≥ ki. λ <Csta(p, k, r) is imposed for queue stability.

7 8 9 10 11 12

0.8

1

1.2

1.4

1.6

1.8

2

code length n

SummationIntegral approx.

Fig. 7. Comparison of the summation term in Eq.2 and integralapproxima-tion for k = 6.

∍ ni

ni−ki

1xdx = ln

(ni

ni−ki

). The gap of approximation is

always upper bounded by the Euler-Mascheroni constant≅ 0.577 for any ni − ki ≥ 1 and quickly diminishes to 0whenni gets large, as illustrated in Fig.7. Although the gapgoes to∞ asni − ki → 0, it does not really matter for thepurpose of this paper since any optimal solution withni

closer toki only means we should setni = ki.Also define the system usage (or simply “cost”) of a request

as the sum of the amount of time each of its tasks being servedby a thread8. When all tasks start at the same time, its expectedsystem usage is (see Section IV of [3] for detailed derivation)

Ui =ni∆i(Ji/ki) +ki

Âľi(Ji/ki)

=∆ikiri + ∆iJiri +Ψiki + ΨiJi. (3)

Given that classi contributes topi fraction of the totalarrivals, the average cost per request isU =

∑i piUi. With L

simultaneously active threads, the departure rate of the systemas well as the request queue isL/U (request/unit time). In lightof this observation, we approximate the request queue with anM/M/1 queue with service rateL/U 9. In other words, thestatic code capacity for a givenp and fixed code choice(k, r)is approximated by

Csta(p, k, r) =L

U. (4)

Let

Îť =ÎťU

=λ∑

i

pi(∆ikiri + ∆iJiri +Ψiki + ΨiJi) (5)

represent the arrival rate of system usage imposed by therequest arrivals. Then the last inequality constraint of theoptimization problem (∗) becomes

Îť < L. (6)

8The time a taskj being served isDt,j if it completes successfully;(

X(k) − Tj

)

if it starts but is terminated preemptively; and 0 if it is canceledwhile waiting in the task queue.

9ThisM/M/1 approximation is a special case of theM/G/1 approxima-

tion used in [3]:Dq = βΝU2

2L(L−λU), with β = 1. Our findings in this paper

readily generalizes to accommodate theM/G/1 approximation.

Page 8: On Throughput-Delay Optimal Access to Storage Clouds via ...connections, replicate the same object using multiple distinct keys in a coded or uncoded fashion, etc. In this paper, we

8

With theM/M/1 queue approximation, the queueing delayin the original system at total arrival rateÎť is approximatedby

Dq =1

L/U − λ−

1

L/U=

ÎťU2

L(L− λU ). (7)

Noticing that given p, the (approximated) static codedcapacityCsta(p, k, r) = L/U is maximized whenki = 1, ri =1, ∀i, we approximate the full capacityC(p) = Csta(p,1,1),where 1 denotes the all-one vector. We acknowledge thatthe above approximations are quite coarse, especially becausetasks of the same batch do not start at the same time ingeneral. However, remember that the main objective of thispaper is to develop a practical solution that can achieve theoptimal throughput-delay trade-off. According to the simula-tion results, these approximations are sufficiently good for thispurpose.

C. Optimal Static Strategy

Even with the above approximations, the minimizationproblem (∗) is not a convex optimization problem: the feasibleregion is not a convex set due to thekiri terms in λ. Ingeneral, non-convex optimization problems are difficult tosolve. Fortunately, we are able to prove the following theoremaccording to which this non-convex optimization problem canbe solved numerically with great efficiency.

Theorem 1: The optimal solutions to (∗) must satisfy thefollowing equations, regardless ofλ and p.

ki = Ίi(ri)

,∆iΓi − ΨiJi +

√(∆iΓi − ΨiJi)2 + 4Ψi∆iJiΓi

2Ψi

, ∀i

(8)

L(Ψiki + ΨiJi)

kiri(ri − 1)(∆iki + ∆iJi)=

L(Ψjkj + ΨjJj)

kjrj(rj − 1)(∆jkj + ∆jJj),

∀i, j(9)

whereΓi =Jiri(ri−1)

∆iri+Ψi

(∆i + Ψi ln

riri−1

). Moreover, whenÎť

and p are given, the optimal solution is the unique solution tothe above equations and the one below:(

L

L− λ

)2

− 1 =L(Ψiki + ΨiJi)

kiri(ri − 1)(∆iki + ∆iJi), ∀i. (10)

Proof: See Appendix.

The importance of Theorem 1 is two-fold:

1) With m different classes of requests, the seemingly2m-dimension optimization problem is in fact 1-dimensional: According to Eq.8, the optimalki is fullydetermined by the optimalri (vice versa). Moreover,according to Eq.9, the optimalri further fully determinesthe optimal choices ofkj and rj for all other j 6= i.In other words, the knowledge of the optimal choice of

anyri (or ki) is sufficient to derive the complete optimalchoice of(k, r).

2) The optimal solution (ni, ki andri) is fully determinedby λ, hence it isvirtually independent of the particularλ and p: λ and p appear in the above equations only inthe form ofλ in Eq.10. So for any two different pairsof (λ, p) and(λ′, p′), as long asλ = λ

′

, they sharethesame optimal choice of codes! An implication of this isthat them-class optimization problem can be solved bysolving a set ofm independent single-class subproblems:the i-th subproblem solves for the optimal(ki, ri) withclass-i-only arrivals at rateλi such thatλiUi = λ,because it is equivalent to them-class problem whenλ′ = λ/Ui and p′ such thatp′i = 1 andp′j = 0, ∀j 6= i.

The second observation above is of particular interest to thepurpose of this paper. It suggests that adaptation of differentclasses can be done separately, as if only arrivals are forthe class under consideration. This significantly simplifies thedesign of our adaptive strategy TOFEC, resulting in greatcomputational efficiency and flexibility.

D. Adaptive Strategy TOFEC

Despite being the mathematical foundation for the designof TOFEC, Theorem 1 at its current formulation is not veryuseful in practice. This is because the code adaptation is basedon the knowledge of the total workloadÎť and the popularitydistribution of different classesp as per Theorem 1. In practice,both quantities usually demonstrate high degree of volatility,making accurate on-the-fly estimation quite difficult and/orunreliable. So in order to achieve effective code adaptation,a more robust system metric that is easy to measure with highaccuracy is desirable.

Observe that the expected length of the request queue is

Q = ÎťDq =(ÎťU)2

L(L− λU)=

Îť2

L(L− λ), (11)

which can be rewritten as

λ =L(√

Q2 + 4Q−Q)

2. (12)

It is trivial and intuitive thatQ is a strictly increasing functionof Îť and vice versa. On the other hand, it is not hard to verifythat optimalni, ki andri are all strictly decreasing functionsof Îť according to Theorem 1. ReplacingÎť with Eq.12, we canconclude the following corollary:

Corollary 1: The optimal values ofni, ki andri can all beexpressed as strictly decreasing functions ofQ:

ni = Ni(Q), ki = Ki(Q) and ri = Ri(Q). (13)

The findings of Corollary 1 conform to the followingintuition:

• At light workload (smallλ), there should be little backlogin the request queue (smallQ) and the service delaydominates the total delay. In this case, the system is notoperating in the capacity-limited regime. So it is bene-ficial to increase the level of chunking and redundancy(largeki andri) to reduce delay.

Page 9: On Throughput-Delay Optimal Access to Storage Clouds via ...connections, replicate the same object using multiple distinct keys in a coded or uncoded fashion, etc. In this paper, we

9

Algorithm 1: TOFEC (Throughput Optimal FEC Cloud)Initialization : q = 0. Whenrequest arrives

1 q ← queue length upon arrival ofrequest;2 i← class thatrequest belongs to;3 q ← αq + (1− α)q;4 Find k ≤ kmax

i such thatq ∈ [HNi,k+1, H

Ni,k);

5 Find n ≤ nmaxi such thatq ∈ [HN

i,n+1, HNi,n);

6 n← min(rmaxi k, n);

7 Serverequest with an (n, k) code when it becomesHoL;

• At heavy workload (largerλ), there will be a largebacklog in the request queue (largeQ) and the queueingdelay dominates the total delay. In this case, the systemoperates in the capacity-limited regime. So it is better toreduce the level of chunking and redundancy (smallkiandri) to support higher throughput.

More importantly, it suggests the sufficiency to choose theFEC code solely based on the length of the request queue– a very robust and easy to obtain system metric – insteadof less reliable estimations ofλ and p. As will be discussedlater, queue length has other advantages over arrival rate in adynamic setting.

The basic idea of TOFEC is to chooseni = Ni(q) andki = Ki(q) for a request of classi, where q is the queuelength upon the arrival of the request. When this is done to allrequest arrivals to the system, it can be expected the averagecode lengths (dimensions) and expected queue lengthQ satisfyEq.13, hence the optimal delay is achieved. In TOFEC, this isimplemented with a threshold based algorithm, which can beperformed very efficiently. For each classi, we first computethe expected queue length givenni ∈ 1, ..., n

maxi is the

optimal code length by

QNi,ni

= N−1i (ni). (14)

Here nmaxi is the maximum number of tasks allowed for a

classi request. SinceNi is a strictly decreasing function, itsinverseN−1

i is a well-defined strictly decreasing function. Asa result, we haveQN

i,1 > QNi,2 > ¡ ¡ ¡ > QN

i,nmax

i

> 0. Notethat our goal is to use code lengthn if the queue lengthq isaroundQN

i,n, so we want a set of thresholdsHNi,n such that

HNi,1 > QN

i,1 > HNi,2 > QN

i,2 > ¡ ¡ ¡

¡ ¡ ¡ > HNi,nmax

i

> QNi,nmax

i

> HNi,nmax

i+1 = 0,

and will usen such thatq ∈ [HNi,n+1, H

Ni,n). In our current im-

plementation of TOFEC, we useHNi,n =

(QN

i,n +QNi,n−1

)/2

for n = 2, ¡ ¡ ¡ , nmaxi and HN

i,1 = ∞. A set of thresholdsHK

i,kmax

i

for adaptation ofki is found in a similar fashion.The adaptation mechanism of TOFEC is summarized in

pseudo-codes as Algorithm 1. Note that in Step 6 we reducen to rmax

i k if the redundancy ratio of the code chosen in theprevious steps is higher thanrmax

i – the maximum allowedredundancy ratio for classi. Also, instead of comparingq directly with the thresholds, we compare an exponentialmoving averageq = αq + (1 − α)q, with a memory factor

0 ≤ α ≤ 1, against the thresholds to determinen andk. Themoving average is used to mitigate the transient variation inqueue length so thatn andk will not change too frequently.It is obvious that we only need to setα = 0 in order to useinstantaneous queue lengthq for the adaptation since in thiscaseq = q.

It is worth pointing out that TOFEC’s threshold basedadaptation is

1) Independent ofp: The thresholds for each class iscomputed a priori without any knowledge or assumptionof p. Once computed, the thresholds can be reused forall realizations of differentp, even if p is time-varying;

2) Independent across classes: For a classi, computation ofits thresholds require knowledge of neither the numbernor the delay parameters of other classes. The adaptationof class i is also independent of those of the otherclasses.

These two properties of independence are direct result of theimplication of Theorem 1 we discussed before. Thanks to thesenice properties, it is very easy in TOFEC to add support fora new class in an incremental fashion: simply compute thethresholds for the new class, leaving the thresholds for theexisting, saym, classes untouched. The old and new thresholdstogether will then produce the optimal choice of codes for theincremented set ofm+ 1 classes.

V. EVALUATION

We now demonstrate the benefits of TOFEC’s adaptationmechanism. We evaluate TOFEC’s adaptation strategy andshow that is outperforms static strategies with both constantand changing workloads, as well as a simple greedy heuristicthat will be introduced later.

A. Simulation Setup

We conducted trace-driven simulations for performanceevaluation for both single-class and multi-class scenarios withboth read and write requests of different file sizes. Due to lackof space, we only show results for the scenario with one class(read,3MB). But we must emphasize that it is representativeenough so that the findings to be discussed in this section arevalid for other settings (different file sizes, write requests, andmultiple classes). We assume that the system supports up toL = 16 simultaneously active threads. We set the maximumcode dimension and redundancy ratio to bekmax = 6 andrmax = 2, because we observe negligible gain in servicedelay beyond this chunking and redundancy level from ourmeasurements. We use traces collected in May and June 2013in availability region “North California”. In order to computethe thresholds for TOFEC, we need estimations of the delayparameters∆, ∆,Ψ, Ψ. For this, we first filter out the worst10% task delays in the traces, then we compute the delayparameters from the least squares linear approximation forthemean and standard deviation of the remaining task delays. Weuse memory factorα = 0.99 in TOFEC.

In addition to the static strategies, we develop a simpleGreedy heuristic strategy for the purpose of comparison.Unlike the adaptive strategy in TOFEC, Greedy does not

Page 10: On Throughput-Delay Optimal Access to Storage Clouds via ...connections, replicate the same object using multiple distinct keys in a coded or uncoded fashion, etc. In this paper, we

10

0 20 40 60 80

102

103

arrival rate (req/sec)

dela

y (m

sec)

Best StaticStatic (1,1)Static (2,1)Fixed k=6GreedyTOFEC

(a) Mean Delay

0 20 40 60 80

102

103

arrival rate (req/sec)

dela

y (m

sec)

Best StaticStatic (1,1)Static (2,1)Fixed k=6GreedyTOFEC

(b) Median Delay

0 20 40 60 8010

2

103

arrival rate (req/sec)

dela

y (m

sec)

Best StaticGreedyTOFEC

(c) 90th Percentile Delay

0 20 40 60 8010

2

103

104

arrival rate (req/sec)

dela

y (m

sec)

Best StaticGreedyTOFEC

(d) 99th Percentile Delay

Fig. 8. Delay performance in read only scenario

require prior-knowledge of the distribution of task delays, yetas the results will reveal, it achieves a competitive mean delayperformance. In Greedy, the code to be used to serve a requestin classi is determined by the number of idle threads uponits arrival: suppose there arel idle threads, then

ki =

1, if l = 0,

min(kmaxi , l), otherwise;

and similarly

ni =

1, if l = 0,

min(rmaxi ki, l), otherwise.

The idea of Greedy is to first maximize the level of chunkingwith the idle threads available, then increase the redundancyratio as long as there are idle threads remaining.

B. Throughput-Delay Trade-Off

Fig.8 shows the mean, median, 90th percentile and 99thpercentile delays of TOFEC and Greedy with Poisson arrivalsat different arrival rates ofÎť. We also run simulations withstatic strategies for all possible combinations of(n, k) at everyarrival rate. In a brute-force fashion, we find the best mean,median, 90th and 99th percentile delays achieved with staticstrategies and use them as the baseline. Fig.8(a) and Fig.8(b)also plot the mean and median delay performance of the basic

static strategy with no chunking and no replication, i.e.,(1, 1)code; the simple replication static strategy with a(2, 1) code;and the backlog-based adaptive strategy from [3] with fixedcode dimensionk = 6 andn ≤ 12.

As we can see, both TOFEC and Greedy successfullysupport the full capacity region – the one supported by basicstatic – while achieving almost optimal mean and mediandelays throughout the full capacity region. At light workload,TOFEC delivers about2.5× improvement in mean delay whencompared with the basic static strategy, and about2× whencompared with simple replication (from 205ms and 151ms to84ms). It also reduces the median delay by about2× fromthat of basic and simple replication (from 156ms and 138msto 74ms). Meanwhile Greedy achieves about2× improvementin both mean (89ms) and median delays (79ms) over basic.

With heavier workload, both TOFEC and Greedy success-fully adapt their codes to keep track with the best staticstrategies, in terms of mean and median delays. It is clear fromthe figures that both TOFEC and Greedy achieve our primarygoal of retaining full system capacity, as supported by basicstatic strategy. On the contrary, although simple replication hasslightly better mean and median delays than basic under lightworkload, it fails to support arrival rates beyond 70% of thecapacity of basic. Meanwhile, the adaptive strategy from [3]with fixed code dimensionk = 6 can only support less than30% of the original capacity region, although it achieves the

Page 11: On Throughput-Delay Optimal Access to Storage Clouds via ...connections, replicate the same object using multiple distinct keys in a coded or uncoded fashion, etc. In this paper, we

11

best delay at very light workload.While the two adaptive strategies have similar performance

in mean and median, TOFEC outperforms Greedy significantlyat high percentiles. As Fig.8(c) and Fig.8(d) demonstrate,TOFEC is on a par with the best static strategies at the 90th and99th percentile delays throughout the whole capacity region.On the other hand, Greedy fails to keep track of the beststatic performance at lower arrival rates. At light workload,TOFEC’s is over2× and2.5× better than Greedy at the 90thand 99th percentiles. Less interesting is the case with heavyworkload when the system is capacity-limited. Hence bothstrategies converge to the basic static strategy using mostly(1, 1) code, which is optimal at this regime.

C. Delay Variation and Choice of Codes

We further compare the standard deviation (STD) ofTOFEC, Greedy and the best static strategy. STD is a veryimportant performance metric because it directly relates towhether customers can receive consistent QoS. In certainapplications, such as video streaming, maintaining low STDindelay can be even more critical than achieving low mean delay.As we can see in Fig.9, for the region of interest with lightto medium workload, TOFEC delivers2× to 3× lower STDthan Greedy does. Moreover, in spite of its dynamic adaptivenature, TOFEC in fact matches with the best static strategyvery well throughout the full capacity region. This suggeststhe code choice in TOFEC indeed converges to the optimal.

The convergence to optimal becomes more obvious whenwe look into the fraction of requests served by each choiceof code. In Fig.10 we plot the compositions of requestsserved by different code dimensionk’s. At each arrival rate,the two bars represent TOFEC and Greedy. For each bar,blocks in different colors represent the fraction of requestsserved with code dimension 1 through 6, from bottom totop. TOFEC’s choice ofk demonstrates a high concentrationaround the optimal value: at all arrival rate, over 80% requestsare served by 2 neighboring values ofk around the optimal,and this fraction quickly diminishes to 0 for codes furtherfrom the optimal. Moreover, as arrival rate varies from lowto high, TOFEC’s choice ofk transitions quite smoothly as(5, 6)→ (3, 4)→ (2, 3)→ (1, 2) and eventually converges toa single value1 as workload approaches system capacity.

On the contrary, Greedy tends to round-robin across allpossible choices ofk and majority of requests are served byeitherk = 1 or 6. So Greedy is effectively alternating betweenthe two extremes of no chunking and very high chunking,instead of staying around the optimal. Such “all or nothing”behavior results in the2× to 3× worse STD shown in Fig.9.So TOFEC provides much better QoS guarantee.

D. Adapting to Changing Workload

We further examine how well the two adaptive strategiesadjust to changes in workload. In Fig.11 we plot the totaldelay experienced by requests arriving at different times withina 600-second period, as well as the choice of code in the sameperiod. The 600 seconds is divided into 3 phases, each lasts200 seconds. The arrival rate is 10 request/second in phases

0 20 40 60 800

50

100

150

200

250

arrival rate (req/sec)

stan

dard

dev

iatio

n of

del

ay (

mse

c)

Best StaticGreedyTOFEC

Fig. 9. Comparison of standard deviation

9 19 29 38 48 58 680

0.2

0.4

0.6

0.8

1

Fra

ctio

n se

rved

with

eac

h co

de d

imen

sion

arrival rate (req/msec)

123456

Fig. 10. Composition ofk. Left: TOFEC, Right: Greedy

1 and 3, and 80 request/second (slightly> C) in phase 2.The corresponding optimal choices of codes(n, k) are(10, 5)for phases 1 and 3, and(1, 1) for phase 2. For the purpose ofcomparison, we also implement an “Ideal” rate-driven strategythat has perfect knowledge of the arrival rate of each phaseand picks the optimal code accordingly as the baseline. We cansee that both TOFEC and Greedy are quite agile to changesin arrival rate and quickly converge to a good composition ofcodes that delivers optimal mean delays within each phase,comparable to that of Ideal.

From Fig.11(b) we can further observe that TOFEC isespecially responsive in face of workload surge (from phase1 to 2). This is because the suddenly increased arrival rateimmediately builds up a large backlog, which in turn forcesTOFEC to pick a code with the smallestk = 1. When thearrival drops (from phase 2 to 3), instead of immediatelyswitching back to codes withk = 5, TOFEC graduallytransitions to the optimal value ofk = 5. Such “smoothening”behavior when workload reduces is actually beneficial. Thisis because the request queue has been built up during thepreceding period of heavy workload. So, ifk is set to 5 rightafter arrival rate drops, it will produce a throughput so lowthat it takes a much longer time to reduce the queue length tothe desired level, and requests arrive during this period willsuffer from long queueing delay even though they are beingserved with the optimal code. On the other hand, TOFEC’squeue length driven adaptation sticks with smallerk, which

Page 12: On Throughput-Delay Optimal Access to Storage Clouds via ...connections, replicate the same object using multiple distinct keys in a coded or uncoded fashion, etc. In this paper, we

12

0 200 400 60010

1

102

103

104

Time (sec)

Del

ay (

mse

c)

IdealGreedyTOFEC

(a) Sampled delay

0 200 400 6000

2

4

6

8

10

Time (sec)

k

IdealGreedyTOFEC

(b) Sampled code dimensionk

400 402 404 406 408 4100

500

1000

1500

2000

2500

3000

3500

4000

Time (sec)

Del

ay (

mse

c)

IdealGreedyTOFEC

(c) Zoom-in delays

Fig. 11. Adaptation to changing workload

delivers higher throughput, to drain the queue much faster tothe desired level. As we can see in Fig.11(c), which plotsthe delay traces for requests arrive in the first 10 seconds ofphase 3, TOFEC and Greedy both reduce their delay to optimalalmost1.8× faster than Ideal does after workload decreases.This is another advantage of using queue length instead ofarrival rate to drive code adaptation.

We can also see that TOFEC’s choice of code is much morestable than that of Greedy. While TOFEC shows little variationaround the optimal in each phase, Greedy keeps oscillatingbetweenk = 1 and k = 6 when the optimal is 1! Thisis consistent with the “all or nothing” behavior of Greedyobserved in Fig.10.

VI. RELATED WORK

FEC in connection with multiple paths and/or multipleservers is a well investigated topic in the literature [8], [9],[10], [11]. However, there is very little attention devotedtothe queueing delays. FEC in the context of network codingor coded scheduling has also been a popular topic from theperspectives of throughput (or network utility) maximizationand throughput vs. service delay trade-offs [12], [13], [14],[15]. Although some incorporate queuing delay analysis, thetreatment is largely for broadcast wireless channels with quitedifferent system characteristics and constraints. FEC hasalsobeen extensively studied in the context of distributed storagefrom the points of high durability and availability whileattaining high storage efficiency [16], [17], [18].

Authors of [4] conducted theoretical study of cloud storagesystems using FEC in a similar fashion as we did in our work[3]. Given that exact mathematical analysis of the general caseis very difficult, authors of [4] considered a very simple casewith a fixed code ofk = 2 tasks. Shah et al. [5] generalize theresults from [4] tok > 2. Both works rely on the assumptionof exponential task delays, which hardly captures the reality.Therefore, some of their theoretical results cannot be appliedin practice. For example, under the assumption of exponentialtask delays, Shah et al. have proved that it is optimal toalways use the largestn possible throughout the full capacityregion C, contradicting with simulation results using real-world measurements in [3] and this paper.

VII. C ONCLUSION

This paper presents the first set of solutions for achievingthe optimal throughput-delay trade-off for scalable key-valuestorage access using erasure codes with variable chunk sizingand rate adaptation. We establish the viability of this approachthrough extensive measurement study over the popular publiccloud storage service Amazon S3. We develop two adaptationstrategies: TOFEC and Greedy. TOFEC monitors the localbacklog and compares it against a set of thresholds to dy-namically determine the optimal code length and dimension.Our trace-driven simulation shows that TOFEC is on a parwith the best static strategy in terms of mean, median, 90th,and 99th percentile delays, as well as delay variation. Tocompute the thresholds, TOFEC requires knowledge of themean and variance of cloud storage access delays, which isusually obtained by maintaining a log of delay traces. On theother hand, Greedy does not require any knowledge of thedelay profile or logging but is able to achieve mean and mediandelays comparable to those of TOFEC. However, it falls shortin important QoS metrics such as higher percentile delays andvariation. It is part of our ongoing work to develop a strategythat matches TOFEC’s high percentile delay performancewithout prior knowledge and logging.

REFERENCES

[1] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, andS. Yekhanin, “Erasure Coding in Windows Azure Storage,” inUSENIXATC, 2012.

[2] S. L. Garfinkel, “An Evaluation of Amazon’s Grid Computing Services:EC2, S3 and SQS,” Harvard University, Tech. Rep., 2007.

[3] G. Liang and U. C. Kozat, “FAST CLOUD: Pushing the Envelope onDelay Performance of Cloud Storage with Coding,”IEEE/ACM Trans.Networking, preprint, 13 Nov. 2013, doi: 10.1109/TNET.2013.2289382.

[4] L. Huang, S. Pawar, H. Zhang, and K. Ramchandran, “Codes CanReduce Queueing Delay in Data Centers,” inIEEE ISIT, 2012.

[5] N. B. Shah, K. Lee, and K. Ramchandran, “The MDS Queue:Analysing Latency Performance of Codes and Redundant Requests,”arXiv:1211.5405, Apr. 2013.

[6] J. C. McCullough, J. Dunagan, A. Wolman, and A. C. Snoeren, “Stout:an Adaptive Interface to Scalable Cloud Storage,” inUSENIX ATC,2010.

[7] R. Kotla, L. Alvisi, M. Dahlin, A. Clement, and E. Wong, “Zyzzyva:Speculative Byzantine fault tolerance,”ACM Transactions on ComputerSystems, vol. 27, pp. 7:1–7:39, 2010.

[8] V. Sharma, S. Kalyanaraman, K. Kar, K. K. Ramakrishnan, and V. Subra-manian, “MPLOT: A Transport Protocol Exploiting MultipathDiversityUsing Erasure Codes,” inIEEE INFOCOM, 2008.

Page 13: On Throughput-Delay Optimal Access to Storage Clouds via ...connections, replicate the same object using multiple distinct keys in a coded or uncoded fashion, etc. In this paper, we

13

[9] E. Gabrielyan, “Fault-Tolerant Real-Time Streaming with FEC thanks toCapillary MultiPath Routing,”Computing Research Repository, 2006.

[10] J. W. Byers, M. Luby, and M. Mitzenmacher, “Accessing MultipleMirror Sites in Parallel: Using Tornado Codes to Speed Up Downloads,”in IEEE INFOCOM, 1999.

[11] R. Saad, A. Serhrouchni, Y. Begliche, and K. Chen, “Evaluating ForwardError Correction performance in BitTorrent protocol,” inIEEE LCN,2010.

[12] A. Eryilmaz, A. Ozdaglar, M. Medard, and E. Ahmed, “On the Delayand Throughput Gains of Coding in Unreliable Networks,”IEEE Trans.Inf. Theor., 2008.

[13] W.-L. Yeow, A. T. Hoang, and C.-K. Tham, “Minimizing Delay forMulticast-Streaming in Wireless Networks with Network Coding,” inIEEE INFOCOM, 2009.

[14] T. K. Dikaliotis, A. G. Dimakis, T. Ho, and M. Effros, “Onthe Delay ofNetwork Coding over Line Networks,”Computing Research Repository,2009.

[15] U. C. Kozat, “On the Throughput Capacity of Opportunistic Multicastingwith Erasure Codes,” inIEEE INFOCOM, 2008.

[16] A. G. Dimakis, P. B. Godfrey, Y. Wu, M. J. Wainwright, andK. Ram-chandran, “Network Coding for Distributed Storage Systems,” IEEETrans. Inf. Theor., 2010.

[17] R. Rodrigues and B. Liskov, “High Availability in DHTs:ErasureCoding vs. Replication,” in4th International Workshop, IPTPS, 2005.

[18] J. Li, S. Yang, X. Wang, and B. Li, “Tree-Structured DataRegenerationin Distributed Storage Systems with Regenerating Codes,” in IEEEINFOCOM, 2010.

APPENDIX

PROOF OFTHEOREM 1

Proof: It is easy to verify that the objective function (∗)is continuous and differentiable everywhere within the feasibleregion and the partial derivatives are

∂(∗)

∂ki=

(L

(L− λU)2−

1

L

)pi(∆iri +Ψi)

− piJik2i

(∆i + Ψi ln

riri − 1

)(15)

and∂(∗)

∂ri=

(L

(L − λU)2−

1

L

)pi(∆iki + ∆iJi)

− pi

(Ψi +

ΨiJiki

)1

ri(ri − 1)(16)

Notice that for the whole the feasible region includingthe boundary, (∗) is always lower bounded by 0. So theremust exist at least one global optimal solution(k∗, r∗) thatminimizes (∗). Moreover, (∗) goes to∞ if and only if theoperating point(k, r) approaches the boundary, i.e.,ki → 0,or ri → 1, or λ→ Csta(p, k, r). Since (∗) is∞ for the wholeboundary, the global optimal(k∗, r∗) must reside strictlywithin the feasible region. As a result, both Eq.15 and Eq.16must evaluate to 0 at(k∗, r∗). In the subsequent discussion,we prove that Eq.15 = 0 and Eq.16 = 0 has an unique solutionwithin the feasible region. As a result,(k∗, r∗) equals to thissolution and is also unique.

From Eq.16 = 0 we have:(

L

(L− λU)2−

1

L

)=

Ψiki + ΨiJi

kiri(ri − 1)(∆iki + ∆iJi). (17)

Plugging the above into Eq.15, we have:

ki(Ψiki + ΨJi)

∆iki + ∆iJi=

Jiri(ri − 1)

∆iri +Ψi

(∆i + Ψi ln

riri − 1

).

(18)

Notice that ifri is fixed, Eq.18 is in fact a quadratic equationof ki. Let Γi(ri) (or Γi for short) be the right hand side ofEq.18. Then we have

ki =∆iΓi − ΨiJi +

√(∆iΓi − ΨiJi)2 + 4Ψi∆iJiΓi

2Ψi

(19)

, Ίi(ri). (20)

We do not consider the other solution to Eq.18 because it isalways≤ 0. It is easy to verify thatΩi is a strictly increasingfunction of ri within the feasible region.

Substitutingki with Ωi(ri), the right hand side of Eq.17can be written as a functionπi(ri), which can be shown to bestrictly decreasing. Notice that Eq.17 must be satisfied foralli and the left hand side remains unchanged. Then

πi(ri) = πj(rj), ∀i, j. (21)

Note thatπi andπj are strictly decreasing functions ofri andrj , respectively. This means that there is a one-to-one mappingbetween anyri and rj at the optimal solutions, andrj is astrictly increasing function ofri, namelyrj = Υj,i(ri).

Now with rj = Υj,i(ri) andkj = Ωj(rj) = Ωj(Υj,i(ri)),Eq.17 becomes a equation that contains only one variableri.It is then not hard to show that for any givenλ and p, the lefthand side of Eq.17 is a strictly increasing function ofri, whilethe right hand side isπi(ri), which is strictly decreasing. Asa result, these two functions can equal for at most one valueof ri. In other words, equations Eq.18 and Eq.17 have at mostone solution. The existence of a solution to these equationsisguaranteed by the existence of(k∗, r∗), so it must be unique.This completes the proof.

Guanfeng Liang (S’06-M’12) received his B.E.degree from University of Science and Technologyof China, Hefei, Anhui, China, in 2004, M.A.Sc.degree in Electrical and Computer Engineering fromUniversity of Toronto, Canada, in 2007, and Ph.D.degree in Electrical and Computer Engineering fromthe University of Illinois at Urbana-Chanpaign, in2012. He currently works with DOCOMO Innova-tions (formerly DOCOMO USA Labs), Palo Alto,CA, as a Research Engineer.

Ulas C. Kozat (S97-M04-SM10) received his B.Sc.degree in Electrical and Electronics Engineeringfrom Bilkent University, Ankara, Turkey, in 1997,M.Sc. degree in Electrical Engineering from theGeorge Washington University, Washington, DC, in1999, and Ph.D. degree in Electrical and ComputerEngineering from the University of Maryland, Col-lege Park, in 2004. He currently works at DOCOMOInnovations (formerly DOCOMO USA Labs), PaloAlto, CA, as a Principal Researcher.