12
QBox: Guaranteeing I/O Performance on Black Box Storage Systems Dimitris Skourtis [email protected] Shinpei Kato [email protected] Scott Brandt [email protected] Department of Computer Science University of California, Santa Cruz ABSTRACT Many storage systems are shared by multiple clients with different types of workloads and performance targets. To achieve performance targets without over-provisioning, a sys- tem must provide isolation between clients. Throughput- based reservations are challenging due to the mix of work- loads and the stateful nature of disk drives, leading to low reservable throughput, while existing utilization-based solu- tions require specialized I/O scheduling for each device in the storage system. Qbox is a new utilization-based approach for generic black box storage systems that enforces utilization (and, indirectly, throughput) requirements and provides isolation between clients, without specialized low-level I/O scheduling. Our experimental results show that Qbox provides good isola- tion and achieves the target utilizations of its clients. Categories and Subject Descriptors D.4.2 [Operating Systems]: Storage Management; D.4.8 [Operating Systems]: Performance Keywords Storage virtualization, quality of service, resource allocation, performance 1. INTRODUCTION During the past decade there has been a significant growth of data with no signs of slowing. Due to that growth there is a real need for storage devices to be shared efficiently by different applications and avoid the extra costs of having more and more under-utilized devices dedicated to specific applications. In environments such as cloud systems, where multiple “clients”, i.e., streams of requests, compete for the same storage device, it is especially important to manage the performance of each client. Failure to do so leads to low performance for some or all clients depending on complex factors such as the I/O schedulers used, the mix of client Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HPDC’12, June 18–22, 2012, Delft, The Netherlands. Copyright 2012 ACM 978-1-4503-0805-2/12/06 ...$10.00. 1 n ... stream n e.g. database e.g. media stream 1 Clients Disk 2 Disk m Disk 1 Closed storage device ... Controller Figure 1: Given that we have no access to the stor- age device, we place a controller between the clients and the device to provide performance management to the clients, i.e., the request streams. workloads, as well as storage-specific characteristics. Un- fortunately, due to the nature of storage devices, managing the performance of each client and isolating them from each other is a non-trivial task. In a shared system, each client may have a different workload and each workload may af- fect the performance of the rest in undesirable and possibly unpredictable ways. A typical example would be a stream of random requests reducing the performance of a sequential or semi-sequential stream, mostly due to the storage device performing unnecessary seeks. The above is the result of storage devices trying to be equally fair to all requests by providing similar throughput to every stream. Of course, not all requests are equally costly, with sequential requests taking only a small fraction of a millisecond and random re- quests taking several milliseconds, 2-3 orders of magnitude longer. Note that a sequential stream does not have to be perfectly sequential–none ever truly are–and that real work- loads often exhibit such behavior. Providing a solution to the above problem may require using specific I/O schedulers for every disk-drive or node in a clustered storage system. Moreover, it could require changes to current infrastructure such as the replacement of the I/O scheduler of every client. Such changes may create compatibility issues preventing upgrades or other modifica- tions to be applied to the storage system. Instead of making modifications to the infrastructure of an existing system it is often easier and in practice cheaper to deploy a solution between the clients and the storage. We call that the black box approach since it imposes minimal requirements on the 73

QBox: guaranteeing I/O performance on black box storage ...skourtis/papers/qbox-p73-skourtis.pdf · signed deadlines instead of overall performance. The next section presentsourmethodfor

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: QBox: guaranteeing I/O performance on black box storage ...skourtis/papers/qbox-p73-skourtis.pdf · signed deadlines instead of overall performance. The next section presentsourmethodfor

QBox: Guaranteeing I/O Performance on Black BoxStorage Systems

Dimitris [email protected]

Shinpei [email protected]

Scott [email protected]

Department of Computer ScienceUniversity of California, Santa Cruz

ABSTRACTMany storage systems are shared by multiple clients withdifferent types of workloads and performance targets. Toachieve performance targets without over-provisioning, a sys-tem must provide isolation between clients. Throughput-based reservations are challenging due to the mix of work-loads and the stateful nature of disk drives, leading to lowreservable throughput, while existing utilization-based solu-tions require specialized I/O scheduling for each device inthe storage system.

Qbox is a new utilization-based approach for generic blackbox storage systems that enforces utilization (and, indirectly,throughput) requirements and provides isolation betweenclients, without specialized low-level I/O scheduling. Ourexperimental results show that Qbox provides good isola-tion and achieves the target utilizations of its clients.

Categories and Subject DescriptorsD.4.2 [Operating Systems]: Storage Management; D.4.8[Operating Systems]: Performance

KeywordsStorage virtualization, quality of service, resource allocation,performance

1. INTRODUCTIONDuring the past decade there has been a significant growth

of data with no signs of slowing. Due to that growth thereis a real need for storage devices to be shared efficiently bydifferent applications and avoid the extra costs of havingmore and more under-utilized devices dedicated to specificapplications. In environments such as cloud systems, wheremultiple “clients”, i.e., streams of requests, compete for thesame storage device, it is especially important to managethe performance of each client. Failure to do so leads to lowperformance for some or all clients depending on complexfactors such as the I/O schedulers used, the mix of client

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.HPDC’12, June 18–22, 2012, Delft, The Netherlands.Copyright 2012 ACM 978-1-4503-0805-2/12/06 ...$10.00.

1

n

...

stream n

e.g. database

e.g. media

stream 1

Clients

Disk 2

Disk m

Disk 1

Closed storage device

...

Controller

Figure 1: Given that we have no access to the stor-age device, we place a controller between the clientsand the device to provide performance managementto the clients, i.e., the request streams.

workloads, as well as storage-specific characteristics. Un-fortunately, due to the nature of storage devices, managingthe performance of each client and isolating them from eachother is a non-trivial task. In a shared system, each clientmay have a different workload and each workload may af-fect the performance of the rest in undesirable and possiblyunpredictable ways. A typical example would be a streamof random requests reducing the performance of a sequentialor semi-sequential stream, mostly due to the storage deviceperforming unnecessary seeks. The above is the result ofstorage devices trying to be equally fair to all requests byproviding similar throughput to every stream. Of course,not all requests are equally costly, with sequential requeststaking only a small fraction of a millisecond and random re-quests taking several milliseconds, 2-3 orders of magnitudelonger. Note that a sequential stream does not have to beperfectly sequential–none ever truly are–and that real work-loads often exhibit such behavior.

Providing a solution to the above problem may requireusing specific I/O schedulers for every disk-drive or nodein a clustered storage system. Moreover, it could requirechanges to current infrastructure such as the replacement ofthe I/O scheduler of every client. Such changes may createcompatibility issues preventing upgrades or other modifica-tions to be applied to the storage system. Instead of makingmodifications to the infrastructure of an existing system itis often easier and in practice cheaper to deploy a solutionbetween the clients and the storage. We call that the blackbox approach since it imposes minimal requirements on the

73

Page 2: QBox: guaranteeing I/O performance on black box storage ...skourtis/papers/qbox-p73-skourtis.pdf · signed deadlines instead of overall performance. The next section presentsourmethodfor

clients and storage, and because it is agnostic to the specifi-cations of either side. Our approach partly fits the grey-boxframework for systems presented in [1], however, QBox re-quires fewer algorithmic assumptions about the underlyingsystem. In this paper we take an almost agnostic approachand target the following problem: given a set of clients anda storage device, our goal is to manage the performance ofeach client’s request stream in terms of disk-time utilizationand provide each client with a pre-specified proportion of thedevice’s time, while having no internal control of either theclients or the storage device, or requiring any modificationsto the infrastructure of either side.

Clients want throughput reservations. However, exceptfor highly regular workloads, throughput varies by orders ofmagnitude depending upon workload (Figure 2) and onlya fixed fraction of the (highly variable) total may be guar-anteed. By isolating each stream from the rest, utilizationreservations allow a system to indirectly guarantee a spe-cific throughput (not just a share of the total) based ondirect or inferred knowledge about the workload of an indi-vidual stream, independent of any other workloads on thesystem and can allow much greater total throughput thanthroughput-based reservations [17]. Our utilization-basedapproach can work with Service Level Agreements (SLA); re-quirements can be converted to utilization as demonstratedin [18] and as long as we can guarantee utilization, we canguarantee throughput provided by an SLA.

To our knowledge there is no prior work on utilization-based performance guarantees for black box storage devices.Most work that is close to our scenario such as [15, 11] isbased on throughput and latency requirements, which arehard to reserve directly without under-utilizing the storagefor a number of reasons such the orders-of-magnitude costdifferences between best- and worst-case requests. More-over, throughput-based solutions create other challenges suchas admission control. Without very specific knowledge aboutthe workloads, the system must make worst-case assump-tions, leading to extremely low reservable throughput. Onthe other hand, existing solutions based on disk utiliza-tion [18, 17, 10] only support single drives and if used ina clustered storage system they require their scheduler to bepresent on every node.

In this paper, we present a novel method for managing theperformance of multiple clients on a storage device in termsof disk-time utilization. Unlike the management of a singledrive, in the black box scenario it is hard to measure theservice time of each request. Instead, our solution is basedon the periodic estimation of the average cost of sequentialand random requests as well as the observation that theircosts have an orders-of-magnitude difference. We observethe throughput of each request type in consecutive time win-dows and maintain separate moving estimates for the costof sequential and random requests. By taking into accountthe desired utilization of each client we schedule their re-quests by assigning them deadlines and dispatching them tothe storage device according to the Earliest Deadline First(EDF) algorithm [13, 21]. Our results show that the de-sired utilization rates are achieved closely enough, achievingboth good performance guarantees and isolation. Those re-sults stand over any combination of random, sequential, andsemi-sequential workloads. Moreover, due to our utilization-based approach, it is easy to decide whether a new clientmay be admitted to the storage system, possibly by modify-

250 (C: 0%) 300 350 (C: 50%) 400 450 (C:100%) 5000

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

Seconds

Thr

ough

put [

IOP

S]

Throughput achieved with utilization−based scheduling

Stream A (Seq. 50% − 0%)Stream B (Seq. 50% − 0%)Stream C (Ran. 0% − 100%)

250 (C: 0%) 300 350 (C: 50%) 400 450 (C:100%) 5000

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

Seconds

Thr

ough

put [

IOP

S]

Throughput achieved with throughput−based scheduling

Stream A (Seq. 50% − 0%)Stream B (Seq. 50% − 0%)Stream C (Ran. 0% − 100%)

Figure 2: QBox (top) provides isolation, while theintroduction of a random stream makes the through-put of sequential streams drop dramatically withthroughput-based scheduling (bottom.)

ing the rates of other clients. Finally, all clients may accessany file on the storage device and we make no assumptionsabout the location of the data on a per client basis.

2. SYSTEM MODELOur basic scenario consists of a set of clients each associ-

ated with a stream of requests and a single storage devicecontaining multiple disks. Clients send requests to the stor-age device and each stream uses a proportion of the device’sexecution time. We call that proportion the utilization rateof a stream and it is either provided by the client or in prac-tice, by a broker, which is part of our controller and trans-lates SLAs into throughput and latency requirements as in[17, 18] or [10]. Briefly, to translate an SLA to utilization,we measure the aggregate throughput of the system for se-quential and random requests separately over small amountsof requests (e.g., 20) and set a confidence level (e.g., 95%)to avoid treating all requests as outliers. Details about ar-rival patterns and issues such as head/track switches andbad layout are presented in [19].

The main characteristic of our scenario is that we treatthe storage device as a black box. In other words, we onlyinteract with the storage device by passing it client requests

74

Page 3: QBox: guaranteeing I/O performance on black box storage ...skourtis/papers/qbox-p73-skourtis.pdf · signed deadlines instead of overall performance. The next section presentsourmethodfor

...

Disk 1 Disk 2 Disk 3 ...

Statistics

Estimation

DeadlineQueue

FIFO StreamQueues

deadline assignment

requests frommultiple clientsQBox

Figure 3: The controller architecture.

and receiving responses. For example, we cannot modify orreplace the device’s scheduler as is the case in [18] and do notassume it uses a particular scheduler. Moreover, we cannotcontrol which disk(s) are going to execute each request anddo not restrict clients to specific parts of the storage device.Due to those requirements the natural choice is to place acontroller between the clients and the storage device. Hence,all client requests go through the controller, where they arescheduled and eventually dispatched to the device. As wewill see, this setup allows us to gather little informationregarding the disk execution times, which turns schedulingand therefore black box management into a challenge.

We manage the performance of the streams in a time-based manner. After a request reaches our controller, weassign it a deadline by keeping an estimate of the expectedexecution time e for each type of request (sequential or ran-dom) and by using the stream’s rate r provided by the bro-ker. Using e and r we compute the request’s deadline byd = e/r. The absolute deadline of a request coming fromstream s is set to Ds = Ts+d, where Ts is the sum of all therelative deadlines assigned so far to the requests of streams. Although, we are using “deadlines” for scheduling, ourgoal is not to strictly satisfy deadlines. Instead, it is therelative values that matter with regards to the dispatchingorder. On the other hand, if we used a stricter dispatchingapproach e.g., [18], then the absolute times would be im-portant for replacing the expected cost with the actual costafter the request was completed. In this paper we do not fo-cus on urgent requests, however, it is possible to place suchrequests ahead of others in the corresponding stream queueby simply assigning them earlier deadlines.

Although we do not assume the storage device is usinga specific disk scheduler, it is better to have a schedulerwhich tries to avoid starvation and orders the requests in areasonable manner (as most do). A stricter dispatching pol-icy such as [18] can be used on the controller side to avoidstarvation by placing more emphasis on satisfying the as-signed deadlines instead of overall performance. The nextsection presents our method for estimating execution timesand managing the performance of each stream in terms of

500 1000 1500 2000 2500 3000 35000

10

20

30

40

50

60

70

80

90

100

Stream rates (disk−time percentage) achieved with no managementwhen desired rates are (50%, 50%)

Window (150ms) counter

Ave

rage

Util

izat

ion

Sequential streamRandom stream

Figure 4: When our controller sends the requestsin the order they are received from the clients, thesystem fails to provide the desired rates.

time. We also discuss practical issues we faced while apply-ing our method and discuss how we addressed them.

3. PERFORMANCE MANAGEMENTIn QBox we maintain a FIFO queue for each stream and

a deadline queue, which may contain requests from anystream. The deadline queue is ordered according to the Ear-liest Deadline First (EDF) scheduler and the deadlines arecomputed as described in the previous section. Wheneverwe are ready to dispatch a request to the storage device therequest with the smallest absolute deadline out of all thestream queues is moved to the deadline queue. To find theearliest-deadline request it suffices to look at the oldest re-quest from each stream queue, since any other request beforethat has either arrived at a later time or is less urgent. Next,the request with the earliest deadline is removed from thedeadline queue and dispatched to the device.

3.1 Estimating execution timesAs mentioned earlier, we aim to provide performance man-

agement through a controller placed between the clientsand the storage device. We wish to achieve this goal with-out knowledge of how the storage device schedules and dis-tributes the requests among its disks and without accessto the storage system internals. Most importantly, we areunaware of the time each request takes on a single disk,which we could otherwise measure by looking at the timedifference between two consecutive responses, i.e., the inter-arrival time. In our case, the time between two consecutiveresponses does not necessarily reflect the time spent by thedevice executing the second request, because those two re-quests may have been satisfied by different disks.

On the other hand, we know the number of requests ex-ecuted from each stream on the storage device. If all re-quests had the same cost, then we could take the averageover a time window T, i.e., e = T/n, where n is the numberof requests completed in T . Clearly, that would not solve

75

Page 4: QBox: guaranteeing I/O performance on black box storage ...skourtis/papers/qbox-p73-skourtis.pdf · signed deadlines instead of overall performance. The next section presentsourmethodfor

(zi/ai,0)

(0, zi/bi)

λi=ai/bi

x

y

(zj/aj,0)

(zj/bj,0)

Figure 5: The intersection of the two lines from (3)gives us the average cost x of a sequential and y ofa random request in the time windows zi and zj.

the problem since random requests are orders-of-magnitudemore expensive than sequential requests, i.e., the disk hasto spend significantly more time to complete a random re-quest. Based on that observation, for each stream we classifyits requests into sequential and random while keeping trackof the number of requests completed by type per window.Assuming the clients saturate the device and the cost x ofthe average sequential request and the cost y of the averagerandom request remain the same across two time windowszi and zj we are lead to the following system of linear equa-tions: {

αix+ βiy = zi

αjx+ βjy = zj ,(1)

where αi is the number of sequential requests completed inwindow i, and similarly for the number of random requestsdenoted by βi. Often, j will be equal to i + 1. Solving theabove system gives us the sequential and random averagerequest costs for windows zi and zj :

x =zjβi − βjziαjβi − αiβj

, y =ziβi

− αi

βix. (2)

The above equations may give us negative solutions due tosystem noise and other factors. Since execution costs mayonly be positive we restrict the solutions to positive (x,y)pairs (Figure 5), i.e., satisfying:⎧⎪⎨

⎪⎩zi/αi < zj/αj

zi/βi > zj/βj

αi/βi > αj/βj

or

⎧⎪⎨⎪⎩

zi/αi > zj/αj

zi/βi < zj/βj .

αi/βi < αj/βj

(3)

Intuitively, setting zi equal to zj in (3) would require thatif the number of completed sequential requests goes down inwindow j, then the number of random requests has to go up(and vice-versa.) Otherwise, the intersection would containa negative component. By focusing on the case where everytime window has the same length we reduce the chancesof getting highly volatile solutions and make the analyticalsolution simpler to intuitively understand. In that case thesolution becomes

x =z

αi + βiλ, y = λx, (4)

time window zi

time window zj

i

j

ai=15, bi=2

aj=19, bj=1

random costsequential cost

y x

Figure 6: Counting sequential and random comple-tions per window lets us estimate their average cost.

where

λ =αi − αj

βj − βi. (5)

From (5) we see that the intersection solutions are ex-pected to be volatile if the window size is small. On the otherhand, if the window size is large and the throughput doesnot change, the intersection will often be negative, i.e., itwill happen on a negative quadrant, since the two lines fromFigure 5 will often have a similar slope. It would be easyto ignore negative solutions by skipping windows. However,depending on the window size and workload it is possible toget negative solutions more often than positive ones. Thatleads to fewer updates and therefore a slower convergenceto a stable estimate. To face that issue we looked into twodirections. One direction is to observe that if the windowsize is small enough, it is not important whether we take theintersection of the current window with the previous one orsome other window not too far in the past. Based on thatobservation we consider the positive intersections of the cur-rent window with a number of the previous ones and takethe average. That method increases the chance of gettinga valid solution. In addition, updating more frequently al-lows the moving estimate to converge more quickly withoutgiving a large weight on any of the individual estimates.

The other way we propose to face negative solutions isto compute the projection of the previous estimate on thecurrent window assuming the x/y ratio remains the samealong those two windows. In particular, we may assumethat αi/βi is close to αj/βj . In that case, we can project theprevious intersection point or estimate on the line describingthe second window. The projection is given by

x =αjzi

αi(μβj + αj), y = μx, (6)

where

μ =1

βi

(zix

− αi

). (7)

The idea is that if both the number of completed sequen-tial and random requests in a window drops (or increases)proportionally the cost must have shifted accordingly. Al-though we observed that the projection method works espe-cially well, its correctness depends on the previous estimate.It could still be used when some intersection is invalid tokeep updating the estimate but leave it as future work todetermine whether it can enhance our estimates.

3.2 Estimation error and seek timesA key assumption is that the request costs are the same

among windows. Assuming that at some point we have the

76

Page 5: QBox: guaranteeing I/O performance on black box storage ...skourtis/papers/qbox-p73-skourtis.pdf · signed deadlines instead of overall performance. The next section presentsourmethodfor

true (x, y) cost and that the cost in the next window is notexactly the same due to system noise we expect to haveerror. To compute that error we replace zj in the solutionfor x in (2) by its definition i.e., αjx+ βjy and denote thatexpression by x′. Taking the difference between x and x′

gives ∣∣x− x′∣∣ = βi

|αjβi − αiβj | |(αjx+ βjy)− zj | (8)

and ∣∣y − y′∣∣ = αi

βi

∣∣x′ − x∣∣ . (9)

So far we have not considered seek times between streamsand how they might affect our estimates. In the typical casewhere m random requests are executed by a disk followed byn sequential requests, the first request out of the sequentialones will incur a seek. That seek is not fully charged toeither type of request in our model, simply because it iseither hard or impossible in our scenario. Intuitively, thetotal seek cost of a window is distributed across both requesttypes. Firstly, because fewer requests of both types will endup being executed in that window and secondly due to theerror formula (9) for y. In particular, assuming the delayedrequests in some window i would also follow the αi/βi ratiowe now show that seeks do not affect our scheduling.

Let α′i = αi − δ

(α)i and β′

i = βi − δ(β)i , where δτi is the

number of requests of type τ that are not executed in win-dow i due to seek events. From the above assumption,

δ(β)i = βi/αiδ

(α)i . Then α′

i/β′i = (αi − δ

(α)i )/(αi − δ

(β)i ),

which gives αi/βi and similarly, for window j. Using theoriginal solution (2) for the sequential and random costs,consider the ratio of y/x as well as y′/x′, which uses α′ in-stead of α and similarly for β. By substituting, we get thaty/x = y′/x′ = −α′

i/β′i, which is independent of the number

of seeks δ and by the above is equal to −αi/βi.From the above, we conclude that seeks do not affect the

relative estimation costs and consequently our schedule. Thereason the ratios are negative can be seen from Figure 5.Specifically, fixing every variable in (2), while increasing thex-cost, reduces the y-cost and vice versa. Therefore, theslope y/x is negative whether we have seeks or not.

3.3 Write support and estimating in practiceIn this work, we only deal with read requests. Since writes

typically respond immediately, it is harder to approximatethe disk throughput over small time intervals. On the otherhand, if a system is busy enough, the write throughput overlarge intervals (e.g., 5 seconds) is expected to have a smallervariance and be closer to the true throughput. Preliminaryresults suggest the above holds. There are still some chal-lenges, such as the effect of writes on reads when there issignificant write activity, which may be addressed by dis-patching writes in groups. Adding support for writes is apriority for future work and is expected to lead into a moregeneral solution supporting SSDs and hybrid systems.

In our implementation we took the approach of havingsmall windows, e.g. 100ms, to increase the frequency of es-timates and to give a small weight to each of them. Aswe compute intersections we keep a moving average andweight each estimate depending on its distance from the pre-vious one. Due to the frequent updates, if there is a shift inthe cost, the moving estimate will reach that value quickly.Moreover, to improve estimates, for each window we find its

0 500 1000 1500 20000

10

20

30

40

50

60

70

80

90

100

Window (150ms) counter

Ave

rage

Rat

e

Sequential 50%, Random 50% (Disk 2)

Total Sequential (50%)Random (50%)Sequential (20%)Sequential (20%)Sequential (10%)

Figure 7: Using one disk and a mixture of sequen-tial and random streams the rates are achieved andconvergence happens quickly.

intersection with a number of the previous windows (e.g.,10.) Finally, if the λ cost ratio as defined in (5) is too smallor too large we ignore that pair of costs. We set the boundsto what we consider safe values in that they will only takeout clearly wrong intersections.

4. EXPERIMENTAL EVALUATIONIn this section, we evaluate QBox in terms of utilization

and throughput management. We first verify that the se-quential and random request cost estimates are accurateenough and that the desired stream rates are satisfied indifferent scenarios. Next, we show that the throughputachieved is to a large degree in agreement with the targetrates of each stream.

4.1 PrototypeIn all our experiments we use up to four disks (different

models) or a software RAID 0 over two disks. We forwardstream requests to the disks asynchronously using KernelAIO. We avoided using threads in order to keep a large num-ber of requests queued up (e.g., 200) and to avoid race condi-tions leading to inaccurate inter-arrival time measurements.Up to subsection 4.4 we are interested in evaluating QBoxin a time-based manner. For that purpose we avoid hittingthe filesystem cache by enabling O DIRECT and do notuse Native Command Queuing (NCQ) in any of the disks.Moreover, we send requests in a RAID 0 fashion rather thanusing a true RAID. The above allows us to know the diskeach request targets, which consequently lets us computethe service times by measuring the inter-arrival times andcompare those with our estimates. The extra information isnot used by our method since it is normally unavailable. Itis used only for evaluation purposes. Starting from subsec-tion 4.4 we gradually remove all the above restrictions andevaluate QBox implicitly in a throughput-based manner.

We evaluate QBox both with synthetic and real workloadsdepending on the goal of the experiment. All synthetic re-quests are reads of size 4KB unless we are using a RAIDover two disks in which case they are 8KB. For the syntheticworkload, each disk contains a hundred 1GB files. We use asubset of the Deasna2 [3] NFS trace with request sizes typ-

77

Page 6: QBox: guaranteeing I/O performance on black box storage ...skourtis/papers/qbox-p73-skourtis.pdf · signed deadlines instead of overall performance. The next section presentsourmethodfor

0 500 1000 1500 2000 2500 30000

10

20

30

40

50

60

70

80

90

100

Window (100ms) counter

Ave

rage

rat

eSequential 30%, Random 70% (20<L <175)

Sequential (10%)Sequential (10%)Sequential (10%)Total Sequential (30%)Random (70%)

0 500 1000 1500 20000

10

20

30

40

50

60

70

80

90

100

Window (100ms) counter

Ave

rage

rat

e

Sequential 50%, Random 50% (20<L <175)

Sequential (20%)Sequential (20%)Sequential (10%)Total Sequential (50%)Random (50%)

0 500 1000 1500 20000

10

20

30

40

50

60

70

80

90

100

Window (100ms) counter

Ave

rage

rat

e

Sequential 70%, Random 30% (20<L <175)

Sequential (30%)Sequential (20%)Sequential (20%)Total Sequential (70%)Random (30%)

Figure 8: Using two disks and our scheduling and estimation method we achieve the desired rates most ofthe time relatively well. In the above we have three sequential streams and a random one.

0 500 1000 1500 2000 2500 30000

0.5

1

1.5

2

2.5x 10

4

Window (100ms) counter

Ran

dom

Req

uest

s E

xecu

tion

Cos

ts

Estimates [Sequential 30%, Random 70%] [20<L<175]

Averages (d1)Averages (d2)Moving Estimate

0 500 1000 1500 20000

0.5

1

1.5

2

2.5x 10

4

Window (100ms) counter

Ran

dom

Req

uest

s E

xecu

tion

Cos

tsEstimates [Sequential 50%, Random 50%] [20<L<175]

Averages (d1)Averages (d2)Moving Estimate

0 500 1000 1500 20000

0.5

1

1.5

2

2.5x 10

4

Window (100ms) counter

Ran

dom

Req

uest

s E

xecu

tion

Cos

ts

Estimates [Sequential 70%, Random 30%] [20<L<175]

Averages (d1)Averages (d2)Moving Estimate

Figure 9: Using two disks (d1, d2) and our estimation method we maintain a moving estimate of the averagerandom execution cost on the storage device.

ically being 32KB or 64KB. Finally, except for a workloadcontaining idle time, we assume there are always requestsqueued up, since that is the most interesting scenario, andwe bound the number of pending requests on the storage bya constant, e.g., 200. Finally,we do not assume a specific I/Oscheduler is used by the storage device. In our experiments,the “Deadline Scheduler” was used, however, we have triedother schedulers and observed similar results.

4.2 Sequential and random streamsOur approach is based on the differentiation between se-

quential and random requests and so the first step in evalu-ating QBox is to consider a workload of fully sequential andrandom streams with the goal of providing isolation betweenthem. Note that to provide isolation a prerequisite is thatour cost estimates for the average sequential and randomrequest are close enough to the true values, which are notknown, and it is not possible to explicitly measure them inour black box scenario. In this set of experiments, the work-load consists of three sequential streams and one random.Each sequential stream starts at a different file to ensurethere are inter-stream seeks. Each request of the randomstream targets a file and offset uniformly at random. Foreach stream we measure the average utilization provided bythe storage device. We look into three sets of desired utiliza-tions. Figure 8(a) shows a random stream with a utilizationtarget of 70%, while each sequential stream has a target of10% for a total of 30%. As the experiment runs, the cost

estimates take values within a small range and the averageachieved utilization converges. In Figures 8(b) and (c) thesequential streams are given higher utilizations. In all threecases, the achieved utilizations are close to the desired ones.Again in Figures 8(b) and (c) the initial estimate was rela-tively close to the actual cost, so the moving rates approachthe converging rates more quickly.

From Figure 9 we notice that estimates get above the av-erage cost when there is many sequential requests even ifthe utilization targets are achieved (Figure 8). The mainreason for that is that we keep track and store (in mem-ory) large amounts of otherwise unnecessary statistics perrequest. Therefore, if in a window of e.g., 100ms there is avery large number of request completions, i.e., when the rateof sequential streams is high, 10ms (10μs·1000 requests) maybe given to that processing and therefore the estimates arescaled up. Of course, those operations can be optimized oreliminated without affecting QBox. As expected, a similareffect happens with the estimated cost of sequential requests(not shown), therefore the ratio of the costs stays valid lead-ing to proper scheduling as shown in Figure 8.

Besides the initial estimates, the convergence rate alsodepends on the window size, since a smaller size impliesmore frequent updates and faster convergence. However, ifthe window size becomes too small the number of completedrequests become too few and the quality of the estimatemay not be accurate enough due to the significant noise.Note that whether the window size is considered too small

78

Page 7: QBox: guaranteeing I/O performance on black box storage ...skourtis/papers/qbox-p73-skourtis.pdf · signed deadlines instead of overall performance. The next section presentsourmethodfor

0 500 1000 1500 2000 2500 30000

10

20

30

40

50

60

70

80

90

100

Window (75ms) counter

Str

eam

Rat

eAchieved Stream rates

A (10%, 25r, 475s)B (30%, 7r, 693s)C(20%, All S)D (40%, All R)

0 500 1000 1500 2000 2500 30000

10

20

30

40

50

60

70

80

90

100

Window (75ms) counter

Str

eam

Rat

e

Achieved stream rates

A (20%, 25r, 475s)B (30%, 7r, 693s)C (20%, All S)D (30%, All R)

0 500 1000 1500 2000 2500 30000

10

20

30

40

50

60

70

80

90

100

Window (75ms) counter

Str

eam

Rat

e

Achieved Stream rates

A (10%, 25r, 475s)B (10%, 7r, 693s)C (10%, All S)D (70%, All R)

Figure 10: Using two disks and our scheduling and estimation method we achieve the desired rates most ofthe time relatively well. Stream A requires 20% of the disk time and sends 25 random requests every 475sequential requests. Similarly for the rest of the streams.

0 500 1000 1500 2000 2500 30000

0.5

1

1.5

2

2.5x 10

4

Window (75ms) counter

Ran

dom

Req

uest

s E

xecu

tion

Cos

ts

Estimates [20<L<175] (Exp. B)

Averages (d1)Averages (d2)Moving Estimate

0 500 1000 1500 2000 2500 30000

0.5

1

1.5

2

2.5x 10

4

Window (75ms) counter

Ran

dom

Req

uest

s E

xecu

tion

Cos

ts

Estimates [20<L<175] (Exp. C)

Averages (d1)Averages (d2)Moving Estimate

0 500 1000 1500 2000 2500 30000

0.5

1

1.5

2

2.5x 10

4

Window (75ms) counter

Ran

dom

Req

uest

s E

xecu

tion

Cos

ts

Estimates [20<L<175] (Exp. D)

Averages (d1)Averages (d2)Moving Estimate

Figure 11: Using two disks (d1, d2) and our estimation method we maintain a moving estimate of the averagerandom execution cost on the storage device.

depends on the number of disks in the storage system ashaving more disks implies that a greater number of requestscomplete per window. The window size we picked in theabove experiment (Figure 8) is 100ms. Other values such as150ms provide similar estimation quality and later we lookat smaller windows of 75ms. Note that in Figure 9 there isa number of recorded averages that are 0 because randomstreams with low target rates are more likely to have noarrivals in a window. Not having any completed randomrequests in a window implies that we can estimate a newsequential estimate more easily.

Finally, in the above, we assume there is always enoughqueued requests from all streams. Without any modificationto our method we see from Figure 12 that under idle timeit is still possible to manage the rates. In particular, every5000 requests (on average) dispatched to the storage devicewe delay dispatching the next request(s) for a (uniformlyat) random amount of time between 0.5 to 1 second. FromFigure 12 we see that the rates are still achieved, while thereis slightly more noise in the estimates compared to Figure 9.We noted that if the idle times are larger than the windowsize, then our method is less affected. That was expected,since idling over a number of consecutive windows impliesthat new requests will be scheduled according to the previousestimates as the estimates will not be updated. Finally,although the start and end of the idle time window mayaffect the estimate, the effect is not significant since the

estimate moves only by a small amount on each update andmost updates are not affected.

4.3 Mixed-workload streamsIn practice, most streams are not perfectly sequential. For

example, a stream of requests may consist of m randomrequests for every n sequential requests, where m is oftensignificantly smaller than n. To face that issue, instead ofcharacterizing each stream as either sequential or randomwe classify each request. Note that the first request of asequential group of requests after m random ones is consid-ered random if m is large enough. Although not all randomrequests cost exactly the same, we do not differentiate be-tween them since we work on top of the filesystem and donot assume we have access to the logical block number ofeach file. Therefore, we do not have a real measure of se-quentiality for any two I/O requests. However, as long as thecost of random requests does not vary significantly betweenstreams we expect to achieve the desired utilization for eachstream. Indeed, as it has been observed in [10], good uti-lization management can still be provided when random re-quests are assumed to cost the same. Moreover, from [3] wesee that requests from common workloads are usually eitheralmost sequential or fully random. Differentiating betweencost estimates on a per stream basis is expected to improvethe management quality and leave it as future work.

From Figure 10 we see that the targets are achieved in the

79

Page 8: QBox: guaranteeing I/O performance on black box storage ...skourtis/papers/qbox-p73-skourtis.pdf · signed deadlines instead of overall performance. The next section presentsourmethodfor

0 500 1000 15000

10

20

30

40

50

60

70

80

90

100

Window (150ms) counter

Ave

rage

Rat

eSequential 50%, Random 50% (with idle time)

Total Sequential (50%)Random (50%)Sequential (20%)Sequential (20%)Sequential (10%)

0 500 1000 15000

0.5

1

1.5

2

2.5x 10

4

Window (150ms) counter

Ran

dom

Req

uest

s E

xecu

tion

Cos

t

Estimates [Sequential 50%, Random 50% (with idle time

Averages (d1)Averages (d2)Moving Estimate

Figure 12: Using two disks (d1, d2), the desiredrates are achieved well enough (reach 45% quickly)even when there is idle time in the workload.

presence of semi-sequential streams. In particular, in 10(a),stream A sends 25 random requests for every 475 sequentialones. Stream B sends 7 random requests for every 693 se-quential ones, while streams C and D are purely sequentialand random, respectively. Other target sets in Figure 10are satisfied equally well. Note that each group of requestsdoes not have to be completed before the next one is sent.Instead, requests are continuously dequeued and scheduled.

So far we have seen scenarios with fixed target rates. Ourmethod supports changing the target rates online as long asthe rate sum is up to 100%. Depending on the new targetrates, the cost estimation updates can be crucial in achiev-ing those rates. For example, increasing the rate of a ran-dom stream decreases the average cost of a random requestand our estimates are adjusted automatically to reflect that.Figure 13(a) illustrates that the utilization rates are satisfiedand Figure 13(b) shows how the random estimate changes asthe clients adjust their desired utilization rates every thirtyseconds. For this experiment we set the number of disks tofour to illustrate our method works with a higher numberof disks and to support our claim that it can work with anynumber of disks. The same experiment was run with twodisks giving nearly identical results (figure omitted.)

As explained earlier, the disk queue depth is set to one forevaluation purposes. However, since a large queue depth canimprove the disk throughput we implicitly evaluate QBoxby comparing the throughput achieved when the depth is1 and 31, while the target rates change. In particular, welook at semi-sequential and random streams. As expectedand illustrated in Figure 15, having a depth of 31 achievesa higher throughput over a range of rates. Although, thisdoes not verify our method works perfectly due to lack ofinformation, it provides evidence that it works and, as wewill see in the next subsection, that is indeed the case.

4.4 RAID utilization managementIn our experiments so far, we have been sending requests

to disks manually in a striping fashion instead of using anactual RAID device. That was done for evaluation purposes.Here, we use a (software) RAID 0 device and instead eval-uate QBox indirectly. The RAID configuration consists oftwo disks with a chunk size of 4KB to match our previousexperiments, while requests have a size of 8KB.

In the first experiment we focus on the throughput achievedby two (semi-)sequential streams as we vary their desiredrates. Moreover, we add a random stream to make it morerealistic and challenging. We fix the target rate of the ran-dom stream since otherwise it would have a variable effect on

500 1000 1500 2000 25000

10

20

30

40

50

60

70

80

90

100

Window (75ms) counter

Str

eam

Rat

es

Achieved Stream Rates with Shifts (4 Disks)

A (30%, 20%, 10%; 25R, 475S)B (20%, 20%, 10%; 7R, 593S)C (20%, 10%, 10%; All S)D (30%, 50%, 70%; All R)

500 1000 1500 2000 25000

0.5

1

1.5

2

2.5x 10

4

Ran

dom

Req

uest

Exe

cutio

n C

ost

Shifted Estimates [20<L<175] (Exp. F)

Disk AveragesDisk AveragesMoving Estimate

Figure 13: Using four disks and desired rates thatshift over time, the rates are still achieved quicklyunder semi-sequential and random workloads.

the sequential streams throughput and make the evaluationuncertain. As long as the throughput achieved by each ofthe sequential streams varies in a linear fashion we are ableto conclude that our method works. Indeed, from Figure 14stream A starts with a target rate of 0.5 and goes down to0, while stream B moves in the opposite direction. As thethroughput of stream A goes down, the difference is pro-vided to stream B. Moreover, in Figure 16 we see that havingtwo random streams and a sequential one fixed at 50% (notplotted) has a similar behavior. The difference betweenthose two cases is the drop in the total throughput of thefirst case with streams A and B having a lower throughputwhen their rates get closer to each other. That is due tothe more balanced number of requests being executed fromeach sequential stream leading to a greater number of seeksbetween them. Since seeks are relatively expensive com-pared to the typical sequential request the overall through-put drops slightly. If that effect was not observed in Figure14, then the random stream (C) would be getting a smalleramount of the storage time, which would go against its per-formance targets. Instead, the random stream throughputremains unchanged. On the other hand, in Figure 16 there

80

Page 9: QBox: guaranteeing I/O performance on black box storage ...skourtis/papers/qbox-p73-skourtis.pdf · signed deadlines instead of overall performance. The next section presentsourmethodfor

(0.5, 0) (0.4, 0.1) (0.3, 0.2) (0.2, 0.3) (0.1, 0.4) (0, 0.5)70

1000

2000

3000

4000

5000

6000

7000

8000T

hrou

ghpu

t [IO

PS

]Throughput of Streams for different rates (no NCQ)

Stream A (seq)Stream B (seq)Total (seq)Stream C (random)

(0.5, 0) (0.4, 0.1) (0.3, 0.2) (0.2, 0.3) (0.1, 0.4) (0, 0.5)100

1000

2000

3000

4000

5000

6000

7000

8000

Thr

ough

put [

IOP

S]

Throughput of Streams for different rates (NCQ 31)

Stream A (seq)Stream B (seq)Total (seq)Stream C (random)

Figure 14: Using RAID 0 the throughput achieved by each sequential stream is in agreement with theirtarget rates. Stream A has a varied target rate from 50% to 0 and the opposite for B. Random stream Crequires a fixed rate of 50% of the storage time. Similarly for a large disk queue depth (NCQ.)

is no drop in the total throughput, which is expected sincethe cost of seeks between random requests are similar tothe typical cost of a random request. Therefore, the to-tal throughput remains constant. Moreover, the sequentialstream (not plotted) reaches an average throughput of 3600and 5060 IOPS with a depth of 1 and 31, respectively asin Figure 14. Finally, note that whether we use no NCQor a depth of 31 the throughput behavior is similar in bothFigures (14 and 16), which is desired since a large depth canprovide a higher throughput in certain cases [26], along withother benefits such as reducing power consumption [24].

4.5 Evaluation using tracesTo strengthen our evaluation, besides synthetic workloads

we run QBox using two different days of the Deasna2 [3]trace as two of the three read streams, while the third streamsends random requests. Deasna2 contains semi-sequentialtraces of email and workloads from Harvard’s division ofengineering and applied sciences. As the requests wait to bedispatched, we classify them as either sequential or randomdepending on the other requests in their queue.

Unlike time, evaluating a method by comparing through-put values is hard because the achieved throughput dependson the stream workloads. However, by looking at the through-put achieved using QBox in Figure 17 and the results ofthroughput-based scheduling in Figure 18 it is easy to con-clude that QBox provides a significantly higher degree of iso-lation and that the target rates of the streams are respectedwell enough. Moreover, looking more closely at Figure 17,we see that wherever the throughput is not in perfect ac-cordance with the targets of streams A and B, there is anincrease of random requests coming from the same streams.That effect is valid and due to the trace itself. On the otherhand, Figure 18 demonstrates the destructive interferenceinherent in throughput-based reservation schemes with semi-sequential streams receiving a very low throughput.

4.6 CachesSo far our experiments have skipped the file system cache

to more easily evaluate our method and to send requestsasynchronously, since without O DIRECT they become block-

(S:0.3, R:0.7) (0.4, 0.6) (0.5, 0.5) (0.6, 0.4) (0.7, 0.3)0

1000

2000

3000

4000

5000

6000

7000

8000

Thr

ough

put [

IOP

S]

Total throughput of (mostly) sequential streams (Exp. F with three disks)

Disk Queue Depth 31Disk Queue Depth 1

(S:0.3, R:0.7) (0.4, 0.6) (0.5, 0.5) (0.6, 0.4) (0.7, 0.3)0

50

100

150

200

250

300

350

400

Thr

ough

put [

IOP

S]

Total throughput of (mostly) random streams (Exp. F with three disks)

DIsk Queue Depth 31Disk Queue Depth 1

Figure 15: The throughput with an NCQ of 1 and31 is maintained while the desired rates vary.

81

Page 10: QBox: guaranteeing I/O performance on black box storage ...skourtis/papers/qbox-p73-skourtis.pdf · signed deadlines instead of overall performance. The next section presentsourmethodfor

(0.5, 0.0) (0.4, 0.1) (0.3, 0.2) (0.2, 0.3) (0.1, 0.4) (0.0, 0.5)0

40

80

120

150T

hrou

ghpu

t [IO

PS

]Throughput of Streams for different rates (no NCQ)

Stream A (random)Stream B (random)Total (random)

(0.5, 0) (0.4, 0.1) (0.3, 0.2) (0.2, 0.3) (0.1, 0.4) (0, 0.5)0

50

100

150

Thr

ough

put [

IOP

S]

Throughput of Streams for different rates (NCQ 31)

Stream A (random)Stream B (random)Total (random)

Figure 16: Using RAID 0 the throughput of each random stream is in agreement with its target. StreamA has a varied target rate from 50% to 0 and B from 0 to 50%. The sequential stream (not plotted) has autilization of 50% leading to an average of 3600 and 5060 IOPS with no NCQ and a depth of 31, respectively.

C:0% 40 60 80 100 C:50% 140 160 180 200 C:100%240

150

500

1,000

1,500

2,000

2,500

3,000

Seconds

Thr

ough

put [

IOP

S]

Throughput with a varying utilization target (using traces)

Stream A (Trace/Semi−seq 50% − 0%)Stream B (Trace/Semi−seq 50% − 0%)Stream C (Random 0% − 100%)Stream (A+B) random throughputStream (A+B) 20sec avg throughputStream (A+B) 10sec avg throughput

Figure 17: QBox provides streams of real traces thethroughput corresponding to their rate close enougheven in the presence of a random stream.

C: 0% 40 60 80 100 C:50% 140 160 180 200 C:100%240

150

500

1000

1500

2000

2500

3000

Seconds

Thr

ough

put [

IOP

S]

Throughput with a varying utilization target (using traces) and throughput scheduling

Stream A (Trace/Semi−seq 50% − 0%)Stream B (Trace/Semi−seq 50% − 0%)Stream C (Random 0% − 100%)Stream (A+B) random throughput

Figure 18: Random stream C affects throughput-scheduling leading to a low throughput for A and B.

ing requests. Although applications such as databases mayavoid file system caches, we are interested in QBox beingapplicable in a general setting. For our purposes, requestcompletions resulting from cache hits could be ignored or ac-counted differently. From our experiments, detection of ran-dom cache hits seems reliable and the well-known relation–as explained in [7]–between the queue size and the averagelatency may also be useful to improve accuracy as well asgrey-box methods [1]. However, we cannot say the same forsequential requests due to prefetching. Moreover, since ran-dom workloads may cover a large segment of the storage,hits are not as likely. Hence, in this paper we treat hits asregular completions for simplicity.

Without modifying QBox we enable the file system cacheand see from Figure 19 that although the throughput is nois-ier than in the previous experiments due to the nature ofcache hits, we still manage to achieve throughput rates thatare in accordance with the target rates. Scheduling basedon throughput (Figure 20) gives similar results to Figure18, supporting our position on throughput-based schedul-ing. Finally, using synthetic workloads we get an outputof the same form as Figure 14 with a maximum sequentialthroughput of 1700 IOPS (figure omitted.)

4.7 OverheadThe computational overhead is trivial. We know the most

urgent request in each stream queue and thus picking thenext request to dispatch requires as many operations as thenumber of streams. Since the number of streams is expectedto be low, that cost is trivial. In addition, on a requestcompletion we increase a fixed number of counters and atthe end of each window we compute a fixed, small number ofintersections. The time it takes to compute each intersectionis insignificant. Finally, updating the moving estimate onlyrequires computing the new estimate weight. In total theprocedure at the end of each window takes less than 10μs.

5. RELATED WORKA large body of literature exists related to providing guar-

antees over storage devices. Typically they either aim to

82

Page 11: QBox: guaranteeing I/O performance on black box storage ...skourtis/papers/qbox-p73-skourtis.pdf · signed deadlines instead of overall performance. The next section presentsourmethodfor

C:0% 40 60 80 100 C:50% 140 160 180 200 C:100% 240100

1000

2000

3000

4000

5000

6000

7000

Seconds

Thr

ough

put [

IOP

S]

Throughput with a varying utilization target (using traces) and caches

Stream A (Trace/Semi−seq 50% − 0%)Stream B (Trace/Semi−seq 50% − 0%)Stream C (Random 0% − 100%)Stream (A+B) random requestsStream (A+B) 20sec avg throughputStream (A+B) 10sec avg throughput

Figure 19: The throughput achieved by QBox in thepresence of caches follows the target rates.

C:0% 40 60 80 100 C:50% 140 160 180 200 C:100% 240100

1000

2000

3000

4000

5000

6000

7000

Seconds

Thr

ough

put [

IOP

S]

Throughput with a varying utilization target (using traces), caches and throughput scheduling

Stream A (Trace/Semi−seq 50% − 0%)Stream B (Trace/Semi−seq 50% − 0%)Stream C (Random 0% − 100%)Stream (A+B) random requests

Figure 20: Throughput-based scheduling fails to iso-late stream performance leading to a low throughputfor semi-sequential streams.

satisfy throughput or latency requirements, or attempt toproportionally distribute throughput. Most solutions donot distinguish between sequential and random workloads,which leads to the storage being under-utilized. Avoidingthat distinction leads to charging semi-sequential streamsunfairly due to the significant cost difference between se-quential and random requests. Instead, QBox uses disk ser-vice time rather than IOPS or Bytes/s to solve that problem.

Stonehenge [8] clusters storage systems in terms of band-width, capacity and latency, however, being based on band-width its reservations cover only a fraction of the disk per-formance. Other proposed solutions based on bandwidth,include [11, 15, 2] and take advantage of the relation - aswas later explained in [7] - between the queue length andaverage latency to throttle requests. mClock [6] does notprovide performance insulation, while both [6] and pClock[5] do not differentiate between sequential and random re-quests. Other solutions such as [9, 11, 15, 16] do not provideinsulation either. On the other hand, Argon [22] provides in-sulation, however, workload changes may affect its providedsoft bounds. In [14], distribution-based QoS is provided to apercentage of the workload to avoid over-provisioning. [12]attempts to predict response times rather than service times

through statistical models. PARDA [4] provides guaranteesby assuming a specific scheduler resides on each host, unlikeQBox, which does not assume access to the hosts/clients.

Facade [15] aims to provide performance guarantees de-scribed by an SLA for each virtual storage device. It placesa virtual store controller between a set of hosts and storagedevices in a network and throttles client requests so that thedevices do not saturate. In particular, it adjusts the queuesize dynamically, which affects the latency of each workload.However, a single set of low latency requests may decreasethe queue size of the system and it is hard to determinewhether a new workload may be admitted.

YouChoose [27] tries to provide the performance of refer-ence storage systems by measuring their performance off-lineand mimicking it online. It is based on an off-line machinelearning process, similar to [23], which can be hard to pre-pare due to the challenging task of selecting a representativeset of training data. Moreover, the safe admission of newvirtual storage devices can be challenging.

Solutions based on execution time estimates such as [18,17, 10] assume we have low-level control over each hard-drive. Moreover, in Horizon [18] it was shown that sucha solution can be used in distributed storage systems withsingle-disk nodes using the Horizon scheduler. Our work isalso based on disk-time utilization and deadline assignment,however, we treat the storage device as a black box andtherefore do not assume our own scheduler is in front ofevery hard-drive. Finally, [25, 20] reserve I/O rates usingworst-case execution times, therefore, they can only reservea fraction of the storage device time.

6. CONCLUSIONSIn this paper, we targeted the problem of providing isola-

tion and performance guarantees in terms of storage deviceutilization to multiple clients with different types of work-loads. We proposed a “plug-n-play” method for isolatingthe performance of clients accessing a single file-level stor-age device treated as a black box. Our solution is based ona novel method for estimating the expected execution timesof sequential and random requests as well as on assigningdeadlines and scheduling requests using the Earliest Dead-line First (EDF) scheduling algorithm. Our experimentsshow that QBox provides isolation between streams havingdifferent characteristics with changing needs and on storagesystems with a variable number of disks.

There are multiple directions for future work. Extensionsinclude support for SSDs based on the cost difference ofreads and writes as well as hybrid systems. Adding supportfor writes and RAID 4, 5 is another direction. Technicalimprovements include a better use of the history of requestsin computing estimates and sophisticated methods to detectsudden and stable changes. Finally, we would like to verifyQBox works on Network Attached Storage, test it at thehypervisor level in a virtualized environment and explore thecase where there is multiple controllers and storage devices.

7. REFERENCES[1] A. C. Arpaci-Dusseau and R. H. Arpaci-Dusseau.

Information and control in gray-box systems. InProceedings of the eighteenth ACM symposium onOperating systems principles, SOSP ’01, pages 43–56,New York, NY, USA, 2001. ACM.

83

Page 12: QBox: guaranteeing I/O performance on black box storage ...skourtis/papers/qbox-p73-skourtis.pdf · signed deadlines instead of overall performance. The next section presentsourmethodfor

[2] D. D. Chambliss, G. A. Alvarez, P. Pandey, D. Jadav,J. Xu, R. Menon, and T. P. Lee. Performancevirtualization for large-scale storage systems. In InProceedings of the 22th International Symposium onReliable Distributed Systems (SRDSO03, pages109–118, 2003.

[3] D. Ellard, J. Ledlie, P. Malkani, and M. Seltzer.Passive nfs tracing of email and research workloads. InProceedings of the 2nd USENIX Conference on Fileand Storage Technologies, FAST ’03, pages 203–216,Berkeley, CA, USA, 2003. USENIX Association.

[4] A. Gulati, I. Ahmad, and C. A. Waldspurger. Parda:proportional allocation of resources for distributedstorage access. In Proccedings of the 7th conference onFile and storage technologies, pages 85–98, Berkeley,CA, USA, 2009. USENIX Association.

[5] A. Gulati, A. Merchant, and P. J. Varman. pclock: anarrival curve based approach for qos guarantees inshared storage systems. In Proceedings of the 2007ACM SIGMETRICS international conference onMeasurement and modeling of computer systems,SIGMETRICS ’07, pages 13–24, New York, NY, USA,2007. ACM.

[6] A. Gulati, A. Merchant, and P. J. Varman. mclock:handling throughput variability for hypervisor ioscheduling. In Proceedings of the 9th USENIXconference on Operating systems design andimplementation, OSDI’10, pages 1–7, Berkeley, CA,USA, 2010. USENIX Association.

[7] A. Gulati, G. Shanmuganathan, I. Ahmad,C. Waldspurger, and M. Uysal. Pesto: online storageperformance management in virtualized datacenters.In Proceedings of the 2nd ACM Symposium on CloudComputing, SOCC ’11, pages 19:1–19:14, New York,NY, USA, 2011. ACM.

[8] L. Huang, G. Peng, and T.-c. Chiueh.Multi-dimensional storage virtualization. InProceedings of the joint international conference onMeasurement and modeling of computer systems,SIGMETRICS ’04/Performance ’04, pages 14–24, NewYork, NY, USA, 2004. ACM.

[9] W. Jin, J. S. Chase, and J. Kaur. Interposedproportional sharing for a storage service utility.SIGMETRICS Perform. Eval. Rev., 32:37–48, June2004.

[10] T. Kaldewey, T. Wong, R. Golding, A. Povzner,S. Brand, and C. Maltzahn. Virtualizing diskperformance. In Real-Time and Embedded Technologyand Applications Symposium, 2008. RTAS ’08. IEEE,pages 319 –330, April 2008.

[11] M. Karlsson, C. Karamanolis, and X. Zhu. Triage:Performance differentiation for storage systems usingadaptive control. Trans. Storage, 1:457–480, November2005.

[12] T. Kelly, I. Cohen, M. Goldszmidt, and K. Keeton.Inducing models of black-box storage arrays. InTechnical Report HPL-SSP-2004-108, 2004.

[13] C. L. Liu and J. W. Layland. Scheduling algorithmsfor multiprogramming in a hard-real-timeenvironment. J. ACM, 20:46–61, January 1973.

[14] L. Lu, P. Varman, and K. Doshi. Graduated qos bydecomposing bursts: Don’t let the tail wag your

server. In Distributed Computing Systems, 2009.ICDCS ’09. 29th IEEE International Conference on,pages 12 –21, June 2009.

[15] C. R. Lumb, A. Merchant, and G. A. Alvarez. Facade:Virtual storage devices with performance guarantees.In Proceedings of the 2nd USENIX Conference on Fileand Storage Technologies, pages 131–144, Berkeley,CA, USA, 2003. USENIX Association.

[16] A. Merchant, M. Uysal, P. Padala, X. Zhu, S. Singhal,and K. Shin. Maestro: quality-of-service in large diskarrays. In Proceedings of the 8th ACM internationalconference on Autonomic computing, ICAC ’11, pages245–254, New York, NY, USA, 2011. ACM.

[17] A. Povzner, T. Kaldewey, S. Brandt, R. Golding,T. M. Wong, and C. Maltzahn. Efficient guaranteeddisk request scheduling with fahrrad. In Proceedings ofthe 3rd ACM SIGOPS/EuroSys European Conferenceon Computer Systems 2008, Eurosys ’08, pages 13–25,New York, NY, USA, 2008. ACM.

[18] A. Povzner, D. Sawyer, and S. Brandt. Horizon:efficient deadline-driven disk i/o management fordistributed storage systems. In Proceedings of the 19thACM International Symposium on High PerformanceDistributed Computing, HPDC ’10, pages 1–12, NewYork, NY, USA, 2010. ACM.

[19] A. S. Povzner. Efficient guaranteed disk i/operformance management. PhD thesis, University ofCalifornia at Santa Cruz, Santa Cruz, CA, USA, 2010.AAI3429522.

[20] L. Reuther and M. Pohlack. Rotational-position-awarereal-time disk scheduling using a dynamic activesubset (das). In In Proceedings of the 24th IEEEReal-Time Systems Symposium (RTSS 2003). IEEE,page 374. IEEE Computer Society, 2003.

[21] M. Spuri, G. Buttazzo, and S. S. S. Anna. Schedulingaperiodic tasks in dynamic priority systems.Real-Time Systems, 10:179–210, 1996.

[22] M. Wachs, M. Abd-El-Malek, E. Thereska, and G. R.Ganger. Argon: performance insulation for sharedstorage servers. In Proceedings of the 5th USENIXconference on File and Storage Technologies, pages5–5, Berkeley, CA, USA, 2007. USENIX Association.

[23] M. Wang, K. Au, A. Ailamaki, A. Brockwell,C. Faloutsos, and G. R. Ganger. Storage deviceperformance prediction with cart models.SIGMETRICS Perform. Eval. Rev., 32:412–413, June2004.

[24] Y. Wang. Ncq for power efficiency. White paper,February 2006.

[25] T. M. Wong, R. A. Golding, C. Lin, and R. A.Becker-szendy. Zygaria: storage performance as amanaged resource. In In IEEE Real Time andEmbedded Technology and Applications Symposium(RTAS 06, pages 125–134, 2006.

[26] Y. J. Yu, D. I. Shin, H. Eom, and H. Y. Yeom. Ncqvs. i/o scheduler: Preventing unexpectedmisbehaviors. Trans. Storage, 6:2:1–2:37, April 2010.

[27] X. Zhang, Y. Xu, and S. Jiang. Youchoose: Choosingyour storage device as a performance interface toconsolidated i/o service. Trans. Storage, 7:9:1–9:18,October 2011.

84