Enabling Cost-Effective Flash based Caching with an Array ... · SSD caching framework to mimic advanced cache algorithms (e.g. sLRU and GDSF) [50]. Speciﬁcally, they ﬁrst reem-phasize

Enabling Cost-Effective Flash based Caching

with an Array of Commodity SSDs

Yongseok OhUniversity of [email protected]

Eunjae LeeUniversity of Seoul

[email protected]

Choulseung HyunUniversity of Seoul

[email protected] Choi

Dankook [email protected]

Donghee LeeUniversity of Seoul

[email protected]

Sam H. NohUNIST

http://next.unist.ac.kr

ABSTRACTSSD based cache solutions are being widely utilized to im-prove performance in network storage systems. With a goalof providing a cost-e↵ective, high performing SSD cache so-lution, we propose a new caching solution called SRC (SSDRAID as a Cache) for an array of commodity SSDs. Indesigning SRC, we borrow both the well-known RAID tech-nique and the log-structured approach and adopt them intothe cache layer. In so doing, we explore a wide variety of de-sign choices such as flush issue frequency, write units, form-ing stripes without parity, and garbage collection throughcopying rather than destaging that become possible as wemake use of RAID and a log-structured approach at thecache level. Using an implementation in Linux under theDevice Mapper framework, we quantitatively present andanalyze results of the design space options that we consid-ered in our design. Our experiments using realistic work-load traces show that SRC performs at least 2 times betterin terms of throughput than existing open source solutions.We also consider cost-e↵ectiveness of SRC with a varietyof SSD products. In particular, we compare SRC config-ured with MLC and TLC SATA SSDs and a single high-endNVMe SSD. We find that SRC configured as RAID-5 withlow-cost MLC and TLC SATA SSDs generally outperformsthat configured with a single high-end SSD in terms of bothperformance and lifetime per dollars spent.

Categories and Subject DescriptorsD.4.2 [Storage Management]: Storage hierarchies; Sec-ondary storage

General TermsDesign, Performance, Experimentation

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected]’15, December 07-11, 2015, Vancouver, BC, Canadac�2015 ACM. ISBN 978-1-4503-3618-5/15/12 ...$15.00

DOI: http://dx.doi.org/10.1145/2814576.2814814.

KeywordsFlash Based Caching, SSD Array, Log-Structured CachingLayout, Cost-e↵ective Storage

1. INTRODUCTIONNAND flash memory based Solid State Drives (SSDs) are

being used as data cache to improve performance of back-end storage such as DAS, NAS and SAN in computer sys-tems [6, 18, 27, 29, 40, 43]. SSD cache solutions are ex-pected to attain SSD-like performance at HDD-like price.For example, the Seagate Momentus XT integrates raw flashmemory chips into hard disks for desktop environments toboost boot-up time and application loading [47]. For en-terprise configurations, storage vendors are providing cus-tomers with high performance solutions based on PCI-e in-terface SSDs [15].

Besides the performance and price aspects, one importanttrait of the SSD cache is non-volatility with which data cansurvive upon power failures or host side crashes. This traitalso appears to allow one to employ the write-back policy asa high performance SSD cache solution. However, in reality,if the SSD cache is the only entity with up-to-date dataand the SSD fails, then data will be lost. To avoid suchrisky scenarios, employing RAID techniques to SSD cachehas been considered. In practice, EnhanceIO [11] and HGSTServerCache [17] use RAID-1, while LSI ElasticCache [36]and HP SmartCache [19] use RAID-4 or -5.

In this paper, we propose a new caching solution calledSRC (SSD RAID as a Cache) for an array of commoditySSDs with the goal of providing a high performing, yet cost-e↵ective solution. In designing such a system, we borrowfrom both the well-known RAID technique and the log-structured approach [42, 44]. RAID systems have tradi-tionally been deployed for primary storage providing ben-efits such as high bandwidth and reliability, while the log-structured approach has been known to perform well forflash based storage [28, 34]. We adopt these two well-knowntechnologies into the SSD cache layer by making variousdesign and optimization choices. In so doing, we take intoaccount a technological trend of modern SSDs, in particular,the growth in the erase group size, which can informally bedefined as the write unit at which performance is optimizedand sustained in modern SSDs [22, 26, 34, 50]. We find thatthis size is becoming larger, and due to this trend, adoptioninto an array of SSDs is not obvious.

Using our implementation in the Linux kernel 3.11.7 under

the Device Mapper (DM) framework, we explore a numberof design choices in SRC through a variety of experiments.For example, we consider combining garbage collection anddestaging based on the hotness of the data and the currentutilization of the system. We also consider distinguishingclean and dirty data when forming a stripe in RAID as cleandata in a cache may be lost without compromising reliabil-ity. SRC is compared with Bcache5 and Flashcache5, thatis, Bcache and Flashcache with experimental settings wherethe RAID-5 configured underlying SSD cache layer is used.Results show that SRC is at least 2 times better than thesesolutions in terms of performance. We also compare SRCconfigured with MLC and TLC SATA SSDs and that con-figured with a single high-end NVMe SSD (without parity).We find that SRC configured as RAID-5 with low-cost TLCSATA SSDs is generally superior in terms of performanceper dollar, while that configured with MLC SATA SSDs issuperior in terms of lifetime per dollar spent.

The remainder of this paper is organized as follows. InSection 2, we first review the internal workings of contem-porary flash based SSDs. Then, we review existing workrelated to this study. In Section 3, we specifically look intoSSD cache solutions that are available as open source. Welook at the limitations of these work and issues that couldinfluence the performance of SSD based caching solutions.In Section 4, based on the findings of the previous section,we present the design of the SRC scheme. In Section 5,we describe the implementation of SRC and present experi-mental results looking at the various parameter choices thatare made in SRC. Comparative evaluations with RAID-5accommodated versions of Bcache and Flashcache and anevaluation of the cost-e↵ectiveness issue are also given inSection 5. Finally, we conclude the paper with a summaryand conclusions in Section 6.

2. BACKGROUND AND MOTIVATIONIn this section, we first describe the internal workings of

modern flash memory based SSDs. Then, we review some ofthe work that is related to our study, specifically, SSD basedcaching and RAID.

2.1 SSD ArchitectureToday’s commercial SSDs use NAND flash memory as

their storage medium. Flash memory is composed of mem-ory cells where bits are stored. The cells are organize in pageunits in which basic operations such as read and program areperformed. A set of pages then constitutes a block, whichis the unit in which the erase basic operation is performed.The number of pages in a block typically ranges in between32 to 512 pages. The blocks are organized in planes. Then,two or more planes comprise a die, which typically formsa flash package. Some packages have more than one die,and depending on the number of dies per package, these aretermed single die package (SDP), double die package (DDP)and quadruple die package (QDP) [24, 33].

For high performance and to support large capacity, state-of-the-art SSDs integrate multiple flash chips. The chipsalong with SRAM, DRAM, and an internal processor areinterconnected through a bus. The SRAM stores data forthe processor and the DRAM bu↵ers the flash memory data.Multiple read/write requests stored in DRAM can be servedsimultaneously along multiple buses and chips when possi-ble. Generally, the size of the DRAM bu↵er is determined

based on the internal parallelism that is to be supported.Firmware called the Flash Translation Layer (FTL) con-trols the workings among the various components of theSSD. To optimize performance SSD designers go throughconsiderable e↵ort to design the FTL firmware so that itcan make full use of all the available hardware resourcesand features such as the chips, buses, and bu↵ers throughinterleaving and parallelism [20, 38]. SSDs are connected tohosts through an interface such as SATA or PCI-e like anyother disk.

In an e↵ort to increase density, manufacturers have madee↵orts to shrink the silicon feature size and store more bitsper flash memory cell. These e↵orts have been realized in1xnm processing, MLC and TLC flash memory, and 3DNAND technologies. However, higher density has resultedin increased negative traits in terms of performance, cell en-durance, data retention time, and energy consumption [7, 8,9, 21, 31]. Furthermore, these negative characteristics aremanifested even further with increased program/erase cy-cles [1, 16]. At the most basic level, flash controllers employECC engines to help remedy these issues. The FTL firmwaredetects the blocks with faulty pages that exceed the num-ber of correctable bits designated by ECC and adopts a badblock management algorithm. Various other advanced tech-niques are used to alleviate data retention errors and otherreliability issues [7, 8, 21, 31, 41].

2.2 Related WorkIn this section, we review three aspects that are related to

our work. The first is about reducing the e↵ect of garbagecollection in SSDs. Several studies have addressed I/O align-ment to minimize or eliminate garbage collection (GC) inSSDs. Lee et al. present an LFS-like file system calledF2FS for flash storage. This file system sequentializes ran-dom writes to minimize internal SSD GC operations [28].Previously, work by Hyun et al. and Kim et al. have alsoconsidered changes to the Linux file system and schedulerto align writes with flash erase blocks [22, 26].

Next, we review studies on SSD based caching. Tang et al.propose RIPQ (Restricted Insertion Priority Queue), a novelSSD caching framework to mimic advanced cache algorithms(e.g. sLRU and GDSF) [50]. Specifically, they first reem-phasize the fact that the erase group size has increased toup to 512MB to achieve sustained performance and improvelifetime for cutting edge SSDs. For SSD caching, the size ofthe replacement unit must be enlarged, leading to hit ratioreduction. Consequently, RIPQ implements priority queueson SSDs and operates on top of priority queues to increasethe hit ratio and minimize write amplification of the SSDcache. Also, they gather clean data in the memory bu↵erand store the data and the metadata together in SSDs forcached data recovery. However, RIPQ does not support thewrite-back policy as the focus in mainly on read intensivedata such as photos.

Saxena et al. suggest a cache architecture, namely FlashTier,where the cache layer is embedded into the FTL of theSSD [45]. To communicate with the operating system, theyprovide well-defined interfaces to SSDs to exploit their char-acteristics and to reduce cache block management costs.With such design, FlashTier can incorporate cache replace-ment with garbage collection in the FTL and also providea mechanism called silent eviction that allows for simple in-validation of cold data instead of GC. Added benefits such

Table 1: Prototype development environmentType Specifics

OS Ubuntu 13.10 (Linux-3.11.7)Processor Intel Xeon CPU (E5-2640)RAM Samsung DDR3 PC3-10600 32GB

SSD Cache 4 Samsung 840 Pro 128GBPrimary Storage HDD based 1Gbps iSCSI storage

(RAID10) composed of 8 2TB 7.2K RPM disks

as improved memory e�ciency by o✏oading the mappingtable to the FTL and avoiding re-filling of hot data fromslow disks upon crash through logging of metadata of cacheddata in the OOB (out-of-band) area are provided. Inter-estingly, the ideas presented through simulations were alsorealized through OpenSSD based real prototype implemen-tations [46].

Another recent study was presented by Li et al. [30]. Here,Li et al. propose a new cache architecture called Nitro forprimary storage systems that combines both compressionand de-duplication techniques to further increase cachingcapacity, leading to increased hit ratio. They report thatboth deduplication and compression improves hit ratio byup to 19% and 9%, respectively, and hence, deduplication ismore e↵ective than compression because primary workloadsinvolve many identical contents. Also, they suggest a newreplacement unit, WEU (Write Evict Unit), that aligns withthe SSD block size to improve their lifetime. Specifically, Ni-tro checks whether the incoming file extent is the same as thecontent that is stored in SSD cache. Then, the extent is com-pressed and packed into a WEU along with the metadata.When Nitro runs out of space, it makes new WEUs througheviction to primary storage to avoid data migration from/toSSDs. These ideas are validated with commodity SSDs andwith simulation based customized SSDs. The results showthat performance and lifetime benefits can be achieved by acombination of capacity and flash optimizations.

The work that we present in this paper is similar to thesestudies in that we also exploit the use of large replacementsizes to fully take advantage of SSDs’ sustained performance.However, our work is unique in that we explore a cost-e↵ective caching solution by making use of an array of com-modity SSDs that does not require elaborate new interfaces.

Another line of work on SSD based caching has been onthe use of SSD caching in networked storage systems. Byanet al. introduce a host-side cache approach where a write-through SSD cache is deployed in the enterprise Hypervisorenvironment [6]. To overcome consistency limitations of thewrite-back caching scheme, Koller et al. provide consistentwrite-back policies [27]. Holland et al. study various con-figurations and policies for host-side caches [18]. Unlike thestudy by Koller et al., they insist that the write-throughpolicy is more reasonable than the write-back policy.

Qin et al. propose write-back based caching policies thatintegrate the general cache flush command to ensure data in-tegrity on primary storage [43]. The workings of the policiesare based on the client side SSD failure models. For exam-ple, when the flash device is assumed to be unrecoverable,the policy flushes all dirty data from the SSD to primarystorage upon an application flush command. Our work isdi↵erent from the study by Qin et al. in that SRC bundlescached data together with metadata and parity in large seg-

Table 2: FIO benchmark 4KB write performancecomparison between write-through (WT) and write-back (WB) policies with a single SSD (MB/s)

Type WT WB Improvement (unit: times)

Bcache 15.3 65.9 4.3Flashcache 5.7 100.3 17.5

Table 3: Impact of flush command

No flush flushReduction

(unit: times)

Sequential 402MB/s 96MB/s 4.1Random 249MB/s 30MB/s 8.3

ment units and then distributes the segment among all SSDsto protect them from disruptive SSD failures. This allowsSRC to avoid flushing the data to primary storage.

The final aspect related to our work is RAID. RAID hasbeen introduced to provide reliable storage [42]. RAID-5, inparticular, is a cost-e↵ective and reliable storage system thathas been widely deployed. RAID-5, though e↵ective for largesequential writes, su↵ers from the small write problem. Thatis, when a small write request arrives, the modified dataand its related parity must be updated through either theread-modify-write or reconstruct-write mechanism. Theseoperations incur extra read and write operations leading tosignificant performance degradation.

To overcome the small write problem, many techniqueshave been introduced. Stodolsky et al. develop a parity log-ging scheme that sequentially logs the updated parity intoa log region [49]. The up-to-date parity are moved to theiroriginal locations when the log region runs out of space.Wilkes et al. propose AutoRAID, a hybrid approach wherefrequently updated data are stored in RAID-1, while infre-quently updated data are stored in RAID-5 so that parityoverhead of RAID-5 is minimized [51]. Our study is sim-ilar to these studies in that a log-structure is employed toreduce parity overhead. However, our study di↵ers in thatour primary use of RAID is as a reliable cache and not asprimary storage. By exploiting the fact that RAID is beingused as a cache and the characteristics of the SSDs, insteadof HDDs, that comprise RAID, we are able to come up withvarious mechanisms to enhance the e↵ect of the cache.

Finally, this paper is an extension of our previous work [40]where we presented our basic idea with experiments con-ducted via DiskSim-based simulations. In this paper, wepresent new observations and enhancements that were madewhile implementing our scheme with real SSDs. Naturally,we also report new experimental results that were obtainedfrom the Device Mapper based prototype that we imple-mented in Linux.

3. STUDY OF EXISTING SOLUTIONSIn this section, we discuss in detail two existing open

source SSD cache solutions, namely, Bcache and Flashcache.We look at the limitations of these work and issues thatcould influence the performance of SSD based caching solu-tions.

3.1 Existing SSD Cache SolutionsIn this section, we present observations made from exper-

0

50

100

150

200

250

Bcache Flashcache

Perf

orm

ance

(M

B/s

) RAID-0RAID-1

RAID-4RAID-5

Figure 1: Performance comparison of Bcache andFlashcache on various RAID levels

iments on two well-known open source flash memory cachesolutions. Specifically, Flashcache is a Facebook solutionthat has been publicly open since 2011, while Bcache is asolution introduced by Overstreet in 2010 and included inLinux since 2013 [4, 13].

Flashcache is designed to optimize read performance byusing a set-associative policy to map data in SSDs. Specif-ically, it divides caching space into multiple sets, which inturn consists of 4KB blocks. (Set size ranges from 128KB to4MB with a default size of 2MB.) Flashcache has separatepartitions to keep metadata and cached data separately. Tomaintain the consistency of modified data in the SSD cache,it writes the metadata of the dirty data to the SSD cache,while, for clean data, it keeps the metadata only in memory.As a result, loading clean data to Flashcache from the un-derlying primary storage is accomplished with a clean datawrite to cache and an in-memory update. Flashcache alwaysignores flush commands from the upper layer and immedi-ately returns acknowledgements in order to improve perfor-mance. However, this approach is vulnerable to file systeminconsistency.

Bcache partitions caching space into multiple buckets, whosesize is 4MB by default. The bucket size can range from 4KBto 16MB. To optimize random writes, Bcache collects smallwrite requests in memory and writes them to a bucket se-quentially. Also, to optimize metadata management, Bcacheemploys a B+tree-like index mechanism. When a write missoccurs in the cache, Bcache first writes dirty data to thecache, and then logs metadata into the journal area witha flush command. Later, metadata in the journal area arecopied to their locations in the metadata area. By virtueof this journaling mechanism, we expect Bcache to outper-form Flashcache, particularly for random write workloads.Like Flashcache, Bcache keeps metadata only in memoryfor clean data to minimize management cost. As a result,cached clean data disappears upon power or SSD failure.

To improve performance for random writes on SSDs, bothBcache and Flashcache employ a write-back-like policy, butwith a parameter, termed writeback percent and dirty thresh pct,respectively, that controls the dirty data ratio, that is, theratio of dirty data that may reside in the cache. (Thesevalues are set to 10% and 20% for Bcache and Flashcache,respectively, by default.) Higher ratios reflect policies moreclose to write-back, though more data may be lost if thecache is lost. While Bcache destages dirty data immediatelywhen the dirty data ratio exceeds writeback percent, Flash-

!"#!"

$!!"$#!"%!!"%#!"&!!"&#!"'!!"'#!"

$" %" '" (" $)"

&%"

)'"

$%("

%#)"

#$%"

$!%'"

!"#$

%&$'()*+

!,-.)

/,0)1&23)*+!.)

*+,"#!-"

*+,"'!-"

*+,"&!-"

*+,"%!-"

*+,"$!-"

*+,"!-"

Figure 2: Increased Erase Group Unit in Samsung840 Pro SSD

Table 4: Comparison of Storage DevicesSSD-A SSD-B

Interface SATA 3.0 PCI-e Gen 3.0Capacity (GB) 128 256 512 400 800 1600 2000

Price ($) 129 206 435 922 1398 3796 4250SR (MB/s) 530 540 540 2700 2800 2800 2800SW (MB/s) 390 520 520 1080 1900 1900 2000RR (KIOPS) 97 100 100 450 460 450 450RW (KIOPS) 90 90 90 75 90 150 175

cache is more tolerant to missing the threshold. Hence, forbusy systems using Flashcache, the dirty data ratio mayactually exceed dirty thresh pct. However, write-throughmode is still recommended by default due to the possibil-ity of dirty data loss upon SSD failure.

To compare the write-back and write-through policies, weperform a set of experiments under the setting presented inTable 1. This setting will be used throughout the experi-ments undertaken for this paper with one SSD or four SSDsbeing employed depending on the experiments of interest.For these experiments, we make use of the FIO (FlexibleI/O) benchmark that measures 4KB I/O performance withUniform Random distribution [12]. The Uniform Randomdistribution generates random write requests that evenly ac-cess all blocks. Settings for request size, iodepth, and thenumber of threads is 4KB, 32, and 4, respectively, and theworkload is generated for 10 minutes.

Table 2 shows the bandwidth results, where WT and WBdenotes the write-through and write-back policy, respec-tively. The results show that write-back outperforms write-through by up to 17 times. Even though Bcache employs alog-structured write scheme, it performs worse than Flash-cache for the write-back policy due to the flush command.

To more clearly quantify the impact of the flush command,we perform a simple, separate set of experiments using ran-dom and sequential workloads with and without issuing flushcommands. For the sequential workload, a flush commandis issued after every 512KB, while for the random workloada flush command is issued for every 32 4KB requests. Wesee from the results in Table 3 that with the flush command,performance deteriorates considerably.

In summary, the results here tell us 1) that the write-backpolicy is beneficial in terms of performance and 2) that theflush command used with write-through is the key source of

Table 5: Comparison of SSD Caching Solutions

Name SSD TypeErase GroupAlignment

Write-BackSupport

flushOptimization

Clean DataPersistency

RAIDSupport

OpenSource

FlashTier [45] Custom Yes Yes N/A Yes No NoNitro [30] Custom Yes Yes N/A Yes No No

Reliable Write-Back [43] Commodity No Yes Yes Yes No NoRIPQ [50] Commodity Yes No N/A Yes No No

FlashCache [13] Commodity No Yes No No No YesBcache [4] Commodity No Yes No No No Yes

SRC Commodity Yes Yes Yes Yes Yes Yes

performance bottleneck in caching systems using commoditySSDs. Hence, the flush command should be used sparinglyand judiciously.

3.2 Exploiting RAID for SSD CachingAs just seen, using the write-back policy is beneficial.

However, previous studies have shown that SSDs are unreli-able as flash memory wears out [16]. Data retention issues,read disturbs, and other SSD vulnerabilities may also bereason enough to just dismiss the write-back policy for SSDbased caches [7, 8, 21, 31].

To resolve this issue, Flashcache suggests a mirroring vol-ume (e.g., RAID-1) containing two SSDs [14]. Bcache alsoclaims that it will provide enhanced data reliability throughmirroring of dirty data and metadata in the future [5]. Inthis paper, we first take these approaches and also exploreother combinations of RAID levels (e.g., RAID-4 and -5).With RAID, faults can be tolerated with redundancy andmaintainability improves through on-the-fly replacement offailed (or about-to-fail) drives. RAID also provides perfor-mance benefits by utilizing disk parallelism. Hence, RAIDis a natural option to be explored.

To see the e↵ect of RAID on existing SSD cache solutions,we set the experiment setting so that Bcache and Flashcacheindividually makes use of the RAID configured underlyingSSD cache layer. The base experiment setup is the sameas the single SSD cache case (Table 1), but this time mak-ing use of four 128GB SSDs connected to the host systemthrough SATA 3.0 ports. All configurations use the write-back policy for caching. Specifically, RAID-0 stripes dataamong four SSDs without data redundancy such that thecache capacity is 4⇥128GB. We regard this option as the su-perior configuration even though data is not protected fromfailures. RAID-1 mirrors data and thus, cache capacity is2⇥128GB, which is half the total capacity. For RAID-4 and-5 configurations, three quarters of the SSDs are for datawhile one quarter is for parity and thus, cache capacity is3⇥128GB. In RAID-4, one SSD is dedicated for storing par-ity while parities are evenly distributed in RAID-5. For theworkload, we use the FIO benchmarks.

Figure 1 shows the bandwidth (MB/s) results. It showsthat the RAID-0 configuration performs the best comparedto other configurations due to no redundancy. RAID-1 con-figuration shows roughly halved performance because halfthe SSDs are utilized as mirrored space. For RAID-0 andRAID-1 configurations, Bcache is worse than Flashcache dueto the flush command. With parity-based configurations(e.g., RAID-4 and -5), Flashcache shows degraded perfor-mance due to parity updates. In contrast, Bcache showsbetter performance compared to Flashcache as it employsthe log-structured approach.

In summary, the results here show that RAID can en-hance data reliability, but hurt performance. Also, the log-structured approach provides a nice opportunity to elimi-nate read-modify-write operations. Finally, we once againfind that the flush command is harmful to performance.

3.3 Erase Group Size and CostThe e↵ect of garbage collection (GC) on SSD performance

is considerable. Hence, studies have been made to minimizethe cost of GC. Previous studies have addressed this issueby generating write requests to an SSD that are su�cientlylarge such that flash blocks are allocated together and erasedtogether [22, 26, 34, 50]. The particular size choice is re-ferred to as the erase group size. Recent studies have shownthat the erase group size is becoming large.

We also perform a set of experiments to extract the erasegroup size from the commodity SSD used in our prototype.This is done for OPS (Over-Provisioned Space) sizes rangingfrom 0% to 50% for each experiment set [39]. Note thatincreasing the OPS size has the e↵ect of reducing the usablestorage space of the SSD.

Figure 2 shows the results. The y-axis in this figure is thethroughput (MB/s), while the x-axis is the write requestsize. As the size of each write request reaches 256MB, thebandwidth of the SSD reaches roughly 400MB/s indepen-dent of the OPS size. This tells us that the erase group sizeis 256MB, which we use throughout our experiments later.

Finally, we discuss the issue of cost. One of our goalsof this project is to provide a cost-e↵ective caching solu-tion. For this, we gather the price and performance valuesof commodity SSDs of a product line from two particularmanufacturers that we summarize in Table 4. The perfor-mance numbers are from specifications obtained from themanufacturers’ website, while the price is that of the cheap-est available on the web that we could find. Here, we findthat the price of each SSD is proportional to its capacity butthe performance is nearly the same within the same productline. We also find that the key deciding factor for price is thehost interface. The cost of a PCI-e product is substantiallyhigher than that of the SATA product. This is because thehost interface is the performance limiting factor in modernSSDs.

Flash cache solutions are already available as end prod-ucts. However, in reality, their price is expensive becausethey use proprietary techniques. Our goal is to providea cost-e↵ective cache solution balancing performance andprice by making use of o↵-the-shelf commodity SSDs. Theperformance/cost findings presented above tell us that ag-gregating multiple low-cost, low-capacity SSDs may be bet-ter than purchasing a single high-end, high-cost product solong as the aggregate performance of the former can match

!"#$"%&'()*+,)-"."/.01)

-234) -234) -234) -234) -234)

5126'17)!.01'8")

903.:32;")<'/=")

!.'82%8)>$?"1)

@A!+!A@)

!7%/) A37%/)

-21.7)!"8)>$?"1) <("'%)!"8)>$?"1)

!!-) !!-) !!-) !!-)

!B<)C'7"1)

*+,)B"#$"3.)

!"#$"%&'()*+,3)

(a) Architecture!"#$%&'()$*'+,')-$

$!.#$/01-2$*'+,')-$

SSD0

SSD1

SSD2

SSD3

!"#$!%&'(#)*%#+,-!.#/$+"# 0"#1*2$!%&'(#)*%#+,-!.#/1$+"#

Type of Blocks

D: Dirty dataC: Clean dataMS: Seg. Metadata (start)ME: Seg. Metadata (end)P: Parity

3%!4-#5%*67#/89:;<"##

!3#$*'+,')-$41567$8(256-$

=->?-.'#5%*67#

==@A#==@B#==@8#==@C#

D*>#

==@A#

==@B#

==@8#

==@C#

9B8E<#

=->?

-.'#/9B8

E<#F#G"#

=->?-.'#5%*67#

9:;<'$=->?-.'#5%*67#

*'+,')-$41567$$!34=##H##

3%!4-#5%*67##/89:;<"#I#G##;=#

;=##;=#

#$#

@#

@##@#

#$#

@#

@##@#

#$#

@#

@##@#

#$#

;3#

;3##;3#

#$#

SSD0

SSD1

SSD2

SSD3

;=#

;=##;=#

#$#

+#

+#

+#

$#

+#

+#

+#

$#

+#

+#

+#

$#

;3#

;3#

;3#

$#

SSD0

SSD1

SSD2

SSD3

;=#

;=##;=#

#;=#

+#

+#

+#

+#

+#

+#

+#

+#

+#

+#

+#

+#

;3#

;3##;3#

#;3#

(b) Data Layout

Figure 3: Architecture and Data Layout of SRC

that of the latter [42]. This also provides added support forthe use of the RAID approach that we undertake.

Finally, we summarize the various flash solutions that arefound in the literature and their properties in comparison toour SRC scheme in Table 5.

4. SSD RAID AS A CACHEIn this section, we present our host-side SSD cache solu-

tion, which we call SRC (SSD RAID as a Cache). Our designgoals are to provide an SSD cache solution using an arrayof commodity SSDs that 1) is cost-e↵ective, 2) employs thewrite-back policy for high performance, and 3) is optimizedfor contemporary SSDs with a large erase group size.

Specifically, for cost-e↵ectiveness, SRC comprises an ar-ray of low-cost commodity SSDs that use the general blockinterface such as SATA as the caching device. Also for cost-e↵ectiveness, SRC maintains a DRAM-only bu↵er to packclean as well as dirty data for segment writes, even thoughmany storage vendors employ an NVRAM-based bu↵er atopSSDs to keep dirty data safe. The issue regarding safety ofdirty data is discussed below.

For high performance, we employ the write-back policy.When using write-back for cache, one must consider thetradeo↵ between durability and performance. The flushcommand could be used to enhance durability, but as dis-cussed previously, this will result in performance degrada-tion. Unlike Flashcache, which ignores the flush commandand Bcache, which issues the command for every metadatawrite, we take an approach similar to the one taken by Qinet al. [43].

As previously discussed, determining the erase group sizeproperly has considerable e↵ect on the performance and life-time of SSDs. In SRC, to align the write size with thelarge erase group size of modern SSDs, we borrow the log-structured scheme that stores data blocks along with meta-data and parity blocks all at once in SSDs when constructingstripes [32, 35, 51]. This has a positive e↵ect of not only re-moving all read-modify-write operations that are detrimen-tal to RAID performance, but also enhancing the lifetime ofSSDs [2, 39]. To avoid inconsistency with data blocks, wecalculate the checksum of the data and store it in metadatablocks. The metadata block is itself also checksummed. We

describe this in more detail in the next section.Figure 3 shows the architecture and data layout of SRC.

SRC provides a caching solution on top of primary storageconnected to a network interface (e.g., iSCSI protocol). Inthe following subsections, we discuss in detail the workingsof SRC.

4.1 Segment Group: A Log Based ApproachSegment Group Layout: Cache space in SRC is deter-

mined by the number of SSDs M that comprise the array ofSSDs and the size S of each SSD as shown in Figure 3(a).(From here on, our discussion will be made in the contextof our implementation, that is, M = 4 and S = 128GB.) InSRC, this total space is divided intoN Segment Groups (SG)as shown in Figure 3(b)(1). The SG size is determined bythe erase group size of the underlying SSDs, which is 256MBin our case. Then, SG is 1GB as we have 4 SSDs. Each SGis also divided into smaller units called a segment, acrossthe SSDs. A segment is composed of 512KB units from all4 SSDs. (The 512KB is an implementation choice made asit is the largest unit in which data can be transferred to thestorage device.) Thus, a segment size is 512KB⇥4, that is,2MB for our case.

Writes in SRC are made to SGs. For maximum perfor-mance that makes full use of the erase group size, the writeunit should be 1GB. However, as this may be too large ina realistic setting, SRC writes in segment units, that is, in2MB units. Metadata and parity information is included ina segment write as shown in Figures 3(b)(2) and (3). Werefer to the SG where writes are made to be an active SG,and there is only one active SG at any time.

In our implementation, the very first SG is used to holdthe superblock that records important information such asthe magic number, create time, device size, and so on. Thesuperblock is only 4KB in size and never modified, hence, wefill the rest of the space with dummy data and set this SG tobe read-only. When SG runs out, SRC generates empty SGsby performing GC as in LFS [44]. We will discuss how SRCperforms this operation in more detail later in Section 4.2.

Segment Bu↵er: SRC keeps an in-memory structurecalled the segment bu↵er, whose size is the same as the seg-ment size. SRC maintains two separate segment bu↵ers, one

for clean data and one for dirty data. (Note Figure 3(a).)Upon write requests, SRC collects data in the segment bu↵erfor dirty data. When this segment bu↵er fills up, the entirebu↵er is written to an unused segment in the active SG.This write may be done with or without a flush command.We discuss this matter in detail later. Then, write requestsfrom the upper layer are acknowledged when the Host-sideCache returns its acknowledgement to SRC.

In case of read misses, SRC first fetches the requesteddata from primary storage to the temporary staging bu↵er,at which time an acknowledgement is sent to the upper layer.The data in the staging bu↵er is later moved to the segmentbu↵er for clean data, and later, when this segment bu↵er isfull, it performs a full segment write to an unused segment.

Partial Segment: Timeouts are used to enhance dura-bility of dirty data. Timeouts occur when a write does nothappen for TWAIT time, which is set to 20 microseconds inthis study, since the last write incurring a segment write.However, the segment bu↵er for dirty data may not be fullupon a timeout. This is similar to the situation that arisesin LFS [44]. Note that the issue of partial writes do not oc-cur for the clean data segment bu↵er as clean data broughtinto the bu↵er have copies in primary storage and can belost without consequences.

flush Command Control: Previously, we explained thatSRC durability is enforced by making use of the flush com-mand issued by the upper layer. To further ensure dataand metadata consistency, SRC provides means of issuingflush commands are two particular points; specifically, atevery segment write point or at every SG write point, thatis, when the current active SG is now full and is about tomove on to a new active SG. We discuss the implication ofthese choices with experimental results later in Section 5.

Metadata management: In each segment, SRC storesthe metadata along with the data and parity as depicted inFigures 3(b)(2) and (3). The metadata is an extension ofthe LFS summary structure [28, 44]. Specifically, to verifythe validity of the summary information, SRC maintains thechecksum, signature, and version information of the meta-data blocks. Also, for every data block, the LBA and thechecksum information is kept in the metadata blocks.

To enable fast recovery and drive scaling, we distribute themetadata blocks across all SSDs. Specifically, SRC main-tains dedicated metadata blocks to indicate the correspond-ing data blocks stored in each SSD. Also, metadata blocksare stored at both the beginning and the end of a segment.We denote the former as MS and the latter as ME .

SRC also maintains an in-memory mapping table to quicklytranslate the logical addresses to physical addresses. Thememory overhead for this, in our implementation, is a 16byte entry per 4KB, which is roughly 0.3% of the storagecapacity. Specifically, for a 512GB SSD, the requirementwould be roughly 1.5GB. We believe this is acceptable fortoday’s servers; the system used for our experiments has32GB RAM.

Failure Handling: Let us now describe the recovery pro-cess of the on-SSD metadata blocks. Upon a power failure,SRC scans the metadata blocks to restore consistency. Ifthe generation numbers of MS and ME match, SRC con-siders the whole segment as being consistent. Otherwise, itsearches the partial metadata blocks or discards this seg-ment and returns it as a free segment.

Besides drive failure handling, SRC also takes into ac-

count silent data corruption [3]. To ensure data integrity,SRC compares the original and calculated checksums whenreading data from SSDs. If a checksum mismatch is found,SRC recovers the data through parity calculations or by re-fetching them from primary storage depending on the stateof the data.

4.2 Free Space ReclamationIn a typical cache, when space runs out, a reclaiming pro-

cess is invoked, generally referred to as garbage collection(GC). Generally, this involves evicting (or destaging) se-lected victim blocks from the cache to primary storage. Thisprocess is called S2D (SSD to Disk) GC in this paper. An-other way to GC for systems that take the log-structuredapproach for storage management is to migrate valid dataas is done in LFS [44]. In particular, SRC, which is alsolog-structure based, can reclaim Segment Groups (SGs) bymigrating data. As this movement of data would be amongSSDs, we will call such means of GC as S2S (SSD to SSD)GC [44].

At first glance, since we are employing these policies atthe cache level, S2D GC seems more e↵ective than S2S GCbecause S2S GC needs to copy the valid data in the to-be-cleaned SGs to an empty SG, while S2D GC simply removesclean data or writes dirty data to primary storage, whichmust eventually happen anyway. Also, S2D may possiblyimprove SSD performance as the utilization of the SSDs islowered. However, S2D does not take into account the hot-ness of the data in cache. If the data to be removed ordestaged are hot, keeping them in cache in spite of the GCcost may be beneficial. This is the main benefit that couldbe brought in by S2S GC.

Specifically, SRC provides two GC policies. One is thepure S2D GC. The other is a variant of S2S GC that werefer to as Sel-GC. Sel-GC applies S2S or S2D depending onthe cache utilization and the hotness of the victim blocks.Sel-GC monitors cache utilization and, if utilization exceedsUMAX , it triggers S2D GC to reclaim space in the cache;S2D GC is the only means used when utilization is higherthan the threshold. However, if cache utilization is belowUMAX , it employs S2S GC in a selective manner. Specifi-cally, for dirty data and hot clean data, SRC performs S2SGC. However, clean cold data is simply discarded as movingit may be for naught. Hotness of data is determined by aper-page based bitmap stored in RAM.

SRC also provides two victim SG selection policies, namely,FIFO and Greedy. Specifically, the FIFO policy simply se-lects the victim SG in the order in which SGs are used. Thisis a simple policy that does well on SSDs as it always makessequential writes. The Greedy policy, on the other hand,chooses the least utilized SG among the SGs to minimizeGC cost. We will compare the performance of these policiesin Section 5.

4.3 Clean Data RedundancyThough parity is an essential part of RAID, another op-

tion becomes possible when RAID is used as a cache. Specif-ically, as clean data in cache is a copy of data in primarystorage, redundancy for clean data within the cache is not re-quired. Therefore, another design choice that can be madeis to omit the parity for clean data in cache. With thischoice, without compromising robustness, comes two advan-tages; one is reduced parity write overhead and the other is

Table 6: Characteristics of Trace Sets

Set Trace Req Size(KB) Size(GB) Read Ratio (%)

Write

prxy0 7.07 84.44 3exch9 21.06 110.46 31mds0 9.59 11.08 29mds1 9.59 11.08 29stg0 11.95 23.16 31msn0 21.73 31.28 6msn1 17.84 37.80 44src12 29.25 53.23 16src20 7.59 11.28 12src22 56.31 62.12 36

Mixed

rsrch0 9.07 12.41 11exch5 18.02 85.628 31hm0 8.88 33.84 32fin0 6.86 34.91 19web0 15.29 29.60 58prn0 12.53 66.79 19msn4 21.73 31.28 6

Read

ts0 9.28 15.95 26usr0 22.81 48.694 72proj3 9.75 20.87 87src21 59.31 37.20 99msn5 10.01 124 75

Table 7: SRC Design Space with Defaults in BoldParameter Value

Erase Group Size (MB) 2, ..., 64, 128, 256, 512Free Space Management S2D, Sel-GCUMAX for Sel-GC 10% to 95%Victim Selection Policy FIFO, GreedyClean Data Redundancy PC, NPCRAID Level 0, 4, 5Flush Command Control Per Segment, Per SG

increased cache capacity.In SRC, the administrator can select one of two modes,

namely Parity for Clean (PC) and No-Parity for Clean (NPC).(See Figure 3(b)(3).) In PC mode, SRC writes parities forclean data, while for NPC mode no parities are kept. Hence,in NPC mode, clean data will disappear when an SSD fails,resulting in degraded read performance until hot clean datais reloaded to the SSD cache from primary storage. In con-trast, in PC mode, caching service is not disrupted by SSDfailure as clean data can be recovered on the fly with theparity information. We will compare and discuss the quanti-tative performance implications of these modes in Section 5.

5. SRC IMPLEMENTATION AND RESULTSIn this section, we first present the experimental plat-

form that we use to implement and experiment with SRC.Then, we present experimental results regarding the designspace issues that we discussed in the previous section. Wealso compare SRC with a RAID-5 accommodated version ofBcache and Flashcache and consider the cost-e↵ectivenessissue of SRC.

5.1 Experiment PlatformTo implement SRC, we modify Akiras’s DM-Writeboost [10]

as a block-level caching solution performing under the De-vice Mapper (DM) framework distributed with the Linuxkernel 3.11.7. Specifically, DM-Writeboost is designed for

Erase Group Size

(a) Throughput (b) I/O Amplification

2 8 32 64 128128 256 512 1024

0 200 400 600 800

Write Mixed ReadThr

ough

put (

MB

/s)

0 0.5

1 1.5

2

Write Mixed Read

I/O A

mpl

ifica

tion

Figure 4: Impact of Erase Group Size, where UMAX

is 90%.

write caching using a single device and hence, thousandsof lines of code have been modified to implement our SRCscheme. We plan to stabilize SRC and make it available asopen source1.

In our current SRC implementation, the Segment Groupsize is set to 1GB as we have 4 SSDs each with a 256MBerase group size in a RAID configuration. The segment sizeis set to 2MB such that a Segment Group is divided into 512segments. For fair and consistent measurement, we initializethe SSDs with the TRIM command before we conduct theexperiments. To quickly reach steady state where GC istriggered in the SSDs, we sequentially fill the SSDs withdummy data until only the OPS size remains [48]. For ourexperiments that make use of 128GB SSDs, our OPS sizeis set to 2GB such that 126GB are sequentially filled withdummy data. Of these 126GBs, we utilize only 18GB as ourcache space to expedite our experiments.

To evaluate long-term performance, we make use of sev-eral block-level traces from the Microsoft Production Server(MPS) and Microsoft Cambridge Server (MCS) [25, 37] thatare summarized in Table 6. The Exchange trace from MPSis a random I/O workload obtained from the Microsoft em-ployee e-mail server. This trace is composed of 9 volumes ofwhich we use the traces of volumes 5 and 9 denoted exch5

and exch9, respectively. The MSN trace, also from MPS,is extracted from 4 RAID-10 volumes on an MSN storageback-end file store. We use the traces of volumes 0, 1, 4,and 5 denoted msn0, msn1, msn4, and msn5, respectively. Allother traces other than these are fromMCS. The MCS tracesare those collected from 36 volumes in 16 servers of a smallscale data center. Each volume is based on RAID-1 for bootvolume or RAID-5 for data volume and are a collection of aone week period. We chose some of the traces among theseto be included in our experiments. Specifically, the concate-nated name and number such as prxy0 refer to the server(proxy server) the trace was collected from and the volumenumber (0 for this example) of the trace.

As shown in Table 6, the traces are organized into threegroups, namely Write, Mixed, and Read groups, of whichthe names represent their main characteristics. The specifictraces from MPS and MCS were chosen so that the work-ing set of each of the trace groups would be similar in size;specifically, roughly 50GB. Finally, since the workloads aretraces and our SRC is a real working system in Linux, wedevelop a trace replay tool2 that generates real I/O requests

1https://bitbucket.org/yongseokoh/dm-src.2https://bitbucket.org/yongseokoh/trace-replay.

Table 8: Free Space Management Performance(MB/s) (Numbers in parentheses refers to I/O am-plification)

GCScheme

S2D SEL-GC

VictimPolicy

FIFO Greedy FIFO Greedy

Write 301(1.26) 312(1.28) 522(1.74) 507(1.65)Mixed 491(1.34) 466(1.34) 581(1.55) 547(1.53)Read 480(1.15) 596(1.14) 619(1.24) 725(1.21)

30 50 70 90 95!"#$%

(a) Throughput (b) I/O Amplification

0

250

500

750

Write Mixed ReadThro

ughput

(MB

/s)

0 0.5

1 1.5

2

Write Mixed Read

I/O

Am

plif

icatio

n

Figure 5: Impact of UMAX on SEL-GC, where theErase Group Size is 256MB.

based on the workload traces to run the experiments. Eachgroup of traces are run separately, but with all traces withineach group running simultaneously and each trace being re-played by four threads. Unless otherwise stated, all resultsreported for the experiments are those accummulated for 10minutes of execution. Hence, note that each experimentedsystem will be at a di↵erent stage of execution when theexperiments are terminated and the numbers reported.

The key evaluation metrics that we observe are the through-put (MB/s), which is used to represent the overall perfor-mance of the caching system, and I/O amplification, whichis used to evaluate how much I/O requests are amplified atthe cache layer. This is calculated by dividing the observedI/Os at the cache layer by the actual I/Os requested. Wealso consider the cost-e↵ectiveness of the various solutions.

5.2 Exploration of Design SpaceWe now consider the e↵ect of the design space parameters

for SRC that is summarized in Table 7.Erase Group Size: To evaluate the impact of the erase

group size on performance, we vary the size from 2MB to1024MB while other parameters are set to their default val-ues. Figure 4 shows that performance improves as the erasegroup size increases. Among the large sizes, performancedi↵ers slightly depending on the type of workload. Over-all, the 256MB size that we chose based on the observationmade in Section 2.2 for SRC looks appropriate. In contrast,I/O amplification is minimized when the size is 2MB as asmall size unit is more likely to be fully utilized.

Free Space Management: Let us turn our attention tofree space management. There are two policy choices thatneed to be made with free space management. That is, thepolicy for GC and the policy for choosing the victim for GC.For GC, we compare the S2D and Sel-GC policies, whoseresults are obtained with UMAX set to 90% and shown in

Table 9: PC and NPC Mode Performance (MB/s)(Numbers in parentheses refers to I/O amplifica-tion)

Trace Parity for Clean No-Parity for Clean(PC mode) (NPC mode)

Write 431.13 (1.80) 507.89 (1.65)Mixed 520.95 (1.56) 547.36 (1.53)Read 669.67 (1.26) 725.95 (1.21)

Table 10: RAID Level Performance (MB/s) (Num-bers in parentheses refers to I/O amplification)

Trace RAID-0 RAID-4 RAID-5

Write 650.49 (1.30) 482.05 (1.63) 507.89 (1.65)Mixed 686.30 (1.21) 521.06 (1.51) 547.36 (1.53)Read 790.74 (1.08) 699.31 (1.20) 725.95 (1.21)

Table 8. The results show that Sel-GC considerably outper-forms S2D for both the FIFO and Greedy victim selectionpolicies. This shows that conserving hot data via S2S copy-ing during GC has a substantial positive e↵ect on perfor-mance. In contrast, S2D shows smaller I/O amplification asit is making less copies among the SSDs.

As for victim selection, Table 8 shows that the FIFO andGreedy policies give-and-take based on the workload and GCscheme used. Overall, the FIFO policy slightly outperformsthe Greedy policy for the Write and Mixed workloads. Thereason for this is that when the workload is write intensivethe FIFO policy has a natural e↵ect of pushing cold blocksto front of the FIFO queue as hot rewritten blocks will beadded to the back end because of the log-structured natureof SRC. In contrast, we see that the Greedy policy showsbetter performance than the FIFO policy by up to 1.24 timesfor the Read workload because the Greedy policy identifiesthe least utilized SGs and evicts them through GC.

UMAX Threshold: We now consider the e↵ect of UMAX

on Sel-GC. Recall that Sel-GC makes use of S2S when uti-lization is below UMAX and S2D when above. Figure 5shows that performance of SRC increases with the increaseof UMAX peaking at 90%, then dropping o↵ as it is increasedeven more. This shows that e↵orts to keep hot data in theSSDs is worth doing, but being too aggressive can be harm-ful. However, again, see that I/O amplification also increaseswith UMAX .

Clean Data Redundancy: We now consider the ef-fect of not keeping parity for clean data, that is, the useof NPC mode. Table 9 presents measurement results thatshow NPCmode outperforming PC mode for the three work-loads. In particular, NPC mode performs much better thanthe PC mode for the Write workload improving performanceby around 18%. The reason NPC mode is more e↵ectivewith write intensive workloads is that the small space thatis saved has an amplifying e↵ect as data is continuously writ-ten compared to read intensive workloads where data oncewritten sits still.

RAID Protection Level: We now consider the di↵er-ent RAID protection levels, specifically, RAID-0, -4, and-5. Table 10 shows the e↵ects of the RAID levels for thethree workload types with other parameters set to default.As expected, RAID-0, which has no redundancy, shows thebest performance as it fully utilizes all SSDs. We see that

Table 11: Influence of flush Command on Perfor-mance as Command is Issued per Segment Writeand per Segment Group Write (MB/s) (Numbersin parentheses refers to I/O amplification)

Trace Per Segment Per Segment Group

Write 462.53 (1.60) 507.89 (1.65)Mixed 480.74 (1.52) 547.36 (1.53)Read 418.03 (1.23) 725.95 (1.21)

Table 12: Comparison of SATA and NVMe SSDsMetric Company A Company B Company C

Host Interface SATA 3.0 NVMeNAND Type MLC TLC MLC TLC MLCEndurance 3K 1K 3K 1K 3K

Capacity(GB) 4⇥128 4⇥120 4⇥128 4⇥128 1⇥400SSD(s) Cost($) 418 272 374 225 469

GB/$ 1.22 1.76 1.36 2.27 0.85Year Released 2012 2013 2014 2014 2015

performance of RAID-5 is o↵ from RAID-0 by around 20%.Also, we see that RAID-5 slightly outperforms RAID-4 asdistributing parity among all the SSDs has a smoothing oute↵ect on performance.

flush Command Control: Recall that in SRC the flushcommand can be issued per Segment write or per SegmentGroup write, aside from flush commands issued from the up-per layer. We observe how this choice a↵ects performance forthe workloads that we considered. The results, in Table 11,show that, as expected, issuing the commands more fre-quently incurs considerably overhead. For the Write work-loads, issuing flush commands per segment write degradesperformance by approximately 10% compared to the defaultper SG issue case. However, for the Read workloads, we seethat performance drops by more than 40% when the com-mand is issued per segment write compared to the defaultcase. This shows that frequent flush command issues thatis related to writing data has a strong negative influence onreading of data as well.

5.3 Cost-effectivenessAs mentioned previously, one goal of SRC is to provide

a cost-e↵ective caching system. To consider this issue, inthis section, we evaluate how commodity SSDs that use newtechnology would fare with SRC. Specifically, we considerSRC with three types of SSDs, that is, SRC configured asRAID-5 with MLC and TLC SSDs and that configured witha single high-cost, high-end NVMe SSD (hence, with no par-ity).

For the experiments, we take SSDs from three separatecompanies and run the same experiments conducted in Sec-tion 5.2 with all parameters set to default values. Table 12shows the costs of the SSDs (as obtained from the web).Note that for these set of experiments five di↵erent setsof SSDs were used, unlike the previous experiments. TLCSSDs are generally more economical in terms of capacity perdollar as exemplified by the higher GB/$ values.

Figure 6 shows the results of our experiments, with theprefix letter of the legend referring to the originating com-pany. (For example, A-MLC(SATA) refers to SRC withSATA based MLC products from company A.) ComparingSRC based on MLC and TLC SSDs, we see that SRC basedon MLC SSDs are better than TLC SSDs based configu-

0

250

500

750

Write Mixed Read

Thro

ughp

ut (M

B/s)

0

0.5

1

1.5

2

2.5

Write Mixed Read

(MB/

s)/$

0

1000

2000

3000

4000

5000

Write Mixed Read

Life

time(

day)

0 2 4 6 8

10 12 14

Write Mixed Read

Life

time(

day)

/$

!"#$%&'()'*"+,&$ !-#$./(&0*&$

!,#$%&'()'*"+,&12$ !3#$./(&0*&12$

A-MLC(SATA) A-TLC(SATA) B-MLC(SATA) B-TLC(SATA) C-MLC(NVMe)

Throughput(MB/s)

Figure 6: Performance and Lifetime Comparisons ofSATA SSD based SRCs and an NVMe SSD.

rations in terms of performance (Figure 6(a)). However,except for the Mixed workload making use of company Aproducts, TLC SSD based configurations are better in termsof performance per dollar as shown in Figure 6(c), whichshows the throughput (in MB/s) attained per dollar used.

Similarly, Figures 6(b) and (d) show the lifetime e↵ect ofthe SSDs. The lifetime numbers shown in Figure 6(b) rep-resent the expected days to live and are calculated using thelifetime estimation equation [23] based on endurance valuesin Table 12 and assuming that 512GB of write data issued bythe workloads are processed per day. The numbers given inFigure 6(d) are obtained from Figure 6(b) by taking into ac-count the cost of the SSD. For example, for A-MLC(SATA)with Write workload, value 2140 from Figure 6(b) is dividedby $418, the cost, resulting in 5.12. For lifetime, we observethat SRC making use of MLC SSDs is always better thanTLC based SRC.

We now compare the performance and endurance of SRCconfigured with a single high-cost, high-end NVMe SSD, rep-resented by the rightmost bar in Figure 6, with the RAID-5configurations. As we make use of a single NVMe SSD anddo not maintain parity for protection against SSD failure,the overall performance of using the NVMe SSD is slightlybetter than the RAID-5 configurations (Figure 6(a)). How-ever, we see that the RAID-5 configurations are generallycomparable, giving and taking based on the workload andtype of SSD, in terms of cost-e↵ectiveness as exemplifiedin Figure 6(c). In terms of lifetime, Figures 6(b) and (d)show that the RAID-5 configurations are superior comparedto the NVMe SSD configuration. Note also that a singleNVMe SSD, though superior in performance, is a fail-stopsolution, making it vulnerable to failures, whereas on-the-flyreplacement of failed SSDs is possible in RAID-5 configura-tions.

5.4 SRC vs Existing SolutionsIn this section, we compare the performance of SRC with

Bcache and Flashcache deployed as RAID-5. (We do notconsider RAID-4 as the results are similar.) Recall thatBcache and Flashcache are originally single SSD solutions.

(a) Throughput (b) I/O Amplification (c) Hit Ratio

0

250

500

750

Write Mixed Read

Thro

ughp

ut (M

B/s)

0

0.5

1

1.5

2

Write Mixed Read

I/O A

mpl

ifica

tion

0

0.2

0.4

0.6

0.8

1

Write Mixed Read

Hit

Rat

io

SRC SRC-S2D Bcache5 Flashcache5

Figure 7: Comparison of SRC with Bcache5 and Flashcache5.

As with the settings in Section 3, the experimental setting isset so that Bcache and Flashcache individually makes use ofthe RAID-5 configured underlying SSD cache layer. We willdenote Bcache and Flashcache with such settings as Bcache5and Flashcache5, respectively. We also evaluate two versionsof SRC; one with default settings, denoted SRC, and theother also with default settings except that the GC policy isS2D instead of Sel-GC. For Bcache5 and Flashcache5, we setthe RAID chunk size to 4KB, which is optimal for 4K ran-dom write workloads. Also, the cache set size in Flashcache5and the bucket size in Bcache5 are equally set to 2MB, whilewriteback percent in Bcache5 and dirty thresh pct in Flash-cache5 are both set to 90%.

Figure 7(a) compares the performance of the four schemes.We see that SRC shows much better performance than theother schemes for all the workload groups. Specifically, SRCoutperforms Bcache5 by 2.83, 2.92, and 3.09 times for theWrite, Mixed, and Read groups, respectively, and Flash-cache5 by 2.50, 2.75, and 2.34 times for the Write, Mixedand Read groups, respectively. Though direct comparisonwith Bcache and Flashcache may not be adequate, the re-sults here show that the features introduced in SRC resultsin substantial performance gains.

Of the four schemes, we find that Bcache5 performs theworst. The major reason behind the low performance ofBcache5 is its frequent issue of the flush command. Notethat in contrast to Figure 1, we observe that Flashcache isdoing better than Bcache. The main reason for this is thatunlike the experiments of Figure 1 where all requests are of4KB size, with real workload traces I/O requests are muchlarger. Hence, the negative e↵ect of small writes is reducedconsiderably.

Comparing SRC and SRC-S2D, we see that SRC doesbetter. However, SRC incurs higher I/O amplification asshown in Figure 7(b). This is because Sel-GC copies hotdata among the SSDs during GC. As a consequence, we seethat the hit ratio for Sel-GC is higher than S2D as shown inFigure 7(c).

6. CONCLUSIONRAID systems have traditionally been deployed for pri-

mary storage providing benefits such as high bandwidth andreliability, while NAND flash memory based SSDs are nowbeing employed as data caches for various storage systems.In this paper, we looked into the workings of existing cachingsolutions and observed the technical trends of modern SSDs,

in particular, the growth in the erase group size. Based onthese observations, we designed and implemented SRC (SSDRAID as a Cache), an SSD-based RAID system to be usedas a write-back cache for primary storage, that aligns datain erase group size boundaries.

Using our implementation in the Linux kernel 3.11.7 underthe Device Mapper (DM) framework, we explored a numberof design choices in SRC through a variety set of experi-ments. We found that combining garbage collection anddestaging based on the hotness of the data and the currentutilization of the system resulted in enhanced performance.Also, distinguishing clean and dirty data when forming astripe in RAID resulted in better management of the cachewithout compromising reliability. SRC was compared withBcache5 and Flashcache5, that is, Bcache and Flashcachewith experimental settings where the RAID-5 configured un-derlying SSD cache layer is used. Results showed that SRCis subtantially better than these solutions in terms of per-formance.

As future work, we are concentrating on adding featuresto SRC to ease management of SSDs for fast recovery andscalability, which is another goal of SRC. Our inital resultsare promising, and we expect to provide a stable means toexpand or contract the number of SSDs in RAID-5 in asmooth and seamless manner while providing sustained per-formance. Other more e�cient policies are also being con-sidered. For example, with Sel-GC, currently we do not dis-tinguish hot clean data and dirty data when using S2S. Sep-arating these two types of data is being considered. Othervictim SG selection policies are being considered as well.We also plan to implement advanced flash-based cachingschemes (e.g., [30], [43], [45], and [50]) and compare themwith SRC. Quantitative measurements of SRC based onstate-of-the-art flash devices such as NVMe SSDs are alsobeing planned. Finally, as mentioned previously, we plan tostabilize SRC and soon make it available as open source.

7. ACKNOWLEDGMENTSWe would like to thank the anonymous reviewers for their de-

tailed comments that helped to improve the presentation of thepaper. We also thank Heejin Park for providing useful comments.This research was supported in part by the National ResearchFoundation of Korea(NRF) grant funded by the Korea govern-ment(MEST) (No. 2012R1A2A2A01045733) and by Seoul Cre-ative Human Development Program (CAC15106). Sam H. Nohis the corresponding author of this paper.

8. REFERENCES[1] M. Abraham. Architectural Considerations for Optimizing

SSDs. In Proc. of Flash Memory Summit, 2014.[2] N. Agrawal, V. Prabhakaran, T. Wobber, J. D. Davis,

M. Manasse, and R. Panigrahy. Design Tradeo↵s for SSDPerformance. In Proc. of USENIX ATC, 2008.

[3] L. N. Bairavasundaram, G. R. Goodson, B. Schroeder,A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. AnAnalysis of Data Corruption in the Storage Stack. In Proc.of USENIX FAST, 2008.

[4] Bcache. http://bcache.evilpiepirate.org.[5] Bcache Mirroring Issue. https://lwn.net/Articles/497140/.[6] S. Byan, J. Lentini, A. Madan, L. Pabon, M. Condict,

J. Kimmel, S. Kleiman, C. Small, and M. Storer. Mercury:Host-Side Flash Caching for the Data Center. In Proc. ofIEEE MSST, 2012.

[7] Y. Cai, E. F. Haratsch, O. Mutlu, and K. Mai. ErrorPatterns in MLC NAND Flash Memory: Measurement,Characterization, and Analysis. In Proc. of DATE, 2012.

[8] Y. Cai, Y. Luo, S. Ghose, E. F. Haratsch, K. Mai, andO. Mutlu. Read Disturb Errors in MLC NAND FlashMemory: Characterization, Mitigation and Recovery. InProc. of IEEE DSN, 2015.

[9] S. Cho, C. Park, Y. Won, S. Kang, J. Cha, S. Yoon, andJ. Choi. Design Tradeo↵s of SSDs: From EnergyConsumption’s Perspective. ACM TOS, 11(2), 2015.

[10] DM-Writeboost.https://github.com/akiradeveloper/dm-writeboost.

[11] EnhanceIO. https://github.com/stec-inc/EnhanceIO.[12] FIO - Flexible IO Tester. http://git.kernel.dk/?p=fio.git.[13] Flashcache. https://github.com/facebook/flashcache/.[14] Flashcache Mirroring Issue. https://goo.gl/KVUH8u.[15] Fusion-IO ioCache.

http://www.fusionio.com/products/iocache.[16] L. M. Grupp, J. D. Davis, and S. Swanson. The Bleak

Future of NAND Flash Memory. In Proc. of USENIXFAST, 2012.

[17] HGST ServerCache.http://www.hgst.com/software/HGST-server-cache.

[18] D. A. Holland, E. Angelino, G. Wald, and M. I. Seltzer.Flash Caching on the Storage Client. In Proc. of USENIXATC, 2013.

[19] HP SmartCache.http://www8.hp.com/us/en/products/server-software/product-detail.html?oid=5364342.

[20] Y. Hu, H. Jiang, D. Feng, L. Tian, H. Luo, and S. Zhang.Performance Impact and Interplay of SSD ParallelismThrough Advanced Commands, Allocation Strategy andData Granularity. In Proc. of ACM ICS, 2011.

[21] P. Huang, G. Wu, X. He, and W. Xiao. An AggressiveWorn-out Flash Block Management Scheme to AlleviateSSD Performance Degradation. In Proc. of ACM EuroSys,2014.

[22] C. Hyun, Y. Oh, J. Choi, D. Lee, and S. H. Noh. APerformance Model and File System Space AllocationScheme for SSDs. In Proc. of IEEE MSST, 2010.

[23] J. Jeong, S. S. Hahn, S. Lee, and J. Kim. LifetimeImprovement of NAND Flash-based Storage Systems UsingDynamic Program and Erase Scaling. In Proc. of USENIXFAST, 2014.

[24] M. Jung, E. Wilson, D. Donofrio, J. Shalf, andM. Kandemir. NANDFlashSim: Intrinsic Latency VariationAware NAND Flash Memory System Modeling andSimulation at Microarchitecture Level. In Proc. of IEEEMSST, 2012.

[25] S. Kavalanekar, B. Worthington, Q. Zhang, and V. Sharda.Characterization of Storage Workload Traces fromProduction Windows Servers. In Proc. of IEEE IISWCC,2008.

[26] J. Kim, Y. Oh, J. Choi, D. Lee, and S. H. Noh. DiskSchedulers for Solid State Drives. In Proc. of ACM

EMSOFT, 2009.[27] R. Koller, L. Marmol, R. Rangaswami, S. Sundararaman,

N. Talagala, and M. Zhao. Write Policies for Host-sideFlash Caches. In Proc. of USENIX FAST, 2013.

[28] C. Lee, D. Sim, J. Hwang, and S. Cho. F2FS: A New FileSystem for Flash Storage. In Proc. of USENIX FAST, 2015.

[29] E. Lee, Y. Oh, and D. Lee. SSD Caching to OvercomeSmall Write Problem of Disk-based RAID in EnterpriseEnvironments. In Proc. of ACM SAC, 2015.

[30] C. Li, P. Shilane, F. Douglis, H. Shim, S. Smaldone, andG. Wallace. Nitro: A Capacity-Optimized SSD Cache forPrimary Storage. In Proc. of USENIX ATC, 2014.

[31] R.-S. Liu, C.-L. Yang, and W. Wu. Optimizing NANDFlash-based SSDs via Retention Relaxation. In Proc. ofUSENIX FAST, 2012.

[32] J. Menon. A Performance Comparison of RAID-5 andLog-Structured Arrays. In Proc. of IEEE HPDC, 1995.

[33] Micron. http://www.micron.com/products/nand-flash/mlc-nand/mlc-nand-part-catalog/.

[34] C. Min, K. Kim, H. Cho, S.-W. Lee, and Y. I. Eom. SFS:Random Write Considered Harmful in Solid State Drives.In Proc. of USENIX FAST, 2012.

[35] K. Mogi and M. Kitsuregawa. Dynamic Parity StripeReorganizations for RAID5 Disk Arrays. In Proc. of PDIS,1994.

[36] D. Mridha and L. Bert. Elastic Cache With Single Parity,2014. US Patent App. 13/605,254.

[37] MSR Cambridge Traces. http://iotta.snia.org/traces/388.[38] E. H. Nam, B. Kim, H. Eom, and S. L. Min. Ozone (O3):

An Out-of-Order Flash Memory Controller Architecture.IEEE TC, 60(5), May 2011.

[39] Y. Oh, J. Choi, D. Lee, and S. H. Noh. Caching Less forBetter Performance: Balancing Cache Size and UpdateCost of Flash Memory Cache in Hybrid Storage Systems.In Proc. of USENIX FAST, 2012.

[40] Y. Oh, J. Choi, D. Lee, and S. H. Noh. ImprovingPerformance and Lifetime of the SSD RAID-based HostCache Through a Log-structured Approach. In Proc. ofACM INFLOW, 2013.

[41] H. Park, J. Kim, J. Choi, D. Lee, and S. H. Noh.Incremental Redundancy to Reduce Data Retention Errorsin Flash-based SSDs. In Proc. of IEEE MSST, 2015.

[42] D. A. Patterson, G. Gibson, and R. H. Katz. A Case forRedundant Arrays of Inexpensive Disks (RAID). In Proc.of ACM SIGMOD, 1988.

[43] D. Qin, A. D. Brown, and A. Goel. Reliable Writeback forClient-side Flash Caches. In Proc. of USENIX ATC, 2014.

[44] M. Rosenblum and J. K. Ousterhout. The Design andImplementation of a Log-Structured File System. ACMTOCS, 10(1), 1992.

[45] M. Saxena, M. M. Swift, and Y. Zhang. FlashTier: ALightweight, Consistent and Durable Storage Cache. InProc. of ACM EuroSys, 2012.

[46] M. Saxena, Y. Zhang, M. M. Swift, A. C. Arpaci-Dusseau,and R. H. Arpaci-Dusseau. Getting Real: Lessons inTransitioning Research Simulations into Hardware Systems.In Proc. of USENIX FAST, 2013.

[47] Seagate Mometus R�XT.http://www.seagate.com/docs/pdf/datasheet/disc/ds momentus xt.pdf.

[48] SNIA Solid State Storage (SSS) Performance TestSpecification (PTS) Enterprise Version 1.1.

[49] D. Stodolsky, G. Gibson, and M. Holland. Parity LoggingOvercoming the Small Write Problem in Redundant DiskArrays. In Proc. of ISCA, 1993.

[50] L. Tang, Q. Huang, W. Lloyd, S. Kumar, and K. Li. RIPQ:Advanced Photo Caching on Flash for Facebook. In Proc.of USENIX FAST, 2015.

[51] J. Wilkes, R. Golding, C. Staelin, and T. Sullivan. The HPAutoRAID Hierarchical Storage System. ACM TOCS,14(1), 1996.

Documents

Enabling Cost-Effective Flash based Caching with an Array ... · SSD caching framework to mimic advanced cache algorithms (e.g. sLRU and GDSF) [50]. Speciﬁcally, they ﬁrst reem-phasize