22
Austere Flash Caching with Deduplication and Compression Qiuping Wang * , Jinhong Li * , Wen Xia # Erik Kruus ^ , Biplob Debnath ^ , Patrick P. C. Lee * * The Chinese University of Hong Kong (CUHK) # Harbin Institute of Technology, Shenzhen ^ NEC Labs 1

Austere Flash Caching with Deduplication and Compression · Austere Flash Caching with Deduplication and Compression Qiuping Wang*, JinhongLi*, Wen Xia# Erik Kruus^, BiplobDebnath^,

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

  • Austere Flash Cachingwith Deduplication and Compression

    Qiuping Wang*, Jinhong Li*, Wen Xia#Erik Kruus^, Biplob Debnath^, Patrick P. C. Lee*

    *The Chinese University of Hong Kong (CUHK)#Harbin Institute of Technology, Shenzhen

    ^NEC Labs

    1

  • Flash Caching

    ØFlash-based solid-state drives (SSDs)• ü Faster than hard disk drives (HDD)• ü Better reliability• û Limited capacity and endurance

    ØFlash caching• Accelerate HDD storage by caching frequently accessed blocks in flash

    2

  • Deduplication and Compression

    ØReduce storage and I/O overheads

    ØDeduplication (coarse-grained)• In units of chunks (fixed- or variable-size)• Compute fingerprint (e.g., SHA-1) from chunk content• Reference identical (same FP) logical chunks to a physical copy

    ØCompression (fine-grained)• In units of bytes• Transform chunks into fewer bytes

    3

  • Deduplicated and Compressed Flash Cache

    ØLBA: chunk address in HDD; FP: chunk fingerprint

    ØCA: chunk address in flash cache (after dedup’ed + compressed)

    4

    SSD

    Chunking

    I/O

    Deduplication and compression

    LBA à FP

    FP à CA, length

    FP-index

    LBA-index

    RAM

    HDD

    Dirty list

    Variable-size compressed chunks

    (after deduplication)

    Fixed-size chunks

    LBA, CALBA, CA

    Read/write

  • Memory Amplification for Indexing

    ØExample: 512-GiB flash cache with 4-TiB HDD working set

    ØConventional flash cache• Memory overhead: 256 MiB

    ØDeduplicated and compressed flash cache• LBA-index: 3.5 GiB• FP-index: 512 MiB• Memory amplification: 16x

    • Can be higher

    5

    LBA (8B) à CA (8B)

    LBA (8B) à FP (20B)

    FP (20B) à CA (8B) + Length (4B)

  • Related Work

    ØNitro [Li et al., ATC’14] • First work to study deduplication and compression in flash caching• Manage compressed data in Write-Evict Units (WEUs)

    ØCacheDedup [Li et al, FAST’16]• Propose dedup-aware algorithms for flash caching to improve hit ratios

    6

    They both suffer from memory amplification!

  • Our Contribution

    7

    ØAustereCache: a deduplicated and compressed flash cache with austere memory-efficient management• Bucketization

    • No overhead for address mappings• Hash chunks to storage locations

    • Fixed-size compressed data management• No tracking for compressed lengths of chunks in memory

    • Bucket-based cache replacement• Cache replacement per bucket• Count-Min Sketch [Cormode 2005] for low-memory reference counting

    ØExtensive trace-driven evaluation and prototype experiments

  • Bucketization

    ØMain idea• Use hashing to partition index and cache space

    • (RAM) LBA-index and FP-index• (SSD) metadata region and data region

    • Store partial keys (prefixes) in memory• Memory savings

    Ø Layout• Hash entries into equal-sized buckets• Each bucket has fixed-number of slots

    8

    Bucket

    mapping / data

    ……

    slot

  • (RAM) LBA-index and FP-index

    Ø (RAM) LBA-index and FP-index

    • Locate buckets with hash suffixes• Match slots with hash prefixes• Each slot in FP-index corresponds to a storage location in flash

    9

    BucketLBA-index

    LBA-hash prefix FP hash Flag

    FP-index

    ……

    FP-hash prefix Flag

    slot

    Bucket

    ………

    slot

  • (SSD) Metadata and Data Regions

    Ø (SSD) Metadata region and data region

    • Each slot has full FP and list of full LBAs in metadata region• For validation against prefix collisions

    • Cached chunks in data region

    10

    Metadataregion

    … …

    DataregionFP List of LBAs Chunk

    Bucket

    ……

    slot

    Bucket

    ……

    slot

  • Fixed-size Compressed Data Management

    ØMain idea• Slice and pad a compressed chunk into fixed-size subchunks

    ØAdvantages• Compatible with bucketization

    • Store each subchunk in one slot• Allow per-chunk management for cache replacement

    11

    32KiB 20KiBCompress Slice and Pad

    8KiB each

  • Fixed-size Compressed Data Management

    Ø Layout• One chunk occupies multiple consecutive slots• No additional memory for compressed length

    12

    FP-index

    ……

    SSDRAM

    … …

    FP List of LBAs Length

    FP-hash prefix Flag

    ……

    Chunk

    Bucket

    Metadata Region Data RegionSubchunk

  • Bucket-based Cache Replacement

    ØMain idea• Cache replacement in each bucket independently

    • Eliminate priority-based structures for cache decisions

    13

    Slot

    LBA-index

    Slot

    23

    ReferenceCounter

    Old…

    FP-index

    Recent

    • Combine recency and deduplication• LBA-index: least-recently-used policy• FP-index: least-referenced policy• Weighted reference counting based on

    recency in LBAs

  • Sketch-based Reference Counting

    ØHigh memory overhead for complete reference counting• One counter for every FP-hash

    ØCount-Min Sketch [Cormode 2005] • Fixed memory usage with provable error bounds

    14

    +1

    +1

    +1

    FP-hash count = minimum counter indexed by (i, Hi(FP-hash))

    w

    h

  • Evaluation

    Ø Implement AustereCache as a user-space block device• ~4.5K lines of C++ code in Linux

    ØTraces• FIU traces: WebVM, Homes, Mail• Synthetic traces: varying I/O dedup ratio and write-read ratio

    • I/O dedup ratio: fraction of duplicate written chunks in all written chunks

    ØSchemes• AustereCache: AC-D, AC-DC• CacheDedup: CD-LRU-D, CD-ARC-D, CD-ARC-DC

    15

  • Memory Overhead

    16

    ØAC-D incurs 69.9-94.9% and 70.4-94.7% less memory across all traces than CD-LRU-D and CD-ARC-D, respectively.

    ØAC-DC incurs 87.0-97.0% less memory than CD-ARC-DC.

    AC-D AC-DC CD-LRU-D CD-ARC-D CD-ARC-DC

    1

    10

    100

    1000

    12.5 25 37.5 50 62.5 75 87.5 100Cache Capacity (%)

    Mem

    ory

    (MiB

    )

    1

    10

    100

    1000

    12.5 25 37.5 50 62.5 75 87.5 100Cache Capacity (%)

    Mem

    ory

    (MiB

    )

    1

    10

    100

    1000

    12.5 25 37.5 50 62.5 75 87.5 100Cache Capacity (%)

    Mem

    ory

    (MiB

    )

    (a) WebVM (b) Homes (c) MailFigure 5: Exp#1 (Overall memory usage). Note that the y-axes are in log scale.

    new SHA-1 fingerprint for the 32 KiB chunk. Table 2 showsthe basic statistics of each regenerated FIU trace on 32 KiBchunks.

    The original FIU traces have no compression details. Thus,for each chunk fingerprint, we set its compressibility ratio(i.e., the ratio of raw bytes to the compressed bytes) followinga normal distribution with mean 2 and variance 0.25 as in [24].Synthetic. For throughput measurement (§5.5), we build asynthetic trace generator to account for different access pat-terns. Each synthetic trace is configured by two parameters:(i) I/O deduplication ratio, which specifies the fraction ofwrites that can be removed on the write path due to dedupli-cation; and (ii) write-to-read ratio, which specifies the ratiosof writes to reads.

    We generate a synthetic trace as follows. First, we ran-domly generate a working set by choosing arbitrary LBAswithin the primary storage. Then we generate an access pat-tern based on the given write-to-read ratio, such that the writeand read requests each follow a Zipf distribution. We derivethe chunk content of each write request based on the givenI/O deduplication ratio as well as the compressibility ratio asin the FIU trace generation (see above). Currently, our evalua-tion fixes the working set size as 128 MiB, the primary storagesize as 5 GiB, and the Zipf constant as 1.0; such parametersare all configurable.

    5.2 SetupTestbed. We conduct our experiments on a machine runningUbuntu 18.04 LTS with Linux kernel 4.15. The machineis equipped with a 10-core 2.2 GHz Intel Xeon E5-2630v4CPU, 32 GiB DDR4 RAM, a 1 TiB Seagate ST1000DM010-2EP1 SATA HDD as the primary storage, and a 128 GiB IntelSSDSC2BW12 SATA SSD as the flash cache.Default setup. For both AustereCache and CacheDedup, weconfigure the size of the FP-index based on a fraction of theworking set size (WSS) of each trace, and fix the size ofthe LBA-index four times that of the FP-index. We storeboth the LBA-index and the FP-index in memory for highperformance. For AustereCache, we set the default chunksize and subchunk size as 32 KiB and 8 KiB, respectively. ForCD-ARC-DC in CacheDedup, we set the WEU size as 2 MiB(the default in [26]).

    5.3 Comparative AnalysisWe compare AustereCache and CacheDedup in terms of mem-ory usage, read hit ratios, and write reduction ratios using theFIU traces.

    Exp#1 (Overall memory usage). We compare the memoryusage of different schemes. We vary the flash cache sizefrom 12.5% to 100% of WSS of each FIU trace, and con-figure the LBA-index and the FP-index based on our defaultsetup (§5.2). To obtain the actual memory usage (rather thanthe allocated memory space for the index structures), wecall malloc trim at the end of each trace replay to returnall unallocated memory from the process heap to the oper-ating system, and check the residual set size (RSS) from/proc/self/stat as the memory usage.

    Figure 5 shows that AustereCache significantly saves thememory usage compared to CacheDedup. For the non-compression schemes (i.e., AC-D, CD-LRU-D, and CD-ARC-D), AC-D incurs 69.9-94.9% and 70.4-94.7% less memoryacross all traces than CD-LRU-D and CD-ARC-D, respec-tively. For the compression schemes (i.e., AC-DC and CD-ARC-DC), AC-DC incurs 87.0-97.0% less memory than CD-ARC-DC.

    AustereCache achieves higher memory savings thanCacheDedup in compression mode, since CD-ARC-DC needsto additionally maintain the lengths of all compressed chunks,while AC-DC eliminates such information. If we compare thememory overhead with and without compression, CD-ARC-DC incurs 78-194% more memory usage than CD-ARC-Dacross all traces, implying that compression comes with highmemory usage penalty in CacheDedup. On the other hand,AC-DC only incurs 2-58% more memory than AC-D.

    Exp#2 (Impact of design techniques on memory savings).We study how different design techniques of AustereCachehelp memory savings. We mainly focus on bucketization(§3.1) and bucket-based cache replacement (§3.3); for fixed-size compressed data management (§3.2), we refer readers toExp#1 for our analysis.

    We choose CD-LRU-D of CacheDedup as our baseline andcompare it with AC-D (both are non-compressed versions),and add individual techniques to see how they contribute tothe memory savings of AC-D. We consider four variants:

  • Read Hit Ratios

    17

    ØAC-D has up to 39.2% higher read hit ratio than CD-LRU-D, and similar read hit ratio as CD-ARC-D

    ØAC-DC has up to 30.7% higher read hit ratio than CD-ARC-DC

    AC-D AC-DC CD-LRU-D CD-ARC-D CD-ARC-DC

    0

    25

    50

    75

    100

    12.5 25 37.5 50 62.5 75 87.5 100Cache Capacity (%)

    Rea

    d H

    it (%

    )

    0 10 20 30 40 50

    12.5 25 37.5 50 62.5 75 87.5 100Cache Capacity (%)

    Rea

    d H

    it (%

    )

    0

    25

    50

    75

    100

    12.5 25 37.5 50 62.5 75 87.5 100Cache Capacity (%)

    Rea

    d H

    it (%

    )

    (a) WebVM (b) Homes (c) MailFigure 5: Exp#3 (Read hit ratio).

    Figure 5 shows the results. AustereCache generallyachieves higher read hit ratios than different CacheDedupalgorithms. For the non-compression schemes, AC-D in-creases the read hit ratio of CD-LRU-D by up to 39.2%. Thereason is that CD-LRU-D is only aware of the request re-cency and fails to clean stale chunks in time (§2.4), whileAustereCache favors to evict chunks with small referencecounts. On the other hand, AC-D achieves similar read hitratios to CD-ARC-D, and in particular has a higher read hitratio (up to 13.4%) when the cache size is small in WebVM(12.5% WSS) by keeping highly referenced chunks in cache.For the compression schemes, AC-DC has higher read hitratios than CD-ARC-DC, by 0.5-30.7% in WebVM, 0.7-9.9%in Homes, and 0.3-6.2% in Mail. Note that CD-ARC-DCshows a lower read hit ratio than CD-ARC-D although it intu-itively stores more chunks with compression, mainly becauseit cannot quickly evict stale chunks due to the WEU-basedorganization (§2.4).Exp#4 (Write reduction ratio). We further evaluate differ-ent schemes in terms of the write reduction ratio, defined asthe fraction of reduction of bytes written to the cache due toboth deduplication and compression. A high write reductionratio implies less written data to the flash cache and henceimproved performance and endurance.

    Figure ?? shows the results. For the non-compressionschemes, AC-D, CD-LRU-D, and CD-ARC-D show marginaldifferences in WebVM and Homes, while in Mail, AC-D haslower write reduction ratios than CD-LRU-D by up to 17.5%.We find that CD-LRU-D tends to keep more stale chunksin cache, thereby saving the writes that hit the stale chunks.For example, when the cache size is 12.5% of WSS in Mail,17.1% of the write reduction in CD-LRU-D comes from thewrites to the stale chunks, while in WebVM and Homes, thecorresponding numbers are only 3.6% and 1.1%, respectively.AC-D achieves lower write reduction ratios than CD-LRU-D, but achieves much higher read hit ratios by up to 39.2%by favoring to evict the chunks with small reference counts(Exp#3).

    For the compression schemes, both CD-ARC-DC and AC-DC have much higher write reduction ratios than the non-compression schemes due to compression. However, AC-DCshows a slightly lower write reduction ratio than CD-ARC-

    DC by 7.7-14.5%. The reason is that AC-DC pads the lastsubchunk of each variable-size compressed chunk, therebyincurring extra writes. As we show later in Exp#5 (§5.4),a smaller subchunk size can reduce the padding overhead,although the memory usage also increases.

    5.4 Sensitivity to ParametersWe evaluate AustereCache for different parameter settingsusing the FIU traces.Exp#5 (Impact of chunk sizes and subchunk sizes). Weevaluate AustereCache on different chunk sizes and subchunksizes. We focus on the Homes trace and vary the chunk sizesand subchunk sizes as described in §5.1. For varying chunksizes, we fix the subchunk size as one-fourth of the chunk size;for varying subchunk sizes, we fix the chunk size as 32 KiB.We focus on comparing AC-DC and CD-ARC-DC by fixingthe cache size as 25% of WSS. Note that CD-ARC-DC isunaffected by the subchunk size.

    AC-DC maintains the significant memory savings com-pared to CD-ARC-DC, by 92.8-95.3% for varying chunksizes (Figure 6(a)) and 93.1-95.1% for varying subchunksizes (Figure 6(b)). It also maintains higher read hit ratiosthan CD-ARC-DC, by 5.0-12.3% for varying chunk sizes(Figure 6(c)) and 7.9-10.4% for varying subchunk sizes (Fig-ure 6(d)). AC-DC incurs a (slightly) less write reduction ratiothan CD-ARC-DC due to padding, by 10.0-14.8% for varyingchunk sizes (Figure 6(e)); the results are consistent with thosein Exp#4. Nevertheless, using a smaller subchunk size canmitigate the padding overhead. As shown in Figure 6(f), thewrite reduction ratio of AC-DC approaches that of CD-ARC-DC when the subchunk size decreases. When the subchunksize is 4 KiB, AC-DC only has a 6.2% less write reductionratio than CD-ARC-DC. Note that if we change the subchunksize from 8 KiB to 4 KiB, the memory usage increases from14.5 MiB to 17.3 MiB (by 18.8%), since the number of buck-ets is doubled in the FP-index (while the LBA-index remainsthe same).Exp#6 (Impact of LBA-index sizes). We study the impactof LBA-index sizes. We vary the LBA-index size from 1⇥ to8⇥ of the FP-index size (recall that the default is 4⇥), and fixthe cache size as 12.5% of WSS.

    Figure 7 depicts the memory usage and read hit ratios; we

  • Write Reduction Ratios

    18

    ØAC-D is comparable as CD-LRU-D and CD-ARC-D

    ØAC-DC is slightly lower (by 7.7-14.5%) than CD-ARC-DC• Due to padding in compressed data management

    (a) WebVM (b) Homes (c) MailFigure 7: Exp#3 (Read hit ratio).

    AC-D AC-DC CD-LRU-D CD-ARC-D CD-ARC-DC

    0

    20

    40

    60

    80

    12.5 25 37.5 50 62.5 75 87.5 100Cache Capacity (%)

    Writ

    e Rd

    . (%

    )

    0

    20

    40

    60

    80

    12.5 25 37.5 50 62.5 75 87.5 100Cache Capacity (%)

    Writ

    e Rd

    . (%

    )

    0

    20

    40

    60

    80

    12.5 25 37.5 50 62.5 75 87.5 100Cache Capacity (%)

    Writ

    e Rd

    . (%

    )

    (a) WebVM (b) Homes (c) MailFigure 8: Exp#4 (Write reduction ratio).

    AC-DC maintains the significant memory savings com-pared to CD-ARC-DC, by 92.8-95.3% for varying chunksizes (Figure 9(a)) and 93.1-95.1% for varying subchunksizes (Figure 9(b)). It also maintains higher read hit ratiosthan CD-ARC-DC, by 5.0-12.3% for varying chunk sizes(Figure 9(c)) and 7.9-10.4% for varying subchunk sizes (Fig-ure 9(d)). AC-DC incurs a (slightly) less write reduction ratiothan CD-ARC-DC due to the paddings by 10.0-14.8% forvarying chunk sizes (Figure 9(e)); the results are consistentwith those in Exp#4. Nevertheless, using a smaller subchunksize can mitigate the padding overhead. As shown in Fig-ure 9(f), the write reduction ratio of AC-DC approaches thatof CD-ARC-DC when the subchunk size decreases. Whenthe subchunk size is 4 KiB, AC-DC only has a 6.2% less writereduction ratio than CD-ARC-DC. Note that if we changethe subchunk size from 8 KiB to 4 KiB, the memory usageincreases from 14.5 MiB to 17.3 MiB (by 18.8%), since thenumber of buckets is doubled in the FP-index (while theLBA-index remains the same).

    Exp#6 (Impact of LBA-index sizes). We study the impactof LBA-index sizes. We vary the LBA-index size from 1⇥ to8⇥ of the FP-index size (recall that the default is 4⇥), and fixthe cache size as 12.5% of WSS.

    Figure 10 depicts the memory usage and read hit ratios;we omit the write reduction ratio as there is nearly no changefor varying LBA-index sizes. When the LBA-index sizeincreases, the memory usage increases by 17.6%, 111.5%,and 160.9% in WebVM, Homes and Mail, respectively (Fig-ure 10(a)), as we allocate more buckets in the LBA-index.Note that the increase in memory usage in WebVM is less

    (a) Memory usage vs.chunk size

    (b) Memory usage vs.subchunk size

    (c) Read hit ratio vs.chunk size

    (d) Read hit ratio vs.subchunk size

    (e) Write reduction ratio vs.chunk size

    (f) Write reduction ratio vs.subchunk size

    Figure 9: Exp#5 (Impact of chunk sizes and subchunk sizes).We focus on the Homes trace and fix the cache size as 25%of WSS in Homes.

    than those in Homes and Mail, mainly because the WSS ofWebVM is small and incurs a small actual increase of thetotal memory usage. Also, the read hit ratio increases with

  • Throughput

    19

    ØAC-DC has highest throughput • Due to high write reduction ratio and high read hit ratio

    ØAC-D has slightly lower throughput than CD-ARC-D• AC-D needs to access the metadata region during indexing

    (a) Memory usage (b) Read hit ratio

    Figure 10: Exp#6 (Impact of LBA-index sizes).

    AC-D AC-DC CD-LRU-D CD-ARC-D CD-ARC-DC

    (a) Throughput vs. I/O dedupratio (write-to-read ratio 7:3)

    (b) Throughput vs. write-to-readratio (I/O dedup ratio 50%)

    Figure 11: Exp#7 (Throughput).

    the LBA-index size, until the LBA-index reaches 4⇥ of theFP-index size (Figure 10(b)). In particular, for WebVM, theread hit ratio grows from 36.7% (1⇥) to 70.4% (8⇥), whilefor Homes and Mail, the read hit ratios increase by only 4.3%and 5.3%, respectively. The reason is that when the LBA-index size increases, WebVM shows a higher increase in thetotal reference counts of the cached chunks than Homes andMail, implying that more reads can be served by the cachedchunks (i.e., higher read hit ratios).

    5.5 Throughput and CPU OverheadWe measure the throughput and CPU overhead of Austere-Cache. We conduct the evaluation on synthetic traces forvarying I/O deduplication ratios and write-to-read ratios. Wefocus on the write-back policy (§2.2), in which AustereCachefirst persists the written chunks to the flash cache and flushesthe chunks to the HDD when they are evicted from the cache.We use direct I/O to remove the impact of page cache. Wereport the averaged results over five runs, while the standarddeviations are small (less than 2.7%) and hence omitted.Exp#7 (Throughput). We compare AustereCache andCacheDedup in throughput using synthetic traces. We fixthe cache size as 50% of the 128 MiB WSS. Both systemswork in single-threaded mode.

    Figures 11(a) and 11(b) show the results for varying I/Odeduplication ratios (with a fixed write-to-read ratio 7:3,which represents a write-intensive workload as in FIU traces)and varying write-to-read ratios (with a fixed I/O dedupli-cation ratio 50%), respectively. For the non-compressionschemes, AC-D achieves 18.5-86.6% higher throughput thanCD-LRU-D for all cases except when the write-to-read ratiois 1:9 (slightly slower by 2.3%). Compared to CD-ARC-D,

    Figure 12: Exp#8 (CPUoverhead).

    Figure 13: Exp#9 (Through-put of multi-threading).

    AC-D is slower by 1.1-24.5%, since both AC-D and CD-ARC-D have similar read hit ratios and write reduction ratios(§5.3), while AC-D issues additional reads and writes to themetadata region (CD-ARC-D keeps all indexing informationin memory). AC-D achieves similar throughput to CD-ARC-D when there are more duplicate chunks (i.e., under highI/O deduplication ratios). For compression schemes, AC-DCachieves 6.8-99.6% higher throughput than CD-ARC-DC.

    Overall, AC-DC achieves the highest throughput amongall schemes for two reasons. First, AustereCache generallyachieves higher or similar read hit ratios compared to CacheD-edup algorithms (§5.3). Second, AustereCache incorporatesdeduplication awareness into cache replacement by cachingchunks with high reference counts, thereby absorbing morewrites in the SSD and reducing writes to the slow HDD.Exp#8 (CPU overhead). We study the CPU overhead ofdeduplication and compression of AustereCache along theI/O path. We measure the latencies of four computationsteps, including fingerprint computation, compression, indexlookup, and index update. Specifically, we run the WebVMtrace with a cache size of 12.5% of WSS, and collect thestatistics of 100 non-duplicate write requests. We also com-pare their latencies with those of 32 KiB chunk write requeststo the SSD and the HDD using the fio benchmark tool [2].

    Figure 12 depicts the results. Fingerprint computation hasthe highest latency (15.5 µs) among all four steps. In total,AustereCache adds around 31.2 µs of CPU overhead. On theother hand, the latencies of 32 KiB writes to the SSD and theHDD are 85 µs and 5,997 µs, respectively. Note that the CPUoverhead can be suppressed via multi-threaded processing, asshown in Exp#9.Exp#9 (Throughput of multi-threading). We evaluate thethroughput gain of AustereCache when it enables multi-threading and issues concurrent requests to multiple buckets(§4). We use synthetic traces with a write-to-read ratio of 7:3,and consider the I/O deduplication ratio of 50% and 80%.

    Figure 13 shows the throughput versus the number ofthreads being configured in AustereCache. When the numberof threads increases, AustereCache shows a higher throughputgain under 80% I/O deduplication ratio (from 93.8 MiB/s to235.5 MiB/s, or 2.51⇥) than under 50% I/O deduplicationratio (from 60.0 MiB/s to 124.9 MiB/s, or 2.08⇥). A higherI/O deduplication ratio implies less I/O to flash, and Austere-Cache benefits more from multi-threading on parallelizing

    (a) Memory usage (b) Read hit ratio

    Figure 10: Exp#6 (Impact of LBA-index sizes).

    0255075

    100

    20 40 60 80I/O Dedup Ratio (%)

    Thpt

    (MiB

    /s)

    (a) Throughput vs. I/O dedupratio (write-to-read ratio 7:3)

    (b) Throughput vs. write-to-readratio (I/O dedup ratio 50%)

    Figure 11: Exp#7 (Throughput).

    the LBA-index size, until the LBA-index reaches 4⇥ of theFP-index size (Figure 10(b)). In particular, for WebVM, theread hit ratio grows from 36.7% (1⇥) to 70.4% (8⇥), whilefor Homes and Mail, the read hit ratios increase by only 4.3%and 5.3%, respectively. The reason is that when the LBA-index size increases, WebVM shows a higher increase in thetotal reference counts of the cached chunks than Homes andMail, implying that more reads can be served by the cachedchunks (i.e., higher read hit ratios).

    5.5 Throughput and CPU OverheadWe measure the throughput and CPU overhead of Austere-Cache. We conduct the evaluation on synthetic traces forvarying I/O deduplication ratios and write-to-read ratios. Wefocus on the write-back policy (§2.2), in which AustereCachefirst persists the written chunks to the flash cache and flushesthe chunks to the HDD when they are evicted from the cache.We use direct I/O to remove the impact of page cache. Wereport the averaged results over five runs, while the standarddeviations are small (less than 2.7%) and hence omitted.Exp#7 (Throughput). We compare AustereCache andCacheDedup in throughput using synthetic traces. We fixthe cache size as 50% of the 128 MiB WSS. Both systemswork in single-threaded mode.

    Figures 11(a) and 11(b) show the results for varying I/Odeduplication ratios (with a fixed write-to-read ratio 7:3,which represents a write-intensive workload as in FIU traces)and varying write-to-read ratios (with a fixed I/O dedupli-cation ratio 50%), respectively. For the non-compressionschemes, AC-D achieves 18.5-86.6% higher throughput thanCD-LRU-D for all cases except when the write-to-read ratiois 1:9 (slightly slower by 2.3%). Compared to CD-ARC-D,

    Figure 12: Exp#8 (CPUoverhead).

    Figure 13: Exp#9 (Through-put of multi-threading).

    AC-D is slower by 1.1-24.5%, since both AC-D and CD-ARC-D have similar read hit ratios and write reduction ratios(§5.3), while AC-D issues additional reads and writes to themetadata region (CD-ARC-D keeps all indexing informationin memory). AC-D achieves similar throughput to CD-ARC-D when there are more duplicate chunks (i.e., under highI/O deduplication ratios). For compression schemes, AC-DCachieves 6.8-99.6% higher throughput than CD-ARC-DC.

    Overall, AC-DC achieves the highest throughput amongall schemes for two reasons. First, AustereCache generallyachieves higher or similar read hit ratios compared to CacheD-edup algorithms (§5.3). Second, AustereCache incorporatesdeduplication awareness into cache replacement by cachingchunks with high reference counts, thereby absorbing morewrites in the SSD and reducing writes to the slow HDD.Exp#8 (CPU overhead). We study the CPU overhead ofdeduplication and compression of AustereCache along theI/O path. We measure the latencies of four computationsteps, including fingerprint computation, compression, indexlookup, and index update. Specifically, we run the WebVMtrace with a cache size of 12.5% of WSS, and collect thestatistics of 100 non-duplicate write requests. We also com-pare their latencies with those of 32 KiB chunk write requeststo the SSD and the HDD using the fio benchmark tool [2].

    Figure 12 depicts the results. Fingerprint computation hasthe highest latency (15.5 µs) among all four steps. In total,AustereCache adds around 31.2 µs of CPU overhead. On theother hand, the latencies of 32 KiB writes to the SSD and theHDD are 85 µs and 5,997 µs, respectively. Note that the CPUoverhead can be suppressed via multi-threaded processing, asshown in Exp#9.Exp#9 (Throughput of multi-threading). We evaluate thethroughput gain of AustereCache when it enables multi-threading and issues concurrent requests to multiple buckets(§4). We use synthetic traces with a write-to-read ratio of 7:3,and consider the I/O deduplication ratio of 50% and 80%.

    Figure 13 shows the throughput versus the number ofthreads being configured in AustereCache. When the numberof threads increases, AustereCache shows a higher throughputgain under 80% I/O deduplication ratio (from 93.8 MiB/s to235.5 MiB/s, or 2.51⇥) than under 50% I/O deduplicationratio (from 60.0 MiB/s to 124.9 MiB/s, or 2.08⇥). A higherI/O deduplication ratio implies less I/O to flash, and Austere-Cache benefits more from multi-threading on parallelizing

    (a) Memory usage (b) Read hit ratio

    Figure 10: Exp#6 (Impact of LBA-index sizes).

    0255075

    100

    9:1 7:3 5:5 3:7 1:9Write-to-Read Ratio

    Thpt

    (MiB

    /s)

    (a) Throughput vs. I/O dedupratio (write-to-read ratio 7:3)

    (b) Throughput vs. write-to-readratio (I/O dedup ratio 50%)

    Figure 11: Exp#7 (Throughput).

    the LBA-index size, until the LBA-index reaches 4⇥ of theFP-index size (Figure 10(b)). In particular, for WebVM, theread hit ratio grows from 36.7% (1⇥) to 70.4% (8⇥), whilefor Homes and Mail, the read hit ratios increase by only 4.3%and 5.3%, respectively. The reason is that when the LBA-index size increases, WebVM shows a higher increase in thetotal reference counts of the cached chunks than Homes andMail, implying that more reads can be served by the cachedchunks (i.e., higher read hit ratios).

    5.5 Throughput and CPU OverheadWe measure the throughput and CPU overhead of Austere-Cache. We conduct the evaluation on synthetic traces forvarying I/O deduplication ratios and write-to-read ratios. Wefocus on the write-back policy (§2.2), in which AustereCachefirst persists the written chunks to the flash cache and flushesthe chunks to the HDD when they are evicted from the cache.We use direct I/O to remove the impact of page cache. Wereport the averaged results over five runs, while the standarddeviations are small (less than 2.7%) and hence omitted.Exp#7 (Throughput). We compare AustereCache andCacheDedup in throughput using synthetic traces. We fixthe cache size as 50% of the 128 MiB WSS. Both systemswork in single-threaded mode.

    Figures 11(a) and 11(b) show the results for varying I/Odeduplication ratios (with a fixed write-to-read ratio 7:3,which represents a write-intensive workload as in FIU traces)and varying write-to-read ratios (with a fixed I/O dedupli-cation ratio 50%), respectively. For the non-compressionschemes, AC-D achieves 18.5-86.6% higher throughput thanCD-LRU-D for all cases except when the write-to-read ratiois 1:9 (slightly slower by 2.3%). Compared to CD-ARC-D,

    Figure 12: Exp#8 (CPUoverhead).

    Figure 13: Exp#9 (Through-put of multi-threading).

    AC-D is slower by 1.1-24.5%, since both AC-D and CD-ARC-D have similar read hit ratios and write reduction ratios(§5.3), while AC-D issues additional reads and writes to themetadata region (CD-ARC-D keeps all indexing informationin memory). AC-D achieves similar throughput to CD-ARC-D when there are more duplicate chunks (i.e., under highI/O deduplication ratios). For compression schemes, AC-DCachieves 6.8-99.6% higher throughput than CD-ARC-DC.

    Overall, AC-DC achieves the highest throughput amongall schemes for two reasons. First, AustereCache generallyachieves higher or similar read hit ratios compared to CacheD-edup algorithms (§5.3). Second, AustereCache incorporatesdeduplication awareness into cache replacement by cachingchunks with high reference counts, thereby absorbing morewrites in the SSD and reducing writes to the slow HDD.Exp#8 (CPU overhead). We study the CPU overhead ofdeduplication and compression of AustereCache along theI/O path. We measure the latencies of four computationsteps, including fingerprint computation, compression, indexlookup, and index update. Specifically, we run the WebVMtrace with a cache size of 12.5% of WSS, and collect thestatistics of 100 non-duplicate write requests. We also com-pare their latencies with those of 32 KiB chunk write requeststo the SSD and the HDD using the fio benchmark tool [2].

    Figure 12 depicts the results. Fingerprint computation hasthe highest latency (15.5 µs) among all four steps. In total,AustereCache adds around 31.2 µs of CPU overhead. On theother hand, the latencies of 32 KiB writes to the SSD and theHDD are 85 µs and 5,997 µs, respectively. Note that the CPUoverhead can be suppressed via multi-threaded processing, asshown in Exp#9.Exp#9 (Throughput of multi-threading). We evaluate thethroughput gain of AustereCache when it enables multi-threading and issues concurrent requests to multiple buckets(§4). We use synthetic traces with a write-to-read ratio of 7:3,and consider the I/O deduplication ratio of 50% and 80%.

    Figure 13 shows the throughput versus the number ofthreads being configured in AustereCache. When the numberof threads increases, AustereCache shows a higher throughputgain under 80% I/O deduplication ratio (from 93.8 MiB/s to235.5 MiB/s, or 2.51⇥) than under 50% I/O deduplicationratio (from 60.0 MiB/s to 124.9 MiB/s, or 2.08⇥). A higherI/O deduplication ratio implies less I/O to flash, and Austere-Cache benefits more from multi-threading on parallelizing

  • CPU Overhead and Multi-threading

    20

    Ø Latency (32 KiB chunk write)• HDD (5,997 µs) and SSD (85 µs)• AustereCache (31.2 µs) (fingerprinting 15.5 µs)• Latency hidden via multi-threading

    ØMulti-threading (write-read ratio 7:3)• 50% I/O dedup ratio: 2.08X• 80% I/O dedup ratio: 2.51X• Higher I/O dedup ratio implies less I/O to flash à more computation savings via multi-threading

    (a) Memory usage (b) Read hit ratio

    Figure 10: Exp#6 (Impact of LBA-index sizes).

    (a) Throughput vs. I/O dedupratio (write-to-read ratio 7:3)

    (b) Throughput vs. write-to-readratio (I/O dedup ratio 50%)

    Figure 11: Exp#7 (Throughput).

    the LBA-index size, until the LBA-index reaches 4⇥ of theFP-index size (Figure 10(b)). In particular, for WebVM, theread hit ratio grows from 36.7% (1⇥) to 70.4% (8⇥), whilefor Homes and Mail, the read hit ratios increase by only 4.3%and 5.3%, respectively. The reason is that when the LBA-index size increases, WebVM shows a higher increase in thetotal reference counts of the cached chunks than Homes andMail, implying that more reads can be served by the cachedchunks (i.e., higher read hit ratios).

    5.5 Throughput and CPU OverheadWe measure the throughput and CPU overhead of Austere-Cache. We conduct the evaluation on synthetic traces forvarying I/O deduplication ratios and write-to-read ratios. Wefocus on the write-back policy (§2.2), in which AustereCachefirst persists the written chunks to the flash cache and flushesthe chunks to the HDD when they are evicted from the cache.We use direct I/O to remove the impact of page cache. Wereport the averaged results over five runs, while the standarddeviations are small (less than 2.7%) and hence omitted.Exp#7 (Throughput). We compare AustereCache andCacheDedup in throughput using synthetic traces. We fixthe cache size as 50% of the 128 MiB WSS. Both systemswork in single-threaded mode.

    Figures 11(a) and 11(b) show the results for varying I/Odeduplication ratios (with a fixed write-to-read ratio 7:3,which represents a write-intensive workload as in FIU traces)and varying write-to-read ratios (with a fixed I/O dedupli-cation ratio 50%), respectively. For the non-compressionschemes, AC-D achieves 18.5-86.6% higher throughput thanCD-LRU-D for all cases except when the write-to-read ratiois 1:9 (slightly slower by 2.3%). Compared to CD-ARC-D,

    0 25 50 75

    100

    Late

    ncy

    (us) 5975

    6000 6025

    FingerprintCompression

    LookupUpdate

    SSDHDD

    Figure 12: Exp#8 (CPUoverhead).

    Figure 13: Exp#9 (Through-put of multi-threading).

    AC-D is slower by 1.1-24.5%, since both AC-D and CD-ARC-D have similar read hit ratios and write reduction ratios(§5.3), while AC-D issues additional reads and writes to themetadata region (CD-ARC-D keeps all indexing informationin memory). AC-D achieves similar throughput to CD-ARC-D when there are more duplicate chunks (i.e., under highI/O deduplication ratios). For compression schemes, AC-DCachieves 6.8-99.6% higher throughput than CD-ARC-DC.

    Overall, AC-DC achieves the highest throughput amongall schemes for two reasons. First, AustereCache generallyachieves higher or similar read hit ratios compared to CacheD-edup algorithms (§5.3). Second, AustereCache incorporatesdeduplication awareness into cache replacement by cachingchunks with high reference counts, thereby absorbing morewrites in the SSD and reducing writes to the slow HDD.Exp#8 (CPU overhead). We study the CPU overhead ofdeduplication and compression of AustereCache along theI/O path. We measure the latencies of four computationsteps, including fingerprint computation, compression, indexlookup, and index update. Specifically, we run the WebVMtrace with a cache size of 12.5% of WSS, and collect thestatistics of 100 non-duplicate write requests. We also com-pare their latencies with those of 32 KiB chunk write requeststo the SSD and the HDD using the fio benchmark tool [2].

    Figure 12 depicts the results. Fingerprint computation hasthe highest latency (15.5 µs) among all four steps. In total,AustereCache adds around 31.2 µs of CPU overhead. On theother hand, the latencies of 32 KiB writes to the SSD and theHDD are 85 µs and 5,997 µs, respectively. Note that the CPUoverhead can be suppressed via multi-threaded processing, asshown in Exp#9.Exp#9 (Throughput of multi-threading). We evaluate thethroughput gain of AustereCache when it enables multi-threading and issues concurrent requests to multiple buckets(§4). We use synthetic traces with a write-to-read ratio of 7:3,and consider the I/O deduplication ratio of 50% and 80%.

    Figure 13 shows the throughput versus the number ofthreads being configured in AustereCache. When the numberof threads increases, AustereCache shows a higher throughputgain under 80% I/O deduplication ratio (from 93.8 MiB/s to235.5 MiB/s, or 2.51⇥) than under 50% I/O deduplicationratio (from 60.0 MiB/s to 124.9 MiB/s, or 2.08⇥). A higherI/O deduplication ratio implies less I/O to flash, and Austere-Cache benefits more from multi-threading on parallelizing

    (a) Memory usage (b) Read hit ratio

    Figure 10: Exp#6 (Impact of LBA-index sizes).

    (a) Throughput vs. I/O dedupratio (write-to-read ratio 7:3)

    (b) Throughput vs. write-to-readratio (I/O dedup ratio 50%)

    Figure 11: Exp#7 (Throughput).

    the LBA-index size, until the LBA-index reaches 4⇥ of theFP-index size (Figure 10(b)). In particular, for WebVM, theread hit ratio grows from 36.7% (1⇥) to 70.4% (8⇥), whilefor Homes and Mail, the read hit ratios increase by only 4.3%and 5.3%, respectively. The reason is that when the LBA-index size increases, WebVM shows a higher increase in thetotal reference counts of the cached chunks than Homes andMail, implying that more reads can be served by the cachedchunks (i.e., higher read hit ratios).

    5.5 Throughput and CPU OverheadWe measure the throughput and CPU overhead of Austere-Cache. We conduct the evaluation on synthetic traces forvarying I/O deduplication ratios and write-to-read ratios. Wefocus on the write-back policy (§2.2), in which AustereCachefirst persists the written chunks to the flash cache and flushesthe chunks to the HDD when they are evicted from the cache.We use direct I/O to remove the impact of page cache. Wereport the averaged results over five runs, while the standarddeviations are small (less than 2.7%) and hence omitted.Exp#7 (Throughput). We compare AustereCache andCacheDedup in throughput using synthetic traces. We fixthe cache size as 50% of the 128 MiB WSS. Both systemswork in single-threaded mode.

    Figures 11(a) and 11(b) show the results for varying I/Odeduplication ratios (with a fixed write-to-read ratio 7:3,which represents a write-intensive workload as in FIU traces)and varying write-to-read ratios (with a fixed I/O dedupli-cation ratio 50%), respectively. For the non-compressionschemes, AC-D achieves 18.5-86.6% higher throughput thanCD-LRU-D for all cases except when the write-to-read ratiois 1:9 (slightly slower by 2.3%). Compared to CD-ARC-D,

    Figure 12: Exp#8 (CPUoverhead).

    050

    100150200250

    1 2 4 6 8Number of threads

    Thpt

    (MiB

    /s)

    50% dedup80% dedup

    Figure 13: Exp#9 (Through-put of multi-threading).

    AC-D is slower by 1.1-24.5%, since both AC-D and CD-ARC-D have similar read hit ratios and write reduction ratios(§5.3), while AC-D issues additional reads and writes to themetadata region (CD-ARC-D keeps all indexing informationin memory). AC-D achieves similar throughput to CD-ARC-D when there are more duplicate chunks (i.e., under highI/O deduplication ratios). For compression schemes, AC-DCachieves 6.8-99.6% higher throughput than CD-ARC-DC.

    Overall, AC-DC achieves the highest throughput amongall schemes for two reasons. First, AustereCache generallyachieves higher or similar read hit ratios compared to CacheD-edup algorithms (§5.3). Second, AustereCache incorporatesdeduplication awareness into cache replacement by cachingchunks with high reference counts, thereby absorbing morewrites in the SSD and reducing writes to the slow HDD.Exp#8 (CPU overhead). We study the CPU overhead ofdeduplication and compression of AustereCache along theI/O path. We measure the latencies of four computationsteps, including fingerprint computation, compression, indexlookup, and index update. Specifically, we run the WebVMtrace with a cache size of 12.5% of WSS, and collect thestatistics of 100 non-duplicate write requests. We also com-pare their latencies with those of 32 KiB chunk write requeststo the SSD and the HDD using the fio benchmark tool [2].

    Figure 12 depicts the results. Fingerprint computation hasthe highest latency (15.5 µs) among all four steps. In total,AustereCache adds around 31.2 µs of CPU overhead. On theother hand, the latencies of 32 KiB writes to the SSD and theHDD are 85 µs and 5,997 µs, respectively. Note that the CPUoverhead can be suppressed via multi-threaded processing, asshown in Exp#9.Exp#9 (Throughput of multi-threading). We evaluate thethroughput gain of AustereCache when it enables multi-threading and issues concurrent requests to multiple buckets(§4). We use synthetic traces with a write-to-read ratio of 7:3,and consider the I/O deduplication ratio of 50% and 80%.

    Figure 13 shows the throughput versus the number ofthreads being configured in AustereCache. When the numberof threads increases, AustereCache shows a higher throughputgain under 80% I/O deduplication ratio (from 93.8 MiB/s to235.5 MiB/s, or 2.51⇥) than under 50% I/O deduplicationratio (from 60.0 MiB/s to 124.9 MiB/s, or 2.08⇥). A higherI/O deduplication ratio implies less I/O to flash, and Austere-Cache benefits more from multi-threading on parallelizing

  • Conclusion

    21

    ØAustereCache: memory efficiency in deduplicated and compressed flash caching via• Bucketization• Fixed-size compressed data management• Bucket-based cache replacement

    ØSource code: http://adslab.cse.cuhk.edu.hk/software/austerecache

  • Thank You!Q & A

    22