Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie...

Lazy Exact Deduplication

Jingwei Ma, Rebecca J. Stones, Yuxiang Ma,Jingui Wang, Junjie Ren, Gang Wang, Xiaoguang Liu

College of Computer and Control Engineering,Nankai University, China.

5 May 2016

Lazy exact deduplication

Lead author: Jingwei Ma, PhD student at Nankai University(supervisor: Prof. Gang Wang).

Lead author: Jingwei Ma, PhD student at Nankai University(supervisor: Prof. Gang Wang). Couldn’t get USA visa in time=⇒ I will present this work.

Credit where credit is due: Jingwei Ma did the lion’s share of thiswork (development, implementation, experimentation, etc.).

Lazy deduplication: ‘Lazy’ in the sense that we postpone disklookups, until we can do them as a batch.

Lazy deduplication: ‘Lazy’ in the sense that we postpone disklookups, until we can do them as a batch. (Lazy is exact.)

Deduplication: What usually happens...

We have a large amount of data, with lots of duplicate data(e.g. weekly backups).

We read through the data, and if we see something we’ve seenbefore, we replace it with an index entry (saving disk space).

The data is broken up into chunks (Rabin Hash).

The chunks are fingerprinted (SHA1): same fingerprint =⇒duplicate chunk.

Disk bottleneck: Most fingerprints are stored on disk =⇒lots of disk reads (“have I seen this before?”) =⇒ slow.

Caching and prefetching reduce the disk bottleneck problem:

Caching and prefetching reduce the disk bottleneck problem:fingerprints

cache miss

· · · fA fB fC fD · · ·

fA fB fC fD

The first time we see fingerprints fA, fB , ...

Caching and prefetching reduce the disk bottleneck problem:fingerprints

cache miss

· · · fA fB fC fD · · ·

fA fB fC fD

The first time we see fingerprints fA, fB , ...

fingerprints

cache miss prefetching

cache hit

· · · fA fB fC fD · · ·

fA fB fC fD

The second time we see fingerprints fA, fB , ...

Lazy deduplication...

fingerprints

· · · fA fB fC fD · · ·

fC fA fB , fD

fA fB fC fD

Bloom filter: identifies manyuniques (not all). [Commonlyused.]

fingerprints

Bloomfilter

· · · fA fB fC fD · · ·

fC fA fB , fD

fA fB fC fD

buffer: stores fingerprints inhash buckets; searched lateron disk (“lazy”)—when full,whole buckets are searched inone go (stored on-disk in hashbuckets)

fingerprints

Bloomfilter

buffer

· · · fA fB fC fD · · ·

fC fA fB , fD

fA fB fC fD

post-lookup: searching thecache after buffering (maybemultiple times)

fingerprints

Bloomfilter

post-lookup

buffer

· · · fA fB fC fD · · ·

fC fA fB , fD

fA fB fC fD

pre-lookup: searching thecache before buffering [notshown]

fingerprints

Bloomfilter

post-lookup

buffer

· · · fA fB fC fD · · ·

fC fA fB , fD

fA fB fC fD

pre-lookup: searching thecache before buffering [notshown]

prefetching: bidirectional;triggers post-lookup

fingerprints

Bloomfilter

post-lookup

prefetching

buffer

· · · fA fB fC fD · · ·

fC fA fB , fD

fA fB fC fD

Prefetching...

Ordinarily, we prefetch the subsequent on-disk fingerprints after aduplicate is found on disk

Prefetching...

Ordinarily, we prefetch the subsequent on-disk fingerprints after aduplicate is found on disk—these will probably be the next incomingfingerprints.

Prefetching...

Ordinarily, we prefetch the subsequent on-disk fingerprints after aduplicate is found on disk—these will probably be the next incomingfingerprints. But this doesn’t work with the lazy method (wherefingerprints are buffered).

Prefetching...

To overcome this obstacle, each buffered fingerprint is given a...

Prefetching...

rank, used to determine the on-disk search range;

Prefetching...

rank, used to determine the on-disk search range; and abuffer cycle, indicating where duplicates might be on-disk.

Prefetching...

rank, used to determine the on-disk search range; and abuffer cycle, indicating where duplicates might be on-disk.

It looks like this:

rank r :

fingerprints

0 1 2 3 4 5 6 7 8

fingerprintsstored on disk

on-disklookup

· · ·

2048 fingerprints

incoming unique on-disk unique buffered / on-disk match

Experimental results...

(See our paper for the details and further experiments.)

Experimental results...

(See our paper for the details and further experiments.)

The time it takes to deduplicate a dataset (on SSD):

Vm (220GB) Src (343GB) FSLHomes (3.58TB)

eager way 282 sec. 476 sec. 5824 sec.

lazy way 151 sec. 226 sec. 3939 sec.

(eager = non-lazy [exact] way—i.e., no buffering before accessingthe disk)

Conclusion: Lazy is faster.

On-disk lookups...

Disk access time (sec.) on SSD:

Vm Src FSLHomes

eager lazy eager lazy eager lazy

on-disk lookup 176 20 325 45 4598 1639

prefetching 46 60 52 68 298 655

other 59 71 99 113 928 1645

total disk access 222 80 377 113 4896 2294

total dedup. 282 151 476 226 5824 3939

Conclusion: Lazy reduces the disk bottleneck.

Throughput...

20 70 120 170 220 270 320 370 420

data size (GB)

Src on SSD

eager lazy

20 70 120 170 220 270 320 370 420

data size (GB)

Src on HDD

eager lazy

Conclusion: Lazy has better throughput on both SSD and HDD,but moreso on slower HDD.

Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie...

Documents

REBECCA SLAYTON - Amazon Web Services · Slayton, Rebecca. 2013. Arguments that Count: Physics, Computing, and Missile Defense, 1949-2012 (Cambridge, MA: MIT Press). Winner of the

Copyright 2015 Jingwei Zhu

Early Childhood Foundation Skills Birth – 5 years Where does this journey begin? Rebecca Coakley, MA Children’s Vision Rehabilitation Project West Virginia

Rebecca Rollins Stone, Ph.D. (also published under Rebecca Stone

Tests Received 11/1/2018 - 11/30/2018 Laurels Web.pdf · Rebecca Li Yedda Ma Anastasiya Rodriguez Lucy Yang Preliminary Moves In The Field Rebecca Li Kimberly Phan Anastasiya Rodriguez

To book a consultation with About Rebecca - Rebecca

PATRIOTS Rachael Coonce Rebecca McAllister Rebecca Riesinger Sarah Krebs

Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liyin Tang

On behalf of MA Department of Public Health and MA Adult …€¦ · 9:45am MDPH Update: Susan Lett, MD and Rebecca Vanucci, MA 10:30am Question and Answer Period 11:00am Networking

CS746 Software Architecture Organizational Meeting Instructor: Prof. Richard C. Holt TA: Jingwei Wu

Ingram, Rebecca Suzanne - Faience and Glass Beads From the Late Bronze Age Shipwreck at Uluburun ( MA Thesis. 2005)

REBECCA SAUNDERS Rebecca Saunders teaches global

Helping Your Members to Get a Job in this Economy a.k.a. Managing Expectations ILT March 2013 Rebecca Daniel-Burke, PhD Danielle, Irving, MA

Rebecca Pointer Final MA thesis

Legislature Tour!!!! Legislature Tour!!!! By Rebecca Lee By Rebecca Lee

Rebecca Bierlein Rebecca Bierlein The story of my life!!

ELD and SDAIE Strategies Putting the standards and laws into practice Presented by Rebecca Tomasini MA in Renaissance Studies University of London MA MACET

Leg Lipoblastomatosis of the Lower Lipoblastoma and...Mimicking a Hemangioma", Pediatrics, 2000 Publication Ma, Jingwei Zhou, Jie Gao, Yanping. "Distributed event-triggered control

Jingwei Gao Boehringer Ingelheim PharmaSUG China 2016 Paper PharmaSUG China 2016 … · 2016-11-29 · Jingwei Gao Boehringer Ingelheim PharmaSUG China 2016 Paper PharmaSUG China

Rebecca Louise Thomas - D-Scholarship@Pittd-scholarship.pitt.edu/37130/1/ETD Thesis Rebecca Louise Thomas... · Rebecca Louise Thomas . BS, University of Notre Dame, 2014 . MA, University