Mul‐Level Comparison of Data Deduplicaon in a Backup Scenario · Mul‐Level Comparison of Data Deduplicaon in a Backup Scenario Dirk Meister André Brinkmann Paderborn Center for

Mul$‐Level Comparison of Data Deduplica$on in a Backup Scenario

Dirk Meister André Brinkmann

Paderborn Center for Parallel Compu$ng University of Paderborn

Structure

•  Introduc$on •  Large‐Scale: Real life system –  Internal Redundancy – Temporal Redundancy

•  Small‐Scale: –  Influence of Small Modifica$ons

– Different File Formats

•  Conclusion

2

Mo$va$on

•  Backups cri$cal for enterprises •  26 full backups 26x storage demands

•  Few changes high redundancies

•  data deduplica$on := Removal of redundant data to store only unique data

3

!"#$%&

!"#$%'

!"#$%(

)$*+,#"-./ "01

232/$4

5/06.7$%232/$4

4

•  Fingerprint‐based Approaches •  Delta‐Encoding‐based Approaches

!"#$%&

!"#$%'

!"#$%(

)$*+,#"-./ "01

232/$4

5/06.7$%232/$4

Fingerprin$ng‐based Approaches

•  Chunking

•  Duplicate Detec$on

•  Storage

5

Chunking

•  Crucial for redundancy detec$on •  Approaches –  Sta$c Chunking –  Content‐Defined Chunking – Whole‐File Chunking – Applica$on‐Specific Chunking

6

Data stream / file /...

Chunk Chunk Chunk

Sta$c Chunking

•  Chunks with a fixed size, e.g. 8 KB •  Very fast •  Problems with shi]ings / inser$ons

7

14159 26535 89793 23846 26433 83279 50288 41971 37510 14159 A2653 58979 32384 62643 38327 95028 84197 13751 0

Content‐defined Chunking

•  Build chunks based on contents only – Fingerprint fp for all substrings of size w – Chunk boundary at posi$on p if fp at the posi$on holds that:

•  Chunk size – Expected value: n Byte (configurable)

•  Also called: Rabin Fingerprint Chunking

8

€

fp modn = c for a constant 0 < c ≤ n

Other Approaches

•  Whole‐File Chunking – Fingerprint over whole file contents – Finds only exact file duplicates

•  Applica$on‐Specific Chunking – Use knowledge about specific file types – e.g. split mail files exactly add abachment boundaries

9

1. PART: LARGE‐SCALE COMPARISION (OF FINGERPRINTING‐BASED APPROACHES)

10

Real Life System

•  User directories of the University of Paderborn •  Weekly measurements

•  Key Facts – 450 GB data – 3.6M files

– 8.000 ac$ve users

11

Internal Redundancy

12

CDC = Content‐defined Chunking SC = Sta$c Chunking Whole‐File = Whole‐File Chunking

0%

5%

10%

15%

20%

25%

30%

35%

40%

CDC (8 KB) CDC (16 KB) SC (8 KB) SC (16 KB) Whole‐File

Int. Redundancy (file types)

0

10

20

30

40

50

60

70

.jpg .doc .ppt .xls .pdf .avi .zip

Redu

ndan

cy in %

CDC 8 SC 8 Whole‐File

13

Int. Redundancy (file size)

0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50%

< 4K

> 4K

> 8K

> 16

K

> 32

K

> 64

K

> 12

8K

> 25

6K

> 51

2K

> 1M

> 2M

> 4M

> 8M

> 16

M

> 32

M

> 64

M

> 12

8M

> 25

6M

> 51

2M

> 1G

> 2G

CDC‐8 CDC‐16 SC‐8 SC‐16 Datei

14

Whole‐File

Temporal Redundancy

15

0

1

2

3

4

5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Patchsize (in

%)

CDC‐8 CDC‐16 SC‐8 SC‐16 Whole‐File

2. PART: SMALL‐SCALE COMPARISION

16

Small‐Scale Experiments

•  Common Assump$on: Small content modifica8on Small change in file

•  Experiments with various file types –  .zip Archive –  .ppt Presenta$on – More in full paper

17

Structure .zip file

Header File 1

Data File 1

Header File 2

Data File 2

...

Header File n

Data File n

Central Directory

Signature (4 Byte)

Headerdata (22 Byte)

Filename (variable)

Extra fields (variable)

18

.ZIP File Experiment

•  Experiment : – Archive of CIFS source code (Linux kernel) – Modifica$on of AUTHORS file

19

Finder:

WinZIP:

.PPT File Experiment

•  Experiment – Presenta$on without images

– Name of author modified

•  Powerpoint 2003/2007/2008: < 3% redundant

20

Powerpoint 2008 (Excerpt):

Conclusion

•  Large‐Scale: – ≈ 35% internal redundancy (CDC‐8) – Advantage for Content‐defined Chunking – > 98,5% temporal redundancy (all) – High varia$on between weeks

•  Small‐Scale: – Assump$on o]en not correct

– Applica$on‐ and version‐specific behavior

21

QUESTIONS?

22

SOFTWARE AVAILABLE UNDER CODE.GOOGLE.COM/P/FS‐C/

23

Shi]ings

•  Shi]ings only have local influence

24

14159 26535 89793 23846 26433 83279 50288 41971 37510 14159 A265 3589793 23846 26433 83279 50288 41971 37510

14159 26535 89793 23846 26433 83279 50288 41971 37510 14159 A2653589793 23846 26433 83279 50288 41971 37510

14159 26535 89793 23846 26433 83279 50288 41971 37510 14159 A26535 89793 23846 26433 83279 50288 41971 37510

Duplicate Detec$on

•  Check for every chunk: Duplicate –  (O]en) no exact comparision of contents

•  Compare‐by‐Hash: – Fingerprint for each chunk (typically: SHA‐1) – Check only if fingerprint is known

•  Approach debated: – Data loss in case of hash collision

25

Comparision with Incr. Backup

26

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Patchsize (in

%)

CDC‐8 Whole‐File Inc. Backup

Caching Strategies

•  Only 10% improvement in archival system Ven$ •  Backup scenario: – most chunks are redundant

–  reduced write accesses •  Caching also reduces read accesses – might have greater impact

•  Comparison of (simple) caching strategies

27

LRU Cache

28

0

10

20

30

40

50

60

70

80

1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M

Cache Hit RaM

o / Internal Red

unda

ncy

(in %)

LRU CDC‐8 LRU CDC‐16 LRU SC‐8 LRU SC‐8

MU Cache

29

0

10

20

30

40

50

60

1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M

Cache Hit RaM

o / Internal Red

unda

ncy

(in %)

MU CDC‐8 MU CDC‐16 MU SC‐8 MU SC‐8

Related Work

•  Policoniades and Prab (2004): – Mirror of sunsite.org.uk: 35 GB

– Personal user files: 2.9 GB – Scratch directories: 100 GB – No temporal redundancies evaluated

•  You and Karamanolis (2004): – 15.5 GB on various data sets not including personal user files

30

Related Work

•  Mandagere et al. (2009): – 450 GB personal user files – 16 users – No temporal redundancies –  Includes throughput measurements

•  Most publica$ons about deduplica$on: –  Include evalua$ons using various, o]en smaller data sets

– Only some include personal user files and temporal redundancies

31

File size distribu$on

32

0 50

100 150 200 250 300 350 400 450 500

< 1

< 16

< 25

6

< 4K

< 64

K

< 1M

< 16

M

< 25

6M

< 2G

File Cap

acity (in

GB)

Capacity Cumula$ve Capacity

0

500

1000

1500

2000

2500

3000

3500

4000

< 1

< 16

< 25

6 < 4K

< 64

K < 1M

< 16

M

< 25

6M

< 2G

File Cou

nt (in T)

Count Cumula$ve Count

Difference to CDC‐8

33

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

% Patchsize (D

ifferen

ce to

CDC‐8)

CDC‐16 SC‐8 SC‐16 Whole‐File

File Types (Excerpt) Rank File type File Count Storage

1 .jpg 3% (114 K) 13% (59.9 GB)

2 N/A 36% (1261 K) 11% (48.4 GB)

3 .doc 2% (85 K) 8% (36.8 GB)

4 .ppt 0% (9 K) 8% (33.2 GB)

5 .xls 1% (24 K) 8% (33.0 GB)

6 .pdf 1% (40 K) 7% (32.1 GB)

7 .avi 0% (0.6K) 3% (14.0 GB)

8 .mov 0% (0.2K) 3% (13.5 GB)

9 .dat 3% (97K) 2% (10.4 GB)

...

13 .zip 0% (8,6 K) 2% (7.3 GB)

15 .gz 0% (2,8 K) 1% (5.4 GB)

TOTAL 3520 K 436,8 GB

34

.PPTX

•  New Powerpoint format – XML‐Files (Content / Metadata)

– Binary Files (Images / Videos) – ZIP‐Container

•  Some files changed (minimally)

•  Compressed data changed

35

Powerpoint 2008:

Documents

Mul‐Level Comparison of Data Deduplicaon in a Backup Scenario · Mul‐Level Comparison of Data Deduplicaon in a Backup Scenario Dirk Meister André Brinkmann Paderborn Center for