Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Mul$‐Level Comparison of Data Deduplica$on in a Backup Scenario
Dirk Meister André Brinkmann
Paderborn Center for Parallel Compu$ng University of Paderborn
Structure
• Introduc$on • Large‐Scale: Real life system – Internal Redundancy – Temporal Redundancy
• Small‐Scale: – Influence of Small Modifica$ons
– Different File Formats
• Conclusion
2
Mo$va$on
• Backups cri$cal for enterprises • 26 full backups 26x storage demands
• Few changes high redundancies
• data deduplica$on := Removal of redundant data to store only unique data
3
!"#$%&
!"#$%'
!"#$%(
)$*+,#"-./ "01
232/$4
5/06.7$%232/$4
4
• Fingerprint‐based Approaches • Delta‐Encoding‐based Approaches
!"#$%&
!"#$%'
!"#$%(
)$*+,#"-./ "01
232/$4
5/06.7$%232/$4
Fingerprin$ng‐based Approaches
• Chunking
• Duplicate Detec$on
• Storage
5
Chunking
• Crucial for redundancy detec$on • Approaches – Sta$c Chunking – Content‐Defined Chunking – Whole‐File Chunking – Applica$on‐Specific Chunking
6
Data stream / file /...
Chunk Chunk Chunk
Sta$c Chunking
• Chunks with a fixed size, e.g. 8 KB • Very fast • Problems with shi]ings / inser$ons
7
14159 26535 89793 23846 26433 83279 50288 41971 37510 14159 A2653 58979 32384 62643 38327 95028 84197 13751 0
Content‐defined Chunking
• Build chunks based on contents only – Fingerprint fp for all substrings of size w – Chunk boundary at posi$on p if fp at the posi$on holds that:
• Chunk size – Expected value: n Byte (configurable)
• Also called: Rabin Fingerprint Chunking
8
€
fp modn = c for a constant 0 < c ≤ n
Other Approaches
• Whole‐File Chunking – Fingerprint over whole file contents – Finds only exact file duplicates
• Applica$on‐Specific Chunking – Use knowledge about specific file types – e.g. split mail files exactly add abachment boundaries
9
1. PART: LARGE‐SCALE COMPARISION (OF FINGERPRINTING‐BASED APPROACHES)
10
Real Life System
• User directories of the University of Paderborn • Weekly measurements
• Key Facts – 450 GB data – 3.6M files
– 8.000 ac$ve users
11
Internal Redundancy
12
CDC = Content‐defined Chunking SC = Sta$c Chunking Whole‐File = Whole‐File Chunking
0%
5%
10%
15%
20%
25%
30%
35%
40%
CDC (8 KB) CDC (16 KB) SC (8 KB) SC (16 KB) Whole‐File
Int. Redundancy (file types)
0
10
20
30
40
50
60
70
.jpg .doc .ppt .xls .pdf .avi .zip
Redu
ndan
cy in %
CDC 8 SC 8 Whole‐File
13
Int. Redundancy (file size)
0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50%
< 4K
> 4K
> 8K
> 16
K
> 32
K
> 64
K
> 12
8K
> 25
6K
> 51
2K
> 1M
> 2M
> 4M
> 8M
> 16
M
> 32
M
> 64
M
> 12
8M
> 25
6M
> 51
2M
> 1G
> 2G
CDC‐8 CDC‐16 SC‐8 SC‐16 Datei
14
Whole‐File
Temporal Redundancy
15
0
1
2
3
4
5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Patchsize (in
%)
CDC‐8 CDC‐16 SC‐8 SC‐16 Whole‐File
2. PART: SMALL‐SCALE COMPARISION
16
Small‐Scale Experiments
• Common Assump$on: Small content modifica8on Small change in file
• Experiments with various file types – .zip Archive – .ppt Presenta$on – More in full paper
17
Structure .zip file
Header File 1
Data File 1
Header File 2
Data File 2
...
Header File n
Data File n
Central Directory
Signature (4 Byte)
Headerdata (22 Byte)
Filename (variable)
Extra fields (variable)
18
.ZIP File Experiment
• Experiment : – Archive of CIFS source code (Linux kernel) – Modifica$on of AUTHORS file
19
Finder:
WinZIP:
.PPT File Experiment
• Experiment – Presenta$on without images
– Name of author modified
• Powerpoint 2003/2007/2008: < 3% redundant
20
Powerpoint 2008 (Excerpt):
Conclusion
• Large‐Scale: – ≈ 35% internal redundancy (CDC‐8) – Advantage for Content‐defined Chunking – > 98,5% temporal redundancy (all) – High varia$on between weeks
• Small‐Scale: – Assump$on o]en not correct
– Applica$on‐ and version‐specific behavior
21
QUESTIONS?
22
SOFTWARE AVAILABLE UNDER CODE.GOOGLE.COM/P/FS‐C/
23
Shi]ings
• Shi]ings only have local influence
24
14159 26535 89793 23846 26433 83279 50288 41971 37510 14159 A265 3589793 23846 26433 83279 50288 41971 37510
14159 26535 89793 23846 26433 83279 50288 41971 37510 14159 A2653589793 23846 26433 83279 50288 41971 37510
14159 26535 89793 23846 26433 83279 50288 41971 37510 14159 A26535 89793 23846 26433 83279 50288 41971 37510
Duplicate Detec$on
• Check for every chunk: Duplicate – (O]en) no exact comparision of contents
• Compare‐by‐Hash: – Fingerprint for each chunk (typically: SHA‐1) – Check only if fingerprint is known
• Approach debated: – Data loss in case of hash collision
25
Comparision with Incr. Backup
26
0
1
2
3
4
5
6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Patchsize (in
%)
CDC‐8 Whole‐File Inc. Backup
Caching Strategies
• Only 10% improvement in archival system Ven$ • Backup scenario: – most chunks are redundant
– reduced write accesses • Caching also reduces read accesses – might have greater impact
• Comparison of (simple) caching strategies
27
LRU Cache
28
0
10
20
30
40
50
60
70
80
1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M
Cache Hit RaM
o / Internal Red
unda
ncy
(in %)
LRU CDC‐8 LRU CDC‐16 LRU SC‐8 LRU SC‐8
MU Cache
29
0
10
20
30
40
50
60
1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M
Cache Hit RaM
o / Internal Red
unda
ncy
(in %)
MU CDC‐8 MU CDC‐16 MU SC‐8 MU SC‐8
Related Work
• Policoniades and Prab (2004): – Mirror of sunsite.org.uk: 35 GB
– Personal user files: 2.9 GB – Scratch directories: 100 GB – No temporal redundancies evaluated
• You and Karamanolis (2004): – 15.5 GB on various data sets not including personal user files
30
Related Work
• Mandagere et al. (2009): – 450 GB personal user files – 16 users – No temporal redundancies – Includes throughput measurements
• Most publica$ons about deduplica$on: – Include evalua$ons using various, o]en smaller data sets
– Only some include personal user files and temporal redundancies
31
File size distribu$on
32
0 50
100 150 200 250 300 350 400 450 500
< 1
< 16
< 25
6
< 4K
< 64
K
< 1M
< 16
M
< 25
6M
< 2G
File Cap
acity (in
GB)
Capacity Cumula$ve Capacity
0
500
1000
1500
2000
2500
3000
3500
4000
< 1
< 16
< 25
6 < 4K
< 64
K < 1M
< 16
M
< 25
6M
< 2G
File Cou
nt (in T)
Count Cumula$ve Count
Difference to CDC‐8
33
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
% Patchsize (D
ifferen
ce to
CDC‐8)
CDC‐16 SC‐8 SC‐16 Whole‐File
File Types (Excerpt) Rank File type File Count Storage
1 .jpg 3% (114 K) 13% (59.9 GB)
2 N/A 36% (1261 K) 11% (48.4 GB)
3 .doc 2% (85 K) 8% (36.8 GB)
4 .ppt 0% (9 K) 8% (33.2 GB)
5 .xls 1% (24 K) 8% (33.0 GB)
6 .pdf 1% (40 K) 7% (32.1 GB)
7 .avi 0% (0.6K) 3% (14.0 GB)
8 .mov 0% (0.2K) 3% (13.5 GB)
9 .dat 3% (97K) 2% (10.4 GB)
...
13 .zip 0% (8,6 K) 2% (7.3 GB)
15 .gz 0% (2,8 K) 1% (5.4 GB)
TOTAL 3520 K 436,8 GB
34
.PPTX
• New Powerpoint format – XML‐Files (Content / Metadata)
– Binary Files (Images / Videos) – ZIP‐Container
• Some files changed (minimally)
• Compressed data changed
35
Powerpoint 2008: