Upload
egan
View
29
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Supporting Content-Addressable Caching with CZIP Compression. KyoungSoo Park , Sunghwan Ihm, Mic Bowman* and Vivek Pai Princeton University *Intel Research. Content-Based Naming (CBN). Naming scheme based on its content Name = one-way hash (content) Hashing function: MD5, SHA-1, etc. - PowerPoint PPT Presentation
Citation preview
Supporting Content-Addressable Caching with CZIP Compression
KyoungSoo Park, Sunghwan Ihm, Mic Bowman* and Vivek Pai
Princeton University*Intel Research
KyoungSoo Park USENIX 2007 2
Content-Based Naming (CBN)• Naming scheme based on its
content• Name = one-way hash (content)
• Hashing function: MD5, SHA-1, etc.• Rabin’s fingerprint for chunk detection
• Redundancy elimination• Network-traffic/storage systems• Research/commercial systems• Special-purpose systems
KyoungSoo Park USENIX 2007 3
Where Can CBN be Applied?• Similar file distribution
• Linux distribution mirror• DVD ISO contains all CD ISOs
• Virtual machine image migration• Base OS takes up majority of content• httpd VM vs. httpd+mysqld VM
• Uncacheable Web content• Some dynamic content doesn’t change
KyoungSoo Park USENIX 2007 4
Contribution of This Work• Generic CBN tool
• Easy to build new systems• Easy to upgrade existing non-CBN systems
• CZIP compression + CZIP-aware apps• Can be used on existing platforms• Provides benefit to non-CZIP apps
• Demonstrate sample systems• Reduces FC6 mirror memory footprint by half• Comparable compression speed to GZIP’s• 2x throughput for CZIP-aware Apache• 4x origin server BW reduction for CZIP-aware
CDN
KyoungSoo Park USENIX 2007 5
CZIP Compression• Compression scheme like GZIP, BZIP2• Export CBN information in the header
A
A
C
B
B
A
C
B
Header Global Fields
Chunk Index 1
Chunk Index 2
Chunk Index 3
Chunk Index 4
Chunk Index 5
CZIP
UNCZIP
CZIP Header
KyoungSoo Park USENIX 2007 6
CZIP Header• Header = global attributes + chunk
info• Global attributes
• One-way hash function (SHA-1/MD5)• Chunk data compression (GZIP/BZIP2)• Convergent encryption (on/off)• Header CRC, File Hash, etc.
• Chunk information• Content hash, start offset, chunk size
KyoungSoo Park USENIX 2007 7
Deployment Scenario• CZIP-aware server
Client AServer
xyzlo5gasdfghkChunk AChunk B
hdr
xyzlo5gasdfghk
Chunk AChunk B
header
qoiertty
Chunk C
file1.cz
Client Bxyzlo5g Chunk Aasdfghk Chunk Bqoiertty Chunk C
CBN Cache file1.cz
file2.cz
file2.cz
read header
read chunksread headerread chunk C
KyoungSoo Park USENIX 2007 8
Deployment Scenario• CZIP-aware client-side proxy
Client AProxy
xyzlo5gasdfghkChunk AChunk B
hdr
xyzlo5gasdfghk
Chunk AChunk B
header
qoiertty
Chunk C
file1.cz
Client Bxyzlo5g Chunk Aasdfghk Chunk Bqoiertty Chunk C
CBN Cache file1.cz
file2.cz
file2.cz
read chunk C
Server
GET /file2.czRange: bytes=1000-1999X-SHA-1: qoiertty
1. X-SHA-1 field helps CZIP-aware server2. Browser cache can support CBN too!
KyoungSoo Park USENIX 2007 9
Compressibility• Fedora Core 6 ISOs/ All files/ Wikipedia
DBData C
ompression
Ratio
6.7 GB 49.7 GB 7.9 GB
3.3 3.2 3.2
6.5 6.5
20.3
48.548.3
19.619.9
7.9
2.7 2.5 2.51.9
00.10.20.30.40.50.60.70.80.9
1
FC6_i386_ISOs.tar FC6_All_files.tar Wikipedia_DB.tar
CZIP+plainCZIP+gzipCZIP+bzip2GZIPBZIP2
KyoungSoo Park USENIX 2007 10
Compression speed
00.10.20.30.40.50.60.70.80.9
1
FC6_i386_ISOs.tar FC6_All_files.tar Wikipedia_DB.tar
Nor
mal
ized
Tim
e
BZIP2GZIPCZIP+bzip2CZIP+gzipCZIP+plain
• On Pentium D 2.8GHz with 4GB memory3,964 secs 29,004 secs 3,151 secs
KyoungSoo Park USENIX 2007 11
Virtual Machine Images• Server consolidation/management• Much redundancy among similar VMs
• Xen FC4 base image (X)• X + httpd (Y) / Y + mysqld (Z)
• Investigating content overlap over• Chunk size • Chunking methods
• Rabin’s fingerprint vs. fixed-sized• After extensive use
KyoungSoo Park USENIX 2007 12
Chunk Size / Chunking Methods Compare three VM images Base = Xen FC4 image / Apache = Base + httpd Both = Apache + mysqld
0
10
20
30
40
50
60
70
80
90
100
4 8 16 32 60Chunk Size (KB)
Cont
ent O
verla
p (%
)
Base vs. ApacheApache vs. BothBase vs. ApacheApache vs. Both
Rabin’s fingerprint
Fixed-sized chunking
KyoungSoo Park USENIX 2007 13
Real VM ImagesEC1 ~ EC5: VMs based on Xen FC-4 + standard tools Daily used by five different engineers for three weeks
88
89
90
91
92
93
94
95
96
97
98
99
4 8 16 32 60Chunk Size (KB)
Con
tent
Ove
rlap
(%)
EC1 vs. EC2: FixedEC1 vs. EC2: RabinEC3 vs. EC4: FixedEC3 vs. EC4: Rabin
KyoungSoo Park USENIX 2007 14
Dynamic Web Pages• Observed the front page of these
sites• Google News• CNN• Slashdot• Digg.com• Fark.com• New York Times
• All of them non-cacheable• “no-cache”, “no-store” or “private”
KyoungSoo Park USENIX 2007 15
Average Content OverlapDownloaded pages every 10 minutes for 18 days
0
10
20
30
40
5060
70
80
90
100
1 2 4 8 16 32
Chunk size (KB)
Cont
ent O
verla
p (%
s)
CNN.comFark.comSlashdotNYTimes.comGoogle NewsDigg.com
KyoungSoo Park USENIX 2007 16
Potential Data Savings via CZIP
0
50
100
150
200
250
300
350
400
Google News Slashdot CNN Digg.com Fark.com NY Times
Tota
l Tra
nsfe
rred
Dat
a(M
B)
Without CZIP
With CZIP
37%
57%
90%
24%61%
39%
KyoungSoo Park USENIX 2007 17
Summary So far• CZIP is comparable to GZIP in speed and
performance• CZIP is far better with files with much redundancy
• Redundancy decreases as chunk size increases• Rabin’s fingerprint exposes a good deal of
redundancy regardless of chunk sizes• Optimal chunk size varies over workload• Bigger chunk size is better for network transfer
• Dynamic content also exposes redundancy• CZIP can save 24-90% of BW instead of GZIP
KyoungSoo Park USENIX 2007 18
Server Performance• CZIP Apache Module• Test scenario (FC mirror simulation)
• 1.5 GB from FC6 DVD• 1.5 GB is split into three 0.5 GB images• Each file is requested in round-robin fashion• 100-300 clients simulated by six machines
in LAN• Server is 2.8GHz Pentium D w/ 2GB
memory• w/ 2GB physical memory with 2 Gbps-NICs
KyoungSoo Park USENIX 2007 19
CZIP Apache Module
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 50 100 150 200 250 300Throughput (Mbps)
Culm
ulat
ive
Dist
ribut
ion
CZIP-Aware Apache
Normal Apache
Worst client in CZIP-aware Apache is faster than 91%of normal Apache clients
Median 2.07 times
90% 2.56 times
KyoungSoo Park USENIX 2007 20
CBN-Aware Content Distribution• CoBlitz large-file CDN [NSDI’06]• Serving 1-2 TB every day on
PlanetLab• http://coblitz.codeen.org/URL• University channel – podcast/vodcast• Fedora Core mirror, Citeseer etc.
• Chunk is basic caching unit• Parallel chunk requests/responses• Chunk request in HTTP byte-range query
KyoungSoo Park USENIX 2007 21
Making CoBlitz CZIP-Aware• CoBlitz’s chunk request
GET /coblitz.codeen.org/www.cs.princeton.edu/bigfile.cz,start=1000,end=1999 HTTP/1.0Host: coblitz.codeen.org
• CZIP-aware CoBlitz (C-CoBlitz) requestGET /czip.codeen.org/Chunk_SHA-1_Hash HTTP/1.0Host: czip.codeen.orgX-URL: www.cs.princeton.edu/bigfile.czX-Range: byte=1000-1999
KyoungSoo Park USENIX 2007 22
CZIP-Aware CoBlitz Testing• Two content-overlapping files• Simultaneously fetch from 100 PlanetLab
nodes• Origin server is at Princeton• Testing cases
• Regular: Download original files by regular CoBlitz
• File-CZIP: Download CZIP’ed files by regular CoBlitz
• CZIP-CDN: Download CZIP’ed files by C-CoBlitz
KyoungSoo Park USENIX 2007 23
100 MB File Downloading388 MB
273 MB, 29.6%
191 MB, 29.7%
Regular File-CZIP CZIP-CDN
KyoungSoo Park USENIX 2007 24
50 MB File Downloading
183 MB
92 MB, 49.7%
24 MB, 73.9%
Regular File-CZIP CZIP-CDN
KyoungSoo Park USENIX 2007 25
Conclusion• CZIP is a generic compression tool
providing CBN benefits• CZIP is comparable to GZIP in
compression performance• CZIP helps greatly reduce memory
footprint in serving similar files• It is very easy to support CZIP and
the benefit is transparent
KyoungSoo Park USENIX 2007 26
Thank you!
More information can be found at http://codeen.cs.princeton.edu/czip/
CZIP code will be released soon!
KyoungSoo Park USENIX 2007 27
200/300 Clients
65%80%
Median 1.95 times
90% 2.27 times
Median 1.84 times
90% 2.11 times
200 clients 300 clients