Upload
raul-fraile
View
1.358
Download
12
Tags:
Embed Size (px)
DESCRIPTION
Data compression is an amazing topic. Even in today’s world, with fast networks and almost unlimited storage, data compression is still relevant, especially for mobile devices and countries with poor Internet connections. For better or worse, GZIP compression is the de-facto lossless compression method for compressing text data in websites. It is not the fastest nor the better, but provides an excellent tradeoff between speed and compression ratio. The way Internet works makes it also difficult to use newer compression methods. This talk examines how GZIP works internally, explaining the internals of the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding. Different implementations will be compared, such as GNU GZIP, 7-ZIP and zopfli, focusing on why and how some of these implementations perform better than others. Finally, we will try to go beyond GZIP, preprocessing our data to achieve better results. For example, transposing JSON.
Citation preview
H O W G Z I P C O M P R E S S I O N W O R K SR A U L F R A I L E
J S C O N F E U B E R L I N
• P H P / J S S O F T W A R E D E V E L O P E R
!
• M S ( R E S ) S T U D E N T I N
C O M P U T I N G T E C H N O L O G I E S .
!
• M A D E I N S PA I N .
A B O U T M E
D ATA C O M P R E S S I O N
N O T A N E X P E R T *
D ATA C O M P R E S S I O N I S A N AMAZ ING T O P I C
REALLY !
M A G I CI T C A N B E S E E N L I K E …
flickr.com/photos/jeffkrause/6799254170
flickr.com/photos/t_e_brown/8677750589
… I T ’ S N O T
I N F O R M AT I O N T H E O R YC L A U D E S H A N N O N
E N T R O P Yflickr.com/photos/95303997@N07/10074330416
H = - p ( x ) l o g 2 p ( x )⎲⎳
AV E R A G E A M O U N T O F I N F O R M AT I O N C O N TA I N E D I N E A C H M E S S A G E
≈N U M B E R O F B I T S T O R E P R E S E N T T H E M E S S A G E
225 days/year 62 %
17 days/year 6 %
flickr.com/photos/aigle_dore/5952296478flickr.com/photos/mariano-mantel/13955110319
H U M A N B R A I NI S D E S I G N E D T O C O M P R E S S D A TA
flickr.com/photos/birthintobeing/11841180046
flickr.com/photos/neolao/3105372669flickr.com/photos/tommiephotography/6840025942
flickr.com/photos/earlysound/2186172726
M O R S E C O D E S H O R T E R S E Q U E N C E S F O R C O M M O N C H A R A C T E R S
flickr.com/photos/amboo213/9044879245
D ATA C O M P R E S S I O N I N H T T P
GET index.html Accept-Encoding: gzip, deflate
G Z I P + H T T P
G Z I P C O M P R E S S I O N
• D E F L A T E A L G O R I T H M
!
• D E S I G N E D B Y P H I L K A T Z
!
• U S E D I N H T T P, P N G A N D P D F
G Z I P
D E F L AT E
L Z 7 7
H U F F M A N C O D I N G+
L Z 7 7 ( VA R I AT I O N )
T H I S F I L E I S H U G E ! T H AT ' S B E C A U S E T H E F I L E I S N O T C O M P R E S S E D
< 3 3 , 9 >
S E A R C H B U F F E R ( U P T O 3 2 K B ) L O O K - A H E A D
T H I S F I L E I S H U G E ! T H AT ' S B E C A U S E T H E F I L E I S N O T C O M P R E S S E D
L Z 7 7 ( VA R I AT I O N )
< 3 3 , 9 >
L I T E R A L S · L E N G T H S · D I S TA N C E S
H U F F M A N C O D I N G
0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1 1 1 0 0 1 0 0 0 0 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0
H 0 0 0E 0 0 1L 0 1 0O 0 1 1W 1 0 0R 1 0 1D 1 1 0_ 1 1 1
H E L L O W O R L D
8 8 B I T S
F I X E D - L E N G T H C O D E S
0 0 0 0 0 1 0 1 0 0 1 0 0 1 1 1 1 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 0
3 3 B I T S
H U F F M A N C O D I N G
C H A R A C T E R F R E Q U E N C Y:
0 0 0 1 0 0 1 0 0 1 1 0 1 1 1 0 0 0 0
L 3 0O 2 1H 1 0 0E 1 0 1W 1 1 0R 1 1 1D 1 0 0 0_ 1 0 0 1
H E L L O W O R L D
1 9 B I T S
I T ’ S A M B I G U O U S
H EL H OD O…
VA R I A B L E - L E N G T H C O D E S
H U F F M A N C O D I N G
L 3 1 0O 2 1 1 1H 1 0 0 1E 1 1 1 0 0W 1 0 0 1R 1 0 0 0D 1 1 1 0 1_ 1 0 1 0
H U F F M A N C O D I N G
L 3 1 0O 2 1 1 1H 1 0 0 1E 1 1 1 0 0W 1 0 0 1R 1 0 0 0D 1 1 1 0 1_ 1 0 1 0
0 0 1 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 0 1
H E L L O W O R L D
3 2 B I T S
H U F F M A N C O D I N G
TA B L E 1 : L I T E R A L S + L E N G T H S
TA B L E 2 : D I S TA N C E S
B L O C K S
B L O C K 1 B L O C K 2 … B L O C K NM M M M
M O D E 1 : N O C O M P R E S S I O N
M O D E 2 : F I X E D C O D E TA B L E S
M O D E 3 : G E N E R AT E D C O D E TA B L E S
flickr.com/photos/functoruser/2436979033
G Z I P C O M P R E S S I O NI M P L E M E N TAT I O N S
G N U G Z I P Z O P F L I7 - Z I P
M O D E FA S T
M O D E H I G H
C O M P R E S S I O N
M O D E N O R M A L
G E N E R A L R U L E : M O R E T I M E , B E T T E R C O M P R E S S I O N R AT I O
I M P L E M E N TAT I O N S
G Z I P C O M P R E S S I O NW H Y G Z I P ?
• G O O D C O M P R E S S I O N R A T I O .
• FA S T T O ( U N ) C O M P R E S S .
• I N T H E W O R S T C A S E , E X PA N D S
T H E D A TA S L I G H T LY.
• M E M O R Y I N D E P E N D E N T.
• F R E E I M P L E M E N TA T I O N S T H A T
A V O I D PA T E N T S .
T R A D E O F F
N E W E R A L G O R I T H M SI S S U E S T R Y I N G T O A D D B Z I P 2 S U P P O R T T O C H R O M E
G Z I P C O M P R E S S I O NB E Y O N D G Z I P
P R E P R O C E S S D ATA T O O P T I M I Z E MATCHES
G Z I P ( T ( D ATA ) ) < G Z I P ( D ATA )
T R A N S P O S I N G J S O N
{ "name": "John", "country": "USA" }, { "name": "Stephan", "country": "Germany" }, { "name": "Rob", "country": "USA" }
{ "name": [ "John", "Stephan", "Rob" ], "country": [ "USA", "Germany", "USA" ] }
X M L / H T M L AT T R I B U T E S O R D E R
<input id='f1' class='field' name="f1" type="text" /> <input class="field" id="f2" type="text" name="f2" />
<input id="f1" class="field" name="f1" type="text" /> <input class="field" id="f2" type="text" name="f2" />
<input id="f1" class="field" name="f1" type="text" /> <input id="f2" class="field" name="f2" type="text" />
<input type="text" class="field" id="f1" name="f1" /> <input type="text" class="field" id="f2" name="f2" />
1 7 , 7 6 %
2 7 , 1 0 %
3 8 , 3 2 %
3 8 , 3 2 %
h t t p : / / g o o . g l / G g M w 2 6
R E F E R E N C E S
“ C o m p r e s s o r H e a d ” C o l t M c A n l i s
“ D a t a C o m p r e s s i o n : T h e C o m p l e t e R e f e r e n c e ” D a v i d S a l o m o n
“ A U n i v e r s a l A l g o r i t h m f o r S e q u e n t i a l D a t a C o m p r e s s i o n ” J a c o b Z i v & A b r a h a m L e m p e l
“ A m e t h o d f o r t h e c o n s t r u c t i o n o f m i n i m u m r e d u n d a n c y c o d e s ” D a v i d A . H u f f m a n
T H A N K Y O U
R a ú l F r a i l e @ r a u l f r a i l e