Upload
buidien
View
220
Download
3
Embed Size (px)
Citation preview
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Information and CodingPart I: Data Compression
Marek Rychlik
Department of MathematicsUniversity of Arizona
February 25, 2016
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
What is coding — informally?
Definition (Coding)
Coding is algorithmic rewriting of data, while achievingdesirable objectives.
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Three main types of objectives
1 Reducing the cost of storage and transmission timerequired (data compression);
2 Preventing data loss due to random causes, such asdefects in physical media (error correcting codes);
3 Achieving privacy of comunications (data encryption).
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Without data compression we would not have
1 high-definition digital streaming video;2 cellular telephony;3 high-definition digital cinema.
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Coriolanus — the play by William Shakespeare
Figure: W. Shakespeare
“Before we proceed any further,hear me speak. Speak, speak. Youare all resolved rather to die than tofamish? Resolved. resolved. First,you know Caius Marcius is chiefenemy to the people. We know’t,we know’t. Let us kill him, andwe’ll have corn at our own price. Is’ta verdict? No more talking on’t; letit be done: away, away! One word,good citizens... [total of 141,830characters]”
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
An early printed scientific paper
Figure: Nikolaus Kopernikus, 1543, “Polish astronomer”(Encyclopedia Britannica)
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Character encoding
Characters are numbered.The oldest standard is ASCII (American Standard Codefor Information Interchange).ASCII is pronounced “as key”.Encodes characters as numbers 0–255.Successor to ASCII is Unicode . It aims at encoding allcharacters in all human alphabets.Only ASCII is used in the current talk.
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Bits and Bytes
The term bit stands for a binary digit and it is either 0 or 1.Bits are digits of base-2 (binary) representation ofnumbers, e.g. 6 in decimal is 101 in binary.
(101)2 = 1 ⋅ 22+ 0 ⋅ 21
+ 1 ⋅ 1 = (6)10
It is a normalized unit of information; we may think thatinformation comes as a sequence of answers to Yes/Noquestions. A fraction of a bit makes sense, statisticallyspeaking (e.g. tossing a biased coin).A byte is a sequence of 8 bits. A byte can also be thoughtof as an unsigned integer in the range 0–255.
(10000100)2 = 1 ⋅ 27+ 1 ⋅ 22
= 128 + 4 = (123)10
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Hexadecimal and Octal notation for integers
1 byte = 8 bits = 2 groups of 4 bits; 24 = 16 “hexadecimal” =base 1616 digits required for hexadecimal: 0–9, A,B,C,D,E,F(letters used as digits)
(1010 0100)2 = (8 + 2) ⋅ 16 + 4 = (A4)16 = /xA4
Octal notation uses groups of 3 bits. 8 digits required: 0–7.
(10 100 100)2 = (244)8 = /o244
NOTE: 8 is not divisible by 3, the highest octal digit of abyte is 3.
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
ASCII codes
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Extended ASCII codes
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Reading some characters of “Coriolanus” in MATLAB
1 %% F i l e : p1 .m2 % Read f i r s t 10 charac te rs from Cor io lanus3 f =fopen ( ’ TextCompression / c o r i o l a n . t x t ’ , ’ r ’ ) ;4 A=f read ( f ,10 , ’ u i n t 8 ’ ) ;5 f c l o s e ( f ) ;6 A ’ % P r i n t i n decimal7 dec2hex (A) % P r i n t i n hexadecimal8 dec2bin (A) % P r i n t i n b ina ry
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Output I
>> p1
ans =
66 101 102 111 114 101 32 119 101 32
ans =
4265666F
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Output II
726520776520
ans =
1000010110010111001101101111
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Output III
111001011001010100000111011111001010100000
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
An important paradigm of data compression
Data Compression = Data Modeling + Source Coding
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
A data model of “Coriolanus”
Example (Data Modeling)
Out model of “Coriolanus” is a random data model: Everycharacter is drawn at random from the alphabet, according tothe probability distribution
P = (p1,p2, . . . ,pn)
Numbers pj are considered parameters, to be estimated.
George Box, a statistician, circa 1979
All models are wrong but some are useful.
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Arithmetic coding as a source coding approach
Example (Source Coding)
Our source coding technique is arithmetic coding, to beexplained later. “Coriolanus” will be encoded as a single binaryfraction with approximately 640,000 binary digits (bits) after thebinary point!
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
The alphabet and “table” of Coriolanus in MATLAB
1 %% F i l e : p2 .m2 %% Read a l l charac te rs o f Cor io lanus as ASCII codes3 f =fopen ( ’ TextCompression / c o r i o l a n . t x t ’ , ’ r ’ ) ;4 A=f read ( f , i n f , ’ u i n t 8 ’ ) ;5 f c l o s e ( f ) ;6 %% Conduct frequency ana lys i s ( b u i l d " t ab l e " )7 % C i s the alphabet : as per documentation8 % of unique , IA and IC are index ar rays s . t .9 % C = A( IA ) and A = C( IC )
10 [C, IA , IC ]= unique (A) ;11 t ab l e = h i s t coun ts ( IC , 1 : ( leng th (C) +1) ) ;12 %% Bar p l o t s o f a lphabet and tab le13 bar (C) ; bar ( t ab l e )
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
The alphabet and “table” bar plots
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Entropy of a probability distribution
Definition (Shannon Entropy)
Figure: C. Shannon
Let P = (p1,p2, . . . ,pn) bea probability distribution on numbers 1–n.The number H(P) defined by the formula:
H(P) =n∑j=1(− log2 pj) ⋅ pj
is calledthe Shannon entropy of the distribution P.
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Entropy of the “Coriolanus” distribution
1 %% F i l e : p3 .m2 %% Read data , f i n d tab le3 p2 ;4 %% Find entropy5 p= tab le / sum( tab le ) ;6 entropy=sum(−p . * log2 ( p ) )
>> p3
entropy =
4.5502
>>
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Meaning of entropy in data compression
Entropy as mean number of bits per symbolIf entropy is h = 4.5502 bits, we expect h bits to be issued in thecompressed message, per every symbol of the originalmessage. Since ASCII encoding uses 1 byte (=8 bits), persymbol, se expect compression ratio to be:
84.5502
= 1.7582
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Calculation of arithmetic code in MATLAB
1 %% F i l e : p4 .m2 %% Read t e x t f i l e i n t o ar ray A, compute tab le ,
alphabet , index ar ray IC3 p2 ;4 %% Use a r i t h m e t i c coding to a r i t h m e t i c a l l y encode
the sequence of i nd i ces5 code=ar i thenco ( IC , t ab l e ) ;6 compress ion_rat io =(8* leng th (A) ) / l eng th ( code )
>> p4
compression_ratio =
1.7581
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
What is arithmetic code?
Arithmetic codeArithmetic code is a binary code which represents digits of areal number x ∈ [0,1] uniquely assigned to the originalmessage (Shakespeare’s play “Coriolanus”). The number hasapproximately
H(P) ⋅N = 645354
binary digits after the binary point:
x = 0.d1d2 . . .
where dj ∈ {0,1}.
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Commentary on arithmetic coding
The idea of arithmetic coding comes fromShannon-Fano-Elias coding.The algorithm was implemented in a practical manner byIBM around 1980.IBM obtained a patend for arithmetic coding and defendedit vigorously for approximately 20 years. The value of thealgorithm as intellectual property was estimated at tens ofbillions of dollars.Now there are many freely available implementations ofarithmetic coding, including the MATLAB implementation.
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
The principle of arithmetic coding I
1 Let m = (a0,a1, . . . ,aN) be a message with
aj ∈ {1,2, . . . ,n}
2 Let P = (p1,p2, . . . ,pn) be a relative frequency table of thesymbols in a message.
3 Every partial message mK = (a0,a1, . . . ,aK ) is assigned asemi-open subinterval of I = [0,1).
4 m0 (the empty message) is assigned I0 = I.5 If mK−1 is assigned interval IK−1, interval IK ⊆ IK−1 and the
length of IK is proportional to paK (the probability of theK -th symbol).
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
The principle of arithmetic coding II6 As the length of the interval
∣IN ∣ =N∏j=0
paj
− log2 ∣IN ∣ =N∑j=0(− log paj ) = N ⋅
n∑l=1(− log pl)
Nl
N
= N ⋅n∑l=1(− log pl)pl = N ⋅H(P).
where Nl is the number of times l appears in the message,and N is the length of the message.
7 We select a number in IN with the smallest number ofbinary digits, approximately N ⋅H(P).
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Useful definitions I
An alphabet is any finite set A.
A message m = a0a1 . . .an (or m = (a0,a1, . . . ,an)) over thealphabet A is any finite sequence of elements of A. The setof all messages is denoted A+. We have this disjoint unionof Cartesian powers of the alphabet:
A+ =∞⊍n=0
n⨉j=1
A
Acode is a map C ∶ A+ → B+.
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Useful definitions II
A symbol code is a map C ∶ A→ B+, i.e. a map that mapssymbols to one alphabet A to messages (or blocks ofsymbols) in another alphabet B. The symbol code extendsto C ∶ A+ → B+ of all messages over alphabet A to allmessages over alphabet B, by concatenation:
C(a0a1 . . .an) = C(a0)C(a1) . . .C(an)
The length of a message is its number of symbols; Thefunction ` ∶ A+ → {0,1, . . .} maps messages to their lengths.A binary (symbol) code is a symbols code C ∶ A→ {0,1}.A code C ∶ A+ → B+ is lossless, if it is 1:1 as a map.A symbol code C ∶ A→ B+ is called uniform iff ` ○ C = const..
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Morse code — an old, binary, non-uniform code withsome compression
Figure: Samuel Morse, 1840, “American inventor and painter”Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
An Excercise — “Coriolanus” in Morse Code
Problem (Coriolanus to Morse Code)Write a program which will encode Shakespeare’s “Coriolanus”in Morse code. Comparse the output length to that of theoriginal message (141,830 characters). Also, estimate theduration in “units” of the transmission of “Coriolanus”, assumingthat the “unit” is 1/4 second.
Speed of Morse Code “Benchmark”“High-speed telegraphy contests are held; according to theGuinness Book of Records in June 2005 at the InternationalAmateur Radio Union’s 6th World Championship in High SpeedTelegraphy in Primorsko, Bulgaria, Andrei Bindasov of Belarustransmitted 230 morse code marks of mixed text in one minute.”
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Kraft inequalityUpper Bound
The expected length of the code
Definition (Expected length of a binary symbol code)
Given a probability distribution P ∶ A→ [0,1] on the alphabet A,the expected length of a binary symbol code C ∶ A→ {0,1}+ isdefined as:
E(` ○ C) = ∑a∈A
`(C(a)) ⋅P(a).
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Kraft inequalityUpper Bound
Shannon source coding theorem
Theorem (Shannon source coding theorem — Shannon, 1948)
If a binary symbol code C ∶ A→ {0,1}+ is lossless then
E(` ○ C) ≥ H(P)
where H(P) is the Shannon entropy of the distribution P:
H(P) = ∑a∈A(− log2 P(a)) ⋅P(a)
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Kraft inequalityUpper Bound
Kraft inequality (extended variant)
Theorem (Kraft Inequality)
If A is countable and C ∶ A→ {0,1}+ is a lossless symbol codethen
∑a∈A
2−D(a)≤ 1.
where D(a) = `(C(a)).
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Kraft inequalityUpper Bound
The existence of nearly optimal codes
Theorem (Shannon-Fano, 1948)
For every alphabet A and a distribution function P ∶ A→ (0,1]there exists a binary code C ∶ A→ {0,1}+ such that:
H(P) ≤ E(` ○ C) ≤ H(P) + 1.
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Kraft inequalityUpper Bound
Full encoder in MATLAB I
1 %% Speci fy i npu t and output f i lenames2 i n f i l e = ’ c o r i o l a n . t x t ’ ;3 o u t f i l e = ’ c o r i o l a n . zzz ’ ;4 %% Read uncompressed t e x t f i l e5 f i l e I D =fopen ( i n f i l e ) ;6 A= f read ( f i l e I D , i n f , ’ u i n t 8 ’ ) ;7 f c l o s e ( f i l e I D ) ;8 d isp ( ’ Enter ing RTG DEMO t e x t encoder . . . ’ ) ;9 %% Conduct frequency ana lys i s ( b u i l d " t ab l e " )
10 % C i s the alphabet : as per documentation o f unique ,11 % IA and IC are index ar rays C = A( IA ) and A = C( IC )12 [C, IA , IC ]= unique (A) ;13 tab le = h i s t coun ts ( IC , 1 : ( leng th (C) +1) ) ;14 %% Find entropy15 p= tab le / sum( tab le ) ;16 entropy=sum(−p . * log2 ( p ) ) ;17 d isp ( [ ’ Expect ing ’ , num2str ( entropy ) , ’ b i t s per symbol ’ ] ) ;18 %% Use a r i t h m e t i c coding to encode the sequence19 code=ar i thenco ( IC , t ab l e ) ;20 %% Calcu la te b i t s per symbol21 b i tspersymbol= leng th ( code ) / leng th (A) ;22 d isp ( [ ’ Achieved ’ , num2str ( b i tspersymbol ) , ’ b i t s per symbol ’ ] ) ;23 %% Save compressed data24 f =fopen ( o u t f i l e , ’wb+ ’ ) ;25 f w r i t e ( f , ’_RTG ’ ) ;26 %% Wri te data to f i l e
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Kraft inequalityUpper Bound
Full encoder in MATLAB II
27 % −−−−−−−−−−−−−−−− Wri te metadata −−−−−−−−−−−−−−−−28 % Wri te unencoded data leng th29 f w r i t e ( f , l eng th (A) , ’ u in t32 ’ ) ;30 % Wri te alphabet leng th31 f w r i t e ( f , l eng th (C) , ’ u i n t 8 ’ ) ;32 % Wri te alphabet33 f w r i t e ( f ,C, ’ u i n t 8 ’ ) ;34 % Wri te t ab le35 b i t s _ p e r _ t a b l e _ e n t r y = f l o o r ( log2 (max( t ab l e ) ) ) +1;36 by tes_per_ tab le_ent ry= c e i l ( b i t s _ p e r _ t a b l e _ e n t r y / 8 ) ;37 i f by tes_per_ tab le_ent ry > 238 count_c lass= ’ u in t32 ’ ;39 e l s e i f by tes_per_ tab le_ent ry > 140 count_c lass= ’ u in t16 ’ ;41 e lse42 count_c lass= ’ u i n t8 ’ ;43 end44 fh= s t r2 func ( count_c lass ) ;45 counts= fh ( t ab l e ) ;46 f w r i t e ( f , by tes_per_ tab le_ent ry , ’ u i n t 8 ’ ) ;47 f w r i t e ( f , counts , count_c lass ) ;48 % −−−−−−−−−−−−−−−− Wri te b inary code −−−−−−−−−−−−−−−−49 % Padd code to a m u l t i p l e o f 8 b i t s50 padd= c e i l ( l eng th ( code ) / 8 ) *8− l eng th ( code ) ;51 code =[ code ; zeros ( padd , 1 ) ] ;52 % Pack code i n t o bytes53 packed_code=zeros ( leng th ( code ) /8 ,1 , ’ u i n t 8 ’ ) ;
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Kraft inequalityUpper Bound
Full encoder in MATLAB III
54 f o r i =1: leng th ( packed_code )55 packed_code ( i , 1 ) =bi2de ( code ( (1 + 8 * ( i −1) ) : ( 8 + 8 * ( i −1) ) ) ’ ) ;56 end57 f w r i t e ( f , l eng th ( packed_code ) , ’ u in t32 ’ ) ;58 f w r i t e ( f , packed_code , ’ u i n t 8 ’ ) ;59 %−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−60 f c l o s e ( f ) ;61 d isp ( ’ E x i t i n g RTG t e x t encoder . ’ ) ;
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Kraft inequalityUpper Bound
Full decoder in MATLAB I
1 disp ( ’ Enter ing RTG t e x t decoder . . . ’ ) ;2 i n f i l e = ’ c o r i o l a n . zzz ’ ;3 o u t f i l e = ’ co r io lan_d . t x t ’ ;4 %% Decoder5 f =fopen ( i n f i l e , ’ r ’ ) ;6 %% Read and check magic number7 magic= f read ( f , 4 , ’ * char ’ ) ’ ;8 i f a l l ( magic== ’_RTG ’ )9 d isp ( ’ Great ! Magic number matches . ’ ) ;
10 e lse11 disp ( ’ Magic number does not match ’ ) ;12 r e t u r n ;13 end14 %% Read unencoded data leng th15 len= f read ( f , 1 , ’ u in t32 ’ ) ;16 %% Read alphabet leng th17 alphabet_ len= f read ( f , 1 , ’ u i n t 8 ’ ) ;18 %% Read alphabet19 C=f read ( f , a lphabet_len , ’ u i n t 8 ’ ) ;20 %% Read tab le21 by tes_per_ tab le_ent ry= f read ( f , 1 , ’ u i n t 8 ’ ) ;22 i f by tes_per_ tab le_ent ry > 223 count_c lass= ’ u in t32 ’ ;24 e l s e i f by tes_per_ tab le_ent ry > 125 count_c lass= ’ u in t16 ’ ;26 e lse
Marek Rychlik CGB
Definition of codingCompression of plain (ASCII) text
Shannon lower bound
Kraft inequalityUpper Bound
Full decoder in MATLAB II
27 count_c lass= ’ u i n t8 ’ ;28 end29 recovered_tab le= f read ( f , a lphabet_len , count_c lass ) ’ ;30 %% Read packed code31 packed_code_len= f read ( f , 1 , ’ u in t32 ’ ) ;32 packed_code= f read ( f , packed_code_len , ’ u i n t 8 ’ ) ;33 %% Unpack the code34 code_len=packed_code_len * 8 ;35 recovered_code=zeros ( code_len ,1 , ’ double ’ ) ;36 recovered_code = de2bi ( packed_code , 8 ) ’ ;37 recovered_code = recovered_code ( : ) ;38 %% Recover index ar ray39 IC=ar i thdeco ( recovered_code , recovered_table , len ) ;40 %% Apply alphabet41 recovered=C( IC ) ;42 %% Wri te the f i l e43 f =fopen ( o u t f i l e , ’w ’ ) ;44 f w r i t e ( f , recovered ) ;45 f c l o s e ( f ) ;46 %−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−47 disp ( ’ E x i t i n g RTG t e x t decoder . . . ’ ) ;
Marek Rychlik CGB