41
Definition of coding Compression of plain (ASCII) text Shannon lower bound Information and Coding Part I: Data Compression Marek Rychlik Department of Mathematics University of Arizona February 25, 2016 Marek Rychlik CGB

Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

  • Upload
    buidien

  • View
    220

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Information and CodingPart I: Data Compression

Marek Rychlik

Department of MathematicsUniversity of Arizona

February 25, 2016

Marek Rychlik CGB

Page 2: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

What is coding — informally?

Definition (Coding)

Coding is algorithmic rewriting of data, while achievingdesirable objectives.

Marek Rychlik CGB

Page 3: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Three main types of objectives

1 Reducing the cost of storage and transmission timerequired (data compression);

2 Preventing data loss due to random causes, such asdefects in physical media (error correcting codes);

3 Achieving privacy of comunications (data encryption).

Marek Rychlik CGB

Page 4: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Without data compression we would not have

1 high-definition digital streaming video;2 cellular telephony;3 high-definition digital cinema.

Marek Rychlik CGB

Page 5: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Coriolanus — the play by William Shakespeare

Figure: W. Shakespeare

“Before we proceed any further,hear me speak. Speak, speak. Youare all resolved rather to die than tofamish? Resolved. resolved. First,you know Caius Marcius is chiefenemy to the people. We know’t,we know’t. Let us kill him, andwe’ll have corn at our own price. Is’ta verdict? No more talking on’t; letit be done: away, away! One word,good citizens... [total of 141,830characters]”

Marek Rychlik CGB

Page 6: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

An early printed scientific paper

Figure: Nikolaus Kopernikus, 1543, “Polish astronomer”(Encyclopedia Britannica)

Marek Rychlik CGB

Page 7: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Character encoding

Characters are numbered.The oldest standard is ASCII (American Standard Codefor Information Interchange).ASCII is pronounced “as key”.Encodes characters as numbers 0–255.Successor to ASCII is Unicode . It aims at encoding allcharacters in all human alphabets.Only ASCII is used in the current talk.

Marek Rychlik CGB

Page 8: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Bits and Bytes

The term bit stands for a binary digit and it is either 0 or 1.Bits are digits of base-2 (binary) representation ofnumbers, e.g. 6 in decimal is 101 in binary.

(101)2 = 1 ⋅ 22+ 0 ⋅ 21

+ 1 ⋅ 1 = (6)10

It is a normalized unit of information; we may think thatinformation comes as a sequence of answers to Yes/Noquestions. A fraction of a bit makes sense, statisticallyspeaking (e.g. tossing a biased coin).A byte is a sequence of 8 bits. A byte can also be thoughtof as an unsigned integer in the range 0–255.

(10000100)2 = 1 ⋅ 27+ 1 ⋅ 22

= 128 + 4 = (123)10

Marek Rychlik CGB

Page 9: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Hexadecimal and Octal notation for integers

1 byte = 8 bits = 2 groups of 4 bits; 24 = 16 “hexadecimal” =base 1616 digits required for hexadecimal: 0–9, A,B,C,D,E,F(letters used as digits)

(1010 0100)2 = (8 + 2) ⋅ 16 + 4 = (A4)16 = /xA4

Octal notation uses groups of 3 bits. 8 digits required: 0–7.

(10 100 100)2 = (244)8 = /o244

NOTE: 8 is not divisible by 3, the highest octal digit of abyte is 3.

Marek Rychlik CGB

Page 10: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

ASCII codes

Marek Rychlik CGB

Page 11: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Extended ASCII codes

Marek Rychlik CGB

Page 12: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Reading some characters of “Coriolanus” in MATLAB

1 %% F i l e : p1 .m2 % Read f i r s t 10 charac te rs from Cor io lanus3 f =fopen ( ’ TextCompression / c o r i o l a n . t x t ’ , ’ r ’ ) ;4 A=f read ( f ,10 , ’ u i n t 8 ’ ) ;5 f c l o s e ( f ) ;6 A ’ % P r i n t i n decimal7 dec2hex (A) % P r i n t i n hexadecimal8 dec2bin (A) % P r i n t i n b ina ry

Marek Rychlik CGB

Page 13: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Output I

>> p1

ans =

66 101 102 111 114 101 32 119 101 32

ans =

4265666F

Marek Rychlik CGB

Page 14: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Output II

726520776520

ans =

1000010110010111001101101111

Marek Rychlik CGB

Page 15: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Output III

111001011001010100000111011111001010100000

Marek Rychlik CGB

Page 16: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

An important paradigm of data compression

Data Compression = Data Modeling + Source Coding

Marek Rychlik CGB

Page 17: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

A data model of “Coriolanus”

Example (Data Modeling)

Out model of “Coriolanus” is a random data model: Everycharacter is drawn at random from the alphabet, according tothe probability distribution

P = (p1,p2, . . . ,pn)

Numbers pj are considered parameters, to be estimated.

George Box, a statistician, circa 1979

All models are wrong but some are useful.

Marek Rychlik CGB

Page 18: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Arithmetic coding as a source coding approach

Example (Source Coding)

Our source coding technique is arithmetic coding, to beexplained later. “Coriolanus” will be encoded as a single binaryfraction with approximately 640,000 binary digits (bits) after thebinary point!

Marek Rychlik CGB

Page 19: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

The alphabet and “table” of Coriolanus in MATLAB

1 %% F i l e : p2 .m2 %% Read a l l charac te rs o f Cor io lanus as ASCII codes3 f =fopen ( ’ TextCompression / c o r i o l a n . t x t ’ , ’ r ’ ) ;4 A=f read ( f , i n f , ’ u i n t 8 ’ ) ;5 f c l o s e ( f ) ;6 %% Conduct frequency ana lys i s ( b u i l d " t ab l e " )7 % C i s the alphabet : as per documentation8 % of unique , IA and IC are index ar rays s . t .9 % C = A( IA ) and A = C( IC )

10 [C, IA , IC ]= unique (A) ;11 t ab l e = h i s t coun ts ( IC , 1 : ( leng th (C) +1) ) ;12 %% Bar p l o t s o f a lphabet and tab le13 bar (C) ; bar ( t ab l e )

Marek Rychlik CGB

Page 20: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

The alphabet and “table” bar plots

Marek Rychlik CGB

Page 21: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Entropy of a probability distribution

Definition (Shannon Entropy)

Figure: C. Shannon

Let P = (p1,p2, . . . ,pn) bea probability distribution on numbers 1–n.The number H(P) defined by the formula:

H(P) =n∑j=1(− log2 pj) ⋅ pj

is calledthe Shannon entropy of the distribution P.

Marek Rychlik CGB

Page 22: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Entropy of the “Coriolanus” distribution

1 %% F i l e : p3 .m2 %% Read data , f i n d tab le3 p2 ;4 %% Find entropy5 p= tab le / sum( tab le ) ;6 entropy=sum(−p . * log2 ( p ) )

>> p3

entropy =

4.5502

>>

Marek Rychlik CGB

Page 23: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Meaning of entropy in data compression

Entropy as mean number of bits per symbolIf entropy is h = 4.5502 bits, we expect h bits to be issued in thecompressed message, per every symbol of the originalmessage. Since ASCII encoding uses 1 byte (=8 bits), persymbol, se expect compression ratio to be:

84.5502

= 1.7582

Marek Rychlik CGB

Page 24: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Calculation of arithmetic code in MATLAB

1 %% F i l e : p4 .m2 %% Read t e x t f i l e i n t o ar ray A, compute tab le ,

alphabet , index ar ray IC3 p2 ;4 %% Use a r i t h m e t i c coding to a r i t h m e t i c a l l y encode

the sequence of i nd i ces5 code=ar i thenco ( IC , t ab l e ) ;6 compress ion_rat io =(8* leng th (A) ) / l eng th ( code )

>> p4

compression_ratio =

1.7581

Marek Rychlik CGB

Page 25: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

What is arithmetic code?

Arithmetic codeArithmetic code is a binary code which represents digits of areal number x ∈ [0,1] uniquely assigned to the originalmessage (Shakespeare’s play “Coriolanus”). The number hasapproximately

H(P) ⋅N = 645354

binary digits after the binary point:

x = 0.d1d2 . . .

where dj ∈ {0,1}.

Marek Rychlik CGB

Page 26: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Commentary on arithmetic coding

The idea of arithmetic coding comes fromShannon-Fano-Elias coding.The algorithm was implemented in a practical manner byIBM around 1980.IBM obtained a patend for arithmetic coding and defendedit vigorously for approximately 20 years. The value of thealgorithm as intellectual property was estimated at tens ofbillions of dollars.Now there are many freely available implementations ofarithmetic coding, including the MATLAB implementation.

Marek Rychlik CGB

Page 27: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

The principle of arithmetic coding I

1 Let m = (a0,a1, . . . ,aN) be a message with

aj ∈ {1,2, . . . ,n}

2 Let P = (p1,p2, . . . ,pn) be a relative frequency table of thesymbols in a message.

3 Every partial message mK = (a0,a1, . . . ,aK ) is assigned asemi-open subinterval of I = [0,1).

4 m0 (the empty message) is assigned I0 = I.5 If mK−1 is assigned interval IK−1, interval IK ⊆ IK−1 and the

length of IK is proportional to paK (the probability of theK -th symbol).

Marek Rychlik CGB

Page 28: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

The principle of arithmetic coding II6 As the length of the interval

∣IN ∣ =N∏j=0

paj

− log2 ∣IN ∣ =N∑j=0(− log paj ) = N ⋅

n∑l=1(− log pl)

Nl

N

= N ⋅n∑l=1(− log pl)pl = N ⋅H(P).

where Nl is the number of times l appears in the message,and N is the length of the message.

7 We select a number in IN with the smallest number ofbinary digits, approximately N ⋅H(P).

Marek Rychlik CGB

Page 29: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Useful definitions I

An alphabet is any finite set A.

A message m = a0a1 . . .an (or m = (a0,a1, . . . ,an)) over thealphabet A is any finite sequence of elements of A. The setof all messages is denoted A+. We have this disjoint unionof Cartesian powers of the alphabet:

A+ =∞⊍n=0

n⨉j=1

A

Acode is a map C ∶ A+ → B+.

Marek Rychlik CGB

Page 30: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Useful definitions II

A symbol code is a map C ∶ A→ B+, i.e. a map that mapssymbols to one alphabet A to messages (or blocks ofsymbols) in another alphabet B. The symbol code extendsto C ∶ A+ → B+ of all messages over alphabet A to allmessages over alphabet B, by concatenation:

C(a0a1 . . .an) = C(a0)C(a1) . . .C(an)

The length of a message is its number of symbols; Thefunction ` ∶ A+ → {0,1, . . .} maps messages to their lengths.A binary (symbol) code is a symbols code C ∶ A→ {0,1}.A code C ∶ A+ → B+ is lossless, if it is 1:1 as a map.A symbol code C ∶ A→ B+ is called uniform iff ` ○ C = const..

Marek Rychlik CGB

Page 31: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Morse code — an old, binary, non-uniform code withsome compression

Figure: Samuel Morse, 1840, “American inventor and painter”Marek Rychlik CGB

Page 32: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

An Excercise — “Coriolanus” in Morse Code

Problem (Coriolanus to Morse Code)Write a program which will encode Shakespeare’s “Coriolanus”in Morse code. Comparse the output length to that of theoriginal message (141,830 characters). Also, estimate theduration in “units” of the transmission of “Coriolanus”, assumingthat the “unit” is 1/4 second.

Speed of Morse Code “Benchmark”“High-speed telegraphy contests are held; according to theGuinness Book of Records in June 2005 at the InternationalAmateur Radio Union’s 6th World Championship in High SpeedTelegraphy in Primorsko, Bulgaria, Andrei Bindasov of Belarustransmitted 230 morse code marks of mixed text in one minute.”

Marek Rychlik CGB

Page 33: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Kraft inequalityUpper Bound

The expected length of the code

Definition (Expected length of a binary symbol code)

Given a probability distribution P ∶ A→ [0,1] on the alphabet A,the expected length of a binary symbol code C ∶ A→ {0,1}+ isdefined as:

E(` ○ C) = ∑a∈A

`(C(a)) ⋅P(a).

Marek Rychlik CGB

Page 34: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Kraft inequalityUpper Bound

Shannon source coding theorem

Theorem (Shannon source coding theorem — Shannon, 1948)

If a binary symbol code C ∶ A→ {0,1}+ is lossless then

E(` ○ C) ≥ H(P)

where H(P) is the Shannon entropy of the distribution P:

H(P) = ∑a∈A(− log2 P(a)) ⋅P(a)

Marek Rychlik CGB

Page 35: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Kraft inequalityUpper Bound

Kraft inequality (extended variant)

Theorem (Kraft Inequality)

If A is countable and C ∶ A→ {0,1}+ is a lossless symbol codethen

∑a∈A

2−D(a)≤ 1.

where D(a) = `(C(a)).

Marek Rychlik CGB

Page 36: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Kraft inequalityUpper Bound

The existence of nearly optimal codes

Theorem (Shannon-Fano, 1948)

For every alphabet A and a distribution function P ∶ A→ (0,1]there exists a binary code C ∶ A→ {0,1}+ such that:

H(P) ≤ E(` ○ C) ≤ H(P) + 1.

Marek Rychlik CGB

Page 37: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Kraft inequalityUpper Bound

Full encoder in MATLAB I

1 %% Speci fy i npu t and output f i lenames2 i n f i l e = ’ c o r i o l a n . t x t ’ ;3 o u t f i l e = ’ c o r i o l a n . zzz ’ ;4 %% Read uncompressed t e x t f i l e5 f i l e I D =fopen ( i n f i l e ) ;6 A= f read ( f i l e I D , i n f , ’ u i n t 8 ’ ) ;7 f c l o s e ( f i l e I D ) ;8 d isp ( ’ Enter ing RTG DEMO t e x t encoder . . . ’ ) ;9 %% Conduct frequency ana lys i s ( b u i l d " t ab l e " )

10 % C i s the alphabet : as per documentation o f unique ,11 % IA and IC are index ar rays C = A( IA ) and A = C( IC )12 [C, IA , IC ]= unique (A) ;13 tab le = h i s t coun ts ( IC , 1 : ( leng th (C) +1) ) ;14 %% Find entropy15 p= tab le / sum( tab le ) ;16 entropy=sum(−p . * log2 ( p ) ) ;17 d isp ( [ ’ Expect ing ’ , num2str ( entropy ) , ’ b i t s per symbol ’ ] ) ;18 %% Use a r i t h m e t i c coding to encode the sequence19 code=ar i thenco ( IC , t ab l e ) ;20 %% Calcu la te b i t s per symbol21 b i tspersymbol= leng th ( code ) / leng th (A) ;22 d isp ( [ ’ Achieved ’ , num2str ( b i tspersymbol ) , ’ b i t s per symbol ’ ] ) ;23 %% Save compressed data24 f =fopen ( o u t f i l e , ’wb+ ’ ) ;25 f w r i t e ( f , ’_RTG ’ ) ;26 %% Wri te data to f i l e

Marek Rychlik CGB

Page 38: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Kraft inequalityUpper Bound

Full encoder in MATLAB II

27 % −−−−−−−−−−−−−−−− Wri te metadata −−−−−−−−−−−−−−−−28 % Wri te unencoded data leng th29 f w r i t e ( f , l eng th (A) , ’ u in t32 ’ ) ;30 % Wri te alphabet leng th31 f w r i t e ( f , l eng th (C) , ’ u i n t 8 ’ ) ;32 % Wri te alphabet33 f w r i t e ( f ,C, ’ u i n t 8 ’ ) ;34 % Wri te t ab le35 b i t s _ p e r _ t a b l e _ e n t r y = f l o o r ( log2 (max( t ab l e ) ) ) +1;36 by tes_per_ tab le_ent ry= c e i l ( b i t s _ p e r _ t a b l e _ e n t r y / 8 ) ;37 i f by tes_per_ tab le_ent ry > 238 count_c lass= ’ u in t32 ’ ;39 e l s e i f by tes_per_ tab le_ent ry > 140 count_c lass= ’ u in t16 ’ ;41 e lse42 count_c lass= ’ u i n t8 ’ ;43 end44 fh= s t r2 func ( count_c lass ) ;45 counts= fh ( t ab l e ) ;46 f w r i t e ( f , by tes_per_ tab le_ent ry , ’ u i n t 8 ’ ) ;47 f w r i t e ( f , counts , count_c lass ) ;48 % −−−−−−−−−−−−−−−− Wri te b inary code −−−−−−−−−−−−−−−−49 % Padd code to a m u l t i p l e o f 8 b i t s50 padd= c e i l ( l eng th ( code ) / 8 ) *8− l eng th ( code ) ;51 code =[ code ; zeros ( padd , 1 ) ] ;52 % Pack code i n t o bytes53 packed_code=zeros ( leng th ( code ) /8 ,1 , ’ u i n t 8 ’ ) ;

Marek Rychlik CGB

Page 39: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Kraft inequalityUpper Bound

Full encoder in MATLAB III

54 f o r i =1: leng th ( packed_code )55 packed_code ( i , 1 ) =bi2de ( code ( (1 + 8 * ( i −1) ) : ( 8 + 8 * ( i −1) ) ) ’ ) ;56 end57 f w r i t e ( f , l eng th ( packed_code ) , ’ u in t32 ’ ) ;58 f w r i t e ( f , packed_code , ’ u i n t 8 ’ ) ;59 %−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−60 f c l o s e ( f ) ;61 d isp ( ’ E x i t i n g RTG t e x t encoder . ’ ) ;

Marek Rychlik CGB

Page 40: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Kraft inequalityUpper Bound

Full decoder in MATLAB I

1 disp ( ’ Enter ing RTG t e x t decoder . . . ’ ) ;2 i n f i l e = ’ c o r i o l a n . zzz ’ ;3 o u t f i l e = ’ co r io lan_d . t x t ’ ;4 %% Decoder5 f =fopen ( i n f i l e , ’ r ’ ) ;6 %% Read and check magic number7 magic= f read ( f , 4 , ’ * char ’ ) ’ ;8 i f a l l ( magic== ’_RTG ’ )9 d isp ( ’ Great ! Magic number matches . ’ ) ;

10 e lse11 disp ( ’ Magic number does not match ’ ) ;12 r e t u r n ;13 end14 %% Read unencoded data leng th15 len= f read ( f , 1 , ’ u in t32 ’ ) ;16 %% Read alphabet leng th17 alphabet_ len= f read ( f , 1 , ’ u i n t 8 ’ ) ;18 %% Read alphabet19 C=f read ( f , a lphabet_len , ’ u i n t 8 ’ ) ;20 %% Read tab le21 by tes_per_ tab le_ent ry= f read ( f , 1 , ’ u i n t 8 ’ ) ;22 i f by tes_per_ tab le_ent ry > 223 count_c lass= ’ u in t32 ’ ;24 e l s e i f by tes_per_ tab le_ent ry > 125 count_c lass= ’ u in t16 ’ ;26 e lse

Marek Rychlik CGB

Page 41: Information and Coding - Home Page of Marek Rychlikalamos.math.arizona.edu/RTG16/DataCompression.pdfDefinition of coding Compression of plain (ASCII) text Shannon lower bound Information

Definition of codingCompression of plain (ASCII) text

Shannon lower bound

Kraft inequalityUpper Bound

Full decoder in MATLAB II

27 count_c lass= ’ u i n t8 ’ ;28 end29 recovered_tab le= f read ( f , a lphabet_len , count_c lass ) ’ ;30 %% Read packed code31 packed_code_len= f read ( f , 1 , ’ u in t32 ’ ) ;32 packed_code= f read ( f , packed_code_len , ’ u i n t 8 ’ ) ;33 %% Unpack the code34 code_len=packed_code_len * 8 ;35 recovered_code=zeros ( code_len ,1 , ’ double ’ ) ;36 recovered_code = de2bi ( packed_code , 8 ) ’ ;37 recovered_code = recovered_code ( : ) ;38 %% Recover index ar ray39 IC=ar i thdeco ( recovered_code , recovered_table , len ) ;40 %% Apply alphabet41 recovered=C( IC ) ;42 %% Wri te the f i l e43 f =fopen ( o u t f i l e , ’w ’ ) ;44 f w r i t e ( f , recovered ) ;45 f c l o s e ( f ) ;46 %−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−47 disp ( ’ E x i t i n g RTG t e x t decoder . . . ’ ) ;

Marek Rychlik CGB