ENTROPY - luc.devroye.orgluc.devroye.org/HenriMertens-Entropy.pdf · ENTROPY THE BASICS OF...

Preview:

Citation preview

ENTROPY

(Notes by Glenn. Mertens)

ENTROPYTHE BASICS OF INFORMATION THEORY

Shannon 's theory from 1948.

Shannon 's view

Lower bounds for compression

Entropy E = Z.fi leg,Yp

.

.

The global view

Back to prefix codes.

. hempel - Ziv compression .

Shannon 's view

IFEE -oe→Bt7EF → REFER

reproofed as a ( lossless compassion)path in atria

Ebbntght !10

HUGE -07

.

Expected length of compressed file-

Ewa: gameD D D D probability of seeing

files filet files file 4 input file "i

".

So,

the last compression method, given

the fi 's,

is the

Huffman code.

BUT . . . .The tree is too large !

The pi 's are often not known precisely .

Nevertheless,

we know a lot about the best compression method :

Shannon 's theorem.

E E ruin !.

.

Pili s E t I.

all binarytrees

E = Epi legftp.. = (binary) entropy.

G- o )

KRAFT 's INEQUALITY

Let Li be the depths of the leaves in a binary tree.

Then

? # E I.

Frodo (By induction : exercise ? )°

← →" " .

i÷!÷!÷÷÷÷÷÷.

PROOF Of SHANNON 'S THEOREM

(LOWER BOUND ) E

Epik . -

- Epi lggeli-ER.bg,

Ghi: )tEpiloja= Zpilofageip.

t E

> -

ZK.CZ?p.-D#tE=

-

- + E

Read:

a

'

atiYig÷÷÷→I

( UPPER BOUND )

By converse of Kraft i If ?Yzli ee,

then there exists

a binary tree with leaves at these depths ( enna 're ).

so, given Pi ,At

e , = togElp?

.

Then I € -2¥ ,p..

= Ipi =L.

fo,

we can use these

Li to make a tree.

Let that tree define the code.

The

expected length is 9- called theShannon - Fano code

-2 pili = Epitopes Epi leg ftp.t Epi = Et I.

So,

there is a code with expected length ⇐ Et I.

The global viewunrealistic

If the input consists of an independent symbols froman alphabet A

,with symbol probabilities pi ,

then each symbol can be coded via Hoffman ,

and we have a total expected length

← M ×③¥ntqyof one symbolLower bound : 3 M * E

(entropy of file = sum of outagesSo

. . . . of the symbols)

It help togroup

the symbols in

groups of⑧ ,and

Huffman - code eachgot . T

a small number

01 : One could use Lampert - Ziv compression Csa later )

Solution I : Hoffman coding on gaps of characters

be: Group "letters in set . of s :Caleb )

,Cbca)

,.

. .

Get pi by countingoccurrences in a file

Construct the Huffman code .

Code t decode as for prefix codes

RECALL

ofI

toStart at root and detect a leaf .

Decode.

Repeat .

Coin pressed 00001 I I I 01010101sequence

in -

Decoded i ta ta III ! IFTime to decode = Length of compressed sequence .

compressed sequences-The decoder :

Given the compressed sequence s

- de

1£ t

Given the binary tree with the code ; it.

root is t ;its leaves contain symbols of the alphabet A

.

seat (traveling pointer in t )while Isl > o :

G a- get neat Git from s

if 8=0 then a - left CoDelse x ← right Cx ]

if a is a leaf then output hey In ]Rot

LEMPEL - Ziv COMPRESSION (taupe and Ziv,

1977 ) (Solution I )

Beagles:

zip , jpeg ,

most compression methods.

Feature: Undergenerous input file assumptions ,

the expectedlength of the compressed sequence

is close to E.

Method: Parse inept in smallest pieces never seen before

INIVT pie,q#60aa ab ale ab e c 6 G Gaa 6 a⑥ aaa G a ac

PIECE # O I 2 3 4 5 6 7 8 9 10 11 12

tta ! ! !: dictate data.

take He

painfulFastsymbol of piece

to front piece

THE BINARY SEQUENCE .

Fr k - th piece ,we have

lT¥a symbol from the← alphabetan integer in too . . .sk - ⇒

l needs tofalktI bit.

piece # I 2 3 4 56 7 8 9 10 in i 7 12

Tof O 1 2 233 334 4 4 4 ,- -

ii.Tx I

x needs a fixed # of bits : TegelAT.

In output ,all bits axe clearly identified .

DATA STRUCTURE

FOR CODING / DECODING .

:THE DIGITAL SEARCH TREE :

"

IAI -

anytrie "

INPUT

0aaaba babe ebb Gaa Ga

⑥aaaeoaac

PIECE # O I 2 3 4 5 6 7 8 9 10 11 12

ttoa.t.at. .! dictate !a data !. Ieago

10

÷E÷¥÷÷÷÷¥:*.÷÷÷ .

In theparsing phase :

start at root and descend to a leafA add a symbol Cand new leaf )

to add a piece .

Exercise: If inept is of size n

,then write the

parsing algorithm that produces

andG) the tree

G) the sequence ( Oa ) ha ) lob ) . - -

and show that it takes time On ).

In the decoding phase : keep a table of pointers .

piece #

%athenEta:impatient3 o 6 withpeanutpointers

4 I G - -

-

5 4 C

060 c

7 3 6€34aato 6 a

LI 2 812 a c

Decodepiece to as ⑥a) → ( o e a) → (01 c a)

Exercise: Writean Oln ) algorithm for decoding .

Recommended