30
An introduction to Data Compression

An introduction to Data Compression

Embed Size (px)

DESCRIPTION

An introduction to Data Compression. General informations. Requirements some programming skills (not so much...) knowledge of data structures ... some work! Office hours ... ... please write me an email [email protected]. What is compression?. - PowerPoint PPT Presentation

Citation preview

Page 1: An introduction to Data Compression

An introduction to

Data Compression

Page 2: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2

General informations

Requirements some programming skills (not so much...) knowledge of data structures ... some work!

Office hours ...... please write me an email [email protected]

Page 3: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 3

What is compression?

Intuitively compression is a method “to press something into a smaller space”.

In our domains a better definition is “to make information shorter”

Page 4: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 4

Some basic questions

What is information?

How can we measure the amount of information?

Why compression is useful?

How do we compress?

How much we can compress?

Page 5: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 5

What is information? - I

Commonly the term information refers to the knowledge of some fact, circumstance or thought.

For example we can think about reading a newspaper, news are the information. syntax

letters, punctuation marks, white spaces, grammar rules ...

semantics meaning of the words and of the sentences

Page 6: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 6

What is information? - II

In our domain, information is merely the syntax, i.e. we are interested in the symbols of the alphabet used to express the information.

In order to give a mathematical definition of information we need some principle of Information Theory

Page 7: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 7

The fundamental concept

A key concept in Information Theory is that the information is conveyed by randomness

Which information give us a biased coin, which outcome is always head?

What about another biased coin, which outcome is head with 90% probability?

We need a way to measure quantitatively the amount of information in some mathematical sense

Page 8: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 8

The Uncertainty - I

Suppose we have a discrete random variable and is a particular outcome with probability

uncertainty

The units are given by the base of the logarithms

base 2 bits base 10 nats

X x( )p x

log( ( )) log(1 ( ))p x p x

Page 9: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 9

The Uncertanty - II

Suppose the random variable output

each outcome has 1 bit of information

0 gives no information at all, while if the outcome is 1 the information is

0,1

(0) (1) 0.5p p

(0) 1, (1) 0p p

Page 10: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 10

The Entropy

More useful is the entropy of a random variable with values in a space

The entropy is a measure of the average uncertanty of the random variable

X X

( ) [uncertanty] ( ) log( ( ))x

H X E p x p x

X

Page 11: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 11

The entropy - examples

Consider again a r.v. with only two possible outcomes, 0 and 1In this case

( ) (0.5log 0.5 0.5log 0.5) 1 bitH p

( ) (0.9 log 0.9 0.1log 0.1) 0.469 bitsH p

1 with prob ( )

0 with prob 1

pX H p

p

0.5p

0.9p

Page 12: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 12

Compression and loss

lossless decompressed message (file) is an exact copy of

the original. Useful for text compression

lossy some information is lost in the decompressed

message (file). Useful for image and sound compression

lgnore for a while lossy compression

Page 13: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 13

Definitions - I

A source code from a r.v. is a mapping from to , the set of finite-length string from a D-ary alphabet.

, codeword for , length of

XX *̂D

( )C x( )l x

x( )C x

Page 14: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 14

Definitions - II

non-singular code (... trivial ...)

every element of is mapped in a different string of :

extension of a code

uniquely decodable code its extension is uniquely decodable

X*D ( ) ( )i j i jx x C x C x

*C C1 2 1 2( ... ) ( ) ( )... ( )n nC x x x C x C x C x

Page 15: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 15

Definitions - III

prefix (better prefix-free) or istantaneous code

no codeword is a prefix of any other codeword the advantage is that decoding has no need to look-

ahead

codewords

a 11

b 110

X ... 11? ...

Page 16: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 16

Examples

X Code 1 Code 2 Code 3 Code 4

1 01 0 10 0

2 110 010 00 10

3 010 10 11 110

4 110 01 110 111

singular

not singular, but not uniquely decodableuniquely decodable, but not instantaneousinstantaneous

Page 17: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 17

Kraft Inequality - I

Theorem (Kraft Inequality)For any instantaneous code over an alphabet of size D, the codeword lengths must satisfy

Conversely, given a set of codeword lengths that satisfy this inequality there exists an istantaneous code with these word lengths

1 2, ,..., ml l l

1

1i

ml

i

D

Page 18: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 18

Kraft Inequality - II

Consider a complete D-ary tree at level k, there are nodes a node at level has descendants that

are nodes at level k

kDp k k pD

level 0

level 1

level 2

level 3

Page 19: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 19

Kraft Inequality - III

Proof

Consider a D-ary tree (not necessarily complete)

representing the codewords, each path down the tree

is a sequence of symbols, and each leaf (with its

unique path) is a codeword. Let be the longest

codeword.

A codeword of length , being a leaf, imply that

at level there are missing nodes

maxl

maxil l

maxlmax il lD

Page 20: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 20

Kraft Inequality - IV

The total number of possible nodes at level is

Summing over all codewords

Dividing by

maxlmaxlD

max max-

1

i

ml l l

i

D D

maxlD

1

1i

ml

i

D

Page 21: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 21

Kraft Inequality - V

Proof Suppose (without loss of generality) that codewords

are ordered by length, i.e. .

Consider a D-ary tree and start assigning each codeword to a node, starting from .

For a generic codeword with length consider the set K of codewords with length , except i.

Suppose there is no available node at level i. That is,

il

k il l

i k il l l

k K

D D

1 2 ... ml l l

1l

Page 22: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 22

Kraft Inequality - VI

but this means that

Then

that is absurd. Then the obtained tree represents an instantaneous code with desidered codeword lengths

1kl

k K

D

1jl

j K i

D

Page 23: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 23

Models and coders

The model supplies the probabilities of the symbols (or of the group of symbols, as we will see later)

The coder encodes and decodes starting from these probabilities

model model

encoder decodertext textcompressed text

Page 24: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 24

Good modeling is crucial

What happens if the true probability of the symbols to be coded are but we use ?

Simply, compressed text will be longer, i.e. the average number of bits/symbol will be greater

It is possible to calculate the difference in bit/symbol from the two mass probability p and q, known as relative entropy

ip iq

Page 25: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 25

Finite-context models

in english text ...

... but

A finite-context model of order m uses the previous m symbols to make the prediction

Better modeling but we need to extimate much more probabilities

( ) 0.02ip x u

1( | ) 0.95i ip x u x q

Page 26: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 26

Finite-state models

Although potentially more powerful (e.g. they can model wheather an odd or even number of as have occurred consecutively), they are not so popular.

Obviously the decoder uses the same model, so they are always in the same states

1 2a 0.5

a 0.99

b 0.01b 0.5

Page 27: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 27

Static models

A models is static if we set up a reasonable probability distribution and use it for all the texts to be coded.

Poor performance in case of different kind of sources (english text, financial data...)

One solution is to have K different models and to send the index of the used model

... but cfr. the book Gadsby by E. V. Wright

Page 28: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 28

Adaptive models

In order to solve the problems of static modeling, adaptive (or dynamic) models begin with a bland probability distribution, that is refined as more symbols of the text are known

The encoder and the decoder have the same initial distribution, and the same rules to alter it

There could be adaptive models of order m>0

Page 29: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 29

The zero-frequency problem

The situation in which a symbol is predicted with probability zero should be avoided, as it cannot be coded

One solution: the total number of symbols in the text is increased by 1. This 1/total probability is divided among all unseen symbols

Another solution: to augment by 1 the count of every symbol

Many more solutions... Which is the best? If text is sufficiently long the

compression is similar

Page 30: An introduction to Data Compression

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 30

Symbolwise and dictionary models

The set of all possible symbols of a source is called the alphabet

Symbolwise models provide an extimated probability for each symbol in the alphabet

Dictionary models instead replace substrings in a text with codewords that identify each substring in a collection, called dictionary or codebook