34
Source Coding- Compression Most Topics from Digital Communications-Simon Haykin Chapter 9 9.1~9.4

Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Embed Size (px)

Citation preview

Page 1: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Source Coding-Compression

Most Topics from Digital Communications-Simon Haykin

Chapter 99.1~9.4

Page 2: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Fundamental Limits on Performance

Given an information source, and a noisy channel

1) Limit on the minimum number of bits per symbol 2) Limit on the maximum rate for reliable communication

Shannon’s theorems

Page 3: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Information Theory Let the source alphabet, with the prob. of occurrence

Assume the discrete memory-less source (DMS)

What is the measure of information?

0, 1 -1 { , .. , }KS s s s

-1

0

0,1, .. , -1( ) , 1K

k k kk

k KP s s p and p

Page 4: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Uncertainty, Information, and Entropy (cont’)

Interrelations between info., uncertainty or surprise No surprise no information

If A is a surprise and B is another surprise, then what is the total info. of simultaneous A and

B

The amount of info may be related to the inverse of the prob. of occurrence.

1( . )

Pr .Info

ob

.( ) .( ) .( )Info A B Info A Info B

1 ( ) log( )k

k

I Sp

Page 5: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Property of Information

1)2) 3) 4) * Custom is to use logarithm of base 2

k k(s ) 0 for p 1I

k( ) 0 for 0 p 1kI s

k i( ) ( ) for p pk iI s I s indep. statist. s and s if ),( )( )( ikikik sIsIssI

Page 6: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Entropy (DMS)

Def. : measure of average information contents per source symbol The mean value of over S,

The property of H 1) H(S)=0, iff for some k, and all other No Uncertainty

2) H(S)= Maximum Uncertainty

)( ksIK-1 K-1

2k 0 k 0

1( ) E[ ( )] ( ) log ( ) k k k k

k

H S I s p I s pp

20 ( ) log , ( # )H S K where K is radix of symbols

1kp 0' spi

2

1 log , kK iff p for all k

K

Page 7: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Extension of DMS (Entropy) Consider blocks of symbols rather them individual

symbols Coding efficiency can increase if higher order DMS are

used H(Sn) means having Kn disinct symbols where K is the #

of distinct symbols in the alphabet Thus H(Sn) = n H(S)

Second order extension means H(S2) Consider a source alphabet S having 3 symbols i.e. {s0, s1, s2} Thus S2 will have 9 symbols i.e. {s0s0, s0s1, s0s2, s1s1, …,s2s2}

Page 8: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Average LengthFor a code C with associated probabilities

p(c) the average length is defined as

We say that a prefix code C is optimal if for all prefix codes C’, la(C) la(C’)

l C p c l cac C

( ) ( ) ( )

Page 9: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Relationship to EntropyTheorem (lower bound): For any probability

distribution p(S) with associated uniquely decodable code C,

Theorem (upper bound): For any probability distribution p(S) with associated optimal prefix code C,

H S l Ca( ) ( )

l C H Sa ( ) ( ) 1

Page 10: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Coding Efficiency Coding Efficiency

n = Lmin/La where La is the average code-word length

From Shannon’s Theorem La >= H(S) Thus Lmin = H(S)

Thus n = H(S)/La

Page 11: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Kraft McMillan InequalityTheorem (Kraft-McMillan): For any

uniquely decodable code C,

Also, for any set of lengths L such that

there is a prefix code C such that

NOTE: Kraft McMillan Inequality does not tell us whether the code is prefix-free or not

2 1

l c

c C

( )

2 1

l

l L

l c l i Li i( ) ( ,...,| |) 1

Page 12: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Uniquely Decodable CodesA variable length code assigns a bit string

(codeword) of variable length to every message value

e.g. a = 1, b = 01, c = 101, d = 011What if you get the sequence of bits

1011 ?Is it aba, ca, or, ad?A uniquely decodable code is a variable

length code in which bit strings can always be uniquely decomposed into its codewords.

Page 13: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Prefix CodesA prefix code is a variable length code in

which no codeword is a prefix of another word

e.g a = 0, b = 110, c = 111, d = 10Can be viewed as a binary tree with

message values at the leaves and 0 or 1s on the edges.

a

b c

d

0

0

0 1

1

1

Page 14: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Some Prefix Codes for Integers n Binary Unary Split

1 ..001 0 1|

2 ..010 10 10|0

3 ..011 110 10|1

4 ..100 1110 110|00

5 ..101 11110 110|01

6 ..110 111110 110|10

Many other fixed prefix codes: Golomb, phased-binary, subexponential, ...

Page 15: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Data compression implies sending or storing a smaller number of bits. Although many methods are used for this purpose, in general these methods can be divided into two broad categories: lossless and lossy methods.

Data compression methods

Page 16: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Run Length Coding

Page 17: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Introduction – What is RLE? Compression technique

Represents data using value and run length Run length defined as number of consecutive equal

valuese.g

1110011111 1 3 0 2 1 5

RLE

Values Run Lengths

Page 18: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Introduction Compression effectiveness depends on input Must have consecutive runs of values in order to

maximize compression Best case: all values same

Can represent any length using two values Worst case: no repeating values

Compressed data twice the length of original!!

Should only be used in situations where we know for sure have repeating values

Page 19: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Run-length encoding example

Page 20: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Run-length encoding for two symbols

Page 21: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Encoder – ResultsInput: 4,5,5,2,7,3,6,9,9,10,10,10,10,10,10,0,0

Output: 4,1,5,2,2,1,7,1,3,1,6,1,9,2,10,6,0,2,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1…

Best Case:Input: 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0Output: 0,16,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1…

Worst Case:Input: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

Output: 0,1,1,1,2,1,3,1,4,1,5,1,6,1,7,1,8,1,9,1,10,1,11,1,12,1,13,1,14,1,15,1

Valid OutputOutput Ends Here

Page 22: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Huffman Coding

Page 23: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Huffman Codes Invented by Huffman as a class

assignment in 1950. Used in many, if not most compression

algorithms such as gzip, bzip, jpeg (as option), fax compression,…

Properties: Generates optimal prefix codes Cheap to generate codes Cheap to encode and decode la=H if probabilities are powers of 2

Page 24: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Huffman CodesHuffman Algorithm Start with a forest of trees each consisting

of a single vertex corresponding to a message s and with weight p(s)

Repeat: Select two trees with minimum weight roots p1

and p2

Join into single tree by adding root with weight p1 + p2

Page 25: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Examplep(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5

a(.1) b(.2) d(.5)c(.2)

a(.1) b(.2)

(.3)

a(.1) b(.2)

(.3) c(.2)

a(.1) b(.2)

(.3) c(.2)

(.5)

(.5) d(.5)

(1.0)

a=000, b=001, c=01, d=1

0

0

0

1

1

1Step 1

Step 2Step 3

Page 26: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Encoding and DecodingEncoding: Start at leaf of Huffman tree and

follow path to the root. Reverse order of bits and send.

Decoding: Start at root of Huffman tree and take branch for each bit received. When at leaf can output message and return to root.

a(.1) b(.2)

(.3) c(.2)

(.5) d(.5)

(1.0)0

0

0

1

1

1

There are even faster methods that can process 8 or 32 bits at a time

Page 27: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Huffman codes Pros & Cons Pros:

The Huffman algorithm generates an optimal prefix code.

Cons: If the ensemble changes the frequencies and

probabilities change the optimal coding changes e.g. in text compression symbol frequencies vary

with context Re-computing the Huffman code by running through

the entire file in advance?! Saving/ transmitting the code too?!

Page 28: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Lempel-Ziv (LZ77)

Page 29: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Lempel-Ziv AlgorithmsLZ77 (Sliding Window) Variants: LZSS (Lempel-Ziv-Storer-Szymanski) Applications: gzip, Squeeze, LHA, PKZIP, ZOO

LZ78 (Dictionary Based) Variants: LZW (Lempel-Ziv-Welch),

LZC (Lempel-Ziv-Compress) Applications:

compress, GIF, CCITT (modems), ARC, PAK

Traditionally LZ77 was better but slower, but the gzip version is almost as fast as any LZ78.

Page 30: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Lempel Ziv encoding

Lempel Ziv (LZ) encoding is an example of a category of algorithms called dictionary-based encoding. The idea is to create a dictionary (a table) of strings used during the communication session. If both the sender and the receiver have a copy of the dictionary, then previously-encountered strings can be substituted by their index in the dictionary to reduce the amount of information transmitted.

Page 31: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Compression

In this phase there are two concurrent events: building an indexed dictionary and compressing a string of symbols. The algorithm extracts the smallest substring that cannot be found in the dictionary from the remaining uncompressed string. It then stores a copy of this substring in the dictionary as a new entry and assigns it an index value. Compression occurs when the substring, except for the last character, is replaced with the index found in the dictionary. The process then inserts the index and the last character of the substring into the compressed string.

Page 32: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

An example of Lempel Ziv encoding

Page 33: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

Decompression

Decompression is the inverse of the compression process. The process extracts the substrings from the compressed string and tries to replace the indexes with the corresponding entry in the dictionary, which is empty at first and built up gradually. The idea is that when an index is received, there is already an entry in the dictionary corresponding to that index.

Page 34: Source Coding- Compression Most Topics from Digital Communications- Simon Haykin Chapter 9 9.1~9.4

An example of Lempel Ziv decoding