Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual

Lempel-Ziv Coding

Prof. Ja-Ling Wu

Department of Computer Science and Information EngineeringNational Taiwan University

2

J.Ziv and A.Lempel, “Compression of individual sequences by variable rate coding, “IEEE Trans. Information Theory, 1978.

Algorithm :The source sequence is sequentially parsed into strings that have not appeared so far.For example, 1011010100010…, will be parsed as

1,0,11,0 1,010,00,10,…After every comma, we look along the input sequence until we come to the shortest string that has not been marked off before. Since this is the shortest such string, all its prefixes must have occurred easier. In particular, the string consisting of all but the last bit of this string must have occurred earlier.

3

We code this phrase by giving the location of the prefix and the value of the last bit.Let C(n) be the number of phrases in the parsing of the input n-sequence. We need log C(n) bits to describe the location of the prefix to the phrase and 1 bit to describe the last bit.In the above example, the code sequence is :(000,1), (000,0), (001,1), (010,1), (100,0), (010,0), (001,0)where the first number of each pair gives the index of the prefix and the 2nd number gives the last bit of the phrase.

4

Decoding the coded sequence is straight forward and we can recover the source sequence without error!!Two passes-algorithm :pass-1 : parse the string and calculate C(n)pass-2 : calculate the points and produce the code string

The two-passes algorithm allots an equal number of bits to all the pointers. This is not necessary, since the range of the pointers is smaller at the initial portion of the string.

• T-A. Welch, A technique for high-performance data Compression, Computer, 1984.

• Bell, Cleary, Witlen, Text Compression, Prentice-Hall, 1990.

Dictionary construction

Pattern Matching

5

Definition :A parsing S of a binary string x1, x2,…,xn is a division of the string into phrases, separated by commas. A distinct parsing is a parsing such that no two phrases are identical.The L-Z algorithm described above gives a distinct parsing of the source sequence. Let C(n) denote the number of phrases in the L-Z parsing of a sequence of length n.The compressed sequence (after applying the L-Z algorithm) consists of a list of c(n) pairs of numbers, each pair consisting of a pointer to the previous

occurrence of the prefix of the phrase and the last bitof the phrase. 1

log c(n)

6

the total length of the compressed sequence is : C(n)(log c(n)+1)

we will show that

for a stationary ergodic sequence X1,X2,…,Xn.

( )( ) ( )χHn

nCnC→

+1log)(

7

Lemma 1 (Lempel and Ziv) :The number of phrase c(n) in a distinct parsing of a binary sequence X1,X2,…,Xn satisfies

where εn→ 0 as n → ∞.

( ) ( ) nnnCn log1 ε−

≤

8

Let

: the sum of the lengths of all distinct strings of length less than or equal to k.

C(n) : the no. of phrases in a distinct parsing of a sequence of length nthis number is maximized when all the phrases are as short as possible.

n=nk → all the phrases are of length ≦ k, thus

If nk≦n<nk+1, we write n=nk+Δ, then Δ=n-nk < nk+1-nk = (k+1)2k+1

( ) 2212 1

1+−=⋅= +

=∑ k

k

j

jk kjn

( )11

22222 11

1 −≤

−−

=≤−=≤ ++

=∑ k

nk

nnC kkkKk

j

j

9

Then the parsing into shortest phrases has each of the phases of length≦ k and phrases of the length k+1.

We now bound the size of k for a given n.Let nk ≦ n < nk+1. Then n ≧ nk = (k-1)2k+1+2 ≧ 2k

→ k ≦ log nn ≦ nk+1 = k．2k+2+2 ≦ (k+2)2k+2

≦ [ (log n+2) ．2k+2 ]

1+Δ

k( ) ( )

( )1

11111

−≤⇒

−=

−Δ+

≤+Δ

+−

≤+Δ

+=

KnnC

Kn

kn

kKn

knCnC kk

k

2loglog2

+≥+⇒

nnk

10

for all n ≧ 4( ) ( )

( )

( )

( )

( )

( ) ( )( )

∞→→+

=

−=

−≤⇒

−=

⎟⎟⎠

⎞⎜⎜⎝

⎛ +−=

⎟⎟⎠

⎞⎜⎜⎝

⎛ +−≥

⎟⎟⎠

⎞⎜⎜⎝

⎛ ++−=

−+−≥−+=−

nnn

nn

KnnC

n

nnn

nnn

nn

n

nnkk

n

n

n

as 0log

4loglogwhere

log11

log1

loglog

4loglog1

loglog

3log2log1

loglog

32loglog1

32logloglog321

ε

ε

ε

11

Lemma 2 :Let Z be a positive integer valued r.v. with mean μ.Then the entropy H(z) is bound byH(z) ≦ (μ+1) log(μ+1) –μlogμ

pf : H W.

12

Let be a stationary ergodic process with pmfp(x1,x2,…xn). For fixed integer k, defined the kth order Markov approximation to P as

, and the initial state will be part of the specification of Qk.

Since is itself an ergodic process, we have

{ }∞ −∞=iiX

( )( ) ( )( ) ( )( ) jixxxx

xxPxPxxxxQ

jiij

i

n

j

jkjjknKk

≤Δ

Δ

+

=

−−−−−− ∏

,,...,, where

,...,,,...

1

1

101101

( )0

1−− kx( )1−

−n

knn XXP

( )( ) ( )( ) ( )11

1

10121

log

log1,...,,log1

−−

−−

=

−−−−

Δ−→

−=− ∑j

kjjj

kjj

n

j

jkjjknK

XXHXXPE

XXPn

XXXXQn

13

We will bound the rate of the L-Z code by the entropy rate of the k-th order Markov approximation for all k. The entropy rate of the Markov approximationconverges to the entropy rate of the process as k → ∞and this will prove the result.

( )1−−j

kjj XXH

14

S’pose , and s’pose that is parsed into C distinct phrases, y1,y2,…,yc. Let νi be the index of the start of the i-th phrase, i.e., . For each i=1,2,…,c, defines . Thus, Si is the k bits of x preceding yi, of course,

Let Cls be the number of phrases yi with length l and preceding state Si=S for l=1,2,… and , we thenhave

and

( ) ( )n

kn

k xX 11 −−−− = nx1

11 −+= i

ixy iνν1−

−= i

i ki xS νν

( )0

11 −−= kxS

kXs∈

nlC

CC

slls

slls

=

=

∑

∑

,

,

15

Lemma 3 : ( Ziv’s inequality )For any distinct parsing (in particular, the L-Z parsing) of the string x1,x2,…,xn, we have

Note that the right hand side does not depend on Qk.

proof : we write

( ) ∑−≤sl

lslsnk CCsxxxQ,

121 log,...,,log

( ) ( )

( )∏=

=

=C

iii

cnk

syP

syyyQsxxxQ

1

121121 ,...,,,...,,

16

( ) ( )

( )

( )

( )∑ ∑

∑∑

∑ ∑

∑

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛≤

=

=

=

==

==

==

=

slSs

lyiii

lsls

Sslyi

iilssl

ls

sl sslyiii

C

iiink

i

i

ii

ii

syPC

C

syPC

C

syP

syPsxxxQ

, :

:,

, ,:

1121

1log

log1

log

log,...,,logor

Now since the yi are distinct, we have ( )∑==

≤

Sslyi

ii

i

i

syP

:

1

( ) ∑≤⇒sl ls

lsnK CCsxxxQ

,121

1log,...,,log

17

Theorem :Let {Xn} be a stationary ergodic process with entropy rate H(XX), and let C(n) be the number of phrases in a distinct parsing of a sample of length n from this process. Then

with probability 1.

( ) ( ) ( )XHn

nCnCn

≤∞→

logsuplim

18

( )

∑∑

∑

∑

=Π=Π

=Π

−−=

⋅−≤

slsl

slsl

lsls

sl

lsls

ls

lslsnk

Cnl

CC

CC

CCCCC

CCCCsxxxQ

,,

,,

,

121

,1

have we, writting

loglog

log,...,,log:pf

19

We now define r.v.’s U,V, such that Pr(U=l,V=s) = Πl,s

Thus E(U)=n/c, and

Since H(U,V) ≦ H(U) + H(V)and H(V) ≦ log |XX|k = k

From Lemma 2, we have

( ) ( )VUHnCC

nCsxxxQ

n

ccVUCHsxxxQ

nK

nk

,log,...,,log1log),()|,...,,(log

121

121

−≥−⇒

−≤

( ) ( ) ( )

( ) ( ).1log, , Thus

1log1log

log1log1

log1log1

OCn

nCk

nCVUH

nC

nC

Cn

Cn

Cn

Cn

Cn

Cn

EUEUEUEUUH

++≤

⎟⎠⎞

⎜⎝⎛ +⎟

⎠⎞

⎜⎝⎛ ++=

−⎟⎠⎞

⎜⎝⎛ +⎟

⎠⎞

⎜⎝⎛ +=

−++≤

20

For a given n, the maximum of is attained for

the maximum value of C . But from Lemma 1,

Cn

nC log

⎟⎠⎞

⎜⎝⎛ ≤

enC 1for

( )( )

( ) ∞→→

⎟⎟⎠

⎞⎜⎜⎝

⎛≤

+≤

n as 0, thereforeand

loglogloglog

Thus . 11log

VUHnC

nnO

Cn

nC

On

nC

21

Therefore,

where εk(n) →0 as n→∞. Hence, with probability 1,

( ) ( ) ( ) ( )nsxxxQnn

nCnCknk ε+−

≤ 121 ,...,,log1log

( ) ( )( )( )

( ) ( ) .,...,

,...,,log1limlogsuplim

10

0121

∞→→=

−≤

−−

−−∞→∞→

kasXHXXXH

XXXXQnn

nCnC

k

knknn

22

Theorem : Let be a stationary ergodic stochastic process. Let l(X1,X2,…Xn) be the L-Z codeword length associated with X1,X2,…,Xn. Then

pf :

The L-Z code is a universal code, in which, the code does not depend on the distribution of the source!!

( ) ( ) ( )( )( )

( ) ( ) ( ) ( )

( ) 1.y probabilit with ,

logsuplim,...,,suplim

0suplim 1, Lemmaby

1log,...,,

21

21

XHnnC

nnCnC

nXXXl

nnC

nCnCXXXl

n

n

n

n

≤

⎟⎠⎞

⎜⎝⎛ +=⇒

=

+=

∞→∞→

( ) ( ) 1.y probabilitwith ,...,,1lim 21 XHXXXln nn

≤∞→

{ }∞ −∞=iiX

23

Optimal Variable Length-to-Fixed Length CodeAlgorithm :

step 1. 永遠分最大的 nodestep 2. 分到有 2l leaf nodes 則停

Example : Source data input : A B C C B A A A CPA=0.7PB=0.2 code-tree AAAAPC=0.1 AAA 0.343 AAAB

AA 0.49 AAB 0.098 AAACA 0.7 AB 0.14 AAC 0.049

root B 0.2 AC 0.07 C 0.1

24

A 1011B 1010C 1001AA 1000AB 0111AC 0110AAA 0101AAB 0100AAC 0011 AAAA 0010AAAB 0001AAAC 0000

A B, C, C, B, A A A C

0111 1001 1001 1010 0000

Tunstall ’77

GIT ph.D. Thesis

Documents

Prof. Ja-Ling Wu Department of Computer Science and ...cmlab.csie.ntu.edu.tw/~itct/slide/Data_Compression_Lempel-Ziv_Coding.pdf · 2 J.Ziv and A.Lempel, “Compression of individual