Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
Lempel-Ziv Coding
Prof. Ja-Ling Wu
Department of Computer Science and Information EngineeringNational Taiwan University
2
J.Ziv and A.Lempel, “Compression of individual sequences by variable rate coding, “IEEE Trans. Information Theory, 1978.
Algorithm :The source sequence is sequentially parsed into strings that have not appeared so far.For example, 1011010100010…, will be parsed as
1,0,11,0 1,010,00,10,…After every comma, we look along the input sequence until we come to the shortest string that has not been marked off before. Since this is the shortest such string, all its prefixes must have occurred easier. In particular, the string consisting of all but the last bit of this string must have occurred earlier.
3
We code this phrase by giving the location of the prefix and the value of the last bit.Let C(n) be the number of phrases in the parsing of the input n-sequence. We need log C(n) bits to describe the location of the prefix to the phrase and 1 bit to describe the last bit.In the above example, the code sequence is :(000,1), (000,0), (001,1), (010,1), (100,0), (010,0), (001,0)where the first number of each pair gives the index of the prefix and the 2nd number gives the last bit of the phrase.
4
Decoding the coded sequence is straight forward and we can recover the source sequence without error!!Two passes-algorithm :pass-1 : parse the string and calculate C(n)pass-2 : calculate the points and produce the code string
The two-passes algorithm allots an equal number of bits to all the pointers. This is not necessary, since the range of the pointers is smaller at the initial portion of the string.
• T-A. Welch, A technique for high-performance data Compression, Computer, 1984.
• Bell, Cleary, Witlen, Text Compression, Prentice-Hall, 1990.
Dictionary construction
Pattern Matching
5
Definition :A parsing S of a binary string x1, x2,…,xn is a division of the string into phrases, separated by commas. A distinct parsing is a parsing such that no two phrases are identical.The L-Z algorithm described above gives a distinct parsing of the source sequence. Let C(n) denote the number of phrases in the L-Z parsing of a sequence of length n.The compressed sequence (after applying the L-Z algorithm) consists of a list of c(n) pairs of numbers, each pair consisting of a pointer to the previous
occurrence of the prefix of the phrase and the last bitof the phrase. 1
log c(n)
6
the total length of the compressed sequence is : C(n)(log c(n)+1)
we will show that
for a stationary ergodic sequence X1,X2,…,Xn.
( )( ) ( )χHn
nCnC→
+1log)(
7
Lemma 1 (Lempel and Ziv) :The number of phrase c(n) in a distinct parsing of a binary sequence X1,X2,…,Xn satisfies
where εn→ 0 as n → ∞.
( ) ( ) nnnCn log1 ε−
≤
8
Let
: the sum of the lengths of all distinct strings of length less than or equal to k.
C(n) : the no. of phrases in a distinct parsing of a sequence of length nthis number is maximized when all the phrases are as short as possible.
n=nk → all the phrases are of length ≦ k, thus
If nk≦n<nk+1, we write n=nk+Δ, then Δ=n-nk < nk+1-nk = (k+1)2k+1
( ) 2212 1
1+−=⋅= +
=∑ k
k
j
jk kjn
( )11
22222 11
1 −≤
−−
=≤−=≤ ++
=∑ k
nk
nnC kkkKk
j
j
9
Then the parsing into shortest phrases has each of the phases of length≦ k and phrases of the length k+1.
We now bound the size of k for a given n.Let nk ≦ n < nk+1. Then n ≧ nk = (k-1)2k+1+2 ≧ 2k
→ k ≦ log nn ≦ nk+1 = k.2k+2+2 ≦ (k+2)2k+2
≦ [ (log n+2) .2k+2 ]
1+Δ
k( ) ( )
( )1
11111
−≤⇒
−=
−Δ+
≤+Δ
+−
≤+Δ
+=
KnnC
Kn
kn
kKn
knCnC kk
k
2loglog2
+≥+⇒
nnk
10
for all n ≧ 4( ) ( )
( )
( )
( )
( )
( ) ( )( )
∞→→+
=
−=
−≤⇒
−=
⎟⎟⎠
⎞⎜⎜⎝
⎛ +−=
⎟⎟⎠
⎞⎜⎜⎝
⎛ +−≥
⎟⎟⎠
⎞⎜⎜⎝
⎛ ++−=
−+−≥−+=−
nnn
nn
KnnC
n
nnn
nnn
nn
n
nnkk
n
n
n
as 0log
4loglogwhere
log11
log1
loglog
4loglog1
loglog
3log2log1
loglog
32loglog1
32logloglog321
ε
ε
ε
11
Lemma 2 :Let Z be a positive integer valued r.v. with mean μ.Then the entropy H(z) is bound byH(z) ≦ (μ+1) log(μ+1) –μlogμ
pf : H W.
12
Let be a stationary ergodic process with pmfp(x1,x2,…xn). For fixed integer k, defined the kth order Markov approximation to P as
, and the initial state will be part of the specification of Qk.
Since is itself an ergodic process, we have
{ }∞ −∞=iiX
( )( ) ( )( ) ( )( ) jixxxx
xxPxPxxxxQ
jiij
i
n
j
jkjjknKk
≤Δ
Δ
+
=
−−−−−− ∏
,,...,, where
,...,,,...
1
1
101101
( )0
1−− kx( )1−
−n
knn XXP
( )( ) ( )( ) ( )11
1
10121
log
log1,...,,log1
−−
−−
=
−−−−
Δ−→
−=− ∑j
kjjj
kjj
n
j
jkjjknK
XXHXXPE
XXPn
XXXXQn
13
We will bound the rate of the L-Z code by the entropy rate of the k-th order Markov approximation for all k. The entropy rate of the Markov approximationconverges to the entropy rate of the process as k → ∞and this will prove the result.
( )1−−j
kjj XXH
14
S’pose , and s’pose that is parsed into C distinct phrases, y1,y2,…,yc. Let νi be the index of the start of the i-th phrase, i.e., . For each i=1,2,…,c, defines . Thus, Si is the k bits of x preceding yi, of course,
Let Cls be the number of phrases yi with length l and preceding state Si=S for l=1,2,… and , we thenhave
and
( ) ( )n
kn
k xX 11 −−−− = nx1
11 −+= i
ixy iνν1−
−= i
i ki xS νν
( )0
11 −−= kxS
kXs∈
nlC
CC
slls
slls
=
=
∑
∑
,
,
15
Lemma 3 : ( Ziv’s inequality )For any distinct parsing (in particular, the L-Z parsing) of the string x1,x2,…,xn, we have
Note that the right hand side does not depend on Qk.
proof : we write
( ) ∑−≤sl
lslsnk CCsxxxQ,
121 log,...,,log
( ) ( )
( )∏=
=
=C
iii
cnk
syP
syyyQsxxxQ
1
121121 ,...,,,...,,
16
( ) ( )
( )
( )
( )∑ ∑
∑∑
∑ ∑
∑
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛≤
=
=
=
==
==
==
=
slSs
lyiii
lsls
Sslyi
iilssl
ls
sl sslyiii
C
iiink
i
i
ii
ii
syPC
C
syPC
C
syP
syPsxxxQ
, :
:,
, ,:
1121
1log
log1
log
log,...,,logor
Now since the yi are distinct, we have ( )∑==
≤
Sslyi
ii
i
i
syP
:
1
( ) ∑≤⇒sl ls
lsnK CCsxxxQ
,121
1log,...,,log
17
Theorem :Let {Xn} be a stationary ergodic process with entropy rate H(XX), and let C(n) be the number of phrases in a distinct parsing of a sample of length n from this process. Then
with probability 1.
( ) ( ) ( )XHn
nCnCn
≤∞→
logsuplim
18
( )
∑∑
∑
∑
=Π=Π
=Π
−−=
⋅−≤
slsl
slsl
lsls
sl
lsls
ls
lslsnk
Cnl
CC
CC
CCCCC
CCCCsxxxQ
,,
,,
,
121
,1
have we, writting
loglog
log,...,,log:pf
19
We now define r.v.’s U,V, such that Pr(U=l,V=s) = Πl,s
Thus E(U)=n/c, and
Since H(U,V) ≦ H(U) + H(V)and H(V) ≦ log |XX|k = k
From Lemma 2, we have
( ) ( )VUHnCC
nCsxxxQ
n
ccVUCHsxxxQ
nK
nk
,log,...,,log1log),()|,...,,(log
121
121
−≥−⇒
−≤
( ) ( ) ( )
( ) ( ).1log, , Thus
1log1log
log1log1
log1log1
OCn
nCk
nCVUH
nC
nC
Cn
Cn
Cn
Cn
Cn
Cn
EUEUEUEUUH
++≤
⎟⎠⎞
⎜⎝⎛ +⎟
⎠⎞
⎜⎝⎛ ++=
−⎟⎠⎞
⎜⎝⎛ +⎟
⎠⎞
⎜⎝⎛ +=
−++≤
20
For a given n, the maximum of is attained for
the maximum value of C . But from Lemma 1,
Cn
nC log
⎟⎠⎞
⎜⎝⎛ ≤
enC 1for
( )( )
( ) ∞→→
⎟⎟⎠
⎞⎜⎜⎝
⎛≤
+≤
n as 0, thereforeand
loglogloglog
Thus . 11log
VUHnC
nnO
Cn
nC
On
nC
21
Therefore,
where εk(n) →0 as n→∞. Hence, with probability 1,
( ) ( ) ( ) ( )nsxxxQnn
nCnCknk ε+−
≤ 121 ,...,,log1log
( ) ( )( )( )
( ) ( ) .,...,
,...,,log1limlogsuplim
10
0121
∞→→=
−≤
−−
−−∞→∞→
kasXHXXXH
XXXXQnn
nCnC
k
knknn
22
Theorem : Let be a stationary ergodic stochastic process. Let l(X1,X2,…Xn) be the L-Z codeword length associated with X1,X2,…,Xn. Then
pf :
The L-Z code is a universal code, in which, the code does not depend on the distribution of the source!!
( ) ( ) ( )( )( )
( ) ( ) ( ) ( )
( ) 1.y probabilit with ,
logsuplim,...,,suplim
0suplim 1, Lemmaby
1log,...,,
21
21
XHnnC
nnCnC
nXXXl
nnC
nCnCXXXl
n
n
n
n
≤
⎟⎠⎞
⎜⎝⎛ +=⇒
=
+=
∞→∞→
( ) ( ) 1.y probabilitwith ,...,,1lim 21 XHXXXln nn
≤∞→
{ }∞ −∞=iiX
23
Optimal Variable Length-to-Fixed Length CodeAlgorithm :
step 1. 永遠分最大的 nodestep 2. 分到有 2l leaf nodes 則停
Example : Source data input : A B C C B A A A CPA=0.7PB=0.2 code-tree AAAAPC=0.1 AAA 0.343 AAAB
AA 0.49 AAB 0.098 AAACA 0.7 AB 0.14 AAC 0.049
root B 0.2 AC 0.07 C 0.1
24
A 1011B 1010C 1001AA 1000AB 0111AC 0110AAA 0101AAB 0100AAC 0011 AAAA 0010AAAB 0001AAAC 0000
A B, C, C, B, A A A C
0111 1001 1001 1010 0000
Tunstall ’77
GIT ph.D. Thesis