7

Click here to load reader

OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM

Embed Size (px)

DESCRIPTION

Data compression refers to reducing the amount of space needed to store data or reducing the amount of time needed to transmit data. Many data compression techniques allow encoding the compressed form of data with different compression ratio. In particular, in the case of LZ77 technique, it reduces the data concurrency of an input file. In the output of this technique it conveys more information that is actually not needed in practical. Removing the extra information from the encoded file that makes this algorithm more optimal. Our task is to identify how much extra information it conveys and how can we minimize it so that there is no trouble at the time of decoding. Basically the encoded output of LZ77 is the sequence of triplets (a structure of encoded output) that is in binary and having fix size. For making the triplets of fix size, sometimes we are creating unnecessary information. We present the method of variable triplet size as a way to improve LZ77 compression and demonstrate it through many experiments. In our optimization algorithm we are getting more compression ratio compare to the conventional LZ77 data compression algorithm.

Citation preview

Page 1: OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME

42

OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM

Hemraj Kumawat CSE,IIT Jodhpur, Jodhpur, India

Jitendra Chaudhary

CSE,IIT Jodhpur, Jodhpur, India

ABSTRACT

Data compression refers to reducing the amount of space needed to store data or reducing the

amount of time needed to transmit data. Many data compression techniques allow encoding the

compressed form of data with different compression ratio. In particular, in the case of LZ77

technique, it reduces the data concurrency of an input file. In the output of this technique it conveys

more information that is actually not needed in practical. Removing the extra information from the

encoded file that makes this algorithm more optimal. Our task is to identify how much extra

information it conveys and how can we minimize it so that there is no trouble at the time of

decoding. Basically the encoded output of LZ77 is the sequence of triplets (a structure of encoded

output) that is in binary and having fix size. For making the triplets of fix size, sometimes we are

creating unnecessary information. We present the method of variable triplet size as a way to improve

LZ77 compression and demonstrate it through many experiments. In our optimization algorithm we

are getting more compression ratio compare to the conventional LZ77 data compression algorithm.

Keywords: Look-ahead Buffer: The look-ahead buffer contains characters yet to be encoded. This buffer starts

where the Search buffer ends and during the algorithm the Search buffer extends into the look-ahead

buffer.

Match Length: The Match Length is the length of largest matching block in the look-ahead buffer.

These pairs are called triplets, consisting of offset, matching length and code word of character. If

the character is matching then next character code word is used, otherwise same character code word

is used.

Offset: The actual distance between the current position of the pointer and the look-ahead buffer is

known as offset.

Search Buffer: The Search Buffer represents the most recently encoded characters.

Sliding Window: The Structure for Data manipulation, in which the Data is held The Sliding

Window, is divided into two parts as Search buffer and look-ahead buffer.

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &

TECHNOLOGY (IJCET)

ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 4, Issue 5, September – October (2013), pp. 42-48 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com

IJCET

© I A E M E

Page 2: OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME

43

1. INTRODUCTION

LZ77 algorithm achieves compression by replacing repeated occurrences of data with

references to a single copy of that data existing earlier in the input (uncompressed) data stream. A

match is encoded by a pair of numbers called a length-distance pair. Some common convention and

definition of the words that we are using in this paper.

2. CONVENTIONAL LZ77 ALGORITHM

LZ77 compression algorithm exploits the fact that words and phrases within a text file are

likely to be repeated. When there is repetition, they can be encoded as a pointer to an earlier

occurrence, with the pointer accompanied by the number of characters to be matched. It is a very

simple adaptive scheme that requires no prior knowledge of the source and seems to require no

assumptions about the characteristics of the source.

In the LZ77 approach the dictionary is simply a portion of the previously encoded sequence.

The encoder examines the input sequence through a sliding window which consists of two parts: a

search buffer that contains a portion of the recently encoded sequence and a look ahead buffer that

contains the next portion of the sequence to be encoded. The algorithm searches the sliding window

for the longest match with the beginning of the look-ahead buffer and outputs a reference (a pointer)

to that match. It is possible that there is no match at all, so the output cannot contain just pointers. In

LZ77 the reference is always represented as a triplet<o,l,c>, where ‘o’ is an offset to the match, ‘l’ is

length of the match and ‘c’ is the next symbol after the match. If there is no match, the algorithm

outputs a null-pointer (both the offset and the match length equal to 0) and the first symbol in the

look-ahead buffer. The values of an offset to a match and length must be limited to some maximum

constant. For this algorithm we have to define the length of the look-ahead buffer, search buffer. The

symbol is usually encoded in 8 bit. More over the compression performance of LZ77 mainly depends

on these values. Generally the search buffer length is more than the look-ahead-buffer size. So the

total triplet size:

While ( look-ahead Buffer not empty) {

get a reference (position, length) to longest match;

if (length > 0)

{

output (position, length, next symbol);

shift the window length+1 positions along;

}

else {

output (0, 0, first symbol in the look-ahead buffer);

shift the window 1 character along;

}

}

ST= [⌈log2(search buffer length)⌉] +[⌈log2(look-ahead buffer length)⌉]+8

Page 3: OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME

44

We can have better understanding with an example- “aacaacabcabaaac” For this example the size of

look-ahead buffer is 6 and search buffer is 4.

Triple Binary -- <0, 0, a> 00000011000001

-- <1, 1, c> 00100111000011

-- < 3, 4, b> 01110011000010

--< 3, 3, a> 01101111000001

-- <1, 2, c> 00101011000011

Sliding window( Size: 6 )

Longest match

Next Character

The triplet length for this example is 14(3+3+8). So here the encoded binary string of this example.

0000001100000100100111000011011100110000100110111100000100101011000011

triplet triplet triplet triplet triplet

-------------------|-------------------|-----------------------|----------------------|--------------------|

The decoding is much faster than the encoding in this process because we have to move our

pointer with fixed length (triple length-14 for this example) and it is one of the important features of

this process. From this way we get the triplets .Now we can easily decode to original data by

reversing the encoding process.

3. OPTIMIZATION OF LZ77

As we have described earlier the triplets of LZ77 algorithm have fix size. In the case, when

offset is equal to the matching length we can modify the structure of triplets and represent it with the

new structure that have only <l,c> where l is the matching length and ‘c’ is the next symbol after the

match .We called this new structure as “doublet” in this paper .All the things are same as the

conventional LZ77 algorithm except replacing the triplet with doublet in the case of matching length

equal to offset length.

��= log2(look-ahead buffer length)⌉] + 8

By replacing the triplet with doublet we are saving [⌈log2(search buffer length)⌉] number of

bits per matching case. The exact algorithms is described in the block

a a c a a c a b c a b a a a c

a a c a a c a b c a b a a a c

a a c a a c a b c a b a a a c

a a c a a c a b c a b a a a c

a a c a a c a b c a b a a a c

Page 4: OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME

45

While ( look-ahead Buffer not empty) {

get a reference (position, length) to longest match;

if (length > 0)

{

If (length== position){

output (length, next symbol);

shift the window length+1 positions along;

}

else{

output (position, length, next symbol);

shift the window length+1 positions along;

}}

else {

output (0, first symbol in the look-ahead buffer);

shift the window 1 character along;

}

}

We will have better understanding of this optimized algorithm with the previous section

example- “aacaacabcabaaac”

Triplet/doublet Binary

--<0, a> 00011000001

--<1, c> 00111000011

--< 3, 4, b> 01110011000010

--< 3, a> 01111000001

--<1, 2, c> 00101011000011

The length of triplet for this example is 14(3+3+8) and the doublet length is 11(3+8). So here

the encoded binary string of this example

0001100000100111000011011100110000100111100000100101011000011

doublet doublet triplet doublet triplet

-----------------|------------------|---------------|-----------------|-----------------------|

The decoding process is slower than the conventional LZ77 Decoding. In this algorithm we

have to move our pointer with the variable size (doublet and triplet length) .But the problem in

decoding is how we will identify which one is doublet and which one is triplet. So the decoding

process is described in the next section. Once we decode encoded file to triplets and doublet then we

can easily get to original data by reversing the encoding (data to triplet/doublet) process.

a a c a A c A b c a b a A a c

a a c a A c A b c a b a A a c

a a c a A c a b c a b a A a c

a a c a A c A b c a b a a a c

a a c a A c A b c a b a A a c

Page 5: OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME

46

4. DECODING PROCESS

For decoding the binary string that contains doublets and triplets, first we have to identify

which one is doublet’s binary string and which one is triplet binary string .To identify this we are

using the concept of delimiter that will depend on the size of the sliding window length. We will put

the delimiter before the doublet binary string.

The effective doublet length is: doublet size + delimiter size.

Now each binary substring is starting from the delimiter .Now for decoding we move the pointer

from starting of the binary string and check whether the first consecutive k bits are equal to the

delimiter or not where k is the length of delimiter .If it is equal then move the current pointer with

the effective doublet length ahead and make this substring to the doublet substring otherwise move

the pointer with the triplet length and make this substring to the triplet substring.

So the decoding algorithm is given below

Effective doublet length = doublet size + delimiter length

//delimiter is a binary string that depends on the offset length

Starting the moving pointer from 0

While(the moving pointer m<total binary length){

if((m to m+ delimiter’s length substring)==delimiter )

{

Move the pointer m with effective doublet length ahead.

Get the doublet binary substring form (m + delimiter’s length ) to (m + effective

length)

}

else{

Move the pointer m with triplet length ahead.

Get the triplet binary substring form m to (m + triplet length)}

So here the above example’s binary string after using the delimiter before the doublets.(here we are

using the delimiter=”1”)

1000110000011001110000110111001100001010111100000100101011000011

doublet double triplet doublet triplet

------------------|----------------|----------------|----------------------|----------------------|

So the substring of doublet length just after the blue one is the doublet binary substring and the rest

substrings of triple size are the triplet binary string .We can easily identify the delimiter (blue one for

this example) by moving the pointer with appropriate length according the doublet and triplets.

But what will happen when the offset’s first k bits are equal to the delimiter then this algorithm is not

valid, where k is the length of delimiter .We cannot decompress the binary string from this delimiter.

Let’s see this from an example:

For maximum search buffer length=31 and look-ahead buffer=7 and delimiter =”111”

Binary string

1110001100000111100011000010111010101100110111100111000111

|___|---doublet----||___|--doublet----||___|----triplet----------||___|--doublet---

From the starting we will see that first two strings are doublet string. The third substring is

actually triplet substring but from our algorithm it reads out as the doublet substring because the first

Page 6: OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME

47

3 bits are equal to the delimiter. This will affect the rest decoding process .Eventually we will get the

wrong decoding output.

So we are bounding our delimiter size by:

Delimiter size = no. of first consecutive 1’s in binary representation of sliding window length +1

All the bits in the delimiter are 1’s. From this Delimiter string we are getting the correct decoding

output.

5. ALGORITHMS ANALYSIS

C: Total no of Matching (offset=Matching Length)

CR: Compression ratio

Op LZ77: Optimized LZ77 Algorithm

LZ77: Conventional LZ77 Algorithm

D: % Difference between conventional LZ77 and Optimized LZ77= (Op LZ77 – LZ77) / LZ77

Max. Offset Value=maximum offset value= search buffer length

6. CONCLUSION

As we can see from the above analysis that for maximum offset value (111) the improved

compression ratio is 1.896 and for maximum offset value (223) the improved compression ratio is

1.154. From this analysis result we can see that maximum offset value is increasing, the improved

compression ratio is decreasing. When we increase the search buffer length then the total number of

matching (offset=matching length) will be decreased. The improved compression is directly

proportional to the matching. The improved compression will be more when the matching is more.

So generally this optimal algorithm is more effective, where either conventional LZ77 algorithm is

not so optimal or the input has less repetition.

Serial

Number File size

(in bytes)

Comparison between conventional LZ77 and Optimized LZ77

Max. Offset Value=111 Max. Offset Value=223 Max. Offset Value=447

C CR

C CR

C CR

LZ77 Op

LZ77 D LZ77

Op

LZ77 D LZ77

Op

LZ77 D

1 61,305 2208 1.180 1.200 2.280 1099 1.293 1.313 1.547 621 1.400 1.417 1.130

2 4,21,144 11417 1.210 1.230 1.770 5514 1.318 1.333 1.153 3104 1.420 1.433 0.842

3 7,79,959 17044 1.197 1.213 1.370 6412 1.310 1.319 0.700 3710 1.386 1.393 0.521

4 73,246 2820 1.189 1.169 1.690 1398 1.265 1.275 0.800 742 1.334 1.338 0.341

5 1,09,684 3755 1.200 1.226 2.180 1857 1.317 1.336 1.469 1063 1.423 1.440 1.080

6 6,04,919 19281 1.264 1.291 2.160 8246 1.390 1.407 1.258 4053 1.507 1.519 0.800

7 1,51,610 3489 1.236 1.255 1.497 1430 1.349 1.360 0.830 779 1.453 1.462 0.591

8 1,30,725 3309 1.222 1.242 1.635 1494 1.347 1.360 1.020 750 1.465 1.475 0.660

9 4,27,180 14369 1.254 1.282 2.253 6673 1.389 1.409 1.437 3388 1.517 1.531 0.950

10 6,78,036 23537 1.157 1.182 2.130 10763 1.263 1.279 1.320 5683 1.362 1.374 0.903

Average 10123 1.211 1.229 1.896 4488 1.324 1.339 1.154 2389 1.426 1.438 0.783

Page 7: OPTIMIZATION OF LZ77 DATA COMPRESSION ALGORITHM

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),

ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME

48

REFERENCES

[1] IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. IT-23, NO. 3, MAY 1977

337A Universal Algorithm for Sequential Data Compression

[2] International Symposium on Information Theory and its Applications, ISITA2006 Seoul,

Korea, October 29–November 1, 2006 Improving LZ77 Data Compression using Bit Recycling

[3] International Journal of Wisdom Based Computing, Vol. 1 (3), December 2011 68 A

Comparative Study Of Text Compression Algorithms

[4] Northwestern University Department of Electrical and Computer Engineering ECE 428:

Information Theory Spring 2004

[5] http://www.stringology.org/DataCompression/lz77/index_en.html

[6] http://www.zlib.net/feldspar.html

• Links of the text files used in Analysis are given below (sorted by the index of the table) 1. http://www.gutenberg.org/cache/epub/28466/pg28466.txt

2. http://www.gutenberg.org/cache/epub/16728/pg16728.txt

3. http://www.gutenberg.org/cache/epub/9173/pg9173.txt

4. http://www.gutenberg.org/cache/epub/32482/pg32482.txt

5. http://www.gutenberg.org/files/25731/25731-0.txt

6. http://www.gutenberg.org/cache/epub/26598/pg26598.txt

7. http://www.gutenberg.org/cache/epub/28569/pg28569.txt

8. http://www.gutenberg.org/cache/epub/32962/pg32962.txt

9. http://www.gutenberg.org/cache/epub/23319/pg23319.txt

10. http://www.gutenberg.org/cache/epub/101/pg101.txt