Jadavpur University Presentation on : Data Compression Using Burrows-Wheeler Transform Presented by: Suvendu Rup

Jadavpur University

Presentation on :

Data Compression Using Burrows-Wheeler Transform

Presented by:

Suvendu Rup

04/20/23 2

Introduction :

What is Data Compression ?

- Data compression is often referred to as coding, where coding is a general

form encompassing any special representation of data that satisfies a given

need.

-Data compression may be viewed as a branch of information theory in which

the primary objective is to minimize the amount of data to be transmitted.

-Data compression has important application in the areas of data transmission

and data storage.

- Compressing data to be stored or transmitted reduces storage and communication cost.

Types of Data Compression :

- Lossless Data Compression

- Lossy Data Compression

04/20/23 3

Lossless Data Compression :

- Lossless data compression is a class of data compression

that allows the exact original data to be reconstructed from

the compressed data.

Most Lossless compression use two different kinds of algorithm.

- Statistical modeling

- Burrows Wheeler Transform

- LZ77

- Encoding Algorithm

-Huffman coding

-Arithmetic coding

Lossy Data Compression :

- A Lossy data compression method is one where compressing data and

then decompressing it retrieves data that may well be different from the

original but is close enough to be useful in some way.

04/20/23 4

Advantages of Data Compression : - More disk space.

- Faster file upload and download.

- More file storage options.

Disadvantage of Data Compression : - Added complication.

- Effect of error in transmission.

- Slower for sophisticated method.

- Need to decompress the previous data.

04/20/23 5

DATA COMPRESSION USING BURROWS WHEELER TRANSFORM :

Why BWT ?

Consider the patterns :

A B B A B A

A A A B B B

Pattern second is more impressive because frequency is not an

important issue context of symbol is also important. As the structure is more

regular so good as the compression.

- Burrows and Wheeler presented a transformation function in 1983. Later

they observed that this is quite suitable for data compression.

- The Burrows Wheeler Transform transform a block of data into a format that

is well suited for compression.

04/20/23 6

DATA COMPRESSION USING BURROWS WHEELER TRANSFORM(COND.)

Burrows wheeler compression is relatively a new approach of lossless compression first presented by Burrows and Wheeler in 1994.

Forward BWT

Move to front encoding

Huffman compression

HuffmanDecompression

ReverseMove-to-Front

ReverseBWT

Source

Text Compressed File

Compressed

File

Original SourceText

Steps For Compression

Steps For Decompression

04/20/23 7

Algorithm for forward transformation :

1. [Sort rotation]

Form a N* N Matrix M whose elements are characters and whose rows are the rotation of S sorted in lexicographic order. At least one of the rows of M contains the original string S. let I be the index of the first such row numbering from 0.

2. [Find last character of rotations]

Let the string L be the last column of M.

04/20/23 8

The Burrows wheeler forward Transformation:

1)Write the input as the first row of a matrix one symbol per column.

2)From all cyclic permutation of that row and write as them as the other rows of the matrix .

3) Sort the matrix rows according to the lexicographic order in the elements of the rows.

4) take as output the final column of the sorted matrix together with the number of the row which corresponds to the original input.

04/20/23 9

Example:-

Let’s encode the sequence D R D O B B S

We start with all the cyclic permutation of this sequence. As there are a total of 7 characters, there are 7 permutation.Now let’s sort these sequences in lexicographic (dictionary) order. The sequence L in this case is

L: O B R S D D B

04/20/23 10

0 D R D O B B S

1 R D O B B S D

2 D O B B S D R

3 O B B S D R D

4 B B S D R D O

5 B S D R D O B

6 S D R D O B B

Cyclic Permutation of D R D O B B S

04/20/23 11

0 B B S D R D O

1 B S D R D O B

2 D O B B S D R

3 D R D O B B S

4 O B B S D R D

5 R D O B B S D

6 S D R D O B B

(Sequences sorted into lexicographic order)The original sequence appears as sequence number 3 in the sorted list. We have tagged the first and last columns F and L

04/20/23 12

REVERSE BWT

We can decode the original sequence by using the sequence N and the index I to the original sequence in the sorted list.

The sequence F is simply the sequence L in lexicographic order.

In example F:B B D D O R S

Lets call the sorted array and the cyclically sifted array As

0 BBSDRDO OBBSDRD

1 BSDRDOB BBSDRDO

2 DOBBSDR RDOBBSD

3 DRDOBBS SDRDOBB

4 OBBSDRD DOBBSDR

5 RDOBBSD DRDOBBS

6 SDRDOBB BSDRDOB

A As

04/20/23 13

The first element of each line of A form the sequence F, while the first element of each line of As is the sequence L.

For Example row 0 in As corresponds to row 4 of A

Lets store this information in the array T

The row T[j] in a is the same as the jth row in As

Thus T[0]=4 and T={4 0 5 6 2 3 1 }

We define two operators F and L where F[j] is the first element in jth row of A and L[j] is the last element in the jth row of A. Row T[j] of A

Is the same as the row j of As

F[T[j]]=L[j]

REVERSE BWT (contd.)

04/20/23 14

Move to front coding

A coding scheme that takes advantage of long runs of identical symbols is the move to front (mtf) coding.

We start with some initial listing of the source alphabet.

The symbol at the top of the list is assigned the number 0, the next one is assigned the number 1 and so on.

04/20/23 15

Example Lets encode l= O B R S D D B. Lets assume that the source alphabet is given by

A={B D O R S }We start out with the assignment 0 1 2 3 4

B D O R SThe first element of L is O which gets encoded as a 2.We then move O to the top of the list which

gives us

0 1 2 3 4

O B D R S

The next B is encoded as 1and move to the top of the list

0 1 2 3 4

B O D R S

The next letter is R which is encoded as 3 ,Moving R to the top of the list , we get

0 1 2 3 4

R B O D SThe next letter is S so that gets encode as a 4.and move to front of the listContinuing in this fashion we get the sequence

2 1 3 4 4 0 3

04/20/23 16

Implementation :

The implementation scheme performs a forward BWT transformation on an input file stream and sends the result in an output file. The input file is New.txt and the corresponding output file are p.txt and t.txt for cyclic shift and lexicographic sequence. Let’s encode small sequence stored in a file New.txt

This$is$the We start with all the cyclic permutation of this sequence. As there are a total of 11 characters, there are 11 permutation.Now let’s sort these sequences in lexicographic (dictionary) order. The sequence L in this case is

L:sshtth$ii$e

04/20/23 17

0 t h i s $ i s $ t h e

1 h i s $ i s $ t h e t

2 i s $ i s $ t h e t h

3 s $ i s $ t h e t h i

4 $ i s $ t h e t h i s

5 i s $ t h e t h i s $

6 s $ t h e t h i s $ i

7 $ t h e t h i s $ i s

8 t h e t h i s $ i s $

9 h e t h i s $ i s $ t

10 e t h i s $ i s $ t h

Permutation of this$is$the

04/20/23 18

0 $ i s $ t h e t h i s

1 $ t h e t h i s $ i s

2 e t h i s $ i s $ t h

3 h e t h i s $ i s $ t

4 h i s $ i s $ t h e t

5 i s $ i s $ t h e t h

6 i s $ t h e t h i s $

7 s $ I s $ t h e t h i

8 s $ t h e t h i s $ i

9 t h e t h i s $ i s $

10 t h i s $ i s $ t h e

(Sequences sorted into lexicographic order)The original sequence appears as sequence number 10 in the sorted list. We have tagged the first and last columns F and L

04/20/23 19

Move to Front Coding :-This scheme performs a Move to Front encoding function on an input file stream New.txt and sends the result to an output file move.txt.

- An MTF encoder encodes each character using the count of distinct previous character seen since the character last appearance.

- Each new input character is encoded with its current position in the array. The character is then moved to position 0 in the array, and all the higher order characters are moved down by one position to make room.

- Both the encoder and decoder have to start with the order array initialized to the same values.

- This scheme takes two argument an input file and an output file.

04/20/23 20

Example Lets encode l= sshtth$ii$e. Lets assume that the source alphabet is given by

A={$, e, h, i, s,t }We start out with the assignment 0 1 2 3 4 5

$ e h i s tThe first element of L is which gets encoded as a 4.We then move s to the top of the list which

gives us

0 1 2 3 4 5

s $ e h i t

The next s is encoded as 0.Because s is already at the top of the list , we do not need to make any changes .The next letter is h , which we encode as 3 . We then move h to the top of the list

0 1 2 3 4 5

h s $ e i t

The next letter is t which is encoded as 5 ,Moving t to the top of the list , we get

0 1 2 3 4 5

t h s $ e iThe next letter is also t so that gets encoded as a 0.Continuing in this fashion we get the sequence

4 0 3 5 0 1 3 5 0 1 5

04/20/23 21

Huffman Compression :

The compression method compress and decompress files.

-The basic compression idea is

1) Count the occurrence of each character.

2) Sort by occurrence highest first

3) Build the Huffman tree.

4) Character with higher probabilities have to be in the near of the top and the others more in the near of the bottom. After constructing the tree now assume

0= left.

1= right.

04/20/23 22

COMPRESSION USING HUFFMAN CODING

Design huffman code for the sequence 4 0 3 5 0 1 3 5 0 1 5

The source={ 0 1 3 4 5}

The probabilities of the occurrences are

P(0)=3/11

P(1)=2/11

P(3)=2/11

P(4)=1/11

P(5)=3/11

04/20/23 23

1

8/11

5/11

3/11

2/11

1/112/11

3/11

3/11

CONSTRUCTION OF HUFFMAN TREE

0

0

0

0

1

1

1

1

41 3

0

5

04/20/23 24

HUFFMAN COMPRESSION

The compressed form 1 4 3 0 5 are 0000 0001 001 01 1

From the compressed form 0001 01 001 1 01 0000 001 1 01 0000 1we will get back the original sequence 4 0 3 5 0 1 3 5 0 1 5

HUFFMAN DECOMPRESSION

04/20/23 25

REVERSE MOVE-TO-FRONT CODING

By applying this technique we will get back the encode sequence that is L.

From the sequence 4 0 3 5 0 1 3 5 0 1 5 we will get back sshtth$ii$e.

Where A={ $ , e , h , i , s ,t }

04/20/23 26

REVERSE BWT

We can decode the original sequence by using the sequence N and the index I to the original sequence in the sorted list.

The sequence F is simply the sequence L in lexicographic order.

In example F:$$ehhiisstt

Lets call the sorted array and the cyclically sifted array As

0 $is$thethis s$is$thethi

1 $thethis$is s$thethis$i

2 ethis$is$th hethis$is$t

3 hethis$is$t thethis$is$

4 his$is$thet this$is$the

5 is$is$theth his$is$thet

6 is$thethis$ $is$thethis

7 s$is$thethi is$is$theth

8 s$thethis$i is$thethis$

9 thethis$is$ $thethis$is

10 This$is$the ethis$is$th

A As

04/20/23 27

REVERSE BWT (CONTD.):The first element of each line of A form the sequence F, while the first element of each line of As is the sequence L.

For Example row 0 in As corresponds to row 7 of A

Lets store this information in the array T

The row T[j] in a is the same as the jth row in As

Thus T[0]=7 and T={7 8 3 9 10 4 0 5 6 1 2 }

We define two operators F and L where F[j] is the first element in jth row of A and L[j] is the last element in the jth row of A. Row T[j] of A

Is the same as the row j of As

F[T[j]]=L[j]

04/20/23 28

We can write the procedure as follows for a sequence of length N.

D[N]=L[I]k Ifor j=1 to N-1

{ kT[k]

D[N-j]=L[k]}

Where D contains the decoded sequence .

ALGORITHM

04/20/23 29

Explanation : Lets us use the decoding algorithm to recover the original sequence from the sequence L in the previous example using

T=[7 8 3 9 10 0 5 6 1 2 ]Note that

L=[s s h t t h $ i i $ e]K=I=10, and N=11

We start with the last symbol :D[11]=L[10]=e

Updating k k ← T[10]=2Then D[10]=L[2]=hContinuing in the fashion k←T[2]=3

D[9]=L[3]=tk←T[3]=9D[8]=L[9]=$k←T[9]=1D[7]=L[1]=s

And we have decoded the entire sequence. k←T[1]=8D[6]=L[8]=ik←T[8]=6D[5]=L[6]=$k←T[6]=0D[4]=L[0]=sk←T[0]=7D[3]=L[7]=ik←T[7]=5D[2]=L[5]=hk←T[5]=4D[1]=L[4]=t

04/20/23 30

Result : -For compress a file using the BWT by Huffman Compression I have taken 7 files in a Calgary corpus.

- For compress a file the step is

Encode <original file name> <compressed file name>

-For decompression

encode <compressed file name> <decompressed file name> /d

04/20/23 31

File name

Size(KB) Compressed file Size

Percentage

Compress time(ms

Decompress time

thelp 13 6 40 219 109

file 50 12 24 656 125

geo 63 14 22 875 154

local 128 109 85 641 250

aa 86 44 51 172 125

thesis 872 587 67 4625 969

book 880 597 67 5234 906

Second step of BWT by alternative Move to Front :

- Much research has been focused on refining the MTF, most based on controlling when to move the symbol to front of stack, instead of attempting to improve the MTF algorithm it examines two relatively simple method based on models.

-In order to employ an efficient alternate method deterministic can be used to calculate the complexity information and entropy associated with a string will be referred to as T-complexity, T-information, and T-entropy.

- The complexity of the string, the T-augmentation is represented by the no. of steps required.

- The T-information of a string is the deterministic information content of a string s refereed to as Idet(s). It is calculate as the inverse logarithm integral of the string.

Idet(li-1 (Cdet(s))

- T-entropy is represented as TE=∆Ti/ ∆l. It is the rate of change of T-information Ti along a string L, and will be used to measure the average T-entropy over a file.

04/20/23 33

Dual modeling of Move to Front data stream :-The entire MTF input stream is conventionally represented in one model.

- One method suggested by Fenwick relies on a cache system, where the most probable symbols are stored in a prominent foreground model and the bulk remaining symbols stored in a larger background model.

- In dual model representation of the MTF data, successful compression is assured by providing the decoder a means of knowing when to switch between models.

- If a background symbol is encountered the encoder emits a special ESCAPE symbol from the foreground model, informing the decoder switch model before then encoding the symbols in the background model.

- Balkenhol has suggested a similar approach encoded in the original ASCII alphabet instead of the conventional MTF.

- Further enhancement are made to the cache provisions to ensure a symbol is only moved to the very front and assigned zero.

04/20/23 34

Further Work :

Image compression :

-BWT can be used for waveform coding, and the wavelet transform could be used before the BWT to improve the compression performance for many class of signals.

Wave form coding using the BWT :

-Many natural waveforms also have repetitions, after uniform sampling and quantization, then repetition often remain in the digitized wave sequences can be compressed using BWT.

- This scheme will not work well for many real world signals, because almost all the real world signals are noisy. Noise will destroy the perfect repetition required for the BWT.

Image compression using the DWT and the BWT :

- The wavelet transform is a powerful analysis tool and has been successfully used in image compression.

04/20/23 35

Base line Algorithm :-The block diagram of the baseline image compression algorithm using the DWT and BWT.

-The images are first wavelet transformed and quantized.

- The next step to convert the 2D image to 1D sequence by a zigzag scan.

- Next to perform BWT and MTF on the sequence.

- Then perform Huffman coding to compress it. Block diagram of Base line Image

Compression:

Wavelet Transform ZigzagBWTMTF

HuffmanCoding

Multiprocessor Approach to Data Compression :- The computer network bandwidth has created the need for increased speed in compressing data and transferring the compressed data over telecommunication line.

- Single processor dictionary compression techniques include Lz77,Lz78.

- The alternative of single processor technique how ever the use of modern multiprocessor computers.

- The recent growth in processor speeds and increase availability of multiprocessor computers,software compression solutions may be fast for many application.

- The parallel process involves arranging processors in a pipeline fashion and passing data between them.

- in this approach data being compressed as a symbol or a symbol string.

- Symbols originate from uncompressed data set and are 8 bits in length with 256 possible values.

-Code word refers to a bit string that is substituted for a symbol or a symbol string

04/20/23 37

Parallel Processor Dictionary Compression Technique :-James Storer has been a pioneer in the use of parallel processor, to implement data compression within very large scale integration(VLSI).

- His designed chained together up to 3840 of the processor in a pipeline. The design was fabricated into a VLSI chip and was able to process 160 million bits per sec.

- The pipeline processor perform the following task.

Buf B Buf A

Stor B Stor A

Processor Index

Data input Data flow

Data o/pData stream

04/20/23 38

Parallel Processor Dictionary Compression Technique (cond.):

1) Buf A, Buf B operate as a data stream for the symbol pass through.

2) Stor A and Stor B operate as a library locations which store a single symbol or code word.

3) The processor index functions as the code word that is associated with the library contents.

4) The processor embodies logic to compare the stored library contents to the data stream.

- Data steps through each processor are symbol at a time right to left.

- The library is built where a special signal called “leader” is sent to a processor to store the contents if buf A and buf B into stor A and stor B.

- The data symbol in buf A and buf B comparing library contents stor A and stor B.

- If a match occurs between buffer and storage location then processor index is substituted directly into data stream for both the symbol currently in buf A and buf B.

- This provides a replacement of two 8 bit symbols with one 12 bit code word.

04/20/23 39

CONCLUSION

13 Years have been passed the introduction of BWT. In this 13years our understanding of several theoretical and practical issues related to the BWT has significantly incresed.BWT has many interesting facts And that is going to deeply influence the field of loss less data compression.

The biggest draw back of BWT based algorithm is that it is not online, that is it must process a large portion of the input data before a single output bit can produce. The issue of developing on line counter parts of BWT based compressor has been address, but further work is still needed in this direction.

04/20/23 40

BIBILOGRAPHY :

1.ARNOLD, R., AND BELL, T. 2000. The Canterbury corpus home page.

2.BENTLEY,J.,SLEATOR,D.,TARJAN,R.,ANDWEI,V.1986.A locally adaptive data compression scheme.Commun.

3.BURROWS,M.,AND WHEELER, D.J. 1994. A block sorting lossless data compresssion algorithm.

4.CLEARY,J.G.,AND TEAHAN ,W.J.1997.Unbound length contexts for PPM.

5.CORMCK,G.V., AND HORSPOOOL,R.N.S.1987. Data compression using dynamic Markov modeling.

6.EFFROS,M.1999.Universal lossless sources coding with the Burrows-Wheeler transform.

04/20/23 41

7.Data Compression Conference .IEEE Computer Society Press, Los Alamiots,calif.

8.FENWICK,P.1996a.Block sorting text compression –final report.

9. FENWICK,P.1996b. The Burrows-Wheeler transform for block sorting text compression.

10.T.C. BELL,J.G. CLEARY,AND I.H. Witten. Text Copression .11.D.L.DONOHO.DE-NOISING by soft-thresholding.Transaction on Information Theory.

12.BALKENHOL B, KURTZ S. Universal data Compression Based on the Burrows {Wheeler Transformation.

13.LARSSON NJ. The context Trees of Blocking Sorting Compression .

14.BALKENHOL B.– One attempt of a Compression algorithm Using the BWT.

04/20/23 42

Documents

Jadavpur University Presentation on : Data Compression Using Burrows-Wheeler Transform Presented by: Suvendu Rup