A TEXT COMPRESSION ALGORITHM_AdeGarba.pdf

7/29/2019 A TEXT COMPRESSION ALGORITHM_AdeGarba.pdf

1/7

1

TEXT COMPRESSION ALGORITHM: A NEW APPROACH

S. E. Adewumi

Department of Mathematics

University Of Jos Nigeria

[email protected]@unijos.edu.ng

E.J.D. Garba

Department of Mathematics

University Of Jos Nigeria

Abstract

This paper is a new version of our earlier algorithm in [1] and has the advantage of

overcoming difficulties of factoring large number encountered in the earlier work. The

algorithm is such that positions of occurrence of each alphabet for each character are

expressed in binary forms having the same length as that of the last occurrence of a

particular alphabet. This could be achieved by adding zeros in front of binary numberswhose length does not equal the length of the last occurrence of a particular alphabet. The

binary expression of occurrences of each alphabet is then written as a continuous chain.

This continuous chain is then converted to decimal and stored. This process constitutes

compression. Decompression is achieved by converting each decimal number to their

respective binary number and then using the length of the last occurrence of each

previously stored length to partition the binary string into their original concatenated

number and then back to their decimal number representing each position of occurrence of

the alphabet. This lossless compression algorithm has the capacity of reducing a document

of several pages to a maximum of two pages.

Keywords Data Compression Algorithm, Decompression Algorithm, Lossless, Lossy,

Redundant, Compactness.

Introduction

Data compression allows the conversion of data to a compact form by removing redundant

elements from the input stream. The main purpose for compressing data is to improve the

efficiency with which data is stored or transmitted [2]. It allows large files or text to be

temporarily squeezedso that they take less space and network transmission time. For them

to have meaning they must be decompressed. We achieve compression in the physical

world by putting juices in concentrated form, while decompression is achieved by adding

water to it.

Data compression can be classified into two Lossless and Lossy. A lossless technique

means that the restored data file is identical to the original compressed data; while a Lossy

method only generates an approximate to the original compressed data. The lossless is

necessary for many types of data that must revert back to its original text after

decompression. The new algorithm being described below is a lossless data compression in

which the compressed data reverts back to the original document [3].


2/7

2

The Compression Algorithm

The new compression algorithm is described below:

1. Take each letteri

l of alphabets (characters) that makes up the text document;2. Find the positions where each of these letters occurs3.

Covert each position to a binary number.4. The binary string length krepresenting the last position of occurrence is used asthe standard length for each binary string. This means that if other positions are

not k-length when converted, it has to be padded on the left to make it k-length

compliant.

5. Concatenate the binary strings for each alphabet (character); this is in turnconverted to decimal number to complete the compression.

6. Store the length krepresenting the binary string of the last occurrence of aparticular alphabet (character) for use during the decompression.

il Position of

occurrence

The binary

values foreach position

Concatenated

binary stringof each

il

decimal number

equivalent of theconcatenated binary

string

k the length

representing thebinary string of

the last

occurrence of a

particular

alphabet

Table 1: The new compression model.

The Decompression Algorithm

1. Take each letteri

l of alphabets(character);

2. Converted each decimal number to its binary number equivalent3. Using the length kof the last occurrence for each

il to partition the binary string into

their positional binary values

4. Convert the positional binary value to their decimal number equivalent5. Write the alphabets in their respective positions

In summary, if the positions of occurrence of each alphabet can be represented by n1, n2,

n3,, nj-1, nj and if nj has k-length binary digits, then the binary string representing n1, n2,

n3, , nj has a length ofkxj. This is in turn converted to its decimal equivalent.

Application of this scheme

The example below demonstrates the use of this scheme to compression and decompress

text document.

If we wish to compress the sentence:

When the bush is burning grasshoppers dont wait to bid farewell


3/7

3

Then the six stages in the compression algorithm are represented in five tables below,

where the actual compression is represented by table 4.1 2 3 4 5 6 7 8 9 1

0

1

1

1

2

1

3

1

4

1

5

1

6

1

7

1

8

1

9

2

0

2

1

2

2

2

3

2

4

2

5

2

6

2

7

2

8

2

9

3

0

3

1

3

2

3

3

3

4

3

5

3

6

3

7

w h e n T h e b u s h i s b u r n i n G g r a s s h o p p e r s

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

d o n t w a I t t o b i d f a r e W E l l

Table 2: Position of each letter


4/7

4

il

Positions

of

occurrence

The binary values for

positions of occurrence

Concatenated binary string

for eachi

l

decimal

number

equivalent

of

concatenate

d binarystring

k= the

length

representi

ng the

binary

string ofthe last

occurrenc

e of a

particular

alphabet

a 28, 45, 57 11100,101101,111001 011100101101111001 117625 6

b 10, 18, 52 1010,10010,110100 001010010010110100 42164 6

d 39, 54 100111,110110 100111110110 2550 6

e 3, 8, 35,

59, 61

11,1000,100011

111011,111101

000011001000100011

111011111101

3292925 6

f 56 111000 111000 56 6

g 24, 26 11000,1010 1100011010 794 5

h 2, 7,13, 31 10,111,1101,11111 00010001110110111111 73151 5

i 15, 22, 46,

53

1111,10110,101110,

110101

001111010110101110

110101

4025141 6

l 62, 63 111110,111111 111110111111 4031 6

n 4, 21, 23,

41

100,10101,10111, 101001 000100010101010111

101001

1136105 6

o 32, 40, 50 100000,101000,110010 100000101000110010 133682 6

p 33, 34 100001,100010 100001100010 2146 6

r 20, 27, 36,

58

10100,11011,100100,

111010

010100011011100100

111010

10598714 6

s 12, 16, 29,30, 37 1100,10000,11101,11110,100101 001100010000011101011110100101 205641637 6

t 6, 42, 47,

49

110,101010,101111,

110001

000110101010101111

110001

1747953 6

u 11, 19 1011,10011 0101110011 371 5

Table 3: Analysis of each alphabet


5/7

5

il

decimal number

equivalent of

concatenated

binary string

k= the length

representing

the binary

string of the

last occurrence

of a particularalphabet

a 117625 6

b 42164 6

d 2550 6

e 3292925 6

f 56 6

g 794 5

h 73151 5

i 4025141 6

l 4031 6

n 1136105 6

o 133682 6

p 2146 6

r 10598714 6

s 205641637 6

t 1747953 6

u 371 5

Table 4: The actual compressed table


6/7

6

To decompress, we simply take table 4, find the binary equivalent of the decimal values

attached to each alphabet, break each binary digits to their k-length equivalent for each

positions and then the original text is recovered. This is show in the table 5.

il

decimal

numberequival

ent of

binary

string

Concatenated binary

string for eachil

The binary values for

positions of occurrence

k= the length

representingthe binary

string of the

last

occurrence of

a particular

alphabet

Positions of

occurrence

a 117625 011100101101111001 11100,101101,111001 6 28, 45, 57

b 42164 001010010010110100 1010,10010,110100 6 10, 18, 52

d 2550 100111110110 100111,110110 6 39, 54

e 329292

5

000011001000100011

111011111101

11,1000,100011

111011,111101

6 3, 8, 35, 59,

61

f 56 111000 111000 6 56

g 794 1100011010 11000,1010 5 24, 26

h 73151 0001000111011011111

1

10,111,1101,11111 5 2, 7,13, 31

i 402514

1

001111010110101110

110101

1111,10110,101110,

110101

6 15, 22, 46,

53

l 4031 111110111111 111110,111111 6 62, 63

n 113610

5

000100010101010111

101001

100,10101,10111,

101001

6 4, 21, 23, 41

o 133682 100000101000110010 100000,101000,110010 6 32, 40, 50

p 2146 100001100010 100001,100010 6 33, 34

r 105987

14

010100011011100100

111010

10100,11011,100100,

111010

6 20, 27, 36,

58s 205641

637

001100010000011101

011110100101

1100,10000,11101,11110

, 100101

6 12, 16, 29,

30, 37

t 174795

3

000110101010101111

110001

110,101010,101111,

110001

6 6, 42, 47, 49

u 371 0101110011 1011,10011 5 11, 19

Table 5: Decompression table.


7/7

7

SUMMARY/CONCLUSION

We have demonstrated compression and decompression using this scheme. This has an

advantage over our earlier scheme, in that, factorization has been done away with. We

believe that this scheme will provide better compression than any known text compression

scheme and in a way, may open new grounds for multimedia compression techniques.

REFERENCE

[1] Adewumi, S. E; Garba E. J. D (2006) New Text Compression Algorithm. Journal

of Information and Communication Technology (ICT), EBSU Abakaliki. Vol. 2, No

1, May 2006. ISSN 0794-6910.

[2] Beekman, G. (1999) Computer Confluence. Addison-Wesley Longman Inc.

California

[3] Lelewer D. and Hirschberg D. (2001)Data Compression

http://www1.ics.uci.edu/~dan/pubs/DataCompression.html

Documents

A TEXT COMPRESSION ALGORITHM_AdeGarba.pdf