Upload
terhemen-aboiyar
View
214
Download
0
Embed Size (px)
Citation preview
7/29/2019 A TEXT COMPRESSION ALGORITHM_AdeGarba.pdf
1/7
1
TEXT COMPRESSION ALGORITHM: A NEW APPROACH
S. E. Adewumi
Department of Mathematics
University Of Jos Nigeria
[email protected]@unijos.edu.ng
E.J.D. Garba
Department of Mathematics
University Of Jos Nigeria
Abstract
This paper is a new version of our earlier algorithm in [1] and has the advantage of
overcoming difficulties of factoring large number encountered in the earlier work. The
algorithm is such that positions of occurrence of each alphabet for each character are
expressed in binary forms having the same length as that of the last occurrence of a
particular alphabet. This could be achieved by adding zeros in front of binary numberswhose length does not equal the length of the last occurrence of a particular alphabet. The
binary expression of occurrences of each alphabet is then written as a continuous chain.
This continuous chain is then converted to decimal and stored. This process constitutes
compression. Decompression is achieved by converting each decimal number to their
respective binary number and then using the length of the last occurrence of each
previously stored length to partition the binary string into their original concatenated
number and then back to their decimal number representing each position of occurrence of
the alphabet. This lossless compression algorithm has the capacity of reducing a document
of several pages to a maximum of two pages.
Keywords Data Compression Algorithm, Decompression Algorithm, Lossless, Lossy,
Redundant, Compactness.
Introduction
Data compression allows the conversion of data to a compact form by removing redundant
elements from the input stream. The main purpose for compressing data is to improve the
efficiency with which data is stored or transmitted [2]. It allows large files or text to be
temporarily squeezedso that they take less space and network transmission time. For them
to have meaning they must be decompressed. We achieve compression in the physical
world by putting juices in concentrated form, while decompression is achieved by adding
water to it.
Data compression can be classified into two Lossless and Lossy. A lossless technique
means that the restored data file is identical to the original compressed data; while a Lossy
method only generates an approximate to the original compressed data. The lossless is
necessary for many types of data that must revert back to its original text after
decompression. The new algorithm being described below is a lossless data compression in
which the compressed data reverts back to the original document [3].
7/29/2019 A TEXT COMPRESSION ALGORITHM_AdeGarba.pdf
2/7
2
The Compression Algorithm
The new compression algorithm is described below:
1. Take each letteri
l of alphabets (characters) that makes up the text document;2. Find the positions where each of these letters occurs3.
Covert each position to a binary number.4. The binary string length krepresenting the last position of occurrence is used asthe standard length for each binary string. This means that if other positions are
not k-length when converted, it has to be padded on the left to make it k-length
compliant.
5. Concatenate the binary strings for each alphabet (character); this is in turnconverted to decimal number to complete the compression.
6. Store the length krepresenting the binary string of the last occurrence of aparticular alphabet (character) for use during the decompression.
il Position of
occurrence
The binary
values foreach position
Concatenated
binary stringof each
il
decimal number
equivalent of theconcatenated binary
string
k the length
representing thebinary string of
the last
occurrence of a
particular
alphabet
Table 1: The new compression model.
The Decompression Algorithm
1. Take each letteri
l of alphabets(character);
2. Converted each decimal number to its binary number equivalent3. Using the length kof the last occurrence for each
il to partition the binary string into
their positional binary values
4. Convert the positional binary value to their decimal number equivalent5. Write the alphabets in their respective positions
In summary, if the positions of occurrence of each alphabet can be represented by n1, n2,
n3,, nj-1, nj and if nj has k-length binary digits, then the binary string representing n1, n2,
n3, , nj has a length ofkxj. This is in turn converted to its decimal equivalent.
Application of this scheme
The example below demonstrates the use of this scheme to compression and decompress
text document.
If we wish to compress the sentence:
When the bush is burning grasshoppers dont wait to bid farewell
7/29/2019 A TEXT COMPRESSION ALGORITHM_AdeGarba.pdf
3/7
3
Then the six stages in the compression algorithm are represented in five tables below,
where the actual compression is represented by table 4.1 2 3 4 5 6 7 8 9 1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3
0
3
1
3
2
3
3
3
4
3
5
3
6
3
7
w h e n T h e b u s h i s b u r n i n G g r a s s h o p p e r s
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
d o n t w a I t t o b i d f a r e W E l l
Table 2: Position of each letter
7/29/2019 A TEXT COMPRESSION ALGORITHM_AdeGarba.pdf
4/7
4
il
Positions
of
occurrence
The binary values for
positions of occurrence
Concatenated binary string
for eachi
l
decimal
number
equivalent
of
concatenate
d binarystring
k= the
length
representi
ng the
binary
string ofthe last
occurrenc
e of a
particular
alphabet
a 28, 45, 57 11100,101101,111001 011100101101111001 117625 6
b 10, 18, 52 1010,10010,110100 001010010010110100 42164 6
d 39, 54 100111,110110 100111110110 2550 6
e 3, 8, 35,
59, 61
11,1000,100011
111011,111101
000011001000100011
111011111101
3292925 6
f 56 111000 111000 56 6
g 24, 26 11000,1010 1100011010 794 5
h 2, 7,13, 31 10,111,1101,11111 00010001110110111111 73151 5
i 15, 22, 46,
53
1111,10110,101110,
110101
001111010110101110
110101
4025141 6
l 62, 63 111110,111111 111110111111 4031 6
n 4, 21, 23,
41
100,10101,10111, 101001 000100010101010111
101001
1136105 6
o 32, 40, 50 100000,101000,110010 100000101000110010 133682 6
p 33, 34 100001,100010 100001100010 2146 6
r 20, 27, 36,
58
10100,11011,100100,
111010
010100011011100100
111010
10598714 6
s 12, 16, 29,30, 37 1100,10000,11101,11110,100101 001100010000011101011110100101 205641637 6
t 6, 42, 47,
49
110,101010,101111,
110001
000110101010101111
110001
1747953 6
u 11, 19 1011,10011 0101110011 371 5
Table 3: Analysis of each alphabet
7/29/2019 A TEXT COMPRESSION ALGORITHM_AdeGarba.pdf
5/7
5
il
decimal number
equivalent of
concatenated
binary string
k= the length
representing
the binary
string of the
last occurrence
of a particularalphabet
a 117625 6
b 42164 6
d 2550 6
e 3292925 6
f 56 6
g 794 5
h 73151 5
i 4025141 6
l 4031 6
n 1136105 6
o 133682 6
p 2146 6
r 10598714 6
s 205641637 6
t 1747953 6
u 371 5
Table 4: The actual compressed table
7/29/2019 A TEXT COMPRESSION ALGORITHM_AdeGarba.pdf
6/7
6
To decompress, we simply take table 4, find the binary equivalent of the decimal values
attached to each alphabet, break each binary digits to their k-length equivalent for each
positions and then the original text is recovered. This is show in the table 5.
il
decimal
numberequival
ent of
binary
string
Concatenated binary
string for eachil
The binary values for
positions of occurrence
k= the length
representingthe binary
string of the
last
occurrence of
a particular
alphabet
Positions of
occurrence
a 117625 011100101101111001 11100,101101,111001 6 28, 45, 57
b 42164 001010010010110100 1010,10010,110100 6 10, 18, 52
d 2550 100111110110 100111,110110 6 39, 54
e 329292
5
000011001000100011
111011111101
11,1000,100011
111011,111101
6 3, 8, 35, 59,
61
f 56 111000 111000 6 56
g 794 1100011010 11000,1010 5 24, 26
h 73151 0001000111011011111
1
10,111,1101,11111 5 2, 7,13, 31
i 402514
1
001111010110101110
110101
1111,10110,101110,
110101
6 15, 22, 46,
53
l 4031 111110111111 111110,111111 6 62, 63
n 113610
5
000100010101010111
101001
100,10101,10111,
101001
6 4, 21, 23, 41
o 133682 100000101000110010 100000,101000,110010 6 32, 40, 50
p 2146 100001100010 100001,100010 6 33, 34
r 105987
14
010100011011100100
111010
10100,11011,100100,
111010
6 20, 27, 36,
58s 205641
637
001100010000011101
011110100101
1100,10000,11101,11110
, 100101
6 12, 16, 29,
30, 37
t 174795
3
000110101010101111
110001
110,101010,101111,
110001
6 6, 42, 47, 49
u 371 0101110011 1011,10011 5 11, 19
Table 5: Decompression table.
7/29/2019 A TEXT COMPRESSION ALGORITHM_AdeGarba.pdf
7/7
7
SUMMARY/CONCLUSION
We have demonstrated compression and decompression using this scheme. This has an
advantage over our earlier scheme, in that, factorization has been done away with. We
believe that this scheme will provide better compression than any known text compression
scheme and in a way, may open new grounds for multimedia compression techniques.
REFERENCE
[1] Adewumi, S. E; Garba E. J. D (2006) New Text Compression Algorithm. Journal
of Information and Communication Technology (ICT), EBSU Abakaliki. Vol. 2, No
1, May 2006. ISSN 0794-6910.
[2] Beekman, G. (1999) Computer Confluence. Addison-Wesley Longman Inc.
California
[3] Lelewer D. and Hirschberg D. (2001)Data Compression
http://www1.ics.uci.edu/~dan/pubs/DataCompression.html