A TEXT COMPRESSION ALGORITHM_AdeGarba.pdf

Embed Size (px)

Citation preview

  • 7/29/2019 A TEXT COMPRESSION ALGORITHM_AdeGarba.pdf

    1/7

    1

    TEXT COMPRESSION ALGORITHM: A NEW APPROACH

    S. E. Adewumi

    Department of Mathematics

    University Of Jos Nigeria

    [email protected]@unijos.edu.ng

    E.J.D. Garba

    Department of Mathematics

    University Of Jos Nigeria

    Abstract

    This paper is a new version of our earlier algorithm in [1] and has the advantage of

    overcoming difficulties of factoring large number encountered in the earlier work. The

    algorithm is such that positions of occurrence of each alphabet for each character are

    expressed in binary forms having the same length as that of the last occurrence of a

    particular alphabet. This could be achieved by adding zeros in front of binary numberswhose length does not equal the length of the last occurrence of a particular alphabet. The

    binary expression of occurrences of each alphabet is then written as a continuous chain.

    This continuous chain is then converted to decimal and stored. This process constitutes

    compression. Decompression is achieved by converting each decimal number to their

    respective binary number and then using the length of the last occurrence of each

    previously stored length to partition the binary string into their original concatenated

    number and then back to their decimal number representing each position of occurrence of

    the alphabet. This lossless compression algorithm has the capacity of reducing a document

    of several pages to a maximum of two pages.

    Keywords Data Compression Algorithm, Decompression Algorithm, Lossless, Lossy,

    Redundant, Compactness.

    Introduction

    Data compression allows the conversion of data to a compact form by removing redundant

    elements from the input stream. The main purpose for compressing data is to improve the

    efficiency with which data is stored or transmitted [2]. It allows large files or text to be

    temporarily squeezedso that they take less space and network transmission time. For them

    to have meaning they must be decompressed. We achieve compression in the physical

    world by putting juices in concentrated form, while decompression is achieved by adding

    water to it.

    Data compression can be classified into two Lossless and Lossy. A lossless technique

    means that the restored data file is identical to the original compressed data; while a Lossy

    method only generates an approximate to the original compressed data. The lossless is

    necessary for many types of data that must revert back to its original text after

    decompression. The new algorithm being described below is a lossless data compression in

    which the compressed data reverts back to the original document [3].

  • 7/29/2019 A TEXT COMPRESSION ALGORITHM_AdeGarba.pdf

    2/7

    2

    The Compression Algorithm

    The new compression algorithm is described below:

    1. Take each letteri

    l of alphabets (characters) that makes up the text document;2. Find the positions where each of these letters occurs3.

    Covert each position to a binary number.4. The binary string length krepresenting the last position of occurrence is used asthe standard length for each binary string. This means that if other positions are

    not k-length when converted, it has to be padded on the left to make it k-length

    compliant.

    5. Concatenate the binary strings for each alphabet (character); this is in turnconverted to decimal number to complete the compression.

    6. Store the length krepresenting the binary string of the last occurrence of aparticular alphabet (character) for use during the decompression.

    il Position of

    occurrence

    The binary

    values foreach position

    Concatenated

    binary stringof each

    il

    decimal number

    equivalent of theconcatenated binary

    string

    k the length

    representing thebinary string of

    the last

    occurrence of a

    particular

    alphabet

    Table 1: The new compression model.

    The Decompression Algorithm

    1. Take each letteri

    l of alphabets(character);

    2. Converted each decimal number to its binary number equivalent3. Using the length kof the last occurrence for each

    il to partition the binary string into

    their positional binary values

    4. Convert the positional binary value to their decimal number equivalent5. Write the alphabets in their respective positions

    In summary, if the positions of occurrence of each alphabet can be represented by n1, n2,

    n3,, nj-1, nj and if nj has k-length binary digits, then the binary string representing n1, n2,

    n3, , nj has a length ofkxj. This is in turn converted to its decimal equivalent.

    Application of this scheme

    The example below demonstrates the use of this scheme to compression and decompress

    text document.

    If we wish to compress the sentence:

    When the bush is burning grasshoppers dont wait to bid farewell

  • 7/29/2019 A TEXT COMPRESSION ALGORITHM_AdeGarba.pdf

    3/7

    3

    Then the six stages in the compression algorithm are represented in five tables below,

    where the actual compression is represented by table 4.1 2 3 4 5 6 7 8 9 1

    0

    1

    1

    1

    2

    1

    3

    1

    4

    1

    5

    1

    6

    1

    7

    1

    8

    1

    9

    2

    0

    2

    1

    2

    2

    2

    3

    2

    4

    2

    5

    2

    6

    2

    7

    2

    8

    2

    9

    3

    0

    3

    1

    3

    2

    3

    3

    3

    4

    3

    5

    3

    6

    3

    7

    w h e n T h e b u s h i s b u r n i n G g r a s s h o p p e r s

    38

    39

    40

    41

    42

    43

    44

    45

    46

    47

    48

    49

    50

    51

    52

    53

    54

    55

    56

    57

    58

    59

    60

    61

    62

    63

    d o n t w a I t t o b i d f a r e W E l l

    Table 2: Position of each letter

  • 7/29/2019 A TEXT COMPRESSION ALGORITHM_AdeGarba.pdf

    4/7

    4

    il

    Positions

    of

    occurrence

    The binary values for

    positions of occurrence

    Concatenated binary string

    for eachi

    l

    decimal

    number

    equivalent

    of

    concatenate

    d binarystring

    k= the

    length

    representi

    ng the

    binary

    string ofthe last

    occurrenc

    e of a

    particular

    alphabet

    a 28, 45, 57 11100,101101,111001 011100101101111001 117625 6

    b 10, 18, 52 1010,10010,110100 001010010010110100 42164 6

    d 39, 54 100111,110110 100111110110 2550 6

    e 3, 8, 35,

    59, 61

    11,1000,100011

    111011,111101

    000011001000100011

    111011111101

    3292925 6

    f 56 111000 111000 56 6

    g 24, 26 11000,1010 1100011010 794 5

    h 2, 7,13, 31 10,111,1101,11111 00010001110110111111 73151 5

    i 15, 22, 46,

    53

    1111,10110,101110,

    110101

    001111010110101110

    110101

    4025141 6

    l 62, 63 111110,111111 111110111111 4031 6

    n 4, 21, 23,

    41

    100,10101,10111, 101001 000100010101010111

    101001

    1136105 6

    o 32, 40, 50 100000,101000,110010 100000101000110010 133682 6

    p 33, 34 100001,100010 100001100010 2146 6

    r 20, 27, 36,

    58

    10100,11011,100100,

    111010

    010100011011100100

    111010

    10598714 6

    s 12, 16, 29,30, 37 1100,10000,11101,11110,100101 001100010000011101011110100101 205641637 6

    t 6, 42, 47,

    49

    110,101010,101111,

    110001

    000110101010101111

    110001

    1747953 6

    u 11, 19 1011,10011 0101110011 371 5

    Table 3: Analysis of each alphabet

  • 7/29/2019 A TEXT COMPRESSION ALGORITHM_AdeGarba.pdf

    5/7

    5

    il

    decimal number

    equivalent of

    concatenated

    binary string

    k= the length

    representing

    the binary

    string of the

    last occurrence

    of a particularalphabet

    a 117625 6

    b 42164 6

    d 2550 6

    e 3292925 6

    f 56 6

    g 794 5

    h 73151 5

    i 4025141 6

    l 4031 6

    n 1136105 6

    o 133682 6

    p 2146 6

    r 10598714 6

    s 205641637 6

    t 1747953 6

    u 371 5

    Table 4: The actual compressed table

  • 7/29/2019 A TEXT COMPRESSION ALGORITHM_AdeGarba.pdf

    6/7

    6

    To decompress, we simply take table 4, find the binary equivalent of the decimal values

    attached to each alphabet, break each binary digits to their k-length equivalent for each

    positions and then the original text is recovered. This is show in the table 5.

    il

    decimal

    numberequival

    ent of

    binary

    string

    Concatenated binary

    string for eachil

    The binary values for

    positions of occurrence

    k= the length

    representingthe binary

    string of the

    last

    occurrence of

    a particular

    alphabet

    Positions of

    occurrence

    a 117625 011100101101111001 11100,101101,111001 6 28, 45, 57

    b 42164 001010010010110100 1010,10010,110100 6 10, 18, 52

    d 2550 100111110110 100111,110110 6 39, 54

    e 329292

    5

    000011001000100011

    111011111101

    11,1000,100011

    111011,111101

    6 3, 8, 35, 59,

    61

    f 56 111000 111000 6 56

    g 794 1100011010 11000,1010 5 24, 26

    h 73151 0001000111011011111

    1

    10,111,1101,11111 5 2, 7,13, 31

    i 402514

    1

    001111010110101110

    110101

    1111,10110,101110,

    110101

    6 15, 22, 46,

    53

    l 4031 111110111111 111110,111111 6 62, 63

    n 113610

    5

    000100010101010111

    101001

    100,10101,10111,

    101001

    6 4, 21, 23, 41

    o 133682 100000101000110010 100000,101000,110010 6 32, 40, 50

    p 2146 100001100010 100001,100010 6 33, 34

    r 105987

    14

    010100011011100100

    111010

    10100,11011,100100,

    111010

    6 20, 27, 36,

    58s 205641

    637

    001100010000011101

    011110100101

    1100,10000,11101,11110

    , 100101

    6 12, 16, 29,

    30, 37

    t 174795

    3

    000110101010101111

    110001

    110,101010,101111,

    110001

    6 6, 42, 47, 49

    u 371 0101110011 1011,10011 5 11, 19

    Table 5: Decompression table.

  • 7/29/2019 A TEXT COMPRESSION ALGORITHM_AdeGarba.pdf

    7/7

    7

    SUMMARY/CONCLUSION

    We have demonstrated compression and decompression using this scheme. This has an

    advantage over our earlier scheme, in that, factorization has been done away with. We

    believe that this scheme will provide better compression than any known text compression

    scheme and in a way, may open new grounds for multimedia compression techniques.

    REFERENCE

    [1] Adewumi, S. E; Garba E. J. D (2006) New Text Compression Algorithm. Journal

    of Information and Communication Technology (ICT), EBSU Abakaliki. Vol. 2, No

    1, May 2006. ISSN 0794-6910.

    [2] Beekman, G. (1999) Computer Confluence. Addison-Wesley Longman Inc.

    California

    [3] Lelewer D. and Hirschberg D. (2001)Data Compression

    http://www1.ics.uci.edu/~dan/pubs/DataCompression.html