Lossless Compression Basics

Embed Size (px)

Citation preview

  • 8/3/2019 Lossless Compression Basics

    1/42

    Data Compression

    By

    Ziya Arnavut

    Department of Computer and Information Sciences

    SUNY Fredonia

    12-20-2011

  • 8/3/2019 Lossless Compression Basics

    2/42

    Tremendous amount of data is communicated every

    day.

    Example: World-Wide-Web: Many people surf on the net.

    People communicate over internet using software such as

    skype.

    The transmission time is related toa) Amount (size) of data b)b) channels capacity.

    Can we reduce the transmission time?

    Of course.

    a) Reduce the size of data. How? Use a suitable compressiontechnique.

    b) Increase the channel capacity. For example: 100 MB 1 GB

    c) Or, utilize both (a) and (b)

  • 8/3/2019 Lossless Compression Basics

    3/42

    Can we reduce the size of data?

    Yes. Using compression techniques.

    Two main data compression techniques:

    1. lossless (noiseless).

    Examples: Text, Medical and Remote Sensing Imaginary

    2. lossy (noisy) techniques.

    Examples: Image, sound and some other Multimedia app.

    Lossy Techniques

    Data compressed by lossy techniques are not exactly recoverable.

    In many applications, this feature helps to increase the channel throughput.

    For example: JPEG imagesOriginal image may have significant loss but this does not cause a problem, since

    humans can often comprehend things even when there is noise.

    Hence, depending on the application, lossy techniques may be used to increase the

    channel throughput.

  • 8/3/2019 Lossless Compression Basics

    4/42

    Original Image

    Size on disk: 2.25 Mbytes

    85% JPEG compressed image.

    Size on disk: 267460 Bytes

  • 8/3/2019 Lossless Compression Basics

    5/42

    A Lossy + Lossless technique:

    Color-mapped (Palette) Images

    To reduce transmission, storage or, most often, display restrictions, sometimesthere is a need to restrict the number of colors in an image

    color reduction step since they are acquired with a high number of different

    colors. This step is known as color quantization and several techniques have

    been proposed

    A color-quantized image is a matrix of indices, where each index icorresponds to a

    triplet (Ri, Gi, Bi) in the color-mapped table of the image. Color-quantized imagesare also known as pseudo-color, color-mapped or palette images.

    Index R G B

    0 28 0 1

    1 19 2 5

    2 34 1 1

    3 39 2 3

    4 44 0 2

    .. .. .. ..

    254 193 211 223

    255 206 212 222

  • 8/3/2019 Lossless Compression Basics

    6/42

    Frymier image

    Size: 1,238,678 bytes

    GIF: 229,930 bytes

    Ben and Jerry

    Size 28,326 bytes

    GIF: 4,387 bytes

    Yahoo image

    Size: 28,110 bytes

    GIF: 6,968 bytes

    Examples: Graphic Interchange Format (GIF) compressed images

  • 8/3/2019 Lossless Compression Basics

    7/42

    Lossless Coding

    A lossless compression scheme has two components1) Modeling

    2) Coding

    First, I will address Coding

    Let A be an alphabet, a collection of distinctsymbols.

    Let S = s1 s2 ... sn be a sequence from an alphabet

    A. That is, S is a data string.

    Example: From the English Alphabet {a..z, , A..Z},This is an example is a data string.

  • 8/3/2019 Lossless Compression Basics

    8/42

    Assigning binary sequences to individual alphabet

    elements is called encoding.

    The set of binary sequences resulting from an

    encoding is called code C= {c1,c2,cn}

    An element of a code is called code-word (i.e., ci C)

    For example,ASCII code consists of 128 code-words.

    Each code-word has 7-bits. An 8th bit is appended for

    parity checking, or other control purposes.

    Example: A1000001, B1000010

  • 8/3/2019 Lossless Compression Basics

    9/42

    Fixed Length Code:

    Example: ASCII codes.

    Do we gain by using fixed length codes? No! Why?

    Use techniques similar to

    Morse telegraph codes. That is, assign short code-words (less number of

    bits) to the characters that appear more often,

    and longer code-words to the characters that

    appear less frequently.

    This is called variable length coding.

  • 8/3/2019 Lossless Compression Basics

    10/42

  • 8/3/2019 Lossless Compression Basics

    11/42

    Question: Why does this work?

    Most often, the frequency distribution of the

    letters in a data string are far from uniform.Example: In an English text the most frequentlyoccurring letter is e.

    If a data source or string has uniform

    distribution, variable length codingtechniques does not work.

    For an independent data source S with

    probabilities of occurrence p(s1) ,..., p(sm) thezero-order entropy is

    H(S) = 7i p(si) * log2 p(si) bps

  • 8/3/2019 Lossless Compression Basics

    12/42

    Entropy of a source yields lower-bound ofencoding cost.

    Two well-known variable length codingtechniques are Huffman and Arithmetic Coding.They can code a data string close to or equal toits entropy.

    Example: Consider a data string with 64 characters from the

    alphabet {a,,z}S=aaaaaaaaaabbbbbbbccegiidffgiiaaabaaaabbccccaaccaa

    aabbaaaaaaabeee

    The zero-order entropy of this string is 2.26 bps. Hence, At best we can code the data for

    64 * 2.26 = 144.64 bits

  • 8/3/2019 Lossless Compression Basics

    13/42

    Huffman Coding Merge together the two least probable characters, and repeating this

    process until there is only one character remaining.

    For our String, here is the probably distribution:

    Freq si P(si)

    30 a 0.469

    13 b 0.203

    8 c 0.125

    1 d 0.016

    4 e 0.063

    4 i 0.0632 g 0.031

    2 f 0.031

    64 1.000

    si P(si)

    a 0.469b 0.203

    c 0.125

    e 0.063

    i 0.063

    g 0.031

    f 0.031

    d 0.016

    After Sorting

  • 8/3/2019 Lossless Compression Basics

    14/42

    Building Huffman Tree

    a 0.469

    b 0.203

    c 0.125

    e 0.063

    i 0.063

    g 0.031

    f 0.031

    d 0.016

    a 0.469

    b 0.203

    c 0.125

    e 0.063

    i 0.063

    0.047

    g 0.031

    a 0.469

    b 0.203

    c 0.125

    0.078

    e 0.063

    i 0.063

    a 0.469

    b 0.203

    0.126

    c 0.125

    0.078

    a 0.469

    b 0.203

    0.203

    0.126

    a 0.469

    0.329

    b 0.203

    0.532

    a 0.469

  • 8/3/2019 Lossless Compression Basics

    15/42

    Huffman Tree & Code

    si P(si) Code Bits

    a 0.469 1 30

    b 0.203 01 26

    c 0.125 0000 32

    e 0.063 0010 4

    i 0.063 0011 16

    g 0.031 00011 20

    f 0.031 000100 12

    d 0.016 000101 12

    Total # ofBits 152

    Hence, Entropy gives us lower bound in terms ofthe number ofbits

    need to encode a data string.

  • 8/3/2019 Lossless Compression Basics

    16/42

    Can we do better?

    Consider, attaching integer values to the

    symbols in S. Then apply

    HS = s1, s2-s1, s3-s2, ...,s64-s63 .

    Note: S can be recovered easily.

    However, frequency distribution for the new

    sequence HS is

    letter 0 1 2 -1 -2 3 -5 -8

    frequency 42 9 6 2 1 1 1 1

  • 8/3/2019 Lossless Compression Basics

    17/42

    Using Huffman coding we may encode HS asfollows:

    0 0, 1 10, 2 110,

    -1 1110, -2 111100,

    3 111101, 8111110,

    and -5111111. Total bits

    42*1 + 9*2 + ... + 1*6 + 1*6 + 1*6 = 110

    A rate of r = 110/64 = 1.71875 bps.

    H operatoris an example of decorrelationstep, which is used in modeling of data.

  • 8/3/2019 Lossless Compression Basics

    18/42

    The aim of Decorrelation step is to remove

    redundancies from the data. Note that approximately a further gain of 0.60

    bps.

    Hence, by considering relationships among

    the data elements, we can obtain better

    compression.

    Research in lossless compression has focused

    on modeling the data source in order to

    exploit the correlation among data elements

  • 8/3/2019 Lossless Compression Basics

    19/42

    Arithmetic Coding

    Suppose we have an alphabet (a, b, c), withprobabilities of occurrence of (0.7, 0.1, 0.2). Each

    symbol may be assigned to the following ranges

    based on its probability:

    Sample Symbol Ranges

    Symbol Probability Range

    a 70% [0.00, 0.70)b 10% [0.70, 0.80)

    c 20% [0.80, 1.00)

  • 8/3/2019 Lossless Compression Basics

    20/42

    Encoding with Arithmetic Coder

    T

    he pseudo code below illustrates how additionalsymbols may be added to an encoded string byrestricting the string's range bounds.

    lower bound = 0upper bound = 1

    while there are still symbols to encode

    current range = upper bound - lower boundupper bound = lower bound + (current range upper bound of new symbol)lower bound = lower bound + (current range lower bound of new symbol)

    end while

    Any value between the computed lower and upperprobability bounds now encodes the input string.

  • 8/3/2019 Lossless Compression Basics

    21/42

    Example: To encode abc

    Encode 'a'

    current range = 1 - 0 = 1upper bound = 0 + (1 0.7) = 0.7lower bound = 0 + (1 0.0) = 0.0

    Encode b'current range = 0.7 - 0.0 = 0.7upper bound = 0.0 + (0.7 0.80) = 0.56lower bound = 0.0 + (0.7 0.7) = 0.49

    Encode c'current range = 0.56 - 0.49 = 0.07upper bound = 0.49+ (0.07 1.00) = 0.56lower bound = 0.49 + (0.07 0.80) = 0.546

    The string "abc" may be encoded by any value withinthe probability range [0.546, 0.56). For example with0.55.

  • 8/3/2019 Lossless Compression Basics

    22/42

    0.0

    a

    0.7

    b

    0.8

    c

    1.0

    0.0

    a

    0.49

    b

    0.56

    c

    0.70

    0.49

    a

    b

    c

    0.56

    0.546

    a

    b

    c

    0.56

    Encoding string abc with AC

  • 8/3/2019 Lossless Compression Basics

    23/42

    Decoding Strings encoded value = encoded input

    while string is not fully decoded

    identify the symbol containing encoded valuewithin its range

    //remove effects of symbol from encoded valuecurrent range = upper bound of new symbol -lower bound of new symbol

    encoded value = (encoded value - lower bound ofnew symbol) current range

    end while

  • 8/3/2019 Lossless Compression Basics

    24/42

    0.0

    a

    0.7

    b

    0.8

    c

    1.0

    0.0

    a

    0.49

    b

    0.56

    c

    0.70

    0.49

    a

    b

    c

    0.56

    0.546

    a

    b

    c

    0.56

    Decoding abc = 0.55 a b c

  • 8/3/2019 Lossless Compression Basics

    25/42

    Universal Lossless Compressors

    Dictionary Based Algorithm (Ziv-Lempel) Encoding Algorithm

    1. Initialize the dictionary to contain all

    blocks of length one (D={a,b}).

    2. Search for the longest block W which has

    appeared in the dictionary.

    3. Encode W by its index in the dictionary.

    4. Add W followed by the first symbol of thenext block to the dictionary.

    5. Go to Step 2.

  • 8/3/2019 Lossless Compression Basics

    26/42

    The following example illustrates how the

    encoding is performed.

    Data: a b b a a b b a a b a b b a a a a b a a b b a

    0 1 1 0 2 4 2 6 5 5 7 3 0

    Dictionary

    Index Entity Index Entity

    0 a 7 b a a

    1 b 8 a b a

    2 a b 9 a b b a

    3 b b 10 a a a

    4 b a 11 a a b

    5 a a 12 b a a b

    6 a b b 13 b b a

  • 8/3/2019 Lossless Compression Basics

    27/42

    The size of the dictionary can grow

    infinitely large.

    In practice, the dictionary size is limited. Once thelimit is reached, no more entries are added.

    For example, a dictionary of size 4096. This corresponds to 12 bits per index.

    Various implementation of Ziv-Lempel algorithm

    has been implemented. Gzip (Gnu Zip) is a freeware available from internet.

  • 8/3/2019 Lossless Compression Basics

    28/42

    Burrows-Wheeler Transformation

    Let w = [3,1,3,1,2] be a data string. Construct

    3 1 3 1 2

    M= 3 1 2 3 1

    1 2 3 1 3

    2 3 1 3 1

    by forming the successive rows ofM, which areconsecutive cyclic left-shifts of w.

  • 8/3/2019 Lossless Compression Basics

    29/42

    By sorting the rows ofM lexically we

    transform it to

    1 2 3 1 3

    1 3 1 2 3

    M= 2 3 1 3 1

    3 1 2 3 1

    3 1 3 1 2

    Let the last column of M denoted by L

  • 8/3/2019 Lossless Compression Basics

    30/42

    Note that the original data string wis the 5th

    row of M .

    Given the I= 5 (row index) ofwin M and

    L = [3, 3, 1, 1, 2] we can recover w. How?

    1 3

    1 3

    M= 2 1

    3 13 _ _ _ 2

  • 8/3/2019 Lossless Compression Basics

    31/42

    Is the transformation enough?

    Of course not!

    The transformation collects similar elements

    near by.

    To achieve better compression we need to usesome other techniques, like H operator.

    In this case we can use Move-to-Front

    (Recency Rank), or Inversion

    coding/transformation.

  • 8/3/2019 Lossless Compression Basics

    32/42

    Move To Front Coding

    (Recency Ranking)

    Introduced by Bentley et al. (1986)

    (independently discovered by Elias (1987))

    Move-to-Front coding is an adaptive

    technique, which is used when the data have

    locality of reference.

    When an MTF coder is implemented for an 8-

    bit data string the identity permutation isconstructed from the set of 0,..., 255.

  • 8/3/2019 Lossless Compression Basics

    33/42

    Example: Let {a, b, c, d} be our alphabet.

    Let S= bbbaaadddddccc be our data string. The MTF encoding is performed as following:

    b b b a a a d ..

    0123 0123 0123 0123 0123 0123 0123 ..abcd bacd bacd bacd abcd abcd abcd ..

    1 0 0 1 0 0 3 ..

    Output: 1001003000300.

  • 8/3/2019 Lossless Compression Basics

    34/42

    MTF decoding on 1001003000300 is done

    as following:

    1 0 0 1 0 0 3 ..

    0123 0123 0123 0123 0123 0123 0123 ..

    abcd bacd bacd bacd abcd abcd abcd ..

    b b b a a a d ..

    Output: bbbaaaddddccc

    Why do we use MT

    F? If the data have localityof reference, MTF transformed data yields

    better distribution for encoding.

  • 8/3/2019 Lossless Compression Basics

    35/42

    Linear Index Permutation (LIT) For example, let w = [1, 3, 1, 3, 2]

    Original l l-1

    1 2 3 4 5 1 4 2 5 3 1 3 5 2 4

    1 3 1 3 2 1 3 1 3 2 1 1 2 3 3

    Note that, the inverse permutation of LIT l isl-1 = [1 ,3, 5, 2, 4]. l-1 is called the Canonical SortingPermutation ofw.

    Also, elements ofw is sorted in non-decreasing orderby l-1 and consists of m-blocks of different sizes.

    Sorted data can be encoded cheaply.

  • 8/3/2019 Lossless Compression Basics

    36/42

    Hence, the problem is to encode the canonical

    sorting permutation.

    Interval ranking, except the first timeappearance of an element from a data string

    yields a rank, which is simply the count of the

    number of elements between every twosame element.

    Let H be the difference operator on a

    sequence, then it is easy to prove that the

    first-order entropy

    H(H l-1 ) $ H(Interval Rank)

  • 8/3/2019 Lossless Compression Basics

    37/42

    Inversion Coding

    Let4

    = [4

    1,4

    2,,4

    n] be an arbitrarypermutation of an n-set S of positive

    integers. A Left Bigger (LB) inversion vector

    associated with 4 is the sequence [I1,I2, ,In

    ] of non-negative integers defined as follows:

    Ik= | {j: 1 < j

  • 8/3/2019 Lossless Compression Basics

    38/42

    When the H operator (difference operator) is

    applied to an inversion vector, except m-1

    values (recall that there are m blocks), all the

    values would be positive ( or negative).

    The value |It - It-1 -1| is the count ofHow

    many bigger (or smaller) elements existsbetween the last and recent occurrence ofa

    symbol from the alphabet?.

    Hence, the decorrelation of inversion vectorelements yields a value which is called

    inversion rank (distance).

  • 8/3/2019 Lossless Compression Basics

    39/42

    Elias proved that the Recency Ranking (MTF)

    yields better compression than the Interval

    Ranking.

    It is easy to prove that the inversion ranking

    yields better compression than the Interval

    ranking. While theoretically its hard to relate MTF and

    Inversion Coding, simulation results has

    shown that inversion coding yields bettercompression than the MTF coding.

  • 8/3/2019 Lossless Compression Basics

    40/42

    The Bzip2 compression scheme utilizes1) BWT transformation

    2) MTF Coder

    3) Variable length Coding (Huffman coder)

    Currently, Bzip2 is one of the best universalcompression scheme.

    My contributions in this area: Theoretical settings of the BWT (1997)

    New and faster transformation than BWT, Linear OrderTransformation (1999)

    Inversion coding for large data files (2004)

    BWIC is available fromwww.cs.fredonia.edu/arnavut/research.html

    Yields better compression than bzip2 on several differentdata files. For example, on large text files, pseudo-colorimages, audio files, images.

  • 8/3/2019 Lossless Compression Basics

    41/42

    Data File Size MTF IC BSC BSWICBib 111261 5.94 5.68 2.11 2.17Book1 768768 5.12 4.84 2.61 2.52Book2 610856 5.24 4.95 2.22 2.19Geo 102400 6.03 6.16 4.83 4.97News 377109 5.55 5.32 2.65 2.70Obj1 21504 6.06 5.70 4.02 4.30Obj2 246814 6.15 6.09 2.58 2.77Paper1 53161 5.46 5.42 2.65 2.74Paper2 82199 5.23 5.13 2.61 2.65Pic 513216 1.09 1.03 0.84 0.81

    Progc 39611 5.59 5.67 2.67 2.82Progl 71646 4.93 4.96 1.88 1.91Progp 49379 5.12 5.30 1.86 1.96Trans 93695 5.55 5.49 1.63 1.77Bible 4047392 5.04 4.56 1.71 1.62Calag.tar 3276813 4.75 4.45 2.44 2.28E.coli 4638690 2.25 2.14 2.21 2.10

    World192 2473400 5.34 5.03 1.49 1.47Avg. 5.02 4.88 2.39 2.43W. Avg. 4.23 3.96 2.04 1.96

    Results in bps using arithmetic coder

  • 8/3/2019 Lossless Compression Basics

    42/42

    THANK YOU!

    Questions?

    1/2/2012 FIT, December 2011 42