Upload
dwight-davis
View
215
Download
3
Embed Size (px)
Citation preview
Multimedia – Data Compression
Dr. Lina A. NimriLebanese UniversityFaculty of Economic Sciences and Business Administration 1st branch
Why compress data?
•Nowadays, the computing power of processors increases more quickly than storage capacities, and is much faster than network band-widths, because this requires enormous changes in the telecommunication infrastructures.
•Thus, to compensate for this, it is usual to rather reduce the size of the data by exploiting the computing power of the processors rather than by increasing storage and data transmission capacities.
What is data compression?• Compression consists in reducing the physical size of
information blocks.• A compressor uses an algorithm which is used to optimize
the data by using suitable considerations for the type as data to be compressed;
• A decompressor is thus necessary to reconstruct the original data using an algorithm that is the opposite to that used for compression.
• The compression method depends essentially on the type of data to be compressed: an image will not be compressed in the same way as an audio file?
What is data compression?• A compression that does not lead to loss of
information is lossless• A compression that leads to loss of information is
lossy• Compression can be defined by the
compression factor, ▫ that is, the number of bits in the compressed image
divided by the number of bits in the original image. • The compression ratio, which is often used,
▫ is the inverse of the compression factor; ▫ it is usually expressed as a percentage.
• Finally, the compression gain, ▫ also expressed as a percentage, ▫ is equivalent to 1 minus the compression factor:
Types of compressions and methods•Physical and logical Compression
▫Physical compression acts directly on the data; it is thus a question of storing the redundant data from one bit pattern to another.
▫Logical compression on the other hand is carried out by a logical reasoning, substituting this information with equivalent information.
Types of compressions and methods• Symmetrical and Asymmetrical Compression
▫ In the case of symmetrical compression, the same method is used to compress and to decompress the data. The same amount of work is thus needed for each of
these operations. It is this type of compression which is generally used
in data transmission.▫ Asymmetrical compression requires more work to
be done for one of the two operations, it is usual to seek algorithms for which compression is slower than decompression. Algorithms that perform compression faster than
decompression may be necessary in the case of data files which are seldom accessed (for security reasons for example), because this creates compact files.
Lossy compression• Lossy compression, as opposed to lossless compression,
eliminates some information in order to achieve the best possible compression ratio, while keeping a result which is as close as possible to the original data. ▫ It is the case, for example, of certain image or sound
compressions, such as MP3 format. ▫ Since this type of compression removes information
contained in the data that is to be compressed, it is usual to speak of irreversible compression methods.
▫ Executable files, for example, cannot be compressed using this method, because they particularly need to preserve their integrity in order to be able to run. Indeed, it is not conceivable to roughly reconstruct a program by omitting bits and then adding some.
▫ On the other hand, multimedia data (audio, video) can tolerate a certain level of degradation without the sensory organs (eye, tympanum, etc) distinguishing any significant degradation.
Adaptive, semi-adaptive and non-adaptive encoding• Certain compression algorithms are based on
dictionaries that are for a specific type of data: these are non-adaptive encoders. ▫The occurrence of letters in a text file, for
example, depends on the language in which it is written.
• An adaptive encoder adapts to the data which it will have to compress, ▫ it does not start out with an already prepared
dictionary for a given type of data. • A semi-adaptive encoder will build a dictionary
according to the data to be compressed,▫ it builds the dictionary by going through the file
and then compresses the latter.
RLE Compression
•The RLE compression method (Run Length Encoding, sometimes written as RLC for Run Length Coding) ▫It is used by many image formats (BMP,
PCX, TIFF). ▫It is based on the repetition of consecutive
elements.
RLE Compression: basics• The basic principle consists in coding a first
element by giving the number of repetitions of a value and then the value to be repeated. ▫Thus, according to this principle, chain
“AAAAAHHHHHHHHHHHHHH” when compressed yields “5A14H”. The compression gain is thus (19-5) /19, that is,
approximately 73.7%▫ On the other hand, for the chain “CORRECTLY”,
where there is little character repetition, the result of the compression is “1C1O2R1E1C1T1L1Y”, thus compression proves to be very expensive here,
with a negative compression gain of (9-16)/9 that is, -78%!
RLE Compression: compression rules• Actually, RLE compression is governed by particular
rules which allow compression to be carried out when necessary and the chain to be left as it is when compression causes a waste. These rules are the following: ▫ If three or more elements are repeated consecutively,
the RLE compression method is used ▫ If not, a control character (00) is inserted, followed by
the number of elements of the non-compressed chain and then the latter If the number of elements of the chain is odd, the control
character (00) is added on the end ▫ Finally, specific control characters were defined in order
to code: an end of line (00 01) the end of the image (00 00) a pointer displacement over the image of XX columns and
YY rows in the reading direction (00 02 XX YY).
RLE Compression: compression rules•Thus, RLE compression makes no sense
except for data with many consecutive repeated elements,▫ in particular images with large uniform
areas.• This method however has the advantage of
not being very difficult to implement.▫There are alternatives in which the image is
encoded by blocks of pixels, in rows, or even in zigzag.
RLE Compression: compression rules
Huffman coding• In 1952, David Huffman proposed a statistical
method▫ It allows a binary code word to be assigned to the
various symbols to be compressed (pixels or characters for example).
• The length of each code word is not identical for all the symbols: the most frequent symbols (those which appear most often) are coded with short code words, while the most uncommon symbols receive longer binary codes. The expression Variable Length Code (VLC) is used to indicate this type of coding because no code is the prefix of another.
• Thus, the final succession of coded words with variable lengths will be on average smaller than that obtained with a constant length coding.
Huffman coding: algorithm
a bottom-up approach1. Initialization: Put all symbols on a list sorted according totheir frequency counts.2. Repeat until the list has only one symbol left:
(1) From the list pick two symbols with the lowest frequency counts.
Form a Huffman subtree that has these two symbols as child nodes
and create a parent node.(2) Assign the sum of the children's frequency counts to the
parent and insert it into the list such that the order is maintained.
(3) Delete the children from the list.3. Assign a codeword for each leaf based on the path from theroot.
Huffman coding•The Huffman coder creates an ordered
tree from all the symbols and their frequency of appearance. ▫The branches are built recursively starting
with the least frequent symbols. •Ex: the word “wikipedia”•The frequencies of letters:
•In binary : 101 11 011 11 100 010 001 11 000, 24 bits instead of 72 using ASCII code.
i a d e k p w
3 1 1 1 1 1 1
Huffman coding
Consider the following sentence: "COMMENT_CA_MARCHE".M A C E _ H O N T R 3 2 2 2 2 1 1 1 1 1This is the corresponding tree:
Huffman coding
•"this is an example of a Huffman tree"
Huffman coding
•Coding Tree for “HELLO” using the Huffman Algorithm
Huffman coding: variants
•There are three variants of Huffman algorithm, each of which defines a method for tree creation
•static : each byte has a predefined code by the software. ▫The tree doesn’t need to be transmitted for
decompression▫The compression is not standard for all types
of files ( e.g. a text in French, in which, the frequency of occurrence of “e” is very high; this latter should have a very short code)
Static Huffman Coding
•Uses codes predefined according to the frequencies of letters in a given language.
•Example:
Z K M C U D L E
2 7 24 32 37 42 42 120
Static Huffman Coding
Leaves
Static Huffman Coding
Static Huffman Coding
Letter Freq Code Bits
C 32 1110 4
D 42 101 3
E 120 0 1
K 7 6
L 42 3
M 20 5
U 37 3
Z 2 6
Decode 1011001110111101
•Coder :101 0 0 101•Decoder: 1011001110111101
•Cost in bits per letter: 2.57
= (1 * 120 + 3 * 121 + 4 * 32 + 5 * 24 + 6 * 9)/ 306
785/306 = 2.57
Static Huffman Coding
Huffman coding: variantssemi-adaptive : • The file is first read to calculate the occurrences of each byte.
▫ After, the tree is constructed according to the weight of each byte.• This tree remains the same till the end of the compression• It will be necessary to transmit the tree to decompress the fileadaptive : • This method provides the best rates of compression, because
the tree is constructed dynamically as far as the flow compression is going on▫ But, this method has a main disadvantage consisting of
continuously modifying the tree, which implies a very long time of compression
▫ On the other hand, the compression is always optimal and the file type should not be known before the compression, and also the file should not be read previous to compression.
▫ You should not transmit or store the table of frequencies of symbols
Adaptive Human Coding
•statistics are gathered and up-dated dynamically as the data stream arrives.
Adaptive Huffman Coding
•Initial code: assigns symbols with some initially agreed codes, without any prior knowledge of the frequency counts.
• update tree: constructs an Adaptive Huffman tree.▫It basically does two things:
(a) increments the frequency counts for the symbols (including any new ones).
(b) updates the configuration of the tree.• The encoder and decoder must use exactly
the same initial code and update tree routines.
Adaptive Huffman Coding•Nodes are numbered in order from left to
right, bottom to top. The numbers in parentheses indicate the count.
• The tree must always maintain its sibling property, i.e., all nodes (internal and leaf) are arranged in the order of increasing counts.▫If the sibling property is about to be violated, a
swap procedure is invoked to update the tree by rearranging the nodes.
• When a swap is necessary, the farthest node with count N is swapped with the node whose count has just been increased to N + 1.
Adaptive Huffman Coding
Adaptive Huffman Coding: example
• This is to clearly illustrate more implementation details. We show exactly what bits are sent, as opposed to simply stating how the tree is updated.
• An additional rule: if any character/symbol is to be sent the first time, it must be preceded by a special symbol, NEW. ▫The initial code for NEW is 0. The count
for NEW is always kept as 0 (the count is never increased);
• Initial code assignment for AADCCDD using adaptive Human coding.
Tree Construction
AADCCDD
Tree Construction
Adaptive Huffman Coding
Sequence of symbols and codes sent to the decoder:
•It is important to emphasize that the code for a particular symbol changes during the adaptive Human coding process.▫For example, after AADCCDD, when the
character D overtakes A as the most frequent symbol, its code changes from 101 to 0.
Huffman Coding: properties•Unique Prefix Property: No Huffman code
is a prefix of any other Huffman code ▫precludes any ambiguity in decoding.
•Optimality: minimum redundancy code - proved optimal for a given data model (i.e., a given, accurate, probability distribution):▫The two least frequent symbols will have the
same length for their Huffman codes, differing only at the last bit.
▫Symbols that occur more frequently will have shorter Huffman codes than symbols that occur less frequently.
Huffman Coding: utilisation
•The coding is independent of the compressed data.
•It simply codes a sequence of bits in the most compact form.▫Almost all compressors uses this coding as
a second stage of compression re-compression by Huffman what was compressed using another technique.
▫This is the case of: JPEG, MPEG, Gzip and Winzip
Dictionary-based Coding• History
▫ Abraham Lempel and Jakob Ziv are the creators of the LZ77 compressor, invented in 1977 . This compressor was then used for filing (ZIP, ARJ and LHA formats use it).
▫ In 1978 they created the LZ78 compressor specialized in image compression (or that of any binary file).
▫ In 1984, Terry Welch of Unisys modified it, in order to use it in hard drive controllers; his surname initial was thus added to the LZ abbreviation yielding LZW.
• LZW is a very fast algorithm both for compression and for decompression, based on the occurrence multiplicity of character sequences in the string to be encoded. Its principle consists in substituting patterns with an index code, by progressively building a dictionary.
Dictionary-based Coding• LZW works on bits and not on bytes, thus, it does
not depend on the way in which the processor codes information. ▫ It is one of the most popular algorithms and is
particularly used in TIFF and GIF formats.▫ LZW compression method is the copyright-free
LZ77 algorithm which is used in PNG images. • LZW uses fixed-length codes to represent
variable-length strings of symbols/characters that commonly occur together, e.g., words in English text.
• the LZW encoder and decoder build up the same dictionary dynamically while receiving the data.
• LZW places longer and longer repeated entries into a dictionary, and then emits the code for an element, rather than the string itself, if the element has already been placed in the dictionary.
LZW Compression: algorithm
Construction of the dictionary:• The dictionary is initialized with the 256 values of the ASCII table.
The file to be compressed is split into strings of bytes, each of these strings is compared with the dictionary and is added if not found there.
BEGINs = next input character;while not EOF
{ c = next input character;if s + c exists in the dictionary
s = s + c;else
{ output the code for s;add string s + c to the dictionary with a new code;s = c;}
}output the code for s;
END
LZW compression for string: “ABABBABCABABBA”• Let's start with a very simple dictionary (also
referred to as a “string table"), initially containing only 3 characters, with codes as follows:
code string------ ---------
1 A2 B3 C
• Now if the input string is “ABABBABCABABBA", the LZW compression algorithm works as follows:
LZW compression for string: “ABABBABCABABBA”
s c output code string----------------------------------------------------------------------
1 A2 B3 C
----------------------------------------------------------------------A B 1 4 ABB A 2 5 BAA BAB B 4 6 ABBB ABA B 5 7 BABB C 2 8 BCC A 3 9 CAA BAB A 4 10 ABAA BAB BABB A 6 11 ABBAA EOF 1
• The output codes are: 1 2 4 5 2 3 4 6 1. Instead of sending 14
characters,only 9 codes need to be sent (compression ratio = 14/9 = 1.56).
LZW Decompression: algorithm (simple version)• During decompression, the algorithm rebuilds the
dictionary in the opposite direction; it thus does not need to be stored.
BEGINs = NIL;while not EOF
{k = next input code;entry = dictionary entry for k;output entry;if (s != NIL)
add string s + entry[0] to dictionary with a new code;
s = entry;}
END
LZW decompression for string: “ABABBABCABABBA"• Input codes to the decoder are 1 2 4 5 2 3 4 6 1.• The initial string table is identical to what is used by the
encoder.
s k entry/output code string------------------------------------------------------------------------------
1 A2 B3 C
----------------------------------------------------------------------------------------NIL 1 AA 2 B 4 ABB 4 AB 5 BAAB 5 BA 6 ABBBA 2 B 7 BABB 3 C 8 BCC 4 AB 9 CAAB 6 ABB 10 ABAABB 1 A 11 ABBAA EOF
• Apparently, the output string is “ABABBABCABABBA", a truly lossless result!