Introduction for Data Compression

MCS10 204(A) Data Compression

Chapter: Introduction

Manish T IAssociate ProfessorDepartment of CSEMET’s School of Engineering, MalaE-mail: [email protected]

Definition

• Data compression is the process of converting an

input data stream (the source stream or the original

raw data) into another data stream (the output, or

the compressed, stream) that has a smaller size. A

stream is either a file or a buffer in memory.

Data compression is popular for two reasons:

(1) People like to accumulate data and hate to throw

anything away.

(2) People hate to wait a long time for data transfers.

Data compression is often called source coding

The input symbols are emitted by a certain information sourceand have to be coded before being sent to their destination. Thesource can be memory less or it can have memory.

In the former case, each bit is independent of its predecessors. Inthe latter case, each symbol depends on some of its predecessorsand, perhaps, also on its successors, so they are correlated.

A memory less source is also termed “independent and identically distributed” or IIID.

• The compressor or encoder is the program that compressesthe raw data in the input stream and creates an outputstream with compressed (low-redundancy) data.

• The decompressor or decoder converts in the opposite direction.

• The term codec is sometimes used to describe both the encoder and decoder.

• “Stream” is a more general term because the compressed data may be transmitted directly to the decoder, instead of being written to a file and saved.

• A non-adaptive compression method is rigid and does not modifyits operations, its parameters, or its tables in response to theparticular data being compressed.

• In contrast, an adaptive method examines the raw data andmodifies its operations and/or its parameters accordingly.

• A 2-pass algorithm, where the first pass reads the input stream tocollect statistics on the data to be compressed, and the secondpass does the actual compressing using parameters set by thefirst pass. Such a method may be called semi-adaptive.

• A data compression method can also be locally adaptive,meaning it adapts itself to local conditions in the input streamand varies this adaptation as it moves from area to area in theinput.

• For the original input stream, we use the terms unencoded, raw, or original data.

• The contents of the final, compressed, stream is considered the encoded or compressed data.

• The term bit stream is also used in the literature to indicate the compressed stream.

• Lossy/lossless compression

• Cascaded compression: The difference between lossless and lossy codecs can be illuminated by considering a cascade of compressions.

• Perceptive compression: A lossy encoder must takeadvantage of the special type of data being compressed.It should delete only data whose absence would not bedetected by our senses.

• It employ algorithms based on our understanding ofpsychoacoustic and psychovisual perception, so it isoften referred to as a perceptive encoder.

• Symmetrical compression is the case where thecompressor and decompressor use basically the samealgorithm but work in “opposite” directions. Such amethod makes sense for general work, where the samenumber of files are compressed as are decompressed.

• A data compression method is called universal if the compressorand decompressor do not know the statistics of the input stream.A universal method is optimal if the compressor can producecompression factors that asymptotically approach the entropy ofthe input stream for long inputs.

• The term file differencing refers to any method that locates andcompresses the differences between two files. Imagine a file Awith two copies that are kept by two users. When a copy isupdated by one user, it should be sent to the other user, to keepthe two copies identical.

• Most compression methods operate in the streaming mode,where the codec inputs a byte or several bytes, processes them,and continues until an end-of-file is sensed

• In the block mode, where the input stream is read block by block and each block is encoded separately. The block size in this case should be a user-controlled parameter, since its size may greatly affect the performance of the method.

• Most compression methods are physical. They look only at thebits in the input stream and ignore the meaning of the data itemsin the input. Such a method translates one bit stream intoanother, shorter, one. The only way to make sense of the outputstream (to decode it) is by knowing how it was encoded.

• Some compression methods are logical. They look at individualdata items in the source stream and replace common items withshort codes. Such a method is normally special purpose and canbe used successfully on certain types of data only.

Compression performanceThe compression ratio is defined as Compression ratio

= size of the output stream/size of the input stream

A value of 0.6 means that the data occupies 60% of its original sizeafter compression.

Values greater than 1 mean an output stream bigger than the inputstream (negative compression).

The compression ratio can also be called bpb (bit per bit), since itequals the number of bits in the compressed stream needed, onaverage, to compress one bit in the input stream.bpb (bits per pixel)bpc (bits per character)

The term bitrate (or “bit rate”) is a general term for bpb and bpc.

• Compression factor = size of the input stream/size of the output stream.

• The expression 100 × (1 − compression ratio) is also a reasonable measureof compression performance. A value of 60 means that the output streamoccupies 40% of its original size (or that the compression has resulted insavings of 60%).

• The expression 100 × (1 − compression ratio) is also a reasonable measureof compression performance. A value of 60 means that the output streamoccupies 40% of its original size (or that the compression has resulted insavings of 60%).

• The unit of the compression gain is called percent log ratio and is denotedby .

• The speed of compression can be measured in cycles per byte (CPB). This isthe average number of machine cycles it takes to compress one byte. Thismeasure is important when compression is done by special hardware.

• Mean square error (MSE) and peak signal to noiseratio (PSNR), are used to measure the distortioncaused by lossy compression of images and movies.

• Relative compression is used to measure thecompression gain in lossless audio compressionmethods.

• The Calgary Corpus is a set of 18 files traditionallyused to test data compression programs. Theyinclude text, image, and object files, for a total ofmore than 3.2 million bytes The corpus can bedownloaded by anonymous ftp

The Canterbury Corpus is another collection of files, introduced in 1997 to provide an alternative to the Calgary corpus for evaluating lossless compression methods.

1. The Calgary corpus has been used by many researchers to develop,test, and compare many compression methods, and there is achance that new methods would unintentionally be fine-tuned tothat corpus. They may do well on the Calgary corpus documentsbut poorly on other documents.

2. The Calgary corpus was collected in 1987 and is getting old.“Typical” documents change during a decade (e.g., htmldocuments did not exist until recently), and any body ofdocuments used for evaluation purposes should be examined fromtime to time.

3. The Calgary corpus is more or less an arbitrary collection ofdocuments, whereas a good corpus for algorithm evaluationshould be selected carefully.

Probability Model

• This concept is important in statistical data compression methods.

• When such a method is used, a model for the data has to beconstructed before compression can begin.

• A typical model is built by reading the entire input stream, countingthe number of times each symbol appears , and computing theprobability of occurrence of each symbol.

• The data stream is then input again, symbol by symbol, and iscompressed using the information in the probability model.

References

• Data Compression: The Complete Reference,

David Salomon, Springer Science & Business Media, 2004