View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Text OperationsPreprocessing and
Compression
Introduction
Document preprocessingndash to improve the precision of documents
retrievedndash lexical analysis stopwords elimination
stemming index term selection thesaurindash build a thesaurus
Text compressionndash to improve the efficiency of the retrieval
processndash statistical methods vs dictionary methodsndash inverted file compression
Document Preprocessing
Lexical analysis of the textndash digits hyphens punctuation marks the case of
letters
Elimination of stopwordsndash filtering out the useless words for retrieval purposes
Stemmingndash dealing with the syntactic variations of query terms
Index terms selectionndash determining the terms to be used as index terms
Thesaurindash the expansion of the original query with related term
The Process of Preprocessing
structure
Accentsspacing stopwords
Noungroups stemming
Manual indexingDocs
structure Full text Index terms
Lexical Analysis of the Text
Four particular casesNumbers
bull usually not good index terms because of their vagueness
bull need some advanced lexical analysis procedurendash ex) 510BC 4105-1201-2310-2213 2000212 hellip
Hyphensbull breaking up hyphenated words might be useful
ndash ex) state-of-the-art state of the art (Good)ndash but B-49 B 49 ()
bull need to adopt a general rule and to specify exceptions on a case by case basis
Lexical Analysis of the Text
Punctuation marksndash removed entirely
bull ex) 510BC 1048774 510BCbull if the query contains lsquo510BCrsquo removal of the dot both in
query term and in the documents will not affect retrieval performance
ndash require the preparation of a list of exceptionsbull ex) valid 1048774 valid ()
The case of lettersndash converts all the text to either lower or upper casendash part of the semantics might be lost
bull Northwestern University 1048774 northwestern university ()
Elimination of Stopwords
Basic conceptndash filtering out words with very low discrimination
valuesbull ex) a the this that where when hellip
Advantagendash reduce the size of the indexing structure
considerably
Disadvantagendash might reduce recall as well
bull ex) to be or not to be
Stemming
What is the ldquostemrdquondash the portion of a word which is left after the removal of
its affixes (ie prefixes and suffixes)ndash ex) lsquoconnectrsquo is the stem for the variants lsquoconnectedrsquo
lsquoconnectingrsquo lsquoconnectionrsquo lsquoconnectionsrsquo
Effect of stemmingndash reduce variants of the same root to a common
conceptndash reduce the size of the indexing structurendash controversy about the benefits of stemming
Index Term Selection
Index terms selectionndash not all words are equally significant for representing
the semantics of a document
Manual selectionndash selection of index terms is usually done by specialist
Automatic selection of index termsndash most of the semantics is carried by the noun wordsndash clustering nouns which appear nearby in the text into
a single indexing component (or concept)ndash ex) computer science
ThesauriWhat is the ldquothesaurusrdquo
ndash list of important words in a given domain of knowledgendash a set of related words derived from a synonymity
relationshipndash a controlled vocabulary for the indexing and searching
Main purposesndash provide a standard vocabulary for indexing and
searchingndash assist users with locating terms for proper query
formulationndash provide classified hierarchies that allow the
broadening and narrowing of the current query request
Thesauri
Thesaurus index termsndash denote a concept which is the basic semantic unitndash can be individual words groups of words or phrases
bull ex) building teaching ballistic missiles body temperature
ndash frequently it is necessary to complement a thesaurus entry with a definition or an explanation
bull ex) seal (marine animals) seal (documents)
Thesaurus term relationshipsndash mostly composed of synonyms and near-synonymsndash BT (Broader Term) NT (Narrower Term) RT
(Related Term)
Text OperationsCoding Compression Methods
Text CompressionMotivation
ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements
IO overhead communication delaysndash obstacle need for IR systems to access text
randomly bull to access a given word in some forms of compressed text
the entire text must be decoded from the beginning until the desired word is reached
Two strategiesndash statistical methodsndash dictionary methods
Statistical MethodsBasic concepts
ndash Modeling a probability is estimated for each symbol
ndash Coding a code is assigned to each symbol based on the model
ndash shorter codes are assigned to the most likely symbols
Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)
bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits
Statistical MethodsCompression models
ndash adaptive model progressively learn about the statistical distribution as the compression process goes on
bull decompression of a file has to start from its beginning
ndash static model assume an average distribution for all input texts
bull poor compression ratios when data deviates from initial distribution assumptions
ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned
bull information on the data distribution must be stored
Statistical Methods
Word-based compression modelndash take words instead of characters as symbols
Reasons to use this model in an IR contextndash much better compression rates
ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters
ndash words are the atoms on which most IR systems are built
ndash word frequencies are useful in answering queries involving combinations of words
ndash the best strategy is to start with the least frequent words first
Statistical Methods
Codingndash the task of obtaining the representation of a symbol
based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and
long codes to unlikely ones
Two statistical coding strategiesndash Huffman coding
bull a variable-length encoding in bits for each symbolbull relatively fast allows random access
ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access
Huffman Coding
Building a Huffman treendash for each symbol of the alphabet create a node
containing the symbol and its probabilityndash the two nodes with the smallest probabilities become
children of a newly created parent nodendash the parent node is associated a probability equal to
the sum of the probabilities of the two chosen children
ndash the operation is repeated ignoring nodes that are already children until there is only one node
Authorrsquos Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
10
10
10 10
0110 0100 1 0101 00 1 0111 00 1
for each rose a rose is a rose
Better Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
1010
10 10
010 000 11 001 10 11 011 10 11
for each rose a rose is a rose
Authorrsquos Canonical Huffman Coding
bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree
wherexi = symbols
yi = numerical value of first symbol
rose
isldquo ldquo
a
each for
10
10
10
10 10
0010 0000 1 0001 00 1 0011 00 1
for each rose a rose is a rose
S = ((1 1) (1 1) (0 infin) (4 0)
Byte-Oriented Huffman Coding
Tree has branching factor of 256
Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)
Characteristicsndash Decompression is faster than for plain Huffman
codingndash Compression ratios are better than for Ziv-Lempel
family of codingsndash Allows direct searching on compressed text
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Introduction
Document preprocessingndash to improve the precision of documents
retrievedndash lexical analysis stopwords elimination
stemming index term selection thesaurindash build a thesaurus
Text compressionndash to improve the efficiency of the retrieval
processndash statistical methods vs dictionary methodsndash inverted file compression
Document Preprocessing
Lexical analysis of the textndash digits hyphens punctuation marks the case of
letters
Elimination of stopwordsndash filtering out the useless words for retrieval purposes
Stemmingndash dealing with the syntactic variations of query terms
Index terms selectionndash determining the terms to be used as index terms
Thesaurindash the expansion of the original query with related term
The Process of Preprocessing
structure
Accentsspacing stopwords
Noungroups stemming
Manual indexingDocs
structure Full text Index terms
Lexical Analysis of the Text
Four particular casesNumbers
bull usually not good index terms because of their vagueness
bull need some advanced lexical analysis procedurendash ex) 510BC 4105-1201-2310-2213 2000212 hellip
Hyphensbull breaking up hyphenated words might be useful
ndash ex) state-of-the-art state of the art (Good)ndash but B-49 B 49 ()
bull need to adopt a general rule and to specify exceptions on a case by case basis
Lexical Analysis of the Text
Punctuation marksndash removed entirely
bull ex) 510BC 1048774 510BCbull if the query contains lsquo510BCrsquo removal of the dot both in
query term and in the documents will not affect retrieval performance
ndash require the preparation of a list of exceptionsbull ex) valid 1048774 valid ()
The case of lettersndash converts all the text to either lower or upper casendash part of the semantics might be lost
bull Northwestern University 1048774 northwestern university ()
Elimination of Stopwords
Basic conceptndash filtering out words with very low discrimination
valuesbull ex) a the this that where when hellip
Advantagendash reduce the size of the indexing structure
considerably
Disadvantagendash might reduce recall as well
bull ex) to be or not to be
Stemming
What is the ldquostemrdquondash the portion of a word which is left after the removal of
its affixes (ie prefixes and suffixes)ndash ex) lsquoconnectrsquo is the stem for the variants lsquoconnectedrsquo
lsquoconnectingrsquo lsquoconnectionrsquo lsquoconnectionsrsquo
Effect of stemmingndash reduce variants of the same root to a common
conceptndash reduce the size of the indexing structurendash controversy about the benefits of stemming
Index Term Selection
Index terms selectionndash not all words are equally significant for representing
the semantics of a document
Manual selectionndash selection of index terms is usually done by specialist
Automatic selection of index termsndash most of the semantics is carried by the noun wordsndash clustering nouns which appear nearby in the text into
a single indexing component (or concept)ndash ex) computer science
ThesauriWhat is the ldquothesaurusrdquo
ndash list of important words in a given domain of knowledgendash a set of related words derived from a synonymity
relationshipndash a controlled vocabulary for the indexing and searching
Main purposesndash provide a standard vocabulary for indexing and
searchingndash assist users with locating terms for proper query
formulationndash provide classified hierarchies that allow the
broadening and narrowing of the current query request
Thesauri
Thesaurus index termsndash denote a concept which is the basic semantic unitndash can be individual words groups of words or phrases
bull ex) building teaching ballistic missiles body temperature
ndash frequently it is necessary to complement a thesaurus entry with a definition or an explanation
bull ex) seal (marine animals) seal (documents)
Thesaurus term relationshipsndash mostly composed of synonyms and near-synonymsndash BT (Broader Term) NT (Narrower Term) RT
(Related Term)
Text OperationsCoding Compression Methods
Text CompressionMotivation
ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements
IO overhead communication delaysndash obstacle need for IR systems to access text
randomly bull to access a given word in some forms of compressed text
the entire text must be decoded from the beginning until the desired word is reached
Two strategiesndash statistical methodsndash dictionary methods
Statistical MethodsBasic concepts
ndash Modeling a probability is estimated for each symbol
ndash Coding a code is assigned to each symbol based on the model
ndash shorter codes are assigned to the most likely symbols
Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)
bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits
Statistical MethodsCompression models
ndash adaptive model progressively learn about the statistical distribution as the compression process goes on
bull decompression of a file has to start from its beginning
ndash static model assume an average distribution for all input texts
bull poor compression ratios when data deviates from initial distribution assumptions
ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned
bull information on the data distribution must be stored
Statistical Methods
Word-based compression modelndash take words instead of characters as symbols
Reasons to use this model in an IR contextndash much better compression rates
ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters
ndash words are the atoms on which most IR systems are built
ndash word frequencies are useful in answering queries involving combinations of words
ndash the best strategy is to start with the least frequent words first
Statistical Methods
Codingndash the task of obtaining the representation of a symbol
based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and
long codes to unlikely ones
Two statistical coding strategiesndash Huffman coding
bull a variable-length encoding in bits for each symbolbull relatively fast allows random access
ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access
Huffman Coding
Building a Huffman treendash for each symbol of the alphabet create a node
containing the symbol and its probabilityndash the two nodes with the smallest probabilities become
children of a newly created parent nodendash the parent node is associated a probability equal to
the sum of the probabilities of the two chosen children
ndash the operation is repeated ignoring nodes that are already children until there is only one node
Authorrsquos Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
10
10
10 10
0110 0100 1 0101 00 1 0111 00 1
for each rose a rose is a rose
Better Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
1010
10 10
010 000 11 001 10 11 011 10 11
for each rose a rose is a rose
Authorrsquos Canonical Huffman Coding
bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree
wherexi = symbols
yi = numerical value of first symbol
rose
isldquo ldquo
a
each for
10
10
10
10 10
0010 0000 1 0001 00 1 0011 00 1
for each rose a rose is a rose
S = ((1 1) (1 1) (0 infin) (4 0)
Byte-Oriented Huffman Coding
Tree has branching factor of 256
Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)
Characteristicsndash Decompression is faster than for plain Huffman
codingndash Compression ratios are better than for Ziv-Lempel
family of codingsndash Allows direct searching on compressed text
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Document Preprocessing
Lexical analysis of the textndash digits hyphens punctuation marks the case of
letters
Elimination of stopwordsndash filtering out the useless words for retrieval purposes
Stemmingndash dealing with the syntactic variations of query terms
Index terms selectionndash determining the terms to be used as index terms
Thesaurindash the expansion of the original query with related term
The Process of Preprocessing
structure
Accentsspacing stopwords
Noungroups stemming
Manual indexingDocs
structure Full text Index terms
Lexical Analysis of the Text
Four particular casesNumbers
bull usually not good index terms because of their vagueness
bull need some advanced lexical analysis procedurendash ex) 510BC 4105-1201-2310-2213 2000212 hellip
Hyphensbull breaking up hyphenated words might be useful
ndash ex) state-of-the-art state of the art (Good)ndash but B-49 B 49 ()
bull need to adopt a general rule and to specify exceptions on a case by case basis
Lexical Analysis of the Text
Punctuation marksndash removed entirely
bull ex) 510BC 1048774 510BCbull if the query contains lsquo510BCrsquo removal of the dot both in
query term and in the documents will not affect retrieval performance
ndash require the preparation of a list of exceptionsbull ex) valid 1048774 valid ()
The case of lettersndash converts all the text to either lower or upper casendash part of the semantics might be lost
bull Northwestern University 1048774 northwestern university ()
Elimination of Stopwords
Basic conceptndash filtering out words with very low discrimination
valuesbull ex) a the this that where when hellip
Advantagendash reduce the size of the indexing structure
considerably
Disadvantagendash might reduce recall as well
bull ex) to be or not to be
Stemming
What is the ldquostemrdquondash the portion of a word which is left after the removal of
its affixes (ie prefixes and suffixes)ndash ex) lsquoconnectrsquo is the stem for the variants lsquoconnectedrsquo
lsquoconnectingrsquo lsquoconnectionrsquo lsquoconnectionsrsquo
Effect of stemmingndash reduce variants of the same root to a common
conceptndash reduce the size of the indexing structurendash controversy about the benefits of stemming
Index Term Selection
Index terms selectionndash not all words are equally significant for representing
the semantics of a document
Manual selectionndash selection of index terms is usually done by specialist
Automatic selection of index termsndash most of the semantics is carried by the noun wordsndash clustering nouns which appear nearby in the text into
a single indexing component (or concept)ndash ex) computer science
ThesauriWhat is the ldquothesaurusrdquo
ndash list of important words in a given domain of knowledgendash a set of related words derived from a synonymity
relationshipndash a controlled vocabulary for the indexing and searching
Main purposesndash provide a standard vocabulary for indexing and
searchingndash assist users with locating terms for proper query
formulationndash provide classified hierarchies that allow the
broadening and narrowing of the current query request
Thesauri
Thesaurus index termsndash denote a concept which is the basic semantic unitndash can be individual words groups of words or phrases
bull ex) building teaching ballistic missiles body temperature
ndash frequently it is necessary to complement a thesaurus entry with a definition or an explanation
bull ex) seal (marine animals) seal (documents)
Thesaurus term relationshipsndash mostly composed of synonyms and near-synonymsndash BT (Broader Term) NT (Narrower Term) RT
(Related Term)
Text OperationsCoding Compression Methods
Text CompressionMotivation
ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements
IO overhead communication delaysndash obstacle need for IR systems to access text
randomly bull to access a given word in some forms of compressed text
the entire text must be decoded from the beginning until the desired word is reached
Two strategiesndash statistical methodsndash dictionary methods
Statistical MethodsBasic concepts
ndash Modeling a probability is estimated for each symbol
ndash Coding a code is assigned to each symbol based on the model
ndash shorter codes are assigned to the most likely symbols
Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)
bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits
Statistical MethodsCompression models
ndash adaptive model progressively learn about the statistical distribution as the compression process goes on
bull decompression of a file has to start from its beginning
ndash static model assume an average distribution for all input texts
bull poor compression ratios when data deviates from initial distribution assumptions
ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned
bull information on the data distribution must be stored
Statistical Methods
Word-based compression modelndash take words instead of characters as symbols
Reasons to use this model in an IR contextndash much better compression rates
ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters
ndash words are the atoms on which most IR systems are built
ndash word frequencies are useful in answering queries involving combinations of words
ndash the best strategy is to start with the least frequent words first
Statistical Methods
Codingndash the task of obtaining the representation of a symbol
based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and
long codes to unlikely ones
Two statistical coding strategiesndash Huffman coding
bull a variable-length encoding in bits for each symbolbull relatively fast allows random access
ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access
Huffman Coding
Building a Huffman treendash for each symbol of the alphabet create a node
containing the symbol and its probabilityndash the two nodes with the smallest probabilities become
children of a newly created parent nodendash the parent node is associated a probability equal to
the sum of the probabilities of the two chosen children
ndash the operation is repeated ignoring nodes that are already children until there is only one node
Authorrsquos Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
10
10
10 10
0110 0100 1 0101 00 1 0111 00 1
for each rose a rose is a rose
Better Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
1010
10 10
010 000 11 001 10 11 011 10 11
for each rose a rose is a rose
Authorrsquos Canonical Huffman Coding
bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree
wherexi = symbols
yi = numerical value of first symbol
rose
isldquo ldquo
a
each for
10
10
10
10 10
0010 0000 1 0001 00 1 0011 00 1
for each rose a rose is a rose
S = ((1 1) (1 1) (0 infin) (4 0)
Byte-Oriented Huffman Coding
Tree has branching factor of 256
Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)
Characteristicsndash Decompression is faster than for plain Huffman
codingndash Compression ratios are better than for Ziv-Lempel
family of codingsndash Allows direct searching on compressed text
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
The Process of Preprocessing
structure
Accentsspacing stopwords
Noungroups stemming
Manual indexingDocs
structure Full text Index terms
Lexical Analysis of the Text
Four particular casesNumbers
bull usually not good index terms because of their vagueness
bull need some advanced lexical analysis procedurendash ex) 510BC 4105-1201-2310-2213 2000212 hellip
Hyphensbull breaking up hyphenated words might be useful
ndash ex) state-of-the-art state of the art (Good)ndash but B-49 B 49 ()
bull need to adopt a general rule and to specify exceptions on a case by case basis
Lexical Analysis of the Text
Punctuation marksndash removed entirely
bull ex) 510BC 1048774 510BCbull if the query contains lsquo510BCrsquo removal of the dot both in
query term and in the documents will not affect retrieval performance
ndash require the preparation of a list of exceptionsbull ex) valid 1048774 valid ()
The case of lettersndash converts all the text to either lower or upper casendash part of the semantics might be lost
bull Northwestern University 1048774 northwestern university ()
Elimination of Stopwords
Basic conceptndash filtering out words with very low discrimination
valuesbull ex) a the this that where when hellip
Advantagendash reduce the size of the indexing structure
considerably
Disadvantagendash might reduce recall as well
bull ex) to be or not to be
Stemming
What is the ldquostemrdquondash the portion of a word which is left after the removal of
its affixes (ie prefixes and suffixes)ndash ex) lsquoconnectrsquo is the stem for the variants lsquoconnectedrsquo
lsquoconnectingrsquo lsquoconnectionrsquo lsquoconnectionsrsquo
Effect of stemmingndash reduce variants of the same root to a common
conceptndash reduce the size of the indexing structurendash controversy about the benefits of stemming
Index Term Selection
Index terms selectionndash not all words are equally significant for representing
the semantics of a document
Manual selectionndash selection of index terms is usually done by specialist
Automatic selection of index termsndash most of the semantics is carried by the noun wordsndash clustering nouns which appear nearby in the text into
a single indexing component (or concept)ndash ex) computer science
ThesauriWhat is the ldquothesaurusrdquo
ndash list of important words in a given domain of knowledgendash a set of related words derived from a synonymity
relationshipndash a controlled vocabulary for the indexing and searching
Main purposesndash provide a standard vocabulary for indexing and
searchingndash assist users with locating terms for proper query
formulationndash provide classified hierarchies that allow the
broadening and narrowing of the current query request
Thesauri
Thesaurus index termsndash denote a concept which is the basic semantic unitndash can be individual words groups of words or phrases
bull ex) building teaching ballistic missiles body temperature
ndash frequently it is necessary to complement a thesaurus entry with a definition or an explanation
bull ex) seal (marine animals) seal (documents)
Thesaurus term relationshipsndash mostly composed of synonyms and near-synonymsndash BT (Broader Term) NT (Narrower Term) RT
(Related Term)
Text OperationsCoding Compression Methods
Text CompressionMotivation
ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements
IO overhead communication delaysndash obstacle need for IR systems to access text
randomly bull to access a given word in some forms of compressed text
the entire text must be decoded from the beginning until the desired word is reached
Two strategiesndash statistical methodsndash dictionary methods
Statistical MethodsBasic concepts
ndash Modeling a probability is estimated for each symbol
ndash Coding a code is assigned to each symbol based on the model
ndash shorter codes are assigned to the most likely symbols
Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)
bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits
Statistical MethodsCompression models
ndash adaptive model progressively learn about the statistical distribution as the compression process goes on
bull decompression of a file has to start from its beginning
ndash static model assume an average distribution for all input texts
bull poor compression ratios when data deviates from initial distribution assumptions
ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned
bull information on the data distribution must be stored
Statistical Methods
Word-based compression modelndash take words instead of characters as symbols
Reasons to use this model in an IR contextndash much better compression rates
ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters
ndash words are the atoms on which most IR systems are built
ndash word frequencies are useful in answering queries involving combinations of words
ndash the best strategy is to start with the least frequent words first
Statistical Methods
Codingndash the task of obtaining the representation of a symbol
based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and
long codes to unlikely ones
Two statistical coding strategiesndash Huffman coding
bull a variable-length encoding in bits for each symbolbull relatively fast allows random access
ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access
Huffman Coding
Building a Huffman treendash for each symbol of the alphabet create a node
containing the symbol and its probabilityndash the two nodes with the smallest probabilities become
children of a newly created parent nodendash the parent node is associated a probability equal to
the sum of the probabilities of the two chosen children
ndash the operation is repeated ignoring nodes that are already children until there is only one node
Authorrsquos Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
10
10
10 10
0110 0100 1 0101 00 1 0111 00 1
for each rose a rose is a rose
Better Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
1010
10 10
010 000 11 001 10 11 011 10 11
for each rose a rose is a rose
Authorrsquos Canonical Huffman Coding
bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree
wherexi = symbols
yi = numerical value of first symbol
rose
isldquo ldquo
a
each for
10
10
10
10 10
0010 0000 1 0001 00 1 0011 00 1
for each rose a rose is a rose
S = ((1 1) (1 1) (0 infin) (4 0)
Byte-Oriented Huffman Coding
Tree has branching factor of 256
Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)
Characteristicsndash Decompression is faster than for plain Huffman
codingndash Compression ratios are better than for Ziv-Lempel
family of codingsndash Allows direct searching on compressed text
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Lexical Analysis of the Text
Four particular casesNumbers
bull usually not good index terms because of their vagueness
bull need some advanced lexical analysis procedurendash ex) 510BC 4105-1201-2310-2213 2000212 hellip
Hyphensbull breaking up hyphenated words might be useful
ndash ex) state-of-the-art state of the art (Good)ndash but B-49 B 49 ()
bull need to adopt a general rule and to specify exceptions on a case by case basis
Lexical Analysis of the Text
Punctuation marksndash removed entirely
bull ex) 510BC 1048774 510BCbull if the query contains lsquo510BCrsquo removal of the dot both in
query term and in the documents will not affect retrieval performance
ndash require the preparation of a list of exceptionsbull ex) valid 1048774 valid ()
The case of lettersndash converts all the text to either lower or upper casendash part of the semantics might be lost
bull Northwestern University 1048774 northwestern university ()
Elimination of Stopwords
Basic conceptndash filtering out words with very low discrimination
valuesbull ex) a the this that where when hellip
Advantagendash reduce the size of the indexing structure
considerably
Disadvantagendash might reduce recall as well
bull ex) to be or not to be
Stemming
What is the ldquostemrdquondash the portion of a word which is left after the removal of
its affixes (ie prefixes and suffixes)ndash ex) lsquoconnectrsquo is the stem for the variants lsquoconnectedrsquo
lsquoconnectingrsquo lsquoconnectionrsquo lsquoconnectionsrsquo
Effect of stemmingndash reduce variants of the same root to a common
conceptndash reduce the size of the indexing structurendash controversy about the benefits of stemming
Index Term Selection
Index terms selectionndash not all words are equally significant for representing
the semantics of a document
Manual selectionndash selection of index terms is usually done by specialist
Automatic selection of index termsndash most of the semantics is carried by the noun wordsndash clustering nouns which appear nearby in the text into
a single indexing component (or concept)ndash ex) computer science
ThesauriWhat is the ldquothesaurusrdquo
ndash list of important words in a given domain of knowledgendash a set of related words derived from a synonymity
relationshipndash a controlled vocabulary for the indexing and searching
Main purposesndash provide a standard vocabulary for indexing and
searchingndash assist users with locating terms for proper query
formulationndash provide classified hierarchies that allow the
broadening and narrowing of the current query request
Thesauri
Thesaurus index termsndash denote a concept which is the basic semantic unitndash can be individual words groups of words or phrases
bull ex) building teaching ballistic missiles body temperature
ndash frequently it is necessary to complement a thesaurus entry with a definition or an explanation
bull ex) seal (marine animals) seal (documents)
Thesaurus term relationshipsndash mostly composed of synonyms and near-synonymsndash BT (Broader Term) NT (Narrower Term) RT
(Related Term)
Text OperationsCoding Compression Methods
Text CompressionMotivation
ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements
IO overhead communication delaysndash obstacle need for IR systems to access text
randomly bull to access a given word in some forms of compressed text
the entire text must be decoded from the beginning until the desired word is reached
Two strategiesndash statistical methodsndash dictionary methods
Statistical MethodsBasic concepts
ndash Modeling a probability is estimated for each symbol
ndash Coding a code is assigned to each symbol based on the model
ndash shorter codes are assigned to the most likely symbols
Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)
bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits
Statistical MethodsCompression models
ndash adaptive model progressively learn about the statistical distribution as the compression process goes on
bull decompression of a file has to start from its beginning
ndash static model assume an average distribution for all input texts
bull poor compression ratios when data deviates from initial distribution assumptions
ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned
bull information on the data distribution must be stored
Statistical Methods
Word-based compression modelndash take words instead of characters as symbols
Reasons to use this model in an IR contextndash much better compression rates
ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters
ndash words are the atoms on which most IR systems are built
ndash word frequencies are useful in answering queries involving combinations of words
ndash the best strategy is to start with the least frequent words first
Statistical Methods
Codingndash the task of obtaining the representation of a symbol
based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and
long codes to unlikely ones
Two statistical coding strategiesndash Huffman coding
bull a variable-length encoding in bits for each symbolbull relatively fast allows random access
ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access
Huffman Coding
Building a Huffman treendash for each symbol of the alphabet create a node
containing the symbol and its probabilityndash the two nodes with the smallest probabilities become
children of a newly created parent nodendash the parent node is associated a probability equal to
the sum of the probabilities of the two chosen children
ndash the operation is repeated ignoring nodes that are already children until there is only one node
Authorrsquos Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
10
10
10 10
0110 0100 1 0101 00 1 0111 00 1
for each rose a rose is a rose
Better Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
1010
10 10
010 000 11 001 10 11 011 10 11
for each rose a rose is a rose
Authorrsquos Canonical Huffman Coding
bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree
wherexi = symbols
yi = numerical value of first symbol
rose
isldquo ldquo
a
each for
10
10
10
10 10
0010 0000 1 0001 00 1 0011 00 1
for each rose a rose is a rose
S = ((1 1) (1 1) (0 infin) (4 0)
Byte-Oriented Huffman Coding
Tree has branching factor of 256
Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)
Characteristicsndash Decompression is faster than for plain Huffman
codingndash Compression ratios are better than for Ziv-Lempel
family of codingsndash Allows direct searching on compressed text
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Lexical Analysis of the Text
Punctuation marksndash removed entirely
bull ex) 510BC 1048774 510BCbull if the query contains lsquo510BCrsquo removal of the dot both in
query term and in the documents will not affect retrieval performance
ndash require the preparation of a list of exceptionsbull ex) valid 1048774 valid ()
The case of lettersndash converts all the text to either lower or upper casendash part of the semantics might be lost
bull Northwestern University 1048774 northwestern university ()
Elimination of Stopwords
Basic conceptndash filtering out words with very low discrimination
valuesbull ex) a the this that where when hellip
Advantagendash reduce the size of the indexing structure
considerably
Disadvantagendash might reduce recall as well
bull ex) to be or not to be
Stemming
What is the ldquostemrdquondash the portion of a word which is left after the removal of
its affixes (ie prefixes and suffixes)ndash ex) lsquoconnectrsquo is the stem for the variants lsquoconnectedrsquo
lsquoconnectingrsquo lsquoconnectionrsquo lsquoconnectionsrsquo
Effect of stemmingndash reduce variants of the same root to a common
conceptndash reduce the size of the indexing structurendash controversy about the benefits of stemming
Index Term Selection
Index terms selectionndash not all words are equally significant for representing
the semantics of a document
Manual selectionndash selection of index terms is usually done by specialist
Automatic selection of index termsndash most of the semantics is carried by the noun wordsndash clustering nouns which appear nearby in the text into
a single indexing component (or concept)ndash ex) computer science
ThesauriWhat is the ldquothesaurusrdquo
ndash list of important words in a given domain of knowledgendash a set of related words derived from a synonymity
relationshipndash a controlled vocabulary for the indexing and searching
Main purposesndash provide a standard vocabulary for indexing and
searchingndash assist users with locating terms for proper query
formulationndash provide classified hierarchies that allow the
broadening and narrowing of the current query request
Thesauri
Thesaurus index termsndash denote a concept which is the basic semantic unitndash can be individual words groups of words or phrases
bull ex) building teaching ballistic missiles body temperature
ndash frequently it is necessary to complement a thesaurus entry with a definition or an explanation
bull ex) seal (marine animals) seal (documents)
Thesaurus term relationshipsndash mostly composed of synonyms and near-synonymsndash BT (Broader Term) NT (Narrower Term) RT
(Related Term)
Text OperationsCoding Compression Methods
Text CompressionMotivation
ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements
IO overhead communication delaysndash obstacle need for IR systems to access text
randomly bull to access a given word in some forms of compressed text
the entire text must be decoded from the beginning until the desired word is reached
Two strategiesndash statistical methodsndash dictionary methods
Statistical MethodsBasic concepts
ndash Modeling a probability is estimated for each symbol
ndash Coding a code is assigned to each symbol based on the model
ndash shorter codes are assigned to the most likely symbols
Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)
bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits
Statistical MethodsCompression models
ndash adaptive model progressively learn about the statistical distribution as the compression process goes on
bull decompression of a file has to start from its beginning
ndash static model assume an average distribution for all input texts
bull poor compression ratios when data deviates from initial distribution assumptions
ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned
bull information on the data distribution must be stored
Statistical Methods
Word-based compression modelndash take words instead of characters as symbols
Reasons to use this model in an IR contextndash much better compression rates
ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters
ndash words are the atoms on which most IR systems are built
ndash word frequencies are useful in answering queries involving combinations of words
ndash the best strategy is to start with the least frequent words first
Statistical Methods
Codingndash the task of obtaining the representation of a symbol
based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and
long codes to unlikely ones
Two statistical coding strategiesndash Huffman coding
bull a variable-length encoding in bits for each symbolbull relatively fast allows random access
ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access
Huffman Coding
Building a Huffman treendash for each symbol of the alphabet create a node
containing the symbol and its probabilityndash the two nodes with the smallest probabilities become
children of a newly created parent nodendash the parent node is associated a probability equal to
the sum of the probabilities of the two chosen children
ndash the operation is repeated ignoring nodes that are already children until there is only one node
Authorrsquos Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
10
10
10 10
0110 0100 1 0101 00 1 0111 00 1
for each rose a rose is a rose
Better Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
1010
10 10
010 000 11 001 10 11 011 10 11
for each rose a rose is a rose
Authorrsquos Canonical Huffman Coding
bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree
wherexi = symbols
yi = numerical value of first symbol
rose
isldquo ldquo
a
each for
10
10
10
10 10
0010 0000 1 0001 00 1 0011 00 1
for each rose a rose is a rose
S = ((1 1) (1 1) (0 infin) (4 0)
Byte-Oriented Huffman Coding
Tree has branching factor of 256
Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)
Characteristicsndash Decompression is faster than for plain Huffman
codingndash Compression ratios are better than for Ziv-Lempel
family of codingsndash Allows direct searching on compressed text
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Elimination of Stopwords
Basic conceptndash filtering out words with very low discrimination
valuesbull ex) a the this that where when hellip
Advantagendash reduce the size of the indexing structure
considerably
Disadvantagendash might reduce recall as well
bull ex) to be or not to be
Stemming
What is the ldquostemrdquondash the portion of a word which is left after the removal of
its affixes (ie prefixes and suffixes)ndash ex) lsquoconnectrsquo is the stem for the variants lsquoconnectedrsquo
lsquoconnectingrsquo lsquoconnectionrsquo lsquoconnectionsrsquo
Effect of stemmingndash reduce variants of the same root to a common
conceptndash reduce the size of the indexing structurendash controversy about the benefits of stemming
Index Term Selection
Index terms selectionndash not all words are equally significant for representing
the semantics of a document
Manual selectionndash selection of index terms is usually done by specialist
Automatic selection of index termsndash most of the semantics is carried by the noun wordsndash clustering nouns which appear nearby in the text into
a single indexing component (or concept)ndash ex) computer science
ThesauriWhat is the ldquothesaurusrdquo
ndash list of important words in a given domain of knowledgendash a set of related words derived from a synonymity
relationshipndash a controlled vocabulary for the indexing and searching
Main purposesndash provide a standard vocabulary for indexing and
searchingndash assist users with locating terms for proper query
formulationndash provide classified hierarchies that allow the
broadening and narrowing of the current query request
Thesauri
Thesaurus index termsndash denote a concept which is the basic semantic unitndash can be individual words groups of words or phrases
bull ex) building teaching ballistic missiles body temperature
ndash frequently it is necessary to complement a thesaurus entry with a definition or an explanation
bull ex) seal (marine animals) seal (documents)
Thesaurus term relationshipsndash mostly composed of synonyms and near-synonymsndash BT (Broader Term) NT (Narrower Term) RT
(Related Term)
Text OperationsCoding Compression Methods
Text CompressionMotivation
ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements
IO overhead communication delaysndash obstacle need for IR systems to access text
randomly bull to access a given word in some forms of compressed text
the entire text must be decoded from the beginning until the desired word is reached
Two strategiesndash statistical methodsndash dictionary methods
Statistical MethodsBasic concepts
ndash Modeling a probability is estimated for each symbol
ndash Coding a code is assigned to each symbol based on the model
ndash shorter codes are assigned to the most likely symbols
Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)
bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits
Statistical MethodsCompression models
ndash adaptive model progressively learn about the statistical distribution as the compression process goes on
bull decompression of a file has to start from its beginning
ndash static model assume an average distribution for all input texts
bull poor compression ratios when data deviates from initial distribution assumptions
ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned
bull information on the data distribution must be stored
Statistical Methods
Word-based compression modelndash take words instead of characters as symbols
Reasons to use this model in an IR contextndash much better compression rates
ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters
ndash words are the atoms on which most IR systems are built
ndash word frequencies are useful in answering queries involving combinations of words
ndash the best strategy is to start with the least frequent words first
Statistical Methods
Codingndash the task of obtaining the representation of a symbol
based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and
long codes to unlikely ones
Two statistical coding strategiesndash Huffman coding
bull a variable-length encoding in bits for each symbolbull relatively fast allows random access
ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access
Huffman Coding
Building a Huffman treendash for each symbol of the alphabet create a node
containing the symbol and its probabilityndash the two nodes with the smallest probabilities become
children of a newly created parent nodendash the parent node is associated a probability equal to
the sum of the probabilities of the two chosen children
ndash the operation is repeated ignoring nodes that are already children until there is only one node
Authorrsquos Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
10
10
10 10
0110 0100 1 0101 00 1 0111 00 1
for each rose a rose is a rose
Better Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
1010
10 10
010 000 11 001 10 11 011 10 11
for each rose a rose is a rose
Authorrsquos Canonical Huffman Coding
bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree
wherexi = symbols
yi = numerical value of first symbol
rose
isldquo ldquo
a
each for
10
10
10
10 10
0010 0000 1 0001 00 1 0011 00 1
for each rose a rose is a rose
S = ((1 1) (1 1) (0 infin) (4 0)
Byte-Oriented Huffman Coding
Tree has branching factor of 256
Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)
Characteristicsndash Decompression is faster than for plain Huffman
codingndash Compression ratios are better than for Ziv-Lempel
family of codingsndash Allows direct searching on compressed text
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Stemming
What is the ldquostemrdquondash the portion of a word which is left after the removal of
its affixes (ie prefixes and suffixes)ndash ex) lsquoconnectrsquo is the stem for the variants lsquoconnectedrsquo
lsquoconnectingrsquo lsquoconnectionrsquo lsquoconnectionsrsquo
Effect of stemmingndash reduce variants of the same root to a common
conceptndash reduce the size of the indexing structurendash controversy about the benefits of stemming
Index Term Selection
Index terms selectionndash not all words are equally significant for representing
the semantics of a document
Manual selectionndash selection of index terms is usually done by specialist
Automatic selection of index termsndash most of the semantics is carried by the noun wordsndash clustering nouns which appear nearby in the text into
a single indexing component (or concept)ndash ex) computer science
ThesauriWhat is the ldquothesaurusrdquo
ndash list of important words in a given domain of knowledgendash a set of related words derived from a synonymity
relationshipndash a controlled vocabulary for the indexing and searching
Main purposesndash provide a standard vocabulary for indexing and
searchingndash assist users with locating terms for proper query
formulationndash provide classified hierarchies that allow the
broadening and narrowing of the current query request
Thesauri
Thesaurus index termsndash denote a concept which is the basic semantic unitndash can be individual words groups of words or phrases
bull ex) building teaching ballistic missiles body temperature
ndash frequently it is necessary to complement a thesaurus entry with a definition or an explanation
bull ex) seal (marine animals) seal (documents)
Thesaurus term relationshipsndash mostly composed of synonyms and near-synonymsndash BT (Broader Term) NT (Narrower Term) RT
(Related Term)
Text OperationsCoding Compression Methods
Text CompressionMotivation
ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements
IO overhead communication delaysndash obstacle need for IR systems to access text
randomly bull to access a given word in some forms of compressed text
the entire text must be decoded from the beginning until the desired word is reached
Two strategiesndash statistical methodsndash dictionary methods
Statistical MethodsBasic concepts
ndash Modeling a probability is estimated for each symbol
ndash Coding a code is assigned to each symbol based on the model
ndash shorter codes are assigned to the most likely symbols
Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)
bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits
Statistical MethodsCompression models
ndash adaptive model progressively learn about the statistical distribution as the compression process goes on
bull decompression of a file has to start from its beginning
ndash static model assume an average distribution for all input texts
bull poor compression ratios when data deviates from initial distribution assumptions
ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned
bull information on the data distribution must be stored
Statistical Methods
Word-based compression modelndash take words instead of characters as symbols
Reasons to use this model in an IR contextndash much better compression rates
ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters
ndash words are the atoms on which most IR systems are built
ndash word frequencies are useful in answering queries involving combinations of words
ndash the best strategy is to start with the least frequent words first
Statistical Methods
Codingndash the task of obtaining the representation of a symbol
based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and
long codes to unlikely ones
Two statistical coding strategiesndash Huffman coding
bull a variable-length encoding in bits for each symbolbull relatively fast allows random access
ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access
Huffman Coding
Building a Huffman treendash for each symbol of the alphabet create a node
containing the symbol and its probabilityndash the two nodes with the smallest probabilities become
children of a newly created parent nodendash the parent node is associated a probability equal to
the sum of the probabilities of the two chosen children
ndash the operation is repeated ignoring nodes that are already children until there is only one node
Authorrsquos Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
10
10
10 10
0110 0100 1 0101 00 1 0111 00 1
for each rose a rose is a rose
Better Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
1010
10 10
010 000 11 001 10 11 011 10 11
for each rose a rose is a rose
Authorrsquos Canonical Huffman Coding
bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree
wherexi = symbols
yi = numerical value of first symbol
rose
isldquo ldquo
a
each for
10
10
10
10 10
0010 0000 1 0001 00 1 0011 00 1
for each rose a rose is a rose
S = ((1 1) (1 1) (0 infin) (4 0)
Byte-Oriented Huffman Coding
Tree has branching factor of 256
Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)
Characteristicsndash Decompression is faster than for plain Huffman
codingndash Compression ratios are better than for Ziv-Lempel
family of codingsndash Allows direct searching on compressed text
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Index Term Selection
Index terms selectionndash not all words are equally significant for representing
the semantics of a document
Manual selectionndash selection of index terms is usually done by specialist
Automatic selection of index termsndash most of the semantics is carried by the noun wordsndash clustering nouns which appear nearby in the text into
a single indexing component (or concept)ndash ex) computer science
ThesauriWhat is the ldquothesaurusrdquo
ndash list of important words in a given domain of knowledgendash a set of related words derived from a synonymity
relationshipndash a controlled vocabulary for the indexing and searching
Main purposesndash provide a standard vocabulary for indexing and
searchingndash assist users with locating terms for proper query
formulationndash provide classified hierarchies that allow the
broadening and narrowing of the current query request
Thesauri
Thesaurus index termsndash denote a concept which is the basic semantic unitndash can be individual words groups of words or phrases
bull ex) building teaching ballistic missiles body temperature
ndash frequently it is necessary to complement a thesaurus entry with a definition or an explanation
bull ex) seal (marine animals) seal (documents)
Thesaurus term relationshipsndash mostly composed of synonyms and near-synonymsndash BT (Broader Term) NT (Narrower Term) RT
(Related Term)
Text OperationsCoding Compression Methods
Text CompressionMotivation
ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements
IO overhead communication delaysndash obstacle need for IR systems to access text
randomly bull to access a given word in some forms of compressed text
the entire text must be decoded from the beginning until the desired word is reached
Two strategiesndash statistical methodsndash dictionary methods
Statistical MethodsBasic concepts
ndash Modeling a probability is estimated for each symbol
ndash Coding a code is assigned to each symbol based on the model
ndash shorter codes are assigned to the most likely symbols
Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)
bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits
Statistical MethodsCompression models
ndash adaptive model progressively learn about the statistical distribution as the compression process goes on
bull decompression of a file has to start from its beginning
ndash static model assume an average distribution for all input texts
bull poor compression ratios when data deviates from initial distribution assumptions
ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned
bull information on the data distribution must be stored
Statistical Methods
Word-based compression modelndash take words instead of characters as symbols
Reasons to use this model in an IR contextndash much better compression rates
ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters
ndash words are the atoms on which most IR systems are built
ndash word frequencies are useful in answering queries involving combinations of words
ndash the best strategy is to start with the least frequent words first
Statistical Methods
Codingndash the task of obtaining the representation of a symbol
based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and
long codes to unlikely ones
Two statistical coding strategiesndash Huffman coding
bull a variable-length encoding in bits for each symbolbull relatively fast allows random access
ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access
Huffman Coding
Building a Huffman treendash for each symbol of the alphabet create a node
containing the symbol and its probabilityndash the two nodes with the smallest probabilities become
children of a newly created parent nodendash the parent node is associated a probability equal to
the sum of the probabilities of the two chosen children
ndash the operation is repeated ignoring nodes that are already children until there is only one node
Authorrsquos Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
10
10
10 10
0110 0100 1 0101 00 1 0111 00 1
for each rose a rose is a rose
Better Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
1010
10 10
010 000 11 001 10 11 011 10 11
for each rose a rose is a rose
Authorrsquos Canonical Huffman Coding
bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree
wherexi = symbols
yi = numerical value of first symbol
rose
isldquo ldquo
a
each for
10
10
10
10 10
0010 0000 1 0001 00 1 0011 00 1
for each rose a rose is a rose
S = ((1 1) (1 1) (0 infin) (4 0)
Byte-Oriented Huffman Coding
Tree has branching factor of 256
Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)
Characteristicsndash Decompression is faster than for plain Huffman
codingndash Compression ratios are better than for Ziv-Lempel
family of codingsndash Allows direct searching on compressed text
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
ThesauriWhat is the ldquothesaurusrdquo
ndash list of important words in a given domain of knowledgendash a set of related words derived from a synonymity
relationshipndash a controlled vocabulary for the indexing and searching
Main purposesndash provide a standard vocabulary for indexing and
searchingndash assist users with locating terms for proper query
formulationndash provide classified hierarchies that allow the
broadening and narrowing of the current query request
Thesauri
Thesaurus index termsndash denote a concept which is the basic semantic unitndash can be individual words groups of words or phrases
bull ex) building teaching ballistic missiles body temperature
ndash frequently it is necessary to complement a thesaurus entry with a definition or an explanation
bull ex) seal (marine animals) seal (documents)
Thesaurus term relationshipsndash mostly composed of synonyms and near-synonymsndash BT (Broader Term) NT (Narrower Term) RT
(Related Term)
Text OperationsCoding Compression Methods
Text CompressionMotivation
ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements
IO overhead communication delaysndash obstacle need for IR systems to access text
randomly bull to access a given word in some forms of compressed text
the entire text must be decoded from the beginning until the desired word is reached
Two strategiesndash statistical methodsndash dictionary methods
Statistical MethodsBasic concepts
ndash Modeling a probability is estimated for each symbol
ndash Coding a code is assigned to each symbol based on the model
ndash shorter codes are assigned to the most likely symbols
Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)
bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits
Statistical MethodsCompression models
ndash adaptive model progressively learn about the statistical distribution as the compression process goes on
bull decompression of a file has to start from its beginning
ndash static model assume an average distribution for all input texts
bull poor compression ratios when data deviates from initial distribution assumptions
ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned
bull information on the data distribution must be stored
Statistical Methods
Word-based compression modelndash take words instead of characters as symbols
Reasons to use this model in an IR contextndash much better compression rates
ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters
ndash words are the atoms on which most IR systems are built
ndash word frequencies are useful in answering queries involving combinations of words
ndash the best strategy is to start with the least frequent words first
Statistical Methods
Codingndash the task of obtaining the representation of a symbol
based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and
long codes to unlikely ones
Two statistical coding strategiesndash Huffman coding
bull a variable-length encoding in bits for each symbolbull relatively fast allows random access
ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access
Huffman Coding
Building a Huffman treendash for each symbol of the alphabet create a node
containing the symbol and its probabilityndash the two nodes with the smallest probabilities become
children of a newly created parent nodendash the parent node is associated a probability equal to
the sum of the probabilities of the two chosen children
ndash the operation is repeated ignoring nodes that are already children until there is only one node
Authorrsquos Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
10
10
10 10
0110 0100 1 0101 00 1 0111 00 1
for each rose a rose is a rose
Better Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
1010
10 10
010 000 11 001 10 11 011 10 11
for each rose a rose is a rose
Authorrsquos Canonical Huffman Coding
bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree
wherexi = symbols
yi = numerical value of first symbol
rose
isldquo ldquo
a
each for
10
10
10
10 10
0010 0000 1 0001 00 1 0011 00 1
for each rose a rose is a rose
S = ((1 1) (1 1) (0 infin) (4 0)
Byte-Oriented Huffman Coding
Tree has branching factor of 256
Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)
Characteristicsndash Decompression is faster than for plain Huffman
codingndash Compression ratios are better than for Ziv-Lempel
family of codingsndash Allows direct searching on compressed text
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Thesauri
Thesaurus index termsndash denote a concept which is the basic semantic unitndash can be individual words groups of words or phrases
bull ex) building teaching ballistic missiles body temperature
ndash frequently it is necessary to complement a thesaurus entry with a definition or an explanation
bull ex) seal (marine animals) seal (documents)
Thesaurus term relationshipsndash mostly composed of synonyms and near-synonymsndash BT (Broader Term) NT (Narrower Term) RT
(Related Term)
Text OperationsCoding Compression Methods
Text CompressionMotivation
ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements
IO overhead communication delaysndash obstacle need for IR systems to access text
randomly bull to access a given word in some forms of compressed text
the entire text must be decoded from the beginning until the desired word is reached
Two strategiesndash statistical methodsndash dictionary methods
Statistical MethodsBasic concepts
ndash Modeling a probability is estimated for each symbol
ndash Coding a code is assigned to each symbol based on the model
ndash shorter codes are assigned to the most likely symbols
Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)
bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits
Statistical MethodsCompression models
ndash adaptive model progressively learn about the statistical distribution as the compression process goes on
bull decompression of a file has to start from its beginning
ndash static model assume an average distribution for all input texts
bull poor compression ratios when data deviates from initial distribution assumptions
ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned
bull information on the data distribution must be stored
Statistical Methods
Word-based compression modelndash take words instead of characters as symbols
Reasons to use this model in an IR contextndash much better compression rates
ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters
ndash words are the atoms on which most IR systems are built
ndash word frequencies are useful in answering queries involving combinations of words
ndash the best strategy is to start with the least frequent words first
Statistical Methods
Codingndash the task of obtaining the representation of a symbol
based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and
long codes to unlikely ones
Two statistical coding strategiesndash Huffman coding
bull a variable-length encoding in bits for each symbolbull relatively fast allows random access
ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access
Huffman Coding
Building a Huffman treendash for each symbol of the alphabet create a node
containing the symbol and its probabilityndash the two nodes with the smallest probabilities become
children of a newly created parent nodendash the parent node is associated a probability equal to
the sum of the probabilities of the two chosen children
ndash the operation is repeated ignoring nodes that are already children until there is only one node
Authorrsquos Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
10
10
10 10
0110 0100 1 0101 00 1 0111 00 1
for each rose a rose is a rose
Better Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
1010
10 10
010 000 11 001 10 11 011 10 11
for each rose a rose is a rose
Authorrsquos Canonical Huffman Coding
bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree
wherexi = symbols
yi = numerical value of first symbol
rose
isldquo ldquo
a
each for
10
10
10
10 10
0010 0000 1 0001 00 1 0011 00 1
for each rose a rose is a rose
S = ((1 1) (1 1) (0 infin) (4 0)
Byte-Oriented Huffman Coding
Tree has branching factor of 256
Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)
Characteristicsndash Decompression is faster than for plain Huffman
codingndash Compression ratios are better than for Ziv-Lempel
family of codingsndash Allows direct searching on compressed text
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Text OperationsCoding Compression Methods
Text CompressionMotivation
ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements
IO overhead communication delaysndash obstacle need for IR systems to access text
randomly bull to access a given word in some forms of compressed text
the entire text must be decoded from the beginning until the desired word is reached
Two strategiesndash statistical methodsndash dictionary methods
Statistical MethodsBasic concepts
ndash Modeling a probability is estimated for each symbol
ndash Coding a code is assigned to each symbol based on the model
ndash shorter codes are assigned to the most likely symbols
Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)
bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits
Statistical MethodsCompression models
ndash adaptive model progressively learn about the statistical distribution as the compression process goes on
bull decompression of a file has to start from its beginning
ndash static model assume an average distribution for all input texts
bull poor compression ratios when data deviates from initial distribution assumptions
ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned
bull information on the data distribution must be stored
Statistical Methods
Word-based compression modelndash take words instead of characters as symbols
Reasons to use this model in an IR contextndash much better compression rates
ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters
ndash words are the atoms on which most IR systems are built
ndash word frequencies are useful in answering queries involving combinations of words
ndash the best strategy is to start with the least frequent words first
Statistical Methods
Codingndash the task of obtaining the representation of a symbol
based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and
long codes to unlikely ones
Two statistical coding strategiesndash Huffman coding
bull a variable-length encoding in bits for each symbolbull relatively fast allows random access
ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access
Huffman Coding
Building a Huffman treendash for each symbol of the alphabet create a node
containing the symbol and its probabilityndash the two nodes with the smallest probabilities become
children of a newly created parent nodendash the parent node is associated a probability equal to
the sum of the probabilities of the two chosen children
ndash the operation is repeated ignoring nodes that are already children until there is only one node
Authorrsquos Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
10
10
10 10
0110 0100 1 0101 00 1 0111 00 1
for each rose a rose is a rose
Better Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
1010
10 10
010 000 11 001 10 11 011 10 11
for each rose a rose is a rose
Authorrsquos Canonical Huffman Coding
bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree
wherexi = symbols
yi = numerical value of first symbol
rose
isldquo ldquo
a
each for
10
10
10
10 10
0010 0000 1 0001 00 1 0011 00 1
for each rose a rose is a rose
S = ((1 1) (1 1) (0 infin) (4 0)
Byte-Oriented Huffman Coding
Tree has branching factor of 256
Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)
Characteristicsndash Decompression is faster than for plain Huffman
codingndash Compression ratios are better than for Ziv-Lempel
family of codingsndash Allows direct searching on compressed text
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Text CompressionMotivation
ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements
IO overhead communication delaysndash obstacle need for IR systems to access text
randomly bull to access a given word in some forms of compressed text
the entire text must be decoded from the beginning until the desired word is reached
Two strategiesndash statistical methodsndash dictionary methods
Statistical MethodsBasic concepts
ndash Modeling a probability is estimated for each symbol
ndash Coding a code is assigned to each symbol based on the model
ndash shorter codes are assigned to the most likely symbols
Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)
bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits
Statistical MethodsCompression models
ndash adaptive model progressively learn about the statistical distribution as the compression process goes on
bull decompression of a file has to start from its beginning
ndash static model assume an average distribution for all input texts
bull poor compression ratios when data deviates from initial distribution assumptions
ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned
bull information on the data distribution must be stored
Statistical Methods
Word-based compression modelndash take words instead of characters as symbols
Reasons to use this model in an IR contextndash much better compression rates
ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters
ndash words are the atoms on which most IR systems are built
ndash word frequencies are useful in answering queries involving combinations of words
ndash the best strategy is to start with the least frequent words first
Statistical Methods
Codingndash the task of obtaining the representation of a symbol
based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and
long codes to unlikely ones
Two statistical coding strategiesndash Huffman coding
bull a variable-length encoding in bits for each symbolbull relatively fast allows random access
ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access
Huffman Coding
Building a Huffman treendash for each symbol of the alphabet create a node
containing the symbol and its probabilityndash the two nodes with the smallest probabilities become
children of a newly created parent nodendash the parent node is associated a probability equal to
the sum of the probabilities of the two chosen children
ndash the operation is repeated ignoring nodes that are already children until there is only one node
Authorrsquos Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
10
10
10 10
0110 0100 1 0101 00 1 0111 00 1
for each rose a rose is a rose
Better Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
1010
10 10
010 000 11 001 10 11 011 10 11
for each rose a rose is a rose
Authorrsquos Canonical Huffman Coding
bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree
wherexi = symbols
yi = numerical value of first symbol
rose
isldquo ldquo
a
each for
10
10
10
10 10
0010 0000 1 0001 00 1 0011 00 1
for each rose a rose is a rose
S = ((1 1) (1 1) (0 infin) (4 0)
Byte-Oriented Huffman Coding
Tree has branching factor of 256
Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)
Characteristicsndash Decompression is faster than for plain Huffman
codingndash Compression ratios are better than for Ziv-Lempel
family of codingsndash Allows direct searching on compressed text
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Statistical MethodsBasic concepts
ndash Modeling a probability is estimated for each symbol
ndash Coding a code is assigned to each symbol based on the model
ndash shorter codes are assigned to the most likely symbols
Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)
bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits
Statistical MethodsCompression models
ndash adaptive model progressively learn about the statistical distribution as the compression process goes on
bull decompression of a file has to start from its beginning
ndash static model assume an average distribution for all input texts
bull poor compression ratios when data deviates from initial distribution assumptions
ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned
bull information on the data distribution must be stored
Statistical Methods
Word-based compression modelndash take words instead of characters as symbols
Reasons to use this model in an IR contextndash much better compression rates
ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters
ndash words are the atoms on which most IR systems are built
ndash word frequencies are useful in answering queries involving combinations of words
ndash the best strategy is to start with the least frequent words first
Statistical Methods
Codingndash the task of obtaining the representation of a symbol
based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and
long codes to unlikely ones
Two statistical coding strategiesndash Huffman coding
bull a variable-length encoding in bits for each symbolbull relatively fast allows random access
ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access
Huffman Coding
Building a Huffman treendash for each symbol of the alphabet create a node
containing the symbol and its probabilityndash the two nodes with the smallest probabilities become
children of a newly created parent nodendash the parent node is associated a probability equal to
the sum of the probabilities of the two chosen children
ndash the operation is repeated ignoring nodes that are already children until there is only one node
Authorrsquos Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
10
10
10 10
0110 0100 1 0101 00 1 0111 00 1
for each rose a rose is a rose
Better Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
1010
10 10
010 000 11 001 10 11 011 10 11
for each rose a rose is a rose
Authorrsquos Canonical Huffman Coding
bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree
wherexi = symbols
yi = numerical value of first symbol
rose
isldquo ldquo
a
each for
10
10
10
10 10
0010 0000 1 0001 00 1 0011 00 1
for each rose a rose is a rose
S = ((1 1) (1 1) (0 infin) (4 0)
Byte-Oriented Huffman Coding
Tree has branching factor of 256
Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)
Characteristicsndash Decompression is faster than for plain Huffman
codingndash Compression ratios are better than for Ziv-Lempel
family of codingsndash Allows direct searching on compressed text
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Statistical MethodsCompression models
ndash adaptive model progressively learn about the statistical distribution as the compression process goes on
bull decompression of a file has to start from its beginning
ndash static model assume an average distribution for all input texts
bull poor compression ratios when data deviates from initial distribution assumptions
ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned
bull information on the data distribution must be stored
Statistical Methods
Word-based compression modelndash take words instead of characters as symbols
Reasons to use this model in an IR contextndash much better compression rates
ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters
ndash words are the atoms on which most IR systems are built
ndash word frequencies are useful in answering queries involving combinations of words
ndash the best strategy is to start with the least frequent words first
Statistical Methods
Codingndash the task of obtaining the representation of a symbol
based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and
long codes to unlikely ones
Two statistical coding strategiesndash Huffman coding
bull a variable-length encoding in bits for each symbolbull relatively fast allows random access
ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access
Huffman Coding
Building a Huffman treendash for each symbol of the alphabet create a node
containing the symbol and its probabilityndash the two nodes with the smallest probabilities become
children of a newly created parent nodendash the parent node is associated a probability equal to
the sum of the probabilities of the two chosen children
ndash the operation is repeated ignoring nodes that are already children until there is only one node
Authorrsquos Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
10
10
10 10
0110 0100 1 0101 00 1 0111 00 1
for each rose a rose is a rose
Better Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
1010
10 10
010 000 11 001 10 11 011 10 11
for each rose a rose is a rose
Authorrsquos Canonical Huffman Coding
bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree
wherexi = symbols
yi = numerical value of first symbol
rose
isldquo ldquo
a
each for
10
10
10
10 10
0010 0000 1 0001 00 1 0011 00 1
for each rose a rose is a rose
S = ((1 1) (1 1) (0 infin) (4 0)
Byte-Oriented Huffman Coding
Tree has branching factor of 256
Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)
Characteristicsndash Decompression is faster than for plain Huffman
codingndash Compression ratios are better than for Ziv-Lempel
family of codingsndash Allows direct searching on compressed text
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Statistical Methods
Word-based compression modelndash take words instead of characters as symbols
Reasons to use this model in an IR contextndash much better compression rates
ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters
ndash words are the atoms on which most IR systems are built
ndash word frequencies are useful in answering queries involving combinations of words
ndash the best strategy is to start with the least frequent words first
Statistical Methods
Codingndash the task of obtaining the representation of a symbol
based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and
long codes to unlikely ones
Two statistical coding strategiesndash Huffman coding
bull a variable-length encoding in bits for each symbolbull relatively fast allows random access
ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access
Huffman Coding
Building a Huffman treendash for each symbol of the alphabet create a node
containing the symbol and its probabilityndash the two nodes with the smallest probabilities become
children of a newly created parent nodendash the parent node is associated a probability equal to
the sum of the probabilities of the two chosen children
ndash the operation is repeated ignoring nodes that are already children until there is only one node
Authorrsquos Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
10
10
10 10
0110 0100 1 0101 00 1 0111 00 1
for each rose a rose is a rose
Better Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
1010
10 10
010 000 11 001 10 11 011 10 11
for each rose a rose is a rose
Authorrsquos Canonical Huffman Coding
bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree
wherexi = symbols
yi = numerical value of first symbol
rose
isldquo ldquo
a
each for
10
10
10
10 10
0010 0000 1 0001 00 1 0011 00 1
for each rose a rose is a rose
S = ((1 1) (1 1) (0 infin) (4 0)
Byte-Oriented Huffman Coding
Tree has branching factor of 256
Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)
Characteristicsndash Decompression is faster than for plain Huffman
codingndash Compression ratios are better than for Ziv-Lempel
family of codingsndash Allows direct searching on compressed text
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Statistical Methods
Codingndash the task of obtaining the representation of a symbol
based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and
long codes to unlikely ones
Two statistical coding strategiesndash Huffman coding
bull a variable-length encoding in bits for each symbolbull relatively fast allows random access
ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access
Huffman Coding
Building a Huffman treendash for each symbol of the alphabet create a node
containing the symbol and its probabilityndash the two nodes with the smallest probabilities become
children of a newly created parent nodendash the parent node is associated a probability equal to
the sum of the probabilities of the two chosen children
ndash the operation is repeated ignoring nodes that are already children until there is only one node
Authorrsquos Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
10
10
10 10
0110 0100 1 0101 00 1 0111 00 1
for each rose a rose is a rose
Better Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
1010
10 10
010 000 11 001 10 11 011 10 11
for each rose a rose is a rose
Authorrsquos Canonical Huffman Coding
bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree
wherexi = symbols
yi = numerical value of first symbol
rose
isldquo ldquo
a
each for
10
10
10
10 10
0010 0000 1 0001 00 1 0011 00 1
for each rose a rose is a rose
S = ((1 1) (1 1) (0 infin) (4 0)
Byte-Oriented Huffman Coding
Tree has branching factor of 256
Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)
Characteristicsndash Decompression is faster than for plain Huffman
codingndash Compression ratios are better than for Ziv-Lempel
family of codingsndash Allows direct searching on compressed text
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Huffman Coding
Building a Huffman treendash for each symbol of the alphabet create a node
containing the symbol and its probabilityndash the two nodes with the smallest probabilities become
children of a newly created parent nodendash the parent node is associated a probability equal to
the sum of the probabilities of the two chosen children
ndash the operation is repeated ignoring nodes that are already children until there is only one node
Authorrsquos Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
10
10
10 10
0110 0100 1 0101 00 1 0111 00 1
for each rose a rose is a rose
Better Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
1010
10 10
010 000 11 001 10 11 011 10 11
for each rose a rose is a rose
Authorrsquos Canonical Huffman Coding
bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree
wherexi = symbols
yi = numerical value of first symbol
rose
isldquo ldquo
a
each for
10
10
10
10 10
0010 0000 1 0001 00 1 0011 00 1
for each rose a rose is a rose
S = ((1 1) (1 1) (0 infin) (4 0)
Byte-Oriented Huffman Coding
Tree has branching factor of 256
Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)
Characteristicsndash Decompression is faster than for plain Huffman
codingndash Compression ratios are better than for Ziv-Lempel
family of codingsndash Allows direct searching on compressed text
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Authorrsquos Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
10
10
10 10
0110 0100 1 0101 00 1 0111 00 1
for each rose a rose is a rose
Better Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
1010
10 10
010 000 11 001 10 11 011 10 11
for each rose a rose is a rose
Authorrsquos Canonical Huffman Coding
bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree
wherexi = symbols
yi = numerical value of first symbol
rose
isldquo ldquo
a
each for
10
10
10
10 10
0010 0000 1 0001 00 1 0011 00 1
for each rose a rose is a rose
S = ((1 1) (1 1) (0 infin) (4 0)
Byte-Oriented Huffman Coding
Tree has branching factor of 256
Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)
Characteristicsndash Decompression is faster than for plain Huffman
codingndash Compression ratios are better than for Ziv-Lempel
family of codingsndash Allows direct searching on compressed text
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Better Huffman Coding
Example ldquofor each rose a rose is a roserdquo
rose
isldquo ldquo
a
each for
10
1010
10 10
010 000 11 001 10 11 011 10 11
for each rose a rose is a rose
Authorrsquos Canonical Huffman Coding
bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree
wherexi = symbols
yi = numerical value of first symbol
rose
isldquo ldquo
a
each for
10
10
10
10 10
0010 0000 1 0001 00 1 0011 00 1
for each rose a rose is a rose
S = ((1 1) (1 1) (0 infin) (4 0)
Byte-Oriented Huffman Coding
Tree has branching factor of 256
Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)
Characteristicsndash Decompression is faster than for plain Huffman
codingndash Compression ratios are better than for Ziv-Lempel
family of codingsndash Allows direct searching on compressed text
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Authorrsquos Canonical Huffman Coding
bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree
wherexi = symbols
yi = numerical value of first symbol
rose
isldquo ldquo
a
each for
10
10
10
10 10
0010 0000 1 0001 00 1 0011 00 1
for each rose a rose is a rose
S = ((1 1) (1 1) (0 infin) (4 0)
Byte-Oriented Huffman Coding
Tree has branching factor of 256
Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)
Characteristicsndash Decompression is faster than for plain Huffman
codingndash Compression ratios are better than for Ziv-Lempel
family of codingsndash Allows direct searching on compressed text
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Byte-Oriented Huffman Coding
Tree has branching factor of 256
Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)
Characteristicsndash Decompression is faster than for plain Huffman
codingndash Compression ratios are better than for Ziv-Lempel
family of codingsndash Allows direct searching on compressed text
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Dictionary Methods
Basic conceptsndash replacing groups of consecutive symbols with a
pointer to an entry in a dictionaryndash the pointer representations are references to entries
in a dictionary composed of a list of symbols that are expected to occur frequently
ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace
ndash modeling and coding does not existndash there are no explicit probabilities associated to
phrases
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Dictionary Methods
Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding
bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary
bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character
ndash main problembull the dictionary might be suitable for one text but unsuitable
for another
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Dictionary Methods
Adaptive dictionary methodsndash Ziv-Lempel
bull placing strings of characters with a reference to a previous occurrence of the string
bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Ziv-Lempel Code
Characteristicsndash identifying each text segment the first time it appears
and then simply pointing back to this first occurrence rather than repeating the segment
ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned
ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Ziv-Lempel Code
LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt
bull a identifies how far back in the decoded text to look for the upcoming text segment
bull b tells how many characters to copy for the upcoming segment
bull c a new character to add to complete the next segment
bull ex) lt82rgt lt00pgt
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Ziv-Lempel Code
An example (decoding)lt00pgt p
lt00egt pe
lt00tgt pet
lt21rgt peter
lt00_gt peter_
lt61igt peter_pi
lt82rgt peter_piper
lt63cgt peter_piper_pic
lt00kgt peter_piper_pick
lt71dgt peter_piper_picked
helliphelliphellip
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Dictionary Methods
Adaptive dictionary methodsndash Disadvantages over the Huffman method
bull no random access does not allow decoding to start in the middle of a compression file
Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no
Comparing Text Compression Techniques
Arithmetic Character Huffman
Word Huffman
Ziv-Lempel
Compression Ratio
very good poor very good good
Compression Speed
slow fast fast very fast
Decompression Speed
slow fast very fast very fast
Memory Space low low high moderate
Compressed pattern matching
no yes yes yes
Random Access
no yes yes no