30
Text Operations: Preprocessing and Compression

Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Text OperationsPreprocessing and

Compression

Introduction

Document preprocessingndash to improve the precision of documents

retrievedndash lexical analysis stopwords elimination

stemming index term selection thesaurindash build a thesaurus

Text compressionndash to improve the efficiency of the retrieval

processndash statistical methods vs dictionary methodsndash inverted file compression

Document Preprocessing

Lexical analysis of the textndash digits hyphens punctuation marks the case of

letters

Elimination of stopwordsndash filtering out the useless words for retrieval purposes

Stemmingndash dealing with the syntactic variations of query terms

Index terms selectionndash determining the terms to be used as index terms

Thesaurindash the expansion of the original query with related term

The Process of Preprocessing

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text Index terms

Lexical Analysis of the Text

Four particular casesNumbers

bull usually not good index terms because of their vagueness

bull need some advanced lexical analysis procedurendash ex) 510BC 4105-1201-2310-2213 2000212 hellip

Hyphensbull breaking up hyphenated words might be useful

ndash ex) state-of-the-art state of the art (Good)ndash but B-49 B 49 ()

bull need to adopt a general rule and to specify exceptions on a case by case basis

Lexical Analysis of the Text

Punctuation marksndash removed entirely

bull ex) 510BC 1048774 510BCbull if the query contains lsquo510BCrsquo removal of the dot both in

query term and in the documents will not affect retrieval performance

ndash require the preparation of a list of exceptionsbull ex) valid 1048774 valid ()

The case of lettersndash converts all the text to either lower or upper casendash part of the semantics might be lost

bull Northwestern University 1048774 northwestern university ()

Elimination of Stopwords

Basic conceptndash filtering out words with very low discrimination

valuesbull ex) a the this that where when hellip

Advantagendash reduce the size of the indexing structure

considerably

Disadvantagendash might reduce recall as well

bull ex) to be or not to be

Stemming

What is the ldquostemrdquondash the portion of a word which is left after the removal of

its affixes (ie prefixes and suffixes)ndash ex) lsquoconnectrsquo is the stem for the variants lsquoconnectedrsquo

lsquoconnectingrsquo lsquoconnectionrsquo lsquoconnectionsrsquo

Effect of stemmingndash reduce variants of the same root to a common

conceptndash reduce the size of the indexing structurendash controversy about the benefits of stemming

Index Term Selection

Index terms selectionndash not all words are equally significant for representing

the semantics of a document

Manual selectionndash selection of index terms is usually done by specialist

Automatic selection of index termsndash most of the semantics is carried by the noun wordsndash clustering nouns which appear nearby in the text into

a single indexing component (or concept)ndash ex) computer science

ThesauriWhat is the ldquothesaurusrdquo

ndash list of important words in a given domain of knowledgendash a set of related words derived from a synonymity

relationshipndash a controlled vocabulary for the indexing and searching

Main purposesndash provide a standard vocabulary for indexing and

searchingndash assist users with locating terms for proper query

formulationndash provide classified hierarchies that allow the

broadening and narrowing of the current query request

Thesauri

Thesaurus index termsndash denote a concept which is the basic semantic unitndash can be individual words groups of words or phrases

bull ex) building teaching ballistic missiles body temperature

ndash frequently it is necessary to complement a thesaurus entry with a definition or an explanation

bull ex) seal (marine animals) seal (documents)

Thesaurus term relationshipsndash mostly composed of synonyms and near-synonymsndash BT (Broader Term) NT (Narrower Term) RT

(Related Term)

Text OperationsCoding Compression Methods

Text CompressionMotivation

ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements

IO overhead communication delaysndash obstacle need for IR systems to access text

randomly bull to access a given word in some forms of compressed text

the entire text must be decoded from the beginning until the desired word is reached

Two strategiesndash statistical methodsndash dictionary methods

Statistical MethodsBasic concepts

ndash Modeling a probability is estimated for each symbol

ndash Coding a code is assigned to each symbol based on the model

ndash shorter codes are assigned to the most likely symbols

Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)

bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits

Statistical MethodsCompression models

ndash adaptive model progressively learn about the statistical distribution as the compression process goes on

bull decompression of a file has to start from its beginning

ndash static model assume an average distribution for all input texts

bull poor compression ratios when data deviates from initial distribution assumptions

ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned

bull information on the data distribution must be stored

Statistical Methods

Word-based compression modelndash take words instead of characters as symbols

Reasons to use this model in an IR contextndash much better compression rates

ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters

ndash words are the atoms on which most IR systems are built

ndash word frequencies are useful in answering queries involving combinations of words

ndash the best strategy is to start with the least frequent words first

Statistical Methods

Codingndash the task of obtaining the representation of a symbol

based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and

long codes to unlikely ones

Two statistical coding strategiesndash Huffman coding

bull a variable-length encoding in bits for each symbolbull relatively fast allows random access

ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access

Huffman Coding

Building a Huffman treendash for each symbol of the alphabet create a node

containing the symbol and its probabilityndash the two nodes with the smallest probabilities become

children of a newly created parent nodendash the parent node is associated a probability equal to

the sum of the probabilities of the two chosen children

ndash the operation is repeated ignoring nodes that are already children until there is only one node

Authorrsquos Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

10

10

10 10

0110 0100 1 0101 00 1 0111 00 1

for each rose a rose is a rose

Better Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

1010

10 10

010 000 11 001 10 11 011 10 11

for each rose a rose is a rose

Authorrsquos Canonical Huffman Coding

bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree

wherexi = symbols

yi = numerical value of first symbol

rose

isldquo ldquo

a

each for

10

10

10

10 10

0010 0000 1 0001 00 1 0011 00 1

for each rose a rose is a rose

S = ((1 1) (1 1) (0 infin) (4 0)

Byte-Oriented Huffman Coding

Tree has branching factor of 256

Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)

Characteristicsndash Decompression is faster than for plain Huffman

codingndash Compression ratios are better than for Ziv-Lempel

family of codingsndash Allows direct searching on compressed text

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 2: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Introduction

Document preprocessingndash to improve the precision of documents

retrievedndash lexical analysis stopwords elimination

stemming index term selection thesaurindash build a thesaurus

Text compressionndash to improve the efficiency of the retrieval

processndash statistical methods vs dictionary methodsndash inverted file compression

Document Preprocessing

Lexical analysis of the textndash digits hyphens punctuation marks the case of

letters

Elimination of stopwordsndash filtering out the useless words for retrieval purposes

Stemmingndash dealing with the syntactic variations of query terms

Index terms selectionndash determining the terms to be used as index terms

Thesaurindash the expansion of the original query with related term

The Process of Preprocessing

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text Index terms

Lexical Analysis of the Text

Four particular casesNumbers

bull usually not good index terms because of their vagueness

bull need some advanced lexical analysis procedurendash ex) 510BC 4105-1201-2310-2213 2000212 hellip

Hyphensbull breaking up hyphenated words might be useful

ndash ex) state-of-the-art state of the art (Good)ndash but B-49 B 49 ()

bull need to adopt a general rule and to specify exceptions on a case by case basis

Lexical Analysis of the Text

Punctuation marksndash removed entirely

bull ex) 510BC 1048774 510BCbull if the query contains lsquo510BCrsquo removal of the dot both in

query term and in the documents will not affect retrieval performance

ndash require the preparation of a list of exceptionsbull ex) valid 1048774 valid ()

The case of lettersndash converts all the text to either lower or upper casendash part of the semantics might be lost

bull Northwestern University 1048774 northwestern university ()

Elimination of Stopwords

Basic conceptndash filtering out words with very low discrimination

valuesbull ex) a the this that where when hellip

Advantagendash reduce the size of the indexing structure

considerably

Disadvantagendash might reduce recall as well

bull ex) to be or not to be

Stemming

What is the ldquostemrdquondash the portion of a word which is left after the removal of

its affixes (ie prefixes and suffixes)ndash ex) lsquoconnectrsquo is the stem for the variants lsquoconnectedrsquo

lsquoconnectingrsquo lsquoconnectionrsquo lsquoconnectionsrsquo

Effect of stemmingndash reduce variants of the same root to a common

conceptndash reduce the size of the indexing structurendash controversy about the benefits of stemming

Index Term Selection

Index terms selectionndash not all words are equally significant for representing

the semantics of a document

Manual selectionndash selection of index terms is usually done by specialist

Automatic selection of index termsndash most of the semantics is carried by the noun wordsndash clustering nouns which appear nearby in the text into

a single indexing component (or concept)ndash ex) computer science

ThesauriWhat is the ldquothesaurusrdquo

ndash list of important words in a given domain of knowledgendash a set of related words derived from a synonymity

relationshipndash a controlled vocabulary for the indexing and searching

Main purposesndash provide a standard vocabulary for indexing and

searchingndash assist users with locating terms for proper query

formulationndash provide classified hierarchies that allow the

broadening and narrowing of the current query request

Thesauri

Thesaurus index termsndash denote a concept which is the basic semantic unitndash can be individual words groups of words or phrases

bull ex) building teaching ballistic missiles body temperature

ndash frequently it is necessary to complement a thesaurus entry with a definition or an explanation

bull ex) seal (marine animals) seal (documents)

Thesaurus term relationshipsndash mostly composed of synonyms and near-synonymsndash BT (Broader Term) NT (Narrower Term) RT

(Related Term)

Text OperationsCoding Compression Methods

Text CompressionMotivation

ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements

IO overhead communication delaysndash obstacle need for IR systems to access text

randomly bull to access a given word in some forms of compressed text

the entire text must be decoded from the beginning until the desired word is reached

Two strategiesndash statistical methodsndash dictionary methods

Statistical MethodsBasic concepts

ndash Modeling a probability is estimated for each symbol

ndash Coding a code is assigned to each symbol based on the model

ndash shorter codes are assigned to the most likely symbols

Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)

bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits

Statistical MethodsCompression models

ndash adaptive model progressively learn about the statistical distribution as the compression process goes on

bull decompression of a file has to start from its beginning

ndash static model assume an average distribution for all input texts

bull poor compression ratios when data deviates from initial distribution assumptions

ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned

bull information on the data distribution must be stored

Statistical Methods

Word-based compression modelndash take words instead of characters as symbols

Reasons to use this model in an IR contextndash much better compression rates

ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters

ndash words are the atoms on which most IR systems are built

ndash word frequencies are useful in answering queries involving combinations of words

ndash the best strategy is to start with the least frequent words first

Statistical Methods

Codingndash the task of obtaining the representation of a symbol

based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and

long codes to unlikely ones

Two statistical coding strategiesndash Huffman coding

bull a variable-length encoding in bits for each symbolbull relatively fast allows random access

ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access

Huffman Coding

Building a Huffman treendash for each symbol of the alphabet create a node

containing the symbol and its probabilityndash the two nodes with the smallest probabilities become

children of a newly created parent nodendash the parent node is associated a probability equal to

the sum of the probabilities of the two chosen children

ndash the operation is repeated ignoring nodes that are already children until there is only one node

Authorrsquos Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

10

10

10 10

0110 0100 1 0101 00 1 0111 00 1

for each rose a rose is a rose

Better Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

1010

10 10

010 000 11 001 10 11 011 10 11

for each rose a rose is a rose

Authorrsquos Canonical Huffman Coding

bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree

wherexi = symbols

yi = numerical value of first symbol

rose

isldquo ldquo

a

each for

10

10

10

10 10

0010 0000 1 0001 00 1 0011 00 1

for each rose a rose is a rose

S = ((1 1) (1 1) (0 infin) (4 0)

Byte-Oriented Huffman Coding

Tree has branching factor of 256

Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)

Characteristicsndash Decompression is faster than for plain Huffman

codingndash Compression ratios are better than for Ziv-Lempel

family of codingsndash Allows direct searching on compressed text

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 3: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Document Preprocessing

Lexical analysis of the textndash digits hyphens punctuation marks the case of

letters

Elimination of stopwordsndash filtering out the useless words for retrieval purposes

Stemmingndash dealing with the syntactic variations of query terms

Index terms selectionndash determining the terms to be used as index terms

Thesaurindash the expansion of the original query with related term

The Process of Preprocessing

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text Index terms

Lexical Analysis of the Text

Four particular casesNumbers

bull usually not good index terms because of their vagueness

bull need some advanced lexical analysis procedurendash ex) 510BC 4105-1201-2310-2213 2000212 hellip

Hyphensbull breaking up hyphenated words might be useful

ndash ex) state-of-the-art state of the art (Good)ndash but B-49 B 49 ()

bull need to adopt a general rule and to specify exceptions on a case by case basis

Lexical Analysis of the Text

Punctuation marksndash removed entirely

bull ex) 510BC 1048774 510BCbull if the query contains lsquo510BCrsquo removal of the dot both in

query term and in the documents will not affect retrieval performance

ndash require the preparation of a list of exceptionsbull ex) valid 1048774 valid ()

The case of lettersndash converts all the text to either lower or upper casendash part of the semantics might be lost

bull Northwestern University 1048774 northwestern university ()

Elimination of Stopwords

Basic conceptndash filtering out words with very low discrimination

valuesbull ex) a the this that where when hellip

Advantagendash reduce the size of the indexing structure

considerably

Disadvantagendash might reduce recall as well

bull ex) to be or not to be

Stemming

What is the ldquostemrdquondash the portion of a word which is left after the removal of

its affixes (ie prefixes and suffixes)ndash ex) lsquoconnectrsquo is the stem for the variants lsquoconnectedrsquo

lsquoconnectingrsquo lsquoconnectionrsquo lsquoconnectionsrsquo

Effect of stemmingndash reduce variants of the same root to a common

conceptndash reduce the size of the indexing structurendash controversy about the benefits of stemming

Index Term Selection

Index terms selectionndash not all words are equally significant for representing

the semantics of a document

Manual selectionndash selection of index terms is usually done by specialist

Automatic selection of index termsndash most of the semantics is carried by the noun wordsndash clustering nouns which appear nearby in the text into

a single indexing component (or concept)ndash ex) computer science

ThesauriWhat is the ldquothesaurusrdquo

ndash list of important words in a given domain of knowledgendash a set of related words derived from a synonymity

relationshipndash a controlled vocabulary for the indexing and searching

Main purposesndash provide a standard vocabulary for indexing and

searchingndash assist users with locating terms for proper query

formulationndash provide classified hierarchies that allow the

broadening and narrowing of the current query request

Thesauri

Thesaurus index termsndash denote a concept which is the basic semantic unitndash can be individual words groups of words or phrases

bull ex) building teaching ballistic missiles body temperature

ndash frequently it is necessary to complement a thesaurus entry with a definition or an explanation

bull ex) seal (marine animals) seal (documents)

Thesaurus term relationshipsndash mostly composed of synonyms and near-synonymsndash BT (Broader Term) NT (Narrower Term) RT

(Related Term)

Text OperationsCoding Compression Methods

Text CompressionMotivation

ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements

IO overhead communication delaysndash obstacle need for IR systems to access text

randomly bull to access a given word in some forms of compressed text

the entire text must be decoded from the beginning until the desired word is reached

Two strategiesndash statistical methodsndash dictionary methods

Statistical MethodsBasic concepts

ndash Modeling a probability is estimated for each symbol

ndash Coding a code is assigned to each symbol based on the model

ndash shorter codes are assigned to the most likely symbols

Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)

bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits

Statistical MethodsCompression models

ndash adaptive model progressively learn about the statistical distribution as the compression process goes on

bull decompression of a file has to start from its beginning

ndash static model assume an average distribution for all input texts

bull poor compression ratios when data deviates from initial distribution assumptions

ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned

bull information on the data distribution must be stored

Statistical Methods

Word-based compression modelndash take words instead of characters as symbols

Reasons to use this model in an IR contextndash much better compression rates

ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters

ndash words are the atoms on which most IR systems are built

ndash word frequencies are useful in answering queries involving combinations of words

ndash the best strategy is to start with the least frequent words first

Statistical Methods

Codingndash the task of obtaining the representation of a symbol

based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and

long codes to unlikely ones

Two statistical coding strategiesndash Huffman coding

bull a variable-length encoding in bits for each symbolbull relatively fast allows random access

ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access

Huffman Coding

Building a Huffman treendash for each symbol of the alphabet create a node

containing the symbol and its probabilityndash the two nodes with the smallest probabilities become

children of a newly created parent nodendash the parent node is associated a probability equal to

the sum of the probabilities of the two chosen children

ndash the operation is repeated ignoring nodes that are already children until there is only one node

Authorrsquos Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

10

10

10 10

0110 0100 1 0101 00 1 0111 00 1

for each rose a rose is a rose

Better Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

1010

10 10

010 000 11 001 10 11 011 10 11

for each rose a rose is a rose

Authorrsquos Canonical Huffman Coding

bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree

wherexi = symbols

yi = numerical value of first symbol

rose

isldquo ldquo

a

each for

10

10

10

10 10

0010 0000 1 0001 00 1 0011 00 1

for each rose a rose is a rose

S = ((1 1) (1 1) (0 infin) (4 0)

Byte-Oriented Huffman Coding

Tree has branching factor of 256

Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)

Characteristicsndash Decompression is faster than for plain Huffman

codingndash Compression ratios are better than for Ziv-Lempel

family of codingsndash Allows direct searching on compressed text

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 4: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

The Process of Preprocessing

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text Index terms

Lexical Analysis of the Text

Four particular casesNumbers

bull usually not good index terms because of their vagueness

bull need some advanced lexical analysis procedurendash ex) 510BC 4105-1201-2310-2213 2000212 hellip

Hyphensbull breaking up hyphenated words might be useful

ndash ex) state-of-the-art state of the art (Good)ndash but B-49 B 49 ()

bull need to adopt a general rule and to specify exceptions on a case by case basis

Lexical Analysis of the Text

Punctuation marksndash removed entirely

bull ex) 510BC 1048774 510BCbull if the query contains lsquo510BCrsquo removal of the dot both in

query term and in the documents will not affect retrieval performance

ndash require the preparation of a list of exceptionsbull ex) valid 1048774 valid ()

The case of lettersndash converts all the text to either lower or upper casendash part of the semantics might be lost

bull Northwestern University 1048774 northwestern university ()

Elimination of Stopwords

Basic conceptndash filtering out words with very low discrimination

valuesbull ex) a the this that where when hellip

Advantagendash reduce the size of the indexing structure

considerably

Disadvantagendash might reduce recall as well

bull ex) to be or not to be

Stemming

What is the ldquostemrdquondash the portion of a word which is left after the removal of

its affixes (ie prefixes and suffixes)ndash ex) lsquoconnectrsquo is the stem for the variants lsquoconnectedrsquo

lsquoconnectingrsquo lsquoconnectionrsquo lsquoconnectionsrsquo

Effect of stemmingndash reduce variants of the same root to a common

conceptndash reduce the size of the indexing structurendash controversy about the benefits of stemming

Index Term Selection

Index terms selectionndash not all words are equally significant for representing

the semantics of a document

Manual selectionndash selection of index terms is usually done by specialist

Automatic selection of index termsndash most of the semantics is carried by the noun wordsndash clustering nouns which appear nearby in the text into

a single indexing component (or concept)ndash ex) computer science

ThesauriWhat is the ldquothesaurusrdquo

ndash list of important words in a given domain of knowledgendash a set of related words derived from a synonymity

relationshipndash a controlled vocabulary for the indexing and searching

Main purposesndash provide a standard vocabulary for indexing and

searchingndash assist users with locating terms for proper query

formulationndash provide classified hierarchies that allow the

broadening and narrowing of the current query request

Thesauri

Thesaurus index termsndash denote a concept which is the basic semantic unitndash can be individual words groups of words or phrases

bull ex) building teaching ballistic missiles body temperature

ndash frequently it is necessary to complement a thesaurus entry with a definition or an explanation

bull ex) seal (marine animals) seal (documents)

Thesaurus term relationshipsndash mostly composed of synonyms and near-synonymsndash BT (Broader Term) NT (Narrower Term) RT

(Related Term)

Text OperationsCoding Compression Methods

Text CompressionMotivation

ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements

IO overhead communication delaysndash obstacle need for IR systems to access text

randomly bull to access a given word in some forms of compressed text

the entire text must be decoded from the beginning until the desired word is reached

Two strategiesndash statistical methodsndash dictionary methods

Statistical MethodsBasic concepts

ndash Modeling a probability is estimated for each symbol

ndash Coding a code is assigned to each symbol based on the model

ndash shorter codes are assigned to the most likely symbols

Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)

bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits

Statistical MethodsCompression models

ndash adaptive model progressively learn about the statistical distribution as the compression process goes on

bull decompression of a file has to start from its beginning

ndash static model assume an average distribution for all input texts

bull poor compression ratios when data deviates from initial distribution assumptions

ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned

bull information on the data distribution must be stored

Statistical Methods

Word-based compression modelndash take words instead of characters as symbols

Reasons to use this model in an IR contextndash much better compression rates

ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters

ndash words are the atoms on which most IR systems are built

ndash word frequencies are useful in answering queries involving combinations of words

ndash the best strategy is to start with the least frequent words first

Statistical Methods

Codingndash the task of obtaining the representation of a symbol

based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and

long codes to unlikely ones

Two statistical coding strategiesndash Huffman coding

bull a variable-length encoding in bits for each symbolbull relatively fast allows random access

ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access

Huffman Coding

Building a Huffman treendash for each symbol of the alphabet create a node

containing the symbol and its probabilityndash the two nodes with the smallest probabilities become

children of a newly created parent nodendash the parent node is associated a probability equal to

the sum of the probabilities of the two chosen children

ndash the operation is repeated ignoring nodes that are already children until there is only one node

Authorrsquos Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

10

10

10 10

0110 0100 1 0101 00 1 0111 00 1

for each rose a rose is a rose

Better Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

1010

10 10

010 000 11 001 10 11 011 10 11

for each rose a rose is a rose

Authorrsquos Canonical Huffman Coding

bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree

wherexi = symbols

yi = numerical value of first symbol

rose

isldquo ldquo

a

each for

10

10

10

10 10

0010 0000 1 0001 00 1 0011 00 1

for each rose a rose is a rose

S = ((1 1) (1 1) (0 infin) (4 0)

Byte-Oriented Huffman Coding

Tree has branching factor of 256

Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)

Characteristicsndash Decompression is faster than for plain Huffman

codingndash Compression ratios are better than for Ziv-Lempel

family of codingsndash Allows direct searching on compressed text

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 5: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Lexical Analysis of the Text

Four particular casesNumbers

bull usually not good index terms because of their vagueness

bull need some advanced lexical analysis procedurendash ex) 510BC 4105-1201-2310-2213 2000212 hellip

Hyphensbull breaking up hyphenated words might be useful

ndash ex) state-of-the-art state of the art (Good)ndash but B-49 B 49 ()

bull need to adopt a general rule and to specify exceptions on a case by case basis

Lexical Analysis of the Text

Punctuation marksndash removed entirely

bull ex) 510BC 1048774 510BCbull if the query contains lsquo510BCrsquo removal of the dot both in

query term and in the documents will not affect retrieval performance

ndash require the preparation of a list of exceptionsbull ex) valid 1048774 valid ()

The case of lettersndash converts all the text to either lower or upper casendash part of the semantics might be lost

bull Northwestern University 1048774 northwestern university ()

Elimination of Stopwords

Basic conceptndash filtering out words with very low discrimination

valuesbull ex) a the this that where when hellip

Advantagendash reduce the size of the indexing structure

considerably

Disadvantagendash might reduce recall as well

bull ex) to be or not to be

Stemming

What is the ldquostemrdquondash the portion of a word which is left after the removal of

its affixes (ie prefixes and suffixes)ndash ex) lsquoconnectrsquo is the stem for the variants lsquoconnectedrsquo

lsquoconnectingrsquo lsquoconnectionrsquo lsquoconnectionsrsquo

Effect of stemmingndash reduce variants of the same root to a common

conceptndash reduce the size of the indexing structurendash controversy about the benefits of stemming

Index Term Selection

Index terms selectionndash not all words are equally significant for representing

the semantics of a document

Manual selectionndash selection of index terms is usually done by specialist

Automatic selection of index termsndash most of the semantics is carried by the noun wordsndash clustering nouns which appear nearby in the text into

a single indexing component (or concept)ndash ex) computer science

ThesauriWhat is the ldquothesaurusrdquo

ndash list of important words in a given domain of knowledgendash a set of related words derived from a synonymity

relationshipndash a controlled vocabulary for the indexing and searching

Main purposesndash provide a standard vocabulary for indexing and

searchingndash assist users with locating terms for proper query

formulationndash provide classified hierarchies that allow the

broadening and narrowing of the current query request

Thesauri

Thesaurus index termsndash denote a concept which is the basic semantic unitndash can be individual words groups of words or phrases

bull ex) building teaching ballistic missiles body temperature

ndash frequently it is necessary to complement a thesaurus entry with a definition or an explanation

bull ex) seal (marine animals) seal (documents)

Thesaurus term relationshipsndash mostly composed of synonyms and near-synonymsndash BT (Broader Term) NT (Narrower Term) RT

(Related Term)

Text OperationsCoding Compression Methods

Text CompressionMotivation

ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements

IO overhead communication delaysndash obstacle need for IR systems to access text

randomly bull to access a given word in some forms of compressed text

the entire text must be decoded from the beginning until the desired word is reached

Two strategiesndash statistical methodsndash dictionary methods

Statistical MethodsBasic concepts

ndash Modeling a probability is estimated for each symbol

ndash Coding a code is assigned to each symbol based on the model

ndash shorter codes are assigned to the most likely symbols

Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)

bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits

Statistical MethodsCompression models

ndash adaptive model progressively learn about the statistical distribution as the compression process goes on

bull decompression of a file has to start from its beginning

ndash static model assume an average distribution for all input texts

bull poor compression ratios when data deviates from initial distribution assumptions

ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned

bull information on the data distribution must be stored

Statistical Methods

Word-based compression modelndash take words instead of characters as symbols

Reasons to use this model in an IR contextndash much better compression rates

ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters

ndash words are the atoms on which most IR systems are built

ndash word frequencies are useful in answering queries involving combinations of words

ndash the best strategy is to start with the least frequent words first

Statistical Methods

Codingndash the task of obtaining the representation of a symbol

based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and

long codes to unlikely ones

Two statistical coding strategiesndash Huffman coding

bull a variable-length encoding in bits for each symbolbull relatively fast allows random access

ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access

Huffman Coding

Building a Huffman treendash for each symbol of the alphabet create a node

containing the symbol and its probabilityndash the two nodes with the smallest probabilities become

children of a newly created parent nodendash the parent node is associated a probability equal to

the sum of the probabilities of the two chosen children

ndash the operation is repeated ignoring nodes that are already children until there is only one node

Authorrsquos Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

10

10

10 10

0110 0100 1 0101 00 1 0111 00 1

for each rose a rose is a rose

Better Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

1010

10 10

010 000 11 001 10 11 011 10 11

for each rose a rose is a rose

Authorrsquos Canonical Huffman Coding

bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree

wherexi = symbols

yi = numerical value of first symbol

rose

isldquo ldquo

a

each for

10

10

10

10 10

0010 0000 1 0001 00 1 0011 00 1

for each rose a rose is a rose

S = ((1 1) (1 1) (0 infin) (4 0)

Byte-Oriented Huffman Coding

Tree has branching factor of 256

Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)

Characteristicsndash Decompression is faster than for plain Huffman

codingndash Compression ratios are better than for Ziv-Lempel

family of codingsndash Allows direct searching on compressed text

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 6: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Lexical Analysis of the Text

Punctuation marksndash removed entirely

bull ex) 510BC 1048774 510BCbull if the query contains lsquo510BCrsquo removal of the dot both in

query term and in the documents will not affect retrieval performance

ndash require the preparation of a list of exceptionsbull ex) valid 1048774 valid ()

The case of lettersndash converts all the text to either lower or upper casendash part of the semantics might be lost

bull Northwestern University 1048774 northwestern university ()

Elimination of Stopwords

Basic conceptndash filtering out words with very low discrimination

valuesbull ex) a the this that where when hellip

Advantagendash reduce the size of the indexing structure

considerably

Disadvantagendash might reduce recall as well

bull ex) to be or not to be

Stemming

What is the ldquostemrdquondash the portion of a word which is left after the removal of

its affixes (ie prefixes and suffixes)ndash ex) lsquoconnectrsquo is the stem for the variants lsquoconnectedrsquo

lsquoconnectingrsquo lsquoconnectionrsquo lsquoconnectionsrsquo

Effect of stemmingndash reduce variants of the same root to a common

conceptndash reduce the size of the indexing structurendash controversy about the benefits of stemming

Index Term Selection

Index terms selectionndash not all words are equally significant for representing

the semantics of a document

Manual selectionndash selection of index terms is usually done by specialist

Automatic selection of index termsndash most of the semantics is carried by the noun wordsndash clustering nouns which appear nearby in the text into

a single indexing component (or concept)ndash ex) computer science

ThesauriWhat is the ldquothesaurusrdquo

ndash list of important words in a given domain of knowledgendash a set of related words derived from a synonymity

relationshipndash a controlled vocabulary for the indexing and searching

Main purposesndash provide a standard vocabulary for indexing and

searchingndash assist users with locating terms for proper query

formulationndash provide classified hierarchies that allow the

broadening and narrowing of the current query request

Thesauri

Thesaurus index termsndash denote a concept which is the basic semantic unitndash can be individual words groups of words or phrases

bull ex) building teaching ballistic missiles body temperature

ndash frequently it is necessary to complement a thesaurus entry with a definition or an explanation

bull ex) seal (marine animals) seal (documents)

Thesaurus term relationshipsndash mostly composed of synonyms and near-synonymsndash BT (Broader Term) NT (Narrower Term) RT

(Related Term)

Text OperationsCoding Compression Methods

Text CompressionMotivation

ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements

IO overhead communication delaysndash obstacle need for IR systems to access text

randomly bull to access a given word in some forms of compressed text

the entire text must be decoded from the beginning until the desired word is reached

Two strategiesndash statistical methodsndash dictionary methods

Statistical MethodsBasic concepts

ndash Modeling a probability is estimated for each symbol

ndash Coding a code is assigned to each symbol based on the model

ndash shorter codes are assigned to the most likely symbols

Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)

bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits

Statistical MethodsCompression models

ndash adaptive model progressively learn about the statistical distribution as the compression process goes on

bull decompression of a file has to start from its beginning

ndash static model assume an average distribution for all input texts

bull poor compression ratios when data deviates from initial distribution assumptions

ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned

bull information on the data distribution must be stored

Statistical Methods

Word-based compression modelndash take words instead of characters as symbols

Reasons to use this model in an IR contextndash much better compression rates

ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters

ndash words are the atoms on which most IR systems are built

ndash word frequencies are useful in answering queries involving combinations of words

ndash the best strategy is to start with the least frequent words first

Statistical Methods

Codingndash the task of obtaining the representation of a symbol

based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and

long codes to unlikely ones

Two statistical coding strategiesndash Huffman coding

bull a variable-length encoding in bits for each symbolbull relatively fast allows random access

ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access

Huffman Coding

Building a Huffman treendash for each symbol of the alphabet create a node

containing the symbol and its probabilityndash the two nodes with the smallest probabilities become

children of a newly created parent nodendash the parent node is associated a probability equal to

the sum of the probabilities of the two chosen children

ndash the operation is repeated ignoring nodes that are already children until there is only one node

Authorrsquos Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

10

10

10 10

0110 0100 1 0101 00 1 0111 00 1

for each rose a rose is a rose

Better Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

1010

10 10

010 000 11 001 10 11 011 10 11

for each rose a rose is a rose

Authorrsquos Canonical Huffman Coding

bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree

wherexi = symbols

yi = numerical value of first symbol

rose

isldquo ldquo

a

each for

10

10

10

10 10

0010 0000 1 0001 00 1 0011 00 1

for each rose a rose is a rose

S = ((1 1) (1 1) (0 infin) (4 0)

Byte-Oriented Huffman Coding

Tree has branching factor of 256

Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)

Characteristicsndash Decompression is faster than for plain Huffman

codingndash Compression ratios are better than for Ziv-Lempel

family of codingsndash Allows direct searching on compressed text

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 7: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Elimination of Stopwords

Basic conceptndash filtering out words with very low discrimination

valuesbull ex) a the this that where when hellip

Advantagendash reduce the size of the indexing structure

considerably

Disadvantagendash might reduce recall as well

bull ex) to be or not to be

Stemming

What is the ldquostemrdquondash the portion of a word which is left after the removal of

its affixes (ie prefixes and suffixes)ndash ex) lsquoconnectrsquo is the stem for the variants lsquoconnectedrsquo

lsquoconnectingrsquo lsquoconnectionrsquo lsquoconnectionsrsquo

Effect of stemmingndash reduce variants of the same root to a common

conceptndash reduce the size of the indexing structurendash controversy about the benefits of stemming

Index Term Selection

Index terms selectionndash not all words are equally significant for representing

the semantics of a document

Manual selectionndash selection of index terms is usually done by specialist

Automatic selection of index termsndash most of the semantics is carried by the noun wordsndash clustering nouns which appear nearby in the text into

a single indexing component (or concept)ndash ex) computer science

ThesauriWhat is the ldquothesaurusrdquo

ndash list of important words in a given domain of knowledgendash a set of related words derived from a synonymity

relationshipndash a controlled vocabulary for the indexing and searching

Main purposesndash provide a standard vocabulary for indexing and

searchingndash assist users with locating terms for proper query

formulationndash provide classified hierarchies that allow the

broadening and narrowing of the current query request

Thesauri

Thesaurus index termsndash denote a concept which is the basic semantic unitndash can be individual words groups of words or phrases

bull ex) building teaching ballistic missiles body temperature

ndash frequently it is necessary to complement a thesaurus entry with a definition or an explanation

bull ex) seal (marine animals) seal (documents)

Thesaurus term relationshipsndash mostly composed of synonyms and near-synonymsndash BT (Broader Term) NT (Narrower Term) RT

(Related Term)

Text OperationsCoding Compression Methods

Text CompressionMotivation

ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements

IO overhead communication delaysndash obstacle need for IR systems to access text

randomly bull to access a given word in some forms of compressed text

the entire text must be decoded from the beginning until the desired word is reached

Two strategiesndash statistical methodsndash dictionary methods

Statistical MethodsBasic concepts

ndash Modeling a probability is estimated for each symbol

ndash Coding a code is assigned to each symbol based on the model

ndash shorter codes are assigned to the most likely symbols

Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)

bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits

Statistical MethodsCompression models

ndash adaptive model progressively learn about the statistical distribution as the compression process goes on

bull decompression of a file has to start from its beginning

ndash static model assume an average distribution for all input texts

bull poor compression ratios when data deviates from initial distribution assumptions

ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned

bull information on the data distribution must be stored

Statistical Methods

Word-based compression modelndash take words instead of characters as symbols

Reasons to use this model in an IR contextndash much better compression rates

ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters

ndash words are the atoms on which most IR systems are built

ndash word frequencies are useful in answering queries involving combinations of words

ndash the best strategy is to start with the least frequent words first

Statistical Methods

Codingndash the task of obtaining the representation of a symbol

based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and

long codes to unlikely ones

Two statistical coding strategiesndash Huffman coding

bull a variable-length encoding in bits for each symbolbull relatively fast allows random access

ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access

Huffman Coding

Building a Huffman treendash for each symbol of the alphabet create a node

containing the symbol and its probabilityndash the two nodes with the smallest probabilities become

children of a newly created parent nodendash the parent node is associated a probability equal to

the sum of the probabilities of the two chosen children

ndash the operation is repeated ignoring nodes that are already children until there is only one node

Authorrsquos Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

10

10

10 10

0110 0100 1 0101 00 1 0111 00 1

for each rose a rose is a rose

Better Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

1010

10 10

010 000 11 001 10 11 011 10 11

for each rose a rose is a rose

Authorrsquos Canonical Huffman Coding

bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree

wherexi = symbols

yi = numerical value of first symbol

rose

isldquo ldquo

a

each for

10

10

10

10 10

0010 0000 1 0001 00 1 0011 00 1

for each rose a rose is a rose

S = ((1 1) (1 1) (0 infin) (4 0)

Byte-Oriented Huffman Coding

Tree has branching factor of 256

Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)

Characteristicsndash Decompression is faster than for plain Huffman

codingndash Compression ratios are better than for Ziv-Lempel

family of codingsndash Allows direct searching on compressed text

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 8: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Stemming

What is the ldquostemrdquondash the portion of a word which is left after the removal of

its affixes (ie prefixes and suffixes)ndash ex) lsquoconnectrsquo is the stem for the variants lsquoconnectedrsquo

lsquoconnectingrsquo lsquoconnectionrsquo lsquoconnectionsrsquo

Effect of stemmingndash reduce variants of the same root to a common

conceptndash reduce the size of the indexing structurendash controversy about the benefits of stemming

Index Term Selection

Index terms selectionndash not all words are equally significant for representing

the semantics of a document

Manual selectionndash selection of index terms is usually done by specialist

Automatic selection of index termsndash most of the semantics is carried by the noun wordsndash clustering nouns which appear nearby in the text into

a single indexing component (or concept)ndash ex) computer science

ThesauriWhat is the ldquothesaurusrdquo

ndash list of important words in a given domain of knowledgendash a set of related words derived from a synonymity

relationshipndash a controlled vocabulary for the indexing and searching

Main purposesndash provide a standard vocabulary for indexing and

searchingndash assist users with locating terms for proper query

formulationndash provide classified hierarchies that allow the

broadening and narrowing of the current query request

Thesauri

Thesaurus index termsndash denote a concept which is the basic semantic unitndash can be individual words groups of words or phrases

bull ex) building teaching ballistic missiles body temperature

ndash frequently it is necessary to complement a thesaurus entry with a definition or an explanation

bull ex) seal (marine animals) seal (documents)

Thesaurus term relationshipsndash mostly composed of synonyms and near-synonymsndash BT (Broader Term) NT (Narrower Term) RT

(Related Term)

Text OperationsCoding Compression Methods

Text CompressionMotivation

ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements

IO overhead communication delaysndash obstacle need for IR systems to access text

randomly bull to access a given word in some forms of compressed text

the entire text must be decoded from the beginning until the desired word is reached

Two strategiesndash statistical methodsndash dictionary methods

Statistical MethodsBasic concepts

ndash Modeling a probability is estimated for each symbol

ndash Coding a code is assigned to each symbol based on the model

ndash shorter codes are assigned to the most likely symbols

Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)

bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits

Statistical MethodsCompression models

ndash adaptive model progressively learn about the statistical distribution as the compression process goes on

bull decompression of a file has to start from its beginning

ndash static model assume an average distribution for all input texts

bull poor compression ratios when data deviates from initial distribution assumptions

ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned

bull information on the data distribution must be stored

Statistical Methods

Word-based compression modelndash take words instead of characters as symbols

Reasons to use this model in an IR contextndash much better compression rates

ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters

ndash words are the atoms on which most IR systems are built

ndash word frequencies are useful in answering queries involving combinations of words

ndash the best strategy is to start with the least frequent words first

Statistical Methods

Codingndash the task of obtaining the representation of a symbol

based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and

long codes to unlikely ones

Two statistical coding strategiesndash Huffman coding

bull a variable-length encoding in bits for each symbolbull relatively fast allows random access

ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access

Huffman Coding

Building a Huffman treendash for each symbol of the alphabet create a node

containing the symbol and its probabilityndash the two nodes with the smallest probabilities become

children of a newly created parent nodendash the parent node is associated a probability equal to

the sum of the probabilities of the two chosen children

ndash the operation is repeated ignoring nodes that are already children until there is only one node

Authorrsquos Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

10

10

10 10

0110 0100 1 0101 00 1 0111 00 1

for each rose a rose is a rose

Better Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

1010

10 10

010 000 11 001 10 11 011 10 11

for each rose a rose is a rose

Authorrsquos Canonical Huffman Coding

bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree

wherexi = symbols

yi = numerical value of first symbol

rose

isldquo ldquo

a

each for

10

10

10

10 10

0010 0000 1 0001 00 1 0011 00 1

for each rose a rose is a rose

S = ((1 1) (1 1) (0 infin) (4 0)

Byte-Oriented Huffman Coding

Tree has branching factor of 256

Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)

Characteristicsndash Decompression is faster than for plain Huffman

codingndash Compression ratios are better than for Ziv-Lempel

family of codingsndash Allows direct searching on compressed text

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 9: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Index Term Selection

Index terms selectionndash not all words are equally significant for representing

the semantics of a document

Manual selectionndash selection of index terms is usually done by specialist

Automatic selection of index termsndash most of the semantics is carried by the noun wordsndash clustering nouns which appear nearby in the text into

a single indexing component (or concept)ndash ex) computer science

ThesauriWhat is the ldquothesaurusrdquo

ndash list of important words in a given domain of knowledgendash a set of related words derived from a synonymity

relationshipndash a controlled vocabulary for the indexing and searching

Main purposesndash provide a standard vocabulary for indexing and

searchingndash assist users with locating terms for proper query

formulationndash provide classified hierarchies that allow the

broadening and narrowing of the current query request

Thesauri

Thesaurus index termsndash denote a concept which is the basic semantic unitndash can be individual words groups of words or phrases

bull ex) building teaching ballistic missiles body temperature

ndash frequently it is necessary to complement a thesaurus entry with a definition or an explanation

bull ex) seal (marine animals) seal (documents)

Thesaurus term relationshipsndash mostly composed of synonyms and near-synonymsndash BT (Broader Term) NT (Narrower Term) RT

(Related Term)

Text OperationsCoding Compression Methods

Text CompressionMotivation

ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements

IO overhead communication delaysndash obstacle need for IR systems to access text

randomly bull to access a given word in some forms of compressed text

the entire text must be decoded from the beginning until the desired word is reached

Two strategiesndash statistical methodsndash dictionary methods

Statistical MethodsBasic concepts

ndash Modeling a probability is estimated for each symbol

ndash Coding a code is assigned to each symbol based on the model

ndash shorter codes are assigned to the most likely symbols

Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)

bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits

Statistical MethodsCompression models

ndash adaptive model progressively learn about the statistical distribution as the compression process goes on

bull decompression of a file has to start from its beginning

ndash static model assume an average distribution for all input texts

bull poor compression ratios when data deviates from initial distribution assumptions

ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned

bull information on the data distribution must be stored

Statistical Methods

Word-based compression modelndash take words instead of characters as symbols

Reasons to use this model in an IR contextndash much better compression rates

ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters

ndash words are the atoms on which most IR systems are built

ndash word frequencies are useful in answering queries involving combinations of words

ndash the best strategy is to start with the least frequent words first

Statistical Methods

Codingndash the task of obtaining the representation of a symbol

based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and

long codes to unlikely ones

Two statistical coding strategiesndash Huffman coding

bull a variable-length encoding in bits for each symbolbull relatively fast allows random access

ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access

Huffman Coding

Building a Huffman treendash for each symbol of the alphabet create a node

containing the symbol and its probabilityndash the two nodes with the smallest probabilities become

children of a newly created parent nodendash the parent node is associated a probability equal to

the sum of the probabilities of the two chosen children

ndash the operation is repeated ignoring nodes that are already children until there is only one node

Authorrsquos Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

10

10

10 10

0110 0100 1 0101 00 1 0111 00 1

for each rose a rose is a rose

Better Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

1010

10 10

010 000 11 001 10 11 011 10 11

for each rose a rose is a rose

Authorrsquos Canonical Huffman Coding

bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree

wherexi = symbols

yi = numerical value of first symbol

rose

isldquo ldquo

a

each for

10

10

10

10 10

0010 0000 1 0001 00 1 0011 00 1

for each rose a rose is a rose

S = ((1 1) (1 1) (0 infin) (4 0)

Byte-Oriented Huffman Coding

Tree has branching factor of 256

Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)

Characteristicsndash Decompression is faster than for plain Huffman

codingndash Compression ratios are better than for Ziv-Lempel

family of codingsndash Allows direct searching on compressed text

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 10: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

ThesauriWhat is the ldquothesaurusrdquo

ndash list of important words in a given domain of knowledgendash a set of related words derived from a synonymity

relationshipndash a controlled vocabulary for the indexing and searching

Main purposesndash provide a standard vocabulary for indexing and

searchingndash assist users with locating terms for proper query

formulationndash provide classified hierarchies that allow the

broadening and narrowing of the current query request

Thesauri

Thesaurus index termsndash denote a concept which is the basic semantic unitndash can be individual words groups of words or phrases

bull ex) building teaching ballistic missiles body temperature

ndash frequently it is necessary to complement a thesaurus entry with a definition or an explanation

bull ex) seal (marine animals) seal (documents)

Thesaurus term relationshipsndash mostly composed of synonyms and near-synonymsndash BT (Broader Term) NT (Narrower Term) RT

(Related Term)

Text OperationsCoding Compression Methods

Text CompressionMotivation

ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements

IO overhead communication delaysndash obstacle need for IR systems to access text

randomly bull to access a given word in some forms of compressed text

the entire text must be decoded from the beginning until the desired word is reached

Two strategiesndash statistical methodsndash dictionary methods

Statistical MethodsBasic concepts

ndash Modeling a probability is estimated for each symbol

ndash Coding a code is assigned to each symbol based on the model

ndash shorter codes are assigned to the most likely symbols

Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)

bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits

Statistical MethodsCompression models

ndash adaptive model progressively learn about the statistical distribution as the compression process goes on

bull decompression of a file has to start from its beginning

ndash static model assume an average distribution for all input texts

bull poor compression ratios when data deviates from initial distribution assumptions

ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned

bull information on the data distribution must be stored

Statistical Methods

Word-based compression modelndash take words instead of characters as symbols

Reasons to use this model in an IR contextndash much better compression rates

ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters

ndash words are the atoms on which most IR systems are built

ndash word frequencies are useful in answering queries involving combinations of words

ndash the best strategy is to start with the least frequent words first

Statistical Methods

Codingndash the task of obtaining the representation of a symbol

based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and

long codes to unlikely ones

Two statistical coding strategiesndash Huffman coding

bull a variable-length encoding in bits for each symbolbull relatively fast allows random access

ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access

Huffman Coding

Building a Huffman treendash for each symbol of the alphabet create a node

containing the symbol and its probabilityndash the two nodes with the smallest probabilities become

children of a newly created parent nodendash the parent node is associated a probability equal to

the sum of the probabilities of the two chosen children

ndash the operation is repeated ignoring nodes that are already children until there is only one node

Authorrsquos Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

10

10

10 10

0110 0100 1 0101 00 1 0111 00 1

for each rose a rose is a rose

Better Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

1010

10 10

010 000 11 001 10 11 011 10 11

for each rose a rose is a rose

Authorrsquos Canonical Huffman Coding

bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree

wherexi = symbols

yi = numerical value of first symbol

rose

isldquo ldquo

a

each for

10

10

10

10 10

0010 0000 1 0001 00 1 0011 00 1

for each rose a rose is a rose

S = ((1 1) (1 1) (0 infin) (4 0)

Byte-Oriented Huffman Coding

Tree has branching factor of 256

Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)

Characteristicsndash Decompression is faster than for plain Huffman

codingndash Compression ratios are better than for Ziv-Lempel

family of codingsndash Allows direct searching on compressed text

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 11: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Thesauri

Thesaurus index termsndash denote a concept which is the basic semantic unitndash can be individual words groups of words or phrases

bull ex) building teaching ballistic missiles body temperature

ndash frequently it is necessary to complement a thesaurus entry with a definition or an explanation

bull ex) seal (marine animals) seal (documents)

Thesaurus term relationshipsndash mostly composed of synonyms and near-synonymsndash BT (Broader Term) NT (Narrower Term) RT

(Related Term)

Text OperationsCoding Compression Methods

Text CompressionMotivation

ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements

IO overhead communication delaysndash obstacle need for IR systems to access text

randomly bull to access a given word in some forms of compressed text

the entire text must be decoded from the beginning until the desired word is reached

Two strategiesndash statistical methodsndash dictionary methods

Statistical MethodsBasic concepts

ndash Modeling a probability is estimated for each symbol

ndash Coding a code is assigned to each symbol based on the model

ndash shorter codes are assigned to the most likely symbols

Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)

bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits

Statistical MethodsCompression models

ndash adaptive model progressively learn about the statistical distribution as the compression process goes on

bull decompression of a file has to start from its beginning

ndash static model assume an average distribution for all input texts

bull poor compression ratios when data deviates from initial distribution assumptions

ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned

bull information on the data distribution must be stored

Statistical Methods

Word-based compression modelndash take words instead of characters as symbols

Reasons to use this model in an IR contextndash much better compression rates

ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters

ndash words are the atoms on which most IR systems are built

ndash word frequencies are useful in answering queries involving combinations of words

ndash the best strategy is to start with the least frequent words first

Statistical Methods

Codingndash the task of obtaining the representation of a symbol

based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and

long codes to unlikely ones

Two statistical coding strategiesndash Huffman coding

bull a variable-length encoding in bits for each symbolbull relatively fast allows random access

ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access

Huffman Coding

Building a Huffman treendash for each symbol of the alphabet create a node

containing the symbol and its probabilityndash the two nodes with the smallest probabilities become

children of a newly created parent nodendash the parent node is associated a probability equal to

the sum of the probabilities of the two chosen children

ndash the operation is repeated ignoring nodes that are already children until there is only one node

Authorrsquos Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

10

10

10 10

0110 0100 1 0101 00 1 0111 00 1

for each rose a rose is a rose

Better Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

1010

10 10

010 000 11 001 10 11 011 10 11

for each rose a rose is a rose

Authorrsquos Canonical Huffman Coding

bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree

wherexi = symbols

yi = numerical value of first symbol

rose

isldquo ldquo

a

each for

10

10

10

10 10

0010 0000 1 0001 00 1 0011 00 1

for each rose a rose is a rose

S = ((1 1) (1 1) (0 infin) (4 0)

Byte-Oriented Huffman Coding

Tree has branching factor of 256

Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)

Characteristicsndash Decompression is faster than for plain Huffman

codingndash Compression ratios are better than for Ziv-Lempel

family of codingsndash Allows direct searching on compressed text

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 12: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Text OperationsCoding Compression Methods

Text CompressionMotivation

ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements

IO overhead communication delaysndash obstacle need for IR systems to access text

randomly bull to access a given word in some forms of compressed text

the entire text must be decoded from the beginning until the desired word is reached

Two strategiesndash statistical methodsndash dictionary methods

Statistical MethodsBasic concepts

ndash Modeling a probability is estimated for each symbol

ndash Coding a code is assigned to each symbol based on the model

ndash shorter codes are assigned to the most likely symbols

Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)

bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits

Statistical MethodsCompression models

ndash adaptive model progressively learn about the statistical distribution as the compression process goes on

bull decompression of a file has to start from its beginning

ndash static model assume an average distribution for all input texts

bull poor compression ratios when data deviates from initial distribution assumptions

ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned

bull information on the data distribution must be stored

Statistical Methods

Word-based compression modelndash take words instead of characters as symbols

Reasons to use this model in an IR contextndash much better compression rates

ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters

ndash words are the atoms on which most IR systems are built

ndash word frequencies are useful in answering queries involving combinations of words

ndash the best strategy is to start with the least frequent words first

Statistical Methods

Codingndash the task of obtaining the representation of a symbol

based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and

long codes to unlikely ones

Two statistical coding strategiesndash Huffman coding

bull a variable-length encoding in bits for each symbolbull relatively fast allows random access

ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access

Huffman Coding

Building a Huffman treendash for each symbol of the alphabet create a node

containing the symbol and its probabilityndash the two nodes with the smallest probabilities become

children of a newly created parent nodendash the parent node is associated a probability equal to

the sum of the probabilities of the two chosen children

ndash the operation is repeated ignoring nodes that are already children until there is only one node

Authorrsquos Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

10

10

10 10

0110 0100 1 0101 00 1 0111 00 1

for each rose a rose is a rose

Better Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

1010

10 10

010 000 11 001 10 11 011 10 11

for each rose a rose is a rose

Authorrsquos Canonical Huffman Coding

bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree

wherexi = symbols

yi = numerical value of first symbol

rose

isldquo ldquo

a

each for

10

10

10

10 10

0010 0000 1 0001 00 1 0011 00 1

for each rose a rose is a rose

S = ((1 1) (1 1) (0 infin) (4 0)

Byte-Oriented Huffman Coding

Tree has branching factor of 256

Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)

Characteristicsndash Decompression is faster than for plain Huffman

codingndash Compression ratios are better than for Ziv-Lempel

family of codingsndash Allows direct searching on compressed text

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 13: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Text CompressionMotivation

ndash finding ways to represent the text in fewer bitsndash reducing costs associated with space requirements

IO overhead communication delaysndash obstacle need for IR systems to access text

randomly bull to access a given word in some forms of compressed text

the entire text must be decoded from the beginning until the desired word is reached

Two strategiesndash statistical methodsndash dictionary methods

Statistical MethodsBasic concepts

ndash Modeling a probability is estimated for each symbol

ndash Coding a code is assigned to each symbol based on the model

ndash shorter codes are assigned to the most likely symbols

Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)

bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits

Statistical MethodsCompression models

ndash adaptive model progressively learn about the statistical distribution as the compression process goes on

bull decompression of a file has to start from its beginning

ndash static model assume an average distribution for all input texts

bull poor compression ratios when data deviates from initial distribution assumptions

ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned

bull information on the data distribution must be stored

Statistical Methods

Word-based compression modelndash take words instead of characters as symbols

Reasons to use this model in an IR contextndash much better compression rates

ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters

ndash words are the atoms on which most IR systems are built

ndash word frequencies are useful in answering queries involving combinations of words

ndash the best strategy is to start with the least frequent words first

Statistical Methods

Codingndash the task of obtaining the representation of a symbol

based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and

long codes to unlikely ones

Two statistical coding strategiesndash Huffman coding

bull a variable-length encoding in bits for each symbolbull relatively fast allows random access

ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access

Huffman Coding

Building a Huffman treendash for each symbol of the alphabet create a node

containing the symbol and its probabilityndash the two nodes with the smallest probabilities become

children of a newly created parent nodendash the parent node is associated a probability equal to

the sum of the probabilities of the two chosen children

ndash the operation is repeated ignoring nodes that are already children until there is only one node

Authorrsquos Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

10

10

10 10

0110 0100 1 0101 00 1 0111 00 1

for each rose a rose is a rose

Better Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

1010

10 10

010 000 11 001 10 11 011 10 11

for each rose a rose is a rose

Authorrsquos Canonical Huffman Coding

bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree

wherexi = symbols

yi = numerical value of first symbol

rose

isldquo ldquo

a

each for

10

10

10

10 10

0010 0000 1 0001 00 1 0011 00 1

for each rose a rose is a rose

S = ((1 1) (1 1) (0 infin) (4 0)

Byte-Oriented Huffman Coding

Tree has branching factor of 256

Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)

Characteristicsndash Decompression is faster than for plain Huffman

codingndash Compression ratios are better than for Ziv-Lempel

family of codingsndash Allows direct searching on compressed text

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 14: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Statistical MethodsBasic concepts

ndash Modeling a probability is estimated for each symbol

ndash Coding a code is assigned to each symbol based on the model

ndash shorter codes are assigned to the most likely symbols

Relationship between probabilities and codesndash Source code theorem (by Claude Shannon)

bull a symbol that occurs with probability p should be assigned a code of length log2 (1p) bits

Statistical MethodsCompression models

ndash adaptive model progressively learn about the statistical distribution as the compression process goes on

bull decompression of a file has to start from its beginning

ndash static model assume an average distribution for all input texts

bull poor compression ratios when data deviates from initial distribution assumptions

ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned

bull information on the data distribution must be stored

Statistical Methods

Word-based compression modelndash take words instead of characters as symbols

Reasons to use this model in an IR contextndash much better compression rates

ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters

ndash words are the atoms on which most IR systems are built

ndash word frequencies are useful in answering queries involving combinations of words

ndash the best strategy is to start with the least frequent words first

Statistical Methods

Codingndash the task of obtaining the representation of a symbol

based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and

long codes to unlikely ones

Two statistical coding strategiesndash Huffman coding

bull a variable-length encoding in bits for each symbolbull relatively fast allows random access

ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access

Huffman Coding

Building a Huffman treendash for each symbol of the alphabet create a node

containing the symbol and its probabilityndash the two nodes with the smallest probabilities become

children of a newly created parent nodendash the parent node is associated a probability equal to

the sum of the probabilities of the two chosen children

ndash the operation is repeated ignoring nodes that are already children until there is only one node

Authorrsquos Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

10

10

10 10

0110 0100 1 0101 00 1 0111 00 1

for each rose a rose is a rose

Better Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

1010

10 10

010 000 11 001 10 11 011 10 11

for each rose a rose is a rose

Authorrsquos Canonical Huffman Coding

bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree

wherexi = symbols

yi = numerical value of first symbol

rose

isldquo ldquo

a

each for

10

10

10

10 10

0010 0000 1 0001 00 1 0011 00 1

for each rose a rose is a rose

S = ((1 1) (1 1) (0 infin) (4 0)

Byte-Oriented Huffman Coding

Tree has branching factor of 256

Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)

Characteristicsndash Decompression is faster than for plain Huffman

codingndash Compression ratios are better than for Ziv-Lempel

family of codingsndash Allows direct searching on compressed text

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 15: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Statistical MethodsCompression models

ndash adaptive model progressively learn about the statistical distribution as the compression process goes on

bull decompression of a file has to start from its beginning

ndash static model assume an average distribution for all input texts

bull poor compression ratios when data deviates from initial distribution assumptions

ndash semi-static model learn a distribution in a first pass compress the data in a second pass by using a fixed code derived from the distribution learned

bull information on the data distribution must be stored

Statistical Methods

Word-based compression modelndash take words instead of characters as symbols

Reasons to use this model in an IR contextndash much better compression rates

ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters

ndash words are the atoms on which most IR systems are built

ndash word frequencies are useful in answering queries involving combinations of words

ndash the best strategy is to start with the least frequent words first

Statistical Methods

Codingndash the task of obtaining the representation of a symbol

based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and

long codes to unlikely ones

Two statistical coding strategiesndash Huffman coding

bull a variable-length encoding in bits for each symbolbull relatively fast allows random access

ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access

Huffman Coding

Building a Huffman treendash for each symbol of the alphabet create a node

containing the symbol and its probabilityndash the two nodes with the smallest probabilities become

children of a newly created parent nodendash the parent node is associated a probability equal to

the sum of the probabilities of the two chosen children

ndash the operation is repeated ignoring nodes that are already children until there is only one node

Authorrsquos Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

10

10

10 10

0110 0100 1 0101 00 1 0111 00 1

for each rose a rose is a rose

Better Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

1010

10 10

010 000 11 001 10 11 011 10 11

for each rose a rose is a rose

Authorrsquos Canonical Huffman Coding

bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree

wherexi = symbols

yi = numerical value of first symbol

rose

isldquo ldquo

a

each for

10

10

10

10 10

0010 0000 1 0001 00 1 0011 00 1

for each rose a rose is a rose

S = ((1 1) (1 1) (0 infin) (4 0)

Byte-Oriented Huffman Coding

Tree has branching factor of 256

Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)

Characteristicsndash Decompression is faster than for plain Huffman

codingndash Compression ratios are better than for Ziv-Lempel

family of codingsndash Allows direct searching on compressed text

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 16: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Statistical Methods

Word-based compression modelndash take words instead of characters as symbols

Reasons to use this model in an IR contextndash much better compression rates

ndash words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters

ndash words are the atoms on which most IR systems are built

ndash word frequencies are useful in answering queries involving combinations of words

ndash the best strategy is to start with the least frequent words first

Statistical Methods

Codingndash the task of obtaining the representation of a symbol

based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and

long codes to unlikely ones

Two statistical coding strategiesndash Huffman coding

bull a variable-length encoding in bits for each symbolbull relatively fast allows random access

ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access

Huffman Coding

Building a Huffman treendash for each symbol of the alphabet create a node

containing the symbol and its probabilityndash the two nodes with the smallest probabilities become

children of a newly created parent nodendash the parent node is associated a probability equal to

the sum of the probabilities of the two chosen children

ndash the operation is repeated ignoring nodes that are already children until there is only one node

Authorrsquos Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

10

10

10 10

0110 0100 1 0101 00 1 0111 00 1

for each rose a rose is a rose

Better Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

1010

10 10

010 000 11 001 10 11 011 10 11

for each rose a rose is a rose

Authorrsquos Canonical Huffman Coding

bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree

wherexi = symbols

yi = numerical value of first symbol

rose

isldquo ldquo

a

each for

10

10

10

10 10

0010 0000 1 0001 00 1 0011 00 1

for each rose a rose is a rose

S = ((1 1) (1 1) (0 infin) (4 0)

Byte-Oriented Huffman Coding

Tree has branching factor of 256

Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)

Characteristicsndash Decompression is faster than for plain Huffman

codingndash Compression ratios are better than for Ziv-Lempel

family of codingsndash Allows direct searching on compressed text

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 17: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Statistical Methods

Codingndash the task of obtaining the representation of a symbol

based on a probability distribution given by a modelndash main goal assign short codes to likely symbols and

long codes to unlikely ones

Two statistical coding strategiesndash Huffman coding

bull a variable-length encoding in bits for each symbolbull relatively fast allows random access

ndash Arithmetic coding bull use an interval of real numbers between 0-1bull much slower does not allow random access

Huffman Coding

Building a Huffman treendash for each symbol of the alphabet create a node

containing the symbol and its probabilityndash the two nodes with the smallest probabilities become

children of a newly created parent nodendash the parent node is associated a probability equal to

the sum of the probabilities of the two chosen children

ndash the operation is repeated ignoring nodes that are already children until there is only one node

Authorrsquos Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

10

10

10 10

0110 0100 1 0101 00 1 0111 00 1

for each rose a rose is a rose

Better Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

1010

10 10

010 000 11 001 10 11 011 10 11

for each rose a rose is a rose

Authorrsquos Canonical Huffman Coding

bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree

wherexi = symbols

yi = numerical value of first symbol

rose

isldquo ldquo

a

each for

10

10

10

10 10

0010 0000 1 0001 00 1 0011 00 1

for each rose a rose is a rose

S = ((1 1) (1 1) (0 infin) (4 0)

Byte-Oriented Huffman Coding

Tree has branching factor of 256

Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)

Characteristicsndash Decompression is faster than for plain Huffman

codingndash Compression ratios are better than for Ziv-Lempel

family of codingsndash Allows direct searching on compressed text

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 18: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Huffman Coding

Building a Huffman treendash for each symbol of the alphabet create a node

containing the symbol and its probabilityndash the two nodes with the smallest probabilities become

children of a newly created parent nodendash the parent node is associated a probability equal to

the sum of the probabilities of the two chosen children

ndash the operation is repeated ignoring nodes that are already children until there is only one node

Authorrsquos Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

10

10

10 10

0110 0100 1 0101 00 1 0111 00 1

for each rose a rose is a rose

Better Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

1010

10 10

010 000 11 001 10 11 011 10 11

for each rose a rose is a rose

Authorrsquos Canonical Huffman Coding

bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree

wherexi = symbols

yi = numerical value of first symbol

rose

isldquo ldquo

a

each for

10

10

10

10 10

0010 0000 1 0001 00 1 0011 00 1

for each rose a rose is a rose

S = ((1 1) (1 1) (0 infin) (4 0)

Byte-Oriented Huffman Coding

Tree has branching factor of 256

Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)

Characteristicsndash Decompression is faster than for plain Huffman

codingndash Compression ratios are better than for Ziv-Lempel

family of codingsndash Allows direct searching on compressed text

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 19: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Authorrsquos Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

10

10

10 10

0110 0100 1 0101 00 1 0111 00 1

for each rose a rose is a rose

Better Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

1010

10 10

010 000 11 001 10 11 011 10 11

for each rose a rose is a rose

Authorrsquos Canonical Huffman Coding

bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree

wherexi = symbols

yi = numerical value of first symbol

rose

isldquo ldquo

a

each for

10

10

10

10 10

0010 0000 1 0001 00 1 0011 00 1

for each rose a rose is a rose

S = ((1 1) (1 1) (0 infin) (4 0)

Byte-Oriented Huffman Coding

Tree has branching factor of 256

Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)

Characteristicsndash Decompression is faster than for plain Huffman

codingndash Compression ratios are better than for Ziv-Lempel

family of codingsndash Allows direct searching on compressed text

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 20: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Better Huffman Coding

Example ldquofor each rose a rose is a roserdquo

rose

isldquo ldquo

a

each for

10

1010

10 10

010 000 11 001 10 11 011 10 11

for each rose a rose is a rose

Authorrsquos Canonical Huffman Coding

bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree

wherexi = symbols

yi = numerical value of first symbol

rose

isldquo ldquo

a

each for

10

10

10

10 10

0010 0000 1 0001 00 1 0011 00 1

for each rose a rose is a rose

S = ((1 1) (1 1) (0 infin) (4 0)

Byte-Oriented Huffman Coding

Tree has branching factor of 256

Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)

Characteristicsndash Decompression is faster than for plain Huffman

codingndash Compression ratios are better than for Ziv-Lempel

family of codingsndash Allows direct searching on compressed text

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 21: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Authorrsquos Canonical Huffman Coding

bull Height of left tree is never shorter than right treebull S ordered sequence of pairs (xi yi) for each level in tree

wherexi = symbols

yi = numerical value of first symbol

rose

isldquo ldquo

a

each for

10

10

10

10 10

0010 0000 1 0001 00 1 0011 00 1

for each rose a rose is a rose

S = ((1 1) (1 1) (0 infin) (4 0)

Byte-Oriented Huffman Coding

Tree has branching factor of 256

Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)

Characteristicsndash Decompression is faster than for plain Huffman

codingndash Compression ratios are better than for Ziv-Lempel

family of codingsndash Allows direct searching on compressed text

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 22: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Byte-Oriented Huffman Coding

Tree has branching factor of 256

Ensure no empty nodes in higher levels of tree of bottom level elements = 1 + ((v ndash 256) mod 255)

Characteristicsndash Decompression is faster than for plain Huffman

codingndash Compression ratios are better than for Ziv-Lempel

family of codingsndash Allows direct searching on compressed text

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 23: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Dictionary Methods

Basic conceptsndash replacing groups of consecutive symbols with a

pointer to an entry in a dictionaryndash the pointer representations are references to entries

in a dictionary composed of a list of symbols that are expected to occur frequently

ndash pointers to the dictionary entries are chosen so that they need less space than the phrase they replace

ndash modeling and coding does not existndash there are no explicit probabilities associated to

phrases

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 24: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Dictionary Methods

Static dictionary methodsndash selected pairs of letters are replaced with codewordsndash ex) Digram coding

bull at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary

bull if so they are coded together and the coding position is shifted by two characters otherwise the single character is represented by its normal code and the position is shifted by one character

ndash main problembull the dictionary might be suitable for one text but unsuitable

for another

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 25: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Dictionary Methods

Adaptive dictionary methodsndash Ziv-Lempel

bull placing strings of characters with a reference to a previous occurrence of the string

bull if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces compression is achieved

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 26: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Ziv-Lempel Code

Characteristicsndash identifying each text segment the first time it appears

and then simply pointing back to this first occurrence rather than repeating the segment

ndash an adaptive model of coding with increasingly long text segments encoded as the text is scanned

ndash require less space than the repeated text segmentsndash higher compression than the Huffman codes ndash codes of roughly 4 bits per character

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 27: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Ziv-Lempel Code

LZ77 ndash Gzip encodingndash the code consists of a set of triples ltabcgt

bull a identifies how far back in the decoded text to look for the upcoming text segment

bull b tells how many characters to copy for the upcoming segment

bull c a new character to add to complete the next segment

bull ex) lt82rgt lt00pgt

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 28: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Ziv-Lempel Code

An example (decoding)lt00pgt p

lt00egt pe

lt00tgt pet

lt21rgt peter

lt00_gt peter_

lt61igt peter_pi

lt82rgt peter_piper

lt63cgt peter_piper_pic

lt00kgt peter_piper_pick

lt71dgt peter_piper_picked

helliphelliphellip

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 29: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Dictionary Methods

Adaptive dictionary methodsndash Disadvantages over the Huffman method

bull no random access does not allow decoding to start in the middle of a compression file

Dictionary schemes are popular for their speed and low memory use but statistical methods are more common in an IR environment

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques
Page 30: Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Comparing Text Compression Techniques

Arithmetic Character Huffman

Word Huffman

Ziv-Lempel

Compression Ratio

very good poor very good good

Compression Speed

slow fast fast very fast

Decompression Speed

slow fast very fast very fast

Memory Space low low high moderate

Compressed pattern matching

no yes yes yes

Random Access

no yes yes no

  • Text Operations Preprocessing and Compression
  • Introduction
  • Document Preprocessing
  • The Process of Preprocessing
  • Lexical Analysis of the Text
  • Slide 6
  • Elimination of Stopwords
  • Stemming
  • Index Term Selection
  • Thesauri
  • Slide 11
  • Text Operations Coding Compression Methods
  • Text Compression
  • Statistical Methods
  • Slide 15
  • Slide 16
  • Slide 17
  • Huffman Coding
  • Authorrsquos Huffman Coding
  • Better Huffman Coding
  • Authorrsquos Canonical Huffman Coding
  • Byte-Oriented Huffman Coding
  • Dictionary Methods
  • Slide 24
  • Slide 25
  • Ziv-Lempel Code
  • Slide 27
  • Slide 28
  • Slide 29
  • Comparing Text Compression Techniques