Speech & NLP (Fall 2014): Information Retrieval

Speech & NLP

www.vkedco.blogspot.com

Information Retrieval

Texts as Feature Vectors, Vector Spaces,

Vocabulary Normalization through Stemming & Stoplisting,

Porter’s Algorithm for Suffix Stripping,

Term Weighting, Query Expansion, Precision & Recall

Vladimir Kulyukin

http://www.vkedco.blogspot.com/



http://www.linkedin.com/pub/vladimir-kulyukin/23/2a2/150



Outline

● Texts as Feature Vectors

● Vector Space Model

● Vocabulary Normalization through Stemming & Stoplisting

● Porter’s Algorithm for Suffix Stripping (aka Porter’s Stemmer)

● Term Weighting

● Query Expansion

● Precision & Recall

Texts as Feature Vectors

Text as Collection of Words ● Any text can be viewed as a collection of words (collections,

unlike sets, allow for duplicates)

● Various techniques can be designed to compute different

properties of texts: most frequent word, least frequent word,

frequency of a word in a text, word n-grams, word co-occurrence

probabilities, part of speech, etc.

● Each such technique is a feature extractor: it extracts from text

specific features (e.g., a single word) and assigns to them

specific weights (e.g., the frequency of that word in the text) or

symbols (part of speech)

● Feature extraction turns a text from a collection of words into a

feature vector

Information Retrieval

● Information Retrieval (IR) is an area of NLP that

deals with storage and retrieval of digital media

● The primary focus of IR has been digital texts

● Other media such as images, videos, audio files

have received more prominent focus recently

Basic IR Terminology

● Document is an indexable and retrievable unit of

digital text

● Collection is a set of documents that can be

searched by users

● Term is a wordform that occurs in a collection

● Query is a set of terms

Vector Space Model

Background

● Vector Space Model of IR was invented by G. Salton

in the early 1970’s

● Document collection is a vector space

● Terms found in texts are dimensions of that vector

space

● Documents are vectors in the vector space

● Term weights are coordinates along specific

dimensions

Example: A 3D Feature Vector Space

● Suppose that all texts in our universe consist of three words w1,

w2, and w3

● Suppose that there are three texts T1, T2, and T3 such that

– T1 = “w1 w1 w2”

– T2 = “w3 w2”

– T3 = “w3 w3 w1”

● Suppose that our feature extraction procedure takes each word

in text and maps it to its frequency in that text

● Since there are three words, each feature vector has 3

dimensions; hence, we have a 3D vector space

Vector Space as Feature Vector Table

w1 w2 w3

T1 2 1 0

T2 0 1 1

T3 1 0 2

𝑇𝑖 𝐢𝐬 𝐚 𝐭𝐞𝐱𝐭 𝐝𝐨𝐜𝐮𝐦𝐞𝐧𝐭

𝑇𝑖 is a feature vector

3D Vector Space

w1

w2

w3

T1 = (2, 1, 0)

T2 = (0, 1, 1) T3 =(1, 0, 2)

Another Example: A 3D Feature Vector Space

● Suppose that all texts in our universe consist of three words w1,

w2, and w3

● Suppose that there are three texts T1, T2, and T3 such that

– T1 = “w1 w1 w2”

– T2 = “w3 w2”

– T3 = “w3 w3 w1”

● Suppose that our feature extraction procedure takes each word

in text and simply records its presence (1) or absence in the

document

Vector Space as Binary Feature Vector Table

w1 w2 w3

T1 1 1 0

T2 0 1 1

T3 1 0 1

Matching Queries Against Vector Tables

● Let twf be a term weighting function that assigns a numerical

weight to a specific term in a specific document

● For example, if the query q = “w1 w3”, i.e., the user enters “w1

w3”, then 𝑞 = 𝑡𝑤𝑓 𝑞, 𝑤1 , 𝑡𝑤𝑓 𝑞,𝑤2 , 𝑡𝑤𝑓(𝑞, 𝑤3)

● If the feature vector table is binary, then 𝑞 = 1,0,1

● One similarity that can be used to rank binary documents is as

follows:

𝑠𝑖𝑚 𝑞 , 𝑇𝑖 = 𝑡𝑤𝑓 𝑞,𝑤𝑘 ∙ 𝑡𝑤𝑓 𝑇𝑖, 𝑤𝑘𝑛𝑘=1 , where n

is the dimension of the vector space (e.g., n = 3)


● Suppose the query q = “w1 w3” and the feature vector table is binary, then

𝑞 = 1,0,1

● Below are the binary (dot product) similarity coefficients for each

document in our 3D document collection (n = 3):

𝑠𝑖𝑚 𝑞 , 𝑇1 = 𝑡𝑤𝑓 𝑞,𝑤𝑘 ∙ 𝑡𝑤𝑓 𝑇1, 𝑤𝑘

𝑛=3

𝑘=1

= 1 ∙ 1 + 0 ∙ 1 + 1 ∙ 0 = 1


𝑛=3

𝑘=1

= 1 ∙ 0 + 0 ∙ 1 + 1 ∙ 1 = 1


𝑛=3

𝑘=1

= 1 ∙ 1 + 0 ∙ 0 + 1 ∙ 1 = 2


● Another common metric is cosine, which is equal to 1 for

identical vectors and 0 to orthogonal vectors:

𝑠𝑖𝑚 𝑞 , 𝑇𝑖 = 𝑡𝑤𝑓 𝑞,𝑤𝑘 ∙ 𝑡𝑤𝑓 𝑇𝑖, 𝑤𝑘𝑛𝑘=1

𝑡𝑤𝑓 𝑞,𝑤𝑘𝑁𝑘=1 𝑡𝑤𝑓 𝑇𝑖, 𝑤𝑘

𝑁𝑘=1

Two Principal Tasks for Vector Space Model

● If the vector space model is to used, we have to

– determine how to compute terms (vocabulary

normalization)

– determine how to assign weights to terms in

individual documents (term weighting)

Vocabulary Normalization

through

Stemming & Stoplisting


● Texts contain many words that are morphologically related:

CONNECT, CONNECTED, CONNECTING, CONNECTION,

CONNECTIONS

● There are also many words in most texts that do not distinguish

them from other texts: TO, UP, FROM, UNTIL, THE, A, BY, etc.

● Stemming is the operation of conflating different wordforms into

a single wordform, called stem; CONNECTED, CONNECTING,

CONNECTION, CONNECTIONS are all conflated to CONNECT

● Stoplisting is the operation of removing wordforms that do not

distinguish texts from each other

● Stemming & stoplisting are vocabulary normalization procedures


● Stemming & stoplisting are two most common vocabulary

normalization procedures

● Both procedures are aimed at standardizing the indexing

vocabulary

● Both procedures reduce the size of the indexing

vocabulary, which is a great time and space booster

● After vocabulary normalization is done, the remaining

words are called terms

Porter’s Algorithm

for

Suffix Stripping

Martin Porter’s original paper is at http://tartarus.org/martin/PorterStemmer/def.txt

Source code in various languages is at http://tartarus.org/martin/PorterStemmer/

http://tartarus.org/martin/PorterStemmer/def.txt

http://tartarus.org/martin/PorterStemmer/

Suffix Stripping Approaches

● Use a stem list

● Use a suffix list

● Use a set of rules that match wordforms & remove

suffixes under specified conditions

Pros & Cons of Suffix Stripping

● Suffix stripping is done not for linguistic reasons

but to improve retrieval performance & storage

efficiency

● It is reasonable when wordform conflation does

not lose information (e.g., CONNECTOR &

CONNECTION)

● It does not seem reasonable when conflation is

lossy (e.g., RELATIVE & RELATIVITY are

conflated)

Pros & Cons of Suffix Stripping

● Suffix stripping is never 100% correct

● The same rule set conflates SAND and SANDER,

which is OK, but it also conflates WAND and

WANDER, which may not be OK

● With any set of rules there comes a point when

adding more rules actually worsens performance

● Exceptions are important but may not be worth

the trouble

Consonants & Vowels

● A consonant is a letter different from A, E, I,

O, U and different from Y when it is

preceded by a consonant

● Y is a consonant when it is preceded by A, E,

I, O, U: in TOY, Y is a consonant; in BY, it is

a vowel

● A vowel is not a consonant

Consonants & Vowels

● A consonant is denoted as c and a vowel as v

● A sequence of at least one consonant (e.g.,

c, cc, ccc, cccc, etc) is denoted as C

● A sequence of at least one vowel (e.g., v, vv,

vvv, etc.) is denoted as V

Porter’s Insight: Wordform Representation

● Any wordform can be represented as one of the four forms:

– CVCV … C

– CVCV … V

– VCVC … C

– VCVC … V

● These forms are condensed into one form: [C]VCVC … [V]

(square brackets denote sequences of zero or more consonants or

vowels)

● This form can be rewritten as [C](VC)m[V], m >= 0

Porter’s Insight: Wordform Representation

● In the formula [C](VC)m[V], m >= 0, m is called the measure of a word

● Examples:

● m=0: TR, EE, TREE, Y, BY

● m=1: TROUBLE, OATS, TREES, IVY

– TROUBLE: [C] TR; (VC) OUBL; [V] E

– OATS: [C] NULL; (VC) OATS; [V] NULL

– TREES: [C] TR; (VC) EES; [V] NULL

● m=2: TROUBLES, PRIVATE

– TROUBLES: [C] TR; (VC)2 (OUBL)(ES); [V] NULL

– PRIVATE: [C] PR; (VC)2 (IV)(AT); [V] E

Morphological Rules

● Suffix removal rules have the form

– (condition) S1 S2

● If a wordform ends with suffix S1 and the stem

before S1 satisfies the (optional) condition, then

S1 is replaced with S2

● Example:

– (m > 1) EMENT

– S1 is EMENT; S2 is NULL

– This rule maps REPLACEMENT to REPLAC

Morphological Rules: Condition Specification

● Conditions can be specified as follows:

– (m > n), where n is a number

– *X – stem ends with the letter X

– *v* - stem contains a vowel

– *d – stem with a double consonant (e.g., -TT)

– *o – stem ends in cvc where the second c is not W, X, or Y

(e.g., -WIL, -HOP)

● Logical AND, OR, and NOT operators are also allowed:

– ((m > 1) AND (*S OR *T))

Length-Based Rule Matching

● If there are several rules match, the one with the

longest S1 wins

● Consider this rule set with null conditions:

– SSES SS

– IES I

– SS SS

– S

● Given this rule set, CARESSES CARESS because SSES is

the longest match and CARES CARE

Five Rule Sets

● In the original paper by M.F. Porter, there are eight sets

of rules: 1A, 1B, 1C, 2, 3, 4, 5A, 5B

● A wordform passes through each rule set one by one

staring from 1A and ending at 5B, in that order

● If no rule in a rule set is applicable, the wordform comes

out unmodified

1A 1B 1C 2 3 4

5A 5B

W

W’

Example

1A: S

3: (m>0) ALIZE AL

GENERALIZATIONS GENERALIZATION 2: (m>0) IZATION IZE

GENERAL

4: (m>0) AL

GENER

GENERALIZE

Term Weighting

Term Weighting in Documents

● Term weighting has a large influence on the

performance of IR systems

● In general, there are two design factors that bear on

term weighting:

– How important is a term within a given document?

– How important is a term within a given collection?

● A common measure of term importance within a single

document is its frequency in that document (this is

commonly referred to as term frequency – tf)

Term Weighting in Collections

● Terms that occur in every document or many

documents in a given collection are not useful as

document discriminators

● Terms that occur in relatively few documents in a

given collection are useful as document

discriminators

● Generally, collection-wide term weighting

approaches value terms that occur in relatively

few documents

Inverse Document Frequency

● Suppose that we have some document collection C

● Let N be the total number of documents in C

● Let 𝑛𝑖 be the number of documents in C that contain at least one

occurrence of the i-th term 𝑡𝑖

● Then the inverse document frequency of 𝑡𝑖

𝑖𝑑𝑓 𝑡𝑖 , 𝐶 = 𝑙𝑜𝑔𝑁

𝑛𝑖

Example: IDF

W1 W1 W2 W3

W3 W3 W3 W3

W2 W2 W2 W1 W3

W3 W3 W3 W1 W1

T1 T2 T3 T4

𝑖𝑑𝑓 𝑊1, 𝐶 = 𝑙𝑜𝑔4

3, because N = 4 and 𝑛1 = 3





𝐶 = 𝑇1, 𝑇2, 𝑇3, 𝑇4

TF*IDF: Combining Local and Global Weights


● Let N be the total number of documents in C

● Let 𝑛𝑖 be the number of documents in C that contain at least one

occurrence of the i-th term 𝑡𝑖

● Let 𝑡𝑓 𝑡𝑖 , 𝑇𝑗 , 𝐶 be the frequency of the term 𝑡𝑖 in the document 𝑇𝑗 of

collection C

● Let 𝑖𝑑𝑓 𝑡𝑖 , 𝐶 be the inverse document frequency of the term 𝑡𝑖 in

collection C

● Then the tfidf measure of 𝑡𝑖 in 𝑇𝑗 of C:

𝑡𝑓𝑖𝑑𝑓 𝑡𝑖 , 𝑇𝑗 , 𝐶 = 𝑡𝑓 𝑡𝑖 , 𝑇𝑗 , 𝐶 ∙ 𝑖𝑑𝑓 𝑡𝑖 , 𝐶

Example: TF*IDF

W1 W1 W2 W3

W3 W3 W3 W3

W2 W2 W2 W1 W3

W3 W3 W3 W1 W1

T1 T2 T3 T4

𝑡𝑓𝑖𝑑𝑓 𝑊1, 𝑇1, 𝐶 = 𝑡𝑓 𝑊1, 𝑇1, 𝐶 ∙ 𝑖𝑑𝑓 𝑊1, 𝐶 = 2 ∙ 𝑙𝑜𝑔4

3


2


4

User Query Expansion

Improving User Queries

● Typically, we cannot change the content of the indexed documents: once a

collection is indexed, we can documents to it, remove documents from it,

but we cannot change the weights of the terms within the vector space

model

● What we can do is improve the user query

● But how? We can dynamically change the weights of the terms in

the user query to move it closer to the more relevant documents

● The standard method of doing it in the vector space model is

called relevance feedback

How Relevance Feedback Works

● The user types in a query

● The system retrieves a set of documents

● The user specifies whether each document is relevant or not to the query:

this can be done on every document in the retrieved set or a small subset of

documents

● The system dynamically increases the weights of the terms in the relevant

documents and decreases the weights of the terms in the non-relevant

documents

● In several iterations, the user query vector ends up being pushed closer to

the relevant documents and further from the non-relevant documents

Rocchio Relevance Feedback Formula


● Let 𝑞𝑖 be the user query vector at the i-th iteration (i.e., 𝑞0 is the original user query vector

● Let us assume that 𝑅 = 𝑟1, … 𝑟|𝑅| is the set of relevant document

vectors from C and 𝑁𝑅 = 𝑛𝑟1, … , 𝑛𝑟|𝑁𝑅| is the set of non-relevant

document vectors from C

● The query vector on the next iteration is:

𝑞𝑖+1 = 𝑞𝑖 +𝛽

|𝑅| 𝑟𝑗|𝑅|𝑗=1 −

𝛾

𝑁𝑅 𝑛𝑟𝑘𝑁𝑅𝑘=1 , where𝛽+ 𝛾= 1

Rocchio Relevance Feedback Formula

● The query vector on the next iteration is:


|𝑅| 𝑟𝑗|𝑅|𝑗=1 −

𝛾

𝑁𝑅 𝑛𝑟𝑘𝑁𝑅𝑘=1 , where𝛽+ 𝛾= 1


|𝑅| 𝑟𝑗

|𝑅|

𝑗=1

−𝛾

𝑁𝑅 𝑛𝑟𝑘

𝑁𝑅

𝑘=1

=

𝑞𝑖 +𝛽

|𝑅|𝑟1 +⋯+

𝛽

|𝑅|𝑟|𝑅| −

𝛾

|𝑁𝑅|𝑛𝑟1 +⋯+

𝛾

|𝑁𝑅|𝑛𝑟|𝑁𝑅|

Thesaurus-Based Query Expansion

● Another commonly used strategy is to have a

thesaurus

● The thesaurus is used to expand the user query by

adding terms to it (e.g., synonyms or correlated

terms)

● Thesaurai are typically collection dependent and

do not generalize across different collections

Performance Evaluation

● There are two commonly used measures of relevance in

IR

● Recall = (number of relevant documents

retrieved)/(total number of relevant documents in

collection C)

● Precision = (number of relevant documents

retrieved)/(number of documents retrieved)

● Typically, recall and precision are inversely related: as

precision increases, recall drops and vice versa

References

1. M. Porter. “An Algorithm for Suffix Stripping.” Program, 14

no. 3, pp 130-137, July 1980.

2. D. Jurafsky & J. Martin. “Speech and Language Processing”,

Ch. 17.

http://tartarus.org/martin/PorterStemmer/def.txt

Science

Speech & NLP (Fall 2014): Information Retrieval