48
Speech & NLP www.vkedco.blogspot.com Information Retrieval Texts as Feature Vectors, Vector Spaces, Vocabulary Normalization through Stemming & Stoplisting, Porter’s Algorithm for Suffix Stripping, Term Weighting, Query Expansion, Precision & Recall Vladimir Kulyukin

Speech & NLP (Fall 2014): Information Retrieval

Embed Size (px)

Citation preview

Page 1: Speech & NLP (Fall 2014): Information Retrieval

Speech & NLP

www.vkedco.blogspot.com

Information Retrieval

Texts as Feature Vectors, Vector Spaces,

Vocabulary Normalization through Stemming & Stoplisting,

Porter’s Algorithm for Suffix Stripping,

Term Weighting, Query Expansion, Precision & Recall

Vladimir Kulyukin

Page 2: Speech & NLP (Fall 2014): Information Retrieval

Outline

● Texts as Feature Vectors

● Vector Space Model

● Vocabulary Normalization through Stemming & Stoplisting

● Porter’s Algorithm for Suffix Stripping (aka Porter’s Stemmer)

● Term Weighting

● Query Expansion

● Precision & Recall

Page 3: Speech & NLP (Fall 2014): Information Retrieval

Texts as Feature Vectors

Page 4: Speech & NLP (Fall 2014): Information Retrieval

Text as Collection of Words ● Any text can be viewed as a collection of words (collections,

unlike sets, allow for duplicates)

● Various techniques can be designed to compute different

properties of texts: most frequent word, least frequent word,

frequency of a word in a text, word n-grams, word co-occurrence

probabilities, part of speech, etc.

● Each such technique is a feature extractor: it extracts from text

specific features (e.g., a single word) and assigns to them

specific weights (e.g., the frequency of that word in the text) or

symbols (part of speech)

● Feature extraction turns a text from a collection of words into a

feature vector

Page 5: Speech & NLP (Fall 2014): Information Retrieval

Information Retrieval

● Information Retrieval (IR) is an area of NLP that

deals with storage and retrieval of digital media

● The primary focus of IR has been digital texts

● Other media such as images, videos, audio files

have received more prominent focus recently

Page 6: Speech & NLP (Fall 2014): Information Retrieval

Basic IR Terminology

● Document is an indexable and retrievable unit of

digital text

● Collection is a set of documents that can be

searched by users

● Term is a wordform that occurs in a collection

● Query is a set of terms

Page 7: Speech & NLP (Fall 2014): Information Retrieval

Vector Space Model

Page 8: Speech & NLP (Fall 2014): Information Retrieval

Background

● Vector Space Model of IR was invented by G. Salton

in the early 1970’s

● Document collection is a vector space

● Terms found in texts are dimensions of that vector

space

● Documents are vectors in the vector space

● Term weights are coordinates along specific

dimensions

Page 9: Speech & NLP (Fall 2014): Information Retrieval

Example: A 3D Feature Vector Space

● Suppose that all texts in our universe consist of three words w1,

w2, and w3

● Suppose that there are three texts T1, T2, and T3 such that

– T1 = “w1 w1 w2”

– T2 = “w3 w2”

– T3 = “w3 w3 w1”

● Suppose that our feature extraction procedure takes each word

in text and maps it to its frequency in that text

● Since there are three words, each feature vector has 3

dimensions; hence, we have a 3D vector space

Page 10: Speech & NLP (Fall 2014): Information Retrieval

Vector Space as Feature Vector Table

w1 w2 w3

T1 2 1 0

T2 0 1 1

T3 1 0 2

𝑇𝑖 𝐢𝐬 𝐚 𝐭𝐞𝐱𝐭 𝐝𝐨𝐜𝐮𝐦𝐞𝐧𝐭

𝑇𝑖 is a feature vector

Page 11: Speech & NLP (Fall 2014): Information Retrieval

3D Vector Space

w1

w2

w3

T1 = (2, 1, 0)

T2 = (0, 1, 1) T3 =(1, 0, 2)

Page 12: Speech & NLP (Fall 2014): Information Retrieval

Another Example: A 3D Feature Vector Space

● Suppose that all texts in our universe consist of three words w1,

w2, and w3

● Suppose that there are three texts T1, T2, and T3 such that

– T1 = “w1 w1 w2”

– T2 = “w3 w2”

– T3 = “w3 w3 w1”

● Suppose that our feature extraction procedure takes each word

in text and simply records its presence (1) or absence in the

document

Page 13: Speech & NLP (Fall 2014): Information Retrieval

Vector Space as Binary Feature Vector Table

w1 w2 w3

T1 1 1 0

T2 0 1 1

T3 1 0 1

Page 14: Speech & NLP (Fall 2014): Information Retrieval

Matching Queries Against Vector Tables

● Let twf be a term weighting function that assigns a numerical

weight to a specific term in a specific document

● For example, if the query q = “w1 w3”, i.e., the user enters “w1

w3”, then 𝑞 = 𝑡𝑤𝑓 𝑞, 𝑤1 , 𝑡𝑤𝑓 𝑞,𝑤2 , 𝑡𝑤𝑓(𝑞, 𝑤3)

● If the feature vector table is binary, then 𝑞 = 1,0,1

● One similarity that can be used to rank binary documents is as

follows:

𝑠𝑖𝑚 𝑞 , 𝑇𝑖 = 𝑡𝑤𝑓 𝑞,𝑤𝑘 ∙ 𝑡𝑤𝑓 𝑇𝑖, 𝑤𝑘𝑛𝑘=1 , where n

is the dimension of the vector space (e.g., n = 3)

Page 15: Speech & NLP (Fall 2014): Information Retrieval

Matching Queries Against Vector Tables

● Suppose the query q = “w1 w3” and the feature vector table is binary, then

𝑞 = 1,0,1

● Below are the binary (dot product) similarity coefficients for each

document in our 3D document collection (n = 3):

𝑠𝑖𝑚 𝑞 , 𝑇1 = 𝑡𝑤𝑓 𝑞,𝑤𝑘 ∙ 𝑡𝑤𝑓 𝑇1, 𝑤𝑘

𝑛=3

𝑘=1

= 1 ∙ 1 + 0 ∙ 1 + 1 ∙ 0 = 1

𝑠𝑖𝑚 𝑞 , 𝑇2 = 𝑡𝑤𝑓 𝑞,𝑤𝑘 ∙ 𝑡𝑤𝑓 𝑇2, 𝑤𝑘

𝑛=3

𝑘=1

= 1 ∙ 0 + 0 ∙ 1 + 1 ∙ 1 = 1

𝑠𝑖𝑚 𝑞 , 𝑇3 = 𝑡𝑤𝑓 𝑞,𝑤𝑘 ∙ 𝑡𝑤𝑓 𝑇3, 𝑤𝑘

𝑛=3

𝑘=1

= 1 ∙ 1 + 0 ∙ 0 + 1 ∙ 1 = 2

Page 16: Speech & NLP (Fall 2014): Information Retrieval

Matching Queries Against Vector Tables

● Another common metric is cosine, which is equal to 1 for

identical vectors and 0 to orthogonal vectors:

𝑠𝑖𝑚 𝑞 , 𝑇𝑖 = 𝑡𝑤𝑓 𝑞,𝑤𝑘 ∙ 𝑡𝑤𝑓 𝑇𝑖, 𝑤𝑘𝑛𝑘=1

𝑡𝑤𝑓 𝑞,𝑤𝑘𝑁𝑘=1 𝑡𝑤𝑓 𝑇𝑖, 𝑤𝑘

𝑁𝑘=1

Page 17: Speech & NLP (Fall 2014): Information Retrieval

Two Principal Tasks for Vector Space Model

● If the vector space model is to used, we have to

– determine how to compute terms (vocabulary

normalization)

– determine how to assign weights to terms in

individual documents (term weighting)

Page 18: Speech & NLP (Fall 2014): Information Retrieval

Vocabulary Normalization

through

Stemming & Stoplisting

Page 19: Speech & NLP (Fall 2014): Information Retrieval

Vocabulary Normalization

● Texts contain many words that are morphologically related:

CONNECT, CONNECTED, CONNECTING, CONNECTION,

CONNECTIONS

● There are also many words in most texts that do not distinguish

them from other texts: TO, UP, FROM, UNTIL, THE, A, BY, etc.

● Stemming is the operation of conflating different wordforms into

a single wordform, called stem; CONNECTED, CONNECTING,

CONNECTION, CONNECTIONS are all conflated to CONNECT

● Stoplisting is the operation of removing wordforms that do not

distinguish texts from each other

● Stemming & stoplisting are vocabulary normalization procedures

Page 20: Speech & NLP (Fall 2014): Information Retrieval

Vocabulary Normalization

● Stemming & stoplisting are two most common vocabulary

normalization procedures

● Both procedures are aimed at standardizing the indexing

vocabulary

● Both procedures reduce the size of the indexing

vocabulary, which is a great time and space booster

● After vocabulary normalization is done, the remaining

words are called terms

Page 21: Speech & NLP (Fall 2014): Information Retrieval

Porter’s Algorithm

for

Suffix Stripping

Martin Porter’s original paper is at http://tartarus.org/martin/PorterStemmer/def.txt

Source code in various languages is at http://tartarus.org/martin/PorterStemmer/

Page 22: Speech & NLP (Fall 2014): Information Retrieval

Suffix Stripping Approaches

● Use a stem list

● Use a suffix list

● Use a set of rules that match wordforms & remove

suffixes under specified conditions

Page 23: Speech & NLP (Fall 2014): Information Retrieval

Pros & Cons of Suffix Stripping

● Suffix stripping is done not for linguistic reasons

but to improve retrieval performance & storage

efficiency

● It is reasonable when wordform conflation does

not lose information (e.g., CONNECTOR &

CONNECTION)

● It does not seem reasonable when conflation is

lossy (e.g., RELATIVE & RELATIVITY are

conflated)

Page 24: Speech & NLP (Fall 2014): Information Retrieval

Pros & Cons of Suffix Stripping

● Suffix stripping is never 100% correct

● The same rule set conflates SAND and SANDER,

which is OK, but it also conflates WAND and

WANDER, which may not be OK

● With any set of rules there comes a point when

adding more rules actually worsens performance

● Exceptions are important but may not be worth

the trouble

Page 25: Speech & NLP (Fall 2014): Information Retrieval

Consonants & Vowels

● A consonant is a letter different from A, E, I,

O, U and different from Y when it is

preceded by a consonant

● Y is a consonant when it is preceded by A, E,

I, O, U: in TOY, Y is a consonant; in BY, it is

a vowel

● A vowel is not a consonant

Page 26: Speech & NLP (Fall 2014): Information Retrieval

Consonants & Vowels

● A consonant is denoted as c and a vowel as v

● A sequence of at least one consonant (e.g.,

c, cc, ccc, cccc, etc) is denoted as C

● A sequence of at least one vowel (e.g., v, vv,

vvv, etc.) is denoted as V

Page 27: Speech & NLP (Fall 2014): Information Retrieval

Porter’s Insight: Wordform Representation

● Any wordform can be represented as one of the four forms:

– CVCV … C

– CVCV … V

– VCVC … C

– VCVC … V

● These forms are condensed into one form: [C]VCVC … [V]

(square brackets denote sequences of zero or more consonants or

vowels)

● This form can be rewritten as [C](VC)m[V], m >= 0

Page 28: Speech & NLP (Fall 2014): Information Retrieval

Porter’s Insight: Wordform Representation

● In the formula [C](VC)m[V], m >= 0, m is called the measure of a word

● Examples:

● m=0: TR, EE, TREE, Y, BY

● m=1: TROUBLE, OATS, TREES, IVY

– TROUBLE: [C] TR; (VC) OUBL; [V] E

– OATS: [C] NULL; (VC) OATS; [V] NULL

– TREES: [C] TR; (VC) EES; [V] NULL

● m=2: TROUBLES, PRIVATE

– TROUBLES: [C] TR; (VC)2 (OUBL)(ES); [V] NULL

– PRIVATE: [C] PR; (VC)2 (IV)(AT); [V] E

Page 29: Speech & NLP (Fall 2014): Information Retrieval

Morphological Rules

● Suffix removal rules have the form

– (condition) S1 S2

● If a wordform ends with suffix S1 and the stem

before S1 satisfies the (optional) condition, then

S1 is replaced with S2

● Example:

– (m > 1) EMENT

– S1 is EMENT; S2 is NULL

– This rule maps REPLACEMENT to REPLAC

Page 30: Speech & NLP (Fall 2014): Information Retrieval

Morphological Rules: Condition Specification

● Conditions can be specified as follows:

– (m > n), where n is a number

– *X – stem ends with the letter X

– *v* - stem contains a vowel

– *d – stem with a double consonant (e.g., -TT)

– *o – stem ends in cvc where the second c is not W, X, or Y

(e.g., -WIL, -HOP)

● Logical AND, OR, and NOT operators are also allowed:

– ((m > 1) AND (*S OR *T))

Page 31: Speech & NLP (Fall 2014): Information Retrieval

Length-Based Rule Matching

● If there are several rules match, the one with the

longest S1 wins

● Consider this rule set with null conditions:

– SSES SS

– IES I

– SS SS

– S

● Given this rule set, CARESSES CARESS because SSES is

the longest match and CARES CARE

Page 32: Speech & NLP (Fall 2014): Information Retrieval

Five Rule Sets

● In the original paper by M.F. Porter, there are eight sets

of rules: 1A, 1B, 1C, 2, 3, 4, 5A, 5B

● A wordform passes through each rule set one by one

staring from 1A and ending at 5B, in that order

● If no rule in a rule set is applicable, the wordform comes

out unmodified

1A 1B 1C 2 3 4

5A 5B

W

W’

Page 33: Speech & NLP (Fall 2014): Information Retrieval

Example

1A: S

3: (m>0) ALIZE AL

GENERALIZATIONS GENERALIZATION 2: (m>0) IZATION IZE

GENERAL

4: (m>0) AL

GENER

GENERALIZE

Page 34: Speech & NLP (Fall 2014): Information Retrieval

Term Weighting

Page 35: Speech & NLP (Fall 2014): Information Retrieval

Term Weighting in Documents

● Term weighting has a large influence on the

performance of IR systems

● In general, there are two design factors that bear on

term weighting:

– How important is a term within a given document?

– How important is a term within a given collection?

● A common measure of term importance within a single

document is its frequency in that document (this is

commonly referred to as term frequency – tf)

Page 36: Speech & NLP (Fall 2014): Information Retrieval

Term Weighting in Collections

● Terms that occur in every document or many

documents in a given collection are not useful as

document discriminators

● Terms that occur in relatively few documents in a

given collection are useful as document

discriminators

● Generally, collection-wide term weighting

approaches value terms that occur in relatively

few documents

Page 37: Speech & NLP (Fall 2014): Information Retrieval

Inverse Document Frequency

● Suppose that we have some document collection C

● Let N be the total number of documents in C

● Let 𝑛𝑖 be the number of documents in C that contain at least one

occurrence of the i-th term 𝑡𝑖

● Then the inverse document frequency of 𝑡𝑖

𝑖𝑑𝑓 𝑡𝑖 , 𝐶 = 𝑙𝑜𝑔𝑁

𝑛𝑖

Page 38: Speech & NLP (Fall 2014): Information Retrieval

Example: IDF

W1 W1 W2 W3

W3 W3 W3 W3

W2 W2 W2 W1 W3

W3 W3 W3 W1 W1

T1 T2 T3 T4

𝑖𝑑𝑓 𝑊1, 𝐶 = 𝑙𝑜𝑔4

3, because N = 4 and 𝑛1 = 3

𝑖𝑑𝑓 𝑊2, 𝐶 = 𝑙𝑜𝑔4

2, because N = 4 and 𝑛2 = 2

𝑖𝑑𝑓 𝑊3, 𝐶 = 𝑙𝑜𝑔4

4, because N = 4 and 𝑛2 = 4

𝐶 = 𝑇1, 𝑇2, 𝑇3, 𝑇4

Page 39: Speech & NLP (Fall 2014): Information Retrieval

TF*IDF: Combining Local and Global Weights

● Suppose that we have some document collection C

● Let N be the total number of documents in C

● Let 𝑛𝑖 be the number of documents in C that contain at least one

occurrence of the i-th term 𝑡𝑖

● Let 𝑡𝑓 𝑡𝑖 , 𝑇𝑗 , 𝐶 be the frequency of the term 𝑡𝑖 in the document 𝑇𝑗 of

collection C

● Let 𝑖𝑑𝑓 𝑡𝑖 , 𝐶 be the inverse document frequency of the term 𝑡𝑖 in

collection C

● Then the tfidf measure of 𝑡𝑖 in 𝑇𝑗 of C:

𝑡𝑓𝑖𝑑𝑓 𝑡𝑖 , 𝑇𝑗 , 𝐶 = 𝑡𝑓 𝑡𝑖 , 𝑇𝑗 , 𝐶 ∙ 𝑖𝑑𝑓 𝑡𝑖 , 𝐶

Page 40: Speech & NLP (Fall 2014): Information Retrieval

Example: TF*IDF

W1 W1 W2 W3

W3 W3 W3 W3

W2 W2 W2 W1 W3

W3 W3 W3 W1 W1

T1 T2 T3 T4

𝑡𝑓𝑖𝑑𝑓 𝑊1, 𝑇1, 𝐶 = 𝑡𝑓 𝑊1, 𝑇1, 𝐶 ∙ 𝑖𝑑𝑓 𝑊1, 𝐶 = 2 ∙ 𝑙𝑜𝑔4

3

𝑡𝑓𝑖𝑑𝑓 𝑊2, 𝑇1, 𝐶 = 𝑡𝑓 𝑊2, 𝑇1, 𝐶 ∙ 𝑖𝑑𝑓 𝑊2, 𝐶 = 1 ∙ 𝑙𝑜𝑔4

2

𝑡𝑓𝑖𝑑𝑓 𝑊3, 𝑇1, 𝐶 = 𝑡𝑓 𝑊3, 𝑇1, 𝐶 ∙ 𝑖𝑑𝑓 𝑊3, 𝐶 = 1 ∙ 𝑙𝑜𝑔4

4

Page 41: Speech & NLP (Fall 2014): Information Retrieval

User Query Expansion

Page 42: Speech & NLP (Fall 2014): Information Retrieval

Improving User Queries

● Typically, we cannot change the content of the indexed documents: once a

collection is indexed, we can documents to it, remove documents from it,

but we cannot change the weights of the terms within the vector space

model

● What we can do is improve the user query

● But how? We can dynamically change the weights of the terms in

the user query to move it closer to the more relevant documents

● The standard method of doing it in the vector space model is

called relevance feedback

Page 43: Speech & NLP (Fall 2014): Information Retrieval

How Relevance Feedback Works

● The user types in a query

● The system retrieves a set of documents

● The user specifies whether each document is relevant or not to the query:

this can be done on every document in the retrieved set or a small subset of

documents

● The system dynamically increases the weights of the terms in the relevant

documents and decreases the weights of the terms in the non-relevant

documents

● In several iterations, the user query vector ends up being pushed closer to

the relevant documents and further from the non-relevant documents

Page 44: Speech & NLP (Fall 2014): Information Retrieval

Rocchio Relevance Feedback Formula

● Suppose that we have some document collection C

● Let 𝑞𝑖 be the user query vector at the i-th iteration (i.e., 𝑞0 is the original user query vector

● Let us assume that 𝑅 = 𝑟1, … 𝑟|𝑅| is the set of relevant document

vectors from C and 𝑁𝑅 = 𝑛𝑟1, … , 𝑛𝑟|𝑁𝑅| is the set of non-relevant

document vectors from C

● The query vector on the next iteration is:

𝑞𝑖+1 = 𝑞𝑖 +𝛽

|𝑅| 𝑟𝑗|𝑅|𝑗=1 −

𝛾

𝑁𝑅 𝑛𝑟𝑘𝑁𝑅𝑘=1 , where𝛽+ 𝛾= 1

Page 45: Speech & NLP (Fall 2014): Information Retrieval

Rocchio Relevance Feedback Formula

● The query vector on the next iteration is:

𝑞𝑖+1 = 𝑞𝑖 +𝛽

|𝑅| 𝑟𝑗|𝑅|𝑗=1 −

𝛾

𝑁𝑅 𝑛𝑟𝑘𝑁𝑅𝑘=1 , where𝛽+ 𝛾= 1

𝑞𝑖+1 = 𝑞𝑖 +𝛽

|𝑅| 𝑟𝑗

|𝑅|

𝑗=1

−𝛾

𝑁𝑅 𝑛𝑟𝑘

𝑁𝑅

𝑘=1

=

𝑞𝑖 +𝛽

|𝑅|𝑟1 +⋯+

𝛽

|𝑅|𝑟|𝑅| −

𝛾

|𝑁𝑅|𝑛𝑟1 +⋯+

𝛾

|𝑁𝑅|𝑛𝑟|𝑁𝑅|

Page 46: Speech & NLP (Fall 2014): Information Retrieval

Thesaurus-Based Query Expansion

● Another commonly used strategy is to have a

thesaurus

● The thesaurus is used to expand the user query by

adding terms to it (e.g., synonyms or correlated

terms)

● Thesaurai are typically collection dependent and

do not generalize across different collections

Page 47: Speech & NLP (Fall 2014): Information Retrieval

Performance Evaluation

● There are two commonly used measures of relevance in

IR

● Recall = (number of relevant documents

retrieved)/(total number of relevant documents in

collection C)

● Precision = (number of relevant documents

retrieved)/(number of documents retrieved)

● Typically, recall and precision are inversely related: as

precision increases, recall drops and vice versa

Page 48: Speech & NLP (Fall 2014): Information Retrieval

References

1. M. Porter. “An Algorithm for Suffix Stripping.” Program, 14

no. 3, pp 130-137, July 1980.

2. D. Jurafsky & J. Martin. “Speech and Language Processing”,

Ch. 17.