51
Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Embed Size (px)

Citation preview

Page 1: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Computational Linguiestic Course

Instructor: Professor Cercone

Presenter: Morteza zihayat

Information Retrieval and

Vector Space Model

Page 2: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages and Disadvantages of VS ModelImproving the VS Model

22Information Retrieval and Vector Space Model

Page 3: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages and Disadvantages of VS ModelImproving the VS Model

33Information Retrieval and Vector Space Model

Page 4: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Introduction to IR

The world's total yearly production of unique information stored in the form of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth.

(Lyman & Hal 00)

Information Retrieval and Vector Space Model44

Page 5: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Growth of textual information

How can we help manage and exploit all the information?

LiteratureEmail

WWWDesktop

News

Intranet

Blog

Information Retrieval and Vector Space Model55

Page 6: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Information overflow

Information Retrieval and Vector Space Model66

Page 7: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

What is Information Retrieval (IR)?

Narrow-sense: IR= Search Engine Technologies (IR=Google, library info

system) IR= Text matching/classification

Broad-sense: IR = Text Information Management: General problem: how to manage text information? How to find useful information? (retrieval)

Example: Google How to organize information? (text classification)

Example: Automatically assign emails to different folders How to discover knowledge from text? (text mining)

Example: Discover correlation of eventsInformation Retrieval and Vector Space Model

77

Page 8: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages of VS ModelImproving the VSM Model

88Information Retrieval and Vector Space Model

Page 9: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Formalizing IR Tasks

Vocabulary: V = {w1,w2, …, wT} of a languageQuery: q = q1, q2, …, qm where qi V.∈Document: di= di1, di2, …, dimi where dij V.∈Collection: C = {d1, d2, …, dN}Relevant document set: R(q) C:Generally ⊆

unknown and user-dependentQuery provides a “hint” on which documents

should be in R(q)

IR: find the approximate relevant document set R’(q)

Source: This slide is borrowed from [1]

Information Retrieval and Vector Space Model99

Page 10: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Evaluation measures

The quality of many retrieval systems depends on how well they manage to rank relevant documents.

How can we evaluate rankings in IR? IR researchers have developed evaluation measures

specifically designed to evaluate rankings. Most of these measures combine precision and recall in a

way that takes account of the ranking.

Information Retrieval and Vector Space Model1010

Page 11: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Precision & Recall

Source: This slide is borrowed from [1]Information Retrieval and Vector Space Model

1111

Page 12: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

In other words:

Precision is the percentage of relevant items in the returned set

Recall is the percentage of all relevant documents in the collection that is in the returned set.

Information Retrieval and Vector Space Model1212

Page 13: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Evaluating Retrieval Performance

Source: This slide is borrowed from [1]Information Retrieval and Vector Space Model

1313

Page 14: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

IR System Architecture

1414

User

query

judgments

docs

results

QueryRep

DocRep

Ranking

Feedback

INDEXING

SEARCHING

QUERY MODIFICATION

INTERFACE

Information Retrieval and Vector Space Model

Page 15: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Indexing DocumentIndexing Document

1515Information Retrieval and Vector Space Model

Page 16: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Searching

Given a query, score documents efficientlyThe basic question:

Given a query, how do we know if document A is more relevant than B?

If document A uses more query words than document B Word usage in document A is more similar to that in

query ….

We should find a way to compute relevance Query and documents

1616Information Retrieval and Vector Space Model

Page 17: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

The Notion of Relevance

1717

Relevance

(Rep(q), Rep(d)) Similarity

P(r=1|q,d) r {0,1} Probability of Relevance

P(d q) or P(q d) Probabilistic inference

Different rep & similarity

Vector spacemodel

(Salton et al., 75)

Prob. distr.model

(Wong & Yao, 89)

GenerativeModel

RegressionModel

(Fox 83)

Classicalprob. Model(Robertson &

Sparck Jones, 76)

Docgeneration

Querygeneration

LMapproach

(Ponte & Croft, 98)(Lafferty & Zhai, 01a)

Prob. conceptspace model

(Wong & Yao, 95)

Differentinference system

Inference network model

(Turtle & Croft, 91)

Today’s lectureInformation Retrieval and Vector Space Model

Page 18: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Relevance = Similarity

Assumptions Query and document are represented similarly A query can be regarded as a “document” Relevance(d,q) similarity(d,q)

R(q) = {dC|f(d,q)>}, f(q,d)=(Rep(q), Rep(d))

Key issues How to represent query/document?

Vector Space Model (VSM) How to define the similarity measure ?

1818Information Retrieval and Vector Space Model

Page 19: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages of VS ModelImproving the VSM Model

1919Information Retrieval and Vector Space Model

Page 20: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Vector Space Model (VSM)

The vector space model is one of the most widely

used models for ad-hoc retrieval

Used in information filtering, information

retrieval, indexing and relevancy rankings.

2020Information Retrieval and Vector Space Model

Page 21: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

VSM

Represent a doc/query by a term vector Term: basic concept, e.g., word or phrase Each term defines one dimension N terms define a high-dimensional space E.g., d=(x1,…,xN), xi is “importance” of term I

Measure relevance by the distance between the query vector and document vector in the vector space

2121Information Retrieval and Vector Space Model

Page 22: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

VS Model: illustration

2222

Java

Microsoft

Starbucks

D6

D10

D9

D4

D7

D8

D5

D11

D2 ? ?

D1

? ?

D3

? ?

Query

Information Retrieval and Vector Space Model

Page 23: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Some Issues about VS Model

There is no consistent definition for basic concept

Assigning weights to words has not been

determined

Weight in query indicates importance of term

2424Information Retrieval and Vector Space Model

Page 24: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages of VS ModelImproving the VSM Model

2525Information Retrieval and Vector Space Model

Page 25: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

How to Assign Weights?

Different terms have different importance in a text

A term weighting scheme plays an important role

for the similarity measure.

Higher weight = greater impact

We now turn to the question of how to weight

words in the vector space model.

2626Information Retrieval and Vector Space Model

Page 26: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

There are three components in a weighting

scheme:

gi: the global weight of the ith term,

tij: is the local weight of the ith term in the jth document,

dj:the normalization factor for the jth document

2727Information Retrieval and Vector Space Model

Page 27: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages of VS ModelImproving the VSM Model

2929Information Retrieval and Vector Space Model

Page 28: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

TF Weighting

Idea: A term is more important if it occurs more frequently in a document

Formulas: Let f(t,d) be the frequency count of term t in doc d Raw TF: TF(t,d) = f(t,d) Log TF: TF(t,d)=log f(t,d) Maximum frequency normalization:

TF(t,d) = 0.5 +0.5*f(t,d)/MaxFreq(d)Normalization of TF is very important!

3030Information Retrieval and Vector Space Model

Page 29: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

TF Methods

3131Information Retrieval and Vector Space Model

Page 30: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

IDF Weighting

Idea: A term is more discriminative if it occurs only in fewer documents

Formula:

IDF(t) = 1 + log(n/k) n : total number of docsk : # docs with term t (doc freq)

3232Information Retrieval and Vector Space Model

Page 31: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

IDF weighting Methods

3333Information Retrieval and Vector Space Model

Page 32: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

TF Normalization

Why? Document length variation “Repeated occurrences” are less informative than the “first

occurrence”

Two views of document length A doc is long because it uses more words A doc is long because it has more contents

Generally penalize long doc, but avoid over-penalizing

3434Information Retrieval and Vector Space Model

Page 33: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

TF-IDF Weighting

TF-IDF weighting : weight(t,d)=TF(t,d)*IDF(t) Common in doc high tf high weight Rare in collection high idf high weight

Imagine a word count profile, what kind of terms would have high weights?

3535Information Retrieval and Vector Space Model

Page 34: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

How to Measure Similarity?

3636

product)dot normalized(

)()(

),( :Cosine

),(C :similarityproduct Dot

absent is terma if 0 ),...,(

),...,(

1

2

1

2

1

1

1

1

N

jij

N

jqj

N

jijqj

i

N

jijqji

qNq

iNii

ww

ww

DQsim

wwDQS

wwwQ

wwD

Information Retrieval and Vector Space Model

Page 35: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages of VS ModelImproving the VSM Model

3737Information Retrieval and Vector Space Model

Page 36: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

VS Example: Raw TF & Dot Product

38

doc3

information retrievalsearchengine

information

travelinformation

maptravel

government presidentcongress

doc1

doc2

Sim(q,doc1)=4.8*2.4+4.5*4.5

Sim(q,doc2)=2.4*2.4

Sim(q,doc3)=0

query=“information retrieval”

1(4.5)1(2.4)Query

1(4.3)1(3.2)1(2.2)Doc3

1(3.3)2(5.6)1(2.4)Doc2

1(5.4)1(2.1)1(4.5)2(4.8)Doc1

4.33.22.25.42.13.32.84.52.4IDF

(fake)

CongressPresidentGovern.EngineSearchMapTravelRetrievalInfo.

Information Retrieval and Vector Space Model

Page 37: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

ExampleExample

Q: “gold silver truck”• D1: “Shipment of gold delivered in a fire”• D2: “Delivery of silver arrived in a silver truck”• D3: “Shipment of gold arrived in a truck”• Document Frequency of the jth term (dfj )

• Inverse Document Frequency (idf) = log10(n / dfj)

Tf*idf is used as term weight here

39Information Retrieval and Vector Space Model

Page 38: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Example (Cont’d)Example (Cont’d)Id Term df idf1 a 3 02 arrived 2 0.1763 damaged 1 0.4774 delivery 1 0.4775 fire 1 0.4776 gold 1 0.1767 in 3 08 of 3 09 silver 1 0.47710 shipment 2 0.17611 truck 2 0.176

40Information Retrieval and Vector Space Model

Page 39: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Example(Cont’d)Example(Cont’d)

Tf*idf is used here

SC(Q, D1 ) = (0)(0) + (0)(0) + (0)(0.477) + (0)(0) + (0)(0.477)+ (0.176)(0.176) + (0)(0) + (0)(0) = 0.031SC(Q, D2 ) = 0.486SC(Q,D3) = 0.062The ranking would be D2,D3,D1.• This SC uses the dot product.

41Information Retrieval and Vector Space Model

Page 40: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages and Disadvantages of VS ModelImproving the VSM Model

4242Information Retrieval and Vector Space Model

Page 41: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Advantages of VS Model

Empirically effective! (Top TREC performance)IntuitiveEasy to implementWell-studied/Most evaluatedThe Smart system

Developed at Cornell: 1960-1999 Still widely used

Warning: Many variants of TF-IDF!

4343Information Retrieval and Vector Space Model

Page 42: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Disadvantages of VS Model

Assume term independence

Assume query and document to be the same

Lots of parameter tuning!

4444Information Retrieval and Vector Space Model

Page 43: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages and Disadvantages of VS ModelImproving the VSM Model

4545Information Retrieval and Vector Space Model

Page 44: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Improving the VSM Model

We can improve the model by: Reducing the number of dimensions

eliminating all stop words and very common terms stemming terms to their roots Latent Semantic Analysis

Not retrieving documents below a defined cosine threshold Normalized frequency of a term i in document j is given

by[1]: Normalized Document Frequencies Normalized Query Frequencies

Information Retrieval and Vector Space Model4646

Page 45: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Stop ListStop List

Function words do not bear useful information for IRof, not, to, or, in, about, with, I, be, …

Stop list: contain stop words, not to be used as index Prepositions Articles Pronouns Some adverbs and adjectives Some frequent words (e.g. document)

The removal of stop words usually improves IR effectiveness

A few “standard” stop lists are commonly used.

4747Information Retrieval and Vector Space Model

Page 46: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

StemmingStemming

4848

Reason: ◦ Different word forms may bear similar meaning

(e.g. search, searching): create a “standard” representation for them

Stemming: ◦ Removing some endings of word

dancer dancers

dancedanceddancing

dance

Information Retrieval and Vector Space Model

Page 47: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Stemming(Cont’d)Stemming(Cont’d)

Two main methods : Linguistic/dictionary-based stemming

high stemming accuracy high implementation and processing costs and higher

coverage

Porter-style stemming

lower stemming accuracy lower implementation and processing costs and lower

coverage Usually sufficient for IR

4949Information Retrieval and Vector Space Model

Page 48: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Latent Semantic Indexing (LSI) [3]

Reduces the dimensions of the term-document space

Attempts to solve the synonomy and polysemyUses Singular Value Decomposition (SVD)

identifies patterns in the relationships between the terms and concepts contained in an unstructured collection of text

Based on the principle that words that are used in the same contexts tend to have similar meanings.

Information Retrieval and Vector Space Model5050

Page 49: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

LSI Process

In general, the process involves: constructing a weighted term-document matrix performing a Singular Value Decomposition on the

matrix using the matrix to identify the concepts contained in the

text

LSI statistically analyses the patterns of word usage across the entire document collection

Information Retrieval and Vector Space Model5151

Page 50: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

References

Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan, and

Hinrich Schuetze

https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/2.pdf

https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/ir4up.pdf

https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/e09-3009.pdf

https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/07models-vsm.pdf

https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/03vectorspaceimplementation-6per.pdf

https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/lecture02.ppt

https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/vector_space_model-

updated.ppt

https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/lecture_13_ir_and_vsm_.ppt

Document Classification based on Wikipedia Content,

http://www.iicm.tugraz.at/cguetl/courses/isr/opt/classification/Vector_Space_Model.html?

timestamp=1318275702299

5454Information Retrieval and Vector Space Model

Page 51: Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Thanks For Your Attention ….

5555 Information Retrieval and Vector Space Model