25
INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

INFORMATION RETRIEVAL AND WEB SEARCH

CC437

(Based on original material by Udo Kruschwitz)

Page 2: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

INFORMATION RETRIEVAL

GOAL: Find the documents most relevant to a certain QUERY

Latest development: WEB SEARCH– Use the Web as the collection of documents

Related: – QUESTION-ANSWERING– DOCUMENT CLASSIFICATION

Page 3: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

INFORMATION RETRIEVAL:SUBTASKS

INDEX the documents in the collection – (offline)

PROCESS the query EVALUATE SIMILARITY and find RANKs

– Find documents most closely matching the query

DISPLAY results / enter a DIALOGUE– E.g., user may refine the query

Page 4: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

DOCUMENTS AS BAGS OF WORDS

broad tech stock rally may signal trend - traders.

technology stocks rallied on tuesday, with gains scored broadly across many sectors, amid what some traders called a recovery from recent doldrums.

broadmay rallyralliedsignal stockstocks techtechnology traderstraders trend

DOCUMENTINDEX

Page 5: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

SUBTASKS I: INDEXING

PREPROCESSING Deletion of STOPWORDS STEMMING Selection of INDEX TERMS

Page 6: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

INDEXING I: PREPROCESSING

PUNCTUATION REMOVAL– (Crestani et al)

CASE FOLDING– London london– LONDON london

DIGIT REMOVAL– But: SPARCStation 5

Page 7: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

INDEXING II: STOPWORD REMOVAL

Very frequent words are not good discriminators– Many of these are CLOSED CLASS words

INQUERY’s list of stop words beginning with letter “a”:

a, about, above, according, across, after, afterwards, again, against, albeit, all, almost, alone, already, also, although, always, among, amongst, am, an, and, another, any, anybody, anyhow, anyone, anything, anyway, anywhere, apart, are, around, as, at

Domain-specific stopwords search, webmaster, copyright, www

Page 8: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

INDEXING III:STEMMING

Simplest: suffix stripping PORTER STEMMER: inflectional & derivational morphology

– develop develop– developing develop– development develop– developments develop– BUT: photography photographi

The effectiveness of stemming:– For English: increase in recall doesn’t compensate loss in

precision– For other languages: necessary

E.g., Abdul Goweder’s dissertation

Page 9: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

STORAGE

Requirements– Huge amounts of data– Lots of redundancy– Quick random access necessary

Indexing techniques:– Inverted index files– Suffix trees / suffix arrays– Signature files

Page 10: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

STORAGE TECHNIQUES:INVERTED INDEX

broad tech stock rally may signal trend - traders. broad {1}

gain {2}rally {1,2}score {2}signal {1} stock {1,2}tech {1}technology {2}traders {1,2} trend {1}tuesday {2}

DOCUMENT1INVERTED INDEX

technology stocks rallied on tuesday, with gains scored broadly across many sectors, amid what some traders called a recovery from recent doldrums.

DOCUMENT2

Page 11: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

SIMILARITY MODELS

Boolean model Probabilistic model Vector-space model

Page 12: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

THE BOOLEAN MODEL

Each index term is either present or absent Documents are either RELEVANT or NOT

RELEVANT (no grading of results) Advantages

– Clean formalism, simple to implement Disadvantages

– Exact matching only– All index terms equal weight

Page 13: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

THE VECTOR SPACE MODEL

Query and documents represented as vectors of index terms, assigned non-binary WEIGHTS

Similarity calculated using vector algebra: COSINE (cfr. lexical similarity models)– RANKED similarity

Most popular of all models (cfr. Salton and Lesk’s SMART)

Page 14: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

SIMILARITY IN VECTOR SPACE MODELS: THE COSINE MEASURE

kj

kj

qd

qd *cos

θ

dj

qk

N

i ij

N

i ik

N

iijik

jk

ww

wwdqsim

1

2,1

2,

1,,

,

Page 15: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

TERM WEIGHTING IN VECTOR SPACE MODELS: THE TF.IDF MEASURE

ikiki df

Nftfidf log*,,

FREQUENCY of term i in document k Number of documents

with term i

Page 16: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

EVALUATION

One of the most important contributions of IR to NLE has been the development of better ways of evaluating systems than simple accuracy

Page 17: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

Simplest quantitative evaluation metrics

ACCURACY: percentage correct(against some gold standard)- e.g., tagger gets 96.7% of tags correct when evaluated using the Penn TreebankProblem with accuracy: only really useful when classes of approximately equal size (not the case in IR)

ERROR: percentage wrong- ERROR REDUCTION most typical metric in ASR

Page 18: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

A more general form of evaluation: precision & recall

sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj aff

rvjfkjfkbv

KFKRQVFsjfanvnf

sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj aff

rvjfkjfkbv

KFKRQVFsjfanvnf

sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj aff

rvjfkjfkbv

KFKRQVFsjfanvnf

sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj aff

rvjfkjfkbv

KFKRQVFsjfanvnf

sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj aff

rvjfkjfkbv

KFKRQVFsjfanvnf

CDKBCWDK

Page 19: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

Positives and negatives

TRUE NEGATIVES

FALSE NEGATIVES

TP FP

Page 20: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

Precision and recall

PRECISION: proportion correct AMONG SELECTED ITEMS

RECALL: proportion of correct items selected

Page 21: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

The tradeoff between precision and recall

Easy to get high precision: never classify anything

Easy to get high recall: return everything

Really need to report BOTH, or F-measure

RP

PRF

2

Page 22: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

WEB SEARCH

In many senses, just a form of IR But:

– Further information one has to take into account Markup Hyperlinks Meta tags

– Extra problems Document highly heterogeneous Multimedia Quality of data

Page 23: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

GOOGLE

Key aspects of Google’s search algorithm (as far as we know!)

– Analyze link structure: PAGE RANK– Exploit visual presentation

Page Rank used to rank retrieved documents in addition to similarity measures

Page Rank motivations:– Most important papers are those cited most often– Not all sources of citations are equally reliable

Page 24: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

PAGE RANK

k

i

i

pC

pPageRankqqpPageRank

1*)1( )(

Page p

Probability q of randomly jumping to that page

Pages pointing to p

Page 25: INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

READINGS AND REFERENCES

Jurafsky and Martin, chapter 10.1-10.4 Other references

– Brin, S. and Page, L. 1998, “The anatomy of a large-scale hypertextual web search engine”, In Proc. Of the 7th WWW conference (WWW7),Brisbane

– F. Crestani et al, 1998, “Is this document relevant? …probably”, ACM Computing Surveys, 30(4):528-552

– Goweder, A, 2004, The role of stemming in IR: the case of Arabic, PhD dissertation, University of Essex

– Porter, M.F., 1980, “An algorithm for suffix stripping”, Program, 14(3) :130-137

– G. Salton and M. E. Lesk, 1968. “Computer evaluation of indexing and text processing”, Journal of the ACM, 15(1),8-36