INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

INFORMATION RETRIEVAL AND WEB SEARCH

CC437

(Based on original material by Udo Kruschwitz)

INFORMATION RETRIEVAL

GOAL: Find the documents most relevant to a certain QUERY

Latest development: WEB SEARCH– Use the Web as the collection of documents

Related: – QUESTION-ANSWERING– DOCUMENT CLASSIFICATION

INFORMATION RETRIEVAL:SUBTASKS

INDEX the documents in the collection – (offline)

PROCESS the query EVALUATE SIMILARITY and find RANKs

– Find documents most closely matching the query

DISPLAY results / enter a DIALOGUE– E.g., user may refine the query

DOCUMENTS AS BAGS OF WORDS

broad tech stock rally may signal trend - traders.

technology stocks rallied on tuesday, with gains scored broadly across many sectors, amid what some traders called a recovery from recent doldrums.

broadmay rallyralliedsignal stockstocks techtechnology traderstraders trend

DOCUMENTINDEX

SUBTASKS I: INDEXING

PREPROCESSING Deletion of STOPWORDS STEMMING Selection of INDEX TERMS

INDEXING I: PREPROCESSING

PUNCTUATION REMOVAL– (Crestani et al)

CASE FOLDING– London london– LONDON london

DIGIT REMOVAL– But: SPARCStation 5

INDEXING II: STOPWORD REMOVAL

Very frequent words are not good discriminators– Many of these are CLOSED CLASS words

INQUERY’s list of stop words beginning with letter “a”:

a, about, above, according, across, after, afterwards, again, against, albeit, all, almost, alone, already, also, although, always, among, amongst, am, an, and, another, any, anybody, anyhow, anyone, anything, anyway, anywhere, apart, are, around, as, at

Domain-specific stopwords search, webmaster, copyright, www

INDEXING III:STEMMING

Simplest: suffix stripping PORTER STEMMER: inflectional & derivational morphology

– develop develop– developing develop– development develop– developments develop– BUT: photography photographi

The effectiveness of stemming:– For English: increase in recall doesn’t compensate loss in

precision– For other languages: necessary

E.g., Abdul Goweder’s dissertation

STORAGE

Requirements– Huge amounts of data– Lots of redundancy– Quick random access necessary

Indexing techniques:– Inverted index files– Suffix trees / suffix arrays– Signature files

STORAGE TECHNIQUES:INVERTED INDEX

broad tech stock rally may signal trend - traders. broad {1}

gain {2}rally {1,2}score {2}signal {1} stock {1,2}tech {1}technology {2}traders {1,2} trend {1}tuesday {2}

DOCUMENT1INVERTED INDEX

technology stocks rallied on tuesday, with gains scored broadly across many sectors, amid what some traders called a recovery from recent doldrums.

DOCUMENT2

SIMILARITY MODELS

Boolean model Probabilistic model Vector-space model

THE BOOLEAN MODEL

Each index term is either present or absent Documents are either RELEVANT or NOT

RELEVANT (no grading of results) Advantages

– Clean formalism, simple to implement Disadvantages

– Exact matching only– All index terms equal weight

THE VECTOR SPACE MODEL

Query and documents represented as vectors of index terms, assigned non-binary WEIGHTS

Similarity calculated using vector algebra: COSINE (cfr. lexical similarity models)– RANKED similarity

Most popular of all models (cfr. Salton and Lesk’s SMART)

SIMILARITY IN VECTOR SPACE MODELS: THE COSINE MEASURE

kj

kj

qd

qd *cos

θ

dj

qk

N

i ij

N

i ik

N

iijik

jk

ww

wwdqsim

1

2,1

2,

1,,

,

TERM WEIGHTING IN VECTOR SPACE MODELS: THE TF.IDF MEASURE

ikiki df

Nftfidf log*,,

FREQUENCY of term i in document k Number of documents

with term i

EVALUATION

One of the most important contributions of IR to NLE has been the development of better ways of evaluating systems than simple accuracy

Simplest quantitative evaluation metrics

ACCURACY: percentage correct(against some gold standard)- e.g., tagger gets 96.7% of tags correct when evaluated using the Penn TreebankProblem with accuracy: only really useful when classes of approximately equal size (not the case in IR)

ERROR: percentage wrong- ERROR REDUCTION most typical metric in ASR

A more general form of evaluation: precision & recall

sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj aff

rvjfkjfkbv

KFKRQVFsjfanvnf


rvjfkjfkbv

KFKRQVFsjfanvnf


rvjfkjfkbv

KFKRQVFsjfanvnf


rvjfkjfkbv

KFKRQVFsjfanvnf


rvjfkjfkbv

KFKRQVFsjfanvnf

CDKBCWDK

Positives and negatives

TRUE NEGATIVES

FALSE NEGATIVES

TP FP

Precision and recall

PRECISION: proportion correct AMONG SELECTED ITEMS

RECALL: proportion of correct items selected

The tradeoff between precision and recall

Easy to get high precision: never classify anything

Easy to get high recall: return everything

Really need to report BOTH, or F-measure

RP

PRF

2

WEB SEARCH

In many senses, just a form of IR But:

– Further information one has to take into account Markup Hyperlinks Meta tags

– Extra problems Document highly heterogeneous Multimedia Quality of data

GOOGLE

Key aspects of Google’s search algorithm (as far as we know!)

– Analyze link structure: PAGE RANK– Exploit visual presentation

Page Rank used to rank retrieved documents in addition to similarity measures

Page Rank motivations:– Most important papers are those cited most often– Not all sources of citations are equally reliable

PAGE RANK

k

i

i

pC

pPageRankqqpPageRank

1*)1( )(

Page p

Probability q of randomly jumping to that page

Pages pointing to p

READINGS AND REFERENCES

Jurafsky and Martin, chapter 10.1-10.4 Other references

– Brin, S. and Page, L. 1998, “The anatomy of a large-scale hypertextual web search engine”, In Proc. Of the 7th WWW conference (WWW7),Brisbane

– F. Crestani et al, 1998, “Is this document relevant? …probably”, ACM Computing Surveys, 30(4):528-552

– Goweder, A, 2004, The role of stemming in IR: the case of Arabic, PhD dissertation, University of Essex

– Porter, M.F., 1980, “An algorithm for suffix stripping”, Program, 14(3) :130-137

– G. Salton and M. E. Lesk, 1968. “Computer evaluation of indexing and text processing”, Journal of the ACM, 15(1),8-36