View
215
Download
0
Tags:
Embed Size (px)
Citation preview
INFORMATION RETRIEVAL AND WEB SEARCH
CC437
(Based on original material by Udo Kruschwitz)
INFORMATION RETRIEVAL
GOAL: Find the documents most relevant to a certain QUERY
Latest development: WEB SEARCH– Use the Web as the collection of documents
Related: – QUESTION-ANSWERING– DOCUMENT CLASSIFICATION
INFORMATION RETRIEVAL:SUBTASKS
INDEX the documents in the collection – (offline)
PROCESS the query EVALUATE SIMILARITY and find RANKs
– Find documents most closely matching the query
DISPLAY results / enter a DIALOGUE– E.g., user may refine the query
DOCUMENTS AS BAGS OF WORDS
broad tech stock rally may signal trend - traders.
technology stocks rallied on tuesday, with gains scored broadly across many sectors, amid what some traders called a recovery from recent doldrums.
broadmay rallyralliedsignal stockstocks techtechnology traderstraders trend
DOCUMENTINDEX
SUBTASKS I: INDEXING
PREPROCESSING Deletion of STOPWORDS STEMMING Selection of INDEX TERMS
INDEXING I: PREPROCESSING
PUNCTUATION REMOVAL– (Crestani et al)
CASE FOLDING– London london– LONDON london
DIGIT REMOVAL– But: SPARCStation 5
INDEXING II: STOPWORD REMOVAL
Very frequent words are not good discriminators– Many of these are CLOSED CLASS words
INQUERY’s list of stop words beginning with letter “a”:
a, about, above, according, across, after, afterwards, again, against, albeit, all, almost, alone, already, also, although, always, among, amongst, am, an, and, another, any, anybody, anyhow, anyone, anything, anyway, anywhere, apart, are, around, as, at
Domain-specific stopwords search, webmaster, copyright, www
INDEXING III:STEMMING
Simplest: suffix stripping PORTER STEMMER: inflectional & derivational morphology
– develop develop– developing develop– development develop– developments develop– BUT: photography photographi
The effectiveness of stemming:– For English: increase in recall doesn’t compensate loss in
precision– For other languages: necessary
E.g., Abdul Goweder’s dissertation
STORAGE
Requirements– Huge amounts of data– Lots of redundancy– Quick random access necessary
Indexing techniques:– Inverted index files– Suffix trees / suffix arrays– Signature files
STORAGE TECHNIQUES:INVERTED INDEX
broad tech stock rally may signal trend - traders. broad {1}
gain {2}rally {1,2}score {2}signal {1} stock {1,2}tech {1}technology {2}traders {1,2} trend {1}tuesday {2}
DOCUMENT1INVERTED INDEX
technology stocks rallied on tuesday, with gains scored broadly across many sectors, amid what some traders called a recovery from recent doldrums.
DOCUMENT2
SIMILARITY MODELS
Boolean model Probabilistic model Vector-space model
THE BOOLEAN MODEL
Each index term is either present or absent Documents are either RELEVANT or NOT
RELEVANT (no grading of results) Advantages
– Clean formalism, simple to implement Disadvantages
– Exact matching only– All index terms equal weight
THE VECTOR SPACE MODEL
Query and documents represented as vectors of index terms, assigned non-binary WEIGHTS
Similarity calculated using vector algebra: COSINE (cfr. lexical similarity models)– RANKED similarity
Most popular of all models (cfr. Salton and Lesk’s SMART)
SIMILARITY IN VECTOR SPACE MODELS: THE COSINE MEASURE
kj
kj
qd
qd *cos
θ
dj
qk
N
i ij
N
i ik
N
iijik
jk
ww
wwdqsim
1
2,1
2,
1,,
,
TERM WEIGHTING IN VECTOR SPACE MODELS: THE TF.IDF MEASURE
ikiki df
Nftfidf log*,,
FREQUENCY of term i in document k Number of documents
with term i
EVALUATION
One of the most important contributions of IR to NLE has been the development of better ways of evaluating systems than simple accuracy
Simplest quantitative evaluation metrics
ACCURACY: percentage correct(against some gold standard)- e.g., tagger gets 96.7% of tags correct when evaluated using the Penn TreebankProblem with accuracy: only really useful when classes of approximately equal size (not the case in IR)
ERROR: percentage wrong- ERROR REDUCTION most typical metric in ASR
A more general form of evaluation: precision & recall
sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj aff
rvjfkjfkbv
KFKRQVFsjfanvnf
sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj aff
rvjfkjfkbv
KFKRQVFsjfanvnf
sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj aff
rvjfkjfkbv
KFKRQVFsjfanvnf
sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj aff
rvjfkjfkbv
KFKRQVFsjfanvnf
sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj aff
rvjfkjfkbv
KFKRQVFsjfanvnf
CDKBCWDK
Positives and negatives
TRUE NEGATIVES
FALSE NEGATIVES
TP FP
Precision and recall
PRECISION: proportion correct AMONG SELECTED ITEMS
RECALL: proportion of correct items selected
The tradeoff between precision and recall
Easy to get high precision: never classify anything
Easy to get high recall: return everything
Really need to report BOTH, or F-measure
RP
PRF
2
WEB SEARCH
In many senses, just a form of IR But:
– Further information one has to take into account Markup Hyperlinks Meta tags
– Extra problems Document highly heterogeneous Multimedia Quality of data
Key aspects of Google’s search algorithm (as far as we know!)
– Analyze link structure: PAGE RANK– Exploit visual presentation
Page Rank used to rank retrieved documents in addition to similarity measures
Page Rank motivations:– Most important papers are those cited most often– Not all sources of citations are equally reliable
PAGE RANK
k
i
i
pC
pPageRankqqpPageRank
1*)1( )(
Page p
Probability q of randomly jumping to that page
Pages pointing to p
READINGS AND REFERENCES
Jurafsky and Martin, chapter 10.1-10.4 Other references
– Brin, S. and Page, L. 1998, “The anatomy of a large-scale hypertextual web search engine”, In Proc. Of the 7th WWW conference (WWW7),Brisbane
– F. Crestani et al, 1998, “Is this document relevant? …probably”, ACM Computing Surveys, 30(4):528-552
– Goweder, A, 2004, The role of stemming in IR: the case of Arabic, PhD dissertation, University of Essex
– Porter, M.F., 1980, “An algorithm for suffix stripping”, Program, 14(3) :130-137
– G. Salton and M. E. Lesk, 1968. “Computer evaluation of indexing and text processing”, Journal of the ACM, 15(1),8-36