TA Lecture 11 Information Retrieval.ppt - ut Lecture IR 2up.pdf · vector model with tf-idf weights a good ranking strategy in general collections simpleandfasttocompute Lecture 1:

3.12.2008

1

Text Algorithms (4AP)Information Retrieval

Jaak Vilo

2008 fall

1MTAT.03.190 Text Algorithms Jaak Vilo

Materials

• Modern Information Retrieval by Ricardo Baeza‐Yates and Berthier Ribeiro‐Neto.Berthier Ribeiro Neto. – http://people.ischool.berkeley.edu/~hearst/irbook/– http://www.amazon.co.uk/Modern‐Information‐Retrieval‐ACM‐

Press/dp/020139829X/ref=sr_1_1?ie=UTF8&s=books&qid=1228237684&sr=8‐1

– http://books.google.com/books?id=HLyAAAAACAAJ&dq=modern+information+retrieval

– New edition in May 2009

• Google Books: Information Retrievalhtt //b k l /b k ? i f ti t i l– http://books.google.com/books?q=information+retrieval

• ESSCaSS’08 : Ricardo Baeza‐Yates and Filippo Menczer– http://courses.cs.ut.ee/schools/esscass2008/Main/Materials

3.12.2008

2

• Given a set of documents, find those relevant to topic X • User formulates a query, documents are returned and retrieved by userUser formulates a query, documents are returned and retrieved by user

• Looking at first 100, result – how many are relevant to topic, how many of all fit in the first 100?

• Given an interesting document (one?), how to find similar ones?

• Which keywords characterise documents similar to other documents?

• How to present the answer to user?

– Topic hierarchies

– Self organising maps (see WebSom)

– ...

3.12.2008

3

3.12.2008

4

3.12.2008

5

Mida otsiti?

auto buss rong tramm troll bensiin diisel puit vesinik maagaas elekter

• Milline dokument peaks kõige sarnasem olema?

– Dokumendi ja päringu sarnasus

Dokumentide järjestamine– Dokumentide järjestamine

– Käänded/pöörded

– Ontoloogiad (mõistete struktuur)

– Dokumendi enda olulisus (e.g. PageRank)

Information retrieval (IR)

• Finding relevant information

– From unstructured document database(s)

• Relevance, measures

• Presenting information (UI, relevance)

• Free text queries (Natural Language Processing)

• User feedback

• http://www.google.com/search?q=define:information+retrieval

– Information Retrieval is the "science of search”

– The study of systems for indexing, searching, and recalling data, particularly text or other unstructured forms

3.12.2008

6

History of IR

• 1960‐70’s: Small text retrieval systems; basic booleanand vector space retrieval modelsand vector‐space retrieval models

• 1980’s: Large document database systems, many run by companies: (e.g. Lexis‐Nexis, Dialog, MEDLINE)

• 1990’s: Searching FTPable documents on the Internet (e.g. Archie, WAIS); Searching the World Wide Web (e.g. Lycos, Yahoo, Altavista)

History cont.

• 2000’s:– Link analysis for Web Search (e g Google)Link analysis for Web Search (e.g. Google)

– Automated Information Extraction (e.g. Whizbang, Fetch, Burning Glass)

– Question Answering (e.g. TREC Q/A track, Ask Jeeves)

– Multimedia IR (Image, Video, Audio and music)

– Cross‐Language IR (e.g. DARPA Tides)

Document Summarization– Document Summarization

3.12.2008

7

Cont …

• 2000’s:

R d S t– Recommender Systems • (e.g. MovieLens, Pandora, LastFM)

– Automated Text Categorization & Clustering• iTunes “Top Songs”

• Amazon “people who bought this also bought…”

• Bloglines “similar blogs”

D l i i “ t l ” b k k• Del.icio.us “most popular” bookmarks

• Flickr.com “ most viewed pictures”

• NYTimes “most emailed articles”

• IR discipline that deals with:

t i l– retrieval• representation

• storage

• organization

• access

– of structured, semi‐structured and unstructured data(information objects)(information objects)

– in response to query (topic statement)• structured (e.g. boolean expression)

• unstructured (e.g. sentence, document)

3.12.2008

8

ConceptsInformation Retrieval ‐ the study of systems for representing, indexing (organising), searching (retrieving) and recalling (delivering) data(retrieving), and recalling (delivering) data.

Information Filtering ‐ given a large amount of data, return the data that the user wants to see

Information Need ‐ what the user really wants to know; a query is an approximation to the i f ti dinformation need.

Query ‐ a string of words that characterizes the information that the user seeks

Browsing ‐ a sequence of user interaction tasks that characterizes the information that the user seeks

• The process of applying algorithms over unstructured, semi‐structured or structureddata in order to satisfy a given information (explicit) query

• Efficiency with respect to:l ith– algorithms

– query building

– data organization/structure

3.12.2008

9

Data vs. Information Retrieval

• Information Retrieval:– Set of keywords (loose semantics)y ( )– Semantics of the information need– Errors are tolerable

• Data Retrieval:– Regular expression (well defined query)– Constraints for the objects in the answer setj– Single error results in a falure

retrieval task

SummaryCompare the information need with the information

generate a ranking which reflects relevance

IR SystemUser Query

Ranked list of documents

Information Need

Lecture 2: Query Languages & Operations

2ID10: Information Retrieval (2005‐2006), Lora Aroyo

feedback

3.12.2008

10

1. IR Models

2. IR Query Languages & Operations

IR introduction IR research issues Applications of IR

4. Language Modeling for IR

8. Multimedia IR

5. Search Engines

Operations

6. Semantic in IR3. Searcher Feedback

8. Multimedia IR

9. Structured Content

• classification and categorization (catalogues)

t d l (NL b d t )• systems and languages (NL‐based systems)

• user interfaces and visualization

• The Web fenomena

– universal repository of knowledge

– free (low cost) universal access

– no central editorial board

• IR – the key to finding the solutions

3.12.2008

11

Logical View of Documents

A i

text +structure text

structure Full text Index terms

Documents

Structure

accents, spacing

Stop-wordsNoun groups

StemmingAutomatic or Manual Indexing

Lecture 1: Introduction 2ID10: Information Retrieval (2005‐2006) 21

• Document representation continuum

• Intermediate representations (transformations)

• Text operations to reduce complexity of documents

UserInterface

The Retrieval Process ...user feedback – change the query

1

text

specifies user need410

Text Operations

Query Operations

Indexing

5

logical view logical view

inverted filequery generated

2

1text – defines logical view

DB Manager Module

4

6

7

builds


Searching

Ranking

Index

Text Database

query generated

retrieved docs

ranking docs

3

17

8

9

3.12.2008

12

3.12.2008

13

Inverted index: document levelhttp://en.wikipedia.org/wiki/Inverted_index

• T0 = "it is what it is",T " h t i it"

"a": {2} "banana": {2}T1 = "what is it"

T2 = "it is a banana“

• Q "what" "is" "it"

banana : {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1}

• Q: "what", "is" "it"

– {0, 1} ∩ {0, 1, 2} ∩ {0, 1, 2} = {0,1}

Inverted index: word level

• T0 = "it is what it is",T " h t i it"

"a": {(2, 2)} "banana": {(2, 3)}

T1 = "what is it"T2 = "it is a banana“

• Q "what is it“

"is": {(0, 1), (0, 4), (1, 1), (2, 1)} "it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)}

• Q: "what is it“{(0, 2), (1, 0)}{(0, 1), (0, 4), (1, 1), (2, 1)} {(0, 0), (0, 3), (1, 2), (2, 0)}

3.12.2008

14

• The inverted index data structure is a central component of a typical search engine indexing algorithm. A goal of a searchtypical search engine indexing algorithm. A goal of a search engine implementation is to optimize the speed of the query: find the documents where word X occurs. Once a forward index is developed, which stores lists of words per document, it is next inverted to develop an inverted index. Querying the forward index would require sequential iteration through each document and to each word to verify a matching document. The time, memory, and processing resources to perform such a query are not always technically realistic. Instead of listing the words per document in the forward index, the inverted index data structure is developed which lists the documents per word.

3.12.2008

15

Measures

• Precision is the fraction of the documents t i d th t l t t th 'retrieved that are relevant to the user's

information need.

• Recall is the fraction of the documents that are relevant to the query that are successfully retrieved.

Measures: Precision & Recall

Retrieved Not retrieved

User need TP FN Relevant

Not needed FP TN Irrelevant

TP+FP TN+FN

TP – True PositiveTN – True Negative

TP Relevant ∩ RetrievedPrecision = ––––––– = –––––––––––––––––––––

TP+FP Retrieved

FP – False PositiveFN – False Negative

TP Relevant ∩ RetrievedRecall = ––––––– = –––––––––––––––––––––

TP+FN Relevant

3.12.2008

16

Measures: Precision & Recall

Retrieved Not retrieved

User need TP FN Relevant

Not needed FP TN Irrelevant

TP+FP TN+FN

TPPrecision = –––––––

TP+FPSpecificity

TPRecall = –––––––

TP+FNSensitivity

Measure: F‐Measure

• The weighted harmonic mean of precision and ll th t diti l F b l drecall, the traditional F‐measure or balanced

F‐score is:

2 X Precision x RecallF-Measure = –––––––––––––––––––––

( Precison + Recall)

•F2 measure, weights recall twice as much as precision, and the F0.5 measure, which weights precision twice as much as recall.

3.12.2008

17

ROC Receiver Operator CharacteristicAUC Area Under Curve

3 systems compared

FP

TP

Rel

evan

t

Irrelevant

Vector space model

• http://en.wikipedia.org/wiki/Vector_space_model

• Document: a vector of words

– A sparse vector over all possible words…

• Similarity between query and document:

– Scalar product

– An angle between the two vectors

3.12.2008

18

Scalar product

• Query Q is a document with perhaps just a i l dsingle word.

• Similarity of query and document

M(Q, Di) = Q ∙ Di

• X ∙ Y = ∑i xiyi∑i iyi

Weighted version

• The more the word occurs, the more relevant,

• Same word vectors, count occurrences

M(X , Y ) = ∑i wq,i wd,i

• w is different for word in each document

• Extend: add weight for a word in a"more important" context

• Can you add term weight on query words?

3.12.2008

19

Limitations of vector space

• Long documents are poorly represented because they have poor similarity values (a small scalarthey have poor similarity values (a small scalar product and a large dimensionality)

• Search keywords must precisely match document terms; word substrings might result in a "false positive match"

• Semantic sensitivity; documents with similar context• Semantic sensitivity; documents with similar context but different term vocabulary won't be associated, resulting in a "false negative match".

• The order in which the terms appear in the document is lost in the vector space representation.

• quantification of intra-document (-cluster)

Term Weigh Calculation

q ( )contents (similarity) = tf factor – the term frequency within a document – how well a term describes a document

• quantification of inter-documents (-cluster) separation (dissimilarity) = idf factor


separation (dissimilarity) = idf factor – the inverse document frequency – frequency of the term in docs of the collection

wij = tf(i,j) * idf(i)

3.12.2008

20

• Let,

TF and IDF Factors

– N be the total number of docs in the collection

– ni be the number of docs which contain ki

– freq(i,j) raw frequency of ki within dj

• A normalized frequency (tf factor) is given by:– f(i,j) = freq(i,j) / max(freq(l,j))

– where max is computed over all terms occuring in doc dj


• The idf factor is computed as:– idf(i) = log (N/ ni )

– log makes values tf and idf comparable – or the amount of information associated with term ki

• Let,

TF and IDF Factors

– N be the total number of docs in the collection

– ni be the number of docs which contain ki

– freq(i,j) raw frequency of ki within dj

• A normalized frequency (tf factor) is given by:– f(i,j) = freq(i,j) / max(freq(l,j))

– where max is computed over all terms occuring in doc dj

vector model with tf-idf weights

a good ranking strategy in general collections

simple and fast to compute


• The idf factor is computed as:– idf(i) = log (N/ ni )

– log makes values tf and idf comparable – or the amount of information associated with term ki

simple and fast to compute

3.12.2008

21

• Advantages:

Pros & Cons

Advantages:– term-weighting improves quality of the answer set

– partial matching allows retrieval of docs that approximate the query conditions

– cosine ranking formula sorts documents according to degree of similarity to the query


• Disadvantages:– assumes independence of index terms

– not clear whether this is bad though

Ontology

• Ontology: a conceptualisation of things…

• An explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them. g

Sõidukid

veesõidukid autod lennukid

Vesilennuk

3.12.2008

22

Ontology driven search

• Query => map to an ontology

– Use ontology to guide what you “really want”

• Map documents to the same ontology

F t h t l t t t t l t• Fetch most relevant to term, ontology, etc…

gopubmed

3.12.2008

23

Importance of a document

• Can we say that some documents are a priori more important than others?

• Type of a document /law, news, chat,…/

• “Good source”

• Relevant (often cited, popular)

What is a Markov Chain?

• A Markov chain has two components:1) A network structure much like a web site1) A network structure much like a web site,

where each node is called a state. So the complete web is the set of all possible states.

2) A transition probability of traversing a link given that the chain is in a state.

– For each state the sum of outgoing probabilities is one.

A f t th h th h i i• A sequence of steps through the chain is called a random walk.

3.12.2008

24

The Random Surfer• Assume the web is a Markov chain.

• Surfers randomly click on links, where the probability of tli k f A i 1/ h i th b fan outlink from page A is 1/m, where m is the number of

outlinks from A.

• The surfer occasionally gets bored and is teleported to another web page, say B, where B is equally likely to be any page.

• Using the theory of Markov chains it can be shown that if the surfer follows links for long enough, the PageRank of a web page is the probability that the surfer will visit that page.

Dangling Pages

• Problem: A and B have no outlinks

A C B

• Problem: A and B have no outlinks.

Solution: Assume A and B have links to all web pages with equal probability.

3.12.2008

25

Rank Sink

P bl P i l l t k b t d t• Problem: Pages in a loop accumulate rank but do not distribute it.

• Solution: Teleportation, i.e. with a certain probability the surfer can jump to any other web page to get out of the loop.

PageRank (PR) ‐ Definition

• P is a web page

)()(

)(...

)(

)(

)(

)()1()(

2

2

1

1

n

n

PO

PPR

PO

PPR

PO

PPRd

N

dPPR

• Pi are the web pages that have a link to P

• O(Pi) is the number of outlinks from Pi

• d is the teleportation probability

• N is the size of the web

3.12.2008

26

Example Web Graph

Iteratively Computing PageRank• Replace d/N in the def. of PR(P) by d, so PR will take values

between 1 and N.

• d is normally set to 0.15, but for simplicity lets set it to 0.5

• Set initial PR values to 1

• Solve the following equations iteratively:

))(2/)((5.05.0)(

)2/)((5.05.0)(

)(5.05.0)(

BPRAPRCPR

APRBPR

CPRAPR

3.12.2008

27

Example Computation of PRIteration PR(A) PR(B) PR(C)

0 1 1 10 1 1 1

1 1 0.75 1.125

2 1.0625 0.765625 1.1484375

3 1.07421875 0.76855469 1.15283203

4 1.07641602 0.76910400 1.15365601

5 1.07682800 0.76920700 1.15381050

… … … …

12 1.07692308 0.76923077 1.15384615

Large Matrix Computation

• Computing PageRank can be done via matrix multiplication, where the matrix has 30 million rowsmultiplication, where the matrix has 30 million rows and columns.

• The matrix is sparse as average number of outlinks is between 7 and 8.

• Setting d = 0.15 or above requires at most 100 iterations to convergence.

• Researchers still trying to speed‐up the computation.

3.12.2008

28

PageRank - Motivation

• A link from page A to page B is a vote of the author of A for B, or a d ti f threcommendation of the page.

• The number incoming links to a page is a measure of importance and authority of the page.

• Also take into account the quality of recommendation, so a page is more important if the sources of its incomoing llinks are important.

3.12.2008

29

3.12.2008

30

The Anatomy of a Large‐Scale Hypertextual Web Search Engine

• http://infolab.stanford.edu/~backrub/google.html

• In this paper, we present Google, a prototype of a large‐scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results thanproduce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/

3.12.2008

31

• Personalized PageRank

– Teleportation to a set of pages defining the preferences of a– Teleportation to a set of pages defining the preferences of a particular user

• Topic‐sensitive PageRank [Haveliwala 02]

– Teleportation to a set of pages defining a particular topic

• TrustRank [Gyöngyi 04]

– Teleportation to “trustworthy” pages

• Many papers on analyzing PageRank and numerical methodsfor efficient computation

3.12.2008

32

3.12.2008

33

Future? Or current?

• Recommendations (Tagging)

• Common behaviour (news/epidemics spread)

• Social networks

• Focus

• Generalisation

• Rich get richer; Googlearchy?; …

• Your contribution?