11
3.12.2008 1 Text Algorithms (4AP) Information Retrieval Jaak Vilo 2008 fall 1 MTAT.03.190 Text Algorithms Jaak Vilo Materials Modern Information Retrieval by Ricardo BaezaYates and Berthier RibeiroNeto. http://people.ischool.berkeley.edu/~hearst/irbook/ http://www.amazon.co.uk/ModernInformationRetrievalACMPress/dp/020139829X/ref=sr_1_1?ie=UTF8&s=books&qid=1228237684&sr=81 http://books.google.com/books?id=HLyAAAAACAAJ&dq=modern+information+retrieval N diti i M 2009 New edition in May 2009 Google Books: Information Retrieval http://books.google.com/books?q=information+retrieval ESSCaSS’08 : Ricardo BaezaYates and Filippo Menczer http://courses.cs.ut.ee/schools/esscass2008/Main/Materials Given a set of documents, find those relevant to topic X User formulates a query, documents are returned and retrieved by user Looking at first 100, result – how many are relevant to topic, how many of all fit in the first 100? Given an interesting document (one?), how to find similar ones? Which keywords characterise documents similar to other documents? How to present the answer to user? Topic hierarchies Self organising maps (see WebSom ) ...

TA Lecture 11 Information Retrieval.ppt - ut Lecture IR 6up.pdf · TF and IDF Factors vector model with tf-idf weights Lecture1:Introduction 2ID10: InformationRetrieval(2005‐2006)

  • Upload
    lamcong

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

3.12.2008

1

Text Algorithms (4AP)Information Retrieval

Jaak Vilo

2008 fall

1MTAT.03.190 Text Algorithms Jaak Vilo

Materials

• Modern Information Retrieval by Ricardo Baeza‐Yates and Berthier Ribeiro‐Neto. – http://people.ischool.berkeley.edu/~hearst/irbook/– http://www.amazon.co.uk/Modern‐Information‐Retrieval‐ACM‐

Press/dp/020139829X/ref=sr_1_1?ie=UTF8&s=books&qid=1228237684&sr=8‐1

– http://books.google.com/books?id=HLyAAAAACAAJ&dq=modern+information+retrieval

N diti i M 2009– New edition in May 2009

• Google Books: Information Retrieval– http://books.google.com/books?q=information+retrieval

• ESSCaSS’08 : Ricardo Baeza‐Yates and Filippo Menczer– http://courses.cs.ut.ee/schools/esscass2008/Main/Materials

• Given a set of documents,  find those relevant to topic X 

• User formulates a query, documents are returned and retrieved by user 

• Looking at first 100, result – howmany are relevant to topic, how many of all fit in the first 100? 

• Given an interesting document (one?), how to find similar ones? 

• Which keywords characterise documents similar to other documents? 

• How to present the answer to user? 

– Topic hierarchies 

– Self organising maps (see WebSom) 

– ... 

3.12.2008

2

Mida otsiti?

• Milline dokument peaks kõige sarnasem l ?

auto buss rong tramm troll bensiin diisel puit vesinik maagaas elekter

olema?

– Dokumendi ja päringu sarnasus

– Dokumentide järjestamine

– Käänded/pöörded

– Ontoloogiad (mõistete struktuur)

– Dokumendi enda olulisus (e.g. PageRank)

Information retrieval (IR)

• Finding relevant information

– From unstructured document database(s)

• Relevance, measures

• Presenting information (UI, relevance)Presenting information (UI, relevance)

• Free text queries (Natural Language Processing)

• User feedback

• http://www.google.com/search?q=define:information+retrieval

– Information Retrieval is the "science of search”

– The study of systems for indexing, searching, and recalling data, particularly text or other unstructured forms

History of IR

• 1960‐70’s: Small text retrieval systems; basic booleanand vector‐space retrieval models

• 1980’s: Large document database systems, many run1980 s: Large document database systems, many run by companies: (e.g. Lexis‐Nexis, Dialog, MEDLINE)

• 1990’s: Searching FTPable documents on the Internet (e.g. Archie, WAIS); Searching the World Wide Web (e.g. Lycos, Yahoo, Altavista)

History cont.

• 2000’s:– Link analysis for Web Search (e.g. Google)

– Automated Information Extraction (e.g. Whizbang, Fetch, Burning Glass)

– Question Answering (e g TREC Q/A track Ask Jeeves)Question Answering (e.g. TREC Q/A track, Ask Jeeves)

– Multimedia IR (Image, Video, Audio and music)

– Cross‐Language IR (e.g. DARPA Tides)

– Document Summarization

3.12.2008

3

Cont …

• 2000’s: 

– Recommender Systems • (e.g. MovieLens, Pandora, LastFM)

– Automated Text Categorization & Clustering

• iTunes “Top Songs”

• Amazon “people who bought this also bought…”

• Bloglines “similar blogs”

• Del.icio.us “most popular” bookmarks

• Flickr.com “ most viewed pictures”

• NYTimes “most emailed articles”

• IR discipline that deals with:

– retrieval• representation

• storage

• organization• organization

• access

– of structured, semi‐structured and unstructured data(information objects)

– in response to query (topic statement)• structured (e.g. boolean expression)

• unstructured (e.g. sentence, document)

ConceptsInformation Retrieval ‐ the study of systems for representing, indexing (organising), searching (retrieving), and recalling (delivering) data.

Information Filtering ‐ given a large amount of data, return the data that the user wants to see

Information Need ‐ what the user really wants to know; a query is an approximation to the information need. 

Query ‐ a string of words that characterizes the information that the user seeks

Browsing ‐ a sequence of user interaction tasks that characterizes the information that the user seeks

• The process of applying algorithms over unstructured, semi‐structured or structureddata in order to satisfy a given information (explicit) query(explicit) query

• Efficiency with respect to:– algorithms

– query building

– data organization/structure

Data vs. Information Retrieval

• Information Retrieval:– Set of keywords (loose semantics)– Semantics of the information need– Errors are tolerable

• Data Retrieval:– Regular expression (well defined query)– Constraints for the objects in the answer set– Single error results in a falure

retrieval task

SummaryCompare the information need with the information

generate a ranking which reflects relevance

Information Need

Lecture 2: Query Languages & Operations                            

2ID10: Information Retrieval (2005‐2006), Lora Aroyo

IR SystemUser Query

Ranked list of documents

feedback

3.12.2008

4

1. IR Models

IR introduction IR research issues Applications of IR

4 Language1. IR Models

2. IR Query Languages & Operations

6. Semantic in IR3. Searcher Feedback

4. Language Modeling for IR

8. Multimedia IR

5. Search Engines

9. Structured Content

• classification and categorization (catalogues)

• systems and languages (NL‐based systems)

• user interfaces and visualization 

• The Web fenomena

– universal repository of knowledge 

– free (low cost) universal access

– no central editorial board

• IR – the key to finding the solutions

Logical View of Documents

Documents accents, spacing

Stop-wordsNoun groups

StemmingAutomatic or Manual Indexing

text +structure text

Lecture 1: Introduction                                     2ID10: Information Retrieval (2005‐2006) 21

• Document representation continuum

• Intermediate representations (transformations)

• Text operations to reduce complexity of documents

structure Full text Index terms

Structure

UserInterface

Text Operations 5

The Retrieval Process ...

logical view logical view

user feedback – change the query

1text – defines logical view

text

specifies user need410

Lecture 1: Introduction                                     2ID10: Information Retrieval (2005‐2006) 22

Query Operations

Indexing

Searching

Ranking

Index

Text Database

g g

inverted filequery generated

retrieved docs

ranking docs

3

1

2

DB Manager Module6

7

8

9

builds

3.12.2008

5

Inverted index: document levelhttp://en.wikipedia.org/wiki/Inverted_index

• T0 = "it is what it is",T1 = "what is it"T2 = "it is a banana“

"a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} " h " {0 1}

• Q:  "what", "is" "it"

– {0, 1} ∩ {0, 1, 2} ∩ {0, 1, 2}  = {0,1}

"what": {0, 1}

Inverted index: word level

• T0 = "it is what it is",T1 = "what is it"T2 = "it is a banana“

"a": {(2, 2)} "banana": {(2, 3)} "is": {(0, 1), (0, 4), (1, 1), (2, 1)} "it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)}

• Q:  "what is it“{(0, 2), (1, 0)}{(0, 1), (0, 4), (1, 1), (2, 1)} {(0, 0), (0, 3), (1, 2), (2, 0)} 

{( , ), ( , )}

• The inverted index data structure is a central component of a typical search engine indexing algorithm. A goal of a search engine implementation is to optimize the speed of the query: find the documents where word X occurs. Once a forward index is developed, which stores lists of words per document, p , p ,it is next inverted to develop an inverted index. Querying the forward index would require sequential iteration through each document and to each word to verify a matching document. The time, memory, and processing resources to perform such a query are not always technically realistic. Instead of listing the words per document in the forward index, the inverted index data structure is developed which lists the documents per word.

Measures

• Precision is the fraction of the documents retrieved that are relevant to the user's information need.

• Recall is the fraction of the documents that• Recall is the fraction of the documents that are relevant to the query that are successfully retrieved.

Measures: Precision & Recall

Retrieved Not retrieved

User need TP FN Relevant

Not needed FP TN  Irrelevant

TP+FP TN+FN

TP – True PositiveTN – True Negative

FP – False PositiveFN – False Negative

TP Relevant ∩ RetrievedPrecision = ––––––– = –––––––––––––––––––––

TP+FP Retrieved

TP Relevant ∩ RetrievedRecall = ––––––– = –––––––––––––––––––––

TP+FN Relevant

3.12.2008

6

Measures: Precision & Recall

Retrieved Not retrieved

User need TP FN Relevant

Not needed FP TN  Irrelevant

TP+FP TN+FN

TPPrecision = –––––––

TP+FP

TPRecall = –––––––

TP+FNSensitivity

Specificity

Measure: F‐Measure

• The weighted harmonic mean of precision and recall, the traditional F‐measure or balanced F‐score is:

•F2 measure, weights recall twice as much as precision, and the F0.5 measure, which weights precision twice as much as recall.

2 X Precision x RecallF-Measure = –––––––––––––––––––––

( Precison + Recall)

ROC Receiver Operator CharacteristicAUC Area Under Curve

3 systems compared

FP

TP

Rel

evan

t

Irrelevant

Vector space model

• http://en.wikipedia.org/wiki/Vector_space_model

• Document: a vector of words

– A sparse vector over all possible words…

• Similarity between query and document:

– Scalar product

– An angle between the two vectors

Scalar product 

• Query Q is a document with perhaps just a single word. 

• Similarity of query and document 

M(Q, Di) = Q ∙ Di

• X ∙ Y = ∑i xiyi

Weighted version

• The more the word occurs, the more relevant, 

• Same word vectors, count occurrences 

M(X , Y ) = ∑i wq,i wd,ii q,i d,i

• w is different for word in each document 

• Extend: add weight for a word in a"more important" context 

• Can you add term weight on query words?

3.12.2008

7

Limitations of vector space

• Long documents are poorly represented because they have poor similarity values (a small scalar product and a large dimensionality)

• Search keywords must precisely match document terms; word substrings might result in a "false positive match"

• Semantic sensitivity; documents with similar context but different term vocabulary won't be associated, resulting in a "false negative match".

• The order in which the terms appear in the document is lost in the vector space representation.

• quantification of intra-document (-cluster) contents (similarity) = tf factor – the term frequency within a document

Term Weigh Calculation

Lecture 1: Introduction                                     2ID10: Information Retrieval (2005‐2006) 38

– how well a term describes a document

• quantification of inter-documents (-cluster) separation (dissimilarity) = idf factor – the inverse document frequency – frequency of the term in docs of the collection

wij = tf(i,j) * idf(i)

• Let,– N be the total number of docs in the collection

– ni be the number of docs which contain ki

– freq(i,j) raw frequency of ki within dj

TF and IDF Factors

Lecture 1: Introduction                                     2ID10: Information Retrieval (2005‐2006) 39

• A normalized frequency (tf factor) is given by:– f(i,j) = freq(i,j) / max(freq(l,j))

– where max is computed over all terms occuring in doc dj

• The idf factor is computed as:– idf(i) = log (N/ ni )

– log makes values tf and idf comparable – or the amount of information associated with term ki

• Let,– N be the total number of docs in the collection

– ni be the number of docs which contain ki

– freq(i,j) raw frequency of ki within dj

TF and IDF Factors

vector model with tf-idf weights

Lecture 1: Introduction                                     2ID10: Information Retrieval (2005‐2006) 40

• A normalized frequency (tf factor) is given by:– f(i,j) = freq(i,j) / max(freq(l,j))

– where max is computed over all terms occuring in doc dj

• The idf factor is computed as:– idf(i) = log (N/ ni )

– log makes values tf and idf comparable – or the amount of information associated with term ki

a good ranking strategy in general collections

simple and fast to compute

• Advantages:– term-weighting improves quality of the answer set

– partial matching allows retrieval of docs that i t th diti

Pros & Cons

Lecture 1: Introduction                                     2ID10: Information Retrieval (2005‐2006) 41

approximate the query conditions

– cosine ranking formula sorts documents according to degree of similarity to the query

• Disadvantages:– assumes independence of index terms

– not clear whether this is bad though

Ontology

• Ontology: a conceptualisation of things…

• An explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some areaentities that are assumed to exist in some area of interest and the relationships that hold among them. 

Sõidukid

veesõidukid autod lennukid

Vesilennuk

3.12.2008

8

Ontology driven search

• Query => map to an ontology

– Use ontology to guide what you “really want”

M d t t th t l• Map documents to the same ontology

• Fetch most relevant to term, ontology, etc…

gopubmed

Importance of a document

• Can we say that some documents are a priori more important than others?

• Type of a document /law, news, chat,…/

• “Good source”

• Relevant (often cited, popular)

What is a Markov Chain?

• A Markov chain has two components:1) A network structure much like a web site,

where each node is called a state. So the complete web is the set of all possible states.

2) A t iti b bilit f t i li k i2) A transition probability of traversing a link given that the chain is in a state.

– For each state the sum of outgoing probabilities is one.

• A sequence of steps through the chain is called a random walk.

The Random Surfer• Assume the web is a Markov chain.

• Surfers randomly click on links, where the probability of an outlink from page A is 1/m, where m is the number of outlinks from A.

• The surfer occasionally gets bored and is teleported to another web page, say B, where B is equally likely to be any page.

• Using the theory of Markov chains it can be shown that if the surfer follows links for long enough, the PageRank of a web page is the probability that the surfer will visit that page.

Dangling Pages

A C B

• Problem: A and B have no outlinks.

Solution: Assume A and B have links to all web pages with equal probability.

3.12.2008

9

Rank Sink

• Problem: Pages in a loop accumulate rank but do not distribute it.

• Solution: Teleportation, i.e. with a certain probability the surfer can jump to any other web page to get out of the loop.

PageRank (PR) ‐ Definition

)()(

)(...

)(

)(

)(

)()1()(

2

2

1

1

n

n

PO

PPR

PO

PPR

PO

PPRd

N

dPPR

• P is a web page

• Pi are the web pages that have a link to P

• O(Pi) is the number of outlinks from Pi

• d is the teleportation probability

• N is the size of the web

Example Web GraphIteratively Computing PageRank

• Replace d/N in the def. of PR(P) by d, so PR will take values between 1 and N.

• d is normally set to 0.15, but for simplicity lets set it to 0.5

• Set initial PR values to 1

• Solve the following equations iteratively:• Solve the following equations iteratively:

))(2/)((5.05.0)(

)2/)((5.05.0)(

)(5.05.0)(

BPRAPRCPR

APRBPR

CPRAPR

Example Computation of PRIteration PR(A) PR(B) PR(C)

0 1 1 1

1 1 0.75 1.125

2 1 0625 0 765625 1 14843752 1.0625 0.765625 1.1484375

3 1.07421875 0.76855469 1.15283203

4 1.07641602 0.76910400 1.15365601

5 1.07682800 0.76920700 1.15381050

… … … …

12 1.07692308 0.76923077 1.15384615

Large Matrix Computation

• Computing PageRank can be done via matrix multiplication, where the matrix has 30 million rows and columns.

• The matrix is sparse as average number of outlinks is between 7 and 8between 7 and 8.

• Setting d = 0.15 or above requires at most 100 iterations to convergence.

• Researchers still trying to speed‐up the computation.

3.12.2008

10

PageRank - Motivation

• A link from page A to page B is a vote of the author of A for B, or a recommendation of the page.

• The number incoming links to a page is a measure of importance and authority of the page.

• Also take into account the quality of recommendation, so a page is more important if the sources of its incomoing llinks are important.

The Anatomy of a Large‐Scale Hypertextual Web Search Engine

• http://infolab.stanford.edu/~backrub/google.html

• In this paper, we present Google, a prototype of a large‐scale search engine which makes heavy use oflarge scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/

3.12.2008

11

• Personalized PageRank

– Teleportation to a set of pages defining the preferences of a particular user

• Topic‐sensitive PageRank [Haveliwala 02]

T l i f d fi i i l i– Teleportation to a set of pages defining a particular topic

• TrustRank [Gyöngyi 04]

– Teleportation to “trustworthy” pages

• Many papers on analyzing PageRank and numerical methodsfor efficient computation

Future? Or current?

• Recommendations (Tagging)

• Common behaviour (news/epidemics spread)

• Social networks

• Focus

• Generalisation

• Rich get richer; Googlearchy?; …

• Your contribution?