Upload
lamcong
View
214
Download
0
Embed Size (px)
Citation preview
3.12.2008
1
Text Algorithms (4AP)Information Retrieval
Jaak Vilo
2008 fall
1MTAT.03.190 Text Algorithms Jaak Vilo
Materials
• Modern Information Retrieval by Ricardo Baeza‐Yates and Berthier Ribeiro‐Neto. – http://people.ischool.berkeley.edu/~hearst/irbook/– http://www.amazon.co.uk/Modern‐Information‐Retrieval‐ACM‐
Press/dp/020139829X/ref=sr_1_1?ie=UTF8&s=books&qid=1228237684&sr=8‐1
– http://books.google.com/books?id=HLyAAAAACAAJ&dq=modern+information+retrieval
N diti i M 2009– New edition in May 2009
• Google Books: Information Retrieval– http://books.google.com/books?q=information+retrieval
• ESSCaSS’08 : Ricardo Baeza‐Yates and Filippo Menczer– http://courses.cs.ut.ee/schools/esscass2008/Main/Materials
• Given a set of documents, find those relevant to topic X
• User formulates a query, documents are returned and retrieved by user
• Looking at first 100, result – howmany are relevant to topic, how many of all fit in the first 100?
• Given an interesting document (one?), how to find similar ones?
• Which keywords characterise documents similar to other documents?
• How to present the answer to user?
– Topic hierarchies
– Self organising maps (see WebSom)
– ...
3.12.2008
2
Mida otsiti?
• Milline dokument peaks kõige sarnasem l ?
auto buss rong tramm troll bensiin diisel puit vesinik maagaas elekter
olema?
– Dokumendi ja päringu sarnasus
– Dokumentide järjestamine
– Käänded/pöörded
– Ontoloogiad (mõistete struktuur)
– Dokumendi enda olulisus (e.g. PageRank)
Information retrieval (IR)
• Finding relevant information
– From unstructured document database(s)
• Relevance, measures
• Presenting information (UI, relevance)Presenting information (UI, relevance)
• Free text queries (Natural Language Processing)
• User feedback
• http://www.google.com/search?q=define:information+retrieval
– Information Retrieval is the "science of search”
– The study of systems for indexing, searching, and recalling data, particularly text or other unstructured forms
History of IR
• 1960‐70’s: Small text retrieval systems; basic booleanand vector‐space retrieval models
• 1980’s: Large document database systems, many run1980 s: Large document database systems, many run by companies: (e.g. Lexis‐Nexis, Dialog, MEDLINE)
• 1990’s: Searching FTPable documents on the Internet (e.g. Archie, WAIS); Searching the World Wide Web (e.g. Lycos, Yahoo, Altavista)
History cont.
• 2000’s:– Link analysis for Web Search (e.g. Google)
– Automated Information Extraction (e.g. Whizbang, Fetch, Burning Glass)
– Question Answering (e g TREC Q/A track Ask Jeeves)Question Answering (e.g. TREC Q/A track, Ask Jeeves)
– Multimedia IR (Image, Video, Audio and music)
– Cross‐Language IR (e.g. DARPA Tides)
– Document Summarization
3.12.2008
3
Cont …
• 2000’s:
– Recommender Systems • (e.g. MovieLens, Pandora, LastFM)
– Automated Text Categorization & Clustering
• iTunes “Top Songs”
• Amazon “people who bought this also bought…”
• Bloglines “similar blogs”
• Del.icio.us “most popular” bookmarks
• Flickr.com “ most viewed pictures”
• NYTimes “most emailed articles”
• IR discipline that deals with:
– retrieval• representation
• storage
• organization• organization
• access
– of structured, semi‐structured and unstructured data(information objects)
– in response to query (topic statement)• structured (e.g. boolean expression)
• unstructured (e.g. sentence, document)
ConceptsInformation Retrieval ‐ the study of systems for representing, indexing (organising), searching (retrieving), and recalling (delivering) data.
Information Filtering ‐ given a large amount of data, return the data that the user wants to see
Information Need ‐ what the user really wants to know; a query is an approximation to the information need.
Query ‐ a string of words that characterizes the information that the user seeks
Browsing ‐ a sequence of user interaction tasks that characterizes the information that the user seeks
• The process of applying algorithms over unstructured, semi‐structured or structureddata in order to satisfy a given information (explicit) query(explicit) query
• Efficiency with respect to:– algorithms
– query building
– data organization/structure
Data vs. Information Retrieval
• Information Retrieval:– Set of keywords (loose semantics)– Semantics of the information need– Errors are tolerable
• Data Retrieval:– Regular expression (well defined query)– Constraints for the objects in the answer set– Single error results in a falure
retrieval task
SummaryCompare the information need with the information
generate a ranking which reflects relevance
Information Need
Lecture 2: Query Languages & Operations
2ID10: Information Retrieval (2005‐2006), Lora Aroyo
IR SystemUser Query
Ranked list of documents
feedback
3.12.2008
4
1. IR Models
IR introduction IR research issues Applications of IR
4 Language1. IR Models
2. IR Query Languages & Operations
6. Semantic in IR3. Searcher Feedback
4. Language Modeling for IR
8. Multimedia IR
5. Search Engines
9. Structured Content
• classification and categorization (catalogues)
• systems and languages (NL‐based systems)
• user interfaces and visualization
• The Web fenomena
– universal repository of knowledge
– free (low cost) universal access
– no central editorial board
• IR – the key to finding the solutions
Logical View of Documents
Documents accents, spacing
Stop-wordsNoun groups
StemmingAutomatic or Manual Indexing
text +structure text
Lecture 1: Introduction 2ID10: Information Retrieval (2005‐2006) 21
• Document representation continuum
• Intermediate representations (transformations)
• Text operations to reduce complexity of documents
structure Full text Index terms
Structure
UserInterface
Text Operations 5
The Retrieval Process ...
logical view logical view
user feedback – change the query
1text – defines logical view
text
specifies user need410
Lecture 1: Introduction 2ID10: Information Retrieval (2005‐2006) 22
Query Operations
Indexing
Searching
Ranking
Index
Text Database
g g
inverted filequery generated
retrieved docs
ranking docs
3
1
2
DB Manager Module6
7
8
9
builds
3.12.2008
5
Inverted index: document levelhttp://en.wikipedia.org/wiki/Inverted_index
• T0 = "it is what it is",T1 = "what is it"T2 = "it is a banana“
"a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} " h " {0 1}
• Q: "what", "is" "it"
– {0, 1} ∩ {0, 1, 2} ∩ {0, 1, 2} = {0,1}
"what": {0, 1}
Inverted index: word level
• T0 = "it is what it is",T1 = "what is it"T2 = "it is a banana“
"a": {(2, 2)} "banana": {(2, 3)} "is": {(0, 1), (0, 4), (1, 1), (2, 1)} "it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)}
• Q: "what is it“{(0, 2), (1, 0)}{(0, 1), (0, 4), (1, 1), (2, 1)} {(0, 0), (0, 3), (1, 2), (2, 0)}
{( , ), ( , )}
• The inverted index data structure is a central component of a typical search engine indexing algorithm. A goal of a search engine implementation is to optimize the speed of the query: find the documents where word X occurs. Once a forward index is developed, which stores lists of words per document, p , p ,it is next inverted to develop an inverted index. Querying the forward index would require sequential iteration through each document and to each word to verify a matching document. The time, memory, and processing resources to perform such a query are not always technically realistic. Instead of listing the words per document in the forward index, the inverted index data structure is developed which lists the documents per word.
Measures
• Precision is the fraction of the documents retrieved that are relevant to the user's information need.
• Recall is the fraction of the documents that• Recall is the fraction of the documents that are relevant to the query that are successfully retrieved.
Measures: Precision & Recall
Retrieved Not retrieved
User need TP FN Relevant
Not needed FP TN Irrelevant
TP+FP TN+FN
TP – True PositiveTN – True Negative
FP – False PositiveFN – False Negative
TP Relevant ∩ RetrievedPrecision = ––––––– = –––––––––––––––––––––
TP+FP Retrieved
TP Relevant ∩ RetrievedRecall = ––––––– = –––––––––––––––––––––
TP+FN Relevant
3.12.2008
6
Measures: Precision & Recall
Retrieved Not retrieved
User need TP FN Relevant
Not needed FP TN Irrelevant
TP+FP TN+FN
TPPrecision = –––––––
TP+FP
TPRecall = –––––––
TP+FNSensitivity
Specificity
Measure: F‐Measure
• The weighted harmonic mean of precision and recall, the traditional F‐measure or balanced F‐score is:
•F2 measure, weights recall twice as much as precision, and the F0.5 measure, which weights precision twice as much as recall.
2 X Precision x RecallF-Measure = –––––––––––––––––––––
( Precison + Recall)
ROC Receiver Operator CharacteristicAUC Area Under Curve
3 systems compared
FP
TP
Rel
evan
t
Irrelevant
Vector space model
• http://en.wikipedia.org/wiki/Vector_space_model
• Document: a vector of words
– A sparse vector over all possible words…
• Similarity between query and document:
– Scalar product
– An angle between the two vectors
Scalar product
• Query Q is a document with perhaps just a single word.
• Similarity of query and document
M(Q, Di) = Q ∙ Di
• X ∙ Y = ∑i xiyi
Weighted version
• The more the word occurs, the more relevant,
• Same word vectors, count occurrences
M(X , Y ) = ∑i wq,i wd,ii q,i d,i
• w is different for word in each document
• Extend: add weight for a word in a"more important" context
• Can you add term weight on query words?
3.12.2008
7
Limitations of vector space
• Long documents are poorly represented because they have poor similarity values (a small scalar product and a large dimensionality)
• Search keywords must precisely match document terms; word substrings might result in a "false positive match"
• Semantic sensitivity; documents with similar context but different term vocabulary won't be associated, resulting in a "false negative match".
• The order in which the terms appear in the document is lost in the vector space representation.
• quantification of intra-document (-cluster) contents (similarity) = tf factor – the term frequency within a document
Term Weigh Calculation
Lecture 1: Introduction 2ID10: Information Retrieval (2005‐2006) 38
– how well a term describes a document
• quantification of inter-documents (-cluster) separation (dissimilarity) = idf factor – the inverse document frequency – frequency of the term in docs of the collection
wij = tf(i,j) * idf(i)
• Let,– N be the total number of docs in the collection
– ni be the number of docs which contain ki
– freq(i,j) raw frequency of ki within dj
TF and IDF Factors
Lecture 1: Introduction 2ID10: Information Retrieval (2005‐2006) 39
• A normalized frequency (tf factor) is given by:– f(i,j) = freq(i,j) / max(freq(l,j))
– where max is computed over all terms occuring in doc dj
• The idf factor is computed as:– idf(i) = log (N/ ni )
– log makes values tf and idf comparable – or the amount of information associated with term ki
• Let,– N be the total number of docs in the collection
– ni be the number of docs which contain ki
– freq(i,j) raw frequency of ki within dj
TF and IDF Factors
vector model with tf-idf weights
Lecture 1: Introduction 2ID10: Information Retrieval (2005‐2006) 40
• A normalized frequency (tf factor) is given by:– f(i,j) = freq(i,j) / max(freq(l,j))
– where max is computed over all terms occuring in doc dj
• The idf factor is computed as:– idf(i) = log (N/ ni )
– log makes values tf and idf comparable – or the amount of information associated with term ki
a good ranking strategy in general collections
simple and fast to compute
• Advantages:– term-weighting improves quality of the answer set
– partial matching allows retrieval of docs that i t th diti
Pros & Cons
Lecture 1: Introduction 2ID10: Information Retrieval (2005‐2006) 41
approximate the query conditions
– cosine ranking formula sorts documents according to degree of similarity to the query
• Disadvantages:– assumes independence of index terms
– not clear whether this is bad though
Ontology
• Ontology: a conceptualisation of things…
• An explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some areaentities that are assumed to exist in some area of interest and the relationships that hold among them.
Sõidukid
veesõidukid autod lennukid
Vesilennuk
3.12.2008
8
Ontology driven search
• Query => map to an ontology
– Use ontology to guide what you “really want”
M d t t th t l• Map documents to the same ontology
• Fetch most relevant to term, ontology, etc…
gopubmed
Importance of a document
• Can we say that some documents are a priori more important than others?
• Type of a document /law, news, chat,…/
• “Good source”
• Relevant (often cited, popular)
What is a Markov Chain?
• A Markov chain has two components:1) A network structure much like a web site,
where each node is called a state. So the complete web is the set of all possible states.
2) A t iti b bilit f t i li k i2) A transition probability of traversing a link given that the chain is in a state.
– For each state the sum of outgoing probabilities is one.
• A sequence of steps through the chain is called a random walk.
The Random Surfer• Assume the web is a Markov chain.
• Surfers randomly click on links, where the probability of an outlink from page A is 1/m, where m is the number of outlinks from A.
• The surfer occasionally gets bored and is teleported to another web page, say B, where B is equally likely to be any page.
• Using the theory of Markov chains it can be shown that if the surfer follows links for long enough, the PageRank of a web page is the probability that the surfer will visit that page.
Dangling Pages
A C B
• Problem: A and B have no outlinks.
Solution: Assume A and B have links to all web pages with equal probability.
3.12.2008
9
Rank Sink
• Problem: Pages in a loop accumulate rank but do not distribute it.
• Solution: Teleportation, i.e. with a certain probability the surfer can jump to any other web page to get out of the loop.
PageRank (PR) ‐ Definition
)()(
)(...
)(
)(
)(
)()1()(
2
2
1
1
n
n
PO
PPR
PO
PPR
PO
PPRd
N
dPPR
• P is a web page
• Pi are the web pages that have a link to P
• O(Pi) is the number of outlinks from Pi
• d is the teleportation probability
• N is the size of the web
Example Web GraphIteratively Computing PageRank
• Replace d/N in the def. of PR(P) by d, so PR will take values between 1 and N.
• d is normally set to 0.15, but for simplicity lets set it to 0.5
• Set initial PR values to 1
• Solve the following equations iteratively:• Solve the following equations iteratively:
))(2/)((5.05.0)(
)2/)((5.05.0)(
)(5.05.0)(
BPRAPRCPR
APRBPR
CPRAPR
Example Computation of PRIteration PR(A) PR(B) PR(C)
0 1 1 1
1 1 0.75 1.125
2 1 0625 0 765625 1 14843752 1.0625 0.765625 1.1484375
3 1.07421875 0.76855469 1.15283203
4 1.07641602 0.76910400 1.15365601
5 1.07682800 0.76920700 1.15381050
… … … …
12 1.07692308 0.76923077 1.15384615
Large Matrix Computation
• Computing PageRank can be done via matrix multiplication, where the matrix has 30 million rows and columns.
• The matrix is sparse as average number of outlinks is between 7 and 8between 7 and 8.
• Setting d = 0.15 or above requires at most 100 iterations to convergence.
• Researchers still trying to speed‐up the computation.
3.12.2008
10
PageRank - Motivation
• A link from page A to page B is a vote of the author of A for B, or a recommendation of the page.
• The number incoming links to a page is a measure of importance and authority of the page.
• Also take into account the quality of recommendation, so a page is more important if the sources of its incomoing llinks are important.
The Anatomy of a Large‐Scale Hypertextual Web Search Engine
• http://infolab.stanford.edu/~backrub/google.html
• In this paper, we present Google, a prototype of a large‐scale search engine which makes heavy use oflarge scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/
3.12.2008
11
• Personalized PageRank
– Teleportation to a set of pages defining the preferences of a particular user
• Topic‐sensitive PageRank [Haveliwala 02]
T l i f d fi i i l i– Teleportation to a set of pages defining a particular topic
• TrustRank [Gyöngyi 04]
– Teleportation to “trustworthy” pages
• Many papers on analyzing PageRank and numerical methodsfor efficient computation
Future? Or current?
• Recommendations (Tagging)
• Common behaviour (news/epidemics spread)
• Social networks
• Focus
• Generalisation
• Rich get richer; Googlearchy?; …
• Your contribution?