View
221
Download
0
Category
Preview:
DESCRIPTION
Brief Overview of the Course Basic principles and theories behind Web-search engines Not much discussion on implementation or tools, but will be happy to discuss them if there are any questions Topics Basic IR models, data structures, and algorithms Topic-based models Latent Semantic index Latent Dirichlet Analysis Link-based ranking Search-engine architecture Issues of scale, Web crawling
Citation preview
Searching the Web
Basic Information Retrieval
Who I Am Associate Professor at UCLA Computer
Science Ph.D. from Stanford in Computer Science B.S. from SNU in Physics Got involved in early Web-search engine
projects Particularly in Web crawling part
Research on search engines and social Web
Brief Overview of the Course Basic principles and theories behind Web-
search engines Not much discussion on implementation or tools,
but will be happy to discuss them if there are any questions
Topics Basic IR models, data structures, and algorithms Topic-based models
Latent Semantic index Latent Dirichlet Analysis
Link-based ranking Search-engine architecture
Issues of scale, Web crawling
Who Are You? Background Expectation Career goal
Today’s Topic Basic Information Retrieval (IR)
Three approaches for computer-based information management
Bag of words assumption Boolean Model
String-matching algorithm Inverted index
Vector-space model Document-term matrix TF-IDF vector and cosine similarity
Phrase queries Spell correction
Computer-based Information Management Basic problem
How to use computers to help humans store, organize and retrieve information?
What approaches have been taken and what has been successful?
Three Major Approaches Database approach Expert-system approach Information-retrieval approach
Database Approach Information is stored in a highly-structured way
Data is stored in relational tables as tuples Simple data model and query language
Relational model and SQL query language Clear interpretation of data and query No ambition to be “intelligent” like humans
Mainly focus on highly efficient system “Performance, performance, performance”
It has been hugely successful All major businesses use a RDB system >$20B market
What are the pros and cons?
Expert-System Approach Information is stored as a set of logical
predicates Bird(x), Cat(x), Fly(x), …
Given a query, the system infers the answer through logical inference Bird(Ostrich) Fly(Ostrich)?
Popular approach in 80s, but has not been successful for general information retrieval
What are the pros and cons?
Information-Retrieval Approach Uses existing text documents as information
source No special structuring or database construction
required Text-based query language
Keyword-based query or natural-language query The system returns best-matching documents
given the query Had a limited appeal until the Web became
popular What are the pros and cons?
Main Challenge of IR Approach Relational Model
Interpretation of query and data is straightforward Student(name, birthdate, major, GPA) SELECT * FROM Student WHERE GPA > 3.0
Information Retrieval Both queries and data are “fuzzy”
Unstructured text and “natural language” query What documents are good matches for a query?
Computers do not “understand” the documents or the queries
Developing a computerizable “model” is essential to implement this approach
Bag of Words: Major Simplification Consider each document as a “bag of words”
“bag” vs “set” Ignore word ordering, but keep word count
Consider queries as bag of words as well Great oversimplification, but works adequately
in many cases “John loves only Jane” vs “Only John loves Jane” The limitation still shows up on current search
engines Still how do we match documents and
queries?
Boolean Model Return all documents that contain the words
in the query Simplest model for information retrieval
No notion of “ranking” A document is either a match or non-match
Q: How to find and return matching documents? Basic algorithm? Useful data structure?
String-Matching Algorithm Given string “abcde”, find what documents
contain the string Q: Computational complexity of naïve
matching of string of length m over a document of length n? Q: Any efficient way
String Matching Example (1) m 0123456789 D: ABCABABABC (doc) W: ABABC (word) i 01234
m 0123456789 D: ABCABABABC (doc) W: ABABC (word) i 01234
Two cursors: m=2, i=1 m: beginning of matching part in D i: the location of matching char in W
String Matching Example (2)
m 0123456789 D: ABCABABABC (doc) W: ABABC (word) i 01234
Mismatch at m=0,i=2 Q: What can we do? Start again at m=1,i=0?
String Matching Example (2)
m 0123456789 D: ABCABABABC (doc) W: ABABC (word) i 01234
Mismatch at m=3,i=4 Q: What can we do? Start at m=7,i=0?
String Matching Example (3)
Algorithm KMP If no substring in W is self-repeated, we can
slide W “completely” for matched portion m <- m + i i <- 0
If the suffix of the matched part is equal to the prefix of W, we have to slide back a little bit m <- m + i – x // x is how much to slide back i <- x The exact value of x depends on the length of the
prefix matching the the suffix of the matched part T[0…m]: “slide-back” table recording x values
Algorithm KMPW: string to look forD: document T: “slide-back” table in case of mismatch
while (m + i) < |D| do: if W[i] = D[m + i], let i = i + 1 if i = |W|, return m otherwise, let m = m + i - T[i], if i > 0, let i = T[i]
return no-match
Algorithm KMP: T[i] TableW: ABCDABD (word)i 0123456
m <- m + i – T[i]
T[0]= -1, T[1]= 0
Q: What should be T[i] for i=2…6?
Data Structure for Quick Document Matching Boolean model
Find all documents that contain the keywords in Q. Q: What data structure will be useful to do it
fast?
Inverted Index Allows quick lookup of document ids with a
particular word
Q: How can we use this to answer “UCLA Physics”?
lexicon/dictionary DIC 3 8 10 13 16 20
Stanford
UCLA
MIT…
1 2 3 9 16 18
PL(Stanford)
PL(UCLA)
Postings list
4 5 8 10 13 19 20 22 PL(MIT)
Inverted Index Allows quick lookup of document ids with a
particular word
lexicon/dictionary DIC 3 8 10 13 16 20
Stanford
UCLA
MIT…
1 2 3 9 16 18
PL(Stanford)
PL(UCLA)
Postings list
4 5 8 10 13 19 20 22 PL(MIT)
Size of Inverted Index (1) 100M docs, 10KB/doc,
1000 unique words/doc, 10B/word, 4B/docid
Q: Document collection size?
Q: Inverted index size?
Heap’s Law: Vocabulary size = k nb with 30 < k < 100 and 0.4 < b < 1 k = 50 and b = 0.5 are good rule of thumb
Size of Inverted Index (2) Q: Between dictionary and postings lists,
which one is larger?
Q: Lengths of postings lists?
Zipf’s law: collection term frequency 1/frequency rank
Q: How do we construct an inverted index?
Inverted Index ConstructionC: set of all documents (corpus)DIC: dictionary of inverted indexPL(w): postings list of word w
1: For each document d C:2: Extract all words in content(d) into W3: For each w W:4: If w DIC, then add w to DIC5: Append id(d) to PL(w)
Q: What if the index is larger than main memory?
Inverted-Index Construction For large text corpus
Block-sorted based construction Partition and merge
Evaluation: Precision and Recall Q: Are all matching documents what users
want?
Basic idea: a model is good if it returns document if and only if it is “relevant”.
R: set of “relevant” documentD: set of documents returned by a model
||||Precision
DRD
||||Recall
RRD
Vector-Space Model Main problem of Boolean model
Too many matching documents when the corpus is large
Any way to “rank” documents? Matrix interpretation of Boolean model
Document – Term matrix Boolean 0 or 1 value for each entry
Basic idea Assign real-valued weight to the matrix entries
depending on the importance of the term “the” vs “UCLA”
Q: How should we assign the weights?
TF-IDF Vector A term t is important for document d
If t appears many times in d or If t is a “rare” term
TF: term frequency # occurrence of t in d
IDF: inverse document frequency # documents containing t
TF-IDF weighting TF X Log(N/IDF)
Q: How to use it to compute query-document relevance?
Cosine Similarity Represent both query and document as a TF-
IDF vector Take the inner product of the two normalized
vectors to compute their similarity
Note: |Q| does not matter for document ranking. Division by |D| penalizes longer document.
DQDQ
Cosine Similarity: Example idf(UCLA)=10, idf(good)=0.1,
idf(university) = idf(car) = idf(racing) = 1
Q = (UCLA, university), D = (car, racing)
Q = (UCLA, university), D = (UCLA, good)
Q = (UCLA, university), D = (university, good)
Finding High Cosine-Similarity Documents Q: Under vector-space model, does
precision/recall make sense?
Q: How to find the documents with highest cosine similarity from corpus?
Q: Any way to avoid complete scan of corpus?
Inverted Index for TF-IDF Q · di = 0 if di has no query words Consider only the documents with query
words Inverted Index: Word Document
35
Word IDF
Stanford
UCLA
MIT…
1/3530
1/9860
1/937
docid TF
D1
D14
D376
2
308
(TF may be normalized by document size)
Postinglist
Lexicon
Phrase Queries “Havard University Boston” exactly as a
phrase Q: How can we support this query?
Two approaches Biword index Positional index
Q: Pros and cons of each approach?
Rule of thumb: x2 – x4 size increase for positional index compared to docid only
Spell correction Q: What the user may have truly intended for the
query “Britnie Spears”? How can we find the correct spelling?
Given a user-typed word w, find its correct spelling c. Probabilistic approach: Find c with the highest
probability P(c|w). Q: How to estimate it?
Bayes’ rule: P(c|w) = P(w|c)P(c)/P(w) Q: What are these probabilities and how can we
estimate them? Rule of thumb: 4/3 misspells are within edit
distance 1. 98% are within edit distance 2.
Summary Boolean model Vector-space model
TF-IDF weight, cosine similarity String-matching algorithm
Algorithm KMP Inverted index
Boolean model TF-IDF model Phrase queries
Spell correction
Recommended