View
219
Download
0
Category
Preview:
Citation preview
8/14/2019 Search Engine Technology (1)
1/51
Search Engine Technology
(1)
Prof. Dragomir R. Radev
radev@cs.columbia.edu
8/14/2019 Search Engine Technology (1)
2/51
SET FALL 2009
1.Introduction
8/14/2019 Search Engine Technology (1)
3/51
8/14/2019 Search Engine Technology (1)
4/51
8/14/2019 Search Engine Technology (1)
5/51
8/14/2019 Search Engine Technology (1)
6/51
8/14/2019 Search Engine Technology (1)
7/51
Examples of search engines
Conventional (library catalog).
Search by keyword, title, author, etc.
Text-based (Lexis-Nexis, Google, Yahoo!).Search by keywords. Limited search using queries in natural language.
Multimedia (QBIC, WebSeek, SaFe)Search by visual appearance (shapes, colors, ).
Question answering systems (Ask, NSIR, Answerbus)Search in (restricted) natural language
Clustering systems (Vivsimo, Clusty) Research systems (Lemur, Nutch)
8/14/2019 Search Engine Technology (1)
8/51
What does it take to build a search
engine?
Decide what to index
Collect it
Index it (efficiently) Keep the index up to date
Provide user-friendly query facilities
8/14/2019 Search Engine Technology (1)
9/51
What else?
Understand the structure of the web for
efficient crawling
Understand user information needs
Preprocess text and other unstructured
data
Cluster data Classify data
Evaluate performance
8/14/2019 Search Engine Technology (1)
10/51
Goals of the course
Understand how search engines work
Understand the limits of existing search technology
Learn to appreciate the sheer size of the Web
Learn to wrote code for text indexing and retrieval
Learn about the state of the art in IR research
Learn to analyze textual and semi-structured data sets
Learn to appreciate the diversity of texts on the Web
Learn to evaluate information retrieval
Learn about standardized document collections
Learn about text similarity measures
Learn about semantic dimensionality reduction
Learn about the idiosyncracies of hyperlinked document collections
Learn about web crawling Learn to use existing software
Understand the dynamics of the Web by building appropriate mathematical models
Build working systems that assist users in finding useful information on the Web
8/14/2019 Search Engine Technology (1)
11/51
Course logistics
Thursdays 6:10-8:00
Office hours: TBA
URL: http://www.cs.columbia.edu/~cs6998
Instructor: Dragomir Radev
Email: radev@cs.columbia.edu
TA: Yves Petinot (ypetinot@cs.columbia.edu)
Kaushal Lahankar (knl2102@columbia.edu)
http://www.cs.columbia.edu/~cs6998mailto:radev@cs.columbia.edumailto:ypetinot@cs.columbia.edumailto:knl2102@columbia.edumailto:knl2102@columbia.edumailto:ypetinot@cs.columbia.edumailto:radev@cs.columbia.eduhttp://www.cs.columbia.edu/~cs69988/14/2019 Search Engine Technology (1)
12/51
Course outline
Classic document retrieval: storing,
indexing, retrieval.
Web retrieval: crawling, query processing.
Text and web mining: classification,
clustering.
Network analysis: random graph models,centrality, diameter and clustering
coefficient.
8/14/2019 Search Engine Technology (1)
13/51
Syllabus
Introduction. Queries and Documents. Models of Information retrieval. The
Boolean model. The Vector model. Document preprocessing. Tokenization. Stemming. The Porter
algorithm. Storing, indexing and searching text. Inverted indexes.
Word distributions. The Zipf distribution. The Benford distribution.Heap's law. TF*IDF. Vector space similarity and ranking. Retrieval evaluation. Precision and Recall. F-measure. Reference
collections. The TREC conferences. Automated indexing/labeling. Compression and coding. Optimal
codes.
String matching. Approximate matching. Query expansion. Relevance feedback. Text classification. Naive Bayes. Feature selection. Decision
trees.
8/14/2019 Search Engine Technology (1)
14/51
Syllabus
Linear classifiers. k-nearest neighbors. Perceptron. Kernelmethods. Maximum-margin classifiers. Support vector machines.Semi-supervised learning.
Lexical semantics and Wordnet. Latent semantic indexing. Singular value decomposition.
Vector space clustering. k-means clustering. EM clustering. Random graph models. Properties of random graphs: clustering
coefficient, betweenness, diameter, giant connected component,degree distribution.
Social network analysis. Small worlds and scale-free networks.Power law distributions. Centrality.
Graph-based methods. Harmonic functions. Random walks. PageRank. Hubs and authorities. Bipartite graphs. HITS. Models of the Web.
8/14/2019 Search Engine Technology (1)
15/51
Syllabus
Crawling the web. Webometrics. Measuring the size of the web. The Bow-tie-method.
Hypertext retrieval. Web-based IR. Document closures. Focused crawling. Question answering Burstiness. Self-triggerability Information extraction
Adversarial IR. Human behavior on the web. Text summarization
POSSIBLE TOPICS
Discovering communities, spectral clustering
Semi-supervised retrieval
Natural language processing. XML retrieval. Text tiling. Human behavior on the
web.
8/14/2019 Search Engine Technology (1)
16/51
Readings
required: Information Retrieval by Manning,Schuetze, and Raghavan (http://www-csli.stanford.edu/~schuetze/information-r), freely available, hard copy for sale
optional: Modeling the Internet and the Web:Probabilistic Methods and Algorithms by PierreBaldi, Paolo Frasconi, Padhraic Smyth, Wiley,2003, ISBN: 0-470-84906-1 (http://ibook.ics.uci.edu).
papers from SIGIR, WWW and journals (to beannounced in class).
http://www-csli.stanford.edu/~schuetze/information-retrieval-book.htmlhttp://ibook.ics.uci.edu/http://ibook.ics.uci.edu/http://www-csli.stanford.edu/~schuetze/information-retrieval-book.html8/14/2019 Search Engine Technology (1)
17/51
Prerequisites
Linear algebra: vectors and matrices. Calculus: Finding extrema of functions. Probabilities: random variables, discrete and
continuous distributions, Bayes theorem. Programming: experience with at least one web-
aware programming language such as Perl(highly recommended) or Java in a UNIXenvironment.
Required CS account
8/14/2019 Search Engine Technology (1)
18/51
Course requirements
Three assignments (30%)
Some of them will be in Perl. The rest can be done in
any appropriate language. All will involve some data
analysis and evaluation Final project (30%)
Research paper or software system.
Class participation (10%)
Final exam (30%)
8/14/2019 Search Engine Technology (1)
19/51
Final project format
Research paper - using the SIGIR format.Students will be in charge of problemformulation, literature survey, hypothesisformulation, experimental design,
implementation, and possibly submission to aconference like SIGIR or WWW. Software system - develop a working system or
API. Students will be responsible for identifying aniche problem, implementing it and deploying it,either on the Web or as an open-sourcedownloadable tool. The system can be eitherstand alone or an extension to an existing one.
8/14/2019 Search Engine Technology (1)
20/51
Project ideas
Build a question answering system. Build a language identification system. Social network analysis from the Web. Participate in the Netflix challenge. Query log analysis.
Build models of Web evolution. Information diffusion in blogs or web. Author-topic models of web pages. Using the web for machine translation. Building evolving models of web documents.
News recommendation system. Compress the text of Wikipedia (losslessly). Spelling correction using query logs. Automatic query expansion.
8/14/2019 Search Engine Technology (1)
21/51
List of projects from the past
Document Closures for Indexing Tibet - Table Structure Recognition Library Ruby Blog Memetracker Sentence decomposition for more accurate information retrieval Extracting Social Networks from LiveJournal
Google Suggest Programming Project (Java Swing Client and Lucene Ba Leveraging Social Networks for Organizing and Browsing Shared Photog Media Bias and the Political Blogosphere Measuring Similarity between search queries Extracting Social Networks and Information about the people within them
LSI + dependency trees
http://www1.cs.columbia.edu/~cs6998/final_reports/ca2269-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/hc2311-report.txthttp://www1.cs.columbia.edu/~cs6998/final_reports/pcd2104-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/bmf2103-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/kmh2124-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/dtf2110-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/lsk20-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/dlm2132-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/ts2379-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/ss3067-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/dys4-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/dys4-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/ss3067-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/ts2379-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/dlm2132-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/lsk20-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/dtf2110-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/kmh2124-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/bmf2103-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/pcd2104-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/hc2311-report.txthttp://www1.cs.columbia.edu/~cs6998/final_reports/ca2269-report.pdf8/14/2019 Search Engine Technology (1)
22/51
Available corpora
Netflix challenge AOL query logs Blogs Bio papers AAN Email
Generifs Web pages Political science corpus VAST del.icio.us SMS News data: aquaint, tdt, nantc, reuters,
setimes, trec, tipster Europarl multilingual US congressional data DMOZ Pubmedcentral DUC/TAC
Timebank Wikipedia wt2g/wt10g/wt100g dotgov RTE Paraphrases
GENIA Generifs Hansards IMDB MTA/MTC nie cnnsumm
Poliblog Sentiment xml epinions Enron
8/14/2019 Search Engine Technology (1)
23/51
Related courses elsewhere
Stanford (Chris Manning, Prabhakar Raghavan, andHinrich Schuetze)
Cornell (Jon Kleinberg) CMU (Yiming Yang and Jamie Callan)
UMass (James Allan) UTexas (Ray Mooney) Illinois (Chengxiang Zhai) Johns Hopkins (David Yarowsky)
For a long list of courses related to Search Engines, Natural LanguageProcessing, Machine Learning look here:
http://tangra.si.umich.edu/clair/clair/courses.html
http://tangra.si.umich.edu/clair/clair/courses.htmlhttp://tangra.si.umich.edu/clair/clair/courses.html8/14/2019 Search Engine Technology (1)
24/51
SET FALL 2009
2. Models of Information retrieval
The Vector modelThe Boolean model
8/14/2019 Search Engine Technology (1)
25/51
8/14/2019 Search Engine Technology (1)
26/51
Sample queries (from Excite)
In what year did baseball become an offical sport?
play station codes . com
birth control and depression
government
"WorkAbility I"+conference
kitchen appliances
where can I find a chines rosewood
tiger electronics
58 Plymouth Fury
How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of ahero?
emeril Lagasse
Hubble
M.S Subalaksmi
running
8/14/2019 Search Engine Technology (1)
27/51
Fun things to do with search
engines Googlewhack
Reduce document set size to 1
Find query that will bring given URL in thetop 10
8/14/2019 Search Engine Technology (1)
28/51
Key Terms Used in IR
QUERY: a representation of what the user is looking
for - can be a list of words or a phrase.
DOCUMENT: an information entity that the userwants to retrieve
COLLECTION: a set of documents
INDEX: a representation of information that makes
querying easier TERM: word or concept that appears in a document
or a query
8/14/2019 Search Engine Technology (1)
29/51
Mappings and abstractions
Reality Data
Information need Query
From Robert Korfhages book
8/14/2019 Search Engine Technology (1)
30/51
Documents
Not just printed paper
Can be records, pages, sites, images,
people, movies
Document encoding (Unicode)
Document representation
Document preprocessing
8/14/2019 Search Engine Technology (1)
31/51
8/14/2019 Search Engine Technology (1)
32/51
Characteristics of user queries
Sessions: users revisit their queries.
Very short queries: typically 2 words long.
A large number of typos. A small number of popular queries. A long
tail of infrequent ones.
Almost no use of advanced queryoperators with the exception of double
quotes
8/14/2019 Search Engine Technology (1)
33/51
Queries as documents
Advantages:
Mathematically easier to manage
Problems:
Different lengths
Syntactic differences
Repetitions of words (or lack thereof)
8/14/2019 Search Engine Technology (1)
34/51
Document representations
Term-document matrix (m x n)
Document-document matrix (n x n)
Typical example in a medium-sizedcollection: 3,000,000 documents (n) with
50,000 terms (m)
Typical example on the Web:
n=30,000,000,000, m=1,000,000
Boolean vs. integer-valued matrices
8/14/2019 Search Engine Technology (1)
35/51
Storage issues
Imagine a medium-sized collection with
n=3,000,000 and m=50,000
How large a term-document matrix will be
needed?
Is there any way to do better? Any
heuristic?
8/14/2019 Search Engine Technology (1)
36/51
Inverted index
Instead of an incidence vector, use aposting table
CLEVELAND: D1, D2, D6
OHIO: D1, D5, D6, D7 Use linked lists to be able to insert new
document postings in order and to remove
existing postings. Keep everything sorted! This gives you a
logarithmic improvement in access.
8/14/2019 Search Engine Technology (1)
37/51
Basic operations on inverted
indexes
Conjunction (AND) iterative merge of thetwo postings: O(x+y)
Disjunction (OR) very similar
Negation (NOT) can we still do it inO(x+y)? Example: MICHIGAN AND NOT OHIO
Example: MICHIGAN OR NOT OHIO Recursive operations
Optimization: start with the smallest sets
8/14/2019 Search Engine Technology (1)
38/51
Major IR models
Boolean
Vector
Probabilistic Language modeling
Fuzzy retrieval
Latent semantic indexing
8/14/2019 Search Engine Technology (1)
39/51
8/14/2019 Search Engine Technology (1)
40/51
Boolean queries
Operators: AND, OR, NOT, parentheses
Example: CLEVELAND AND NOT OHIO
(MICHIGAN AND INDIANA) OR (TEXAS ANDOKLAHOMA)
Ambiguous uses of AND and OR in
human language Exclusive vs. inclusive OR Restrictive operator: AND or OR?
8/14/2019 Search Engine Technology (1)
41/51
Canonical forms of queries
De Morgans Laws:
NOT (A AND B) = (NOT A) OR (NOT B)
NOT (A OR B) = (NOT A) AND (NOT B)
Normal forms
Conjunctive normal form (CNF) Disjunctive normal form (DNF)
Reference librarians prefer CNF - why?
8/14/2019 Search Engine Technology (1)
42/51
Evaluating Boolean queries
Incidence vectors:
CLEVELAND: 1100010
OHIO: 1000111
Examples:
CLEVELAND AND OHIO
CLEVELAND AND NOT OHIO
CLEVALAND OR OHIO
8/14/2019 Search Engine Technology (1)
43/51
Exercise
D1 = computer information retrieval
D2 = computer retrieval
D3 = information D4 = computer information
Q1 = information AND retrieval
Q2 = information AND NOT computer
8/14/2019 Search Engine Technology (1)
44/51
Exercise0
1 Swift
2 Shakespeare
3 Shakespeare Swift
4 Milton
5 Milton Swift
6 Milton Shakespeare
7 Milton Shakespeare Swift
8 Chaucer
9 Chaucer Swift
10 Chaucer Shakespeare
11 Chaucer Shakespeare Swift
12 Chaucer Milton
13 Chaucer Milton Swift
14 Chaucer Milton Shakespeare
15 Chaucer Milton Shakespeare Swift
((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))
8/14/2019 Search Engine Technology (1)
45/51
How to deal with?
Multi-word phrases?
Document ranking?
8/14/2019 Search Engine Technology (1)
46/51
The Vector model
Term 1
Term 2
Term 3
Doc 1
Doc 2
Doc 3
8/14/2019 Search Engine Technology (1)
47/51
Vector queries
Each document is represented as a vector
Non-efficient representation
Dimensional compatibility
W1 W2 W3 W4 W5 W6 W7 W8 W9 W10
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
8/14/2019 Search Engine Technology (1)
48/51
The matching process
Document space
Matching is done between a document
and a query (or between two documents)
Distance vs. similarity measures.
Euclidean distance, Manhattan distance,
Word overlap, Jaccard coefficient, etc.
8/14/2019 Search Engine Technology (1)
49/51
8/14/2019 Search Engine Technology (1)
50/51
Exercise
Compute the cosine scores (D1,D2) and
(D1,D3) for the documents: D1 = ,
D2
= and D3
=
Compute the corresponding Euclidean
distances, Manhattan distances, and
Jaccard coefficients.
8/14/2019 Search Engine Technology (1)
51/51
Readings
(1): MRS1, MRS2, MRS5 (Zipf)
(2): MRS7, MRS8
Recommended