Search Engine Technology (1)

8/14/2019 Search Engine Technology (1)

1/51

Search Engine Technology

(1)

Prof. Dragomir R. Radev

[email protected]


2/51

SET FALL 2009

1.Introduction


3/51


4/51


5/51


6/51


7/51

Examples of search engines

Conventional (library catalog).

Search by keyword, title, author, etc.

Text-based (Lexis-Nexis, Google, Yahoo!).Search by keywords. Limited search using queries in natural language.

Multimedia (QBIC, WebSeek, SaFe)Search by visual appearance (shapes, colors, ).

Question answering systems (Ask, NSIR, Answerbus)Search in (restricted) natural language

Clustering systems (Vivsimo, Clusty) Research systems (Lemur, Nutch)


8/51

What does it take to build a search

engine?

Decide what to index

Collect it

Index it (efficiently) Keep the index up to date

Provide user-friendly query facilities


9/51

What else?

Understand the structure of the web for

efficient crawling

Understand user information needs

Preprocess text and other unstructured

data

Cluster data Classify data

Evaluate performance


10/51

Goals of the course

Understand how search engines work

Understand the limits of existing search technology

Learn to appreciate the sheer size of the Web

Learn to wrote code for text indexing and retrieval

Learn about the state of the art in IR research

Learn to analyze textual and semi-structured data sets

Learn to appreciate the diversity of texts on the Web

Learn to evaluate information retrieval

Learn about standardized document collections

Learn about text similarity measures

Learn about semantic dimensionality reduction

Learn about the idiosyncracies of hyperlinked document collections

Learn about web crawling Learn to use existing software

Understand the dynamics of the Web by building appropriate mathematical models

Build working systems that assist users in finding useful information on the Web


11/51

Course logistics

Thursdays 6:10-8:00

Office hours: TBA

URL: http://www.cs.columbia.edu/~cs6998

Instructor: Dragomir Radev

Email: [email protected]

TA: Yves Petinot ([email protected])

Kaushal Lahankar ([email protected])
http://www.cs.columbia.edu/~cs6998mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]://www.cs.columbia.edu/~cs6998


12/51

Course outline

Classic document retrieval: storing,

indexing, retrieval.

Web retrieval: crawling, query processing.

Text and web mining: classification,

clustering.

Network analysis: random graph models,centrality, diameter and clustering

coefficient.


13/51

Syllabus

Introduction. Queries and Documents. Models of Information retrieval. The

Boolean model. The Vector model. Document preprocessing. Tokenization. Stemming. The Porter

algorithm. Storing, indexing and searching text. Inverted indexes.

Word distributions. The Zipf distribution. The Benford distribution.Heap's law. TF*IDF. Vector space similarity and ranking. Retrieval evaluation. Precision and Recall. F-measure. Reference

collections. The TREC conferences. Automated indexing/labeling. Compression and coding. Optimal

codes.

String matching. Approximate matching. Query expansion. Relevance feedback. Text classification. Naive Bayes. Feature selection. Decision

trees.


14/51

Syllabus

Linear classifiers. k-nearest neighbors. Perceptron. Kernelmethods. Maximum-margin classifiers. Support vector machines.Semi-supervised learning.

Lexical semantics and Wordnet. Latent semantic indexing. Singular value decomposition.

Vector space clustering. k-means clustering. EM clustering. Random graph models. Properties of random graphs: clustering

coefficient, betweenness, diameter, giant connected component,degree distribution.

Social network analysis. Small worlds and scale-free networks.Power law distributions. Centrality.

Graph-based methods. Harmonic functions. Random walks. PageRank. Hubs and authorities. Bipartite graphs. HITS. Models of the Web.


15/51

Syllabus

Crawling the web. Webometrics. Measuring the size of the web. The Bow-tie-method.

Hypertext retrieval. Web-based IR. Document closures. Focused crawling. Question answering Burstiness. Self-triggerability Information extraction

Adversarial IR. Human behavior on the web. Text summarization

POSSIBLE TOPICS

Discovering communities, spectral clustering

Semi-supervised retrieval

Natural language processing. XML retrieval. Text tiling. Human behavior on the

web.


16/51

Readings

required: Information Retrieval by Manning,Schuetze, and Raghavan (http://www-csli.stanford.edu/~schuetze/information-r), freely available, hard copy for sale

optional: Modeling the Internet and the Web:Probabilistic Methods and Algorithms by PierreBaldi, Paolo Frasconi, Padhraic Smyth, Wiley,2003, ISBN: 0-470-84906-1 (http://ibook.ics.uci.edu).

papers from SIGIR, WWW and journals (to beannounced in class).
http://www-csli.stanford.edu/~schuetze/information-retrieval-book.htmlhttp://ibook.ics.uci.edu/http://ibook.ics.uci.edu/http://www-csli.stanford.edu/~schuetze/information-retrieval-book.html


17/51

Prerequisites

Linear algebra: vectors and matrices. Calculus: Finding extrema of functions. Probabilities: random variables, discrete and

continuous distributions, Bayes theorem. Programming: experience with at least one web-

aware programming language such as Perl(highly recommended) or Java in a UNIXenvironment.

Required CS account


18/51

Course requirements

Three assignments (30%)

Some of them will be in Perl. The rest can be done in

any appropriate language. All will involve some data

analysis and evaluation Final project (30%)

Research paper or software system.

Class participation (10%)

Final exam (30%)


19/51

Final project format

Research paper - using the SIGIR format.Students will be in charge of problemformulation, literature survey, hypothesisformulation, experimental design,

implementation, and possibly submission to aconference like SIGIR or WWW. Software system - develop a working system or

API. Students will be responsible for identifying aniche problem, implementing it and deploying it,either on the Web or as an open-sourcedownloadable tool. The system can be eitherstand alone or an extension to an existing one.


20/51

Project ideas

Build a question answering system. Build a language identification system. Social network analysis from the Web. Participate in the Netflix challenge. Query log analysis.

Build models of Web evolution. Information diffusion in blogs or web. Author-topic models of web pages. Using the web for machine translation. Building evolving models of web documents.

News recommendation system. Compress the text of Wikipedia (losslessly). Spelling correction using query logs. Automatic query expansion.


21/51

List of projects from the past

Document Closures for Indexing Tibet - Table Structure Recognition Library Ruby Blog Memetracker Sentence decomposition for more accurate information retrieval Extracting Social Networks from LiveJournal

Google Suggest Programming Project (Java Swing Client and Lucene Ba Leveraging Social Networks for Organizing and Browsing Shared Photog Media Bias and the Political Blogosphere Measuring Similarity between search queries Extracting Social Networks and Information about the people within them

LSI + dependency trees
http://www1.cs.columbia.edu/~cs6998/final_reports/ca2269-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/hc2311-report.txthttp://www1.cs.columbia.edu/~cs6998/final_reports/pcd2104-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/bmf2103-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/kmh2124-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/dtf2110-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/lsk20-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/dlm2132-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/ts2379-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/ss3067-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/dys4-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/dys4-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/ss3067-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/ts2379-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/dlm2132-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/lsk20-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/dtf2110-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/kmh2124-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/bmf2103-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/pcd2104-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/hc2311-report.txthttp://www1.cs.columbia.edu/~cs6998/final_reports/ca2269-report.pdf


22/51

Available corpora

Netflix challenge AOL query logs Blogs Bio papers AAN Email

Generifs Web pages Political science corpus VAST del.icio.us SMS News data: aquaint, tdt, nantc, reuters,

setimes, trec, tipster Europarl multilingual US congressional data DMOZ Pubmedcentral DUC/TAC

Timebank Wikipedia wt2g/wt10g/wt100g dotgov RTE Paraphrases

GENIA Generifs Hansards IMDB MTA/MTC nie cnnsumm

Poliblog Sentiment xml epinions Enron


23/51

Related courses elsewhere

Stanford (Chris Manning, Prabhakar Raghavan, andHinrich Schuetze)

Cornell (Jon Kleinberg) CMU (Yiming Yang and Jamie Callan)

UMass (James Allan) UTexas (Ray Mooney) Illinois (Chengxiang Zhai) Johns Hopkins (David Yarowsky)

For a long list of courses related to Search Engines, Natural LanguageProcessing, Machine Learning look here:

http://tangra.si.umich.edu/clair/clair/courses.html
http://tangra.si.umich.edu/clair/clair/courses.htmlhttp://tangra.si.umich.edu/clair/clair/courses.html


24/51

SET FALL 2009

2. Models of Information retrieval

The Vector modelThe Boolean model


25/51


26/51

Sample queries (from Excite)

In what year did baseball become an offical sport?

play station codes . com

birth control and depression

government

"WorkAbility I"+conference

kitchen appliances

where can I find a chines rosewood

tiger electronics

58 Plymouth Fury

How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of ahero?

emeril Lagasse

Hubble

M.S Subalaksmi

running


27/51

Fun things to do with search

engines Googlewhack

Reduce document set size to 1

Find query that will bring given URL in thetop 10


28/51

Key Terms Used in IR

QUERY: a representation of what the user is looking

for - can be a list of words or a phrase.

DOCUMENT: an information entity that the userwants to retrieve

COLLECTION: a set of documents

INDEX: a representation of information that makes

querying easier TERM: word or concept that appears in a document

or a query


29/51

Mappings and abstractions

Reality Data

Information need Query

From Robert Korfhages book


30/51

Documents

Not just printed paper

Can be records, pages, sites, images,

people, movies

Document encoding (Unicode)

Document representation

Document preprocessing


31/51


32/51

Characteristics of user queries

Sessions: users revisit their queries.

Very short queries: typically 2 words long.

A large number of typos. A small number of popular queries. A long

tail of infrequent ones.

Almost no use of advanced queryoperators with the exception of double

quotes


33/51

Queries as documents

Advantages:

Mathematically easier to manage

Problems:

Different lengths

Syntactic differences

Repetitions of words (or lack thereof)


34/51

Document representations

Term-document matrix (m x n)

Document-document matrix (n x n)

Typical example in a medium-sizedcollection: 3,000,000 documents (n) with

50,000 terms (m)

Typical example on the Web:

n=30,000,000,000, m=1,000,000

Boolean vs. integer-valued matrices


35/51

Storage issues

Imagine a medium-sized collection with

n=3,000,000 and m=50,000

How large a term-document matrix will be

needed?

Is there any way to do better? Any

heuristic?


36/51

Inverted index

Instead of an incidence vector, use aposting table

CLEVELAND: D1, D2, D6

OHIO: D1, D5, D6, D7 Use linked lists to be able to insert new

document postings in order and to remove

existing postings. Keep everything sorted! This gives you a

logarithmic improvement in access.


37/51

Basic operations on inverted

indexes

Conjunction (AND) iterative merge of thetwo postings: O(x+y)

Disjunction (OR) very similar

Negation (NOT) can we still do it inO(x+y)? Example: MICHIGAN AND NOT OHIO

Example: MICHIGAN OR NOT OHIO Recursive operations

Optimization: start with the smallest sets


38/51

Major IR models

Boolean

Vector

Probabilistic Language modeling

Fuzzy retrieval

Latent semantic indexing


39/51


40/51

Boolean queries

Operators: AND, OR, NOT, parentheses

Example: CLEVELAND AND NOT OHIO

(MICHIGAN AND INDIANA) OR (TEXAS ANDOKLAHOMA)

Ambiguous uses of AND and OR in

human language Exclusive vs. inclusive OR Restrictive operator: AND or OR?


41/51

Canonical forms of queries

De Morgans Laws:

NOT (A AND B) = (NOT A) OR (NOT B)

NOT (A OR B) = (NOT A) AND (NOT B)

Normal forms

Conjunctive normal form (CNF) Disjunctive normal form (DNF)

Reference librarians prefer CNF - why?


42/51

Evaluating Boolean queries

Incidence vectors:

CLEVELAND: 1100010

OHIO: 1000111

Examples:

CLEVELAND AND OHIO

CLEVELAND AND NOT OHIO

CLEVALAND OR OHIO


43/51

Exercise

D1 = computer information retrieval

D2 = computer retrieval

D3 = information D4 = computer information

Q1 = information AND retrieval

Q2 = information AND NOT computer


44/51

Exercise0

1 Swift

2 Shakespeare

3 Shakespeare Swift

4 Milton

5 Milton Swift

6 Milton Shakespeare

7 Milton Shakespeare Swift

8 Chaucer

9 Chaucer Swift

10 Chaucer Shakespeare

11 Chaucer Shakespeare Swift

12 Chaucer Milton

13 Chaucer Milton Swift

14 Chaucer Milton Shakespeare

15 Chaucer Milton Shakespeare Swift

((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))


45/51

How to deal with?

Multi-word phrases?

Document ranking?


46/51

The Vector model

Term 1

Term 2

Term 3

Doc 1

Doc 2

Doc 3


47/51

Vector queries

Each document is represented as a vector

Non-efficient representation

Dimensional compatibility

W1 W2 W3 W4 W5 W6 W7 W8 W9 W10

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10


48/51

The matching process

Document space

Matching is done between a document

and a query (or between two documents)

Distance vs. similarity measures.

Euclidean distance, Manhattan distance,

Word overlap, Jaccard coefficient, etc.


49/51


50/51

Exercise

Compute the cosine scores (D1,D2) and

(D1,D3) for the documents: D1 = ,

D2

= and D3

=

Compute the corresponding Euclidean

distances, Manhattan distances, and

Jaccard coefficients.


51/51

Readings

(1): MRS1, MRS2, MRS5 (Zipf)

(2): MRS7, MRS8