Search Engine Technology (1)

Embed Size (px)

Citation preview

  • 8/14/2019 Search Engine Technology (1)

    1/51

    Search Engine Technology

    (1)

    Prof. Dragomir R. Radev

    [email protected]

  • 8/14/2019 Search Engine Technology (1)

    2/51

    SET FALL 2009

    1.Introduction

  • 8/14/2019 Search Engine Technology (1)

    3/51

  • 8/14/2019 Search Engine Technology (1)

    4/51

  • 8/14/2019 Search Engine Technology (1)

    5/51

  • 8/14/2019 Search Engine Technology (1)

    6/51

  • 8/14/2019 Search Engine Technology (1)

    7/51

    Examples of search engines

    Conventional (library catalog).

    Search by keyword, title, author, etc.

    Text-based (Lexis-Nexis, Google, Yahoo!).Search by keywords. Limited search using queries in natural language.

    Multimedia (QBIC, WebSeek, SaFe)Search by visual appearance (shapes, colors, ).

    Question answering systems (Ask, NSIR, Answerbus)Search in (restricted) natural language

    Clustering systems (Vivsimo, Clusty) Research systems (Lemur, Nutch)

  • 8/14/2019 Search Engine Technology (1)

    8/51

    What does it take to build a search

    engine?

    Decide what to index

    Collect it

    Index it (efficiently) Keep the index up to date

    Provide user-friendly query facilities

  • 8/14/2019 Search Engine Technology (1)

    9/51

    What else?

    Understand the structure of the web for

    efficient crawling

    Understand user information needs

    Preprocess text and other unstructured

    data

    Cluster data Classify data

    Evaluate performance

  • 8/14/2019 Search Engine Technology (1)

    10/51

    Goals of the course

    Understand how search engines work

    Understand the limits of existing search technology

    Learn to appreciate the sheer size of the Web

    Learn to wrote code for text indexing and retrieval

    Learn about the state of the art in IR research

    Learn to analyze textual and semi-structured data sets

    Learn to appreciate the diversity of texts on the Web

    Learn to evaluate information retrieval

    Learn about standardized document collections

    Learn about text similarity measures

    Learn about semantic dimensionality reduction

    Learn about the idiosyncracies of hyperlinked document collections

    Learn about web crawling Learn to use existing software

    Understand the dynamics of the Web by building appropriate mathematical models

    Build working systems that assist users in finding useful information on the Web

  • 8/14/2019 Search Engine Technology (1)

    11/51

    Course logistics

    Thursdays 6:10-8:00

    Office hours: TBA

    URL: http://www.cs.columbia.edu/~cs6998

    Instructor: Dragomir Radev

    Email: [email protected]

    TA: Yves Petinot ([email protected])

    Kaushal Lahankar ([email protected])

    http://www.cs.columbia.edu/~cs6998mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]://www.cs.columbia.edu/~cs6998
  • 8/14/2019 Search Engine Technology (1)

    12/51

    Course outline

    Classic document retrieval: storing,

    indexing, retrieval.

    Web retrieval: crawling, query processing.

    Text and web mining: classification,

    clustering.

    Network analysis: random graph models,centrality, diameter and clustering

    coefficient.

  • 8/14/2019 Search Engine Technology (1)

    13/51

    Syllabus

    Introduction. Queries and Documents. Models of Information retrieval. The

    Boolean model. The Vector model. Document preprocessing. Tokenization. Stemming. The Porter

    algorithm. Storing, indexing and searching text. Inverted indexes.

    Word distributions. The Zipf distribution. The Benford distribution.Heap's law. TF*IDF. Vector space similarity and ranking. Retrieval evaluation. Precision and Recall. F-measure. Reference

    collections. The TREC conferences. Automated indexing/labeling. Compression and coding. Optimal

    codes.

    String matching. Approximate matching. Query expansion. Relevance feedback. Text classification. Naive Bayes. Feature selection. Decision

    trees.

  • 8/14/2019 Search Engine Technology (1)

    14/51

    Syllabus

    Linear classifiers. k-nearest neighbors. Perceptron. Kernelmethods. Maximum-margin classifiers. Support vector machines.Semi-supervised learning.

    Lexical semantics and Wordnet. Latent semantic indexing. Singular value decomposition.

    Vector space clustering. k-means clustering. EM clustering. Random graph models. Properties of random graphs: clustering

    coefficient, betweenness, diameter, giant connected component,degree distribution.

    Social network analysis. Small worlds and scale-free networks.Power law distributions. Centrality.

    Graph-based methods. Harmonic functions. Random walks. PageRank. Hubs and authorities. Bipartite graphs. HITS. Models of the Web.

  • 8/14/2019 Search Engine Technology (1)

    15/51

    Syllabus

    Crawling the web. Webometrics. Measuring the size of the web. The Bow-tie-method.

    Hypertext retrieval. Web-based IR. Document closures. Focused crawling. Question answering Burstiness. Self-triggerability Information extraction

    Adversarial IR. Human behavior on the web. Text summarization

    POSSIBLE TOPICS

    Discovering communities, spectral clustering

    Semi-supervised retrieval

    Natural language processing. XML retrieval. Text tiling. Human behavior on the

    web.

  • 8/14/2019 Search Engine Technology (1)

    16/51

    Readings

    required: Information Retrieval by Manning,Schuetze, and Raghavan (http://www-csli.stanford.edu/~schuetze/information-r), freely available, hard copy for sale

    optional: Modeling the Internet and the Web:Probabilistic Methods and Algorithms by PierreBaldi, Paolo Frasconi, Padhraic Smyth, Wiley,2003, ISBN: 0-470-84906-1 (http://ibook.ics.uci.edu).

    papers from SIGIR, WWW and journals (to beannounced in class).

    http://www-csli.stanford.edu/~schuetze/information-retrieval-book.htmlhttp://ibook.ics.uci.edu/http://ibook.ics.uci.edu/http://www-csli.stanford.edu/~schuetze/information-retrieval-book.html
  • 8/14/2019 Search Engine Technology (1)

    17/51

    Prerequisites

    Linear algebra: vectors and matrices. Calculus: Finding extrema of functions. Probabilities: random variables, discrete and

    continuous distributions, Bayes theorem. Programming: experience with at least one web-

    aware programming language such as Perl(highly recommended) or Java in a UNIXenvironment.

    Required CS account

  • 8/14/2019 Search Engine Technology (1)

    18/51

    Course requirements

    Three assignments (30%)

    Some of them will be in Perl. The rest can be done in

    any appropriate language. All will involve some data

    analysis and evaluation Final project (30%)

    Research paper or software system.

    Class participation (10%)

    Final exam (30%)

  • 8/14/2019 Search Engine Technology (1)

    19/51

    Final project format

    Research paper - using the SIGIR format.Students will be in charge of problemformulation, literature survey, hypothesisformulation, experimental design,

    implementation, and possibly submission to aconference like SIGIR or WWW. Software system - develop a working system or

    API. Students will be responsible for identifying aniche problem, implementing it and deploying it,either on the Web or as an open-sourcedownloadable tool. The system can be eitherstand alone or an extension to an existing one.

  • 8/14/2019 Search Engine Technology (1)

    20/51

    Project ideas

    Build a question answering system. Build a language identification system. Social network analysis from the Web. Participate in the Netflix challenge. Query log analysis.

    Build models of Web evolution. Information diffusion in blogs or web. Author-topic models of web pages. Using the web for machine translation. Building evolving models of web documents.

    News recommendation system. Compress the text of Wikipedia (losslessly). Spelling correction using query logs. Automatic query expansion.

  • 8/14/2019 Search Engine Technology (1)

    21/51

    List of projects from the past

    Document Closures for Indexing Tibet - Table Structure Recognition Library Ruby Blog Memetracker Sentence decomposition for more accurate information retrieval Extracting Social Networks from LiveJournal

    Google Suggest Programming Project (Java Swing Client and Lucene Ba Leveraging Social Networks for Organizing and Browsing Shared Photog Media Bias and the Political Blogosphere Measuring Similarity between search queries Extracting Social Networks and Information about the people within them

    LSI + dependency trees

    http://www1.cs.columbia.edu/~cs6998/final_reports/ca2269-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/hc2311-report.txthttp://www1.cs.columbia.edu/~cs6998/final_reports/pcd2104-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/bmf2103-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/kmh2124-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/dtf2110-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/lsk20-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/dlm2132-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/ts2379-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/ss3067-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/dys4-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/dys4-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/ss3067-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/ts2379-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/dlm2132-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/lsk20-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/dtf2110-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/kmh2124-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/bmf2103-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/pcd2104-report.pdfhttp://www1.cs.columbia.edu/~cs6998/final_reports/hc2311-report.txthttp://www1.cs.columbia.edu/~cs6998/final_reports/ca2269-report.pdf
  • 8/14/2019 Search Engine Technology (1)

    22/51

    Available corpora

    Netflix challenge AOL query logs Blogs Bio papers AAN Email

    Generifs Web pages Political science corpus VAST del.icio.us SMS News data: aquaint, tdt, nantc, reuters,

    setimes, trec, tipster Europarl multilingual US congressional data DMOZ Pubmedcentral DUC/TAC

    Timebank Wikipedia wt2g/wt10g/wt100g dotgov RTE Paraphrases

    GENIA Generifs Hansards IMDB MTA/MTC nie cnnsumm

    Poliblog Sentiment xml epinions Enron

  • 8/14/2019 Search Engine Technology (1)

    23/51

    Related courses elsewhere

    Stanford (Chris Manning, Prabhakar Raghavan, andHinrich Schuetze)

    Cornell (Jon Kleinberg) CMU (Yiming Yang and Jamie Callan)

    UMass (James Allan) UTexas (Ray Mooney) Illinois (Chengxiang Zhai) Johns Hopkins (David Yarowsky)

    For a long list of courses related to Search Engines, Natural LanguageProcessing, Machine Learning look here:

    http://tangra.si.umich.edu/clair/clair/courses.html

    http://tangra.si.umich.edu/clair/clair/courses.htmlhttp://tangra.si.umich.edu/clair/clair/courses.html
  • 8/14/2019 Search Engine Technology (1)

    24/51

    SET FALL 2009

    2. Models of Information retrieval

    The Vector modelThe Boolean model

  • 8/14/2019 Search Engine Technology (1)

    25/51

  • 8/14/2019 Search Engine Technology (1)

    26/51

    Sample queries (from Excite)

    In what year did baseball become an offical sport?

    play station codes . com

    birth control and depression

    government

    "WorkAbility I"+conference

    kitchen appliances

    where can I find a chines rosewood

    tiger electronics

    58 Plymouth Fury

    How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of ahero?

    emeril Lagasse

    Hubble

    M.S Subalaksmi

    running

  • 8/14/2019 Search Engine Technology (1)

    27/51

    Fun things to do with search

    engines Googlewhack

    Reduce document set size to 1

    Find query that will bring given URL in thetop 10

  • 8/14/2019 Search Engine Technology (1)

    28/51

    Key Terms Used in IR

    QUERY: a representation of what the user is looking

    for - can be a list of words or a phrase.

    DOCUMENT: an information entity that the userwants to retrieve

    COLLECTION: a set of documents

    INDEX: a representation of information that makes

    querying easier TERM: word or concept that appears in a document

    or a query

  • 8/14/2019 Search Engine Technology (1)

    29/51

    Mappings and abstractions

    Reality Data

    Information need Query

    From Robert Korfhages book

  • 8/14/2019 Search Engine Technology (1)

    30/51

    Documents

    Not just printed paper

    Can be records, pages, sites, images,

    people, movies

    Document encoding (Unicode)

    Document representation

    Document preprocessing

  • 8/14/2019 Search Engine Technology (1)

    31/51

  • 8/14/2019 Search Engine Technology (1)

    32/51

    Characteristics of user queries

    Sessions: users revisit their queries.

    Very short queries: typically 2 words long.

    A large number of typos. A small number of popular queries. A long

    tail of infrequent ones.

    Almost no use of advanced queryoperators with the exception of double

    quotes

  • 8/14/2019 Search Engine Technology (1)

    33/51

    Queries as documents

    Advantages:

    Mathematically easier to manage

    Problems:

    Different lengths

    Syntactic differences

    Repetitions of words (or lack thereof)

  • 8/14/2019 Search Engine Technology (1)

    34/51

    Document representations

    Term-document matrix (m x n)

    Document-document matrix (n x n)

    Typical example in a medium-sizedcollection: 3,000,000 documents (n) with

    50,000 terms (m)

    Typical example on the Web:

    n=30,000,000,000, m=1,000,000

    Boolean vs. integer-valued matrices

  • 8/14/2019 Search Engine Technology (1)

    35/51

    Storage issues

    Imagine a medium-sized collection with

    n=3,000,000 and m=50,000

    How large a term-document matrix will be

    needed?

    Is there any way to do better? Any

    heuristic?

  • 8/14/2019 Search Engine Technology (1)

    36/51

    Inverted index

    Instead of an incidence vector, use aposting table

    CLEVELAND: D1, D2, D6

    OHIO: D1, D5, D6, D7 Use linked lists to be able to insert new

    document postings in order and to remove

    existing postings. Keep everything sorted! This gives you a

    logarithmic improvement in access.

  • 8/14/2019 Search Engine Technology (1)

    37/51

    Basic operations on inverted

    indexes

    Conjunction (AND) iterative merge of thetwo postings: O(x+y)

    Disjunction (OR) very similar

    Negation (NOT) can we still do it inO(x+y)? Example: MICHIGAN AND NOT OHIO

    Example: MICHIGAN OR NOT OHIO Recursive operations

    Optimization: start with the smallest sets

  • 8/14/2019 Search Engine Technology (1)

    38/51

    Major IR models

    Boolean

    Vector

    Probabilistic Language modeling

    Fuzzy retrieval

    Latent semantic indexing

  • 8/14/2019 Search Engine Technology (1)

    39/51

  • 8/14/2019 Search Engine Technology (1)

    40/51

    Boolean queries

    Operators: AND, OR, NOT, parentheses

    Example: CLEVELAND AND NOT OHIO

    (MICHIGAN AND INDIANA) OR (TEXAS ANDOKLAHOMA)

    Ambiguous uses of AND and OR in

    human language Exclusive vs. inclusive OR Restrictive operator: AND or OR?

  • 8/14/2019 Search Engine Technology (1)

    41/51

    Canonical forms of queries

    De Morgans Laws:

    NOT (A AND B) = (NOT A) OR (NOT B)

    NOT (A OR B) = (NOT A) AND (NOT B)

    Normal forms

    Conjunctive normal form (CNF) Disjunctive normal form (DNF)

    Reference librarians prefer CNF - why?

  • 8/14/2019 Search Engine Technology (1)

    42/51

    Evaluating Boolean queries

    Incidence vectors:

    CLEVELAND: 1100010

    OHIO: 1000111

    Examples:

    CLEVELAND AND OHIO

    CLEVELAND AND NOT OHIO

    CLEVALAND OR OHIO

  • 8/14/2019 Search Engine Technology (1)

    43/51

    Exercise

    D1 = computer information retrieval

    D2 = computer retrieval

    D3 = information D4 = computer information

    Q1 = information AND retrieval

    Q2 = information AND NOT computer

  • 8/14/2019 Search Engine Technology (1)

    44/51

    Exercise0

    1 Swift

    2 Shakespeare

    3 Shakespeare Swift

    4 Milton

    5 Milton Swift

    6 Milton Shakespeare

    7 Milton Shakespeare Swift

    8 Chaucer

    9 Chaucer Swift

    10 Chaucer Shakespeare

    11 Chaucer Shakespeare Swift

    12 Chaucer Milton

    13 Chaucer Milton Swift

    14 Chaucer Milton Shakespeare

    15 Chaucer Milton Shakespeare Swift

    ((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))

  • 8/14/2019 Search Engine Technology (1)

    45/51

    How to deal with?

    Multi-word phrases?

    Document ranking?

  • 8/14/2019 Search Engine Technology (1)

    46/51

    The Vector model

    Term 1

    Term 2

    Term 3

    Doc 1

    Doc 2

    Doc 3

  • 8/14/2019 Search Engine Technology (1)

    47/51

    Vector queries

    Each document is represented as a vector

    Non-efficient representation

    Dimensional compatibility

    W1 W2 W3 W4 W5 W6 W7 W8 W9 W10

    C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

  • 8/14/2019 Search Engine Technology (1)

    48/51

    The matching process

    Document space

    Matching is done between a document

    and a query (or between two documents)

    Distance vs. similarity measures.

    Euclidean distance, Manhattan distance,

    Word overlap, Jaccard coefficient, etc.

  • 8/14/2019 Search Engine Technology (1)

    49/51

  • 8/14/2019 Search Engine Technology (1)

    50/51

    Exercise

    Compute the cosine scores (D1,D2) and

    (D1,D3) for the documents: D1 = ,

    D2

    = and D3

    =

    Compute the corresponding Euclidean

    distances, Manhattan distances, and

    Jaccard coefficients.

  • 8/14/2019 Search Engine Technology (1)

    51/51

    Readings

    (1): MRS1, MRS2, MRS5 (Zipf)

    (2): MRS7, MRS8