40
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University

Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

CS473:

CS-473

Course Review

Luo Si

Department of Computer Science

Purdue University

Page 2: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Basic Concepts of IR: Outline

Basic Concepts of Information Retrieval:

Task definition of Ad-hoc IR

Terminologies and concepts

Overview of retrieval models

Text representation

Indexing

Text preprocessing

Evaluation

Evaluation methodology

Evaluation metrics

Page 3: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Ad-hoc IR: Terminologies

Terminologies:

Query

Representative data of user’s information need: text (default) and

other media

Document

Data candidate to satisfy user’s information need: text (default) and

other media

Database|Collection|Corpus

A set of documents

Corpora

A set of databases

Valuable corpora from TREC (Text Retrieval Evaluation

Conference)

Page 4: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Some core concepts of IR

Information

Need

Retrieval Model

Representation

Query Indexed Objects

Retrieved Objects

Representation

Returned Results

Evaluation/Feedback

Page 5: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Text Representation: Indexing

Statistical Properties of Text

/ 0.1rp A r A

Zipf’s law: relate a term’s frequency to its rank

Rank all terms with their frequencies in descending order, for a

term at a specific rank (e.g., r) collects and calculates

rf : term frequency r

r

fp

N : relative term frequency

Total number of words Zipf’s law (by observation):

So log( ) log( ) log( )rr r r

f Ap rf AN r f AN

N r

So Rank X Frequency = Constant

Page 6: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Text Representation: Text Preprocessing

Text Preprocessing: extract representative index terms

Parse query/document for useful structure

E.g., title, anchor text, link, tag in xml…..

Tokenization

For most western languages, words separated by spaces; deal with

punctuation, capitalization, hyphenation

For Chinese, Japanese: more complex word segmentation…

Remove stopwords: (remove “the”, “is”,..., existing standard list)

Morphological analysis (e.g., stemming):

Stemming: determine stem form of given inflected forms

Other: extract phrases; decompounding for some European

languages

Page 7: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Evaluation

Relevant docs retrievedPrecision=

Retrieved docs

Evaluation criteria

Effectiveness

Favor returned document ranked lists with more relevant documents

at the top

Objective measures

Recall and Precision

Mean-average precision

Rank based precision

For documents in a subset of a

ranked lists, if we know the truth

Relevant docs retrievedRecall=

Relevant docs

Page 8: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Evaluation

Pooling Strategy

Retrieve documents using multiple methods

Judge top n documents from each method

Whole retrieved set is the union of top retrieved documents

from all methods

Problems: the judged relevant documents may not be

complete

It is possible to estimate size of true relevant documents by

randomly sampling

Page 9: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Evaluation

Single value metrics

Mean average precision

Calculate precision at each relevant document; average over all

precision values

11-point interpolated average precision

Calculate precision at standard recall points (e.g., 10%, 20%...);

smooth the values; estimate 0 % by interpolation

Average the results

Rank based precision

Calculate precision at top ranked documents (e.g., 5, 10, 15…)

Desirable when users care more for top ranked documents

Page 10: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Evaluation

Sample Results

Page 11: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Retrieval Models: Outline

Retrieval Models

Exact-match retrieval method

Unranked Boolean retrieval method

Ranked Boolean retrieval method

Best-match retrieval method

Vector space retrieval method

Latent semantic indexing

Page 12: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Retrieval Models: Unranked Boolean

Unranked Boolean: Exact match method

Selection Model

Retrieve a document iff it matches the precise query

Often return unranked documents (or with chronological order)

Operators

Logical Operators: AND OR, NOT

Approximately operators: #1(white house) (i.e., within one word

distance, phrase) #sen(Iraq weapon) (i.e., within a sentence)

String matching operators: Wildcard (e.g., ind* for india and

indonesia)

Field operators: title(information and retrieval)…

Page 13: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Retrieval Models: Unranked Boolean

Advantages:

Work well if user knows exactly what to retrieve

Predicable; easy to explain

Very efficient

Disadvantages:

It is difficult to design the query; high recall and low precision

for loose query; low recall and high precision for strict query

Results are unordered; hard to find useful ones

Users may be too optimistic for strict queries. A few very

relevant but a lot more are missing

Page 14: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Retrieval Models: Ranked Boolean

Ranked Boolean: Exact match

Similar as unranked Boolean but documents are ordered by

some criterion

Reflect importance of

document by its words

Query: (Thailand AND stock AND market)

Retrieve docs from Wall Street Journal Collection

Which word is more important?

Term Frequency (TF): Number of occurrence in query/doc; larger

number means more important

Inversed Document Frequency (IDF):

Larger means more important

Total number of docs

Number of docs

contain a term

There are many variants of TF, IDF: e.g., consider document length

Many “stock” and “market”, but fewer

“Thailand”. Fewer may be more indicative

Page 15: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Retrieval Models: Ranked Boolean

Ranked Boolean: Calculate doc score

Term evidence: Evidence from term i occurred in doc j: (tfij)

and (tfij*idfi)

AND weight: minimum of argument weights

OR weight: maximum of argument weights

Term

evidence 0.2 0.6 0.4

AND

Min=0.2

0.2 0.6 0.4

OR

Max=0.6

Query: (Thailand AND stock AND market)

Page 16: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Retrieval Models: Ranked Boolean

Advantages:

All advantages from unranked Boolean algorithm

Works well when query is precise; predictive; efficient

Results in a ranked list (not a full list); easier to browse and

find the most relevant ones than Boolean

Rank criterion is flexible: e.g., different variants of term

evidence

Disadvantages:

Still an exact match (document selection) model: inverse

correlation for recall and precision of strict and loose queries

Predictability makes user overestimate retrieval quality

Page 17: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Retrieval Models: Vector Space Model

Vector space model

Any text object can be represented by a term vector

Documents, queries, passages, sentences

A query can be seen as a short document

Similarity is determined by distance in the vector space

Example: cosine of the angle between two vectors

The SMART system

Developed at Cornell University: 1960-1999

Still quite popular

Page 18: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Retrieval Models: Vector Space Model

Vector representation

Java

Sun

Starbucks

D2

D3 D1

Query

Page 19: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Retrieval Models: Vector Space Model

Give two vectors of query and document

query as

document as

calculate the similarity

1 2( , ,..., )nq q q q

1 2( , ,..., )j j j jnd d d d

Cosine similarity: Angle between vectors

1 ,1 2 ,2 , 1 ,1 2 ,2 ,

2 2 2 2

1 1

cos( ( , ))

... ...

... ...

j

j j j j j n j j j j n

n j jn

q d

q d q d q d q d q d q dq d

q d q d q q d d

( , )jq d

q

jd

( , ) cos( ( , ))j jsim q d q d

Page 20: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Retrieval Models: Vector Space Model

Vector Coefficients

The coefficients (vector elements) represent term evidence/

term importance

It is derived from several elements

Document term weight: Evidence of the term in the document/query

Collection term weight: Importance of term from observation of collection

Length normalization: Reduce document length bias

Naming convention for coefficients:

,. .k j kq d DCLDCL First triple represents query term;

second for document term

Page 21: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Retrieval Models: Vector Space Model

Common vector weight components:

lnc.ltc: widely used term weight

“l”: log(tf)+1

“n”: no weight/normalization

“t”: log(N/df)

“c”: cosine normalization

2

2

2211

)(log1)(log(1)(log(

)(log1)(log(1)(log(

..

k

j

k

q

k

jq

j

jnnjj

kdf

Nktfktf

kdf

Nktfktf

dq

dqdqdq

Page 22: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Retrieval Models: Vector Space Model

Advantages:

Best match method; it does not need a precise query

Generated ranked lists; easy to explore the results

Simplicity: easy to implement

Effectiveness: often works well

Flexibility: can utilize different types of term weighting

methods

Used in a wide range of IR tasks: retrieval, classification,

summarization, content-based filtering…

Page 23: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Retrieval Models: Vector Space Model

Disadvantages:

Hard to choose the dimension of the vector (“basic concept”);

terms may not be the best choice

Assume independent relationship among terms

Heuristic for choosing vector operations

Choose of term weights

Choose of similarity function

Assume a query and a document can be treated in the same

way

Page 24: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Retrieval Models: Latent Semantic Indexing

Latent Semantic Indexing (LSI): Explore correlation

between terms and documents

Two terms are correlated (may share similar semantic

concepts) if they often co-occur

Two documents are correlated (share similar topics) if they

have many common words

Latent Semantic Indexing (LSI): Associate each term and

document with a small number of semantic concepts/topics

Page 25: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Query Expansion: Outline

Query Expansion via Relevant Feedback

Relevance Feedback

Blind/Pseudo Relevance Feedback

Query Expansion via External Resources

Thesaurus

“Industrial Chemical Thesaurus”, “Medical Subject

Headings” (MeSH)

Semantic network

WordNet

Page 26: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Goal: Move new query close to relevant documents and

far away from irrelevant documents

Approach: New query is a weighted average of original

query, and relevant and non-relevant document vectors

Query Expansion: Relevance Feedback Vector Space Model

1 1' (Rocchio formula)

| | | |i i

i i

d R d NR

q q d dR NR

Page 27: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Web Search

Spam Detection

Content spam; link spam;……

Source size estimation

Capture-Recapture Model

What is the assumption?

How to calculate?

Duplicate detection

Page 28: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Text Categorization (I)

Outline

Introduction to the task of text categorization

Manual v.s. automatic text categorization

Text categorization applications

Evaluation of text categorization

K nearest neighbor text categorization method

Page 29: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Text Categorization

Automatic text categorization

Learn algorithm to automatically assign predefined

categories to text documents /objects

automatic or semi-automatic

Procedures

Training: Given a set of categories and labeled document

examples; learn a method to map a document to correct

category (categories)

Testing: Predict the category (categories) of a new

document

Automatic or semi-automatic categorization can significantly

reduce the manual efforts

Page 30: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

K-Nearest Neighbor Classifier

Idea: find your language by what language your

neighbors speak

(k=1)

(k=5)

Use K nearest neighbors to vote

Page 31: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

K Nearest Neighbor: Technical Elements

Document representation

Document distance measure: closer documents should have

similar labels; neighbors speak the same language

Number of nearest neighbors (value of K)

Decision threshold

Page 32: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

K Nearest Neighbor: Framework

{0,1}ydocs,,Rx)},y,{(xD i

M

iii

MRx

(x)Dx

ii

ki

)yxsim(x,k

1(x)y

Training data

Test data

Scoring Function

The neighbor hood is

DDk

Classification:

ˆ1 if y(x) t 0

0 otherwise

Document Representation: Xi uses tf.idf weighting for each dimension

Page 33: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Filtering

System

User Profile:

Information Needs are Stable

System should make a delivery decision on the fly when a

document “arrives”

Content Based Filtering Filtering

Page 34: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Content Based Filtering Filtering

Information Needs are Stable

System should make a delivery decision on the fly

when a document “arrives”

User profile are built by some initialization information and

the historical feedback information from a user (e.g.,

relevant concept extracted from relevant documents)

The delivery decision is made by analyzing the content

information of a document

Page 35: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Three Main Problems in CBF

Three problems

Initialization: Initialize the user profile with several key words

or very few examples

Delivery Decision Making: new documents -> yes/no based

on current user profile

Learning: utilize relevance feedback information from users

to update user profile (only on “recommended” documents)

Page 36: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Collaborative Filtering

Outline

Introduction to collaborative filtering

Main framework

Memory-based collaborative filtering approach

Model-based collaborative filtering approach

Aspect model & Two-way clustering model

Flexible mixture model

Decouple model

Unified filtering by combining content and collaborative

filtering

Page 37: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Federated Search

Outline

Introduction to federated search

Main research problems

Resource Representation

Resource Selection

Results Merging

A unified utility maximization framework for federated search

Modeling search engine effectiveness

Page 38: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Components of a Federated Search System and Two Important Applications

. . . . . .

(1) Resource

Representation

. . . . Engine 1 Engine 2 Engine 3 Engine 4 Engine N

(2) Resource

Selection

……

(3) Results

Merging

Information source recommendation: Recommend information sources

for users’ text queries (e.g., completeplanet.com): Steps 1 and 2

Federated document retrieval: Also search selected sources and

merge individual ranked lists into a single list: Steps 1, 2 and 3

Federated Search

Page 39: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Clustering

Document clustering

– Motivations

– Document representations

– Success criteria

Clustering algorithms

– K-means

– Model-based clustering (EM clustering)

Page 40: Steven F. Ashby Center for Applied Scientific Computing ...€¦ · Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com):

Link Analysis

Outline

The characteristics of Web structure (small world)

Hub & Authority Algorithms

Authority Value, Hubness Value

Page Rank Algorithm (Page Rank Value)

Relation with the computation of eigenvector