EDI 2009- Advanced Search: What’s Under the Hood of your Favorite Search System?

The Advanced E-Discovery InstituteNovember 12-13, 2009

What’s Under the Hood of your Favorite Search System?

Ellen Voorhees [email protected]

So you want to build a search engine What is the collection to be searched? How will the content (text other media) be

represented? [indexing] How will the information need be

represented? [query language] How will respective representations be

matched? [retrieval model] How effective is the search?

The Advanced E-Discovery Institute November 13, 2009

Boolean Retrieval The Model

documents represented by descriptors descriptors originally manually assigned concepts from controlled vocabulary modern implementations generally use words in text as descriptors

information need represented by descriptors structured with Boolean operators modern implementations include more operators than just AND, OR, NOT

a match occurs if and only if doc satisfies Boolean expression “fuzzy match” systems use descriptor weights, relax strict binary interpretation

Pros and cons good: transparency---clear exactly why doc retrieved bad: little control over retrieved set size; no ranking; searchers

must learn query language


Vector Space Model The Model

documents represented as vectors in N-dimensional space where N is number of ‘terms’ in the document set term is usually a word (stem); but might be phrase or thesaurus class terms are weighted based on frequency and distribution of occurrences

information need is natural language text mapped in same space matching is similarity between query and doc vectors

example similarity: cosine of angle between vectors allows documents to be ranked by decreasing similarity

Pros and Cons good: less brittle than pure Boolean bad: less transparency---depending on weights, a doc with few

query terms can be ranked higher than a doc with many


Vector Similarities Document-Document similarity

docs are similar to the extent they contain the same terms doc pairs with maximal similarity detects duplicates document clustering

cluster hypothesis: “Closely associated documents tend to be relevant to the same requests.”

thus, do retrieval based on returning whole clusters since usually much more information in doc-doc comparison than doc-query

Term-Term similarity terms are similar to the extent the occur in

the same documents term clustering

query expansion provide bottom-up description of document set


T1 T2 T3 T4 …

D1D2D3D4D5D6…

5 0 33 0 …0 0 8 0 …1 4 0 2 …0 3 0 4 …0 1 0 0 …5 3 2 0 …

Further Matrix Manipulation:Latent Semantic

Indexing Mathematically, the axes in a vector space are orthogonal to one another so, vector space model technically assumes words occur in

documents independently of any other words (which is nonsense)

this vector space is very large, and very sparse

Perform singular value decomposition of original matrix and select first X eigenvectors as new axes X chosen to be much smaller than number of terms, producing

much smaller denser vector space project document vectors into new space

elements in vector no longer correspond to words new axes capture some (but which?) dependencies among original word

occurrencesThe Advanced E-Discovery Institute November 13, 2009

How Effective is the Search?

Nu

mb

er

rele

van

t

Evaluation for technology development: comparative evaluation using

mean scores on test collections

Absolute evaluation of current e-discovery search: very little guidance in IR

literature: you don’t know what you don’t know!

too much variability for test collections to predict tight bounds

R

R Number retrieved

num_rel = num_ret

number relevant retrievedPrecision =

number retrieved

number relevant retrievedRecall =

total relevant

2×Precision×Recall

Precision + RecallF =

Technology

EDI 2009- Advanced Search: What’s Under the Hood of your Favorite Search System?