Upload
georgetown-university-law-center-office-of-continuing-legal-education
View
536
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
The Advanced E-Discovery InstituteNovember 12-13, 2009
What’s Under the Hood of your Favorite Search System?
Ellen Voorhees [email protected]
So you want to build a search engine What is the collection to be searched? How will the content (text other media) be
represented? [indexing] How will the information need be
represented? [query language] How will respective representations be
matched? [retrieval model] How effective is the search?
The Advanced E-Discovery Institute November 13, 2009
Boolean Retrieval The Model
documents represented by descriptors descriptors originally manually assigned concepts from controlled vocabulary modern implementations generally use words in text as descriptors
information need represented by descriptors structured with Boolean operators modern implementations include more operators than just AND, OR, NOT
a match occurs if and only if doc satisfies Boolean expression “fuzzy match” systems use descriptor weights, relax strict binary interpretation
Pros and cons good: transparency---clear exactly why doc retrieved bad: little control over retrieved set size; no ranking; searchers
must learn query language
The Advanced E-Discovery Institute November 13, 2009
Vector Space Model The Model
documents represented as vectors in N-dimensional space where N is number of ‘terms’ in the document set term is usually a word (stem); but might be phrase or thesaurus class terms are weighted based on frequency and distribution of occurrences
information need is natural language text mapped in same space matching is similarity between query and doc vectors
example similarity: cosine of angle between vectors allows documents to be ranked by decreasing similarity
Pros and Cons good: less brittle than pure Boolean bad: less transparency---depending on weights, a doc with few
query terms can be ranked higher than a doc with many
The Advanced E-Discovery Institute November 13, 2009
Vector Similarities Document-Document similarity
docs are similar to the extent they contain the same terms doc pairs with maximal similarity detects duplicates document clustering
cluster hypothesis: “Closely associated documents tend to be relevant to the same requests.”
thus, do retrieval based on returning whole clusters since usually much more information in doc-doc comparison than doc-query
Term-Term similarity terms are similar to the extent the occur in
the same documents term clustering
query expansion provide bottom-up description of document set
The Advanced E-Discovery Institute November 13, 2009
T1 T2 T3 T4 …
D1D2D3D4D5D6…
5 0 33 0 …0 0 8 0 …1 4 0 2 …0 3 0 4 …0 1 0 0 …5 3 2 0 …
Further Matrix Manipulation:Latent Semantic
Indexing Mathematically, the axes in a vector space are orthogonal to one another so, vector space model technically assumes words occur in
documents independently of any other words (which is nonsense)
this vector space is very large, and very sparse
Perform singular value decomposition of original matrix and select first X eigenvectors as new axes X chosen to be much smaller than number of terms, producing
much smaller denser vector space project document vectors into new space
elements in vector no longer correspond to words new axes capture some (but which?) dependencies among original word
occurrencesThe Advanced E-Discovery Institute November 13, 2009
How Effective is the Search?
Nu
mb
er
rele
van
t
Evaluation for technology development: comparative evaluation using
mean scores on test collections
Absolute evaluation of current e-discovery search: very little guidance in IR
literature: you don’t know what you don’t know!
too much variability for test collections to predict tight bounds
R
R Number retrieved
num_rel = num_ret
number relevant retrievedPrecision =
number retrieved
number relevant retrievedRecall =
total relevant
2×Precision×Recall
Precision + RecallF =