Upload
orcatec-llc
View
420
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Judge Facciola said that for one to "opine that a certain search term or terms would be more likely to produce information than [another] is truly to go where angels fear to tread.This presentation provides the tools needed to feel secure in conducting and comparing searches in electronic discovery. It covers:Why you would want to searchWhat search does,How concept search can help,Dispels myths about concept search,Provides tools for measuring the efficacy of search and review.
Citation preview
Search Angels: Have no fear (well, not much any way)
Herbert L Roitblat, Ph.D.OrcaTec LLC
© 2009 OrcaTec LLC
Search Angels
Whether search terms or “keywords” will yield the information sought is a complicated question involving the interplay, at least, of the sciences of computer technology, statistics and linguistics. … Given this complexity, for lawyers and judges to dare opine that a certain search term or terms would be more likely to produce information than the terms that were used is truly to go where angels fear to tread.Judge John Facciola, United States v. O’Keefe, No. 06-249 (D.D.C. Feb. 18, 2008).
Why S e a rch ?
Responsive documents come to you
Searching and Filtering
?
Protect
Privilege
Consistency
Quality
Organize
Focus
Defend
Sedona Best Practice Commentary on Search
Courts have not only accepted, but in some cases have ordered, the use of keyword searching to define discovery parameters and resolve discovery disputes.
What is there to fear in search?
What to fear?
• Include the right documents, sources, etc.
• Get all of the data from the files
• Treat the data consistently• Guess the right search terms• Formulate queries correctly• Correctly classify the
documents that have been retrieved
Unicode
When is 2 ≠ 2?
2 2 U+ff12 ≠ U+0032
Characters are not always what they seem
When Unicode ≠ U n i c o d e
The Defendants are regrettably vague in their description of the seventy keywords used for the text-searchable ESI privilege review, how they were developed, how the search was conducted, and what quality controls were employed to assess their reliability and accuracy. --Judge Paul Grimm
Victor Stanley v. Creative Pipe, Inc.
Blair & Maron Study: 20% recallLawyers thought that they had virtually all relevant documents, They had only 20%Failed to anticipate all wordings Wording depends on point of viewDefense: “Unfortunate incident” Plaintiff: “Disaster”
B la ir a nd Ma ron, C ommunications of the A C M , 28 , 1985, 289-299
Blair and Maron
“It is impossibly difficult for users to predict the exact words, word combinations, and phrases that are used by all (or most) relevant documents and only (or primarily) by those documents.”
Strike
Bank
Words are ambiguous
How ambiguous? Look it up!The companies have agreed to a brief delay in implementing their agreement. 37 14 39 17 54 62 20 8 84 8 7 9
7,788,584,618,680,320 possible interpretations
Each word disambiguates the others
# definitions
Overcoming that fear
So, first of all, let me assert my firm belief that the only thing we have to fear is fear itself -- nameless, unreasoning, unjustified terror which paralyzes needed efforts to convert retreat into advance.
Franklin D. Roosevelt
Judge Facciola
I bring to the parties’ attention recent scholarship that argues that concept searching, as opposed to keyword searching, is more efficient and more likely to produce the most comprehensive results. Disability Rights Council of Greater Wash. v. Wash. Metro. Area Transit Auth., 2007 WL 1585452 (D.D.C.
June 1, 2007)
See George L. Paul & Jason R. Baron, Information Inflation: Can the Legal System Adapt? 13 Rich. J.L. & Tech. 10 (2007).
Black Box?
Difficult to understand?
Difficult to use?
Difficult to defend?
Concept Search
How does it work?
inspectforage
looking
look for
investigate
explore
SEARCH
lookup
seek
examine
hunthunting
Thesaurus
R5-D4 Astromech
IG-88 Assassin
R2-D2 Astromec
h
C-3PO Protocol
2-1B Surgical
T3-M4 Utility
Droids
Humanoid droids
Nonhumanoid droids
R-Series droids
Droid Taxonomy
http://starwars.wikia.com
Ontology
Attorne y
La wye r
E s q
P a ra le g a l Le g a l a s s is ta nt
Synonym
Synonym
AssociatedAssoc
iate
d
La w firm
Associated
Associ
ated
Associated
Machine learning
Neural networks
Language modeling
Statistical pattern learning
Bayesian classifiers
Latent semantic indexing
Probabilistic classifiers
Cluster analysis
化粧Makeup
Concept Search Example
Search for:
化粧Makeup
肌Skin
美Beauty
美容Beauty of figure or
form
ケアCare
dhcBrand of skin care
products
スキンSkin
キレイPretty
始めOrigin
口コミWord of mouth
通販Mail order
Context for 化粧
化粧 化粧 , 美容 , 肌 , 美 , ケア , dhc, ス
キン , キレイ , 始め , 口コミ , 通販
Query as entered
Query as searchedTerms are weighted by language model
Expand query to give context
Design
Execute
Evaluate
Sound methodology
Evaluation:Human review (TREC08)
71.3% overall agreement
82%18%Previously judged nonresponsive
42%58%Previously judged responsive
Currently judged nonresponsive
Currently judged responsive
http://trec.nist.gov/pubs/trec17/papers/LEGAL.OVERVIEW08.pdf
Measuring accuracyPrecision:
The proportion of gold vs. rock in the pan.
The proportion of responsive vs. nonresponsive documents that were retrieved.
Recall:
The proportion of the gold in the river that was found.
The proportion of the responsive documents that have been found.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1P
reci
sio
n o
r R
ecal
l
NegotiatedBoolean
Provider A Ad HocMax
Provider B
TREC08 Topic 103
Recall
Precision
Very low recall, except for one service provider’s concept system
Common sense dictates that sampling and other quality assurance techniques must be employed to meet requirements of completeness. --Judge David A. Baker
The only prudent way to test the reliability of the keyword search is to perform some appropriate sampling of the documents determined to be privileged and those determined not to be. --Judge Paul Grimm
Elusion:
The proportion of gold among the rocks discarded from the pan.
The proportion of nonretrieved documents that were responsive.
Simple evaluation: Sample & accept on zero
Accept on zero
Collect sample.
Accept batch only if there are 0 defects in the sample.
Sample size depends on the maximum acceptable probability of a defect and confidence level.
Concept Search
Sound methodology
Empirical evidence
SamplingQuality assurance
Search Angel: Fear not
You’ve got friends
Herbert L Roitblat, Ph.D.
OrcaTec LLC
www.orcatec.com
805-918-4612