Search Angels

Preview:

DESCRIPTION

Judge Facciola said that for one to "opine that a certain search term or terms would be more likely to produce information than [another] is truly to go where angels fear to tread.This presentation provides the tools needed to feel secure in conducting and comparing searches in electronic discovery. It covers:Why you would want to searchWhat search does,How concept search can help,Dispels myths about concept search,Provides tools for measuring the efficacy of search and review.

Citation preview

Search Angels: Have no fear (well, not much any way)

Herbert L Roitblat, Ph.D.OrcaTec LLC

© 2009 OrcaTec LLC

Search Angels

Whether search terms or “keywords” will yield the information sought is a complicated question involving the interplay, at least, of the sciences of computer technology, statistics and linguistics. … Given this complexity, for lawyers and judges to dare opine that a certain search term or terms would be more likely to produce information than the terms that were used is truly to go where angels fear to tread.Judge John Facciola, United States v. O’Keefe, No. 06-249 (D.D.C. Feb. 18, 2008).

Why S e a rch ?

Responsive documents come to you

Searching and Filtering

?

Protect

Privilege

Consistency

Quality

Organize

Focus

Defend

Sedona Best Practice Commentary on Search

Courts have not only accepted, but in some cases have ordered, the use of keyword searching to define discovery parameters and resolve discovery disputes.

What is there to fear in search?

What to fear?

• Include the right documents, sources, etc.

• Get all of the data from the files

• Treat the data consistently• Guess the right search terms• Formulate queries correctly• Correctly classify the

documents that have been retrieved

Unicode

When is 2 ≠ 2?

2 2 U+ff12 ≠ U+0032

Characters are not always what they seem

When Unicode ≠ U n i c o d e

The Defendants are regrettably vague in their description of the seventy keywords used for the text-searchable ESI privilege review, how they were developed, how the search was conducted, and what quality controls were employed to assess their reliability and accuracy. --Judge Paul Grimm

Victor Stanley v. Creative Pipe, Inc.

Blair & Maron Study: 20% recallLawyers thought that they had virtually all relevant documents, They had only 20%Failed to anticipate all wordings Wording depends on point of viewDefense: “Unfortunate incident” Plaintiff: “Disaster”

B la ir a nd Ma ron, C ommunications of the A C M , 28 , 1985, 289-299

Blair and Maron

“It is impossibly difficult for users to predict the exact words, word combinations, and phrases that are used by all (or most) relevant documents and only (or primarily) by those documents.”

Strike

Bank

Words are ambiguous

How ambiguous? Look it up!The companies have agreed to a brief delay in implementing their agreement. 37 14 39 17 54 62 20 8 84 8 7 9

7,788,584,618,680,320 possible interpretations

Each word disambiguates the others

# definitions

Overcoming that fear

So, first of all, let me assert my firm belief that the only thing we have to fear is fear itself -- nameless, unreasoning, unjustified terror which paralyzes needed efforts to convert retreat into advance.

Franklin D. Roosevelt

Judge Facciola

I bring to the parties’ attention recent scholarship that argues that concept searching, as opposed to keyword searching, is more efficient and more likely to produce the most comprehensive results. Disability Rights Council of Greater Wash. v. Wash. Metro. Area Transit Auth., 2007 WL 1585452 (D.D.C.

June 1, 2007)

See George L. Paul & Jason R. Baron, Information Inflation: Can the Legal System Adapt? 13 Rich. J.L. & Tech. 10 (2007).

Black Box?

Difficult to understand?

Difficult to use?

Difficult to defend?

Concept Search

How does it work?

inspectforage

looking

look for

investigate

explore

SEARCH

lookup

seek

examine

hunthunting

Thesaurus

R5-D4 Astromech

IG-88 Assassin

R2-D2 Astromec

h

C-3PO Protocol

2-1B Surgical

T3-M4 Utility

Droids

Humanoid droids

Nonhumanoid droids

R-Series droids

Droid Taxonomy

http://starwars.wikia.com

Ontology

Attorne y

La wye r

E s q

P a ra le g a l Le g a l a s s is ta nt

Synonym

Synonym

AssociatedAssoc

iate

d

La w firm

Associated

Associ

ated

Associated

Machine learning

Neural networks

Language modeling

Statistical pattern learning

Bayesian classifiers

Latent semantic indexing

Probabilistic classifiers

Cluster analysis

化粧Makeup

Concept Search Example

Search for:

化粧Makeup

肌Skin

美Beauty

美容Beauty of figure or

form

ケアCare

dhcBrand of skin care

products

スキンSkin

キレイPretty

始めOrigin

口コミWord of mouth

通販Mail order

Context for 化粧

化粧 化粧 , 美容 , 肌 , 美 , ケア , dhc, ス

キン , キレイ , 始め , 口コミ , 通販

Query as entered

Query as searchedTerms are weighted by language model

Expand query to give context

Design

Execute

Evaluate

Sound methodology

Evaluation:Human review (TREC08)

71.3% overall agreement

82%18%Previously judged nonresponsive

42%58%Previously judged responsive

Currently judged nonresponsive

Currently judged responsive

http://trec.nist.gov/pubs/trec17/papers/LEGAL.OVERVIEW08.pdf

Measuring accuracyPrecision:

The proportion of gold vs. rock in the pan.

The proportion of responsive vs. nonresponsive documents that were retrieved.

Recall:

The proportion of the gold in the river that was found.

The proportion of the responsive documents that have been found.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1P

reci

sio

n o

r R

ecal

l

NegotiatedBoolean

Provider A Ad HocMax

Provider B

TREC08 Topic 103

Recall

Precision

Very low recall, except for one service provider’s concept system

Common sense dictates that sampling and other quality assurance techniques must be employed to meet requirements of completeness. --Judge David A. Baker

The only prudent way to test the reliability of the keyword search is to perform some appropriate sampling of the documents determined to be privileged and those determined not to be. --Judge Paul Grimm

Elusion:

The proportion of gold among the rocks discarded from the pan.

The proportion of nonretrieved documents that were responsive.

Simple evaluation: Sample & accept on zero

Accept on zero

Collect sample.

Accept batch only if there are 0 defects in the sample.

Sample size depends on the maximum acceptable probability of a defect and confidence level.

Concept Search

Sound methodology

Empirical evidence

SamplingQuality assurance

Search Angel: Fear not

You’ve got friends

Herbert L Roitblat, Ph.D.

OrcaTec LLC

www.orcatec.com

herb@orcatec.com

805-918-4612

Recommended