39
Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California Supporting Search and Sensemaking for Electronically Stored Information in Discovery Proceedings (“DESI Workshop”) June 4, 2007 Jason R. Baron Director of Litigation National Archives and Records Administration

Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

Thinking Outside the Boolean Box:Metastrategies for Conducting Legally

Defensible Searches in an Expanding ESI Universe

ICAIL 2007, Palo Alto, CaliforniaSupporting Search and Sensemaking for Electronically

Stored Information in Discovery Proceedings (“DESI Workshop”)

June 4, 2007

Jason R. BaronDirector of Litigation

National Archives and Records Administration

Page 2: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

Definition of “ESI”

-A new legal term of art: “electronically stored information” to supplement the older term “documents”:

-The wide variety of computer systems currently in use, and the rapidity of technological change,counsel against a limiting or precise definition of ESI…A common example [is] email … The rule … [is intended] to encompass future developments in computer technology. --Advisory Committee Notes to Rule 34(a), 2006 Amendments to the Federal Rules of Civil Procedure

Page 3: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

National Archives and Records Administration 3

Information Inflation: The Expanding ESI Universe . . . .

Page 4: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

Snapshot of 2007 ESI Heterogeneity

E-mail, integrated with voice mail & VOIP, word processing (including not in English), spreadsheets, dynamic databases, instant messaging, Web pages including intraweb sites, Blogs, wikis, and RSS feeds, backup tapes, hard drives, removable media, flash drives, new storage devices, remote PDAs, and audit logs and metadata of all types.

Page 5: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

The Myth of Search & Retrieval (bedtime stories for lawyers)

When lawyers request production of “all” relevant documents (and now ESI), they believe (or pretend to believe) that all or substantially all documents and ESI will in fact be retrieved by existing manual or automated methods of search.Corollary: in conducting automated searches, lawyers (and judges) operate under the assumption that the use of “keywords” alone will powerfully and reliably produce all or substantially all documents from a large document collection.

Page 6: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

The “Hype” on Search & Retrieval

Claims in the legal tech sector that a very high rate of “recall” *(i.e., finding all relevant documents) is easily obtainable provided one uses a particular software product or service.

Page 7: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

The Reality of Search & Retrieval

+ Past research (Blair & Maron, 1985) has shown a gap or disconnect between lawyers’ perceptions of their ability to ferret out relevant documents, and their actual ability to do so: --in a 40,000 document case (350,000 pages), lawyers estimated that a manual search would find 75% of relevant documents, when in fact the research showed only 20% or so had been found.

Page 8: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

Why is IR hard for lawyers?

+ Lawyers not technically grounded+ Traditional lawyering doesn’t emphasize front-

end “process” issues that would help simplify or focus search problem in

particular contexts+ The reality is that huge sources of heterogeneous ESI exist, presenting an array of technical issues + Deadlines and resource constraints + Failure to employ best strategic practices

Page 9: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

National Archives and Records Administration 9

Sedona Guideline 11A responding party may satisfy its good faith obligation to preserve and produce potentially responsive electronic data and documents by using electronic tools and processes, such as data sampling, searching, or the use of selection criteria, to identify data most likely to contain responsive information.www.thesedonaconference.org

Page 10: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

National Archives and Records Administration 10

Case Study: U.S. v. Philip Morris – Overall Discovery

1,726 Requests to Produce propounded by tobacco companies on U.S. (30 federal agencies, including NARA) for tobacco related records

Along with paper records, email records were made subject to discovery

32 million Clinton era email records – government had burden of searching

Page 11: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

ESI Universe of White House email

pertaining to tobacco lawsuit

Page 12: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

National Archives and Records Administration 12

Case Study: U.S. v. Philip Morris (con’t) – Employing a limited feedback loop

Original set of 12 keywords searched unilaterally After informal negotiations, additional terms

explored Sampling against database to find “noisy” terms

generating too many false positives (Marlboro, PMI, TI, etc.)

Report back and consensus on what additional terms would be in search protocol.

Page 13: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

National Archives and Records Administration 13

Example of Boolean search string from U.S. v. Philip Morris

(((master settlement agreement OR msa) AND NOT (medical savings account OR metropolitan standard area)) OR s. 1415 OR (ets AND NOT educational testing service) OR (liggett AND NOT sharon a. liggett) OR atco OR lorillard OR (pmi AND NOT presidential management intern) OR pm usa OR rjr OR (b&w AND NOT photo*) OR phillip morris OR batco OR ftc test method OR star scientific OR vector group OR joe camel OR (marlboro AND NOT upper marlboro)) AND NOT (tobacco* OR cigarette* OR smoking OR tar OR nicotine OR smokeless OR synar amendment OR philip morris OR r.j. reynolds OR ("brown and williamson") OR ("brown & williamson") OR bat industries OR liggett group)

Page 14: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

National Archives and Records Administration 14

U.S. v. Philip Morris E-mail Winnowing Process

20 million 200,000 100,000 80,000 20,000 email hits based relevant produced placed on records on keyword emails to opposing privilege terms used party logs (1%)

A PROBLEM: only a handful entered as exhibits at trial A BIGGER PROGLEM: the 1% figure does not scale

Page 15: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

Litigation Targets

+ Defining “relevance”

+ Maximizing # responsive docs

+ Minimizing retrieval “noise” or false

positives (non-responsive docs)

Page 16: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

National Archives and Records Administration 16

Not Relevant and Retrieved

Relevant and Retrieved

Relevant and Not Retrieved

Not Relevant and Not Retrieved

FINDING RESPONSIVE DOCUMENTS IN A LARGE DATA SET: FOUR LOGICAL CATEGORIES

DOCUMENT SET FALSE POSITIVES

FALSE NEGATIVES

Page 17: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

National Archives and Records Administration 17

FINDING RESPONSIVE DOCUMENTS IN A LARGE DATA SET: THE REALITY OF LARGE SCALE DISCOVERY

RELEVANT DOCUMENTS

“HITS” ON NONRELEVANT DOCUMENTS

???????

??????

?????

The Great Unknown

Page 18: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

Measures of Information Retrieval

Recall =

# of responsive docs retrieved

# of responsive docs in collection

Page 19: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

Measures of Information Retrieval

Precision =

# of responsive docs retrieved

# of docs retrieved

Page 20: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

National Archives and Records Administration 20

RECALL

PRECISION

0 100%

100%

THE RECALL-PRECISION TRADEOFF

Page 21: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

Three Questions

(1) How can one go about improving rates of recall and precision (so as to find a greater number of relevant documents, while spending less overall time, cost, etc., sifting through noise?)(2) What alternatives to keyword searching exist?(3) Are there ways in which to benchmark alternative search methodologies so as to evaluate their efficacy?

Page 22: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

Beyond Reliance on Keywords Alone: Alternative Search Methods

Greater Use Made of Boolean StringsFuzzy Search ModelsProbabilistic models (Bayesian)Statistical methods (clustering)Machine learning approaches to semantic representationCategorization tools: taxonomies and ontologiesSocial network analysis

Page 23: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

National Archives and Records Administration 23

What is TREC?What is TREC?

Conference series co-sponsored by the National Institute of Standards and Technology (NIST) and the Advanced Research and Development Activity (ARDA) of the Department of Defense

Designed to promote research into the science of information retrieval

First TREC conference was in 1992 15th Conference held November 15-17, 2006 in

Gaithersburg, Maryland (NIST headquarters)

Page 24: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

National Archives and Records Administration 24

TREC 2006 Legal TrackTREC 2006 Legal Track

The TREC 2006 Legal Track was designed to evaluate th effectiveness of search technologies in a real-world legal context

First of a kind study using nonproprietary data since Blair/Maron research in 1985

5 hypothetical complaints and 43 “requests to produce” drafted by Sedona Conference members

“Boolean negotiations” conducted as a baseline for search efforts

Documents to be searched were drawn from a publicly available 7 million document tobacco litigation Master Settlement Agreement database

6 Participating teams submitted 33 runs. Teams consisted of: Hummingbird, National University of Singapore, Sabir Research, University of Maryland, University of Missouri-Kansas City, and York University

Page 25: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

National Archives and Records Administration 25

<?xml version="1.0" encoding="ISO-8859-1" ?> - <TrecLegalProductionRequest>- <ProductionRequest>  <RequestNumber>6</RequestNumber>   <RequestText>All documents discussing, referencing, or relating to company guidelines or internal approval for placement of tobacco products, logos, or signage,

in television programs (network or cable), where the documents expressly refer to the programs being watched by children.</RequestText> - <BooleanQuery>  <FinalQuery>(guide! OR strateg! OR approval) AND (place!

OR promot! OR logos OR sign! OR merchandise) AND (TV OR "T.V." OR televis! OR cable OR network)

AND ((watch! OR view!) W/5 (child! OR teen! OR juvenile OR kid! OR adolescent!))</FinalQuery> - <NegotiationHistory>

TREC 2006 LEGAL TRACK XML ENCODED TOPICS WITH NEGOTIATION HISTORY – ONE EXAMPLE

Page 26: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

Beyond Boolean: getting at the “dark matter”

(i.e., relevant documents not found by keyword searches alone)

Page 27: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

National Archives and Records Administration 27

0

100

200

300

400

500

49404117 1047243629 39254451 3520143328 46503222 371830 6 9 38451343 7 27 8 3421 31262319

Topic

Kn

ow

n R

ele

va

nt

Do

cu

me

nts

Boolean Expert Searcher Ranked Only

TREC Legal Track 2006: Percentage of Unique Documents By Topic Found By Boolean, Expert Searcher, and Other Combined Methods of Search

Page 28: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

National Archives and Records Administration 28

TREC Legal Track 2006: Sort by Increasing Percentage of Unique Documents Per Topic Found By Combined Methods Other Than A Baseline Boolean Search

0% 0% 0% 0% 0%2% 3%

6% 6% 7% 7%11

% 14% 16%

17%

17%

18% 20

% 22% 24

%24

%25

% 28% 32

%32

% 36% 38%

44% 46

% 51% 53

%54

%62

% 64%

73% 76

%76

% 78%

100%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

17374041512818275013302522194745292039214331322644 8 3814362335 6 3446 9 33 7 2410

Topic

Per

cen

t

Series1

Page 29: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

National Archives and Records Administration 29

INCREASING EFFORT(time, resources expended, etc.)

Boolean Run

Alternative Search Run

Boolean vs. Hypothetical Alternative Search Method

B

C

D

SUCCESS(in retrievingrelevant docs)

A

x

y

Page 30: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

Strategic Challenges

Convincing lawyers and judges that automated searches are not just desirable but necessary in response to large e-discovery demands.

Page 31: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

Challenges (con’t)

Having all parties and adjudicators understand that the use of automated methods does not guarantee all responsive documents will be identified in a large data collection.

Page 32: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

Challenges (con’t)

Designing an overall review process which maximizes the potential to find responsive documents in a large data collection (no matter which search tool is used), and using sampling and other analytic techniques to test hypotheses early on.

Page 33: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

Challenges (con’t)

Parties making a good faith attempt to collaborate on the use of particular search methods, including utilizing multiple “meet and confers” as necessary based on initial sampling or surveying of retrieved ESI, based on whatever methods are used.

Page 34: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

Challenges (con’t)

Being open to using new and evolving search and information retrieval methods and tools.

Page 35: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

The Research Challenge

Scaling up TREC to real-world litigationFinding better ways to account for (i.e., sample) the “dark matter”Benchmarking competing search methods with objective standardsMeasuring how paying greater attention to front-end “process” improves the results found by search tools and methods

Page 36: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

National Archives and Records Administration 36

References

J. Baron, D. Oard, and D. Lewis, “TREC 2006 Legal Track Overview,” available at http://trec.nist.gov/pubs/trec15/t15_proceedings.html (document 3)

The Sedona Conference, Best Practices Commentary on The Use of Search & Retrieval Methods in E-Discovery (forthcoming 2007)

TREC 2007 Legal Track Home Page, see http://trec-legal.umiacs.umd.edu/

Page 37: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

National Archives and Records Administration 37

References

ICAIL 2007 (International Conference on Artificial Intelligence and the Law), Workshop on Supporting Search and Sensemaking for ESI in Discovery Proceedings, see http://www.umiacs.umd.edu/~oard/desi-ws/ see also J. Baron and P. Thompson, “The Search

Problem Posed By Large Heterogeneous Data Sets in Litigation: Possible Future Approaches to Research,” ICAIL 2007 Conference Paper, June 4-8, 2007, available at http://www.umiacs.umd.edu/~oard/desi-ws/ (click link to conference paper).

Page 38: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

National Archives and Records Administration 38

References

Collaborative Expedition Workshop #45, Advancing Information Sharing, Access, Discovery and Assimilation of Diverse Digital Collections Governed by Heterogeneous Sensitivies, held Nov. 8, 2005, see http://colab.cim3.net/cgi-bin/wiki.pl?AdvancingInformationSharing_DiverseDigitalCollections_HeterogeneousSensitivities_11_08_05

Page 39: Thinking Outside the Boolean Box: Metastrategies for Conducting Legally Defensible Searches in an Expanding ESI Universe ICAIL 2007, Palo Alto, California

National Archives and Records Administration 39

Jason R. BaronDirector of LitigationOffice of General Counsel

National Archives and Records Administration

8601 Adelphi Road # 3110College Park, MD 20740(301) 837-1499Email: [email protected]