25
2/23/12 1 Text Classification in Electronic Discovery Dave Lewis, Ph.D. David D. Lewis Consulting, LLC www.DavidDLewis.com Slides for lecture via Skype on February 23, 2012 in LBSC 708X/INFM 718X (Seminar on E-Discovery, Univ. of Maryland). 1 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved. Disclaimer I am not a lawyer Nothing here should be considered legal advice If you need legal advice, consult a lawyer Repeat as needed 2 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

1  

Text Classification in Electronic Discovery

Dave Lewis, Ph.D. David D. Lewis Consulting, LLC

www.DavidDLewis.com Slides for lecture via Skype on February 23, 2012 in LBSC 708X/INFM 718X (Seminar on E-Discovery, Univ. of Maryland).

1 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Disclaimer

•  I am not a lawyer •  Nothing here should be considered legal

advice •  If you need legal advice, consult a lawyer

•  Repeat as needed

2 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Page 2: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

2  

Disclaimer 2

•  I have a variety of interests in this area – Co-founder of TREC Legal Track – Publishing and public speaking in attempt to

influence evolving legal standards – Consulting expert and expert witness on legal

cases involving e-discovery – Algorithm designer for an e-discovery vendor

3 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Outline

•  What is e-discovery? •  IR technologies in e-discovery

– Text retrieval – Text classification – Effectiveness evaluation

•  Summary

4 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Page 3: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

3  

E-Discovery in Black and White

•  A document is – Preserved or not – Reviewed or not – Listed in privilege log or not – Turned over as responsive or not – Presented in court or not

•  As evidence of a particular fact or not

•  Law tries to define actions unambiguously – That means classification

5 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Copyright 2001-2010, David D. Lewis Consulting. All rights reserved. 6

Text Classification

•  Deciding which of several predefined classes (groups) a text belongs to

•  Crudest form of understanding text •  BUT…current technology often can do

well •  AND…many tasks can be viewed as

classification – Particularly tasks of organizing information

Page 4: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

4  

Copyright 2001-2010, David D. Lewis Consulting. All rights reserved. 7

Advantages of Viewing a Task As Classification

•  Classifiers are good plug-in modules –  “Text in/class out" a simple interface

•  Finite set of outputs a good basis for –  Conditioning action on output –  Counting, correlation, etc.

•  Automated applying and learning of classifiers –  Software reusable across classification tasks

•  Clarifies nature of task, successes, failure modes –  Straightforward numerical measures of quality

Copyright 2001-2010, David D. Lewis Consulting. All rights reserved. 8

Organizing Sets of Classes

•  Binary classification •  Three or more classes

– Multiclass – Multilabel – Ordinal – Hierarchical

•  Probabilistic and graded classification

Page 5: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

5  

CULLING REVIEW

PRESENTATION

Other parties

classification in e-discovery

MANAGE- MENT

9 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

which documents to keep?

MANAGE- MENT

10 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Page 6: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

6  

CULLING

which documents to review?

11 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

REVIEW

which documents must we turn over?

12 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Page 7: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

7  

REVIEW

what types of documents do we have?

13 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

PRESENTATION

which documents do we show in court?

14 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Page 8: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

8  

Arkfeld, Michael. Arkfeld on Electronic Discovery and Evidence, 2nd edition, Feb. 2010. 15 Copyright 2001-2010, David D. Lewis

Consulting. All rights reserved.

Approaches to Classification

•  Interactive manual classification – Search and save documents

•  Interactive manual classifier creation – Search and save query

•  Supervised learning of classifier – Review documents – Computer creates query

16 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Page 9: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

9  

Classification "System"

Search and

Save

17 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Classification "System"

Search and

Save

18 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Page 10: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

10  

Search and Save •  Strengths:

– See interesting docs first –  Leverage search algorithms and interfaces

•  Weaknesses – Personnel must have both case knowledge and

search skills – Documents arrive in bursty fashion

•  Constantly need these highly trained people – Human variability – Can't show later why doc was or wasn't saved

19 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Classification System

Query = Classifier

Search and

Save Query

(Manual Classifier

Construction) 20 Copyright 2001-2010, David D. Lewis

Consulting. All rights reserved.

Page 11: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

11  

Classification System

Classifier

21 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Manual Classifier Construction in Culling

•  1-2 orders of magnitude reduction typically quoted

•  Traditionally simple Booleans •  Developed iteratively •  Sometimes statistical evaluation of

effectiveness

Things are moving very rapidly in this area 22 Copyright 2001-2010, David D. Lewis

Consulting. All rights reserved.

Page 12: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

12  

Classification System

Classifier

23 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Classification System

Classifier

24 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Page 13: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

13  

Classification System

Classifier

25 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Research Questions in Manual Classifier Construction for E-Disco •  Culling:

– Optimizing large-grained selection – Detecting redundancy – Summary representations – How good are people at this? – How can we help them?

•  Much from distributed search applies!

26 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Page 14: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

14  

Manual Classifier Construction

•  Strengths – Classifier can be run as new docs arrive – Explicit justification of decisions

•  Weaknesses – Must justify why classifier built way it was – New docs may differ from old ones, requiring

ongoing modification of classifier •  Haven't escaped need for single search/case

expert

27 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

28 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Page 15: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

15  

X

X ✔ ✔

29 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

X

X ✔ ✔

Supervised Learning

30 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Page 16: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

16  

Supervised Learning for Text Classification

•  People manually classify documents •  Computer searches for a rule that

– Agrees with most manual classifications –  Is not too "complex"

•  Uses that rule to classify future documents – Rules typically associate numeric weights with

words and other document features

31 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

REVIEW

32 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Page 17: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

17  

REVIEW

33 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

REVIEW

$ $ $ $

$

$ $

$

34 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Page 18: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

18  

REVIEW

?

35 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

REVIEW

?

responsiveness?

privilege?

36 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Page 19: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

19  

REVIEW

?

37 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Other Favorable Factors for Supervised Learning

•  Review is a recall-oriented task – Classifier likely to need thousands of words – Very hard to create by hand

•  Many, weak, disparate predictors – Text, custodian, time, place, organizational

structure, file type, message headers,... – Again very hard to manually use

•  Objective reason the classifier is way it is –  "The learning algorithm said so"

38 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Page 20: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

20  

X

X ✔ ✔

Classification System

Classifier

X

✔ ✔ ✔ ✔

✔?

✔?

39 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

X

X ✔ ✔

Classification System

Classifier

X

✔ ✔ ✔ ✔

✔?

✔?

40 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Page 21: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

21  

X

X ✔ ✔

Classification System

Classifier

X

✔ ✔ ✔ ✔

✔0.98 X X X

✔ ✔ ✔ ✔ ✔ ✔

X

0.93

0.89 41 Copyright 2001-2010, David D. Lewis

Consulting. All rights reserved.

Unusual Aspects of E-Discovery for Supervised Learning

1  Ultra high recall, moderate precision 2  Extremely diverse document sets 3  Duplicate documents 4  Document families 5  Multiple assessors 6  Some categories conditional on others 7  Goal of assessment is review, not training

--May be unwilling to evaluate documents for categories if not responsive

All of these are pains / research opportunities

42 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Page 22: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

22  

X

X ✔ ✔

Classification System

Classifier

X

X X X

✔ ✔ ✔ ✔ ✔ ✔

X

0.50 ? 0.50

0.50

?

?

8. active learning

43 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

X

X ✔ ✔

Classification System

Classifier

X

0.98 X X X

✔ ✔ ✔ ✔ ✔ ✔

X

0.80

0.76 BA

C

9. diversity-biased classification

44 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Page 23: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

23  

X

X ✔ ✔

Classification System

Classifier

X

✔0.98 X X X

✔ ✔ ✔ ✔ ✔ ✔

X

0.93

0.89

10. transduction

45 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Outline

•  What is e-discovery? •  IR technologies in e-discovery

– Text retrieval – Text classification – Effectiveness evaluation

•  Why I love e-discovery

46 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Page 24: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

24  

Copyright 2001-2010, David D. Lewis Consulting. All rights reserved. 47

The Joy of E-Discovery •  All that SIGIR/TREC/etc. IR geek stuff...

– Supervised learning from huge amounts of training data

– High recall search – Careful statistical measurement of

effectiveness – Obsessed tuning to get effectiveness up

•  Matters, works, and saves people great drudgery and cost

Summary

•  E-discovery is an application where IR really matters – Current technology can help a lot – Advances would help even more – Rigor matters: write, program as if you might

have to testify about it

48 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.

Page 25: Text Classification in Electronic Discoveryoard/teaching/708x/spring12/slides/5/lewis2.pdfText Classification • Deciding which of several predefined classes (groups) a text belongs

2/23/12  

25  

Thanks!

•  I'm always happy to answer questions:

•  Sign up if you'd like to be on list to receive updated copies of this talk

49 Copyright 2001-2010, David D. Lewis Consulting. All rights reserved.