Upload
todd-dawson
View
220
Download
1
Tags:
Embed Size (px)
Citation preview
Analysis 360: Blurring the line between EDA and PC
Andrea Gibson, Product Director, Kroll Ontrack
March 27, 2014
2
Discussion Overview
Pushing the Boundaries of Early Data Analysis (EDA)
Examining Traditional EDA Tools
Leveraging Predictive Coding (PC) for Analysis
Using PC in an EDA Environment
Pushing the Boundaries of EDA
4
EDA | an acronym worth defining Early Data Analysis (EDA) aides fact-finding and
narrows the data scope by helping attorneys understand their datasets» Triage data into critical and non-critical groupings
» Identify and reduces number of key players
» Test search terms
» Identify critical case arguments
» Categorize documents as efficiently as possible for production
A true methodology – technology fuels human decisions
5
» Filter
» Search
» Cluster
» Processing
» Ensure portabilityof groups and tags
» Ensure production/search capabilities of review platform
» Search
» Tag
» Redact
Identify, Collect & Process
Analysis
Export to Review Platform
» Log
» Route
» Report
Import & Perform Early Analysis
» Test
» QC
Document Review
Traditional EDA | Overview
6
» Filter
» Search
» Cluster
» Processing
» Ensure portabilityof groups and tags
» Ensure production/search capabilities of review platform
Identify, Collect & Process
Analysis
Export to Review Platform
Import & Perform Analysis
» Test
» QC
Where does Predictive Coding fit in?
Predictive Coding!» Search
» Tag
» Redact
» Log
» Route
» Report
Document Review
7
» Filter
» Search
» Cluster
» Ensure portabilityof groups and tags
» Ensure production/search capabilities of review platform
» Search
» Tag
» Redact
Predictive Coding!
Identify, Collect & Process
Analysis
Export to Review Platform
» Log
» Route
» Report
Import & Perform Analysis
» Test
» QC
Review
Traditional EDA | How efficient is it?
The
Ber
mud
a Tr
iang
le o
f edi
scov
ery
» PC is massively underused
» The tools used during analysis and review overlap
substantially
» Pointless inefficiencies are created by jockeying data between two
standalone platforms
8
Identify, Collect & Process
Analyze and Review
EDA + Review | Could it look like this?
» Process
» PC
» Filter
» Search
» Cluster
» Test
» QC
» Route
» Report
» Tag
Examining Traditional EDA Tools
10
Keyword Search & Concept Search
»Uses search terms and Boolean operators (&, or, not) to retrieve documents that contain those exact terms
»Standard practice
»Generally accepted in the courts
“baseball & field”
»Technology alternative
»Allows reviewers to find documents with similar conceptual terms even if they do not contain exact search terms
»Seldom used for filtering; increasingly used for review
“baseball” diamond, MLB, hit, out
11
Finance
» Documents automatically grouped by theme without human input
Topic Grouping &
» Identify all languages in a document
» Used to group and sort documents for review by multilingual reviewers
非披露協議كتيب
الموظف
Contract
Topic Grouping & Language Identification
12
» Identifies and groups e-mail conversations based on content
Topic Grouping &
» Reviewers can quickly identify and compare documents that are very similar to one another but are not exact duplicates
Email Threading & Near Deduplication
Start-Point Email RE:
FWD:End-Point
13
Finding a Common Thread At their cores, these
tools help attorneys learn more about their data» Does PC fit the bill?
TopicGroup
Key WordSearch
Language ID
Dedupe Email Threading
Concept Search
Analytical
Tools
Predictive Coding
Leveraging PC for Analysis
15
Predictive Coding for Production
Predictive Coding For Analysis
16
PC has been praised for its ability to reduce the amount of documents manually reviewed during first pass
But at least three critical components of PC empower attorneys with unrivaled knowledge about their case:» Prioritization
» Categorization
» Active Learning
The Prioritization Component
17
74,000
480,000
Responsive Non-responsive
Learns from reviewer decisions and escalates documents based on two binary categories» Responsive or
nonresponsive
» Works based on modest amount of learning
Increases the ratio of responsive documents that get routed to reviewers
The Prioritization Component
18
How does this help attorneys analyze their case?» When attorneys ‘check out’ documents to review, they are seeing those
documents most likely to be responsive
» For the same reasons this speeds up production, attorneys who put eyes on these richly relevant documents will know more about their case earlier – driving arguments and filling knowledge gaps
» It runs in the background, you don’t need to carve into billable hours to test keywords
Request batch
Entire Corpus
19
Learns from trainer decisions and suggests coding on multiple categories for an entire collection of documents
Assigns a predicted responsiveness score
Improves speed and quality of categorization decisions
75% Predicted
Responsive
Non-responsivePrivileged67% Predicted
89% Predicted
The Categorization Component
The Categorization Component
20
How does this help attorneys analyze their case?» Allows attorneys to segregate data at user-defined predicted
responsiveness ratings after modest training
» Empowers attorneys to route certain categories of documents (e.g. “hot” docs) to certain sub-groups within the team
0% 100%
1,427 docs9,522 docs
Post Round One Categorization Results (65% cutoff)
65%
% likelihood to be responsive
To: Brief-writer BryanRe: Good Luck on the first draft!
Key component of any true PC solution» Automatically escalates focus documents for training (as opposed
to just handpicked, or just randomly selected training documents)
Focus Documents:» Come from grey areas in the classifier because the machine is
currently uncertain whether they are responsive or not responsive
» Ideal candidates to improve machine learning
» Not random, but queried
21
100% responsive
0% non-responsive
90%
80%
70%60%
50%40%
30%20%
10%
The Active Learning Component
How does this help attorneys analyze their case?» Introduces attorneys to the documents on the fringe of relevancy
– These could be case-changing documents that the machine just doesn’t know enough about yet
» Most effective way to boost metrics and improve results between early training rounds– Reduces false positives; improves accuracy of machine’s concept of
relevancy
22
The Active Learning Component
Pre
cis
ion
Re
ca
ll
Re
ca
ll
Pre
cis
ion
TR1 TR2
Additional Efficiencies
23
Production» Can easily transition into production whether leveraging PC, or not
– Most practical form of PC for EDA
Reporting» Even if just one or two training rounds are performed, metrics will
show where you stand– In this vein, no other EDA tool comes close to PC’s automatic reporting
– There’s a reason courts often ask for recall and precision - these indicate whether you’re understanding of the data set is accurate
Additional Efficiencies
24
Other ECA tools complement predictive coding» Predictive coding requires reviewing a few thousand documents in
training– Most PC solutions also come equipped with all other EDA tools available
– This helps you navigate the training set as well as during review
Intra-team quality control» Can compare reviewer-machine agreement rates side-by-side
» Identify points of disagreement and inconsistency
Additional Efficiencies
25
The small case conundrum» The analytical value from PC is greater where the same subject-
matter expert who trains the system is the same attorney who is forming case strategy– This is most likely true in small-medium cases where one attorney may be
in charge of a case through trial
» The production value from using PC to aid review is greater where high upfront costs can be recouped from applying the machine’s logic to a large amount of documents– Traditionally, this has been true only in large cases
Additional Efficiencies
26
This is all changing
The “portfolio approach” to ediscovery» Pay yearly for PC (and everything that preceded it) in all your cases
for a data hosting fee (process on the vendor’s side)– Upload on day one, train on day one, see a list of documents ranked by
relevancy on day one
Using PC in an EDA Environment
28
Overview It’s not that crazy
» EDA tools let you learn more about your data—so does PC
» Many of the tools discussed today (e.g. de-duplication, concept searching) already exist in standalone “PC solutions”
Aggressive culling via keywords can have an impact on training in PC
Any search strategy must be well designed according to the matter at hand
The producing party has substantial deference in conducting its search
29
In re Biomet» Defendant’s search strategy:
» Plaintiffs argued: the defendant should have used PC on the whole 19.5 million document corpus; the keywords tainted the training. We want joint review of training docs.
» Court held: defendant’s search was reasonable
Pre-PC Keyword Cull?
3million
documents
19.5 million
documents
ProductionKeyword PC
30
Parting Thoughts There are many ways to learn about data
» Different tools on the same belt; multi-modal search
Solutions are emerging that offer all of these tools in one location» No more data jockeying
» More information for better decisions
Quality control is essential whenever you use one of these tools to remove documents from production