Predictive Review

8/12/2019 Predictive Review

1/4

By

Donald WorthingtonPrincipal Consultant, InternationalOne Consulting

Copyright 2010 InternationalOne Consulting. All rights reserved.

IOCWHITEPAPER:

The Cost of eDiscovery & the Importance of Predictive Review:

Introducing the New Paradigm

Abstract A new generation of tools for the

identification of documents that are responsive to

discovery requests in legal matters without the

cost of document-by-document reviews are being

introduced. This whitepaper describes one of

these tools as an example and also provides

metrics on the cost savings and improved

precision and recall that such tools can provide.

Corporate legal departments are operating with much

tighter budgets, forcing them into positions that favor the

settlement of lawsuits in lieu of incurring massive legal bills

brought on, in large part, by the mountains of documents

that must be reviewed to find relevant documents using

conventional approaches.

Approximately 75% of all the data a corporation generates

or retains, and are thus subject to possible legal review, are

comprised of unstructured documents. Yet historically, the

vast majority of data analytic or business intelligence tools

have been built to deal with the 25% of corporate data that

can be neatly stored in structured repositories and

databases. Only recently, and primarily with the necessary

efforts to deal with international terrorism and clandestine

activities, have machine tools to deal with unstructured

data begun to work their way from the secret project labs

and into the light of what might be considered normal

data management.

These tools are now becoming available to the Legal

Services eDiscovery marketplace. They have the potential to

dramatically reduce, by at least one half, the cost of

eDiscovery to the corporate client, as shown in Table 1.

They greatly increase the speed with which a document

corpus or dataset can be reviewed, and the results are

demonstrably better. Refer to the example cited in the

sidebar Case Study for more detail.

While this whitepaper is not intended as a detailed treatise

on the functioning of all these tools, a high level overview is

appropriate using one such tool as an example. Thisexample tool is not keyword driven, and searching is not

performed against an index, so no index-building is

required. This fact alone has the potential to speed up

eDiscovery when this tool is employed, as the process of

creating and maintaining indices against the data is a large

time and storage space component of the processing that is

common in todays practices.

Keyword searching itself is in fact a poor methodology

choice in any regard. Keyword searches are easily rendered

ineffective simply by using alternate words in place of thekeywords. The intelligence community learned this lesson

many years ago, but it remains one of the bastions of legal

searches, along with various add-ons like stemming and

fuzzy spelling, in an attempt to improve the precision and

recall of the search query. In short, its a prime example of

using the wrong tool for the job: in this case a tool made for

structured data, where the applications that create it exist

in a strictly-controlled environment, and they always use

the same literal (keyword) or metadata to always mean the

same thing. Thus a tool that will function extremely well

with structured data - say for example, return all the end ofday till information for a string of retail stores - will lose

much of its value when confronted with a stack of free form

documents or emails discussing sales in those same retail

stores.


2/4

IOC WHITEPAPER: The Cost of eDiscovery & the Importance of Predictive Review 2 |P a g e


Task

Traditional EDRM Process Predictive Review Process

Cost per

Gigabyte

Size in

Gigabytes Cost

Cost per

Gigabyte

Size in

Gigabytes Cost

Processing $ 750 100 $ 75,000 $ 750 100 $ 75,000

Predictive Review $ - 0 $ - $ 150 75 $ 11,250

Review $ 20,000 75 $1,500,000 $ 20,000 30 $ 600,000

Responsive Data 15 15

Total Cost $1,575,000 $ 686,250

Table 1

The tool doesnt rely on this approach. Instead, the

structure of the prose making up the text, and the context

of the words being used is exploited. Thus, for a trivial

example, imagine a document corpus with a document

containing prose about rowboats and the oars necessary to

row them. There will be a structure and a relationship

between rowboats and oars and their context in the

document. Now imagine another document containing

canoes and paddles. There will be a structure, relationship

and context for these words as well, and theres a high

probability that the relationships and context in the two

documents will be very similar. The tool will keep these

relationships in close proximity, and what begins to

develop is an idea which could be termed a concept. This

developing concept will not be rowboat, canoe, oar

or paddle, but something more akin to floating things

with moving things which will eventually be given a label

by the tool that humans can better deal with perhaps

boats or another label in high usage. This concept will

carry with it the words as associated values, and as more

documents from the corpus are examined, the concept will

be strengthened or diminished by way of frequency ofoccurrence, or by associating other words with similar

relationships and contexts. As a result, the words

themselves become almost irrelevant, and what becomes

much more predominant is the word usage. In this sense,

the operation is not confined to a single language almost

any language can be examined using the tool.

If this example document corpus were a dataset taken from

some normal corporation or entity, the likelihood of it

containing a significant number of documents about the

concept of boats would be very low. And if keyword

searching were used to find rowboats, the searches

would completely miss any documents that contained only

oars, canoe or paddles. Additional searches would

have to be run for each of these keywords to find all the

documents. If the Subject Matter Expert (SME) constructing

the keyword list were for some reason unaware of the

existence of canoes or use of the word canoe, the

appropriate search would never occur, and that entire

subset of documents would never be identified. If someone

wanted to hide the fact they were writing about canoes and

paddles, all they would have to do is a simple keyword

switch for example use hammer and nails instead

and no keyword search for the known words would ever

identify those documents.

This sort of omission can occur frequently in the Legal

Services eDiscovery industry, especially with complex

subjects. As a result, many service providers and law firms

take extensive measures to minimize this. They hire

linguists and SMEs. They do outside research. In a previous

litigation, for example, a product manufacturer was being

sued for alleged product defects. Independent research

uncovered that the product line was being sold

internationally under an entirely different name and

distribution structure. This fact, which was completely

unrepresented in the original keyword list, opened whole

new avenues of discovery, and played a critical role in the

outcome of the lawsuit. The point however, is that this

research was both expensive and slow and very fortunate


3/4



in this case. Its not so fortunate in every case, and yet the

expense is the same, whether the results are valuable or

not. The new tools, on the other hand, make this kind of

association simply, quickly, efficiently, cheaply and

automatically.

To extend the example dataset just a little further, imaginethat hammers and nails were in fact the things that a

lawsuit was concerned with, and that rowboats and oars,

and canoes and paddles were words used in their stead. By

seeding the tool with information about hammers and

nails, not only could documents with those words and this

relationship, if any existed, be returned, the tool would

then also identify documents with similar structure and

context, but using rowboat, canoe, oar and paddle,

along with other documents that might contain similar

word relationships: stapler and staples for example, or

words like spikes or brads all without any additionalsubject matter knowledge. These documents could never

be found with keyword searching, unless the keyword list

was massive and the searching and subsequent review was

extensive. And yet, phrases like up a creek without a

paddle or nailed it, which could easily be found in

informal documents, would not qualify using the tool the

relationship of the words to their surroundings is different -

while those words would definitely be returned by keyword

searching, and represent documents submitted for review

that are not relevant to the matter.

The real value of all the tools however, and what makes

them absolutely essential to the future, is their ability to

save corporations very large sums of money as compared to

costs associated with current practices. Table 1 illustrates

some nominal cost savings that might be expected for a 100

Gigabyte document dataset (from 7,500,000 to 10,000,00

pages), using the example tool and assuming the process is

no better than human reviewers. In fact however, the

results obtained by the tools are demonstrably much better

than those produced by teams of reviewers without the use

of these tools, with precision reaching as high as 97%, ascontrasted with roughly 25% for the human review. The

case study in the sidebar is one such documented result.

A significant objection to these tools, also as Table 1

illustrates, might be considered to be the reduction in

revenues to a law firm due to the reduction in volume of

documents going through the review process. Nevertheless,

this change is unpreventable over the long term, and it has

the benefit of allowing the law firm to focus on more

possibly responsive documents and information. With the

mounting volumes of data every corporation must deal

with, the current cost of

eDiscovery is simply

prohibitive, and is in real

danger of being completely

untenable, thus forcing

settlement even where

arguments may have

significant merit or even

decisive advantage. The

costs can be enormous,

threatening the very

existence of a company. In

a recent SEC investigation

resulting in a $75 million

penalty, a major

corporation spent nearly

$200 million on discovery,

finding and identifying

responsive documents.

While its difficult to say

what the penalty may have

been had the discovery

process been different, the

relationship of cost to

outcome with current

eDiscovery practices is not

at all unusual. Using the

example tool, the overall

cost to this corporation

could easily have been

reduced from $275 million

to at most $175 million,

and perhaps as low as $100

- $150 million. Since the

new tools also offer better

eDiscovery results, theres

also a reasonable chance

that the penalty itself

might not have been as

high depending, of

course, on what

engendered the

investigation and penalty in

the first place. Certainly,

however, saving $100 - $125 million on a single SEC

investigation is no small amount, and could easily represent

CASE STUDY

A previously reviewed corpus

of 100,000 documents was

used for the study. From this

corpus, the legal team had

identified 377 documents

responsive to the matter, and

had categorized them into five

groups, Categories 1 to 5, with

the Category 5 documents

being the most important to

the matter. The team had

identified 77 Category 4 and 5

documents. The review hadrequired 4 months to

complete, at a cost of $49,000

to the client.

Using the example tool

described in the whitepaper,

the same document set was

analyzed, with a total elapsed

time of two days, following a

briefing by the attorneys on

the matter and initial seeding

of the analysis engine for thetopics of concern.

The tool identified the original

plus an additional 90 Category

4 and 5 documents, and failed

to identify fewer than 10

Category 1 documents which

were deemed on further

human review to be

unimportant.

The total cost to the client for

this analysis was $3,000,

representing a cost savings of

$46,000 or nearly 94%, and a

time savings of 118 days or

98%.


4/4



the difference between profitability and loss, or research

and development or modernization programs that never

got a chance to be undertaken, or even the difference

between the existence or failure of the corporation itself.

Another objection to the use of these tools might be that

until proven otherwise, they could be subject to legalchallenge. This in fact may be true. However, the processes

that use these tools are highly replicable and objective -

much more so than any results obtainable by human

reviewers. Theyre significantly and documentably more

accurate. Theyve had millions of dollars of academic and

development research performed in their creation, and

they have been and are being used in national security and

other intelligence environments. They may be subject at

some point to legal challenge, but they will absolutely meet

and overcome that challenge.

In conclusion, it is time and past time to begin serious

investigation of these tools, and to begin the process of

integrating them into normal workflow. They will represent

clear and immediate cost savings, and will be highly

defensible. The case study described in this whitepaper

also illustrates that predictive review is faster and far less

expensive than traditional review, while providing results

that are far more replicable and objective than human

review. If necessary, they can be easily added or removed

from the workflow, as most integrate seamlessly with

existing applications and tools. Whether used in lieu of

manual review or to augment manual review, their use

should be considered at all stages in the e-discovery

workflow.

InternationalOne Consulting (IOC) is a consultancy specializing in eDiscovery and ESI tools, strategy, implementation, process and competitive

analysis and positioning. Its clients include some of the largest and best-known service providers, software companies and law firms in the

eDiscovery and ESI marketplaces.

Donald Worthington is a Principal Consultant for InternationalOne Consulting. His experience includes being a founder and Executive Vice

President for SPi Legal, the director of eDiscovery for Sullivan & Cromwell, many years in IT and engineering/development, and government

service with the National Security Agency and other agencies.

To contact Mr. Worthington, send email to [email protected], call at 512-949-9498 or visit InternationalOne

Consulting on the web at InternationalOneConsulting.com.

Documents

Predictive Review