Upload
chuck-rothman
View
223
Download
0
Embed Size (px)
Citation preview
8/12/2019 Predictive Review
1/4
By
Donald WorthingtonPrincipal Consultant, InternationalOne Consulting
Copyright 2010 InternationalOne Consulting. All rights reserved.
IOCWHITEPAPER:
The Cost of eDiscovery & the Importance of Predictive Review:
Introducing the New Paradigm
Abstract A new generation of tools for the
identification of documents that are responsive to
discovery requests in legal matters without the
cost of document-by-document reviews are being
introduced. This whitepaper describes one of
these tools as an example and also provides
metrics on the cost savings and improved
precision and recall that such tools can provide.
Corporate legal departments are operating with much
tighter budgets, forcing them into positions that favor the
settlement of lawsuits in lieu of incurring massive legal bills
brought on, in large part, by the mountains of documents
that must be reviewed to find relevant documents using
conventional approaches.
Approximately 75% of all the data a corporation generates
or retains, and are thus subject to possible legal review, are
comprised of unstructured documents. Yet historically, the
vast majority of data analytic or business intelligence tools
have been built to deal with the 25% of corporate data that
can be neatly stored in structured repositories and
databases. Only recently, and primarily with the necessary
efforts to deal with international terrorism and clandestine
activities, have machine tools to deal with unstructured
data begun to work their way from the secret project labs
and into the light of what might be considered normal
data management.
These tools are now becoming available to the Legal
Services eDiscovery marketplace. They have the potential to
dramatically reduce, by at least one half, the cost of
eDiscovery to the corporate client, as shown in Table 1.
They greatly increase the speed with which a document
corpus or dataset can be reviewed, and the results are
demonstrably better. Refer to the example cited in the
sidebar Case Study for more detail.
While this whitepaper is not intended as a detailed treatise
on the functioning of all these tools, a high level overview is
appropriate using one such tool as an example. Thisexample tool is not keyword driven, and searching is not
performed against an index, so no index-building is
required. This fact alone has the potential to speed up
eDiscovery when this tool is employed, as the process of
creating and maintaining indices against the data is a large
time and storage space component of the processing that is
common in todays practices.
Keyword searching itself is in fact a poor methodology
choice in any regard. Keyword searches are easily rendered
ineffective simply by using alternate words in place of thekeywords. The intelligence community learned this lesson
many years ago, but it remains one of the bastions of legal
searches, along with various add-ons like stemming and
fuzzy spelling, in an attempt to improve the precision and
recall of the search query. In short, its a prime example of
using the wrong tool for the job: in this case a tool made for
structured data, where the applications that create it exist
in a strictly-controlled environment, and they always use
the same literal (keyword) or metadata to always mean the
same thing. Thus a tool that will function extremely well
with structured data - say for example, return all the end ofday till information for a string of retail stores - will lose
much of its value when confronted with a stack of free form
documents or emails discussing sales in those same retail
stores.
8/12/2019 Predictive Review
2/4
IOC WHITEPAPER: The Cost of eDiscovery & the Importance of Predictive Review 2 |P a g e
Copyright 2010 InternationalOne Consulting. All rights reserved.
Task
Traditional EDRM Process Predictive Review Process
Cost per
Gigabyte
Size in
Gigabytes Cost
Cost per
Gigabyte
Size in
Gigabytes Cost
Processing $ 750 100 $ 75,000 $ 750 100 $ 75,000
Predictive Review $ - 0 $ - $ 150 75 $ 11,250
Review $ 20,000 75 $1,500,000 $ 20,000 30 $ 600,000
Responsive Data 15 15
Total Cost $1,575,000 $ 686,250
Table 1
The tool doesnt rely on this approach. Instead, the
structure of the prose making up the text, and the context
of the words being used is exploited. Thus, for a trivial
example, imagine a document corpus with a document
containing prose about rowboats and the oars necessary to
row them. There will be a structure and a relationship
between rowboats and oars and their context in the
document. Now imagine another document containing
canoes and paddles. There will be a structure, relationship
and context for these words as well, and theres a high
probability that the relationships and context in the two
documents will be very similar. The tool will keep these
relationships in close proximity, and what begins to
develop is an idea which could be termed a concept. This
developing concept will not be rowboat, canoe, oar
or paddle, but something more akin to floating things
with moving things which will eventually be given a label
by the tool that humans can better deal with perhaps
boats or another label in high usage. This concept will
carry with it the words as associated values, and as more
documents from the corpus are examined, the concept will
be strengthened or diminished by way of frequency ofoccurrence, or by associating other words with similar
relationships and contexts. As a result, the words
themselves become almost irrelevant, and what becomes
much more predominant is the word usage. In this sense,
the operation is not confined to a single language almost
any language can be examined using the tool.
If this example document corpus were a dataset taken from
some normal corporation or entity, the likelihood of it
containing a significant number of documents about the
concept of boats would be very low. And if keyword
searching were used to find rowboats, the searches
would completely miss any documents that contained only
oars, canoe or paddles. Additional searches would
have to be run for each of these keywords to find all the
documents. If the Subject Matter Expert (SME) constructing
the keyword list were for some reason unaware of the
existence of canoes or use of the word canoe, the
appropriate search would never occur, and that entire
subset of documents would never be identified. If someone
wanted to hide the fact they were writing about canoes and
paddles, all they would have to do is a simple keyword
switch for example use hammer and nails instead
and no keyword search for the known words would ever
identify those documents.
This sort of omission can occur frequently in the Legal
Services eDiscovery industry, especially with complex
subjects. As a result, many service providers and law firms
take extensive measures to minimize this. They hire
linguists and SMEs. They do outside research. In a previous
litigation, for example, a product manufacturer was being
sued for alleged product defects. Independent research
uncovered that the product line was being sold
internationally under an entirely different name and
distribution structure. This fact, which was completely
unrepresented in the original keyword list, opened whole
new avenues of discovery, and played a critical role in the
outcome of the lawsuit. The point however, is that this
research was both expensive and slow and very fortunate
8/12/2019 Predictive Review
3/4
IOC WHITEPAPER: The Cost of eDiscovery & the Importance of Predictive Review 3 |P a g e
Copyright 2010 InternationalOne Consulting. All rights reserved.
in this case. Its not so fortunate in every case, and yet the
expense is the same, whether the results are valuable or
not. The new tools, on the other hand, make this kind of
association simply, quickly, efficiently, cheaply and
automatically.
To extend the example dataset just a little further, imaginethat hammers and nails were in fact the things that a
lawsuit was concerned with, and that rowboats and oars,
and canoes and paddles were words used in their stead. By
seeding the tool with information about hammers and
nails, not only could documents with those words and this
relationship, if any existed, be returned, the tool would
then also identify documents with similar structure and
context, but using rowboat, canoe, oar and paddle,
along with other documents that might contain similar
word relationships: stapler and staples for example, or
words like spikes or brads all without any additionalsubject matter knowledge. These documents could never
be found with keyword searching, unless the keyword list
was massive and the searching and subsequent review was
extensive. And yet, phrases like up a creek without a
paddle or nailed it, which could easily be found in
informal documents, would not qualify using the tool the
relationship of the words to their surroundings is different -
while those words would definitely be returned by keyword
searching, and represent documents submitted for review
that are not relevant to the matter.
The real value of all the tools however, and what makes
them absolutely essential to the future, is their ability to
save corporations very large sums of money as compared to
costs associated with current practices. Table 1 illustrates
some nominal cost savings that might be expected for a 100
Gigabyte document dataset (from 7,500,000 to 10,000,00
pages), using the example tool and assuming the process is
no better than human reviewers. In fact however, the
results obtained by the tools are demonstrably much better
than those produced by teams of reviewers without the use
of these tools, with precision reaching as high as 97%, ascontrasted with roughly 25% for the human review. The
case study in the sidebar is one such documented result.
A significant objection to these tools, also as Table 1
illustrates, might be considered to be the reduction in
revenues to a law firm due to the reduction in volume of
documents going through the review process. Nevertheless,
this change is unpreventable over the long term, and it has
the benefit of allowing the law firm to focus on more
possibly responsive documents and information. With the
mounting volumes of data every corporation must deal
with, the current cost of
eDiscovery is simply
prohibitive, and is in real
danger of being completely
untenable, thus forcing
settlement even where
arguments may have
significant merit or even
decisive advantage. The
costs can be enormous,
threatening the very
existence of a company. In
a recent SEC investigation
resulting in a $75 million
penalty, a major
corporation spent nearly
$200 million on discovery,
finding and identifying
responsive documents.
While its difficult to say
what the penalty may have
been had the discovery
process been different, the
relationship of cost to
outcome with current
eDiscovery practices is not
at all unusual. Using the
example tool, the overall
cost to this corporation
could easily have been
reduced from $275 million
to at most $175 million,
and perhaps as low as $100
- $150 million. Since the
new tools also offer better
eDiscovery results, theres
also a reasonable chance
that the penalty itself
might not have been as
high depending, of
course, on what
engendered the
investigation and penalty in
the first place. Certainly,
however, saving $100 - $125 million on a single SEC
investigation is no small amount, and could easily represent
CASE STUDY
A previously reviewed corpus
of 100,000 documents was
used for the study. From this
corpus, the legal team had
identified 377 documents
responsive to the matter, and
had categorized them into five
groups, Categories 1 to 5, with
the Category 5 documents
being the most important to
the matter. The team had
identified 77 Category 4 and 5
documents. The review hadrequired 4 months to
complete, at a cost of $49,000
to the client.
Using the example tool
described in the whitepaper,
the same document set was
analyzed, with a total elapsed
time of two days, following a
briefing by the attorneys on
the matter and initial seeding
of the analysis engine for thetopics of concern.
The tool identified the original
plus an additional 90 Category
4 and 5 documents, and failed
to identify fewer than 10
Category 1 documents which
were deemed on further
human review to be
unimportant.
The total cost to the client for
this analysis was $3,000,
representing a cost savings of
$46,000 or nearly 94%, and a
time savings of 118 days or
98%.
8/12/2019 Predictive Review
4/4
IOC WHITEPAPER: The Cost of eDiscovery & the Importance of Predictive Review 4 |P a g e
Copyright 2010 InternationalOne Consulting. All rights reserved.
the difference between profitability and loss, or research
and development or modernization programs that never
got a chance to be undertaken, or even the difference
between the existence or failure of the corporation itself.
Another objection to the use of these tools might be that
until proven otherwise, they could be subject to legalchallenge. This in fact may be true. However, the processes
that use these tools are highly replicable and objective -
much more so than any results obtainable by human
reviewers. Theyre significantly and documentably more
accurate. Theyve had millions of dollars of academic and
development research performed in their creation, and
they have been and are being used in national security and
other intelligence environments. They may be subject at
some point to legal challenge, but they will absolutely meet
and overcome that challenge.
In conclusion, it is time and past time to begin serious
investigation of these tools, and to begin the process of
integrating them into normal workflow. They will represent
clear and immediate cost savings, and will be highly
defensible. The case study described in this whitepaper
also illustrates that predictive review is faster and far less
expensive than traditional review, while providing results
that are far more replicable and objective than human
review. If necessary, they can be easily added or removed
from the workflow, as most integrate seamlessly with
existing applications and tools. Whether used in lieu of
manual review or to augment manual review, their use
should be considered at all stages in the e-discovery
workflow.
InternationalOne Consulting (IOC) is a consultancy specializing in eDiscovery and ESI tools, strategy, implementation, process and competitive
analysis and positioning. Its clients include some of the largest and best-known service providers, software companies and law firms in the
eDiscovery and ESI marketplaces.
Donald Worthington is a Principal Consultant for InternationalOne Consulting. His experience includes being a founder and Executive Vice
President for SPi Legal, the director of eDiscovery for Sullivan & Cromwell, many years in IT and engineering/development, and government
service with the National Security Agency and other agencies.
To contact Mr. Worthington, send email to [email protected], call at 512-949-9498 or visit InternationalOne
Consulting on the web at InternationalOneConsulting.com.