Predictive Review

Embed Size (px)

Citation preview

  • 8/12/2019 Predictive Review

    1/4

    By

    Donald WorthingtonPrincipal Consultant, InternationalOne Consulting

    Copyright 2010 InternationalOne Consulting. All rights reserved.

    IOCWHITEPAPER:

    The Cost of eDiscovery & the Importance of Predictive Review:

    Introducing the New Paradigm

    Abstract A new generation of tools for the

    identification of documents that are responsive to

    discovery requests in legal matters without the

    cost of document-by-document reviews are being

    introduced. This whitepaper describes one of

    these tools as an example and also provides

    metrics on the cost savings and improved

    precision and recall that such tools can provide.

    Corporate legal departments are operating with much

    tighter budgets, forcing them into positions that favor the

    settlement of lawsuits in lieu of incurring massive legal bills

    brought on, in large part, by the mountains of documents

    that must be reviewed to find relevant documents using

    conventional approaches.

    Approximately 75% of all the data a corporation generates

    or retains, and are thus subject to possible legal review, are

    comprised of unstructured documents. Yet historically, the

    vast majority of data analytic or business intelligence tools

    have been built to deal with the 25% of corporate data that

    can be neatly stored in structured repositories and

    databases. Only recently, and primarily with the necessary

    efforts to deal with international terrorism and clandestine

    activities, have machine tools to deal with unstructured

    data begun to work their way from the secret project labs

    and into the light of what might be considered normal

    data management.

    These tools are now becoming available to the Legal

    Services eDiscovery marketplace. They have the potential to

    dramatically reduce, by at least one half, the cost of

    eDiscovery to the corporate client, as shown in Table 1.

    They greatly increase the speed with which a document

    corpus or dataset can be reviewed, and the results are

    demonstrably better. Refer to the example cited in the

    sidebar Case Study for more detail.

    While this whitepaper is not intended as a detailed treatise

    on the functioning of all these tools, a high level overview is

    appropriate using one such tool as an example. Thisexample tool is not keyword driven, and searching is not

    performed against an index, so no index-building is

    required. This fact alone has the potential to speed up

    eDiscovery when this tool is employed, as the process of

    creating and maintaining indices against the data is a large

    time and storage space component of the processing that is

    common in todays practices.

    Keyword searching itself is in fact a poor methodology

    choice in any regard. Keyword searches are easily rendered

    ineffective simply by using alternate words in place of thekeywords. The intelligence community learned this lesson

    many years ago, but it remains one of the bastions of legal

    searches, along with various add-ons like stemming and

    fuzzy spelling, in an attempt to improve the precision and

    recall of the search query. In short, its a prime example of

    using the wrong tool for the job: in this case a tool made for

    structured data, where the applications that create it exist

    in a strictly-controlled environment, and they always use

    the same literal (keyword) or metadata to always mean the

    same thing. Thus a tool that will function extremely well

    with structured data - say for example, return all the end ofday till information for a string of retail stores - will lose

    much of its value when confronted with a stack of free form

    documents or emails discussing sales in those same retail

    stores.

  • 8/12/2019 Predictive Review

    2/4

    IOC WHITEPAPER: The Cost of eDiscovery & the Importance of Predictive Review 2 |P a g e

    Copyright 2010 InternationalOne Consulting. All rights reserved.

    Task

    Traditional EDRM Process Predictive Review Process

    Cost per

    Gigabyte

    Size in

    Gigabytes Cost

    Cost per

    Gigabyte

    Size in

    Gigabytes Cost

    Processing $ 750 100 $ 75,000 $ 750 100 $ 75,000

    Predictive Review $ - 0 $ - $ 150 75 $ 11,250

    Review $ 20,000 75 $1,500,000 $ 20,000 30 $ 600,000

    Responsive Data 15 15

    Total Cost $1,575,000 $ 686,250

    Table 1

    The tool doesnt rely on this approach. Instead, the

    structure of the prose making up the text, and the context

    of the words being used is exploited. Thus, for a trivial

    example, imagine a document corpus with a document

    containing prose about rowboats and the oars necessary to

    row them. There will be a structure and a relationship

    between rowboats and oars and their context in the

    document. Now imagine another document containing

    canoes and paddles. There will be a structure, relationship

    and context for these words as well, and theres a high

    probability that the relationships and context in the two

    documents will be very similar. The tool will keep these

    relationships in close proximity, and what begins to

    develop is an idea which could be termed a concept. This

    developing concept will not be rowboat, canoe, oar

    or paddle, but something more akin to floating things

    with moving things which will eventually be given a label

    by the tool that humans can better deal with perhaps

    boats or another label in high usage. This concept will

    carry with it the words as associated values, and as more

    documents from the corpus are examined, the concept will

    be strengthened or diminished by way of frequency ofoccurrence, or by associating other words with similar

    relationships and contexts. As a result, the words

    themselves become almost irrelevant, and what becomes

    much more predominant is the word usage. In this sense,

    the operation is not confined to a single language almost

    any language can be examined using the tool.

    If this example document corpus were a dataset taken from

    some normal corporation or entity, the likelihood of it

    containing a significant number of documents about the

    concept of boats would be very low. And if keyword

    searching were used to find rowboats, the searches

    would completely miss any documents that contained only

    oars, canoe or paddles. Additional searches would

    have to be run for each of these keywords to find all the

    documents. If the Subject Matter Expert (SME) constructing

    the keyword list were for some reason unaware of the

    existence of canoes or use of the word canoe, the

    appropriate search would never occur, and that entire

    subset of documents would never be identified. If someone

    wanted to hide the fact they were writing about canoes and

    paddles, all they would have to do is a simple keyword

    switch for example use hammer and nails instead

    and no keyword search for the known words would ever

    identify those documents.

    This sort of omission can occur frequently in the Legal

    Services eDiscovery industry, especially with complex

    subjects. As a result, many service providers and law firms

    take extensive measures to minimize this. They hire

    linguists and SMEs. They do outside research. In a previous

    litigation, for example, a product manufacturer was being

    sued for alleged product defects. Independent research

    uncovered that the product line was being sold

    internationally under an entirely different name and

    distribution structure. This fact, which was completely

    unrepresented in the original keyword list, opened whole

    new avenues of discovery, and played a critical role in the

    outcome of the lawsuit. The point however, is that this

    research was both expensive and slow and very fortunate

  • 8/12/2019 Predictive Review

    3/4

    IOC WHITEPAPER: The Cost of eDiscovery & the Importance of Predictive Review 3 |P a g e

    Copyright 2010 InternationalOne Consulting. All rights reserved.

    in this case. Its not so fortunate in every case, and yet the

    expense is the same, whether the results are valuable or

    not. The new tools, on the other hand, make this kind of

    association simply, quickly, efficiently, cheaply and

    automatically.

    To extend the example dataset just a little further, imaginethat hammers and nails were in fact the things that a

    lawsuit was concerned with, and that rowboats and oars,

    and canoes and paddles were words used in their stead. By

    seeding the tool with information about hammers and

    nails, not only could documents with those words and this

    relationship, if any existed, be returned, the tool would

    then also identify documents with similar structure and

    context, but using rowboat, canoe, oar and paddle,

    along with other documents that might contain similar

    word relationships: stapler and staples for example, or

    words like spikes or brads all without any additionalsubject matter knowledge. These documents could never

    be found with keyword searching, unless the keyword list

    was massive and the searching and subsequent review was

    extensive. And yet, phrases like up a creek without a

    paddle or nailed it, which could easily be found in

    informal documents, would not qualify using the tool the

    relationship of the words to their surroundings is different -

    while those words would definitely be returned by keyword

    searching, and represent documents submitted for review

    that are not relevant to the matter.

    The real value of all the tools however, and what makes

    them absolutely essential to the future, is their ability to

    save corporations very large sums of money as compared to

    costs associated with current practices. Table 1 illustrates

    some nominal cost savings that might be expected for a 100

    Gigabyte document dataset (from 7,500,000 to 10,000,00

    pages), using the example tool and assuming the process is

    no better than human reviewers. In fact however, the

    results obtained by the tools are demonstrably much better

    than those produced by teams of reviewers without the use

    of these tools, with precision reaching as high as 97%, ascontrasted with roughly 25% for the human review. The

    case study in the sidebar is one such documented result.

    A significant objection to these tools, also as Table 1

    illustrates, might be considered to be the reduction in

    revenues to a law firm due to the reduction in volume of

    documents going through the review process. Nevertheless,

    this change is unpreventable over the long term, and it has

    the benefit of allowing the law firm to focus on more

    possibly responsive documents and information. With the

    mounting volumes of data every corporation must deal

    with, the current cost of

    eDiscovery is simply

    prohibitive, and is in real

    danger of being completely

    untenable, thus forcing

    settlement even where

    arguments may have

    significant merit or even

    decisive advantage. The

    costs can be enormous,

    threatening the very

    existence of a company. In

    a recent SEC investigation

    resulting in a $75 million

    penalty, a major

    corporation spent nearly

    $200 million on discovery,

    finding and identifying

    responsive documents.

    While its difficult to say

    what the penalty may have

    been had the discovery

    process been different, the

    relationship of cost to

    outcome with current

    eDiscovery practices is not

    at all unusual. Using the

    example tool, the overall

    cost to this corporation

    could easily have been

    reduced from $275 million

    to at most $175 million,

    and perhaps as low as $100

    - $150 million. Since the

    new tools also offer better

    eDiscovery results, theres

    also a reasonable chance

    that the penalty itself

    might not have been as

    high depending, of

    course, on what

    engendered the

    investigation and penalty in

    the first place. Certainly,

    however, saving $100 - $125 million on a single SEC

    investigation is no small amount, and could easily represent

    CASE STUDY

    A previously reviewed corpus

    of 100,000 documents was

    used for the study. From this

    corpus, the legal team had

    identified 377 documents

    responsive to the matter, and

    had categorized them into five

    groups, Categories 1 to 5, with

    the Category 5 documents

    being the most important to

    the matter. The team had

    identified 77 Category 4 and 5

    documents. The review hadrequired 4 months to

    complete, at a cost of $49,000

    to the client.

    Using the example tool

    described in the whitepaper,

    the same document set was

    analyzed, with a total elapsed

    time of two days, following a

    briefing by the attorneys on

    the matter and initial seeding

    of the analysis engine for thetopics of concern.

    The tool identified the original

    plus an additional 90 Category

    4 and 5 documents, and failed

    to identify fewer than 10

    Category 1 documents which

    were deemed on further

    human review to be

    unimportant.

    The total cost to the client for

    this analysis was $3,000,

    representing a cost savings of

    $46,000 or nearly 94%, and a

    time savings of 118 days or

    98%.

  • 8/12/2019 Predictive Review

    4/4

    IOC WHITEPAPER: The Cost of eDiscovery & the Importance of Predictive Review 4 |P a g e

    Copyright 2010 InternationalOne Consulting. All rights reserved.

    the difference between profitability and loss, or research

    and development or modernization programs that never

    got a chance to be undertaken, or even the difference

    between the existence or failure of the corporation itself.

    Another objection to the use of these tools might be that

    until proven otherwise, they could be subject to legalchallenge. This in fact may be true. However, the processes

    that use these tools are highly replicable and objective -

    much more so than any results obtainable by human

    reviewers. Theyre significantly and documentably more

    accurate. Theyve had millions of dollars of academic and

    development research performed in their creation, and

    they have been and are being used in national security and

    other intelligence environments. They may be subject at

    some point to legal challenge, but they will absolutely meet

    and overcome that challenge.

    In conclusion, it is time and past time to begin serious

    investigation of these tools, and to begin the process of

    integrating them into normal workflow. They will represent

    clear and immediate cost savings, and will be highly

    defensible. The case study described in this whitepaper

    also illustrates that predictive review is faster and far less

    expensive than traditional review, while providing results

    that are far more replicable and objective than human

    review. If necessary, they can be easily added or removed

    from the workflow, as most integrate seamlessly with

    existing applications and tools. Whether used in lieu of

    manual review or to augment manual review, their use

    should be considered at all stages in the e-discovery

    workflow.

    InternationalOne Consulting (IOC) is a consultancy specializing in eDiscovery and ESI tools, strategy, implementation, process and competitive

    analysis and positioning. Its clients include some of the largest and best-known service providers, software companies and law firms in the

    eDiscovery and ESI marketplaces.

    Donald Worthington is a Principal Consultant for InternationalOne Consulting. His experience includes being a founder and Executive Vice

    President for SPi Legal, the director of eDiscovery for Sullivan & Cromwell, many years in IT and engineering/development, and government

    service with the National Security Agency and other agencies.

    To contact Mr. Worthington, send email to [email protected], call at 512-949-9498 or visit InternationalOne

    Consulting on the web at InternationalOneConsulting.com.