How To Take Control of Your Data

Embed Size (px)

Citation preview

  • 7/27/2019 How To Take Control of Your Data

    1/8

    DatabaseVOl.2NO.8

    Iss e 08/2013 (11) J

    eDiscoveryCompendium

    E-DISCOVERY AND E-DISCLOSURE:SAME DIFFERENCE?

    FORENSICS & E-DISCOVERY COMMUNICATION IS THE KEY

    SYMBIOSIS

    CORPORATE E-DISCOVERYSUCCESS STARTS WITH

    INFORMATION GOVERNANCE

    EDISCOVERY COLLECTIONS

    THERES MORE THAN ONE WAY TOCOPY A FILE

  • 7/27/2019 How To Take Control of Your Data

    2/880

    HOw TO TAKE CONTROl

    Of yOuR DATAINSTEAD Of wAITING fOR THE NEXT TRIGGERING EVENTb Benjamin Marks and Brent Stan e

    Come e-Discovery counsel throughout the land and please dontignore what you cant understand. During a time of political andsocial upheaval in 1965, American songwriter Bob Dylan penned The Times They Are A Changin. In our community, changecontinues to occur as data volumes grow.

    T he importance of data classi-fication by relevant businesspurpose,prior to processingcannot be understated or misunder-stood. Proactive technology choices

    such as classification create numer-ous benefits downstream during alitigation event, as well as upstreamto manage information governanceacross an enterprise. Poor quali -ty data might not be searchable butthat must not diminish its relevanceor the need to understand its con-tent. Whereas predictive coding em-ploys technology that relies upon thesearch ability of good quality text,what is your workflow for the boxes

    of paper and the unsearchable elec-tronic files created from third genera-tion scans?

    INTRODuCTIONBIG DATA IS GROwING BEyONDyOuR COMMAND, THE OlDMETHODS ARE RAPIDly AGINGIn 2013, unstructured data continues

    to exponentially increase in volume.For the longest time, our industry hasfollowed the Four Ps People, Pro -cess, Platform, and Protocol, of thedecidedly reactive Electronic Discov-ery Reference Model. Clients relatethat their chief problems tend to re-volve around productivity, accuracy,risk mitigation, defensibility of pro-cess, and these all have an impacton the bottom line their legal spend.However, the time has come to un-

    derstand a Fifth P PROACTIVE.We now know that not all workflowsare equal. An abundance of interest

    What you will learn: The benefits of classification How to manage ugly data and

    atypical data populations Stakeholder questions in consid -

    eration of classification (see Fig -

    ure 2)

    What you should know: The difference between reactive

    e-Discovery and proactive Infor-mation Governance

    Not all managed reviews are cre -ated equal

    Technology Assisted Review re -quires subject matter expertise for effective deployment

  • 7/27/2019 How To Take Control of Your Data

    3/8www.eForensicsMag.com 81

    HOw TO TAKE CONTROl Of yOuR DATA

    in enterprise-wide Business Process Management (BPM) cost-saving measures is driving solutionstowards creation and deployment of a proactiveworkflow where classification occurs prior to themanaged review of documents. Poor quality data

    is rarely reviewed or effectively searched prior to,or in conjunction with Rule 26 conferences. Casestudies and interactive questions are used to illus-trate the concepts in this article.

    BODyInformation Lifecycle Management with a founda-tion in the EDRM is a multi-step process wheredata is Forensically Collected, Processed and

    Analyzed, Hosted, and then Reviewed and Pro -duced, according to a very specific protocol andset of instructions. According to the 2012 Rand

    Report, Where the Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery , Collections and Processing accountfor about 27% of the litigation spend, while theReview component is roughly 73% of the litiga -tion spend.

    We know that strategic decisions enacted up-stream will lead to a proactive and cost-effectiveworkflow downstream. Classification is best ap -plied prior to the processing and analytics step,where the on-average cost of 5 cents per docu-ment spent proactively upon classification can off-

    set 50 cents per document spent reactively duringa traditional linear managed review. In proactiveclassification workflows, a cost savings accruesleading to greater predictability of budget for bothtime and money, so that the 73% of spend that oc -curs in review may carry higher value than a merelinear review.

    AS THE PRESENT NOw wIll lATER BE PASTClients shared prior horror stories of what hap -pened on their last managed review such as per-forming quality control and finding a high amount

    of material error in the 1st

    level review. They werechagrined about the time the vendor promised thatthe review would be finished in four weeks, buthad to add twenty reviewers and work overtime inthe fourth week; not unexpectedly, the project billwas over budget. Of equal frustration were the oc-curences of last minute productions delivered toassociates with no time to spare for around-the-clock deposition preparation, over the weekend again. Not surprisingly, these problems occur withgreater frequency in a reactive workflow, or wherethe vendor did not lay the foundation for solutions

    and defensibility with its failure to ask the rightquestions. These questions and a clients answerscarry a dual purpose: (1) to assist the scoping of the project, and (2) alignment of value to clientneeds for the purchase of e-Discovery services.(see List 1).

    PROACTIVE APPROACH TO COSTSAVINGSCONSIDERING NEw TECHNOlOGyClient 1 is a Fortune 500 vertically integrated com -pany that relies on several managed review pro-

    viders and outsourced early case assessment(ECA) tools; ostensibly, they made purchasing de -cisions based on relationships and price, rather than on an underlying awareness of their needs or the changing technologies in the marketplace. Theclient shared that they were concerned about thehigh cost of 1 st level document review.

    In an effort to identify cost savings, we offeredto re-review their data from a recent case to il-lustrate how machine learning via a classifica-tion tool could provide improved client knowledgeabout their data, prior to processing and espe-

    cially prior to managed review, so that intelligentstaffing choices could be made for a future man-aged review.

    lIST 1: SCOPING QuESTIONS What is the subject matter?

    Similar subject matters may engender similarprotocols or review

    Case matters pro les may be replicated for theclient

    Demonstrable expertise from measurable his -toric results

    What is the approximate volume of documents?

    Working assumptions are con rmed Greater predictability for duration of review Which pricing model to apply, hourly or xed

    price per document What are the le types?

    Impact to timing and work ow requirementsHistoric fle type management on this type o caseIdenti y any special types o skilled reviewers

    needed What are the average pages per document? What are the average pages per GB? What are the average documents per GB? Collectively, these 3 questions assist identi cation

    o an atypical document population Such an identi cation can alert us to special con -

    cerns for sta ng be orethe review begins Comparison against historic work ow profles or

    anomalies that may impact timing and other ser-vices such as privilege log creation or redaction

    How many custodians? Comparison for historic hit rate Prioritization for work ow and best practices Sta ng needs

    How many issue tags? Historic responsiveness rates compared to cur -

    rent case Best practices favor 10 tags or fewer Discussion of potential areas of data uncertain -

    ty prior to review so that data may be strategi-cally batched in mitigation o uture costly re-review that results rom client protocol change

    What drives your purchasing decision to chooseone provider over another?

    Is there a feature or aspect of your current servicethat you consider important? Why?

  • 7/27/2019 How To Take Control of Your Data

    4/882

    We classified data for its relevant business pur-pose as a precursor to creating a seed set for pre-dictive analytics. We compared the effectivenessof the tool to an existing in-house product. Weidentified best practices for seed set creation pro-

    tocols, and can share some lessons learned aboutthe process that will benefit future clients.

    PlEASE CONSIDER A NEw CHANGE IN yOuRDATA wORKflOwFor testing, we utilized a sample batch of data con-sisting of a mix of Excel, Word, Power Point, Ado -be PDF, and MS Outlook Email. The test data setwas initially provided and we then proceeded toanalyze the data, create categories for classifica-tion, identify a seed set, and proceed with an au-tomated classification process on the remainder.

    The results were provided to us and we loadedthem into a Relativity database for testing. The de-liverable included a list of identified categories, alist of documents used in their seed set, and a loadfile listing all documents and their correspondingcategory. Once all categorization sets were com-pleted, we built saved searches to identify discrep-ancies. We employed a human element to validatethe classifications performed and to create a blindseed set for comparison.

    lESSONS lEARNEDThe subject matter expertise of the engagementengineer plays a factor in the way that seed setsare created. The new classification technology wasable to classify a higher percentage of documentsand illustrated better optimization of multiple filetypes than any of the in-house categorization setscreated by incumbent products. The ability to clas-sify on a relevant business purpose with a robustfile identification engine is perhaps one of the larg-est differentiators between competing technologies.The human intelligence married to the artificial intel-ligence of machine learning is an important step in

    the iterative process of seed set creation.Subject matter knowledge differs from personto person based on understanding of the type of case, the case in point, familiarity with the use of technology, and professional experience / expo-sure to documents and concepts that clients pro-vide for production. The Blind Classification setcreated by the subject matter expert was found tomatch favorably (72%) with the machine learningtechnology performed by our tool.

    PROCEDuRAl BEST PRACTICES

    Process source data to expose actual (as op -posed to stated) le types, system les, dupli -cates and near-duplicates.

    Classi cation on multiple relevant businesspurposes helps understand data and leads to a

    higher quality of prioritized data, different froma linear review.

    De ne categories and identify where overlapoccurs. It is a prioritized classi cation and po -tentially responsive; certain categories may re-

    quire a 2nd

    look as part of the iterative process,prior to managed review. Client should be encouraged to provide list of re -

    sponsive terms, privilege names, during custodial collection , for the purpose of data mapping andclassi cation, prior to the project kickoff.

    A Potential Privilege lter can be applied basedupon list of counsel names and mitigate the im-pact that occurs from inconsistent coding in atraditional linear review.

    On a case-by-case basis, con rm with your ven -dor who from their pool of candidates and subject

    matter experts will be provided for supervisedmachine learning and seed set creation.

    If yOuR TIME TO yOu IS wORTH SAVINTo summarize, proactive pre-processing classifica-tion takes a large corpus of unstructured data andorganizes it around a central business purpose or theme. This categorization prioritizes and in turnreduces the amount of documents that undergo atraditional linear first pass review for responsive-ness. A reduced volume of documents leads toa reduced labor cost where less reviewers are

    needed to accomplish the same task perhaps infewer hours, days, or weeks.The potentially responsive documents are clas-

    sified and prioritized around the relevant purposeand the potentially non-responsive documentsare set aside for later review, if necessary. Nocoding decisions to tag have been made at thisstage. Neither have non-responsive documentshad to be processed in order to determine thatthey do not meet threshold requirements for re-sponsive production.

    EVOluTION Of ENGINEERED REVIEwThrough the use of proactive classification, wehave transformed managed review into an engi-neered review. Its a more efficiently staffed proj-ect. We train and qualify our review team on theclassification of the data and the alignment of who,what, when and how. Everything that we learned from the classification process is a point of knowl-edge for the case and this is conveyed through thedelivery of a production binder documenting every step taken for defensibility.

    Better trained reviewers make less material er -ror because the training on the quality controlprocess is very robust.

    Productive reviewers complete batches faster because they are not distracted by uncatego-rized linear data.

  • 7/27/2019 How To Take Control of Your Data

    5/8

    HOw TO TAKE CONTROl Of yOuR DATA

    ClASSIfICATION uSE CASEClassifcation organizes the data and themes emerge.Trends and occurrences are readily visible as patterns o be-havior: Who was talking to whom, about what, how and when did this occur?

    Every month or 9 months, Smith and Jones, had ameeting, and exchanged 3 emails with 6 attachments. There were always 3 spreadsheets, 1 HR Word doc -

    ument related to goal measurement, a PowerPointpresentation or the board o directors, and an agen-da.

    There were multiple drafts of the PowerPoint. There were requests for legal advice that made some

    o the documents, potentially privileged. Lawyers and legal domain names, identi ed in

    advance through the use o classifcation tools,and potentially privileged were set aside or thePrivilege Review team instead of having to be re -viewed twice, at the risk o making an inconsis-

    tent call.Classifcation identifes requency o events, conversations,and third parties to a conversation .

    Then one day, in the 10 th month, Smith and Jones in-troduced Davis, a competitor, to the mix o their regular-ly patterned behavior. All o a sudden, Smith and Jones

    were scheduling a meeting with Davis to discuss fxinga price.

    Consider the ollowing questions. Could you have ound that in a traditional linear review? When would you have

    ound it? Would you have noticed the requent pattern

    o behavior or 9 months and then spotted the anomaly,Davis, in the 10 th month? What i you had di erent revie-wers on the two batches, a distinctive likelihood?

    In a classifcation system, you could fnd it with re-quency reports, and then using the iterative process o machine learning, train the machine to fnd other docu-ments like that smoking gun whose existence was pre-viously unknown. Data can be batched specifc to thisparticular incident, be ore reviewers are in their seatsand classifcation can provide valuable case knowled-ge in the instance where you arent necessarily aware o what you did not know.

    Oneby-product or the corporation that engages inclassifcation is the understanding o their data in termso knowledge management. Classifcation can deliver re-ports on requency o nouns and verbs or both de ensi-bility of the process undertaken (for use in Rule 26 me -et and con ers), as well or the identifcation o the nexttriggering event. In this manner the wheel is not recre-ated each and every time there is a triggering event.

    Applica ons include:

    Classi ca on Services for Informa on Governance Due Diligence and Audit Support Data Mining on Physical Records aka Whats In the Box? Records Valida on and Veri ca on

    Services include:

    Subject ma er exper se (SME) in machine learning and data extrac on Classi ca on training and cer ca on to clients and partners

    Products include:

    Haystac RetenGine which processes enterprise data and Haystac Web which processes data on the internet

    Contact: +1 781-820-7616

    Email: info @haystac.com

    On the web: h p://www.haystac.com

    To read more from Haystac, please visit h p://www.haystac.com/whitepapers

    a d v e r i s e m e n t

    mailto:[email protected]:[email protected]://www.haystac.com/whitepapershttp://www.haystac.com/whitepapershttp://www.haystac.com/whitepapershttp://www.haystac.com/whitepapersmailto:[email protected]
  • 7/27/2019 How To Take Control of Your Data

    6/884

    Rather, they are tuned in to the proactive priori -tization of the classi cations.

    Thus, they are more likely to spot outliers anddepartures in behavior patterns, analyze sen-timent in a message, and spot differences not

    readily found in a traditional linear review. (seeFrame: Classi cation Use Case)

    uGly DATA AND ATyPICAl DOCuMENTPOPulATIONSClient 2 is a Fortune 100 commercial bank. Becausewe have a very deep understanding of this banks liti-gation matters, we undertook three custom tasks thatwould be considered atypical by any vendor stan-dard in the e-discovery industry. While many provid-ers would shy away from undertaking such projects,these were the perfect test cases to employ technol-

    ogy, identify efficiencies, and share results both withour banking client, and other companies who facethe same challenges (Frame: Why is data classifica -tion a good idea for your organization?).

    wHAT IS uGly DATA?Ugly data is poor quality data that originated as a paper document at some pointin its life. One easy to digestexample is the process of contract execution where acontract was printed and signed, then scanned andsent to a counterparty or additional signatory for sign-ing, where it wasre-scanned and returned. That was

    at least three generations off the original. Dependingupon the quality of the printout and scan, there maybe some loss of fidelity during the OCR conversionfrom native file to TIFF. Recent work for clients in theoil and gas industry required the cleanup of a fax doc-ument for the production of a maintenance report re-lated to a well (See Figures 1, 2).

    lARGE PDf SPlITTING AND ClASSIfICATIONTransfer of assets and collection of work productacross several vertical markets has resulted in the re-cords for such being compiled into a single PDF, usu -ally with no index. This condition is prevalent in the oil

    and gas and mortgage industries, where the recordsassociated with the asset are created as these largePDFs. The holder of these PDFs is forced to recon -struct the original document collection in order to de-termine the presence of critical records and/or recre-ate a database of key attributes contained within thedocuments. In addition, the quality of the OCR textis usually poor, severely limiting the usefulness of search-based interrogation. Manually splitting thesePDFs into their original documents is an expensiveand time consuming process.

    We were able to train on a seed set of documents

    and automatically split 21 Loan Files into 1900 PDFs,the original document set, accurately identifying thelogical document breaks and auto-classifying eachdocument to high levels of accuracy. New docu-ment naming conventions are auto-generated, usu-ally based on appending the original file name withthe page range of the new document. (see Figure 3)The client provided a list of 13 categories by which toplace documents. For comparison, we had our Hay-stac technology go head to head with human review-ers. The technology was able to categorize all of thedocuments and left fewer documents in the OTH-

    ER category, than the off-shore human review team.The advantage of Haystacs machine-based pro-cess is quicker recognition of error patterns and their correction, thus eliminating the inherent variabilityof human judgment. The process can be applied tomillions of pages of PDFs and produce results in afraction of the time of its manual counterpart.

    wHy IS DATA ClASSIfICATION A GOODIDEA fOR yOuR ORGANIZATION?Classifcation equals preparedness or all stakeholders.Who is involved in your enterprise with making thesedecisions? Do you have any o these concerns?RECORDS MANAGEMENT Indexing and remediation of legacy data for storage. What are the new record-keeping requirements un -

    der Dodd-Frank? Classi cation can reduce annual storage costs at

    the terabyte and petabyte level.RISK MANAGEMENT, AuDIT, AND COMPlIANCE Defensible deletion reduces enterprise risk. Are we

    holding data or too long? How will we meet the new statutory regimes for re -

    porting under Dodd-Frank? Classi cation can establish cost-e ective predict -

    ability or compliance and mitigate costs ound ina risk profle.C-SuITE MANAGEMENT Periodic M&A events that requires great due dili -

    gence. Regulatory Compliance.

    Future litigation. Classi cation provides a material bene t to C-

    Suite stakeholders.lEGAl AND GENERAl COuNSEl

    Reduction of labor cost occurs at the most variableportion o a managed review. Improved productivity to reach higher priority data

    where strategic decisions are made. Classi cation enables greater accuracy to allow

    the production o data sooner.CORPORATE KNOwlEDGE BASE

    A repository allows clear insight into the languageused to discuss common business events: Who was talking to whom, When these conversa -

    tions were occurring, and Identi cation of a pattern of expected behavior, thus

    enabling the visibility o outliers, anomalies, and de-partures rom the pattern in essence needles in ahaystack. Classi cation enables the creation of a corporate

    repository and promotes reusability o data, sothat you no longer have to recreate the wheel.

  • 7/27/2019 How To Take Control of Your Data

    7/8www.eForensicsMag.com 85

    HOw TO TAKE CONTROl Of yOuR DATA

    AuTO EXTRACTION Of TEXT fOR lOGICAlDOCuMENT DETERMINATIONPoor quality text documents can constitute a signifi -cant percentage of stored documents. Scanned doc-uments are typically stored as TIFF or PDF files on

    file servers and email archives and are usually poorlyindexed, making them hard to find using enterprisesearch engines. In addition, important records storedin boxes and files are also poorly indexed at the boxor file level, making the box or file contents blind tothe enterprise. Manually indexing these documentsis resource intensive and costly, yet locating impor-tant records is very meaningful to satisfy audit, inves-tigatory and document control objectives, as well asmeeting information governance requirements.

    Document titles are often, a key indicator of thepurpose of the document, so that accurately and

    cost effectively determining the title means docu-ment importance as a record can be determined.Determining the title allows classifying the docu-ment to a business purpose using database map-ping. Using a soft dictionary-based approach toidentifying document titles, a dictionary has beencompiled from common business function-based

    documents and is supplemented with actual doc-ument headers gleaned by sampling client data.Image processing extracts the title fragment andalgorithmic processing determines the most prob-able title match. The user interface contains an ed-

    itor which allows the user to view machine results,enter new headers and correct errors.On this task, we extracted document titles that

    were meaningful to the useful categorization of thepoor quality OCRed documents.

    AuTO-EXTRACTION Of TEXT fORREPORTING PuRPOSESThere were 68 fields of entry on a custom reportingdocument for which items such as loan #, amounts,

    fig re 2. Data Cleaned Up

    fig re 1. Poor Quality Data

  • 7/27/2019 How To Take Control of Your Data

    8/886

    codes, dates, borrower names, mortgage lenders,title insurance, and other information was required.

    Our auto-extraction technology was able to accu-rately populate the data on an Excel spreadsheet inresponse to the government request for production.

    CuSTOM SOluTIONS yIElD wORKflOwBENEfITS

    Accurately Identify Important Records withoutManual Brute Force Processing

    Scale Classi cation to Large Document Collec -tions

    Eliminate Unnecessary Documents from Storage

    Improve the Odds of Finding Critical Records Increase the Speed of Getting Results

    CONCluSIONTHE SlOw ONES NOw wIll lATER BE fASTProspects and clients will one day realize that thelowest vendor price does not always equate to thebest value for their litigation spend. On the reviewside, value can be added where Multi-Class Clas -sification occurs prior to processing and SubjectMatter Expertise is applied as the human com-plement to machine learning. Value is enhanced

    where the proactive use of technology removesinefficiencies leadingto improved knowledge man-agement and ultimately a higher quality litigationspend. Proposed changes to the Federal Rulesof Civil Procedure (proportionality amendments)will seek to reduce time delays and extraordinarycosts associated with e-Discovery where suchcosts outweigh the utility of the task undertaken;in this regard, classification applied in a proactiveworkflow would meet the goal of proportionalitybecause better organization of data upstream cansave downstream costs.

    The benefits of proactive classification at the out-set of an engineered review are multiple:

    Increased productivity on a 1st level reviewadds value to the predictability of your litigationbudget for time and money.

    A fully defensible engineered review processmitigates a clients risk pro le.

    Reduction in legal spend through a more ef -cient engineered review where fewer attorneysare needed for a 1st level review is in essence,

    doing more with less. Corporate Knowledge Base created which sig -ni es an advance in the reuse of data.

    Accuracy and robust quality control protocolsenable the direction and allocation of litigationspend towards higher value legal functions,sooner.

    OH THE TIMES, THEy ARE A CHANGIN

    About the Author

    Benjamin S. Marks is a consultant on e-Discovery and Information Governanceinitiatives. Most recently, he assisted de-velopment of a document review center inCharlotte, North Carolina, and new prod-uct introduction for an e-Discovery service

    provider. An entrepreneurial strategic-minded lawyer with a business operations

    background, Bens prior work on staffing managed reviewsaffords him the insight to identify subject matter expertise for teams, develop proactive workflows, and assemble respons-es to RFPs. Prior to law school, Ben was the founder of EcoSpecialties and Design, an environmentally themed promo-tions company. Today, when hes not building seed sets or reading about Dodd-Franks impact on enterprise risk man-agement, Ben follows Orioles baseball, attends live music events, enjoys cooking, and runs with his puggle in Baltimore,Maryland. He holds a J.D. and Environmental Certificate fromPace University School of Law.