39
TECHNOLOGY FOR EDD CONSULTANTS, NOT DUMMIES TOPIC: KEYWORD SEARCHING AND METADATA By Sara Emami

Sara's keyword searching metadata_lecture_revised

Embed Size (px)

DESCRIPTION

Sara's E-Discovery Consulting lecture to lawyers and paralegals on concept and keyword searching.

Citation preview

Page 1: Sara's keyword searching metadata_lecture_revised

TECHNOLOGY FOR EDD CONSULTANTS, NOT DUMMIES

TOPIC: KEYWORD SEARCHING AND METADATA

By Sara Emami

Page 2: Sara's keyword searching metadata_lecture_revised

WHAT IS KEYWORD

SEARCHING?When we think of the term, “keyword search” we are talking about a basic search technique that involves searching for one or more words within a collection of documents.

Typically, a keyword search involves a user typing their search request, or query, into a search engine such as Google, which then returns only those documents that contain the search terms entered. The documents returned by the search engine are called the search results.

Page 3: Sara's keyword searching metadata_lecture_revised

KEYWORD SEARCH AND TECHNIQUES

Keyword searching in the EDRM (Electronic Discovery Reference Model) can utilize an array of techniques through a variety of data. Often time, data in a case are searched within documents in a specific case, but even there the documents can take several forms.

Understanding the array of forms will not only benefit the EDD consultant, but also their client in the best approach to pursue their case.

Page 4: Sara's keyword searching metadata_lecture_revised

KEY WORD SEARCHES AND

TECHNIQUES (cont.)Computer files (known as Electronically Stored Information, or ESI), including files such as documents created with Microsoft Word or PowerPoint, email stored as individual message files or together in an Outlook or Notes data file, OCR (Optical Character Recognition) files created from scanned paper documents, or even more exotic files such as those created by a CADCAM program demand the need for computer systems to store and manage data in important cases.

Page 5: Sara's keyword searching metadata_lecture_revised

KEYWORD SEARCHING AND

WHY IT IS SIGNIFICANT IN E-

DISCOVERYSearch tools and methodologies are significant because they have numerous applications during the e-discovery phase of the litigation lifecycle and yield searches which help cases for clients needing relevant information for their case.

Let us take a real life example of the processes and challenges related to using search and how these challenges can be mitigated. Our example includes an automobile accident and a maintenance shop or garage which should have documented a failed brake system, but may have been incompetent.

Page 6: Sara's keyword searching metadata_lecture_revised

EXAMPLE 1Let us say that Attorney John Doe is working on a new case involving a car accident. The plaintiff is claiming that his local garage failed to spot the a failing brake system in his client’s 2004 Honda Civic. As a result, the failing breaks not only caused a major car accident, but additionally caused property damage and bodily injury.

Attorney Jacob Bacon, who is representing the defendant’s garage, has a database containing thousands of documents, including email to and from the plaintiff and the defendant, email from a mailing list for Honda enthusiasts that both plaintiff and defendant participated in, and OCR’d documents including maintenance records and receipts from the garage.

Page 7: Sara's keyword searching metadata_lecture_revised

EXAMPLE 1 (continued)

This time Attorney John Doe runs a concept search using the keywords on Honda Civic, brakes, accident, and maintenance. As John Doe scrolls through the results he doesn’t see anything new, until he sees the word “stoppies”, which he is unfamiliar with. A little digging in the result set of documents lets him discover that “stoppies” is a behavior similar to wheelies that can result in damaged brakes.

The documents containing this word revealed that the plaintiff frequently engaged in this dangerous behavior. Attorney Doe now had the ammunition he needed to win his case, using a concept he did not know in advance existed. What exactly is concept searching? Read on to find out.

Page 8: Sara's keyword searching metadata_lecture_revised

CONCEPT SEARCHING

We have discussed the notion of keyword searching, but based on our recent example of the failed brake system involving the Honda, let us examine what concept or “conceptual” searching is.

Concept search is an automated method used to search electronically stored and unstructured text for information based on “ideas” or “concepts”. As we saw in our previous example of the automobile accident, the term “stoppies” was a concept or idea to show a failed brake system. The information retrieved in response to a concept query should be relevant to the ideas contained in the text of the query.

Page 9: Sara's keyword searching metadata_lecture_revised

CONCEPT SEARCHING -

ExampleLet us say that you are hired on by Oil/Gas Company X who is in the midst of a lawsuit by a terminated employee by which the employee wants to sue Oil/Gas Company X for wrongful termination. Now, if we are wanting to perform a search on the word “termination” – what other concept words/concept ideas related to to “termination” can you think of?

Here are some random words that might be found in e-mails related to termination: canned, let-go, hosed, fired, gatorated, sunset and beaches, retired, vacation, etc. With concept search technologies and their advanced capabilities, concept searching can assess trends in evaluating patters and produce results that can help lawyers and corporations with their litigation.

Page 10: Sara's keyword searching metadata_lecture_revised

CONTEMPORARY EXAMPLE (CAN YOU

SPELL ENRON?)We all may recall the Enron and WorldCom debacle which highlighted corporate greed and was quite the scandal of the early 2000s. How would concept searching help incriminate the big bad wolfs?

Let us take an example Enron used to “hide” or employ the use of “code” to prevent authorities or legal entities from finding their hidden crime.

The term “Rawhide” was found in several of the Enron emails. “Rawhide” could mean a kind of leather or an old TV show, but in the context of the Enron emails, “Rawhide” actually refers to one of its off-books partnerships. 

“Raptor” was another of those problematic partnerships.  So a Concept Search query in the Enron emails for “Raptor” would not net you documents about hawks, but rather about “Rawhide” and other off-books partnerships, even if the words “Raptor” and “Rawhide” did not actually appear in any particular document itself.

Page 11: Sara's keyword searching metadata_lecture_revised

BENEFITS OF CONCEPT SEARCH

Increased likelihood of finding a larger number of relevant documents

Less time spent perusing irrelevant documents

Less time spent trying to come up with the right keywords

Reduced time, cost and effort overall in retrieving  the best documents in reply to the concept of your query in the context of the entire document collection

Page 12: Sara's keyword searching metadata_lecture_revised

EXAMPLE 2Let us say that we are working with a major oil/gas company (Oil Company X) and that Oil Company X needs a vendor who hires us to assist them with a lawsuit against oil company Y. Their lawsuit references intellectual property theft in the year 2009 and Oil company X argues that there are certain words or phrases that would incriminate Oil Company Y. How would we be able to assist our client in the most cost efficient and time-efficient fashion? Keyword searching allows vendors to zoom into collected data to find the relevant data in the form of “keyword search” that would assist the client with their lawsuit in the most meaningful fashion.

Understanding the reasoning behind keyword searching allows us to help our clients.

Page 13: Sara's keyword searching metadata_lecture_revised

WHAT QUALIFIES AS KEYWORD SEARCHING?

Keyword searches are most often used to identify documents that are either responsive or privileged. It is also widely used for large-scale culling and filtering of documents. Keywords often form a basic building block for constructing other more complex compound searches. Such compound searches use other search elements such as Boolean logic.

Page 14: Sara's keyword searching metadata_lecture_revised

PARAMETERS IN KEYWORD

SEARCHINGThe syntax in the search string;

Use of the keywords with or without stemming;

Use of keywords with certain wildcard specifications and the syntax for said wildcards;

Case-sensitivity of keywords used in searches and whether the keyword should match both cases; and

The target data sources to be searched.

Whether the query can be applied to any specific fields such as email ‘To/From’ or ‘Subject’.

Whether the query can be applied to any specific date range such as an email ‘Sent Date’ between the date range of January 1, 2001 through December 31, 2001

Page 15: Sara's keyword searching metadata_lecture_revised

BOOLEAN SEARCHES

Boolean searches are used to combine results of multiple searches as well as to designate ambiguity, as when search for two or more terms but do not necessarily need both.

Imagine you are at your local university library and want to perform a search in one of the library databases which houses many of the scholastic journals. You encounter a database form which asks you to enter the

Page 16: Sara's keyword searching metadata_lecture_revised

EXAMPLE OF LIBRARY DATABASE

Page 17: Sara's keyword searching metadata_lecture_revised

EXAMPLES

Page 18: Sara's keyword searching metadata_lecture_revised

WILDCARDA wildcard is a character that may be used in a search term to represent one or more other characters. It also allows you to find words using patterns for a set of words and to find synonyms or forms of a word The two most commonly used wildcards are:

1) The question mark (“?”) may be used to represent a single alphanumeric

character in a search expression. For example, searching for the term

“ho?se” would yield results which contain such words as “house” and “horse”.

Page 19: Sara's keyword searching metadata_lecture_revised

WILD CARD SINGLE

CHARACTER

Page 20: Sara's keyword searching metadata_lecture_revised

WILDCARD EXAMPLE WITH

MULTIPLE CHARACTERS

Page 21: Sara's keyword searching metadata_lecture_revised

FUZZY SEARCHFuzzy search allows searching for word variations such as in the case of misspellings.

Typically, such searching includes some form of distance and score computations between the specified word and the words in the corpus.

Fuzzy search is specified using the operator: fuzzy-search.

Page 22: Sara's keyword searching metadata_lecture_revised

FUZZY SEARCH EXAMPLE

Page 23: Sara's keyword searching metadata_lecture_revised

SYNONYM SEARCH

Synonyms are word variations that are determined to be synonyms of the word being searched. Such searching includes some form of dictionary or thesaurus based lookup (e.g. party synonym is gathering, get=together, festivity, etc.).

Page 24: Sara's keyword searching metadata_lecture_revised

PROXIMITY SEARCH

A proximity search looks for documents where two or more separately matching term occurrences are within a specified distance, where distance is the number of intermediate words or characters.

In addition to proximity, some implementations may also impose a constraint on the word order, in that the order in the searched text must be identical to the order of the search query. Proximity searching goes beyond the simple matching of words by adding the constraint of proximity and is generally regarded as a form of advanced search.

Page 25: Sara's keyword searching metadata_lecture_revised

PROXIMITY SEARCH EXAMPLE

For example, a search could be used to find "red brick house", and match phrases such as "red house of brick" or "house made of red brick". By limiting the proximity, these phrases can be matched while avoiding documents where the words are scattered or spread across a page or in unrelated articles in an anthology.

Page 26: Sara's keyword searching metadata_lecture_revised

TRUNCATION SPECIFICATION AND

STEMMINGTruncation specification is one way to match word variations. Truncation allows for the final few characters to be left unspecified.

Stemming specification is another method for matching word variations. Stemming is the process of finding the root form of a word.

The stemming specification will match all morphological inflections of the word, so that if you enter the search term sing, the stemming matches would include singing, sang, and song. Note that even though a stemming search will return singing for a search term of sing, this is different from wildcard search. A wildcard search for sing* will not return sang or song, while it will return Singsing.

Page 27: Sara's keyword searching metadata_lecture_revised

WHAT IS METADATA AND WHY IS IT IMPORTANT?

Software programs embed various categories of metadata in the documents users create.

Metadata is significant because it describes how, when, and by whom an electronic document was created, modified, and transmitted.

Unlike paper documents, electronic documents are unique because they carry their history with them.

Paper is boring and pertains to dinosaurs as it merely shows us what a document said or looked like. Electronic tells where the document went and what it did.

Page 28: Sara's keyword searching metadata_lecture_revised

METADATA AND E-MAILS

An e-mail carries information about its author, creation date, attachments, identities of all recipients including who was CC’ed or BCC’ed.

Metadata also connects attachments to e-mails. Information embedded in other file types may include document names, authors, number of times printed…etc. Track changes reflects modifications by each recipient.

Page 29: Sara's keyword searching metadata_lecture_revised

METADATA AND PRESERVATION

Some methods of document review fail to account for and preserve metadata. If a document is printed in the review or production process, its metadata is lost.

Many lawyers believe they are conducting EDD when in fact they are working with electronic images of documents. The process of scanning and coding documents into a database does not capture original document metadata.

Understand the difference between document metadata versus file system metadata.

Page 30: Sara's keyword searching metadata_lecture_revised

FILE SYSTEM METADATA

When we think of file system metadata, think ‘file timestamps’

While ‘file metadata’ and “timestamps are often used interchangeably, they mean two completely different things.

There are two separate ‘timestamps’ for office documents and several other file types. The first set, is stored in the operating system (Windows, Linux, MacOS) and are different from those stored in the file.

The metadata stored in a file (Date Created, Date Last Saved etc.) may also be referred to as the files timestamps and confused with what’s stored by the operating system.

Page 31: Sara's keyword searching metadata_lecture_revised

TYPES OF METADATA

Page 32: Sara's keyword searching metadata_lecture_revised

METADATA WORD/EXCEL

Page 33: Sara's keyword searching metadata_lecture_revised

ARE YOU FAMILIAR WITH

THIS?

Page 34: Sara's keyword searching metadata_lecture_revised

METADATA AND RSS (RICH SITE

SUMMARY)

Page 35: Sara's keyword searching metadata_lecture_revised

METADATA AND HYPER-TEXT MARKUP

LANGUAGE

Page 36: Sara's keyword searching metadata_lecture_revised

CSS AND METADATA

Page 37: Sara's keyword searching metadata_lecture_revised

METADATA ADVANCED

Page 38: Sara's keyword searching metadata_lecture_revised

ADVANCED (continued)

Page 39: Sara's keyword searching metadata_lecture_revised

SUMMARY ON METADATA