Concepts, Semantics and Syntax in E-Discovery

Concepts, Semantics and Syntax in E-Discovery

Concepts, Semantics and Syntax in E-Discovery

David EichmannInstitute for Clinical and Translational Science

The University of Iowa

David EichmannInstitute for Clinical and Translational Science

The University of Iowa

Our Approach

Analyze the human-generated metadata available for document collections for organizational and individual interactions

Explore the syntactic and semantic nature of document content and the potential for automatic generation of metadata

Explore the concept space generated by the previous step and its correspondence to boolean predicate specification in discovery

Our Target Corpus

The Illinois Institute of Technology Complex Document Information Processing Test Collection (IIT CDIP), v. 1.0Derived from the tobacco master settlement

agreementComprises 6,910,192 ‘documents’Or more properly the OCR output from those

documentsTwo merged XML tag sets of metadata, with

overlapping content<A><LTDLWOCR>

Metadata Entity Frequencies

EntityOccurrences

Total Distinct Avg/Entity Avg/Doc.

Bates 9,476,794 8,054,075 1.18 1.40

Category 13,594,494 74 183,709.38 2.00

Doctype 18,359,644 2,501 7,340.92 2.70

Prodbox 6,830,993 6,306 1,083.25 1.01


EntityOccurrences


Attendee 65,691,473 49,375 1,330.46 9.68

Brand 26,498,001 155,350 170.57 3.90

Copied 8,775,307 322,294 27.22 1.29


Org. Entity

Occurrences


Author 8,742,976 149,641 58.43 1.29

Mentioned 31,406,753 883,285 35.56 4.63

Receiving 8,262,496 63,625 129.86 1.22


Person Entity

Occurrences


Author 11,128,029 875,292 12.71 1.64

Mentioned 34,683,289 1,938,310 17.89 5.11

Receiving 23,427,415 455,404 51.44 3.45

Database Schema

We map the XML structure to a set of relational database tables

Non-recurring fields are collected in a table named ‘document’docid titledescriptionOCR text

Recurring elements each get a tabledocidvalue

Identifying an Individual

Person# of Occurrences as

Attendee

AuthorReceive

rMention

REININGHAUS, W 189,380 23,880 32,764 16,152

REININGHAUS 7,337 200 1,974 2,837

REININGHAUS, B 196 2

REININGHAUS, R 17 144 12

How Many Reininghaus?

Reininghaus,R

Reininghaus,W

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.









Co-mention Connections

Reininghaus Walk

Person Count Person Count

WALK,RA 3,871 REININGHAUS,W 3,871

ROEMER,E 3,716 ROEMER,E 2,883

HAUSSMANN,HJ 3,293 HAUSSMANN,HJ 2,799

TEWES,F 2,784 HACKENBERG,U 2,360


Reininghaus Roemer



ROEMER,E 3,716 WALK,RA 2,883

HAUSSMANN,HJ 3,293 HACKENBERG,U 2,623

TEWES,F 2,784 HAUSSMAN,HJ 2,573


Reininghaus Haussmann



ROEMER,E 3,716 WALK,RA 2,799

HAUSSMANN,HJ 3,293 ROEMER,E 2,573

TEWES,F 2,784 VONCKEN,P 2,323

Co-mention Affiliations

Person Affiliation

Reininghaus, Wolf Gen. Mgr, Contract Research, INBIFO

Walk, Rudiger-Alexander Dir. Human Studies, Philip Morris

Roemer, Ewald INBIFO

Haussmann, Hans-Jurgen Assoc. Prin. Scientist, Philip Morris

Tewes, F. Biologist, INBIFO

Hackenberg, Ulrich INBIFO

Voncken, P. Chemist, INBIFO

Semantics and Structure

Our analysis of content involves the following phases:Lexical analysisSentence boundary detectionNamed entity recognitionSentence parsingRelationship extraction

The nature of the OCR data seriously impacts each of the phases (sometimes in different ways)

CDIP Parse Tree Complexity

Clean Text Parse Tree Complexity

Next Steps

Experiment with custom lexical analysis of the OCRStart with simple white space detectionConstruct a lexicon and look for out-of-band

vocabulary as OCR errors candidatesRewrite the analyzer to support OCR error correction

Sentence boundary detect and parse the full corpus

Generate entity relationships using our question answering framework

And Beyond That…

Return to the document images and analyze document layoutRegenerate OCR to include token coordinatesUse our PDF structure extraction framework to

generate logical document structureGenerate a set of document models based upon

similar layout

Use the document models to map OCR text to metadata elements

For Example



For Example



Documents

Concepts, Semantics and Syntax in E-Discovery