21
Concepts, Semantics and Syntax in E- Discovery David Eichmann Institute for Clinical and Translational Science The University of Iowa

Concepts, Semantics and Syntax in E-Discovery

  • Upload
    mandy

  • View
    22

  • Download
    0

Embed Size (px)

DESCRIPTION

Concepts, Semantics and Syntax in E-Discovery. David Eichmann Institute for Clinical and Translational Science The University of Iowa. Our Approach. Analyze the human-generated metadata available for document collections for organizational and individual interactions - PowerPoint PPT Presentation

Citation preview

Page 1: Concepts, Semantics and Syntax in E-Discovery

Concepts, Semantics and Syntax in E-Discovery

Concepts, Semantics and Syntax in E-Discovery

David EichmannInstitute for Clinical and Translational Science

The University of Iowa

David EichmannInstitute for Clinical and Translational Science

The University of Iowa

Page 2: Concepts, Semantics and Syntax in E-Discovery

Our Approach

Analyze the human-generated metadata available for document collections for organizational and individual interactions

Explore the syntactic and semantic nature of document content and the potential for automatic generation of metadata

Explore the concept space generated by the previous step and its correspondence to boolean predicate specification in discovery

Page 3: Concepts, Semantics and Syntax in E-Discovery

Our Target Corpus

The Illinois Institute of Technology Complex Document Information Processing Test Collection (IIT CDIP), v. 1.0Derived from the tobacco master settlement

agreementComprises 6,910,192 ‘documents’Or more properly the OCR output from those

documentsTwo merged XML tag sets of metadata, with

overlapping content<A><LTDLWOCR>

Page 4: Concepts, Semantics and Syntax in E-Discovery

Metadata Entity Frequencies

EntityOccurrences

Total Distinct Avg/Entity Avg/Doc.

Bates 9,476,794 8,054,075 1.18 1.40

Category 13,594,494 74 183,709.38 2.00

Doctype 18,359,644 2,501 7,340.92 2.70

Prodbox 6,830,993 6,306 1,083.25 1.01

Page 5: Concepts, Semantics and Syntax in E-Discovery

Metadata Entity Frequencies

EntityOccurrences

Total Distinct Avg/Entity Avg/Doc.

Attendee 65,691,473 49,375 1,330.46 9.68

Brand 26,498,001 155,350 170.57 3.90

Copied 8,775,307 322,294 27.22 1.29

Page 6: Concepts, Semantics and Syntax in E-Discovery

Metadata Entity Frequencies

Org. Entity

Occurrences

Total Distinct Avg/Entity Avg/Doc.

Author 8,742,976 149,641 58.43 1.29

Mentioned 31,406,753 883,285 35.56 4.63

Receiving 8,262,496 63,625 129.86 1.22

Page 7: Concepts, Semantics and Syntax in E-Discovery

Metadata Entity Frequencies

Person Entity

Occurrences

Total Distinct Avg/Entity Avg/Doc.

Author 11,128,029 875,292 12.71 1.64

Mentioned 34,683,289 1,938,310 17.89 5.11

Receiving 23,427,415 455,404 51.44 3.45

Page 8: Concepts, Semantics and Syntax in E-Discovery

Database Schema

We map the XML structure to a set of relational database tables

Non-recurring fields are collected in a table named ‘document’docid titledescriptionOCR text

Recurring elements each get a tabledocidvalue

Page 9: Concepts, Semantics and Syntax in E-Discovery

Identifying an Individual

Person# of Occurrences as

Attendee

AuthorReceive

rMention

REININGHAUS, W 189,380 23,880 32,764 16,152

REININGHAUS 7,337 200 1,974 2,837

REININGHAUS, B 196 2

REININGHAUS, R 17 144 12

Page 10: Concepts, Semantics and Syntax in E-Discovery

How Many Reininghaus?

Reininghaus,R

Reininghaus,W

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 11: Concepts, Semantics and Syntax in E-Discovery

Co-mention Connections

Reininghaus Walk

Person Count Person Count

WALK,RA 3,871 REININGHAUS,W 3,871

ROEMER,E 3,716 ROEMER,E 2,883

HAUSSMANN,HJ 3,293 HAUSSMANN,HJ 2,799

TEWES,F 2,784 HACKENBERG,U 2,360

Page 12: Concepts, Semantics and Syntax in E-Discovery

Co-mention Connections

Reininghaus Roemer

Person Count Person Count

WALK,RA 3,871 REININGHAUS,W 3,716

ROEMER,E 3,716 WALK,RA 2,883

HAUSSMANN,HJ 3,293 HACKENBERG,U 2,623

TEWES,F 2,784 HAUSSMAN,HJ 2,573

Page 13: Concepts, Semantics and Syntax in E-Discovery

Co-mention Connections

Reininghaus Haussmann

Person Count Person Count

WALK,RA 3,871 REININGHAUS,W 3,293

ROEMER,E 3,716 WALK,RA 2,799

HAUSSMANN,HJ 3,293 ROEMER,E 2,573

TEWES,F 2,784 VONCKEN,P 2,323

Page 14: Concepts, Semantics and Syntax in E-Discovery

Co-mention Affiliations

Person Affiliation

Reininghaus, Wolf Gen. Mgr, Contract Research, INBIFO

Walk, Rudiger-Alexander Dir. Human Studies, Philip Morris

Roemer, Ewald INBIFO

Haussmann, Hans-Jurgen Assoc. Prin. Scientist, Philip Morris

Tewes, F. Biologist, INBIFO

Hackenberg, Ulrich INBIFO

Voncken, P. Chemist, INBIFO

Page 15: Concepts, Semantics and Syntax in E-Discovery

Semantics and Structure

Our analysis of content involves the following phases:Lexical analysisSentence boundary detectionNamed entity recognitionSentence parsingRelationship extraction

The nature of the OCR data seriously impacts each of the phases (sometimes in different ways)

Page 16: Concepts, Semantics and Syntax in E-Discovery

CDIP Parse Tree Complexity

Page 17: Concepts, Semantics and Syntax in E-Discovery

Clean Text Parse Tree Complexity

Page 18: Concepts, Semantics and Syntax in E-Discovery

Next Steps

Experiment with custom lexical analysis of the OCRStart with simple white space detectionConstruct a lexicon and look for out-of-band

vocabulary as OCR errors candidatesRewrite the analyzer to support OCR error correction

Sentence boundary detect and parse the full corpus

Generate entity relationships using our question answering framework

Page 19: Concepts, Semantics and Syntax in E-Discovery

And Beyond That…

Return to the document images and analyze document layoutRegenerate OCR to include token coordinatesUse our PDF structure extraction framework to

generate logical document structureGenerate a set of document models based upon

similar layout

Use the document models to map OCR text to metadata elements

Page 20: Concepts, Semantics and Syntax in E-Discovery

For Example

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 21: Concepts, Semantics and Syntax in E-Discovery

For Example

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.