Upload
mandy
View
22
Download
0
Embed Size (px)
DESCRIPTION
Concepts, Semantics and Syntax in E-Discovery. David Eichmann Institute for Clinical and Translational Science The University of Iowa. Our Approach. Analyze the human-generated metadata available for document collections for organizational and individual interactions - PowerPoint PPT Presentation
Citation preview
Concepts, Semantics and Syntax in E-Discovery
Concepts, Semantics and Syntax in E-Discovery
David EichmannInstitute for Clinical and Translational Science
The University of Iowa
David EichmannInstitute for Clinical and Translational Science
The University of Iowa
Our Approach
Analyze the human-generated metadata available for document collections for organizational and individual interactions
Explore the syntactic and semantic nature of document content and the potential for automatic generation of metadata
Explore the concept space generated by the previous step and its correspondence to boolean predicate specification in discovery
Our Target Corpus
The Illinois Institute of Technology Complex Document Information Processing Test Collection (IIT CDIP), v. 1.0Derived from the tobacco master settlement
agreementComprises 6,910,192 ‘documents’Or more properly the OCR output from those
documentsTwo merged XML tag sets of metadata, with
overlapping content<A><LTDLWOCR>
Metadata Entity Frequencies
EntityOccurrences
Total Distinct Avg/Entity Avg/Doc.
Bates 9,476,794 8,054,075 1.18 1.40
Category 13,594,494 74 183,709.38 2.00
Doctype 18,359,644 2,501 7,340.92 2.70
Prodbox 6,830,993 6,306 1,083.25 1.01
Metadata Entity Frequencies
EntityOccurrences
Total Distinct Avg/Entity Avg/Doc.
Attendee 65,691,473 49,375 1,330.46 9.68
Brand 26,498,001 155,350 170.57 3.90
Copied 8,775,307 322,294 27.22 1.29
Metadata Entity Frequencies
Org. Entity
Occurrences
Total Distinct Avg/Entity Avg/Doc.
Author 8,742,976 149,641 58.43 1.29
Mentioned 31,406,753 883,285 35.56 4.63
Receiving 8,262,496 63,625 129.86 1.22
Metadata Entity Frequencies
Person Entity
Occurrences
Total Distinct Avg/Entity Avg/Doc.
Author 11,128,029 875,292 12.71 1.64
Mentioned 34,683,289 1,938,310 17.89 5.11
Receiving 23,427,415 455,404 51.44 3.45
Database Schema
We map the XML structure to a set of relational database tables
Non-recurring fields are collected in a table named ‘document’docid titledescriptionOCR text
Recurring elements each get a tabledocidvalue
Identifying an Individual
Person# of Occurrences as
Attendee
AuthorReceive
rMention
REININGHAUS, W 189,380 23,880 32,764 16,152
REININGHAUS 7,337 200 1,974 2,837
REININGHAUS, B 196 2
REININGHAUS, R 17 144 12
How Many Reininghaus?
Reininghaus,R
Reininghaus,W
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Co-mention Connections
Reininghaus Walk
Person Count Person Count
WALK,RA 3,871 REININGHAUS,W 3,871
ROEMER,E 3,716 ROEMER,E 2,883
HAUSSMANN,HJ 3,293 HAUSSMANN,HJ 2,799
TEWES,F 2,784 HACKENBERG,U 2,360
Co-mention Connections
Reininghaus Roemer
Person Count Person Count
WALK,RA 3,871 REININGHAUS,W 3,716
ROEMER,E 3,716 WALK,RA 2,883
HAUSSMANN,HJ 3,293 HACKENBERG,U 2,623
TEWES,F 2,784 HAUSSMAN,HJ 2,573
Co-mention Connections
Reininghaus Haussmann
Person Count Person Count
WALK,RA 3,871 REININGHAUS,W 3,293
ROEMER,E 3,716 WALK,RA 2,799
HAUSSMANN,HJ 3,293 ROEMER,E 2,573
TEWES,F 2,784 VONCKEN,P 2,323
Co-mention Affiliations
Person Affiliation
Reininghaus, Wolf Gen. Mgr, Contract Research, INBIFO
Walk, Rudiger-Alexander Dir. Human Studies, Philip Morris
Roemer, Ewald INBIFO
Haussmann, Hans-Jurgen Assoc. Prin. Scientist, Philip Morris
Tewes, F. Biologist, INBIFO
Hackenberg, Ulrich INBIFO
Voncken, P. Chemist, INBIFO
Semantics and Structure
Our analysis of content involves the following phases:Lexical analysisSentence boundary detectionNamed entity recognitionSentence parsingRelationship extraction
The nature of the OCR data seriously impacts each of the phases (sometimes in different ways)
CDIP Parse Tree Complexity
Clean Text Parse Tree Complexity
Next Steps
Experiment with custom lexical analysis of the OCRStart with simple white space detectionConstruct a lexicon and look for out-of-band
vocabulary as OCR errors candidatesRewrite the analyzer to support OCR error correction
Sentence boundary detect and parse the full corpus
Generate entity relationships using our question answering framework
And Beyond That…
Return to the document images and analyze document layoutRegenerate OCR to include token coordinatesUse our PDF structure extraction framework to
generate logical document structureGenerate a set of document models based upon
similar layout
Use the document models to map OCR text to metadata elements
For Example
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
For Example
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.