Upload
norma-briggs
View
217
Download
0
Embed Size (px)
Citation preview
September 25, 2006
NASA Feasibility StudyStatus Update
September 25, 2006
NASA Milestones
A. Feasibility Study to identify the NASA document types –Report - May 31, 2006
B. Form identification and template development - Template set - Aug 31, 2006
C. Enhance classification algorithm for two specific classes – software packaged -Oct 31, 2006
D. Process study for inter-organizational collections – configuration software – Dec 1, 2006
E. Enhance engine to recognize two major classes – software packaged – Dec 15, 2006
F. Evaluation of extraction process – report – Feb 28,2006
September 25, 2006
Form Identification and Template Development August 31 Deliverable
September 25, 2006
Form Identification and Template DevelopmentAugust 31 Deliverable
DEMO
September 25, 2006
Active Tasks for future NASA Milestones
Standard Intermediate Representation of the Scanned Document (IDM)
Design Classification Algorithm
September 25, 2006
Independent Document Model (IDM)
Platform independent Document Model Motivation
Dramatic XML Schema Change between Omnipage 14 and 15 Tie the template engine to stable specification Protects from linking directly to specific OCR product Allows us to include statistics for enhanced feature usage
Statistics (i.e. avgDocFontSize, avgPageFontSize, wordCount, avgDocWordCount, etc..)
Supports Pointpage Detection, Classification Use XSLT 2.0 stylesheets to transform
Supporting new OCR schema only requires generation of new XSLT stylesheet. -- Engine does not change
Chain a series of sheets to add functionality (CleanML)
September 25, 2006
IDM Usage
• Each incoming XML schema requires specific XSLT 2.0 Stylesheet
• Resulting IDM Doc used for “Form Based” templates
• IDM transformed into CleanML for “Non-form” templates
CleanML XML Doc
docTreeModelOmni14.xsl
docTreeModelOmni14.xsl
docTreeModelOmni15.xsldocTreeModelOmni15.xsl
docTreeModelOther.xsl
docTreeModelOther.xsl
docTreeModelCleanML.xsldocTreeModelCleanML.xsl
OmniPage 14 XML Doc
OmniPage 15 XML Doc
Other OCR Output XML Doc
IDM XML Doc
Form Based Extraction
Non Form Extraction
September 25, 2006
Classification Algorithm
Two approaches:
Classification(switching) based on image classification
Post-hoc classification via validation
September 25, 2006
Post-hoc classification via validation
Attempt metadata extraction with all plausible templates
Validate each results set, assigning confidence scores Field-specific validation rules, may combine
- statistical models derived for each field of - text length
- % of words from English dictionary - % of phrases from knowledge base prepared for that field - pattern matching Select metadata set with highest confidence score
September 25, 2006
Sample set of extracted metadata bindings
<metadata> <author>Steven J. Zeil</author> <organization>Old Dominion University Technical
Report2006-24</organization>
<reportDate>September 12, 2006</reportDate> <title>Validation of Extracted Metadata</title> <abstract>A lengthy discussion of techniques for validating metadata is </abstract></metadata>
September 25, 2006
Validation template customized for the
collection <val:validate collection="dtic"
xmlns:val="jelly:edu.odu.cs.dtic.validation.ValidationTagLibrary">
<val:average> <val:field name="author"> <val:min> <val:length/> <val:vocabulary/> <val:phrases length="2"/> <val:phrases length="3"/> <val:phrases length="4"/> </val:min> </val:field>
September 25, 2006
<val:field name="organization"> <val:min> <val:length/> <val:vocabulary/> <val:phrases length="2"/> <val:phrases length="3"/> <val:phrases length="4"/> </val:min> </val:field> <val:field name="reportNumber"> <val:max> <val:regexp pattern="Technical Report +\d\d\d\d-\d\d"/> </val:max> </val:field> <val:field name="reportDate"> <val:max> <val:dateFormat/> </val:max> </val:field> <val:field name="abstract"> <val:min> <val:length/> <val:dictionary/> </val:min> </val:field> </val:average></val:validate>
September 25, 2006
Annotated version of the metadata
bindings <metadata confidence="0.59"> <author confidence="0.85">Steven J. Zeil</author> <organization confidence="0.42" warning="inappropriate
vocabulary">Old Dominion University Technical Report 2006-24</organization> <reportDate confidence="1.0">September 12,
2006</reportDate> <title confidence="1.0">Validation of Extracted
Metadata</title> <abstract confidence="0.3" warning="Unusually short"> A lengthy discussion of techniques for validating metadata is </abstract></metadata>