13
September 25, 2006 NASA Feasibility Study Status Update

September 25, 2006 NASA Feasibility Study Status Update

Embed Size (px)

Citation preview

Page 1: September 25, 2006 NASA Feasibility Study Status Update

September 25, 2006

NASA Feasibility StudyStatus Update

Page 2: September 25, 2006 NASA Feasibility Study Status Update

September 25, 2006

NASA Milestones

A. Feasibility Study to identify the NASA document types –Report - May 31, 2006

B. Form identification and template development - Template set - Aug 31, 2006

C. Enhance classification algorithm for two specific classes – software packaged -Oct 31, 2006

D. Process study for inter-organizational collections – configuration software – Dec 1, 2006

E. Enhance engine to recognize two major classes – software packaged – Dec 15, 2006

F. Evaluation of extraction process – report – Feb 28,2006

Page 3: September 25, 2006 NASA Feasibility Study Status Update

September 25, 2006

Form Identification and Template Development August 31 Deliverable

Page 4: September 25, 2006 NASA Feasibility Study Status Update

September 25, 2006

Form Identification and Template DevelopmentAugust 31 Deliverable

DEMO

Page 5: September 25, 2006 NASA Feasibility Study Status Update

September 25, 2006

Active Tasks for future NASA Milestones

Standard Intermediate Representation of the Scanned Document (IDM)

Design Classification Algorithm

Page 6: September 25, 2006 NASA Feasibility Study Status Update

September 25, 2006

Independent Document Model (IDM)

Platform independent Document Model Motivation

Dramatic XML Schema Change between Omnipage 14 and 15 Tie the template engine to stable specification Protects from linking directly to specific OCR product Allows us to include statistics for enhanced feature usage

Statistics (i.e. avgDocFontSize, avgPageFontSize, wordCount, avgDocWordCount, etc..)

Supports Pointpage Detection, Classification Use XSLT 2.0 stylesheets to transform

Supporting new OCR schema only requires generation of new XSLT stylesheet. -- Engine does not change

Chain a series of sheets to add functionality (CleanML)

Page 7: September 25, 2006 NASA Feasibility Study Status Update

September 25, 2006

IDM Usage

• Each incoming XML schema requires specific XSLT 2.0 Stylesheet

• Resulting IDM Doc used for “Form Based” templates

• IDM transformed into CleanML for “Non-form” templates

CleanML XML Doc

docTreeModelOmni14.xsl

docTreeModelOmni14.xsl

docTreeModelOmni15.xsldocTreeModelOmni15.xsl

docTreeModelOther.xsl

docTreeModelOther.xsl

docTreeModelCleanML.xsldocTreeModelCleanML.xsl

OmniPage 14 XML Doc

OmniPage 15 XML Doc

Other OCR Output XML Doc

IDM XML Doc

Form Based Extraction

Non Form Extraction

Page 8: September 25, 2006 NASA Feasibility Study Status Update

September 25, 2006

Classification Algorithm

Two approaches:

Classification(switching) based on image classification

Post-hoc classification via validation

Page 9: September 25, 2006 NASA Feasibility Study Status Update

September 25, 2006

Post-hoc classification via validation

Attempt metadata extraction with all plausible templates

Validate each results set, assigning confidence scores Field-specific validation rules, may combine

- statistical models derived for each field of - text length

- % of words from English dictionary - % of phrases from knowledge base prepared for that field - pattern matching Select metadata set with highest confidence score

Page 10: September 25, 2006 NASA Feasibility Study Status Update

September 25, 2006

Sample set of extracted metadata bindings

<metadata> <author>Steven J. Zeil</author> <organization>Old Dominion University Technical

Report2006-24</organization>

<reportDate>September 12, 2006</reportDate> <title>Validation of Extracted Metadata</title> <abstract>A lengthy discussion of techniques for validating metadata is </abstract></metadata>

Page 11: September 25, 2006 NASA Feasibility Study Status Update

September 25, 2006

Validation template customized for the

collection <val:validate collection="dtic"

xmlns:val="jelly:edu.odu.cs.dtic.validation.ValidationTagLibrary">

<val:average> <val:field name="author"> <val:min> <val:length/> <val:vocabulary/> <val:phrases length="2"/> <val:phrases length="3"/> <val:phrases length="4"/> </val:min> </val:field>

Page 12: September 25, 2006 NASA Feasibility Study Status Update

September 25, 2006

<val:field name="organization"> <val:min> <val:length/> <val:vocabulary/> <val:phrases length="2"/> <val:phrases length="3"/> <val:phrases length="4"/> </val:min> </val:field> <val:field name="reportNumber"> <val:max> <val:regexp pattern="Technical Report +\d\d\d\d-\d\d"/> </val:max> </val:field> <val:field name="reportDate"> <val:max> <val:dateFormat/> </val:max> </val:field> <val:field name="abstract"> <val:min> <val:length/> <val:dictionary/> </val:min> </val:field> </val:average></val:validate>

Page 13: September 25, 2006 NASA Feasibility Study Status Update

September 25, 2006

Annotated version of the metadata

bindings <metadata confidence="0.59"> <author confidence="0.85">Steven J. Zeil</author> <organization confidence="0.42" warning="inappropriate

vocabulary">Old Dominion University Technical Report 2006-24</organization> <reportDate confidence="1.0">September 12,

2006</reportDate> <title confidence="1.0">Validation of Extracted

Metadata</title> <abstract confidence="0.3" warning="Unusually short"> A lengthy discussion of techniques for validating metadata is </abstract></metadata>