Upload
liberty-goodman
View
23
Download
3
Embed Size (px)
DESCRIPTION
Automated Form processing for DTIC Documents. March 20, 2006 Presented By, K. Maly, M. Zubair, S. Zeil. Outline. Overall process for handling documents in batches Issues Results Conclusion. Overall process for handling documents in batches. start. 1. - PowerPoint PPT Presentation
Citation preview
Automated Form processing for DTIC Documents
March 20, 2006
Presented By,
K. Maly, M. Zubair, S. Zeil
Outline
Overall process for handling documents in batches
Issues
Results
Conclusion
Overall process for handling documents in batches
Figure: Flowchart of Processing One Document
Read Next Page
Match against all form templates
no pages left
Matched Templates # >0
Add the page with its templates
into candidate set
start
Get the best one
Have candidates?
1
Extract metadata
End
Store “resolved” results
Move to “unresolved”
folder2
1. Omnipage xml document having 10 pages (first 5 and last 5).
2. Possibly, more than one page will have a match with more than one templates. At this time, we do not check how well they matched.
3. Determined by the ratio of the number of fields matched over the total number of fields.
3
yes
no yes
no
no
yes
Issues in form based metadata extraction
Issue Solution
1) Illegal characters in omnipage xml document that causes fatal parser exceptions.
Before processing the xml pages are cleaned to remove these illegal characters.
2) Forms may miss some obvious fields. Not yet addressed
3) Incorrect form captions due to
OCR errors.
resolved using edit distance algorithm
4) Forms miss captions. These forms are detected using the metadata key field names.
5) Forms may span multiple pages. The code processes the pages subsequent to the form page to find if the form spans multiple pages.
6) Word length boundary detection. Solved to a certain limit, by using different string matching algorithms.
7) Metadata field names have different variations. Match field name part by part
8) Coverage type is missing Not yet addressed
9) Document is wrongly identified Not yet addressed
10) OCR Error Not yet addressed
Results Of 246 Documents Results Of 100 Documents
• For example in the following document, the POINT page (first page) has the author, but the form doesn’t. http://128.82.7.208:9090/dtic/newdocs/sf298/formdocs/pdfs/ADA425677.pdf
Forms are missing some obvious fields
In the following form, the caption “REPORT DOCUMENTATION PAGE” is OCRed incorrectly as “REPORT DOCUMENTA110N PAGE “. These type of OCR errors are resolved using edit distance.
The following has no form caption.
If the captions of a form page is missing, we recognize it as a form if more than 10 metadata field names have been found.
The following form spans on two pages.
After finding a form page, we check the following pages by using field name match to see if it’s a part of the form.
In the following form we have word boundary detection errors for metadata field names.For example, “4. TITLE AND SUBTITLE” appears as “4 . T ITL E A ND SUB TI TL E”.
(We use the following seqence for matching field names: exact match, match after removing white spaces, similar match (using edit distance))
Following are parts of two forms, where we can see the variations for the field “17. LIMITATION OF ABSTRACT”.
Here we recognize the field name by matching it part by part.
If the cell boundary information is available (i.e. "17. LIMITATION OF ABSTRACT" is in one cell), we will also rebuild the text field name by connecting the texts in the cell (e.g. "17.", "LIMITATION", "OF", "ABSTRACT" ===> "17. LIMITATION OF ABSTRACT") and match it against defined field name directly. Its worth noting that not all form pages have cell boundary information.
Coverage Type Missing In the Original Document
The Title is missing in the Third Field of the PDF document it should contain “REPORT TYPE AND DATES COVERED”
Identified as sf298_1
The current templates identified this form but failed to extract because this wasa new kind of form and we can handle this case by writing a new template.
OCR Error In the Date Covered Field
OCR has produced a garbage for the Third Field (From-TO) In the Dates Covered Field
OCR Error In the Date Covered Field
OCR has produced a garbage for the Third Field (From-TO) In the Dates Covered Field
We are currently handling six types of forms (through templates), five are variations of sf298 form (Report Documentation Page) and one is other type of form.
For any new forms templates can be written to handle them.
Following are the recall and precision results based on 264 documents.
class # documents Recall Precision
sf298_1 92 95% 97%
sf298_2 137 92% 98%
sf298_3 2 93% 100%
sf298_4 24 100% 100%
citation_1 9 100% 100%
Results of 264 Documents
Results of 100 Documents
class # documents Recall Precision
sf298_1 30 91% 95%
sf298_2 30 98% 99%
sf298_3 10 68% (Problem with sf298_3)
96%
sf298_4 10 100% 100%
Control 10 96% 100%
citation_1 10 100% 100%
Execution Time : The Code took 21 hrs, 58 minutes to process our testbed of 10K pdf documents.
We found that for 10k documents we are getting good results for most of the form classes and relatively poor performance for sf298_3 due to OCR errors.
Conclusion