Automated Form processing for DTIC Documents

Automated Form processing for DTIC Documents

March 20, 2006

Presented By,

K. Maly, M. Zubair, S. Zeil

Outline

Overall process for handling documents in batches

Issues

Results

Conclusion

Overall process for handling documents in batches

Figure: Flowchart of Processing One Document

Read Next Page

Match against all form templates

no pages left

Matched Templates # >0

Add the page with its templates

into candidate set

start

Get the best one

Have candidates?

1

Extract metadata

End

Store “resolved” results

Move to “unresolved”

folder2

1. Omnipage xml document having 10 pages (first 5 and last 5).

2. Possibly, more than one page will have a match with more than one templates. At this time, we do not check how well they matched.

3. Determined by the ratio of the number of fields matched over the total number of fields.

3

yes

no yes

no

no

yes

Issues in form based metadata extraction

Issue Solution

1) Illegal characters in omnipage xml document that causes fatal parser exceptions.

Before processing the xml pages are cleaned to remove these illegal characters.

2) Forms may miss some obvious fields. Not yet addressed

3) Incorrect form captions due to

OCR errors.

resolved using edit distance algorithm

4) Forms miss captions. These forms are detected using the metadata key field names.

5) Forms may span multiple pages. The code processes the pages subsequent to the form page to find if the form spans multiple pages.

6) Word length boundary detection. Solved to a certain limit, by using different string matching algorithms.

7) Metadata field names have different variations. Match field name part by part

8) Coverage type is missing Not yet addressed

9) Document is wrongly identified Not yet addressed

10) OCR Error Not yet addressed

Results Of 246 Documents Results Of 100 Documents

• For example in the following document, the POINT page (first page) has the author, but the form doesn’t. http://128.82.7.208:9090/dtic/newdocs/sf298/formdocs/pdfs/ADA425677.pdf

Forms are missing some obvious fields

http://128.82.7.208:9090/dtic/newdocs/sf298/formdocs/pdfs/ADA425677.pdf

In the following form, the caption “REPORT DOCUMENTATION PAGE” is OCRed incorrectly as “REPORT DOCUMENTA110N PAGE “. These type of OCR errors are resolved using edit distance.

The following has no form caption.

If the captions of a form page is missing, we recognize it as a form if more than 10 metadata field names have been found.

The following form spans on two pages.

After finding a form page, we check the following pages by using field name match to see if it’s a part of the form.

In the following form we have word boundary detection errors for metadata field names.For example, “4. TITLE AND SUBTITLE” appears as “4 . T ITL E A ND SUB TI TL E”.

(We use the following seqence for matching field names: exact match, match after removing white spaces, similar match (using edit distance))

Following are parts of two forms, where we can see the variations for the field “17. LIMITATION OF ABSTRACT”.

Here we recognize the field name by matching it part by part.

If the cell boundary information is available (i.e. "17. LIMITATION OF ABSTRACT" is in one cell), we will also rebuild the text field name by connecting the texts in the cell (e.g. "17.", "LIMITATION", "OF", "ABSTRACT" ===> "17. LIMITATION OF ABSTRACT") and match it against defined field name directly. Its worth noting that not all form pages have cell boundary information.

Coverage Type Missing In the Original Document

The Title is missing in the Third Field of the PDF document it should contain “REPORT TYPE AND DATES COVERED”

Identified as sf298_1

The current templates identified this form but failed to extract because this wasa new kind of form and we can handle this case by writing a new template.

OCR Error In the Date Covered Field

OCR has produced a garbage for the Third Field (From-TO) In the Dates Covered Field

OCR Error In the Date Covered Field

OCR has produced a garbage for the Third Field (From-TO) In the Dates Covered Field

We are currently handling six types of forms (through templates), five are variations of sf298 form (Report Documentation Page) and one is other type of form.

For any new forms templates can be written to handle them.

Following are the recall and precision results based on 264 documents.

class # documents Recall Precision

sf298_1 92 95% 97%

sf298_2 137 92% 98%

sf298_3 2 93% 100%

sf298_4 24 100% 100%

citation_1 9 100% 100%

Results of 264 Documents

Results of 100 Documents

class # documents Recall Precision

sf298_1 30 91% 95%

sf298_2 30 98% 99%

sf298_3 10 68% (Problem with sf298_3)

96%

sf298_4 10 100% 100%

Control 10 96% 100%

citation_1 10 100% 100%

Execution Time : The Code took 21 hrs, 58 minutes to process our testbed of 10K pdf documents.

We found that for 10k documents we are getting good results for most of the form classes and relatively poor performance for sf298_3 due to OCR errors.

Conclusion