17
IJDAR (2007) 9:263–279 DOI 10.1007/s10032-006-0022-0 ORIGINAL PAPER User-configurable OCR enhancement for online natural history archives Andy Downton · Jingyu He · Simon Lucas Received: 14 February 2005 / Revised: 3 November 2005 / Accepted: 28 May 2006 / Published online: 4 August 2006 © Springer-Verlag 2006 Abstract The creation of structured digital libraries from paper-based archives is an area of growing demand in many scientific and cultural fields, and is not satis- fied either by off-the-shelf OCR or commercial form- processing systems. This paper describes and evaluates a configurable archive construction system, which inte- grates document image pre-processing and analysis with text post-processing tools and a standard OCR package to meet digital archiving requirements. The prototype system is currently being used in conjunction with the UK Natural History Museum to help convert more than 500,000 cards of Lepidoptera (Butterflies and Moths) and Coleoptera (Beetles) to searchable digital archives. Evaluation results covering different aspects of the sys- tem from card scanning to overall word recognition rates for different database fields are summarised for two datasets comprising over 5,000 cards selected from different parts of these archives. First-pass end-to-end word recognition rates of 70–90% are reported for key data fields, subject to availability of suitable electronic dictionaries. Further validation and correction is sup- ported through web-editing of the online digital archive. Keywords Document analysis · Digital archive · OCR A. Downton (B ) · J. He · S. Lucas Department of Electronic Systems Engineering, University of Essex, Wivenhoe Park, Colchester, CO4 3SQ, UK e-mail: [email protected] J. He e-mail: [email protected] S. Lucas e-mail: [email protected] 1 Introduction Digital archive construction from historic paper archives is a major image analysis application of interest both for cultural and scientific purposes. For example, 17 papers out of 53 presented at the most recent Document Anal- ysis Systems workshop were concerned with analysis of historical documents [1]. Archive documents are often stored in well-structured taxonomies (e.g. libraries, scientific specimen indexes and censuses) where the structure extends across the index as well as within the layout of each document. Documents are recorded using text which challenges off-the-shelf OCR [2], not only because of poor quality and/or decayed typescript or handwriting, but also be- cause standard commercial OCR systems cannot infer the data structure inherent within the records without human guidance. Similarly, although some commercial form-processing OCR systems exist, these normally pro- cess fixed-format pre-defined forms designed for OCR using background drop-out colours and/or tabular guide- lines to maximise performance, rather than arbitrary pre-existing document archives. 1.1 System concept To address archive applications, the user-configurable archive document processing system described in this paper integrates image analysis and text post-processing tools with a configurable commercial OCR package, to generate text content that can be fed direct into a target online database. User configurability is essen- tial to allow the system to be re-targeted to process different archives, and also variable layouts within the same archive. The pattern recognition aspects of the

User-configurable OCR enhancement for online natural history archives

Embed Size (px)

Citation preview

Page 1: User-configurable OCR enhancement for online natural history archives

IJDAR (2007) 9:263–279DOI 10.1007/s10032-006-0022-0

ORIGINAL PAPER

User-configurable OCR enhancement for online natural historyarchives

Andy Downton · Jingyu He · Simon Lucas

Received: 14 February 2005 / Revised: 3 November 2005 / Accepted: 28 May 2006 / Published online: 4 August 2006© Springer-Verlag 2006

Abstract The creation of structured digital librariesfrom paper-based archives is an area of growing demandin many scientific and cultural fields, and is not satis-fied either by off-the-shelf OCR or commercial form-processing systems. This paper describes and evaluatesa configurable archive construction system, which inte-grates document image pre-processing and analysis withtext post-processing tools and a standard OCR packageto meet digital archiving requirements. The prototypesystem is currently being used in conjunction with theUK Natural History Museum to help convert more than500,000 cards of Lepidoptera (Butterflies and Moths)and Coleoptera (Beetles) to searchable digital archives.Evaluation results covering different aspects of the sys-tem from card scanning to overall word recognitionrates for different database fields are summarised fortwo datasets comprising over 5,000 cards selected fromdifferent parts of these archives. First-pass end-to-endword recognition rates of 70–90% are reported for keydata fields, subject to availability of suitable electronicdictionaries. Further validation and correction is sup-ported through web-editing of the online digital archive.

Keywords Document analysis · Digital archive ·OCR

A. Downton (B) · J. He · S. LucasDepartment of Electronic Systems Engineering,University of Essex, Wivenhoe Park,Colchester, CO4 3SQ, UKe-mail: [email protected]

J. Hee-mail: [email protected]

S. Lucase-mail: [email protected]

1 Introduction

Digital archive construction from historic paper archivesis a major image analysis application of interest both forcultural and scientific purposes. For example, 17 papersout of 53 presented at the most recent Document Anal-ysis Systems workshop were concerned with analysis ofhistorical documents [1].

Archive documents are often stored in well-structuredtaxonomies (e.g. libraries, scientific specimen indexesand censuses) where the structure extends across theindex as well as within the layout of each document.Documents are recorded using text which challengesoff-the-shelf OCR [2], not only because of poor qualityand/or decayed typescript or handwriting, but also be-cause standard commercial OCR systems cannot inferthe data structure inherent within the records withouthuman guidance. Similarly, although some commercialform-processing OCR systems exist, these normally pro-cess fixed-format pre-defined forms designed for OCRusing background drop-out colours and/or tabular guide-lines to maximise performance, rather than arbitrarypre-existing document archives.

1.1 System concept

To address archive applications, the user-configurablearchive document processing system described in thispaper integrates image analysis and text post-processingtools with a configurable commercial OCR package,to generate text content that can be fed direct into atarget online database. User configurability is essen-tial to allow the system to be re-targeted to processdifferent archives, and also variable layouts within thesame archive. The pattern recognition aspects of the

Page 2: User-configurable OCR enhancement for online natural history archives

264 A. Downton et al.

system (ranging from colour segmentation, to documentstructure classification, to stamp identification and re-moval) are uniformly implemented using a fuzzy classi-fication scheme which is parameterised within the userinterface. OCR performance is optimised using config-urable user dictionaries linked to semantically labelledtext fields identified using document image analysis. Theraw output of the OCR system corresponding to eachlabelled sub-field of the document image is then post-processed using a regular expression engine to convertit into the required database format.

Our system has been developed in conjunction withthe UK Natural History Museum (NHM), and hence haslargely been evaluated on their archive data,although it is also intended to be more widely appli-cable to other structured document archives. The workreported here has mainly worked with the index to worldspecies of Butterflies and Moths (Lepidoptera), whichcontains 290,886 index cards, and more recently theindex of Beetles (Coleoptera), of similar size.

1.2 NHM card archives

In addition to 68 million biological specimens, The NHMhouses global index card archives of taxonomic datafor many important groups of organisms extant and ex-tinct. These card archives represent a comprehensiveinventory of their scientific names and associated biblio-graphical data, for which no published global cataloguecurrently exists. Archives at the NHM are recorded incard indexes, which contain bibliographical data andother information for one scientific name on each card,laid out in a standardised format (Fig. 1) for each archive.However, different archives may be subject to verydifferent recording conventions. Information is usually

Fig. 1 An index card with multiple hand print and handwritingannotations showing components to be extracted

Fig. 2 SEAC-Banche cheque scanner. Up to 50 cheques (or indexcards) are loaded into the front hopper, and are then fed one-by-one past twin vertically mounted CCDs (so both sides of the doc-ument are scanned simultaneously). A user-specified text stringcan also be printed on the document during scanning

type-written, but a minority of cards are entirely hand-written and hand-written annotations are common.

Cards are ordered within the index: first, accordingto higher classification (superfamily, family, subfamily,tribe); second, alphabetically by genus; third, alphabeti-cally within each genus by species; and fourth, alphabet-ically within each species by subspecies (hence the cardsequence implies several database fields which are notexplicitly included on every card1). Cards with namesthat are no longer in use (e.g. synonyms of currentspecies names) are arranged alphabetically by scientificname following the card for the currently valid name.

The importance of the NHM Lepidoptera card indexis demonstrated by the fact that taxonomic cataloguesfor several groups of Lepidoptera have been producedlargely based on data from it (e.g. for Noctuidae [3],Geometridae [4], and the Butterflies & Moths of theWorld: Generic Names & their Type-species, providinga full list of genus names online [5]). More generally,access to such historic data is now increasingly requiredto support worldwide research in areas such as biodiver-sity and climate change, providing a strong motivationto digitise paper archives.

1.3 Card scanning

Card archives are scanned using a SEAC Banche RDS-6000 bank cheque scanner (Fig. 2), modified by theaddition of a customised software interface which allows

1 For example, the species hyemalis Butler, shown in Fig. 1, isof the genus URAPTEROIDES, superfamily Geometroidea, andfamily Uraniidae, all of which can be deduced from the positionof the card within the card sequence. None of this information isdirectly recorded on the card.

Page 3: User-configurable OCR enhancement for online natural history archives

User-configurable OCR enhancement for online natural history archives 265

Fig. 3 Overall system diagram

configuration of the scanning process, to build a large im-age archive from a series of batch scans which may takeplace over days or weeks. Using this system, over 0.5million Lepidoptera (Moths and Butterflies) and Cole-optera (Beetles) cards have so far been scanned at theNHM. The scanner has the capability to scan both sidesof a card simultaneously in colour and/or monochromeat 200 pixels/in. resolution at a rate of about 1 card/s,and stores the resulting images in JPEG format.2 It isalso able to print a reference file number on the back ofeach card for cross-checking against the electronic filearchive.

1.4 Paper structure

Subsequent sections of this paper describe each part ofthe overall archive construction system and then eval-uate its end-to-end system performance. Section 2 pro-vides a system overview. Section 3 describes alternativepre-processing algorithms used to binarize the scannedcard images, Section 4 then describes the document im-age analysis method used to extract and semanticallylabel each independent text block in an archive cardimage. Results from this stage are fed to the commer-cial OCR system, with performance for each separatesemantic field optimised by the use of field-specific dic-tionaries. Section 5 explains how text is post-processedusing regular expression matching, so that it can be feddirectly into an NHM database, accessible on the Inter-net [6].

Because different card archives record varying seman-tic information with different layouts, user configurationof the system is required before commencing analysis of

2 Note that although archive card images are processed in rawimage format internally by the system, image acquisition andarchiving were initially carried out in minimum compressionJPEG format, due to storage and processing capacity limitationsof the PCs in use at the time (2001).

an archive, to define the card template layout(s), identifydictionaries to be used with the OCR system, and spec-ify any text post-processing required to interface thesystem to a target database. The system is configuredusing fuzzy templates specified using a graphical userinterface, and described in detail in Section 6. Section 7briefly describes the Access database and Web browserto which the output of the system interfaces.

Evaluation is required not only of each individualcomponent of the system, but also of its end-to-end per-formance from card scanning to database input. This ispresented in Section 8, followed by a brief discussion ofother issues in Section 9 and conclusions in Section 10.

2 System overview

The overall system (Fig. 3) consists of four main compo-nents, pre-processing, document analysis, OCR (usinga commercial OCR engine) and post-processing, imple-mented in a mixture of Tcl/tk and C++ on top of anexisting document analysis system framework [7]. Pre-processing reads the original JPEG archive documentimages, and converts them from either colour or greylevel into binary for semantic labelling purposes. Doc-ument analysis then segments and semantically labelsimportant text fields. The output normally consists oflabeled text field colour or grey-level sub-images in thesystem’s internal image format (PNM). However, pre-processing can also be recalled after Document Anal-ysis so that the labelled text field sub-images can beconverted into binary under system control rather thanrelying on the OCR’s internal binarization algorithm.The OCR system recognises labelled images text-fieldby text-field and converts them into raw text, which isfurther processed by regular expression post-processingto meet final database input requirements. The compo-nents are integrated into a complete archive batch pro-cessing system with the user interface shown in Fig. 4.

Page 4: User-configurable OCR enhancement for online natural history archives

266 A. Downton et al.

Fig. 4 Main user interface to the archive document analysis system

3 Document image pre-processing

Pre-processing includes two independent operations,binarization and colour segmentation, either or both ofwhich can be applied to input images. Five alternativealgorithms have been implemented within the systemfor binarization from grey-level images: global thres-holding, Niblack’s algorithm [8], Sauvola’s algorithm [9],adaptive Niblack, and adaptive Sauvola. The first threeof these are standard algorithms, while the last two arenovel developments reported in [10]. A comparativeperformance evaluation of all these algorithms, and alsothe internal binarization algorithm used by the OCRsystem, established that the best OCR performance forthe NHM archive dataset was achieved by our adaptiveNiblack algorithm. All subsequent system evaluation re-sults reported in this paper therefore use this segmenta-tion method.

Colour segmentation [11] is used to separate an inputcolour image into monochrome colour layers accord-ing to a predefined colour map (a collection of colourclusters, which represent the distinct colours found inthe document). Each colour layer may be associatedwith one or more distinct document fields (e.g. a docu-ment stamp, which may need to be removed, see [12]).After colour cluster identification, all the foregroundcolour layers can be projected together to generate abackground-free image, which can then be binarized bymapping all foreground colours to black (and optionally

eroding to reduce stroke thickness). In our evaluationof binarization methods [10], we compared this colourclassification route to binarization with the grey-levelbinarization methods described above, but found noimprovement in overall system performance. However,colour segmentation is still a useful component of oursystem for background artefact removal and to allowidentification of specific colour layers where they repre-sent a particular semantic component of the document,e.g. a stamp or genus name.

4 Document image analysis and OCR

4.1 Document image analysis (DIA)

The format of archive index cards consists of severalindependent blocks of text, and each block contains oneor more logically related text fields. Blocks retain a fairlyconsistent mutual layout over a complete archive, butthe layout of text fields within each block is not strictlyfixed. Nor are there any tabular guidelines defining fixedblock boundaries. The X–Y cuts algorithm [13] is there-fore an appropriate segmentation algorithm for this classof document image structure. Pixel smearing [14], with athreshold sufficient to join adjacent text characters butnot adjacent horizontal words or vertical lines, is firstapplied as a pre-processing non-linear low-pass filter toeach archive card image. The X–Y cuts algorithm then

Page 5: User-configurable OCR enhancement for online natural history archives

User-configurable OCR enhancement for online natural history archives 267

Fig. 5 Example of card processing

extracts and stores the contents of each index card intoa hierarchical tree structure (the so-called X–Y tree),consisting of text blocks, lines and words.

In addition to segmentation, DIA labels each seg-mented region (as shown in Fig. 1), in this case basedupon a template layout pre-registered during systemconfiguration for the NHM archive format. Labelled im-age fields allow the OCR system to be configured withfield-specific dictionaries, and raw text output from theOCR to be fed to the correct database field. They alsomake it possible to automatically remove any redundantfield from the document image if desired (replacing itwith average background pixels from the surroundingregion). In fact, the only redundant foreground in Fig. 1is the “Original Spelling. . .” stamp, which may appearanywhere on the image with any orientation, but witha fixed overall format. As part of our archive process-ing system, a special tool was developed to remove suchstamps, based upon fuzzy matching of global features

of the stamp [12]. The features used are relative cornerangles, distances and font sizes of the outer boundary ofthe stamp.

4.2 OCR and word recognition

The OCR used in the proposed system is a commer-cial product, Abbyy FineReader 6.0, which is currentlyregarded as one of the best available solutions for printand type-writer text recognition. Since it is designed forstand-alone use, it includes its own internal image pro-cessing (e.g. binarization) integrated with OCR, but canalso accept pre-processed images in a variety of differentimage formats including binary. We used it as middle-ware working in combination with the other componentsof our system. For example, Fig. 5 shows texts on the cardthat have been extracted and labelled into five classes ofimages, Index, Species, Author, Reference and Location.To recognise the class Author, a specific name dictionary

Page 6: User-configurable OCR enhancement for online natural history archives

268 A. Downton et al.

Fig. 6 User interface for post-processing

(provided by NHM) can be added to the default OCREnglish dictionary. When the input class is changed, adifferent specific dictionary is used.

A few other OCR settings also need to vary accord-ing to the image field being processed. As most textis typewritten, the default print type is normally setas “typewriter” and layout detect as “autodetect lay-out”. However, in this application, the year of publica-tion within the Reference class is a specific searchabledatabase field, and hence needs to be extracted fromthe remainder of the reference using regular expressionpost-processing. Initial evaluation showed that the OCRsystem performs poorly on numeric fields with the printtype set to “typewriter”, but somewhat better when it isset as “autodetect”.

The OCR output of the Abbyy middleware is raw textwhich is saved into separate text files for each text sub-image of the overall card image, since all sub-imagesof a single field (e.g. species name) are processed asa single batch with the same user-specified dictionaryand other settings. The subsequent post-processing stagethen reassembles all the different raw texts derived froma single card image into a set of labelled text fields suit-able for database insertion.

5 Post-processing

The purpose of text post-processing is to generatedatabase-oriented text strings for input to the onlinedatabase from the raw text output by the OCR. Forexample, the NHM online database only needs the yearof publication in the recognized reference to be storedas a search key; other text in the reference can be ig-nored. Another example is the author name, where thedatabase requires complete author names, but abbre-viated author names (terminated with a full stop) arefrequently found in the original images (e.g. the au-thor “Warren” may be abbreviated to “Warr.”). Thecorresponding complete author names need to beretrieved and substituted for each abbreviation in theonline database.

Tcl regular expressions specified within another partof the system’s user interface (Fig. 6) are the main toolfor the manipulation of raw OCR text output. For exam-ple the regular expression to parse the published yearfrom a reference is expressed by

regexp{([1][7 − 9][0 − 9][0 − 9])}$referenceyear

Page 7: User-configurable OCR enhancement for online natural history archives

User-configurable OCR enhancement for online natural history archives 269

Fig. 7 An Initial 4×3 equallydivided fuzzy zone

where regexp is the Tcl regular expression command(assumed by the graphical user interface), {([1][7–9][0–9][0–9])} is the parsing pattern (which searches for ayear between 1700 and 1999, prefixed by a space), $ref-erence represents the reference raw text generated byOCR (fed to the parser automatically and displayedin the source text window), and year contains the fourmatched digits output to the ‘year’ database field (dis-played in parsed text window). Similarly, for authorabbreviations, a regular expression is applied to the au-thor name raw OCR source text field to find any patternterminated by a full stop:

regexp{[∧\.]+}$authorpattern

Based on the detected pattern, another regular expres-sion is then used to search the specific Author dictionaryto find and substitute the matching full Author name.Similar dictionary techniques, based on edit distance,are used to detect and correct limited errors in text fieldswhich otherwise match one of the database fields.

Finally, the post-processing combines the separatetext files generated by the OCR process, augmentedwith the results of regular expression matching, into asingle file suitable for populating the online database.

6 Fuzzy configuration

6.1 Fuzzy inference

When a card image is examined, humans will tend to de-scribe the text within the image in terms of approximatepositions rather than exact coordinates. These positionscan be defined by introducing some “fuzzy” terms, e.g.

top, bottom, right, left, which are in fact our naturallanguage. Fuzzy logic is a technique developed to mimicthis kind of human thinking, and our system design isbased on fuzzy logic so that people who are not ex-perts on document analysis can still achieve a realisticunderstanding of the document image analysis processeswhich the system carries out. The overall archive docu-ment analysis system comprises three parts, fuzzy tem-plate creation, user configuration and batch processing.

6.2 Fuzzy template creation

Initially, a set of default fuzzy zones is defined on anexample card image from the archive to be processed.For example, Fig. 7, shows a card image that has beenequally divided into 4 × 3 = 12 fuzzy zones (any numberof zones can be defined both horizontally and vertically,but 4 × 3 was found to be sufficient and optimal for thearchives evaluated). Vertically, the card image is dividedinto four rows, which are denoted using the fuzzy terms,Top, Upper, Lower, and Bottom. Horizontally, the imageis divided into three columns, which are denoted withfuzzy terms, Left, Middle and Right. Any point T(x,y)on the image can then be described as (Top, Left) or(Bottom, Middle) etc. The default initial zones and theirdenotation are configurable by the curator according tothe application.

Secondly, the text fields to be recognised are labelledusing the system interface (see Fig. 4) by selecting themusing a rubber-banding technique and labelling eachboxed area. The default fuzzy zone boundaries are thenautomatically adjusted to maximise vertical and hori-zontal gaps between labelled text fields, as shown inFig. 8.

Page 8: User-configurable OCR enhancement for online natural history archives

270 A. Downton et al.

Fig. 8 Membership function based on fuzzy zones

Fuzzy membership functions fitted to the adjustedzone boundaries can be then set up as shown in Fig. 8.Vertically, four membership functions overlap each otherand correspond respectively to the four fuzzy termspreviously defined. Horizontally, there are a total of12 membership functions corresponding respectively tothree fuzzy terms on each of four rows. The overlap be-tween adjacent fuzzy terms varies with each documentimage, and depends on the gap in horizontal or verticalposition between adjacent text components.

Finally, using the template membership functions, typ-ical fuzzy “if–then” rules can be deduced for each textfield, for example:

• “If T(y) is from Upper to Upper and T(x) is fromLeft to Left, then T(x,y) is Species” (since the

species name is normally found in the top-left areaof the card);

• “If T(y) is from Lower-middle to Lower-middle andT(x) is from Left to Right, then T(x,y) is Reference”(since the reference is normally found below the spe-cies and author names, all the way across the card);

• “If T(y) is from Bottom to Bottom and T(x) is fromLeft to Left, then T(x,y) is Location” (since the loca-tion is normally found at the bottom-left of the card).

If a particular archive contains more than one typicalcard layout, then further templates for additional lay-outs can be defined by repeating the steps above, andthe best fuzzy matching template for each card imagewill be chosen when the system is run in batch card pro-cessing mode.

Page 9: User-configurable OCR enhancement for online natural history archives

User-configurable OCR enhancement for online natural history archives 271

Fig. 9 Templateconfiguration interface

Each labelled text field has a corresponding fuzzyrule. Therefore, a total of five fuzzy rules can be obtainedfor the template of Fig. 8.

6.3 User configuration

The fuzzy rules are changeable using a template configu-ration interface as shown in Fig. 9. For example, the rulefor Location can be changed to “If T(y) is from Bottomto Bottom and T(x) is from Left to Middle, then T(x,y)is Location” to allow for images that may have a longerlocation name. The interface also provides four heuristicinputs, Type, Align, Sequence and Quantity. Type definestext components into three levels, word, line and block,consistent with the X–Y tree document image analysis.Align gives four directions, west, east, north and south,for text components to align to (the default is none).Sequence indicates the sequence of a text componentin a text field. Quantity provides the number of text

components in a text field. These inputs are used to makethe classification more accurate and precise, since thefuzzy rules can only define an approximate area for textfields. By combining fuzzy rules with heuristic inputs, thecomplete rule for the Species text field of the templatecan for example be written as “based on word level, ifT(y) is from Upper to Upper and T(x) is from Left toLeft, then the first word of T(x,y) is the Species”.

6.4 Batch processing

Batches of card images are processed using theregistered template(s) which include template member-ship function and fuzzy rules. The template member-ship functions are readjusted locally image by image, bymeasuring the vertical and horizontal gaps between textcomponents. Based on the local membership functions,the text components on each image are then classifiedin terms of the fuzzy rules.

Page 10: User-configurable OCR enhancement for online natural history archives

272 A. Downton et al.

Fig. 10 Access form entry for reviewing/editing archive card data after document analysis and OCR

7 Archive card database and Web interface

7.1 LepIndex archive card database

The card images and associated taxonomic data are man-aged by the NHM using an MS Access relational data-base consisting of seven linked tables, 12 lookup tablesand 18 additional tables, plus 32 queries and 27 forms;it also includes more than 10,000 lines of Visual Basiccode. The seven linked tables form the main part ofthe database and contain a total of 135 fields (fields areincluded for all data that might be present on the indexcards, as well as for fields not on the cards but indirectlyinferred from card sequence). These tables are linkedby the unique card reference number assigned to eachcard image (and printed on its back) when the imageswere created.

The structure of the database and the layout of thefront-end were constructed to meet specific taxonomicrequirements. It incorporates, therefore, specialistknowledge of taxonomic protocols and demanded athorough assessment of the structure and function ofexisting taxonomic databases. The main purpose of thedatabase is to enable quick visual comparison of thetype- or hand-written data on the card images with datagenerated by OCR analysis of these images and to al-low these data to be edited. The database was designedto provide an electronic substitute for the card index

it replaces, and is now being made available to othertaxonomists in the NHM Entomology Department viathe local intranet. A Web interface for the database hasalso been developed (Sect. 7.2).

The main database form (Fig. 10) allows users quicklyto find a card image, and the associated data, using a vari-ety of search options (e.g. a drill-down search by higherclassification and a ‘simple search’, with or without wild-cards, for any taxon name). Authorised users are ableto edit, delete and create new records. They can also‘move’ records, singly or in batches, to new relative posi-tions within the record sequence (e.g. in cases where theuser wishes to transfer a species name from one genus toanother). All changes made to data in the database arerecorded in a set of archive tables. These tables store theold and the new field values, the name of the user, andthe date and time of the change. Deleted records are alsoarchived, and the user name, date and time are recorded.Users can validate information in all except memo fields,by placing the cursor in the appropriate field and dou-ble clicking the left hand mouse button. The value cur-rently stored in the field, plus the user name, date andtime are recorded. If a field containing validated datais double clicked subsequently, then the validated datais displayed on a pop-up form and the user is given theoption of deleting the stored validation information oroverwriting it with a new validation record. Any of thecard images (colour front or back, or grey-level front)

Page 11: User-configurable OCR enhancement for online natural history archives

User-configurable OCR enhancement for online natural history archives 273

Fig. 11 Web interface to the NHM Lepidoptera card archive database (see [6])

can be viewed separately by selecting the appropriatebutton, and clicking on the magnifier button presentsa full size image (around 1,000 × 600 pixels) for moredetailed inspection.

7.2 LepIndex Web browser interface

A Web interface for the Access database has also beendeveloped and is publicly accessible (see Fig. 11) [6].Users are able to search for records using a variety ofsearch systems (e.g. a simple search by scientific name,or an advanced search using a combination of a numberof different search terms). The results page displayinga record is laid out in a similar way to the main formof the Access database (i.e. Fig. 10), except that relatedgroups of fields (e.g. the fields which comprise a ref-erence) are concatenated to aid readability. Althoughusers cannot add entire new records, they are able to editexisting ones. If they do so, the user’s details and sug-gested changes are sent to the NHM server and storedin Access tables until the administrator of the system

decides whether or not to include the changes in themaster Access database. The Web interface operatesusing copies of the tables from the master Access data-base and these are updated periodically.

8 System evaluation

8.1 Card scanning

The 290,886 cards in the full Lepidoptera index werescanned using the modified cheque scanner in a totalof 61 person days—an average of about 10 cards/min(compared with the raw read rate of 60 cards/min) dueto the overhead of collecting, transferring and returningcards from file drawers to the small 40-card hopper ofthe scanner (larger hoppers are available for commer-cial systems, but at increased cost). The full archive ofJPEG images requires around 30 Gb of storage, withthree images (front and back colour images, and frontgrey-scale) stored for each card. Individual JPEG card

Page 12: User-configurable OCR enhancement for online natural history archives

274 A. Downton et al.

images are typically around 30 kbytes in size, thoughthey can be considerably further compressed withoutsignificant subjective distortion using newer image cod-ing standards such as JPEG2000 or DjVu [15]. However,since both initial capture and web browser display useJPEG as the default image coding standard, we have notexplored other options further as yet.

Subjective assessment of images by NHM staff, dis-played both through the Access database and LepIn-dex online [6] have indicated that the current imageresolution of 200 d.p.i. is more than adequate for cross-checking or validating data on-screen, although it falls alittle short of the recommended resolution of 300 d.p.i.for optimising OCR performance [16].

8.2 Evaluation datasets

The system was evaluated on two sets of sample cards.One set of 4,435 cards was randomly chosen from thePyraloidea dataset of 27,578 archive cards, for which fulltruth data was independently available from the NHM.A second set of 10,000 cards was processed from the Cur-culionidae subset of the Coleoptera archive, as part ofthe overall system trials. No truth data was initially avail-able for Curculionidae, so a random subset of 994 cardsfrom this dataset was manually truthed by the authors(the remaining unvalidated data output of the systemwas returned for curatorial evaluation). The Curculion-idae test set uses different card layouts and dictionariesfrom the Pyraloidea testset and therefore provides anindependent dataset for validating system performance,and also for estimating whether sufficient user (re)con-figurability has been allowed in the system design.

8.3 Overall evaluation method

For both datasets, the text fields extracted from eacharchive card for evaluation were: genus/species name,author name and the date sub-field within the refer-ence, since these fields are currently indexed in LepIn-dex [6]. Electronic dictionaries (not always complete)are also available for genus names, species names andauthor names. Evaluation was carried out using pre-processing and document analysis to generate three setsof binary text sub-images (genus/species name, authorname and reference), which were binarized using ouradaptive Niblack algorithm. The text sub-images werethen fed into the Abbyy OCR for recognition class-by-class using respective class dictionaries, and the resultswere saved into three sets of text files.

The three sets of text files were parsed and mergedinto three single text files by post-processing with respec-tive regular expressions for database input. The results

produced by the system were then compared with theword-level truth data for the corresponding databasefields. In the word-level evaluation, if any unmatchedcharacter was found, the whole text field (which couldcontain one or more words) was considered incorrect asshown in Fig. 12, where the german ü was unmatched,because it is not included in the OCR character set.

The same evaluation methodology was used for bothevaluation datasets, except that only a single archivecard template was required for the first dataset (Pyro-loidea cards), whereas three different templates wererequired for the second dataset (Circulionidae cards),because three different card layouts were detected withinthis dataset (Fig. 13).

Evaluation of the pre-processing performance wascarried out using only the genus/species name imagefield, whereas evaluation of overall system end-to-endperformance also used the author name and referencesub-images.

8.4 Pre-processing results

After optimising parameters using a small independentdataset of 335 images, recognition results for eight differ-ent thresholding methods were compared as shown inTable 1. The methods were: default Abbyy OCR thres-holding (as a baseline), global thresholding, Niblack’sand Sauvola’s algorithms, our own adaptive versions ofNiblack’s and Sauvola’s algorithms [10], and Niblack’sand Sauvola’s algorithms applied to images with back-ground removed using our colour segmentation method.Table 2 gives an example of a binarized word image usingeach binarization method, and the corresponding OCRresults.

From analysis of the results, we concluded that thebest system performance was achieved by Niblack’s andour adaptive Niblack’s algorithms, both of which per-formed better than the default thresholding included inthe Abbyy system. Although previous research [9] hasclaimed that Sauvola’s algorithm has superior perfor-mance to Niblack’s, our experimental work with archivedocuments suggests this is not always true. In particular,

Fig. 12 Examples of text field recognition

Page 13: User-configurable OCR enhancement for online natural history archives

User-configurable OCR enhancement for online natural history archives 275

Fig. 13 Three different Index Card layouts in the Curculionidae Dataset

Table 1 Evaluation results (4,435 word images)

Methods Algorithm Correct Rate (%)

I ABBYY 3720 83.9II Global 3409 76.9III Niblack 3780 85.2IV Niblack—background 3701 83.4V Sauvola 3658 82.5VI Sauvola—background 3578 80.7VII Adaptive Niblack 3781 85.3VIII Adaptive Sauvola 3701 83.4

Table 2 Example of the effect of thresholding on OCR outputfor each of the thsresholding algorithms evaluated

Method Image OCR Text

I ODCNTABTHRIA

II ODCNtABTHBIA

III ODONTARTHRIA

IV GDCHTARTHBIA

V ODQHTARTHRIA

VI GDOmAKEBKEA

VII ODONTARTHRIA

VIII ODQHTARTHRIA

backgrounds on archive documents vary considerablybetween images (Fig. 14), and performance of Niblack’salgorithm was found to be relatively insensitive to itsparameter settings, whereas Sauvola’s algorithm wasquite sensitive. Hence, even with the benefit of the adap-tive parameterisation method which we introduced, over-all performance across the full set of 4,435 images wasstill better using Niblack’s algorithm and our adaptivevariant.

8.5 End-to-end word recognition results for the firstevaluation dataset

Analysis of errors in the first dataset (Table 3) shows that15% of overall errors occurred when document imageanalysis wrongly extracted or labelled text fields, and13% resulted from incorrect truthing (e.g. see Fig. 15)including abbreviations. The remaining 72% of errorswere generated by the OCR system, often caused bytouching typewritten characters. 16.4% of errors weresubsequently corrected by text post-processing, due todictionary correction or expansion of abbreviations.

Fig. 14 (a) light image background (b) darker image background

Page 14: User-configurable OCR enhancement for online natural history archives

276 A. Downton et al.

Table 3 Evaluation resultsfor Pyraloidea Dataset Text field Species/genus Author Year

Text fields 4435 4435 4435Doc. analysis 149/4435 −3.4% 166/4435 −3.7% 140/4435 −3.1%OCR 460/4435 −10.4% 711/4435 −16.0% 1080/4435 −24.4%Truthing data 50/4435 −1.1% 365/4435 −8.2% 0Post-processing 165/4435 +3.7% 346/4435 +7.8% 0Correct Text.fields 3941/4435 88.9% 3539/4435 79.8% 3215/4435 72.5%

Fig. 15 Difference between original image and truth data

Since author recognition was carried out with anincomplete Author dictionary, the word recognition ratefor this field is lower than for Species/Genus, where afull dictionary was available. The poorer result for Yearwas mainly caused by the OCR, which was less accu-rate in recognising digits than characters (nearly 89% oftotal errors for Year were caused by OCR errors com-pared with the average of 72%). Another cause of poorperformance is that quite a few years are handwritten(Fig. 16).

8.6 End-to-end word recognition results for secondevaluation dataset

The second test set of 994 cards were randomly cho-sen from the Curculionidae dataset of 10,000 archivecards. Three different layout formats were encounteredas shown in Fig. 13. In this evaluation, three templatescorresponding to these formats were registered for Doc-ument Analysis.

Table 4 shows that 99% of cards’ formats were identi-fied correctly, and hence subsequently analysed with thecorrect template. Type (a) was by far the most commontemplate, occurring in more than 88% of the sampledcards. Therefore, we carried out the same evaluation asin Sect. 8.4 using just the 881 cards of type (a).

The text fields extracted from each archive card forevaluation in the second evaluation dataset were genus

Fig. 16 Reference with handwritten Year

Table 4 Evaluation results for template identification

Image type Template for (a) Template for (b) Template for (c)

(a) 874/881 (99.2%) 0% 7/881 (0.8%)(b) (0%) 66/66 (100%) 0%(c) 1/47 (2.1%) 0% 46/47 (97.9%)Overall correct rate 99%

name, species name and author name. On the image(Fig. 13a), the top block is Genus and the second (ref-erence) block contains both Species and Author data.Most of the time, Species is the first the word in theblock. Author, in most cases, is located in the middleof the block, and terminated with a comma. Its initialletter is always capitalized. Suitable regular expressionsare used to search for these fields embedded within the‘raw’ OCR output for the reference sub-image.

Table 5 summarises the evaluation results for the sec-ond test set. 8.1% of overall errors occurred when doc-ument image analysis wrongly extracted or labelled textfields, and the remaining 91.9% of errors were gener-ated by the OCR system, often caused by touching type-written characters and complex surroundings. 12.6% oferrors were corrected by the text post-processing stage.As species recognition was carried out with an incom-plete Species dictionary, the word recognition rate forthis field is poorer than the other two. Also, both speciesand author fields have more complex adjacent text sur-rounding than genus which is a separate sub-image. Thisalso causes a lower recognition rate for both species andauthor in comparison with genus.

8.7 Effect of removing the document analysis stage

A final stage of evaluation attempted to address thequestion of whether improved performance could beobtained by dispensing with the image analysis stageof the system, and effectively treating the whole cardimage as flat text, as a standard OCR system would. Inthis case, in addition to processing the raw text outputof the OCR to extract the sub-texts required for feed-ing into the database, the regular expression processingalso parses the text, labelling the database field to which

Page 15: User-configurable OCR enhancement for online natural history archives

User-configurable OCR enhancement for online natural history archives 277

Table 5 Evaluation resultsfor Curculionidae Dataset Text field Genus Species Author

Text fields 881 753 783Doc. analysis 28/881 −3.2% 17/753 −2.3% 17/783 −2.2%OCR 170/881 −19.3% 286/753 −38.0% 222/783 −28.4%Post-processing 36/881 +4.1% 10/753 +1.3% 47/783 +6.0%Correct Text.fields 719/881 81.6% 460/753 61.1% 591/783 75.5%

each sub-text is applied, by means of dictionary match-ing. The result is a simpler system, but with some effecton word recognition performance for several reasons:

• The use of dictionaries to identify specific databasefields assumes that all field dictionaries are indepen-dent of each other.

• Regular expression matching using a dictionary willonly extract (near) exact matches, whereas othervalid text may be identified using image analysis (forexample in the case of an incomplete dictionary).

• Processing the complete image in one pass throughthe OCR system requires a single set of OCR param-eters to be applied to all fields, which may be sub-optimal for some.

The system was set up to process the first test setof 4,435 images again using a configuration where allimages were processed directly by the OCR system, withmodified regular expression post-processing to extractand label all sub-texts from the raw OCR output foreach complete image.

Results obtained were: Species/genus name 88.7%word recognition rate (compared with 89.2% fromTable 3); Author name 77.5% (79.8% from Table 3); andYear 62.5% (73.5% from Table 3). The very small differ-ence in performance for Species/genus names is proba-bly due to the slight difference in performance betweenthe Abbyy thresholding algorithm and the Adaptive Ni-black algorithm used in the full system. The 2.3% differ-ence in performance for Author names is indicative ofthe effect of the author dictionary being incomplete.Finally, the much larger difference in performance forYear (11%) results from the non-optimal parameterisa-tion of the Abbyy OCR for this field. Since the full cardimage is processed by the OCR in a single pass, it wasprocessed with the ‘typewriter’ text setting (which wasoptimal for species/genus and author names), insteadof ‘autodetect’ (which gave improved performance fornumerals).

9 Discussion of results

The aim of this work was to develop a system-level ap-proach to online historic archive conversion, by

integrating a configurable set of document imageanalysis and recognition components into a completeuser-reconfigurable system. Work on this system iscontinuing, and a copy is currently under user evaluationat the NHM. While it can not be claimed that any of theindividual system components exhibit great novelty, theoverall project approach of designing a reconfigurablesystem to allow users to configure aspects of documentanalysis on a batch-by-batch basis (a ‘cottage industry’rather than ‘production line’ approach) is unusual, butwe suggest necessary, to handle the unique variabilityand physical distribution throughout the world that ischaracteristic of historic document archives. The authorsare aware of very few other published evaluated imple-mentations of end-to-end systems of this type. The valueof this systems-level approach is in clarifying architec-tural issues and performance sensitivities which cannotbe addressed by component-level studies—the breadth-first document image analysis strategy recently advo-cated by Baird et al. [17].

As might be expected, the completion of this firstimplementation and evaluation phase has highlightedmany areas of possible improvement for the system. Forexample, it is evident from the current evaluation that ahigh proportion of the current word recognition errorsresult from touching characters being mis-segmentedby the Abbyy OCR system (which nevertheless per-forms significantly better than other commercial OCRsystems we evaluated). Since most of the text is fixed-pitch Courier typewriter font, a significant performanceimprovement might be achieved by pre-segmentationof characters using projection from word boundariesor Fourier transforms of histograms, prior to OCR, ashas already been suggested for second world war type-written documents [18]. Another promising approach issegmentation free OCR, where a classifier is convolvedover the word image, thereby considering all possiblesegmentations [21, 22]. While prototype implementa-tions of these methods have been shown to significantlyimprove accuracy on these archive cards, more work isneeded to tune the system for practical operation [21],or to make it run at reasonable speed [22].

Improved performance could also be achieved byupgrading to the latest version of Abbyy’s OCR sys-tem (currently version 7.1), which trial evaluations have

Page 16: User-configurable OCR enhancement for online natural history archives

278 A. Downton et al.

shown improves on the results reported here. However,to adopt this version would require repeating all theevaluation results reported in this paper, to maintainconsistency. Since adopting Finereader 6.0 for the trialhowever, Abby have also recently announced other rel-evant developments, including support for old Euro-pean languages [19], and Abbyy Flexicapture studio[20]. Flexicapture may be the first commercial attemptto provide support for variable-format business formprocessing, and adopts several of the same approachesdescribed in this paper, including GUI-based form tem-plating using form image samples, flexible hierarchicaldocument image descriptions for initial DIA, and regu-lar expression post-processing to extract key text fieldsfrom larger text strings.

Initial user evaluation at the NHM has also indicatedthat, although the current system is usable, and repro-duces the word recognition performance described herewith other datasets, it is far from efficient, due to a vari-ety of sub-optimal user interface characteristics that aretypical of a research and development prototype. Forexample, at present the system is restricted to process-ing moderate batches of a few hundred card images ata time, due to restrictions on the number of file han-dles simultaneously open. Such issues are not funda-mental performance limitations, but nevertheless serveto highlight the need for considerable further user inter-face optimization before the present system could beconsidered to be fully industrialized. A more completecomparative evaluation of the system in terms of bothevaluation performance and time could in principle becarried out using a CAVIAR-style methodology [23],except that the VIADOCS system is only intended foruse by taxonomic experts with significant computationalexperience. Such subjects are rare and widely geograph-ically distributed, making statistically valid conclusionsdifficult to achieve.

10 Conclusions

This paper has described and evaluated a complete end-to-end system for archive document acquisition, analysisand recognition for digital libraries. The system is user-configurable through a fuzzy configuration user inter-face to handle different archive layout formats. It alsointegrates sub-systems to perform document image pre-processing, document image analysis for semantic label-ling, and text post-processing using regular expressions,with a standard off-the-shelf OCR system. The over-all system performance is encouraging: basic word-levelOCR exceeds the raw OCR of the embedded commer-cial OCR system we used, and subsequent text post-

processing corrects a significant proportion of residualerrors, as well as allowing direct insertion of the recogni-sed text into appropriate online database fields.Database field recognition rates of up to 90% havebeen observed during evaluation, and such performanceshould be reliably met or exceeded with foreseeableimprovements to the preprocessing and consistent avail-ability of complete field dictionaries.

Acknowledgments The work reported here was initiated in theVIADOCS project, sponsored by EPSRC and BBSRC as part ofthe UK research councils’ Bioinformatics research programme,under research contracts 84/BIO11933 and 40/BIO11938. Theauthors are grateful to Dr. Malcolm Scoble, Dr. Gaden Robin-son and Dr. George Beccaloni for their contribution to this col-laborative project, and particularly to George Beccaloni for thedesign of the MS Access LepIndex database and Mike Sadka forthe LepIndex website. The scanner software for the system wasdeveloped by Arran Holmes as an undergraduate project in theDepartment of Electronic Systems Engineering at University ofEssex.

References

1. Marinai, S., Dengel, A. (eds.): Document Analysis SystemsVI, Proceedings of 6th International Workshop on DAS2004,Italy, September 2004, Florence. LNCS 3163. Springer, BerlinHeidelberg New York (2004)

2. Spitz, A.L.: Tilting at windmills: adventures in attempting toreconstruct Don Quixote, INCS., pp. 51–62 (2004)

3. Poole, R.W.: In: Heppner, J.B. (eds.) Lepidopterorum Catalo-gus (New Series) Fascicle 118 Noctuidae Parts 1–3: 1314 pp.E.J. Brill/Flora & Fauna Publications. Leiden–New York–Ko-benhavn–Koln (1989)

4. Scoble, M.J. (ed.): Geometrid Moths of the World: A Cat-alogue. Volumes 1 and 2: 1016 pp. + index 129 pp. CSIROPublishing, Canberra (1999)

5. Pitkin, B.R., Jenkins, P.: Butterflies & Moths of the World:Generic Names & their Type-species. http://www.nhm.ac.uk/entomology/butmoth/ (2002)

6. Beccaloni, G., Scoble, M., Robinson, G., Pitkin, B.: The GlobalLepidoptera Names Index, http://www.nhm.ac.uk/entomol-ogy/ lepindex

7. Cracknell, C., Downton, A.C.: TABS – script-based soft-ware framework for research in image processing, analysisand understanding. In: IEE Proc. VISP, vol. 145 (3), 194–202(1998)

8. Niblack, W.: An Introduction to Digital Image Processing.Prentice hyp. Hall, Englewood Cliffs (1986)

9. Sauvola, J., Pietikainen, M.: Adaptive Document Image Bina-rization. Pattern Recognition 33, 225–236 (2000)

10. He, J., Do, D.M.Q., Downton, A.C.: A comparison of binari-zation methods for historical archive documents. Submittedto ICDAR2005, Seoul, Korea (2005)

11. He, J., Downton, A.C.: Colour map classification for archiveDocuments. pp. 241–251

12. He, J., Downton, A.C.: Configurable text stamp identificationtool with application of fuzzy logic. In: 6th International Work-shop on Document Analysis Systems, DAS 2004, pp. 201–212

13. Nagy, G., Seth, S., Viswanathan, M.: A prototype documentimage analysis system for technical journals. Computer 25(7),10–22 (1992)

Page 17: User-configurable OCR enhancement for online natural history archives

User-configurable OCR enhancement for online natural history archives 279

14. Baird, H.S., Jones, S.E., Fortune, S.J.: Image egmentation usingshape-directed covers. In: Proc.10th Int. Conf. Pattern Recog-nition (ICPR), IEEE CS Press, Los Alamitos CA., pp. 820–825(1990)

15. Bottou, L., Haffner, P., Howard, P.G., Simard, P., Bengio, Y.,LeCun, Y.: High quality document image compression withDjVu. J. Elect. Imag. 7(3), 410–425 (1998)

16. Rice, S.V., Jenkins, F.R., Nartker, T.A.: The 5th annual testof OCR accuracy. Technical Report, ISRI, University ofNevada at Las Vegas. ISRI TR-96-01, http://www.isri.unlv.edu/pub/ISRI/OCRtk/AT-1996.pdf (1996)

17. Baird, H.S., Govindaraju, V., Lopresti, D.P.: Document anal-ysis systems for digital libaries: challenges and opportunities.In: Marinai, S., Dengel, A.(eds.) Proceedings of 6th Interna-tional Workshop Document Analysis Systems VI, DAS2004,Florence, Italy, September 2004. LNCS 3163. Springer, BerlinHeidelberg New York, pp. 1–16 (2006)

18. Antonacopoulos, A., Karatzas, D.: A complete approach tothe conversion of typewritten historical documents for digitalarchives. pp. 90–101

19. ABBYY Ships First Omnifont OCR Software Packagefor Fraktur and Old European Language Recognition”,Press release 18 January 2005, http://www.abbyy.com/press/press_releases.asp?param=38400

20. ABBYY Flexicapture white paper, http://www.abbyy.com/articles/WP%20FlexiCapture.pdf

21. Lucas, S.M., Patoulas, G., Downton, A.C.: Fast lexicon-basedword recognition in noisy images. In: Proc. ICDAR2003, 7thInternational Conference on Document Image Analysis andRecognition, Edinburgh, 3–6, pp. 462–466 (2003).

22. Ishidera, E., Lucas, S.M., Downton, A.C.: Top-down likeli-hood word image generation model for holistic word rec-ognition. In: Proc. DAS’02 Document Analysis SystemsWorkshop, August 19–21, Springer-Verlag Princeton, NJ,LNCS 2423, pp. 82–94, (2002)

23. Zou, J., Nagy, G.: Evaluation of model-based interactiveflower recognition. In: Proc ICPR 2004, 17th InternationalConference on Pattern Recognition, vol.2, pp. 311–314,Cambridge, UK (2004)

Author Biography

Andy Downton holds a BSc(1974) and a PhD (1982) inElectronic Engineering fromUniversity of Southampton,UK, where he was also a lec-turer. He is now a Professorat the University of Essex, andwas Head of the Departmentof Electronic Systems Engi-neering from 1999–2004. Pro-fessor Downton is a Fellow ofthe IEE and a Senior Mem-ber of IEEE, and was chair ofthe organising committees for

IWFHR5 (held at University of Essex in 1996) and of ICDAR2003in Edinburgh, Scotland. His research interests are in DocumentImage Analysis and Recognition, Human Computer Systems de-sign and parallel and embedded computer systems.

Jingyu (Jack) received hisBEng degree in Electrical &Electronic Engineering fromthe University of Liverpool,UK, in 2000, and his MSc inComputer Science from theUniversity of Essex, UK, in2002. He is now studying fora PhD in Electronics SystemEngineering at the Univer-sity of Essex. His main re-search interests are ArchiveDocument Image Analysiswith Application of Fuzzy

Logic and related overall system solutions.

Simon Lucas received his BScdegree in Computer SystemsEngineering from the Univer-sity of Kent, UK, in 1986,and his PhD from the Univer-sity of Southampton, UK, in1991. He was appointed to aLectureship at the Universityof Essex in 1992, and is cur-rently a Reader in the Com-puter Science Department atEssex. Dr. Lucas is currentlychair of IAPR Technical Com-mittee 5 on Benchmarking and

Software, and has organised pattern recognition competitions forseveral international conferences, including ICDAR 2003 and IC-DAR 2005. His main research interests are pattern recognition,evolutionary computation, and using games as test-beds for andapplications of computational intelligence. He is the inventor ofthe scanning n-tuple classifier, a fast and accurate OCR method.