21
Improving search in scanned documents: Looking for OCR mismatches David Morse David King Anton Dil Alistair Willis David Roberts Chris Lyal

Improving search in scanned documents: Looking for OCR mismatches

Embed Size (px)

DESCRIPTION

Improving search in scanned documents: Looking for OCR mismatches. David Roberts Chris Lyal. David Morse David King Anton Dil Alistair Willis. Introduction. Biological taxonomy manage names of organisms eg. Homo sapiens relationships between organisms Heavily publication based - PowerPoint PPT Presentation

Citation preview

Page 1: Improving search in scanned documents: Looking for OCR mismatches

Improving search in scanned documents: Looking for OCR mismatches

David MorseDavid KingAnton DilAlistair Willis

David RobertsChris Lyal

Page 2: Improving search in scanned documents: Looking for OCR mismatches

Introduction

– Biological taxonomy – manage names of organisms

– eg. Homo sapiens– relationships between organisms

– Heavily publication based– extensive legacy literature

– from 15th century– observations appear in wide variety of publications

– learned society documents, encyclopaedias, institution reports, etc.

– occurence data, historical trends, geographic clues etc.– biodiversity management

Page 3: Improving search in scanned documents: Looking for OCR mismatches

Digital libraries and curation

– Digitised historical collections– necessary for practising taxonomists– possible Natural Language Processing research?

– compare Genia/Medline

– Must be searchable– on taxonomic names– also other non-standard English:

– names (authorities)– locations etc.– technical language

– non-dictionary terms

Page 4: Improving search in scanned documents: Looking for OCR mismatches

Mass digitisation

– Most biodiversity literature still on paper only– but collections being scanned at high rate– eg. the Biodiversity Heritage Library

– Several major repositories of biodiversity documents– BHL

– INOTAXA (BHL + TEI-Lite)– eFloras

– Plazi– Internet archive

– Lack of consistent markup– SciXML, NLM DTD, ...

Page 5: Improving search in scanned documents: Looking for OCR mismatches

Markup in ABLE

– Previously been using TEI-Lite– INOTAXA– structural markup

– semantic content in layout – eg. indentation for hierarchy

– Moving towards taXMLit– retains structural markup– includes semantic markup

– taxonomic name,– authority,

– operation (new name, synonym etc)– date

– etc.

Page 6: Improving search in scanned documents: Looking for OCR mismatches

Digital libraries and curation

– Large scale scanning– BHL currently:

– 22,000 volumes– 9.2 million pages

– growth rate:– 1,500 volumes / month– 600,000 pages / month

– No chance of manual correction/markup

– Automatic markup necessary– INOTAXA– Cheaper to manually correct and rekey

Page 7: Improving search in scanned documents: Looking for OCR mismatches

Digital libraries and curation

– Taxonomic nomenclature recognition difficult to automate– OCR errors

– variation in fonts– meaning in layout

– not generally captured by OCR– non-dictionary words

I should refer to Peritaxia, the first ventral suture being nearly straight

Page 8: Improving search in scanned documents: Looking for OCR mismatches
Page 9: Improving search in scanned documents: Looking for OCR mismatches

Digital libraries and curation

– huge terminological variation (GBIF)

Actinobacillus actimomycetemcomitansActinobacillus actimycetemcomitansActinobacillus actinmycetemcomitansActinobacillus actinomicetemcomitansActinobacillus actinomyActinobacillus actinomyceActinobacillus actinomycemcomitansActinobacillus actinomyceremcomitansActinobacillus actinomycetamActinobacillus actinomycetamcomitansActinobacillus actinomycetecomitansActinobacillus actinomycetemcmitansActinobacillus actinomycetemcomintansActinobacillus actinomycetemcomitanceActinobacillus actinomycetemcomitansActinobacillus actinomycetemcomitants

Actinobacillus actinomycetemcommitansActinobacillus actinomycetemocimitansActinobacillus actinomycetencomitansActinobacillus actinomycetumActinobacillus actinomyctemcomitansActinobacillus actinomyectomcomitansActinobacillus actinomyetemcomitansActinobacillus actinonmycetemcomitansActinobacillus actionomycetemcomitansActinobacillus actynomicetemcomitansActinobacillus antinomycetemcomitans

Page 10: Improving search in scanned documents: Looking for OCR mismatches

Variation with OCR

– Want to help manage databases of taxons– lots for taxonomy

– GBIF, ITIS, Species 2000, Catalogue of life, uBIO, ...– very incomplete– too few specialists to maintain/integrate

– Literature used to manage these resources,– but only if can identify/search terms in documents.

– OCR struggles with these examples

Page 11: Improving search in scanned documents: Looking for OCR mismatches

OCR accuracy

– Typical OCR accuracy around 95-96% (by word)– Generally on born digital documents

– known fonts– for legacy literature, font may be unique to a publication

– standard dictionaries

– Recognition rates for taxons– 20-35% F-score (TaxonFinder, FAT)

– Is terminology in the 5% of errors?– specialist English– italics in non-standard fonts

Page 12: Improving search in scanned documents: Looking for OCR mismatches

Problems with the terminology variation

– Misreadings may not be identifiable as errors– Homa / Homo– Pica / Pioa

– no canonical reference

– Even with multiple OCR readings, may not get the correct form

– RHYNCHOPHOBA (correct)– BHYNCHOPHOKA (PDF maker)– KHYNCHOPHOBA (ABBYY)

– Attempting to correct the OCR errors is not an option

Page 13: Improving search in scanned documents: Looking for OCR mismatches

New taxons

– Not in the business of correcting OCR– no canonical reference– aim is to identify new taxons

– community too small to correct most errors

– Is an unrecognised word a new taxon?– Pioa

– Pica? (type of magpie)– Roa?– or a new term?

Page 14: Improving search in scanned documents: Looking for OCR mismatches

Proposed approach

– No possibility of getting the correct version from the two interpretations– But often get a mismatch

– Differences between collections of OCRed versions may provide clues

– Compare outputs using sequence alignment algorithm– Needleman Wunsch

– word by word comparison– plus Levenshtein edit

– hand-keyed INOTAXA text as reference

Page 15: Improving search in scanned documents: Looking for OCR mismatches

What we’re working with

– Can’t get at the internals of OCR systems

– Have a hand-corrected and version of a document– Biologia Centrali-Americana, coleoptera v.4 pt.3

– 180,553 words– INOTAXA hand keyed

– + scanned version with 2 OCRs– BHL

– NHM (pdf maker)– Internet Archive

– ABBYY Finereader – taken from same jpeg

Page 16: Improving search in scanned documents: Looking for OCR mismatches

Needleman-Wunsch algorithm

– Global sequence alignment algorithm

– Match similar terms against each other– or insert gaps– score for match, mismatch, gap insertion

– minimise score

– To align ABCE with ABDE:

– A - A          A - A          A - AB - B          B - B          B - BC - [ ]         [ ] - D         C - D[ ] - D         C - [ ]         E - EE - E         E - E

– Depending on score function

Page 17: Improving search in scanned documents: Looking for OCR mismatches

Needleman-Wunsch on OCR outputs

– Compare word sequences

– The study of the Otiorhynchinæ Alatæ has unfortunately been delayed

– The study of the Otiorhynchinse Alatee has unfortunately been delayed

– The study of the Otiorhynchinœ Alatse has unfortunately been delayed

– Finds mismatches for similar words– lower penalty for similar terms

– Levenshtein comparison

– Gap sequences for lengthy mismatches– So alignment preferable to eg. DIFF

Page 18: Improving search in scanned documents: Looking for OCR mismatches

Sequence of gaps

schwarzi schwarzi MATCH

34 34 MATCH

1 1 MATCH

[] — GAP

[] , GAP

[] - GAP

obsoletus obsoletus MATCH

273, 273, MATCH

Page 19: Improving search in scanned documents: Looking for OCR mismatches

Results

– No formal evaluation yet

– Initial results promising– Precision good

– even where misread punctuation– inconsistent punctuation analysis sometimes suggests that

surrounding word(s) are difficult for OCR

– Recall currently difficult to measure– hand markup not always consistent– Part of motivation for moving to taXMLit

Page 20: Improving search in scanned documents: Looking for OCR mismatches

Future work

– Fuzzy search– use markup to highlight difficult terms– requires development of appropriate markup language

– partial disambiguation with colocations?

– Information extraction?– Synonym recognition

tail is longer, ♂, shorter, ♀, with green tip

– OCR doesn’t recognise the symbols either...– idiosyncratic languagee, often unique to author

– Interpretation from layout conventions– current OCR not fine grained enough

Page 21: Improving search in scanned documents: Looking for OCR mismatches

Thank you

Any questions?