13
Integrating OCR and NLP to Digitize 2.3 Million Lichen and Bryophyte Specimens Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin

Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin

Embed Size (px)

Citation preview

Page 1: Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin

Integrating OCR and NLP to Digitize 2.3 Million Lichen and Bryophyte Specimens

Edward GilbertCorinna GriesThomas H. Nash IIIRobert Anglin

Page 2: Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin

Goals and Scope

NSF ADBC (#1115116) ~ 2.3 million specimen

90% of all specimens 900,000 lichens 1.4 million bryophytes

> 60 non-governmental US herbaria (95%) Mexico, US, Canada

16 digitization centers

Page 3: Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin

Digitization Workflow

Page 4: Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin

National Portals

Lichen Consortium http://lichenportal.org 34 Collections 902,664 Records

Bryophyte Consortium http://bryophyteportal

/ 26 Collections 1,300,135 Records

Symbiota software

Page 5: Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin

Imaging Stage

Capture Image

barcode in file name

Create Skeleton

Filespecies name,

country, state,

exsiccati, etc.

Upload to FTP server

Image processing

extract barcode,

create web versions, map to portal DBs

Herbarium Database

Automated OCR

Tesseract, ABBYY

Existing Record

simply link image

Upload to FTP server

Image URLs

Manage Specimen

Data in Portal

Manage / Review

Records in Portal

SymbiotaEditor

review, edit, keystroke

Create New Record

barcode, image, skeletal data

Automated NLP

Darwin Core Parsing

Page 6: Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin

Automated OCR

1. Iterate through “unprocessed” images

2. OCR via Tesseract (version 3)a) In focus, good lighting, minimal noiseb) Resolution: >20px x-height

3. Database raw text block4. Progress to next step

1. Low OCR return => hand processing2. Natural Language Processing

Page 7: Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin

OCR Challenges

Issues Old fonts Faded labels Form labels Handwritten

labels Specialized terms

Solutions Image

treatments OCR tuning Dictionaries Consensus OCR

¢_].L.|»‘¢ .'».f.'._..‘~,(.Jfin-x‘*\'a:"511z:1 wf .~\:'i/.onli State UniversityP.’~.r"~2= ,_. gg J:.2 " J*J*" †(=:\‘-“ax "»..'\-12�‘ “ "‘ ;T~;‘~7i?»-1_1_\f;>sf`;,' ESXZ»ie+‘-». “~'.»te;~:i_.t<» ff`t;~f3":.f.“» »4 xx, ,"""‘“â€T"’ <1;-.rs f3'a,1.z>.t;;a¢f~rus ’�V4 J 'if . r°'° M '1?nies ivain.) Sav.neutal Station - " '1 ~»r';;4-\P ` 1.T11 ./P.. ,J ..-.ELEV. ' `.fJL_\ LATL Q _‘ 1 _ Y’ DATE_ ,. W5. (> f- , -:‘; i f>i_T ~~ . A 1:». v\ .-v »~. 4. a xvala 8/27/73

PLANTS OF NEW r~1ExIcoHerbarium of Arizona State UniversityParmelia ulophyllodes (Vain.) Sav.COUNTY “°â€â€œâ€œ �Joranada Experimental Station -New Mexico State University"“““' on JuniperusELEV. ‘ 4400EEILLEETUR DATEDU T. H. Nash #7914 8/27/73T. H. N.

Page 8: Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin

Automated NLP

1. Iterate through raw OCR text blocks

2. Parse text block1. Darwin Core 2. Populate database

3. Review1. Adjust content2. Approve3. Handwritten => keystroke

Page 9: Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin

NLP Challenges

Issues Variable layouts Loose standards OCR error

Solutions Authority tables Levenshtein

distance Word stats Format

recognition Parsing profiles Duplicate

harvesting

Page 10: Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin

NLP: Duplicate Harvesting

1. Extract collector dataa) Last name, number, date

2. Harvest duplicates from consortium DB

a) Exact duplicatesb) Duplicate events

3. High similarity indexes4. OCR block comparison5. Consensus record

Page 11: Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin

NLP: Targeted Parsing Profiles

1. Target similar label formats2. Use raw OCR to locate “Nash”

labels3. Targeted parsing algorithms4. Exclude:

a) Determined by Nashb) Author of scientific namec) Associated collectord) County

Page 12: Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin

Label Review

Page 13: Edward Gilbert Corinna Gries Thomas H. Nash III Robert Anglin

Thank You

Michael Adamo Bruce Allen Meredith Blackwell Bill Buck Alina Freire-Fierro John Freudenstein Alan Fryday David Giblin Karen Hughes Steffi Ickert-Bond Timothy James Jennifer S. Kluse Matt Von Konrat Ben Legler Tatyana Livshultz

Robert Lücking Francois Lutzoni Bob Magill Andrew Miller Brent Mishler Donald Pfister Richard Rabeler Malcolm Sargent Edward Schilling Michaela Schmull Blanka Shaw Jon Shaw Carol Shearer Larry StClair Barbara Thiers

Funded by the NSF ADBC program