OCR and SALIX Parsing

OCR andSALIX Parsing

Daryl LaffertyArizona State University

October, 2012

SALIX:Semi-Automatic Label Information eXtraction

SALIX was developed at Arizona State University from 2009 through 2012.

Over 55,000 ASU Herbarium specimen labels were digitized using SALIX

Ideal SALIX Process Flow

The ideal process flow is: Photograph the specimen label

Perform OCR on the photograph

Have SALIX parse the resulting text into database categories

Upload the results to the database

Practical SALIX Process Flow

The actual process flow has added steps: Photograph the specimen label

Perform OCR on the photograph

Correct any OCR errors. Tweak the text layout

Have SALIX parse the resulting text into database categories

Correct any mis-parsed results

Upload the results to the database

OCR Workflow We use a ABBYY Professional Version 10 We capture an image of the full specimen,

and another of just the label for OCR. Processing is done in batch mode, usually

run over night on a folder containing hundreds of images.

The result is a single text file with one label per page.

OCR errors are corrected in the text file before processing with SALIX

The SALIX User Interface

Manual Data Entry

A label that results in many OCR errors

A label that results in few OCR errors

Label Length and Quality We first categorized 4 different label types, with the

following average characteristics:

We then had 3 students each process 10 labels of each category (40 labels total through SALIX and

typed into Symbiota form.

Sample Throughput Data

Conclusions

OCR quality has a strong effect on semi-automated parsing throughput using SALIX.

OCR using ABBYY in Batch Mode was most efficient for our workflow.

The relationship is roughly:

S = Ratio of SALIX Throughput/Typing Throughput

andE = OCR Error rate stated as OCR Errors per 100

(Obviously, the relationship isn't accurate as E approaches zero, i.e. less than about 2 Errors/100 words)

Acknowledgements

All of the data presented here was from Anne Barber's Master's Thesis, completed at ASU in May, 2012.

Anne also developed the process flow that helped optimize SALIX throughput.

The overall project was under the direction of Les Landrum, curator of the ASU Herbarium.

OCR and SALIX Parsing

Documents

COASTAL AND WATERWAY SYSTEMS - Salix

Recycling of wastewater and sludge in Salix plantations · 4 Recycling of wastewater and sludge in Salix plantations Recycling of wastewater and sludge in Salix plantations 5 In Kågeröd,

Dependency Parsing (3) - University Of Maryland · Dependency Parsing: what you should know •Transition-based dependency parsing •Shift-reduce parsing •Transition systems: arc

1152. chips of Salix triandra × dasyclados hybrid willow

Startup Guide - Salix OS

Managing willows in Victoria - Water and catchments€¦ · willow ( Salix alba x matsudana ), weeping willow ( Salix babylonica ), pussy willow ( Salix x calodendron ), goat willow

Top-Down Parsing - recursive descent - predictive parsing

Short Rotation Coppice Willow Salix Viminalis L

Energetic Willow (Salix Viminalis)

Chart Parsing and Probabilistic Parsing - SourceForgenltk.sourceforge.net/doc/en/advanced-parsing.pdfChart Parsing and Probabilistic Parsing 9.1 Introduction ... Furthermore, it is

Rivers and Wetlands: An introduction to Salix

Top Down Parsing, Predictive Parsing

OCR and SALIX Parsing Daryl Lafferty Arizona State University October, 2012

SALIX HOMES RETROFIT STUDY Alison Ball Tim Whitley

Skörd av övergrov salix med skogsbrukets maskiner

SALIX HOMES RETROFIT STUDY Alison Ball Tim Whitley

Chart Parsing and Probabilistic Parsing

Willow (Salix viminalis) Prepared by: Józef Pyrczak

Syntactic Analysis Operator-Precedence Parsing Recursive-Descent Parsing

Salix lasiolepis web show