18
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Analysis and Post-Correction of OCR-processed historical documents Ulrich Reffle CIS University of Munich

IMPACT Final Conference - Ulrich Reffle

Embed Size (px)

DESCRIPTION

Postcorrection in IMPACT with Ulrich Reffle from the University of Munich

Citation preview

Page 1: IMPACT Final Conference - Ulrich Reffle

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Analysis and Post-Correction of OCR-processed historical documents

Ulrich Reffle

CISUniversity of Munich

Page 2: IMPACT Final Conference - Ulrich Reffle

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle [email protected] 2

Overview

Document specific analysis of OCR results of historical documents A system for interactive OCR post-correction

Page 3: IMPACT Final Conference - Ulrich Reffle

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle [email protected] 3

Document specific analysis of OCR results of historical documents

Page 4: IMPACT Final Conference - Ulrich Reffle

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle [email protected] 4

Why do we need special methods?

Problems specific to the processing of historical language in the context of mass digitization:– High OCR error rates– No standardized language

Special resources and methods are needed for OCR, post-processing and Information Retrieval

OCR-

resultOCR Post-

Correction IRDigital

image

Problem of historicallanguage variation

Page 5: IMPACT Final Conference - Ulrich Reffle

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle [email protected] 5

Why do we need special methods?

Diversity of input material makes document specific parameter settings important:– Distribution of spelling variants– Special vocabulary– OCR channel model

OCR-

resultOCR Post-

Correction IRDigital

image

Problem of historicallanguage variation

Page 6: IMPACT Final Conference - Ulrich Reffle

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle [email protected] 6

Document specific language and error profiles

Language and error profiles provide document specific characteristics of the language and OCR errors.

Language profile: shares of foreign languages (such as Latin, French), frequencies for language modeling, important patterns of spelling variation (in English: e.g. oou, vu )

Error profile: estimated error rate, important error patterns (like ec, il), frequent erroneous words

Language and error profiles are computed fully automatically, no manual interaction or groundtruth needed.

Page 7: IMPACT Final Conference - Ulrich Reffle

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle [email protected] 7

Global Profile of a document

Language

profile

Error

profile

Frequency

t→th 120

i→y 106

ä→a 38

… …

Frequency

e→c 51

n→u 45

t→i 34

… …

Lexicon %

Modern 82%

Historic 9%

Place names 6%

Latin 3%

Correct words 72%

Erroneous words 20%

Unknown words 8%

Page 8: IMPACT Final Conference - Ulrich Reffle

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle [email protected] 8

Local profile of all words of a document

„theil“„theil“„theil“„theil“„hatn“

Weighted set of interpretations/ correction suggestions for each word of the document.

Correction suggestion Modern spelling probability

hath has 0,95

hat Hat 0,01

hate hate 0,04

Page 9: IMPACT Final Conference - Ulrich Reffle

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle [email protected] 9

Summary

Document specific profiles …– are computed in a fully automated way from OCR output– provide characteristics of language and OCR error channel in order to adapt

OCR and downstream processes.

Page 10: IMPACT Final Conference - Ulrich Reffle

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle [email protected] 10

System for interactive post-correction of OCR results

Page 11: IMPACT Final Conference - Ulrich Reffle

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle [email protected] 11

Post-correction system

A graphical user interface for fast and convenient post-correction specifically for OCRed historical documents

Novel possibilities for detection, presentation and correction of systematic OCR errors.

Page 12: IMPACT Final Conference - Ulrich Reffle

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle [email protected] 12

Post-correction system

Special functionality

Image

OCR Editor

Page 13: IMPACT Final Conference - Ulrich Reffle

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle [email protected] 13

Proper treatment of spelling variants

Historical spelling variants are identified with the help of historical lexica and language profiles.

Local profiles include non-modern words as correction suggestions.

Page 14: IMPACT Final Conference - Ulrich Reffle

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle [email protected] 14

Conventional correction methods

Correcting words in the text view– Manual input– Selection of a correction suggestion

Page 15: IMPACT Final Conference - Ulrich Reffle

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle [email protected] 15

Batch-Correction of systematic OCR errors

Systematic OCR errors are identified by error profile Batches of errors can be corrected with just a few keystrokes.

Page 16: IMPACT Final Conference - Ulrich Reffle

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle [email protected] 16

Evaluation

User experiment with 14 participants. Novel technology makes correction up to 2.7 times faster.

Page 17: IMPACT Final Conference - Ulrich Reffle

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle [email protected] 17

Availability

Graphical interface is going to be distributed open source. Document pre-processing to obtain language and error profiles is protected

by US patent application.– Pre-processing is offered as a web-service, as of now free of charge.

Page 18: IMPACT Final Conference - Ulrich Reffle

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

18

Thank you!

http://[email protected]

24.10.2011 Ulrich Reffle [email protected]