IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. TR5 Profiler and Post-Correction System Ludwig-Maximilians-Universität München Centrum für Informations- und Sprachverarbeitung

Bne demoday postcorrection_and_profiler

Embed Size (px)


Presentation introducing Profiler and Postcorrection System presented by Jesse de Does during demo session held at the BNE 5th of October 2011.

Citation preview

Page 1: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

TR5 Profiler and Post-Correction System Ludwig-Maximilians-Universität MünchenCentrum für Informations- und Sprachverarbeitung

Page 2: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

TR5 Post-Correction System

User interface for easy postcorrection of historical OCR'd documents

Stand-alone user interface Innovative language technology enables

identification, presentation of recognition errors and efficient correction

Page 3: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Customizable user interface

OCR and image fragments

Correction candidates,Special functions

Complete image

Freely rearrangeable interface elements:

– OCR with Image snippets– Complete image– Correction candidates/ Special


Font size

Page 4: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Word by word presentation of recognized text and image clippings.

Comparison of text and image follows reading order and is much easier than side-by-side presentation of image and text.

View: OCR and Image clippings

Page 5: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

– For difficult cases – When word segmentation by OCR

fails– Current word is highlighted

View: Original image

Page 6: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Correction by manual text entry Choosing correction candidates Faster correction thanks to candidates

proposed by the postcorrection system

Word by word correction of text

Page 7: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Batch correction – Several occurences of identical


Batch correction: efficient postcorrection

Page 8: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Batch correction– classes of systematic errors– errors where the correction

candidate has a high degree of certainty

– further possilities Frequent errors For instance Location names

Batch correction: efficient postcorrection

Page 9: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Postcorrection system: Evaluation

9Ulrich Reffle, 4, Juli 2011


Error correction thanks to text and error profiling is 2.7 times faster

User Experiment with 14 individual instances

Page 10: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.



Page 11: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.



Page 12: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Targets more specialist audience

Thanks to underlying language technology: Historical variants are recognized and

not marked as errors – even when not in historical lexicon

Historical variants are proposed as correction candidates

Typical error patterns are exploited Ranking of correction candidates

Why another postcorrection system?

Page 13: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Lexica and language models help dealing with orthographical variants und unknown words.

Recognition of OCR errors and proposal of Correction candidates depends on specially developed LMU language technology Approximate search in “hypothetical lexica“ An analysis of the whole work („text and error profile“) produces document-

specific information about the language and the type of OCR errors

Underlying language technology

Page 14: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Text and error profilesText profile Error profile


Coverage of lexica

Typical variant patterns

● Targeted selection of lexica● Better language models

● Distinguishing historical variants and OCR errors

● Ranking of correction candidates● Recall and Precision in IR

Estimate of error rateTypical OCR errors

● Better modeling of error channel● Distinguishing historical variants

and OCT errors● Ranking of correction candidates● Treatment of systematic errors

Page 15: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Underlying logic: Dual noisy channel model

Interpretation of OCR output tokens as result of two “noisy channels”

modern word u historical variant v OCR result w

Given an OCR token w, give possible interpretations of w in terms of• “underlying” modern word u (IR!)• correct historical word v and its derivation from u via “patterns”• OCR errors garbling v into w

patterns OCR errors

Page 16: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Historical variant and OCR error patterns


OCRError patterns

teil theil

theil iheil

Page 17: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Relative frequency: 2.9% of all ‘t’ are rewritten to ‘th’

Absolute frequency: Pattern was found 120 times in the current document.

Page 18: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Local view: interpretations of tokens

● Local view: “Meaningful interpretations” for all tokens of the ocr text are the matches in all attached lexicons, using the given settings.

Occurrence of spelling variant “i→y”:

Occurrence of ocr error “i→y”:

Page 19: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Global view: pattern frequencies

● Global view: Increment counters to estimate (relative) frequencies.

Occurrences of spelling variant “i→y”:+0.999771

Occurrences of ocr error “i→y”:+0.000224948

Page 20: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Computation of profile: initialization


OCR resultw0, w1 ,w2, w3, …

Initial global profile

Non-specific model with probabilities for•Words•Variant Patterns•Error

Page 21: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

w3:… → … → …… → … → …… → … → …… → … → …

Ulrich Reffle, 4, Juli 2011


w3:… → … → …… → … → …… → … → …… → … → …

w2:… → … → …… → … → …… → … → …… → … → …

w1:… → … → …… → … → …… → … → …… → … → …

w0:… → … → …… → … → …… → … → …… → … → …

Local profileInitial global profile

Computation of profile: global to local

OCR resultw0, w1 ,w2, w3, …

Non-specific model with probabilities for•Words•Variant Patterns•Error

Page 22: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

w3:… → … → …… → … → …… → … → …… → … → …

Computation of profile: local to global

Ulrich Reffle, 4, Juli 2011


w3:… → … → …… → … → …… → … → …… → … → …

w2:… → … → …… → … → …… → … → …… → … → …

w1:… → … → …… → … → …… → … → …… → … → …

w0:… → … → …… → … → …… → … → …… → … → …

Local profileGlobal profile

OCR resultw0, w1 ,w2, w3, …

Improved model with probabilities for•Words•Variant Patterns•Error

Page 23: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Computation of profile: iteration

Ulrich Reffle, 4, Juli 2011


Local profileGlobal profile

w3:… → … → …… → … → …… → … → …… → … → …

w3:… → … → …… → … → …… → … → …… → … → …

w2:… → … → …… → … → …… → … → …… → … → …

w1:… → … → …… → … → …… → … → …… → … → …

w0:… → … → …… → … → …… → … → …… → … → …

OCR resultw0, w1 ,w2, w3, …

Improved model with probabilities for•Words•Variant Patterns•Error

Page 24: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Profiler Evaluation

Measure the quality • of global profiles• of OCR error detection

Challenges● Measures not obvious● Good evaluation data is difficult to gather● Results need interpretation

Page 25: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Evaluation: Measures

(1) Global Profiles Percentage of matches for the first 10 patterns in the ranked output lists Two Values: Historical Patterns, OCR Patterns (2) OCR Error Detection Precision and Recall for the OCR errors detected by the Profiler

(3) Indirect evaluation (For instance, by means of the postcorrection system)

Page 26: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Evaluation: Data preparation

(1) Deep Evaluation: For each token of the evaluation document the historical interpretation and the OCR interpretation have been manually annotated. ++ fully accurate -- manual work (2) Shallow Evaluation: The OCR’ed document is automatically aligned with its re-typed ground truth; For each token of the evaluation document the historical and the OCR interpretation is automatically assigned from the ground truth.

++ no manual work – not completely accurate

Page 27: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Evaluation: Data

Deep: Eckartshausen 100 pages

Briefkunst 40 pages

Shallow: 5 books each,

16th, 17th and 18th century

Page 28: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Evaluation: Eckartshausen

● historical patterns matches first 10 70%

precision all 68% recall all 73%● OCR patterns matches first 6 67% precision all 59% recall all 19%(3) OCR error detection precision 86% recall 46%

Page 29: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Graphical Evaluation: Eckartshausen

Page 30: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Graphical Evaluation: diacritics

Hist. Var.


Page 31: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Shallow Evaluation Results

16th 17th 18th

HIST Patterns first 10 60% 74% 78%

OCR Patterns first 10 48% 70% 50%

Error Detection Prec 95% 92% 81%

Error Detection Recall 49% 43% 45%

Content Words Errors 64% 44% 16%

Easy Interactive Correction per 10,000 words

≈3000 words ≈ 1892 words ≈ 720 words

Page 32: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Global Profile: Spelling variation patterns

Page 33: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Spelling variation profile

Page 34: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

OCR Error Profile

Page 35: Bne demoday postcorrection_and_profiler

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.