Upload
impact-centre-of-competence
View
133
Download
2
Embed Size (px)
DESCRIPTION
Presentation introducing Profiler and Postcorrection System presented by Jesse de Does during demo session held at the BNE 5th of October 2011.
Citation preview
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
TR5 Profiler and Post-Correction System Ludwig-Maximilians-Universität MünchenCentrum für Informations- und Sprachverarbeitung
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
TR5 Post-Correction System
User interface for easy postcorrection of historical OCR'd documents
Stand-alone user interface Innovative language technology enables
identification, presentation of recognition errors and efficient correction
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Customizable user interface
OCR and image fragments
Correction candidates,Special functions
Complete image
Freely rearrangeable interface elements:
– OCR with Image snippets– Complete image– Correction candidates/ Special
functions
Font size
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Word by word presentation of recognized text and image clippings.
Comparison of text and image follows reading order and is much easier than side-by-side presentation of image and text.
View: OCR and Image clippings
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
– For difficult cases – When word segmentation by OCR
fails– Current word is highlighted
View: Original image
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Correction by manual text entry Choosing correction candidates Faster correction thanks to candidates
proposed by the postcorrection system
Word by word correction of text
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Batch correction – Several occurences of identical
word
Batch correction: efficient postcorrection
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Batch correction– classes of systematic errors– errors where the correction
candidate has a high degree of certainty
– further possilities Frequent errors For instance Location names
Batch correction: efficient postcorrection
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Postcorrection system: Evaluation
9Ulrich Reffle, 4, Juli 2011
Result:
Error correction thanks to text and error profiling is 2.7 times faster
User Experiment with 14 individual instances
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Korrektursystem
10
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Korrektursystem
11
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Targets more specialist audience
Thanks to underlying language technology: Historical variants are recognized and
not marked as errors – even when not in historical lexicon
Historical variants are proposed as correction candidates
Typical error patterns are exploited Ranking of correction candidates
Why another postcorrection system?
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexica and language models help dealing with orthographical variants und unknown words.
Recognition of OCR errors and proposal of Correction candidates depends on specially developed LMU language technology Approximate search in “hypothetical lexica“ An analysis of the whole work („text and error profile“) produces document-
specific information about the language and the type of OCR errors
Underlying language technology
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Text and error profilesText profile Error profile
14
Coverage of lexica
Typical variant patterns
● Targeted selection of lexica● Better language models
● Distinguishing historical variants and OCR errors
● Ranking of correction candidates● Recall and Precision in IR
Estimate of error rateTypical OCR errors
● Better modeling of error channel● Distinguishing historical variants
and OCT errors● Ranking of correction candidates● Treatment of systematic errors
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Underlying logic: Dual noisy channel model
Interpretation of OCR output tokens as result of two “noisy channels”
modern word u historical variant v OCR result w
Given an OCR token w, give possible interpretations of w in terms of• “underlying” modern word u (IR!)• correct historical word v and its derivation from u via “patterns”• OCR errors garbling v into w
patterns OCR errors
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Historical variant and OCR error patterns
HistoricalVariants
OCRError patterns
teil theil
theil iheil
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Relative frequency: 2.9% of all ‘t’ are rewritten to ‘th’
Absolute frequency: Pattern was found 120 times in the current document.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Local view: interpretations of tokens
● Local view: “Meaningful interpretations” for all tokens of the ocr text are the matches in all attached lexicons, using the given settings.
Occurrence of spelling variant “i→y”:
Occurrence of ocr error “i→y”:
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Global view: pattern frequencies
● Global view: Increment counters to estimate (relative) frequencies.
Occurrences of spelling variant “i→y”:+0.999771
Occurrences of ocr error “i→y”:+0.000224948
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Computation of profile: initialization
20
OCR resultw0, w1 ,w2, w3, …
Initial global profile
Non-specific model with probabilities for•Words•Variant Patterns•Error
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
w3:… → … → …… → … → …… → … → …… → … → …
Ulrich Reffle, 4, Juli 2011
21
w3:… → … → …… → … → …… → … → …… → … → …
w2:… → … → …… → … → …… → … → …… → … → …
w1:… → … → …… → … → …… → … → …… → … → …
w0:… → … → …… → … → …… → … → …… → … → …
Local profileInitial global profile
Computation of profile: global to local
OCR resultw0, w1 ,w2, w3, …
Non-specific model with probabilities for•Words•Variant Patterns•Error
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
w3:… → … → …… → … → …… → … → …… → … → …
Computation of profile: local to global
Ulrich Reffle, 4, Juli 2011
22
w3:… → … → …… → … → …… → … → …… → … → …
w2:… → … → …… → … → …… → … → …… → … → …
w1:… → … → …… → … → …… → … → …… → … → …
w0:… → … → …… → … → …… → … → …… → … → …
Local profileGlobal profile
OCR resultw0, w1 ,w2, w3, …
Improved model with probabilities for•Words•Variant Patterns•Error
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Computation of profile: iteration
Ulrich Reffle, 4, Juli 2011
23
Local profileGlobal profile
w3:… → … → …… → … → …… → … → …… → … → …
w3:… → … → …… → … → …… → … → …… → … → …
w2:… → … → …… → … → …… → … → …… → … → …
w1:… → … → …… → … → …… → … → …… → … → …
w0:… → … → …… → … → …… → … → …… → … → …
OCR resultw0, w1 ,w2, w3, …
Improved model with probabilities for•Words•Variant Patterns•Error
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Profiler Evaluation
Measure the quality • of global profiles• of OCR error detection
Challenges● Measures not obvious● Good evaluation data is difficult to gather● Results need interpretation
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation: Measures
(1) Global Profiles Percentage of matches for the first 10 patterns in the ranked output lists Two Values: Historical Patterns, OCR Patterns (2) OCR Error Detection Precision and Recall for the OCR errors detected by the Profiler
(3) Indirect evaluation (For instance, by means of the postcorrection system)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation: Data preparation
(1) Deep Evaluation: For each token of the evaluation document the historical interpretation and the OCR interpretation have been manually annotated. ++ fully accurate -- manual work (2) Shallow Evaluation: The OCR’ed document is automatically aligned with its re-typed ground truth; For each token of the evaluation document the historical and the OCR interpretation is automatically assigned from the ground truth.
++ no manual work – not completely accurate
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation: Data
Deep: Eckartshausen 100 pages
Briefkunst 40 pages
Shallow: 5 books each,
16th, 17th and 18th century
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation: Eckartshausen
● historical patterns matches first 10 70%
precision all 68% recall all 73%● OCR patterns matches first 6 67% precision all 59% recall all 19%(3) OCR error detection precision 86% recall 46%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Graphical Evaluation: Eckartshausen
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Graphical Evaluation: diacritics
Hist. Var.
OCR
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Shallow Evaluation Results
16th 17th 18th
HIST Patterns first 10 60% 74% 78%
OCR Patterns first 10 48% 70% 50%
Error Detection Prec 95% 92% 81%
Error Detection Recall 49% 43% 45%
Content Words Errors 64% 44% 16%
Easy Interactive Correction per 10,000 words
≈3000 words ≈ 1892 words ≈ 720 words
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Global Profile: Spelling variation patterns
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Spelling variation profile
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR Error Profile
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.