View
31
Download
0
Category
Preview:
Citation preview
Interactive Handwritten Text Recognition and Indexingof Historical Documents: the tranScriptorum Project
Alejandro H. Toselliahector@prhlt.upv.es
Pattern Recognition and Human Language Technology
Reseach Center
Universitat Politecnica de Valencia
Spain
June 2015
Assisted HTR
Assisted HTR handwritting KWS
Index
1 Handwritten Text Recognition (HTR) and Indexing . 1
2 The tranScriptorium Project . 3
3 Selected Handwritting Datasets . 5
4 Interactive HTR: Transcription Demonstration . 11
5 HTR and Interactive-Predictive HTR Results . 14
6 Handwritten Text Images Indexing: Search Demonstration . 17
7 Handwritten Text Images Indexing and Search Results . 20
8 Conclusion . 23
A.H. Toselli – PRHLT/UPV Page 1
Assisted HTR handwritting KWS
Index
◦ 1 Handwritten Text Recognition (HTR) and Indexing . 1
2 The tranScriptorium Project . 3
3 Selected Handwritting Datasets . 5
4 Interactive HTR: Transcription Demonstration . 11
5 HTR and Interactive-Predictive HTR Results . 14
6 Handwritten Text Images Indexing: Search Demonstration . 17
7 Handwritten Text Images Indexing and Search Results . 20
8 Conclusion . 23
A.H. Toselli – PRHLT/UPV Page 1
Assisted HTR handwritting KWS
Handwritten Text Recognition (HTR) and Indexing
Huge amounts of handwritten historical documents arebeing published by on-line digital libraries world wide
However, for these raw digital images to be reallyuseful, they need be annotated with informative content
This presentation introduces efficient solutionsfor the indexing, search and full transcription ofhistorical handwritten document images
A.H. Toselli – PRHLT/UPV Page 2
Assisted HTR handwritting KWS
Index
1 Handwritten Text Recognition (HTR) and Indexing . 1
◦ 2 The tranScriptorium Project . 3
3 Selected Handwritting Datasets . 5
4 Interactive HTR: Transcription Demonstration . 11
5 HTR and Interactive-Predictive HTR Results . 14
6 Handwritten Text Images Indexing: Search Demonstration . 17
7 Handwritten Text Images Indexing and Search Results . 20
8 Conclusion . 23
A.H. Toselli – PRHLT/UPV Page 3
Assisted HTR handwritting KWS
The tranScriptorium Projecthttp://www.transcriptorium.eu
• STREP of the FP7 in the ICT for Learning and Access to CulturalResources challenge (1 January 2013 to 31 December 2015)
• tranScriptorium aims to develop innovative, efficient and cost-effectivesolutions for the indexing, search and full transcription of historicalhandwritten document images, using modern, holistic Handwritten TextRecognition technology
1. Enhancing HTR technology for efficient transcription2. Bringing the HTR technology to users3. Integrating the HTR results in public web portals
Supported by: EU Cultural Heritage:A.H. Toselli – PRHLT/UPV Page 4
Assisted HTR handwritting KWS
Index
1 Handwritten Text Recognition (HTR) and Indexing . 1
2 The tranScriptorium Project . 3
◦ 3 Selected Handwritting Datasets . 5
4 Interactive HTR: Transcription Demonstration . 11
5 HTR and Interactive-Predictive HTR Results . 14
6 Handwritten Text Images Indexing: Search Demonstration . 17
7 Handwritten Text Images Indexing and Search Results . 20
8 Conclusion . 23
A.H. Toselli – PRHLT/UPV Page 5
Assisted HTR handwritting KWS
Selected Handwritting Datasets
• BENTHAM: XVIII/XIX centuries collection of over 4, 000 volumesof drafts and notes, written by several hands in English
• PLANTAS: XVII century botanical specimen manuscript collectionof seven volumes written by a single hand in Old Spanish – kindlyprovided by the BNE
• HATTEM: XV century Medieval Manuscript composed of 573sheets written by a single hand in Dutch
• ESPOSALLES: XVII century Marriage License records written byseveral hands in old Catalan and other languages
• AUSTEN: XVIII century Juvenilia manuscripts by Jane Austen(single hand in English) – kindly provided by the BL
• REICHSGERICHT: XVIII century court decisions from the HighCourt of Germany, written by several hands in German
A.H. Toselli – PRHLT/UPV Page 6
Assisted HTR handwritting KWS
“BENTHAM” DatasetXVIII century collection of over 4, 000 volumes of drafts and notes, written by
several writers in English
Experiments on a first batch of 433 pre-selected page images
Number of: TotalPages 433Lines 11 473Running words 106 905Lexicon size 9 717Running characters 550 674Character set size 86
A.H. Toselli – PRHLT/UPV Page 7
Assisted HTR handwritting KWS
“PLANTAS” DatasetXVII century Botanical Specimen Manuscript Collection of seven volumes
written by a single writer in Old Spanish
Experiments on the first volume
Number of: TotalPages 871Lines 19 544Running words 197 617Lexicon size 21 148Character set size 77
A.H. Toselli – PRHLT/UPV Page 8
Assisted HTR handwritting KWS
“AUSTEN” Dataset
Jane Austen’s Juvenilia: XVIII century single hand manuscript
Experiments on Volume The Third
Number of: TotalPages 128Lines 2 693Running words 25 291Dataset lexicon 3 567Running characters 118 881Character set size 81
A.H. Toselli – PRHLT/UPV Page 9
Assisted HTR handwritting KWS
“Reichsgericht” Dataset
XVIII century court decisions from the High Court of Germany, written byseveral hands
Number of: TotalPages 114Lines 4 106Running words 31 545Dataset lexicon 8 108Running characters 239 762Character set size 92
A.H. Toselli – PRHLT/UPV Page 10
Assisted HTR handwritting KWS
Index
1 Handwritten Text Recognition (HTR) and Indexing . 1
2 The tranScriptorium Project . 3
3 Selected Handwritting Datasets . 5
◦ 4 Interactive HTR: Transcription Demonstration . 11
5 HTR and Interactive-Predictive HTR Results . 14
6 Handwritten Text Images Indexing: Search Demonstration . 17
7 Handwritten Text Images Indexing and Search Results . 20
8 Conclusion . 23
A.H. Toselli – PRHLT/UPV Page 11
Assisted HTR handwritting KWS
Interactive HTR: CATTI operation example
x
STEP-0 p
s ≡ w antiguas cuidadelas que en el Castillo sus llamadas p′ antiguas cuidadelas que en el Castillo sus llamadas
STEP-1 κ antiguos cuidadelas que en el Castillo sus llamadas p antiguos ciudadanos que en el Castillo sus llamadas s antiguos ciudadanos que en el Castillo sus llamadas p′ antiguos ciudadanos que en el Castillo sus llamadas
STEP-2 κ antiguos ciudadanos que en Castilla sus llamadas p antiguos ciudadanos que en Castilla se llamaban s antiguos ciudadanos que en Castilla se llamaban p′ antiguos ciudadanos que en Castilla se llamaban
FINAL κ antiguos ciudadanos que en Castilla se llamaban #p ≡ T antiguos ciudadanos que en Castilla se llamaban
Post-editing WER: 6/7 (86%)Interactive WSR: 2/7 (29%, assuming a whole-word correction in step-1)Estimated effort reduction: 1− 29/86 (66%).
A.H. Toselli – PRHLT/UPV Page 12
Assisted HTR handwritting KWS
Interactive HTR: Transcription Demonstration
• It is just a “demo” ! not intended for real operation (other systems do that)
• Everything is real. No tricks to make demo look better than real
• Web client-server architecture: Web browser front-end, back-end serverproviding off-line HTR-CATTI
• Off-line HTR-CATTI decoder based on word graphs
• Three tasks:
– BENTHAM: 10K words open vocabulary
– AUSTEN: 78K words external, open vocabulary from Bentham texts
20K words external, open vocabulary from Austen texts
A.H. Toselli – PRHLT/UPV Page 13
Assisted HTR handwritting KWS
Index
1 Handwritten Text Recognition (HTR) and Indexing . 1
2 The tranScriptorium Project . 3
3 Selected Handwritting Datasets . 5
4 Interactive HTR: Transcription Demonstration . 11
◦ 5 HTR and Interactive-Predictive HTR Results . 14
6 Handwritten Text Images Indexing: Search Demonstration . 17
7 Handwritten Text Images Indexing and Search Results . 20
8 Conclusion . 23
A.H. Toselli – PRHLT/UPV Page 14
Assisted HTR handwritting KWS
HTR and Interactive-Predictive HTR
HTR current state-of-the-art:
• Segmentation-free approach: no explicit segmentation of text images intowords or characters is required
• The basic input unit is a handwritten text line image
• Statistical modeling at different perception levels:
– Optical (character shape), using Hidden Markov Models (HMMs)– Lexical, by means of finite-state character representation of words– Syntactical, based on statistical language models, such as N -grams
Interactive-predictive framework: rather than full transcription automation, thesystem assists the human transcriber
• Combines HTR efficiency with the accuracy of human experts, leading tocost-effective perfect transcripts
A.H. Toselli – PRHLT/UPV Page 15
Assisted HTR handwritting KWS
HTR and Interactive-Predictive HTR Results
• BENTHAM: Training: OMs with 400 pages, LM Lex. 78K words. Test: 33 pages.WER = 22.0 % WSR = 17.2 % EFR: 21.5 % wrt post-editCER = 9.9 %
• PLANTAS: Training: OMs with 224 pages, LM Lex. 21K words. Test: 647 pages.WER = 33.4 % WSR: 31.5 % EFR: 5.7 % wrt post-editingCER = 16.0 %
• AUSTEN: Training: OMs with 50 pages, LM Lexicon 20K words. Test: 78 pagesWER = 32.2 % WSR = 21.4 % EFR: 33.5 % wrt post-editingCER = 15.9 %
• REICHSGERICHT: Training: OMs with 88 pgs, LM Lexicon 5K words. Test: 26 pgsWER = 33.3 % WSR = 25.1 % EFR: 24.6 % wrt post-editingCER = 12.9 %
WER/CER: percentage of mis-recognized words/characters.Experiments with open-vocabulary lexica and bi-gram LMs.
WSR = Percentage of word-level corrections to achieve ground truth transcripts.EFR = “Estimated Effort Reduction”.
A.H. Toselli – PRHLT/UPV Page 16
Assisted HTR handwritting KWS
Index
1 Handwritten Text Recognition (HTR) and Indexing . 1
2 The tranScriptorium Project . 3
3 Selected Handwritting Datasets . 5
4 Interactive HTR: Transcription Demonstration . 11
5 HTR and Interactive-Predictive HTR Results . 14
◦ 6 Handwritten Text Images Indexing: Search Demonstration . 17
7 Handwritten Text Images Indexing and Search Results . 20
8 Conclusion . 23
A.H. Toselli – PRHLT/UPV Page 17
Assisted HTR handwritting KWS
Handwritten Text Images Indexing and Search
• There are massive text image collections out there, but their textualcontent remains practically inaccessible
• If perfect or sufficiently accurate text image transcripts were available,image textual context could be straightforwardly indexed for plaintexttextual access.
• But fully automatic transcription results lack the level of accuracy neededfor useful text indexing and search purposes
• And manual or even interactive-predictive assisted transcription isentirely prohibitive to deal with massive image collections
• Good news: indexing and search can be directly implemented on theimages themselves, without explicitly resorting to any image transcripts,as we will see now.
A.H. Toselli – PRHLT/UPV Page 18
Assisted HTR handwritting KWS
Handwritten Text Images Indexing and Search: Demonstration
• It is just a “demo” ! not (yet) intended for real operation. But everythingis real – no tricks to make demo look better than real
• Line-level indexing according to the precision-recall trade-off model :Rather than exact searching, search is carried out with a confidencethreshold, specified by the user as part of the query in order to meetthe required precision-recall trade-off
• Word confidence scores are based on pixel-level probabilities andcomputed for line-shaped regions. Spotted word positions are markedonly approximately
• Two tasks:
– AUSTEN: Trained on Austen (50p), 20K words open vocabulary.Demo on the whole “Juvenile volume The Third” (128 pages)
– PLANTAS: Trained on Plantas (224p), 21K words open vocabulary.Demo on Volume I (about 1 000 pages)
A.H. Toselli – PRHLT/UPV Page 19
Assisted HTR handwritting KWS
Index
1 Handwritten Text Recognition (HTR) and Indexing . 1
2 The tranScriptorium Project . 3
3 Selected Handwritting Datasets . 5
4 Interactive HTR: Transcription Demonstration . 11
5 HTR and Interactive-Predictive HTR Results . 14
6 Handwritten Text Images Indexing: Search Demonstration . 17
◦ 7 Handwritten Text Images Indexing and Search Results . 20
8 Conclusion . 23
A.H. Toselli – PRHLT/UPV Page 20
Assisted HTR handwritting KWS
Indexing and Search for Handwritten Text Images:Pixel-level Posteriorgram
P
X
Pixel-level posterior probabilities P for a text image X and word v =”matter”.
An accurate, contextual (n-gram based) word classifier was used to compute P . Thishelped to achieve very low posteriors in a region of X around (i=100, j=200), where avery similar word, “matters”, is written.
A.H. Toselli – PRHLT/UPV Page 21
Assisted HTR handwritting KWS
Results on tranScriptorium Data Sets
Average Precision (AP)Mean Average Precision (MAP)and Recall-Precision curves
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Precision:π
Recall: ρ
Bentham: ap=0.88, map=0.94Plantas: ap=0.89, map=0.92Austen-B:ap=0.76, map=0.65Austen: ap=0.91, map=0.77
Bentham
Plantas
Austen-B
Austen
Datasets training and test details
• BENTHAM: Multi-hand. Training: 400 pg. fromBentham, 87 char. HMMs, 2-gram LM trainedon Bentham texts; Lexicon 9 341 tokens.Test : 33 pages; query set: 6 962 keywords
• PLANTAS: Single hand. Training: 224 pagesfrom Plantas, 77 char. HMMs, 2-gram LMtrained with the training set + book glossarytranscripts. Lexicon 11 561 tokens.Test : 32 pages; query set: 9 945 keywords
• AUSTEN-B: Single hand. No training; usingBentham char. HMMs, lexicon and LM.Test : 78 pages; query set: 9 000 keywords
• AUSTEN: Single hand. Training: 50 Austenpages, 81 char. HMMs, 2-gram LM trained onAusten texts; Lexicon 20K tokens.Test : 78 pages; query set: 9 000 keywords
A.H. Toselli – PRHLT/UPV Page 22
Assisted HTR handwritting KWS
Index
1 Handwritten Text Recognition (HTR) and Indexing . 1
2 The tranScriptorium Project . 3
3 Selected Handwritting Datasets . 5
4 Interactive HTR: Transcription Demonstration . 11
5 HTR and Interactive-Predictive HTR Results . 14
6 Handwritten Text Images Indexing: Search Demonstration . 17
7 Handwritten Text Images Indexing and Search Results . 20
◦ 8 Conclusion . 23
A.H. Toselli – PRHLT/UPV Page 23
Assisted HTR handwritting KWS
Conclussions
• Automatic or assisted handwritten text transcription and fullyautomatic indexing is now becoming perfectly feassible
• Models trained for a given collection can provide quiteuseful performance on images from other similar collections,without need of (re-training)
• Several demonstrators have been implemented and madepublicly available for first-hand experience in real use; see:http://transcriptorium.eu/demonstrations
A.H. Toselli – PRHLT/UPV Page 24
Assisted HTR handwritting KWS
Thanks for Your Attention !
A.H. Toselli – PRHLT/UPV Page 25
Recommended