Europeana Newspapers LFT Infoday Genereux

  • Published on
    10-Jul-2015

  • View
    110

  • Download
    1

Transcript

  • 1Institute for Specialised Communication and Multilingualism

    Michel Gnreux 27.10.2014

    Correcting OCR results with

    computational linguistic

    methods

    Michel Gnreux, EURAC

    Bozen / Bolzano

    in collaboration with Egon W. Stemle, Lionel Nicolas, Verena Lyding and Katalin Szab

  • 2Institute for Specialised Communication and Multilingualism

    Michel Gnreux 27.10.2014

    Introduction

    The OPATCH project (Open Platform for Access to and Analysis of Textual Documents of Cultural Heritage) aims at creating an advanced on-line search infrastructure for research in an historical newspapers archive.

    Duration: 24 months (Jan 2014 Dec 2015) Fundings: Autonome Provinz Bozen-Sdtirol, Landesgesetz Nr. 14, Forschung und Innovation Partners:

    Landesbibliothek Dr. Friedrich Temann, Bozen Institut fr Corpuslinguistik und Texttechnologie (ICLTT), Wien

    For implementing this, OPATCH builds on computational linguistic (CL) methods for structural parsing, word class tagging and named entity recognition.

    Dating between 1910 and 1920, the newspapers are typed in the blackletter Fraktur font and paper quality is derogated due to age.

    Hence, in OPATCH we are starting from majorly error-prone OCR-ed text, in quantities that cannot realistically be corrected manually.

  • 3Institute for Specialised Communication and Multilingualism

    Michel Gnreux 27.10.2014

    A Glance at the Temann collection

    626,287 pages for a total of 819,310,354 tokens;

    616,751,127 tokens are in the reference dictionary, so a degree of cleanness of 0.75

    Our post-OCR correction system is based on 10 OCR-ed pages with cleanness of 0.70

    Given that the dictionary covers on average 91% of all words, 0.90 cleanness is almost perfect OCR.

    Focusing on the best pages from the Temann collection ...

  • 4Institute for Specialised Communication and Multilingualism

    Michel Gnreux 27.10.2014

    OPATCH 100k pages Newsp_Decade Number

    of pagesNumber of

    tokensNumber of

    tokens in dictionary

    Cumul. number of

    pages

    Cumul. number of

    tokens

    Cumul. number of

    tokens in dict.

    PM_1900 1032 1137449 0.901 1032 1137449 1024804 90.1%

    AM_1890 8 9457 0.899 1040 1146906 1033302 90.1%

    FA_1900 8 8652 0.887 1048 1155558 1040978 90.1%

    PM_1910 737 750131 0.886 1785 1905689 1705286 89.5%

    IS_1900 2970 2817010 0.879 4755 4722699 4180623 88.5%

    WB_1900 72 45481 0.872 4827 4768180 4220278 88.5%

    SVB_1890 9749 14306473 0.809 70546 92484068 75861319 82.0%

    SVB_1900Volksblatt

    17545 26314923 0.807 88091 118798991 97103539 81.7%

    BZN_1890Bozner Nachrichten

    16015 12153784 0.806 104106 130952775 106894635 81.6%

  • 5Institute for Specialised Communication and Multilingualism

    Michel Gnreux 27.10.2014

    OPATCH unnanotated 5k pages We select the 5000 cleanest pages from the Temann collection above in the years 1910-1920. The average cleanness for this corpus is 91%, so no need for cleaning. This results in the following un-annotated example corpus:

  • 6Institute for Specialised Communication and Multilingualism

    Michel Gnreux 27.10.2014

    OPATCH annotated 5k pages Automated annotations for Part-of-Speech (POS), Lemmas and Named Entities (NE). We also have a list of roughly 31k locations and names for South Tyrol compiled by Temann.

    Bozen hat ein Museum fr tzi.

    POS LEMMA NE

    Bozen NE Bozen I-LOChat VAFIN habenein ART eineMuseum NN Museumfr APPR frtzi NE tzi I-PER

  • 7Institute for Specialised Communication and Multilingualism

    Michel Gnreux 27.10.2014

    CorporaTen OCR-ed pages with their manually corrected versions.

    10,468 tokens and 3,621 types. Eight pages (8,324/2,487) are used as training data and two pages (2,144/1,134) for testing.

    More than one out of two tokens is misrecognized, among which almost half (48%) need a minimum of three edit operations for correction.

    Reference corpus: 5M words and 5M bigrams From the WEB and also http://www.gutenberg.org/ Romane

    und Erzhlungen (1910-20) The dictionary covers 91% of all words in the ten OCR-ed pages

  • 8Institute for Specialised Communication and Multilingualism

    Michel Gnreux 27.10.2014

    Approach for OCR-ed correction1. Probability models. Collate and tally all edit-operations (delete, insert and

    replace) needed to transform all unrecognized tokens from the training OCR-ed texts to its corrected form in the Gold Standard: n|u 98

    * we obtain two probability models: constrained and unconstrained

    2. Candidate generation is achieved by finding the closest entry in the dictionary by applying the minimum number of edit-operations to an unrecognized OCR-ed token. The number of candidates is function of the maximum number of edit-operations allowed and the model used.

    * wundestc wundesten: c e and inserting a n after the e.

    3. Selection of the most suitable candidate, given relative frequency and context:

    ... word word word word word WORD word word word word ...

  • 9Institute for Specialised Communication and Multilingualism

    Michel Gnreux 27.10.2014

    Experiment 1Artificially created errorsTo achieve this we extracted random trigrams from the GS (left context, target, right context) and applied, in reverse, the edit error model. Errors were introduced up to a limit of two per target and contexts. At the end of this process, we have two context words and five candidates, including the target.En is the maximum edit-operations performed to generate candidates.

  • 10

    Institute for Specialised Communication and Multilingualism

    Michel Gnreux 27.10.2014

    Experiment 2: Real errors

  • 11

    Institute for Specialised Communication and Multilingualism

    Michel Gnreux 27.10.2014

    Discussion

    The approach we presented to correct OCR errors considered four features of two types: edit-distance and n-grams frequencies. Results showed that a simple scoring system can correct with very high accuracy OCR-ed texts under idealized conditions: no more than two edit operations and a wide-coverage dictionary. Obviously, these conditions do not always hold in practice, thus an observed accu-racy drops to 10%. Wrong substitions by the OCR process have also been neglected.

    Nevertheless, we can expect to improve our dictionary coverage so that very noisy OCR-ed texts (i.e. 48% error with distance of at least three to target) can be corrected with accuracies up to 20%.

    OCR-ed texts with less challenging error patterns can be corrected with accuracies up to 61% (distance two) and 86% (distance one).

    Reference: Michel Gnreux, Egon W. Stemle, Verena Lyding and Lionel Nicolas. Correcting OCR errors for German in Fraktur font. Pisa, 9-10 dicembre 2014. La prima Conferenza di Linguistica Computazionale Italiana.

    Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11