27
Text- und Strukturerkennung für historische Zeitungen Günter Mühlberger Universität Innsbruck Digitalisierung und elektronische Archivierung

Europeana Newspapers LFT Infoday Muehlberger

Embed Size (px)

Citation preview

Page 1: Europeana Newspapers LFT Infoday Muehlberger

Text- und Strukturerkennung für

historische Zeitungen

Günter Mühlberger

Universität Innsbruck – Digitalisierung und

elektronische Archivierung

Page 2: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Who we are

• Digitisation and Digital Preservation group @ University of Innsbruck

• Since mid 1990ies involved in digitisation and Optical Character Recognition

(OCR)

• Research projects: LAURIN, METADATA ENGINE, books2u!, reUSE,

Digitisation on Demand, eBooks on Demand, IMPACT, PrestoPRIME,

ARROW+, Europeana Newspaper, tranScriptorium,…

• Our mission: “Digitisation of humanities” = Digital Humanities

• Selection of Digitisation projects

• Austrian Literature Online (since 2002)

• Digitisation of the Innsbrucker Newspaper Archive (2004-2006)

• Digitisation of the Tiroler Tageszeitung from 1945-2003) (2012-2014)

• Text recognition of 8 Mill. Newspaper pages within Europeana Newspapers

• Commercial services via the Technology Transferplatform of the University

2

Page 3: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Digitisation

3

IMAGE

CAPTURING

TEXT &

STRUCTURE

RECOGNITION

NATURAL

LANGUAGE

PROCESSING

CONTENT

REPRESEN-

TATION

Page 4: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Example – Index card: Capturing

4

Page 5: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

OCR Interface

5

Page 6: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Raw OCR Text

6

“Â.”- ikonogr.

religiös

V oragine , Jacob a ; LEGENDA AUREA Dresdae

ÄLipsiae 1846

Page 7: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Structure Recognition

7

“Â.”- ikonogr.

religiös

V oragine , Jacob a ; LEGENDA AUREA Dresdae

ÄLipsiae 1846

Page 8: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Natural Language Processing

8

Voragine, Jacob

LEGENDA AUREA

1846

Matching with reference database, e.g. WorldCat

Page 9: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Matching with Reference (knowledge) data

9

Page 10: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

The actual book

10

Page 11: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Content Representation

Instead of a scanned index card we are able to

access/link/work with a full featured catalogue entry and the

actually digitised work

Instead of digitised newspapers we want to

access/link/work with the content/information/knowledge

contained in these newspapers!

OCR is one important step towards this overall objective!

11

Page 12: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

OCR – Some Facts

• Optical Character Recognition

• “Old” technology: “pattern recognition”

• Largest progress in late 1990ies

• Market situation

• Two large companies: ABBYY, Nuance

• Cheap technology

• Open Source tools: Tesseract, Ocropus, Gamera,…

• Google: Worked with ABYYY, changed to Tesseract since 2012

• ABBYY

• Took part in two EU projects

• Gothic letter and long “s” out of the box “Old Italian” as language

• Direct export of Analysed Layout and Text Object (ALTO)

12

Page 13: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Output

• Processing

• University Innsbruck, 32 ABBYY Licenses on 4 Server

• 10.000 large newspaper pages per day, 40.000 medium size, 150.000

book size

• PDF

• Text above the image vs. text behind the image

• PDF/A Standard

• Tagged PDF

• XML - ALTO

• Keeps all the information: Blocks, type of blocks, languages, lines, words,

characters, confidence of words, etc.

• ALTO: de-facto standard – Library of Congress

13

Page 14: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Accuracy rates

• What do we expect?

• Researchers: Critical edition of Shakespears Works: no error accepted

• eBooks: less than 1 error per 1000 characters (=half a page)

• Users getting full-text searching offered as an additional feature?

• Academic staff working (copy & paste) with a text?

• Natural language processing?

• Knowledge extraction?

• Word Error Rate (WER) vs. Character Error Rate (CER)

• WER more meaningful to users

• WER easier to measure

14

Page 15: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

IMPACT

EVA/MIN

ERVA

12th Nov.

2008

15

Page 16: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

IMPACT

EVA/MIN

ERVA

12th Nov.

2008

16

Page 17: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

17

Page 18: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Outlook OCR

• Abbyy

• For small and medium amounts, up to some ten-millions of pages

• Tesseract

• Growing community

• Can be parallelized on High Performance Computing engines (e.g. several

hundreds or thousands of nodes)

• More experiments can be done for very large volumes, e.g. hundreds of

millions of pages

• Handwritten Text Recognition

• Next generation of engines for handwritten material

• Speech and face recognition as technological background

• Transcription and Recognition Platform

• Virtual Research Environment

• Will be released by University of Innsbruck in 2015

18

Page 19: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Structural Metadata

• Layout Analyses

• Noise reduction (redundant text)

• A newspaper contains much more than edited articles

Content units

• One separation could be: edited articles – advertisements - entertainment

• Document Understanding

• Newspaper consists of repeated sections (“templates”)

• Unique vs. common content

E.g. local news, local advertisements, etc. vs. “world news”

• Common content may be found elsewhere in more detail

E.g. book announcement

19

Page 20: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Austrian Newspapers Online – ANNO - 1916

20

Page 21: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

…more than edited articles

21

Page 22: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Edited articles vs. advertisements vs. entertainment

22

Innsbrucker Nachrichten, 4 June 1870

Page 23: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Innsbrucker Nachrichten 1870

23

Page 24: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Content units

• Types

• List of recently died persons

• Announcement of local associations

• Apartments to rent

• Obituaries

• Continued novels

• …

24

Page 25: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Technical approaches

• Layout analysis

• Specific tools

• XML Output of OCR engine (cheap, easy to handle)

• Approaches

• Rule based approaches (experts needed)

• Machine learning approaches (large amounts of training samples needed)

• Functional Extension Parser (IMPACT project)

• Rule based approach for historical books (pre 1900)

• More than 80% accuracy for non-trivial features are hard to reach

• E.g. separation edited text – advertisments – entertainment, running titles, section headings,

25

Page 26: Europeana Newspapers LFT Infoday Muehlberger

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Summary

• Digitisation of newspapers is in many countries/regions still

at the beginning

• OCR, though erroneous, is a must and cheap (compared to

scanning)

• Post-processing of OCR is promising

• Structural metadata are a must as well, new approaches are

needed (beyond article separation)

• Natural Language Processing and more advanced

operations will benefit

• Final goal of “document understanding” by machines

26

Page 27: Europeana Newspapers LFT Infoday Muehlberger

Thank you for your attention! l Günter Mühlberger

<[email protected]>