61
Transkribus. A research infrastructure for transcribing, recognizing and searching archival documents Günter Mühlberger University of Innsbruck, Digitisation and Digital Preservation Group

Transkribus | Günter Mühlberger

Embed Size (px)

Citation preview

Page 1: Transkribus | Günter Mühlberger

Transkribus. A research infrastructure for

transcribing, recognizing and searching archival documents

Günter MühlbergerUniversity of Innsbruck,

Digitisation and Digital Preservation Group

Page 2: Transkribus | Günter Mühlberger

Googeling?

Page 3: Transkribus | Günter Mühlberger

Voorts Hooge Mogende heeren, Inde practyque vande decisie by uw Hogevoorts hoomge Mogende heeren, Inde pra tiqune vande decasie by ende hoge

Page 4: Transkribus | Günter Mühlberger

Mo: gedaen in t'voorJaer over t'vinden vande middelen syn verscheyden differentenMo: gedaen in t' voorJaer over t' vinden vande middelen syn verscheyden differenten

Page 5: Transkribus | Günter Mühlberger

voorgevallen tusschen Stadt en Lande, als namentl ouer den voet endevoorgevallen tusschen Stadt en Lande, als nanent ouer den voet ende

Page 6: Transkribus | Günter Mühlberger
Page 7: Transkribus | Günter Mühlberger

Neural Networks are taking over. (Ray Smith, Google)

Page 8: Transkribus | Günter Mühlberger

Archives are starting to digitise their holdings.

BTW: Documents in archives are unique, were never published before and contain extremely interesting content!

Page 9: Transkribus | Günter Mühlberger

Digital Humanities are (big) data driven.

Page 10: Transkribus | Günter Mühlberger

Volunteers and the crowd are willing to contribute to scientific and cultural heritage projects.

Page 11: Transkribus | Günter Mühlberger

Research Infrastructure!

Page 12: Transkribus | Günter Mühlberger

HUMANITIES SCHOLARS

ARCHIVES

COMPUTER SCIENCE

& TECHNOLOGY

PROVIDERS

PUBLICVOLUNTEERS / CROWD

TRANS-KRIBUS

DeliverDocuments

STORAGEWork withdocuments

EXPERTINTERFACES TOOLS

Advanced search

WEB UI

Contribute(Crowd-Sourcing)

Data

Technology

Enricheddocuments

Get results

Page 13: Transkribus | Günter Mühlberger

Wolpertinger, Bavarian mythical creature (is everything…)

Page 14: Transkribus | Günter Mühlberger
Page 15: Transkribus | Günter Mühlberger

READ

• H2020 e-Infrastructure Project• Duration: 1.1.2016 – 30.6.2019• Budget: 8,2 mill. EUR grant• Coordinated by University of Innsbruck• 14 partners and more than 20 institutions connected with an Memorandum of

Understanding• Main objectives

• Foster research in Pattern Recognition, Machine Learning, Natural Language Processing, Digital Humanities

• Set up a service platform (“Transkribus”) to make the technology available to archives, scholars, public.

• Transform this research infrastructure into a permanent service

Page 16: Transkribus | Günter Mühlberger

Transkribus as platform and as expert client

Page 17: Transkribus | Günter Mühlberger
Page 18: Transkribus | Günter Mühlberger

Documents in Transkribus

• Private• All documents in Transkribus are first of all private – visible only to the “owner” of the

document• Local

• For simple operations, but all services are available only for remote documents• Remote

• Standard mode• Stored on the servers of the University of Innsbruck

• Upload of documents• HTTP• PDF• FTP• METS Link• Direkt download from repository

Page 19: Transkribus | Günter Mühlberger
Page 20: Transkribus | Günter Mühlberger
Page 21: Transkribus | Günter Mühlberger

Documents directly loaded from the repository – one button!Implemented by Intranda (Goobi Viewer)

Page 22: Transkribus | Günter Mühlberger
Page 23: Transkribus | Günter Mühlberger

Researchers can go “shopping” and collect documents from various repositories and digital libraries in their private Transkribus collection

Page 24: Transkribus | Günter Mühlberger

Transcribe text in a reliable, secure, and machine readable way = create a scholarly transcription And use the text to train the HTR engines

Page 25: Transkribus | Günter Mühlberger
Page 26: Transkribus | Günter Mühlberger
Page 27: Transkribus | Günter Mühlberger
Page 28: Transkribus | Günter Mühlberger

Finished? Write an email…

Training process will be made available to the user as well (but will need some time due to a lack of resources in Innsbruck)

Page 29: Transkribus | Günter Mühlberger

HTR engine(s) – current implementation

• Hidden Markov Models - HMM (already available)• Training takes some hours• Recognition takes 20-60’ for one page• Strong limitations on dictionary and resolution of images

• Recurrent Neural Networks - RNN (coming soon)• Training takes some days• Recognition takes less than 60’’ (!)• No limitation on resolution of images• Free choice of dictionary – less dependency

• Main limitations for both HTR engines• Layout Analysis (“line finder”)• Need for dictionaries

Page 30: Transkribus | Günter Mühlberger
Page 31: Transkribus | Günter Mühlberger
Page 32: Transkribus | Günter Mühlberger
Page 33: Transkribus | Günter Mühlberger

What to do with the automatically recognized text?

• Measure results• Correct the text• Search in the full-text• Invite people to support you in transcribing

Page 34: Transkribus | Günter Mühlberger

Measure results with Character Error Rate (CER) and Word Error Rate (WER)

Page 35: Transkribus | Günter Mühlberger
Page 36: Transkribus | Günter Mühlberger

Correct text

• Character Error Rates• Above 20% correction takes as long as keying, but readers who have

difficulties to decipher may benefit• Above 10% correction is faster, but experienced readers prefer to key• Below 10% correction is much faster and even experienced readers will

accept correction instead of keying

• Currently typical figures are 10% CER• Under lab conditions significantly better results are already possible

Page 37: Transkribus | Günter Mühlberger

Search full-text

• Private search, you will get only results from collections where you are member

• Facetted search• Configurable search

Page 38: Transkribus | Günter Mühlberger
Page 39: Transkribus | Günter Mühlberger
Page 40: Transkribus | Günter Mühlberger
Page 41: Transkribus | Günter Mühlberger

Share your documents among your working group, colleagues, students and volunteers…

Page 42: Transkribus | Günter Mühlberger
Page 43: Transkribus | Günter Mühlberger

Export documents

• Various formats• XML (PAGE)• METS (Metadata Encoding and Transmission Standard – LoC)• ALTO (Analyzed Layout and Text Object – LoC)• DOCX• TEI (Text Encoding Initiative)• PDF• Excel• …

Page 44: Transkribus | Günter Mühlberger
Page 45: Transkribus | Günter Mühlberger

How to access services via machines?

Services in Transkribus are accessible via REST interface

Page 46: Transkribus | Günter Mühlberger
Page 47: Transkribus | Günter Mühlberger

What will come next?

• Table editor• eLearning Interface• Web-interface for simplified transcription (crowd-sourcing)• Text2Image matching tool• ScanApp• …

Page 48: Transkribus | Günter Mühlberger

Table editor

Define table as template automatic matchingExport data as CSV or Excel file

Page 49: Transkribus | Günter Mühlberger
Page 50: Transkribus | Günter Mühlberger

User learns with real objectsSelf-evaluation is based on simple metric: Word Error Rate (the same as for the machine)

eLearning interface

Page 51: Transkribus | Günter Mühlberger
Page 52: Transkribus | Günter Mühlberger
Page 53: Transkribus | Günter Mühlberger

Statistics

Page 54: Transkribus | Günter Mühlberger

Web-interface

Every document in Transkribus will also be accessible via a web-interface suitable to involve volunteers and the crowd

Page 55: Transkribus | Günter Mühlberger
Page 56: Transkribus | Günter Mühlberger

txt2img tool

Many printed or digital editions are available.Automated matching may simplify the training data production. (only good matches will be taken for training)

Page 57: Transkribus | Günter Mühlberger

ScanApp

Researchers are enabled to use mobile phones as document scanners(images are sent directly to their private Transkribus collection and archives may benefit from this)

Page 58: Transkribus | Günter Mühlberger

ScanApp

Page 59: Transkribus | Günter Mühlberger

Try out?

We are happy to support you to set up test projectsConclusion of a Memorandum of Understanding is a simple way to take part in the project!

Page 60: Transkribus | Günter Mühlberger

Credits

Hubert Alisade Hilde Boe Laurant Bolli Max Bryan Elaine Charwat VincentChristlein Sebastian Colutto Hervé Déjean Barbara Denicolo Markus DiemFelix Dietrich Reko Etelävuori Stefan Fiel Basilis Gatos Beat Gnädinger TobiasGrüning Vili Haukkovaara Gerhard Heyer Tobias Hodel Frederic KaplanMaria Kallio Istvan Kecskemeti Florian Kleber Roger Labahn Eva Lang SörenLaube Gundram Leifert Georgios Louloudis Philip Kahle Rory McNichollJean-Luc Meunier Johannes Michael Hannes Obermair Moises PastorNathanael Philipp Hannelore Putz George Retsinas Veronica Romero JoanAndreu Sanchez Robert Sablatnig Christian Sieber Giorgos Sfikas PhilipSchofield Louise Seaward Nikolaos Stamatopolous Tobias Strauss MelissaTerras Alejandro Hector Toselli Enrique Vidal Mauricio Villegas MaxWeidemann Welf Wustlich Herbert Wurster and many, many more!

Page 61: Transkribus | Günter Mühlberger

Thank you for your attention!

More information on the project and the Transkribus platform

http://read.transkribus.eu/

http://transkribus.eu/

http://transkribus.eu/wiki/

This project has received funding from the European Union’sHorizon 2020 research and innovation programme undergrant agreement No 674943.