Upload
netwerk-oorlogsbronnen
View
391
Download
6
Embed Size (px)
Citation preview
Transkribus. A research infrastructure for
transcribing, recognizing and searching archival documents
Günter MühlbergerUniversity of Innsbruck,
Digitisation and Digital Preservation Group
Googeling?
Voorts Hooge Mogende heeren, Inde practyque vande decisie by uw Hogevoorts hoomge Mogende heeren, Inde pra tiqune vande decasie by ende hoge
Mo: gedaen in t'voorJaer over t'vinden vande middelen syn verscheyden differentenMo: gedaen in t' voorJaer over t' vinden vande middelen syn verscheyden differenten
voorgevallen tusschen Stadt en Lande, als namentl ouer den voet endevoorgevallen tusschen Stadt en Lande, als nanent ouer den voet ende
Neural Networks are taking over. (Ray Smith, Google)
Archives are starting to digitise their holdings.
BTW: Documents in archives are unique, were never published before and contain extremely interesting content!
Digital Humanities are (big) data driven.
Volunteers and the crowd are willing to contribute to scientific and cultural heritage projects.
Research Infrastructure!
HUMANITIES SCHOLARS
ARCHIVES
COMPUTER SCIENCE
& TECHNOLOGY
PROVIDERS
PUBLICVOLUNTEERS / CROWD
TRANS-KRIBUS
DeliverDocuments
STORAGEWork withdocuments
EXPERTINTERFACES TOOLS
Advanced search
WEB UI
Contribute(Crowd-Sourcing)
Data
Technology
Enricheddocuments
Get results
Wolpertinger, Bavarian mythical creature (is everything…)
READ
• H2020 e-Infrastructure Project• Duration: 1.1.2016 – 30.6.2019• Budget: 8,2 mill. EUR grant• Coordinated by University of Innsbruck• 14 partners and more than 20 institutions connected with an Memorandum of
Understanding• Main objectives
• Foster research in Pattern Recognition, Machine Learning, Natural Language Processing, Digital Humanities
• Set up a service platform (“Transkribus”) to make the technology available to archives, scholars, public.
• Transform this research infrastructure into a permanent service
Transkribus as platform and as expert client
Documents in Transkribus
• Private• All documents in Transkribus are first of all private – visible only to the “owner” of the
document• Local
• For simple operations, but all services are available only for remote documents• Remote
• Standard mode• Stored on the servers of the University of Innsbruck
• Upload of documents• HTTP• PDF• FTP• METS Link• Direkt download from repository
Documents directly loaded from the repository – one button!Implemented by Intranda (Goobi Viewer)
Researchers can go “shopping” and collect documents from various repositories and digital libraries in their private Transkribus collection
Transcribe text in a reliable, secure, and machine readable way = create a scholarly transcription And use the text to train the HTR engines
Finished? Write an email…
Training process will be made available to the user as well (but will need some time due to a lack of resources in Innsbruck)
HTR engine(s) – current implementation
• Hidden Markov Models - HMM (already available)• Training takes some hours• Recognition takes 20-60’ for one page• Strong limitations on dictionary and resolution of images
• Recurrent Neural Networks - RNN (coming soon)• Training takes some days• Recognition takes less than 60’’ (!)• No limitation on resolution of images• Free choice of dictionary – less dependency
• Main limitations for both HTR engines• Layout Analysis (“line finder”)• Need for dictionaries
What to do with the automatically recognized text?
• Measure results• Correct the text• Search in the full-text• Invite people to support you in transcribing
Measure results with Character Error Rate (CER) and Word Error Rate (WER)
Correct text
• Character Error Rates• Above 20% correction takes as long as keying, but readers who have
difficulties to decipher may benefit• Above 10% correction is faster, but experienced readers prefer to key• Below 10% correction is much faster and even experienced readers will
accept correction instead of keying
• Currently typical figures are 10% CER• Under lab conditions significantly better results are already possible
Search full-text
• Private search, you will get only results from collections where you are member
• Facetted search• Configurable search
Share your documents among your working group, colleagues, students and volunteers…
Export documents
• Various formats• XML (PAGE)• METS (Metadata Encoding and Transmission Standard – LoC)• ALTO (Analyzed Layout and Text Object – LoC)• DOCX• TEI (Text Encoding Initiative)• PDF• Excel• …
How to access services via machines?
Services in Transkribus are accessible via REST interface
What will come next?
• Table editor• eLearning Interface• Web-interface for simplified transcription (crowd-sourcing)• Text2Image matching tool• ScanApp• …
Table editor
Define table as template automatic matchingExport data as CSV or Excel file
User learns with real objectsSelf-evaluation is based on simple metric: Word Error Rate (the same as for the machine)
eLearning interface
Statistics
Web-interface
Every document in Transkribus will also be accessible via a web-interface suitable to involve volunteers and the crowd
txt2img tool
Many printed or digital editions are available.Automated matching may simplify the training data production. (only good matches will be taken for training)
ScanApp
Researchers are enabled to use mobile phones as document scanners(images are sent directly to their private Transkribus collection and archives may benefit from this)
ScanApp
Try out?
We are happy to support you to set up test projectsConclusion of a Memorandum of Understanding is a simple way to take part in the project!
Credits
Hubert Alisade Hilde Boe Laurant Bolli Max Bryan Elaine Charwat VincentChristlein Sebastian Colutto Hervé Déjean Barbara Denicolo Markus DiemFelix Dietrich Reko Etelävuori Stefan Fiel Basilis Gatos Beat Gnädinger TobiasGrüning Vili Haukkovaara Gerhard Heyer Tobias Hodel Frederic KaplanMaria Kallio Istvan Kecskemeti Florian Kleber Roger Labahn Eva Lang SörenLaube Gundram Leifert Georgios Louloudis Philip Kahle Rory McNichollJean-Luc Meunier Johannes Michael Hannes Obermair Moises PastorNathanael Philipp Hannelore Putz George Retsinas Veronica Romero JoanAndreu Sanchez Robert Sablatnig Christian Sieber Giorgos Sfikas PhilipSchofield Louise Seaward Nikolaos Stamatopolous Tobias Strauss MelissaTerras Alejandro Hector Toselli Enrique Vidal Mauricio Villegas MaxWeidemann Welf Wustlich Herbert Wurster and many, many more!
Thank you for your attention!
More information on the project and the Transkribus platform
http://read.transkribus.eu/
http://transkribus.eu/
http://transkribus.eu/wiki/
This project has received funding from the European Union’sHorizon 2020 research and innovation programme undergrant agreement No 674943.