20
Europeana Newspapers Workshop: Refinement WP2 – Introduction to Refinement Munich, 26 June 2013 Clemens Neudecker (@cneudecker)

Europeana Newspapers wp2 liber2013

Embed Size (px)

Citation preview

Page 1: Europeana Newspapers wp2 liber2013

Europeana Newspapers Workshop:

Refinement

WP2 – Introduction to Refinement

Munich, 26 June 2013

Clemens Neudecker (@cneudecker)

Page 2: Europeana Newspapers wp2 liber2013

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Overview

• Objectives & Challenges

• Overview of Refinement Dataset

• Introduction to Refinement: Workflow & Technologies

• Questions & Answers

2

Page 3: Europeana Newspapers wp2 liber2013

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Objectives

- Analysis of available digital newspaper collections of project partners and identification of subsets suitable for refinement

- Definition of requirements and minimum quality of digitized newspapers for refinement to enable advanced services in Europeana

- Coordination of the scalable processing of 10 million digitised newspaper pages with several refinement technologies

- Providing recommendations on best practices for the refinement of digitised newspaper collections with full-text (and ingest to Europeana)

Page 4: Europeana Newspapers wp2 liber2013

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Challenges

• Processing quality vs. speed/throughput

• Volume of data requires focus on simple & standardised workflow with clear checkpoints

• Diverse partners supplying content with different digitisation & access policies

• Large variety of content in terms of file formats, fonts, languages, etc.

4

Page 5: Europeana Newspapers wp2 liber2013

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

The data

Page 6: Europeana Newspapers wp2 liber2013

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Europeana Newspaper Dataset (1)

Page 7: Europeana Newspapers wp2 liber2013

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Europeana Newspaper Dataset (2)

Page 8: Europeana Newspapers wp2 liber2013

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Europeana Newspapers Dataset (3)

Page 9: Europeana Newspapers wp2 liber2013

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Europeana Newspapers Dataset (4)

Page 10: Europeana Newspapers wp2 liber2013

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Refinement Workflow steps

10

Page 11: Europeana Newspapers wp2 liber2013

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Tools (BCT)

• BCT = Binarisation and Colour Reduction Tool

• Purpose: Convert grey/colour scans to bitonal using highly optimised GPP method

• Background: Reduce total file size of master images to guarantee feasibility and timing of data transfers

11

Page 12: Europeana Newspapers wp2 liber2013

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Tools (FRT)

• FRT = File Rename Tool

• Purpose: Support content holders in preparing their data in the correct format

• Background: Ensure folder structure and file naming requirements for automated processing are met

12

Page 13: Europeana Newspapers wp2 liber2013

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Tools (FAT)

• FAT = File Analyzer Tool

• Purpose: Final quality check of data before refinement

• Background: Ensure content and refinement partners that all preparation steps have been executed successfully

13

Page 14: Europeana Newspapers wp2 liber2013

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Refinement: OCR@UIBK

• OCR = Optical Character Recognition

• Number of pages to be refined: 8 million

• Technologies: ABBYY FineReader SDK

• State-of-the-art OCR software, fully supports Fraktur/Latin/Cyrillic fonts

• Result: METS/ALTO package containing images, metadata & full text

14

Page 15: Europeana Newspapers wp2 liber2013

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

OCR Full text search

15

http://www.europeana-newspapers.eu/building-a-content-browser-for-digital-newspapers/

Page 16: Europeana Newspapers wp2 liber2013

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Refinement: OLR@CCS

• OLR = Optical Layout Recognition

• Number of pages to be refined: 2 million

• Technologies: docWorks

• Separation of columns, articles, headlines, page classes

• Result: METS/ALTO package containing images, metadata & full text

16

Page 17: Europeana Newspapers wp2 liber2013

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

OLR Article separation

17

Page 18: Europeana Newspapers wp2 liber2013

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Refinement: NER@KB

• NER = Named Entities Recognition

• Number of pages to be refined: 2 million

• Technologies: Stanford CRF-NER

• Languages supported: German, Dutch, English (+ French, Latvian)

• Open source: https://github.com/KBNLresearch/europeananp-ner

• Detection of Named entities: Person, Location, Organization

• Feedback cycle with manual training step better results

18

Page 19: Europeana Newspapers wp2 liber2013

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

NER Browse by names or places

19

Page 20: Europeana Newspapers wp2 liber2013

Thank you for your [email protected]