Upload
nuno-freire
View
78
Download
0
Embed Size (px)
Citation preview
www.eudat.euEUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065
Building new knowledge from distributed scientific corpus
HERBADROP & EUROPEANA: two concrete case studies for exploring big archival data
2nd Computational Archival Science (CAS) workshopBoston, USA, December 2017
Pascal Dugénie, Daan Broeder, Nuno Freire
Massivelydistributedcollections
Digital Infrastructures for Research
Opportunities for preserving valuable scientific heritage
Collaborative Data Infrastructure (CDI)
Trusted Digital Repositories (TDR)ISO 16363, ISO 14721 (OAIS)
High-speednetwork
infrastructures
LONG-TERM PRESERVATION
Monitoring
Data StoragePersistent ID
Metadata
Data curationand policies
Natural heritage Cultural heritage
HPCinfrastructures
BIG DATAanalysis tools
sharing
distributed
corpora
extraction
of text in
images
knowledge
building
visibility
of data
EUDAT: A truly pan-European Infrastructure
EUDAT offers common data
services to both research
communities and individuals
through a large network of
European organisations.
EUDAT wants to enable
European researchers from
any discipline to preserve,
find, access, and process data
in a trusted environment, as part
of a Collaborative Data
Infrastructure.
European infrastructures
Technology Providers
Research Communities
B2 Service Suite
https://www.eudat.eu/services
Covering both access and
deposit, from informal
data sharing to long-term
archiving, and addressing
identification,
discoverability and
computability of both
long-tail and big data,
EUDAT services seek to
address the full lifecycle
of research data
Common Language Resources and Technology
Infrastructure (CLARIN)
Building solutions with the
communities
European Network for Earth System Modelling (ENES)
Distributed infrastructure for life-science information
(ELIXIR)
European Plate Observing System (EPOS) - Solid Earth
sciences Research Infrastructure
Integrated Carbon Observation System (ICOS) to quantify
& understand greenhouse gas balance
Long-Term Ecosystem Research (LTER) in Europe
EUDAT services are designed, built and implemented together with
user communities.
Challenges and problem to be solved
Digitalized images
physical copies are fragile
digital copy must be preserved
Exploitation of digital copies
description metadata and classification is complex
images contain a lot of information that should beextracted and made available
Herbadrop rationale
• Millions of specimens in herbaria all over the world
• Global trend to industrialdigitizing
• Data difficult to handle evenfor medium size institutes
• Same challenges being facedby hundreds of herbaria in Europe
• Makes sense to work togetherto develop a solution
tiff: 180MB zip: 80MB jpg: 1MBTotal: 161MB
Herbadrop in Europe
MEISE, BE
n
Herbadrop objectives
PRESERVATION1
INFORMATION
EXTRACTION2
KNOWLEDGE
BUILDING3
deep learning using OCR results with
access with the whole community for
crowdsourcing
long-term preservation of herbarium
specimen images
curent scope
extracting information from images by
using Optical Character Recognition
(OCR) basic image analysis techniques
perspectives
HERBADROP/EUDAT Workflows
STORAGE
TRANSFER
Transferring
images using
B2SAFE
service
OCR
ACCES MONITORING
images
Performing
OCR
analysis
using HPC
Ingesting OCR
results in a
full text
indexing engine
Controling
data quality
(file format
and integrity)
OCR
ARCHIVING
Surveying
bit-stream
integrity
and data
quality
Ingesting
images and
metadata for
long-term
archiving
Producing
regular
statistical
reports
Producing
regular
statistical
reports
Monitoring
data and
processes
status
reportsstatistics
Harvesting
and indexing
metadata
Offering open
access to full
text engine,
images and
metadata
CERTIFICATION
Implementing a DSA-based certification including appropriate SLA
Europeana:European Cultural Heritage on the
WebThe main goal of Europeana is to provide
access to cultural heritage and encourage
people to engage with culture.
• And the main access point is the Web!
• Promoting the research use of heritage data
resources is in its early stages of
development
CC BY-SAPerspectives on using Schema.org for publishing and harvesting metadata at
EuropeanaCC BY-SA
The Challenges (1/2)
The Generic Challenge
How to facilitate the re-use of Cultural Heritage language resources for research purposes
… by exploiting the existing and emerging European research infrastructures
How can the resources be discovered
How can the resources be shared in practical ways for researchers
How can advanced computation be applied to these Cultural Heritage datasets
How can the resources and datasets be cited and referenced in research
How can the Cultural Heritage institutions re-use the outcomes of research
The Challenges (2/2)
The Specific Challenges of the Pilot
To identify requirements for technical interoperability
between the two infrastructures
Creating best practice guidelines for the publication
and citation of cultural heritage data
Facilitate the collaborative work between researchers,
with focus on:
Humanities
Social Sciences
Computer science
Europeana Newspapers Corpus
The pilot aims to expose the full text aggregated in the
Europeana Newspapers project.
This corpus contains over 11 million pages of full text of
historic newspapers
Mainly from the 19th century
Aggregated from national and research libraries
across Europe.
The pilot aims to expose and improve the text for more
data driven usage
…based on EUDAT Data services…
EUDAT service uptake
Europeana Newspaper Pilot relies on the following EUDAT services:
Research data storage and sharing (B2SHARE): as to undertake the enrichment of the datasets as well as, more generally, expose them for re-use by other academics, particularly those outside the digital humanities
Persistent Identification Service (B2HANDLE): Persistent identification of the main objects of the full-text corpus: the newspapers titles and individual issues
Multi-disciplinary joint metadata catalogue (B2FIND): so that scientists will be able to
obtain the full corpus for machine processing
select just a portion of the corpus benefitting from the enrichment of article-level annotations with named entities and topics
www.eudat.euEUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065
Conclusions &
Perspectives
Conclusions
• General conclusions:
• A successful application of the EUDAT services was achieved
• Heritage research data brought new requirements to EUDAT
• HERBADROP:
• Application of EUDAT’s computational capabilities are identifying new challenges:
• How to address poor quality OCR
• Amount of data is large and may become a limitation for accurate and exhaustive analysis
• EUROPEANA:
• Learned about the requirements of research usage
• Some may have impact on its data providers
HERBADROP and EUROPEANA: Some perspectives for data services
Improving discoverability of heritage research data resources
Full-text based
Metadata based
Additional heritage specific metadata support in EUDAT
Dat formats support, and semantics
Semantic annotations
Computational processing for heritage use cases:
OCR
Image analysis tools
For additional information
http://www.eudat.eu/
Nuno Freire,
Europeana DSI/INESC-ID
http://www.europeana.eu/