Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
The EUDAT Project
Towards a European Collabora9ve Data Infrastructure
Slides by Damien Lecarpen2er – CSC, IT Center for Science, Finland Presenta2on by Alberto Michelini, INGV Rome, VERCE Kick-‐Off, Paris October 3, 2011
Source: European Commission
••• 2
Scientific facilities, research communities . . . . . . .
Linking at the speed of the light
Sharing computers, software and instruments
Sharing and federating scientific data
e-Infrastructures Vision empower research communities through ubiquitous, trusted and easy
access to services for data, computation, communication and collaborative work
Sharing and federa9ng Scien9fic Data?
The current data infrastructure landscape
There is a long history of data management in Europe, with several exis2ng data infrastructures dealing with established and growing user communi2es (e.g., ESO, ESA, EBI, CERN)
New Research Infrastructures are emerging and are also trying to build data infrastructure solu2ons to meet their needs (CLARIN, EPOS, ELIXIR, ESS, etc.)
There is a large number of projects providing excellent data services (EURO-‐VO, GENESI-‐DR, Geo-‐Seas, HELIO, IMPACT, METAFOR, PESI, SEALS, etc.)
However, most of these infrastructures and ini9a9ves address the needs of a specific discipline and user community
The current data infrastructure landscape: Challenges and opportuni9es
Challenges
How to ensure compa2bility and interoperability between the different infrastructures to promote cross-‐disciplinary research?
Infrastructures face increasing challenges due to data growth in volume and complexity (the so-‐called “data tsunami”)
strong impact on costs threatening the sustainability of the infrastructure
Opportuni9es
Poten2al synergies do exist: although disciplines have different ambi2ons, they have common basic needs and requirements that could be matched with generic pan-‐European services suppor2ng mul2ple communi2es.
Strategy needed at European level
History of the EUDAT concept
The concept of a shared pan-‐European infrastructure was supported and further elaborated by a number of policy and experts bodies:
EUDAT has its origins in the work of the PARADE (Partnership for Accessing Data in Europe) ini2a2ve
PARADE White Paper (October 2009) defining a ”Strategy for a
European Data Infrastructure that should be persistent, mul2disciplinary, and based on the need of user communi2es”
e-‐IRG and ESFRI: e-‐IRG Blue Paper (September 2010) Recommends ”to iden2fy and promote common (long term) data related services across different RI”
High Level Expert Group (HLEG) report on Scien2fic Data (October 2010) Calls for a ”Collabora2ve Data Infrastructure” for scien2fic data, that supports seamless access, use, reuse, and trust of data.
EUDAT will materialise this vision from October 2011
Towards a Collabora've Data Infrastructure
Source: HLEG report, p. 31
EUDAT will focus on building this generic data infrastructure layer and offer a trusted domain for long term data preserva7on accompanied with related services to store, iden7fy, authen7cate and mine these data.
This need be done in close collabora2on with the Communi2es
Core services must match the requirements of the communi9es Community services can also be incorporated into the common data service infrastructure when they are of use to other communi9es.
Core services are building blocks of EUDAT‘s Common Data Infrastructure mainly included on bobom layer of data services
EUDAT Core Services • Single Sign On (federated AAI) • Data access and upload • Long-‐term preserva2on • Persistent iden2fier service • Workspaces • Web execu2on and workflow services • Monitoring and accoun2ng services • Network services
Extended Core Services (community-‐supported) • Joint meta data service • Joint data mining service
EUDAT core services
No need to match the needs of all at the same time, addressing a group of communities can be very valuable, too
The EUDAT Consor9um
The EUDAT Communi9es: CLARIN/Linguis9cs
The EUDAT Communi9es: EPOS/Earth Science
The EUDAT Communi9es: ENES/Climate
The EUDAT Communi9es: Lifewatch/Environment
The EUDAT Communi9es: VPH/BMS
The EUDAT Communi9es
The EUDAT Communi9es (by field)
EUDAT targets all scien2fic disciplines (discipline neutral):
To enable the capture and iden2fy cross-‐discipline requirements To involving the scien2sts of all the communi2es in the shaping of the infrastructure and its services
Biological and Medical Science
VPH, ELIXIR, BBRMI, ECRIN
Environmental Science
ENES, EPOS, Lifewatch, EMSO, IAGOS-‐ERI, ICOS
Social Sciences and Humani9es
CLARIN
Physical Sciences and Engineering
WLCG, ISIS
Material Science
ESS…
Energy
EUFORIA…
EUDAT Services Activities – Iterative Design
EUDAT’s Services ac2vity is concerned with iden2fica2on of the types of data services needed by the European research communi2es, delivering them through a federated data infrastructure and suppor2ng their users
1. Capturing Communi9es Requirements (WP4) Services to be deployed must be based on user communi2es needs Strong engagement and collabora2on with user communi2es (EUDAT communi2es and beyond) to capture requirements
2. Building the services (WP5) User requirements must be matched with available technologies Need to iden2fy:
available technologies and tools to develop the required services (technology appraisal) gaps and market failures that should be addressed by EUDAT research ac2vi2es
Services must be designed, built and tested in a pre-‐produc2on test bed environment and made available to WP4 for evalua2on by their users
3. Deploying the services and opera9ng the federated infrastructure (WP6) Services must be deployed on the EUDAT infrastructure and made available to users, with interfaces
for cross-‐site, cross-‐community opera2on Reliability, 24h/7d availability and accessibility of the shared services, with opera2onal security, data
integrity and compliance with stakeholder requirements and policies.
EUDAT Kick-‐Off
Service deployment
SERVICE DESIGN
USER REQUIREMENTS
SERVICE DEPLOYMENT
2012 2013 2014 2015
1st User Forum 4th User Forum 2nd User Forum 3rd User Forum
First Services available
Cross-‐ Community Services
Full core Services deployed
Sustainability Plan
EUDAT Timeline Budget: 9.3 M€
Expected benefits of a Collabora9ve Data Infrastructure
Enabling innova9ve mul9-‐disciplinary data intensive research
Development of common services suppor2ng research communi2es Support to exis2ng scien2fic communi2es’ infrastructures Support to smaller communi2es through access to sophis2cated services
Inter-‐disciplinary collabora2on and exploita2on of synergies between communi2es Communi2es from different disciplines working together to build services Data sharing between disciplines
Ensuring wide access to and preserva9on of data in a sustainable way
A robust generic infrastructure capable of handling the scale and complexity of data that will be generated over the next 10-‐20 years
Greater access to exis2ng data and beber management of data for the future Increased security by managing mul2ple copies in geographically distant loca2ons
Put Europe in a compe22ve posi2on for important data repositories of world-‐wide relevance
Economies of scale and cost-‐efficiency Shared resources and work are less costly
Ini9al work of the Collabora9ve Data Infrastructure: WP4 -‐ Capturing Communi9es Requirements
The design of CDI poses a significant challenge, requiring active collaboration between all actors, and it needs to be based on an organic and open discussion process involving data architects and practitioners. “…Open and widely accepted registries of different sort will play a key role, since they make otherwise embedded knowledge explicit and therefore also allow people to re-use their content for own purposes and to carry out assessments….” From “Data Access and Interoperability Task Force” Preparation note
Digital Object Architecture Kahn, Wilensky (2006)
Canonical Workflow
originator depositor repository A user
registered DO - data - metadata (Key-MD)
handle generator
PID property record rights type (from central registry) ROR flag mutable flag transaction record
repository B
work ownership
data metadata (Key-MD) PID access rights
hands-over
requests
deposits via RAP
requests
stores
maintains
receives disseminations
via RAP
replicates
Definitions/Entities originator = creates digital works and is owner; they can already request Handles depositor = forms work into DO (incl. metadata), deposits DO, specifies access rights and provides PID if available digital object (DO) = instance of an abstract data type with 2 components (typed data + key-metadata (as part of more metadata, includes a Handle)); can be
elementary and composed; registered DOs are such DOs with a Handle; DO content is not considered repository (Rep) = network accessible storage to store DOs; has mechanisms for deposit and access; has a unique name (X.Y.Z) to be registered centrally; store
additional data about DOs; one rep is the ROR RAP (Rep access protocol) = simple access protocol with minimal functionality required for DOA; reps can specify more Dissemination = is the data stream a user receives upon request via RAP ROR (repository of record) = the repository where data was stored first; controls replication process Meta-Objects (MO) = are objects that store mainly references mutable DOs = some DOs can be modified, others not property record = contains various info about DO (metadata, etc) type = data of DOs have a type transaction record = all disseminations of a DO are recorded
© EUDAT - MPI
EPOS Architecture 1
originator Depositor 2
repository B
user
registered DO -‐ data -‐ metadata (Key-‐MD)
handle generator
PID property record rights type (from central registry) ROR flag mutable flag transac2on record
repository D
work ownership data (mutable DOs)
(both con2nuous and discrete)
hands-‐over
requests triggered by user requests
deposits via RAP1
requests
stores
maintains
receives dissemina2ons
via RAP1
replicates
Defini9ons/En99es originator = creates digital works and is owner; it can be a probe or a computa2onal simula2on depositor1 = forms work into DO, which are only metadata, deposits DO, specifies access rights; it can be a person depositor2 = forms work into DO, deposits DO, specifies access rights; it can overwrite DO depositor3 = forms DO into registered DO, deposits DO, specifies access rights and provides PID if available digital object (DO) = instance of an abstract data type with 2 components (typed data + key-‐metadata (as part of more metadata, includes a Handle));
can be elementary and composed; registered DOs are such DOs with a Handle repository (Rep) = network accessible storage to store DOs; has mechanisms for deposit and access; has a unique name (X.Y.Z) to be registered centrally;
store addi2onal data about DOs; one rep is the ROR RAP1 (Rep access protocol) = simple access protocol with minimal func2onality required for DOA; reps can specify more RAP2 = the protocol used for metadata is different from that one used for data. Dissemina9on = is the data stream a user receives upon request via RAP ROR (repository of record) = the repository where data was stored first; controls replica2on process, it is the repository B. Meta-‐Objects (MO) = are objects that store mainly references mutable DOs = some DOs can be modified, others not property record = contains various info about DO (metadata, etc) type = data of DOs have a type transac9on record = all dissemina2ons of a DO are recorded handle generator = it provides PIDs (Persistent Iden2fier), but its implementa2on has not been defined yet.
PIDs should be related to URIs, which are increasingly used to iden2fy DO.
© EUDAT -‐ MPI
repository A2
replicates
repository C
requests
deposits via RAP1
Depositor 3
repository A1 Depositor 1
deposits via RAP2
hands-‐over metadata
Part related to mutable DO
Part related to registered DO
originator
EPOS Architecture 2
Originators
EPOS Architecture 3 For example, the main seismic data format is called SEED (Standard for the Exchange of Earthquake Data), which is
organized in this way: l Control headers (formatted in ASCII) – called dataless (metadata) l information about the volume, the station-channels, and the data l Time series (binary, unformatted) – called mini-seed (data) l seismic waveform data And the protocol to exchange such format is called ArcLink, a TCP based distributed data archive access protocol, which allows to perform data request based on time windows. The Handle Generator has not been implemented yet, but it should be able to manage URIs (IETF rfc 3986), which are used as PIDs.
originator Depositor 2
Full-‐seed (dataless+mini-‐seed)
user
registered DO -‐ data -‐ metadata (Key-‐MD)
URI generator
PID property record rights type (from central registry) ROR flag mutable flag transac2on record
repository D
work ownership data (mutable DOs)
(both con2nuous and discrete)
hands-‐over
requests triggered by user requests
deposits via ArcLink
requests
stores
maintains
receives dissemina2ons via ArcLink
replicates
Mini-‐seed Repository (filesystem based)
replicates
repository C
requests
deposits via ArcLink
Depositor 3
Dataless Repository (DB)
Person who manually creates the dataless
deposits via RAP2
hands-‐over metadata
Part related to mutable DO
Part related to registered DO
Seismic sta2on
Ini9al work of the Collabora9ve Data Infrastructure: WP7 -‐Assessing Scalability
Ini9al work of the Collabora9ve Data Infrastructure: WP7 -‐Assessing Scalability
Ini9al work of the Collabora9ve Data Infrastructure: WP7 -‐Assessing Scalability
Challenges
Delivering high level mul9-‐disciplinary data services
Achieving a high level of interoperability in the context of diversity of data, research disciplines and prac2ces
Need to strongly involve the different communi2es in the design and evalua2on of services EUDAT as a platorm to discuss interoperability issues (along with other ini2a2ves: e.g DAITF)
Building trust among stakeholders
Trust between service providers and users but also between the researchers and disciplines themselves Trust in the EUDAT infrastructure, the data deposited and collected, data integrity
Ensuring the sustainaibility of the infrastructure
Providing a framework and a plan to ensure the con2nuity of services beyond the immediate funding window, through the seung up of a sustainable en2ty
Funding and business models Parnership and governance models
How to get in touch with EUDAT?
Kimmo Koski, CSC -‐ IT Center for Science EUDAT Project Coordinator
Peter Wigenburg, Max Planck Ins2tute for Psycholinguis2cs at Nijmegen (MPI-‐PL) EUDAT Scien2fic Coordinator [email protected]
Damien Lecarpen9er, CSC -‐ IT Center for Science
EUDAT Project Manager [email protected]
Alberto Michelini, INGV
EUDAT EPOS community representa2ve [email protected]
THANK YOU!
“Do the difficult things while they are easy and do the great things while they are small. A journey of a thousand miles
must begin with a single step.”
Lao Tzu
The beginning of a long journey…