35
by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team

by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

byAndy Götz + Rudolf Dimper

+ Alex de Maria

on behalf of the ESRFData Policy Implementation Team

Page 2: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

ESRF Data Policy

• Why an ESRF data policy?• Changing landscape• Current situation• Other sites• How did it happen• Data policy explained• Open questions

Page 2 ADMP (CERN) 29 June 2016 – ESRF data policy

Page 3: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

Who are we – the European synchrotron

Page 3

• A synchrotron is an accelerator producing very intense light in the form of x-rays with special characteristics which make it very versatile e.g. intensity, wavelength, size

● ESRF, the European Synchrotron, is one of many in Europe and the world

ESRF

ADMP (CERN) 29 June 2016 – ESRF data policy

Page 4: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

What do we do - operate beamlines

Page 4

• Synchrotrons provide x-rays to beamlines where experiments are conducted and lots of data are produced

ESRF

ADMP (CERN) 29 June 2016 – ESRF data policy

Page 5: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

Why an ESRF data policy?

Page 5

• Information technology has transformed scientific investigation

• Data is the raw material of science and the main core product of research facilities

• Data needs to be properly managed to allow:

• linking to publications (increasingly requested by publishers)

• re-analysis• verification and anti-fraud• new research• preservation of unique data sets• users to comply with H2020

Open Data requirements

ADMP (CERN) 29 June 2016 – ESRF data policy

Page 6: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

The challenge – many scientific domains

Page 6

• The following are the research areas into which experiment proposals are categorised:

• Hard condensed matter science

• Applied material science

• Engineering

• Chemistry

• Soft condensed matter science

• Life sciences

• Structural biology

• Medicine

• Earth and science

• Environment

• Cultural heritage

• Methods and instrumentation

ADMP (CERN) 29 June 2016 – ESRF data policy

Page 7: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

The challenge – diverse samples + data

Page 7

• Very many different kinds of :

● Techniques – 10s● Experiments – 1500/yr● Samples – 100s/day● Data sets – 100s/day/bl● Scientists – 2000/yr● Communities - 100s● Scientific domains - 10s● Data analysis programs - 100s

http://www.esrf.eu/news/spotlight

ADMP (CERN) 29 June 2016 – ESRF data policy

Page 8: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

ESRF data policy – what made it possible

Page 8

• The following factors contributed to the data policy :

1. Evolving scientific data landscape

2. The FP7 projects PANDATA + CRISP

3. Recommendations from the EC (H2020) + IUCR

4. Similar facilities showing the way (ISIS+ILL)

5. Realisation that metadata on its own is useful

6. Choice of tape as low cost storage medium

7. Availability of ICAT metadata catalogue

8. Our motivation to improve data management

9. Support from our management + advisers + council

10. Helping our users + increase data re-use via open data

ADMP (CERN) 29 June 2016 – ESRF data policy

Page 9: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

Preserving the records of science

Page 9

US National Research Council, Study: “Bits of Power, Issues in Global Access to

Scientific Data”, 1997

“The value of data lies in their use. Full and open access to scientific data should be adopted as the

international norm for … data derived from publicly funded research”

OECD Principles and Guidelines for Access to Research Data from Public Funding

(2007):

“Sharing and open access to publicly funded research data not only helps to maximise the research potential but

provides greater returns from the public investment in research”

Why an ESRF data policy?

ADMP (CERN) 29 June 2016 – ESRF data policy

Page 10: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

Page 10

ESFRI Position Paper on Digital Repositories:

“Research Infrastructures should guarantee that raw research data are made available through

portals and databases.”

06/09/2007 – e-IRG ESFRI

Data's shameful neglect“Research cannot flourish if data are not preserved and

made accessible. All concerned must act accordingly”

Nature 461, 145 (10 September 2009) | doi:10.1038/461145a

redefine misconduct as distorted reporting: ‘any omission or misrepresentation of the information necessary and sufficient to evaluate the validity and significance of research, at the level appropriate to the context in which the research is communicated’

Nature 494, 149 (14 February 2013) doi:10.1038/494149a

Preserving the records of science

ADMP (CERN) 29 June 2016 – ESRF data policy

Page 11: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

Scientific data are more and more considered like a publication and/or part of the publication

Movement to Open Data is growing e.g. OECD, G8, RDA, …

IUCr dddwg initiative for open data

Pressure is increasing on publicly funded research institutes to follow

H2020 participation will be conditioned on a data management plan

FRANCE : loi numerique voté par l’assemblée nationale 26 Janvier 2016

ESRF as the European synchrotron has to lead

Page 11

Changing landscape

ADMP (CERN) 29 June 2016 – ESRF data policy

Page 12: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

Raw Data

Currently produce 2 PB data / year

Data are stored in proprietary formats

Data sets vary from 1 to 100 000 files and from 10 MB to 100s TBs

Data are deleted after 50 days from disk and after one year from tape

No persistent identifiers

No data management plan

Strong difference between in-house research and visitor data

Metadata

Metadata not collected systematically

Experiment report public

Page 12

Where we are coming from

ADMP (CERN) 29 June 2016 – ESRF data policy

Page 13: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

Neutrons

ILL – PanData-like policy since 3 years

ISIS – PanData-like policy since 3 years

Photons

ELETTRA – PanData-like policy since 1 year

HZB – PanData-like policy adopted since 14 June 2016

ALBA – PanData-like policy proposed

SLS – Currently under preparation

Other

Alfred Wegener Institut (Helmholtz) – Open Data Policy

Astronomy, Biology, CERN, … – Open Data Policies

Page 13

Data Policy at other Research Infrastructures

ADMP (CERN) 29 June 2016 – ESRF data policy

Page 14: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

Based on the PaNdata Data Policy (deliverable D2.1* of PaN-data Europe FP7 project in 2011)

The policy addresses the issues of:

Data ownership

Data curation

Data archiving

Open access to data

Page 14

ESRF Data Policy

ADMP (CERN) 29 June 2016 – ESRF data policy

*http://wiki.pan-data.eu/imagesGHD/0/08/PaN-data-D2-1.pdf

Page 15: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

Raw data and associated metadata

ESRF is the custodian of raw data and metadata from all beamlines (including CRGs)

ESRF will automatically collect metadata for all experiments

ESRF will store metadata in a metadata catalogue (icat)

High level metadata will be published as soon as possible, i.e.

Title, Authors, Beamline, Abstract, Experiment Report

Experimental team has sole access to the data during the so-called embargo period of 3 years; request to extend embargo period can be made

After embargo ESRF will make the data “Open Access” under CC-BY license

Users need to create an identifier to get Open Access data

Proprietary i.e. commercial data belong by default to the PI and are not archived unless explicitly agreed

Page 15

ESRF Data Policy* – main elements

ADMP (CERN) 29 June 2016 – ESRF data policy

*http://www.esrf.fr/files/live/sites/www/files/about/organisation/ESRF%20data%20policy-web.pdf

Page 16: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

Raw data and associated metadata …

Only keep data generated at the ESRF

Data must be in a format the ESRF can read

Metadata must be of a sufficient high quality to enable data re-use

Data must be traceable and verifiable as coming from the ESRF

ESRF data catalogue to be linked to other open data repositories

Page 16

ESRF Data Policy – main elements

ADMP (CERN) 29 June 2016 – ESRF data policy

Page 17: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

Access to raw data and metadata will be via a searchable on-line catalogue (https://wwws.esrf.fr/icat/)

Access to the on-line catalogue of the ESRF will be restricted to registered users of the on-line catalogue. The ESRF will set up an on-line procedure to become a registered user of the catalogue, e.g. with an Umbrella ID

Access to proposals will only be provided to the experimental team and appropriate facility staff

Principal Investigator (PI) has the possibility to transfer parts or the totality of her/his rights during the embargo period to another registered person

PI has the right to create and distribute copies of the raw data

PI has the possibility to render data public before the end of the embargo period

Page 17

ESRF Data Policy – open data access

ADMP (CERN) 29 June 2016 – ESRF data policy

Page 18: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

What do we need to curate data for 10 years ?

A metadata catalogue - icat (already installed)

Good metadata on all beamlines - modify the data acquisition

Hooks in the experiments - modify macros on each beamline

A catalogue of data to curate - identify what data to register + archive

Identity management - persistent IDs

Lots of tape storage - money for tapes and manpower to install

Automatic way to restore data - manpower to implement workflow

Current production is ~2 PB / year in 2015

Assuming linear growth to 15 PB / year in 2025 - 45 PB on tape

Future tape storage technology has 88x more capacity than today

Page 18

Implications of data policy

ADMP (CERN) 29 June 2016 – ESRF data policy

icatproject.org

Page 19: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

What tape technology + costs

Starting point is 2 x DLS 8400 tape libraries from Storage Tek with 2 x 64 tape drives

Situation in June 2016 - capacity of 8.5PB

Planned to increase to 75 PB in 2017

Tapes

– LTO-5 tape = 100MB/s, 1.7TB

– T10Kd T2 tape = 300MB/s, 8.5TB

Data written to tape with Time Navigator s/w

Limit of maximum number of data objects in catalogue (~ 2 Million)

Translates to 1000 data objects i.e. hdf5 files / shift / experiment (600 shifts x 30 beamlines)

Cost is ~ 100 kilo Euros / year for tapes + s/w

Page 19

Data archiving on tapes – most cost effective

ADMP (CERN) 29 June 2016 – ESRF data policy

Page 20: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

ESRF Data Policy – user driven objectives

Page 20 ADMP (CERN) 29 June 2016 – ESRF data policy

1.Define and collect metadata on all beamlines

2.Store metadata in hdf5 file and metadata catalogue (icat) and raw data in hdf5 files

3.Publish a Digital Object Identifier (DOI) per experiment (eventually per dataset) for referencing in publications

4.Archive metadata and raw data for 10 years

5. Implement access rights defined in data policy i.e. 3 years embargo then open access

6.Provide searching, browsing, viewing and download service to metadata+data portal

Page 21: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

Research Teams / BL Scientists / Review Panels / Community

Metadata are systematically collected

Better and continuously improved metadata

Data are managed and archived for long term

Metadata can be searched and downloaded easily

Compliance with Data Management Plan required by H2020

Data can be referenced in publications via PIDs (doi)

ESRF

Better data management and follow-up

Data from the ESRF can be traced and verified

Better statistics about publications using ESRF data

Conformance with France/European/World wide move to Open Access

Eventually will lead to ESRF data being used more

Page 21

Advantages of the ESRF data policy

ADMP (CERN) 29 June 2016 – ESRF data policy

Page 22: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

Following recommendation by the SAC, Council approved the ESRF Data Policy on

01-December 2015

Implemented on all beamlines by

01-January-2020

Page 22

Official endorsement

ADMP (CERN) 29 June 2016 – ESRF data policy

Page 23: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

Twelve

Herculean

tasks

Page 23 ADMP (CERN) 29 June 2016 – ESRF data policy

Implementing the ESRF Data Policy

Page 24: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

1. Metadata definition + capture (ICAT)

2. Electronic log-book

3. Identity management

4. Individual accounts for users

5. ACLs and data confidentiality

6. Extend storage capacity of tape archives

7. Link metadata catalogue to tape storage

8. Authentication + authorization to access data

9. Web interface for data access, browsing + visualisation

10.Communication via presentations, web, users meeting, this meeting

11.Modify proposal submission to add “I accept the ESRF data policy”

12.DOI minting, generate landing page + linking to publications

Page 24

Implementing ESRF Data Policy - 12 tasks

ADMP (CERN) 29 June 2016 – ESRF data policy

Page 25: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

ICAT metadata catalogue – icatproject.org

Page 25 ADMP (CERN) 29 June 2016 – ESRF data policy

➔Key element of the ESRF Data Policy for storing metadata and curating data

➔Chosen after comparing with other solutions (FP7 CRISP deliverable)

➔ is developed and maintained by STFC, adopted by ISIS, DLS, ESRF, HZB, FERMI

➔ provides ✔ a scientific data model,

✔ web services with SOAP + REST api

✔ web interface for searching and selecting data,

✔ download service supporting multiple protocols

✔ dashboard of statistics and usage

✔ DOI landing page generation

✔ rules based authorization

Page 26: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

ICAT data model

Page 26 ADMP (CERN) 29 June 2016 – ESRF data policy

Page 27: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

ICAT data model → ESRF data model

Page 27 ADMP (CERN) 29 June 2016 – ESRF data policy

Page 28: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

Implementation – planned for 2016

Page 28 ADMP (CERN) 29 June 2016 – ESRF data policy

Page 29: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

Implementation – (meta)data flow

Page 29 ADMP (CERN) 29 June 2016 – ESRF data policy

icat Tapearchive

Metadatacatalogue

Data portalWeb interface

DOI

Downloaddata

(hdf5)

Page 30: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

Implementation of metadata ingestion

Page 30 ADMP (CERN) 29 June 2016 – ESRF data policy

Page 31: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

HZB Data Policy – adopted 14 June 2016

Page 31 ADMP (CERN) 29 June 2016 – ESRF data policy

➔ Third photon source to adopt a data policy

➔ Based on the PanData data policy

➔ Archived for 10 years

➔ 5 year embargo period

➔ Data released under CC0 licence

http://www.helmholtz-berlin.de/pubbin/news_seite?nid=14472&sprache=en

Page 32: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

ESRF Data Policy – Open Questions

Page 32 ADMP (CERN) 29 June 2016 – ESRF data policy

1.Can the issues surrounding Data Ownership be clarified at an international level ?

2.How to encourage EU member states to adopt EU recommendations on Open Data ?

3.How to communicate and educate scientists about the move to Open Data ?

4.What guidelines exist for embargo periods ?

5.How to make our data repository easily findable ?

6.What permanent authentication mechanism should be used e.g. Orcid ?

Page 33: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

The scientific landscape is changing and the call for Open Access for publicly funded research has become stronger

ESRF has adapted to this changing landscape

A data policy offers many advantages for Users

ESRF data will be traceable, verifiable and re-useable in the future

ESRF is in-line with best practices for scientific data recommended by the RDA, EU, G8, research organisations, learned societies, …

Other photon sources arelikely to adopt open datapolicies too e.g. ELETTRA, HZB, ...

Page 33

Conclusion

Conclusion

ADMP (CERN) 29 June 2016 – ESRF data policy

Page 34: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

Alejandro de Maria (ISDD) – Data Manager Bruno Lebayle (TID) – IT infrastructure Joanne McCarthy (EXPD) – User Office Armando Solé (ISDD) – Metadata+data Jens Meyer (ISDD) – Beamline controls Dominique Porte (TID) – User ID's Rudolf Dimper (TID) – Data policy Andy Götz (ISDD) – Implementation

ADMP (CERN) 29 June 2016 – ESRF data policy

ESRF Data Policy Implementation Team

Page 35: by Andy Götz + Rudolf Dimper + Alex de Maria€¦ · by Andy Götz + Rudolf Dimper + Alex de Maria on behalf of the ESRF Data Policy Implementation Team. ESRF Data Policy • Why

Data Ownership i.e. Who “owns” data ?

Difficult question which is better addressed via “rights” where we consider the ESRF as being the “data custodian” :

1. ESRF has the right to curate data from public research experiments

2. Experimenters have the right to analyse and publish the data

3. ESRF has the right to give open access to the data after embargo

4. Users own the sample and any results they derive from the data

Page 35 ADMP (CERN) 29 June 2016 – ESRF data policy

DATA OWNERSHIP versus RIGHTS