Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
byAndy Götz + Rudolf Dimper
+ Alex de Maria
on behalf of the ESRFData Policy Implementation Team
ESRF Data Policy
• Why an ESRF data policy?• Changing landscape• Current situation• Other sites• How did it happen• Data policy explained• Open questions
Page 2 ADMP (CERN) 29 June 2016 – ESRF data policy
Who are we – the European synchrotron
Page 3
• A synchrotron is an accelerator producing very intense light in the form of x-rays with special characteristics which make it very versatile e.g. intensity, wavelength, size
● ESRF, the European Synchrotron, is one of many in Europe and the world
ESRF
ADMP (CERN) 29 June 2016 – ESRF data policy
What do we do - operate beamlines
Page 4
• Synchrotrons provide x-rays to beamlines where experiments are conducted and lots of data are produced
ESRF
ADMP (CERN) 29 June 2016 – ESRF data policy
Why an ESRF data policy?
Page 5
• Information technology has transformed scientific investigation
• Data is the raw material of science and the main core product of research facilities
• Data needs to be properly managed to allow:
• linking to publications (increasingly requested by publishers)
• re-analysis• verification and anti-fraud• new research• preservation of unique data sets• users to comply with H2020
Open Data requirements
ADMP (CERN) 29 June 2016 – ESRF data policy
The challenge – many scientific domains
Page 6
• The following are the research areas into which experiment proposals are categorised:
• Hard condensed matter science
• Applied material science
• Engineering
• Chemistry
• Soft condensed matter science
• Life sciences
• Structural biology
• Medicine
• Earth and science
• Environment
• Cultural heritage
• Methods and instrumentation
ADMP (CERN) 29 June 2016 – ESRF data policy
The challenge – diverse samples + data
Page 7
• Very many different kinds of :
● Techniques – 10s● Experiments – 1500/yr● Samples – 100s/day● Data sets – 100s/day/bl● Scientists – 2000/yr● Communities - 100s● Scientific domains - 10s● Data analysis programs - 100s
http://www.esrf.eu/news/spotlight
ADMP (CERN) 29 June 2016 – ESRF data policy
ESRF data policy – what made it possible
Page 8
• The following factors contributed to the data policy :
1. Evolving scientific data landscape
2. The FP7 projects PANDATA + CRISP
3. Recommendations from the EC (H2020) + IUCR
4. Similar facilities showing the way (ISIS+ILL)
5. Realisation that metadata on its own is useful
6. Choice of tape as low cost storage medium
7. Availability of ICAT metadata catalogue
8. Our motivation to improve data management
9. Support from our management + advisers + council
10. Helping our users + increase data re-use via open data
ADMP (CERN) 29 June 2016 – ESRF data policy
Preserving the records of science
Page 9
US National Research Council, Study: “Bits of Power, Issues in Global Access to
Scientific Data”, 1997
“The value of data lies in their use. Full and open access to scientific data should be adopted as the
international norm for … data derived from publicly funded research”
OECD Principles and Guidelines for Access to Research Data from Public Funding
(2007):
“Sharing and open access to publicly funded research data not only helps to maximise the research potential but
provides greater returns from the public investment in research”
Why an ESRF data policy?
ADMP (CERN) 29 June 2016 – ESRF data policy
Page 10
ESFRI Position Paper on Digital Repositories:
“Research Infrastructures should guarantee that raw research data are made available through
portals and databases.”
06/09/2007 – e-IRG ESFRI
Data's shameful neglect“Research cannot flourish if data are not preserved and
made accessible. All concerned must act accordingly”
Nature 461, 145 (10 September 2009) | doi:10.1038/461145a
redefine misconduct as distorted reporting: ‘any omission or misrepresentation of the information necessary and sufficient to evaluate the validity and significance of research, at the level appropriate to the context in which the research is communicated’
Nature 494, 149 (14 February 2013) doi:10.1038/494149a
Preserving the records of science
ADMP (CERN) 29 June 2016 – ESRF data policy
Scientific data are more and more considered like a publication and/or part of the publication
Movement to Open Data is growing e.g. OECD, G8, RDA, …
IUCr dddwg initiative for open data
Pressure is increasing on publicly funded research institutes to follow
H2020 participation will be conditioned on a data management plan
FRANCE : loi numerique voté par l’assemblée nationale 26 Janvier 2016
ESRF as the European synchrotron has to lead
Page 11
Changing landscape
ADMP (CERN) 29 June 2016 – ESRF data policy
Raw Data
Currently produce 2 PB data / year
Data are stored in proprietary formats
Data sets vary from 1 to 100 000 files and from 10 MB to 100s TBs
Data are deleted after 50 days from disk and after one year from tape
No persistent identifiers
No data management plan
Strong difference between in-house research and visitor data
Metadata
Metadata not collected systematically
Experiment report public
Page 12
Where we are coming from
ADMP (CERN) 29 June 2016 – ESRF data policy
Neutrons
ILL – PanData-like policy since 3 years
ISIS – PanData-like policy since 3 years
Photons
ELETTRA – PanData-like policy since 1 year
HZB – PanData-like policy adopted since 14 June 2016
ALBA – PanData-like policy proposed
SLS – Currently under preparation
Other
Alfred Wegener Institut (Helmholtz) – Open Data Policy
Astronomy, Biology, CERN, … – Open Data Policies
Page 13
Data Policy at other Research Infrastructures
ADMP (CERN) 29 June 2016 – ESRF data policy
Based on the PaNdata Data Policy (deliverable D2.1* of PaN-data Europe FP7 project in 2011)
The policy addresses the issues of:
Data ownership
Data curation
Data archiving
Open access to data
Page 14
ESRF Data Policy
ADMP (CERN) 29 June 2016 – ESRF data policy
*http://wiki.pan-data.eu/imagesGHD/0/08/PaN-data-D2-1.pdf
Raw data and associated metadata
ESRF is the custodian of raw data and metadata from all beamlines (including CRGs)
ESRF will automatically collect metadata for all experiments
ESRF will store metadata in a metadata catalogue (icat)
High level metadata will be published as soon as possible, i.e.
Title, Authors, Beamline, Abstract, Experiment Report
Experimental team has sole access to the data during the so-called embargo period of 3 years; request to extend embargo period can be made
After embargo ESRF will make the data “Open Access” under CC-BY license
Users need to create an identifier to get Open Access data
Proprietary i.e. commercial data belong by default to the PI and are not archived unless explicitly agreed
Page 15
ESRF Data Policy* – main elements
ADMP (CERN) 29 June 2016 – ESRF data policy
*http://www.esrf.fr/files/live/sites/www/files/about/organisation/ESRF%20data%20policy-web.pdf
Raw data and associated metadata …
Only keep data generated at the ESRF
Data must be in a format the ESRF can read
Metadata must be of a sufficient high quality to enable data re-use
Data must be traceable and verifiable as coming from the ESRF
ESRF data catalogue to be linked to other open data repositories
Page 16
ESRF Data Policy – main elements
ADMP (CERN) 29 June 2016 – ESRF data policy
Access to raw data and metadata will be via a searchable on-line catalogue (https://wwws.esrf.fr/icat/)
Access to the on-line catalogue of the ESRF will be restricted to registered users of the on-line catalogue. The ESRF will set up an on-line procedure to become a registered user of the catalogue, e.g. with an Umbrella ID
Access to proposals will only be provided to the experimental team and appropriate facility staff
Principal Investigator (PI) has the possibility to transfer parts or the totality of her/his rights during the embargo period to another registered person
PI has the right to create and distribute copies of the raw data
PI has the possibility to render data public before the end of the embargo period
Page 17
ESRF Data Policy – open data access
ADMP (CERN) 29 June 2016 – ESRF data policy
What do we need to curate data for 10 years ?
A metadata catalogue - icat (already installed)
Good metadata on all beamlines - modify the data acquisition
Hooks in the experiments - modify macros on each beamline
A catalogue of data to curate - identify what data to register + archive
Identity management - persistent IDs
Lots of tape storage - money for tapes and manpower to install
Automatic way to restore data - manpower to implement workflow
Current production is ~2 PB / year in 2015
Assuming linear growth to 15 PB / year in 2025 - 45 PB on tape
Future tape storage technology has 88x more capacity than today
Page 18
Implications of data policy
ADMP (CERN) 29 June 2016 – ESRF data policy
icatproject.org
What tape technology + costs
Starting point is 2 x DLS 8400 tape libraries from Storage Tek with 2 x 64 tape drives
Situation in June 2016 - capacity of 8.5PB
Planned to increase to 75 PB in 2017
Tapes
– LTO-5 tape = 100MB/s, 1.7TB
– T10Kd T2 tape = 300MB/s, 8.5TB
Data written to tape with Time Navigator s/w
Limit of maximum number of data objects in catalogue (~ 2 Million)
Translates to 1000 data objects i.e. hdf5 files / shift / experiment (600 shifts x 30 beamlines)
Cost is ~ 100 kilo Euros / year for tapes + s/w
Page 19
Data archiving on tapes – most cost effective
ADMP (CERN) 29 June 2016 – ESRF data policy
ESRF Data Policy – user driven objectives
Page 20 ADMP (CERN) 29 June 2016 – ESRF data policy
1.Define and collect metadata on all beamlines
2.Store metadata in hdf5 file and metadata catalogue (icat) and raw data in hdf5 files
3.Publish a Digital Object Identifier (DOI) per experiment (eventually per dataset) for referencing in publications
4.Archive metadata and raw data for 10 years
5. Implement access rights defined in data policy i.e. 3 years embargo then open access
6.Provide searching, browsing, viewing and download service to metadata+data portal
Research Teams / BL Scientists / Review Panels / Community
Metadata are systematically collected
Better and continuously improved metadata
Data are managed and archived for long term
Metadata can be searched and downloaded easily
Compliance with Data Management Plan required by H2020
Data can be referenced in publications via PIDs (doi)
ESRF
Better data management and follow-up
Data from the ESRF can be traced and verified
Better statistics about publications using ESRF data
Conformance with France/European/World wide move to Open Access
Eventually will lead to ESRF data being used more
Page 21
Advantages of the ESRF data policy
ADMP (CERN) 29 June 2016 – ESRF data policy
Following recommendation by the SAC, Council approved the ESRF Data Policy on
01-December 2015
Implemented on all beamlines by
01-January-2020
Page 22
Official endorsement
ADMP (CERN) 29 June 2016 – ESRF data policy
Twelve
Herculean
tasks
Page 23 ADMP (CERN) 29 June 2016 – ESRF data policy
Implementing the ESRF Data Policy
1. Metadata definition + capture (ICAT)
2. Electronic log-book
3. Identity management
4. Individual accounts for users
5. ACLs and data confidentiality
6. Extend storage capacity of tape archives
7. Link metadata catalogue to tape storage
8. Authentication + authorization to access data
9. Web interface for data access, browsing + visualisation
10.Communication via presentations, web, users meeting, this meeting
11.Modify proposal submission to add “I accept the ESRF data policy”
12.DOI minting, generate landing page + linking to publications
Page 24
Implementing ESRF Data Policy - 12 tasks
ADMP (CERN) 29 June 2016 – ESRF data policy
ICAT metadata catalogue – icatproject.org
Page 25 ADMP (CERN) 29 June 2016 – ESRF data policy
➔Key element of the ESRF Data Policy for storing metadata and curating data
➔Chosen after comparing with other solutions (FP7 CRISP deliverable)
➔ is developed and maintained by STFC, adopted by ISIS, DLS, ESRF, HZB, FERMI
➔ provides ✔ a scientific data model,
✔ web services with SOAP + REST api
✔ web interface for searching and selecting data,
✔ download service supporting multiple protocols
✔ dashboard of statistics and usage
✔ DOI landing page generation
✔ rules based authorization
ICAT data model
Page 26 ADMP (CERN) 29 June 2016 – ESRF data policy
ICAT data model → ESRF data model
Page 27 ADMP (CERN) 29 June 2016 – ESRF data policy
Implementation – planned for 2016
Page 28 ADMP (CERN) 29 June 2016 – ESRF data policy
•
Implementation – (meta)data flow
Page 29 ADMP (CERN) 29 June 2016 – ESRF data policy
•
icat Tapearchive
Metadatacatalogue
Data portalWeb interface
DOI
Downloaddata
(hdf5)
Implementation of metadata ingestion
Page 30 ADMP (CERN) 29 June 2016 – ESRF data policy
HZB Data Policy – adopted 14 June 2016
Page 31 ADMP (CERN) 29 June 2016 – ESRF data policy
➔ Third photon source to adopt a data policy
➔ Based on the PanData data policy
➔ Archived for 10 years
➔ 5 year embargo period
➔ Data released under CC0 licence
http://www.helmholtz-berlin.de/pubbin/news_seite?nid=14472&sprache=en
ESRF Data Policy – Open Questions
Page 32 ADMP (CERN) 29 June 2016 – ESRF data policy
1.Can the issues surrounding Data Ownership be clarified at an international level ?
2.How to encourage EU member states to adopt EU recommendations on Open Data ?
3.How to communicate and educate scientists about the move to Open Data ?
4.What guidelines exist for embargo periods ?
5.How to make our data repository easily findable ?
6.What permanent authentication mechanism should be used e.g. Orcid ?
The scientific landscape is changing and the call for Open Access for publicly funded research has become stronger
ESRF has adapted to this changing landscape
A data policy offers many advantages for Users
ESRF data will be traceable, verifiable and re-useable in the future
ESRF is in-line with best practices for scientific data recommended by the RDA, EU, G8, research organisations, learned societies, …
Other photon sources arelikely to adopt open datapolicies too e.g. ELETTRA, HZB, ...
Page 33
Conclusion
Conclusion
ADMP (CERN) 29 June 2016 – ESRF data policy
Alejandro de Maria (ISDD) – Data Manager Bruno Lebayle (TID) – IT infrastructure Joanne McCarthy (EXPD) – User Office Armando Solé (ISDD) – Metadata+data Jens Meyer (ISDD) – Beamline controls Dominique Porte (TID) – User ID's Rudolf Dimper (TID) – Data policy Andy Götz (ISDD) – Implementation
ADMP (CERN) 29 June 2016 – ESRF data policy
ESRF Data Policy Implementation Team
Data Ownership i.e. Who “owns” data ?
Difficult question which is better addressed via “rights” where we consider the ESRF as being the “data custodian” :
1. ESRF has the right to curate data from public research experiments
2. Experimenters have the right to analyse and publish the data
3. ESRF has the right to give open access to the data after embargo
4. Users own the sample and any results they derive from the data
Page 35 ADMP (CERN) 29 June 2016 – ESRF data policy
DATA OWNERSHIP versus RIGHTS