1
Fiona Murphy (1), Sarah Callaghan (2), Paul Hardaker (3), Rob Allan (4) (1)Wiley-Blackwell, Earth Science Journals, Chichester, UK ([email protected] ), (2) British Atmospheric Data Centre, STFC, UK ([email protected]), (3) Royal Meteorological Society, Reading, UK ([email protected]), (4) Met Office, Hadley Centre, Exeter, UK ([email protected] ) Is the current situation really so bad? Research budgets are under strain as never before – whilst at the same time researchers are able to produce data at an unprecedented rate. This gives rise to difficult questions of what to produce, what to throw away, formatting, interoperability, long-term curation, security, and a host of other key issues. In short, how we can be aware of, locate, use and register what we know, and thereby ensure its most effective use and re-use? And how do we fit all of this into a sustainable research ecosystem? In recognition of this situation, and in response to the general economic squeeze on science budgets, funding bodies (for example the National Science Foundation and the UK’s Natural Environment Research Council) are increasingly trying to encourage good data management practises by requiring data management plans (DMPs). At present, DMPs have not yet fully integrated into the research lifecycle, so far as many researchers are concerned. Researchers may view DMPs with suspicion for potentially being a waste of effort or for compelling researchers to share research output before they are able to fully exploit it themselves. In short, the benefits of good data management are not clearly mapped out to everyone’s satisfaction. To its credit, NSF in particular is engaging in dialogue with other stakeholders on this issue through outreach at relevant scientific events, an interactive website and so forth. However, this is not the only problem faced by the current research ecosystem. Crucially, technical advances have generally not been accompanied by support to work optimally in terms of funding structure and research infrastructures. As yet, best data output and management practice does not elicit the sort of scientific reward needed to persuade compliance. Other key barriers to progress are silos in terms of geography (where people are or where data are collected) and also within disciplines (hydrologists versus geologists versus climatologists). The status of data science itself – and of data scientists – also needs re-examining. Benefits to the community It is becoming increasingly important to the scientific and wider non- academic communities that the data that underpins key scientific results should be made available to allow for the testing and confirmation of those results. Historically, publishing data has been so difficult as to be prohibitive, and those cases where it has been possible, the raw data has had to be converted to other formats; for example, instead of raw numbers being published in a (lengthy) table, it has been converted to a graph. As scientists’ ability to create and collect new data has been growing, so too has our ability to store it. A dataset can be stored on any digital medium that is convenient, but future-proofing the data so that it is readable and understandable in 20 years’ time remains a time- consuming and difficult job. Yet, if the results drawn from that data are to stand up to scrutiny in the future, the data must be curated and archived properly. Openly sharing data is often proposed as a method for ensuring that data underpinning the scientific record is kept. There are issues with this in that sharing data in an unstructured way often results in the provenance of the dataset (and often the dataset itself) being changed as it passes from one “owner” to another, thereby reducing any chances of using that data to test the reproducibility of results originally made from it. Also, the present mechanism for academic recognition revolves around the production and publication of peer-reviewed papers. The production of high-quality datasets takes time and effort, and is often insufficiently recognised as an activity worthy of prestige, even though the papers that result from that dataset may be considered of significant scientific importance. Simple sharing of data is unlikely to provide the data creators with the academic recognition they deserve. A process of data publication, involving peer-review of datasets would be of benefit to many sectors of the academic community. Aims of the project The Workflow of Data Publication How do we coordinate the peer review of data and data papers? The Geoscience Data Journal: collaboration between data repositories and publishing houses in data publishing Pressure points Workflow, landing pages, author, editor, publisher, data centre communication and education Processes shown in orange boxes are the subject of continual, iterative development driven by stakeholders and incorporating best practices as they emerge. Incentives Publishing a dataset in a data journal will provide academic credit to data scientists, and without diverting effort from their primary work on ensuring data quality. • Funders want to get the best possible science for their money. Running measurement campaigns is expensive, so the more reuse that can be derived from a dataset, the better. Publication in a data journal ensures that the dataset is uploaded to a trusted repository where it will be backed up, archived and curated and so won’t be vulnerable to bit-rot or being lost/stored on obsolete media. The peer-review process also reassures the funder that the published dataset is of good quality and that the experiment was carried out appropriately. Data journals will be a good starting point for information for researchers outside the immediate field, about what sort of data is available and how to access the data. This will encourage inter-disciplinary collaboration, and open up the user base not only for the datasets, but also the data journal and the underlying repositories. The availability of published datasets will make it easier to validate conclusions through the reanalysis of those datasets. Data publication will help show transparency in the scientific process, improving public accountability. Opportunities to form partnerships with other organisations with the same goal of data publication to exploit common activities and achieve a “the amount of data generated worldwide...is growing by 58% per year; in 2010 the world generated 1250 billion gigabytes of data” The Digital Universe Decade – Are You Ready? IDCC White Paper, May 2010 The current situation can be summarised as this...: Which we want to change into this: It is for these reasons that a partnership has been developed between the British Atmospheric Data Centre, the Royal Meteorological Society and the academic publishers Wiley-Blackwell, in order to develop a mechanism for the formal publication of data in the (soon to be launched) Geoscience Data Journal. This journal builds on the work funded by JISC in the OJIMS (Overlay Journal Infrastructure for Meteorological Sciences) project, and parallels with work done by the NERC Science Information Strategy Data Citation and Publication project team, which brings all the NERC environmental data centres together.

Fiona Murphy (1), Sarah Callaghan (2), Paul Hardaker (3), Rob Allan (4)

  • Upload
    afi

  • View
    26

  • Download
    0

Embed Size (px)

DESCRIPTION

The Geoscience Data Journal: collaboration between data repositories and publishing houses in data publishing. Fiona Murphy (1), Sarah Callaghan (2), Paul Hardaker (3), Rob Allan (4) - PowerPoint PPT Presentation

Citation preview

Page 1: Fiona Murphy (1), Sarah Callaghan (2), Paul Hardaker (3), Rob Allan (4)

Fiona Murphy (1), Sarah Callaghan (2), Paul Hardaker (3), Rob Allan (4)(1)Wiley-Blackwell, Earth Science Journals, Chichester, UK ([email protected]), (2) British Atmospheric Data Centre, STFC, UK ([email protected]), (3) Royal Meteorological Society, Reading,

UK ([email protected]), (4) Met Office, Hadley Centre, Exeter, UK ([email protected])

Is the current situation really so bad?

Research budgets are under strain as never before – whilst at the same time researchers are able to produce data at an unprecedented rate. This gives rise to difficult questions of what to produce, what to throw away, formatting, interoperability, long-term curation, security, and a host of other key issues. In short, how we can be aware of, locate, use and register what we know, and thereby ensure its most effective use and re-use? And how do we fit all of this into a sustainable research ecosystem?

In recognition of this situation, and in response to the general economic squeeze on science budgets, funding bodies (for example the National Science Foundation and the UK’s Natural Environment Research Council) are increasingly trying to encourage good data management practises by requiring data management plans (DMPs). At present, DMPs have not yet fully integrated into the research lifecycle, so far as many researchers are concerned. Researchers may view DMPs with suspicion for potentially being a waste of effort or for compelling researchers to share research output before they are able to fully exploit it themselves. In short, the benefits of good data management are not clearly mapped out to everyone’s satisfaction. To its credit, NSF in particular is engaging in dialogue with other stakeholders on this issue through outreach at relevant scientific events, an interactive website and so forth.

However, this is not the only problem faced by the current research ecosystem. Crucially, technical advances have generally not been accompanied by support to work optimally in terms of funding structure and research infrastructures. As yet, best data output and management practice does not elicit the sort of scientific reward needed to persuade compliance. Other key barriers to progress are silos in terms of geography (where people are or where data are collected) and also within disciplines (hydrologists versus geologists versus climatologists). The status of data science itself – and of data scientists – also needs re-examining.

Figures adapted from ‘Opportunities for Data Exchange’ Project, Alliance for Permanent Access http://www.alliancepermanentaccess.org/

Benefits to the community

It is becoming increasingly important to the scientific and wider non-academic communities that the data that underpins key scientific results should be made available to allow for the testing and confirmation of those results. Historically, publishing data has been so difficult as to be prohibitive, and those cases where it has been possible, the raw data has had to be converted to other formats; for example, instead of raw numbers being published in a (lengthy) table, it has been converted to a graph.

As scientists’ ability to create and collect new data has been growing, so too has our ability to store it. A dataset can be stored on any digital medium that is convenient, but future-proofing the data so that it is readable and understandable in 20 years’ time remains a time-consuming and difficult job. Yet, if the results drawn from that data are to stand up to scrutiny in the future, the data must be curated and archived properly.

Openly sharing data is often proposed as a method for ensuring that data underpinning the scientific record is kept. There are issues with this in that sharing data in an unstructured way often results in the provenance of the dataset (and often the dataset itself) being changed as it passes from one “owner” to another, thereby reducing any chances of using that data to test the reproducibility of results originally made from it.

Also, the present mechanism for academic recognition revolves around the production and publication of peer-reviewed papers. The production of high-quality datasets takes time and effort, and is often insufficiently recognised as an activity worthy of prestige, even though the papers that result from that dataset may be considered of significant scientific importance. Simple sharing of data is unlikely to provide the data creators with the academic recognition they deserve. A process of data publication, involving peer-review of datasets would be of benefit to many sectors of the academic community.

Aims of the project

The aim of the Geoscience Data Journal is to provide a platform where scientific data can be formally published, in a way that includes scientific peer-review. This will provide the dataset creator with full credit for their efforts, while also improving the scientific record, and allowing major datasets to be fully described, cited and discovered.

The Workflow of Data Publication

How do we coordinate the peer review of data and data papers?

The Geoscience Data Journal: collaboration between data repositories and publishing houses in data publishing

Pressure points

• Workflow, landing pages, author, editor, publisher, data centre communication and education• Processes shown in orange boxes are the subject of continual, iterative development driven by

stakeholders and incorporating best practices as they emerge.

Incentives• Publishing a dataset in a data journal will provide academic credit to data scientists, and without

diverting effort from their primary work on ensuring data quality.• Funders want to get the best possible science for their money. Running measurement campaigns

is expensive, so the more reuse that can be derived from a dataset, the better. Publication in a data journal ensures that the dataset is uploaded to a trusted repository where it will be backed up, archived and curated and so won’t be vulnerable to bit-rot or being lost/stored on obsolete media. The peer-review process also reassures the funder that the published dataset is of good quality and that the experiment was carried out appropriately. Data journals will be a good starting point for information for researchers outside the immediate field, about what sort of data is available and how to access the data. This will encourage inter-disciplinary collaboration, and open up the user base not only for the datasets, but also the data journal and the underlying repositories. The availability of published datasets will make it easier to validate conclusions through the reanalysis of those datasets. Data publication will help show transparency in the scientific process, improving public accountability.

• Opportunities to form partnerships with other organisations with the same goal of data publication to exploit common activities and achieve a wider community buy-in. For example, the CODATA-ICSTI Task Group on Data Citation Standards and Practises, DataCite and others.

“the amount of data generated worldwide...is growing by 58% per year; in 2010 the world generated 1250 billion gigabytes of data”

The Digital Universe Decade – Are You Ready?IDCC White Paper, May 2010

The current situation can be summarised as this...:

Which we want to change into this:

It is for these reasons that a partnership has been developed between the British Atmospheric Data Centre, the Royal Meteorological Society and the academic publishers Wiley-Blackwell, in order to develop a mechanism for the formal publication of data in the (soon to be launched) Geoscience Data Journal.

This journal builds on the work funded by JISC in the OJIMS (Overlay Journal Infrastructure for Meteorological Sciences) project, and parallels with work done by the NERC Science Information Strategy Data Citation and Publication project team, which brings all the NERC environmental data centres together.