Upload
thomas-owens
View
213
Download
0
Embed Size (px)
Citation preview
Publishing Data Workflows
RDA Plenary 5 -- March 11, 2015
Session Chairs: Amy Nurnberger and Mary Vardigan
Please sign in: http://bit.ly/1Hju0LM
Agenda• Introduction:
• Objectives • Progress so far• Workflow Examples•Get involved
• Dataverse workflow presentation• SoftwareX workflow presentation• Use case development
Group notes document: http://bit.ly/1MlXysR
The working group members (currently)• Theodora Bloom (BMJ) [CO-CHAIR]• Sünje Dallmeier-Tiessen (Switzerland,
CERN) [CO-CHAIR]• Elizabeth Newbold (BL) [CO-CHAIR]• Merce Crosas (US, Harvard University)• Michael Diepenbroek (PANGAEA)• Kim Finney (Australia, AADC)• John Helly (US, UCSD)• Brian Hole (Ubiquity Press, UK)• Varsha Khodiyar (Nature Scientific Data)• Hylke Koers (The Netherlands, Elsevier)• Rebecca Lawrence (UK, F1000 Research Ltd.)• Fiona Murphy (UK, Wiley-Blackwell)
Others are very welcome ☺
• Amy Nurnberger (US, Columbia University Libraries)
• Lisa Raymond (US, Library Woods Hole Oceanographic Institution)
• Johanna Schwarz (Germany, Springer)•Jonathan Tedds (UK, University of Leicester) •Mary Vardigan (US, ICPSR)•Ruth Wilson (UK, Nature)•Eva Zanzerkia (US, NSF)•Angus Whyte (UK, DCC)
•And growing…
Background and Motivation• Only a small fraction of research data is preserved and shared, often with
a bare minimum of metadata
• Often due to the lack of “established” or “trusted” services and workflows
But there are established or emerging workflows!
• Usually in selected disciplines, e.g., Earth Sciences
• Some provide credit via citation mechanisms
Objectives• Provide an analysis of a representative range of existing
and emerging workflows and standards for data publishing • Including deposit and citation • Provide reference models, a “classification”
• Test implementations of key components for application in new workflows
• Illustrate the benefits of the reference models for researchers and organisations
Relevance• Information about workflows crucial for researchers and
other stakeholders to understand the options available to practice open science
• Helps to illustrate different possibilities for data sharing, leading to more efficient and reliable reuse of research data
• Shows those involved in research data where they fit in the overall scheme of things
More detailed work programme• Identification of a smaller set of reference models covering a range of such
workflows to include:• For example, when and where QA/QC and data peer-review fit into the
publishing process • Who does what and when…• Automated vs. “manual” processes
• Selection of key use cases and organizations in which components of a reference model can be implemented and tested for suitability• For example: dedicated data peer review• For example: metadata checks
First results of workflow analysis
http://tinyurl.com/mvtbrek
Workflows in the current list- STFC Data centre- NSIDC Data centre- ENVRI reference model- OJS/ Dataverse- INSPIRE Digital library- NPG (PubChem & Scientific Data) Publisher- UK Data Archive/Service- PREPARDE (NCAR CISL)- Ocean Data Publication Cookbook (UNESCO IOC)- PURR Institutional repository- ICPSR- Edinburgh Datashare- F1000 Research
- Ubiquity Press: Open Health Data Journal+...- PANGAEA - Data Publisher for Earth and Environmental Sciences- WDC Climate - Data Publisher for Climate Sciences- CMIP / IPCC DDC - International project series in Climate Sciences- GigaScience- Dryad digital repository with integrated journals workflow- Stanford Digital Repository- Academic Commons: Columbia University Institutional Research Repository- Elsevier: Data in Brief- Integrated data publishing solution at Elsevier [through “traditional” journals]
Categories we are looking at • Discipline• Function of workflow• PID assignment to dataset• PID type -- e.g., DOI, ARK, etc.• Peer review of data (e.g., by researcher & editorial review)• Curatorial review of metadata (e.g., by institutional or subject repository?)• Technical review & checks (e.g., for data integrity at repository/data centre on ingest)• Discoverability: Indexing of the data -- if yes, where? • Formats covered• Persons/Roles involved, e.g., editor, publisher, data repository manager, etc.• Link to data paper or “standalone” data• Links to grants, usage of author PIDs• Data citation facilitated• Data life cycle referred to• Standards compliance
Observations• The researcher/author generally initiates the workflow • Discipline-specific repositories have the most rigorous ingest and
review processes -- more general institutional repositories have a lighter touch
• Journals vs. repositories: For the former, any peer review is conducted externally, for many of the latter it is internal
Repository view
Data Deposit
Ingest
QualityAssurance
Data ManagementLT Archiving
DisseminationAccess
Producer Consumer/Reuse
Simplified generic repository workflow
Researcher with a central role: submission/deposition
Review/QA mainly internal
Data Deposit
Ingest
QualityAssurance
LightData
ManagementLT Archiving
DisseminationAccess
Producer
Consumer(disciplinary)
Ingest
QualityAssuranceDetailed
Project Repositories:• Data are published in a federated
data infrastructure • Data are added and corrected • Poor documentation• Usually no data backup• Light-weight quality assurance
against intl. and project standards• Tendency that the project data
never become stable• Currently no PIDs assigned or
reserved but Handles planned
Long-term Archive:
• Data are archived for the long term at asingle location
• Data are stable and curated• Detailed documentation• Data backup/redundancy • Quality assurance process is more
detailed and includes a review• Data is a “snapshot” of the project
data at a certain time• DOIs assigned to data collections
Consumer(interdisciplinary)
DisseminationAccess
Designed byM. Stockhause
Lessons Learnt and questions• Very diverse landscape• Discipline-specific and cross-discipline actions• Quality assurance a big topic in discipline-specific
repositories• Widespread persistent identification• Data citation awareness• Challenge: Bidirectional data-publication linking• Challenge: Versioning
Publisher’s perspective
Article preparation
Data Submission
Article submission
Peer Review Process EditingProducer Consumer/
Reuse
Simplified generic publisher workflow
Researcher takes over several roles: submitter, reviewer, editor potentially?
Who takes on which role and responsibility?
- Article/data container
- Separate article and datasets
Publishing
Example: Dryad repository integrated with journals
Lessons learnt and questions• Recommended repositories for collaboration? Who
decides/how?• External review
• Open, plus invitation• Closed, upon invitation• Blind
•Emerging data and software journal landscape: no information yet on uptake
Current and future work
How to get involved• Contribute to the workflow analysis: http://bit.ly/1BBQQPW• Contribute your own workflow “walk-throughs” and use cases• Tell us what is needed for a “successful” workflow in your
institute/discipline
… Moving to implementation• Tell us if you are interested to learn from a specific example or are
maybe considering implementing data publishing workflows• Tell us if you have code/documentation to share
Break for presentationsDataverse: Eleni CastroSoftwareX: Hylke Koers
DATA PUBLISHING WORKFLOWS WITH DATAVERSE
Eleni Castro ([email protected])
Institute for Quantitative Social Science (IQSS)
Harvard University
RDA 5th PlenaryWG RDA/WDS Publishing Data Workflows March 11, 2015
An Integrated & AutomatedJournal / Data Publishing Workflow
25
Journal
Repository
Current Workflows in Dataverse: To Connect Data to Journals
A. Journals include Dataverse as a Recommended Repository
B. Authors Contribute Directly to a Journal’s Dataverse
C. Automated Integration of Journal + Dataverse (e.g., OJS)
26
Example of Option C: Phase 1OJS / Dataverse Integration
Integrating Open Journal Systems (OJS) with Dataverse Reference Implementation: Automated via SWORD API
Pilot with ~ 50 journals + expand to 1000s using OJS. Dataverse plugin is automatically available w/ OJS. Future: Embed Dataverse widgets into journal article.
http://projects.iq.harvard.edu/ojs-dvn
27
Project Details: 2012-2014
In the Backend: Technical Workflow
Client sends:
XML file: AtomPub "entry” with Dublin Core Terms (e.g., title, creator, isReferencedBy (article citation), …)
Zip file: All data files associated with that dataset.
Repository sends:
XML file: “Deposit Receipt” send data citation from repository to client.
Plus updates from client to server during lifecycle (CRUD): In review, reject (delete), publish first version, update new versions.
28
On the Frontend: OJS Dataverse Plugin Walkthrough
29
Journal Manager Sets Up Plugin in OJS
30
Journal Manager Sets Up Data Policies
Read full Data Policies / Guidelines Template: http://bit.ly/1xkLjoZ
Including Guidelines for:1) Authors (data citation)2) Reviewers3) Copyeditors
31
Author Submits Manuscript + Data (1)
32
Author Submits Manuscript + Data (2)
Option to: (a) deposit into Dataverse OR; (b) if data is already in a repository can include the data citation (w/ persistent URL/identifier).
33
To-Do: Support for adding multiple datasets to a journal article.
Editor Reviews Article + Data34
Approved = Data Published in Dataverse
When issue is published:1) URL to Article displays in Dataverse. 2) Data Citation shows up in OJS Article (see next slide).
35
1
2
Article in OJS: Published w/ Data Citation
36
Phase 2: Expansion of API + Workflows
38
2015-2016 (collaboration w/ Odum Institute)
1. Expand to more journals, publishing systems, & workflows2. Develop Community-Based Repository API Standard:
Work w/ RDA, WDS, Data FAIRport, FORCE11, CODATA, etc…
Should we extend the Repository API beyond SWORD? Support for additional Metadata Schemas & fields (non-DC)? Support for more/which dataset review workflows?
Project Goals
Project Questions
How Do I Get Involved?
39
Sign up to Contribute: Repositories Workshop + Dataverse Community Meeting June 9-11, 2015 @ Harvard http://bit.ly/1A51atJ
Find Out More: * Visit our Collaborations page: http://bit.ly/1Bg2nkw * Dataverse Project Site: http://dataverse.org
Contact Project Coordinator: Eleni Castro ([email protected])
1
2
3
Hylke Koers, Head of Content Innovation, Elsevier
RDA Plenary 5, San Diego
SoftwareX – a home for research software
| 42Open Access
Software (like data) is high-value but hard to access
Researcher survey, 3824 respondents(Publishing Research Consortium, 2010)
Importance of access
Eas
e o
f ac
ces
s
High value & easy access
High value & difficult to access
| 43Open Access
• Many scholars develop software , but current paper based system does not capture this “born digital” research output systematically
• Users (readers) can’t find this valuable content • Developers (authors) can’t claim credit • Software is a research method in its own right –
and deserved to receive full academic recognition
Why SoftwareX?
| 44Open Access
SoftwareX: a home for research software
SoftwareX aims to acknowledge the impact of software on today'sresearch practice, and on new scientific discoveries in almost allresearch domains. SoftwareX also aims to stress the importance ofthe software developers who are, in part, responsible for this impact.
To this end, SoftwareX aims to support publication of research software in such a way that:• The software is provided with a peer-reviewed recognition of scientific impact• The software developers are given the academic credit they deserve;• The software is citable, allowing traditional metrics of scientific excellence to
apply;• The academic career paths of software developers are supported rather than
hindered;• The software is publicly available for inspection, validation, and re-use.
Above all, SoftwareX aims to inform researchers about software applications, tools and libraries with a (proven) potential to impact the process of scientific discovery in various domains
From “Aims & Scope”, see http://www.journals.elsevier.com/softwarex
| 45Open Access
SoftwareX: a home for research software
• Publishing “Original Software Publications”:- The software and code can include post publication updates- Metadata is systematically captured
• Article is Open Access under CC-BY license• All software and code published is, and will remain, fully owned by
their developers.• Peer-reviewed; dedicated software Editors & Reviewers• Multi-disciplinary• Submission in 3 easy steps• GitHub repository to store and expose all software and code• Launched at FORCE15
See http://www.journals.elsevier.com/softwarex/news/you-can-now-submit-your-software-to-softwarex/
| 46Open Access
How does it work?
How to submit your software to SoftwareX in 3 easy steps:
1. Select a repository for your software or pack your software into a zip file or archive. Remember to make your software public so that the reviewers and readers can find it.
2. Download the template for the OSP manuscript, and write your article describing your software following this template.
3. Submit your OSP manuscript via the SoftwareX submission site.
After review and acceptance, software and/or code will be copied to the journal archive on GitHub and integrated with the online version of your Original Software Publication available on ScienceDirect.
See http://www.journals.elsevier.com/softwarex
| 47Open Access
Template contains structured metadata
Nr Code metadata description Please fill in this column
C1 Current code version For example v42
C2 Permanent link to code/repository used of this code version
For example: https://github.com/mozart/mozart2
C3 Legal Code License List one of the approved licenses
C4 Code versioning system used For example svn, git, mercurial, etc. put none if none
C5 Software code languages, tools, and services used
For example C++, python, r, MPI, OpenCL, etc.
C6 Compilation requirements, operating environments & dependencies
C7 If available Link to developer documentation/manual
For example: http://mozart.github.io/documentation/
C8 Support email for questions
| 48Open Access
Template contains structured metadata
Nr (Executable) software metadata description
Please fill in this column
S1 Current software version for example 1.1, 2.4 etc.
S2 Permanent link to executables of this version
For example: https://github.com/combogenomics/DuctApe/releases/tag/DuctApe-0.16.4
S3 Legal Software License List one of the approved licenses
S4 Computing platforms/Operating Systems For example Android, BSD, iOS, Linux, OS X, Microsoft Windows, Unix-like , IBM z/OS, distributed/web based etc.
S5 Installation requirements & dependencies
S6 If available, link to user manual - if formally published include a reference to the publication in the reference list
For example: http://mozart.github.io/documentation/
S7 Support email for questions
| 49Open Access
Flexible range of open-source licenses for computer code
• Apache License, 2.0 (Apache-2.0)• BSD 3-Clause "New" or "Revised" license (BSD-3-Clause)• BSD 3-Clause "Simplified" or "FreeBSD" license (BSD-2-Clause)• GNU General Public License (GPL)• GNU Library or "Lesser" General Public License (LGPL)• MIT license (MIT)• Mozilla Public License 2.0 (MPL-2.0)• Common Development and Distribution License (CDDL-1.0)• Eclipse Public License (EPL-1.0)• Creative Commons Zero (CC0)
| 50Open Access
And now.. The moment you have all been waiting for…
| 51Open Access
A workflow diagram
Researcher has code and paper
Submits to journal as OSP + code
(supp. mat.)
Editorial + peer-review process
Code made available on journal
GitHub instance
Bi-directional links
OSP published on ScienceDirect
| 52Open Access
A workflow diagramEditorial + peer-review process
Code made available on journal
GitHub instance
Bi-directional links
OSP published on ScienceDirect
Code deposited to (or build on) code repository
OSP submitted to journal
OSP linked with code
| 53Open Access
Thank you!
Any questions?
Discussion
Use case development
Developing use cases for workflows●The tools
○ Part A: http://goo.gl/forms/Wkc7KyxvX5○ Part B: http://goo.gl/forms/ZFRrzG6krX
●The process○ Walk through the tools○ Form up in groups○ Generate use cases
The tools: Part A http://goo.gl/forms/Wkc7KyxvX5
The tools: Part A http://goo.gl/forms/Wkc7KyxvX5
The tools: Part A http://goo.gl/forms/Wkc7KyxvX5
The tools: Part A http://goo.gl/forms/Wkc7KyxvX5
The tools: Part A http://goo.gl/forms/Wkc7KyxvX5
Thank you! You have completed Part A of this use case. For the next part, you will be completing multiples of a form, to address each individual actor listed in this use case. Click this to get to Part B: http://goo.gl/forms/ZFRrzG6krX
The tools: Part B http://goo.gl/forms/ZFRrzG6krX
The tools: Part B http://goo.gl/forms/ZFRrzG6krX
The tools: Part B http://goo.gl/forms/ZFRrzG6krX
The tools: Part B http://goo.gl/forms/ZFRrzG6krX
Group up!
●The tools○ Part A: http://goo.gl/forms/Wkc7KyxvX5○ Part B: http://goo.gl/forms/ZFRrzG6krX