Upload
susanna-assunta-sansone
View
220
Download
1
Embed Size (px)
Citation preview
Consultant, Honorary Academic Editor
Associate Director, Principal Investigator
!
High quality data publications:
drives and needs !
Susanna-Assunta Sansone, PhD!!!
@biosharing!@isatools!
@scientificdata!!
BBSRC DTP, Oxford, 15 December, 2014
http://www.slideshare.net/SusannaSansone
https://projects.ac/blog/five-top-reasons-to-protect-your-data-and-practise-safe-science/
Credit to:
• Over 50% of completed studies in biomedicine do not appear in the published literature!
!
• Often because results do not conform to author's hypotheses!
“Only half the health-related studies funded by the European Union between 1998 and 2006 - an expenditure of €6 billion - led to identifiable reports”!
Plagued by selective reporting of data and methods
• Big science efforts!o data is often better organized, reported and shared!
• Small independent efforts, yielding a rich variety of specialty data sets!o Most of these data (such as null findings) is unpublished!o These dark data hold a potential wealth of knowledge!
Incentivizing individual contributor to share data
A community mobilization for “openness”
image by Greg Emmerich
http://discovery.urlibraries.org/ https://okfn.org
Open data is a means to do better science more efficiently!
http://pantonprinciples.org
http://opendefinition.org/licenses/
https://creativecommons.org
Open access is not enough on its own
http://www.theguardian.com/higher-education-network/blog/2014/jun/26
If your research has been funded by the taxpayer, there's a good chance you'll be encouraged to publish your results on an open access basis….. This final article makes publicly available the hypotheses, interpretations and conclusions of your research. But what about the data that led you to those results and conclusions?
Also open data is not always enough
http://www.theguardian.com/higher-education-network/blog/2014/jun/26
So data that is in theory open and free to access!• may still be hard to get hold of!• it may not have been stored or cited
in the appropriate manner!• it may not be interoperable with
related data because it is not formatted appropriately; or!
• it may not be reusable because it may not contain enough information for others to understand it!
Movement for FAIR data in life and medical sciences
http://bd2k.nih.gov/workshops.html#ADDS
Because, in all fairness, not much data is FAIR!
Responsibilities lie across several stakeholder groups
Understand the benefits of sharing FAIR datasets and enact them
Engage and assist researchers to enable them to share FAIR datasets
Release or endorse practices and polices, but also incentive
and credit mechanisms for researchers, curators and
developers
Because of importance of formal publications in the academic !
incentive structure!
Publishers occupy a leverage point
Serve as the implementation and/or enforcement arm at the point of publication!
Role of publishers as “agents of change”
• Policies on access (to data, code, reagents etc.)!o Supporting funder & community needs!
• Format and amount of content!o Methodological details, supplementary info, data integration and
links to repositories!
• Licensing for reuse!• Incentives to share!o Data citations!
o Data journals and articles!
• Quality assurance through peer review!
Publishers and data/reproducibility
Credit to: Iain Hrynaszkiewicz
Human Genome 2001 62 Pages, 150 Authors,
49 Figure, 27 tables
Encode Project 2012 30 papers, 3 Journals
Nature Publishing Group: the changing landscape
Credit to: Iain Hrynaszkiewicz
2013
Wang et al, Nature, 2013 doi:10.1038/nature12730
Data/reproducibility at NPG
• Figure source data o putting data behind figures/graphs
Data/reproducibility at NPG
• Figure source data o putting data behind figures/graphs
• Data citation o tackling both styling and format; monitoring community developments,
such the Data Citation Synthesis Group
• Code reproducibility o peer review, availability and reuse
• NPG’s Linked Data release – CC0
• A new data journal
Data journals everywhere?
Credit to: Iain Hrynaszkiewicz
!
!
!
!
!
!
!
!
!
!!
A new open-access, online-only publication for descriptions of scientifically valuable datasets !
• Get Credit for Sharing Your Data • Publications will be listed in the major indexes and will be citeable • Focused on Data Reuse • All the information others need to reuse the data; no interpretative
analysis or hypothesis testing
• Open-access • Authors select from three Creative Commons licences for the main • Data Descriptor. Each publication supported by curated CC0
metadata
• Peer-reviewed • Rigorous peer-review managed by our Editorial Board of academic
researchers ensures data quality and standards
• Promoting Community Data Repositories • Data stored in community data repositories
Data Descriptor
Synthesis
Analysis
Conclusions
Interpretation
What is the sample?
What did I do to generate the data?
Where is the data?
How was the data processed?
Who did what when?
Summary of Data Descriptor
Facts
Data Descriptor
Journal article
NARRATIVE
Introducing a new content type: the Data Descriptor • Designed to make data more discoverable, interpretable and
reusable!• Concerned with the facts behind the methodology
of data generation/collection and processing!• Complements a journal article!
Data Descriptor: narrative and structure!
!!!
Experimental metadata or !structured component!
(in-house curated, machine-readable formats)!
Article or !narrative component!
(PDF and HTML) !
In traditional publications this information is not provided in a sufficiently detailed manner
However this information is essential for understanding, reusing, and reproducing datasets
Focus on data reuse!Detailed descriptions of the methods and technical analyses supporting the quality of the measurements.!Does not contain tests of new scientific hypotheses!
Data Descriptor: narrative!
Sections:!• Title!• Abstract!• Background & Summary!• Methods!• Technical Validation!• Data Records!• Usage Notes !• Figures & Tables !• References!• Data Citations!!
Data Descriptor: narrative!
Sections:!• Title!• Abstract!• Background & Summary!• Methods!• Technical Validation!• Data Records!• Usage Notes !• Figures & Tables !• References!• Data Citations!!
Focus on data reuse!Detailed descriptions of the methods and technical analyses supporting the quality of the measurements.!Does not contain tests of new scientific hypotheses!
Data Descriptor: narrative!
Sections:!• Title!• Abstract!• Background & Summary!• Methods!• Technical Validation!• Data Records!• Usage Notes !• Figures & Tables !• References!• Data Citations!!
Focus on data reuse!Detailed descriptions of the methods and technical analyses supporting the quality of the measurements.!Does not contain tests of new scientific hypotheses!
Joint Declaration of Data Citation Principles by the Data Citation Synthesis Group
In-house editorial curator:!• assists users to submit the structured
content via simple templates and an internal authoring tool!
• performs value-added semantic annotation of the experimental metadata!
For advanced users/service providers willing to export ISA-Tab for direct submission, we have released a technical specification:!
analysis !method! script!
Data file or !record in a database!
Data Descriptor: structure - content !
Green: author; Purple: repository; Blue: SciData; Red: production
Workflow overview!
Collect Data!
Follow-up experiments!
Publish Findings!
Publish Data!
Scientific Data’s prior publication policy with other NPG journals protects your ability to publish the screen data and the hits later
Publish your data early!
Credit to: Andrew Hufton
Hao et al.: Environmental!
Data sets from the Global Integrated Drought Monitoring and Prediction System (GIDMaPS), which provides drought information based on multiple drought indicators
8 citations
Hao et al.: Environmental!New Dataset • Data in figshare • Code in figshare
8 citations
Hao et al.: Environmental!New Dataset • Data in figshare • Code in figshare • Cited in Science
8 citations
!!!!!!!!!
Code in GitHub
!!!!!!!!!Data in OpenfMRI
Hanke: Neuroscience !New Dataset
Collect Data!
Follow-up experiments!
Publish Findings!
Submit Data!
Hold publication!
Scientific Data will hold a Data Descriptor publication that has been accepted for publication, while your other related research
publications clear peer review Credit to:
Andrew Hufton
Or your data and findings simultaneously!
Collect Data!
Follow-up experiments!
Publish Findings!
Publish Data!
• A fuller, more in-depth look at the data processing steps, supported by additional data files and code from each step
• And/or additional tutorial-like information for scientists interested in reusing or integrating the data with their own
Or after the findings, but….!
Messina et al.: Epidemiology!
The most comprehensive geographic collection of human dengue virus occurrence data (1960 -2012), linked to point or polygon locations, derived from peer-reviewed literature and case reports as well as informal online sources
4 citations
Messina et al.: Epidemiology! 4 citations
Associated Nature Article • Data in figshare
!!!!!!!!Scientific hypotheses:!Synthesis!Analysis!Conclusions!
Methods and technical analyses supporting the quality of the measurements:!What did I do to generate the data?!How was the data processed?!Where is the data?!Who did what when!
Res
earc
h pa
pers
D
ata
reco
rds
Dat
a D
escr
ipto
rs
Adding value to research articles and data records
24
3
10 4
1
4
3
4
DNA and protein sequenceFunctional genomicsGenetic association and genome variationMetagenomicsMolecular interactionsOrganism- or disease-specificProteomicsTaxonomy and species diversityTraces and sequencing reads
“Omics” is emphasized among basic life-sciences repositories
• We currently recognize over 60 public data repositories, and provide advice on the best place for authors to archive their data!
• We have integrated systems with both:!!!
Helping authors find the right place for the data!
Big data | CSE 2014 39
Repositories criteria!1. Broad support and recognition within their scientific community !2. Ensure long-term persistence and preservation of datasets!3. Provide expert curation !
4. Implement relevant, community-endorsed reporting requirements !Progressively monitor this via !
5. Provide for confidential review of submitted datasets !
6. Provide stable identifiers for submitted datasets !7. Allow public access to data without unnecessary restrictions !
Citations of and links to data files - databases!
Evaluation is not be based on the perceived impact !or novelty of the findings or size of the data!
!
• Experimental rigour and technical data quality!o Methodologically sound!o Technical validation experiments and statistical analyses!o Depth, coverage, size, and/or completeness of data sufficient for the types
of applications!• Completeness of the description!
o Sufficient details to allow others to reproduce the results, reuse or integrate it with other data!
o Compliance with relevant minimum information or reporting standards!• Integrity of the data files and repository record!
o Data files match the descriptions in the Data Descriptor!o Deposited in the most appropriate available databases!
Peer review process focused on quality and reuse!
• Neuroscience, ecology, epidemiology, environmental science, functional genomics, metabolomics, toxicology etc.!
• New previously published individual datasets, curated
aggregation and citizen science:!
• Datasets in figshare, Dryad and domain specific databases!
• Code deposited in figshare and GitHub!
• First collection:!
42
Current content is diverse - bimonthly releases !
Supported by:!
Advisory Panel including senior researchers, funders, librarians and curators Michael Huerta ● National Institutes of Health, USA ● Mark Thorley ● Natural Environment Research Council, UK ● Patricia Cruse ● University of California, USA ● Susan Gregurick ● Office of Biological and Environmental Research, Department of Energy, USA ● Ioannis Xenarios ● Swiss Institute of Bioinformatics, Switzerland ● Chris Bowler ● IBENS, France ● Mark Forster ● Syngenta, UK ● Anthony Rowe ● Johnson & Johnson, USA ● Stephen Chanock ● National Cancer Institute, USA ● Weida Tong ● National Center for Toxicological Research, FDA, USA ● Albert J. R. Heck ● Utrecht University, The Netherlands ● Johanna McEntyre ● EMBL-EBI, European Bioinformatics Institute, UK ● Simon Hodson ● CODATA, France ● Joseph R. Ecker ● Howard Hughes Medical Institute & Salk Institute, USA ● Stephen Friend ● Sage Bionetworks, USA ● Jessica Tenenbaum ● Duke Translational Medicine Institute, USA ● Anne-Claude Gavin ● EMBL, Germany ● David Carr ● Wellcome Trust, UK ● Wolfram Horstmann ● Göttingen State and University Library, Germany ● Piero Carninci ● RIKEN Omics Science Center, Japan ● Pascale Gaudet ● Swiss Institute of Bioinformatics, Switzerland ● Judith A. Blake ● The Jackson Laboratory, USA ● Richard H. Scheuermann ● J. Craig Venter Institute, USA ● Caroline Shamu ● Harvard Medical School, USA
Susanna-Assunta Sansone Honorary Academic Editor (University of Oxford, UK)
Andrew L Hufton Managing Editor
Varsha Khodiyar Editorial Curator
Iain Hrynaszkiewicz Publisher
An open access, peer-reviewed publication for descriptions of scientifically valuable datasets!
Launched May 2014