65
Data Consultant, Honorary Academic Editor Associate Director, Principal Investigator The rise of the data-centric research and publication enterprises A journey into the whirlwind Susanna-Assunta Sansone, PhD uk.linkedin.com/in/sasansone @biosharing @isatools @scientificdata http://www.slideshare.net/SusannaSansone Oxford Interdisciplinary Bioscience Doctoral Training Partnership (DTP) – 31 July, 2014

Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Embed Size (px)

DESCRIPTION

What to know when planning for your data management strategy and preparing a data management statement for a research proposal for BBSRC DTP first year students

Citation preview

Page 1: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Data Consultant, Honorary Academic Editor

Associate Director, Principal Investigator

The rise of the data-centric !research and publication enterprises!

!

A journey into the whirlwind!

Susanna-Assunta Sansone, PhD!!

uk.linkedin.com/in/sasansone!@biosharing!

@isatools!@scientificdata!

!

http://www.slideshare.net/SusannaSansone

Oxford Interdisciplinary Bioscience Doctoral Training Partnership (DTP) – 31 July, 2014

Page 2: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Scope is to outline the strategies taken by the

researchers to allow access to they research outputs

Page 3: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

But it does not give you all info you need to know! Also because data sharing policies are unclear; a very common and loose text is: “Applicants should make use of existing, recognised standards for data collection and management, where these exist, and make data available through existing community resources or databases where possible”. But what constitutes a recognised standard or acceptable community resource?

Page 4: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

•  About myself!o  activities and interests!

•  FAIR data!o  concept!o my related projects!

•  Scientific Data!o  rationale!o Data Descriptors!o  examples!

Outline!

Page 5: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Communities we work with/for:

Page 6: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Communities we work with/for: •  Describe my experiments •  Share info with my group •  Store for query & access •  Upload to analysis tool •  Compare with similar data •  etc……

•  Support the needs of our researchers

•  Enhancing our existing tools; develop new components, or connect with other tools

•  Comply to community standards

•  etc….

•  Drive data reusability agenda •  Recommend tools, data

resources and standards in data policies

•  Criteria for endorsement •  Mapping the ecosystem of

efforts •  etc……

Page 7: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Key areas of activity: •  Data capture and curation •  Database development •  Data (nano)publication •  Data provenance •  Open, community ontologies

and standards •  Semantic web •  Social engineering •  Software development •  Training •  Visualization (collaboration/

jointly with Prof Min Chen)

Communities we work with/for:

Page 8: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Key areas of activity: •  Data capture and curation •  Database development •  Data (nano)publication •  Data provenance •  Open, community ontologies

and standards •  Semantic web •  Social engineering •  Software development •  Training •  Visualization (collaboration/

jointly with Prof Min Chen)

Communities we work with/for: As part of: •  UK, European and international

consortia •  Pre-competitive informatics

public-private partnerships •  Standardization initiatives

Page 9: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

http://www.flickr.com/photos/notbrucelee/8016189356/

CC BY

Our activities are around and in support of data curation, management and publication

and their pivotal roles in enabling FAIR data and research, driving science and discoveries

Page 10: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

eTRIKS – european Translational Information and Knowledge management Services Consortium of academic (Imperial College, CNRS, Un of Luxemburg) and pharmas (Janssen, Merck, AZ, Lilly, Lundbeck, Pfizer, Roche, Sanofi, Bayer, GSK) building a sustainable, open translational research informatics platform

•  Nature Publishing Group‘s Scientific Data •  BioMedCentral and BGI‘s GigaScience •  F1000 Research •  Oxford University Press

Susanna-Assunta Sansone, Philippe Rocca-Serra, Alejandra Gonzalez-Beltran, Eamonn Maguire, Milo Thurston; Oxford alumni: Annapaola Santarsiero, Pavlos Georgiou + my previous team when at EBI (2001 – 2010)!

Page 11: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014
Page 12: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014
Page 13: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

https://projects.ac/blog/five-top-reasons-to-protect-your-data-and-practise-safe-science/

Credit to:

Page 14: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

A great start, but not enough!

image by Greg Emmerich

http://discovery.urlibraries.org/

http://www.theguardian.com/higher-education-network/blog/2014/jun/26

Page 15: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

§  Researchers and bioinformaticians in both academic and commercial science, along with funding agencies and publishers, embrace the concept that both •  DATA: entities of interest e.g., genes, metabolites, phenotypes and •  METADATA: experimental steps e.g., provenance of study materials,

technology and measurement types should be Findable, Accessible, Interoperable and Reusable

Worldwide movement for FAIR data

Source: http://ebbailey.wordpress.com

Page 16: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

16

In all fairness, no much data is FAIR!

Page 17: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

In all fairness, no much data is FAIR!

Page 18: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

But it is not just about technology…!

Page 19: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

19

…breath and depth of the content!

…is pivotal!

Page 20: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

20

sample characteristic(s)

experimental design

experimental variable(s)

technology(s)

measurement(s)

protocols(s)

data file(s)

......

Page 21: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

21

•  make annotation explicit and discoverable

•  structure the descriptions for consistency

•  ensure/regulate access

•  deposit and publish

•  etc….

§  To make this dataset ‘FAIR’, one must have best practices, standards and tools to: •  report sufficient details •  capture all salient features of

the experimental workflow

Page 22: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

A community mobilization to develop standards, e.g.:

§  Structural and operational differences •  organization types (open, close to members, society, WG etc.) •  standards development (how to formulate, conduct and maintain) •  adoption, uptake, outreach (link to journals, funders and commercial sector) •  funds (sponsors, memberships, grants, volunteering)

de jure de facto

grass-roots groups

standard organizations

Nanotechnology Working Group

Page 23: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

A community mobilization to develop standards, e.g.:

de jure de facto

grass-roots groups

standard organizations

Nanotechnology Working Group

Page 24: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Focus on reporting or content standards

Including minimum information reporting requirements, or checklists to report the same core, essential information

Including controlled vocabularies, taxonomies, thesauri, ontologies etc. to use the same word and refer to the same ‘thing’

Including conceptual model, conceptual schema from which an exchange format is derived to allow data to flow from one system to another

Community-developed, standards are pivotal to structure, enrich the description and share data and associated contextual metadata, facilitating understanding and reuse!

Page 25: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Growing number of reporting standards

+ 130 + 150

+ 303

Source: BioPortal

Databases, !annotation,!

curation !tools !

implementing !standards!

miame!MIAPA!

MIRIAM!MIQAS!MIX!

MIGEN!

CIMR!MIAPE!

MIASE!

MIQE!

MISFISHIE….!

REMARK!

CONSORT!

MAGE-Tab!GCDML!

SRAxml!SOFT! FASTA!

DICOM!

MzML !SBRML!

SEDML…!

GELML!

ISA-Tab!

CML!

MITAB!

AAO!CHEBI!

OBI!

PATO! ENVO!MOD!

BTO!IDO…!

TEDDY!

PRO!XAO!

DO

VO!Source: B

ioSharing

Source: BioSharing

Page 26: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

26

Technologically-delineated views of the world !

Biologically-delineated views of the world!

Generic features (‘common core’)!- description of source biomaterial!- experimental design components!

Arrays!

Scanning! Arrays &Scanning!

Columns!

Gels!MS! MS!

FTIR!

NMR!

Columns!

transcriptomics proteomics metabolomics

plant biology epidemiology microbiology

Fragmentation, duplications and gaps

To compare and integrate data we need interoperable standards

Page 27: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Three EBI omics systems S

ubm

issi

on

Acc

ess

Sto

rage

Fragmentation of the databases and data, e.g.

Page 28: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Three EBI omics systems S

ubm

issi

on

Acc

ess

Sto

rage

Fragmentation of the databases and data, e.g.

Proteomics

Page 29: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Three EBI omics systems S

ubm

issi

on

Acc

ess

Sto

rage

Fragmentation of the databases and data, e.g.

Proteomics

Page 30: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Three EBI omics systems

DIFFERENT Formats, terminologies and tools

Sub

mis

sion

DIFFERENT Download formats

Acc

ess

DIFFERENT - Core requirements represented - Representation of the studies and related samples - Curation practices

Sto

rage

Fragmentation of the databases and data, e.g.

Proteomics

Page 31: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

How much do we know and which standards can we use

Page 32: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014
Page 33: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Registering and cataloging is just step one; the next one are: •  Develop assessment criteria for usability and popularity of standards •  Associate standards to data policies and databases •  Assemble journal and funder policies re data storage •  Make fully cross-searchable •  Intended goal: help stakeholders make informed decisions

Page 34: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

34

Page 35: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

35

Users can claim records and maintain them

Page 36: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

36

Users can claim records and maintain them

Page 37: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

37

Classify, links standards and visualize relations

Page 38: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Example

The relationship among popular standard formats for pathway information!BioPAX and PSI-MI are designed for data exchange to and from databases and pathway and network data integration. SBML and CellML are designed to support mathematical simulations of biological systems and SBGN represents pathway diagrams. !

CREDIT: Demir, et al., The BioPAX community standard for pathway data sharing, 2010.

Page 39: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

39

•  make annotation explicit and discoverable

•  structure the descriptions for consistency

•  ensure/regulate access

•  deposit and publish

•  etc….

§  To make this dataset ‘FAIR’, one must have best practices, standards and tools to: •  report sufficient details •  capture all salient features of

the experimental workflow

Page 40: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014
Page 41: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014
Page 42: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

ISA powers data collection, curation resources and repositories, e.g.:

Page 43: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

General-purpose, configurable format, designed to support: •  description of the experimental metadata,

making the annotation explicit and discoverable

•  provenance tracking •  use community standards, such as minimal

reporting guidelines and terminologies •  designed to be converted to - a growing

number of - other metadata formats, e.g. used by EBI repositories

analysis !method! script!

Data file or !record in a database!

Page 44: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014
Page 45: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Our industry needs a Disruptive Innovation. That Disruption...is Pistoia!

Page 46: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Oxford e-Research Centre (Sansone SA): Data Standards and Curation training

Page 47: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

FAIR data - roles and responsibilities

•  Data has to become an integral part of the scholarly communications!

•  Responsibilities lie across several stakeholder groups: researchers, data centers, librarians, funding agencies and publishers!

•  But publishers occupy a “leverage point” in this process!

Page 48: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Journal publishing: the changing landscape!

Page 49: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Human Genome 2001 62 Pages, 150 Authors,

49 Figure, 27 tables

Encode Project 2012 30 papers, 3 Journals

Nature Publishing Group: the changing landscape!

Page 50: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

!!!

Launched on May 27th, 2014

A new online-only publication for descriptions of scientifically valuable datasets in the life, environmental and biomedical sciences, but not limited to these!

Credit for sharing your data

Focused on reuse and reproducibility

Peer reviewed, curated

Promoting Community Data Repositories

Open Access

Supported by:!

Page 51: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

!!!Experimental metadata or !

structured component!(in-house curated, machine-

readable formats)!

Article or !narrative component!

(PDF and HTML) !

Data Descriptor: narrative and structure!

Page 52: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Data Descriptor: narrative!

Sections:!•  Title!•  Abstract!•  Background & Summary!•  Methods!•  Technical Validation!•  Data Records!•  Usage Notes !•  Figures & Tables !•  References!•  Data Citations!!

Focus on data reuse!Detailed descriptions of the methods and technical analyses supporting the quality of the measurements.!Does not contain tests of new scientific hypotheses!

In traditional publications this information is not provided in a sufficiently detailed manner

However this information is essential for understanding, reusing, and reproducing datasets

Page 53: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Data Descriptor: narrative!

Sections:!•  Title!•  Abstract!•  Background & Summary!•  Methods!•  Technical Validation!•  Data Records!•  Usage Notes !•  Figures & Tables !•  References!•  Data Citations!!

Focus on data reuse!Detailed descriptions of the methods and technical analyses supporting the quality of the measurements.!Does not contain tests of new scientific hypotheses!

Page 54: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Data Descriptor: narrative!

Sections:!•  Title!•  Abstract!•  Background & Summary!•  Methods!•  Technical Validation!•  Data Records!•  Usage Notes !•  Figures & Tables !•  References!•  Data Citations!!

Focus on data reuse!Detailed descriptions of the methods and technical analyses supporting the quality of the measurements.!Does not contain tests of new scientific hypotheses!

Joint Declaration of Data Citation Principles by the Data Citation Synthesis Group, incl.: -  CODATA -  Research Data Alliance (RDA), -  Force11

Page 55: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

In-house curation team:!•  assists users to submit the structured

content via simple templates and an internal authoring tool!

•  performs value-added semantic annotation of the experimental metadata!

For advanced users/service providers willing to export ISA-Tab for direct submission, we will release a technical specification:!

analysis !method! script!

Data file or !record in a database!

Data Descriptor: structure (CC0)!

Page 56: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Hanke: Neuroscience !

!!!!!!!!!

Code in GitHub

New Dataset Data in OpenfMRI Source code in GitHub

Big Data

Page 57: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Stefano: Stem Cells!Associated Nature Article Data - figshare - NCBI GEO Integrated figshare data viewer

Page 58: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

!!!!!!!!Scientific hypotheses:!Synthesis!Analysis!Conclusions!

Methods and technical analyses supporting the quality of the measurements:!What did I do to generate the data?!How was the data processed?!Where is the data?!Who did what when!

BEFORE: get your data to the community as soon as possible (see NPG pre-publication policy) AT THE SAME TIME: publish your Data Descriptor(s) alongside research article(s) AFTER: expand on your research articles, adding further information for reuse of the data

Relation with traditional articles - content and time!

Page 59: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Export to various formats (ISA_tab, RDF, etc)

Linking between research papers, Data Descriptors, and data records

Making data discoverable !

We currently recognize over 50 public data repositories!!

Page 60: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Evaluation is not be based on the perceived impact or novelty of the findings!•  Experimental Rigour and Technical Data Quality!

o  Were the data produced in a rigorous and methodologically sound manner?!o  Was the technical quality of the data supported convincingly with technical validation

experiments and statistical analyses of data quality or error, as needed?!o  Are the depth, coverage, size, and/or completeness of these data sufficient for the types of

applications or research questions outlined by the authors?!

•  Completeness of the Description!o  Are the methods and any data-processing steps described in sufficient detail to allow others to

reproduce these steps?!o  Did the authors provide all the information needed for others to reuse this dataset or integrate it

with other data?!o  Is this Data Descriptor, in combination with any repository metadata, consistent with relevant

minimum information or reporting standards?!

•  Integrity of the Data Files and Repository Record!o  Have you confirmed that the data files deposited by the authors are complete and match the

descriptions in the Data Descriptor?!o  Have these data files been deposited in the most appropriate available data repository?!

Peer review process focused on quality and reuse!

Page 61: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

•  Neuroscience, ecology, epidemiology, environmental science, functional genomics, metabolomics, toxicology!

•  New datasets and previously published data sets!o  a fuller, more in-depth look at the data processing steps,

supported by additional data files and code from each step!o  additional tutorial-like information for scientists interested in

reusing or integrating the data with their own!•  Datasets in figshare and domain specific databases!•  Code deposited in figshare and GitHub!•  Individual datasets, curated aggregation and citizen science!•  First dataset part of a collection !•  Academic and industry authors!

61

Current content is diverse – bimonthly releases !

Page 62: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Take home messages

http://www.flickr.com/photos/12308429@N03/4957994485/

u  Make sure your research outputs make an impact! u  Open your research outputs, via the right channels to get cited and credited

u  Contribute to the reproducible research movement and to FAIR data

Page 63: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Take home messages

u  Uniquely identify yourself via ORCID u  Share identified generic research outputs, e.g. FigShare

u  Share and deposit code, e.g. GitHub, Bitbucket

http://www.flickr.com/photos/idiolector/289490834/

Page 64: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Take home messages

u  Learn about open standards in your area, via e.g. BioSharing u  Select tools that implement relevant standards, e.g. ISA

u  Publish not just in traditional journals, but think Scientific Data

http://www.flickr.com/photos/webhamster/2582189977/

Page 65: Overview to: BBSRC Oxford Doctoral Training Partnership - Dr Sansone - July 2014

Data Consultant, Honorary Academic Editor

Associate Director, Principal Investigator

The rise of the data-centric !research and publication enterprises!

!

A journey into the whirlwind!

Susanna-Assunta Sansone, PhD!!

uk.linkedin.com/in/sasansone!@biosharing!

@isatools!@scientificdata!

!

http://www.slideshare.net/SusannaSansone

Oxford Interdisciplinary Bioscience Doctoral Training Partnership (DTP) – 31 July, 2014