53
Big Data publishing Beyond dead trees, and a case study Scott Edmunds ISMB, 22 nd July 2013 @gigascience

Scott Edmunds ISMB talk on Big Data Publishing

Embed Size (px)

DESCRIPTION

Scott Edmunds talk on Big Data Publishing at the "What Bioinformaticians need to know about digital publishing beyond the PDF" workshop at ISMB 2013, July 22nd 2013

Citation preview

Page 1: Scott Edmunds ISMB talk on Big Data Publishing

Big Data publishingBeyond dead trees, and a case study

Scott EdmundsISMB, 22nd July 2013@gigascience

Page 2: Scott Edmunds ISMB talk on Big Data Publishing

The problems with publishing

• Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995

• Core scientific statements or assertions are intertwined and hidden in the conventional scholarly narratives

• Lack of transparency, lack of credit for anything other than “regular” dead tree publication

Page 3: Scott Edmunds ISMB talk on Big Data Publishing

Time to move beyond:

18121665 1869

Page 4: Scott Edmunds ISMB talk on Big Data Publishing

Problem: growing replication gap

1. Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html3. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

More retractions: >15X increase in last decadeAt current % > by 2045 as many papers published as retracted

Page 5: Scott Edmunds ISMB talk on Big Data Publishing

Motivation• Scholarly artefacts must be

– Treated as first-class objects, in scientific investigations and in scholarly communications

– Made machine-readable for the convenience of reasoning– Represented in an interoperable manner

• Truly “add value” to publishing – Provide infrastructure to aid & reward replication

• Trial of ISA+Nanopublication+RO– Three similar approaches that should complement each

other for the representation of scholarly artifacts

Page 6: Scott Edmunds ISMB talk on Big Data Publishing

Motivation• Scholarly artefacts must be

– Treated as first-class objects, in scientific investigations and in scholarly communications

?

:“data* generated in the course of research are just as valuable to the ongoing academic discourse as papers and monographs”.

“increase acceptance of research data* as legitimate, citable contributions to the scholarly record”.

Data* Citation (*and more)

Page 7: Scott Edmunds ISMB talk on Big Data Publishing

• Data• Software• Re-use…

= Credit}

Page 8: Scott Edmunds ISMB talk on Big Data Publishing

GigaSolution: deconstructing the paperNeed to credit and reward:

•Data/software availability

•Metadata/curation

•Interoperability

•Availability of workflows

•Transparent analyses

Data

Metadata

Methods

Analyses

Page 9: Scott Edmunds ISMB talk on Big Data Publishing

GigaSolution: deconstructing the paper

www.gigadb.orgwww.gigasciencejournal.com

20PB storage, 20.5K cores, 212TFlops, >1000 bioinformaticians

Utilizes big-data infrastructure and expertise from:

Combines and integrates:Open-access journal

Data Publishing Platform

Data Analysis Platform

Page 10: Scott Edmunds ISMB talk on Big Data Publishing

What should we reward?

Page 11: Scott Edmunds ISMB talk on Big Data Publishing

Different levels of granularity:

Experiment(e.g. Rice 10K project)

Datasets(e.g. species, variety)

Sample(e.g. specimen xyz)

e.g. doi:10.5524/100001

e.g. doi:10.5524/100001-2

e.g. doi:10.5524/100001-2000or doi:10.5524/100001_xyz

Smaller still?

Papers

Data/Micropubs

NanopubsFacts/Assertions (~1014 in literature)

Reward different shaped publishable objects

Page 12: Scott Edmunds ISMB talk on Big Data Publishing

Rewarding open data

Page 13: Scott Edmunds ISMB talk on Big Data Publishing

Valid

ation

chec

ks

Fail – submitter is provided error report

Pass – dataset is uploaded to GigaDB.

Submission Workflow

Curator makes dataset public (can be set as future date if required)

DataCite XML file

Excel submission file

Submitter logs in to GigaDB website and uploads Excel submission

GigaDB

DOI assigned

FilesSubmitter provides files by ftp or Aspera

XML is generated and registered with DataCite

Curator Review

Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues).

DOI 10.5524/100003Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)

Public GigaDB dataset

Page 14: Scott Edmunds ISMB talk on Big Data Publishing

Reward open & transparent review

End reviewer 3 Download parody videos, now!

Page 15: Scott Edmunds ISMB talk on Big Data Publishing

Real-time open-review = paper in arXiv + blogged reviews

Reward open & transparent review

http://tmblr.co/ZzXdssfOMJfywww.gigasciencejournal.com/content/2/1/10

Page 16: Scott Edmunds ISMB talk on Big Data Publishing

Cloud solutions?

Reward better handling of metadata…Novel tools/formats for data interoperability/handling.

BMC Research Awards 2013Winner of open data award

Page 17: Scott Edmunds ISMB talk on Big Data Publishing

Rewarding transparent methods/results

Software availability

Open code

Accessible pipelines

Sharable workflows

Research Objects

Ease o

f replicatio

n

Page 18: Scott Edmunds ISMB talk on Big Data Publishing

Implement workflows in a community-accepted format

http://galaxyproject.org

Over 36,000 main Galaxy server users

Over 500 papersciting Galaxy use

Over 55 Galaxyservers deployed

Open source

Page 19: Scott Edmunds ISMB talk on Big Data Publishing

galaxy.cbiit.cuhk.edu.hk

Page 20: Scott Edmunds ISMB talk on Big Data Publishing

Research ObjectsAn aggregation of scholarly artefacts: • Data used or results produced in an

experiment study• Methods employed to produce and

analyse that data• Provenance and setting information

about the experiments• People involved in the investigation• Annotations about these resources, that

are essential to the understanding and interpretation of the scientific outcomes captured by a research object.

Page 21: Scott Edmunds ISMB talk on Big Data Publishing

Example

Page 22: Scott Edmunds ISMB talk on Big Data Publishing
Page 23: Scott Edmunds ISMB talk on Big Data Publishing

How are we supporting data reproducibility?

Data sets

Analyses

Linked to

Linked to

DOI

DOI

Open-Paper

Open-Review

DOI:10.1186/2047-217X-1-18>11000 accesses

Open-Code

8 reviewers tested data in ftp server & named reports published

DOI:10.5524/100044

Open-PipelinesOpen-Workflows

DOI:10.5524/100038Open-Data

78GB CC0 data

Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/>5000 downloads

Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2

Page 24: Scott Edmunds ISMB talk on Big Data Publishing

8 referees downloaded & tested data, then signed reports

Reward open & transparent review

Page 25: Scott Edmunds ISMB talk on Big Data Publishing

Post publication: bloggers pull apart code/reviews in blogs + wiki:

SOAPdenov2 wiki: http://homolog.us/wiki1/index.php?title=SOAPdenovo2Homologus blogs: http://www.homolog.us/blogs/category/soapdenovo/

Reward open & transparent review

Page 26: Scott Edmunds ISMB talk on Big Data Publishing

SOAPdenovo2 workflows implemented in

galaxy.cbiit.cuhk.edu.hk

Page 27: Scott Edmunds ISMB talk on Big Data Publishing

SOAPdenovo2 workflows implemented in

galaxy.cbiit.cuhk.edu.hk

Implemented entire workflow in our Galaxy server, inc.:

• 3 pre-processing steps

• 4 SOAPdenovo modules

• 1 post processing steps

• Evaluation and visualization tools

Also will be available to download by >36K Galaxy users in

Page 28: Scott Edmunds ISMB talk on Big Data Publishing

How much further can we take this?

Page 29: Scott Edmunds ISMB talk on Big Data Publishing

How much further can we take this?ISA + RO + Nanopub case study

Understand how each of the three models can support representation of the actual scholarly artefacts, which are essential first-class objects in scholarly communication

Demonstrate added value to life science, publishing and scholarly communication communities on how these models should be used together to describe scholarly artefacts from life sciences domains

Page 30: Scott Edmunds ISMB talk on Big Data Publishing

Data models (instructions for authors for digital publishing)

• Research Object– An encapsulation of essential information related to experiments

and investigations

• The ISA (Investigation + Study + Assay) framework– includes a format and a set of software tools that enable its

international user community to provide rich description of the experimental workflows in life science, environmental and biomedical domains.

• Nanopublication– Dissemination of individual data (assertions) with/without an

accompanying scholarly articles– Enables attribution to the scientists for sharing these their data

Page 31: Scott Edmunds ISMB talk on Big Data Publishing

DataData

Method/Experimental

protocol

Method/Experimental

protocol

FindingsFindings

Types of resources in an RO

Wfdesc/ISA-TAB/ISA2OWL

Wfdesc/ISA-TAB/ISA2OWL

Models to describe each resource type

The Big Picture

Page 32: Scott Edmunds ISMB talk on Big Data Publishing

The SOAPdenovo2 Case study

The DataThe method, as a Galaxy workflow

The findings, Table2 in the paper (doi:10.1186/2047-217X-1-18)

Page 33: Scott Edmunds ISMB talk on Big Data Publishing

investigation

Investigation/Study/Assay infrastructure

The investigation file is a high-level aggregator for related studies, contains all the information to understand the overall goals of an experiment, including investigators involved, associated publications, the experimental design, experimental factors, protocols, funding agencies and so on…

Page 34: Scott Edmunds ISMB talk on Big Data Publishing

investigation

Investigation/Study/Assay infrastructure

An investigation can have one or more studies. A study is the central unit of the experimental description and it contains information on the subject(s) under study, their characteristics, and any treatments applied.

study study

Page 35: Scott Edmunds ISMB talk on Big Data Publishing

investigation

Investigation/Study/Assay infrastructure

Each study has one or more associated assays. The assay is the test performed either on the subject or on material taken from the subject, which produce qualitative and/or quantitative measurements.

study study

assay assay assay assay

Page 36: Scott Edmunds ISMB talk on Big Data Publishing

investigation

Investigation/Study/Assay infrastructure

study study

assay assay assay assay

data analysis method scriptdata data

ISA

Page 37: Scott Edmunds ISMB talk on Big Data Publishing

investigation

Study Design

ISA framework SOAPdenovo2

Page 38: Scott Edmunds ISMB talk on Big Data Publishing

investigation

study

Study Samples: A table rendering the sample collection workflow

ISA framework SOAPdenovo2

Page 39: Scott Edmunds ISMB talk on Big Data Publishing

investigation

study

assay

ftp://public.genomics.org.cn/BGI/SOAPdenovo2

http://galaxy.cbiit.cuhk.edu.hk/

FTP

ISA framework SOAPdenovo2

Page 40: Scott Edmunds ISMB talk on Big Data Publishing

investigation

study

assay

Representation available in:•Tabular format

• Spreadsheet-like format• For biologists/experimentalists

•RDF/OWL format for Semantic Web/Linked Data users• For bioinformaticians/software developers• Facilitating data integration, querying, reasoning

•Support for submission to public repositories and data publication platforms

•Tools support for curation, creation, storage, analysis…

•Large and diverse life science user/collaborator communities

ISA framework

Page 41: Scott Edmunds ISMB talk on Big Data Publishing

RO + ISA

Scientific Workflow-specific ROs ISA experiment and data description

Scientific, computationalExperiments, non-wet lab

protocols

Scientific, computationalExperiments, non-wet lab

protocols

Focus on web-lab or non-

computational experimental

protocols

Focus on web-lab or non-

computational experimental

protocols

Page 42: Scott Edmunds ISMB talk on Big Data Publishing

An RO for the Case Study

A Galaxy workflowA Galaxy workflow

Some nanopub statements

Some nanopub statements

Input sequence

data

Input sequence

data

A Research Object

The Research Object contains the following artefacts:

• The inputs sequence data that are represented in ISA-TAB format

• The Galaxy workflow that reflects the computational steps taken for generating the results used to produce Table 2 • Machine-readable descriptions about the workflow

• The nanopublication statements that represents claims based on the content of Table 2

Descriptions about the workflow

Descriptions about the workflow

Page 43: Scott Edmunds ISMB talk on Big Data Publishing

An RO for the Case Study

Page 44: Scott Edmunds ISMB talk on Big Data Publishing

Assertion

Nanopublication URL

Provenance PublicationInfo

assertion

assertion

opm:was

DerivedFrom

opm:was

DerivedFrom

opm:wasGene-ratedBy

opm:wasGene-ratedBy

thisnanopub

thisnanopub

dcterms:created

dcterms:created

pav:authored-

By

pav:authored-

By

associa-tion

associa-tion aa

sio:statis-ticalAssoci

ation

sio:statis-ticalAssoci

ation

sio:has-measurementValue

sio:has-measurementValue

Association_1_p_val

ue

Association_1_p_val

ue

aa

Sio:probability-value

Sio:probability-value

sio:has-value

sio:has-value

6.56 e-5

^^xsd:float

6.56 e-5

^^xsd:float

sio:refers-to

sio:refers-to

dcterms:DOI

dcterms:DOI

Integrity KeyIntegrity Key

An Individual association between concepts:•statement or declaration•measurement•hypothetical inference•quantitative or qualitative

An Individual association between concepts:•statement or declaration•measurement•hypothetical inference•quantitative or qualitative

Guarantee immutabilityafter publication

Guarantee immutabilityafter publication

Unique, persistent and resolvable identifier

Unique, persistent and resolvable identifier

How this assertion came to be, methods,

evidence, context, etc.

How this assertion came to be, methods,

evidence, context, etc.

• Detailed attribution for authors, institutions, lab technicians, curators

• License info• Publication date

• Detailed attribution for authors, institutions, lab technicians, curators

• License info• Publication date

Page 45: Scott Edmunds ISMB talk on Big Data Publishing

A Nanopublication-Centric View• Improvements of SOAPdenovo2 have also been observed in assembling GAGE [8]

dataset (see Additional file 1: Supplementary Method 6 and Tables  2 and 3). As shown in Tables  2 and 3, the correct assembly length of SOAPdenovo2 increased by approximately 3 to 80-fold comparing with that of SOAPdenovo1.

Page 46: Scott Edmunds ISMB talk on Big Data Publishing

SOAPdenovo2 S. aureus pipeline

Page 47: Scott Edmunds ISMB talk on Big Data Publishing

How do we generate a nanopub from this?

…stay tuned for Tech Track talk #34 by Marco Roos

ICC Lounge 81, Tuesday 23rd: 3.40pm-4.05pm

Page 48: Scott Edmunds ISMB talk on Big Data Publishing

Final step: visualizationFinal step: visualization

NC_010079.pdf

gi_161510924_ref_NC_010063.1_.pdf

CONTIGuator 2 (thanks Marco Galardini)CONTIGuator 2 (thanks Marco Galardini)

https://github.com/combogenomics/CONTIGuator

Page 49: Scott Edmunds ISMB talk on Big Data Publishing

Lessons learned:

• Is possible to push button(s) & recreate a result from a paper

• Reproducibility is COSTLY. How much are you willing to spend?

• Learn a huge amount about the study, and provides lots of information not present in the paper

• Much easier to do this before rather than after publication

Page 50: Scott Edmunds ISMB talk on Big Data Publishing

steps

• Complete the case study on the release of ISA-OWL, nanopubs & ROs

• Extend the case study by including more than one datasets or ROs, in order to show how related or conflicting information can be more easily interlinked

• Create community guidelines on how these three models should be used together, e.g. recommended patterns or vocabulary terms

Page 51: Scott Edmunds ISMB talk on Big Data Publishing

www.gigasciencejournal.com

Give us your data & pipelines!*

Want to go beyond dead trees & the PDF?

[email protected]@[email protected]

Contact us:

* APC’s currently generously covered by BGI in 2013

Page 52: Scott Edmunds ISMB talk on Big Data Publishing

Ruibang Luo (BGI/HKU)Shaoguang Liang (BGI-SZ)Tin-Lap Lee (CUHK)Qiong Luo (HKUST)Senghong Wang (HKUST)Yan Zhou (HKUST)

Thanks to:

@gigasciencefacebook.com/GigaScienceblogs.openaccesscentral.com/blogs/gigablog/

Peter LiHuayan Gao Chris HunterJesse Si ZheNicole NogoyLaurie Goodman

Marco Roos (LUMC)Mark Thompson (LUMC)Jun Zhao (Oxford)Susanna Sansone (Oxford)Philippe Rocca-Serra (Oxford) Alejandra Gonzalez-Beltran (Oxford)

www.gigadb.orggalaxy.cbiit.cuhk.edu.hk

www.gigasciencejournal.com

CBIITFunding from:

Our collaborators:team: Case study:

Page 53: Scott Edmunds ISMB talk on Big Data Publishing

DataData

Method/Experimental

protocol

Method/Experimental

protocol

FindingsFindings

Types of resources in an RO

Wfdesc/ISA-TAB/ISA2OWL

Wfdesc/ISA-TAB/ISA2OWL

Models to describe each resource type

The Big Picture