Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

  • View
    123

  • Download
    2

  • Category

    Science

Preview:

Citation preview

Updates to the GigaDB open access data publishing platform

Jesse Xiao

Jesse@gigasciencejournal.com

ORCID ID:0000-0003-3408-2852

About the Journal

GigaScience is an open access, open data, open peer-review journal focusing on ‘big data’ research from the life and biomedical sciences

What is the point of publishing?

• To disseminate information/knowledge/ideas.

• To present material so it can be reasonably assessed for its level of quality (and interest).

• To gain credit for career advancement.

Kahn, Goodman, & Mittleman. Dragging Scientific Publishing into the 21st Century 2014 http://genomebiology.com/2014/15/12/556

From Journal Delivery to PDF Delivery

Lack of Data and Software Availability Impacts Reproducibility

1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14

2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

Deconstructing a paper into accessible, useable, trackable, interlinked units

Need to provide credit to reward sharing and proper organization of:

• Narrative• Data/Metadata

availability/curation• Software availability• Interoperability• Availability of workflows• Transparent analyses

Data/MetaData

Software

Methods

Narrative

Deconstructing a paper into accessible, useable, trackable, interlinked units

Currently we provide credit for this:

• Narrative• Data/Metadata

availability/curation• Software availability• Interoperability• Availability of workflows• Transparent analyses

Data/MetaData

Software

Methods

Narrative

Sometimes we publish these as Methods Papers

Beyond the NarrativeData And Tools

Getting past…

…look but don't touch

Data publishing

http://gigasciencejournal.com/

Launched July 2012. Publishes “Data Notes” for CC0 data. Uses ISA-Tab.

Data publishing

APC covers 1TB storage in GigaDB

FAIR DATA in GigaDB

Findable Accessible Interoperable Reusable

FindableWe have 373 published datasets in GigaDB,& around 30 TB data. Every dataset has a DOIand the individual dataset page.

Provides powerful search engine and API search functione.g.http://gigadb.org/api/search

AccessibleAll data in GigaDB can be accessed in the public ftp server.We provide three stable ftp sites in 2 geographic locations (HK & Shenzhen)

1. ftp://penguin.genomics.cn // The main ftp server2. ftp://ftp.cngb.org/pub/gigadb/ // The mirror ftp server in the cloud3. ftp://ftp2.cngb.org/pub/gigadb// The mirror ftp server in the cloud

Download Speed

We are working with China National Gene Bank and will to use UDP protocol software(Data Expedition) to provide faster data download speed.

The source code for all software and tools published in GigaDB can access in the Githubhttps://github.com/gigascience

Accessible via APIWe provide a REST API to allow user retrieve and search all metadata held in GigaDB.

The current API returns result in XML (the XML file based on the database schema), and we plan to have the option to also return results in JSON or ISA2.0-JSON in our next version

Accessible via APIThe website http://www.gigadb.org/site/help#0.1_API provides detailed instructions on how to use the GigaDB API

Interoperable and reusableIntegrating tools (inc Jbrowse genome browser …) to visualize data

First journal with deep integration with

Launched 2nd June 2016

Reward better handling of “wet” protocols…

• Create, share, modify forkeable protocols in repo.

• Download & run on smartphone app.

• Get discoverability, credit, DOIs for sharing methods.

• Create your own, or let us set up & you claim.

http://protocols.io/

The GigaDB dataset page embedsthe protocol.io in the iframe.

e.g. RNA extraction protocol

Interoperable and reusableGigaDB provides an online submission wizard and excel spreadsheet to help users curate their own metadata

https://codeocean.com/

Cloud-based executable research platform

Browse, share & run code on AWS

Creates compute capsule: encapsulation of the data, code, and computation environment

Integration into the paper, share via DOIs

First examples published in GigaScience

Integrated plugin into GigaDB

Share your code this way!

Interoperable and reusable

gigagalaxy.net

Reward Sharing of Workflows

Interoperable and reusable

http://www.gigasciencejournal.com/content/3/1/23http://www.gigasciencejournal.com/content/4/1/19

Virtual Machines/containers

• Downloadable as virtual harddisk/available as Amazon Machine Image

• Now publishing many container (docker) submissions

Interoperable and reusable

How FAIR can we get?

Data sets

Analyses

Open-Paper

Open-Review

DOI:10.1186/2047-217X-1-18

>50,000 accesses& >1000 citations

Open-Code

7 reviewers tested data in ftp server & named reports published

DOI:10.5524/100044

Open-Pipelines

Open-Workflows

DOI:10.5524/100038

Open-Data

78GB CC0 data

Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/>40,000 downloads

Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0127612

Quantifying how FAIR can we get

Methods

Answer

Metadata

softwareAnalysis

(Pipelines)

Workflows/

Environments

Idea

Study

Rewarding the

DOI, etc.

Publication

Publication

Publication

Data

www.gigasciencejournal.com

Give us your data, papers & pipelines

Help GigaPandamake it happen!

editorial@gigasciencejournal.comdatabase@gigasciencejournal.com

Contact us:

Thanks to:Laurie Goodman, Editor in Chief

Nicole Nogoy, EditorHans Zauner, Assistant Editor

Peter Li, Lead Data Manager

Chris Hunter, Lead BioCurator

Xiao (Jesse) Si Zhe, Database DeveloperChen Qi, Shenzhen Office.

All of BGI

@GigaScience

facebook.com/GigaScience

gigasciencejournal.com/blog/

Follow us:

www.gigasciencejournal.comwww.gigadb.org

+

Weibo

& WeChat

Recommended