68
GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Embed Size (px)

Citation preview

Page 1: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

GigaDB explained

Christopher I HunterInternational Training Workshop on Big Data

11-Mar-2015

Page 2: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Presentation contents

• GigaDB introduction• Data types hosted• Anatomy of a dataset DOI • Navigate GigaDB site• Search tool • Submission tool• The extensible metadata schema

--------------------- Coffee Break ---------------------

• ISA tools introduction• ISA-Tab as an exchange format• ISA in action

Page 3: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

GigaScience Database

Page 4: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Giga-overview

• GigaDB hosts biological data(any type of data related to, or used in biological studies)

• Primarily associated with the BMC journal, GigaScience

• Funded by BGI-Research and China National GeneBank

Page 5: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

• Currently ~160 datasets available• Genomic datasets represent majority of

data(~70%)• ~90% of all data from BGI (or partner) studies• But there 13 different types represented• All manually curated

Page 6: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Data types

• Various Nucleotide data types:– Genomic, Transcriptomic, Metagenomic,

Epigenomic, Genome mapping.• Mass spectrometry:

– Proteomics, Metabolomics, MS-Imaging.• Software & Workflows• Other

– Imaging, Neuroscience, Network analysis

Page 7: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Navigating the GigaDB website

• Home page• Dataset DOI page• Data download options• Search tool• Submission:

– Who should submit to GigaDB– How to submit data

Page 8: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015
Page 9: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Anatomy of a GigaDB entry

• All relevant information is held together in packets called Datasets

• Each dataset has a stable DOI page

• If required there can be a hierarchy of datasets

Page 10: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

• Title• Study type(s)• Image• Citation

• Description

• Funders• Links to Google scholar

and EuroPMC to see who has cited this dataset

• Email submitter• Link to manuscript• Links to external

resources

Cont.

Page 11: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

• Samples used in the study

• Files listed as part of the study

• History of dataset changes

• Social media links

• Links to other datasets of similar nature

Page 12: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Downloading the data

FTP• Conventional/easy to use• Can pull individually from

web page • 1 or multiple files using

command line unix• Speed = upto 1 Mb/sec

Page 13: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Downloading the data

Aspera• Requires plugin download• Only available to use via

web-app• 1 or multiple files • Speed = upto 100 Mb/sec

– (e.g. upto 100x faster than FTP)

Page 14: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Search tool

• Search for the term “genome” in the search bar at the top of any dataset page:

Page 15: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Search tool

Page 16: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

= GigaDB datasets = Samples = Files

It will only display files that contain matches to the search term

Page 17: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015
Page 18: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015
Page 19: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Submitting data to GigaDB• All data submitted to GigaDB must be fully consented

for public release• Where appropriate data should be submitted to

established public archives first. (e.g. INSDC)

• At present we only host data associated with GigaScience journal articles, or by prior approval by the Editors of GigaScience.

• Potential submitters should approach the editors and database curators by email to discuss possible inclusion.

Page 20: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Valid

ation

chec

ks

Fail – submitter is provided error report

Pass – dataset is uploaded to GigaDB.

Submission Workflow

Curator makes dataset public (can be set as future date if required)

DataCite XML file

Submission

Submit Excel spreadsheet or uses online wizard

GigaDB

DOI assigned

FilesSubmitter provides files by ftp or Aspera

XML is generated and registered with DataCite

Curator Review

Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues).

DOI 10.5524/100003Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)

Public GigaDB dataset

Page 21: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Curator makes dataset public (can be set as future date if required)

DataCite XML file

GigaDB

DOI assigned

FilesSubmitter provides files by ftp or Aspera

XML is generated and registered with DataCite

Curator Review

Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues).

DOI 10.5524/100003Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)

Public GigaDB dataset

Submit Excel spreadsheet or uses online wizard

Valid

ation

chec

ks

Fail – submitter is provided error report

Pass – dataset is uploaded to GigaDB.

Submission Workflow

Submission

Page 22: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Submission• Once approved there are two

options for submitting metadata;– offline using an Excel spreadsheet

– online using the wizard

• Soon to be a third option (ISA-tab)

Page 23: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Online vs Offline

• Guided • Good for few large

samples• Allows greater addition

of linking

• Limited documentation• Best for large number of

samples/files

Page 24: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Submission wizardRegister, Log in, Goto your profile page:

Page 25: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015
Page 26: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015
Page 27: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015
Page 28: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Add all the links to related data

Page 29: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Add all the links to related data

Page 30: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Add all the links to related data

Page 31: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Add all the links to related data

Page 32: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Add all Sample metadata

Page 33: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Curator makes dataset public (can be set as future date if required)

DataCite XML file

XML is generated and registered with DataCite

DOI 10.5524/100003Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)

Public GigaDB dataset

Submit Excel spreadsheet or uses online wizard

Valid

ation

chec

ks

Fail – submitter is provided error report

Pass – dataset is uploaded to GigaDB.

Submission

GigaDB

DOI assigned

FilesSubmitter provides files by ftp or Aspera

Curator Review

Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues).

Submission Workflow

Page 34: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Manual check /curate

• After metadata has been submitted it is checked by a curator

• A private upload area is assigned and user can upload data files by Aspera or FTP

Page 35: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Submit Excel spreadsheet or uses online wizard

Valid

ation

chec

ks

Fail – submitter is provided error report

Pass – dataset is uploaded to GigaDB.

Submission

DOI assigned

FilesSubmitter provides files by ftp or Aspera

Curator Review

Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues).

Curator makes dataset public (can be set as future date if required)

DataCite XML file

XML is generated and registered with DataCite

GigaDB

DOI 10.5524/100003Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)

Public GigaDB dataset

Submission Workflow

Page 36: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Mint the DOI

• Once all the files and metadata are stored and linked appropriately we will mint the DOI with out partners at DataCite.

Page 37: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Publish the dataset

• Publication date = date on which DOI is released to public.

• Immediately added to GigaDB RSS feed. • Any other promotion of datasets is done in

conjunction with manuscript publication

Page 38: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Behind the scenes

Page 39: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

The extensible metadata schema

• Spectrum of data being hosted is very broad• Database needs to be:

– Structured, but allow wide variety– Be able to incorporate multiple standards– Utilise ontologies– Link to multiple external sources

Page 40: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

The GigaDB schema looks like this:

Page 41: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Just the Dataset tables

Page 42: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Just the Dataset tablesdatasetidsubmitter_idimage_ididentifiertitledescriptiondataset_sizeftp_siteupload_statusexcelfileexcelfile_md5publication_datemodification_datepublisher_idtoken

Page 43: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Just the Dataset tables

Page 44: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Store wide variety of attributes

attributeidattribute_namedefinitionmodelstructured_comment_namevalue_syntaxallowed_unitsoccurrenceontology_linknote

Page 45: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Checklists

• Different things are important in different experiment types

• Various communities have standard checklists they try to adhere to

• GigaDB can leverage those different checklists and integrate them where possible.

Page 46: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

MIxS

• Genomic Standards Consortium (GSC)– Minimum Information about x Sequence

http://gensc.org/projects/mixs-gsc-project/

• It includes:– set of core descriptors for sequence data– Set of measurements and observations describing the

environment of the sample– Goes beyond the minimum, by defining ~370 attributes that

could be used.• It is hoped that the adoption of this standard would

elevate the quality, accessibility and utility of information that can be collected.

Page 47: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

SRA & PX

• The Sequence Read Archive (SRA) and the ProteomeXchange (PX) also both provide specific terms (attributes) that we can map to.

Page 48: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Other checklists

• We are able to include attributes from any model or standard and link that from the attributes table

• So if there is a recommended standard for a particular data-type we can incorporate it.

Page 49: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Ontologies

• Units• Taxonomy• Any that are defined in standards• Common ones in use:

– DOID - Disease ontology– EFO - Experimental Factor ontology– SO - Sequence ontology– UBERON - cross-species ontology of anatomical structures– ENVO - ENVironment Ontology

Page 50: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Future developments• Develop an Application programming Interface

(API)– Including support for ISA format import and export

• Improve dataset DOI display pages– Include experiment information

• Improve submission wizard– Include bulk upload tables

• Add ontology look-up automatically• Integrate other tools

Page 51: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

That’s it for GigaDB.

Thanks for listening!

Any Questions?

Next up, ISA tools.

Page 52: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

ISA tools

Christopher I HunterInternational Training Workshop on Big Data

11-Mar-2015

Page 53: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

What is ISA?

Investigation

Study (and/or Sample)

Assay

Page 54: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

What is ISA-tab?

• ISA-tab is a general purpose, domain agnostic, flexible format to describe multi-omic experiments.

• It can be used as a submission format to some archives and there are a suite of tools for conversion into other common formats.

Page 55: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015
Page 56: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

What are ISA-tools?

• A suite of tools based on the ISA-tab format• Developed and maintained by a team at Oxford

University, UK.• The main tool of interest here is the ISA-creator

Page 57: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015
Page 58: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

Live demoDon’t panic, I have screen shots if it all goes wrong!

Page 59: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015
Page 60: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015
Page 61: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015
Page 62: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015
Page 63: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015
Page 64: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

The Ontology lookup function

Page 65: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

ISA Validation tool

• Default only checks ISA-tabs are formed correctly

• Can be configured to check against checklists

Page 66: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

ISA converter tool

• The development team actively work on new converter tools

• And are always happy to work with domain experts to make more

Page 67: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

The ISA team

• Susanna Sansone• Philippe Rocca-Serra• Alejandra Gonzalez-Beltran

http://www.isa-tools.org/

Page 68: GigaDB explained Christopher I Hunter International Training Workshop on Big Data 11-Mar-2015

That’s it for ISA.

Thanks for listening!

Any Questions?

Next up, GigaScience software and workflows.