Upload
dustin-sharp
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
GigaDB explained
Christopher I HunterInternational Training Workshop on Big Data
11-Mar-2015
Presentation contents
• GigaDB introduction• Data types hosted• Anatomy of a dataset DOI • Navigate GigaDB site• Search tool • Submission tool• The extensible metadata schema
--------------------- Coffee Break ---------------------
• ISA tools introduction• ISA-Tab as an exchange format• ISA in action
GigaScience Database
Giga-overview
• GigaDB hosts biological data(any type of data related to, or used in biological studies)
• Primarily associated with the BMC journal, GigaScience
• Funded by BGI-Research and China National GeneBank
• Currently ~160 datasets available• Genomic datasets represent majority of
data(~70%)• ~90% of all data from BGI (or partner) studies• But there 13 different types represented• All manually curated
Data types
• Various Nucleotide data types:– Genomic, Transcriptomic, Metagenomic,
Epigenomic, Genome mapping.• Mass spectrometry:
– Proteomics, Metabolomics, MS-Imaging.• Software & Workflows• Other
– Imaging, Neuroscience, Network analysis
Navigating the GigaDB website
• Home page• Dataset DOI page• Data download options• Search tool• Submission:
– Who should submit to GigaDB– How to submit data
Anatomy of a GigaDB entry
• All relevant information is held together in packets called Datasets
• Each dataset has a stable DOI page
• If required there can be a hierarchy of datasets
• Title• Study type(s)• Image• Citation
• Description
• Funders• Links to Google scholar
and EuroPMC to see who has cited this dataset
• Email submitter• Link to manuscript• Links to external
resources
Cont.
• Samples used in the study
• Files listed as part of the study
• History of dataset changes
• Social media links
• Links to other datasets of similar nature
Downloading the data
FTP• Conventional/easy to use• Can pull individually from
web page • 1 or multiple files using
command line unix• Speed = upto 1 Mb/sec
Downloading the data
Aspera• Requires plugin download• Only available to use via
web-app• 1 or multiple files • Speed = upto 100 Mb/sec
– (e.g. upto 100x faster than FTP)
Search tool
• Search for the term “genome” in the search bar at the top of any dataset page:
Search tool
= GigaDB datasets = Samples = Files
It will only display files that contain matches to the search term
Submitting data to GigaDB• All data submitted to GigaDB must be fully consented
for public release• Where appropriate data should be submitted to
established public archives first. (e.g. INSDC)
• At present we only host data associated with GigaScience journal articles, or by prior approval by the Editors of GigaScience.
• Potential submitters should approach the editors and database curators by email to discuss possible inclusion.
Valid
ation
chec
ks
Fail – submitter is provided error report
Pass – dataset is uploaded to GigaDB.
Submission Workflow
Curator makes dataset public (can be set as future date if required)
DataCite XML file
Submission
Submit Excel spreadsheet or uses online wizard
GigaDB
DOI assigned
FilesSubmitter provides files by ftp or Aspera
XML is generated and registered with DataCite
Curator Review
Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues).
DOI 10.5524/100003Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)
Public GigaDB dataset
Curator makes dataset public (can be set as future date if required)
DataCite XML file
GigaDB
DOI assigned
FilesSubmitter provides files by ftp or Aspera
XML is generated and registered with DataCite
Curator Review
Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues).
DOI 10.5524/100003Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)
Public GigaDB dataset
Submit Excel spreadsheet or uses online wizard
Valid
ation
chec
ks
Fail – submitter is provided error report
Pass – dataset is uploaded to GigaDB.
Submission Workflow
Submission
Submission• Once approved there are two
options for submitting metadata;– offline using an Excel spreadsheet
– online using the wizard
• Soon to be a third option (ISA-tab)
Online vs Offline
• Guided • Good for few large
samples• Allows greater addition
of linking
• Limited documentation• Best for large number of
samples/files
Submission wizardRegister, Log in, Goto your profile page:
Add all the links to related data
Add all the links to related data
Add all the links to related data
Add all the links to related data
Add all Sample metadata
Curator makes dataset public (can be set as future date if required)
DataCite XML file
XML is generated and registered with DataCite
DOI 10.5524/100003Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)
Public GigaDB dataset
Submit Excel spreadsheet or uses online wizard
Valid
ation
chec
ks
Fail – submitter is provided error report
Pass – dataset is uploaded to GigaDB.
Submission
GigaDB
DOI assigned
FilesSubmitter provides files by ftp or Aspera
Curator Review
Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues).
Submission Workflow
Manual check /curate
• After metadata has been submitted it is checked by a curator
• A private upload area is assigned and user can upload data files by Aspera or FTP
Submit Excel spreadsheet or uses online wizard
Valid
ation
chec
ks
Fail – submitter is provided error report
Pass – dataset is uploaded to GigaDB.
Submission
DOI assigned
FilesSubmitter provides files by ftp or Aspera
Curator Review
Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues).
Curator makes dataset public (can be set as future date if required)
DataCite XML file
XML is generated and registered with DataCite
GigaDB
DOI 10.5524/100003Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)
Public GigaDB dataset
Submission Workflow
Mint the DOI
• Once all the files and metadata are stored and linked appropriately we will mint the DOI with out partners at DataCite.
Publish the dataset
• Publication date = date on which DOI is released to public.
• Immediately added to GigaDB RSS feed. • Any other promotion of datasets is done in
conjunction with manuscript publication
Behind the scenes
The extensible metadata schema
• Spectrum of data being hosted is very broad• Database needs to be:
– Structured, but allow wide variety– Be able to incorporate multiple standards– Utilise ontologies– Link to multiple external sources
The GigaDB schema looks like this:
Just the Dataset tables
Just the Dataset tablesdatasetidsubmitter_idimage_ididentifiertitledescriptiondataset_sizeftp_siteupload_statusexcelfileexcelfile_md5publication_datemodification_datepublisher_idtoken
Just the Dataset tables
Store wide variety of attributes
attributeidattribute_namedefinitionmodelstructured_comment_namevalue_syntaxallowed_unitsoccurrenceontology_linknote
Checklists
• Different things are important in different experiment types
• Various communities have standard checklists they try to adhere to
• GigaDB can leverage those different checklists and integrate them where possible.
MIxS
• Genomic Standards Consortium (GSC)– Minimum Information about x Sequence
http://gensc.org/projects/mixs-gsc-project/
• It includes:– set of core descriptors for sequence data– Set of measurements and observations describing the
environment of the sample– Goes beyond the minimum, by defining ~370 attributes that
could be used.• It is hoped that the adoption of this standard would
elevate the quality, accessibility and utility of information that can be collected.
SRA & PX
• The Sequence Read Archive (SRA) and the ProteomeXchange (PX) also both provide specific terms (attributes) that we can map to.
Other checklists
• We are able to include attributes from any model or standard and link that from the attributes table
• So if there is a recommended standard for a particular data-type we can incorporate it.
Ontologies
• Units• Taxonomy• Any that are defined in standards• Common ones in use:
– DOID - Disease ontology– EFO - Experimental Factor ontology– SO - Sequence ontology– UBERON - cross-species ontology of anatomical structures– ENVO - ENVironment Ontology
Future developments• Develop an Application programming Interface
(API)– Including support for ISA format import and export
• Improve dataset DOI display pages– Include experiment information
• Improve submission wizard– Include bulk upload tables
• Add ontology look-up automatically• Integrate other tools
That’s it for GigaDB.
Thanks for listening!
Any Questions?
Next up, ISA tools.
ISA tools
Christopher I HunterInternational Training Workshop on Big Data
11-Mar-2015
What is ISA?
Investigation
Study (and/or Sample)
Assay
What is ISA-tab?
• ISA-tab is a general purpose, domain agnostic, flexible format to describe multi-omic experiments.
• It can be used as a submission format to some archives and there are a suite of tools for conversion into other common formats.
What are ISA-tools?
• A suite of tools based on the ISA-tab format• Developed and maintained by a team at Oxford
University, UK.• The main tool of interest here is the ISA-creator
Live demoDon’t panic, I have screen shots if it all goes wrong!
The Ontology lookup function
ISA Validation tool
• Default only checks ISA-tabs are formed correctly
• Can be configured to check against checklists
ISA converter tool
• The development team actively work on new converter tools
• And are always happy to work with domain experts to make more
The ISA team
• Susanna Sansone• Philippe Rocca-Serra• Alejandra Gonzalez-Beltran
http://www.isa-tools.org/
That’s it for ISA.
Thanks for listening!
Any Questions?
Next up, GigaScience software and workflows.