Upload
alejandra-gonzalez-beltran
View
111
Download
1
Embed Size (px)
DESCRIPTION
This talk explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.
Citation preview
What was the plan? A role for data standards, models and computational
workflows in scholarly data publishing
Alejandra González-Beltrán, PhD Philippe Rocca-Serra, PhD Oxford e-Research Centre, University of Oxford
{alejandra.gonzalezbeltran,philippe.rocca-serra}@oerc.ox.ac.uk
ISMB Workshop: What Bioinformaticians need to know about
digital publishing beyond the PDF2
July15th, 2014 Boston, USA
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
The experimental workflow
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
The experimental workflow
metadata
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
The experimental workflow
metadata
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Interoperability
The experimental workflow
Reproducibility
Data Review
The experimental workflow
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Reusability
The experimental plan - life sciences case
experimental design!sample characteristic(s)!
experimental variable(s)!
2-week systemic rat study using male Wistar rats (N=15 per dose group)
14 proprietary drug candidates from participating companies and 2 reference toxic compounds
InnoMed PredTox Project
The experimental plan - life sciences case
experimental design!sample characteristic(s)!
experimental variable(s)!
technology(s)!measurement(s)!protocols(s)!data file(s)!…!
The experimental plan - computational case
•open peer-review •availability of
•data •analysis scripts •documentation
Evaluation of SOAPdenovo2 tool for the de novo assembly of genomes from small DNA segments reads by next generation sequencing, implementing improvements over SOAPdenovo1 assembler.
genome assembly algorithm
genome size
Predictor Variables!(Factor Name, Factor Type)
The experimental plan - computational case
genome assembly algorithm
genome size
SOAPdenovo2
SOAPdenovo1
ALL-PATHS-LG
Predictor Variables!(Factor Name, Factor Type)
The experimental plan - computational case
genome assembly algorithm
genome size
SOAPdenovo2
SOAPdenovo1
ALL-PATHS-LG
bacterial genome
insect genomehuman genome
Predictor Variables!(Factor Name, Factor Type)
The experimental plan - computational case
genome assembly algorithm
genome size
SOAPdenovo2
SOAPdenovo1
ALL-PATHS-LG
bacterial genome
insect genomehuman genome
bacterial genome
insect genomehuman genomebacterial genome
insect genomehuman genome
Predictor Variables!(Factor Name, Factor Type)
3x3 factorial design 9 study groups
The experimental plan - computational case
genome assembly algorithm
genome size
SOAPdenovo2
SOAPdenovo1
ALL-PATHS-LG
bacterial genome
insect genomehuman genome
bacterial genome
insect genomehuman genomebacterial genome
insect genomehuman genome
Predictor Variables!(Factor Name, Factor Type)
The experimental plan - computational case
S. aureusR. sphaeroides
B. impatiens
Chinese Han genome (or YH genome)
genome assembly algorithm
genome size
SOAPdenovo2
SOAPdenovo1
ALL-PATHS-LG
bacterial genome
insect genomehuman genome
bacterial genome
insect genomehuman genomebacterial genome
insect genomehuman genome
Predictor Variables!(Factor Name, Factor Type)
The experimental plan - computational case
Response Variables!
genome coverage
computation run time
memory consumption
http://www.am
a-roch
ester.o
rg/W
P/wp-co
nten
t/up
load
s/20
13/01/three-pillars.png
17
A growing ecosystem of over 30 public and internal resources using the ISA metadata tracking framework (ISA-Tab and/or tools) to facilitate standards-compliant collection, curation, management and reuse of investigations in an increasingly diverse set of life science domains, including: !
• stem cell discovery • system biology • transcriptomics • toxicogenomics • also by communities working to build a library of cellular
signatures
!• environmental health • environmental genomics • metabolomics • metagenomics • nanotechnology • proteomics
General-purpose, configurable format designed to support: !• description of the experimental metadata, making the annotation explicit and discoverable !• provenance tracking !
• use of community standards, such as minimal reporting guidelines and terminologies !• designed to be converted to - a growing number of - other metadata formats, e.g. used by the European Bioinformatics Institute (EBI) repositories !
H. Sapiens
H. Sapiens
H. Sapiens
H1
H1
H2
35
35
33
Years
Years
Years
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
Scanning
Scanning
Scanning
...
H. Sapiens
33 Years
H1
H2
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
H. Sapiens
35 Years
Scanning
Scanning
Scanning
...
...
...
H. Sapiens
H. Sapiens
H. Sapiens
H1
H1
H2
35
35
33
Years
Years
Years
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
Scanning
Scanning
Scanning
...
H. Sapiens
33 Years
H1
H2
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
H. Sapiens
35 Years
Scanning
Scanning
Scanning
...
...
...
obi:material entity
obi:material sample
obi:material processing
obi:processed material
obi:planned process
isa:raw data file
bfo:derives from
http://gigasciencejournal.com
http://gigadb.org/dataset/100035
http://gigasciencejournal.com
http://gigadb.org/dataset/100035
Experimental metadata
or structured component
(in-house curated, machine-readable
formats)
Article or narrative
component (PDF and HTML)
A new online-only publication for descriptions of scientifically valuable datasets in the life, environmental and biomedical sciences, but not limited to these!
Credit for sharing your data
Focused on reuse and reproducibility
Peer reviewed, curated
Promoting Community Data Repositories
Open Access
SOAPdenovo2
http://isa-tools.github.io/soapdenovo2
Galaxy workflows to re-enact the data analysis
http://isa-tools.github.io/soapdenovo2
SOAPdenovo2
Nanopub: represents structured data along with its
provenance in a single publishable and citable entity
http://isa-tools.github.io/soapdenovo2
SOAPdenovo2
ResearchObject: enables the aggregation of the digital
resources contributing to findings of computational
research, including results, data and software, as citable
compound digital objects
Reproducing SOAPdenovo2 results Galaxy workflows
S. aureus pipeline
Reproducing SOAPdenovo2 results Galaxy workflows
Reproducing SOAPdenovo2 results Galaxy workflows
2241 400
30
119.0 11 106 24 68
0
Reproducing SOAPdenovo2 results Galaxy workflows
“genome coverage increased over the human data when comparing SOAPdenovo2 against SOAPdenovo1”!
Response Variables!
genome coverage
computation run time
memory consumption
OntoMaton:(a(Bioportal(powered(Ontology(widget(for(Google(
Spreadsheets(Maguire(et(al,((2013(
Bioinforma?cs(
widget for ontology
annotation and tagging on
Google spreadsheets
relying on BioPortal and Linked Open Vocabularies
services
OntoMaton:(a(Bioportal(powered(Ontology(widget(for(Google(
Spreadsheets(Maguire(et(al,((2013(
Bioinforma?cs(
widget for ontology
annotation and tagging on
Google spreadsheets
relying on BioPortal and Linked Open Vocabularies
services
NanoMaton https://github.com/ISA-tools/NanoMaton
Ontology for Biomedical Investigations
SemanticsScience Integrated Ontology
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Findable, Accessible, Interoperable, Reusable!FAIR data
Contributing to !Metabolights and ISA
• BBRSC UK-China Award & BGI funded Hackathon!• venue: BGI Hong-Kong!• Participants:!
• Metabolights/BGI/ISA/Birmingham/Hong-Kong University!
• Outcome: !• ISAtab web viewer code!• Functional Specifications & Code for DoE
Wizard API!• 4 datasets coded in ISA format!• Conversion Metabolights datasets to RDF
funders
acknowledgements
Scott Edmunds, GigaScience
Peter Li, GigaScience
Jun Zhao, Lancaster University
María Susana Avila García, Oxford University
Marco Roos, Leiden UniversityMark Thompson, Leiden University
Ruibang Luo, University of Hong Kong
Tin-Lap Lee, Chinese University of Hong Kong
Tak-wah Lam, University of Hong Kong
Questions?You can email us...
View our blog http://isatools.wordpress.com
Follow us on Twitter @isatools
View our websites
View our Git repo & contribute http://github.com/ISA-tools
Thanks for your attention!