Upload
david-johnson
View
534
Download
0
Embed Size (px)
Citation preview
Data Sharing Infrastructures to Foster Data ReuseDavid [email protected]@NuDataScientist
Integrating Large Data into Plant Science workshop21st April 2016
Philippe Rocca-Serra, PhDSenior Research Lecturer
AlejandraGonzalez-Beltran, PhDResearch Lecturer
Milo Thurston, PhDResearch Software Engineer
Massimiliano Izzo, PhDResearch Software Engineer
Peter McQuilton, PhDKnowledge Engineer
Our main areas of research and activity:
• Data collection, curation, representation etc.
• Data publication• Data provenance • Development of software, infrastructure• Open, community ontologies and
standards• Semantic web• Training
Communities we work with/for:Allyson Lister, PhDKnowledge Engineer
EamonnMaguire, DPhilSoftware Engineer contractor
David Johnson, PhDResearch Software Engineer
Susanna-Assunta Sansone, PhDPrincipal Investigator, Associate Director (consultant for Nature Publishing Group)
Notes in Lab Books(information for humans)
Spreadsheets and Tables( the compromise)
Facts as RDF statements(information for machines)
Notes and narrative Spreadsheets and tables Linked data and data publication
Notes in Lab Books(information for humans)
Spreadsheets and Tables( the compromise)
Facts as RDF statements(information for machines)
Notes in Lab Books(information for humans)
Spreadsheets and Tables( the compromise)
Facts as RDF statements(information for machines)
Enabling reproducible research and open science,driving science and discoveries
Increase the level of annotation at the source, tracking provenance and using community standards
Maximize data discoverability and reuse
Applied research approachTwo well-established products with large user base, embedded in
many funded projects
Several community-driven ontology and other standards, embedded in many funded
projects
86
349
200
MIAMEMIAPA
MIRIAMMIQASMIX
MIGEN
ARRIVEMIAPPE
MIASE
MIQE
MISFISHIE….
REMARK
CONSORT
MAGE-TabGCDML
SRA XMLSOFT FASTA
DICOM
MzMLSBRML
SED-ML…
GELML
ISAtab
CML
MITABAAO
CHEBI
OBIPATO ENVO
MOD
BTOIDO…
TEDDY
PRO
XAO
DO
VO
In the life sciences there are > 600 content standards
Databases and toolsimplementing
Standards; also training material on and around
standards
nmrML
ISA-JSON
Formats
Terminologies
Guidelines
CO
de jure de facto
grass-rootsgroups
standard organizations
Nanotechnology Working Group
• To structure, enrich and report the description of the datasets and theexperimental context under which they were produced
Community-developed content standards
FormatsTerminologies
Guidelines
Mapping the landscape of ‘standards’ in the life sciences
A web-based, curated and searchable registry ensuring that standards and databases are registered, informative and discoverable;; monitoring development and evolution of standards, their use in
databases and adoption of both in data policies
1,400 records and growing
Mapping the landscape of ‘standards’ in the life sciences
1,400 records and growing
also operating as a WG in Run at is also an contribution to
Is there a database, implementing standards, where to deposit my
metagenomics dataset?
My funder’s data sharing policy recommends the use of
established standards, but which ones are widely
endorsed and applicable to my toxicological and clinical data?
Am I using the most up-to-date version of this terminology to annotate cell-based assays?
I understand this format has been deprecated; what has been replaced
by and who is leading the work?
Are there databases implementing this exchange format, whose
development we have funded?
What are the maturestandards and
standards-compliant databases we should
recommend to our authors?
But how do we help users to make informed decisions?
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
Search and filter to find what is relevant to your type of data
From simple and advance search interfaces to….
Powered by curated descriptions of each standard and database records, and their
relations;;
….the recommender system
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
Tracking evolution, e.g. deprecations and substitutions
Cross-linking standards to standards and databases
Model/format formalizing reporting guideline -->
<-- Reporting guideline used by model/format
We link (descriptions of) standards to related standards and databases,
implementing them
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
ISA powers data collection, curation resources and repositories, e.g.:
Initiated 2003, continues to work with/for many domains
model and related formats
18
Why ISA format and Tools?
ISA metadata specifications:•workflow and process orientated•compatible with checklist enforcement•compatible with external vocabulary resources•compatible by design with existing schemas
19
1. Essentials about ISA tab syntax
Investigation File: cardinality: 1..1– purpose: think “executive summary”– layout: rows of key value pairs organized in blocks– content:
• Why? general study description• How? methods / protocol declaration • How? variable declarations (predictor and response variables)• Who? contact and affiliation information
Study File: cardinality: 1..n– layout: true header/row of record table (think “sorting, filtering of samples”)– content:
• What? Listing all biological materials collected over the study course and their treatments.
Assay File: cardinality: 1..n– layout: true header/row of record table (think “sorting, filtering of datafiles”)– content:
• What? Listing all data acquisition events and data files collected by a given assay and subsequent data transformations
20
1. Essentials about ISA syntax
Protocol act on Material or Data definingWorkflows: – Input and Outputs of Protocols are Material Nodes (Source Name, Sample Name, Extract Name, Labeled
Extract Name.) or Data Nodes (Raw Data File or Derived Data File)
Characteristics[…]Factor Value[…](independent variables)Material TypeComment[…]
Data NodeMaterial Node
Date (day effect)
Performer (operator effect)
Parameter Value […]
Protocol Application
Material TransformationSample
Extract Raw Data File
Derived Data File
22
– Branching events:
root
mature leaf
A thaliana 1
Source NameCharacteristics[organism] Protocol REF
Parameter Value[storage condition]
Sample Name Characteristics[organ]
AT1 A Thaliana sample collection
liquid nitrogen AT1 -‐ sample1 flower
AT1 A Thaliana sample collection
liquid nitrogen AT1 -‐ sample2 mature leaf
AT1 A Thaliana sample collection
liquid nitrogen AT1 -‐ sample3 root
Source Material
flower
Sample Material
2. basic coding patterns with ISA syntax
23
– Pooling events:
Source NameCharacteristics[organism] Protocol REF
Parameter Value[storage condition]
Sample Name Characteristics[organ]
plant 1 Fragaria ×ananassa,
sample collection
liquid nitrogen pool1 fruit
plant 2 Fragaria ×ananassa,
sample collection
liquid nitrogen pool1 fruit
plant 3 Fragaria ×ananassa,
sample collection
liquid nitrogen pool1 fruit
plant 1
plant 2
plant 3
Source Material
fruit
Sample Material
2. basic coding patterns with ISA syntax
24
– Representing interventions and treatments
• expressing treatments as sets of factor levels• examples: exposure to different doses of systemic herbicide• Factors will be ‘compound’, ‘dose’ and duration• (what?,howmuch?, how long for?)
• Implicit column order matters but this is independent from the ISA syntax specification:
Source NameCharacteristics[organism] Protocol REF Factor
Value[compound]Factor Value[dose]
Factor Value[duration]
Plant 1 Zea mays treatment glyphosate 250 mg/day 12 weeks
Plant 2 Zea mays treatment glyphosate 250 mg/day 12 weeks
Plant 3 Zea mays treatment glyphosate 20 mg/day 12 weeks
2. basic coding patterns with ISA syntax
25
–Tagging with Terminologies
• ISA tools (ISAcreator - ISAconfigurator) provide Ontology term selection and term tagging facilities to help users.
Source NameCharacteristics[ORGANISM]
Term Source REF
Term Accession Number
Characteristics[AGE]
Unit Term Source REF
Term Accession Number
Factor Value[COMPOUND (htppt://purl]
Term Source REF Term Accession Number
individual1 Homo sapiens NCBITax 9606 12 week UO UO:wwerwta
aspirin CHEBI 1231354
2. basic coding patterns with ISA syntax
Source Name Characteristics[ORGANISM] Characteristics[AGE] Factor Value[COMPOUND]
individual1 human 12 weeks aspirin
26
ISA syntax boundaries
Any model is a compromise between granularity and simplicity
Some cases are hard to represent– crossover design with dissimilar arms – representing mixtures of chemical– representing loops (with donors and recipients)
Reaching the limits of how graphs can be efficiently represented in tables
27
– A case of simple non destructive HTP :– 60 genotypes x 5 replicates : 12 trays of 25 pots each– 1 seed per pot gives us 300 individual plants– experiment duration: 35 days– single daily data acquisition:
• visible light: 3 angles + top view = 4 images• near infrared: 3 angles + top view = 4 images• fluorescence: 1 angle = 1 image• TOTAL: 9 images per plant per day
– Grand Total: 94,500 files to store and track
Plant H-T Phenotyping worked example
28
– Decomposing the experiment in term of ISA elements– Identifying key experimental variables:
• independent variables => used to define ISA Factors and/or Characteristics – Factor = genotype, Factor Values[G1..G60] = 60 distinct values– Factor = day, Factor Values[day1..day35] = 35 distinct values
• response variables => used to define 3 distinct ISA Assays– morphology using visible light imaging» ISA parameters to track ‘camera position’ top,left,right,centre
– water content using near infrared imaging» ISA parameters to track ‘camera position’ top,left,right,centre
– photosynthetic pigment concentration using fluorescence imaging» ISA parameters to track ‘camera position’ top
Plant H-T Phenotyping worked example
29
– Decomposing the experiment in term of ISA elements– Identifying key experimental variables:
• independent variables => used to define ISA Factors and/or Characteristics – Factor = genotype, Factor Values[ ] = 60 distinct values– Factor = day, Factor Values[ ] = 35 distinct values
• Automatic creating and filling of ISA Study Sample files– 60 x 35 = 2100 factor combinations– 5 replicates per factor combination => 10500 pots with 1 seed per
pot to be grown– Translated into :» 1 ISA study file with 10500 row on the following pattern
Plant H-T Phenotyping worked example
30
Declaring and annotating an ISA Source Node
ISA Protocol Application with sets of Parameter Values resulting in a ISA Sample Node
Reporting of independent variables as ISA Factor Values
Plant H-T Phenotyping worked example
31
– Decomposing the experiment in term of ISA elements– Identifying key experimental variables:
• response variables => used to define 3 distinct ISA Assays– morphology using visible light imaging» ISA parameters to track ‘camera position’
top,left,right,centre– water content using near infrared imaging» ISA parameters to track ‘camera position’
top,left,right,centre– photosynthetic pigment concentration using fluorescence
imaging» ISA parameters to track ‘camera position’ top
Plant H-T Phenotyping worked examples
32
Describing a data acquisition event
ISA Protocol Application of type Data Transformation with sets of Parameter Values resulting in a ISA Derived Data File
Reporting of independent variables as ISA Factor Values
Plant H-T Phenotyping worked examples
36
You can email [email protected]
View our bloghttp://isatools.org/blog
Follow us on Twitter@isatools
@biosharing
View our websiteshttp://www.isa-tools.org
http://www.biosharing.orgView our Git repo & contribute
http://github.com/ISA-tools