36
Data Sharing Infrastructures to Foster Data Reuse David Johnson [email protected] @NuDataScientist Integrating Large Data into Plant Science workshop 21 st April 2016

GARNet workshop on Integrating Large Data into Plant Science

Embed Size (px)

Citation preview

Data Sharing Infrastructures to Foster Data ReuseDavid [email protected]@NuDataScientist

Integrating Large Data into Plant Science workshop21st April 2016

Philippe Rocca-­Serra, PhDSenior Research Lecturer

AlejandraGonzalez-­Beltran, PhDResearch Lecturer

Milo Thurston, PhDResearch Software Engineer

Massimiliano Izzo, PhDResearch Software Engineer

Peter McQuilton, PhDKnowledge Engineer

Our main areas of research and activity:

• Data collection, curation, representation etc.

• Data publication• Data provenance • Development of software, infrastructure• Open, community ontologies and

standards• Semantic web• Training

Communities we work with/for:Allyson Lister, PhDKnowledge Engineer

EamonnMaguire, DPhilSoftware Engineer contractor

David Johnson, PhDResearch Software Engineer

Susanna-­Assunta Sansone, PhDPrincipal Investigator, Associate Director (consultant for Nature Publishing Group)

Notes in Lab Books(information for humans)

Spreadsheets and Tables( the compromise)

Facts as RDF statements(information for machines)

Notes and narrative Spreadsheets and tables Linked data and data publication

Notes in Lab Books(information for humans)

Spreadsheets and Tables( the compromise)

Facts as RDF statements(information for machines)

Notes in Lab Books(information for humans)

Spreadsheets and Tables( the compromise)

Facts as RDF statements(information for machines)

Enabling reproducible research and open science,driving science and discoveries

Increase the level of annotation at the source, tracking provenance and using community standards

Maximize data discoverability and reuse

Applied research approachTwo well-­established products with large user base, embedded in

many funded projects

Several community-­driven ontology and other standards, embedded in many funded

projects

86

349

200

MIAMEMIAPA

MIRIAMMIQASMIX

MIGEN

ARRIVEMIAPPE

MIASE

MIQE

MISFISHIE….

REMARK

CONSORT

MAGE-TabGCDML

SRA XMLSOFT FASTA

DICOM

MzMLSBRML

SED-ML…

GELML

ISAtab

CML

MITABAAO

CHEBI

OBIPATO ENVO

MOD

BTOIDO…

TEDDY

PRO

XAO

DO

VO

In the life sciences there are > 600 content standards

Databases and toolsimplementing

Standards; also training material on and around

standards

nmrML

ISA-JSON

Formats

Terminologies

Guidelines

CO

de jure de facto

grass-rootsgroups

standard organizations

Nanotechnology Working Group

• To structure, enrich and report the description of the datasets and theexperimental context under which they were produced

Community-developed content standards

FormatsTerminologies

Guidelines

Mapping the landscape of ‘standards’ in the life sciences

A web-­based, curated and searchable registry ensuring that standards and databases are registered, informative and discoverable;; monitoring development and evolution of standards, their use in

databases and adoption of both in data policies

1,400 records and growing

Mapping the landscape of ‘standards’ in the life sciences

1,400 records and growing

also operating as a WG in Run at is also an contribution to

Is there a database, implementing standards, where to deposit my

metagenomics dataset?

My funder’s data sharing policy recommends the use of

established standards, but which ones are widely

endorsed and applicable to my toxicological and clinical data?

Am I using the most up-to-date version of this terminology to annotate cell-based assays?

I understand this format has been deprecated; what has been replaced

by and who is leading the work?

Are there databases implementing this exchange format, whose

development we have funded?

What are the maturestandards and

standards-compliant databases we should

recommend to our authors?

But how do we help users to make informed decisions?

The International Conference on Systems Biology (ICSB), 22-­28 August, 2008 Susanna-­Assunta Sansone www.ebi.ac.uk/net-­project

Search and filter to find what is relevant to your type of data

From simple and advance search interfaces to….

Powered by curated descriptions of each standard and database records, and their

relations;;

….the recommender system

The International Conference on Systems Biology (ICSB), 22-­28 August, 2008 Susanna-­Assunta Sansone www.ebi.ac.uk/net-­project

Tracking evolution, e.g. deprecations and substitutions

Cross-­linking standards to standards and databases

Model/format formalizing reporting guideline -­-­>

<-­-­ Reporting guideline used by model/format

We link (descriptions of) standards to related standards and databases,

implementing them

Standards and databases cross-­linked

model and related formats

These tools and formats will help you to:

The International Conference on Systems Biology (ICSB), 22-­28 August, 2008 Susanna-­Assunta Sansone www.ebi.ac.uk/net-­project

ISA powers data collection, curation resources and repositories, e.g.:

Initiated 2003, continues to work with/for many domains

model and related formats

17

ISA in a nutshell

18

Why ISA format and Tools?

ISA metadata specifications:•workflow and process orientated•compatible with checklist enforcement•compatible with external vocabulary resources•compatible by design with existing schemas

19

1. Essentials about ISA tab syntax

Investigation File: cardinality: 1..1– purpose: think “executive summary”– layout: rows of key value pairs organized in blocks– content:

• Why? general study description• How? methods / protocol declaration • How? variable declarations (predictor and response variables)• Who? contact and affiliation information

Study File: cardinality: 1..n– layout: true header/row of record table (think “sorting, filtering of samples”)– content:

• What? Listing all biological materials collected over the study course and their treatments.

Assay File: cardinality: 1..n– layout: true header/row of record table (think “sorting, filtering of datafiles”)– content:

• What? Listing all data acquisition events and data files collected by a given assay and subsequent data transformations

20

1. Essentials about ISA syntax

Protocol act on Material or Data definingWorkflows: – Input and Outputs of Protocols are Material Nodes (Source Name, Sample Name, Extract Name, Labeled

Extract Name.) or Data Nodes (Raw Data File or Derived Data File)

Characteristics[…]Factor Value[…](independent variables)Material TypeComment[…]

Data NodeMaterial Node

Date (day effect)

Performer (operator effect)

Parameter Value […]

Protocol Application

Material TransformationSample

Extract Raw Data File

Derived Data File

21

2. basic coding patterns with ISA syntax

The task: rendering a graph in a table

22

– Branching events:

root

mature leaf

A thaliana 1

Source NameCharacteristics[organism] Protocol REF

Parameter Value[storage condition]

Sample Name Characteristics[organ]

AT1 A Thaliana sample collection

liquid nitrogen AT1 -­‐ sample1 flower

AT1 A Thaliana sample collection

liquid nitrogen AT1 -­‐ sample2 mature leaf

AT1 A Thaliana sample collection

liquid nitrogen AT1 -­‐ sample3 root

Source Material

flower

Sample Material

2. basic coding patterns with ISA syntax

23

– Pooling events:

Source NameCharacteristics[organism] Protocol REF

Parameter Value[storage condition]

Sample Name Characteristics[organ]

plant 1 Fragaria ×ananassa,

sample collection

liquid nitrogen pool1 fruit

plant 2 Fragaria ×ananassa,

sample collection

liquid nitrogen pool1 fruit

plant 3 Fragaria ×ananassa,

sample collection

liquid nitrogen pool1 fruit

plant 1

plant 2

plant 3

Source Material

fruit

Sample Material

2. basic coding patterns with ISA syntax

24

– Representing interventions and treatments

• expressing treatments as sets of factor levels• examples: exposure to different doses of systemic herbicide• Factors will be ‘compound’, ‘dose’ and duration• (what?,howmuch?, how long for?)

• Implicit column order matters but this is independent from the ISA syntax specification:

Source NameCharacteristics[organism] Protocol REF Factor

Value[compound]Factor Value[dose]

Factor Value[duration]

Plant 1 Zea mays treatment glyphosate 250 mg/day 12 weeks

Plant 2 Zea mays treatment glyphosate 250 mg/day 12 weeks

Plant 3 Zea mays treatment glyphosate 20 mg/day 12 weeks

2. basic coding patterns with ISA syntax

25

–Tagging with Terminologies

• ISA tools (ISAcreator - ISAconfigurator) provide Ontology term selection and term tagging facilities to help users.

Source NameCharacteristics[ORGANISM]

Term Source REF

Term Accession Number

Characteristics[AGE]

Unit Term Source REF

Term Accession Number

Factor Value[COMPOUND (htppt://purl]

Term Source REF Term Accession Number

individual1 Homo sapiens NCBITax 9606 12 week UO UO:wwerwta

aspirin CHEBI 1231354

2. basic coding patterns with ISA syntax

Source Name Characteristics[ORGANISM] Characteristics[AGE] Factor Value[COMPOUND]

individual1 human 12 weeks aspirin

26

ISA syntax boundaries

Any model is a compromise between granularity and simplicity

Some cases are hard to represent– crossover design with dissimilar arms – representing mixtures of chemical– representing loops (with donors and recipients)

Reaching the limits of how graphs can be efficiently represented in tables

27

– A case of simple non destructive HTP :– 60 genotypes x 5 replicates : 12 trays of 25 pots each– 1 seed per pot gives us 300 individual plants– experiment duration: 35 days– single daily data acquisition:

• visible light: 3 angles + top view = 4 images• near infrared: 3 angles + top view = 4 images• fluorescence: 1 angle = 1 image• TOTAL: 9 images per plant per day

– Grand Total: 94,500 files to store and track

Plant H-T Phenotyping worked example

28

– Decomposing the experiment in term of ISA elements– Identifying key experimental variables:

• independent variables => used to define ISA Factors and/or Characteristics – Factor = genotype, Factor Values[G1..G60] = 60 distinct values– Factor = day, Factor Values[day1..day35] = 35 distinct values

• response variables => used to define 3 distinct ISA Assays– morphology using visible light imaging» ISA parameters to track ‘camera position’ top,left,right,centre

– water content using near infrared imaging» ISA parameters to track ‘camera position’ top,left,right,centre

– photosynthetic pigment concentration using fluorescence imaging» ISA parameters to track ‘camera position’ top

Plant H-T Phenotyping worked example

29

– Decomposing the experiment in term of ISA elements– Identifying key experimental variables:

• independent variables => used to define ISA Factors and/or Characteristics – Factor = genotype, Factor Values[ ] = 60 distinct values– Factor = day, Factor Values[ ] = 35 distinct values

• Automatic creating and filling of ISA Study Sample files– 60 x 35 = 2100 factor combinations– 5 replicates per factor combination => 10500 pots with 1 seed per

pot to be grown– Translated into :» 1 ISA study file with 10500 row on the following pattern

Plant H-T Phenotyping worked example

30

Declaring and annotating an ISA Source Node

ISA Protocol Application with sets of Parameter Values resulting in a ISA Sample Node

Reporting of independent variables as ISA Factor Values

Plant H-T Phenotyping worked example

31

– Decomposing the experiment in term of ISA elements– Identifying key experimental variables:

• response variables => used to define 3 distinct ISA Assays– morphology using visible light imaging» ISA parameters to track ‘camera position’

top,left,right,centre– water content using near infrared imaging» ISA parameters to track ‘camera position’

top,left,right,centre– photosynthetic pigment concentration using fluorescence

imaging» ISA parameters to track ‘camera position’ top

Plant H-T Phenotyping worked examples

32

Describing a data acquisition event

ISA Protocol Application of type Data Transformation with sets of Parameter Values resulting in a ISA Derived Data File

Reporting of independent variables as ISA Factor Values

Plant H-T Phenotyping worked examples

Collaborative Open Plant Omics

34

ISA tools in the Cloud

35

36

You can email [email protected]

View our bloghttp://isatools.org/blog

Follow us on Twitter@isatools

@biosharing

View our websiteshttp://www.isa-tools.org

http://www.biosharing.orgView our Git repo & contribute

http://github.com/ISA-tools