27
SESSION CHAIR: RICHARD SCHEUERMANN (VIPR & IRD) BRC2011 Session #5 – Data Standards and Metadata

SESSION CHAIR: RICHARD SCHEUERMANN (VIPR & IRD) BRC2011 Session #5 – Data Standards and Metadata

Embed Size (px)

Citation preview

S E S S I O N C H A I R : R I C H A R D S C H E U E R M A N N ( V I P R & I R D )

BRC2011Session #5 – Data Standards and

Metadata

Session #5 - Outline

MotivationOpportunities, Challenges and Talking Points

minimum information checklists ontology-based value sets use cases for metadata SOPs for data & metadata acquisition

Ontology of Biomedical Investigations – Bjoern PetersInfectious Disease Ontology and extensions – Lindsay

CowellGSCID-BRC Metadata Working Group effortsOpen discussion

Why Data Standards

Interoperability - the ability to exchange information between people, organizations, machines

Comparability - the ability to ascertain the equivalence of data from different sources

Data Quality – asses the completeness, accuracy and precision of the data

Dependability – ensures that you get what you expect from a database query

Accurate Statistical AnalysisInference

What Data Standards

Minimum Information Sets – what needs to be describedStructured Vocabulary/Ontology – how to describe them

Term strings – unique identifiers Definitions - what terms mean Syntax - how terms are used

Semantics - how the components relate to each other

Session #5 – Challenges

Status of relevant data standards Few data standards that have been widely adopted by the infectious diseases community Some standards are being development without engagement of all relevant stakeholders If we drive standards development, how do we get broad adoption

Adoption of data standards by data providers Even if vocabulary standards are available, how do we get the broader community to use them How do we educate them to use the data standards accurately How to keep the barrier low for getting required meta-data in a standard format

Technical challenges Usability is constrained by spreadsheet interface Ontology-based controlled vocabularies sometimes too large for spreadsheet like interface or drop down lists While web-based GUI smart forms are good for single submission, difficult to design them to scale

Need for quality control and curation If data standards are not enforced, mapping to standards may be required Problems with homonyms (Turkey vs turkey) and synonyms (Puerto Rico and PR) Not all tasks in metadata collection lend themselves to automation Data entry quality control mechanisms are especially limited because of spreadsheet functionality Could be 1-2 FTEs; not budgeted

Compliance with HIPAA and other privacy regulations. PATRIC does not anticipate working with identifying data but GSCIDs and investigators could be delayed by compliance issues

Special cases Metadata for genomes for NBCI bulk submission and non-unique taxon ids. Metadata for growth conditions to be used with transcript datasets Metadata for metagenomes to correlate genomes and proteins with useful info about sites and conditions

How to we effectively exploit standardized data and metadata

Session #5 – Opportunities

Existing relevant ontologies are in decent shape – GO, IDO, OBI Ontology for Biomedical Investigations (OBI) can provide a common framework for describing

and exchanging datasets GSCID-BRC Metadata Working Group Leverage and harmonize with MIGS/MIMS We have the opportunity to establish policies for metadata collection, exchange, and release that

would be broadly applicable. We are in the position to drive standards adoption The BRCs support many pathogens that infect the same host(s) … can we exploit this fact to create

specialized views and tools for interacting with the host resources from both pathogen and host perspectives?

Ontology-driven integration (GMOD, Population biology) Small sequencing centers

Offer community a standard metadata template for isolates Bring your own data and metadata to PATRIC for annotation, analysis, long term metadata storage and dissemination

Develop additional metadata standards and collect, store, and share additional metadata More efficient encoding of things like alignments

Presentations

Ontology of Biomedical Investigations (OBI) – Bjoern Peters

Infectious Disease Ontology (IDO) and extensions – Lindsay Cowell

GSCID-BRC Metadata Working Group

GSCID-BRC Metadata Working Group

Working group established to define common metadata standard for pathogen isolate sequencing projects

Collaboration between BRCs, GSCIDs and NIAID Process

Collect spreadsheets, metadata examples, previous submission from sequencing projects Core metadata fields collected by virus, bacteria and eukaryote subgroups For each metadata field, propose:

preferred term definition synonyms allowed values based on controlled vocabularies preferred syntax responsible provider data category examples

Merge recommendations from subgroups into a common core metadata using an OBI-based semantic framework Develop recommendations for project-specific and pathogen-specific metadata fields Harmonize with other relevant standards (MIGS/MIMS, IDO) Establish policies and procedures for metadata submission workflows and GenBank linkage

Core Metadata Examples

data transformations –image processing

assemblysequencing assay

specimen source – organism or environmental

specimencollector

input sample

reagents

technician

equipment

type ID qualities

temporal-spatialregion

data transformations –variant detection

serotype marker detect.gene detection

primarydata

sequencedata

genotype/serotype/gene data

specimen

microorganism

enrichedNA sample

microorganismgenomic NA

specimen isolationprocess

isolationprotocol

sample processing

data archivingprocess

sequencedata record

has_in

put

has_o

utput

has_o

utput

has_specification has_part has_part

is_about

has_input

has_o

utput

has_in

put

has_in

put

has_in

put

has_o

utput

has_o

utput

has_o

utput

is_about

GenBankID

denotes

located_in

denotes

- independent continuant

- dependent continuant

- occurrent

- temporal-spatial region

ital - relations

has_in

put

has_qualityinstance_of

temporal-spatialregion

located_in

Network Overview

data transformations –image processing

assemblysequencing assay

specimen source – organism or environmental

specimencollector

input sample

reagents

technician

equipment

type ID qualities

temporal-spatialregion

data transformations –variant detection

serotype marker detect.gene detection

primarydata

sequencedata

genotype/serotype/gene data

specimen

microorganism

enrichedNA sample

microorganismgenomic NA

specimen isolationprocess

isolationprotocol

sample processing

data archivingprocess

sequencedata record

has_in

put

has_o

utput

has_o

utput

has_specification has_part has_part

is_about

has_input

has_o

utput

has_in

put

has_in

put

has_in

put

has_o

utput

has_o

utput

has_o

utput

is_about

GenBankID

denotes

located_in

denotes

has_in

put

has_qualityinstance_of

temporal-spatialregion

located_in

Specimen Isolation

Material Processing

Data ProcessingSequencing Assay

Investigation

data transformations –image processing

assemblysequencing assay

organism

environmentalmaterial

equipment

person

samplematerial

material

person

equipment

templaterole

reagentrole

sequencingtech. role

signaldetection role

specimensource role

specimencapture role

specimencollector role

species/strain

organismID

age, gender,symptom

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

cDNAsample

data transformations –variant detection

serotype marker detect.gene detection

primarydata

sequencedata

genotype/serotype/gene data

specimen

microorganism

enrichedNA sample

microorganismgenomic NA

specimen isolationprocess

isolationprotocol

NA enrichmentprocess

NA enrichmentprotocol

cDNA synthesisprocess

cDNA synthesisprotocol

sequencingprotocol algorithm

temporal-spatialregion

data archivingprocess

sequencedata record

has_in

put

has_in

put

has_in

put

has_o

utput

has_o

utput

has_o

utput

plays

plays

has_specification has_part has_specification has_part has_specification

loca

ted_in has_part

denot

es

is_about

has_input

has_o

utput

has_in

put

plays

located_in

has_specification has_specification

has_in

put

has_in

put

has_o

utput

has_o

utput

has_o

utput

is_about

GenBankID

denotes

located_in

software

has_input

data transferprotocol

has_specification

commonname

denotes

denotes

has_qualityinstance_of

name

denotes

spatialregion

geographiclocation

denot

eslocated_in

affiliation

has_affiliation

species/strain

instance_of

ID ID ID

amount

has_quality

v2

v5-6

v3-4

v7v8

v10

v12

v11

v13

v15

v16

v22 v25

v23

v24

v27v30 v32

v29 v31 v43

v40

v42

v45

v46

v44

vX – row X in virus sheet

- independent continuant

- dependent continuant

- occurrent

- temporal-spatial region

ital - relations

Metadata Categories

InvestigationSpecimen IsolationSpecimen ProcessingSample ShipmentPathogen Detection & IsolationSequencing Sample PreparationSequencing AssayData Transformation

organism

environmentalmaterial

equipment

person

specimensource role

specimencapture role

specimencollector role

species/strain

organismID

age, gender,symptom

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

specimen X

microorganism

specimen isolationprocedure X

isolationprotocol

has_in

put

has_o

utput

plays

plays

has_specification

has_part

has_partden

otes

located_in

commonname

denotes

denotes

has_qualityinstance_of

name

denotes

spatialregion

geographiclocation

denot

eslocated_in

affiliation

has_affiliation

species/strain

instance_of

ID

v2

v5-6

v3-4

v7v8

v10

v12

v11

v13

v15

v16

v27

denotes

specimen typein

stan

ce_o

f

specimen isolationprocedure type

instance_of

Specimen Isolation

plays

has_in

put

Comments

????

v9

organism parthypothesis

v17

is_about

IRB/IACUCapproval

has_authorization

v19v18

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_partden

otes

spatialregion

geographiclocation

denot

eslocated_in

located_in

vX – row X in virus sheet

- independent continuant

- dependent continuant

- occurrent

- temporal-spatial region

ital - relations

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

specimen X

microorganism X

sampleset X

sample setassembly process X

sample setassembly protocol

has_o

utputhas_part

has_specification

has_part

loca

ted_in

spatialregion

geographiclocation

species/strain

instance_of

ID

v15

v16

v27

Specimen Processing

aliquotingprocess X

aliquotingprotocol

has_in

put

has_o

utput

has_specification

specimen Xaliquot Y

specimentypeamount

denotes

instance_ofhas_quality

ID

specimentypeamount

denotes

instance_ofhas_quality

ID

specimentypeamount

denotes

instance_ofhas_quality

located_in located_in

sample setassembly process

aliquotingprocess

instance_of instance_of

denotes denotes denotes

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

loca

ted_in

spatialregion

geographiclocation

denotes denotes denotes

specimen Aaliquot B

specimen Maliquot N

specimen Taliquot U

has_in

put

v20v22

v23v24

sample set Xat GSC

sample set Xin transit

sample shipmentprocess X

sample shipmentprotocol

sample receiptprocess X

sample receiptprotocol

has_in

put

has_in

put

has_o

utput

has_o

utput

has_specification has_specification

Sample Shipment

sampleset X

ID

sample settypeamount

denotes

instance_ofhas_quality

ID

sample settypeamount

denotes

instance_ofhas_quality

ID

sample settypeamount

denotes

instance_ofhas_quality

located_in located_insample shipmentprocess

sample receiptprocess

instance_of instance_of

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

loca

ted_in

spatialregion

geographiclocation

denotes denotes denotes

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

loca

ted_in

spatialregion

geographiclocation

denotes denotes denotes

v21

sample Xat GSC

ID

sampletypeamount

denotes

instance_ofhas_quality

has_p

art

v24v23

v25

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

specimen X

microorganism X

has_part

has_part

loca

ted_in

spatialregion

geographiclocation

species/strain

instance_of

IDv15

v16

v27

Pathogen Detection & Isolation

pathogen detectionprocess X

has_in

put

has_specification

data aboutpathogen presence

specimentype

amount

denotes

instance_of

has_quality

located_in

pathogen detectionmethod

instance_of

denotes denotes denotes

pathogen detectionprotocol

has_output

v28

v26

is_ab

out

v34

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

loca

ted_in

spatialregion

geographiclocation

pathogen isolationprocess X

located_in

pathogen isolationmethod

denotes denotes denotes

pathogen detectionprotocol

has_input

inst

ance

_of

has_s

pecifi

catio

n

pathogenisolate X

ID

pathogentypeamount

denotes

instance_ofhas_quality

has_output

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

cDNAsample X

specimen X

microorganism X

enrichedNA sample X

microorganismgenomic NA

NA enrichmentprocess X

NA enrichmentprotocol

cDNA synthesisprocess X

cDNA synthesisprotocol

has_in

put

has_in

put

has_o

utput

has_o

utputhas_part

has_specification

has_part

has_specification

has_part

loca

ted_in

spatialregion

geographiclocation

species/strain

instance_of

ID

ID

v15

v16

v27

Sequencing Sample Preparation

aliquotingprocess X

aliquotingprotocol

has_in

put

has_o

utput

has_specification

specimenaliquot X

specimentypeamount

denotes

instance_ofhas_quality

ID

specimentypeamount

denotes

instance_ofhas_quality

ID

specimentypeamount

denotes

instance_ofhas_quality

ID

specimentypeamount

denotes

instance_ofhas_quality

located_in located_in located_in

NA enrichmentprocess

cDNA synthesisprocess

aliquotingprocess

instance_of instance_of instance_of

denotes denotes denotes

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

loca

ted_in

spatialregion

geographiclocation

denotes denotes denotes

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

loca

ted_in

spatialregion

geographiclocation

denotes denotes denotes

v35

v36

v37

v38

v39

v33

sequencing assay X

samplematerial X

material X

person X

equipment X

lot #

primarydata

sequencingprotocol

temporal-spatialregion

has_in

put

located_in

has_specification

has_o

utput

v40

plays

spatialregion

temporalinterval

GPSlocation

date/time

spatialregion

geographiclocation

Sequencing Assay

has_part

loca

ted_indenotes denotes

runID

sequencingassay type

denotes

insatnce_of

reagentrole

reagenttype

inst

ance

_of

denot

es

sample ID

playstemplate

role

sampletype

inst

ance

_of

denot

es

name

playssequencing

tech. role

species

inst

ance

_of

denot

es

serial #

playssignal

detection role

equipmenttype

inst

ance

_of

denot

es

has_in

put

has_in

put

has_in

put

v14

v41

objectives – coverage,genome type targeted

has_part

data transformations –image processing

assembly X

data transformations –variant detection

primarydata

sequencedata

genotype data

microorganism X

microorganismgenomic NA

algorithm

data archivingprocess

sequencedata record

has_input

inst

ance

_of

has_specification

has_in

put

has_o

utput

has_o

utput

is_about

GenBankID

denotes

software

has_input

data transferprotocol

has_specification

species/strain

has_output

has_in

put

temporal-spatialregion

located_in

spatialregion

temporalinterval

GPSlocation

date/time

spatialregion

geographiclocation

has_part

loca

ted_indenotes denotes

person Xname

plays

bioinformaticstech. role

species

inst

ance

_of

denot

es

runID

denoteslocated_in

data transformations –serotype marker

detection

serotype data

data transformations –gene detection

gene data

part_of

has_output

has_output

is_ab

out

has_input

has_input

Data Transformationstemporal-spatial

region

spatialregion

temporalinterval

GPSlocation

date/time

spatialregion

geographiclocation

has_part

loca

ted_indenotes denotes

v29

v43

v31

v32

v42

v30

v44

v45 v46

v47

Investigation- independent continuant

- dependent continuant

- occurrent

- temporal-spatial region

ital - relations

investigation

study design

has_part

documenting

study design execution

has_part

has_part

objective specification

has_part

data transformation

has_parthas_part

Information content entity

has_specified_input

specimen creation

specimen preparation

for assay

sequencing assay

has_part has_part

assay X

samplematerial X

material X

person X

equipment X

lot #

primarydata

assayprotocol

temporal-spatialregion

has_in

put

located_in

has_specification

has_o

utput

plays

spatialregion

temporalinterval

GPSlocation

date/time

spatialregion

geographiclocation

Generic Assay

has_part

loca

ted_indenotes denotes

runID

assaytype

denotes

instance_of

reagentrole

reagenttype

inst

ance

_of

denot

es

sample ID

playstarget

role

sampletype

inst

ance

_of

denot

es

name

playstechnician

role

species

inst

ance

_of

denot

es

serial #

playssignal

detection role

equipmenttype

inst

ance

_of

denot

es

has_in

put

has_in

put

has_in

put

objectives

has_part

analyte X

has_part

quality x

has_quality

input samplematerial X

is_ab

out

materialtransformation X

samplematerial X

material X

person X

equipment X

lot #

outputmaterial X

material transformationprotocol

temporal-spatialregion

has_in

put

located_in

has_specification

has_o

utput

plays

spatialregion

temporalinterval

GPSlocation

date/time

spatialregion

geographiclocation

Generic Material Transformation

has_part

loca

ted_indenotes denotes

runID

material transformationtype

denotes

instance_of

reagentrole

reagenttype

inst

ance

_of

denot

es

sample ID

playstarget

role

sampletype

inst

ance

_of

denot

es

name

playstechnician

role

species

inst

ance

_of

denot

es

serial #

playssignal

detection role

equipmenttype

inst

ance

_of

denot

es

has_in

put

has_in

put

has_in

put

objectives

has_part

quality x

has_quality

quality x

materialtype

has_quality

instance_of

sample IDden

otes

data transformation Xinputdata

outputdata

material X

algorithm

has_specification

has_o

utput

is_about

software

has_in

put

located_in

person Xname

data analystrole

denot

es

runID

denotes

Generic Data Transformation

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

spatialregion

geographiclocation

has_part

loca

ted_indenotes denotes

data transformationtype

instance_of

plays

Generic Material (IC)

material X

ID

materialtype

quality x

has_quality

material Y

has_part

material Z

has_part

quality y

has_quality

denotes

instance_of

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

loca

ted_in

spatialregion

geographiclocation

denotes denotes denotes

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

loca

ted_in

spatialregion

geographiclocation

denotes denotes denotes

located_in located_in

Discussion Points

MIBBI may not be sufficient Don’t distinguish between minimum information to reproduce and experiment and the

minimum information to structure in a database Lack a semantic framework

OBI-based framework is re-usable Sequencing => “omics”

Challenge of using ontologies for preferred value sets Can be large May not directly match common language

Value of defining the semantic framework Appropriate relations are retained How can we take advantage of the framework for semantic query and inferential

analysis?

Practical issues about implementation strategies