Transcript

GSC-BRC Metadata Standards

Richard H. Scheuermann

U.T. Southwestern Medical Center

Metadata Inconsistencies

• Each project was providing different types of metadata

• No consistent nomenclature being used• Impossible to perform reliable comparative

genomics analysis

Dengue Clinical Metadata

Virus Isolate Information

Complex Query Interface

Additional Clinical Characteristics

GSC-BRC Metadata Standards Working Group

• NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs

• Develop metadata standards for pathogen isolate sequencing projects

Metadata Standards Process

• Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors• Collect example metadata sets from sequencing project white papers and other project

sources (e.g. CEIRS)• Identify data fields that appear to be common across projects within a pathogen subgroup

(core) and data fields that appear to be project specific• For each data field, provide definitions, synonyms, allowed value sets preferably using

controlled vocabularies, expected syntax, examples, data categories and data providers• Merge subgroup core elements into a common set of core metadata fields and attributes• Assemble metadata fields into a semantic network• Harmonize semantic network with the Ontology of Biomedical Investigation (OBI)• Compare, harmonize, map to other relevant initiatives, including MIGS, MIMS,

BioProjects, BioSamples• Develop data submission spreadsheets to be used for all white paper and BRC-associated

projects

GSC-BRC Metadata Working Groups

Example Metadata

Virus Core Metadata Sheet

Metadata Merge

data transformations –image processing

assemblysequencing assay

specimen source – organism or environmental

specimencollector

input sample

reagents

technician

equipment

type ID qualities

temporal-spatialregion

data transformations –variant detection

serotype marker detect.gene detection

primarydata

sequencedata

genotype/serotype/gene data

specimen

microorganism

enrichedNA sample

microorganismgenomic NA

specimen isolationprocess

isolationprotocol

sample processing

data archivingprocess

sequencedata record

has_input

has_output

has_output

has_specification has_part has_part

is_about

has_input

has_output

has_input

has_input

has_input

has_output

has_output

has_output

is_about

GenBankID

denotes

located_in

denotes

- independent continuant

- dependent continuant

- occurrent

- temporal-spatial region

ital - relations

has_input

has_qualityinstance_of

temporal-spatialregion

located_in

Network Overview

data transformations –image processing

assemblysequencing assay

specimen source – organism or environmental

specimencollector

input sample

reagents

technician

equipment

type ID qualities

temporal-spatialregion

data transformations –variant detection

serotype marker detect.gene detection

primarydata

sequencedata

genotype/serotype/gene data

specimen

microorganism

enrichedNA sample

microorganismgenomic NA

specimen isolationprocess

isolationprotocol

sample processing

data archivingprocess

sequencedata record

has_input

has_output

has_output

has_specification has_part has_part

is_about

has_input

has_output

has_input

has_input

has_input

has_output

has_output

has_output

is_about

GenBankID

denotes

located_in

denotes

has_input

has_qualityinstance_of

temporal-spatialregion

located_in

Specimen Isolation

Material Processing

Data ProcessingSequencing Assay

Investigation

Metadata Categories

• Investigation• Host/Source Characterization• Specimen Isolation• Pathogen Detection• Pathogen Isolation• Pathogen Characterization• Specimen Processing• Sample Shipment• Sequencing Sample Preparation• Sequencing Assay• Data Transformation

organism

environmentalmaterial

specimensource role

species/strain

organismID

age, gender,symptom

specimen isolationprocedure X

has_input

plays

commonname

denotes

denotes

has_qualityinstance_of

v10

v12

v11

v13

Host/Source Characterization

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_partden

otes

spatialregion

geographiclocation

denotes

located_inlocated_in

vX – row X in virus sheet

- independent continuant

- dependent continuant

- occurrent

- temporal-spatial region

ital - relations

b14 b15b16 b17

b19 b20

organism

environmentalmaterial

equipment

person

specimensource role

specimencapture role

specimencollector role

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

specimen Xspecimen isolation

procedure X

isolationprotocol

has_input

has_output

plays

plays

has_specification

has_partden

otes

located_in

name

denotes

spatialregion

geographiclocation

denotes

located_in

affiliation

has_affiliation

ID

v2

v5-6

v3-4

v7v8

v15

v16

denotes

specimen typeinsta

nce_of

specimen isolationprocedure type

instance_of

Specimen Isolation

plays

has_input

Comments

????

v9

organism parthypothesis

v17

is_about

IRB/IACUCapproval

has_authorization

v19v18

b18

b22environment

has_quality

b23

b24

b28 b29

b25 b26 b27

b30

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

specimen X

microorganism X

has_part

has_part

located

_in

spatialregion

geographiclocation

species/strain

instance_of

IDv15

v16

v27

Pathogen Detection

pathogen detectionprocess X

has_input

has_specification

data aboutpathogen presence

specimentype

amount

denotes

instance_of

has_quality

located_in

pathogen detectionmethod

instance_of

denotes denotes denotes

pathogen detectionprotocol

has_output

v28

is_about

b21

specimen X

microorganism X

has_part

species/strain

instance_of

IDv15

v16

Pathogen Isolation

specimentype

amount

denotes

instance_of

has_quality

v34

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

located

_in

spatialregion

geographiclocation

pathogen isolationprocess X

located_in

pathogen isolationmethod

denotes denotes denotes

pathogen isolationprotocol

has_input

instance

_of

has_sp

ecific

ation

pathogenisolate X

ID

pathogentypeamount

denotes

instance_ofhas_quality

has_output

v26

specimen X

microorganism X

has_part

species/strain

instance_of

IDv15

v16

v27

PathogenCharacterization

specimentype

amount

denotes

instance_of

has_quality

v34

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

located

_in

spatialregion

geographiclocation

pathogen isolationprocess X

located_in

pathogen isolationmethod

denotes denotes denotes

pathogen isolationprotocol

has_input

instance

_of

has_sp

ecific

ation

pathogenisolate X

ID

pathogentypeamount

denotes

instance_ofhas_quality

has_outputb2

b3

b4

biological characteristicassay X

antigenic characteristicassay X

pathologic characteristicassay X

genetic characteristicassay X

chromosome/plasmidassay X

biovarcharacteristic

serovarcharacteristic

pathovarcharacteristic

genotypecharacteristic

chromosome/plasmidcharacteristic

antibiotic sensitivityassay X

antibody sensitivitycharacteristic

has_inputis_about

genus/species/straindetermination assay X

genus/species/straincharacteristic

b5

b6

b7

b8

b11

b13

b10

b9

b12

has_outputv27

v29

v30

v31v32

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

specimen X

microorganism X

sampleset X

sample setassembly process X

sample setassembly protocol

has_outputhas_part

has_specification

has_part

located

_in

spatialregion

geographiclocation

species/strain

instance_of

ID

v15

v16

v27

SpecimenProcessing

aliquotingprocess X

aliquotingprotocol

has_input

has_output

has_specification

specimen Xaliquot Y

specimentypeamount

denotes

instance_ofhas_quality

ID

specimentypeamount

denotes

instance_ofhas_quality

ID

specimentypeamount

denotes

instance_ofhas_quality

located_in located_in

sample setassembly process

aliquotingprocess

instance_of instance_of

denotes denotes denotes

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

located

_in

spatialregion

geographiclocation

denotes denotes denotes

specimen Aaliquot B

specimen Maliquot N

specimen Taliquot U

has_input

v20v22

v23

b40

repositoryspecimen X

ID

specimentypeinformation

record

denotes

instance_ofhas_quality

repository depositionprocess X

has_input

has_output

specimenrepository

located_in

b41 b43b42

sample set Xat GSC

sample set Xin transit

sample shipmentprocess X

sample shipmentprotocol

sample receiptprocess X

sample receiptprotocol

has_input

has_input

has_output

has_output

has_specification has_specification

Sample Shipment

sampleset X

ID

sample settypeamount

denotes

instance_ofhas_quality

ID

sample settypeamount

denotes

instance_ofhas_quality

ID

sample settypeamount

denotes

instance_ofhas_quality

located_in located_insample shipmentprocess

sample receiptprocess

instance_of instance_of

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

located

_in

spatialregion

geographiclocation

denotes denotes denotes

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

located

_in

spatialregion

geographiclocation

denotes denotes denotes

v21

sample Xat GSC

ID

sampletypeamount

denotes

instance_ofhas_quality

has_part

v24

v25

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

NA amplifiedsample X

specimen X

microorganism X

enrichedNA sample X

microorganismgenomic NA

NA enrichmentprocess X

NA enrichmentprotocol

NA amplificationprocess X

NA amplificationprotocol

has_input

has_input

has_output

has_outputhas_part

has_specification

has_part

has_specification

has_part

located

_in

spatialregion

geographiclocation

species/strain

instance_of

ID

ID

v15

v16

v27

Sequencing Sample Preparation

aliquotingprocess X

aliquotingprotocol

has_input

has_output

has_specification

specimenaliquot X

specimentypeamount

denotes

instance_ofhas_quality

ID

specimentypeamount

denotes

instance_ofhas_quality

ID

specimentypeamount

denotes

instance_ofhas_quality

ID

specimentypeamount

denotes

instance_ofhas_quality

located_in located_in located_in

NA enrichmentprocess

NA amplificationprocess

aliquotingprocess

instance_of instance_of instance_of

denotes denotes denotes

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

located

_in

spatialregion

geographiclocation

denotes denotes denotes

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

located

_in

spatialregion

geographiclocation

denotes denotes denotes

v35

v36

v37

v38

v39

v33

b31

b32

library constructionprotocol

b33

sequencing assay X

samplematerial X

material X

person X

equipment X

lot #

primarydata

sequencingprotocol

temporal-spatialregion

has_input

located_in

has_specification

has_output

v40

plays

spatialregion

temporalinterval

GPSlocation

date/time

spatialregion

geographiclocation

Sequencing Assay

has_part

located

_indenotes denotes

runID

sequencingassay type

denotes

insatnce_of

reagentrole

reagenttype

instance

_of

denotes

sample ID

playstemplaterole

sampletype

instance

_of

denotes

name

playssequencingtech. role

species

instance

_of

denotes

serial #

playssignaldetection role

equipmenttype

instance

_of

denotes

has_input

has_input

has_input

v14

v41

objectives – coverage,genome type targeted,

finishing

has_part

b34

b38

data transformations –image processing

assembly X

data transformations –variant detection

primarydata

sequencedata

genotype data

microorganism X

microorganismgenomic NA

algorithm

data archivingprocess

sequencedata record

has_input

instance

_of

has_specification

has_input

has_output

has_output

is_about

GenBankID

denotes

software

has_input

data transferprotocol

has_specification

species/strain

has_output

has_input

temporal-spatialregion

located_in

spatialregion

temporalinterval

GPSlocation

date/time

spatialregion

geographiclocation

has_part

located

_indenotes denotes

person Xname

playsbioinformatics

tech. role

species

instance

_of

denotes

runID

denoteslocated_in

data transformations –serotype marker

detection

serotype data

data transformations –gene detection

gene data

part_of

has_output

has_output

is_about

has_input

has_input

Data Transformationstemporal-spatial

region

spatialregion

temporalinterval

GPSlocation

date/time

spatialregion

geographiclocation

has_part

located

_indenotes denotes

v29

v43

v31

v32

v42

v30

v44

v45 v46

v47

b35

b36

finishingstatus

has_quality

b37

b39

assay X

samplematerial X

material X

person X

equipment X

lot #

primarydata

assayprotocol

temporal-spatialregion

has_input

located_in

has_specification

has_output

plays

spatialregion

temporalinterval

GPSlocation

date/time

spatialregion

geographiclocation

Generic Assay

has_part

located

_indenotes denotes

runID

assaytype

denotes

instance_of

reagentrole

reagenttype

instance

_of

denotes

sample ID

playstargetrole

sampletype

instance

_of

denotes

name

playstechnicianrole

species

instance

_of

denotes

serial #

playssignaldetection role

equipmenttype

instance

_of

denotes

has_input

has_input

has_input

objectives

has_part

analyte X

has_part

quality x

has_quality

input samplematerial X

is_about

materialtransformation X

samplematerial X

material X

person X

equipment X

lot #

outputmaterial X

material transformationprotocol

temporal-spatialregion

has_input

located_in

has_specification

has_output

plays

spatialregion

temporalinterval

GPSlocation

date/time

spatialregion

geographiclocation

Generic Material Transformation

has_part

located

_indenotes denotes

runID

material transformationtype

denotes

instance_of

reagentrole

reagenttype

instance

_of

denotes

sample ID

playstargetrole

sampletype

instance

_of

denotes

name

playstechnicianrole

species

instance

_of

denotes

serial #

playssignaldetection role

equipmenttype

instance

_of

denotes

has_input

has_input

has_input

objectives

has_part

quality x

has_quality

quality x

materialtype

has_quality

instance_of

sample IDden

otes

data transformation Xinputdata

outputdata

material X

algorithm

has_specification

has_output

is_about

software

has_input

located_in

person Xname

data analystrole

denotes

runID

denotes

Generic Data Transformation

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

spatialregion

geographiclocation

has_part

located

_indenotes denotes

data transformationtype

instance_of

plays

Generic Material (IC)

material X

ID

materialtype

quality x

has_quality

material Y

has_part

material Z

has_part

quality y

has_quality

denotes

instance_of

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

located

_in

spatialregion

geographiclocation

denotes denotes denotes

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

located

_in

spatialregion

geographiclocation

denotes denotes denotes

located_in located_in

OBI specimen creation

organism (for ‘collecting specimen from an organism’)

human being

synonym

individual organism identifier

quality

geographic location

specimen

infectious agent

specimen creation

protocol

has_sp

ecifie

d_output

realizes

unfolds_in

denotes has_quality

is_about

located_in

has_specified_input

geographic location

time measurement datum

is_duration_of

material entity (for ‘environmental material

collection’)

has_participant

organization

is_member_of_organization

e21

written name

denotes

e22CRID symboldenotes

e24

textual entity

is_about

document

measurement datum

is_about

anatomical entity (‘portion of body substance’ or ’ portion of tissue’)

is_a

specimen creation objective

achieves_planned_objective

infectious agent

is_about

e17 e18

synonym e19

is_about

organization

has_supplier

quality

has_quality

e26

measurement datum

e23

is_quality_measured_as

infectious agent

e25

e27

e29 e30

e31

e32

e33

located_in

growth environment

e35

e36

e40 e41 e42

e44

treatment

material_entity

has_participant

has_participant

e43

genetic characteristics information

is_about

e37

genetic characteristics information

is_about

e20

e39

e38

located_in

located_in

e45 e46

e47 e50

e14

e16

e15

information content entity

denotes

has_agent

Status

• Core metadata merge process nearly complete• Comprehensive semantic networks developed• Begun the OBI harmonization process• Begun the MIGS/MIMS harmonization process• Still need to:– Compare, harmonize, map with BioProjects and BioSamples– Decide what to do about metadata fields that appear to be

project specific– Develop metadata submission templates– Report process and results