Upload
encode-dcc
View
40
Download
0
Tags:
Embed Size (px)
Citation preview
Ontology application and use at the ENCODE DCC
Venkat MalladiData Wrangler, ENCODE DCC Department of Genetics Stanford University School of Medicine
Venkat Malladi ENCODE DCC
Overview
Venkat Malladi ENCODE DCC
MetadataModel
Ontologies Search Futuredirections
Intro to ENCODE and the DCC
What is ENCODE?
Venkat Malladi ENCODE DCC Modified from PLoS Biol 9-e1001046,2011
(M. Pazin)
Approximately ~30 different assays
Role of the Data Coordination Center
Venkat Malladi ENCODE DCC
Production labsAnalysis groups
Genome Browser
ENCODE portal(DCC)
Data files
Metadata DCCDCC Integrative websites
Scientificcommunity
Role: Data generation Data organization Data access
Tasks: Perform assays Data processing & validation Web-based searchesPerform analyses Data file storage Data
downloadsValidate data Metadata curation
Submit data filesSubmit metadata
Challenge: Find common biosamples from data generated by two consortia
Venkat Malladi ENCODE DCC
356 termshttp://encodeproject.org/ENCODE/cellTypes.html
Projects are internally consistent…..
314 termsGEO characteristics: common_name, tissue_type, cell_type, lines
Simple text match
Venkat Malladi ENCODE DCC
360 termsCell type
… but only 3 biosample names match exactly between projects
314 termsGEO
IMR90PBMCTh17
Metadata annotation using Ontologies
An ontology is a set of words and relationships … … All relationships must be true.
Venkat Malladi ENCODE DCC
nucleuschromosome
mitochondrial chromosome
mitochondrion
cellParent term
Child term
part_of
part_of
part_of
part_of is_a
part_ofX
An ontology is a set of words and relationships.Need true relationships because inferences can be based
upon them.
Venkat Malladi ENCODE DCC
nucleuschromosome
mitochondrial chromosome
mitochondrion
cellParent term
Child term
part_of
part_of
part_of
part_of is_a
part_ofX
part_of
X part_of
http://www.geneontology.org/GO.ontology.relations.shtml
True
False
Why use ontologies?
Venkat Malladi ENCODE DCC
Reason 1: Consistent way of describing biological concepts
Reason 2: Consistency of language facilitates identification of related data easily.
Reason 3: Consistency in data analysis because relationships between terms provide flexibility of grouping while everyone uses the same set of metadata
What metadata is annotated with ontologies?
Venkat Malladi ENCODE DCC
1. the biological sample serving as input (Biosample)
2. the reagents and conditions applied to the biological input (Treatment)
3. the set of methods and conditions to survey the biological input (Assay)
Venkat Malladi ENCODE DCC
Biosample ontologies
Venkat Malladi ENCODE DCC
1. Uber anatomy ontology (Uberon) - structure, location and heterogenous mixture of cells
2. Cell Ontology (CL) - primary cells or stem cells
3. Experimental Factor Ontology (EFO) - no direct corresponding anatomical structure or physiological cell type
Venkat Malladi ENCODE DCC
Challenge: Find all heart-related tissues?
Venkat Malladi ENCODE DCC
Heart_OCHCFHCFaaHCMOthers?
Fetal HeartHeartRight AtriumRight VentricleOthers?
Searching ENCODE metadata
Venkat Malladi ENCODE DCC
Ontology driven search
Venkat Malladi ENCODE DCC
Future directions
Venkat Malladi ENCODE DCC
• Additional ontologies
• Ontology- based data validations
Additional ontologies
Venkat Malladi ENCODE DCC
• Protein Ontology (PRO,http://pir.georgetown.edu/pro/pro.shtml)o transforming growth factor beta-1 (human)— PR:P01137
• EDAM Ontology (EDAM, http://edamontology.org)o FASTQ—format:1930, BAM—format:2572o sequence alignment—data:0863
Ontology based validations
Venkat Malladi ENCODE DCC
Acknowledgments
Venkat Malladi ENCODE DCC
Nikhil Podduturi, Laurence Rowe, Forrest Tanaka
Esther Chan, Jean Davidson, Venkat Malladi, Cricket Sloan, J. Seth Strattan
Eurie Hong, Mike Cherry (PI), Jim Kent (co-PI), Ben Hitz
Brian Lee, Stuart Miyasato, Matt Simison, Zhenhua Wang, Marcus Ho
Data Wranglers
Software Engineers
QA, administration, biocuration
National Institute of General Medical Sciences of the United States AQ1215 National Institutes of Health (GM10331601); U41 grant from National Human Genome Research Institute at the U.S. National Institutes of Health (HG006992)