Upload
marged
View
38
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Biomedical Ontologies. Bio-Trac 40 (Protein Bioinformatics) October 9, 2008 Zhang-Zhi Hu, M.D. Research Associate Professor Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center. Overview. What is ontology ? - PowerPoint PPT Presentation
Citation preview
1
Bio-Trac 40 (Protein Bioinformatics)Bio-Trac 40 (Protein Bioinformatics)
October 9, 2008October 9, 2008
Zhang-Zhi Hu, M.D. Zhang-Zhi Hu, M.D. Research Associate ProfessorResearch Associate ProfessorProtein Information Resource, Department of Protein Information Resource, Department of Biochemistry and Molecular & Cellular BiologyBiochemistry and Molecular & Cellular BiologyGeorgetown University Medical CenterGeorgetown University Medical Center
Biomedical Ontologies
2
Overview• What is ontology?
– What is biomedical ontology?• What is gene ontology?
– How is it generated?– How is it used for annotation?
• What is protein ontology? – Why is it necessary?– How to use it?
• ……
3
Tree of Porphyry with Aristotle’s Categories
Aristotle, 384 BC – 322 BC
4
Ontology:
• In philosophy, it seeks to describe basic categories and relationships of being or existence to define entities and types of entities within its framework: – What do you know? How do you know it?– What is existence? What is a physical object? – What constitutes the identity of an object? ……
• Central goal is to have a definitive and exhaustive classification of all entities.
onto-, of being or existence; -logy, study. Greek origin; Latin, ontologia,1606
“The science of what is, of the kinds and structures of objects, properties, events, processes and relations in every area of reality”
– Barry Smith, U Buffalo
5
In computer and information science• Ontology is a data model that represents a set of
concepts within a domain and the relationships between those concepts. It is used to reason about the objects within that domain.
is_a
Individuals (instances)
Classes (concepts)
Attributes
Relations
Most ontologies describe individuals (instances), classes (concepts), attributes, and relations
your Ford, my Ford, his Ford…
e.g. color, engine, door…
Classes
6
What are ontology useful for?
• Terminology management• Integration, interoperability, and sharing of data
– promote precise communication between scientists– enable information retrieval across multiple resources
• Knowledge reuse and decision support– extend the power of computational approaches to perform
data exploration, inference, and mining
Ontology is a form of knowledge representation about the world or some part of it.
Biomedical Terminology vs. Biomedical Ontology
• UMLS (unified medical language system)
• MeSH (medical subject heading)
• NCI Thesaurus
• SNOMED / SNODENT
• Medical WordNet
7
Ontology Enables Large-Scale Biomedical Science
• Structured representation of biomedicine: – For different types of entities and relations to
describe biomedicine (ontology content curation).
• Annotation: using ontologies to summarize and describe biomedical experimental results to enable: – Integration of their data with other researchers’
results– Cross-species analyses
The center of two major activities currently in biomedical research:
8
what makes it so wildly
successful ?
Gene Ontology (GO)
9
GO Consortium
• The Gene Ontology was originally constructed in 1998 by a consortium of researchers studying the genome of three model organisms:
– Drosophila melanogaster (fruit fly) (FlyBase)
– Mus musculus (mouse) (MGD)
– Saccharomyces cerevisiae (yeast) (SGD)
• Many other model organism databases have joined the GO consortium, contributing:
– development of the ontologies
– annotations for the genes of one or more organisms
http://www.geneontology.org/
10
Three key concepts: [Currently total 25804 GO terms] (Oct. 2008)
• Biological process: series of events accomplished by one or more ordered assemblies of molecular functions, e.g. signal transduction, or pyrimidine metabolism, and alpha-glucoside transport. [ [total: 15161]
• Molecular function: describes activities, such as catalytic or binding activities, that occur at the molecular level. Activities that can be performed by individual gene products, or by assembled complexes of gene products; e.g. catalytic activity, transporter activity. [total: 8425]
• Cellular component: a component of a cell that it is part of some larger object, maybe an anatomical structure (e.g. ER or nucleus) or a gene product group (e.g. ribosome, or a protein dimer). [total: 2218]
• What is Gene Ontology? GO provides controlled vocabulary to describe gene and gene product attributes in any organism – how gene products behave in a cellular context
Need for annotation of genome sequences
• GO annotation - Characterization of gene products using GO terms - Members submit their data which are available at GO website.
11
GO Representation: Tree or Network?
Node, a concept or a term
root
Leaf node
C has two parents, A and B
A
B
C
C
GO is a network structure
Relations: is_a, or part_of
12http://www.geneontology.org/
13
GO term (GO:0006366): mRNA transcription from RNA polymerase II promoter
Leaf node
GO search and display tool
14
Human p53 – GO annotation (UniProtKB:P04637)
GO:0006289:nucleotide-excision repair [PMID:7663514; evidence:IMP]
15
• Science basis of the GO: trained experts use the experimental observations from literature to associate GO terms with gene products (to annotate the entities represented in the gene/protein databases)
• Enabling data integration across databases and making them available to semantic search
GO annotation of gene products
~46Human, mouse, plant, worm, yeast …
http://www.geneontology.org/GO.current.annotations.shtml
16
What GO is NOT……• Ontology of gene products: e.g. cytochrome c is not in
GO, but attributes of cytochrome c are, e.g. oxidoreductase activity.
• Processes, functions and component unique to mutants or diseases: e.g. oncogenesis is not a valid GO.
• Protein domains or structural features. • Protein-protein interactions. • Environment, evolution and expression. • Anatomical or histological features above the level of
cellular components, including cell types.
Neither GO is Ontology of Genes!! – a misnomer
17
Missing GO nodes…
not deep enough…not broad enough…
18
Lack of connections among GOs
Estrogen receptor
19
what cellular component?
what molecular function?
what biological process?
GO: A Common Standard for Omics Data Analysis
20
need more…• need to improve the quality of GO to support more
rigorous logic-based reasoning across the data annotated in its terms
• need to extend the GO by engaging ever broader community support for addition of new terms and for correction of errors
• need to extend the methodology to other domains, including clinical domains, such as: – disease ontology– immunology ontology– symptom (phenotype) ontology– clinical trial ontology– ...
21
• Establish common rules governing best practices for creating ontologies and for using these in annotations
• Apply these rules to create a complete suite of orthogonal interoperable biomedical reference ontologies
National Center for Biomedical
Ontology (NCBO)
http://www.obofoundry.org/
http://bioontology.org/
22
http://www.obofoundry.org/index.cgi?sort=domain&show=ontologies
23
• tight connection to the biomedical basic sciences• compatibility, interoperability, common relations• support for logic-based reasoning
OBO Foundry = a subset of OBO ontologies, whose developers have agreed in advance to accept a common set of principles reflecting best practice in ontology development designed to ensure:
The OBO Foundry
• A family of interoperable gold standard biomedical reference ontologies to serve annotation of:
– scientific literature
– model organism databases
– clinical trial data …
OBO Foundry Principles: http://www.obofoundry.org/crit.shtml
24
CONTINUANT OCCURRENT
INDEPENDENT DEPENDENT
ORGAN ANDORGANISM
Organism(NCBI
Taxonomy)
Anatomical Entity(FMA, CARO)
OrganFunction
(FMP, CPRO) Phenotypic
Quality(PaTO)
Biological Process
(GO)CELL AND CELLULAR
COMPONENT
Cell(CL)
Cellular Componen
t(FMA, GO)
Cellular Function
(GO)
MOLECULEMolecule
(ChEBI, SO,RNAO, PRO)
Molecular Function(GO)
Molecular Process
(GO)
Rationale of OBO Foundry coverage
GRANULARITY
RELATION TO TIME
25
Foundational is_apart_of
Spatial located_incontained_inadjacent_to
Temporal transformation_ofderives_frompreceded_by
Participation has_participanthas_agent
OBO Relation Ontology
e.g.: A is_a B =def. every instance of A is an instance of B
“rose is_a plant all instances of rose is_a plant”
26
What is Protein Ontology? Why?
PRO
http://pir.georgetown.edu/pro/
27
The Need for Representation of Various Proteins Forms
Human PRLR
and PTMs…
Glucocorticoid receptor (GR)
28
Sphingomyelin phosphodiesterase (SMPD1) (ASM_HUMAN)
• Cleavage sites:– lysosomal: the enzyme is transported from the Golgi apparatus
to the lysosome after additions of mannose-6-phosphate moieties (M6P) and binding to M6P receptor.
– secreted: the shorter cleaved form is not modified with M6P and is targeted for secretion to the extracellular space, with different functions such as LDL binding and oxidized LDL catabolism.
M6P
lysosome
Extracelluar, e.g. LDL binding
29
Alternative splicing
• Only FGF8b can transform midbrain to cerebellum whereas FGF8a causes an overgrowth of midbrain.
FGF8_HUMAN alternative splicing
a single new contact between Phe32 (F32) of FGF8b and a hydrophobic groove within Ig domain 3 of FGFR2c
FGF8a FGF8b
Olsen et al., Genes Dev. 2006
FGF8a, 8b – differ in their ability to pattern embryonic brain
30
GOA for Transcription factor Ovo-like 2Form 1 - long: GO:0045892 IDA - negative regulation of transcription, DNA-dependentForm 2 – short: GO:0045893 IDA - positive regulation of transcription, DNA-dependent
274 aa
OVOL2_MOUSE (Q8CIV7)
- Gene. 2004 336:47-58. PMID:15225875
31
The Need for Protein Classes Representing Protein Evolutionary Relationships
• Genes/proteins identified in model organisms, such as mouse, yeast, fly, may have important functional implications in human. – Gene function in model organism may not applied to human
• Animal models for human diseases: such as mouse models for diabetes, arthritis, and tumor. – Essential genes may be redundant and nonessential in another
species due to functional compensation, e.g.:
• mutation of Rb1 causes retinoblastoma in early childhood
• Rb1 knock-out mouse did not develop retinoblastoma because of compensation from a functional homolog p107.
• Close examination of proteins in phylogenetic classes and their functional convergence and divergence in a ontological structure is important for application of disease models.
32
Human
Chimp
Mou
se
E.coli
Fly Yeast
Wor
mRat B.su
btilis
Implications of Protein Evolution
Common ancestor
• Conclusions from experiments performed on proteins from one organism are often applicable to the homologous protein from another organism.
• Information learned about existing proteins allows us to infer the properties of ancestral proteins.
33
Protein Evolution
With enough similarity, one can trace back to a common origin
Sequence changes
What about these?
Domain shuffling
34
Functional convergence
• Protein classes of the same function derived from different evolutionary origins, e.g. carbonate dehydratase (or carbonic anhydrase EC 4.2.1.1), which has three independent gene families with functional convergence.
Animal and prokaryotic type
Plant and prokaryotic type
Archaea type
35
Functional divergence
TGM3 (Human)
TGM3 (Mouse)
EPB42 (Human)
EPB42 (Mouse)
Gene Duplication (TGM3/EPB42 split) Speciation (Human/mouse split)
TGM3 (Human)
TGM3 (Mouse)
EPB42 (Human)
EPB42 (Mouse)
TGM3 branch
EPB42 branch
Human
Mouse
Human
Mouse
TGM3 = Protein-glutamine gamma-glutamyltransferase(Transglutaminase; involved in protein modification)
EBP42 = Erythrocyte membrane protein band 4.2(Constituent of cytoskeleton; involved in cell shape)
36
The Need for Protein Ontology
• Data integration and knowledge management for -omics work.• A gap exists in OBO for gene products. • Protein Ontology (PRO) will contain two connected components
(or subontologies): – ProEvo captures the protein classes represented by protein
families at fold, domain and full length levels that reflect evolutional relationship
– ProForm captures the specific protein objects of a specific gene resulting from alternative splicing, posttranslational modification, genetic variations.
– ProEvo and ProMod is connected through the “reference” (canonical) protein sequence currently annotated in UniProtKB.
• PRO formalization of these detailed protein objects and classes will allow accurate and consistent proteomics experimental design and data analysis/integration.
37
PRO Framework
• PRO is designed to be a formal and well-principled OBO Foundry ontology for protein entities.
• Attributes of objects will take the form of links to other ontologies, such as gene (GO), sequence (SO), modification (PSI-MOD) and disease (DO) ontologies.
• A PRO prototype for TGF-beta signaling proteins was built based on this framework.
• In this way, PRO aims at providing an ontological framework to define protein entities and evolutionary-related classes that community can adopt for different purposes, e.g.
– annotation of entities attributes,
– mapping of objects in pathways, and
– modeling of biological system dynamics and disease.
38
ProEvo
ProForm
GO Gene Ontology
molecular function
cellular component
biological process
participates_in
part_of (for complexes) located_in (for compartments)
has_function
PROprotein
Root Levelis_a
translation product of an evolutionarily-related gene
translation product of a specific mRNA
Family-Level Distinction• Derivation: common ancestor• Source: PIRSF family
Modification-Level Distinction• Derived from post-translational modification• Source: UniProtKB
Sequence-Level Distinction• Derivation: specific allele or splice variant• Source: UniProtKB
cleaved/modified translation product disease
OMIM Disease
agent_in
is_a
protein modification
has_modification
PSI-MOD Modification
SO Sequence Ontology
sequence change
has_agent (sequence change) agent_of (effect on function)
derives_from
protein domain
has_part
Pfam Domain
Example:
TGF-beta receptor phosphorylated smad2 isoform1
is a phosphorylated smad2 isoform1
derives_from smad2 isoform 1
is a smad2
is a TGF- receptor-regulated smad
is a smad
is a protein
Modification Level
Sequence Level
Family Level
Root Level
translation product of a specific geneGene-Level Distinction• Derivation: specific gene• Sources: PIRSF subfamily, Panther subfamily
is_a
Gene Level
Protein Ontology (PRO) http://pir.georgetown.edu/pro/
39
Cellular Component:- nucleus
Molecular Function:- protein binding
Biological Process:- signal transduction- regulation of transcription, DNA-dependent
Mothers against decapentaplegic homolog 2
Smad 2
GO annotation of SMAD2_HUMAN:
40
II I
TGF-TGF-beta receptor
PP Smad 4 P
4 DNA binding
1 phosphorylation
2 complex formation
Nucleus
Cytoplasm
Smad 2
Smad 2
PPSmad 2
Smad 4 P
Transcription Regulation
PPSmad 2
Smad 4 P
3 nuclear translocation
PPSmad 2PP
P
++
ERK1
CAMK2
PP
41
“normal” •Cytoplasmic PRO:00000011
TGF- receptor phosphorylated
•Forms complex•Nuclear•Txn upregulation
PRO:00000013
ERK1 phosphorylated •Forms complex•Nuclear•Txn upregulation++
PRO:00000014
CAMK2 phosphorylated
•Forms complex•Cytoplasmic•No Txn upregulation
PRO:00000015
alternatively spliced short form
•Cytoplasmic
PRO:00000016
phosphorylated short form
•Nuclear•Txn upregulation PRO:00000018
point mutation (causative agent: large intestine carcinoma)
•Doesn’t form complex•Cytoplasmic•No Txn upregulation
PRO:00000019
Smad 2
PPSmad 2
PPSmad 2P
PPSmad 2 P
Smad 2 x
Smad 2
Smad 2 PP
SMAD2_HUMAN
SMAD2_HUMAN
SMAD2_HUMAN
SMAD2_HUMAN
SMAD2_HUMAN
SMAD2_HUMAN
SMAD2_HUMAN
Smad2 gene products Forms Location ID
43OBO relations: is_a, derives_from
ProEvo:
ProForm:
PRO hierarchy in Obo Edit
Representing evolutionary-related protein classes. In this example, children of TGF-beta-like cysteine-knot cytokine have a common architecture consisting of a signal peptide, a variable propeptide region and a transforming growth factor beta-like domain that is a cysteine-knot domain.
Pfam:PF00019 "has_part Transforming growth factor beta like domain".
Representing multiple protein products of a gene. Only forms with experimental data are included. When common protein forms exist in human and mouse, a single node is created (See details below).
44
Summary• The vision of the biomedical ontology community is that all
biomedical knowledge and data are disseminated on the Internet using principled ontologies, such that they are semantically interoperable and useful for improving biomedical science and clinical care.
• The scope extends to all knowledge and data that is relevant to the understanding or improvement of human biology and health.
• Knowledge and data are semantically interoperable when they enable predictable, meaningful, computation across knowledge sources developed independently to meet diverse needs.
• Principled ontologies are ones that follow NCBO-recommended formats and methodologies for ontology development, maintenance, and use.