43
1 Bio-Trac 40 (Protein Bioinformatics) Bio-Trac 40 (Protein Bioinformatics) October 9, 2008 October 9, 2008 Zhang-Zhi Hu, M.D. Zhang-Zhi Hu, M.D. Research Associate Professor Research Associate Professor Protein Information Resource, Department Protein Information Resource, Department of of Biochemistry and Molecular & Cellular Biochemistry and Molecular & Cellular Biology Biology Georgetown University Medical Center Georgetown University Medical Center Biomedical Ontologies

Biomedical Ontologies

  • Upload
    marged

  • View
    38

  • Download
    2

Embed Size (px)

DESCRIPTION

Biomedical Ontologies. Bio-Trac 40 (Protein Bioinformatics) October 9, 2008 Zhang-Zhi Hu, M.D. Research Associate Professor Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center. Overview. What is ontology ? - PowerPoint PPT Presentation

Citation preview

Page 1: Biomedical Ontologies

1

Bio-Trac 40 (Protein Bioinformatics)Bio-Trac 40 (Protein Bioinformatics)

October 9, 2008October 9, 2008

Zhang-Zhi Hu, M.D. Zhang-Zhi Hu, M.D. Research Associate ProfessorResearch Associate ProfessorProtein Information Resource, Department of Protein Information Resource, Department of Biochemistry and Molecular & Cellular BiologyBiochemistry and Molecular & Cellular BiologyGeorgetown University Medical CenterGeorgetown University Medical Center

Biomedical Ontologies

Page 2: Biomedical Ontologies

2

Overview• What is ontology?

– What is biomedical ontology?• What is gene ontology?

– How is it generated?– How is it used for annotation?

• What is protein ontology? – Why is it necessary?– How to use it?

• ……

Page 3: Biomedical Ontologies

3

Tree of Porphyry with Aristotle’s Categories

Aristotle, 384 BC – 322 BC

Page 4: Biomedical Ontologies

4

Ontology:

• In philosophy, it seeks to describe basic categories and relationships of being or existence to define entities and types of entities within its framework: – What do you know? How do you know it?– What is existence? What is a physical object? – What constitutes the identity of an object? ……

• Central goal is to have a definitive and exhaustive classification of all entities.

onto-, of being or existence; -logy, study. Greek origin; Latin, ontologia,1606

“The science of what is, of the kinds and structures of objects, properties, events, processes and relations in every area of reality”

– Barry Smith, U Buffalo

Page 5: Biomedical Ontologies

5

In computer and information science• Ontology is a data model that represents a set of

concepts within a domain and the relationships between those concepts. It is used to reason about the objects within that domain.

is_a

Individuals (instances)

Classes (concepts)

Attributes

Relations

Most ontologies describe individuals (instances), classes (concepts), attributes, and relations

your Ford, my Ford, his Ford…

e.g. color, engine, door…

Classes

Page 6: Biomedical Ontologies

6

What are ontology useful for?

• Terminology management• Integration, interoperability, and sharing of data

– promote precise communication between scientists– enable information retrieval across multiple resources

• Knowledge reuse and decision support– extend the power of computational approaches to perform

data exploration, inference, and mining

Ontology is a form of knowledge representation about the world or some part of it.

Biomedical Terminology vs. Biomedical Ontology

• UMLS (unified medical language system)

• MeSH (medical subject heading)

• NCI Thesaurus

• SNOMED / SNODENT

• Medical WordNet

Page 7: Biomedical Ontologies

7

Ontology Enables Large-Scale Biomedical Science

• Structured representation of biomedicine: – For different types of entities and relations to

describe biomedicine (ontology content curation).

• Annotation: using ontologies to summarize and describe biomedical experimental results to enable: – Integration of their data with other researchers’

results– Cross-species analyses

The center of two major activities currently in biomedical research:

Page 8: Biomedical Ontologies

8

what makes it so wildly

successful ?

Gene Ontology (GO)

Page 9: Biomedical Ontologies

9

GO Consortium

• The Gene Ontology was originally constructed in 1998 by a consortium of researchers studying the genome of three model organisms:

– Drosophila melanogaster (fruit fly) (FlyBase)

– Mus musculus (mouse) (MGD)

– Saccharomyces cerevisiae (yeast) (SGD)

• Many other model organism databases have joined the GO consortium, contributing:

– development of the ontologies

– annotations for the genes of one or more organisms

http://www.geneontology.org/

Page 10: Biomedical Ontologies

10

Three key concepts: [Currently total 25804 GO terms] (Oct. 2008)

• Biological process: series of events accomplished by one or more ordered assemblies of molecular functions, e.g. signal transduction, or pyrimidine metabolism, and alpha-glucoside transport. [ [total: 15161]

• Molecular function: describes activities, such as catalytic or binding activities, that occur at the molecular level. Activities that can be performed by individual gene products, or by assembled complexes of gene products; e.g. catalytic activity, transporter activity. [total: 8425]

• Cellular component: a component of a cell that it is part of some larger object, maybe an anatomical structure (e.g. ER or nucleus) or a gene product group (e.g. ribosome, or a protein dimer). [total: 2218]

• What is Gene Ontology? GO provides controlled vocabulary to describe gene and gene product attributes in any organism – how gene products behave in a cellular context

Need for annotation of genome sequences

• GO annotation - Characterization of gene products using GO terms - Members submit their data which are available at GO website.

Page 11: Biomedical Ontologies

11

GO Representation: Tree or Network?

Node, a concept or a term

root

Leaf node

C has two parents, A and B

A

B

C

C

GO is a network structure

Relations: is_a, or part_of

Page 12: Biomedical Ontologies

12http://www.geneontology.org/

Page 13: Biomedical Ontologies

13

GO term (GO:0006366): mRNA transcription from RNA polymerase II promoter

Leaf node

GO search and display tool

Page 14: Biomedical Ontologies

14

Human p53 – GO annotation (UniProtKB:P04637)

GO:0006289:nucleotide-excision repair [PMID:7663514; evidence:IMP]

Page 15: Biomedical Ontologies

15

• Science basis of the GO: trained experts use the experimental observations from literature to associate GO terms with gene products (to annotate the entities represented in the gene/protein databases)

• Enabling data integration across databases and making them available to semantic search

GO annotation of gene products

~46Human, mouse, plant, worm, yeast …

http://www.geneontology.org/GO.current.annotations.shtml

Page 16: Biomedical Ontologies

16

What GO is NOT……• Ontology of gene products: e.g. cytochrome c is not in

GO, but attributes of cytochrome c are, e.g. oxidoreductase activity.

• Processes, functions and component unique to mutants or diseases: e.g. oncogenesis is not a valid GO.

• Protein domains or structural features. • Protein-protein interactions. • Environment, evolution and expression. • Anatomical or histological features above the level of

cellular components, including cell types.

Neither GO is Ontology of Genes!! – a misnomer

Page 17: Biomedical Ontologies

17

Missing GO nodes…

not deep enough…not broad enough…

Page 18: Biomedical Ontologies

18

Lack of connections among GOs

Estrogen receptor

Page 19: Biomedical Ontologies

19

what cellular component?

what molecular function?

what biological process?

GO: A Common Standard for Omics Data Analysis

Page 20: Biomedical Ontologies

20

need more…• need to improve the quality of GO to support more

rigorous logic-based reasoning across the data annotated in its terms

• need to extend the GO by engaging ever broader community support for addition of new terms and for correction of errors

• need to extend the methodology to other domains, including clinical domains, such as: – disease ontology– immunology ontology– symptom (phenotype) ontology– clinical trial ontology– ...

Page 21: Biomedical Ontologies

21

• Establish common rules governing best practices for creating ontologies and for using these in annotations

• Apply these rules to create a complete suite of orthogonal interoperable biomedical reference ontologies

National Center for Biomedical

Ontology (NCBO)

http://www.obofoundry.org/

http://bioontology.org/

Page 22: Biomedical Ontologies

22

http://www.obofoundry.org/index.cgi?sort=domain&show=ontologies

Page 23: Biomedical Ontologies

23

• tight connection to the biomedical basic sciences• compatibility, interoperability, common relations• support for logic-based reasoning

OBO Foundry = a subset of OBO ontologies, whose developers have agreed in advance to accept a common set of principles reflecting best practice in ontology development designed to ensure:

The OBO Foundry

• A family of interoperable gold standard biomedical reference ontologies to serve annotation of:

– scientific literature

– model organism databases

– clinical trial data …

OBO Foundry Principles: http://www.obofoundry.org/crit.shtml

Page 24: Biomedical Ontologies

24

CONTINUANT OCCURRENT

INDEPENDENT DEPENDENT

ORGAN ANDORGANISM

Organism(NCBI

Taxonomy)

Anatomical Entity(FMA, CARO)

OrganFunction

(FMP, CPRO) Phenotypic

Quality(PaTO)

Biological Process

(GO)CELL AND CELLULAR

COMPONENT

Cell(CL)

Cellular Componen

t(FMA, GO)

Cellular Function

(GO)

MOLECULEMolecule

(ChEBI, SO,RNAO, PRO)

Molecular Function(GO)

Molecular Process

(GO)

Rationale of OBO Foundry coverage

GRANULARITY

RELATION TO TIME

Page 25: Biomedical Ontologies

25

Foundational is_apart_of

Spatial located_incontained_inadjacent_to

Temporal transformation_ofderives_frompreceded_by

Participation has_participanthas_agent

OBO Relation Ontology

e.g.: A is_a B =def. every instance of A is an instance of B

“rose is_a plant all instances of rose is_a plant”

Page 26: Biomedical Ontologies

26

What is Protein Ontology? Why?

PRO

http://pir.georgetown.edu/pro/

Page 27: Biomedical Ontologies

27

The Need for Representation of Various Proteins Forms

Human PRLR

and PTMs…

Glucocorticoid receptor (GR)

Page 28: Biomedical Ontologies

28

Sphingomyelin phosphodiesterase (SMPD1) (ASM_HUMAN)

• Cleavage sites:– lysosomal: the enzyme is transported from the Golgi apparatus

to the lysosome after additions of mannose-6-phosphate moieties (M6P) and binding to M6P receptor.

– secreted: the shorter cleaved form is not modified with M6P and is targeted for secretion to the extracellular space, with different functions such as LDL binding and oxidized LDL catabolism.

M6P

lysosome

Extracelluar, e.g. LDL binding

Page 29: Biomedical Ontologies

29

Alternative splicing

• Only FGF8b can transform midbrain to cerebellum whereas FGF8a causes an overgrowth of midbrain.

FGF8_HUMAN alternative splicing

a single new contact between Phe32 (F32) of FGF8b and a hydrophobic groove within Ig domain 3 of FGFR2c

FGF8a FGF8b

Olsen et al., Genes Dev. 2006

FGF8a, 8b – differ in their ability to pattern embryonic brain

Page 30: Biomedical Ontologies

30

GOA for Transcription factor Ovo-like 2Form 1 - long: GO:0045892 IDA - negative regulation of transcription, DNA-dependentForm 2 – short: GO:0045893 IDA - positive regulation of transcription, DNA-dependent

274 aa

OVOL2_MOUSE (Q8CIV7)

- Gene. 2004 336:47-58. PMID:15225875

Page 31: Biomedical Ontologies

31

The Need for Protein Classes Representing Protein Evolutionary Relationships

• Genes/proteins identified in model organisms, such as mouse, yeast, fly, may have important functional implications in human. – Gene function in model organism may not applied to human

• Animal models for human diseases: such as mouse models for diabetes, arthritis, and tumor. – Essential genes may be redundant and nonessential in another

species due to functional compensation, e.g.:

• mutation of Rb1 causes retinoblastoma in early childhood

• Rb1 knock-out mouse did not develop retinoblastoma because of compensation from a functional homolog p107.

• Close examination of proteins in phylogenetic classes and their functional convergence and divergence in a ontological structure is important for application of disease models.

Page 32: Biomedical Ontologies

32

Human

Chimp

Mou

se

E.coli

Fly Yeast

Wor

mRat B.su

btilis

Implications of Protein Evolution

Common ancestor

• Conclusions from experiments performed on proteins from one organism are often applicable to the homologous protein from another organism.

• Information learned about existing proteins allows us to infer the properties of ancestral proteins.

Page 33: Biomedical Ontologies

33

Protein Evolution

With enough similarity, one can trace back to a common origin

Sequence changes

What about these?

Domain shuffling

Page 34: Biomedical Ontologies

34

Functional convergence

• Protein classes of the same function derived from different evolutionary origins, e.g. carbonate dehydratase (or carbonic anhydrase EC 4.2.1.1), which has three independent gene families with functional convergence.

Animal and prokaryotic type

Plant and prokaryotic type

Archaea type

Page 35: Biomedical Ontologies

35

Functional divergence

TGM3 (Human)

TGM3 (Mouse)

EPB42 (Human)

EPB42 (Mouse)

Gene Duplication (TGM3/EPB42 split) Speciation (Human/mouse split)

TGM3 (Human)

TGM3 (Mouse)

EPB42 (Human)

EPB42 (Mouse)

TGM3 branch

EPB42 branch

Human

Mouse

Human

Mouse

TGM3 = Protein-glutamine gamma-glutamyltransferase(Transglutaminase; involved in protein modification)

EBP42 = Erythrocyte membrane protein band 4.2(Constituent of cytoskeleton; involved in cell shape)

Page 36: Biomedical Ontologies

36

The Need for Protein Ontology

• Data integration and knowledge management for -omics work.• A gap exists in OBO for gene products. • Protein Ontology (PRO) will contain two connected components

(or subontologies): – ProEvo captures the protein classes represented by protein

families at fold, domain and full length levels that reflect evolutional relationship

– ProForm captures the specific protein objects of a specific gene resulting from alternative splicing, posttranslational modification, genetic variations.

– ProEvo and ProMod is connected through the “reference” (canonical) protein sequence currently annotated in UniProtKB.

• PRO formalization of these detailed protein objects and classes will allow accurate and consistent proteomics experimental design and data analysis/integration.

Page 37: Biomedical Ontologies

37

PRO Framework

• PRO is designed to be a formal and well-principled OBO Foundry ontology for protein entities.

• Attributes of objects will take the form of links to other ontologies, such as gene (GO), sequence (SO), modification (PSI-MOD) and disease (DO) ontologies.

• A PRO prototype for TGF-beta signaling proteins was built based on this framework.

• In this way, PRO aims at providing an ontological framework to define protein entities and evolutionary-related classes that community can adopt for different purposes, e.g.

– annotation of entities attributes,

– mapping of objects in pathways, and

– modeling of biological system dynamics and disease.

Page 38: Biomedical Ontologies

38

ProEvo

ProForm

GO Gene Ontology

molecular function

cellular component

biological process

participates_in

part_of (for complexes) located_in (for compartments)

has_function

PROprotein

Root Levelis_a

translation product of an evolutionarily-related gene

translation product of a specific mRNA

Family-Level Distinction• Derivation: common ancestor• Source: PIRSF family

Modification-Level Distinction• Derived from post-translational modification• Source: UniProtKB

Sequence-Level Distinction• Derivation: specific allele or splice variant• Source: UniProtKB

cleaved/modified translation product disease

OMIM Disease

agent_in

is_a

protein modification

has_modification

PSI-MOD Modification

SO Sequence Ontology

sequence change

has_agent (sequence change) agent_of (effect on function)

derives_from

protein domain

has_part

Pfam Domain

Example:

TGF-beta receptor phosphorylated smad2 isoform1

is a phosphorylated smad2 isoform1

derives_from smad2 isoform 1

is a smad2

is a TGF- receptor-regulated smad

is a smad

is a protein

Modification Level

Sequence Level

Family Level

Root Level

translation product of a specific geneGene-Level Distinction• Derivation: specific gene• Sources: PIRSF subfamily, Panther subfamily

is_a

Gene Level

Protein Ontology (PRO) http://pir.georgetown.edu/pro/

Page 39: Biomedical Ontologies

39

Cellular Component:- nucleus

Molecular Function:- protein binding

Biological Process:- signal transduction- regulation of transcription, DNA-dependent

Mothers against decapentaplegic homolog 2

Smad 2

GO annotation of SMAD2_HUMAN:

Page 40: Biomedical Ontologies

40

II I

TGF-TGF-beta receptor

PP Smad 4 P

4 DNA binding

1 phosphorylation

2 complex formation

Nucleus

Cytoplasm

Smad 2

Smad 2

PPSmad 2

Smad 4 P

Transcription Regulation

PPSmad 2

Smad 4 P

3 nuclear translocation

PPSmad 2PP

P

++

ERK1

CAMK2

PP

Page 41: Biomedical Ontologies

41

“normal” •Cytoplasmic PRO:00000011

TGF- receptor phosphorylated

•Forms complex•Nuclear•Txn upregulation

PRO:00000013

ERK1 phosphorylated •Forms complex•Nuclear•Txn upregulation++

PRO:00000014

CAMK2 phosphorylated

•Forms complex•Cytoplasmic•No Txn upregulation

PRO:00000015

alternatively spliced short form

•Cytoplasmic

PRO:00000016

phosphorylated short form

•Nuclear•Txn upregulation PRO:00000018

point mutation (causative agent: large intestine carcinoma)

•Doesn’t form complex•Cytoplasmic•No Txn upregulation

PRO:00000019

Smad 2

PPSmad 2

PPSmad 2P

PPSmad 2 P

Smad 2 x

Smad 2

Smad 2 PP

SMAD2_HUMAN

SMAD2_HUMAN

SMAD2_HUMAN

SMAD2_HUMAN

SMAD2_HUMAN

SMAD2_HUMAN

SMAD2_HUMAN

Smad2 gene products Forms Location ID

Page 42: Biomedical Ontologies

43OBO relations: is_a, derives_from

ProEvo:

ProForm:

PRO hierarchy in Obo Edit

Representing evolutionary-related protein classes. In this example, children of TGF-beta-like cysteine-knot cytokine have a common architecture consisting of a signal peptide, a variable propeptide region and a transforming growth factor beta-like domain that is a cysteine-knot domain.

Pfam:PF00019 "has_part Transforming growth factor beta like domain".

Representing multiple protein products of a gene. Only forms with experimental data are included. When common protein forms exist in human and mouse, a single node is created (See details below).

Page 43: Biomedical Ontologies

44

Summary• The vision of the biomedical ontology community is that all

biomedical knowledge and data are disseminated on the Internet using principled ontologies, such that they are semantically interoperable and useful for improving biomedical science and clinical care.

• The scope extends to all knowledge and data that is relevant to the understanding or improvement of human biology and health.

• Knowledge and data are semantically interoperable when they enable predictable, meaningful, computation across knowledge sources developed independently to meet diverse needs.

• Principled ontologies are ones that follow NCBO-recommended formats and methodologies for ontology development, maintenance, and use.