Upload
ebony-blake
View
40
Download
1
Embed Size (px)
DESCRIPTION
Practical Ontologies. Lessons from the GO February 2011. The time was 1998-99. None of the model organism databases used standard terminology to describe biological function Drosophila sequence was imminent Largest genome sequenced at that time - PowerPoint PPT Presentation
Citation preview
Practical Ontologies
Lessons from the GOFebruary 2011
The time was 1998-99
None of the model organism databases used standard terminology to describe biological function
Drosophila sequence was imminent Largest genome sequenced at that time Two weeks, 3 dozen scientists, all new software How could we organize the annotation?
microArray technology was the latest research tool, and results needed to be described
AI folk and ontologists organized the first “bio-ontologies” workshop at ISMB
The Gene Ontology—the beginning
A handful of biologists (4) met in a bar in Montreal after the bio-ontologies workshop to share their frustrations and decided to just do it*… Would demonstrate possibilities for
data integration across the MODs (FlyBase, SGD, MGD)
Provided an organizing principle for the Drosophila genome annotation jamboree
* i.e. Describe gene products in a biologically meaningful way.
AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGACCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTTGAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG
GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGACCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTTGAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTGGTGTAGATGGAGATCGCGTGCTTGAGTCGTTCGTTTTTTTATGCTGATGATATAAATATATAGTGTTGGTG
GGGGGTACTCTACTCTCTCTAGAGAGAGCCTCTCAAAAAAAAAGCTCGGGGATCGGGTTCGAAGAAGTGAGATGTACGCGCTAGXTAGTATATCTCTTTCTCTGTCGTGCTGCTTGAGATCGTTCGTTTTTTTATGCTGATGATATAAATATATAGTGTTGGTGGGGGGTACTCTACTCTCTCTAGAGAGAGCCTCTCAAAAAAAAAGCTCGGGGATCGGGTTCGAAGAAGTGAGATGTACGCGCTAGXTAGTATATCTCTTTCTCTGTCGTGCT
Late summer 1999
reads sequence
Piles of data
Mountains of data
assemble analysis
filtering
First-pass predictions
converging
Tentative function
Love-at-first-sight
Functional knowns
‘GO’directories
The Gene Ontology project
Annotated now The importance of stress-testing Don’t delay, use your ontology today
Do no harm (KISS) i.e. Target the low hanging fruit, work
on the obvious, high-confidence steps Collaborate on concrete projects
Focusing the mind
Annotations
Have 3 primary components The ontology term(s) The entity instance (e.g. gene
product) The evidence for that assertion
An annotation is an evidence-based assertion which indicates that this entity is best classified/described by this term(s)
IDA
GO:0005720
What type of evidence?
SPCC622.16c
Identify genes
PMID:17449867
Read paper(s)
Identify GO terms
Identify GO terms associated with each gene
SPCC622.16c GO:0005720 IDA PMID:17449867
= bud initiation
= bud initiation
= bud initiation
The same name can be used to describe different things.
Classification rule: Disambiguation
= tooth bud initiation
= cellular bud initiation
= flower bud initiation
Include plain “bud initiation” as a synonym for each of these terms
Classification rule: Disambiguation
Exactly the same thing can be described with different terms
Disambiguation
Glucose synthesis Glucose biosynthesis Glucose formation Glucose anabolism Gluconeogenesis
Comparison is difficult, especially across species or across databases that each use one of these different variants
Use a single term, and plenty of synonyms
Annotation for a healthy ontology
Easier to find the most accurate term(s) to use Avoids annotation errors
Easier for new curators to learn and understand
Develop annotation guidelines and training material
Enables automatic reasoning for searching & inference
Bottom line: Following basic construction rules makes
more useful ontologies
Doh! I get it now, says the computer.
Typical ontologydeveloper
Typical wet lab PI
annotating data
Improvement needed: Closing the loop
The Gene Ontology project
Annotated now The importance of stress-testing Don’t delay, use your ontology today
Do no harm (KISS) i.e. Target the low hanging fruit, work
on the obvious, high-confidence steps Collaborate on concrete projects
Focusing the mind
GO in 2000-2008
Filling in annotation gaps
GO:0016301kinase activityGO:0016301
kinase activityGO:0016310
phosphorylationGO:0016310
phosphorylation
|P| = 3640|F| = 6053|F ∩ P| = 2230|F ∩ not P| = 3823
2230
14103823
July 2008
part_of
part_of
annotations propagateover part_of
KIC1 IDA
part_of
annotations propagateover part_of
IDAKIC1
part_of
NDK1IDA
annotations propagateover part_of
part_of
annotations propagateover part_of
NDK1IDA
Filling in annotation gaps
GO:0016301kinase activityGO:0016301
kinase activity
GO:0016310 phosphorylationGO:0016310 phosphorylation
2009
The H word—2011
Characters in common are due to inheritance Allows inferences about common ancestor
time
divergence
Evolution of MSH2 subfamilybiological process
DNA repair
Maintenance ofDNA repeats
Homologousrecombination
Apoptosis
Somatic hypermutation of immunoglobulin genes
Ancestral inference
• Integration at points of common ancestry• Infer “hidden” character of living organisms• Explicitly leverage evolutionary relationships
E.c.A.t. MTHFR1A.t. MTHFR2D.d.
S.p.S.c. MET13
D.m.A.g.
S.p.S.c. MET12C.e.
D.r.
G.g.
H.s. MTHFRR.n.M.m.
divergence
Biochemistry: purification and assay
Genetics: mutant phenotypes
Integrating different GO annotations
PAINTPAINTPhylogenetic Annotation and Inference Tool
The Gene Ontology project
Annotated now The importance of stress-testing Don’t delay, use your ontology today
Do no harm (KISS) i.e. Target the low hanging fruit, work
on the obvious, high-confidence steps Collaborate on concrete projects
Focusing the mind
SGDSGD MGDMGD
2009
FlyBaseFlyBase
GO
Scoping
The ontology has a clearly specified and clearly delineated content.
Decisions to make the work easier
Provide definitions for everything Intelligible ontologies are more useful
To humans (for annotation) and To machines (for searching, reasoning
and error-checking)
Use content-free unique identifiers Drive all semantics away from
tracking Don’t confuse the representational
technology with the conceptual modeling
Implicit ontologies within the GO:
cysteine biosynthesis (ChEBI) myoblast fusion (Cell Type Ontology) hydrogen ion transporter activity (ChEBI) snoRNA catabolism (Sequence Ontology) wing disc pattern formation (Drosophila
anatomy) epidermal cell differentiation (Cell Type
Ontology) regulation of flower development (Plant
anatomy) B-cell differentiation (Cell Type Ontology)
brain development
brain development
hindbrain development
hindbrain development
metencephalon development
metencephalon development
pons development
pons development
trigeminal motor nucleus
development
trigeminal motor nucleus
development
GO
Implicit anatomy ontology within the GO:
Alpha-Synuclein Mouse
Substantia nigra
number
has part is bearer of
Lewy body
of
Ischemic Mouse
is bearer of
number
of
Condensed Mitochondrion
Nucleus
Lysosome
Orthodox Mitochondrion
Golgi Apparatus
Condensed Mitochondrion
Condensed Mitochondrion
Dark Material
Condensed Mitochondrion
Common Interest Sociology—to enlist the community, the ontology
must meet each individual group’s immediate needs. Too many people => Too many requirements
Outstanding problems Closing the loop between ontology construction and
ontology application QC improvements Prioritizing tasks Visualization …
A cast of thousands