Upload
embl-abr
View
35
Download
0
Embed Size (px)
Citation preview
MotivationYour research is valuable
All advances in knowledge are incremental, with
each new idea ultimately building on earlier knowledge such as you are gathering.
2
Losing data at a rapid rateup to 80% unavailable after 20 years
http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416
Data valuation Information is infinitely shareable without any loss of
value Reuse increases the value derived from the original
investment By combining data, their value increases The more these assets are used, the more additional
knowledge can be gathered (data science) As a corollary, unshared or insufficiently documented
information is less valuable The more accurate and complete the information is,
the more useful, and therefore valuable, it is
Moody and Walsh 1999
WHAT ONTOLOGIES ARE
eye
what kinds of things exist?
what are the relationships between these things? ommatidium
sense organeye disc
is_a
part_of
developsfrom
A biological ontology is: A machine interpretable
representation of some aspect of biological reality
May 2, 2023
Ontology defined The science of what is: of the kinds and
structures of the objects, and their properties and relations in every area of reality.
The classification of entities and the relations between them.
Defined by a scientific field's vocabulary and by the canonical formulations of its theories.
Seeks to solve problems which arise in these domains.
WHY ARE ONTOLOGIES NEEDED
Ontologies help with decision making
handy ontology tells us what’s there…
Where should I eat…?
Ontologies don’t just organize data; they also facilitate inference, and that creates new knowledge, often unconsciously in the user.
(Presumable) country of origin
Type of cuisine
What a 5 year old child (or a computer) will likely infer about the world from this helpful ontology… Flag of fresh juice
‘Frozen Yogurt’ cuisine in search of a national
identity?
Where delicatessen food hails from from…
Fresh Juice is a national cuisine…
Information retrieval is not straightforward 18-day pregnant
females female (lactating) individual female worker caste (female) 2 yr old female female (pregnant) lgb*cc females sex: female 400 yr. old female female (outbred) mare female, other adult female female parent female (worker) female child asexual female female plant monosex female femal
femlale diploid female female(gynoecious) remale metafemale f femele semi-engorged
female sterile female famale female, pooled sexual oviparous
female normal female femail femalen sterile female worker sf female females strictly female
vitellogenic replete female
female - worker females only tetraploid female worker female (alate sexual) gynoecious thelytoky hexaploid female female (calf) healthy female female (gynoecious) female (f-o) hen probably female
(based on morphology)
castrate female female with eggs ovigerous female 3 female cf.female female worker oviparous sexual
females female (phenotype) cystocarpic female female, 6-8 weeks old worker bee female mice dikaryon female, virgin female enriched female, spayed dioecious female female, worker pseudohermaprhoditi
c female
Courtesy of N. Silvester and S. Orchard, European Nucleotide Archive, EMBL-EBI
May 2, 2023
Motivation is to represent biology accurately
Inferences and decisions we make are based upon what we know of the biological reality.
An ontology is a computable representation of this underlying biological reality.
Enables a computer to reason over the data in (some of) the ways that we do.
Annotation bottleneck Even the best research will be for naught if
data can never be found again. An active lab can easily generate 10-
100GB of data per month, and it is very difficult to manage on this scale. Must be annotated at the rate at which it is
generated And the data must be integrated with other
data Furthermore, the effort put into generating
this data will be utterly wasted if the curated data cannot be reliably computed upon.
HOW TO BUILD ONTOLOGIES
May 2, 2023
Ontologies must be shared
Communities form scientific theories that seek to explain all of the existing evidence and can be used for prediction
The computable representation must also be shared
Thus ontology development is inherently collaborative
May 2, 2023
Ontologies must be used Usage feeds back on ontology development
and improves the ontology It improves even more when these data are
used to answer research questions There will be fewer problems in the
ontology and more commitment to fixing remaining problems when important research data is involved that scientists depend upon
Why do we need rules for good ontology?
Ontologies must be intelligible To humans (for annotation) and To machines (for searching, reasoning and error-checking)
Makes it easier to find the most accurate term(s) to use Avoids annotation errors
Makes it easier for new curators to learn and understand Makes it easier to combine with other ontologies and
terminologies Makes automatic reasoning possible for searching &
inference
Bottom line: Following basic rules makes more useful ontologies
May 2, 2023
First Rule: Univocity Terms (including those describing
relations) should have the same meanings on every occasion of use.
In other words, they should refer to the same kinds of entities in reality
May 2, 2023
Glucosesynthesis
GluconeogenesisGlucosesynthesis
?
The Challenge of Univocity:People call the same thing by different names
Comparison is difficult, especially across species or across databases that each use one of these different variants
Disambiguation
Use a single term, and plenty of synonyms Gluconeogenesis
Synonyms: Glucose synthesis Glucose biosynthesis Glucose formation Glucose anabolism
May 2, 2023
Bud initiation? How is a computer to know?
= tooth bud initiation
= cellular bud initiation
= flower bud initiation
Include plain “bud initiation” as a synonym for each of these terms
Classification rule: Disambiguation
May 2, 2023
Second Rule: Positivity Complements of classes are not
themselves classes.
Terms such as ‘non-mammal’ or ‘non-membrane’ do not designate genuine classes.
May 2, 2023
The Challenge of Positivity
Some organelles are membrane-bound.A centrosome is not a membrane bound organelle,but it still may be considered an organelle.
May 2, 2023
Positivity Note the logical difference between
“non-membrane-bound organelle” and “not a membrane-bound organelle”
The latter includes everything that is not a membrane bound organelle!
May 2, 2023
Third Rule: Objectivity Which classes exist is not a function
of our biological knowledge.
Terms such as ‘unknown’ or ‘unclassified’ or ‘unlocalized’ do not designate biological natural kinds.
May 2, 2023
Objectivity How can we annotate when we know
that we don’t have any information? Annotate to root nodes and use the ND (no
data) evidence code
Similar strategies can be used for any situation more specific information is not yet known
May 2, 2023
GPCRs with unknown ligands
Annotate to this
Ontologies are graphs, where the nodes (terms in the ontology ) are connected by edges (relationships between the terms)
is-apart-of
Fourth Rule: Use defined relationships
mitochondrialmembrane
chloroplast
Cell
membrane
Chloroplastmembrane
32
Reasoning is critical Prokaryotic and
Eukaryotic cell are declared disjoints
Fungal cell is a Eukaryotic cell
Spore is a Fungal cell and a Prokaryotic cell
Satisfiable?
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0022006
ProkaryoticCell
EukaryoticCell
FungalCell
Spore
disjoint
33
Reasoning is critical
Solution: clarify spore
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0022006
ProkaryoticCell
EukaryoticCell
FungalCell
disjoint
ActinomyceteType Spore
MycetozoaType Spore
May 2, 2023
Fifth Rule: Intelligibility of Definitions
The terms used in a definition should be simpler (more intelligible) than the term to be defined
otherwise the definition provides no assistance to human understanding for machine processing
May 2, 2023
Sixth Rule: Keep it Real When building or maintaining an
ontology, always think carefully at how classes (types, kinds, species) relate to instances in reality
May 2, 2023
The Rules1. Univocity: Terms should have the same
meanings on every occasion of use2. Positivity: Terms such as ‘non-mammal’ or
‘non-membrane’ do not designate genuine classes.
3. Objectivity: Terms such as ‘unknown’ or ‘unclassified’ or ‘unlocalized’ do not designate biological natural kinds.
4. Single Inheritance: No class in a classification hierarchy should have more than one is_a parent on the immediate higher level
5. Intelligibility of Definitions: The terms used in a definition should be simpler (more intelligible) than the term to be defined
6. Basis in Reality: When building or maintaining an ontology, always think carefully at how classes relate to instances in reality
7. Distinguish Universals and Instances
Natural Language Computable Ontology
+ Large existing body of information+ Highly expressive
- Ambiguous (making it difficult and unreliable to compute on) - Less expressive
+ Logical+ Precise
How to best describe biology?
ONTOLOGIES AND BIOLOGYWithout rigor, we won’t—know what we know, or where to find it, or what we can infer from it.
GENOME ANNOTATIONApollo
Once a genome is sequenced… What are the parts? (sequence features)
Protein coding genes (coding sequence) Non coding RNAs (rRNA, snoRNA, tRNA,
microRNA antisense RNA) Promoters and regulatory regions Transposons Recombination hotspots, origins of replication Centromeres & telomeres …
ComputeCrawler
RepeatMaskerGenscanFgenesHGrailBlastSim4GenewiseLap
CGTGTGCGCAGGGGGATATGCGGCGCATATTGTGTTGAAGAGATGCGCTGCATTTCGCGATGCCGATTAGGNCACAGGGAA
DNA on a linear coordinate
Little boxes
de novo predictions
protein alignments
transcript alignmentsfull length cDNAs
47
APOLLOannotation editing environment
BECOMING ACQUAINTED WITH APOLLO
Color by CDS frame, toggle strands, set color scheme and highlights.
Upload evidence files (GFF3, BAM, BigWig), add combination and sequence search tracks.
Query the genome using BLAT.
Navigation and zoom.
Search for a gene model or a scaffold.
Get coordinates and “rubber band” selection for zooming.
Login
User-created annotations. Annotator
panel.
Evidence Tracks
Stage and cell-type specific transcription data.
http://genomearchitect.org/web_apollo_user_guide
Coordinate transforms:Curator ‘ligation’
Coordinate transforms:intron folding
Alterations: whether experimental artifacts or natural differences
Substitutions
Alterations: whether experimental artifacts or natural differences
Insertions
Alterations: whether experimental artifacts or natural differences
Deletions
Alterations: whether experimental artifacts or natural differences
Impact
Instructions54 | 54
APOLLO ON THE WEBinstructions
Username:[email protected]:usernumber
Email Password Server Begin [email protected] userone 1 [email protected] usertwo 2 [email protected] userthree 3 [email protected] userfour 4 [email protected] userfive 5 [email protected] usersix 1 [email protected] userseven 2 [email protected] usereight 3 [email protected] usernine 4 [email protected] userten 5 [email protected] usereleven 1 [email protected] usertwelve 2 [email protected] userthirteen 3 [email protected] userfourteen 4 [email protected] userfifteen 5 [email protected] usersixteen 1 [email protected] userseventeen 2 [email protected] usereighteen 3 [email protected] usernineteen 4 [email protected] usertwenty 5 [email protected] usertwentyone 1 [email protected] usertwentytwo 2 [email protected] usertwentythree 3 [email protected] usertwentyfour 4 [email protected] usertwentyfive 5 [email protected] usertwentysix 1 [email protected] usertwentyseven 2 [email protected] usertwentyeight 3 [email protected] usertwentynine 4 7
Server URL1 http://ec2-52-63-181-136.ap-southeast-2.com
pute.amazonaws.com/apollo/2 http://ec2-52-64-198-214.ap-southeast-2.compute.amazonaws.com/apollo/3 http://ec2-52-62-166-89.ap-southeast-2.compute.amazonaws.com/apollo/4 http://ec2-52-64-182-170.ap-southeast-2.compute.amazonaws.com/apollo/5 http://ec2-52-63-255-136.ap-southeast-2.compute.amazonaws.com/apollo/
GCGAAGTGCCAACTTCTACACACACAAAG
GCGAAGTGCCAACTTCTACACACACAAAG
For example – ontologically described genotypes/variants
intrinsic genotype genomic variationcomplementgenomic background
= + CGTAGC
CGTACC
apchu745/+; fgfa8ti282/ti282(AB)
genomic variationcomplement
variant single locuscomplement
variant allele
sequence alteration
has_part has_part
apchu745/+
apchu745
hu745
has_part has_part
has_part has_part
XAACGTACCGACGCTCGCTACGGGCGTATC
(AB) apchu745/+; fgf8ati282/ti282
apchu745/+; fgf8ati282/ti282
GCGAAGTGCCAACTTCTACACACACAAAG
GCGAAGTGCCAACTTCTACACACACAAAG
AACGTAGCGACGCTCGCTACGGGCGTATC
AACGTACCGACGCTCGCTACGGGCGTATC X
ACAC
X
X
X
X
AACGTAGCGACGCTCGCTACGGGCGTATC
X ACAC
X
XX
XX
FUNCTIONAL ANNOTATION
Phylogenetic Annotation Inferencing Tool — PAINT
Evolutionary history is the natural way to organize and
analyze biological data
Ancestral inference
• Integration at points of common ancestry• Infer “hidden” character of living organisms• Explicitly leverage evolutionary relationships
E.c.A.t. MTHFR1A.t. MTHFR2D.d.
S.p.S.c. MET13
D.m.A.g.
S.p.S.c. MET12C.e.
D.r.G.g.
H.s. MTHFRR.n.M.m.
divergence
Biochemistry: purification and assay
Genetics: mutant phenotypes
What is transitive annotation?
Related genes have a common function because their common ancestor had that function.
Not just an inference about one gene. It is also an inference for The most recent common ancestor (MRCA) Continuous inheritance since the MRCA Potential inheritance by other descendants of the MRCA
Gene inYeast
Gene inMouse
Function X
Gene inOpisthokontMRCA
Function X
Function X
Gene inZebrafish
Function X
Function X
Gene inHuman
Function XFunction X
61
• Green indicates experimental• Black dot indicates direct
experimental data.• White dot indicates a more
general functional class inferred from ontology
Red indicates NOT function for the gene
All nodes have persistent identifiers which are retained across different builds of the protein family trees.
cholinesterase
carboxylic ester hydrolase
Evolutionary event type: duplication speciation
• PAINTed nodes – • 3 steps carried out by
curator• Gain & Loss of function
• Inferred By Descendants• Experimental annotations
provide evidence
• Inferred by Ancestry• Propagation to
unannotated leaves
carboxylic ester hydrolase
Node with loss of function
Gaudet, P., et al. (2011). Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium. Briefings in Bioinformatics, 12(5), 449–62. doi:10.1093/bib/bbr042
Node with gain of function- cholinesterase
http://questfororthologs.org/
FUNCTIONAL ANNOTATION
Noctua for Building Models of Biology
Motivation: multi-scale knowledge models of mechanistic biology
Bai, J. P. F., & Abernethy, D. R. (n.d.). Systems Pharmacology to Predict Drug Toxicity : Integration Across Levels of Biological Organization , ∗451–473. doi:10.1146/annurev-pharmtox-011112-140248
A data model for causal ontology annotations: “LEGO”
ActivityGO:nnnnnnn
What: <molecule>
A data model for causal ontology annotations: “LEGO”
ActivityGO:nnnnnnn
What: <molecule>
Where: GO/CL/Uberon
A data model for causal ontology annotations: “LEGO”
ActivityGO:nnnnnnn
What: <molecule>
Where: GO/CL/Uberon
ActivityGO:nnnnnnn
What: <molecule>
Where: GO/CL/Uberon
Relationship
RO:nnnnnnn
A data model for causal ontology annotations: “LEGO”
ActivityGO:nnnnnnn
What: <molecule>
Where: GO/CL/Uberon
ActivityGO:nnnnnnn
What: <molecule>
Where: GO/CL/Uberon
Relationship
RO:nnnnnnn
Evidence: ECO, SEPIOSource: PMID, ORCID, ...
ProcessGO:nnnnnnn
A data model for causal ontology annotations: “LEGO”
ActivityGO:nnnnnnn
What: <molecule>
Where: GO/CL/Uberon
ActivityGO:nnnnnnn
What: <molecule>
Where: GO/CL/Uberon
Relationship
RO:nnnnnnn
A data model for causal ontology annotations: “LEGO”
GTPase activity GO:0003924
What: TEM1 S000004529
Where: spindle pole GO:0000922
GTPase inhibitor activityGO:0005095
What: BFA1S000003814
Where: spindle poleGO:0000922
Exit from mitosisGO:0010458
A data model for causal ontology annotations: “LEGO”
GTPase activity GO:0003924
What: TEM1 S000004529
Where: spindle pole GO:0000922
GTPase inhibitor activityGO:0005095
What: BFA1S000003814
Where: spindle poleGO:0000922
http://noctua.berkeleybop.org/
CollaborativeEditing!
RDF/OWLSemanticRepresentation -Reasoning
-Linked data
Gene sets
Building causal modelsof biologyusing ontologies
Diabetes mockup example
https://vimeo.com/channels/Noctua