Annotation Systems & Implementation Issues - Suzanna Lewis

  • View
    35

  • Download
    0

  • Category

    Science

Preview:

Citation preview

MotivationYour research is valuable

All advances in knowledge are incremental, with

each new idea ultimately building on earlier knowledge such as you are gathering.

2

Losing data at a rapid rateup to 80% unavailable after 20 years

http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416

Data valuation Information is infinitely shareable without any loss of

value Reuse increases the value derived from the original

investment By combining data, their value increases The more these assets are used, the more additional

knowledge can be gathered (data science) As a corollary, unshared or insufficiently documented

information is less valuable The more accurate and complete the information is,

the more useful, and therefore valuable, it is

Moody and Walsh 1999

WHAT ONTOLOGIES ARE

eye

what kinds of things exist?

what are the relationships between these things? ommatidium

sense organeye disc

is_a

part_of

developsfrom

A biological ontology is: A machine interpretable

representation of some aspect of biological reality

May 2, 2023

Ontology defined The science of what is: of the kinds and

structures of the objects, and their properties and relations in every area of reality.

The classification of entities and the relations between them.

Defined by a scientific field's vocabulary and by the canonical formulations of its theories.

Seeks to solve problems which arise in these domains.

WHY ARE ONTOLOGIES NEEDED

Ontologies help with decision making

handy ontology tells us what’s there…

Where should I eat…?

Ontologies don’t just organize data; they also facilitate inference, and that creates new knowledge, often unconsciously in the user.

(Presumable) country of origin

Type of cuisine

What a 5 year old child (or a computer) will likely infer about the world from this helpful ontology… Flag of fresh juice

‘Frozen Yogurt’ cuisine in search of a national

identity?

Where delicatessen food hails from from…

Fresh Juice is a national cuisine…

Information retrieval is not straightforward 18-day pregnant

females female (lactating) individual female worker caste (female) 2 yr old female female (pregnant) lgb*cc females sex: female 400 yr. old female female (outbred) mare female, other adult female female parent female (worker) female child asexual female female plant monosex female femal

femlale diploid female female(gynoecious) remale metafemale f femele semi-engorged

female sterile female famale female, pooled sexual oviparous

female normal female femail femalen sterile female worker sf female females strictly female

vitellogenic replete female

female - worker females only tetraploid female worker female (alate sexual) gynoecious thelytoky hexaploid female female (calf) healthy female female (gynoecious) female (f-o) hen probably female

(based on morphology)

castrate female female with eggs ovigerous female 3 female cf.female female worker oviparous sexual

females female (phenotype) cystocarpic female female, 6-8 weeks old worker bee female mice dikaryon female, virgin female enriched female, spayed dioecious female female, worker pseudohermaprhoditi

c female

Courtesy of N. Silvester and S. Orchard, European Nucleotide Archive, EMBL-EBI

May 2, 2023

Motivation is to represent biology accurately

Inferences and decisions we make are based upon what we know of the biological reality.

An ontology is a computable representation of this underlying biological reality.

Enables a computer to reason over the data in (some of) the ways that we do.

Annotation bottleneck Even the best research will be for naught if

data can never be found again. An active lab can easily generate 10-

100GB of data per month, and it is very difficult to manage on this scale. Must be annotated at the rate at which it is

generated And the data must be integrated with other

data Furthermore, the effort put into generating

this data will be utterly wasted if the curated data cannot be reliably computed upon.

HOW TO BUILD ONTOLOGIES

May 2, 2023

Ontologies must be shared

Communities form scientific theories that seek to explain all of the existing evidence and can be used for prediction

The computable representation must also be shared

Thus ontology development is inherently collaborative

May 2, 2023

Ontologies must be used Usage feeds back on ontology development

and improves the ontology It improves even more when these data are

used to answer research questions There will be fewer problems in the

ontology and more commitment to fixing remaining problems when important research data is involved that scientists depend upon

Why do we need rules for good ontology?

Ontologies must be intelligible To humans (for annotation) and To machines (for searching, reasoning and error-checking)

Makes it easier to find the most accurate term(s) to use Avoids annotation errors

Makes it easier for new curators to learn and understand Makes it easier to combine with other ontologies and

terminologies Makes automatic reasoning possible for searching &

inference

Bottom line: Following basic rules makes more useful ontologies

May 2, 2023

First Rule: Univocity Terms (including those describing

relations) should have the same meanings on every occasion of use.

In other words, they should refer to the same kinds of entities in reality

May 2, 2023

Glucosesynthesis

GluconeogenesisGlucosesynthesis

?

The Challenge of Univocity:People call the same thing by different names

Comparison is difficult, especially across species or across databases that each use one of these different variants

Disambiguation

Use a single term, and plenty of synonyms Gluconeogenesis

Synonyms: Glucose synthesis Glucose biosynthesis Glucose formation Glucose anabolism

May 2, 2023

Bud initiation? How is a computer to know?

= tooth bud initiation

= cellular bud initiation

= flower bud initiation

Include plain “bud initiation” as a synonym for each of these terms

Classification rule: Disambiguation

May 2, 2023

Second Rule: Positivity Complements of classes are not

themselves classes.

Terms such as ‘non-mammal’ or ‘non-membrane’ do not designate genuine classes.

May 2, 2023

The Challenge of Positivity

Some organelles are membrane-bound.A centrosome is not a membrane bound organelle,but it still may be considered an organelle.

May 2, 2023

Positivity Note the logical difference between

“non-membrane-bound organelle” and “not a membrane-bound organelle”

The latter includes everything that is not a membrane bound organelle!

May 2, 2023

Third Rule: Objectivity Which classes exist is not a function

of our biological knowledge.

Terms such as ‘unknown’ or ‘unclassified’ or ‘unlocalized’ do not designate biological natural kinds.

May 2, 2023

Objectivity How can we annotate when we know

that we don’t have any information? Annotate to root nodes and use the ND (no

data) evidence code

Similar strategies can be used for any situation more specific information is not yet known

May 2, 2023

GPCRs with unknown ligands

Annotate to this

Ontologies are graphs, where the nodes (terms in the ontology ) are connected by edges (relationships between the terms)

is-apart-of

Fourth Rule: Use defined relationships

mitochondrialmembrane

chloroplast

Cell

membrane

Chloroplastmembrane

32

Reasoning is critical Prokaryotic and

Eukaryotic cell are declared disjoints

Fungal cell is a Eukaryotic cell

Spore is a Fungal cell and a Prokaryotic cell

Satisfiable?

http://www.plosone.org/article/info:doi/10.1371/journal.pone.0022006

ProkaryoticCell

EukaryoticCell

FungalCell

Spore

disjoint

33

Reasoning is critical

Solution: clarify spore

http://www.plosone.org/article/info:doi/10.1371/journal.pone.0022006

ProkaryoticCell

EukaryoticCell

FungalCell

disjoint

ActinomyceteType Spore

MycetozoaType Spore

May 2, 2023

Fifth Rule: Intelligibility of Definitions

The terms used in a definition should be simpler (more intelligible) than the term to be defined

otherwise the definition provides no assistance to human understanding for machine processing

May 2, 2023

Sixth Rule: Keep it Real When building or maintaining an

ontology, always think carefully at how classes (types, kinds, species) relate to instances in reality

May 2, 2023

The Rules1. Univocity: Terms should have the same

meanings on every occasion of use2. Positivity: Terms such as ‘non-mammal’ or

‘non-membrane’ do not designate genuine classes.

3. Objectivity: Terms such as ‘unknown’ or ‘unclassified’ or ‘unlocalized’ do not designate biological natural kinds.

4. Single Inheritance: No class in a classification hierarchy should have more than one is_a parent on the immediate higher level

5. Intelligibility of Definitions: The terms used in a definition should be simpler (more intelligible) than the term to be defined

6. Basis in Reality: When building or maintaining an ontology, always think carefully at how classes relate to instances in reality

7. Distinguish Universals and Instances

Natural Language Computable Ontology

+ Large existing body of information+ Highly expressive

- Ambiguous (making it difficult and unreliable to compute on) - Less expressive

+ Logical+ Precise

How to best describe biology?

ONTOLOGIES AND BIOLOGYWithout rigor, we won’t—know what we know, or where to find it, or what we can infer from it.

GENOME ANNOTATIONApollo

Once a genome is sequenced… What are the parts? (sequence features)

Protein coding genes (coding sequence) Non coding RNAs (rRNA, snoRNA, tRNA,

microRNA antisense RNA) Promoters and regulatory regions Transposons Recombination hotspots, origins of replication Centromeres & telomeres …

ComputeCrawler

RepeatMaskerGenscanFgenesHGrailBlastSim4GenewiseLap

CGTGTGCGCAGGGGGATATGCGGCGCATATTGTGTTGAAGAGATGCGCTGCATTTCGCGATGCCGATTAGGNCACAGGGAA

DNA on a linear coordinate

Little boxes

de novo predictions

protein alignments

transcript alignmentsfull length cDNAs

47

APOLLOannotation editing environment

BECOMING ACQUAINTED WITH APOLLO

Color by CDS frame, toggle strands, set color scheme and highlights.

Upload evidence files (GFF3, BAM, BigWig), add combination and sequence search tracks.

Query the genome using BLAT.

Navigation and zoom.

Search for a gene model or a scaffold.

Get coordinates and “rubber band” selection for zooming.

Login

User-created annotations. Annotator

panel.

Evidence Tracks

Stage and cell-type specific transcription data.

http://genomearchitect.org/web_apollo_user_guide

Coordinate transforms:Curator ‘ligation’

Coordinate transforms:intron folding

Alterations: whether experimental artifacts or natural differences

Substitutions

Alterations: whether experimental artifacts or natural differences

Insertions

Alterations: whether experimental artifacts or natural differences

Deletions

Alterations: whether experimental artifacts or natural differences

Impact

Instructions54 | 54

APOLLO ON THE WEBinstructions

Username:user.number@example.comPassword:usernumber

Email Password Server Begin atuser.one@example.com userone 1 1user.two@example.com usertwo 2 1user.three@example.com userthree 3 1user.four@example.com userfour 4 1user.five@example.com userfive 5 1user.six@example.com usersix 1 7user.seven@example.com userseven 2 7user.eight@example.com usereight 3 7user.nine@example.com usernine 4 7user.ten@example.com userten 5 7user.eleven@example.com usereleven 1 1user.twelve@example.com usertwelve 2 1user.thirteen@example.com userthirteen 3 1user.fourteen@example.com userfourteen 4 1user.fifteen@example.com userfifteen 5 1user.sixteen@example.com usersixteen 1 7user.seventeen@example.com userseventeen 2 7user.eightteen@example.com usereighteen 3 7user.nineteen@example.com usernineteen 4 7user.twenty@example.com usertwenty 5 7user.twentyone@example.com usertwentyone 1 1user.twentytwo@example.com usertwentytwo 2 1user.twentythree@example.com usertwentythree 3 1user.twentyfour@example.com usertwentyfour 4 1user.twentyfive@example.com usertwentyfive 5 1user.twentysix@example.com usertwentysix 1 7user.twentyseven@example.com usertwentyseven 2 7user.twentyeight@example.com usertwentyeight 3 7user.twentynine@example.com usertwentynine 4 7

Server URL1 http://ec2-52-63-181-136.ap-southeast-2.com

pute.amazonaws.com/apollo/2 http://ec2-52-64-198-214.ap-southeast-2.compute.amazonaws.com/apollo/3 http://ec2-52-62-166-89.ap-southeast-2.compute.amazonaws.com/apollo/4 http://ec2-52-64-182-170.ap-southeast-2.compute.amazonaws.com/apollo/5 http://ec2-52-63-255-136.ap-southeast-2.compute.amazonaws.com/apollo/

GCGAAGTGCCAACTTCTACACACACAAAG

GCGAAGTGCCAACTTCTACACACACAAAG

For example – ontologically described genotypes/variants

intrinsic genotype genomic variationcomplementgenomic background

= + CGTAGC

CGTACC

apchu745/+; fgfa8ti282/ti282(AB)

genomic variationcomplement

variant single locuscomplement

variant allele

sequence alteration

has_part has_part

apchu745/+

apchu745

hu745

has_part has_part

has_part has_part

XAACGTACCGACGCTCGCTACGGGCGTATC

(AB) apchu745/+; fgf8ati282/ti282

apchu745/+; fgf8ati282/ti282

GCGAAGTGCCAACTTCTACACACACAAAG

GCGAAGTGCCAACTTCTACACACACAAAG

AACGTAGCGACGCTCGCTACGGGCGTATC

AACGTACCGACGCTCGCTACGGGCGTATC X

ACAC

X

X

X

X

AACGTAGCGACGCTCGCTACGGGCGTATC

X ACAC

X

XX

XX

FUNCTIONAL ANNOTATION

Phylogenetic Annotation Inferencing Tool — PAINT

Evolutionary history is the natural way to organize and

analyze biological data

Ancestral inference

• Integration at points of common ancestry• Infer “hidden” character of living organisms• Explicitly leverage evolutionary relationships

E.c.A.t. MTHFR1A.t. MTHFR2D.d.

S.p.S.c. MET13

D.m.A.g.

S.p.S.c. MET12C.e.

D.r.G.g.

H.s. MTHFRR.n.M.m.

divergence

Biochemistry: purification and assay

Genetics: mutant phenotypes

What is transitive annotation?

Related genes have a common function because their common ancestor had that function.

Not just an inference about one gene. It is also an inference for The most recent common ancestor (MRCA) Continuous inheritance since the MRCA Potential inheritance by other descendants of the MRCA

Gene inYeast

Gene inMouse

Function X

Gene inOpisthokontMRCA

Function X

Function X

Gene inZebrafish

Function X

Function X

Gene inHuman

Function XFunction X

61

• Green indicates experimental• Black dot indicates direct

experimental data.• White dot indicates a more

general functional class inferred from ontology

Red indicates NOT function for the gene

All nodes have persistent identifiers which are retained across different builds of the protein family trees.

cholinesterase

carboxylic ester hydrolase

Evolutionary event type: duplication speciation

• PAINTed nodes – • 3 steps carried out by

curator• Gain & Loss of function

• Inferred By Descendants• Experimental annotations

provide evidence

• Inferred by Ancestry• Propagation to

unannotated leaves

carboxylic ester hydrolase

Node with loss of function

Gaudet, P., et al. (2011). Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium. Briefings in Bioinformatics, 12(5), 449–62. doi:10.1093/bib/bbr042

Node with gain of function- cholinesterase

http://questfororthologs.org/

FUNCTIONAL ANNOTATION

Noctua for Building Models of Biology

Motivation: multi-scale knowledge models of mechanistic biology

Bai, J. P. F., & Abernethy, D. R. (n.d.). Systems Pharmacology to Predict Drug Toxicity : Integration Across Levels of Biological Organization , ∗451–473. doi:10.1146/annurev-pharmtox-011112-140248

A data model for causal ontology annotations: “LEGO”

ActivityGO:nnnnnnn

What: <molecule>

A data model for causal ontology annotations: “LEGO”

ActivityGO:nnnnnnn

What: <molecule>

Where: GO/CL/Uberon

A data model for causal ontology annotations: “LEGO”

ActivityGO:nnnnnnn

What: <molecule>

Where: GO/CL/Uberon

ActivityGO:nnnnnnn

What: <molecule>

Where: GO/CL/Uberon

Relationship

RO:nnnnnnn

A data model for causal ontology annotations: “LEGO”

ActivityGO:nnnnnnn

What: <molecule>

Where: GO/CL/Uberon

ActivityGO:nnnnnnn

What: <molecule>

Where: GO/CL/Uberon

Relationship

RO:nnnnnnn

Evidence: ECO, SEPIOSource: PMID, ORCID, ...

ProcessGO:nnnnnnn

A data model for causal ontology annotations: “LEGO”

ActivityGO:nnnnnnn

What: <molecule>

Where: GO/CL/Uberon

ActivityGO:nnnnnnn

What: <molecule>

Where: GO/CL/Uberon

Relationship

RO:nnnnnnn

A data model for causal ontology annotations: “LEGO”

GTPase activity GO:0003924

What: TEM1 S000004529

Where: spindle pole GO:0000922

GTPase inhibitor activityGO:0005095

What: BFA1S000003814

Where: spindle poleGO:0000922

Exit from mitosisGO:0010458

A data model for causal ontology annotations: “LEGO”

GTPase activity GO:0003924

What: TEM1 S000004529

Where: spindle pole GO:0000922

GTPase inhibitor activityGO:0005095

What: BFA1S000003814

Where: spindle poleGO:0000922

http://noctua.berkeleybop.org/

CollaborativeEditing!

RDF/OWLSemanticRepresentation -Reasoning

-Linked data

Gene sets

Building causal modelsof biologyusing ontologies

Diabetes mockup example

https://vimeo.com/channels/Noctua

Recommended