76
Motivation Your research is valuable All advances in knowledge are incremental, with each new idea ultimately building on earlier knowledge such as you are gathering.

Annotation Systems & Implementation Issues - Suzanna Lewis

Embed Size (px)

Citation preview

Page 1: Annotation Systems & Implementation Issues - Suzanna Lewis

MotivationYour research is valuable

All advances in knowledge are incremental, with

each new idea ultimately building on earlier knowledge such as you are gathering.

Page 2: Annotation Systems & Implementation Issues - Suzanna Lewis

2

Losing data at a rapid rateup to 80% unavailable after 20 years

http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416

Page 3: Annotation Systems & Implementation Issues - Suzanna Lewis

Data valuation Information is infinitely shareable without any loss of

value Reuse increases the value derived from the original

investment By combining data, their value increases The more these assets are used, the more additional

knowledge can be gathered (data science) As a corollary, unshared or insufficiently documented

information is less valuable The more accurate and complete the information is,

the more useful, and therefore valuable, it is

Moody and Walsh 1999

Page 4: Annotation Systems & Implementation Issues - Suzanna Lewis
Page 5: Annotation Systems & Implementation Issues - Suzanna Lewis
Page 6: Annotation Systems & Implementation Issues - Suzanna Lewis

WHAT ONTOLOGIES ARE

Page 7: Annotation Systems & Implementation Issues - Suzanna Lewis

eye

what kinds of things exist?

what are the relationships between these things? ommatidium

sense organeye disc

is_a

part_of

developsfrom

A biological ontology is: A machine interpretable

representation of some aspect of biological reality

Page 8: Annotation Systems & Implementation Issues - Suzanna Lewis

May 2, 2023

Ontology defined The science of what is: of the kinds and

structures of the objects, and their properties and relations in every area of reality.

The classification of entities and the relations between them.

Defined by a scientific field's vocabulary and by the canonical formulations of its theories.

Seeks to solve problems which arise in these domains.

Page 9: Annotation Systems & Implementation Issues - Suzanna Lewis

WHY ARE ONTOLOGIES NEEDED

Page 10: Annotation Systems & Implementation Issues - Suzanna Lewis

Ontologies help with decision making

handy ontology tells us what’s there…

Where should I eat…?

Page 11: Annotation Systems & Implementation Issues - Suzanna Lewis

Ontologies don’t just organize data; they also facilitate inference, and that creates new knowledge, often unconsciously in the user.

(Presumable) country of origin

Type of cuisine

Page 12: Annotation Systems & Implementation Issues - Suzanna Lewis

What a 5 year old child (or a computer) will likely infer about the world from this helpful ontology… Flag of fresh juice

‘Frozen Yogurt’ cuisine in search of a national

identity?

Where delicatessen food hails from from…

Fresh Juice is a national cuisine…

Page 13: Annotation Systems & Implementation Issues - Suzanna Lewis

Information retrieval is not straightforward 18-day pregnant

females female (lactating) individual female worker caste (female) 2 yr old female female (pregnant) lgb*cc females sex: female 400 yr. old female female (outbred) mare female, other adult female female parent female (worker) female child asexual female female plant monosex female femal

femlale diploid female female(gynoecious) remale metafemale f femele semi-engorged

female sterile female famale female, pooled sexual oviparous

female normal female femail femalen sterile female worker sf female females strictly female

vitellogenic replete female

female - worker females only tetraploid female worker female (alate sexual) gynoecious thelytoky hexaploid female female (calf) healthy female female (gynoecious) female (f-o) hen probably female

(based on morphology)

castrate female female with eggs ovigerous female 3 female cf.female female worker oviparous sexual

females female (phenotype) cystocarpic female female, 6-8 weeks old worker bee female mice dikaryon female, virgin female enriched female, spayed dioecious female female, worker pseudohermaprhoditi

c female

Courtesy of N. Silvester and S. Orchard, European Nucleotide Archive, EMBL-EBI

Page 14: Annotation Systems & Implementation Issues - Suzanna Lewis

May 2, 2023

Motivation is to represent biology accurately

Inferences and decisions we make are based upon what we know of the biological reality.

An ontology is a computable representation of this underlying biological reality.

Enables a computer to reason over the data in (some of) the ways that we do.

Page 15: Annotation Systems & Implementation Issues - Suzanna Lewis

Annotation bottleneck Even the best research will be for naught if

data can never be found again. An active lab can easily generate 10-

100GB of data per month, and it is very difficult to manage on this scale. Must be annotated at the rate at which it is

generated And the data must be integrated with other

data Furthermore, the effort put into generating

this data will be utterly wasted if the curated data cannot be reliably computed upon.

Page 16: Annotation Systems & Implementation Issues - Suzanna Lewis

HOW TO BUILD ONTOLOGIES

Page 17: Annotation Systems & Implementation Issues - Suzanna Lewis

May 2, 2023

Ontologies must be shared

Communities form scientific theories that seek to explain all of the existing evidence and can be used for prediction

The computable representation must also be shared

Thus ontology development is inherently collaborative

Page 18: Annotation Systems & Implementation Issues - Suzanna Lewis

May 2, 2023

Ontologies must be used Usage feeds back on ontology development

and improves the ontology It improves even more when these data are

used to answer research questions There will be fewer problems in the

ontology and more commitment to fixing remaining problems when important research data is involved that scientists depend upon

Page 19: Annotation Systems & Implementation Issues - Suzanna Lewis

Why do we need rules for good ontology?

Ontologies must be intelligible To humans (for annotation) and To machines (for searching, reasoning and error-checking)

Makes it easier to find the most accurate term(s) to use Avoids annotation errors

Makes it easier for new curators to learn and understand Makes it easier to combine with other ontologies and

terminologies Makes automatic reasoning possible for searching &

inference

Bottom line: Following basic rules makes more useful ontologies

Page 20: Annotation Systems & Implementation Issues - Suzanna Lewis

May 2, 2023

First Rule: Univocity Terms (including those describing

relations) should have the same meanings on every occasion of use.

In other words, they should refer to the same kinds of entities in reality

Page 21: Annotation Systems & Implementation Issues - Suzanna Lewis

May 2, 2023

Glucosesynthesis

GluconeogenesisGlucosesynthesis

?

The Challenge of Univocity:People call the same thing by different names

Page 22: Annotation Systems & Implementation Issues - Suzanna Lewis

Comparison is difficult, especially across species or across databases that each use one of these different variants

Disambiguation

Use a single term, and plenty of synonyms Gluconeogenesis

Synonyms: Glucose synthesis Glucose biosynthesis Glucose formation Glucose anabolism

Page 23: Annotation Systems & Implementation Issues - Suzanna Lewis

May 2, 2023

Bud initiation? How is a computer to know?

Page 24: Annotation Systems & Implementation Issues - Suzanna Lewis

= tooth bud initiation

= cellular bud initiation

= flower bud initiation

Include plain “bud initiation” as a synonym for each of these terms

Classification rule: Disambiguation

Page 25: Annotation Systems & Implementation Issues - Suzanna Lewis

May 2, 2023

Second Rule: Positivity Complements of classes are not

themselves classes.

Terms such as ‘non-mammal’ or ‘non-membrane’ do not designate genuine classes.

Page 26: Annotation Systems & Implementation Issues - Suzanna Lewis

May 2, 2023

The Challenge of Positivity

Some organelles are membrane-bound.A centrosome is not a membrane bound organelle,but it still may be considered an organelle.

Page 27: Annotation Systems & Implementation Issues - Suzanna Lewis

May 2, 2023

Positivity Note the logical difference between

“non-membrane-bound organelle” and “not a membrane-bound organelle”

The latter includes everything that is not a membrane bound organelle!

Page 28: Annotation Systems & Implementation Issues - Suzanna Lewis

May 2, 2023

Third Rule: Objectivity Which classes exist is not a function

of our biological knowledge.

Terms such as ‘unknown’ or ‘unclassified’ or ‘unlocalized’ do not designate biological natural kinds.

Page 29: Annotation Systems & Implementation Issues - Suzanna Lewis

May 2, 2023

Objectivity How can we annotate when we know

that we don’t have any information? Annotate to root nodes and use the ND (no

data) evidence code

Similar strategies can be used for any situation more specific information is not yet known

Page 30: Annotation Systems & Implementation Issues - Suzanna Lewis

May 2, 2023

GPCRs with unknown ligands

Annotate to this

Page 31: Annotation Systems & Implementation Issues - Suzanna Lewis

Ontologies are graphs, where the nodes (terms in the ontology ) are connected by edges (relationships between the terms)

is-apart-of

Fourth Rule: Use defined relationships

mitochondrialmembrane

chloroplast

Cell

membrane

Chloroplastmembrane

Page 32: Annotation Systems & Implementation Issues - Suzanna Lewis

32

Reasoning is critical Prokaryotic and

Eukaryotic cell are declared disjoints

Fungal cell is a Eukaryotic cell

Spore is a Fungal cell and a Prokaryotic cell

Satisfiable?

http://www.plosone.org/article/info:doi/10.1371/journal.pone.0022006

ProkaryoticCell

EukaryoticCell

FungalCell

Spore

disjoint

Page 33: Annotation Systems & Implementation Issues - Suzanna Lewis

33

Reasoning is critical

Solution: clarify spore

http://www.plosone.org/article/info:doi/10.1371/journal.pone.0022006

ProkaryoticCell

EukaryoticCell

FungalCell

disjoint

ActinomyceteType Spore

MycetozoaType Spore

Page 34: Annotation Systems & Implementation Issues - Suzanna Lewis

May 2, 2023

Fifth Rule: Intelligibility of Definitions

The terms used in a definition should be simpler (more intelligible) than the term to be defined

otherwise the definition provides no assistance to human understanding for machine processing

Page 35: Annotation Systems & Implementation Issues - Suzanna Lewis

May 2, 2023

Sixth Rule: Keep it Real When building or maintaining an

ontology, always think carefully at how classes (types, kinds, species) relate to instances in reality

Page 36: Annotation Systems & Implementation Issues - Suzanna Lewis

May 2, 2023

The Rules1. Univocity: Terms should have the same

meanings on every occasion of use2. Positivity: Terms such as ‘non-mammal’ or

‘non-membrane’ do not designate genuine classes.

3. Objectivity: Terms such as ‘unknown’ or ‘unclassified’ or ‘unlocalized’ do not designate biological natural kinds.

4. Single Inheritance: No class in a classification hierarchy should have more than one is_a parent on the immediate higher level

5. Intelligibility of Definitions: The terms used in a definition should be simpler (more intelligible) than the term to be defined

6. Basis in Reality: When building or maintaining an ontology, always think carefully at how classes relate to instances in reality

7. Distinguish Universals and Instances

Page 37: Annotation Systems & Implementation Issues - Suzanna Lewis

Natural Language Computable Ontology

+ Large existing body of information+ Highly expressive

- Ambiguous (making it difficult and unreliable to compute on) - Less expressive

+ Logical+ Precise

How to best describe biology?

Page 38: Annotation Systems & Implementation Issues - Suzanna Lewis

ONTOLOGIES AND BIOLOGYWithout rigor, we won’t—know what we know, or where to find it, or what we can infer from it.

Page 39: Annotation Systems & Implementation Issues - Suzanna Lewis
Page 40: Annotation Systems & Implementation Issues - Suzanna Lewis

GENOME ANNOTATIONApollo

Page 41: Annotation Systems & Implementation Issues - Suzanna Lewis

Once a genome is sequenced… What are the parts? (sequence features)

Protein coding genes (coding sequence) Non coding RNAs (rRNA, snoRNA, tRNA,

microRNA antisense RNA) Promoters and regulatory regions Transposons Recombination hotspots, origins of replication Centromeres & telomeres …

Page 42: Annotation Systems & Implementation Issues - Suzanna Lewis

ComputeCrawler

RepeatMaskerGenscanFgenesHGrailBlastSim4GenewiseLap

CGTGTGCGCAGGGGGATATGCGGCGCATATTGTGTTGAAGAGATGCGCTGCATTTCGCGATGCCGATTAGGNCACAGGGAA

Page 43: Annotation Systems & Implementation Issues - Suzanna Lewis

DNA on a linear coordinate

Little boxes

Page 44: Annotation Systems & Implementation Issues - Suzanna Lewis

de novo predictions

Page 45: Annotation Systems & Implementation Issues - Suzanna Lewis

protein alignments

Page 46: Annotation Systems & Implementation Issues - Suzanna Lewis

transcript alignmentsfull length cDNAs

Page 47: Annotation Systems & Implementation Issues - Suzanna Lewis

47

APOLLOannotation editing environment

BECOMING ACQUAINTED WITH APOLLO

Color by CDS frame, toggle strands, set color scheme and highlights.

Upload evidence files (GFF3, BAM, BigWig), add combination and sequence search tracks.

Query the genome using BLAT.

Navigation and zoom.

Search for a gene model or a scaffold.

Get coordinates and “rubber band” selection for zooming.

Login

User-created annotations. Annotator

panel.

Evidence Tracks

Stage and cell-type specific transcription data.

http://genomearchitect.org/web_apollo_user_guide

Page 48: Annotation Systems & Implementation Issues - Suzanna Lewis

Coordinate transforms:Curator ‘ligation’

Page 49: Annotation Systems & Implementation Issues - Suzanna Lewis

Coordinate transforms:intron folding

Page 50: Annotation Systems & Implementation Issues - Suzanna Lewis

Alterations: whether experimental artifacts or natural differences

Substitutions

Page 51: Annotation Systems & Implementation Issues - Suzanna Lewis

Alterations: whether experimental artifacts or natural differences

Insertions

Page 52: Annotation Systems & Implementation Issues - Suzanna Lewis

Alterations: whether experimental artifacts or natural differences

Deletions

Page 53: Annotation Systems & Implementation Issues - Suzanna Lewis

Alterations: whether experimental artifacts or natural differences

Impact

Page 54: Annotation Systems & Implementation Issues - Suzanna Lewis

Instructions54 | 54

APOLLO ON THE WEBinstructions

Username:[email protected]:usernumber

Email Password Server Begin [email protected] userone 1 [email protected] usertwo 2 [email protected] userthree 3 [email protected] userfour 4 [email protected] userfive 5 [email protected] usersix 1 [email protected] userseven 2 [email protected] usereight 3 [email protected] usernine 4 [email protected] userten 5 [email protected] usereleven 1 [email protected] usertwelve 2 [email protected] userthirteen 3 [email protected] userfourteen 4 [email protected] userfifteen 5 [email protected] usersixteen 1 [email protected] userseventeen 2 [email protected] usereighteen 3 [email protected] usernineteen 4 [email protected] usertwenty 5 [email protected] usertwentyone 1 [email protected] usertwentytwo 2 [email protected] usertwentythree 3 [email protected] usertwentyfour 4 [email protected] usertwentyfive 5 [email protected] usertwentysix 1 [email protected] usertwentyseven 2 [email protected] usertwentyeight 3 [email protected] usertwentynine 4 7

Server URL1 http://ec2-52-63-181-136.ap-southeast-2.com

pute.amazonaws.com/apollo/2 http://ec2-52-64-198-214.ap-southeast-2.compute.amazonaws.com/apollo/3 http://ec2-52-62-166-89.ap-southeast-2.compute.amazonaws.com/apollo/4 http://ec2-52-64-182-170.ap-southeast-2.compute.amazonaws.com/apollo/5 http://ec2-52-63-255-136.ap-southeast-2.compute.amazonaws.com/apollo/

Page 55: Annotation Systems & Implementation Issues - Suzanna Lewis

GCGAAGTGCCAACTTCTACACACACAAAG

GCGAAGTGCCAACTTCTACACACACAAAG

For example – ontologically described genotypes/variants

intrinsic genotype genomic variationcomplementgenomic background

= + CGTAGC

CGTACC

apchu745/+; fgfa8ti282/ti282(AB)

genomic variationcomplement

variant single locuscomplement

variant allele

sequence alteration

has_part has_part

apchu745/+

apchu745

hu745

has_part has_part

has_part has_part

XAACGTACCGACGCTCGCTACGGGCGTATC

(AB) apchu745/+; fgf8ati282/ti282

apchu745/+; fgf8ati282/ti282

GCGAAGTGCCAACTTCTACACACACAAAG

GCGAAGTGCCAACTTCTACACACACAAAG

AACGTAGCGACGCTCGCTACGGGCGTATC

AACGTACCGACGCTCGCTACGGGCGTATC X

ACAC

X

X

X

X

AACGTAGCGACGCTCGCTACGGGCGTATC

X ACAC

X

XX

XX

Page 56: Annotation Systems & Implementation Issues - Suzanna Lewis
Page 57: Annotation Systems & Implementation Issues - Suzanna Lewis

FUNCTIONAL ANNOTATION

Phylogenetic Annotation Inferencing Tool — PAINT

Page 58: Annotation Systems & Implementation Issues - Suzanna Lewis

Evolutionary history is the natural way to organize and

analyze biological data

Page 59: Annotation Systems & Implementation Issues - Suzanna Lewis

Ancestral inference

• Integration at points of common ancestry• Infer “hidden” character of living organisms• Explicitly leverage evolutionary relationships

E.c.A.t. MTHFR1A.t. MTHFR2D.d.

S.p.S.c. MET13

D.m.A.g.

S.p.S.c. MET12C.e.

D.r.G.g.

H.s. MTHFRR.n.M.m.

divergence

Biochemistry: purification and assay

Genetics: mutant phenotypes

Page 60: Annotation Systems & Implementation Issues - Suzanna Lewis

What is transitive annotation?

Related genes have a common function because their common ancestor had that function.

Not just an inference about one gene. It is also an inference for The most recent common ancestor (MRCA) Continuous inheritance since the MRCA Potential inheritance by other descendants of the MRCA

Gene inYeast

Gene inMouse

Function X

Gene inOpisthokontMRCA

Function X

Function X

Gene inZebrafish

Function X

Function X

Gene inHuman

Function XFunction X

Page 61: Annotation Systems & Implementation Issues - Suzanna Lewis

61

• Green indicates experimental• Black dot indicates direct

experimental data.• White dot indicates a more

general functional class inferred from ontology

Red indicates NOT function for the gene

All nodes have persistent identifiers which are retained across different builds of the protein family trees.

cholinesterase

carboxylic ester hydrolase

Evolutionary event type: duplication speciation

Page 62: Annotation Systems & Implementation Issues - Suzanna Lewis

• PAINTed nodes – • 3 steps carried out by

curator• Gain & Loss of function

• Inferred By Descendants• Experimental annotations

provide evidence

• Inferred by Ancestry• Propagation to

unannotated leaves

carboxylic ester hydrolase

Node with loss of function

Gaudet, P., et al. (2011). Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium. Briefings in Bioinformatics, 12(5), 449–62. doi:10.1093/bib/bbr042

Node with gain of function- cholinesterase

Page 63: Annotation Systems & Implementation Issues - Suzanna Lewis
Page 64: Annotation Systems & Implementation Issues - Suzanna Lewis

http://questfororthologs.org/

Page 65: Annotation Systems & Implementation Issues - Suzanna Lewis

FUNCTIONAL ANNOTATION

Noctua for Building Models of Biology

Page 66: Annotation Systems & Implementation Issues - Suzanna Lewis

Motivation: multi-scale knowledge models of mechanistic biology

Bai, J. P. F., & Abernethy, D. R. (n.d.). Systems Pharmacology to Predict Drug Toxicity : Integration Across Levels of Biological Organization , ∗451–473. doi:10.1146/annurev-pharmtox-011112-140248

Page 67: Annotation Systems & Implementation Issues - Suzanna Lewis

A data model for causal ontology annotations: “LEGO”

ActivityGO:nnnnnnn

What: <molecule>

Page 68: Annotation Systems & Implementation Issues - Suzanna Lewis

A data model for causal ontology annotations: “LEGO”

ActivityGO:nnnnnnn

What: <molecule>

Where: GO/CL/Uberon

Page 69: Annotation Systems & Implementation Issues - Suzanna Lewis

A data model for causal ontology annotations: “LEGO”

ActivityGO:nnnnnnn

What: <molecule>

Where: GO/CL/Uberon

ActivityGO:nnnnnnn

What: <molecule>

Where: GO/CL/Uberon

Relationship

RO:nnnnnnn

Page 70: Annotation Systems & Implementation Issues - Suzanna Lewis

A data model for causal ontology annotations: “LEGO”

ActivityGO:nnnnnnn

What: <molecule>

Where: GO/CL/Uberon

ActivityGO:nnnnnnn

What: <molecule>

Where: GO/CL/Uberon

Relationship

RO:nnnnnnn

Evidence: ECO, SEPIOSource: PMID, ORCID, ...

Page 71: Annotation Systems & Implementation Issues - Suzanna Lewis

ProcessGO:nnnnnnn

A data model for causal ontology annotations: “LEGO”

ActivityGO:nnnnnnn

What: <molecule>

Where: GO/CL/Uberon

ActivityGO:nnnnnnn

What: <molecule>

Where: GO/CL/Uberon

Relationship

RO:nnnnnnn

Page 72: Annotation Systems & Implementation Issues - Suzanna Lewis

A data model for causal ontology annotations: “LEGO”

GTPase activity GO:0003924

What: TEM1 S000004529

Where: spindle pole GO:0000922

GTPase inhibitor activityGO:0005095

What: BFA1S000003814

Where: spindle poleGO:0000922

Page 73: Annotation Systems & Implementation Issues - Suzanna Lewis

Exit from mitosisGO:0010458

A data model for causal ontology annotations: “LEGO”

GTPase activity GO:0003924

What: TEM1 S000004529

Where: spindle pole GO:0000922

GTPase inhibitor activityGO:0005095

What: BFA1S000003814

Where: spindle poleGO:0000922

Page 74: Annotation Systems & Implementation Issues - Suzanna Lewis

http://noctua.berkeleybop.org/

CollaborativeEditing!

RDF/OWLSemanticRepresentation -Reasoning

-Linked data

Gene sets

Building causal modelsof biologyusing ontologies

Page 75: Annotation Systems & Implementation Issues - Suzanna Lewis

Diabetes mockup example

Page 76: Annotation Systems & Implementation Issues - Suzanna Lewis

https://vimeo.com/channels/Noctua