Chibucos annot go_final

GENE ANNOTATION AND ONTOLOGY

Marcus C. Chibucos, Ph.D.

Arabidopsis thaliana ATPaseHMA4 zinc binding domain

GO:0006829 : zinc ion transport (BP)GO:0005886 : plasma membrane (CC)GO:0005515 : protein binding (MF)

Annotation

Ontology

Evidence

2 Outline of this talk

Background: the language of biology

Gene Ontology: overview, terms & structure

Annotating with GO and Evidence

Using annotation to facilitate your research

3

About screenshots in this talk

AmiGO web-based ontology browser http://amigo.geneontology.org

OBO-Edit stand-alone editor http://oboedit.org

4

What is annotation? Who is involved?Term confusion (what’s in a name?)Scale: the sea of dataControlled vocabularies & ontologiesThe Gene Ontology Consortium

Background: the language of biology

5

Annotation

annotate – to make or furnish critical or explanatory notes or comment.

(Merriam-Webster dictionary)

genome annotation – the process of taking the raw DNA sequence produced by the genome-sequencing projects and adding the layers of analysis and interpretation necessary to extract its biological significance and place it into the context of our understanding of biological processes.

(Lincoln Stein, PMID 11433356)

Gene Ontology annotation – the process of assigning GO terms to gene products… according to two general principles: first, annotations should be attributed to a source; second, each annotation should indicate the evidence on which it is based.

(http://www.geneontology.org)

6

Diverse parties involved

End-users, including various researchers Small-scale laboratory projects Whole genome sequencing projects

Annotators From reading papers to computational

analysis Ontology developers

Create terms that reflect scientific knowledge

Make interoperable ontologies, database links

Developers of tools & resources Standards for storing & sharing data Web interfaces for data analysis & sharing

Many areas of expertise Laboratory sciences – biology, chemistry,

medicine, and many other disciplines Computational science – bioinformatics,

genomics, statistics Software development & web design Philosophy – ontology & logic

7

Term confusion: synonyms

Do biologists use precise & consistent language? Mutually understood concepts – DNA,

RNA, or protein Synonym (one thing known by more

than one name) – translation and protein synthesis

Enzyme Commission reactions Standardized id, official name &

alternative names

http://www.expasy.ch/enzyme/2.7.1.40

8

Term confusion: homonyms

Homonyms common in biology – different things known by the same name Sporulation Vascular (plant vasculature, i.e. xylem &

phloem, or vascular smooth muscle, i.e. blood vessels?)

Endospore formation

Bacillus anthracis

Reproductive sporulationAsci & ascospores, Morchella elata (morel)

http://en.wikipedia.org/wiki/File:Morelasci.jpg©PG Warner 2008 (accessed 17-Sep-09)

http://www.microbelibrary.org/ASMOnly/details.asp?id=1426&Lang=©L Stauffer 2003 (accessed 17-Sep-09)

“Sporulation”

9

Term confusion: homonyms and biological complexity

AmiGO query “vascular” 51 terms In biology, many related phenomena

are described with similar terminology

10

The problem of scale

Enormous data sets◦ Microarray experiments◦ Whole genome sequencing

projects◦ Comparative genomics of

multiple diverse taxa

Computers don’t understand nuance◦ Millions of proteins to annotate◦ How to effectively search?◦ How to draw meaningful

comparisons?

http://en.wikipedia.org/wiki/File:Microarray2.gif(accessed 17-Sep-09)

Small data sets, small experiments & isolated scientific communities?

11

The Gene Ontology (GO)

Way to address the problems of synonyms, homonyms, biological complexity, increasing glut of data

GO provides a common biological language for protein functional annotation

www.geneontology.org

12

Controlled vocabulary (CV)

An official list of precisely defined terms that can be used to classify information and facilitate its retrieval Think of flat list like a thesaurus or

catalog Benefits of CVs

Allow standardized descriptions of things

Remedy synonym & homonym issues Can be cross-referenced externally Facilitate electronic searching

http://www.nlm.nih.gov/nichsr/hta101/ta101014.html

A CV can be “…used to index and retrieve a body of literature in a bibliographic, factual, or other database. An example is the MeSH controlled vocabulary used in MEDLINE and other MEDLARS databases of the NLM.”

13

Ontology is a type of CV with defined relationships

GO terms describe biological attributes of gene products…

Ontology – formalizes knowledge of a subject with precise textual definitions

Networked terms where child more specific (“granular”) than parentLess

specific

More granular

14

How GO works

GO Consortium develops & maintains: Ontologies and cross-links between

ontologies and different resources Tools to develop and use the ontologies SourceForge tracker for development

People studying organisms at databases annotate gene products with GO terms

Groups share files of annotation data about their respective organisms

Because a common language was used to describe gene products and this information was shared amongst databases… We can search uniformly across

databases Do comparative genomics of diverse

taxa

15

GO on SourceForgesourceforge.net/projects/geneontology

16

The Gene Ontology Consortium

ZFIN

Reactome IGS

Collaboration began 1998 among model organism databases mouse (MGI), fruit fly (FlyBase) and baker’s yeast (SGD) Michael Ashburner of FlyBase

contributed the base vocabulary Today > 20 members & associates

First publication 2000 (PMID 10802651) Today, PubMed query “gene ontology”

yields 3,347 papers (27-Jun-2011) Organisms represented by GO

annotations from every kingdom of life

Many groups use GO in many different ways for their research

Among eight OBO-Foundry ontologies

17

OBO Foundry ontologieswww.obofoundry.org

Collaboration among developers of science-based ontologies

Establish principles for ontology development Goal of creating a suite of orthogonal

interoperable reference ontologies in the biomedical domain.

many others…

18

What the GO is notGO comprises three ontologiesAnatomy & storage of GO termsOntology structureDetail of a term in AmiGOTrue path rule

Gene Ontology:overview, terms & structure

19

Caveats – what GO is not

Not gene naming system or gene catalog GO describes attributes of biological objects –

“oxidoreductase activity” not “cytochrome c”

The three ontologies have limitations No sequence attributes or structural features No characteristics unique to mutants or

disease No environment, evolution or expression No anatomy features above cellular

component

Not dictated standard or federated solution Databases share annotations as they see fit Curators evaluate differently

GO is evolving as our knowledge evolves New terms added on daily basis Incorrect/poorly defined terms made obsolete Secondary ids – terms with same meaning

merged

20

GO comprises three ontologies

Cellular component ontology (CC) “cytoplasm”

Molecular function ontology (MF) “protein binding” “peptidase activity” “cysteine-type endopeptidase activity”

Biological process ontology (BP) “proteolysis” “apoptosis”

Terms describe attributes of gene products (GPs) Any protein or RNA encoded by a gene Species-independent context, e.g. “ribosome” Could describe GPs found in limited taxa, e.g.

“photosynthesis” or “lactation”

One GP can be associated with ≥ 1 CC, BP, MF Example: Caspase-6 from Bos taurus

21

Cellular component ontology

Describes location at level of subcellular structure & macromolecular complex

GP subcomponent of or located in particular cellular component, with some exceptions:

No individual proteins or nucleic acids No multicellular anatomical terms For annotation purposes, a GP can be

associated with or located in ≥ one cellular component

Anatomical structure rough endoplasmic

reticulum nucleus nuclear inner

membrane

Multi-subunit enzyme or protein complex ribosome proteasome ubiquitin ligase

complex

22

Molecular function ontology

Describe gene product activity at molecular level Describes attributes of entities

Adenylate cyclase (E.C. 4.6.1.1)Catalyzes a specific reaction:

ATP = 3',5'-cyclic AMP + diphosphateDescribed by the Gene Ontology term:

“adenylate cyclase activity” (GO:0004016)http://www.ebi.ac.uk/pdbsum/1ab8

[accessed 4-Feb-2010]

Usually single GP, sometimes a complex “ferritin receptor activity”

Definition: “combining with ferritin, an iron-storing protein complex, to initiate a change in cell activity”

Broad functions “catalytic activity” “transporter

activity” “binding”

Specific functions “adenylate cyclase activity” “protein-DNA complex

transmembrane transporter activity”

“Fc-gamma receptor I complex binding”

23

Biological process ontology

Describes recognized series of events or molecular functions with a defined beginning and end

“GO does not try to represent the dynamics or dependencies that would be required to fully describe a pathway” (from GO documentation)

Mutant phenotypes often reflect disruptions in BP

Specific process “pyrimidine

metabolism” “α-glucosidase

transport

Broad process “cellular

physiological process”

“signal transduction”

http://www.geneontology.org/GO.process.guidelines.shtml

General considerationsThe Cell Cycle

The Development Node

Multi-Organism Process

MetabolismRegulation

Detection of and Response to StimuliSensory PerceptionSignaling Pathways

Transport and Localization

Transporter activity (molecular function)Other Misc. Standard

Defs

24

Anatomy of a GO term

Term name

goid (unique numerical identifier)

Precise textual definition with reference stating source

Synonyms (broad or narrow) for searching,

alternative names, misspellings…

GO slim

Ontology placement

25

Storage and cross referencing of GO terms

Storage in flat file (text)

Database cross reference for mappings to GO GO term identical to object

in other database

26

Ontology structure:parent-child relationship

Parent term (broader)

Child term (specialized)

hexose biosynthesis

hexose metabolism

monosaccharide biosynthesis

Up in the tree is more general; down in the tree is more specific:

Annotation of genes Start with terms denoting broad functional

categories Use more specific term as knowledge

warrants

27

Ontology structure:terms arranged in DAGs

GO terms structured as hierarchical-like directed acyclic graphs (DAGs) Tree-like, but each term can

have more than one parent (pseudo-hierarchy)

Each term may have one or more child terms (“siblings” share same parent)

parents

“siblings”

child terms

child term

parent

28

GO has three term relationships

is_a - child is instance of parent (“A is_a B”) Class-subclass relationship

part_of - child part of parent (“C part_of D”) When C present, part of D; but C not

always present Nucleus always part_of cell; not all cells

have nuclei regulates

Child term regulates parent term

(Zoomed in view of biological process ontology depicted here.)

29

AmiGO for viewing terms

Open source HTML-based application developed by the GO Consortium

Interface for browsing, querying and visualizing OBO data Users can search GO terms or annotations

Available via website or download for local install http://amigo.geneontology.org

GO:0019836

Example query with

keyword “hemolysis” or

goid GO:0019836

30

AmiGO search results

Click

31

Term information in AmiGO

Webpage continues…

32

AmiGO view continued

Our term is much further down…

Number of gene products in GO

annotation collection annotated to that term or one of its child terms

Relationship between

term and its parent

Several informativ

e views

Clic

k

33

Graph view

Alternative view of network of terms

34

A term with two parents

amine group carboxylic acid group

generic amino acid

• Name: amino acid transmembrane transporter activity• ID number: GO:0015171• Definition: Catalysis of the transfer of amino acids from

one side of a membrane to the other. Amino acids are organic molecules that contain an amino group and a carboxyl group. [source: GOC:ai, GOC:mtg_transport, ISBN:0815340729]

• parent term: amine transmembrane transporter activity (GO:0005275)

• relationship to parent: “is_a”

• parent term: carboxylic acid transmembrane transporter activity (GO:0046943)

• relationship to parent: “is_a”

35

Multiple paths to root:graphical view in OBO-Edit

36

“True path rule”

The pathway from a term all the way up to its top-level parent(s) must always be true for any gene product that could be annotated to that term (“if true for the child, then true for the parent”)

cell organelle mitochondrion proton-transporting ATP synthase complex

Incorrect for Bacteria

cell intracellular proton-transporting ATP synthase complex plasma membrane proton-transporting ATP synthase complex mitochondrial proton-transporting ATP synthase complex

membrane plasma membrane plasma membrane proton-transporting ATP synthase complex

organelle mitochondrion mitochondrial inner membrane mitochondrial proton-transporting ATP synthase complex

Correct for Bacteria (and Eukaryotes)

(Abbreviated versions of the actual trees)

What is GO annotation?Literature curation at model organism databasesThe annotation fileEvidence – critical for annotationSequence similarity-based annotationAnnotation specificity

Annotating with GO and Evidence

37

38

GO annotation overview

Associating a GO term with a gene product Goal is to select GO terms from all three

ontologies to represent what, where, and how

Linking a GO term to a gene product asserts that it has that attribute

For example, 6-phosphofructokinase Molecular function

GO:0003872 6-phosphofructokinase activity Biological process

GO:0006096 glycolysis Cellular component

GO:0005737 cytoplasm

Annotation, whether based on literature or computational methods, always involves: Learning something about a gene product Selecting an appropriate GO term Providing an appropriate evidence code Citing a [preferably open access] reference Entering information into GO annotation file

39

Chaperone DnaK, one protein/multiple annotations

Molecular function ATP binding (GO:0005524) ATPase activity (GO:0016887) unfolded protein binding (GO:0051082) misfolded protein binding (GO:0051787) denatured protein binding

(GO:0031249)

Biological process protein folding (GO:0006457) protein refolding (GO:0042026) protein stabilization (GO:0050821) response to stress (GO:0006950)

Cellular component cytoplasm (GO:0005737)

40

Literature curation performed at model organism databases

From the abstract:

41

Results section indicates a “direct assay” annotation

They document the findings of a direct assay performed on purified protein:

They further document the methods used, and evaluate the findings in the Discussion section…

42

Query AmiGO with “DNA ligase” & “DNA ligation”

All “ligation” in biological process ontology

43

Resulting annotations

GO id term name

aspect ev. code

reference

with

GO:0003909

DNA ligase activity

molecular function

IDA PMID:17705817

N/A

GO:0006266

DNA ligation

biological process

IDA PMID:17705817

N/A

GO:0005737

cytoplasm cellular component

IC PMID:17705817

GO:0003909

Name: DNA ligase (stated in paper) Gene symbol: ligA (stated in paper) EC: 6.5.1.2 (queried enzyme for “DNA

ligase”)

44

Gene annotation file captures annotations

Evidence

45

Evidence

Essential to base annotation on evidence Conclusions more robust and traceable With evidence, a GO annotation is standard

operating procedure (SOP)-independent

Many types of evidence exist For example, experiment described in

literature What method (e.g. direct assay, mutant

phenotype, et cetera) was used? Did author cite references? Did author provide details of analyses?

Perhaps you used a sequence-based method What were the methods of manual curation? Give accession numbers of similar sequences Provide any references describing methods

Controlled vocabularies help here, too!

46

GO standard references

GO_REF:0000011 A Hidden Markov Model (HMM) is a statistical representation of patterns found in a data set. When using HMMs with proteins, the HMM is a statistical model of the patterns of the amino acids found in a multiple alignment of a set of proteins called the "seed". Seed proteins are chosen based on sequence similarity to each other. Seed members can be chosen with different levels of relationship to each other. They can be members of a superfamily (ex. ABC transporter, ATP-binding proteins), they can all share the same exact specific function (ex. biotin synthase) or they could share another type of relationship of intermediate specificity (ex. subfamily, domain). New proteins can be scored against the model generated from the seed according to how closely the patterns of amino acids in the new proteins match those in the seed. There are two scores assigned to the HMM which allow annotators to judge how well any new protein scores to the model. Proteins scoring above the "trusted cutoff" score can be assumed to be part of the group defined by the seed. Proteins scoring below the "noise cutoff" score can be assumed to NOT be a part of the group. Proteins scoring between the trusted and noise cutoffs may be part of the group but may not. One of the important features of HMMs is that they are built from a multiple alignment of protein sequences, not a pairwise alignment. This is significant, since shared similarity between many proteins is much more likely to indicate shared functional relationship than sequence similarity between just two proteins. The usefulness of an HMM is directly related to the amount of care that is taken in chosing the seed members, building a good multiple alignment of the seed members, assessing the level of specificity of the model, and choosing the cutoff scores correctly. In order to properly assess what functional relevance an above-trusted scoring HMM match has to a query, one must carefully determine what the functional scope of the HMM is. If the HMM models proteins that all share the same function then it is likely possible to assign a specific function to high-scoring match proteins based on the HMM. If the HMM models proteins that have a wide variety of functions, then it will not be possible to assign a specific function to the query based on the HMM match, however, depending on the nature of the HMM in question, it may be possible to assign a more general (family or subfamily level) function. In order to determine the functional scope of an HMM, one must carefully read the documentation associated with the HMM. The annotator must also consider whether the function attributed to the proteins in the HMM makes sense for the query based on what is known about the organism in which the query protein resides and in light of any other information that might be available about the query protein. After carefully considering all of these issues the annotator makes an annotation.

GO_REF:0000011 A Hidden Markov Model (HMM) is a statistical representation of patterns found in a data set. When using HMMs with proteins, the HMM is a statistical model of the patterns of the amino acids found in a multiple alignment of a set of proteins called the "seed". Seed proteins are chosen based on sequence similarity to each other. Seed members can be chosen with different levels of relationship to each other...

47

GO evidence codeswww.geneontology.org/GO.evidence.shtml

EXP - inferred from experiment IDA - inferred from direct assay IEP inferred from expression pattern IGI - inferred from genetic interaction IPI - inferred from physical interaction IMP - inferred from mutant phenotype

ISS - inferred from sequence or structural similarity ISA - inferred from sequence alignment ISO - inferred from sequence orthology ISM - inferred from sequence model

IGC - inferred from genomic context ND - no biological data available IC - inferred by curator TAS - traceable author statement NAS - non-traceable author statement IEA - inferred from electronic annotation

GO codes are a subset of yet another ontology!

48

Types of sequence similarity-based annotations

Find similarity between gene product & one that is experimentally characterized BLAST-type alignments Shared synteny to establish orthology of

genomic regions between species

Find similarity between gene product and defined protein family HMMs (Pfam, TIGRFAMS) Prosite InterPro

Find motifs in gene product with prediction tools TMHMM SignalP

Many (most?) information you find is based on transitive annotation and much of it has never been looked at by a human being!

49

Evaluation of sequence similarity-based information

Visually inspect alignments & criteria Length & identity Conservation of catalytic sites Check HMM scores with respect to cutoff

Look at available metabolic analysis Pathways, complexes?

Information from neighboring genes Gene in an operon (common prokaryotes)

can supplement weak similarity evidence

Sequence characteristics Transmembrane regions? Signal peptide? Known motifs that give a clue to function? Paralogous family member

50

An example: HI0678, a protein from H. influenzae…

...high quality alignment to experimentally characterized triosephosphate isomerase from Vibrio marinus

51

further down the page

Information from Swiss-Prot database on experimentally characterized match protein

52

…. full-length match, high percent identity (67.8%), conserved active and binding sites (boxed in red).

High quality…..

53


GO id term name

aspect ev code

reference with

GO:0004807triose-phosphate isomerase activity

molecular function

ISSGO_REF:0000012

Swiss-Prot:P50921

GO:0006096 glycolysis

biological process

IGCPMID:15347579

TIGR_GenProp:GenProp0120

GO:0005737 cytoplasm

cellular component

ICGO_REF:0000012 GO:0004807

name: triosephosphate isomerase

gene symbol: tpiA

EC: 5.3.1.1

(This, and the following annotations, came from the match protein.)

54

KEGG pathway for glycolysis core

55

KEGG pathway for glycolysis core

56


GO id term name

aspect ev code

reference with


molecular function

ISSGO_REF:0000012

Swiss-Prot:P50921

GO:0006096 glycolysis biological process

IGCGO_REF:0000012

KEGG_PATHWAY:

hin00010

GO:0005737 cytoplasm cellular component

ICGO_REF:0000012 GO:0004807

name: triosephosphate isomerase

gene symbol: tpiA

EC: 5.3.1.1

57

And another annotation

GO id term name

aspect ev code

reference with


molecular function

ISSGO_REF:0000012

Swiss-Prot:P50921

GO:0006096 glycolysis biological process

IGCGO_REF:0000012

KEGG_PATHWAY:

hin00010

GO:0005737 cytoplasm cellular component

ICGO_REF:0000012 GO:0004807

The biologist knows that glycolysis takes place in the cytoplasm in bacteria, and so infers a cytoplasmic location for that protein (“inferred by curator” evidence code).

58

Annotation specificity should reflect knowledge

Available evidence for three genes

#1-good match to an HMM for “kinase”

#2-good match to an HMM for “kinase”-a high-quality BER match to an experimentally characterized “glucokinase’ AND a ‘fructokinase’

#3-good match to an HMM specific for “ribokinase”-a high-quality BER match to an experimentally characterized ribokinase

GO trees (very abbreviated)

Function catalytic activity kinase activity carbohydrate kinase activity ribokinase activity glucokinase activity fructokinase activity

Process metabolism carbohydrate metabolism monosaccharide metabolism hexose metabolism glucose metabolism fructose metabolism pentose metabolism ribose metabolism

#1

#1

#2

#2

#3

#3

Using shared annotationsSearch for GO terms at databasesSlims for broad classificationGO toolsWorking with GO-limited data setsSummary

Using annotation to facilitate your research

59

60

Sharing annotations

Annotation file sent to GO, put in repository All these data free to anyone Hundreds of thousands of GP annotations

Annotation files all in same format Facilitates easy use of data by everyone

Most of your favorite organism databases use these annotation files

61

Searching for GO terms at EuPathDB

62

Slim is a distilled (reduced) ontology Made by manually pruning low-level terms

with an ontology editor Selected high-level terms remain Slims reduce ontology complexity

Reduce clutter & see general trends Microarray experiments Comparative whole genome analyses Remove irrelevant terms

Looking at specific taxa, such as yeast or plant

Go offers script to bin more granular annotations up to higher levels

Ontology slimwww.geneontology.org/GO.slims.shtml

63

Comparing genomes with a GO slim

MJ Gardner, et al. (2002) Nature 419:498-511

High-level biological process terms used to compare Plasmodium and Saccharomyces

64

GO slim: manual/orthology-based gene annotations

Nucleic Acids Res. 2010 January; 38(Database issue): D420–D427.

65

GO toolswww.geneontology.org/GO.tools.shtml

The real challenge is finding the right one for your needs

For example, statistical representation of GO terms:

http://go.princeton.edu/cgi-bin/GOTermFinder

66

GO & analysis of RNA-seq data

We present GOseq, an application for performing Gene Ontology (GO) analysis on RNA-seq data. GO analysis is widely used to reduce complexity and highlight biological processes in genome-wide expression studies, but standard methods give biased results on RNA-seq data due to over-detection of differential expression for long and highly expressed transcripts. Application of GOseq to a prostate cancer data set shows that GOseq dramatically changes the results, highlighting categories more consistent with the known biology.

Young et al. Genome Biology 2010, 11:R14 http://genomebiology.com/2010/11/2/R14

67

When GO is limited

Food for thought: what happens when we have limited GO (or other)annotation data?

New and interesting genomes often see this problem

68

Comparative analysis of orthologs in syntenic blocks

The more genomes we have at our disposal, the better

Structural rearrangements, absence of intron, gene duplication, intron structure, gene deletion/creation

Nucleic Acids Res. 2010 January; 38(Database issue): D420–D427.

69

Summary GO analyses

GO remedies problems of synonyms & homonyms in biological nomenclature Queries based on IDs linked to precise

definitions, not less reliable text-matching

GO can help you to: Find all genes that share a particular

function regardless of sequence Do comparisons across any species

annotated with GO Summarize major classes of genes in a

newly sequenced genome Characterize expressed genes is a study Drive hypotheses to test in the laboratory

GO is not a panacea but it should be a valuable tool in your genomics toolbox

The title slide revisited…

Arabidopsis thaliana ATPaseHMA4 zinc binding domain

GO:0006829 : zinc ion transport (BP)GO:0005886 : plasma membrane (CC)GO:0005515 : protein binding (MF)

Annotation

Ontology

Evidence

THANK YOU.

Education

Chibucos annot go_final