Upload
sucheta-tripathy
View
740
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Gene Ontology
Citation preview
GENE ANNOTATION AND ONTOLOGY
Marcus C. Chibucos, Ph.D.
Arabidopsis thaliana ATPaseHMA4 zinc binding domain
GO:0006829 : zinc ion transport (BP)GO:0005886 : plasma membrane (CC)GO:0005515 : protein binding (MF)
Annotation
Ontology
Evidence
2 Outline of this talk
Background: the language of biology
Gene Ontology: overview, terms & structure
Annotating with GO and Evidence
Using annotation to facilitate your research
3
About screenshots in this talk
AmiGO web-based ontology browser http://amigo.geneontology.org
OBO-Edit stand-alone editor http://oboedit.org
4
What is annotation? Who is involved?Term confusion (what’s in a name?)Scale: the sea of dataControlled vocabularies & ontologiesThe Gene Ontology Consortium
Background: the language of biology
5
Annotation
annotate – to make or furnish critical or explanatory notes or comment.
(Merriam-Webster dictionary)
genome annotation – the process of taking the raw DNA sequence produced by the genome-sequencing projects and adding the layers of analysis and interpretation necessary to extract its biological significance and place it into the context of our understanding of biological processes.
(Lincoln Stein, PMID 11433356)
Gene Ontology annotation – the process of assigning GO terms to gene products… according to two general principles: first, annotations should be attributed to a source; second, each annotation should indicate the evidence on which it is based.
(http://www.geneontology.org)
6
Diverse parties involved
End-users, including various researchers Small-scale laboratory projects Whole genome sequencing projects
Annotators From reading papers to computational
analysis Ontology developers
Create terms that reflect scientific knowledge
Make interoperable ontologies, database links
Developers of tools & resources Standards for storing & sharing data Web interfaces for data analysis & sharing
Many areas of expertise Laboratory sciences – biology, chemistry,
medicine, and many other disciplines Computational science – bioinformatics,
genomics, statistics Software development & web design Philosophy – ontology & logic
7
Term confusion: synonyms
Do biologists use precise & consistent language? Mutually understood concepts – DNA,
RNA, or protein Synonym (one thing known by more
than one name) – translation and protein synthesis
Enzyme Commission reactions Standardized id, official name &
alternative names
http://www.expasy.ch/enzyme/2.7.1.40
8
Term confusion: homonyms
Homonyms common in biology – different things known by the same name Sporulation Vascular (plant vasculature, i.e. xylem &
phloem, or vascular smooth muscle, i.e. blood vessels?)
Endospore formation
Bacillus anthracis
Reproductive sporulationAsci & ascospores, Morchella elata (morel)
http://en.wikipedia.org/wiki/File:Morelasci.jpg©PG Warner 2008 (accessed 17-Sep-09)
http://www.microbelibrary.org/ASMOnly/details.asp?id=1426&Lang=©L Stauffer 2003 (accessed 17-Sep-09)
“Sporulation”
9
Term confusion: homonyms and biological complexity
AmiGO query “vascular” 51 terms In biology, many related phenomena
are described with similar terminology
10
The problem of scale
Enormous data sets◦ Microarray experiments◦ Whole genome sequencing
projects◦ Comparative genomics of
multiple diverse taxa
Computers don’t understand nuance◦ Millions of proteins to annotate◦ How to effectively search?◦ How to draw meaningful
comparisons?
http://en.wikipedia.org/wiki/File:Microarray2.gif(accessed 17-Sep-09)
Small data sets, small experiments & isolated scientific communities?
11
The Gene Ontology (GO)
Way to address the problems of synonyms, homonyms, biological complexity, increasing glut of data
GO provides a common biological language for protein functional annotation
www.geneontology.org
12
Controlled vocabulary (CV)
An official list of precisely defined terms that can be used to classify information and facilitate its retrieval Think of flat list like a thesaurus or
catalog Benefits of CVs
Allow standardized descriptions of things
Remedy synonym & homonym issues Can be cross-referenced externally Facilitate electronic searching
http://www.nlm.nih.gov/nichsr/hta101/ta101014.html
A CV can be “…used to index and retrieve a body of literature in a bibliographic, factual, or other database. An example is the MeSH controlled vocabulary used in MEDLINE and other MEDLARS databases of the NLM.”
13
Ontology is a type of CV with defined relationships
GO terms describe biological attributes of gene products…
Ontology – formalizes knowledge of a subject with precise textual definitions
Networked terms where child more specific (“granular”) than parentLess
specific
More granular
14
How GO works
GO Consortium develops & maintains: Ontologies and cross-links between
ontologies and different resources Tools to develop and use the ontologies SourceForge tracker for development
People studying organisms at databases annotate gene products with GO terms
Groups share files of annotation data about their respective organisms
Because a common language was used to describe gene products and this information was shared amongst databases… We can search uniformly across
databases Do comparative genomics of diverse
taxa
15
GO on SourceForgesourceforge.net/projects/geneontology
16
The Gene Ontology Consortium
ZFIN
Reactome IGS
Collaboration began 1998 among model organism databases mouse (MGI), fruit fly (FlyBase) and baker’s yeast (SGD) Michael Ashburner of FlyBase
contributed the base vocabulary Today > 20 members & associates
First publication 2000 (PMID 10802651) Today, PubMed query “gene ontology”
yields 3,347 papers (27-Jun-2011) Organisms represented by GO
annotations from every kingdom of life
Many groups use GO in many different ways for their research
Among eight OBO-Foundry ontologies
17
OBO Foundry ontologieswww.obofoundry.org
Collaboration among developers of science-based ontologies
Establish principles for ontology development Goal of creating a suite of orthogonal
interoperable reference ontologies in the biomedical domain.
many others…
18
What the GO is notGO comprises three ontologiesAnatomy & storage of GO termsOntology structureDetail of a term in AmiGOTrue path rule
Gene Ontology:overview, terms & structure
19
Caveats – what GO is not
Not gene naming system or gene catalog GO describes attributes of biological objects –
“oxidoreductase activity” not “cytochrome c”
The three ontologies have limitations No sequence attributes or structural features No characteristics unique to mutants or
disease No environment, evolution or expression No anatomy features above cellular
component
Not dictated standard or federated solution Databases share annotations as they see fit Curators evaluate differently
GO is evolving as our knowledge evolves New terms added on daily basis Incorrect/poorly defined terms made obsolete Secondary ids – terms with same meaning
merged
20
GO comprises three ontologies
Cellular component ontology (CC) “cytoplasm”
Molecular function ontology (MF) “protein binding” “peptidase activity” “cysteine-type endopeptidase activity”
Biological process ontology (BP) “proteolysis” “apoptosis”
Terms describe attributes of gene products (GPs) Any protein or RNA encoded by a gene Species-independent context, e.g. “ribosome” Could describe GPs found in limited taxa, e.g.
“photosynthesis” or “lactation”
One GP can be associated with ≥ 1 CC, BP, MF Example: Caspase-6 from Bos taurus
21
Cellular component ontology
Describes location at level of subcellular structure & macromolecular complex
GP subcomponent of or located in particular cellular component, with some exceptions:
No individual proteins or nucleic acids No multicellular anatomical terms For annotation purposes, a GP can be
associated with or located in ≥ one cellular component
Anatomical structure rough endoplasmic
reticulum nucleus nuclear inner
membrane
Multi-subunit enzyme or protein complex ribosome proteasome ubiquitin ligase
complex
22
Molecular function ontology
Describe gene product activity at molecular level Describes attributes of entities
Adenylate cyclase (E.C. 4.6.1.1)Catalyzes a specific reaction:
ATP = 3',5'-cyclic AMP + diphosphateDescribed by the Gene Ontology term:
“adenylate cyclase activity” (GO:0004016)http://www.ebi.ac.uk/pdbsum/1ab8
[accessed 4-Feb-2010]
Usually single GP, sometimes a complex “ferritin receptor activity”
Definition: “combining with ferritin, an iron-storing protein complex, to initiate a change in cell activity”
Broad functions “catalytic activity” “transporter
activity” “binding”
Specific functions “adenylate cyclase activity” “protein-DNA complex
transmembrane transporter activity”
“Fc-gamma receptor I complex binding”
23
Biological process ontology
Describes recognized series of events or molecular functions with a defined beginning and end
“GO does not try to represent the dynamics or dependencies that would be required to fully describe a pathway” (from GO documentation)
Mutant phenotypes often reflect disruptions in BP
Specific process “pyrimidine
metabolism” “α-glucosidase
transport
Broad process “cellular
physiological process”
“signal transduction”
http://www.geneontology.org/GO.process.guidelines.shtml
General considerationsThe Cell Cycle
The Development Node
Multi-Organism Process
MetabolismRegulation
Detection of and Response to StimuliSensory PerceptionSignaling Pathways
Transport and Localization
Transporter activity (molecular function)Other Misc. Standard
Defs
24
Anatomy of a GO term
Term name
goid (unique numerical identifier)
Precise textual definition with reference stating source
Synonyms (broad or narrow) for searching,
alternative names, misspellings…
GO slim
Ontology placement
25
Storage and cross referencing of GO terms
Storage in flat file (text)
Database cross reference for mappings to GO GO term identical to object
in other database
26
Ontology structure:parent-child relationship
Parent term (broader)
Child term (specialized)
hexose biosynthesis
hexose metabolism
monosaccharide biosynthesis
Up in the tree is more general; down in the tree is more specific:
Annotation of genes Start with terms denoting broad functional
categories Use more specific term as knowledge
warrants
27
Ontology structure:terms arranged in DAGs
GO terms structured as hierarchical-like directed acyclic graphs (DAGs) Tree-like, but each term can
have more than one parent (pseudo-hierarchy)
Each term may have one or more child terms (“siblings” share same parent)
parents
“siblings”
child terms
child term
parent
28
GO has three term relationships
is_a - child is instance of parent (“A is_a B”) Class-subclass relationship
part_of - child part of parent (“C part_of D”) When C present, part of D; but C not
always present Nucleus always part_of cell; not all cells
have nuclei regulates
Child term regulates parent term
(Zoomed in view of biological process ontology depicted here.)
29
AmiGO for viewing terms
Open source HTML-based application developed by the GO Consortium
Interface for browsing, querying and visualizing OBO data Users can search GO terms or annotations
Available via website or download for local install http://amigo.geneontology.org
GO:0019836
Example query with
keyword “hemolysis” or
goid GO:0019836
30
AmiGO search results
Click
31
Term information in AmiGO
Webpage continues…
32
AmiGO view continued
Our term is much further down…
Number of gene products in GO
annotation collection annotated to that term or one of its child terms
Relationship between
term and its parent
Several informativ
e views
Clic
k
33
Graph view
Alternative view of network of terms
34
A term with two parents
amine group carboxylic acid group
generic amino acid
• Name: amino acid transmembrane transporter activity• ID number: GO:0015171• Definition: Catalysis of the transfer of amino acids from
one side of a membrane to the other. Amino acids are organic molecules that contain an amino group and a carboxyl group. [source: GOC:ai, GOC:mtg_transport, ISBN:0815340729]
• parent term: amine transmembrane transporter activity (GO:0005275)
• relationship to parent: “is_a”
• parent term: carboxylic acid transmembrane transporter activity (GO:0046943)
• relationship to parent: “is_a”
35
Multiple paths to root:graphical view in OBO-Edit
36
“True path rule”
The pathway from a term all the way up to its top-level parent(s) must always be true for any gene product that could be annotated to that term (“if true for the child, then true for the parent”)
cell organelle mitochondrion proton-transporting ATP synthase complex
Incorrect for Bacteria
cell intracellular proton-transporting ATP synthase complex plasma membrane proton-transporting ATP synthase complex mitochondrial proton-transporting ATP synthase complex
membrane plasma membrane plasma membrane proton-transporting ATP synthase complex
organelle mitochondrion mitochondrial inner membrane mitochondrial proton-transporting ATP synthase complex
Correct for Bacteria (and Eukaryotes)
(Abbreviated versions of the actual trees)
What is GO annotation?Literature curation at model organism databasesThe annotation fileEvidence – critical for annotationSequence similarity-based annotationAnnotation specificity
Annotating with GO and Evidence
37
38
GO annotation overview
Associating a GO term with a gene product Goal is to select GO terms from all three
ontologies to represent what, where, and how
Linking a GO term to a gene product asserts that it has that attribute
For example, 6-phosphofructokinase Molecular function
GO:0003872 6-phosphofructokinase activity Biological process
GO:0006096 glycolysis Cellular component
GO:0005737 cytoplasm
Annotation, whether based on literature or computational methods, always involves: Learning something about a gene product Selecting an appropriate GO term Providing an appropriate evidence code Citing a [preferably open access] reference Entering information into GO annotation file
39
Chaperone DnaK, one protein/multiple annotations
Molecular function ATP binding (GO:0005524) ATPase activity (GO:0016887) unfolded protein binding (GO:0051082) misfolded protein binding (GO:0051787) denatured protein binding
(GO:0031249)
Biological process protein folding (GO:0006457) protein refolding (GO:0042026) protein stabilization (GO:0050821) response to stress (GO:0006950)
Cellular component cytoplasm (GO:0005737)
40
Literature curation performed at model organism databases
From the abstract:
41
Results section indicates a “direct assay” annotation
They document the findings of a direct assay performed on purified protein:
They further document the methods used, and evaluate the findings in the Discussion section…
42
Query AmiGO with “DNA ligase” & “DNA ligation”
All “ligation” in biological process ontology
43
Resulting annotations
GO id term name
aspect ev. code
reference
with
GO:0003909
DNA ligase activity
molecular function
IDA PMID:17705817
N/A
GO:0006266
DNA ligation
biological process
IDA PMID:17705817
N/A
GO:0005737
cytoplasm cellular component
IC PMID:17705817
GO:0003909
Name: DNA ligase (stated in paper) Gene symbol: ligA (stated in paper) EC: 6.5.1.2 (queried enzyme for “DNA
ligase”)
44
Gene annotation file captures annotations
Evidence
45
Evidence
Essential to base annotation on evidence Conclusions more robust and traceable With evidence, a GO annotation is standard
operating procedure (SOP)-independent
Many types of evidence exist For example, experiment described in
literature What method (e.g. direct assay, mutant
phenotype, et cetera) was used? Did author cite references? Did author provide details of analyses?
Perhaps you used a sequence-based method What were the methods of manual curation? Give accession numbers of similar sequences Provide any references describing methods
Controlled vocabularies help here, too!
46
GO standard references
GO_REF:0000011 A Hidden Markov Model (HMM) is a statistical representation of patterns found in a data set. When using HMMs with proteins, the HMM is a statistical model of the patterns of the amino acids found in a multiple alignment of a set of proteins called the "seed". Seed proteins are chosen based on sequence similarity to each other. Seed members can be chosen with different levels of relationship to each other. They can be members of a superfamily (ex. ABC transporter, ATP-binding proteins), they can all share the same exact specific function (ex. biotin synthase) or they could share another type of relationship of intermediate specificity (ex. subfamily, domain). New proteins can be scored against the model generated from the seed according to how closely the patterns of amino acids in the new proteins match those in the seed. There are two scores assigned to the HMM which allow annotators to judge how well any new protein scores to the model. Proteins scoring above the "trusted cutoff" score can be assumed to be part of the group defined by the seed. Proteins scoring below the "noise cutoff" score can be assumed to NOT be a part of the group. Proteins scoring between the trusted and noise cutoffs may be part of the group but may not. One of the important features of HMMs is that they are built from a multiple alignment of protein sequences, not a pairwise alignment. This is significant, since shared similarity between many proteins is much more likely to indicate shared functional relationship than sequence similarity between just two proteins. The usefulness of an HMM is directly related to the amount of care that is taken in chosing the seed members, building a good multiple alignment of the seed members, assessing the level of specificity of the model, and choosing the cutoff scores correctly. In order to properly assess what functional relevance an above-trusted scoring HMM match has to a query, one must carefully determine what the functional scope of the HMM is. If the HMM models proteins that all share the same function then it is likely possible to assign a specific function to high-scoring match proteins based on the HMM. If the HMM models proteins that have a wide variety of functions, then it will not be possible to assign a specific function to the query based on the HMM match, however, depending on the nature of the HMM in question, it may be possible to assign a more general (family or subfamily level) function. In order to determine the functional scope of an HMM, one must carefully read the documentation associated with the HMM. The annotator must also consider whether the function attributed to the proteins in the HMM makes sense for the query based on what is known about the organism in which the query protein resides and in light of any other information that might be available about the query protein. After carefully considering all of these issues the annotator makes an annotation.
GO_REF:0000011 A Hidden Markov Model (HMM) is a statistical representation of patterns found in a data set. When using HMMs with proteins, the HMM is a statistical model of the patterns of the amino acids found in a multiple alignment of a set of proteins called the "seed". Seed proteins are chosen based on sequence similarity to each other. Seed members can be chosen with different levels of relationship to each other...
47
GO evidence codeswww.geneontology.org/GO.evidence.shtml
EXP - inferred from experiment IDA - inferred from direct assay IEP inferred from expression pattern IGI - inferred from genetic interaction IPI - inferred from physical interaction IMP - inferred from mutant phenotype
ISS - inferred from sequence or structural similarity ISA - inferred from sequence alignment ISO - inferred from sequence orthology ISM - inferred from sequence model
IGC - inferred from genomic context ND - no biological data available IC - inferred by curator TAS - traceable author statement NAS - non-traceable author statement IEA - inferred from electronic annotation
GO codes are a subset of yet another ontology!
48
Types of sequence similarity-based annotations
Find similarity between gene product & one that is experimentally characterized BLAST-type alignments Shared synteny to establish orthology of
genomic regions between species
Find similarity between gene product and defined protein family HMMs (Pfam, TIGRFAMS) Prosite InterPro
Find motifs in gene product with prediction tools TMHMM SignalP
Many (most?) information you find is based on transitive annotation and much of it has never been looked at by a human being!
49
Evaluation of sequence similarity-based information
Visually inspect alignments & criteria Length & identity Conservation of catalytic sites Check HMM scores with respect to cutoff
Look at available metabolic analysis Pathways, complexes?
Information from neighboring genes Gene in an operon (common prokaryotes)
can supplement weak similarity evidence
Sequence characteristics Transmembrane regions? Signal peptide? Known motifs that give a clue to function? Paralogous family member
50
An example: HI0678, a protein from H. influenzae…
...high quality alignment to experimentally characterized triosephosphate isomerase from Vibrio marinus
51
further down the page
Information from Swiss-Prot database on experimentally characterized match protein
52
…. full-length match, high percent identity (67.8%), conserved active and binding sites (boxed in red).
High quality…..
53
Resulting annotations
GO id term name
aspect ev code
reference with
GO:0004807triose-phosphate isomerase activity
molecular function
ISSGO_REF:0000012
Swiss-Prot:P50921
GO:0006096 glycolysis
biological process
IGCPMID:15347579
TIGR_GenProp:GenProp0120
GO:0005737 cytoplasm
cellular component
ICGO_REF:0000012 GO:0004807
name: triosephosphate isomerase
gene symbol: tpiA
EC: 5.3.1.1
(This, and the following annotations, came from the match protein.)
54
KEGG pathway for glycolysis core
55
KEGG pathway for glycolysis core
56
Resulting annotations
GO id term name
aspect ev code
reference with
GO:0004807triose-phosphate isomerase activity
molecular function
ISSGO_REF:0000012
Swiss-Prot:P50921
GO:0006096 glycolysis biological process
IGCGO_REF:0000012
KEGG_PATHWAY:
hin00010
GO:0005737 cytoplasm cellular component
ICGO_REF:0000012 GO:0004807
name: triosephosphate isomerase
gene symbol: tpiA
EC: 5.3.1.1
57
And another annotation
GO id term name
aspect ev code
reference with
GO:0004807triose-phosphate isomerase activity
molecular function
ISSGO_REF:0000012
Swiss-Prot:P50921
GO:0006096 glycolysis biological process
IGCGO_REF:0000012
KEGG_PATHWAY:
hin00010
GO:0005737 cytoplasm cellular component
ICGO_REF:0000012 GO:0004807
The biologist knows that glycolysis takes place in the cytoplasm in bacteria, and so infers a cytoplasmic location for that protein (“inferred by curator” evidence code).
58
Annotation specificity should reflect knowledge
Available evidence for three genes
#1-good match to an HMM for “kinase”
#2-good match to an HMM for “kinase”-a high-quality BER match to an experimentally characterized “glucokinase’ AND a ‘fructokinase’
#3-good match to an HMM specific for “ribokinase”-a high-quality BER match to an experimentally characterized ribokinase
GO trees (very abbreviated)
Function catalytic activity kinase activity carbohydrate kinase activity ribokinase activity glucokinase activity fructokinase activity
Process metabolism carbohydrate metabolism monosaccharide metabolism hexose metabolism glucose metabolism fructose metabolism pentose metabolism ribose metabolism
#1
#1
#2
#2
#3
#3
Using shared annotationsSearch for GO terms at databasesSlims for broad classificationGO toolsWorking with GO-limited data setsSummary
Using annotation to facilitate your research
59
60
Sharing annotations
Annotation file sent to GO, put in repository All these data free to anyone Hundreds of thousands of GP annotations
Annotation files all in same format Facilitates easy use of data by everyone
Most of your favorite organism databases use these annotation files
61
Searching for GO terms at EuPathDB
62
Slim is a distilled (reduced) ontology Made by manually pruning low-level terms
with an ontology editor Selected high-level terms remain Slims reduce ontology complexity
Reduce clutter & see general trends Microarray experiments Comparative whole genome analyses Remove irrelevant terms
Looking at specific taxa, such as yeast or plant
Go offers script to bin more granular annotations up to higher levels
Ontology slimwww.geneontology.org/GO.slims.shtml
63
Comparing genomes with a GO slim
MJ Gardner, et al. (2002) Nature 419:498-511
High-level biological process terms used to compare Plasmodium and Saccharomyces
64
GO slim: manual/orthology-based gene annotations
Nucleic Acids Res. 2010 January; 38(Database issue): D420–D427.
65
GO toolswww.geneontology.org/GO.tools.shtml
The real challenge is finding the right one for your needs
For example, statistical representation of GO terms:
http://go.princeton.edu/cgi-bin/GOTermFinder
66
GO & analysis of RNA-seq data
We present GOseq, an application for performing Gene Ontology (GO) analysis on RNA-seq data. GO analysis is widely used to reduce complexity and highlight biological processes in genome-wide expression studies, but standard methods give biased results on RNA-seq data due to over-detection of differential expression for long and highly expressed transcripts. Application of GOseq to a prostate cancer data set shows that GOseq dramatically changes the results, highlighting categories more consistent with the known biology.
Young et al. Genome Biology 2010, 11:R14 http://genomebiology.com/2010/11/2/R14
67
When GO is limited
Food for thought: what happens when we have limited GO (or other)annotation data?
New and interesting genomes often see this problem
68
Comparative analysis of orthologs in syntenic blocks
The more genomes we have at our disposal, the better
Structural rearrangements, absence of intron, gene duplication, intron structure, gene deletion/creation
Nucleic Acids Res. 2010 January; 38(Database issue): D420–D427.
69
Summary GO analyses
GO remedies problems of synonyms & homonyms in biological nomenclature Queries based on IDs linked to precise
definitions, not less reliable text-matching
GO can help you to: Find all genes that share a particular
function regardless of sequence Do comparisons across any species
annotated with GO Summarize major classes of genes in a
newly sequenced genome Characterize expressed genes is a study Drive hypotheses to test in the laboratory
GO is not a panacea but it should be a valuable tool in your genomics toolbox
The title slide revisited…
Arabidopsis thaliana ATPaseHMA4 zinc binding domain
GO:0006829 : zinc ion transport (BP)GO:0005886 : plasma membrane (CC)GO:0005515 : protein binding (MF)
Annotation
Ontology
Evidence
THANK YOU.