Upload
melanie-courtot
View
299
Download
2
Embed Size (px)
Citation preview
Ontologies for life sciences: examples from the Gene Ontology
Melanie Courtot GO/GOA project lead [email protected] @mcourtot
Cross domain resources . Cross dom
ain resources
dg
P
b
s
y
Data resources at EMBL-EBI Genes, genomes & variation
RNA Central
ArrayExpress
Expression Atlas
Metabolights
PRIDE
InterPro Pfam UniProt
ChEMBL SureChEMBL ChEBI
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
European Nucleotide Archive
European Variation Archive
European Genome-phenome Archive
Gene, protein & metabolite expression
Protein sequences, families & motifs
Chemical biology
Reactions, interactions & pathways
IntAct Reactome MetaboLights
Systems
BioModels Enzyme Portal BioSamples
Ensembl
Ensembl Genomes
GWAS Catalog
Metagenomics portal
Europe PubMed Central
BioStudies
Gene Ontology
Experimental Factor
Ontology
Literature & ontologies
genomics transcriptomics proteomics metabolomics transcriptomics metabolomics
individual experiments genomics transcriptomics proteomics metabolomics
transcriptomics metabolomics individual experiments genomics transcriptomics proteomics metabolomics
transcriptomics metabolomics individual experiments
Data integration in times of ‘omics’
genomics transcriptomics proteomics metabolomics transcriptomics metabolomics
individual experiments
conducted at different times by different researchers using different equipment/approaches reporting same type of results differently
Data growth is fast
12 month doubling
18 month doubling 4 month doubling
3 month doubling
100000000
1E+09
1E+10
1E+11
1E+12
1E+13
1E+14
1E+15
1E+16
2002 2004 2006 2008 2010 2012 2014 2016
byte
s
date
EGA
ENA
PRIDE
MetaboLights
ArrayExpress
Slide credit: Paul Flicek
Data growth is fast
12 month doubling
18 month doubling 4 month doubling
3 month doubling
100000000
1E+09
1E+10
1E+11
1E+12
1E+13
1E+14
1E+15
1E+16
2002 2004 2006 2008 2010 2012 2014 2016
byte
s
date
EGA
ENA
PRIDE
MetaboLights
ArrayExpress
Slide credit: Paul Flicek
Vast amount of data generated means
vast amount of data submitted to repositories
Curation - Dirty data and the long tail
200 100
sex:female
gender:female
disease:breast cancer
frequency=2285 frequency=1288
data integration [ˈdeɪtə ˌɪntəˈgreɪʃən]: (computational) means to access, retrieve and analyse data sets from different sources in order to exploit them, i.e., gain new knowledge, and share that new knowledge
data integration [ˈdeɪtə ˌɪntəˈgreɪʃən]: (computational) means to access, retrieve and analyse data sets from different sources in order to exploit them, i.e., gain new knowledge, and share that new knowledge
Standards What do they offer? • uniformity and consistency in reporting data
• effective reuse, integration and mining of data
• creation of SOPs, benchmarks, quality assessment
• community cohesion
What constitutes a standard?
1. Establish your community
2. Define community needs
3. Define minimal information which needs to be gathered and exchanged by that community
4. Design* an interchange format
5. Design* domain-specific controlled vocabularies
*Design = review, reuse and fill the gaps
• Same name for different concepts
• Different names for the same concept
Inconsistency in naming of biological concepts
?
An example …
Tactition Tactile sense
Taction
perception of touch ; GO:0050975
Sample description with semantic markup
CL:CL_0000071 (blood vessel endothelial cell)
obo:CHEBI_39867 (valproic acid)
NCBITaxon:NCBITaxon_9606 (Homo Sapiens)
Curation
Ontologies
• Representation of important things in a specific domain
• Describes types of entities (e.g. cells) and relations between them
• An active, formal computational artifact
• A mathematical model based on a subset of first order logic
• Tools can automatically process ontologies
• A communication tool
• Provides a dictionary for collaborators, a shared understanding
• Allows data sharing
Reasoning is critical
• Prokaryotic and Eukaryotic cell are declared disjoints
• Fungal cell is a Eukaryotic cell
• Spore is a Fungal cell and a Prokaryotic cell
⇒ Unsatisfiability
⇒ Solution: clarify spore (sensu Mycetozoa) AND actinomycete-type spore
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0022006
Different words same concept: example of Dyschromatopsia
We searched earlier for : - Dyschromatopsia - Colorblindness - Abnormality of the eye
The ontology of color blindness
HP:0011518 (Dichromacy )
HP:0011518 (Eye)
HP:0000551 (Abnormality of color vision )
HP:0007641 (Dyschromatopsia)
Is-a
Is-a Disease-location
The ontology of color blindness
HP:0011518 (Dichromacy )
HP:0011518 (Eye)
HP:0000551 (Abnormality of color vision )
HP:0007641 (Dyschromatopsia)
Is-a
Is-a Disease-location
“Colorblindness”
“A form of colorblindness in which only two of the three fundamental colors can be distinguished due to a lack of one of the retinal cone pigments.”
synonym
definition
Building ontologies
• Put things into categories
• Helps organise the data
• Allows us to generalise over data
• Capture the relations between things
• Anatomical parts
Biopolymer
Nucleic Acid Polypeptide
Enzyme DNA RNA
tRNA mRNA smRNA
CMPO term: graped micronucleus CMPO_0000156
CMPO term: graped micronucleus CMPO_0000156
Integrate file formats Integrate metadata
Apply phenotype ontology
Predict disease gene/biomarkers
Human Disease
Cell Gene knockdown
31
32
Genotype Phenotype
Sequence Proteins
Gene products Transcript
Pathways
Cell type
BRENDA tissue / enzyme source
Development
Anatomy
Phenotype
Plasmodium life cycle
- Sequence types and features - Genetic Context
- Molecule role - Molecular Function - Biological process - Cellular component
- Protein covalent bond - Protein domain - UniProt taxonomy
-Pathway ontology -Event (INOH pathway ontology) -Systems Biology -Protein-protein interaction
-Arabidopsis development -Cereal plant development -Plant growth and developmental stage -C. elegans development -Drosophila development FBdv fly development.obo OBO yes yes -Human developmental anatomy, abstract version -Human developmental anatomy, timed version
-Mosquito gross anatomy -Mouse adult gross anatomy -Mouse gross anatomy and development -C. elegans gross anatomy -Arabidopsis gross anatomy -Cereal plant gross anatomy -Drosophila gross anatomy -Dictyostelium discoideum anatomy -Fungal gross anatomy FAO -Plant structure -Maize gross anatomy -Medaka fish anatomy and development -Zebrafish anatomy and development
-NCI Thesaurus -Mouse pathology -Human disease -Cereal plant trait -PATO PATO attribute and value.obo -Mammalian phenotype - Human phenotype -Habronattus courtship -Loggerhead nesting -Animal natural history and life history
eVOC (Expressed Sequence Annotation for Humans)
Ontologies for life sciences
Open Biological and Biomedical Ontologies (OBO)
A subset of biological and biomedical ontologies whose developers have agreed in advance to accept a common set of principles reflecting best practice in ontology development designed to ensure …
• tight connection to the biomedical basic sciences
• compatibility
• interoperability, common relations
• formal robustness
• support for logic-based reasoning
Building metadata (& ontology) rich resources
• We build tools for semantic enrichment and alignment
• Interoperability toolkit
• Microservices based architecture
• Technology-agnostic
• Pushing boundaries of ontology “embedding”
Raw Data to Explicit Knowledge
Data Exploration
and Cleanup
Data structuring
Ontology Annotation
Data cleaning and mapping
Ontology building
Webulous
OxO mapping service
Searching for ontology terms: the EBI Ontology Lookup Service
• for searching and visualizing >140 ontologies from the biomedical domain
• includes (among others):
• Gene Ontology
• OBO Relations ontology
• Evidence ontology
• Pathogen Transmission Ontology
• Symptom Ontology
• Basic Formal Ontology
Ontology Lookup Service
• Ontology search engine
• Ontology visualisation
• Powerful RESTful API
• Open source project
• Generic infrastructure (can load any ontology represented in OWL)
https://github.com/EBISPOT/OLS
Repository of over 150 biomedical ontologies (4.5 million terms, 11 million relations)
http://www.ebi.ac.uk/ols
• Sample attributes and variables are mapped to EFO ontology
Sample attribute
Mapping data to ontology terms
• Zooma automatically annotates sample attributes and variables with ontology classes
Mapping data to ontology terms
Mapping data to ontology terms
Information supplied as part of a search
The source of this mapping
ZOOMA contains a linked data repository of annotation knowledge and highly annotated data
Expression Atlas: source of mappings
• Atlas automated pipeline runs against Zooma, then curators: • Check that the automatic mappings are all correct
• Create a list of new mappings that should be added to Zooma
• Webulous Google Add-On • Connect to the Webulous server from Google Spreadsheets • Load templates from the Webulous server • Submit populated templates back to the server for processing
Expression Atlas: curation
What happens when we need a term that is not in EFO?
• A Webulous template specifies a series of fields (columns) for the input data
Some fields only allow values from a
list of ontology terms
Adding diseases to EFO using
This data validation provides user with convenient term autocomplete
when entering data into a cell
Raw Data to Explicit Knowledge
Data Exploration
and Cleanup
Data structuring
Ontology Annotation
Data cleaning and mapping
Ontology building
Webulous
OxO mapping service
BioSolr
“BioSolr aims to significantly advance the state of the art with regards to indexing and querying biomedical data with
freely available open source software”
flaxsearch/BioSolr
Solr documents with ontology annotation
Enriched Solr with ontology content (synonyms, structure, relations)
Solr/Elastic plugin Query expansion and hierarchical faceting
Which other diseases are associated with PDE4D?
View diseases grouped in therapeutic areas or organised in a tree
View more information about PDE4D
Filter by therapeutic area
Publishing biological data as Linked Open Data
• The EBI RDF platform
• Released Nov 2013
• Currently over 16 billion RDF triples
• Datasets updated ~ quarterly
LOD diagram August 2014
Jupp et al (2013). The EBI RDF Platform: Linked Open Data for the Life Sciences. Bioinformatics.
RDF Platform Integration points
Gene (via identifiers.org/ensembl)
RNA transcript (via identifiers.org/ensembl)
uniprot:Protein
rdfs:seeAlso (not currently linking
to identifiers.org but soon)
discretized differential gene expression ratio
(sio: SIO_001078)
Gene Expression Atlas
Ensembl
sio:'is attribute of'(sio:SIO_000011)
Uniprot
Gene Ontology
GO BP GO MF GO CC
uniprot:classifiedWith
bq:occursIn
Organisms
Organism/taxon
ChEMBL
Assay(?)
chembl:h
asTarget
?
bq:isVersionOf
uniprot:organism
rdfs:seeAlso
1
1
1
*
1
* * *
1
1
BioModels
SBMLModel
Reaction
Species
Compartment
bq:isbq:isVersionOf
bq:isVersionOf
bq:isbq:isVersionOf
bq:isHomologTobq:hasPart
ChEBI
Reactome
Pathway
bq:is
Vers
ionOf
bq:isVersionOf
SBObq:is
Relationships within Biomodels can be found
at https://github.com/sarala/ricordo-
rdfconverter/wiki/SBML-RDF-Schema
rdfs:seeAlso
Structure
PDB
1
rdfs:seeAlso
Target (?)
unipr
ot:tra
nscri
bedF
rom
Protein (via identifiers.org/ensembl)
uniprot:translatedTo
bq:isVersionOf
RDF Platform – lessons learned
Successes • Novel queries possible over
EBI datasets
• Production quality RDF releases
• Community of users
• Highly available public SPARQL endpoints
• 500+ users (10-50 million hits per month)
• Lots of interest
• Catalyst for new RDF efforts
Lessons ● Public SPARQL endpoints
problematic
● Query federation not performant
● Inference support limited
● Not scalable for all EBI data e.g. Variation, ENA
● Lack of expertise in service teams
● Too much overhead to get started quickly in this space
• A way to capture biological knowledge for individual gene products in a written and computable form
The Gene Ontology
• A set of concepts and their relationships to each other arranged as a hierarchy
www.ebi.ac.uk/QuickGO
Less specific concepts
More specific concepts
The Gene Ontology
http://geneontology.org/
• Collaborative effort to address the need for consistent descriptions of genes/gene products across databases
• Use of GO terms by collaborating databases facilitates uniform queries across all of them
Aims of the GO project
• compile the ontologies
• >40000 terms
• constantly increasing and improving
• annotate gene products using the terms
• provide public resource of data and tools
• regular releases of annotations
• tools for browsing/querying annotations and editing the GO
The GO editorial office at EMBL-EBI
• Part of the Sample, Phenotypes and Ontology team (SPOT)
• Contributes to development of the Gene Ontology
• Specific areas of interest: autophagy, synapse…
• Answers user requests
• New terms, modifications, updates
• Help support
• Curator requests
GO editorial office at the EBI:
Paola Roncaglia
David Osumi-Sutherland
Develop the ontology
• An OWL ontology of >41,000 classes
• biological process, cellular component, molecular function
• > 14,000 imported classes (CL, Uberon, ChEBI, NCBI_tax)
• >136,000 logical axioms, including:
• ~72,000 subClassOf axioms between named GO classes
• ~41,000 simple existential restrictions (subClassOf R some C)
• EL expressivity => fast, scalable reasoning (with ELK)
https://www.cs.ox.ac.uk/isg/tools/ELK/
Ontology structure
• Hierarchical
Terms can have more than one parent
• Terms are linked by relationships
is_a part_of regulates (and +/- regulates)
www.ebi.ac.uk/QuickGO occurs_in has_part
These relationships allow for complex analysis of large datasets
Terms can have more than one child
Biological Process what does a gene product do?
cell division transcription
A commonly recognised series of events
Molecular Function how does a gene product act?
• insulin binding
• insulin receptor activity
• glucose-6-phosphate isomerase activity
Cellular Component where is a gene product located?
plasma membrane
• mitochondrion • mitochondrial membrane • mitochondrial matrix • mitochondrial lumen
• ribosome
• large ribosomal subunit
• small ribosomal subunit
Example GO annotation – cytochrome c
cellular components
molecular functions
biological processes Electron carrier activity
GO:0009055
oxidation-reduction process
GO:0055114
Mitochondrion
GO:0005739
https://www.ebi.ac.uk/QuickGO/GProtein?ac=P99999
What are the four direct parents of the term nucleosome?
Chromatin Chromosomal part DNA packaging complex Protein-DNA complex
What types of relationships are there between the term nucleosome and its direct parents?
Part of chromatin Is a for the others
Building the GO
• The GO editorial team
• Submission via GitHub, https://github.com/geneontology/
• Submissions via TermGenie, http://go.termgenie.org
• ~80% terms are now created this way
Annotate gene products
GOA
Database
external annotation groups (25)
manual annotation by curators (125)
electronic prediction methods (11)
Manual annotations
• Time-consuming process producing lower numbers of annotations (~2,800 taxons covered)
• More specific GO terms
• Manual annotation is essential for creating predictions
• Part of the Protein Function content team
• Largest open-source contributor of annotations to GO
• Focuses on human, but provide annotations for more than 441,000 species
• Human curators, and collate manual and electronic annotations across community
UniProt-Gene Ontology Annotation (UniProt-GOA) project at the EMBL-EBI http://www.ebi.ac.uk/GOA
Aleksandra Shypitsyna
Elena Speretta
Penelope Garmiri
Tony Sawford
UniProt-GOA project at the EBI:
…a statement that a gene product;
P00505
Accession Name GO ID GO term name Reference Evidence code
IDA PMID:2731362 aspartate transaminase activity GO:0004069 GOT2
A GO annotation is …
…a statement that a gene product; 1. has a particular molecular function or is involved in a particular biological process
or is located within a certain cellular component
A GO annotation is …
P00505
Accession Name GO ID GO term name Reference Evidence code
IDA PMID:2731362 aspartate transaminase activity GO:0004069 GOT2
…a statement that a gene product; 1. has a particular molecular function or is involved in a particular biological process
or is located within a certain cellular component 2. as described in a particular reference
A GO annotation is …
P00505
Accession Name GO ID GO term name Reference Evidence code
IDA PMID:2731362 aspartate transaminase activity GO:0004069 GOT2
…a statement that a gene product; 1. has a particular molecular function or is involved in a particular biological process
or is located within a certain cellular component 2. as described in a particular reference 3. as determined by a particular method
A GO annotation is …
P00505
Accession Name GO ID GO term name Reference Evidence code
IDA PMID:2731362 aspartate transaminase activity GO:0004069 GOT2
Experimental data
Computational analysis
Author statements/ curator inference
(+ Inferred from electronic annotations)
http://www.evidenceontology.org/
Tracking provenance
FIG. 2. Human Nbp35 is a cytosolic protein. (A) EGFP fluorescence of a HeLa cell transiently transfected with a vector encoding a huNbp35-EGFP fusion protein (right) in comparison to the endogenous autofluorescence (AFL) of control cells (left).
(C) Sub-cellular localization of huNbp35 by cell fractionation. […]HuNbp35 exclusively colocalizes with tubulin in the cytosolic fraction, but not with mitochondrial aconitase (mtAconitase) present in the membrane fraction.
Human Nbp35 is a cytosolic protein. • Evidence:
• Fig 2A Immunofluorescence and/or
• Fig 2C subcellular fractionation
GO evidence codes [small excerpt]
TAS, Traceable author statement NAS, Non-traceable author statement
IDA, Inferred from Direct Assay IMP, Inferred from Mutant Phenotype IPI, Inferred from Physical Interaction
Experimental evidence, Methods & Results
Abstract & Introduction
Electronic Annotations • Quick way of producing large numbers of annotations
• Annotations use less-specific GO terms
Only source of annotation
for ~438,000 non-model
organism species
Electronic Annotations • Quick way of producing large numbers of annotations
• Annotations use less-specific GO terms
• Only source of annotation for ~438,000 non-model organism species
orthology taxon constraints
Broad taxonomic coverage
…as well as less well-studied species that have;
• Complete proteome • >25% GO annotation coverage
We provide annotation files for well-studied species…
We have annotations for species that may not have a dedicated curation effort;
e.g. for 1,400 Solanacae species’ we have ~360,000 annotations for ~64,000 proteins
1. Mapping of external concepts to GO terms e.g. InterPro2GO, UniProt Keyword2GO, Enzyme Commission2GO
Electronic annotation methods
GO:0004715 ; non-membrane spanning protein tyrosine kinase activity
Annotations are high-quality and have an explanation of the method (GO_REF)
Macaque
Mouse Dog Cow
Guinea Pig Chimpanzee Rat
Chicken
Ensembl compara
2. Automatic transfer of manual annotations to orthologs
...and more
e.g. Human
Arabidopsis
Rice
Brachypodium
Maize
Poplar
Grape
…and more Ensembl compara
Electronic annotation methods
http://www.geneontology.org/cgi-bin/references.cgi
An example
ACCESSION GO ID GO ASPECT GO TERM P04637 GO:0047485 F protein N-terminus binding P04637 GO:0051087 F chaperone binding P04637 GO:0051721 F protein phosphatase 2A binding P04637 GO:0000733 P DNA strand renaturation P04637 GO:0006289 P nucleotide-excision repair P04637 GO:0006355 P regulation of transcription, DNA-templated P04637 GO:0006461 P protein complex assembly
ACCESSION GO ID GO ASPECT GO TERM Q549C9 GO:0047485 F protein N-terminus binding Q549C9 GO:0051087 F chaperone binding Q549C9 GO:0051721 F protein phosphatase 2A binding Q549C9 GO:0000733 P DNA strand renaturation Q549C9 GO:0006289 P nucleotide-excision repair Q549C9 GO:0006355 P regulation of transcription, DNA-templated Q549C9 GO:0006461 P protein complex assembly
Annotations from the source…
…are projected on to the target
InterPro
Source of ~93 million GO mappings for ~30 million distinct UniProtKB sequences (Oct 30 2015 release)
3. Propagation of GO annotations to protein groups
GO mapping to domains:
Function of domain may not be function of protein
Family members can be experimentally characterised as lacking function:
P14210 - a serine protease homologue with no proteolytic activity
(proteins are reported to GOA to be blacklisted)
Broad families that are functionally diverse: The GHMP kinase superfamily includes - Galactokinases (EC=2.7.1.6) - Homoserine kinases (EC=2.7.1.39) - Mevalonate kinases (EC=2.7.1.36) - Diphosphomevalonate decarboxylases (EC 4.1.1.33)
Considerations for mapping GO terms
* Includes manual annotations integrated from external model organism and specialist groups
2,811,622 Manual annotations*
280,313,749 Electronic annotations
Number of annotations in UniProt-GOA database (June 2016)
http://www.ebi.ac.uk/QuickGO
Map-up annotations with GO slims
Search GO terms or proteins
Find sets of GO annotations
Questions on how to use QuickGO? Contact [email protected]
One example: the QuickGO browser
GO term enrichment analysis
• What is it?
• What can you use it for?
• How does it actually work?
• How can I actually do it?
• When is it NOT a good idea to do it?
GO term enrichment analysis
• What is it?
• Most popular type of GO analysis
• Determines which GO terms are more often associated with a specified list of genes/proteins compared with a control list or rest of genome
GO term enrichment analysis
“Our gene list contains targets for GATA1 (orange balls) and SP1 (blue balls) transcription factors (TFs). For each TF, we extract the proportion of targets in the gene list and in the genome to construct the contingency table. Fisher's exact test is used to determine if there is a nonrandom association between the gene list and the specific regulation of a TF.”
• http://bioinfo.cipf.es/docs/renato/simple_enrichment_analysis
GO term enrichment analysis
• How does it actually work?
• http://geneontology.org/page/go-enrichment-analysis
• http://geneontology.org/faq/what-minimum-information-include-functional-analysis-paper
• Also useful for GO analysis in general:
GO term enrichment analysis
• How can I actually do it?
• Many tools available to do this analysis
• User must decide which is best for their analysis
• We’ll focus on the tool provided by the GO Consortium
• Be aware that there are numerous third-party tools and that they do not all use up-to-date GO data
GO term enrichment analysis
• How do you get to the GO TE tool?
• From front page of GO website
• From AmiGO
Hands on - Dataset
• Download http://tinyurl.com/IDs-for-enrichment
• Go to http://geneontology.org
• Run the enrichment analysis
Caveats
• When can you NOT do an enrichment analysis?
• Too few target genes/proteins
• Genes/proteins of interest are not present in your background set (e.g. array)
• Genes/proteins of interest are not expressed/translated in your sample(s)
Slim generation for industry
• Collaboration funded by Roche
• Need a custom GO slim for analysis of genesets of interest
• Need to be descriptive enough
• Without redundancy
• Internal proprietary vocabulary – hard to maintain
• Desire to automatically map to GO
http://www.swat4ls.org/wp-content/uploads/2015/10/SWAT4LS_2015_paper_44.pdf
• Mapping query: participant_OR_reg_participant some cannabinoid
• Description: “A process in which a cannabinoid participates, or that regulates a process in which a cannabinoid participates.”
Results
• We have successfully mapped 84% of terms from RCV (308/365) to OWL queries that can be used to replicate some proportion of the original manual mapping.
• In addition, these queries find 1000s of terms that were missed in the original mapping.
David Osumi-Sutherland
https://www.ebi.ac.uk/metagenomics/projects/SRP033553/samples/SRS512695/runs/SRR1045093/results/versions/3.0
Acknowledgements
• GO editors and developers
• GO annotators
• The Gene Ontology (GO) Consortium
• Samples, Phenotype and Ontology team (Helen Parkinson)
• Protein Function Content team (Claire O’Donovan)
• Funding: EMBL-EBI, National Human Genome Research Institute (NHGRI)
Thank you for your attention!
Contact Gene Ontology Annotation:
Contact Gene Ontology: http://geneontology.org/form/contact-go