Data mining with the Gene Ontology

Josep Lluís MosqueraApril 2005

Grup de Recerca en Estadística i Bioinformàtica

GOing into Biological Meaning

Motivation

• High throughput methodologies pose different challenges:

1) The experiment in itself2) Statistical analysis of results3) Biological interpretation

• In gene-expression microarray studies, independently of the technology or analysis methods used, one generally obtains long lists of genes.

QUESTION: What does this mean?

Rationale

• Bioinformatics resources hold data, often in the form of sequences which are annotated in scientific natural language.

• The annotation in this form, is human readable and understandable, but it isn’t easy to interpretate computationally.

PROBLEM: The lack of a common set of terms and descriptions which is common to all organisms.

What can we do?

• An ontology provides a set of vocabulary terms covering a conceptual domain. These terms: Must:

o have a definitiono be placed within a structure of relationships

May have one or more parents. May be linked by two kind of relationships:

o ‘is-a’ between parent and childo ‘part-of’ between part and whole

• In this context, the Gene Ontology (GO) is a very useful resource for the initial interpretation of gene lists.

Gene Ontology Consortium

But... what’s the GO?

• It is an ontology with clear definitions of its terms and relationships between them starting at the top level (GO) whose children are three independent ontologies.

Molecular Functions (MF) Biological Processes (BP) Cellular Components (CC)

Graphical Overview

• There are more than 16K nodes in GO

• Consist of two essential parts: The current ontologies:

o Vocabularyo Structure

The current annotations:o Create a link between the known genes and the

associated GOs that define their function.

GO database

THE CHALLENGE: Use annotations and structure of the GOs to understand the

biological meaning in a large dataset of genes.

Genes and GO terms

• Each gene can have several associated GO terms

• Each GO term can be connected to several other GO terms higher these are associated with the gene too.

• We call:o path the list of GO terms between the root

and the annotated GO term.o split each GO term in the path.

Our context

• A list of 100 genes will usually have many hundreds of associated GO terms and several thousand associated splits.

OBJECTIVE: How to cast biological meaning to gene lists from differentially expressed genes through of the Gene Ontology (GO)

Statistical Methods

• Let us consider:o N genes on a microarray:

M belong to a given GO term category (A)M-N do not belong it (category Ac)

o K of the N genes are selected and assigned to a given class (e.g. regulated genes)

o x genes of these K will be in A (EXAMPLE)

STATISTICAL HYPOTHESIS:H0: GO category A is equally represented on the

microarray than in the class of differentially regulated genes

H1: GO category A is higher (or lower) represented on the microarray than in the class differentially regulated genes

Hypergeometric Distribution (1/2)

We ask: Assuming sampling without replacement, what is the probability of having exactly x genes of category A?

• The probability that certain category occurs x times just by chance in the list of differentially regulated genes is modelled by a hypergeometric distribution with parameters (N, M, K).

Hypergeometric Distribution (2/2)

• So, under the null hypothesis p_value of having x genes or larger in A will be:

• This corresponds to a one-side test in which small p_values relate to over-represented GO terms.

• For under-represented categories can be calculated as1 - p_value

HxXPvaluep 0_

Disadvantages

• The hypergeometric distribution is rather difficult and time consuming to calculate when N is high.

• We can proof,

• Using this approximation the p_value for over-represented GO terms can be calculated as

MKBinKNMHip

Kvaluep

Alternative approaches• Let us assume

where N=N.., M=N1., K=N.1 and x=n11

• Using this notation, alternative include:o test for equality of two proportionso Fisher’s Exact Test

Differentially regulated genes (D)

Dc Genes on Microarray

Category A

n11 n12 N1.

Ac n21 n22 N2.

N.1 N.2 N..

Chi-square Test (2)

• statistic can be calculated as2

211222112 ~

NnnnnN

PROBLEMS: It cannot:1. Distinguish between under- and over-

represented gene categories.2. Be used for small samples, i.e. when

Fisher’s Exact Test

• This test consider fixed the marginal totals and uses the hypergeometric distribution to calculate the probability of observing each individual table as:

• One can calculate a table containing all possible combinations of n11n12n21n22.

• The p_value for a particular occurrence is the sum of all probabilities lower than or equal to the probability corresponding to the observed combination.

22211211

Correction for Multiple Tests

• As the number of GO terms for which test significance is large, p_values have to take the correction for multiple tests in account. For instance:o Methods controlling False Discovery Rate (FDR):

Benjamin and Hochberg (assuming independence) Benjamin and Yekutieli (dropping independence)

o Methods controlling Family Wyse Error Rate (FWER): Holm correction Westfall and Young

Example

N= 9177 genes on microarrayA

M= 467 in GO category

N-M= 8710 in Ac

K= 173 genes picked randomly

x= 51 genes of category

Miguel.... GO!

Data mining with the Gene Ontology

Documents

Gene Ontology Network Enrichment Analysis

Gene Ontology Project geneontology

GOSt a Gene Ontology mining tool Jüri Reimand

Integrating Information Retrieval with Distant Supervision ...€¦ · Keywords: Gene ontology, annotation, information retrieval, classification Introduction The gene ontology (GO)

Gene Ontology Consortium And GO Database

From Genes to Biology: The Gene Ontology / Pathway …From Genes to Biology: The Gene Ontology / Pathway enrichment •Gene Ontology (GO) –"Ontology" –a directed acyclic graph

Gene Ontology Overview and Perspective Lung Development Ontology Workshop

2003/12/5PPLAB1 Prediction of Human Protein Function According to Gene Ontology Categories Gene Ontology Gene Ontology L. J. Jensen, R. Gupta, H. –H. Stærfeldt

Using The Gene Ontology: Gene Product Annotation

The Gene Ontology & Gene Ontology Annotation resources

The Gene Ontology Project

Lecture 4: Gene Annotation & Gene Ontology June 11, 2015

Gene Ontology (GO) Project Jane Lomax

Real-life ontology development: lessons from the Gene Ontology

Gene Ontology and Functional Annotation

Gene Ontology and overrepresentation analysis · Gene Ontology (GO) Why Gene Ontology? –Produce a controlled vocabulary describing aspects of molecular biology, that can be applied

Gene Ontology and Functional Enrichment

GOAT: The Gene Ontology Annotation Tool

From Genes to Biology: The Gene Ontology / Pathway …1 From Genes to Biology: The Gene Ontology / Pathway enrichment •Gene Ontology (GO) –"Ontology" –a directed acyclic graph

The Ontology of the Gene Ontology