21
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning

Data mining with the Gene Ontology

  • Upload
    royal

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

Grup de Recerca en Estadística i Bioinformàtica. Data mining with the Gene Ontology. GO ing into Biological Meaning. Josep Lluís Mosquera April 2005. Motivation. High throughput methodologies pose different challenges : The experiment in itself Statistical analysis of results - PowerPoint PPT Presentation

Citation preview

Page 1: Data mining with the Gene Ontology

Data mining with the Gene Ontology

Josep Lluís MosqueraApril 2005

Grup de Recerca en Estadística i Bioinformàtica

GOing into Biological Meaning

Page 2: Data mining with the Gene Ontology

2

Motivation

• High throughput methodologies pose different challenges:

1) The experiment in itself2) Statistical analysis of results3) Biological interpretation

• In gene-expression microarray studies, independently of the technology or analysis methods used, one generally obtains long lists of genes.

QUESTION: What does this mean?

Page 3: Data mining with the Gene Ontology

3

Rationale

• Bioinformatics resources hold data, often in the form of sequences which are annotated in scientific natural language.

• The annotation in this form, is human readable and understandable, but it isn’t easy to interpretate computationally.

PROBLEM: The lack of a common set of terms and descriptions which is common to all organisms.

Page 4: Data mining with the Gene Ontology

4

What can we do?

• An ontology provides a set of vocabulary terms covering a conceptual domain. These terms: Must:

o have a definitiono be placed within a structure of relationships

May have one or more parents. May be linked by two kind of relationships:

o ‘is-a’ between parent and childo ‘part-of’ between part and whole

• In this context, the Gene Ontology (GO) is a very useful resource for the initial interpretation of gene lists.

Page 5: Data mining with the Gene Ontology

5

Gene Ontology Consortium

Page 6: Data mining with the Gene Ontology

6

But... what’s the GO?

• It is an ontology with clear definitions of its terms and relationships between them starting at the top level (GO) whose children are three independent ontologies.

GO

Molecular Functions (MF) Biological Processes (BP) Cellular Components (CC)

Page 7: Data mining with the Gene Ontology

7

Graphical Overview

• There are more than 16K nodes in GO

Page 8: Data mining with the Gene Ontology

8

• Consist of two essential parts: The current ontologies:

o Vocabularyo Structure

The current annotations:o Create a link between the known genes and the

associated GOs that define their function.

GO database

THE CHALLENGE: Use annotations and structure of the GOs to understand the

biological meaning in a large dataset of genes.

Page 9: Data mining with the Gene Ontology

9

Genes and GO terms

• Each gene can have several associated GO terms

• Each GO term can be connected to several other GO terms higher these are associated with the gene too.

• We call:o path the list of GO terms between the root

and the annotated GO term.o split each GO term in the path.

Page 10: Data mining with the Gene Ontology

10

Page 11: Data mining with the Gene Ontology

11

Our context

• A list of 100 genes will usually have many hundreds of associated GO terms and several thousand associated splits.

OBJECTIVE: How to cast biological meaning to gene lists from differentially expressed genes through of the Gene Ontology (GO)

Page 12: Data mining with the Gene Ontology

12

Statistical Methods

• Let us consider:o N genes on a microarray:

M belong to a given GO term category (A)M-N do not belong it (category Ac)

o K of the N genes are selected and assigned to a given class (e.g. regulated genes)

o x genes of these K will be in A (EXAMPLE)

STATISTICAL HYPOTHESIS:H0: GO category A is equally represented on the

microarray than in the class of differentially regulated genes

H1: GO category A is higher (or lower) represented on the microarray than in the class differentially regulated genes

Page 13: Data mining with the Gene Ontology

13

Hypergeometric Distribution (1/2)

We ask: Assuming sampling without replacement, what is the probability of having exactly x genes of category A?

• The probability that certain category occurs x times just by chance in the list of differentially regulated genes is modelled by a hypergeometric distribution with parameters (N, M, K).

K

N

xK

MN

x

M

xXP

Page 14: Data mining with the Gene Ontology

14

Hypergeometric Distribution (2/2)

• So, under the null hypothesis p_value of having x genes or larger in A will be:

• This corresponds to a one-side test in which small p_values relate to over-represented GO terms.

• For under-represented categories can be calculated as1 - p_value

K

xk

K

N

kK

MN

k

M

HxXPvaluep 0_

Page 15: Data mining with the Gene Ontology

15

Disadvantages

• The hypergeometric distribution is rather difficult and time consuming to calculate when N is high.

• We can proof,

• Using this approximation the p_value for over-represented GO terms can be calculated as

N

MKBinKNMHip

N,,,

iKix

i N

M

N

M

i

Kvaluep

1_

1

0

Page 16: Data mining with the Gene Ontology

16

Alternative approaches• Let us assume

where N=N.., M=N1., K=N.1 and x=n11

• Using this notation, alternative include:o test for equality of two proportionso Fisher’s Exact Test

Differentially regulated genes (D)

Dc Genes on Microarray

Category A

n11 n12 N1.

Ac n21 n22 N2.

N.1 N.2 N..

2

Page 17: Data mining with the Gene Ontology

17

Chi-square Test (2)

• statistic can be calculated as2

21

2121

2

211222112 ~

2

NNNN

NnnnnN

5

N

NN ji

PROBLEMS: It cannot:1. Distinguish between under- and over-

represented gene categories.2. Be used for small samples, i.e. when

Page 18: Data mining with the Gene Ontology

18

Fisher’s Exact Test

• This test consider fixed the marginal totals and uses the hypergeometric distribution to calculate the probability of observing each individual table as:

• One can calculate a table containing all possible combinations of n11n12n21n22.

• The p_value for a particular occurrence is the sum of all probabilities lower than or equal to the probability corresponding to the observed combination.

!!!!!

!!!!

22211211

2121

nnnnN

NNNNP

Page 19: Data mining with the Gene Ontology

19

Correction for Multiple Tests

• As the number of GO terms for which test significance is large, p_values have to take the correction for multiple tests in account. For instance:o Methods controlling False Discovery Rate (FDR):

Benjamin and Hochberg (assuming independence) Benjamin and Yekutieli (dropping independence)

o Methods controlling Family Wyse Error Rate (FWER): Holm correction Westfall and Young

Page 20: Data mining with the Gene Ontology

20

Example

N= 9177 genes on microarrayA

Ac

M= 467 in GO category

A

N-M= 8710 in Ac

K= 173 genes picked randomly

x= 51 genes of category

A

Page 21: Data mining with the Gene Ontology

21

Miguel.... GO!