Upload
leslie-whitaker
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Alex Lewin (Imperial College Centre for Biostatistics)
Ian Grieve (IC Microarray Centre)Elena Kulinskaya (IC Statistical Advisory Service)
Improving Interpretation in Gene Set Enrichment Analysis
Introduction
• Microarray experiment list of differentially expressed (DE) genes
• Genes belong to categories of Gene Ontology (GO)
• Are some GO categories (groups of genes) over-represented amongst the DE genes?
Contents
• Grouping Gene Ontology categories can improve interpretation of gene set enrichment analysis
• Fuzzy decision rules for multiple testing with discrete data
Gene Ontology (GO)
Database of biological terms
Arranged in graph connecting related terms: links from more general to more specific terms
For each node, can define ancestor and descendant terms
Directed Acyclic Graph
~16,000 terms
from QuickGO website (EBI)
Gene Annotations
• Genes/proteins annotated to relevant GO terms– Gene may be annotated to several GO terms – GO term may have 1000s of genes annotated to it (or
none)
• Gene annotated to term A annotated to all ancestors of A
Find GO terms over-represented amongst differentially expressed genes
For each GO term, compare:
proportion of differentially expressed genes annotated to that term
v.
proportion of non-differentially expressed genes annotated to that term
Fisher’s test p-value for each GO term.
Multiple testing considerations threshold below which p-values are declared significant.
Many websites do this type of analysis, eg FatiGO website http://fatigo.bioinfo.cnio.es/
22
173 7847
467GO
not
notDE
Difficulties in Testing GO terms
Interpretation: many terms close in the graph may be found significant – or not significant but many low p-values close together in the graph
Statistical Power: many terms have few genes annotated
Discrete statistics: p-values not Uniform under null
Grouping GO terms
Use the Poset Ontology Categorizer (POSOC)
Joslyn et al. 2004
Software which groups terms based on
- pseudo-distance between terms
- ‘coverage’ of genes
Example: for data used here, reduces ~16,000 terms to 76 groups
Example: genes associated with the insulin-resistance gene Cd36
Knock-out and wildtype mice
Bayesian hierarchical model gives posterior probabilities (pg) of being differentially expressed
Most differentially expressed:
pg > 0.5 (280 genes)
Least differentially expressed:
pg < 0.2 (11171 genes)
Example Results
Individual term tests
Used Fatigo website
Multiple testing corrections (Benjamini and Hochberg FDR) done separately for each ‘level’
Found no GO terms significant when FDR controlled at 5%
Group tests
POSOC on all genes on U74A chip, gives 76 groups
3 groups found significant when controlling FDR at 5%
Comparison of Individual and Group Tests
Rank in Fatigo (smallest p-values) Membership of POSOC group significant
1: response to external stimulus
2: resp. to pest, pathogen or parasite
3: response to wounding
4: organismal movement
5: response to biotic stimulus
6: neurophysiological process
7: response to stress
8: inflammatory response
9: transmission of nerve impulse
10: neuromuscular physiological proc.
11: defense response
12: immune response
13: chemotaxis
14: nucleobase, nucleoside, nuc …
15: cell-cell signalling
IA
response to p.p.p.
response to wounding
IA
IA
-
IA
immune resp, resp. to ppp, resp to wound
-
-
IA
immune resp, resp. to ppp, resp to wound
immune resp, resp. to ppp, resp to wound
chemotaxis,
cell-migration
-
-
IA
yes
yes
IA
IA
-
IA
yes
-
-
IA
yes
yes
no (at 5%)
no
-
-
IA = immediate ancestor of significant POSOC group
Physiological process`
Organismal movement
Inflammatory response
Response to stimulus
Response to external stimulus
Response to biotic stimulus
Response to stress
Response to wounding
Defense response
Response to pest, pathogen
or parasiteImmune response
Biological process
Response to other
organism
Ranks high individually (smallest p-values)
Significant in group tests (and ranks high individually)
Comparison of Individual and Group Tests
Discrete test statistics
Null hypothesis determined by margins of 2x2 table
Often very small no. possible values for cells small no. possible p-values
X
173 7847
467GO
not
notDE
Null Hypothesis:
X ~ HyperGeom(173, 7847-173, 467)
X = 0,…,173
Discrete test statistics
X
173 7847
467GO
not
notDE
p-value p(x) = P( X ≤ x | null )
P( p ≤ α | null) ≠ α for most α
Randomised Test
Observe X=x0
pobs = observed p-value = P( X ≤ x0 | null )
pprev = next smallest possible p-value = P( X ≤ x0-1 | null )
Randomised p-value
P(x0) = P( X < x0 | null ) + u*P( X = x0 | null ) where u ~ Unif(0,1)
= pprev + u*(pobs - pprev)
conditionally, P | x0 ~ Unif(pprev , pobs) unconditionally P ~ Unif(0,1)
pobs0 1pprev
Fuzzy Decision Rule
Idea is to use all possible realisations of randomised test.
Summarise evidence by critical function of randomised test:
τα(pprev , pobs) =
1 pobs < α
(α – pprev)/(pobs - pprev) pprev < α < pobs
0 pprev > α pobs0 1pprev
Use τα as a fuzzy measure of evidence against the null hypothesis.
(Fuzzy decision rule considered by Cox & Hinckley, 1974 and developed by Geyer and Meeden 2005)
Fuzzy Decision Rules for Multiple Testing
We have developed fuzzy decision rules for multiple tests (i = 1,…,m)
Use Benjamini and Hochberg false discovery rate (BH FDR)
τBHα(pi
prev , pi
obs ) = P( randomised p-value i is rejected | null )
using BH FDR procedure
For small no. tests we can calculate these exactly.
Fuzzy Decision Rules for Multiple Testing
τBHα(pi
prev , pi
obs ) = P( randomised p-value i is rejected | null )
For large no. tests use simulations:
for j = 1,…,n {
generate randomised p-values (i=1,…,m) Pij ~ Unif (piprev
, piobs
)
perform BH FDR procedure Iij =
}
τBHα(pi
prev , pi
obs ) = 1/n Σj Iij
1 if Pij rejected
0 else
^
Results for Cd36 Example
[1] "alpha = 0.05" pprev pval i.bonf i.bh tau POSOC group1 1e-04 3e-04 1 1 1 response to pest, pathogen or parasite 2 1e-04 4e-04 1 1 1 response to wounding 3 2e-04 6e-04 1 1 1 immune response 4 7e-04 0.0079 0 0 0.297 digestion 5 0.003 0.0122 0 0 0.021 chemotaxis 6 0.0039 0.0209 0 0 0.002 organic acid biosynthesis 7 0.0092 0.0306 0 0 0 synaptic transmission 8 5e-04 0.0436 0 0 0.059 response to fungi
[1] "alpha = 0.15" pprev pval i.bonf i.bh tau POSOC group1 1e-04 3e-04 1 1 1 response to pest, pathogen or parasite 2 1e-04 4e-04 1 1 1 response to wounding 3 2e-04 6e-04 1 1 1 immune response 4 7e-04 0.0079 0 1 1 digestion 5 0.003 0.0122 0 0 0.943 chemotaxis 6 0.0039 0.0209 0 0 0.661 organic acid biosynthesis 7 0.0092 0.0306 0 0 0.375 synaptic transmission 8 5e-04 0.0436 0 0 0.391 response to fungi
Results for Cd36 Example
Order of fuzzy decisions is not the same as order of observed p-values
Depends on amount of discreteness of null
pobspprev
Conclusions
• Grouping Gene Ontology categories can help find significant regions of the GO graph
• Fuzzy decision rules for multiple testing with discrete data can provide more candidates for rejection
Acknowledgements
Acknowledgements
Cliff Joslyn (Los Alamos National Laboratory)
Tim Aitman (IC Microarray Centre)
Sylvia Richardson (IC Centre for Biostatistics)
BBSRC ‘Exploiting Genomics’ grant (AL)
Wellcome Trust grant (IG)
References
Joslyn CA, Mniszewski SM, Fulmer A and Heaton G (2004), The Gene Ontology Categorizer, Bioinformatics 20, 169-177.
Geyer and Meeden (2005), Fuzzy Confidence Intervals and P-values, Statistical Science, to appear.