Upload
sherilyn-byrd
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Integrating Biology and Statistics: Gene Set Methods
BIOS 691-003
Winter/Spring 2010
Philosophical Overture
• Integrating biology and statistics
• Gene sets: genes whose protein products collaborate on a well-defined function– Vague!
• Hard to define ‘function’ or draw boundary on ‘gene sets’
• Statistical methods often ad-hoc
• Be skeptical... but optimistic
Historical Motivations
• Too many genes are significant– Researchers used to generate a list by p-
value and comb for genes that work together– First pathway tools automated this process
• Patterns may be more significant than any individual gene– e.g. if most genes in glycogen biosynthesis
are up, but none is significant individually (after multiple-comparisons adjustment)
• We can infer that glycogen is being made
Goals of Current Practice
• Characterize biological meaning of joint changes in gene expression
• Organize expression (or other) changes into meaningful ‘chunks’ (themes)
• Identify crucial points in process where intervention could make a difference
Gene Sets
• Gene Ontology– Biological Process– Molecular Function– Cellular Location
• Pathway Databases– KEGG– BioCarta– MSIGDB
• Broad Institute
Approaches
• Univariate (most of current practice):– Discrete methods based on counting– Continuous methods: summarize gene test
statistics by set
• Multivariate (promising but unclear):– Compare differences to normal covariation of
genes in groups across individuals– Use known biological relationships to
construct test statistics
Univariate Approaches
• Discrete tests: enrichment for groups in gene lists– Select genes differentially expressed at some cutoff– For each gene group cross-tabulate– Test for significance (Hypergeometric or Fisher test)
• Continuous tests: from gene scores to group scores– Compare distribution of scores within each group to
random selections– GSEA (Gene Set Enrichment Analysis)– PAGE (Parametric Analysis of Gene Expression)
Discrete Approach – 2 x 2 Table
Signif. Genes
NS Genes
Gene Set k n-k n
Others K-k (N-n)-(K-k)
N-n
K N-K N
P =
• For each set in turn construct 2 x 2 table of significance vs membership in set:
Significance Testing of Categories
• Fisher’s Exact Test– Condition on margins fixed
• Of all tables with same margins, how many have dependence as or more extreme?
– Hard to compute when either n or k are large
• Approximations– Binomial (when k/n is small)– Chi-square (when expected values > 5 )– G2 (log-likelihood ratio; compare to 2 on 1 df)
Practical Issues – I
• What is appropriate Null Distribution?– Highly correlated because many overlaps– Must do permutation analysis– How to permute?
• Random sets of genes? Or• Random assignments of samples?
• P-value or FDR?– Heuristic method– More constrained by annotation than statistics
Practical Issues – II
• If a child category is declared significant, how to assess significance of parent category?– Include child category– Consider only genes external to child
• In practice big categories are not useful• Small categories may not be well
represented on chip• Select categories in middle range: 5-20
represented on chip
Critiques of Discrete Approach
• No use of information about size of change– Large t scores count like small t’s
• Continuous procedures have more power than discrete procedures on discretized continuous data
GSEA (Gene Set Enrichment Analysis)
• Introduced in 2003 by Mootha to address a puzzle in a diabetes data set– No genes significant individually– But Oxidative Phosphorylation mostly up
• GSEA tests rank of genes in a gene set against randomly distributed ranks– Kolmogorov-Smirnov test:– Maximum difference between ranks of
genes in set and uniform distribution
Kolmogorov-Smirnov Test
• Based on statistics of ‘Brownian Bridge’ – random walk fixed end
• Maximum difference is test statistic– Null distribution known
• Reformulated by GSEA as difference of CDF – uniform from axis
0 200 400 600 800 1000
0.0
0.2
0.4
0.6
0.8
1.0
ecdf(nn)
x
Fn
(x)
0 200 400 600 800 1000
-0.2
0-0
.15
-0.1
0-0
.05
0.0
00
.05
Index
ecd
f(n
n)(
1:1
00
0)
- 1
:10
00
/10
00
GSEA
K-S Test Finds Irrelevant Sets
• Sometimes ranks concentrated in middle – K-S statistic high, but not meaningful for path change
• Fix: ad-hoc weighting by actual t-scores emphasizes departures at extreme ends
• No theory• Generate null distribution by permutation
Group Z- or T- Scores
• PAGE: log fold-changes over all genes follow ‘close to’ Normal distribution– Can estimate from overall distribution
• T-Profiler: under Null Hypothesis, each gene’s t-score follows t distribution ‘near’ N(0,1) distribution
• Hence the sum over genes in a specific set G:• PAGE: T-profiler:
• If most genes in a pathway are up-regulated then gene set scores will be significantly high
Gi
i NGt )1,0(~/Gi
i NGf )1,0(~)/()log(
Issues and Critiques
• Same issues as discrete approach– Null distribution by permuting samples
• GSEA finally gets that right in 2005
• Null distribution for Z-test assumes IID
• Methods assume all meaningful changes in same direction
• Don’t use information about normal co-variation
Why Is Covariation Important?
• Most cellular processes are homeostatic:
• They find a good functional set-point
• Coping with variation in inputs …• … AND in specific regulatory couplings
– Most of us have regulatory SNP’s that vary expression by a factor of two or more
– Other genes are expressed at somewhat different levels to accommodate key processes
Multivariate Approaches
• Classical multivariate methods– Multi-dimensional Scaling– Hotelling’s T2
• Machine learning approaches– Topological score relative to network– Prediction by machine learning tool
• e.g. ‘random forest’
PCA
Three correlated variablesPCA1 lies along the direction ofmaximal correlation; PCA 2 atright angles with the next highest variation.
Multi-Dimensional Scaling
• Aim: to represent graphically the most information about relationships among samples with multi-dimensional attributes in 2 (or 3) dimensions
• Algorithm: – Transform distances into cross-product matrix– Initial PCA onto 2 (or 3) axes– Deform until better representation
• Minimize ‘strain’ measure:
Nji ij
ijij
d
dd
,1
2)ˆ(
Separating Using MDS
-2 0 2 4
N = 20 Bandw idth = 0.5232
-4 -2 0 2 4
N = 20 Bandw idth = 0.7575
Dens
ity
-2 0 2 4
N = 20 Bandw idth = 0.4766
-2 -1 0 1 2 3 4
N = 20 Bandw idth = 0.4832
Dens
ity
-2 -1 0 1 2 3 4
N = 20 Bandw idth = 0.3849
-2 0 2 4
N = 20 Bandw idth = 0.4896
Dens
ity
-2 0 2 4 6
N = 20 Bandw idth = 0.6724
-4 -2 0 2 4 6
N = 20 Bandw idth = 0.6962
Dens
ityDe
nsity
-4 -2 0 2
-2-1
01
23
cc[,1]
cc[,2
]
Left: distributions of individual variablesRight: MDS plot (in this case PCA)
MDS for Pathways
• BAD pathway: controlled cell death
NormalIBCOther BC
• Clear separation between groups
• Cancer samples don’t have coherent variation
• Compute distance between sample means using (common) metric of covariation
• Where
• Multidimensional analog of t (actually F) statistic
Hotelling’s T2
Principles of Kong et al Method
• Normal covariation generally acts to preserve homeostasis
• The transcription of genes that participate in many processes will be changed
• The joint changes in genes will be most distinctive for those genes active in pathways that are working differently
Issues
• Not robust to outliers– In practice this may not matter much (?)
• Assumes same covariance in each sample
• Small samples -> unreliable estimates– Loss of power
• Robust / Regularized Methods improve sensitivity by up to a factor of 10! – Yates & Reimers (in prep)
Overall Assessment
• Gene sets are somewhat arbitrary – Most ‘modules’ overlap extensively with
others– Many ‘modules’ act by protein modification
rather than gene expression
• Current methods represent a first attempt to bring biological information to bear on the significance problem