Integrating Biology and Statistics: Gene Set Methods BIOS 691-003 Winter/Spring 2010

Integrating Biology and Statistics: Gene Set Methods

BIOS 691-003

Winter/Spring 2010

Philosophical Overture

• Integrating biology and statistics

• Gene sets: genes whose protein products collaborate on a well-defined function– Vague!

• Hard to define ‘function’ or draw boundary on ‘gene sets’

• Statistical methods often ad-hoc

• Be skeptical... but optimistic

Historical Motivations

• Too many genes are significant– Researchers used to generate a list by p-

value and comb for genes that work together– First pathway tools automated this process

• Patterns may be more significant than any individual gene– e.g. if most genes in glycogen biosynthesis

are up, but none is significant individually (after multiple-comparisons adjustment)

• We can infer that glycogen is being made

Goals of Current Practice

• Characterize biological meaning of joint changes in gene expression

• Organize expression (or other) changes into meaningful ‘chunks’ (themes)

• Identify crucial points in process where intervention could make a difference

Gene Sets

• Gene Ontology– Biological Process– Molecular Function– Cellular Location

• Pathway Databases– KEGG– BioCarta– MSIGDB

• Broad Institute

Approaches

• Univariate (most of current practice):– Discrete methods based on counting– Continuous methods: summarize gene test

statistics by set

• Multivariate (promising but unclear):– Compare differences to normal covariation of

genes in groups across individuals– Use known biological relationships to

construct test statistics

Univariate Approaches

• Discrete tests: enrichment for groups in gene lists– Select genes differentially expressed at some cutoff– For each gene group cross-tabulate– Test for significance (Hypergeometric or Fisher test)

• Continuous tests: from gene scores to group scores– Compare distribution of scores within each group to

random selections– GSEA (Gene Set Enrichment Analysis)– PAGE (Parametric Analysis of Gene Expression)

Discrete Approach – 2 x 2 Table

Signif. Genes

NS Genes

Gene Set k n-k n

Others K-k (N-n)-(K-k)

N-n

K N-K N

P =

• For each set in turn construct 2 x 2 table of significance vs membership in set:

Significance Testing of Categories

• Fisher’s Exact Test– Condition on margins fixed

• Of all tables with same margins, how many have dependence as or more extreme?

– Hard to compute when either n or k are large

• Approximations– Binomial (when k/n is small)– Chi-square (when expected values > 5 )– G2 (log-likelihood ratio; compare to 2 on 1 df)

Practical Issues – I

• What is appropriate Null Distribution?– Highly correlated because many overlaps– Must do permutation analysis– How to permute?

• Random sets of genes? Or• Random assignments of samples?

• P-value or FDR?– Heuristic method– More constrained by annotation than statistics

Practical Issues – II

• If a child category is declared significant, how to assess significance of parent category?– Include child category– Consider only genes external to child

• In practice big categories are not useful• Small categories may not be well

represented on chip• Select categories in middle range: 5-20

represented on chip

Critiques of Discrete Approach

• No use of information about size of change– Large t scores count like small t’s

• Continuous procedures have more power than discrete procedures on discretized continuous data

GSEA (Gene Set Enrichment Analysis)

• Introduced in 2003 by Mootha to address a puzzle in a diabetes data set– No genes significant individually– But Oxidative Phosphorylation mostly up

• GSEA tests rank of genes in a gene set against randomly distributed ranks– Kolmogorov-Smirnov test:– Maximum difference between ranks of

genes in set and uniform distribution

Kolmogorov-Smirnov Test

• Based on statistics of ‘Brownian Bridge’ – random walk fixed end

• Maximum difference is test statistic– Null distribution known

• Reformulated by GSEA as difference of CDF – uniform from axis

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

ecdf(nn)

x

Fn

(x)

0 200 400 600 800 1000

-0.2

0-0

.15

-0.1

0-0

.05

0.0

00

.05

Index

ecd

f(n

n)(

1:1

00

0)

- 1

:10

00

/10

00

GSEA

K-S Test Finds Irrelevant Sets

• Sometimes ranks concentrated in middle – K-S statistic high, but not meaningful for path change

• Fix: ad-hoc weighting by actual t-scores emphasizes departures at extreme ends

• No theory• Generate null distribution by permutation

Group Z- or T- Scores

• PAGE: log fold-changes over all genes follow ‘close to’ Normal distribution– Can estimate from overall distribution

• T-Profiler: under Null Hypothesis, each gene’s t-score follows t distribution ‘near’ N(0,1) distribution

• Hence the sum over genes in a specific set G:• PAGE: T-profiler:

• If most genes in a pathway are up-regulated then gene set scores will be significantly high

Gi

i NGt )1,0(~/Gi

i NGf )1,0(~)/()log(

Issues and Critiques

• Same issues as discrete approach– Null distribution by permuting samples

• GSEA finally gets that right in 2005

• Null distribution for Z-test assumes IID

• Methods assume all meaningful changes in same direction

• Don’t use information about normal co-variation

Why Is Covariation Important?

• Most cellular processes are homeostatic:

• They find a good functional set-point

• Coping with variation in inputs …• … AND in specific regulatory couplings

– Most of us have regulatory SNP’s that vary expression by a factor of two or more

– Other genes are expressed at somewhat different levels to accommodate key processes

Multivariate Approaches

• Classical multivariate methods– Multi-dimensional Scaling– Hotelling’s T2

• Machine learning approaches– Topological score relative to network– Prediction by machine learning tool

• e.g. ‘random forest’

PCA

Three correlated variablesPCA1 lies along the direction ofmaximal correlation; PCA 2 atright angles with the next highest variation.

Multi-Dimensional Scaling

• Aim: to represent graphically the most information about relationships among samples with multi-dimensional attributes in 2 (or 3) dimensions

• Algorithm: – Transform distances into cross-product matrix– Initial PCA onto 2 (or 3) axes– Deform until better representation

• Minimize ‘strain’ measure:

Nji ij

ijij

d

dd

,1

2)ˆ(

Separating Using MDS

-2 0 2 4

N = 20 Bandw idth = 0.5232

-4 -2 0 2 4

N = 20 Bandw idth = 0.7575

Dens

ity

-2 0 2 4

N = 20 Bandw idth = 0.4766

-2 -1 0 1 2 3 4

N = 20 Bandw idth = 0.4832

Dens

ity

-2 -1 0 1 2 3 4

N = 20 Bandw idth = 0.3849

-2 0 2 4

N = 20 Bandw idth = 0.4896

Dens

ity

-2 0 2 4 6

N = 20 Bandw idth = 0.6724

-4 -2 0 2 4 6

N = 20 Bandw idth = 0.6962

Dens

ityDe

nsity

-4 -2 0 2

-2-1

01

23

cc[,1]

cc[,2

]

Left: distributions of individual variablesRight: MDS plot (in this case PCA)

MDS for Pathways

• BAD pathway: controlled cell death

NormalIBCOther BC

• Clear separation between groups

• Cancer samples don’t have coherent variation

• Compute distance between sample means using (common) metric of covariation

• Where

• Multidimensional analog of t (actually F) statistic

Hotelling’s T2

Principles of Kong et al Method

• Normal covariation generally acts to preserve homeostasis

• The transcription of genes that participate in many processes will be changed

• The joint changes in genes will be most distinctive for those genes active in pathways that are working differently

Issues

• Not robust to outliers– In practice this may not matter much (?)

• Assumes same covariance in each sample

• Small samples -> unreliable estimates– Loss of power

• Robust / Regularized Methods improve sensitivity by up to a factor of 10! – Yates & Reimers (in prep)

Overall Assessment

• Gene sets are somewhat arbitrary – Most ‘modules’ overlap extensively with

others– Many ‘modules’ act by protein modification

rather than gene expression

• Current methods represent a first attempt to bring biological information to bear on the significance problem

Documents

Integrating Biology and Statistics: Gene Set Methods BIOS 691-003 Winter/Spring 2010