1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of gene expression data In collaboration with Natalia

1

Sylvia RichardsonCentre for Biostatistics

Imperial College, London

Bayesian hierarchical modelling of gene expression data

In collaboration with Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary’s)

Helen Causton and Tim Aitman (Hammersmith)Graeme Ambler and Peter Green (Bristol)

Philippe Broët (INSERM, Paris)

BBSRC Exploiting Genomics grant

2

Outline

• Hierarchical modelling framework

• A Bayesian gene expression index

• Modelling differential expression

• False discovery rate and mixture models

3

Introduction

• Gene expression is a hierarchical process– Substantive question– Experimental design– Sample preparation– Array design & manufacture– Gene expression matrix– Probe level data– Image level data

• Interest in using statistical framework capable to handle multiple sources of variability coherently

Interestingvariability(signal)

Obscuringvariability

(noise)

+

Bayesian statistics

4

Bayesian hierarchical model framework

• Has the flexibility to model various sources of variability: between probes, gene specific, within array, between array, …

• Building of all these features into a common model

• Avoids the need to use systematically a plug-in approach

uncertainty is propagated • ‘Borrow strength’ / share out information

according to principle• Allows some model checking

5

Gene expression analysis is amulti-step process

Low-level Model(how is the measured expression

related to the signal)

Multi-arrays processing(how to make appropriate

combined inference)

Differential Expression

ClusteringPartition Model

We build all these steps in a common statistical framework

6

Hierarchical model of replicate(biological) variability and array effect

PMMM

PMMM

PMMM

Gene specific variability (probe)Gene index BGX

Condition 1

PMMM

PMMM

PMMM

PMMM


Differential expression parameter

Condition 2

Integrated modelling of Affymetrix data

PMMM

Gene and condition BGX index

Gene and condition BGX index


7

A fully Bayesian Gene eXpression index for Affymetrix GeneChip arrays

Anne Mette HeinSR, Helen Causton,

Graeme Ambler, Peter Green

Gene specific variability (probe)

PMMM

PMMM

PMMM

PMMM

Gene index BGX

8

Single array model: Motivation

Key observations: Conclusions:

• PMs and MMs both increase with spike-in concentration (MMs slower than PMs)

MMs bind fraction of signal

• Spread of PMs increase with level

Multiplicative (and additive) error; transformation needed

• Considerable variability in PM (and MM) response within a probe set

Varying reliability in gene expression estimation for different genes

• Probe effects approximately additive on log-scale

Estimate gene expression measure from PMs and MMs on log scale

9

• The intensity for the PM measurement for probe (reporter) j and gene g is due to binding

of labelled fragments that perfectly match the oligos in the spot

The true Signal Sgj

of labelled fragments that do not

perfectly match these oligos

The non-specific hybridisation Hgj

• The intensity of the corresponding MM measurement is caused

by a binding fraction Φ of the true signal Sgj

by non-specific hybridisation Hgj

Model assumptions and key biological parameters

10

BGX single array model:g=1,…,G (thousands), j=1,…,J (11-20)

Gene specific error terms:exchangeable

log(ξ g2)N(a,

b2)

log(Sgj+1) TN (μg , ξg2)

j=1,…,J

Gene expression index (BGX):

g=median(TN (μg , ξ g2))

“Pools” information over probes j=1,…,J

log(Hgj+1) TN(λ, η2)

Array-wide distribution

PMgj N( Sgj + Hgj , τ2)

MMgj N(Φ Sgj + Hgj ,τ2) Background noise, additive

signal Non-specific hybridisation

fraction

Priors: “vague” 2 ~ (10-3, 10-3) ~ B(1,1),

g ~ U(0,15) 2 ~ (10-3, 10-3), ~ N(0,103) “Empirical Bayes”

11

Implementation

• In WinBugs for ease of model development

and C++ for efficiency• Joint estimation of parameters in full Bayesian

framework• Base inference on posterior distribution

of all unknown quantities, Sgj, Hgj ,

g = Median of TN(g, ξ g2), ….

and use appropriate summaries

12

• 14 samples of cRNA from acute myeloid leukemia (AML) tumor cell line

• In sample k: each of 11 genes spiked in at concentration ck:

sample k: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 conc. ck(pM): 0.0 0.5 0.75 1.0 1.5 2.0 3.0 5.0 12.5 25 50 75 100 150

• Each sample hybridised to an array

Single array model performance:Data set : varying concentrations (geneLogic):

Consider subset consisting of 500 normal genes

+ 11 spike-ins

13

Single array model performance:One array: four genes spiked in at concentration 5.0

Posterior distributions:

2.5-97.5 credibility intervals:

o: log(PM-MM)

: TN(medPost(g),medPost(ξ g2))

Log(Sgj+1):

g:

posterior distributions reflect variability

PM: MM:PM-MM:

Probes: degree of response / variability over probe set:

medium / high low / low medium / low high/ low

Probe behaviour:

Highly Variable responses within probes sets and between genes

BGX index

Log Sgj

14

Single array model performance: signal and expression index10 arrays: gene 1 spiked-in at increasing concentrations

`true signal`/expression index BGX increases with concentration

Posterior distributions:

2.5-97.5 credibility intervals:

o : log(PM-MM)


Log(Sgj+1):

g:

as previously:

log(Hgj+1):

15

2.5-97.5 credibility interval:

Single array model performance: non-specific hybridization10 arrays: gene 1 spiked-in at increasing concentrations

Signals Signals/cross

Non-specific hybridization does not increase with concentration


log(Hgj+1):

16

Single array model performance:

11 genes spiked in at 13 (increasing)

concentrations

BGX index g increases with

concentration …..

… except for gene 7 (spiked-in??)

Indication of smooth

& sustained increase

over a wider range of

concentrations

Comparison with other expression measures

17

2.5 – 97.5 % credibility intervals for the Bayesian expression index

11 spike-in genes at 13 different concentration (data set A)

Note how the variabilityis substantially larger for low expression level

Each colour corresponds to a different spike-in geneGene 7 : broken red line

18

What variability is captured?

• For some genes, there is considerable discrepancy between the information given by the different probes

• Posterior becomes “flat” or “bimodal”• Hard to summarise by a single number

Less reproducibility of point estimates of expression level

• Model improvement: -- stratify Φ by CG content ? -- less weight to the MM in some cases? – more robust summary of index distribution or heavy tail distributions?

19

Single array model:examples of posterior distributions of BGX

expression indices

Each curve represents a gene

Examples with data:

o: log(PMgj-MMgj)

j=1,…,Jg

(at 0 if not defined)

Mean +- 1SD

20

Differential expression and array effects

Alex Lewin SR, Natalia Bochkina,

Anne Glazier, Tim Aitman

21

Data Set and Biological question

Previous Work (Tim Aitman, Anne Marie Glazier)

The spontaneously hypertensive rat (SHR): A model of human insulin resistance syndromes.

Deficiency in gene Cd36 found to be associated with insulin resistance in SHR

Following this, several animal models were developed where other relevant genes are knocked out comparison between knocked out and wildtype

(normal) mice or rats.

22

Data Set and Biological question

Microarray Data

Data set A (MAS 5) ( 12000 genes on each array)

3 SHR compared with 3 transgenic rats

Data set B (RMA) ( 22700 genes on each array)

8 wildtype (normal) mice compared with 8 knocked out mice

Biological Question

Find genes which are expressed differently in wildtype and knockout / transgenic mice

23

Gene specific error term Gene specific error term

Differential expression parameter

PMMM

Condition 1 Condition 2

Posterior distribution

(flat prior)

Mixture modelling for classification

Hierarchical model of replicateVariability and array effect

Hierarchical model of replicateVariability and array effect

24

Model for Differential Expression

• Expression-level-dependent normalisation

• Only few replicates per gene, so share information between genes to estimate variability of gene expression between the replicates

• To select interesting genes:– Use posterior distribution of quantities of interest,

function of, ranks ….– Use mixture prior on the differential expression

parameter

25

Data: ygr = log gene expression for gene g, replicate r

(for the present, ygr is treated as known data)

g = gene effect

r( ) = array effect (possibly expression-level dependent)

g2 = gene specific variance

• 1st level

ygr N(g + r(g), g2), Σr r (g) = 0

r( ) = smooth function of g

Bayesian hierarchical model for replicate expression data (under one condition)

Piecewise polynomial with unknown break points

26

Condition 1 (3 replicates)

Condition 2 (3 replicates)

Needs ‘normalisation’

Spline curves shown

Exploratory analysis of array effect

27

• 2nd level

Priors for g (flat) , coefficients and break points

Σr (g) = 0 constraint imposed

g2 lognormal (μ, τ)

Hyper-parameters μ and τ can be influential.In a full Bayesian analysis, these are not fixed

• 3rd level

μ N( c, d) τ lognormal (e, f)

Hierarchical structure for gene specific parameters

28

• Variances are estimated using information from all G x R measurements (~12000 x 3) rather than just 3

• Variances are stabilised and shrunk towards average variance

Smoothing of the gene specific variances

29

• Check assumptions on gene variances, e.g. exchangeable variances, what distribution ?

• Predict sample variance Sg2 new (a chosen checking function)

from the model specification (not using the data for this)

• Compare predicted Sg2 new with observed Sg

2 obs

‘Bayesian p-value’: Prob( Sg2 new > Sg

2 obs )

• Distribution of p-values approx Uniform if model is ‘true’

(Marshall and Spiegelhalter, 2003)• Easily implemented in MCMC algorithm

Bayesian Model Checking

30

Data set A

31

Differential expression model

The quantity of interest is the difference between conditions for each gene: dg , g = 1, …,N

Joint model for the 2 conditions :

yg1r = g - ½ dg + 1r(g) + g1r , r = 1, … R1

yg2r = g + ½ dg + 2r(g) + g2r , r = 1, … R2

• g is now the overall gene effect over the conditions•The parameter of interest dg is given a flat prior (for now)

•Same assumptions for the distribution of σ2gs

• Modelling of sr(g) as before, s = 1, 2 , sum to zero constraint imposed within each condition

32

Possible Statistics for Differential Expression

dg ≈ log fold change

dg* = dg / (σ2 g1 / R1 + σ2 g2 / R2 )½ (standardised difference)

• We obtain the posterior distribution of all {dg} and/or {dg

* }

• Can compute directly posterior probability of genes satisfying criterion X of interest:

pg,X = Prob( g of “interest” | Criterion X, data)

• Can compute the distributions of ranks

33

Gene is of interest if |log fold change| > log(2) and log (overall expression) > 4

Criterion X

The majority of the genes

have very small pg,X :

90% of genes

have pg,X < 0.2

Genes withpg,X > 0.5 (green)

# 280pg,X > 0.8 (red)

# 46

pg,X = 0.49

Plot of log fold change versus overall expression level

Data set A 3 wildtype mice compared to 3 knockout mice (U74A chip) Mas5

Genes with low overall expression have a greater range of fold change than those with higher expression

34

Gene is of interest if |log fold change| > log (1.5)Criterion X:

The majority of the genes

have very small pg,X :

97% of genes

have pg,X < 0.2

Genes withpg,X > 0.5 (green)

# 292pg,X > 0.8 (red)

# 139

Plot of log fold change versus overall expression level

Experiment: 8 wildtype mice compared to 8 knockout mice RMA

35

Posterior probabilities and log fold change

Data set A : 3 replicates MAS5 Data set B : 8 replicates RMA

36

Credibility intervals for ranks

100 genes with lowest rank (most under/over expressed)

Low rank, high uncertainty

Low rank, low uncertainty

Data set B

37

• Compute

Probability ( |dg* | > 2 | data)

Bayesian analogue of a t test !

• Order genes

• Select genes such that

Probability ( |dg* | > 2 | data) > cut-off

Using the posterior distribution of dg*

(standardised difference)

38

Bayesian

T test

(Bayesian estimate)

Volcano plots

For illustration, cut-offs lines drawn at 0.95

39

PMMM

PMMM

PMMM


Condition 1

PMMM

PMMM

PMMM

PMMM


Distribution of differential expression parameter

Condition 2

Integrated modelling of Affymetrix data

PMMM

Distribution of expression index for gene g , condition 1

Distribution of expression index for gene g , condition 2



40

PMgjcr N( Sgjcr+ Hgjcr , τcr2)

MMgjcr N(ΦSgjcr+ Hgjcr , τcr2)

BGX Multiple array model: conditions: c=1,…,C, replicates: r = 1,…,Rc

log(Sgjcr+1) TN (μgc , ξ gc2)

Gene and condition specific BGX

gc=median(TN(μgc, ξ gc

2)) “Pools” information over replicate probe sets j = 1,…J, r = 1,…,Rc

Background noise, additiveArray specific

log(Hgjcr+1) TN(λcr,ηcr2)

Array-specific distribution of non-specific hybridisation

41

Posterior distributions of BGX:Single array vs multiple array analyses:

Mean +- 1SD

Three replicate arrays analysed separately

Three replicate arrays analysed together (multiple array model)

42

Subset of AffyU133A spike-in data set(AffyComp)

Consider:

• Six arrays, 1154 genes (every 20th and 42 spike-ins)

• Same cRNA hybridised to all arrays EXCEPT for spike-ins:

`1` `2` `3` … `12` `13` `14`

Spike-in genes: 1-3 4-6 7-9 … 34-36 37-39 40-42

Spike-in conc (pM):

Condition 1 (array 1-3): 0.0 0.25 0.50 … 128 256 512

Condition 2 (array 4-6): 0.25 0.50 1.00 … 256 512 0.00

Fold change: - 2 2 … 2 2 -

43

M v A plots:

True fold changes: Black: zero Red: 2

A: (1/2)*(exprg,1+exprg,2), M: (exprg,1-exprg,2)

NB! Point estimates used

MAS5 and RMA: exprgc= mean over three replicates

BGX: Multiple array index

44

BGX: measure of uncertainty providedPosterior mean +- 1SD credibility intervals

diffg=bgxg,1- bgxg,2

}

Spike in 1113 -1154above the blue line

Blue stars show RMA measure

45

Mixture and Bayesian estimation of false discovery rates

Natalia Bochkina, Alex Lewin SR, Philippe Broët

46

• Gene lists can be built by computing separately a criteria for each gene and ranking

• Thousands of genes are considered simultaneously• How to assess the performance of such lists ?

Multiple Testing Problem

Statistical ChallengeSelect interesting genes without including too many false

positives in a gene list

A gene is a false positive if it is included in the list when it is truly unmodified under the experimental set up

Want an evaluation of the expected false discovery rate (FDR)

47

Bayesian Estimate of FDR

• Step 1: Choose a gene specific parameter (e.g. dg ) or a gene statistic (see later)

• Step 2: Model its prior (resp marginal) distribution using a mixture model

-- with one component that models the unaffected genes (null hypothesis) e.g. point mass at 0 for dg

-- other components that model (flexibly) the alternative

• Step 3: Calculate the posterior probability for any gene of belonging to the unmodified component : pg0 | data

• Step 4: Evaluate FDR (and FNR) for any listAssuming that all the gene classification are independent:Bayes FDR (list) | data = 1/card(list) Σg list pg0

48

Mixture prior

• To obtain a gene list, a commonly used method

(cf Lonnstedt &Speed 2002, Newton 2003, Smyth 2003, …) is to define a mixture prior for dg :

• H0 dg = 0 point mass at 0 with probability p0

• H1 dg ~ flexible 2-sided distribution to model differential expression

Classify each gene following its posterior probabilities of not being in the null: 1- pg0

Use Bayes rule or fix the FDR

49

Classification with mixture prior

• Joint estimation of all the mixture parameters (including p0) avoids plugging-in of values (e.g. p0) that are influential on the classification

• Sensitivity to prior settings of the alternative distribution and performance has been tested on simulated data sets

Work in progress

Poster by Natalia Bochkina

50

Performance of the mixture prior

yg1r = g - ½ dg + g1r , r = 1, … R1

yg2r = g + ½ dg + g2r , r = 1, … R2

(For simplification, we assume that the data has been pre normalised)

σ2g ~ IG(a, b)

dg ~ p0δ0 + p1G (1.5, 1) + p2G (1.5, 2)

H0 H1

Dirichlet distribution for (p0, p1, p2)

Exponential hyper prior for 1 and 2

51

Simulated data

ygr ~ N(dg , σ2g) (8 replicates)

σ2g ~ IG(1.5, 0.05)

dg ~ (-1)Bern(0.5) G(2,2), g=1:200

dg = 0, g=201:1000

Choice of simulation parametersinspired by estimates found in analyses of biological data sets

Plot of the true differences

52

Posterior estimates of fold change using mixture model

53

Comparison of mixture classification and posterior probabilities for the standardised differences

In red, 200

genes with

dg ≠ 0

Probability ( |dg* | > 2 | data)

31 = 4%False negative

10 = 6%False positive

Post Prob (g H1)

54

Post Prob (g H1) = 1- pg0

Bayesrule

FDR (black)FNR (blue)as a function of1- pg0

55

Using mixtures for modelling the marginal distribution of gene statistics

• Instead of modelling the prior for dg as a

mixture, an alternative is – To summarise differential expression by a

gene statistic– To model is marginal distribution as a

mixture such that the distribution is approximately known under H0 and use a flexible distribution for the alternative

56

Mixture modelling of transformed F statistics

Gene statistic based on classical F statistic (this was developed to analyse multiclass ( > 2 conditions) experiments)

Gives a de-centred asymmetric marginal distribution rather than a two-tailed one

Transform F -> approx. standard Normal if no change across conditions (H0).

Use a mixture of normals (variable number) for modelling the alternative (following Richardson and Green 1997)

57

Results for Simulated Data(to detect modified profile over 3 conditions)

Broet, Lewin, SR 2004 Bayes mixture estimate of FDR is close to true value

Case A : well separatednull and alternative hypotheses

Case B : less separated null and alternative hypotheses

For details, see the poster by Alex Lewin

58

Marginal mixture performance for the simulated data

(2 conditions, same data as for the prior mixture)Number on list as a function

of cut-off prob

Expected number of false positive

59

Simulated data, comparison of prior and marginal mixture classification

Good agreementbetween the 2 approaches

The marginal mixturehas more false positives

Transformation to Normality for 2 conditions??

Further comparisonin progress

60

Bayesian gene expression measure (BGX)

Good range of resolution , provides credibility intervals

Differential Expression

Expression-level-dependent normalisation

Borrow information across genes for variances

Joint distribution of ranks, gene lists based on posterior probabilities

False Discovery Rate

Mixture gives good estimate of FDR and classifies well

Future work

Mixture prior on BGX index, with uncertainty propagated to mixture parameters, comparison of marginal and prior mixture approaches, clustering for more general experimental set-ups

Summary

61

Papers and technical reports:

Hein AM., Richardson S., Causton H., Ambler G. and Green P. (2004)BGX: a fully Bayesian gene expression index for Affymetrix GeneChip data (submitted)

Lewin A., Richardson S., Marshall C., Glazier A. and Aitman T. (2003) Bayesian Modelling of Differential Gene Expression (submitted)

Broët P., Lewin A., Richardson S., Dalmasso C. and Magdelenat H. (2004) A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments. (Bioinformatics, advanced access April 29 2004)

Broët, P., Richardson, S. and Radvanyi, F. (2002) Bayesian Hierarchical Model for Identifying Changes in Gene Expression from Microarray Experiments , Journal of Computational Biology 9, 671-683.

Available athttp ://www.bgx.org.uk/

Thanks

http://www.bgx.org.uk/



Documents

1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of gene expression data In collaboration with Natalia