42
Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall, Natalia Bochkina (Imperial) Graeme Ambler (Bristol) Tim Aitman and Helen Causton (Hammersmith) BGX Windsor, October 2004

Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

Embed Size (px)

Citation preview

Page 1: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

Structured statistical modelling of gene expression data

Peter Green (Bristol)Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial)

with Clare Marshall, Natalia Bochkina (Imperial)Graeme Ambler (Bristol)

Tim Aitman and Helen Causton (Hammersmith)

BGX

Windsor, October 2004

Page 2: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Statistical modelling and biology

• Extracting the message from microarray data needs statistical as well as biological understanding

• Statistical modelling – in contrast to data analysis – gives a framework for formally organising assumptions about signal and noise

• Our models are structured, reflecting data generation process: ‘highly structured stochastic systems’

Page 3: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Background and 3 studies

• Hierarchical modelling• A fully Bayesian gene expression index

(BGX)• Differential expression and array effects• Two-way clustering

Page 4: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

• Hierarchical modelling• A fully Bayesian gene expression index

(BGX)• Differential expression and array effects• Two-way clustering

Part 1

Page 5: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Gene expression using Affymetrix chips

20µm

Millions of copies of a specificoligonucleotide sequence element

Image of Hybridised Array

Approx. ½ million differentcomplementary oligonucleotides

Single stranded, labeled RNA sample

Oligonucleotide element

**

**

*

1.28cm

Hybridised Spot

Slide courtesy of Affymetrix

Expressed genes

Non-expressed genes

Zoom Image of Hybridised Array

Page 6: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Variation and uncertainty

• condition/treatment• biological• array manufacture• imaging• technical

• within/between array variation

• gene-specific variability

Gene expression data (e.g. Affymetrix) is the result of multiple sources of variability

Structured statistical modelling allows considering all uncertainty at once

Page 7: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Costs and benefits of this approach

Advantages of avoiding plug-in approach

• Uncertainties propagated throughout model

• Realistic estimates of variability

• Avoid bias

The price you pay – computational costs

• Intricate implementation

• Longer run times (but far less than experimental protocol!)

Page 8: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

• Hierarchical modelling• A fully Bayesian gene expression index

(BGX)• Differential expression and array effects• Two-way clustering

Part 2

Page 9: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

A fully Bayesian Gene eXpression indexfor Affymetrix GeneChip arrays

Anne-Mette HeinSylvia Richardson, Helen Causton, Graeme Ambler, Peter Green

Gene specific variability (probe)

PMMM

PMMM

PMMM

PMMM

BGX Gene index

Page 10: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Single array model: motivation

PMs and MMs both increase with spike-in concentration (MMs slower than PMs)

MMs bind fraction of signal

Spread of PMs increase with level

Multiplicative (and additive) error; transformation needed

Considerable variability in PM (and MM) response within a probe set

Varying reliability in gene expression estimation for different genes

Probe effects approximately additive on log-scale

Estimate gene expression measure from PMs and MMs on log scale

Key observations: Conclusions:

Page 11: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Model assumptions and key biological parameters

• The intensity for the PM measurement for probe (reporter) j and gene g is due to binding • of labelled fragments that perfectly match the oligos in

the spot (the true signal Sgj)• of labelled fragments that do not perfectly match these

oligos (the non-specific hybridisation Hgj)

• The intensity of the corresponding MM measurement is caused • by a binding fraction Φ of the true signal Sgj

• by non-specific hybridisation Hgj

Page 12: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

BGX single array modelg=1,…,G (thousands), j=1,…,J (11-20)

Gene expression index (BGX):

g=median(TN (μg , ξ g2))

“Pools” information over probes j=1,…,J

log(Hgj+1) TN(λ, η2)

Array-wide distribution

PMgj N( Sgj + Hgj , τ2)

MMgj N(Φ Sgj + Hgj ,τ2) Background noise, additive

signal Non-specific hybridisation

fraction

j=1,…,J

Priors: “vague” 2 ~ (10-3, 10-3) ~ B(1,1),

g ~ U(0,15) 2 ~ (10-3, 10-3), ~ N(0,103)

Gene specific error terms:exchangeable

log(ξ g2)N(a,

b2)

log(Sgj+1) TN (μg , ξg2)

“Empirical Bayes”

Page 13: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Markov chain Monte Carlo (MCMC) computation

• Fitting of Bayesian models hugely facilitated by advent of these simulation methods

• Produce a large sample of values of all unknowns, from posterior given data

• Easy to set up for hierarchical models• BUT can be slow to run (for many

variables!)• and can fail to converge reliably

Page 14: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Sample in place of a distribution - 1D

Page 15: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Sample in place of a distribution - 2D

Page 16: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Single array model performance

• Data set : varying concentrations (geneLogic):

• 14 samples of cRNA from acute myeloid leukemia (AML) tumor cell line

• In sample k: each of 11 genes spiked in at concentration ck:

sample k: 1 2 3 4 5 6 7 8 9 10 11 12 13 14

conc. (pM): 0.0 0.5 0.75 1.0 1.5 2.0 3.0 5.0 12.5 25 50 75 100 150

• Each sample hybridised to an array

• Consider subset consisting of 500 normal genes + 11 spike-ins

Page 17: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Signal & expression indices

`true signal`/expression index BGX increases with concentration

10 arrays: gene 1 spiked-in at increasing concentrations

Lines: 95% credibility intervals for log(Sgj+1)Curves: posterior for signal

Page 18: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Non-specific hybridisation

10 arrays: gene 1 spiked-in at increasing concentrations

Non-specific hybridisation does not increase with concentration

Lines: 95% credibility intervals for log(Hgj+1)Curves: posterior for signal

Page 19: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Comparison with other expression measures

11 genes spiked in at 13 (increasing) concentrations

BGX index g increases with concentration …..

… except for gene 7 (incorrectly spiked-in??)

Indication of smooth & sustained increase over a wider range ofconcentrations

Page 20: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

95% credibility intervals for Bayesian gene expression index

11 spike-in genes at 13 different concentrations (data set A)

Note how the variabilityis substantially larger for low expression level

Each colour corresponds to a different spike-in geneGene 7 : broken red line

Page 21: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

• Hierarchical modelling• A fully Bayesian gene expression index

(BGX)• Differential expression and array effects• Two-way clustering

Part 3

Page 22: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Bayesian modelling of differential gene expression, adjusting for array effects

Alex LewinSylvia Richardson, Natalia Bochkina,Clare Marshall, Anne Glazier, Tim Aitman

•The spontaneously hypertensive rat (SHR): A model of human insulin resistance syndromes.

•Deficiency in gene Cd36 found to be associated with insulin resistance in SHR

•Following this, several animal models were developed where other relevant genes are knocked out comparison between knocked out and wildtype (normal) mice or rats.

See poster!See poster!

Page 23: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Data set & biological question

Microarray Data

Data set A (MAS 5) ( 12000 genes on each array)3 SHR compared with 3 transgenic rats

Data set B (RMA) ( 22700 genes on each array)8 wildtype (normal) mice compared with 8 knocked out mice

Biological Question

Find genes which are expressed differently in wildtype and knockout / transgenic mice

Page 24: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Exploratory analysis showing array effect

Condition 1 (3 replicates)

Condition 2 (3 replicates)

Page 25: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Differential expression model

The quantity of interest is the difference between conditions for each gene: dg , g = 1, …,N

Joint model for the 2 conditions :

yg1r = g - ½ dg + 1r(g) + g1r , r = 1, … R1

yg2r = g + ½ dg + 2r(g) + g2r , r = 1, … R2

where ygcr is log gene expression for gene g, condition c, replicate rg is overall gene effectcr() is array effect - a smooth function of gcr is normally distributed error, with gene- and condition- specific variance

Page 26: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Differential expression model

Joint modelling of array effects and differential expression:

• Performs normalisation simultaneously with estimation

• Gives fewer false positives

Can work with any desired composite criterion for identifying ‘interesting’ genes, e.g. fold change and overall expression level

Page 27: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

Gene is of interest if |log fold change| > log(2) and log (overall expression) > 4

Criterion:

The majority of the genes

have very small pg,X :

90% of genes

have pg,X < 0.2

Genes withpg,X > 0.5 (green)

# 280pg,X > 0.8 (red)

# 46

pg,X = 0.49

Plot of log fold change versus overall expression level

Data set A 3 wildtype mice compared to 3 knockout mice (U74A chip) MAS5

Genes with low overall expression have a greater range of fold change than those with higher expression

Page 28: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

• Hierarchical modelling• A fully Bayesian gene expression index

(BGX)• Differential expression and array effects• Two-way (gene by sample) clustering

Part 4

Page 29: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Hierarchical clustering of samples

A subset of 1161 gene expression profiles, obtained in 60 different samples

Ross et al, Nature Genetics, 2000

The gene expression profiles cluster according to tissue of origin of thesamples

Red : more mRNAGreen : less mRNAin the sample compared to a reference

Page 30: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

• Many clustering algorithms have been developed and used for exploratory purposes

• They rely on a measure of ‘distance’ (dissimilarity) between gene or sample profiles, e.g. Euclidean

• Hierarchical clustering proceeds in an agglomerative manner: single profiles are joined to form groups using the distance metric, recursively

• Good visual tool, but many arbitrary choices care in interpretation!

Non-model-based clustering

Page 31: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

• Build the cluster structure into the model, rather than estimating gene effects (say) first, and post-processing to seek clusters

• Bayesian setting allows use of real prior information where it is exists (biological understanding of pathways, etc, previous experiments, …)

Model-based clustering

Page 32: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Additive ‘ANOVA’ models for (log-) gene expression

gssggsy g=genes=sample/condition

The simplest model: gene + sample

The model generates the method, and in this case performs a simple form of normalisation

Under standard conditions, the (least-squares) estimates of gene effects are

... yygg

Page 33: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

... bring in mixture modelling …

ggy g=gene

gTg gy

Tg= unknown cluster to which gene g belongsThis is a mixture model

(single sample first!)

Page 34: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

… finally allow clusters to overlap – ‘Plaid’ model

h denotes a ‘cluster’, ‘block’ or ‘layer’ – pathway? gh= 0 or 1 and sh= 0 or 1

gsh

gsshh

ghgsy )(

)()( hhgs )()()( h

ghh

gs )()()( hs

hhgs )()()()( h

sh

ghh

gs

Page 35: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

‘Plaid’ model

gene

ssamples

Page 36: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

An early experiment : artificial raw data

Artificial data from a very special case of the Plaid model: single sample s

True H=3, b(h)=2.2, 3.4 and 4.7, N(0,2); 500 genes, some in each of 23=8 configurations of gh

8 overlapping normal clusters

Page 37: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

true H was 3

true b(h) were 2.2, 3.4, 4.7

Page 38: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Human fibroblast data – Lemon et al (2002)

• 18 samples split into 3 categories: serum starved, serum stimulated and a 50:50 mix of starved/stimulated.

• We used the natural logarithm of Lemon et al.’s calculated LWF values as our measure of expression and subtracted gene and sample mean levels.

• We then selected the 100 most variable genes across all 18 samples and used this 18×100 array as the input to our analysis.

Page 39: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Bayesian clustering

• Hierarchical model allows us to learn about all unknowns simultaneously

• In particular, this includes complete 2-way classification, gene by sample, with numerical uncertainties

• We then construct visualisations of interesting aspects (marginal distributions) of this posterior

Page 40: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Bayesian clustering: samples

Page 41: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

Bayesian clustering: genes

Page 42: Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

BGX

More details, papers and code

• www.stats.bris.ac.uk/BGX/

• www.bgx.org.uk