Upload
alejandro-smith
View
216
Download
3
Tags:
Embed Size (px)
Citation preview
Structured statistical modelling of gene expression data
Peter Green (Bristol)Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial)
with Clare Marshall, Natalia Bochkina (Imperial)Graeme Ambler (Bristol)
Tim Aitman and Helen Causton (Hammersmith)
BGX
Windsor, October 2004
BGX
Statistical modelling and biology
• Extracting the message from microarray data needs statistical as well as biological understanding
• Statistical modelling – in contrast to data analysis – gives a framework for formally organising assumptions about signal and noise
• Our models are structured, reflecting data generation process: ‘highly structured stochastic systems’
BGX
Background and 3 studies
• Hierarchical modelling• A fully Bayesian gene expression index
(BGX)• Differential expression and array effects• Two-way clustering
BGX
• Hierarchical modelling• A fully Bayesian gene expression index
(BGX)• Differential expression and array effects• Two-way clustering
Part 1
BGX
Gene expression using Affymetrix chips
20µm
Millions of copies of a specificoligonucleotide sequence element
Image of Hybridised Array
Approx. ½ million differentcomplementary oligonucleotides
Single stranded, labeled RNA sample
Oligonucleotide element
**
**
*
1.28cm
Hybridised Spot
Slide courtesy of Affymetrix
Expressed genes
Non-expressed genes
Zoom Image of Hybridised Array
BGX
Variation and uncertainty
• condition/treatment• biological• array manufacture• imaging• technical
• within/between array variation
• gene-specific variability
Gene expression data (e.g. Affymetrix) is the result of multiple sources of variability
Structured statistical modelling allows considering all uncertainty at once
BGX
Costs and benefits of this approach
Advantages of avoiding plug-in approach
• Uncertainties propagated throughout model
• Realistic estimates of variability
• Avoid bias
The price you pay – computational costs
• Intricate implementation
• Longer run times (but far less than experimental protocol!)
BGX
• Hierarchical modelling• A fully Bayesian gene expression index
(BGX)• Differential expression and array effects• Two-way clustering
Part 2
BGX
A fully Bayesian Gene eXpression indexfor Affymetrix GeneChip arrays
Anne-Mette HeinSylvia Richardson, Helen Causton, Graeme Ambler, Peter Green
Gene specific variability (probe)
PMMM
PMMM
PMMM
PMMM
BGX Gene index
BGX
Single array model: motivation
PMs and MMs both increase with spike-in concentration (MMs slower than PMs)
MMs bind fraction of signal
Spread of PMs increase with level
Multiplicative (and additive) error; transformation needed
Considerable variability in PM (and MM) response within a probe set
Varying reliability in gene expression estimation for different genes
Probe effects approximately additive on log-scale
Estimate gene expression measure from PMs and MMs on log scale
Key observations: Conclusions:
BGX
Model assumptions and key biological parameters
• The intensity for the PM measurement for probe (reporter) j and gene g is due to binding • of labelled fragments that perfectly match the oligos in
the spot (the true signal Sgj)• of labelled fragments that do not perfectly match these
oligos (the non-specific hybridisation Hgj)
• The intensity of the corresponding MM measurement is caused • by a binding fraction Φ of the true signal Sgj
• by non-specific hybridisation Hgj
BGX
BGX single array modelg=1,…,G (thousands), j=1,…,J (11-20)
Gene expression index (BGX):
g=median(TN (μg , ξ g2))
“Pools” information over probes j=1,…,J
log(Hgj+1) TN(λ, η2)
Array-wide distribution
PMgj N( Sgj + Hgj , τ2)
MMgj N(Φ Sgj + Hgj ,τ2) Background noise, additive
signal Non-specific hybridisation
fraction
j=1,…,J
Priors: “vague” 2 ~ (10-3, 10-3) ~ B(1,1),
g ~ U(0,15) 2 ~ (10-3, 10-3), ~ N(0,103)
Gene specific error terms:exchangeable
log(ξ g2)N(a,
b2)
log(Sgj+1) TN (μg , ξg2)
“Empirical Bayes”
BGX
Markov chain Monte Carlo (MCMC) computation
• Fitting of Bayesian models hugely facilitated by advent of these simulation methods
• Produce a large sample of values of all unknowns, from posterior given data
• Easy to set up for hierarchical models• BUT can be slow to run (for many
variables!)• and can fail to converge reliably
BGX
Sample in place of a distribution - 1D
BGX
Sample in place of a distribution - 2D
BGX
Single array model performance
• Data set : varying concentrations (geneLogic):
• 14 samples of cRNA from acute myeloid leukemia (AML) tumor cell line
• In sample k: each of 11 genes spiked in at concentration ck:
sample k: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
conc. (pM): 0.0 0.5 0.75 1.0 1.5 2.0 3.0 5.0 12.5 25 50 75 100 150
• Each sample hybridised to an array
• Consider subset consisting of 500 normal genes + 11 spike-ins
BGX
Signal & expression indices
`true signal`/expression index BGX increases with concentration
10 arrays: gene 1 spiked-in at increasing concentrations
Lines: 95% credibility intervals for log(Sgj+1)Curves: posterior for signal
BGX
Non-specific hybridisation
10 arrays: gene 1 spiked-in at increasing concentrations
Non-specific hybridisation does not increase with concentration
Lines: 95% credibility intervals for log(Hgj+1)Curves: posterior for signal
BGX
Comparison with other expression measures
11 genes spiked in at 13 (increasing) concentrations
BGX index g increases with concentration …..
… except for gene 7 (incorrectly spiked-in??)
Indication of smooth & sustained increase over a wider range ofconcentrations
BGX
95% credibility intervals for Bayesian gene expression index
11 spike-in genes at 13 different concentrations (data set A)
Note how the variabilityis substantially larger for low expression level
Each colour corresponds to a different spike-in geneGene 7 : broken red line
BGX
• Hierarchical modelling• A fully Bayesian gene expression index
(BGX)• Differential expression and array effects• Two-way clustering
Part 3
BGX
Bayesian modelling of differential gene expression, adjusting for array effects
Alex LewinSylvia Richardson, Natalia Bochkina,Clare Marshall, Anne Glazier, Tim Aitman
•The spontaneously hypertensive rat (SHR): A model of human insulin resistance syndromes.
•Deficiency in gene Cd36 found to be associated with insulin resistance in SHR
•Following this, several animal models were developed where other relevant genes are knocked out comparison between knocked out and wildtype (normal) mice or rats.
See poster!See poster!
BGX
Data set & biological question
Microarray Data
Data set A (MAS 5) ( 12000 genes on each array)3 SHR compared with 3 transgenic rats
Data set B (RMA) ( 22700 genes on each array)8 wildtype (normal) mice compared with 8 knocked out mice
Biological Question
Find genes which are expressed differently in wildtype and knockout / transgenic mice
BGX
Exploratory analysis showing array effect
Condition 1 (3 replicates)
Condition 2 (3 replicates)
BGX
Differential expression model
The quantity of interest is the difference between conditions for each gene: dg , g = 1, …,N
Joint model for the 2 conditions :
yg1r = g - ½ dg + 1r(g) + g1r , r = 1, … R1
yg2r = g + ½ dg + 2r(g) + g2r , r = 1, … R2
where ygcr is log gene expression for gene g, condition c, replicate rg is overall gene effectcr() is array effect - a smooth function of gcr is normally distributed error, with gene- and condition- specific variance
BGX
Differential expression model
Joint modelling of array effects and differential expression:
• Performs normalisation simultaneously with estimation
• Gives fewer false positives
Can work with any desired composite criterion for identifying ‘interesting’ genes, e.g. fold change and overall expression level
Gene is of interest if |log fold change| > log(2) and log (overall expression) > 4
Criterion:
The majority of the genes
have very small pg,X :
90% of genes
have pg,X < 0.2
Genes withpg,X > 0.5 (green)
# 280pg,X > 0.8 (red)
# 46
pg,X = 0.49
Plot of log fold change versus overall expression level
Data set A 3 wildtype mice compared to 3 knockout mice (U74A chip) MAS5
Genes with low overall expression have a greater range of fold change than those with higher expression
BGX
• Hierarchical modelling• A fully Bayesian gene expression index
(BGX)• Differential expression and array effects• Two-way (gene by sample) clustering
Part 4
BGX
Hierarchical clustering of samples
A subset of 1161 gene expression profiles, obtained in 60 different samples
Ross et al, Nature Genetics, 2000
The gene expression profiles cluster according to tissue of origin of thesamples
Red : more mRNAGreen : less mRNAin the sample compared to a reference
BGX
• Many clustering algorithms have been developed and used for exploratory purposes
• They rely on a measure of ‘distance’ (dissimilarity) between gene or sample profiles, e.g. Euclidean
• Hierarchical clustering proceeds in an agglomerative manner: single profiles are joined to form groups using the distance metric, recursively
• Good visual tool, but many arbitrary choices care in interpretation!
Non-model-based clustering
BGX
• Build the cluster structure into the model, rather than estimating gene effects (say) first, and post-processing to seek clusters
• Bayesian setting allows use of real prior information where it is exists (biological understanding of pathways, etc, previous experiments, …)
Model-based clustering
BGX
Additive ‘ANOVA’ models for (log-) gene expression
gssggsy g=genes=sample/condition
The simplest model: gene + sample
The model generates the method, and in this case performs a simple form of normalisation
Under standard conditions, the (least-squares) estimates of gene effects are
... yygg
BGX
... bring in mixture modelling …
ggy g=gene
gTg gy
Tg= unknown cluster to which gene g belongsThis is a mixture model
(single sample first!)
BGX
… finally allow clusters to overlap – ‘Plaid’ model
h denotes a ‘cluster’, ‘block’ or ‘layer’ – pathway? gh= 0 or 1 and sh= 0 or 1
gsh
gsshh
ghgsy )(
)()( hhgs )()()( h
ghh
gs )()()( hs
hhgs )()()()( h
sh
ghh
gs
BGX
‘Plaid’ model
gene
ssamples
BGX
An early experiment : artificial raw data
Artificial data from a very special case of the Plaid model: single sample s
True H=3, b(h)=2.2, 3.4 and 4.7, N(0,2); 500 genes, some in each of 23=8 configurations of gh
8 overlapping normal clusters
BGX
true H was 3
true b(h) were 2.2, 3.4, 4.7
BGX
Human fibroblast data – Lemon et al (2002)
• 18 samples split into 3 categories: serum starved, serum stimulated and a 50:50 mix of starved/stimulated.
• We used the natural logarithm of Lemon et al.’s calculated LWF values as our measure of expression and subtracted gene and sample mean levels.
• We then selected the 100 most variable genes across all 18 samples and used this 18×100 array as the input to our analysis.
BGX
Bayesian clustering
• Hierarchical model allows us to learn about all unknowns simultaneously
• In particular, this includes complete 2-way classification, gene by sample, with numerical uncertainties
• We then construct visualisations of interesting aspects (marginal distributions) of this posterior
BGX
Bayesian clustering: samples
BGX
Bayesian clustering: genes
BGX
More details, papers and code
• www.stats.bris.ac.uk/BGX/
• www.bgx.org.uk