1
Statistical Analysis of CGH Arrays
UMR AgroParisTech / INRA: Mathematique et Informatique AppliqueesSSB group: genome.jouy.inra.fr/ssb/
ACI-IMPBIOParis, 04 - 10 - 2007
Outline
1 - Detecting chromosomal aberrations 4 - Multiple arrays analysis
2 - Looking for breakpoints 5 - Looking for common aberrations
3 - Segments classification 6 - Future work
S. Robin: Statistical analysis of CGH data
2
1 - Detecting chromosomal aberrations1.1 - Within chromosome aberration
• Change of scale: what are the effects of small size DNA sequencesdeletions/amplifications?
→ experimental tool: ”conventional” CGH (resolution ∼ 10Mb).
• CGH = Comparative Genomic Hybridization: method for the comparativemeasurement of relative DNA copy numbers between two samples(normal/disease, test/reference).
→ Application of the microarray technology to CGH (resolution ∼ 100kb).Chapter 1. De�nition of array-based Comparative Genomi Hybridization
Figure 1.3: S hemati representation of array CGH on eption.CGH has be ome a standard method for ytogeneti studies, te hni al limitationsrestri t its usefulness as a omprehensive s reening tool: the ondensed and super- oiled state of the target DNA in the hromosomes limits the resolution to 10Mbfor loss and 2Mb for ampli� ation (Beheshti et al. (2002)). The resolution ofComparative Genomi Hybridizations has been greatly improved using mi roar-ray te hnology (Solinas-Toldo et al. (1997)).1.2 Appli ation of mi roarray te hnology to om-parative genomi hybridizationThe di�eren e between hromosome CGH and array-based CGH lies in the sup-port whi h is used for hybridization. For hromosome CGH, this support is a hromosome, whereas in CGH array experiments, the support is a slide. Sin emore and more DNA lones have been mapped and sequen ed, they are spottedon a slide (Figure 1.3). In parallel, genomi DNA is extra ted from biologi alsamples, ampli�ed and labelled with �uores ent dyes, alled Cy3 and Cy5 (Fig-ure 1.4). This mixture of targets, is hybridized on the hip, and DNA sequen es an bind their omplementary template. Sin e probes are uniquely lo alized onthe slide, the quanti� ation of the �uores en e signals on the hip will de�ne ameasurement of the abundan e of thousands of genomi sequen es in a ell in agiven ondition.Mi roarray te hnology is well-known and widely used to study gene expressionpro�les. CGH mi roarrays use referen e DNA that do not present any alteration,allowing an "absolute" quanti� ation of genomi imbalan es for the sample DNA.The appli ation of mi roarray te hnology to CGH has improved the resolutionfrom megabases to 100kb. Pinkel et al. (1998) further re�ned this te hnique andhave shown that CGH mi roarrays an dete t hromosomal aberrations of 40kb.14S. Robin: Statistical analysis of CGH data
3
1.2 - Microarray technology in its principle
S. Robin: Statistical analysis of CGH data
4
1.3 - Interpretation of a CGH profile
Data size.
Few thousands points /genome
Few hundreds points /chromosome
A dot on the graph =1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64 1.65 1.66 1.67
x 106
−3
−2
−1
0
1
2
3
log 2 r
at
genomic position
Unaltered segment
Deleted segment
Amplified segments
log2
{♯ copies of BAC(t) in the test genome
♯ copies of BAC(t) in the reference genome
}
S. Robin: Statistical analysis of CGH data
5
2 - Looking for breakpoints2.1 - Statistical model
• At position t, there exists a ’true’ log-ratio, which depends on the relative copynumber.
• The value of the true log-ratio is affected by abrupt changes:
t0
t1
I1
µ1, σ
(1)
t2
I2
µ2, σ
(2)t3
I3
µ3, σ
(3)
t4
I4
µ4, σ
(4)
Chromosome
Positions t1, t2, .. are called breakpoints. µk is the true log-ratio in segment Ik.
• The observed signal Yt is noisy:
if t ∈ Ik : Yt ∼ N (µk, σ2(k)), {Yt}t independent.
S. Robin: Statistical analysis of CGH data
6
Segmentation as a linear model
The model defined above can be stated in the general linear model framework:
Y = Tµ + E
where
• Y is the n-vector containing the profile,
• T is the N×K segmentation matrix that associates each position to a segment,
• µ is the K-vector containing the means of each segment,
• E is the n-vector containing the random noise (i.i.d. ∼ N (0, σ2)).
Remark. In our case, the matrix T has to be estimated.
S. Robin: Statistical analysis of CGH data
7
Estimating the parameters
Log-Likelihood (with a constant variance σ2):
LK(T,Θ) = −n log σ2 −1
σ2
K∑
k=1
∑
t∈Ik
(Yt − µk)2 + cst.
• Because the data are supposed to be independent, the log-likelihood is additive(i.e. sum over all the segments).
• Given the breakpoint position, the estimates of µk and σ2(k) are straightforward:
µk =1
nk
∑
t∈Ik
Yt and σ2 =1
n
K∑
k=1
∑
t∈Ik
(Yt− µk)2 or σ2
k =1
nk
∑
t∈Ik
(Yt− µk)2.
S. Robin: Statistical analysis of CGH data
8
How to find the breakpoints?
When the number of segments K is known , we have to minimize
Jk(1, n) =K∑
k=1
∑
t∈Ik
(Yt − µk)2.
• The
(n − 1K − 1
)possible choices for the positions of the breakpoints
t1, t2, . . . , tK−1 can not be explored for large n and K
•∑
t∈Ik(Yt − µk)
2 can be viewed as the ’cost’ of segment Ik, i.e. the cost ofgathering data Ytk−1+1 to Ytk+1
into a single segment.
• The optimization problem is actually a shortest path problem that can be solvedthanks to dynamic programming [Picard et al., 2005].
S. Robin: Statistical analysis of CGH data
9
One last problem: the selection of K
• JK decreases when K
(complexity) increases.
• Standard criteria such as AICor BIC are not valid.
• Penalty function: pen(K) =K + 1 or 2K.
• We look for the minimum of
Jk + βpen(K)
where β is given [Lebarbier,
2005] or adaptively estimated[Lavielle, 2003].
Chapter 4. Pro ess segmentation
0 5 10 15 20 25 30 35 40−3.5
−3
−2.5
−2
−1.5
−1
−0.5
pen(K)
JK
Figure 4.1: Illustration of the model sele tion pro edure proposed by Lavielle(2005). Cir les represent the onvex hull of ontrast . The verti al line indi atesthe number of segments for whi h the ontrast eases to de rease signi� antly.4.4 Bayesian formulation of the multiple hange-point problemIn order to omplete this review on segmentation methods, we present anothermodelling strategy that has been onsidered for this problem, in the Bayesianframework. See Green (1995), Carlin (1992), Barry and Hartigan (1993), Av-ery and Henderson (1999), Lavielle and Lebarbier (2001) for instan e. Previousse tions were dedi ated to strategies whose obje tive is to provide the best seg-mentation on the data, based on a spe i� riterion. The obje tive is di�erent inthe Bayesian setting, where the number of segments as well as their position israndom. As a onsequen e, their posterior distribution will be used to hoose themost appropriate number of segments, and will provide lo al information regard-ing the position of the breakpoints.The model an be spe i�ed as follows. Let be a real pro ess su h thatwhere is a sequen e of zero-mean random variables. The fun tion to bere overed is supposed pie ewise onstant. With the onventional notations:39
S. Robin: Statistical analysis of CGH data
10
2.2 - Example of segmentation on array CGH data
Are the variances σ2k homogeneous? BT474 cell line, chromosome 9:
Homogeneous variances Heterogeneous variancesK = 4 segments
1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64 1.65 1.66 1.67
x 106
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
log 2 r
at
genomic position1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64 1.65 1.66 1.67
x 106
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
log 2 r
atgenomic position
S. Robin: Statistical analysis of CGH data
11
Adaptive choice of the number of segments. BT474 cell line, chromosome 1:
Homogeneous variances Heterogeneous variances
K = 10 segments K = 2 segments
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2
x 105
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
log 2 r
at
genomic position0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2
x 105
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
log 2 r
at
genomic position
Homogeneous variances result in smaller segments.
S. Robin: Statistical analysis of CGH data
12
Comparative studyLai & al. (Bioinformatics, 05). On both synthetic and real data (GBM brain tumordata), the dynamic programming approach outperfoms other methods available atthat time.
S. Robin: Statistical analysis of CGH data
13
3 - Segments classification
Considering biologists objective and the need for a new model.
0 5 10 15 20 25 30 35 40−2
−1
0
1
2
3
4
5
0 5 10 15 20 25 30 35 40−2
−1
0
1
2
3
4
5
We’d like segments of same type (’normal’, ’deleted’, amplified’) to be gatheredinto classes.
S. Robin: Statistical analysis of CGH data
14
3.1 - A segmentation-clustering model
• We suppose that there exists a secondary unobserved structure that clusters theK segments into P classes with proportions π1, ..., πP :
πp = proportion of class p,∑
p
πp = 1.
• Conditionally to the class to which the segment belongs, we know the distributionof Y :
t ∈ Ik, k ∈ p ⇒ Yt ∼ N (mp, s2p).
[Picard et al., 2007]
The parameters of this model are
the breakpoint positions: T = {t1, ..., tK−1},
the mixture characteristics: Ψ = ({πp}, {mp}, {s2p}).
S. Robin: Statistical analysis of CGH data
15
Matrix form
This model can written as
Y = TCm + E
where
• C is the K × P classification matrix that assigns each segment to a class,
• m is the P vector containing the mean of each class.
Remarks.
• This model is not linear anymore.
• C represents the unobserved part of this incomplete data model.
• Two levels of statistical units: the positions (t) and the segments(k).
S. Robin: Statistical analysis of CGH data
16
3.2 - An hybrid estimation algorithm: DP-EM
Alternate parameters estimation with K and P known
1. When T is fixed, the Expectation-Maximisation (EM) algorithm estimates Ψ:
Ψ(h+1) = arg maxΨ
{logLKP
(Ψ, T (h)
)}.
2. When Ψ is fixed, Dynamic Programming (DP) estimates T :
T (h+1) = arg maxT
{logLKP
(Ψ(h+1), T
)}.
An increasing sequence of likelihoods:
logLKP (Ψ(h+1);T (h+1)) ≥ logLKP (Ψ(h);T (h))
S. Robin: Statistical analysis of CGH data
17
3.3 - BT474 cell line
Interest of clustering: an easy case. Chromosome 9:
Segmentation Segmentation / ClusteringK = 4 segments
1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64 1.65 1.66 1.67
x 106
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
log 2 r
at
genomic position 10 20 30 40 50 60 70 80 90
−3
−2
−1
0
1
2
Clustering defines ’deleted’, ’normal’ and ’amplified’ classes.
S. Robin: Statistical analysis of CGH data
18
Interest of clustering: a more interesting case. Chromosome 1:
Segmentation Segmentation / ClusteringK = 2 P = 3, K = 8
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2
x 105
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
log 2 r
at
genomic position 10 20 30 40 50 60 70 80 90 100 110
−0.5
0
0.5
1
1.5
2
2.5
3
3.5
Clustering detects an outliers and captures a ’normal’ segment within a largevariance region.
S. Robin: Statistical analysis of CGH data
19
4 - Multiple arrays analysisTwo examples.
1 - Comparing A. Thaliana mutants or ecotypes. Chromosomal rearrangementcan be detected by comparing CGH profiles observed on individual from differentgenotypes.
2 - Comparing groups of patients. To detect chromosomal aberration associatedwith a specific disease (e.g. breast cancer), we analyse the profiles of several illpatients jointly.
First approach: Common breakpoints.
Assuming that all the patients of the same group have their breakpoints atthe same positions, leads to segment a multivariate signal, which can be achievedusing DP.
This is doable but biologically irrelevant.
S. Robin: Statistical analysis of CGH data
20
Breast cancer dataset (Inst. Curie): Chromosome 8.
Good prognosis (53 patients) Bad prognosis (81 patients)
Numberofbreakpoints
−2 0 2 4 6 8 10 12 14 160
5
10
15
20
25
30
35
40
−2 0 2 4 6 8 10 12 14 160
5
10
15
20
25
30
Positionof thebreakpoints
20 40 60 80 100 1200
1
2
3
4
5
6
7
8
9
10
20 40 60 80 100 1200
1
2
3
4
5
6
7
8
9
10
S. Robin: Statistical analysis of CGH data
21
4.1 - Correlated profiles
We want to account for correlationsbetween profiles.
Different probe affinities may alter allthe profiles at the same position.
Denoting Yit the profile of patient i atposition t (belonging to segment Iik),we set
Yit = µik + Ut + Eit
where µik is the mean of segment Iik
and Ut is the random effect of probe t:
{Eit} iid N (0, σ2), {Ut} iid N (0, γ2).
S. Robin: Statistical analysis of CGH data
22
4.2 - Mixed segmentation model
Covariance between profiles.
C ov(Yit, Yi′t′) = C ov(Ut + Eit, Ut′ + Ei′t′) =
{γ2 if t = t′
0 otherwise.
Matrix form. We end up with a mixed linear model with a segmentation term (T):
Y = Tµ + ZU + E
where
• m is the number of patients, n is the length of each profile
• Z is the (nm × n) design matrix that gives the probe corresponding to eachdata point.
• U is the n-vector containing the random effects of the probes.
[Lebarbier et al., 2007]S. Robin: Statistical analysis of CGH data
23
4.3 - Estimation of the parameters
Direct maximisation. Due to correlations, the variance V (Y) is not diagonal, sothe log-likelihood L(Y) is not additive anymore ⇒ No dynamic programming
A second DP-EM algorithmThe conditional variance of Y given U (V (Y|U)) is diagonal, so DP can be
used to estimate T given U. U is seen as an unobserved. [Foulley, Lecture notes]
E step. Calculate the conditional moments of the random effect given the data:
E (U|Y), V (U|Y).
M step. Denoting U = E (U|Y) , perform the segmentation as follows:
Tµ = arg minTµ
‖Y − Tµ − ZU‖2.
A two-stage dynamic programming is required to achieve this step for numerouspatients. [Picard et al., ?]
S. Robin: Statistical analysis of CGH data
24
4.4 - Nakao data, chromosome 20
The random effect has an influence on the detected breakpoints.
We find a large negative randomeffect Ut has at position 37.
→ poor probe affinity or wrongannotation.
Comparison individual / multiplesegmentation.
The breakpoints detected aroundposition 37 vanish.
S. Robin: Statistical analysis of CGH data
25
5 - Looking for common aberrationsTo detect aberrationsassociated to a disease(e.g. bladder cancer), welook for aberrations thatappear ’significantly’often among a set ofpatients.
Xit denotes the status ofposition t for patient i:
Xit = –1 (loss)= 0 (normal)= +1 (gain)
[Huppe et al., 2004] 0 500 1000 1500 20000
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 10 11 12 1314 15 16 17 181920 21222324
S. Robin: Statistical analysis of CGH data
26
5.1 - Minimal region
A minimal region is a sequence of successivepositions for which the same status is observedin large number of patients at the same time.[Rouveirol, 2007]
Data. M∗ = 31 patients present ℓ = 5 successivedeletions between positions 1189 and 1193, inchromosome 9.
Question. Is this significant, given the numberof patients (m = 84) and the profiles length(n = 2340)?
1000 1500
8 9 10 11 12
S. Robin: Statistical analysis of CGH data
27
5.2 - Motifs are back
Binary case. Assume that only 2 status exist:
Xit =
{1 for aberration,0 for normal.
Region. A minimal region is then a ℓ-run of 1s. Denote
Yit =t∏
u=t−ℓ+1
Xit =
{1 if a ℓ-run occurs at position t in profile i,0 otherwise.
Simultaneous occurrences. Y+t =∑
i Yit is the number of patients for which a ℓ
successive aberrations (ℓ-run) occurs at t.
Significance of an observed minimal region. We have to calculate
Pr
{max
ℓ≤t≤nY+t ≥ M∗
}.
S. Robin: Statistical analysis of CGH data
28
5.3 - Markov Model
Model. Each profile {Xit}t is a 2-state stationary Markov chain (MC):
{Xit}t ∼ MC(Π,µ)
where Π is the transition matrix and µ the stationary distribution (∀i : Xi1 ∼ µ).
The patients (profiles) are supposed to be independent.
Estimated transition probabilities and stationary distributions. States = {0, 1}:
Gain: Π+
=
(99.74 0.261.98 98.02
), µ
+ = (88.36 11.64),
Loss: Π−
=
(99.72 0.282.26 97.74
), µ
− = (88.98 11.02).
S. Robin: Statistical analysis of CGH data
29
5.4 - Significance
We want to evaluate
Pr
{max
ℓ≤t≤nY+t ≥ M∗
}.
Exact calculation. We need to consider m independent Markov chains of orderℓ − 1:
m = 84, ℓ = 15 → 8.3 1086 states...
Upper bound. Denoting X+t =∑
i Xit, we have
Pr
{max
ℓ≤t≤nY+t ≥ M∗
}≤ Pr{X+t,X+,t+1, . . .X+,t+ℓ−1 ≥ M∗}.
The latter probability can be calculated via an embedded Markov Chain.
Lower bound. A lower can also be derived, using another technique. [R. et al., ?]
S. Robin: Statistical analysis of CGH data
30
Bladder cancer
Deleted regions in m = 84 patients, n = 2340 positions along 24 chromosomes.
position length # patients Significancet∗ chrom. ℓ M∗ p(upper) p(lower)
1189 9 5 31 6.04 e–8 4.05 e–81387 11 3 30 6.82 e–7 6.08 e–71430 11 3 27 5.08 e–5 4.60 e–51340 10 22 23 1.18 e–4 3.02 e–61457 11 3 26 1.91 e–4 1.74 e–41006 8 42 21 4.95 e–4 1.86 e–8996 8 7 22 7.85 e–3 4.78 e–3584 4 6 19 1.82 e–1 1.39 e–11947 17 26 16 2.33 e–1 1.42 e–2594 4 17 17 2.34 e–1 5.16 e–2643 5 11 17 4.18 e–1 2.21 e–1
• The upper and lower bounds are close, except for long regions.
• Deletions in chromosome 9 are known to be associated to bladder cancer.
S. Robin: Statistical analysis of CGH data
31
6 - What’s next?Enriching the the multiple array model:
• Classify the segments into status
• Account for possible covariates (regarding the probes or the patients)
• Maximum likelihood for the complete model including segmentation,classification, random effects and covariates seems intractable.→ Approximations (variationnal approaches) are promising.
A R package should be available soon.
Recent or new technologies:
• High density: tiling arrays (≃ 105 probes per chromosomes)
• High throughput sequencing (discrete signal).
S. Robin: Statistical analysis of CGH data
32
Collaborations
Institut Curie:F. Radvanyi, O. Delattre
Cancer center, San Francisco:C. Vaisse
INRA - URGV, Evry:J.-P. Renou
UMR AgroParisTech / INRA 518:E. Lebarbier, J-J. Daudin, B. Thiam
UMR CNRS 5558, Lyon:F. Picard
Masaryk univ., Rep. Tcheque:E. Budinska
Univ. Paris X:L. Pierre
Univ. Paris XI:M. Lavielle
LIPN, UMR CNRS 7030:C. Rouveirol
Western Australia univ:V. Stefanov
S. Robin: Statistical analysis of CGH data