Download pdf - Statistical Analysis of CGH ArraysCGH and arra y-based lies in the sup-p ort whic h is used for ybridization. F or c hromosome CGH, this supp ort a c hromosome, whereas in CGH arra

1

Statistical Analysis of CGH Arrays

S. [email protected]

UMR AgroParisTech / INRA: Mathematique et Informatique AppliqueesSSB group: genome.jouy.inra.fr/ssb/

ACI-IMPBIOParis, 04 - 10 - 2007

Outline

1 - Detecting chromosomal aberrations 4 - Multiple arrays analysis

2 - Looking for breakpoints 5 - Looking for common aberrations

3 - Segments classification 6 - Future work

S. Robin: Statistical analysis of CGH data

2

1 - Detecting chromosomal aberrations1.1 - Within chromosome aberration

• Change of scale: what are the effects of small size DNA sequencesdeletions/amplifications?

→ experimental tool: ”conventional” CGH (resolution ∼ 10Mb).

• CGH = Comparative Genomic Hybridization: method for the comparativemeasurement of relative DNA copy numbers between two samples(normal/disease, test/reference).

→ Application of the microarray technology to CGH (resolution ∼ 100kb).Chapter 1. De�nition of array-based Comparative Genomi Hybridization

Figure 1.3: S hemati representation of array CGH on eption.CGH has be ome a standard method for ytogeneti studies, te hni al limitationsrestri t its usefulness as a omprehensive s reening tool: the ondensed and super- oiled state of the target DNA in the hromosomes limits the resolution to 10Mbfor loss and 2Mb for ampli� ation (Beheshti et al. (2002)). The resolution ofComparative Genomi Hybridizations has been greatly improved using mi roar-ray te hnology (Solinas-Toldo et al. (1997)).1.2 Appli ation of mi roarray te hnology to om-parative genomi hybridizationThe di�eren e between hromosome CGH and array-based CGH lies in the sup-port whi h is used for hybridization. For hromosome CGH, this support is a hromosome, whereas in CGH array experiments, the support is a slide. Sin emore and more DNA lones have been mapped and sequen ed, they are spottedon a slide (Figure 1.3). In parallel, genomi DNA is extra ted from biologi alsamples, ampli�ed and labelled with �uores ent dyes, alled Cy3 and Cy5 (Fig-ure 1.4). This mixture of targets, is hybridized on the hip, and DNA sequen es an bind their omplementary template. Sin e probes are uniquely lo alized onthe slide, the quanti� ation of the �uores en e signals on the hip will de�ne ameasurement of the abundan e of thousands of genomi sequen es in a ell in agiven ondition.Mi roarray te hnology is well-known and widely used to study gene expressionpro�les. CGH mi roarrays use referen e DNA that do not present any alteration,allowing an "absolute" quanti� ation of genomi imbalan es for the sample DNA.The appli ation of mi roarray te hnology to CGH has improved the resolutionfrom megabases to 100kb. Pinkel et al. (1998) further re�ned this te hnique andhave shown that CGH mi roarrays an dete t hromosomal aberrations of 40kb.14S. Robin: Statistical analysis of CGH data

3

1.2 - Microarray technology in its principle


4

1.3 - Interpretation of a CGH profile

Data size.

Few thousands points /genome

Few hundreds points /chromosome

A dot on the graph =1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64 1.65 1.66 1.67

x 106

−3

−2

−1

0

1

2

3

log 2 r

at

genomic position

Unaltered segment

Deleted segment

Amplified segments

log2

{♯ copies of BAC(t) in the test genome

♯ copies of BAC(t) in the reference genome

}


5

2 - Looking for breakpoints2.1 - Statistical model

• At position t, there exists a ’true’ log-ratio, which depends on the relative copynumber.

• The value of the true log-ratio is affected by abrupt changes:

t0

t1

I1

µ1, σ

(1)

t2

I2

µ2, σ

(2)t3

I3

µ3, σ

(3)

t4

I4

µ4, σ

(4)

Chromosome

Positions t1, t2, .. are called breakpoints. µk is the true log-ratio in segment Ik.

• The observed signal Yt is noisy:

if t ∈ Ik : Yt ∼ N (µk, σ2(k)), {Yt}t independent.


6

Segmentation as a linear model

The model defined above can be stated in the general linear model framework:

Y = Tµ + E

where

• Y is the n-vector containing the profile,

• T is the N×K segmentation matrix that associates each position to a segment,

• µ is the K-vector containing the means of each segment,

• E is the n-vector containing the random noise (i.i.d. ∼ N (0, σ2)).

Remark. In our case, the matrix T has to be estimated.


7

Estimating the parameters

Log-Likelihood (with a constant variance σ2):

LK(T,Θ) = −n log σ2 −1

σ2

K∑

k=1

∑

t∈Ik

(Yt − µk)2 + cst.

• Because the data are supposed to be independent, the log-likelihood is additive(i.e. sum over all the segments).

• Given the breakpoint position, the estimates of µk and σ2(k) are straightforward:

µk =1

nk

∑

t∈Ik

Yt and σ2 =1

n

K∑

k=1

∑

t∈Ik

(Yt− µk)2 or σ2

k =1

nk

∑

t∈Ik

(Yt− µk)2.


8

How to find the breakpoints?

When the number of segments K is known , we have to minimize

Jk(1, n) =K∑

k=1

∑

t∈Ik

(Yt − µk)2.

• The

(n − 1K − 1

)possible choices for the positions of the breakpoints

t1, t2, . . . , tK−1 can not be explored for large n and K

•∑

t∈Ik(Yt − µk)

2 can be viewed as the ’cost’ of segment Ik, i.e. the cost ofgathering data Ytk−1+1 to Ytk+1

into a single segment.

• The optimization problem is actually a shortest path problem that can be solvedthanks to dynamic programming [Picard et al., 2005].


9

One last problem: the selection of K

• JK decreases when K

(complexity) increases.

• Standard criteria such as AICor BIC are not valid.

• Penalty function: pen(K) =K + 1 or 2K.

• We look for the minimum of

Jk + βpen(K)

where β is given [Lebarbier,

2005] or adaptively estimated[Lavielle, 2003].

Chapter 4. Pro ess segmentation

0 5 10 15 20 25 30 35 40−3.5

−3

−2.5

−2

−1.5

−1

−0.5

pen(K)

JK

Figure 4.1: Illustration of the model sele tion pro edure proposed by Lavielle(2005). Cir les represent the onvex hull of ontrast . The verti al line indi atesthe number of segments for whi h the ontrast eases to de rease signi� antly.4.4 Bayesian formulation of the multiple hange-point problemIn order to omplete this review on segmentation methods, we present anothermodelling strategy that has been onsidered for this problem, in the Bayesianframework. See Green (1995), Carlin (1992), Barry and Hartigan (1993), Av-ery and Henderson (1999), Lavielle and Lebarbier (2001) for instan e. Previousse tions were dedi ated to strategies whose obje tive is to provide the best seg-mentation on the data, based on a spe i� riterion. The obje tive is di�erent inthe Bayesian setting, where the number of segments as well as their position israndom. As a onsequen e, their posterior distribution will be used to hoose themost appropriate number of segments, and will provide lo al information regard-ing the position of the breakpoints.The model an be spe i�ed as follows. Let be a real pro ess su h thatwhere is a sequen e of zero-mean random variables. The fun tion to bere overed is supposed pie ewise onstant. With the onventional notations:39


10

2.2 - Example of segmentation on array CGH data

Are the variances σ2k homogeneous? BT474 cell line, chromosome 9:

Homogeneous variances Heterogeneous variancesK = 4 segments

1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64 1.65 1.66 1.67

x 106

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

log 2 r

at

genomic position1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64 1.65 1.66 1.67

x 106

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

log 2 r

atgenomic position


11

Adaptive choice of the number of segments. BT474 cell line, chromosome 1:

Homogeneous variances Heterogeneous variances

K = 10 segments K = 2 segments

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2

x 105

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

log 2 r

at

genomic position0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2

x 105

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

log 2 r

at

genomic position

Homogeneous variances result in smaller segments.


12

Comparative studyLai & al. (Bioinformatics, 05). On both synthetic and real data (GBM brain tumordata), the dynamic programming approach outperfoms other methods available atthat time.


13

3 - Segments classification

Considering biologists objective and the need for a new model.

0 5 10 15 20 25 30 35 40−2

−1

0

1

2

3

4

5

0 5 10 15 20 25 30 35 40−2

−1

0

1

2

3

4

5

We’d like segments of same type (’normal’, ’deleted’, amplified’) to be gatheredinto classes.


14

3.1 - A segmentation-clustering model

• We suppose that there exists a secondary unobserved structure that clusters theK segments into P classes with proportions π1, ..., πP :

πp = proportion of class p,∑

p

πp = 1.

• Conditionally to the class to which the segment belongs, we know the distributionof Y :

t ∈ Ik, k ∈ p ⇒ Yt ∼ N (mp, s2p).

[Picard et al., 2007]

The parameters of this model are

the breakpoint positions: T = {t1, ..., tK−1},

the mixture characteristics: Ψ = ({πp}, {mp}, {s2p}).


15

Matrix form

This model can written as

Y = TCm + E

where

• C is the K × P classification matrix that assigns each segment to a class,

• m is the P vector containing the mean of each class.

Remarks.

• This model is not linear anymore.

• C represents the unobserved part of this incomplete data model.

• Two levels of statistical units: the positions (t) and the segments(k).


16

3.2 - An hybrid estimation algorithm: DP-EM

Alternate parameters estimation with K and P known

1. When T is fixed, the Expectation-Maximisation (EM) algorithm estimates Ψ:

Ψ(h+1) = arg maxΨ

{logLKP

(Ψ, T (h)

)}.

2. When Ψ is fixed, Dynamic Programming (DP) estimates T :

T (h+1) = arg maxT

{logLKP

(Ψ(h+1), T

)}.

An increasing sequence of likelihoods:

logLKP (Ψ(h+1);T (h+1)) ≥ logLKP (Ψ(h);T (h))


17

3.3 - BT474 cell line

Interest of clustering: an easy case. Chromosome 9:

Segmentation Segmentation / ClusteringK = 4 segments

1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64 1.65 1.66 1.67

x 106

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

log 2 r

at

genomic position 10 20 30 40 50 60 70 80 90

−3

−2

−1

0

1

2

Clustering defines ’deleted’, ’normal’ and ’amplified’ classes.


18

Interest of clustering: a more interesting case. Chromosome 1:

Segmentation Segmentation / ClusteringK = 2 P = 3, K = 8

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2

x 105

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

log 2 r

at

genomic position 10 20 30 40 50 60 70 80 90 100 110

−0.5

0

0.5

1

1.5

2

2.5

3

3.5

Clustering detects an outliers and captures a ’normal’ segment within a largevariance region.


19

4 - Multiple arrays analysisTwo examples.

1 - Comparing A. Thaliana mutants or ecotypes. Chromosomal rearrangementcan be detected by comparing CGH profiles observed on individual from differentgenotypes.

2 - Comparing groups of patients. To detect chromosomal aberration associatedwith a specific disease (e.g. breast cancer), we analyse the profiles of several illpatients jointly.

First approach: Common breakpoints.

Assuming that all the patients of the same group have their breakpoints atthe same positions, leads to segment a multivariate signal, which can be achievedusing DP.

This is doable but biologically irrelevant.


20

Breast cancer dataset (Inst. Curie): Chromosome 8.

Good prognosis (53 patients) Bad prognosis (81 patients)

Numberofbreakpoints

−2 0 2 4 6 8 10 12 14 160

5

10

15

20

25

30

35

40

−2 0 2 4 6 8 10 12 14 160

5

10

15

20

25

30

Positionof thebreakpoints

20 40 60 80 100 1200

1

2

3

4

5

6

7

8

9

10

20 40 60 80 100 1200

1

2

3

4

5

6

7

8

9

10


21

4.1 - Correlated profiles

We want to account for correlationsbetween profiles.

Different probe affinities may alter allthe profiles at the same position.

Denoting Yit the profile of patient i atposition t (belonging to segment Iik),we set

Yit = µik + Ut + Eit

where µik is the mean of segment Iik

and Ut is the random effect of probe t:

{Eit} iid N (0, σ2), {Ut} iid N (0, γ2).


22

4.2 - Mixed segmentation model

Covariance between profiles.

C ov(Yit, Yi′t′) = C ov(Ut + Eit, Ut′ + Ei′t′) =

{γ2 if t = t′

0 otherwise.

Matrix form. We end up with a mixed linear model with a segmentation term (T):

Y = Tµ + ZU + E

where

• m is the number of patients, n is the length of each profile

• Z is the (nm × n) design matrix that gives the probe corresponding to eachdata point.

• U is the n-vector containing the random effects of the probes.

[Lebarbier et al., 2007]S. Robin: Statistical analysis of CGH data

23

4.3 - Estimation of the parameters

Direct maximisation. Due to correlations, the variance V (Y) is not diagonal, sothe log-likelihood L(Y) is not additive anymore ⇒ No dynamic programming

A second DP-EM algorithmThe conditional variance of Y given U (V (Y|U)) is diagonal, so DP can be

used to estimate T given U. U is seen as an unobserved. [Foulley, Lecture notes]

E step. Calculate the conditional moments of the random effect given the data:

E (U|Y), V (U|Y).

M step. Denoting U = E (U|Y) , perform the segmentation as follows:

Tµ = arg minTµ

‖Y − Tµ − ZU‖2.

A two-stage dynamic programming is required to achieve this step for numerouspatients. [Picard et al., ?]


24

4.4 - Nakao data, chromosome 20

The random effect has an influence on the detected breakpoints.

We find a large negative randomeffect Ut has at position 37.

→ poor probe affinity or wrongannotation.

Comparison individual / multiplesegmentation.

The breakpoints detected aroundposition 37 vanish.


25

5 - Looking for common aberrationsTo detect aberrationsassociated to a disease(e.g. bladder cancer), welook for aberrations thatappear ’significantly’often among a set ofpatients.

Xit denotes the status ofposition t for patient i:

Xit = –1 (loss)= 0 (normal)= +1 (gain)

[Huppe et al., 2004] 0 500 1000 1500 20000

10

20

30

40

50

60

70

80

1 2 3 4 5 6 7 8 9 10 11 12 1314 15 16 17 181920 21222324


26

5.1 - Minimal region

A minimal region is a sequence of successivepositions for which the same status is observedin large number of patients at the same time.[Rouveirol, 2007]

Data. M∗ = 31 patients present ℓ = 5 successivedeletions between positions 1189 and 1193, inchromosome 9.

Question. Is this significant, given the numberof patients (m = 84) and the profiles length(n = 2340)?

1000 1500

8 9 10 11 12


27

5.2 - Motifs are back

Binary case. Assume that only 2 status exist:

Xit =

{1 for aberration,0 for normal.

Region. A minimal region is then a ℓ-run of 1s. Denote

Yit =t∏

u=t−ℓ+1

Xit =

{1 if a ℓ-run occurs at position t in profile i,0 otherwise.

Simultaneous occurrences. Y+t =∑

i Yit is the number of patients for which a ℓ

successive aberrations (ℓ-run) occurs at t.

Significance of an observed minimal region. We have to calculate

Pr

{max

ℓ≤t≤nY+t ≥ M∗

}.


28

5.3 - Markov Model

Model. Each profile {Xit}t is a 2-state stationary Markov chain (MC):

{Xit}t ∼ MC(Π,µ)

where Π is the transition matrix and µ the stationary distribution (∀i : Xi1 ∼ µ).

The patients (profiles) are supposed to be independent.

Estimated transition probabilities and stationary distributions. States = {0, 1}:

Gain: Π+

=

(99.74 0.261.98 98.02

), µ

+ = (88.36 11.64),

Loss: Π−

=

(99.72 0.282.26 97.74

), µ

− = (88.98 11.02).


29

5.4 - Significance

We want to evaluate

Pr

{max


}.

Exact calculation. We need to consider m independent Markov chains of orderℓ − 1:

m = 84, ℓ = 15 → 8.3 1086 states...

Upper bound. Denoting X+t =∑

i Xit, we have

Pr

{max


}≤ Pr{X+t,X+,t+1, . . .X+,t+ℓ−1 ≥ M∗}.

The latter probability can be calculated via an embedded Markov Chain.

Lower bound. A lower can also be derived, using another technique. [R. et al., ?]


30

Bladder cancer

Deleted regions in m = 84 patients, n = 2340 positions along 24 chromosomes.

position length # patients Significancet∗ chrom. ℓ M∗ p(upper) p(lower)

1189 9 5 31 6.04 e–8 4.05 e–81387 11 3 30 6.82 e–7 6.08 e–71430 11 3 27 5.08 e–5 4.60 e–51340 10 22 23 1.18 e–4 3.02 e–61457 11 3 26 1.91 e–4 1.74 e–41006 8 42 21 4.95 e–4 1.86 e–8996 8 7 22 7.85 e–3 4.78 e–3584 4 6 19 1.82 e–1 1.39 e–11947 17 26 16 2.33 e–1 1.42 e–2594 4 17 17 2.34 e–1 5.16 e–2643 5 11 17 4.18 e–1 2.21 e–1

• The upper and lower bounds are close, except for long regions.

• Deletions in chromosome 9 are known to be associated to bladder cancer.


31

6 - What’s next?Enriching the the multiple array model:

• Classify the segments into status

• Account for possible covariates (regarding the probes or the patients)

• Maximum likelihood for the complete model including segmentation,classification, random effects and covariates seems intractable.→ Approximations (variationnal approaches) are promising.

A R package should be available soon.

Recent or new technologies:

• High density: tiling arrays (≃ 105 probes per chromosomes)

• High throughput sequencing (discrete signal).


32

Collaborations

Institut Curie:F. Radvanyi, O. Delattre

Cancer center, San Francisco:C. Vaisse

INRA - URGV, Evry:J.-P. Renou

UMR AgroParisTech / INRA 518:E. Lebarbier, J-J. Daudin, B. Thiam

UMR CNRS 5558, Lyon:F. Picard

Masaryk univ., Rep. Tcheque:E. Budinska

Univ. Paris X:L. Pierre

Univ. Paris XI:M. Lavielle

LIPN, UMR CNRS 7030:C. Rouveirol

Western Australia univ:V. Stefanov