View
43
Download
0
Category
Tags:
Preview:
DESCRIPTION
(j 1 ). >0. (j 2 ).
Citation preview
Interval Scores for Quality Annotated CGH Data Doron Lipson1, Anya Tsalenko2, Zohar Yakhini1,2 and Amir Ben-Dor2
1Technion, Haifa, Israel 2Agilent Laboratories, Palo Alto, CA
References1. Barrett MT, Scheffer A, Ben-Dor A, Sampas N, Lipson D, Kincaid R, Tsang P, Curry B, Baird K, Meltzer PS,
Yakhini Z, Bruhn L, and Laderman S., Comparative Genomic Hybridization using Oligonucleotide Microarrays and Total Genomic DNA. PNAS 2004; 101(51):17765-70 .
2. Lipson D, Aumann Y, Ben-Dor A, Linial N, and Yakhini Z., Efficient Calculation of Interval Scores for DNA Copy Number Data Analysis. Ninth Annual International Conference on Research in Computational Molecular Biology, RECOMB 2005 (Cambridge, MA).
3. Pollack JR, Sorlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R, Botstein D, Borresen-Dale AL, and Brown PO. Microarray Analysis Reveals a Major Direct Role of DNA Copy Number Alteration in the Transcriptional Program of Human Breast Tumors. PNAS 2002; 99(20): 12963-12968.
4. Dehan E, Ben-Dor A, Liao W, Lipson D, Rienstein S, Simansky D, Krupsky M, Yaron P, Friedman E, Rechavi G, Perlman M, Aviram-Goldring A, Bittner M, Yakhini Z, and Kaminski N. Chromosomal Aberrations and Gene Expression Profiles in Non Small Cell Lung Cancer. In preparation.
Most human cancers arise as a result of an acquired genomic instability and the subsequent evolution of clonal populations of cells with accumulated genetic errors. Accordingly, most cancers and some premalignant tissues contain multiple genomic abnormalities not present in cells within the normal tissues from which the neoplasias arose. These abnormalities include gains and losses of chromosomal regions that vary extensively in their sizes, up to and including whole chromosomes. Increases in genomic copy number can lead to overexpression of tumor promoter genes (oncogenes) while losses are associated with disruption of normal cell regulatory processes (e.g through the loss of tumor suppressor genes).
The Cancer Genome
Normal Human GenomeStable diploid copy number even in most diseases, e.g. cardiovascular, neurological.
Cancer GenomeMultiple genome-wide chromosome aberrations including copy number changes and rearrangements
Array-based Comparative Genomic Hybridization (aCGH)
DNA copy number alterations have been measured using fluorescence in situ hybridization-based techniques. The development of a genome wide technique – Comparative Genomic Hybridization (CGH) – allowed to jointly measure multiple chromosomal alterations present in cancer cells. Differentially labeled tumor and normal DNA are co-hybridized to normal metaphase chromosomes and ratios between the two labels allow the quantification of changes in DNA copy number. In a more advanced method termed array CGH (aCGH), the metaphase chromosomes are replaced by a microarray of thousands of genomic BAC, cDNA or oligonucleotide probes, greatly enhancing the resolution at which changes in DNA copy number may be detected.
HT-29 colon carcinoma cell line [1]
The Interval ScoreLet C=(c1…cn) be a vector of all log(R/G) measurements along some chromosome. if the target contains an aberration then we expect to see many consecutive positive or negative entries in C. On the other hand, if the target is normal we expect no localized effects. Intuitively, we look for intervals (sets of consecutive probes) where signal sums are significantly higher or lower than expected at random. As a null model we assume that no aberration is present in the target, and therefore the variation in C represents only the noise of the measurement.
Assuming that the measurement noise along the chromosome is independent for distinct probes and normally distributed, let µ and denote the mean and standard deviation of the normal genomic data. Given an interval I spanning k probes, we define its score as:
σI
μcIS Ii i
||)(
MaxInterval Algorithm I:LookAhead
Assume you are given:• m – An upper bound for the value of a single element ci
• t – A lower bound on the maximum score
If we are currently considering an interval I=[i,…,i+k-1] with a sum of s = jI cj, then the score of I is:
The score of an interval I’ = [i,…,i+k+x-1] is then bounded by:
Complexity:Expected O(n1.5) (unproved)
ksI )S(
)()()S( xkmxsI Solve for first x for which S(I ) may exceed t.
sumlength
score
sk
s+mxk+x
I I’
ks)()(
xkmxs
Applications: Common AberrationsFinding common aberrations in a set of samples can be performed directly by using variants of the interval score (see [2] for details).
0 20 40 60 80 100 120 140 160 180
2001T-12002T-12009T-12010T-12011T-12014T-12017T-12020T-12022T-12062T-12068T-12069T-12073T-12075T-12076T-12079T-12080T-12082T-12083T-12086T-12090T-12091T-12092T-12093T-12097T-12099T-1
>0<0
Chromosome 3 of 26 lung tumor samples on mid-density cDNA array. Data from Dehan et al [4].Common deletion located in 3p21 and common amplification – in 3q.
Chromosomes 8 and 11 of 37 breast tumor samples on mid-density cDNA array. Data from Pollack et al [4].Common deletion located in 8p and common amplification – in 11q.
Sam
ples
Sam
ples
Applications: Single Samples
Chromosome 16 of HCT116 colon carcinoma cell line on high-density oligo array (n=5,464).Data from Barrett et al [1].
Chromosome 17 of several breast carcinoma cell lines on mid-density cDNA array (n=364).Data from Pollack et al [3].
500 25 75 Mbp
0
1
0
1
0
1
0
1ERBB2
Log 2
(rat
io)
0
-1
1 FRA16BA2BP1
0 50 Mbp25 75
Log 2
(rat
io)
Quality Weighted Interval Scores
For an interval I, spanning k probes, compute a weighted mean:
Variance of individual loci:
Variance due to consistency within the interval:
And finally, the interval score:
Ii iloci qσ 2/11
Ii i
Ii iicon q
qμck
kσ 2
22
/1/
1
22 )1(1)( conloci σαk
ασIσ
Ii i
Ii ii
qqc
Iμμ 2
2
/1/
)(
)()()( IσIμIS
Consider the vector V=((c1,q1),(c2,q2),…(cn,qn)) where at each locus i the number ci is the measured log(R/G) and the number qi represents the standard deviation of this particular measurement. For every I set wi=(qi)-2.
Chr. 17 of MDA-MB-453 breast cancer cell-line sample Data from Barrett et al [1].
Analysis using simple interval score:
Analysis that accounts the signal consistency within the interval (con) and single locus variance (loci).
Note the difference in the aberrations called for the genomic regions 58-75Mbp, and 8-15Mbp.
Radii of the datapoints proportional to wi
The MaxInterval ProblemFor convenience of algorithmic analysis we define the MaxInterval problem of finding the maximal scoring interval. Other intervals with high scores may be found by recursively calling this function.
Input: A vector C=(c1…cn)
Output: An interval I[1…n], that maximizes S(I )
1 2 Identification and Mapping of GenomicAlteration Events
A common first step in analyzing DNA copy number data consists of identifying aberrant (amplified or deleted) regions in each individual sample.
Given a series of log(R/G) measurements along some genomic region, e.g. a chromosome, we would like to identify intervals within this vectors that consistently contain significantly high values (amplifications) or significantly low values (deletions)
0
-0.5
0.5
Log 2
(R/G
)
0
-0.5
0.5
Deletion Amplification
Genomic position Genomic position
3 4
5
6
MaxInterval Algorithm II:Geometric Family Approximation (GFA)For >0 define the following geometric family of intervals:
j
j
jjjj
jjj
j
j
knikiij
kk
)(
0:]1,[)(
, )1(
kjj
(j1)
(j2)
(j3)
Theorem [2]: Let I* be the optimal scoring interval. Let J be the leftmost longest interval of fully contained in I*. Then S(J) ≥ S(I*)/, where -2.
Complexity: O(n)
7
BenchmarkingBenchmarking results of the Exhaustive, LookAhead and GFA algorithms on synthetic vectors of varying lengths.Linear regression suggests that the complexities of the Exhaustive, LookAhead and GFA algorithms are O(n2), O(n1.5), O(n), respectively.
8
9
10
11
Recommended