6
Reproducible gene selection algorithm with random effect model in cDNA microarray-based CGH data Mijung Kim * Institute for Mathematical Sciences, Yonsei University, 134 Shinchon-Dong, Seodaemun-Gu, Seoul 120-752, Republic of Korea article info Keywords: cDNA microarray Comparative genomic hybridization (CGH) Copy-number changes Gastric cancer Reproducibility Random effect model abstract cDNA microarray-based CGH with 30 pairs of normal and tumor gastric tissues using cDNA microarrays containing 17,000 human genes was performed to delineate the individual genes that undergo copy- number changes. Frequency analysis is more efficient than mean analysis for detecting subtle differences in copy-number when most of the data are from low spot intensities, such as those seen when performing cDNA microarray-based CGH. This article studies on how to deal with variation of data in replicated mea- surements for application of frequency analysis. A reproducible gene selection algorithm was developed for minimizing variation across array measurements. This algorithm incorporates a measurement of reproducibility with a random effect model and collects individual genes with reproducible copy-number change as a filtering process. This algorithm controls both reproducibility and number of remaining genes by dropping genes with large variations and results in increased reproducibility. Application of this algo- rithm allows for obtaining a well-filtered set of genes, thus dealing with variation in frequency analysis of the replicated data. Ó 2009 Elsevier Ltd. All rights reserved. 1. Introduction The Cancer Metastasis Research Center (CMRC) at Yonsei Uni- versity conducted a cDNA microarray-based CGH study to investi- gate gastric cancer-related DNA copy-number changes; this type of study makes it possible to observe a diverse pattern of potential biomarkers at the DNA level. cDNA microarray-based CGH was performed on 30 pairs of normal and gastric tumor tissues, and di- rect comparisons were made to detect gastric cancer-related genes with copy-number changes. The primary purpose of this experi- ment was to identify copy-number changes in individual genes rather than in segments of genes. For gene-by-gene identification of copy-number changes, Yang et al. performed simple frequency analysis and selected genes showing at least one alteration in gastric cancer (Yang, Seo, & Jeong, 2005). Seo et al. have also investigated individual genes for copy-number changes in bilateral breast cancer (Seo, Rha, & Yang, 2004); Cheng et al. analyzed array CGH based on a gene- by-gene search through array rank order to detect copy-number changes in human cancer (Cheng, Kimmel, Neiman, & Zhao, 2003). cDNA microarray-based CGH data include special features such as more meaning for occurrence of copy-number changes rather than the quantity of means for intensity difference. In addition, many of the data have low signal-to-noise ratios. Since cDNA microarray-based CGH experiments produce low-intensity spots, several issues are raised with data analysis for detecting subtle dif- ferences in copy-number change. One such issue is that only a few genes are identified as altered due to their small mean values. Analysis using mean values, such as the t-test, does not always suc- cessfully identify ‘altered gene’ where the change in the mean copy-number should be defined higher than the minimal criterion for an alteration. To detect subtle differences in copy-number change, frequency analysis is more efficient than mean analysis. A second issue is to deal with variations of gene over the arrays in frequency analysis because there are many genes with relatively large variations compared to the total variation of all the genes; frequency analysis, such as a 1.5- or 2-fold change cut-off, does not consider variations of the gene over the arrays. This article sug- gests that fold change cut-off system is incorporated after filtering genes with small variations by utilizing the developed algorithm which helps it possible to obtain a well-filtered set of genes. As an application of the proposing algorithm, it is possible to se- lect candidates for altered genes by choosing genes with a high fre- quency of alteration at a set filtered with the algorithm; the selected genes include genes identified by t-test, as well as altered genes with a high frequency of one-side alteration and small means that such mean-utilizing statistical tests rarely detect due to the fact that the minimal criterion for an alteration in mean copy-number change was not satisfied. As another application of the proposing algorithm, it might be utilized for detecting genes that displayed subtle but ‘consistent’ 0957-4174/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2009.03.034 * Tel.: +82 2 2123 4093; fax: +82 2 363 4845. E-mail address: [email protected] Expert Systems with Applications 36 (2009) 11589–11594 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Reproducible gene selection algorithm with random effect model in cDNA microarray-based CGH data

Embed Size (px)

Citation preview

Page 1: Reproducible gene selection algorithm with random effect model in cDNA microarray-based CGH data

Expert Systems with Applications 36 (2009) 11589–11594

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Reproducible gene selection algorithm with random effect modelin cDNA microarray-based CGH data

Mijung Kim *

Institute for Mathematical Sciences, Yonsei University, 134 Shinchon-Dong, Seodaemun-Gu, Seoul 120-752, Republic of Korea

a r t i c l e i n f o

Keywords:cDNA microarrayComparative genomic hybridization (CGH)Copy-number changesGastric cancerReproducibilityRandom effect model

0957-4174/$ - see front matter � 2009 Elsevier Ltd. Adoi:10.1016/j.eswa.2009.03.034

* Tel.: +82 2 2123 4093; fax: +82 2 363 4845.E-mail address: [email protected]

a b s t r a c t

cDNA microarray-based CGH with 30 pairs of normal and tumor gastric tissues using cDNA microarrayscontaining 17,000 human genes was performed to delineate the individual genes that undergo copy-number changes. Frequency analysis is more efficient than mean analysis for detecting subtle differencesin copy-number when most of the data are from low spot intensities, such as those seen when performingcDNA microarray-based CGH. This article studies on how to deal with variation of data in replicated mea-surements for application of frequency analysis. A reproducible gene selection algorithm was developedfor minimizing variation across array measurements. This algorithm incorporates a measurement ofreproducibility with a random effect model and collects individual genes with reproducible copy-numberchange as a filtering process. This algorithm controls both reproducibility and number of remaining genesby dropping genes with large variations and results in increased reproducibility. Application of this algo-rithm allows for obtaining a well-filtered set of genes, thus dealing with variation in frequency analysis ofthe replicated data.

� 2009 Elsevier Ltd. All rights reserved.

1. Introduction

The Cancer Metastasis Research Center (CMRC) at Yonsei Uni-versity conducted a cDNA microarray-based CGH study to investi-gate gastric cancer-related DNA copy-number changes; this type ofstudy makes it possible to observe a diverse pattern of potentialbiomarkers at the DNA level. cDNA microarray-based CGH wasperformed on 30 pairs of normal and gastric tumor tissues, and di-rect comparisons were made to detect gastric cancer-related geneswith copy-number changes. The primary purpose of this experi-ment was to identify copy-number changes in individual genesrather than in segments of genes.

For gene-by-gene identification of copy-number changes, Yanget al. performed simple frequency analysis and selected genesshowing at least one alteration in gastric cancer (Yang, Seo, &Jeong, 2005). Seo et al. have also investigated individual genesfor copy-number changes in bilateral breast cancer (Seo, Rha, &Yang, 2004); Cheng et al. analyzed array CGH based on a gene-by-gene search through array rank order to detect copy-numberchanges in human cancer (Cheng, Kimmel, Neiman, & Zhao, 2003).

cDNA microarray-based CGH data include special features suchas more meaning for occurrence of copy-number changes ratherthan the quantity of means for intensity difference. In addition,many of the data have low signal-to-noise ratios. Since cDNA

ll rights reserved.

microarray-based CGH experiments produce low-intensity spots,several issues are raised with data analysis for detecting subtle dif-ferences in copy-number change. One such issue is that only a fewgenes are identified as altered due to their small mean values.Analysis using mean values, such as the t-test, does not always suc-cessfully identify ‘altered gene’ where the change in the meancopy-number should be defined higher than the minimal criterionfor an alteration. To detect subtle differences in copy-numberchange, frequency analysis is more efficient than mean analysis.

A second issue is to deal with variations of gene over the arraysin frequency analysis because there are many genes with relativelylarge variations compared to the total variation of all the genes;frequency analysis, such as a 1.5- or 2-fold change cut-off, doesnot consider variations of the gene over the arrays. This article sug-gests that fold change cut-off system is incorporated after filteringgenes with small variations by utilizing the developed algorithmwhich helps it possible to obtain a well-filtered set of genes.

As an application of the proposing algorithm, it is possible to se-lect candidates for altered genes by choosing genes with a high fre-quency of alteration at a set filtered with the algorithm; theselected genes include genes identified by t-test, as well as alteredgenes with a high frequency of one-side alteration and smallmeans that such mean-utilizing statistical tests rarely detect dueto the fact that the minimal criterion for an alteration in meancopy-number change was not satisfied.

As another application of the proposing algorithm, it might beutilized for detecting genes that displayed subtle but ‘consistent’

Page 2: Reproducible gene selection algorithm with random effect model in cDNA microarray-based CGH data

11590 M. Kim / Expert Systems with Applications 36 (2009) 11589–11594

differences in copy-number change, by increasing reproducibilityof the data through lowering threshold of the algorithm; the term‘consistent’ gene is used to denote either only gain or only loss ofgene copy-number (that is, one side alteration) in all available ar-rays. This allows for ranking genes with their frequency of consis-tent alteration. Genes with relatively large copy-number variationscompared to the total variation of all the genes often reveal mix-tures of gain and loss, which are referred to ‘hampering genes’;when a matter of concern is not hampering but consistent genesthe developed algorithm is applicable for selecting set of consistentgenes by increasing reproducibility.

Several ways of minimizing variation in replicated measure-ments have been reported. Data with large standard deviation(sd.) within experiments have been filtered out (Alizadeh, Eisen,& Davis, 2000; Marton, Derisi, & Bennett, 1998; Ross, Scherf, & Ei-sen, 2000; White, Rifkin, Hurban, & Hogness, 1999), and Kodotaet al. introduced PRIM (Preprocessing Implementation for Micro-array) to filter out data with small mean correlations betweenany two replicates (Kadota et al., 2001).

In this study, a reproducible gene selection algorithm (RGSA)was developed as a solution to the issues discussed above. In RGSA,the variability of replicated—measurements was quantified via arandom effect model and a measurement of reproducibility wasincorporated using intra-class correlation coefficient. RGSA con-trols both reproducibility and number of remaining genes. Thewell-filtered set this article suggests has both reproducibility andnumber of remaining genes maximized. In addition to the recom-mending filtered set, more reproducible genes are collectable bylowering threshold in RGSA. In this case, genes showing both gainand loss with low reproducibility are dropped and it is possible torank genes based on their frequency of consistent alteration whencategorizing the data according to the criterion on alteration.

For dealing with variations of the replicated data in frequencyanalysis, this article suggests to perform the frequency analysisfor the set filtered by RGSA.

cDNA microarray-based CGH data whose experiment was con-ducted at CMRC of Yonsei University was utilized for testing RGSA,noticing that the application was done with purpose for testingRGSA and illustrating application of RGSA, therefore not final anal-ysis for the experiment. For use of this data, within-print tip, inten-sity-dependent normalization was performed before taking thesteps of RGSA.

For the purpose of comparison of the sets before and after RGSAapplied, a filtered set is chosen for set with reproducibility in-creased by 30% of CMRC data, in addition to the suggesting filteredset. Characteristics are compared before and after application ofRGSA.

Simulation study shows the sensitivity of RGSA for detectingdata with large variation reaches 32–79% at the suggestingfiltered set, and it increases to 73–96% when a set of morereproducible genes is selected, according to the variation of thesimulated data.

2. Materials and methods

2.1. cDNA microarray-based CGH

Thirty pairs of normal gastric mucosa and cancer tissues wereobtained from gastric cancer patients who had undergone surgeryat the Severance Hospital, Cancer Metastasis Research Center(CMRC), Yonsei University Health System, Seoul, Korea, from1997 to 1999. The patients consisted of 27 males and 3 femaleswith a median age of 65 years (41–78). The numbers of patientsin each stage were 3, 9, 12 and 6 for stage I, stage II, stage III andstage IV, respectively.

Genomic DNA extraction was performed according to a conven-tional protocol using the phenol/chloroform/isoamyl alcohol meth-od. The cDNA microarrays containing 17,000 sequence-verifiedhuman gene probes (CMRC-Genomictree, Korea) were used forCGH in a direct comparison design, where genomic DNAs fromthe normal and tumor tissues were labeled with fluorescent dyesCy3 and Cy5, respectively, and cohybridized, following the stan-dard protocol of CMRC, Yonsei University (Yang et al., 2005). Therange of genomic copy-number in normal tissues was within±0.3 of the log2 intensity ratios in autosomal genes (Park, Jeong,& Choi, 2006), and thus a gain in copy-number of the gene wasidentified if the log2 intensity ratio was over 0.3, and loss was iden-tified if the ratio was below �0.3. The experiment was performedwith direct comparisons (Churchill, 2002). The cDNA microarray-based CGH data for this experiment has been deposited into ArrayExpress (http://www.ebi.ac.uk/arrayexpress/) Query:1283947172E-TABM-171.

2.2. Data preparation

17K cDNA microarray contained the 15,723 unique genes with17,664 spots and these unique genes were mapped for their chro-mosomal location using SOURCE (http://genome-www5.stan-ford.edu/cgi-bin/source/sourceSearch) and DAVID (http://apps1.niaid.nih.gov/david/).

Let R and G denote the fluorescent intensities of tumor and nor-mal hybridizations, respectively. For the evaluation of relativeintensity, Y = log2(R/G) was used, and data were pre-processedwith the following considerations: first, within-print tip, inten-sity-dependent normalization of Y was performed as described(Yang, Dudoit, & Luu, 2002); second, genes showing missing valuesfor >20% of the total number of observations were deleted; third,the 10-nearest neighbor method was employed for imputation ofmissing values; and fourth, averaged values were used in caseswith multiple spots. In this step, 10,514 genes were found in 30microarrays, and this data set was designated BF. Reproducibilityof the data among arrays in the initial data set was 17.74%.

2.3. Statistical method

2.3.1. Random effect model establishmentThe random variable Y of log2 for the ratio of intensities is as-

sumed to follow normal distribution. For the ith gene and jth array,the gene-based statistical model for log2 intensity ratio, yij, was asfollows:

yij ¼ lþ ai þ eij ð1Þ

where l is the mean effect and ai is the random effect of the ithgene, which explains the gene’s intrinsic variability. eij is a randomvariable reflecting variation from sources other than those identi-fied by the gene’s effect. The underlying mean for the ith gene is gi-ven by l + ai, where ai is drawn from a normal distribution withmean 0 and variance r2

A. eij, is assumed to be drawn from a normaldistribution with mean 0 and variance r2. This model is referred toas a random effects one-way analysis of variance model.

2.3.2. Measurement of two types of variation and reproducibilityTo measure variation between replicate measurements, varia-

tion is decomposed into two components. The first is the intrinsicvariation of the genes, denoted by r2

A which is the extent of the ‘be-tween-gene’ variation. The second, denoted by r2, is the variationbetween replicates, including measurement error, which is the‘within-gene’ variation. The ratio of r2

A to r2A þ r2, denoted by q,

explains how closely the gene measurements of one array trackthe gene measurements of another. When r2

A is relatively large

Page 3: Reproducible gene selection algorithm with random effect model in cDNA microarray-based CGH data

0.16

0.18

0.2

S

M. Kim / Expert Systems with Applications 36 (2009) 11589–11594 11591

compared to r2, q becomes large and it is easier to measure achange in copy-number for the gene. That is, q measures reproduc-ibility and is called the intra-class correlation (Rosner, 2000).Reproducibility quantifies the variability of replicated measure-ments, and it is important in CGH experiments because of the needto monitor and quantify small but biologically important changes.

2.3.3. Reproducible gene selection algorithm (RGSA)An efficient algorithm, RGSA, was developed to collect consis-

tent data by eliminating genes with relatively large variation com-pared to the total variation. RGSA controls both reproducibility andnumber of remaining genes by investigating the product of esti-mated reproducibility and ratio of the number of remaining genesvs. the number of genes in the initial set as threshold varies in thedata.

To access how much variation in an analysis is attributable tobetween-gene vs. within-gene variation, the gene-based randomeffect model (1) is considered. The RGSA algorithm comprises thetwo filtrations. In the first, genes which are unstable due to lowsignal intensity (foreground intensity < background intensity + ksd. in both channels with threshold k) are deleted. The second stepperforms the following procedures with Eqs. (2)–(5):

Step 1. Calculate r̂2i , r̂2 and r̂2

A using Eqs. (2)–(4).Step 2. Eliminate genes whose r̂i is greater than kr̂, where r2 is

the total ‘within-gene’ variation for the remaining genesexcluding the ith gene, where r2

i is the ith gene’s‘within-gene’ variation.

Step 3. Calculate q̂ using Eq. (5).Step 4. Calculate f ðq̂;RÞ ¼ q̂� R, where R is the ratio of the num-

ber of remaining genes to the number of genes in the ini-tial set. Here the estimator of r2

i is:

r̂2i ¼

XNi

j¼1

ðyij � yiÞ2=ðNi � 1Þ ð2Þ

yi is the mean intensity of the ith gene, and Ni is the number ofavailable arrays with the ith gene; the estimator of r2 is calculatedas:

r̂2 ¼XK

i¼1

XNi

j¼1

ðyij � yiÞ2=ðN � KÞ ð3Þ

N ¼PK

i¼1Ni; the estimator of r2A is:

r̂2A ¼maxfðMSBT � r̂2Þ=n0; 0g ð4Þ

n0 ¼PK

i¼1Ni �PK

i¼1N2i

.PKi¼1Ni

� �.ðK � 1Þ, MSBT ¼

PKi¼1Niðyi � ��yÞ2=

ðK � 1Þ, ��y ¼PK

i¼1yi=K, where K is the number of genes, and no isthe number of arrays if all genes have no missing values in all ar-rays. The estimator of q is:

q̂ ¼ r̂2A=ðr̂2

A þ r̂2Þ ð5Þ

0.14

0 0.5 1 1.5 2 2.5 3 3.5

Threshold (k)

Fig. 1. Graph for function of reproducibility and the number of remaining genes asthreshold k varies in CMRC data.

The loop runs from an initial value of threshold k until the func-tion of algorithm, f ðq̂;RÞ is maximized or a reasonable q̂ is ob-tained as increasing k. The threshold, k for the well-filtered set issuggested for being chosen at the level of maximizedf ðq̂;RÞ ¼ q̂� R so that the filtered set has both reproducibility

Table 1ADescription of data set name.

Data set name Description

BF The set before RGSA is appliALG1 The set where the 1st filtratiSmax The set where RGSA is run uRmax The set where RGSA is run u

and number of remaining genes maximized, which is named asSmax. Data set description can be found in Table 1A.

In Table 1A, q is reproducibility, and n is number of remaininggenes. In CMRC data, BF and ALG1 are identical except for the threegenes which are unstable due to low signal intensity (foregroundintensity < background intensity + 5sd. in both channels); accord-ingly, the set name BF is used without distinction from ALG1.

Fig. 1 shows the graph of f ðq̂Þ, for threshold k in CMRC data.In Fig. 1, S ¼ q̂� R, where q̂ is the estimated reproducibility, R is

the ratio of the number of remaining genes vs. the number of genesin the initial set, and k is a value such that all genes with variationgreater than k times total variation are removed. In CMRC data, thisfunction S was maximized at threshold of 1.05.

2.4. Results

2.4.1. Simulation studyIn order to evaluate sensitivity (probability for detecting genes

with large variations correctly) of RGSA a simulation study wascarried out. This simulation study was also conducted for demon-strating Smax is well-filtered as starting set for analysis. The simu-lated artificial data set assumes normal distribution of log2 ratiofor expression levels whose mean and variance are similar to thosein CMRC data; estimated mean and variance in CMRC data are�0.0012 and 0.01, respectively. The first 6% of genes were simu-lated as copy-number changed with a mean difference of 0.3; theother 94% were representative of unchanged copy-number. To testthe sensitivity of RGSA for detecting data with large variation, 18%of genes, including half of those copy-number changed above, wereset to a relatively large variation (three or five times total). For sim-ilarity to the actual CMRC data, variance of the artificial data wasset as 0.01, 0.015 or 0.02. The data structure is as follows:Part A 3% copy-number changed genes with large mean differ-

encePart B 3% copy-number changed genes with large mean differ-

ence together with large variationsPart C 15% copy-number unchanged with large variationsPart D 79% copy-number unchanged genes.

edon step of RGSA is completed; before RGSA loop is run, and before q is consideredntil both q and n is maximizedntil a desirable level of q is obtained

Page 4: Reproducible gene selection algorithm with random effect model in cDNA microarray-based CGH data

Fig. 2. Sensitivity of RGSA as increasing reproducibility of the data.

11592 M. Kim / Expert Systems with Applications 36 (2009) 11589–11594

Six artificial data sets were created according to possible combi-nation of total variation and individual gene’s variation. Simulateddata set is denoted by S(l,r2) whose total variation is r2 and rela-tively large variation of part B and part C is product of l and r2,where l is taken by 3 or 5.

Ten thousand genes were generated from the selected samplesize of 30 as in the CMRC data. Investigation for sensitivity of RGSAwas made at various level of reproducibility obtained by RGSA,depending on different combination of total and individual gene’svariations (Fig. 2).

In Fig. 2, y-axis represents the sensitivity of RGSA; the filteredset name on x-axis represents reproducibility of the correspondingfiltered set thus x-axis stands for reproducibility in the direction ofincreasing. RGSA sensitivity was observed in the nine selected sets(Smax, m1, m2, m3, Rmax, ex1, ex2, ex3, and ex4) along with reproduc-ibility. The nine sets were determined by the RGSA threshold,where threshold controls the reproducibility of the selected set.Rmax is determined where the algorithm function of reproducibilityand the number of remaining genes is locally maximized or filteredset reaches a desirable reproducibility. m1, m2 and m3 are deter-mined at intermediate thresholds between the two sets Smax andRmax so that their reproducibility is equally distributed. ex1, ex2,ex3 andex4 are determined at the intermediate thresholds so thattheir reproducibility is equally distributed after passing thresholdof Rmax.

This figure shows sensitivity of RGSA depends on individualgene’s variation together with the total variation of all genes; geneshaving relatively large variations are removed, and thus RGSAshows high sensitivity for removing genes with large variations.When a gene’s variation is not large, RGSA does not detect it as agene with large variation so that the sensitivity becomes low,and this fact fits the objective of RGSA. For instance, the sensitivityfor genes with variation five times as large as total variation isgreater than that for those with three times as large as total varia-tion; RGSA sensitivity depends on variations of data, however, itreaches large enough value (up to 96.4% in Rmax). The sensitivityof RGSA increases as total variation increases and becomes stableat certain degree of total genes’ variation for the set with deter-mined reproducibility. For instance, in the set of genes with fivetimes total variation the sensitivity reaches 78.8% in Smax and

96.4% in Rmax for a total variation of 0.015 or 0.02, while it reaches47.3% in Smax and 84.9% in Rmax for a total variation of 0.01. Sensi-tivity becomes low for the set including genes with three times to-tal variation, that is, it reaches 31.5% and 72.7% in Smax and Rmax,respectively, for any of total variation, 0.015 or 0.02, while itreaches 1.3% and 72.7% in Smax and Rmax, respectively, for total var-iation of 0.01 (Fig. 2).

Since number of remaining genes decreases as reproducibilityincreases, determination of the RGSA threshold is a trade-off be-tween reproducibility and how many genes remained. Smax is sug-gested as filtered set for starting analysis of data since it isconstructed for both reproducibility and number of remaininggenes to be maximized, and genes with large variations are re-duced comparing to those of the initial set BF. The figure revealsthat detection rate for data with large variations is greatly in-creased at Smax and becomes moderately increasing at the othersets selected with increased reproducibility. Furthermore, Smax in-cludes maximum number of genes among the selected sets whosereproducibility is larger than that of BF. These facts support Smax iswell-filtered as starting set for analysis.

This simulation study shows RGSA sensitivity reaches 78.8% atSmax even though it depends on variations of data, furthermore sen-sitivity increases as RGSA filters more reproducible data; this dem-onstrates RGSA is valid for collecting data with small variations.

2.4.2. Comparison of selected genes before and after application ofRGSA

Park et al. have studied the criterion on alteration for this gas-tric cancer-related cDNA microarray-based CGH data, and the cri-teria were determined at +0.3 for gain and �0.3 for loss (Parket al., 2006). Referring to their study, the data are categorizedaccording to this pre-determined criterion.

In order to compare the sets filtered by RGSA to initial set, RGSAwas applied to CMRC data and characteristics of the three sets, BF,Smax and Rmax were compared before and after filtration with RGSA(Table 1B).

In Table 2, column (row) represents the observed frequency ofloss (gain) among all arrays available for a given gene with cut-off nine (30% frequency of all the available arrays). L and G denotefrequency of loss and gain, respectively. Numbers in each cell are

Page 5: Reproducible gene selection algorithm with random effect model in cDNA microarray-based CGH data

Table 1BData description before and after RGSA.

Characteristic Data set

BF Smax Rmax

q̂ 17.25% 24.54% 29.75%n 10,511 7262 5084n(h) 1512 331 77n(c) 5284

(50.27%)3685 (50.74%) 2255 (44.35%)

l̂ �0.00122 �0.00548 �0.006649r̂2 0.021934 0.01487 0.0121Range of

variationsNorestriction

61.05 � totalvariation

60.89 � totalvariation

q̂: estimated reproducibility; n(h): number of hampering genes; n(c): number ofconsistently altered genes; l̂: estimated mean; r̂2: estimated pooled variance.

M. Kim / Expert Systems with Applications 36 (2009) 11589–11594 11593

the counts of genes with frequency corresponding to the cell forgain and loss; shaded cell is the count of hampering genes and **

is the count of consistent genes. For instance, in Smax, 1873 genesfrom all available arrays show consistent loss without a mixture;1836 of these show only loss with a frequency <9 and 37 show onlyloss with a frequency P9. Similarly, 1812 genes show consistentgain, while 331 genes show a mixture of gain and loss with fre-quencies <9. Shaded cell shows the number of hampering genesdecreases as the screened set moves in the direction of increasingof reproducibility, from BF through Smax to Rmax.

As Table 2 shows 3249 genes were eliminated and 3685 of theremaining 7262 genes showed consistent alteration at Smax. Threehundred and thirty-one genes showed a mixture of gain and loss,whereas 1512 were mixture of gain and loss before filtration withRGSA. When selecting genes with simple frequency without filtra-tion there is a possibility that hampering genes are selected andnot distinguished from consistent genes with same frequency ofalteration as candidates for copy-number change (Fig. 3A).

Table 2Frequency table of gain and loss in CMRC data before and after application of RGSA.

SFB max

Frequency L=0 0<L<9 L 9 Total L=0 0<L<9 L 9

G=0 3715 2458** 90** 6263 3246 1836** 37**

0<G<9 2600** 1494 9 4103 1748** 331 0

G 9 136** 8 1 145 64** 0 0

CMRC

data

Total 6451 3960 100 10511 5058 2167 37

alt.= (3,4)ave.= -0.003s.d.= 0.35

altaves.d

log2(R/G)log2(R/G)

-1

-0.5

0

0.5

1

1.5

2

AA865707AA425900

Fig. 3A. Example for genes tha

In Fig. 3A, exemplified genes appear to be altered with high fre-quency of alteration. However, these are hampering genes and notdistinguished from consistent genes with same frequency of alter-ation when simple frequency analysis is performed without deal-ing with variations of data. These are dropped with RGSA due torelatively large variations.

Frequency table (Table 2) shows that RGSA can be utilized to se-lect consistently altered genes based on the frequency of alterationsince the hampering genes are reduced. Reproducibility was im-proved by 42.26% comparing to initial set. For more reproducibledata, the loop of RGSA was run until it obtained nearly 30% repro-ducibility (set Rmax) for the CMRC data, eliminating genes with var-iation greater than 0.89 times total variation and leaving 5084genes. Genes with many missing values cause large variation dueto reduced degrees of freedom, so these were also excluded. Thenumber of hampering genes in Rmax was reduced to 77 (Table 1B).

When study focuses on consistent alteration (not mixed gainand loss), RGSA is helpful for reducing number of genes that appearto be significant due to large frequency of mixed gain and loss.

2.4.3. Application of RGSA: Ranking candidate genes with frequency ofone side alteration

Hampering genes are highly reduced when collecting reproduc-ible genes with RGSA. Thus, genes with a high frequency of alter-ation show consistent alteration (only gain or only loss); thismakes it possible to rank genes based on frequency of alterationso that selected genes with high rank have a high frequency of con-sistent alteration (Table 2).

For the CMRC data, when Smax is chosen as a filtered set, 101genes are selected using 30% frequency cut-off and all the selected101 genes show consistent alteration. Thus frequency of alterationprovides a ranking of consistent copy-number changes among theselected genes; the 30% cut-off is not absolute but was chosen toobtain strong evidence of consistent alteration.

Rmax

Total L=0 0<L<9 L 9 Total

3466 2752 1152** 17** 3921

2079 1051** 77 0 1128

64 35** 0 0 35

7262 3838 1229 17 5084

.= (3,4).= 0.058.= 0.61

alt.= (4,4)ave.= 0.07s.d.= 0.34

alt.= (4,4)ave.= 0.03s.d= 0.46

AI379981

t filtered out with RGSA.

Page 6: Reproducible gene selection algorithm with random effect model in cDNA microarray-based CGH data

alt.= (15,0)ave.= 0.28467s.d.= 0.14

alt.= (17,0)ave.= 0.29241s.d.= 0.12

alt.= (15,0)ave.= 0.29300s.d.= 0.13

alt.= (17,0)ave.= 0.29733s.d= 0.17

alt. = (16,0)ave.= 0.29567s.d.= 0.08

log2(R/G)

AI421834AI014388AA991514AA973575AA281426-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

log2(R/G)log2(R/G)

Fig. 3B. Exemplified genes selected by the proposing approach.

11594 M. Kim / Expert Systems with Applications 36 (2009) 11589–11594

Genes in Fig. 3B are examples of the 101 genes selected afterapplication of RGSA (set Smax) with a 30% cut-off for frequencyof alteration; a t-test does not recognize these as alteredgenes since the mean is not larger than criterion on alteration,0.3. alt = (a,b) denotes the observed frequency of alteration (gaina and loss b) for a given gene; for example, gene AA281426reveals gain in 15 arrays, no loss in any array, and non-alterationin the other 15 arrays. ave. denotes mean for copy-numberchanges, and sd. denotes standard deviation of copy-numberchanges for a given gene.

When spot intensities are low, mean-utilizing methods such asthe t-test or SAM often miss many consistent alterations eventhough the observed alterations are consistent with high frequencyand small variations among arrays (Fig. 3B).

3. Discussion

The purpose of this study was to deal with variations in repli-cated measurements and thus to obtain filtered set which pertainsincreased reproducibility by dropping data with large variations. Areproducible gene selection algorithm, RGSA, was developed forcollecting genes with small variations. This efficient algorithm con-trols reproducibility and the number of remaining genes of thedata and thus allows for obtaining filtered set with both reproduc-ibility and the number of remaining genes maximized.

RGSA is useful for screening genes with subtle difference incopy-number change, and thus detecting genes with high fre-quency but small means which mean-utilizing methods fail todetect.

Low spot intensity in cDNA microarray-based CGH experimentsconducted at CMRC of Yonsei University results in few genes beingidentified as significantly altered since the mean intensities are notfar from the normal range (�0.3 through +0.3). In the CMRC data,for instance, the t-test missed genes with high frequency of one-sided alteration and small variations. This was due to the fact thatmost absolute values for the means of data with high frequency ofone-sided alteration and small variations were not larger than thecriterion on alteration; for making decisions on statistical signifi-cance, the t-test was performed with the additional condition thatabsolute mean values should be larger than the criterion on alter-ation. In searching for copy-number changes in cDNA microarrayCGH, the occurrence of alteration is more important than the quan-tity of mean difference for intensities. Thus, altered genes may bedetected by simple frequency analysis of alterations (Yang et al.,2005). However, without considering variations in the replicateddata, simple frequency analysis has a drawback that genes havingthe same frequency but different variations would be given thesame rank of significance. This article illustrates an approach to

deal with variations of replicated data in frequency analysis, whichis processing RGSA before frequency analysis is conducted.

This approach was applied for finding subtly but consistently(that is, either only gain or only loss) altered genes in cDNA micro-array-based CGH experiments conducted at CMRC of Yonsei Uni-versity for delineating individual gene for copy-number change.It shows RGSA is good for selecting genes with a high frequencyof consistent alteration. It is therefore possible to assign ranks toselected genes based on their frequency of consistent alterationin copy-number change.

A simulation study demonstrates that Smax is a good filtered setas initial set for analysis, and sensitivity of RGSA for detectinggenes of large variations ranges from 32% to 73% at Smax, and from79% to 96% at Rmax according to variation in the data.

Acknowledgements

This study was supported by a Korean Research FoundationGrant funded by the Korean Government (MOEHRD) (R03-2004-000-10048-0). The author thanks Dr. Sun Young Rha and Dr. HyunCheol Chung at CMRC of Yonsei University for providing data forthis study, and also Young Sun Kim for his assistance.

References

Alizadeh, A., Eisen, M., Davis, R., et al. (2000). Distinct types of diffuse large B-celllymphoma identified by gene expression profiling. Nature, 403, 503–511.

Cheng, C., Kimmel, R., Neiman, P., & Zhao, L. (2003). Array rank order regressionanalysis for the detection of gene copy-number changes in human cancer.Genomics, 82, 122–129.

Churchill, G. (2002). Fundamentals of experimental design for cDNA microarrays.Nature Genetics, 32, 490–495.

Kadota, K., Miki, R., Bono, H., Shimizu, K., Okazaki, Y., & Hayashizaki, Y. (2001).Preprocessing implementation for microarray (PRIM): An efficient method forprocessing cDNA microarray data. Physiological Genomics, 4, 183–188.

Marton, M., Derisi, J., Bennett, H., et al. (1998). Drug target validation andidentification of secondary drug target effects using DNA microarrays. NatureMedicine, 4, 1293–1301.

Park, C., Jeong, H., Choi, Y., et al. (2006). Systematic analysis of cDNA microarray-based CGH. International Journal of Molecular Medicine, 17, 261–267.

Rosner, B. (2000). Fundamentals of biostatistics (pp. 555–567). California, USA:Duxbury Thomson Learning.

Ross, D., Scherf, U., Eisen, M., et al. (2000). Systematic variation in gene expressionpatterns in human cancer cell lines. Nature Genetics, 24, 227–235.

Seo, M., Rha, S., Yang, S., et al. (2004). The pattern of gene copy number changes inbilateral breast cancer surveyed by cDNA microarray-based comparativegenomic hybridization. International Journal of Molecular Medicine, 13, 17–24.

White, K., Rifkin, S., Hurban, P., & Hogness, D. (1999). Microarray analysis ofDrosophila development during metamorphosis. Science, 286, 2179–2184.

Yang, Y., Dudoit, S., Luu, P., et al. (2002). Normalization for cDNA microarray data: Arobust composite method addressing single and multiple slide systematicvariation. Nucleic Acids Research, 30, e15.

Yang, S., Seo, M., Jeong, H., et al. (2005). Gene copy number change events atchromosome 20 and their association with recurrence in gastric cancer patients.Clinical Cancer Research, 11, 612–620.