Download pdf - Standardized genetic alteration score and predicted score for predicting recurrence status of gastric cancer

ORIGINAL PAPER

Standardized genetic alteration score and predicted scorefor predicting recurrence status of gastric cancer

Mijung Kim Æ Hyun Cheol Chung

Received: 11 December 2008 / Accepted: 22 April 2009 / Published online: 16 May 2009

� Springer-Verlag 2009

Abstract

Purpose To build a standardized genetic alteration score

(SGAS) based on genes that are related to a patient’s

recurrence status, and to obtain the predicted score (PS) for

predicting a patient’s recurrence status, which reflects the

genetic information of the gastric cancer patient.

Methods SGAS was constructed using linear combina-

tions that best account for the variability in the data. This

methodology was fit to and validated using cDNA micro-

array-based CGH data obtained from the Cancer Metastasis

Research Center at Yonsei University.

Results When classifying cancer patients, the accuracy

was 92.59% in the leave-one-out validation method.

Conclusions SGAS provided PS for the risk of recur-

rence, which was capable of discriminating a patient’s

recurrence status. A total of 59 genes were found to have a

high frequency of alteration in either the recurrence or non-

recurrence status. SGAS was found to be a significant risk

factor on recurrence and explained 31% variability of the

59 genes.

Keywords cDNA microarray-based CGH �Gene copy number change � Correct classification rate �Factor analysis � Logistic regression model

Introduction

Gastric cancer is one of the most common causes of can-

cer-related mortality (Pisani et al. 1999). cDNA micro-

array-based CGH is a useful technique for detecting

genome aberrations with high resolution. A genomic

alteration detected by cDNA microarray-based CGH can

easily be translated into both sequence and gene identifi-

cation, which can provide additional information on

regarding chromosomal rearrangements and imbalances

(Squire et al. 2003). A gastric cancer-related cDNA

microarray-based CGH experiment was performed to

investigate genomic aberrations with high resolution at the

Cancer Metastasis Research Center at Yonsei University

(Park et al. 2006; Yang et al. 2005). Thirty pairs of gastric

tumors and normal gastric tissues were used and the cDNA

microarrays containing 17,000 sequence-verified human

gene probes were directly compared. Since the cDNA

microarray-based CGH data include special features such

as low-intensity spots, analysis methods using mean values,

such as the t-test, do not always successfully identify

‘altered gene’ where the change in the mean copy number

should be defined higher than the minimal criterion for an

alteration. In the present study, to detect subtle differences

in copy number changes, a frequency method was per-

formed where the variations in genes over the arrays were

controlled. In our previous study, we developed a repro-

ducible gene selection algorithm (RGSA) for controlling

variations of the genes across arrays with the same cDNA

microarray-based CGH data performed at the Cancer

Metastasis Research Center at Yonsei University (Kim

2009). Kadota et al. (2001) developed the preprocessing

implementation for microarray (PRIM) for extracting

reproducible data from the result of duplicate experiments.

RGSA was extended to the experiments with replicates

M. Kim (&)

Institute for Mathematical Sciences, Yonsei University,

134 Shinchon-Dong, Seodaemun-Gu, Seoul 120-752, Korea

e-mail: [email protected]

H. C. Chung

Brain Korea 21 Project for Medical Science,

Department of Internal Medicine, Cancer Metastasis Research

Center, College of Medicine, Yonsei University, Seoul, Korea

123

J Cancer Res Clin Oncol (2009) 135:1501–1512

DOI 10.1007/s00432-009-0597-1

more than two. Rosner (2000) and Dowdy and Wearden

(1991) introduced intra-class correlation coefficient and its

application in the random effect model. In RGSA, the

variability of replicated measurements was quantified via a

random effect model and a measurement of reproducibility

was incorporated using an intra-class correlation coeffi-

cient. RGSA deals with both reproducibility and the

number of remaining genes. In this study, the set maxi-

mizing both reproducibility and number of remaining

genes was suggested as the well-filtered set.

The present study is concerned with identifying genes

that are related to the cancer recurrence status and are

altered in gastric cancer. Genes with small variations across

arrays were screened with RGSA after within-print tip

normalization and intensity-dependent normalization of the

data. Of the genes screened with RGSA, genes displaying a

high frequency of alteration, either on recurrence or on

non-recurrence, were selected after the genes were cate-

gorized into the alteration/non-alteration groups. Among

the selected sets of genes, the set of genes that had the

highest correct classification rate was used for distin-

guishing the recurrence status of cancer. Based on the

changes in the copy number of these representative genes,

a standardized genetic alteration score (SGAS) was con-

structed as a scoring system for predicting a patients’

recurrence status. SGAS was determined from genes that

displayed a high frequency of alterations either on recur-

rence or on non-recurrence of gastric cancer. Using this

scoring system, it was possible to search for specific genes

that had a strong relationship with the risk factor on

recurrence of gastric cancer, and to predict the probability

of cancer recurrence, which reflects the genetic information

of the gastric cancer patient.

For scoring genetic information, Yang et al. (2005)

related the summation of changes in the number of gene

copies of amplified genes (without considering deletions)

to the recurrence of cancer, and Inoue et al. (2002) assigned

a weight of ?1 or -1 to the gene depending on its char-

acteristic for the five conventional pathological factors in

relation to gastric cancer. Liu and Huang (2008) suggested

a linear transformation method for cancer classification

using rotation forest. Liebermeister (2002) applied inde-

pendent component analysis to gene expression data for

deriving a linear model based on hidden variables. Park

et al. (2002) developed a linear transformation method

linking gene expression data with patient survival time

using a partial least squares method.

In the present study not only was the gain considered,

but also the loss in the number of gene copies, and a weight

was assigned to each gene according to its contribution to

the genetic score, which was related to the risk of recur-

rence of gastric cancer. In this process, the variability of the

representative genes was decomposed with variabilities of

several latent factors that represent common characteristics

of the genes using a factor analysis technique, where the

latent factors were independent components and consisted

of linear combinations of the genes. Among these common

characteristics, the characteristic that was related to the risk

for recurrence of gastric cancer was obtained to establish a

score system with variable selection of the logistic

regression model. Agresti (2002) and Lee (1992) intro-

duced the logistic regression model for the identification of

risk factors related to dichotomous data.

The selected common factor characterizing the relation-

ship with the risk of recurrence provided the patient’s

genetic alteration level, which was based on changes in the

gene copy number. These genetic alteration levels were used

as the genetic alteration score, which was named the SGAS

since it was constructed as a variable following standard

normal distribution via factor analysis. SGAS was used to

select genes that have a strong relationship with the risk

factor on recurrence and to predict the probability [named

predicted score (PS)] of a patient’s recurrence The strategy

of this study was to: (a) select representative genes that

displayed a high frequency of alteration either on recurrence

or on non-recurrence and small variations across the arrays;

(b) construct the statistical model for the characteristics of

the representative genes using a linear combination of

common (latent) characteristics of the representative genes;

(c) find characteristics that were related to the risk of

recurrence via variable selection in a logistic regression

model; (d) apply SGAS for predicting the probability for a

gastric cancer patient’s recurrence and find genes that were

strongly related to the risk factor on recurrence by investi-

gating the correlation of SGAS for each gene.

The methods used to accomplish this are discussed in

‘‘Materials and methods’’; the first part explains the data

preparation used for this study and the representative set of

genes are presented. The second part establishes the SGAS.

The characteristics of SGAS and the PS for predicting a

patient’s recurrence status and its application are described

in ‘‘Results’’. In addition, the recurrence status is classified

using the PS, and the correct classification rate is shown.

‘‘Discussion’’ is presented in the last section.

Materials and methods

Data preparation and constructing representative set

of genes

cDNA microarray-based CGH data from the Cancer

Metastasis Research Center at Yonsei University was

analyzed with an application of the RGSA, which was

developed in our previous study with the same data. The

data were prepared as follows.

1502 J Cancer Res Clin Oncol (2009) 135:1501–1512

123

Patient and tissue samples

Thirty pairs of normal gastric mucosa and cancer tissues

were obtained from gastric cancer patients who had

undergone surgery at the Severance Hospital, Cancer

Metastasis Research Center (CMRC), Yonsei University

Health System, Seoul, Korea, from 1997 to 1999. The

study followed the local ethical guidelines of the Institu-

tional Review Board of the Yonsei University Medical

Center, Seoul, Korea. The tissue samples were immediately

frozen into liquid nitrogen at the time of resection and

stored at -150�C until further use. The patients consisted

of 27 males and 3 females with a median age of 65 years.

The number of patients in each stage was 3, 9, 12, and 6 for

stage I, stage II, stage III and stage IV, respectively.

Clinical data description is shown in Table 1.

DNA extraction and cDNA microarray-based CGH

Genomic DNA extraction from the tissue was performed

according to the conventional protocol using phenol/

chloroform/isoamylalcohol method. The cDNA micro-

arrays containing 17,000 human gene probes (CMRC-

Genomictree, Korea) were used for CGH following the

standard protocol of CMRC, Yonsei University (Park et al.

2006; Yang et al. 2005). Briefly, 4 lg of the normal or

cancer DNA from the same patient was fluorescently

labeled with Cy3 or Cy5-dUTP (Amersham, USA),

respectively, using a BioPrime DNA Labeling System

(Invitrogen, USA). The labeling products were purified

with a PCR Purification Kit (Qiagen, Germany) and com-

bined with human Cot-1 DNA (30 lg; Gibco BRL, USA),

yeast tRNA (100 lg; Gibco BRL, USA) and poly(dA-dT)

(20 lg; Sigma, USA). The hybridization mixture was then

concentrated using Microcon 30 (Millipore, USA) and

hybridized to the 17 K microarray at 65�C for 16–18 h.

After washing, the microarray was scanned using GenePix

4000B (Axon Ins., USA). The experiment was performed

with direct comparison. In this experiment, normal and

tumor genomic DNA samples were extracted from the

same patient and hybridized on the same spotted array. The

17 K cDNA microarray contained 15,723 unique genes

with 17,664 spots and these unique genes were mapped

for their chromosomal location using SOURCE (http://

genome-www5.stanford.edu/cgi-bin/source/sourceSearch)

and DAVID (http://apps1.niaid.nih.gov/david/).

Data preparation

The transformation of the intensity signal to a ratio was

carried out using the log2 red to green ratio, log2(R/G),

where R and G denote the fluorescent intensities of tumor

and normal hybridizations, respectively. Pre-processing of

the data was done with within-print tip, intensity-depen-

dent normalization of Y following (Yang et al. 2002).

Genes showing missing values for [20% of the total

number of observations were deleted, and ten-nearest

neighbor method was employed for imputation of missing

values. Averaged values were used in case of the multiple

spots. At this step, 10,514 genes were found from the 30

microarrays and the set of this data was the initial set for

analysis; filtering genes with RGSA was performed on

this set.

Park et al. (2006) evaluated the genome-wide mea-

surement of copy number of each gene in normal gastric

cancer and placenta tissues for determining the criteria on a

genomic alteration with the same data of cDNA micro-

array-based CGH; the range of genomic copy number of

normal tissues was found to be ±0.3 of the log2 fluores-

cence intensity ratio in the autosomal genes. This criterion

was used for categorizing the gene’s copy number change

into alteration and non-alteration, that is, genes that had

between a ?0.3 and -0.3 log2 copy changes were cate-

gorized into the non-alteration group and those outside this

regime were categorized into the alteration (gain or loss)

group. The cDNA microarray-based CGH data for this

study has been deposited into Array Express (http://www.

ebi.ac.uk/arrayexpress/) Query:1283947172 E-TABM-171.

Table 1 Clinical information of the patients

Categorical data Continuous data

Variable Class Cases Total cases Variable Average (SD) Min Max Total cases

Survival status Death 15 29 Age 63.833 (9.706) 41 78 30

Survival 14

Stage I, II 12 30 Lymph node metastasis 0.119 (0.158) 0 0.518 30

III, IV 18

Recurrence status Recurrence 13 27 Size 40.325 (29.682) 9 126 30

Non-Recurrence 14

Gender Female 3 30

Male 27

J Cancer Res Clin Oncol (2009) 135:1501–1512 1503

123

http://genome-www5.stanford.edu/cgi-bin/source/sourceSearch

http://genome-www5.stanford.edu/cgi-bin/source/sourceSearch

http://apps1.niaid.nih.gov/david/

http://www.ebi.ac.uk/arrayexpress/

http://www.ebi.ac.uk/arrayexpress/

Data analysis was performed with SAS V.9.1 (SAS

Institute Inc 2000).

Constructing representative set of genes

In this study, a SGAS was established with the purpose of

predicting a patient’s recurrence status and obtaining a

patient’s PS for the risk of recurrence using genetic

information. This scoring system allowed the search of

specific genes that have a strong relationship with the risk

factor in the recurrence of gastric cancer.

For this purpose, a representative set of genes showing

high frequency of alteration in either recurrence or non-

recurrence status was selected by controlling the repro-

ducibility of genes.

Candidate sets were collected according to the following

four parameters; reproducibility of the remaining genes,

cut-off for frequency of alteration in recurrence/non-

recurrence group, number of selected genes, and correct

classification rate determined from the logistic regression

model via factor analysis.

By removing genes with relatively large variations to

total variation with RGSA, the reproducibility of the set of

remaining genes increased; therefore, it was necessary to

take a reasonably large reproducibility since the number of

genes decreases as the reproducibility increases. At this

stage, three sets, the initial set, INI, the optimal set, Sopt,

and the set with 30% reproducibility, Rmax, were consid-

ered regarding control of reproducibility.

The initial set and the optimal set had a reproducibility of

17.24 and 24.54%, respectively and set Rmax was determined

to have 30% reproducibility. INI was the set that included all

genes without considering variations of the genes across the

arrays after within-print tip normalization and intensity-

dependent normalization. Sopt was the set obtained by

maximizing both reproducibility and the number of

remaining genes. Rmax was the set in which a reasonably

large reproducibility was obtained, and 30% was considered

reasonable reproducibility in this study. All these sets were

obtained by RGSA. In each set, genes were categorized into

the non-alteration group and the alteration group. Genes

from each set with a high frequency (at least four or five

frequency) of alteration in either the recurrence group or

non-recurrence group were collected. Each case was deno-

ted by (4,4) or (5,5), respectively, where (4,4) was the case

where genes had at least a frequency of alteration (gain or

loss) of 4 in the recurrence or non-recurrence group; (5,5)

was similar to (4,4). For the selection of the best represen-

tative set, the correct classification rate was investigated

using the leave-one-out validation (LOOV) method (Efron

1979). Factor analysis was performed on each set of the

genes for obtaining independent components, which are

linear combinations of the selected genes, and logistic

regression model was constructed for finding significant

factors (independent components) including clinical risk

factors. The correct classification rate was estimated with

LOOV on the constructed logistic regression model. Table 2

shows the classification rates according to reproducibility,

frequency cut-off on alteration, correct classification rate

and number of distinguishing genes remaining in the given

set. The set showing a high correct classification rate with a

small number of selected genes and large reproducibility

was selected as the representative set.

Since INI includes more genes than Sopt or Rmax, some

INI sets had a higher accuracy than Sopt or Rmax; however,

variations of genes were not controlled, which indicates INI

may include genes with large variations. In this step, the

representative set for showing high frequency of alteration

in either recurrence or non-recurrence status was obtained

by selecting the set that had the highest possible level of

accuracy with a small number of genes. The number of

genes decreased as the set moved to Sopt or Rmax where the

reproducibility of the remaining genes increased. As shown

in Table 2, the set with a cut-off (5, 5) at Sopt had the best

correct classification rate, where genes showing a frequency

of alteration of at least 5 for the recurrence or non-recur-

rence group, which consisted of 93 genes were collected.

Among the 93 genes that displayed a frequency of alteration

of at least 5 in each of the recurrence and the non-recurrence

groups, 59 genes displayed a high frequency (at least 5

frequency) of alteration in only either the recurrence or non-

recurrence group, and these 59 genes (off-diagonal in

Table 3) were determined to be the representative set of

genes that distinguished the recurrence status.

Table 2 Classification rates

validated with LOOV

The best classification rates

were boldfaced; (5, 5) at Sopt

was determined for the cut-off

on the set of representative

genes

Frequency Reproducibility Sensitivity

(%)

Specificity

(%)

Correct classification

rate (%)

Number of

genes screened

(4,4) INI 92.3 85.7 88.9 313

Sopt 84.6 85.7 85.2 125

Rmax 76.9 85.7 81.5 65

(5,5) INI 92.3 85.7 88.9 129

Sopt 92.3 92.9 92.6 59

Rmax 76.9 85.7 81.5 35


123

Constructing SGAS

To decompose the characteristics of the 59 representative

genes into several common characteristics, first, PCA

(principal component analysis) was performed. The result

from PCA is shown in Table 4; the level at which factors

explained the variability of at least 70% and each factor

showed eigenvalue greater than 1.0 was selected for a

reasonable number of factors. The top eight factors

explained the 72% variability of the data with eigenvalues

greater than 1.0.

These eight factors consisted of linear combinations of

the selected 59 genes. Based on the results shown in

Table 4, the orthogonal factor model was constructed with

these eight factors.

For the ith gene’s copy number changes Xi and its mean

li, a linearly dependent model on the eight independent,

latent factors F1; . . .;F8was expressed as follows:

X1 � l1 ¼ l1;1F1 þ l1;2F2 þ � � � þ l1;8F8 þ e1

X2 � l2 ¼ l2;1F1 þ l2;2F2 þ � � � þ l2;8F8 þ e2

..

. ... ..

.

X59 � l59 ¼ l59;1F1 þ l59;2F2 þ � � � þ l59;8F8 þ e59

whereF1; . . .;F8are ‘common factors’, and additional

sources of variation e1; . . .; e59 are ‘errors’. lij is the ‘load-

ing’ of the ith gene on the jth factor.

Each gene had correlations with the eight common

factors, which are shown in Table 5 as factor loadings.

Factor loadings, boldfaced, are strong correlations of the

genes with the corresponding factor; genes with loadings

emphasized in each factor represent the factor’s charac-

teristic that corresponds with the largest correlation when

compared to the other factors.

The factor patterns of the eight factors are shown in

Table 5, where each emphasized box (with loadings

boldfaced) is the correlations of the genes that best char-

acterize the corresponding factor.

Variable selection of the eight factors including possible

clinical risk factors was performed using a logistic

regression model with the aim of finding significant factors

on risk of recurrence for predicting the recurrence status of

a gastric cancer patient. The possible clinical variables

considered in the logistic regression model were age,

gender, lymph node metastasis (LN), size of tumor and

cancer stage (early stage of I and II, late stage of III and

IV).

Using stepwise variable selection with the eight factors

including these clinical factors in the logistic regression

model, factor 1 and clinical factor, LN were identified as

significant factors in relation to the risk of recurrence of

gastric cancer, with a significance level for entering and

removing of 0.1 and 0.15, respectively (Table 6). It is

worth noting that factor 1 most explains the variability of

the data among the eight factors, and its proportion is

31.01% (Table 4).

Table 3 Frequency table on high frequency of alteration on each

recurrence status

Recurrence Non-recurrence Total

0 1

0 0 36 36

1 23 34 57

Total 23 70 93

Entries are number of genes. The entry of (1,1) element is zero since

this set collected genes showing at least five frequency of alterations

in the recurrence or non-recurrence group

1 represents at least a frequency of alteration of five, 0 represents a

frequency of alteration less than five in recurrence or non-recurrence

group

Table 4 Eigenvalues for the proportions of variability explained by

the factors

Factor Eigenvalue Difference Proportion Cumulative proportion

1 18.295 12.167 0.310 0.310

2 6.128 1.647 0.104 0.414

3 4.480 0.956 0.076 0.490

4 3.524 0.392 0.060 0.550

5 3.132 0.336 0.053 0.603

6 2.796 0.405 0.047 0.650

7 2.391 0.185 0.041 0.691

8 2.206 0.156 0.037 0.728

9 2.050 0.236 0.035 0.763

10 1.814 0.181 0.031 0.794

11 1.633 0.152 0.028 0.821

12 1.481 0.106 0.025 0.846

13 1.375 0.219 0.023 0.870

14 1.156 0.068 0.020 0.889

15 1.088 0.130 0.018 0.908

16 0.958 0.128 0.016 0.924

17 0.830 0.189 0.014 0.938

18 0.641 0.067 0.011 0.949

19 0.574 0.005 0.010 0.959

20 0.568 0.106 0.010 0.968

21 0.462 0.055 0.008 0.976

22 0.407 0.046 0.007 0.983

23 0.361 0.091 0.006 0.989

24 0.270 0.060 0.005 0.994

25 0.211 0.042 0.004 0.997

26 0.168 0.168 0.003 1.000

Bold For a reasonable number of factors, top eight factors were

selected since those explained the variability of at least 70% and each

factor showed eigenvalue greater than 1.0


123

Table 5 Rotated factor patterns for the 59 representative genes

GeneBank Accession ID Gene name Factor 1 Factor 2 Factor 3 Factor 4 Factor 5 Factor 6 Factor 7 Factor 8

AI669693 CD207 0.834 0.192 -0.060 0.355 -0.209 0.074 -0.021 -0.020

H62387 ISLR 0.825 0.128 0.011 -0.060 0.127 -0.184 0.106 0.004

AI681562 MTERFD2 0.803 0.191 -0.114 0.251 -0.154 0.182 0.213 0.028

AA987337 RG9MTD2 0.756 0.389 -0.095 -0.036 -0.028 0.182 -0.090 -0.176

AI312990 SP140 0.751 0.037 0.105 0.082 0.086 -0.161 -0.331 -0.017

H09936 NHLH1 0.724 0.013 -0.365 0.292 0.105 0.087 0.197 -0.090

AI207203 R3HDM1 0.668 0.084 -0.002 0.164 -0.027 0.349 0.418 -0.066

AI262957 GTF2H2 0.553 0.431 -0.130 -0.126 -0.220 0.138 0.196 -0.070

AA278852 COPS2 0.487 0.423 -0.252 0.243 0.274 0.032 0.217 0.230

AI359676 0.405 0.377 -0.392 0.402 -0.080 0.340 0.253 -0.198

AI611010 RPS10 20.483 -0.474 0.469 -0.398 -0.040 -0.067 -0.184 -0.053

AI302710 UBIAD1 20.489 -0.364 0.438 -0.339 -0.154 -0.296 0.104 0.163

AW009403 ARID3A 20.521 -0.201 0.385 0.293 0.081 -0.248 0.055 -0.240

AA464250 KRT19 20.532 0.055 0.478 -0.339 0.154 -0.118 -0.064 -0.066

AI025116 20.594 -0.063 0.219 0.085 -0.416 0.273 -0.209 0.067

AI521736 BCL6B 0.108 0.785 -0.208 0.213 -0.073 0.107 0.043 -0.182

AA987446 C3orf63 0.114 0.744 0.397 0.199 -0.128 0.171 0.289 -0.050

AA427927 PRPF3 0.144 0.702 -0.018 0.094 -0.103 0.295 -0.052 -0.192

AA063580 TAF7 0.449 0.687 -0.203 0.123 0.050 0.194 -0.288 0.020

AA281744 C18orf54 0.281 0.634 -0.456 -0.024 -0.050 0.051 0.162 -0.274

AA281064 FGFR1 0.100 0.491 0.371 -0.267 -0.197 -0.315 -0.183 0.281

H05445 GAP43 0.395 0.484 -0.413 0.423 -0.190 0.166 0.018 -0.162

AA995875 AQP2 0.445 0.449 -0.077 0.328 -0.069 0.253 0.057 -0.183

AI304790 PTGIS -0.109 20.726 0.204 0.313 0.116 -0.151 -0.043 -0.028

N94468 JUNB -0.058 -0.110 0.687 0.024 -0.173 0.099 -0.050 0.193

AA872372 -0.370 -0.116 0.627 -0.415 0.083 0.167 -0.021 -0.117

AI299228 -0.572 -0.144 0.593 -0.220 0.057 -0.102 -0.007 0.033

AA279023 LRCH3 0.018 -0.148 0.510 -0.499 -0.063 0.262 0.060 -0.140

AA281426 SLC11A1 -0.492 -0.025 0.492 -0.034 -0.195 -0.023 0.453 0.056

H54023 LILRB2 -0.117 0.067 0.448 -0.360 -0.016 0.129 -0.132 0.242

AA128162 MS4A4A 0.064 0.175 20.625 0.157 0.226 0.365 0.291 -0.106

AA411380 EVI2A -0.096 0.325 -0.101 0.736 0.234 0.139 0.045 0.003

AI336859 ASXL2 0.358 -0.095 -0.027 0.735 -0.175 -0.274 0.172 -0.055

AA456571 BSCL2 0.051 -0.064 -0.097 0.576 -0.254 0.230 0.068 -0.109

H98218 HMGA2 0.188 0.207 -0.049 0.422 0.029 0.372 0.112 -0.008

AI653069 DOCK10 -0.232 -0.332 -0.036 20.462 0.038 0.019 -0.279 0.309

AI273225 LACTB -0.436 -0.246 0.474 20.490 0.190 0.001 -0.008 0.209

AI369284 GTSE1 -0.447 -0.124 0.181 20.545 -0.221 -0.100 -0.443 0.012

AI244751 SETBP1 -0.115 0.074 0.235 20.703 -0.191 -0.051 0.235 -0.026

AA991162 ANKRD26 0.183 0.057 -0.088 -0.157 0.797 -0.306 0.085 0.229

T57875 PRKCI -0.305 -0.119 -0.254 0.202 0.738 0.067 0.086 -0.150

AI346446 EPSTI1 0.232 -0.346 -0.102 -0.081 0.648 0.131 -0.057 -0.270

AA236957 ARHGEF6 0.078 -0.321 0.102 0.031 0.612 0.010 -0.459 0.343

N71462 SCML2 -0.467 0.021 0.350 -0.061 0.607 -0.093 0.201 0.102

R18845 ZNF559 0.446 0.165 -0.300 0.349 0.516 0.299 0.105 0.034

AA278956 PHC3 -0.002 -0.153 0.366 -0.208 20.385 -0.187 -0.242 -0.205

AI380028 SCFD1 0.151 0.313 -0.068 0.029 20.671 0.336 0.355 0.100

AA490609 MRPL1 0.006 0.220 0.216 -0.013 -0.036 0.739 -0.051 -0.252


123

Factor 1 follows a standard normal distribution whose

mean and variance are 0 and 1 respectively, and thus it was

utilized as a SGAS for discriminating the recurrence status

of the patient.

The logistic regression model with p explanatory vari-

ables is given by

loge

PrðY ¼ 1jX1; . . .;XpÞPrðY ¼ 0jX1; . . .;XpÞ ¼ b0 þ b1x1 þ � � � þ bpxp

where Y is a dependent variable representing a patient

recurrence status, where a value 1 was used for recurrence

and 0 otherwise.

Based on this relationship, the suggested logistic model

containing the prognostic factor and SGAS is as follows:

loge

PrðrecurrencejLN�; SGASÞPrðnon - recurrencejLN

�; SGASÞ¼ �3:081þ 3:159LN� � 2:199SGAS ð1Þ

SGAS of this model is the genetic alteration score that

represents significant factor in relation to the risk of

recurrence for gastric cancer, and for each increase of one

unit in SGAS, the log odds increase by -2.199 when

adjusting LN*, that is, patients with a larger SGAS score

have a lower risk of recurrence.

For each gastric cancer patient, the predicted probabil-

ity, which is called the ‘PS’, is obtained as

PrðrecurrencejLN�; SGASÞ

¼ expð�3:081þ 3:159LN� � 2:199SGASÞ1þ expð�3:081þ 3:159LN� � 2:199SGASÞ :

This score is the predicted probability of a patient’s

recurrence. Table 7 shows the SGAS score and PS of 30

patients (excluding missing data).

Results

Characteristic of SGAS

The SGAS loading, which is the corresponding gene’s

correlation with SGAS, frequency on gains and losses for

the 30 arrays and mean copy number changes of the 59

genes are listed in Table 8. Table 8 lists the 28 genes

whose correlations with SGAS were large. The 15 genes

that distinguish the characteristic for SGAS from the other

seven common characteristics are boldfaced (Table 5).

SGAS was found to be a significant risk factor related to

the recurrence of gastric cancer, which was calculated

using a linear combination of the changes in the copy

number of the 59 genes (genes are listed in Table 5).

Of the 59 genes, 28 listed in Table 8 show positive or

negative correlations of at least 0.3 (most are over 0.4) with

Table 5 continued

GeneBank Accession ID Gene name Factor 1 Factor 2 Factor 3 Factor 4 Factor 5 Factor 6 Factor 7 Factor 8

AW055023 HRB 0.192 0.360 0.149 0.111 -0.215 0.691 0.270 0.051

AI097124 WDR22 0.011 0.204 -0.371 -0.098 -0.368 0.601 0.032 0.125

AA877334 ERBB3 -0.046 -0.395 0.250 -0.202 -0.067 20.451 -0.226 0.182

AA088258 LOC400986 0.132 0.007 -0.195 0.094 0.083 0.160 0.815 -0.199

AI261857 DCUN1D1 0.084 0.026 -0.076 0.006 0.105 0.601 0.631 0.016

AA001403 ZMPSTE24 0.269 0.510 -0.033 0.272 -0.039 0.020 0.607 0.014

AA931565 MRE11A -0.406 -0.361 0.371 -0.300 -0.185 -0.249 0.483 0.031

AI630817 CDH22 -0.078 -0.234 -0.053 -0.257 -0.193 -0.012 0.065 0.752

N70013 RP2 -0.137 -0.070 0.466 -0.160 0.286 -0.187 0.100 0.600

AI361153 SRF -0.206 -0.359 0.295 -0.027 -0.129 0.051 -0.160 0.469

AA421335 LDB1 -0.110 0.053 -0.091 -0.166 -0.175 0.146 0.258 20.613

Loadings boldfaced are the correlations of the genes that best characterize the corresponding factor

Table 6 Selected factors with the logistic regression model

Parameter df Estimate Standard error Wald v2 Pr [ v2 Odds ratio estimates

Point estimate 95% Wald confidence limits

Intercept 1 -3.081 1.447 4.532 0.0333

LN* 1 3.159 1.900 2.766 0.0963 23.557 0.569 975.45

Factor1 1 -2.199 1.216 3.271 0.0705 0.111 0.01 1.202

LN* Lymph node metastasis multiplied by 10


123

SGAS, which indicates that they are strongly related to the

risk factor on recurrence of gastric cancer. By investigating

the characteristics of these 28 genes, those displaying a

positive (negative) correlation with SGAS were shown to

have only losses (only gains) with high frequency of the 30

arrays, except AW009403.

Results from the statistical test indicate that SGAS was

capable of distinguishing the recurrence status, that is,

there was a significant difference in SGAS between the

recurrence and non-recurrence group (P value: 0.045). It

was helpful to include significant clinical risk factors in the

model to classify a patients’ recurrence status at a high

correct classification rate. For this purpose, a logistic

regression model (1) was used (described in ‘‘Materials and

methods’’). SGAS obtained from factor analysis follows a

standard normal distribution, and was validated with the

Kolmogorov–Smirnov test (P value: 0.15).

Since SGAS follows standardized normal distribution, it

was possible to obtain the percentile for the patient’s

SGAS; thus, it was possible to classify patients based on

their SGAS percentiles and predict the probability of

recurrence for each class rank. Table 9 lists the PSs of a

gastric cancer patient’s recurrence in each class and these

are ranked with the SGAS score after adjusting the sig-

nificant clinical risk factor, LN, with its average value,

where the class was graded based on the SGAS percentile.

A straight line drawn tangent to the curve at a particular

SGAS value describes the rate of change at that point. The

slope approaches 0 as the PS approaches 1.0 or 0. The

steepest slope of the curve occurs at a particular SGAS

where the PS for recurrence equals 0.5; the SGAS value

was 0.303 when LN* was adjusted with its average 1.186.

This SGAS value is denoted by EL50, median effective

level, and represents the level at which recurrence and non-

recurrence has a 50% chance. As shown in the above table,

when SGAS was larger than EL50, the PS for recurrence

dramatically decreases with only small incremental chan-

ges in SGAS near EL50, and the PS curve becomes less

steep as SGAS moves farther from EL50, while it tends to

decrease moderately in proportion to changes in SGAS

when SGAS is less than EL50.

PS for predicting patient’s recurrence status

A logistic regression model that includes both the signifi-

cant clinical risk factor, LN, and SGAS results in a higher

correct rate for classification of recurrence. From this

model, it was possible to obtain the predicted probability

(score) of a patient’s recurrence (Table 7).

The recurrence status of patients was classified based on

the patients’ PS, where a threshold 0.5 was used to deter-

mine the classification of the patient since it represents the

level at which each outcome has a 50% chance of occur-

ring, and its characteristic was described with EL50. Fig-

ure 1 shows the classification of patients’ recurrence status

determined from the PS of the patient, where the patient

with a PS larger (smaller) than 0.5 was classified with a

recurrence status (non-recurrence) and is shown in the right

(in the left). No information on the recurrence status of 3

out of the 30 patients was available; thus, they were not

considered for classification. Of the 13 recurrent patients,

all the patients except ID 8 were correctly classified; for the

14 non-recurrent patients, all the patients were correctly

classified except patient ID 5. Only 2 of the 27 patients

were misclassified by the model developed in this study,

which incorporated SGAS in a logistic model that involved

the clinical risk factor, LN. It is worth noting that the

misclassified three patients ID15, ID24 and ID26 (under-

lined with star notation in the figure), when only the sig-

nificant clinical risk factor, LN, was used in the logistic

Table 7 SGAS score and PS of 27 patients

Patient ID LN* SGAS PS Recurrence status

1 0 0.672 0.010 0

2 0 0.710 0.010 0

5 0.238 -1.103 0.524 0

6 0 0.655 0.011 0

7 0 0.621 0.012 0

8 0 -0.361 0.092 1

9 0 1.264 0.003 0

10 0 -1.107 0.344 0

11 0 0.259 0.025 0

12 0 1.153 0.004 0

13 0 -0.424 0.105 0

14 0.351 -0.462 0.278 0

15 0.435 -1.392 0.795 1

16 0 0.470 0.016 0

18 0.333 0.401 0.052 0

19 3.421 1.137 0.995 1

20 4.091 0.002 1.000 1

21 1.600 -1.783 0.997 1

22 2.586 0.499 0.982 1

23 1.698 -2.152 0.999 1

24 1.132 1.288 0.088 0

25 5.179 0.989 1.000 1

26 0.465 -1.048 0.667 1

27 3.621 -1.181 1.000 1

28 3.175 -0.121 0.999 1

29 3.443 1.194 0.994 1

30 3.214 -0.177 0.999 1

Three patients, ID 3, ID 4, and ID 17, have their status of recurrence

missing

LN* lymph node metastasis multiplied by 10, Recurrence status = 1

for recurrence, and 0 for non-recurrence


123

regression model, were correctly classified when SGAS

was incorporated in the model.

Figure 1 shows positive predicted value (PV?), nega-

tive predictive value (PV-), sensitivity and specificity,

which are estimated as 92.3, 92.86, 92.3, and 92.86%,

respectively.

Validation of the correct classification rate was done

using the leave-one-out validation (LOOV) method, which

was shown to be 92.59%. The ROC curve indicated that the

sensitivity and specificity at any specific threshold and

overall correct classification rate were excellent for making

decisions with the suggested model.

For further assessment of the suggested model, a score

card was created according to the PS percentile as shown in

Fig. 2. It displays the cumulative frequency of recurrence

and non-recurrence in each class ranked at every tenth

percentile of PS obtained from the suggested model. This

figure shows that the probability of a patient’s recurrence

was well ordered according to the class rank with the PS

percentile.

The K–S statistic measures how well the class rank is

assigned where a high rank class receives a high frequency of

recurrence and a low frequency of non-recurrence, and vice

versa. A K–S statistic of over 50 indicates that the model

successfully classified subjects based on the PS obtained by

the model, that is, the classes were well ordered (Mayes 2003).

The suggested model had a K–S statistic of 85, implying

that this model provided an excellent grading system. There

were no patients with recurrence (non-recurrence) for the

four classes at the bottom rank (the top rank), indicating that

the chance of misclassification on recurrence (non-recur-

rence) was very low for patients classified as high (low) rank.

As shown in Fig. 2, the frequency of recurrence

increases and the frequency of non-recurrence decreases as

the class ranks go to a higher level. This trend demonstrates

that the PS obtained by the suggested model is excellent for

classifying patients according to the probability of a

patient’s recurrence and will provide a good method to

assess and predict the possibility of a recurrence in cancer

patients. Clinical data are shown in Table 10.

Table 8 SGAS loadings,

frequency on gains and losses,

and mean copy number changes

For instance, GeneBank

Accession ID AI669693 has

correlation of 0.834 with SGAS

and shows a total of nine losses

and no gains for the 30 patients

(arrays), and the mean change in

the copy number for the 30

arrays is -0.291

Fifteen genes of distinguishing

the characteristic for SGAS

from the other seven common

characteristics are boldfaceda SGAS loading, which is the

corresponding gene’s

correlation with SGASb Frequency of alteration (gain

and loss, respectively) of the

corresponding gene among all

patients (arrays) showing

recurrence or non-recurrencec Average change in the copy

number of the gene for the 30

patients (arrays)

Gene Bank

Accession ID

Gene name SGAS loadinga Frequency

(gain, loss)bMeanc

AI669693 CD207 0.834 (0,9) 20.291

H62387 ISLR 0.825 (0,9) 20.227

AI681562 MTERFD2 0.803 (0,8) 20.188

AA987337 RG9MTD2 0.756 (0,8) 20.188

AI312990 SP140 0.751 (0,5) 20.140

H09936 NHLH1 0.724 (0,6) 20.197

AI207203 R3HDM1 0.668 (0,10) 20.237

AI262957 GTF2H2 0.553 (0,7) 20.203

AA278852 COPS2 0.487 (0,9) 20.234

AA063580 TAF7 0.449 (0,8) -0.188

R18845 ZNF559 0.446 (0,9) -0.234

AA995875 AQP2 0.445 (0,8) -0.218

AI359676 0.405 (0,7) 20.234

H05445 GAP43 0.395 (0,6) -0.234

AI336859 ASXL2 0.358 (0,9) -0.282

T57875 PRKCI -0.305 (8,0) 0.240

AA872372 . -0.370 (8,0) 0.260

AA931565 MRE11A -0.406 (7,0) 0.221

AI273225 LACTB -0.436 (9,0) 0.231

AI369284 GTSE1 -0.447 (9,0) 0.238

N71462 SCML2 -0.467 (8,0) 0.271

AI611010 RPS10 20.483 (8,0) 0.248

AI302710 UBIAD1 20.489 (8,0) 0.229

AA281426 SLC11A1 -0.492 (13,0) 0.285

AW009403 ARID3A 20.521 (0,6) 20.200

AA464250 KRT19 20.532 (8,0) 0.170

AI299228 . -0.572 (10,0) 0.268

AI025116 20.594 (6,0) 0.265


123

Table 9 Class rank based on SGAS percentile, expected SGAS percentile and predicted probability for recurrence when adjusting LN* with its

average value

Class rank (percentile) Expected SGAS percentile SGAS percentile obtained from data Predicted score for recurrencea

10 -1.282 -1.392 0.970249

20 -0.842 -1.104 0.925334

30 -0.524 -0.424 0.860308

40 -0.253 -0.177 0.772401

50 0 0.259 0.660511

60 0.253 0.499 0.52728

70 0.524 0.655 0.380671

80 0.842 0.989 0.233978

90 1.282 1.194 0.104001

100 3? 1.288 0.002648

a This was obtained with expected SGAS percentile after adjusting significant clinical risk factor LN* with its average value, 1.186, where LN*

is initial LN multiplied by 10 and its average was taken for the 27 patients

Fig. 1 Classification of

recurrence status with PS, ID 5

and ID 8 were misclassified.

Patients ID15*, ID24* and

ID26* were correctly classified

with the suggested model in this

study; misclassified without

utilizing SGAS


123

Discussion

The aim of this study was to develop a SGAS based on

changes in the copy number of genes related to a patient’s

recurrence status. A representative set of genes was

determined, based on examining a high frequency of

alteration in only one of the two groups, recurrence group

and non-recurrence group, after controlling variations in

the data across arrays for selecting genes with small vari-

ations so that had a high reproducibility could be achieved.

It was found that SGAS was the significant risk factor on

the recurrence of gastric cancer and the factor explaining

Score Classa Cumulative NR (%)b

Cumulative R (%)c

k-s (%)d

10 21 0 21

20 36 6

30 57 7

40 71 1

50 86

0 3

0 5

0 7

8 78

60 100 15 85

70 100 31 69

80 100 54 46

90 100 85 15

100 100 100 0

K-S 85

Fig. 2 a Score class ranked

according to the percentile of

PS. b Cumulative percentage of

non-recurrence in each score

class. c Cumulative percentage

of recurrence in each score

class. d ‘Cumulative R (%)’

subtracted from the ‘cumulative

NR (%)’, K–S is the maximum

values of k–s values, In the

figure x- and y-axes represent

score class and cumulative

percentages of NR and R,

respectively

Table 10 Reference table on

clinical data from the Cancer

Metastasis Research Center at

Yonsei University

Patients were followed up for

65 months

– missing, 1 recurrence, 0 non-

recurrencea A total of 27 patients’ were

available on the their recurrence

status from the follow-up on the

30 patientsb Cancer-free time in months

Patient

numberaAge

(year)

Sex Lymph node

metastasis (LN)

Status on

recurrence

Disease-free

time (month)bSize

1 65 M 0/17 0 65 10 9 8

2 70 M 0/12 0 48 4 9 3

3 65 M 1/42 0 60 5 9 3

4 77 M 0/40 0 61 3 9 3

5 62 M 0/39 0 60 4 9 4

6 77 M 0/76 1 20 9 9 9

7 62 M 0/52 0 49 11 9 6.5

8 60 M 0/66 0 42 6 9 6

9 70 M 0/37 0 46 3 9 3

10 60 F 0/30 0 – 6 9 5

11 77 F 0/44 0 – 11 9 6

12 56 M 2/57 0 60 7 9 7

13 66 M 2/46 1 12 6 9 5

14 52 M 0/32 0 60 7 9 7

15 67 M 2/60 0 49 4 9 3

16 78 M 13/38 1 7 7 9 4

17 76 F 9/22 1 – 4 9 3.5

18 76 M 8/50 1 58 10 9 8

19 55 M 15/58 1 12 14 9 9

20 71 M 9/53 1 18 5.5 9 4.5

21 48 M 6/53 0 58 8 9 3

22 55 M 29/56 1 28 9 9 5

23 59 M 2/43 1 10 5 9 3.5

24 59 M 21/58 1 31 6 9 4

25 54 M 20/63 1 8 5 9 3

26 41 M 21/61 1 31 9 9 9

27 74 M 18/56 1 15 9 9 9


123

the most variability of 31% for the representative genes’.

From SGAS, it was possible to identify genes that were

strongly related to the risk factor on recurrence of gastric

cancer. SGAS was a weighted sum of changes in the copy

number of genes. SGAS reflects a patients’ genetic alter-

ation levels better than a simple summation of the genes’

genetic alteration levels in the sense that weights were

assigned to genes according to their accountabilities for the

score variable. This weight was the coefficient of the gene

in the linear combination used in determining the SGAS,

while loading of the SGAS for each gene explained the

correlation between SGAS and the corresponding gene.

SGAS was found to be associated with the recurrence

status. By incorporating SGAS in the logistic model in

addition to the significant clinical risk factor, it was pos-

sible to obtain a patient’s PS on recurrence. Three patients

who were initially misclassified by the model, when only

the significant clinical risk factor was considered, were

correctly classified when the suggested model incorporated

SGAS. The correct classification rate, validated by the

LOOV method, was 92.59%.

The fact that SGAS was constructed as a variable that

followed a standard normal distribution made it possible to

estimate the percentile of the patient’s SGAS score and

thus it was possible to estimate the probability of recur-

rence for the patient.

For instance, patient ID 15 and ID 26 had a SGAS score

of -1.392 and -1.048 (Table 7), which were approxi-

mately in the 8.23 and 14.7 percentile of SGAS, respec-

tively. Thus they had a high risk of recurrence (referred to

the figure in Table 9) and actually had a recurrence.

However, using only a significant clinical risk factor, these

patients were not correctly classified, since they had a good

clinical risk factor, a small LN of 0.0435 and 0.0465 (LN*

of 0.435 and 0.465), respectively.

Furthermore, patient ID 24 had a SGAS score of 1.288,

which was approximately in the 90.2 percentile of SGAS,

and thus had a low risk of recurrence (Table 6 describes

SGAS as being negatively related to the risk of recurrence,

and the figure in Table 9 shows PS versus SGAS) and

actually was non-recurrent. However, using only the sig-

nificant clinical risk factor, this patient was incorrectly

classified, since this patient had a relatively large LN of

0.1132 (LN* of 1.132; the average of the 30 patients’ LN*

was 1.186).

Acknowledgments This work was supported by the Korean

Research Foundation Grant funded by the Korean Government

(MOEHRD) (R03-2004-000-10048-0). The authors would like to

express gratitude to Cancer Metastasis Research Center at Yonsei

University for the permission to use their data, and also thank

Jin-Hyung Kim for his assistance in this study.

Conflict of interest statement We declare that we have no conflict

of interest.

References

Agresti A (2002) Categorical date analysis. Wiley, New Jersey

Dowdy S, Wearden S (1991) Statistics for research. Wiley, New

Jersey

Efron B (1979) Bootstrap methods: another look at the jackknife. Ann

Stat 7(1):1–26. doi:10.1214/aos/1176344552

Inoue H, Matsuyama A, Mimori K, Ueo H, Mori M (2002) Prognostic

score of gastric cancer determined by cDNA microarray. Clin

Cancer Res 8:3475–3479

Kadota K, Miki R, Bono H, Shimizu K, Okazaki Y, Hayashizaki Y

(2001) Preprocessing implementation for microarray (PRIM): an

efficient method for processing cDNA microarray data. Physiol

Genomics 4:183–188

Kim M (2009) Reproducible gene selction algorithm with random

effect model in cDNA microarray-based CGH data. Expert Syst

Appl. doi:10.1016/j.eswa.2009.03.034

Lee E (1992) Statistical methods for survival data analysis. Wiley,

New Jersey

Liebermeister W (2002) Linear modes of gene expression determined

by independent component analysis. Bioinformatics 18:51–60.

doi:10.1093/bioinformatics/18.1.51

Liu KH, Huang DS (2008) Cancer classification using Rotation

Forest. Comput Biol Med 38:601–610. doi:

10.1016/j.compbiomed.2008.02.007

Mayes E (2003) Handbook of credit scoring. Glenlake Publishing

Company, Chicago

Park PJ, Tian L, Kohane IS (2002) Linking gene expression data with

patient survival times using partial least squares. Bioinformatics

18:120–127

Park CH, Jeong HJ, Choi YH, Kim SC, Jeong HC, Park KH, Lee GE,

Kim TS, Yang SW, Ahn SW, Kim YS, Rha SY, Chung HC

(2006) Systematic analysis of cDNA microarray-based CGH. Int

J Mol Med 17:261–267

Pisani P, Parkin DM, Bray F, Ferlay J (1999) Estimates of the

worldwide mortality from 25 cancers in 1990. Int J Cancer

83:18–29. doi:10.1002/(SICI)1097-0215(19990924)83:1\18::

AID-IJC5[3.0.CO;2-M

Rosner B (2000) Fundamentals of biostatistics. Duxbury Publisher,

Pacific Grove

SAS Institute Inc (2000) SAS System for Windows V.9.1

Squire JA, Pei J, Marrano P, Beheshti B, Bayani J, Lim G, Moldovan

L, Zielenska M (2003) High-resolution mapping of amplifica-

tions and deletions in pediatric osteosarcoma by use of CGH

analysis of cDNA microarrays. Genes Chromosomes Cancer

38:215–225. doi:10.1002/gcc.10273

Yang YH, Dudoit S, Luu P, Lin D, Peng V, Ngai J, Speed T (2002)

Normalization for cDNA microarray data: a robust composite

method addressing single and multiple slide systematic variation.

Nucleic Acids Res 30:e15. doi:10.1093/nar/30.4.e15

Yang SH, Seo MY, Jeong HA, Jeung HC, Shin J, Kim SC, Noh SH,

Chung HC, Rha SY (2005) Gene copy number change events at

chromosome 20 and their association with recurrence in gastric

cancer patients. Clin Cancer Res 11:612–620


123

http://dx.doi.org/10.1214/aos/1176344552

http://dx.doi.org/10.1016/j.eswa.2009.03.034

http://dx.doi.org/10.1093/bioinformatics/18.1.51

http://dx.doi.org/10.1016/j.compbiomed.2008.02.007

http://dx.doi.org/10.1002/(SICI)1097-0215(19990924)83:1%3c18::AID-IJC5%3e3.0.CO;2-M

http://dx.doi.org/10.1002/(SICI)1097-0215(19990924)83:1%3c18::AID-IJC5%3e3.0.CO;2-M

http://dx.doi.org/10.1002/gcc.10273

http://dx.doi.org/10.1093/nar/30.4.e15