ORIGINAL PAPER
Standardized genetic alteration score and predicted scorefor predicting recurrence status of gastric cancer
Mijung Kim Æ Hyun Cheol Chung
Received: 11 December 2008 / Accepted: 22 April 2009 / Published online: 16 May 2009
� Springer-Verlag 2009
Abstract
Purpose To build a standardized genetic alteration score
(SGAS) based on genes that are related to a patient’s
recurrence status, and to obtain the predicted score (PS) for
predicting a patient’s recurrence status, which reflects the
genetic information of the gastric cancer patient.
Methods SGAS was constructed using linear combina-
tions that best account for the variability in the data. This
methodology was fit to and validated using cDNA micro-
array-based CGH data obtained from the Cancer Metastasis
Research Center at Yonsei University.
Results When classifying cancer patients, the accuracy
was 92.59% in the leave-one-out validation method.
Conclusions SGAS provided PS for the risk of recur-
rence, which was capable of discriminating a patient’s
recurrence status. A total of 59 genes were found to have a
high frequency of alteration in either the recurrence or non-
recurrence status. SGAS was found to be a significant risk
factor on recurrence and explained 31% variability of the
59 genes.
Keywords cDNA microarray-based CGH �Gene copy number change � Correct classification rate �Factor analysis � Logistic regression model
Introduction
Gastric cancer is one of the most common causes of can-
cer-related mortality (Pisani et al. 1999). cDNA micro-
array-based CGH is a useful technique for detecting
genome aberrations with high resolution. A genomic
alteration detected by cDNA microarray-based CGH can
easily be translated into both sequence and gene identifi-
cation, which can provide additional information on
regarding chromosomal rearrangements and imbalances
(Squire et al. 2003). A gastric cancer-related cDNA
microarray-based CGH experiment was performed to
investigate genomic aberrations with high resolution at the
Cancer Metastasis Research Center at Yonsei University
(Park et al. 2006; Yang et al. 2005). Thirty pairs of gastric
tumors and normal gastric tissues were used and the cDNA
microarrays containing 17,000 sequence-verified human
gene probes were directly compared. Since the cDNA
microarray-based CGH data include special features such
as low-intensity spots, analysis methods using mean values,
such as the t-test, do not always successfully identify
‘altered gene’ where the change in the mean copy number
should be defined higher than the minimal criterion for an
alteration. In the present study, to detect subtle differences
in copy number changes, a frequency method was per-
formed where the variations in genes over the arrays were
controlled. In our previous study, we developed a repro-
ducible gene selection algorithm (RGSA) for controlling
variations of the genes across arrays with the same cDNA
microarray-based CGH data performed at the Cancer
Metastasis Research Center at Yonsei University (Kim
2009). Kadota et al. (2001) developed the preprocessing
implementation for microarray (PRIM) for extracting
reproducible data from the result of duplicate experiments.
RGSA was extended to the experiments with replicates
M. Kim (&)
Institute for Mathematical Sciences, Yonsei University,
134 Shinchon-Dong, Seodaemun-Gu, Seoul 120-752, Korea
e-mail: [email protected]
H. C. Chung
Brain Korea 21 Project for Medical Science,
Department of Internal Medicine, Cancer Metastasis Research
Center, College of Medicine, Yonsei University, Seoul, Korea
123
J Cancer Res Clin Oncol (2009) 135:1501–1512
DOI 10.1007/s00432-009-0597-1
more than two. Rosner (2000) and Dowdy and Wearden
(1991) introduced intra-class correlation coefficient and its
application in the random effect model. In RGSA, the
variability of replicated measurements was quantified via a
random effect model and a measurement of reproducibility
was incorporated using an intra-class correlation coeffi-
cient. RGSA deals with both reproducibility and the
number of remaining genes. In this study, the set maxi-
mizing both reproducibility and number of remaining
genes was suggested as the well-filtered set.
The present study is concerned with identifying genes
that are related to the cancer recurrence status and are
altered in gastric cancer. Genes with small variations across
arrays were screened with RGSA after within-print tip
normalization and intensity-dependent normalization of the
data. Of the genes screened with RGSA, genes displaying a
high frequency of alteration, either on recurrence or on
non-recurrence, were selected after the genes were cate-
gorized into the alteration/non-alteration groups. Among
the selected sets of genes, the set of genes that had the
highest correct classification rate was used for distin-
guishing the recurrence status of cancer. Based on the
changes in the copy number of these representative genes,
a standardized genetic alteration score (SGAS) was con-
structed as a scoring system for predicting a patients’
recurrence status. SGAS was determined from genes that
displayed a high frequency of alterations either on recur-
rence or on non-recurrence of gastric cancer. Using this
scoring system, it was possible to search for specific genes
that had a strong relationship with the risk factor on
recurrence of gastric cancer, and to predict the probability
of cancer recurrence, which reflects the genetic information
of the gastric cancer patient.
For scoring genetic information, Yang et al. (2005)
related the summation of changes in the number of gene
copies of amplified genes (without considering deletions)
to the recurrence of cancer, and Inoue et al. (2002) assigned
a weight of ?1 or -1 to the gene depending on its char-
acteristic for the five conventional pathological factors in
relation to gastric cancer. Liu and Huang (2008) suggested
a linear transformation method for cancer classification
using rotation forest. Liebermeister (2002) applied inde-
pendent component analysis to gene expression data for
deriving a linear model based on hidden variables. Park
et al. (2002) developed a linear transformation method
linking gene expression data with patient survival time
using a partial least squares method.
In the present study not only was the gain considered,
but also the loss in the number of gene copies, and a weight
was assigned to each gene according to its contribution to
the genetic score, which was related to the risk of recur-
rence of gastric cancer. In this process, the variability of the
representative genes was decomposed with variabilities of
several latent factors that represent common characteristics
of the genes using a factor analysis technique, where the
latent factors were independent components and consisted
of linear combinations of the genes. Among these common
characteristics, the characteristic that was related to the risk
for recurrence of gastric cancer was obtained to establish a
score system with variable selection of the logistic
regression model. Agresti (2002) and Lee (1992) intro-
duced the logistic regression model for the identification of
risk factors related to dichotomous data.
The selected common factor characterizing the relation-
ship with the risk of recurrence provided the patient’s
genetic alteration level, which was based on changes in the
gene copy number. These genetic alteration levels were used
as the genetic alteration score, which was named the SGAS
since it was constructed as a variable following standard
normal distribution via factor analysis. SGAS was used to
select genes that have a strong relationship with the risk
factor on recurrence and to predict the probability [named
predicted score (PS)] of a patient’s recurrence The strategy
of this study was to: (a) select representative genes that
displayed a high frequency of alteration either on recurrence
or on non-recurrence and small variations across the arrays;
(b) construct the statistical model for the characteristics of
the representative genes using a linear combination of
common (latent) characteristics of the representative genes;
(c) find characteristics that were related to the risk of
recurrence via variable selection in a logistic regression
model; (d) apply SGAS for predicting the probability for a
gastric cancer patient’s recurrence and find genes that were
strongly related to the risk factor on recurrence by investi-
gating the correlation of SGAS for each gene.
The methods used to accomplish this are discussed in
‘‘Materials and methods’’; the first part explains the data
preparation used for this study and the representative set of
genes are presented. The second part establishes the SGAS.
The characteristics of SGAS and the PS for predicting a
patient’s recurrence status and its application are described
in ‘‘Results’’. In addition, the recurrence status is classified
using the PS, and the correct classification rate is shown.
‘‘Discussion’’ is presented in the last section.
Materials and methods
Data preparation and constructing representative set
of genes
cDNA microarray-based CGH data from the Cancer
Metastasis Research Center at Yonsei University was
analyzed with an application of the RGSA, which was
developed in our previous study with the same data. The
data were prepared as follows.
1502 J Cancer Res Clin Oncol (2009) 135:1501–1512
123
Patient and tissue samples
Thirty pairs of normal gastric mucosa and cancer tissues
were obtained from gastric cancer patients who had
undergone surgery at the Severance Hospital, Cancer
Metastasis Research Center (CMRC), Yonsei University
Health System, Seoul, Korea, from 1997 to 1999. The
study followed the local ethical guidelines of the Institu-
tional Review Board of the Yonsei University Medical
Center, Seoul, Korea. The tissue samples were immediately
frozen into liquid nitrogen at the time of resection and
stored at -150�C until further use. The patients consisted
of 27 males and 3 females with a median age of 65 years.
The number of patients in each stage was 3, 9, 12, and 6 for
stage I, stage II, stage III and stage IV, respectively.
Clinical data description is shown in Table 1.
DNA extraction and cDNA microarray-based CGH
Genomic DNA extraction from the tissue was performed
according to the conventional protocol using phenol/
chloroform/isoamylalcohol method. The cDNA micro-
arrays containing 17,000 human gene probes (CMRC-
Genomictree, Korea) were used for CGH following the
standard protocol of CMRC, Yonsei University (Park et al.
2006; Yang et al. 2005). Briefly, 4 lg of the normal or
cancer DNA from the same patient was fluorescently
labeled with Cy3 or Cy5-dUTP (Amersham, USA),
respectively, using a BioPrime DNA Labeling System
(Invitrogen, USA). The labeling products were purified
with a PCR Purification Kit (Qiagen, Germany) and com-
bined with human Cot-1 DNA (30 lg; Gibco BRL, USA),
yeast tRNA (100 lg; Gibco BRL, USA) and poly(dA-dT)
(20 lg; Sigma, USA). The hybridization mixture was then
concentrated using Microcon 30 (Millipore, USA) and
hybridized to the 17 K microarray at 65�C for 16–18 h.
After washing, the microarray was scanned using GenePix
4000B (Axon Ins., USA). The experiment was performed
with direct comparison. In this experiment, normal and
tumor genomic DNA samples were extracted from the
same patient and hybridized on the same spotted array. The
17 K cDNA microarray contained 15,723 unique genes
with 17,664 spots and these unique genes were mapped
for their chromosomal location using SOURCE (http://
genome-www5.stanford.edu/cgi-bin/source/sourceSearch)
and DAVID (http://apps1.niaid.nih.gov/david/).
Data preparation
The transformation of the intensity signal to a ratio was
carried out using the log2 red to green ratio, log2(R/G),
where R and G denote the fluorescent intensities of tumor
and normal hybridizations, respectively. Pre-processing of
the data was done with within-print tip, intensity-depen-
dent normalization of Y following (Yang et al. 2002).
Genes showing missing values for [20% of the total
number of observations were deleted, and ten-nearest
neighbor method was employed for imputation of missing
values. Averaged values were used in case of the multiple
spots. At this step, 10,514 genes were found from the 30
microarrays and the set of this data was the initial set for
analysis; filtering genes with RGSA was performed on
this set.
Park et al. (2006) evaluated the genome-wide mea-
surement of copy number of each gene in normal gastric
cancer and placenta tissues for determining the criteria on a
genomic alteration with the same data of cDNA micro-
array-based CGH; the range of genomic copy number of
normal tissues was found to be ±0.3 of the log2 fluores-
cence intensity ratio in the autosomal genes. This criterion
was used for categorizing the gene’s copy number change
into alteration and non-alteration, that is, genes that had
between a ?0.3 and -0.3 log2 copy changes were cate-
gorized into the non-alteration group and those outside this
regime were categorized into the alteration (gain or loss)
group. The cDNA microarray-based CGH data for this
study has been deposited into Array Express (http://www.
ebi.ac.uk/arrayexpress/) Query:1283947172 E-TABM-171.
Table 1 Clinical information of the patients
Categorical data Continuous data
Variable Class Cases Total cases Variable Average (SD) Min Max Total cases
Survival status Death 15 29 Age 63.833 (9.706) 41 78 30
Survival 14
Stage I, II 12 30 Lymph node metastasis 0.119 (0.158) 0 0.518 30
III, IV 18
Recurrence status Recurrence 13 27 Size 40.325 (29.682) 9 126 30
Non-Recurrence 14
Gender Female 3 30
Male 27
J Cancer Res Clin Oncol (2009) 135:1501–1512 1503
123
Data analysis was performed with SAS V.9.1 (SAS
Institute Inc 2000).
Constructing representative set of genes
In this study, a SGAS was established with the purpose of
predicting a patient’s recurrence status and obtaining a
patient’s PS for the risk of recurrence using genetic
information. This scoring system allowed the search of
specific genes that have a strong relationship with the risk
factor in the recurrence of gastric cancer.
For this purpose, a representative set of genes showing
high frequency of alteration in either recurrence or non-
recurrence status was selected by controlling the repro-
ducibility of genes.
Candidate sets were collected according to the following
four parameters; reproducibility of the remaining genes,
cut-off for frequency of alteration in recurrence/non-
recurrence group, number of selected genes, and correct
classification rate determined from the logistic regression
model via factor analysis.
By removing genes with relatively large variations to
total variation with RGSA, the reproducibility of the set of
remaining genes increased; therefore, it was necessary to
take a reasonably large reproducibility since the number of
genes decreases as the reproducibility increases. At this
stage, three sets, the initial set, INI, the optimal set, Sopt,
and the set with 30% reproducibility, Rmax, were consid-
ered regarding control of reproducibility.
The initial set and the optimal set had a reproducibility of
17.24 and 24.54%, respectively and set Rmax was determined
to have 30% reproducibility. INI was the set that included all
genes without considering variations of the genes across the
arrays after within-print tip normalization and intensity-
dependent normalization. Sopt was the set obtained by
maximizing both reproducibility and the number of
remaining genes. Rmax was the set in which a reasonably
large reproducibility was obtained, and 30% was considered
reasonable reproducibility in this study. All these sets were
obtained by RGSA. In each set, genes were categorized into
the non-alteration group and the alteration group. Genes
from each set with a high frequency (at least four or five
frequency) of alteration in either the recurrence group or
non-recurrence group were collected. Each case was deno-
ted by (4,4) or (5,5), respectively, where (4,4) was the case
where genes had at least a frequency of alteration (gain or
loss) of 4 in the recurrence or non-recurrence group; (5,5)
was similar to (4,4). For the selection of the best represen-
tative set, the correct classification rate was investigated
using the leave-one-out validation (LOOV) method (Efron
1979). Factor analysis was performed on each set of the
genes for obtaining independent components, which are
linear combinations of the selected genes, and logistic
regression model was constructed for finding significant
factors (independent components) including clinical risk
factors. The correct classification rate was estimated with
LOOV on the constructed logistic regression model. Table 2
shows the classification rates according to reproducibility,
frequency cut-off on alteration, correct classification rate
and number of distinguishing genes remaining in the given
set. The set showing a high correct classification rate with a
small number of selected genes and large reproducibility
was selected as the representative set.
Since INI includes more genes than Sopt or Rmax, some
INI sets had a higher accuracy than Sopt or Rmax; however,
variations of genes were not controlled, which indicates INI
may include genes with large variations. In this step, the
representative set for showing high frequency of alteration
in either recurrence or non-recurrence status was obtained
by selecting the set that had the highest possible level of
accuracy with a small number of genes. The number of
genes decreased as the set moved to Sopt or Rmax where the
reproducibility of the remaining genes increased. As shown
in Table 2, the set with a cut-off (5, 5) at Sopt had the best
correct classification rate, where genes showing a frequency
of alteration of at least 5 for the recurrence or non-recur-
rence group, which consisted of 93 genes were collected.
Among the 93 genes that displayed a frequency of alteration
of at least 5 in each of the recurrence and the non-recurrence
groups, 59 genes displayed a high frequency (at least 5
frequency) of alteration in only either the recurrence or non-
recurrence group, and these 59 genes (off-diagonal in
Table 3) were determined to be the representative set of
genes that distinguished the recurrence status.
Table 2 Classification rates
validated with LOOV
The best classification rates
were boldfaced; (5, 5) at Sopt
was determined for the cut-off
on the set of representative
genes
Frequency Reproducibility Sensitivity
(%)
Specificity
(%)
Correct classification
rate (%)
Number of
genes screened
(4,4) INI 92.3 85.7 88.9 313
Sopt 84.6 85.7 85.2 125
Rmax 76.9 85.7 81.5 65
(5,5) INI 92.3 85.7 88.9 129
Sopt 92.3 92.9 92.6 59
Rmax 76.9 85.7 81.5 35
1504 J Cancer Res Clin Oncol (2009) 135:1501–1512
123
Constructing SGAS
To decompose the characteristics of the 59 representative
genes into several common characteristics, first, PCA
(principal component analysis) was performed. The result
from PCA is shown in Table 4; the level at which factors
explained the variability of at least 70% and each factor
showed eigenvalue greater than 1.0 was selected for a
reasonable number of factors. The top eight factors
explained the 72% variability of the data with eigenvalues
greater than 1.0.
These eight factors consisted of linear combinations of
the selected 59 genes. Based on the results shown in
Table 4, the orthogonal factor model was constructed with
these eight factors.
For the ith gene’s copy number changes Xi and its mean
li, a linearly dependent model on the eight independent,
latent factors F1; . . .;F8was expressed as follows:
X1 � l1 ¼ l1;1F1 þ l1;2F2 þ � � � þ l1;8F8 þ e1
X2 � l2 ¼ l2;1F1 þ l2;2F2 þ � � � þ l2;8F8 þ e2
..
. ... ..
.
X59 � l59 ¼ l59;1F1 þ l59;2F2 þ � � � þ l59;8F8 þ e59
whereF1; . . .;F8are ‘common factors’, and additional
sources of variation e1; . . .; e59 are ‘errors’. lij is the ‘load-
ing’ of the ith gene on the jth factor.
Each gene had correlations with the eight common
factors, which are shown in Table 5 as factor loadings.
Factor loadings, boldfaced, are strong correlations of the
genes with the corresponding factor; genes with loadings
emphasized in each factor represent the factor’s charac-
teristic that corresponds with the largest correlation when
compared to the other factors.
The factor patterns of the eight factors are shown in
Table 5, where each emphasized box (with loadings
boldfaced) is the correlations of the genes that best char-
acterize the corresponding factor.
Variable selection of the eight factors including possible
clinical risk factors was performed using a logistic
regression model with the aim of finding significant factors
on risk of recurrence for predicting the recurrence status of
a gastric cancer patient. The possible clinical variables
considered in the logistic regression model were age,
gender, lymph node metastasis (LN), size of tumor and
cancer stage (early stage of I and II, late stage of III and
IV).
Using stepwise variable selection with the eight factors
including these clinical factors in the logistic regression
model, factor 1 and clinical factor, LN were identified as
significant factors in relation to the risk of recurrence of
gastric cancer, with a significance level for entering and
removing of 0.1 and 0.15, respectively (Table 6). It is
worth noting that factor 1 most explains the variability of
the data among the eight factors, and its proportion is
31.01% (Table 4).
Table 3 Frequency table on high frequency of alteration on each
recurrence status
Recurrence Non-recurrence Total
0 1
0 0 36 36
1 23 34 57
Total 23 70 93
Entries are number of genes. The entry of (1,1) element is zero since
this set collected genes showing at least five frequency of alterations
in the recurrence or non-recurrence group
1 represents at least a frequency of alteration of five, 0 represents a
frequency of alteration less than five in recurrence or non-recurrence
group
Table 4 Eigenvalues for the proportions of variability explained by
the factors
Factor Eigenvalue Difference Proportion Cumulative proportion
1 18.295 12.167 0.310 0.310
2 6.128 1.647 0.104 0.414
3 4.480 0.956 0.076 0.490
4 3.524 0.392 0.060 0.550
5 3.132 0.336 0.053 0.603
6 2.796 0.405 0.047 0.650
7 2.391 0.185 0.041 0.691
8 2.206 0.156 0.037 0.728
9 2.050 0.236 0.035 0.763
10 1.814 0.181 0.031 0.794
11 1.633 0.152 0.028 0.821
12 1.481 0.106 0.025 0.846
13 1.375 0.219 0.023 0.870
14 1.156 0.068 0.020 0.889
15 1.088 0.130 0.018 0.908
16 0.958 0.128 0.016 0.924
17 0.830 0.189 0.014 0.938
18 0.641 0.067 0.011 0.949
19 0.574 0.005 0.010 0.959
20 0.568 0.106 0.010 0.968
21 0.462 0.055 0.008 0.976
22 0.407 0.046 0.007 0.983
23 0.361 0.091 0.006 0.989
24 0.270 0.060 0.005 0.994
25 0.211 0.042 0.004 0.997
26 0.168 0.168 0.003 1.000
Bold For a reasonable number of factors, top eight factors were
selected since those explained the variability of at least 70% and each
factor showed eigenvalue greater than 1.0
J Cancer Res Clin Oncol (2009) 135:1501–1512 1505
123
Table 5 Rotated factor patterns for the 59 representative genes
GeneBank Accession ID Gene name Factor 1 Factor 2 Factor 3 Factor 4 Factor 5 Factor 6 Factor 7 Factor 8
AI669693 CD207 0.834 0.192 -0.060 0.355 -0.209 0.074 -0.021 -0.020
H62387 ISLR 0.825 0.128 0.011 -0.060 0.127 -0.184 0.106 0.004
AI681562 MTERFD2 0.803 0.191 -0.114 0.251 -0.154 0.182 0.213 0.028
AA987337 RG9MTD2 0.756 0.389 -0.095 -0.036 -0.028 0.182 -0.090 -0.176
AI312990 SP140 0.751 0.037 0.105 0.082 0.086 -0.161 -0.331 -0.017
H09936 NHLH1 0.724 0.013 -0.365 0.292 0.105 0.087 0.197 -0.090
AI207203 R3HDM1 0.668 0.084 -0.002 0.164 -0.027 0.349 0.418 -0.066
AI262957 GTF2H2 0.553 0.431 -0.130 -0.126 -0.220 0.138 0.196 -0.070
AA278852 COPS2 0.487 0.423 -0.252 0.243 0.274 0.032 0.217 0.230
AI359676 0.405 0.377 -0.392 0.402 -0.080 0.340 0.253 -0.198
AI611010 RPS10 20.483 -0.474 0.469 -0.398 -0.040 -0.067 -0.184 -0.053
AI302710 UBIAD1 20.489 -0.364 0.438 -0.339 -0.154 -0.296 0.104 0.163
AW009403 ARID3A 20.521 -0.201 0.385 0.293 0.081 -0.248 0.055 -0.240
AA464250 KRT19 20.532 0.055 0.478 -0.339 0.154 -0.118 -0.064 -0.066
AI025116 20.594 -0.063 0.219 0.085 -0.416 0.273 -0.209 0.067
AI521736 BCL6B 0.108 0.785 -0.208 0.213 -0.073 0.107 0.043 -0.182
AA987446 C3orf63 0.114 0.744 0.397 0.199 -0.128 0.171 0.289 -0.050
AA427927 PRPF3 0.144 0.702 -0.018 0.094 -0.103 0.295 -0.052 -0.192
AA063580 TAF7 0.449 0.687 -0.203 0.123 0.050 0.194 -0.288 0.020
AA281744 C18orf54 0.281 0.634 -0.456 -0.024 -0.050 0.051 0.162 -0.274
AA281064 FGFR1 0.100 0.491 0.371 -0.267 -0.197 -0.315 -0.183 0.281
H05445 GAP43 0.395 0.484 -0.413 0.423 -0.190 0.166 0.018 -0.162
AA995875 AQP2 0.445 0.449 -0.077 0.328 -0.069 0.253 0.057 -0.183
AI304790 PTGIS -0.109 20.726 0.204 0.313 0.116 -0.151 -0.043 -0.028
N94468 JUNB -0.058 -0.110 0.687 0.024 -0.173 0.099 -0.050 0.193
AA872372 -0.370 -0.116 0.627 -0.415 0.083 0.167 -0.021 -0.117
AI299228 -0.572 -0.144 0.593 -0.220 0.057 -0.102 -0.007 0.033
AA279023 LRCH3 0.018 -0.148 0.510 -0.499 -0.063 0.262 0.060 -0.140
AA281426 SLC11A1 -0.492 -0.025 0.492 -0.034 -0.195 -0.023 0.453 0.056
H54023 LILRB2 -0.117 0.067 0.448 -0.360 -0.016 0.129 -0.132 0.242
AA128162 MS4A4A 0.064 0.175 20.625 0.157 0.226 0.365 0.291 -0.106
AA411380 EVI2A -0.096 0.325 -0.101 0.736 0.234 0.139 0.045 0.003
AI336859 ASXL2 0.358 -0.095 -0.027 0.735 -0.175 -0.274 0.172 -0.055
AA456571 BSCL2 0.051 -0.064 -0.097 0.576 -0.254 0.230 0.068 -0.109
H98218 HMGA2 0.188 0.207 -0.049 0.422 0.029 0.372 0.112 -0.008
AI653069 DOCK10 -0.232 -0.332 -0.036 20.462 0.038 0.019 -0.279 0.309
AI273225 LACTB -0.436 -0.246 0.474 20.490 0.190 0.001 -0.008 0.209
AI369284 GTSE1 -0.447 -0.124 0.181 20.545 -0.221 -0.100 -0.443 0.012
AI244751 SETBP1 -0.115 0.074 0.235 20.703 -0.191 -0.051 0.235 -0.026
AA991162 ANKRD26 0.183 0.057 -0.088 -0.157 0.797 -0.306 0.085 0.229
T57875 PRKCI -0.305 -0.119 -0.254 0.202 0.738 0.067 0.086 -0.150
AI346446 EPSTI1 0.232 -0.346 -0.102 -0.081 0.648 0.131 -0.057 -0.270
AA236957 ARHGEF6 0.078 -0.321 0.102 0.031 0.612 0.010 -0.459 0.343
N71462 SCML2 -0.467 0.021 0.350 -0.061 0.607 -0.093 0.201 0.102
R18845 ZNF559 0.446 0.165 -0.300 0.349 0.516 0.299 0.105 0.034
AA278956 PHC3 -0.002 -0.153 0.366 -0.208 20.385 -0.187 -0.242 -0.205
AI380028 SCFD1 0.151 0.313 -0.068 0.029 20.671 0.336 0.355 0.100
AA490609 MRPL1 0.006 0.220 0.216 -0.013 -0.036 0.739 -0.051 -0.252
1506 J Cancer Res Clin Oncol (2009) 135:1501–1512
123
Factor 1 follows a standard normal distribution whose
mean and variance are 0 and 1 respectively, and thus it was
utilized as a SGAS for discriminating the recurrence status
of the patient.
The logistic regression model with p explanatory vari-
ables is given by
loge
PrðY ¼ 1jX1; . . .;XpÞPrðY ¼ 0jX1; . . .;XpÞ ¼ b0 þ b1x1 þ � � � þ bpxp
where Y is a dependent variable representing a patient
recurrence status, where a value 1 was used for recurrence
and 0 otherwise.
Based on this relationship, the suggested logistic model
containing the prognostic factor and SGAS is as follows:
loge
PrðrecurrencejLN�; SGASÞPrðnon - recurrencejLN
�; SGASÞ¼ �3:081þ 3:159LN� � 2:199SGAS ð1Þ
SGAS of this model is the genetic alteration score that
represents significant factor in relation to the risk of
recurrence for gastric cancer, and for each increase of one
unit in SGAS, the log odds increase by -2.199 when
adjusting LN*, that is, patients with a larger SGAS score
have a lower risk of recurrence.
For each gastric cancer patient, the predicted probabil-
ity, which is called the ‘PS’, is obtained as
PrðrecurrencejLN�; SGASÞ
¼ expð�3:081þ 3:159LN� � 2:199SGASÞ1þ expð�3:081þ 3:159LN� � 2:199SGASÞ :
This score is the predicted probability of a patient’s
recurrence. Table 7 shows the SGAS score and PS of 30
patients (excluding missing data).
Results
Characteristic of SGAS
The SGAS loading, which is the corresponding gene’s
correlation with SGAS, frequency on gains and losses for
the 30 arrays and mean copy number changes of the 59
genes are listed in Table 8. Table 8 lists the 28 genes
whose correlations with SGAS were large. The 15 genes
that distinguish the characteristic for SGAS from the other
seven common characteristics are boldfaced (Table 5).
SGAS was found to be a significant risk factor related to
the recurrence of gastric cancer, which was calculated
using a linear combination of the changes in the copy
number of the 59 genes (genes are listed in Table 5).
Of the 59 genes, 28 listed in Table 8 show positive or
negative correlations of at least 0.3 (most are over 0.4) with
Table 5 continued
GeneBank Accession ID Gene name Factor 1 Factor 2 Factor 3 Factor 4 Factor 5 Factor 6 Factor 7 Factor 8
AW055023 HRB 0.192 0.360 0.149 0.111 -0.215 0.691 0.270 0.051
AI097124 WDR22 0.011 0.204 -0.371 -0.098 -0.368 0.601 0.032 0.125
AA877334 ERBB3 -0.046 -0.395 0.250 -0.202 -0.067 20.451 -0.226 0.182
AA088258 LOC400986 0.132 0.007 -0.195 0.094 0.083 0.160 0.815 -0.199
AI261857 DCUN1D1 0.084 0.026 -0.076 0.006 0.105 0.601 0.631 0.016
AA001403 ZMPSTE24 0.269 0.510 -0.033 0.272 -0.039 0.020 0.607 0.014
AA931565 MRE11A -0.406 -0.361 0.371 -0.300 -0.185 -0.249 0.483 0.031
AI630817 CDH22 -0.078 -0.234 -0.053 -0.257 -0.193 -0.012 0.065 0.752
N70013 RP2 -0.137 -0.070 0.466 -0.160 0.286 -0.187 0.100 0.600
AI361153 SRF -0.206 -0.359 0.295 -0.027 -0.129 0.051 -0.160 0.469
AA421335 LDB1 -0.110 0.053 -0.091 -0.166 -0.175 0.146 0.258 20.613
Loadings boldfaced are the correlations of the genes that best characterize the corresponding factor
Table 6 Selected factors with the logistic regression model
Parameter df Estimate Standard error Wald v2 Pr [ v2 Odds ratio estimates
Point estimate 95% Wald confidence limits
Intercept 1 -3.081 1.447 4.532 0.0333
LN* 1 3.159 1.900 2.766 0.0963 23.557 0.569 975.45
Factor1 1 -2.199 1.216 3.271 0.0705 0.111 0.01 1.202
LN* Lymph node metastasis multiplied by 10
J Cancer Res Clin Oncol (2009) 135:1501–1512 1507
123
SGAS, which indicates that they are strongly related to the
risk factor on recurrence of gastric cancer. By investigating
the characteristics of these 28 genes, those displaying a
positive (negative) correlation with SGAS were shown to
have only losses (only gains) with high frequency of the 30
arrays, except AW009403.
Results from the statistical test indicate that SGAS was
capable of distinguishing the recurrence status, that is,
there was a significant difference in SGAS between the
recurrence and non-recurrence group (P value: 0.045). It
was helpful to include significant clinical risk factors in the
model to classify a patients’ recurrence status at a high
correct classification rate. For this purpose, a logistic
regression model (1) was used (described in ‘‘Materials and
methods’’). SGAS obtained from factor analysis follows a
standard normal distribution, and was validated with the
Kolmogorov–Smirnov test (P value: 0.15).
Since SGAS follows standardized normal distribution, it
was possible to obtain the percentile for the patient’s
SGAS; thus, it was possible to classify patients based on
their SGAS percentiles and predict the probability of
recurrence for each class rank. Table 9 lists the PSs of a
gastric cancer patient’s recurrence in each class and these
are ranked with the SGAS score after adjusting the sig-
nificant clinical risk factor, LN, with its average value,
where the class was graded based on the SGAS percentile.
A straight line drawn tangent to the curve at a particular
SGAS value describes the rate of change at that point. The
slope approaches 0 as the PS approaches 1.0 or 0. The
steepest slope of the curve occurs at a particular SGAS
where the PS for recurrence equals 0.5; the SGAS value
was 0.303 when LN* was adjusted with its average 1.186.
This SGAS value is denoted by EL50, median effective
level, and represents the level at which recurrence and non-
recurrence has a 50% chance. As shown in the above table,
when SGAS was larger than EL50, the PS for recurrence
dramatically decreases with only small incremental chan-
ges in SGAS near EL50, and the PS curve becomes less
steep as SGAS moves farther from EL50, while it tends to
decrease moderately in proportion to changes in SGAS
when SGAS is less than EL50.
PS for predicting patient’s recurrence status
A logistic regression model that includes both the signifi-
cant clinical risk factor, LN, and SGAS results in a higher
correct rate for classification of recurrence. From this
model, it was possible to obtain the predicted probability
(score) of a patient’s recurrence (Table 7).
The recurrence status of patients was classified based on
the patients’ PS, where a threshold 0.5 was used to deter-
mine the classification of the patient since it represents the
level at which each outcome has a 50% chance of occur-
ring, and its characteristic was described with EL50. Fig-
ure 1 shows the classification of patients’ recurrence status
determined from the PS of the patient, where the patient
with a PS larger (smaller) than 0.5 was classified with a
recurrence status (non-recurrence) and is shown in the right
(in the left). No information on the recurrence status of 3
out of the 30 patients was available; thus, they were not
considered for classification. Of the 13 recurrent patients,
all the patients except ID 8 were correctly classified; for the
14 non-recurrent patients, all the patients were correctly
classified except patient ID 5. Only 2 of the 27 patients
were misclassified by the model developed in this study,
which incorporated SGAS in a logistic model that involved
the clinical risk factor, LN. It is worth noting that the
misclassified three patients ID15, ID24 and ID26 (under-
lined with star notation in the figure), when only the sig-
nificant clinical risk factor, LN, was used in the logistic
Table 7 SGAS score and PS of 27 patients
Patient ID LN* SGAS PS Recurrence status
1 0 0.672 0.010 0
2 0 0.710 0.010 0
5 0.238 -1.103 0.524 0
6 0 0.655 0.011 0
7 0 0.621 0.012 0
8 0 -0.361 0.092 1
9 0 1.264 0.003 0
10 0 -1.107 0.344 0
11 0 0.259 0.025 0
12 0 1.153 0.004 0
13 0 -0.424 0.105 0
14 0.351 -0.462 0.278 0
15 0.435 -1.392 0.795 1
16 0 0.470 0.016 0
18 0.333 0.401 0.052 0
19 3.421 1.137 0.995 1
20 4.091 0.002 1.000 1
21 1.600 -1.783 0.997 1
22 2.586 0.499 0.982 1
23 1.698 -2.152 0.999 1
24 1.132 1.288 0.088 0
25 5.179 0.989 1.000 1
26 0.465 -1.048 0.667 1
27 3.621 -1.181 1.000 1
28 3.175 -0.121 0.999 1
29 3.443 1.194 0.994 1
30 3.214 -0.177 0.999 1
Three patients, ID 3, ID 4, and ID 17, have their status of recurrence
missing
LN* lymph node metastasis multiplied by 10, Recurrence status = 1
for recurrence, and 0 for non-recurrence
1508 J Cancer Res Clin Oncol (2009) 135:1501–1512
123
regression model, were correctly classified when SGAS
was incorporated in the model.
Figure 1 shows positive predicted value (PV?), nega-
tive predictive value (PV-), sensitivity and specificity,
which are estimated as 92.3, 92.86, 92.3, and 92.86%,
respectively.
Validation of the correct classification rate was done
using the leave-one-out validation (LOOV) method, which
was shown to be 92.59%. The ROC curve indicated that the
sensitivity and specificity at any specific threshold and
overall correct classification rate were excellent for making
decisions with the suggested model.
For further assessment of the suggested model, a score
card was created according to the PS percentile as shown in
Fig. 2. It displays the cumulative frequency of recurrence
and non-recurrence in each class ranked at every tenth
percentile of PS obtained from the suggested model. This
figure shows that the probability of a patient’s recurrence
was well ordered according to the class rank with the PS
percentile.
The K–S statistic measures how well the class rank is
assigned where a high rank class receives a high frequency of
recurrence and a low frequency of non-recurrence, and vice
versa. A K–S statistic of over 50 indicates that the model
successfully classified subjects based on the PS obtained by
the model, that is, the classes were well ordered (Mayes 2003).
The suggested model had a K–S statistic of 85, implying
that this model provided an excellent grading system. There
were no patients with recurrence (non-recurrence) for the
four classes at the bottom rank (the top rank), indicating that
the chance of misclassification on recurrence (non-recur-
rence) was very low for patients classified as high (low) rank.
As shown in Fig. 2, the frequency of recurrence
increases and the frequency of non-recurrence decreases as
the class ranks go to a higher level. This trend demonstrates
that the PS obtained by the suggested model is excellent for
classifying patients according to the probability of a
patient’s recurrence and will provide a good method to
assess and predict the possibility of a recurrence in cancer
patients. Clinical data are shown in Table 10.
Table 8 SGAS loadings,
frequency on gains and losses,
and mean copy number changes
For instance, GeneBank
Accession ID AI669693 has
correlation of 0.834 with SGAS
and shows a total of nine losses
and no gains for the 30 patients
(arrays), and the mean change in
the copy number for the 30
arrays is -0.291
Fifteen genes of distinguishing
the characteristic for SGAS
from the other seven common
characteristics are boldfaceda SGAS loading, which is the
corresponding gene’s
correlation with SGASb Frequency of alteration (gain
and loss, respectively) of the
corresponding gene among all
patients (arrays) showing
recurrence or non-recurrencec Average change in the copy
number of the gene for the 30
patients (arrays)
Gene Bank
Accession ID
Gene name SGAS loadinga Frequency
(gain, loss)bMeanc
AI669693 CD207 0.834 (0,9) 20.291
H62387 ISLR 0.825 (0,9) 20.227
AI681562 MTERFD2 0.803 (0,8) 20.188
AA987337 RG9MTD2 0.756 (0,8) 20.188
AI312990 SP140 0.751 (0,5) 20.140
H09936 NHLH1 0.724 (0,6) 20.197
AI207203 R3HDM1 0.668 (0,10) 20.237
AI262957 GTF2H2 0.553 (0,7) 20.203
AA278852 COPS2 0.487 (0,9) 20.234
AA063580 TAF7 0.449 (0,8) -0.188
R18845 ZNF559 0.446 (0,9) -0.234
AA995875 AQP2 0.445 (0,8) -0.218
AI359676 0.405 (0,7) 20.234
H05445 GAP43 0.395 (0,6) -0.234
AI336859 ASXL2 0.358 (0,9) -0.282
T57875 PRKCI -0.305 (8,0) 0.240
AA872372 . -0.370 (8,0) 0.260
AA931565 MRE11A -0.406 (7,0) 0.221
AI273225 LACTB -0.436 (9,0) 0.231
AI369284 GTSE1 -0.447 (9,0) 0.238
N71462 SCML2 -0.467 (8,0) 0.271
AI611010 RPS10 20.483 (8,0) 0.248
AI302710 UBIAD1 20.489 (8,0) 0.229
AA281426 SLC11A1 -0.492 (13,0) 0.285
AW009403 ARID3A 20.521 (0,6) 20.200
AA464250 KRT19 20.532 (8,0) 0.170
AI299228 . -0.572 (10,0) 0.268
AI025116 20.594 (6,0) 0.265
J Cancer Res Clin Oncol (2009) 135:1501–1512 1509
123
Table 9 Class rank based on SGAS percentile, expected SGAS percentile and predicted probability for recurrence when adjusting LN* with its
average value
Class rank (percentile) Expected SGAS percentile SGAS percentile obtained from data Predicted score for recurrencea
10 -1.282 -1.392 0.970249
20 -0.842 -1.104 0.925334
30 -0.524 -0.424 0.860308
40 -0.253 -0.177 0.772401
50 0 0.259 0.660511
60 0.253 0.499 0.52728
70 0.524 0.655 0.380671
80 0.842 0.989 0.233978
90 1.282 1.194 0.104001
100 3? 1.288 0.002648
a This was obtained with expected SGAS percentile after adjusting significant clinical risk factor LN* with its average value, 1.186, where LN*
is initial LN multiplied by 10 and its average was taken for the 27 patients
Fig. 1 Classification of
recurrence status with PS, ID 5
and ID 8 were misclassified.
Patients ID15*, ID24* and
ID26* were correctly classified
with the suggested model in this
study; misclassified without
utilizing SGAS
1510 J Cancer Res Clin Oncol (2009) 135:1501–1512
123
Discussion
The aim of this study was to develop a SGAS based on
changes in the copy number of genes related to a patient’s
recurrence status. A representative set of genes was
determined, based on examining a high frequency of
alteration in only one of the two groups, recurrence group
and non-recurrence group, after controlling variations in
the data across arrays for selecting genes with small vari-
ations so that had a high reproducibility could be achieved.
It was found that SGAS was the significant risk factor on
the recurrence of gastric cancer and the factor explaining
Score Classa Cumulative NR (%)b
Cumulative R (%)c
k-s (%)d
10 21 0 21
20 36 6
30 57 7
40 71 1
50 86
0 3
0 5
0 7
8 78
60 100 15 85
70 100 31 69
80 100 54 46
90 100 85 15
100 100 100 0
K-S 85
Fig. 2 a Score class ranked
according to the percentile of
PS. b Cumulative percentage of
non-recurrence in each score
class. c Cumulative percentage
of recurrence in each score
class. d ‘Cumulative R (%)’
subtracted from the ‘cumulative
NR (%)’, K–S is the maximum
values of k–s values, In the
figure x- and y-axes represent
score class and cumulative
percentages of NR and R,
respectively
Table 10 Reference table on
clinical data from the Cancer
Metastasis Research Center at
Yonsei University
Patients were followed up for
65 months
– missing, 1 recurrence, 0 non-
recurrencea A total of 27 patients’ were
available on the their recurrence
status from the follow-up on the
30 patientsb Cancer-free time in months
Patient
numberaAge
(year)
Sex Lymph node
metastasis (LN)
Status on
recurrence
Disease-free
time (month)bSize
1 65 M 0/17 0 65 10 9 8
2 70 M 0/12 0 48 4 9 3
3 65 M 1/42 0 60 5 9 3
4 77 M 0/40 0 61 3 9 3
5 62 M 0/39 0 60 4 9 4
6 77 M 0/76 1 20 9 9 9
7 62 M 0/52 0 49 11 9 6.5
8 60 M 0/66 0 42 6 9 6
9 70 M 0/37 0 46 3 9 3
10 60 F 0/30 0 – 6 9 5
11 77 F 0/44 0 – 11 9 6
12 56 M 2/57 0 60 7 9 7
13 66 M 2/46 1 12 6 9 5
14 52 M 0/32 0 60 7 9 7
15 67 M 2/60 0 49 4 9 3
16 78 M 13/38 1 7 7 9 4
17 76 F 9/22 1 – 4 9 3.5
18 76 M 8/50 1 58 10 9 8
19 55 M 15/58 1 12 14 9 9
20 71 M 9/53 1 18 5.5 9 4.5
21 48 M 6/53 0 58 8 9 3
22 55 M 29/56 1 28 9 9 5
23 59 M 2/43 1 10 5 9 3.5
24 59 M 21/58 1 31 6 9 4
25 54 M 20/63 1 8 5 9 3
26 41 M 21/61 1 31 9 9 9
27 74 M 18/56 1 15 9 9 9
J Cancer Res Clin Oncol (2009) 135:1501–1512 1511
123
the most variability of 31% for the representative genes’.
From SGAS, it was possible to identify genes that were
strongly related to the risk factor on recurrence of gastric
cancer. SGAS was a weighted sum of changes in the copy
number of genes. SGAS reflects a patients’ genetic alter-
ation levels better than a simple summation of the genes’
genetic alteration levels in the sense that weights were
assigned to genes according to their accountabilities for the
score variable. This weight was the coefficient of the gene
in the linear combination used in determining the SGAS,
while loading of the SGAS for each gene explained the
correlation between SGAS and the corresponding gene.
SGAS was found to be associated with the recurrence
status. By incorporating SGAS in the logistic model in
addition to the significant clinical risk factor, it was pos-
sible to obtain a patient’s PS on recurrence. Three patients
who were initially misclassified by the model, when only
the significant clinical risk factor was considered, were
correctly classified when the suggested model incorporated
SGAS. The correct classification rate, validated by the
LOOV method, was 92.59%.
The fact that SGAS was constructed as a variable that
followed a standard normal distribution made it possible to
estimate the percentile of the patient’s SGAS score and
thus it was possible to estimate the probability of recur-
rence for the patient.
For instance, patient ID 15 and ID 26 had a SGAS score
of -1.392 and -1.048 (Table 7), which were approxi-
mately in the 8.23 and 14.7 percentile of SGAS, respec-
tively. Thus they had a high risk of recurrence (referred to
the figure in Table 9) and actually had a recurrence.
However, using only a significant clinical risk factor, these
patients were not correctly classified, since they had a good
clinical risk factor, a small LN of 0.0435 and 0.0465 (LN*
of 0.435 and 0.465), respectively.
Furthermore, patient ID 24 had a SGAS score of 1.288,
which was approximately in the 90.2 percentile of SGAS,
and thus had a low risk of recurrence (Table 6 describes
SGAS as being negatively related to the risk of recurrence,
and the figure in Table 9 shows PS versus SGAS) and
actually was non-recurrent. However, using only the sig-
nificant clinical risk factor, this patient was incorrectly
classified, since this patient had a relatively large LN of
0.1132 (LN* of 1.132; the average of the 30 patients’ LN*
was 1.186).
Acknowledgments This work was supported by the Korean
Research Foundation Grant funded by the Korean Government
(MOEHRD) (R03-2004-000-10048-0). The authors would like to
express gratitude to Cancer Metastasis Research Center at Yonsei
University for the permission to use their data, and also thank
Jin-Hyung Kim for his assistance in this study.
Conflict of interest statement We declare that we have no conflict
of interest.
References
Agresti A (2002) Categorical date analysis. Wiley, New Jersey
Dowdy S, Wearden S (1991) Statistics for research. Wiley, New
Jersey
Efron B (1979) Bootstrap methods: another look at the jackknife. Ann
Stat 7(1):1–26. doi:10.1214/aos/1176344552
Inoue H, Matsuyama A, Mimori K, Ueo H, Mori M (2002) Prognostic
score of gastric cancer determined by cDNA microarray. Clin
Cancer Res 8:3475–3479
Kadota K, Miki R, Bono H, Shimizu K, Okazaki Y, Hayashizaki Y
(2001) Preprocessing implementation for microarray (PRIM): an
efficient method for processing cDNA microarray data. Physiol
Genomics 4:183–188
Kim M (2009) Reproducible gene selction algorithm with random
effect model in cDNA microarray-based CGH data. Expert Syst
Appl. doi:10.1016/j.eswa.2009.03.034
Lee E (1992) Statistical methods for survival data analysis. Wiley,
New Jersey
Liebermeister W (2002) Linear modes of gene expression determined
by independent component analysis. Bioinformatics 18:51–60.
doi:10.1093/bioinformatics/18.1.51
Liu KH, Huang DS (2008) Cancer classification using Rotation
Forest. Comput Biol Med 38:601–610. doi:
10.1016/j.compbiomed.2008.02.007
Mayes E (2003) Handbook of credit scoring. Glenlake Publishing
Company, Chicago
Park PJ, Tian L, Kohane IS (2002) Linking gene expression data with
patient survival times using partial least squares. Bioinformatics
18:120–127
Park CH, Jeong HJ, Choi YH, Kim SC, Jeong HC, Park KH, Lee GE,
Kim TS, Yang SW, Ahn SW, Kim YS, Rha SY, Chung HC
(2006) Systematic analysis of cDNA microarray-based CGH. Int
J Mol Med 17:261–267
Pisani P, Parkin DM, Bray F, Ferlay J (1999) Estimates of the
worldwide mortality from 25 cancers in 1990. Int J Cancer
83:18–29. doi:10.1002/(SICI)1097-0215(19990924)83:1\18::
AID-IJC5[3.0.CO;2-M
Rosner B (2000) Fundamentals of biostatistics. Duxbury Publisher,
Pacific Grove
SAS Institute Inc (2000) SAS System for Windows V.9.1
Squire JA, Pei J, Marrano P, Beheshti B, Bayani J, Lim G, Moldovan
L, Zielenska M (2003) High-resolution mapping of amplifica-
tions and deletions in pediatric osteosarcoma by use of CGH
analysis of cDNA microarrays. Genes Chromosomes Cancer
38:215–225. doi:10.1002/gcc.10273
Yang YH, Dudoit S, Luu P, Lin D, Peng V, Ngai J, Speed T (2002)
Normalization for cDNA microarray data: a robust composite
method addressing single and multiple slide systematic variation.
Nucleic Acids Res 30:e15. doi:10.1093/nar/30.4.e15
Yang SH, Seo MY, Jeong HA, Jeung HC, Shin J, Kim SC, Noh SH,
Chung HC, Rha SY (2005) Gene copy number change events at
chromosome 20 and their association with recurrence in gastric
cancer patients. Clin Cancer Res 11:612–620
1512 J Cancer Res Clin Oncol (2009) 135:1501–1512
123