Click here to load reader
Upload
andreas-albrecht
View
216
Download
1
Embed Size (px)
Citation preview
An Epicurean learning approach to gene-expressiondata classification
Andreas Albrechta,*, Staal A. Vinterbob, Lucila Ohno-Machadob,c
aComputer Science Department, University of Hertfordshire, Hatfield, Herts AL10 9AB, UKbDecision Systems Group, Harvard Medical School, 75 Francis Street, Boston, MA, USA
cDivision of Health Sciences and Technology, MIT, Cambridge, MA, USA
Received 21 May 2002; received in revised form 6 November 2002; accepted 10 December 2002
Abstract
We investigate the use of perceptrons for classification of microarray data where we use two
datasets that were published in [Nat. Med. 7 (6) (2001) 673] and [Science 286 (1999) 531]. The
classification problem studied by Khan et al. is related to the diagnosis of small round blue cell
tumours (SRBCT) of childhood which are difficult to classify both clinically and via routine
histology. Golub et al. study acute myeloid leukemia (AML) and acute lymphoblastic leukemia
(ALL). We used a simulated annealing-based method in learning a system of perceptrons, each
obtained by resampling of the training set. Our results are comparable to those of Khan et al. and
Golub et al., indicating that there is a role for perceptrons in the classification of tumours based on
gene-expression data. We also show that it is critical to perform feature selection in this type of
models, i.e. we propose a method for identifying genes that might be significant for the particular
tumour types. For SRBCTs, zero error on test data has been obtained for only 13 out of 2308 genes;
for the ALL/AML problem, we have zero error for 9 out of 7129 genes that are used for the
classification procedure. Furthermore, we provide evidence that Epicurean-style learning and
simulated annealing-based search are both essential for obtaining the best classification results.
# 2003 Elsevier Science B.V. All rights reserved.
Keywords: Perceptrons; Simulated annealing; Gene-expression analysis
1. Introduction
Measuring gene-expression levels is important for understanding the genetic basis of
diseases. The simultaneous measurement of gene-expression levels for thousands of genes
Artificial Intelligence in Medicine 28 (2003) 75–87
* Corresponding author. Tel.: þ44-1707-284247; fax: þ44-1707-284303.
E-mail addresses: [email protected] (A. Albrecht), [email protected] (S.A. Vinterbo),
[email protected] (L. Ohno-Machado).
0933-3657/03/$ – see front matter # 2003 Elsevier Science B.V. All rights reserved.
doi:10.1016/S0933-3657(03)00036-8
is now possible due to microarray technology [17,18]. Data derived from microarrays are
difficult to analyze without the help of computers, as keeping track of thousands of
measurements and their relationships is overwhelmingly complicated.
Several authors have utilized unsupervised learning algorithms to cluster gene-expression
data [6]. In those applications, the goal is to find genes that have correlated patterns of
expression, in order to facilitate the discovery of regulatory networks. Recent publications
have begun to deal with supervised classification for gene expression derived from micro-
arrays [7,11]. The goal in these applications is usually to classify cases into diagnostic or
prognostic categories. Additionally, researchers try to determine which genes are most
significantly related to the category of interest. Since the number of measurements is very
large compared to the number of arrays, there is tremendous potential for overfitting in
models that do not utilize a pre-processing step for feature selection.
The feature selection process itself is of interest, as it helps to determine the relative
importance of a gene in the classification. Approaches for feature selection in the context of
gene-expression analysis are currently being investigated [11]. Developing a strategy for
selecting genes that are important in a classification model, regardless of their absolute
expression levels, is important in this context.
In this paper, we propose an algorithm for learning perceptrons based on simulated
annealing and we show that it can be successfully applied to the analysis of gene-
expression data. Besides the combination of simulated annealing and perceptrons, another
key feature of our approach is training the perceptrons [16,19] on randomly selected
subsets of the entire sample set. In statistical inference, the method of drawing many
samples from some population or constructing many rearrangements of sample values is
called resampling [10]. For each sample or rearrangement, test statistics are computed, and
the resulting set of test statistics constitutes the sampling distribution. Since we are dealing
with a learning and classification task where a large number of hypotheses is calculated
from randomly selected (small) subsets of samples, we use the concept of Epicurean
learning. To our knowledge, Epicurean learning was first mentioned by Cleary et al. [5],
motivated by Epicurus’ paradigm that all hypotheses fitting the known data about an object
should be retained [8].
In our paper, we are analysing two microarray data published by Khan et al. [14] and
Golub et al. [9]. On the test data provided with both datasets, we obtain zero classification
error on a very small number of gene data only. Moreover, our computational experiments
show that Epicurean learning as well as the simulated annealing-based search are both
essential for obtaining the best classification results.
2. Methods
Let D � Qn be our input data table where each of the columns correspond to expression
measurements for a particular gene over the tissues investigated. Further, let
Qn ! f1; 2; . . . ;mg be a partial function that for D returns the tumor class associated
with each row.
We would like to find a realization of a function F: Qn ! 2f1;2;...;mg that represents an
extension of c that we can use to classify new, unseen expression measurement vectors.
76 A. Albrecht et al. / Artificial Intelligence in Medicine 28 (2003) 75–87
We do this as follows. For each class i 2 f1; 2; . . . ;mg, we construct a classifier Fi:
Qn ! ½0; 1�. These Fi are then combined to form F as:
FðxÞ ¼ fjjFjðxÞ ¼ maxi
FiðxÞg: (1)
The number jFðxÞj gives us an indication of how uniquely we were able to classify x. We
choose to discard the classification if jFðxÞj > 1.
We now turn to the construction of the functions Fi. A perceptron p is a function p:
Rn Rn R! f0; 1g such that
pðx;w; WÞ ¼ pw;WðxÞ ¼ tWðwxÞ; (2)
where t is the unit step function at threshold W defined as
tWðyÞ ¼0; if y < W;
1; otherwise:
�(3)
Given ki perceptrons, we define Fi to be
FiðxÞ ¼1
ki
Xki
j¼1
pijw;0ðxÞ; (4)
where the w’s are restricted to be rational (for simpler notations, we assume W ¼ 0 and
always xn ¼ 1, i.e. wn represents the threshold). There are two problems that need to be
solved:
1. finding the wij’s, i.e. training the perceptrons;
2. finding ki for each class i.
The parameter ki is chosen empirically. We will discuss how this is done in Section 4. In the
remainder of this section we address the training of the perceptrons, and pragmatics
regarding reduction of the input dimensionality.
2.1. Perceptron training
Let T ¼ Tþ [ T� be a training set composed of positive and negative examples,
respectively. We want to find the parameters w of the perceptron p that maximize the
separation of the positive and negative examples in T .
Hoffgen et al. [13] have shown that finding a linear threshold function that minimizes the
number of misclassified vectors is NP-hard in general. Hence, we need to apply heuristics
to the problem of training a perceptron.
Simulated annealing has been proven to be a versatile tool in combinatorial optimisation
[1,15], and is our choice of optimization strategy. In [3], we have demonstrated on
randomly generated disjunctive normal forms that the classification rate improves if the
perceptron algorithm is combined with logarithmic simulated annealing. An application to
real world data has been provided in [2].
Given a search space W in which we want to find an element that minimizes an objective
function o: W ! N, an initial ‘‘current location’’ ws 2 W , a function s: W ! W that in a
A. Albrecht et al. / Artificial Intelligence in Medicine 28 (2003) 75–87 77
stochastic fashion proposes a next ‘‘current location’’ that we hope will lead to the wanted
optimum, a function a: N W W ! f0; 1g that accepts or rejects the proposed next
current location, and finally a stopping criterion. We can illustrate the strategy by the
following simple pseudo-code skeleton:
w ws; k 0
while not stop(k;w)
k k þ 1; wn sðwÞif akðw;wnÞ ¼ 1
w wn
The idea of the algorithm is to, while initially allowing locally suboptimal steps, become
more restrictive in allowing locally suboptimal steps to be taken over time. The hope is to
avoid premature convergence to local minima. In our case, we define our objective function
to be the number of misclassified elements in T . Let
MTðwÞ ¼ fx 2 Tþjpw;0ðxÞ < 1g [ fx 2 T�jpw;0ðxÞ > 0g; (5)
denote the set of misclassified samples, then we can define our objective function as
oðwÞ ¼ jMTðwÞj. The set MTðwÞ can be viewed as a neighbourhood of w containing all the
possible next steps we can take from w. As a first key feature of our heuristic, we now
construct a probability mass function u over MTðwÞ as
qðxÞ ¼ jxwjPy2MT ðwÞ jywj : (6)
Elements that are ‘‘further away’’ from being classified correctly by pw;0 are assigned
higher values by q, which goes conform with the perceptron algorithm [16,19]. We now
define
sðwÞ ¼ w� wðxÞ � sampleðMTðwÞ; qÞ=ffiffiffiffiffiffiffiwwp
; (7)
where sample stochastically selects one element from the set MTðwÞ with the probability
given by q, and wðxÞ ¼ 1 for x 2 T�, wðxÞ ¼ �1 for x 2 Tþ. The acceptance function at
step k (the kth time through the while-loop) is defined as
aðwk�1;wkÞ ¼1; if pðwk�1;wkÞ > r;
0; otherwise;
�(8)
where
pðwk�1;wkÞ ¼1; if oðwkÞ � oðwk�1Þ � 0;
e�ðoðwkÞ�oðwk�1ÞÞ=tðkÞ; otherwise;
�(9)
and r 2 ½0; 1� is uniformly randomly sampled at each step k. The function t, motivated by
Hajek’s Theorem [12] on convergence of inhomogeneous Markov chains for big enough
constants G, is defined as
tðkÞ ¼ Glnðk þ 2Þ ; k 2 f0; 1; . . .g; (10)
78 A. Albrecht et al. / Artificial Intelligence in Medicine 28 (2003) 75–87
and represents the ‘‘annealing’’ temperature (second key feature). As t decreases, the
probability of accepting a w that does not decrease the objective function decreases. We
empirically chose G in the range of ðjTþj þ jT�jÞ=4 up to ðjTþj þ jT�jÞ=3, essentially
using the same method as in [2].
Finally our stopping criterion is given by a pre-determined number of iterations through
the while-loop.
2.2. Perceptron training set sampling
In order to generate (a large number of) different hypotheses as well as to achieve zero
learning error in a short time, the following sampling scheme for training sets is applied
(third key feature):
Let Cþi ¼ fx 2 DjcðxÞ ¼ ig be the positive examples for class i in D, and let
C�i ¼ D� Cþi be the negative examples of class i in D. Further, let for two parameters
a; b 2 ð0; 1�, sþi ¼ ajCþi jb c, and let s�i ¼ bjC�i j� �
. For each j in f1; 2; . . . ; kigwe randomly
sample Tþi;j � Cþi and T�i;j � C�i such that jTþi;j j ¼ sþi , and jT�i;j j ¼ s�i . The set
Ti;j ¼ Tþi;j [ T�i;j is then the training set used to train perceptron pj in Fi. Thus, by the
parameters a and b we can control the size of the randomly chosen subsets of the training
data, which is an integral part of the Epicurean-style learning approach.
The boosting method (cf. [20,21]) tries to reduce the training error by assigning
higher probabilities to ‘‘difficult’’ samples in a recursive learning procedure. In our
approach, we observed that almost all subsets are learned with zero error even for
relatively large fractions of the entire sample set (about 0.5 in [2] and 0.75 in the present
study). Thus, according to the procedure described in the previous section, the method
we are using belongs to the class of voting predictions. Recent research has shown that
classification performance can be significantly improved by voting on multiple hypoth-
eses [21]; for a more detailed discussion cf. [5]. We note that the particular examples
are trained multiple times, e.g. for the ALL/AML dataset, the average occurrence of
an ALL sample in randomly chosen subsets is about 220 in trials with the best
classification rate.
2.3. Dimensionality reduction
In our case, where the data D presents a massively underdetermined system, i.e. there are
many more gene expression measurements than there are tissue samples, experiments have
shown that reducing the dimensionality of the data is beneficial [14]. The scheme we
applied is based on selecting the genes that incur the coefficients with the biggest absolute
values in the perceptrons after having completed training on all dimensions in the data
(fourth key feature). Let gðw; qÞ be the set of q positions that produce the q biggest values
jwlj in w ¼ ðw1;w2; . . . ;wnÞ (ties are ordered arbitrarily). Let Gi ¼ \pj¼1gðwi
j; qÞ be the set
of dimensions selected for class i, i.e. here we have ki ¼ p for all i ¼ 1; . . . ; m. Each Gi is
truncated to k ¼ mini2f1;...;mg jGij positions with the largest associated values (called
priority genes). Training each Fi is then repeated with the data D projected onto the
dimensions in Gi for ki ¼ K perceptrons, i ¼ 1; . . . ; m. The importance of weight size in
learning procedures has been emphasised by Bartlett [4].
A. Albrecht et al. / Artificial Intelligence in Medicine 28 (2003) 75–87 79
3. Datasets
Improvements in cancer classification have been central to advances in cancer treatment
[9]. Usually, cancer classification has been based primarily on the morphological appearance
of the tumour which has serious limitations. Tumours with similar appearance can follow
significantly different clinical courses and show different responses to therapy. In a few cases,
such clinical heterogeneity has been explained by dividing morphologically similar tumours
into subtypes. Key examples include the subdivision of small round blue cell tumours and
acute leukemias (SRBCTs). Both tumour classes are considered in the present section.
3.1. SRBCT data
For the first series of computational experiments, the data used in this paper are provided
by Khan et al. [14]. Given are gene-expression data from cDNA microarrays containing
2308 genes for four types of SRBCTs of childhood, which include neuroblastoma (NB),
rhabdomyosarcoma (RMS), Burkitt lymphoma (BL) and the Ewing family of tumours
(EWS), i.e. here we have m ¼ 4. The number of training samples is as follows: 23 for EWS,
8 for BL, 12 for NB, and 20 for RMS. The test set consists of 25 samples: 6 for EWS, 3 for
BL, 6 for NB, 5 for RMS, and 5 ‘‘others’’. The split of the data into training and test sets was
the same as in the paper by Khan et al., where it has been shown that a system of artificial
neural networks can utilize gene-expression measurements from microarrays and classify
these tumours into four different categories. In [14], 3750 ANNs are calculated to obtain 96
genes for training the final ANN which is able to classify correctly the 20þ 5 test data.
3.2. AML/ALL data
The data are taken from Golub et al. [9]. The training set consists of 7129 gene-
expression data for 11 samples of acute myeloid leukemia (AML) and 27 samples of acute
lymphoblastic leukemia (ALL), respectively (i.e. m ¼ 2 in this case). For the test, 14 AML
samples and 20 ALL samples are used (again, each of them is represented by 7129 gene-
expression data). Current clinical practice involves an experienced specialist’s interpreta-
tion of the tumour’s morphology, histochemistry, immunophenotyping, and cytogenetic
analysis, each performed in a separate, highly specialised laboratory. Golub et al. analysed
various aspects of cancer classification in [9]. In particular, by using the model of self-
organising maps in combination with a weighted voting scheme, Golub et al. obtained a
strong prediction for 18 out of the 20 ALL test samples and 10 out of the 14 AML samples
(i.e. a total of six misclassifications).
Furey et al. [7] apply Support Vector Machines (SVMs) [22] to the AML/ALL data.
Significant genes (feature selection) are derived from a score calculated from the mean and
standard deviation of each gene type. Tests are performed for 25, 250, 500, and 100 top-
ranked genes, and for the five test examples unpredicted by Golub’s method, each is
misclassified in at least one SVM test. Two test examples are misclassified in all SVM tests.
Guyon et al. [11] also apply SVMs, but with a different feature selection method called
recursive feature elimination. At each iteration, the weights associated with genes are
recalculated by an SVM test and genes with the smallest weights are removed. On 8 and
80 A. Albrecht et al. / Artificial Intelligence in Medicine 28 (2003) 75–87
16 genes, the classification error on the test set is zero with a minimum margin of 0.05 and
0.03 and medium margins 0.49 and 0.38, respectively.
4. Results
The algorithm described in Section 2 has been implemented in Cþþ and we performed
computational experiments for the datasets from Section 3 on SUN Ultra 5/333 work-
station with 128 MB RAM.
For both sets of microarray data, we performed the following normalisation (including
the test data). For each gene type i (i.e. input position), we calculated the mean mi and the
standard deviation si from the training data only. Then, for all data dji (j covers training as
well as test data), the normalised values ðdji � miÞ=si were taken for the training and test
procedures. The same normalisation is used in [7,9,11]. For the dataset from [14] we
discuss the impact of normalisation.
4.1. SRBCT data
In Table 1, computations include all 2308 input variables (gene-expression data). The
parameter settings are a ¼ 0:75 and b ¼ 0:25 for balanced but still sufficiently large
numbers of positive and negative examples. The entries in the table are the errors on classes
in the order [EWS, BL, NB, RMS]. We recall that the number K of threshold functions
(perceptrons) is the same for all four classes, i.e. ki ¼ K, i ¼ 1; . . . ; 4, and the computa-
tional experiments have been performed on normalised data. The values are typical results
from three to five runs for each parameter setting; the deviation is very small and therefore
average values are not calculated. The errors are calculated for the 20 test samples with
confirmed classification only.
In Table 2, the training procedure has been executed on priority genes only, i.e. on kgenes that are determined by the intersection of p sets of significant input positions (derived
Table 1
Results on SRBCT test data if training is executed on all 2308 input data
K
11 33 99 297 891
Error distribution [1, 1, 5, 0] [1, 1, 5, 0] [0, 1, 5, 0] [0, 1, 4, 0] [0, 1, 4, 0]
Total errors (%) 35 35 30 25 25
Table 2
Results on SRBCT test data if training is executed on priority genes only
k (K ¼ 595)
7 9 11 13 15
Error distribution [1, 0, 1, 0] [1, 0, 0, 0] [1, 0, 0, 0] [0, 0, 0, 0] [2, 0, 0, 0]
Total errors (%) 10 5 5 0 10
A. Albrecht et al. / Artificial Intelligence in Medicine 28 (2003) 75–87 81
from p perceptrons) consisting of q most significant weights. The value of p has been
chosen in such a way that k � j \pj¼1 gðwi
j; qÞj is valid, cf. Section 2.3.
The parameter settings are p ¼ 5 and q ¼ 1000 (q is large enough to have non-empty
intersections), and the numbers of priority genes are k ¼ 7, 9, 11, 13 and 15. We observed
stable results for K � 600 and therefore K ¼ 595 has been chosen (K depends on p and the
voting at depth 2 which is performed on seven inputs in this case). We obtained the best
series of results (for the range of k ¼ 7; . . . ; 15) for relatively small numbers a and b. The
number of randomly chosen training samples is 4 for negative as well as positive examples,
i.e. we have a 2 f0:17; . . . ; 0:5g and b 2 f0:073; . . . ; 0:1g.Thus, we obtain zero classification error on 13 genes only, whereas 96 genes are used in
[14] for final classification (which are derived from 3750 ANNs; the ANNs are trained and
evaluated on different selections out of the sample set). The run-time ranges from 138 min
(for k ¼ 15) up to 215 min (for k ¼ 7).
The average margin between the top and the second largest rating is in the range of 0.16,
while the minimum margin is about 0.01. The absolute rating of the five ‘‘other’’ examples
is below 0.38. These values are stable in different runs for parameter settings with top
classification results (e.g. k ¼ 13, jTþj ¼ jT�j ¼ 4, and K ¼ 595). The effect of small
margins is due to the normalisation procedure of gene-expression data which was used for
comparison purposes. Without normalisation, the top absolute rating of each example is, in
general, at least twice as large as the second largest one (for most cases much higher).
Without normalisation, the rating of the five ‘‘other’’ examples is below 0.25. The rating for
the test example no. 20 is only marginally better than the second best rating. On this
example, Kahn et al. reported a vote of 0.40 [14], and our ratings are in the same range
between 0.41 and 0.45.
Since we are using jwlj for the selection of priority genes, it is interesting to know
whether the largest absolute values of weights correspond to the most significant average
values of gene-expression data. Table 3 provides the rank of the average value for priority
genes calculated for K ¼ 595 and the EWS cancer type. In this run, we have k ¼ 11 (cf.
Table 2). We recall that q ¼ 1000; for the largest average values of gene-expression data
this means the range of ranks 1309; . . . ; 2308.
In Table 3, two of the genes do not belong to this range; thus, there seems to be no direct
correspondence between the rank of the average value of gene-expression data and the
absolute value of weights in classifying threshold functions from the first layer of the
classification circuit.
Table 3
Ranking of priority genes for SRBCT data
Gene number 488 530 654 738 891
Rank number 1592 2152 1994 2141 1271
Gene number 915 1021 1441 1912 1950
Rank number 2004 2158 1448 1060 1744
Gene number 1979 2188 2302
Rank number 2112 1019 1764
82 A. Albrecht et al. / Artificial Intelligence in Medicine 28 (2003) 75–87
For k ¼ 13 and K ¼ 595, we investigated the impact of the random choice of subsets
from the entire sample set. We performed experiments for L ¼ jTþj ¼ jT�j ¼ 4, . . . ,
L ¼ jT�j ¼ jTþj ¼ 12, where for L > 8 and the class BL the setting jTþj ¼ jT�j ¼ 8 was
chosen.
Table 4 shows typical results from our runs and clearly demonstrates the effect of
Epicurean learning: The classification results become worse if the number of potential
training subsets becomes smaller.
4.2. AML/ALL data
In this case, we have a binary classification problem. We recall that the computations are
executed on normalised data.
In Table 5, the training has been performed on all 7129 input variables (gene-expression
data). The entries in the table are the errors in the order [ALL, AML]. The values are typical
results from several runs for each parameter setting.
The classification improves on the results published in [9]. The run-time is between
1 min and 60 min. The question is whether the same or even better results are possible on
significantly smaller numbers of gene data inputs.
The priority genes are computed for b ¼ 1:0 (due to the small number of samples) with
the same number of training samples from ALL data which implies a ¼ 0:41 (i.e. we have
jTþj ¼ jT�j ¼ 11). The parameter q was set to 2500 and p ¼ 3 was chosen, cf. Section 2.3.
From the intersection of p subsets, we took a fixed number of k genes corresponding to the
k largest weights. In Table 6, some typical results are presented.
The run-time ranges from 92 min (for k ¼ 17) up to 1578 min (for k ¼ 7). The
classification results are stable, in particular, for k ¼ 7 and 9. The average margin is
0.23 and the minimum margin is 0.035 for k ¼ 9.
In Table 7, we provide the rank of the average value for priority genes calculated for
K ¼ 315 and k ¼ 7. As can be seen, none of the genes belongs to the range of q ¼ 2500
Table 4
The impact of Epicurean-style learning on classification of SRBCT data
L (k ¼ 11)
4 6 8 10 12
Error distribution [0, 0, 0, 0] [1, 0, 0, 0] [2, 0, 0, 0] [2, 0, 0, 0] [2, 0, 1, 0]
Total errors (%) 0 5 10 10 15
Table 5
Results on AML/ALL test data if training is executed on all 7129 input data
K
33 99 297 891 2673
Error distribution [0, 3] [1, 2] [2, 2] [2, 2] [2, 1]
Total errors (%) 8.8 8.8 11.8 11.8 8.8
A. Albrecht et al. / Artificial Intelligence in Medicine 28 (2003) 75–87 83
genes with the largest average absolute value, i.e. for AML/ALL data the largest weights do
not correlate to the largest absolute values at all.
As for SRBCT data, we analysed the impact of an increasing size of randomly chosen
subsets of the entire sample set. For k ¼ 9 and K ¼ 315, we performed computational
experiments for b ¼ 1:0 and a ¼ 0:41; . . . ; 1:0 on ALL data. Table 8 shows that the
classification becomes worse as the number of potential training subsets decreases.
Despite the small number of samples we are using for the training procedure
(jTþj ¼ jT�j ¼ 11), we observed slightly worse classification results when the simulated
annealing-based search is removed from our classification algorithm. We performed
several runs for the same parameter settings as in Table 6 without logarithmic simulated
annealing.
We obtained a significant speed-up for the run-time: 75 min for k ¼ 17 and 965 min for
k ¼ 7. However, in general, the classification rate becomes worse, as shown in Table 9.
Table 6
Results on AML/ALL test data if training is executed on priority genes only
k (K ¼ 315)
7 9 11 13 15 17
Error distribution [0, 0] [0, 0] [0, 1] [0, 1] [0, 1] [2, 1]
Total errors (%) 0.0 0.0 2.9 2.9 2.9 8.8
Table 7
Ranking of priority genes for ALL/AML data
Gene number 918 1760 2251 3025 3391 4408 6276
Rank number 1759 2872 3734 2415 3443 1830 975
Table 8
The impact of Epicurean-style learning on classification of AML/ALL data
a (k ¼ 9)
[0.41] [0.50] [0.75] [1.0]
Error distribution [0, 0] [0, 0] [0, 2] [1, 2]
Total errors (%) 0.0 0.0 5.9 8.8
Table 9
The impact of simulated annealing-based search on classification results
k (K ¼ 315)
7 9 11 13 15 17
Error distribution [0, 1] [0, 2] [0, 2] [0, 1] [1, 1] [2, 1]
Total errors (%) 2.9 5.9 5.9 2.9 5.9 8.8
84 A. Albrecht et al. / Artificial Intelligence in Medicine 28 (2003) 75–87
5. Discussion
In this paper, we presented an algorithm that is used to train a system of perceptrons by
utilising a simulated annealing-based local search procedure. The model was able to
successfully classify previously unseen cases of SBRCTs and AML/ALL data using a
small number of genes only.
Among the key features of the algorithm are the specific cooling schedule of simulated
annealing, the calculation of perceptrons on randomly chosen subsets of training samples
(Epicurean learning), and the method to calculate priority genes (feature selection).
To obtain the best classification results, basically four parameters have to be adjusted: (1)
the choice of G in logarithmic simulated annealing; (2) the total number K of perceptrons in
the threshold circuit; (3) the size ðjTþj þ jT�jÞ of randomly chosen subsets of training
samples; and (4) the number k of priority genes.
For the first two parameters, the computational experiments on gene-expression
data from the present paper confirm the range of parameter settings from previous
applications of our approach to CT image classification [2]. If G is in the range of
ðjTþj þ jT�jÞ=4; . . . ; ðjTþj þ jT�jÞ=3 and K > 600 (much smaller for binary problems
like AML/ALL and image classification as in [2]), the classification rate depends mainly on
ðjTþj þ jT�jÞ and k. Thus, compared to other learning-based classification methods,
the application of our approach to different problems turns out to be relatively straightfor-
ward.
As indicated already in [3], the local search by logarithmic simulated annealing is
essential for obtaining the best classification results. Our computational experiments on
AML/ALL data provide evidence that the classification rate becomes worse if we are using
the perceptron algorithm only for calculating the weights of threshold functions. The
observation is of particular interest since the number of samples is very small.
We have shown that the Epicurean learning approach, where a large number of
independent hypotheses is calculated on (small) randomly chosen subsets of the training
set, has a noticeable impact on the classification rate. Our classification results improve on
the results obtained in [14] for SRBCTs data (by using ANNS) and in [7,9] for AML/ALL
data (by using SVMs, and a combination of SOMs and weighted voting, respectively). In
[7,9,14], resampling methods are used for identifying significant genes, while clustering
and weight calculation of the final neural network-based model, respectively, are per-
formed on the entire training set. In our Epicurean learning approach, the only parameter
that has to be adjusted in this respect is the size ðjTþj þ jT�jÞ of randomly chosen subsets
of training samples.
For AML/ALL data, we obtain zero error classification on nearly the same number of
genes as reported in [11] where the weights of a linear classifier are calculated by SVMs.
The set of significant genes is determined by a recursive procedure, and an SVM
computation is executed at each step on all training samples with an input reduction to
genes that have not been deleted yet (when halving the number of genes at each iteration as
in [11], log nb c iterations are required for a total number of n genes). In our approach, the
significant genes are determined by the input positions that have the largest absolute values
of weights in a certain number p of threshold gates calculated by the Epicurean-style
learning method. The number p is independent of n.
A. Albrecht et al. / Artificial Intelligence in Medicine 28 (2003) 75–87 85
While G and K seem to be determined by the paradigms mentioned above, it remains to
find appropriate rules for the choice of ðjTþj þ jT�jÞ and k, in particular, when the sample
size is very small as in the case of microarray data. Future research will be directed towards
a deeper analysis of both parameters. For example, in our approach as well as in
[7,9,11,14], the parameter k is adjusted by test runs on sample sets. The question is
whether there are general rules that allow to identify a priori a set of significant input
positions that guarantee the highest classification rates.
Acknowledgements
The authors would like to thank the referees for their careful reading of the manuscript
and helpful suggestions that resulted in an improved presentation. The research has been
partially supported by EPSRC Grant GR/R72938/01 and by the Taplin award from the
Harvard/MIT Health Sciences and Technology Division.
References
[1] Aarts EHL. Local search in combinatorial optimization. New York: Wiley; 1998.
[2] Albrecht A, Loomes MJ, Steinhofel K, Taupitz M. A modified perceptron algorithm for computer-assisted
diagnosis. In: Bramer M, Preece A, Coenen F, editors. Research and development in intelligent systems
XVII. BCS series. London: Springer; 2000. p. 199–211.
[3] Albrecht A, Wong CK. Combining the perceptron algorithm with logarithmic simulated annealing. Neural
Process Lett 2001;14(1):75–83.
[4] Bartlett P. The sample complexity of pattern classification with neural networks: the size of weights is
more important than the size of the network. IEEE Trans Inf Theory 1998;44(2):525–36.
[5] Cleary JG, Trigg LE, Holmes G, Hall MA. Experiences with a weighted decision tree learner. In: Bramer
M, Preece A, Coenen F, editors. Research and development in intelligent systems XVII. BCS series.
London: Springer; 2000. p. 35–47.
[6] Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression
patterns. Proc Natl Acad Sci USA 1998;95(25):14863–8.
[7] Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D. Support vector machine
classification and validation of cancer tissue samples using microarray expression data. Bioinformatics
2000;16:906–14.
[8] Geyer C-F. Epikur. Hamburg: Junius; 2000.
[9] Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of
cancer: class discovery and class prediction by gene expression monitoring. Science 1999;286:531–7.
[10] Good P. Permutation tests. Heidelberg: Springer; 2000.
[11] Guyon I, Weston J, Barnhill St, Vapnik V. Gene selection for cancer classification using support vector
machines. Machine Learning 2002;46(1–3):389–422.
[12] Hajek B. Cooling schedules for optimal annealing. Math Oper Res 1988;13:311–29.
[13] Hoffgen K-U, Simon H-U, van Horn KS. Robust trainability of single neurons. J Comp Syst Sci
1995;50:114–25.
[14] Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, et al. Classification and diagnostic
prediction of cancers using gene expression profiling and artificial neural networks. Nat Med
2001;7(6):673–9.
[15] Kirkpatrick S, Gelatt Jr. CD, Vecchi MP. Optimization by simulated annealing. Science 1983;220:671–80.
[16] Minsky ML, Papert SA. Perceptrons. Cambridge (MA): MIT Press; 1969.
[17] Maughan NJ, Lewis FA, Smith V. An introduction to arrays. J Pathol 2001;195:3–6.
86 A. Albrecht et al. / Artificial Intelligence in Medicine 28 (2003) 75–87
[18] Quackenbush J. Computational analysis of microarray data. Nat Rev Genet 2001;2(6):418–27.
[19] Rosenblatt F. Principles of neurodynamics. New York: Spartan Books; 1962.
[20] Schapire RE. The strength of weak learnability. Machine Learning 1990;5(2):197–227.
[21] Schapire RE, Freund Y, Bartlett P, Lee WS. Boosting the margin: a new explanation for the effectiveness
of voting methods. Ann Stat 1998;26(5):1651–86.
[22] Vapnik V. Statistical learning theory. New York: Wiley; 1998.
A. Albrecht et al. / Artificial Intelligence in Medicine 28 (2003) 75–87 87