An Epicurean learning approach to gene-expression data classification

An Epicurean learning approach to gene-expressiondata classification

Andreas Albrechta,*, Staal A. Vinterbob, Lucila Ohno-Machadob,c

aComputer Science Department, University of Hertfordshire, Hatfield, Herts AL10 9AB, UKbDecision Systems Group, Harvard Medical School, 75 Francis Street, Boston, MA, USA

cDivision of Health Sciences and Technology, MIT, Cambridge, MA, USA

Received 21 May 2002; received in revised form 6 November 2002; accepted 10 December 2002

Abstract

We investigate the use of perceptrons for classification of microarray data where we use two

datasets that were published in [Nat. Med. 7 (6) (2001) 673] and [Science 286 (1999) 531]. The

classification problem studied by Khan et al. is related to the diagnosis of small round blue cell

tumours (SRBCT) of childhood which are difficult to classify both clinically and via routine

histology. Golub et al. study acute myeloid leukemia (AML) and acute lymphoblastic leukemia

(ALL). We used a simulated annealing-based method in learning a system of perceptrons, each

obtained by resampling of the training set. Our results are comparable to those of Khan et al. and

Golub et al., indicating that there is a role for perceptrons in the classification of tumours based on

gene-expression data. We also show that it is critical to perform feature selection in this type of

models, i.e. we propose a method for identifying genes that might be significant for the particular

tumour types. For SRBCTs, zero error on test data has been obtained for only 13 out of 2308 genes;

for the ALL/AML problem, we have zero error for 9 out of 7129 genes that are used for the

classification procedure. Furthermore, we provide evidence that Epicurean-style learning and

simulated annealing-based search are both essential for obtaining the best classification results.

# 2003 Elsevier Science B.V. All rights reserved.

Keywords: Perceptrons; Simulated annealing; Gene-expression analysis

1. Introduction

Measuring gene-expression levels is important for understanding the genetic basis of

diseases. The simultaneous measurement of gene-expression levels for thousands of genes

Artificial Intelligence in Medicine 28 (2003) 75–87

* Corresponding author. Tel.: þ44-1707-284247; fax: þ44-1707-284303.

E-mail addresses: [email protected] (A. Albrecht), [email protected] (S.A. Vinterbo),

[email protected] (L. Ohno-Machado).

0933-3657/03/$ – see front matter # 2003 Elsevier Science B.V. All rights reserved.

doi:10.1016/S0933-3657(03)00036-8

is now possible due to microarray technology [17,18]. Data derived from microarrays are

difficult to analyze without the help of computers, as keeping track of thousands of

measurements and their relationships is overwhelmingly complicated.

Several authors have utilized unsupervised learning algorithms to cluster gene-expression

data [6]. In those applications, the goal is to find genes that have correlated patterns of

expression, in order to facilitate the discovery of regulatory networks. Recent publications

have begun to deal with supervised classification for gene expression derived from micro-

arrays [7,11]. The goal in these applications is usually to classify cases into diagnostic or

prognostic categories. Additionally, researchers try to determine which genes are most

significantly related to the category of interest. Since the number of measurements is very

large compared to the number of arrays, there is tremendous potential for overfitting in

models that do not utilize a pre-processing step for feature selection.

The feature selection process itself is of interest, as it helps to determine the relative

importance of a gene in the classification. Approaches for feature selection in the context of

gene-expression analysis are currently being investigated [11]. Developing a strategy for

selecting genes that are important in a classification model, regardless of their absolute

expression levels, is important in this context.

In this paper, we propose an algorithm for learning perceptrons based on simulated

annealing and we show that it can be successfully applied to the analysis of gene-

expression data. Besides the combination of simulated annealing and perceptrons, another

key feature of our approach is training the perceptrons [16,19] on randomly selected

subsets of the entire sample set. In statistical inference, the method of drawing many

samples from some population or constructing many rearrangements of sample values is

called resampling [10]. For each sample or rearrangement, test statistics are computed, and

the resulting set of test statistics constitutes the sampling distribution. Since we are dealing

with a learning and classification task where a large number of hypotheses is calculated

from randomly selected (small) subsets of samples, we use the concept of Epicurean

learning. To our knowledge, Epicurean learning was first mentioned by Cleary et al. [5],

motivated by Epicurus’ paradigm that all hypotheses fitting the known data about an object

should be retained [8].

In our paper, we are analysing two microarray data published by Khan et al. [14] and

Golub et al. [9]. On the test data provided with both datasets, we obtain zero classification

error on a very small number of gene data only. Moreover, our computational experiments

show that Epicurean learning as well as the simulated annealing-based search are both

essential for obtaining the best classification results.

2. Methods

Let D � Qn be our input data table where each of the columns correspond to expression

measurements for a particular gene over the tissues investigated. Further, let

Qn ! f1; 2; . . . ;mg be a partial function that for D returns the tumor class associated

with each row.

We would like to find a realization of a function F: Qn ! 2f1;2;...;mg that represents an

extension of c that we can use to classify new, unseen expression measurement vectors.

76 A. Albrecht et al. / Artificial Intelligence in Medicine 28 (2003) 75–87

We do this as follows. For each class i 2 f1; 2; . . . ;mg, we construct a classifier Fi:

Qn ! ½0; 1�. These Fi are then combined to form F as:

FðxÞ ¼ fjjFjðxÞ ¼ maxi

FiðxÞg: (1)

The number jFðxÞj gives us an indication of how uniquely we were able to classify x. We

choose to discard the classification if jFðxÞj > 1.

We now turn to the construction of the functions Fi. A perceptron p is a function p:

Rn Rn R! f0; 1g such that

pðx;w; WÞ ¼ pw;WðxÞ ¼ tWðwxÞ; (2)

where t is the unit step function at threshold W defined as

tWðyÞ ¼0; if y < W;

1; otherwise:

�(3)

Given ki perceptrons, we define Fi to be

FiðxÞ ¼1

ki

Xki

j¼1

pijw;0ðxÞ; (4)

where the w’s are restricted to be rational (for simpler notations, we assume W ¼ 0 and

always xn ¼ 1, i.e. wn represents the threshold). There are two problems that need to be

solved:

1. finding the wij’s, i.e. training the perceptrons;

2. finding ki for each class i.

The parameter ki is chosen empirically. We will discuss how this is done in Section 4. In the

remainder of this section we address the training of the perceptrons, and pragmatics

regarding reduction of the input dimensionality.

2.1. Perceptron training

Let T ¼ Tþ [ T� be a training set composed of positive and negative examples,

respectively. We want to find the parameters w of the perceptron p that maximize the

separation of the positive and negative examples in T .

Hoffgen et al. [13] have shown that finding a linear threshold function that minimizes the

number of misclassified vectors is NP-hard in general. Hence, we need to apply heuristics

to the problem of training a perceptron.

Simulated annealing has been proven to be a versatile tool in combinatorial optimisation

[1,15], and is our choice of optimization strategy. In [3], we have demonstrated on

randomly generated disjunctive normal forms that the classification rate improves if the

perceptron algorithm is combined with logarithmic simulated annealing. An application to

real world data has been provided in [2].

Given a search space W in which we want to find an element that minimizes an objective

function o: W ! N, an initial ‘‘current location’’ ws 2 W , a function s: W ! W that in a

A. Albrecht et al. / Artificial Intelligence in Medicine 28 (2003) 75–87 77

stochastic fashion proposes a next ‘‘current location’’ that we hope will lead to the wanted

optimum, a function a: N W W ! f0; 1g that accepts or rejects the proposed next

current location, and finally a stopping criterion. We can illustrate the strategy by the

following simple pseudo-code skeleton:

w ws; k 0

while not stop(k;w)

k k þ 1; wn sðwÞif akðw;wnÞ ¼ 1

w wn

The idea of the algorithm is to, while initially allowing locally suboptimal steps, become

more restrictive in allowing locally suboptimal steps to be taken over time. The hope is to

avoid premature convergence to local minima. In our case, we define our objective function

to be the number of misclassified elements in T . Let

MTðwÞ ¼ fx 2 Tþjpw;0ðxÞ < 1g [ fx 2 T�jpw;0ðxÞ > 0g; (5)

denote the set of misclassified samples, then we can define our objective function as

oðwÞ ¼ jMTðwÞj. The set MTðwÞ can be viewed as a neighbourhood of w containing all the

possible next steps we can take from w. As a first key feature of our heuristic, we now

construct a probability mass function u over MTðwÞ as

qðxÞ ¼ jxwjPy2MT ðwÞ jywj : (6)

Elements that are ‘‘further away’’ from being classified correctly by pw;0 are assigned

higher values by q, which goes conform with the perceptron algorithm [16,19]. We now

define

sðwÞ ¼ w� wðxÞ � sampleðMTðwÞ; qÞ=ffiffiffiffiffiffiffiwwp

; (7)

where sample stochastically selects one element from the set MTðwÞ with the probability

given by q, and wðxÞ ¼ 1 for x 2 T�, wðxÞ ¼ �1 for x 2 Tþ. The acceptance function at

step k (the kth time through the while-loop) is defined as

aðwk�1;wkÞ ¼1; if pðwk�1;wkÞ > r;

0; otherwise;

�(8)

where

pðwk�1;wkÞ ¼1; if oðwkÞ � oðwk�1Þ � 0;

e�ðoðwkÞ�oðwk�1ÞÞ=tðkÞ; otherwise;

�(9)

and r 2 ½0; 1� is uniformly randomly sampled at each step k. The function t, motivated by

Hajek’s Theorem [12] on convergence of inhomogeneous Markov chains for big enough

constants G, is defined as

tðkÞ ¼ Glnðk þ 2Þ ; k 2 f0; 1; . . .g; (10)


and represents the ‘‘annealing’’ temperature (second key feature). As t decreases, the

probability of accepting a w that does not decrease the objective function decreases. We

empirically chose G in the range of ðjTþj þ jT�jÞ=4 up to ðjTþj þ jT�jÞ=3, essentially

using the same method as in [2].

Finally our stopping criterion is given by a pre-determined number of iterations through

the while-loop.

2.2. Perceptron training set sampling

In order to generate (a large number of) different hypotheses as well as to achieve zero

learning error in a short time, the following sampling scheme for training sets is applied

(third key feature):

Let Cþi ¼ fx 2 DjcðxÞ ¼ ig be the positive examples for class i in D, and let

C�i ¼ D� Cþi be the negative examples of class i in D. Further, let for two parameters

a; b 2 ð0; 1�, sþi ¼ ajCþi jb c, and let s�i ¼ bjC�i j� �

. For each j in f1; 2; . . . ; kigwe randomly

sample Tþi;j � Cþi and T�i;j � C�i such that jTþi;j j ¼ sþi , and jT�i;j j ¼ s�i . The set

Ti;j ¼ Tþi;j [ T�i;j is then the training set used to train perceptron pj in Fi. Thus, by the

parameters a and b we can control the size of the randomly chosen subsets of the training

data, which is an integral part of the Epicurean-style learning approach.

The boosting method (cf. [20,21]) tries to reduce the training error by assigning

higher probabilities to ‘‘difficult’’ samples in a recursive learning procedure. In our

approach, we observed that almost all subsets are learned with zero error even for

relatively large fractions of the entire sample set (about 0.5 in [2] and 0.75 in the present

study). Thus, according to the procedure described in the previous section, the method

we are using belongs to the class of voting predictions. Recent research has shown that

classification performance can be significantly improved by voting on multiple hypoth-

eses [21]; for a more detailed discussion cf. [5]. We note that the particular examples

are trained multiple times, e.g. for the ALL/AML dataset, the average occurrence of

an ALL sample in randomly chosen subsets is about 220 in trials with the best

classification rate.

2.3. Dimensionality reduction

In our case, where the data D presents a massively underdetermined system, i.e. there are

many more gene expression measurements than there are tissue samples, experiments have

shown that reducing the dimensionality of the data is beneficial [14]. The scheme we

applied is based on selecting the genes that incur the coefficients with the biggest absolute

values in the perceptrons after having completed training on all dimensions in the data

(fourth key feature). Let gðw; qÞ be the set of q positions that produce the q biggest values

jwlj in w ¼ ðw1;w2; . . . ;wnÞ (ties are ordered arbitrarily). Let Gi ¼ \pj¼1gðwi

j; qÞ be the set

of dimensions selected for class i, i.e. here we have ki ¼ p for all i ¼ 1; . . . ; m. Each Gi is

truncated to k ¼ mini2f1;...;mg jGij positions with the largest associated values (called

priority genes). Training each Fi is then repeated with the data D projected onto the

dimensions in Gi for ki ¼ K perceptrons, i ¼ 1; . . . ; m. The importance of weight size in

learning procedures has been emphasised by Bartlett [4].


3. Datasets

Improvements in cancer classification have been central to advances in cancer treatment

[9]. Usually, cancer classification has been based primarily on the morphological appearance

of the tumour which has serious limitations. Tumours with similar appearance can follow

significantly different clinical courses and show different responses to therapy. In a few cases,

such clinical heterogeneity has been explained by dividing morphologically similar tumours

into subtypes. Key examples include the subdivision of small round blue cell tumours and

acute leukemias (SRBCTs). Both tumour classes are considered in the present section.

3.1. SRBCT data

For the first series of computational experiments, the data used in this paper are provided

by Khan et al. [14]. Given are gene-expression data from cDNA microarrays containing

2308 genes for four types of SRBCTs of childhood, which include neuroblastoma (NB),

rhabdomyosarcoma (RMS), Burkitt lymphoma (BL) and the Ewing family of tumours

(EWS), i.e. here we have m ¼ 4. The number of training samples is as follows: 23 for EWS,

8 for BL, 12 for NB, and 20 for RMS. The test set consists of 25 samples: 6 for EWS, 3 for

BL, 6 for NB, 5 for RMS, and 5 ‘‘others’’. The split of the data into training and test sets was

the same as in the paper by Khan et al., where it has been shown that a system of artificial

neural networks can utilize gene-expression measurements from microarrays and classify

these tumours into four different categories. In [14], 3750 ANNs are calculated to obtain 96

genes for training the final ANN which is able to classify correctly the 20þ 5 test data.

3.2. AML/ALL data

The data are taken from Golub et al. [9]. The training set consists of 7129 gene-

expression data for 11 samples of acute myeloid leukemia (AML) and 27 samples of acute

lymphoblastic leukemia (ALL), respectively (i.e. m ¼ 2 in this case). For the test, 14 AML

samples and 20 ALL samples are used (again, each of them is represented by 7129 gene-

expression data). Current clinical practice involves an experienced specialist’s interpreta-

tion of the tumour’s morphology, histochemistry, immunophenotyping, and cytogenetic

analysis, each performed in a separate, highly specialised laboratory. Golub et al. analysed

various aspects of cancer classification in [9]. In particular, by using the model of self-

organising maps in combination with a weighted voting scheme, Golub et al. obtained a

strong prediction for 18 out of the 20 ALL test samples and 10 out of the 14 AML samples

(i.e. a total of six misclassifications).

Furey et al. [7] apply Support Vector Machines (SVMs) [22] to the AML/ALL data.

Significant genes (feature selection) are derived from a score calculated from the mean and

standard deviation of each gene type. Tests are performed for 25, 250, 500, and 100 top-

ranked genes, and for the five test examples unpredicted by Golub’s method, each is

misclassified in at least one SVM test. Two test examples are misclassified in all SVM tests.

Guyon et al. [11] also apply SVMs, but with a different feature selection method called

recursive feature elimination. At each iteration, the weights associated with genes are

recalculated by an SVM test and genes with the smallest weights are removed. On 8 and


16 genes, the classification error on the test set is zero with a minimum margin of 0.05 and

0.03 and medium margins 0.49 and 0.38, respectively.

4. Results

The algorithm described in Section 2 has been implemented in Cþþ and we performed

computational experiments for the datasets from Section 3 on SUN Ultra 5/333 work-

station with 128 MB RAM.

For both sets of microarray data, we performed the following normalisation (including

the test data). For each gene type i (i.e. input position), we calculated the mean mi and the

standard deviation si from the training data only. Then, for all data dji (j covers training as

well as test data), the normalised values ðdji � miÞ=si were taken for the training and test

procedures. The same normalisation is used in [7,9,11]. For the dataset from [14] we

discuss the impact of normalisation.

4.1. SRBCT data

In Table 1, computations include all 2308 input variables (gene-expression data). The

parameter settings are a ¼ 0:75 and b ¼ 0:25 for balanced but still sufficiently large

numbers of positive and negative examples. The entries in the table are the errors on classes

in the order [EWS, BL, NB, RMS]. We recall that the number K of threshold functions

(perceptrons) is the same for all four classes, i.e. ki ¼ K, i ¼ 1; . . . ; 4, and the computa-

tional experiments have been performed on normalised data. The values are typical results

from three to five runs for each parameter setting; the deviation is very small and therefore

average values are not calculated. The errors are calculated for the 20 test samples with

confirmed classification only.

In Table 2, the training procedure has been executed on priority genes only, i.e. on kgenes that are determined by the intersection of p sets of significant input positions (derived

Table 1

Results on SRBCT test data if training is executed on all 2308 input data

K

11 33 99 297 891

Error distribution [1, 1, 5, 0] [1, 1, 5, 0] [0, 1, 5, 0] [0, 1, 4, 0] [0, 1, 4, 0]

Total errors (%) 35 35 30 25 25

Table 2

Results on SRBCT test data if training is executed on priority genes only

k (K ¼ 595)

7 9 11 13 15


Total errors (%) 10 5 5 0 10


from p perceptrons) consisting of q most significant weights. The value of p has been

chosen in such a way that k � j \pj¼1 gðwi

j; qÞj is valid, cf. Section 2.3.

The parameter settings are p ¼ 5 and q ¼ 1000 (q is large enough to have non-empty

intersections), and the numbers of priority genes are k ¼ 7, 9, 11, 13 and 15. We observed

stable results for K � 600 and therefore K ¼ 595 has been chosen (K depends on p and the

voting at depth 2 which is performed on seven inputs in this case). We obtained the best

series of results (for the range of k ¼ 7; . . . ; 15) for relatively small numbers a and b. The

number of randomly chosen training samples is 4 for negative as well as positive examples,

i.e. we have a 2 f0:17; . . . ; 0:5g and b 2 f0:073; . . . ; 0:1g.Thus, we obtain zero classification error on 13 genes only, whereas 96 genes are used in

[14] for final classification (which are derived from 3750 ANNs; the ANNs are trained and

evaluated on different selections out of the sample set). The run-time ranges from 138 min

(for k ¼ 15) up to 215 min (for k ¼ 7).

The average margin between the top and the second largest rating is in the range of 0.16,

while the minimum margin is about 0.01. The absolute rating of the five ‘‘other’’ examples

is below 0.38. These values are stable in different runs for parameter settings with top

classification results (e.g. k ¼ 13, jTþj ¼ jT�j ¼ 4, and K ¼ 595). The effect of small

margins is due to the normalisation procedure of gene-expression data which was used for

comparison purposes. Without normalisation, the top absolute rating of each example is, in

general, at least twice as large as the second largest one (for most cases much higher).

Without normalisation, the rating of the five ‘‘other’’ examples is below 0.25. The rating for

the test example no. 20 is only marginally better than the second best rating. On this

example, Kahn et al. reported a vote of 0.40 [14], and our ratings are in the same range

between 0.41 and 0.45.

Since we are using jwlj for the selection of priority genes, it is interesting to know

whether the largest absolute values of weights correspond to the most significant average

values of gene-expression data. Table 3 provides the rank of the average value for priority

genes calculated for K ¼ 595 and the EWS cancer type. In this run, we have k ¼ 11 (cf.

Table 2). We recall that q ¼ 1000; for the largest average values of gene-expression data

this means the range of ranks 1309; . . . ; 2308.

In Table 3, two of the genes do not belong to this range; thus, there seems to be no direct

correspondence between the rank of the average value of gene-expression data and the

absolute value of weights in classifying threshold functions from the first layer of the

classification circuit.

Table 3

Ranking of priority genes for SRBCT data

Gene number 488 530 654 738 891

Rank number 1592 2152 1994 2141 1271

Gene number 915 1021 1441 1912 1950

Rank number 2004 2158 1448 1060 1744

Gene number 1979 2188 2302

Rank number 2112 1019 1764


For k ¼ 13 and K ¼ 595, we investigated the impact of the random choice of subsets

from the entire sample set. We performed experiments for L ¼ jTþj ¼ jT�j ¼ 4, . . . ,

L ¼ jT�j ¼ jTþj ¼ 12, where for L > 8 and the class BL the setting jTþj ¼ jT�j ¼ 8 was

chosen.

Table 4 shows typical results from our runs and clearly demonstrates the effect of

Epicurean learning: The classification results become worse if the number of potential

training subsets becomes smaller.

4.2. AML/ALL data

In this case, we have a binary classification problem. We recall that the computations are

executed on normalised data.

In Table 5, the training has been performed on all 7129 input variables (gene-expression

data). The entries in the table are the errors in the order [ALL, AML]. The values are typical

results from several runs for each parameter setting.

The classification improves on the results published in [9]. The run-time is between

1 min and 60 min. The question is whether the same or even better results are possible on

significantly smaller numbers of gene data inputs.

The priority genes are computed for b ¼ 1:0 (due to the small number of samples) with

the same number of training samples from ALL data which implies a ¼ 0:41 (i.e. we have

jTþj ¼ jT�j ¼ 11). The parameter q was set to 2500 and p ¼ 3 was chosen, cf. Section 2.3.

From the intersection of p subsets, we took a fixed number of k genes corresponding to the

k largest weights. In Table 6, some typical results are presented.

The run-time ranges from 92 min (for k ¼ 17) up to 1578 min (for k ¼ 7). The

classification results are stable, in particular, for k ¼ 7 and 9. The average margin is

0.23 and the minimum margin is 0.035 for k ¼ 9.

In Table 7, we provide the rank of the average value for priority genes calculated for

K ¼ 315 and k ¼ 7. As can be seen, none of the genes belongs to the range of q ¼ 2500

Table 4

The impact of Epicurean-style learning on classification of SRBCT data

L (k ¼ 11)

4 6 8 10 12


Total errors (%) 0 5 10 10 15

Table 5

Results on AML/ALL test data if training is executed on all 7129 input data

K

33 99 297 891 2673

Error distribution [0, 3] [1, 2] [2, 2] [2, 2] [2, 1]

Total errors (%) 8.8 8.8 11.8 11.8 8.8


genes with the largest average absolute value, i.e. for AML/ALL data the largest weights do

not correlate to the largest absolute values at all.

As for SRBCT data, we analysed the impact of an increasing size of randomly chosen

subsets of the entire sample set. For k ¼ 9 and K ¼ 315, we performed computational

experiments for b ¼ 1:0 and a ¼ 0:41; . . . ; 1:0 on ALL data. Table 8 shows that the

classification becomes worse as the number of potential training subsets decreases.

Despite the small number of samples we are using for the training procedure

(jTþj ¼ jT�j ¼ 11), we observed slightly worse classification results when the simulated

annealing-based search is removed from our classification algorithm. We performed

several runs for the same parameter settings as in Table 6 without logarithmic simulated

annealing.

We obtained a significant speed-up for the run-time: 75 min for k ¼ 17 and 965 min for

k ¼ 7. However, in general, the classification rate becomes worse, as shown in Table 9.

Table 6

Results on AML/ALL test data if training is executed on priority genes only

k (K ¼ 315)

7 9 11 13 15 17

Error distribution [0, 0] [0, 0] [0, 1] [0, 1] [0, 1] [2, 1]

Total errors (%) 0.0 0.0 2.9 2.9 2.9 8.8

Table 7

Ranking of priority genes for ALL/AML data

Gene number 918 1760 2251 3025 3391 4408 6276

Rank number 1759 2872 3734 2415 3443 1830 975

Table 8

The impact of Epicurean-style learning on classification of AML/ALL data

a (k ¼ 9)

[0.41] [0.50] [0.75] [1.0]

Error distribution [0, 0] [0, 0] [0, 2] [1, 2]

Total errors (%) 0.0 0.0 5.9 8.8

Table 9

The impact of simulated annealing-based search on classification results

k (K ¼ 315)

7 9 11 13 15 17

Error distribution [0, 1] [0, 2] [0, 2] [0, 1] [1, 1] [2, 1]

Total errors (%) 2.9 5.9 5.9 2.9 5.9 8.8


5. Discussion

In this paper, we presented an algorithm that is used to train a system of perceptrons by

utilising a simulated annealing-based local search procedure. The model was able to

successfully classify previously unseen cases of SBRCTs and AML/ALL data using a

small number of genes only.

Among the key features of the algorithm are the specific cooling schedule of simulated

annealing, the calculation of perceptrons on randomly chosen subsets of training samples

(Epicurean learning), and the method to calculate priority genes (feature selection).

To obtain the best classification results, basically four parameters have to be adjusted: (1)

the choice of G in logarithmic simulated annealing; (2) the total number K of perceptrons in

the threshold circuit; (3) the size ðjTþj þ jT�jÞ of randomly chosen subsets of training

samples; and (4) the number k of priority genes.

For the first two parameters, the computational experiments on gene-expression

data from the present paper confirm the range of parameter settings from previous

applications of our approach to CT image classification [2]. If G is in the range of

ðjTþj þ jT�jÞ=4; . . . ; ðjTþj þ jT�jÞ=3 and K > 600 (much smaller for binary problems

like AML/ALL and image classification as in [2]), the classification rate depends mainly on

ðjTþj þ jT�jÞ and k. Thus, compared to other learning-based classification methods,

the application of our approach to different problems turns out to be relatively straightfor-

ward.

As indicated already in [3], the local search by logarithmic simulated annealing is

essential for obtaining the best classification results. Our computational experiments on

AML/ALL data provide evidence that the classification rate becomes worse if we are using

the perceptron algorithm only for calculating the weights of threshold functions. The

observation is of particular interest since the number of samples is very small.

We have shown that the Epicurean learning approach, where a large number of

independent hypotheses is calculated on (small) randomly chosen subsets of the training

set, has a noticeable impact on the classification rate. Our classification results improve on

the results obtained in [14] for SRBCTs data (by using ANNS) and in [7,9] for AML/ALL

data (by using SVMs, and a combination of SOMs and weighted voting, respectively). In

[7,9,14], resampling methods are used for identifying significant genes, while clustering

and weight calculation of the final neural network-based model, respectively, are per-

formed on the entire training set. In our Epicurean learning approach, the only parameter

that has to be adjusted in this respect is the size ðjTþj þ jT�jÞ of randomly chosen subsets

of training samples.

For AML/ALL data, we obtain zero error classification on nearly the same number of

genes as reported in [11] where the weights of a linear classifier are calculated by SVMs.

The set of significant genes is determined by a recursive procedure, and an SVM

computation is executed at each step on all training samples with an input reduction to

genes that have not been deleted yet (when halving the number of genes at each iteration as

in [11], log nb c iterations are required for a total number of n genes). In our approach, the

significant genes are determined by the input positions that have the largest absolute values

of weights in a certain number p of threshold gates calculated by the Epicurean-style

learning method. The number p is independent of n.


While G and K seem to be determined by the paradigms mentioned above, it remains to

find appropriate rules for the choice of ðjTþj þ jT�jÞ and k, in particular, when the sample

size is very small as in the case of microarray data. Future research will be directed towards

a deeper analysis of both parameters. For example, in our approach as well as in

[7,9,11,14], the parameter k is adjusted by test runs on sample sets. The question is

whether there are general rules that allow to identify a priori a set of significant input

positions that guarantee the highest classification rates.

Acknowledgements

The authors would like to thank the referees for their careful reading of the manuscript

and helpful suggestions that resulted in an improved presentation. The research has been

partially supported by EPSRC Grant GR/R72938/01 and by the Taplin award from the

Harvard/MIT Health Sciences and Technology Division.

References

[1] Aarts EHL. Local search in combinatorial optimization. New York: Wiley; 1998.

[2] Albrecht A, Loomes MJ, Steinhofel K, Taupitz M. A modified perceptron algorithm for computer-assisted

diagnosis. In: Bramer M, Preece A, Coenen F, editors. Research and development in intelligent systems

XVII. BCS series. London: Springer; 2000. p. 199–211.

[3] Albrecht A, Wong CK. Combining the perceptron algorithm with logarithmic simulated annealing. Neural

Process Lett 2001;14(1):75–83.

[4] Bartlett P. The sample complexity of pattern classification with neural networks: the size of weights is

more important than the size of the network. IEEE Trans Inf Theory 1998;44(2):525–36.

[5] Cleary JG, Trigg LE, Holmes G, Hall MA. Experiences with a weighted decision tree learner. In: Bramer

M, Preece A, Coenen F, editors. Research and development in intelligent systems XVII. BCS series.

London: Springer; 2000. p. 35–47.

[6] Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression

patterns. Proc Natl Acad Sci USA 1998;95(25):14863–8.

[7] Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D. Support vector machine

classification and validation of cancer tissue samples using microarray expression data. Bioinformatics

2000;16:906–14.

[8] Geyer C-F. Epikur. Hamburg: Junius; 2000.

[9] Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of

cancer: class discovery and class prediction by gene expression monitoring. Science 1999;286:531–7.

[10] Good P. Permutation tests. Heidelberg: Springer; 2000.

[11] Guyon I, Weston J, Barnhill St, Vapnik V. Gene selection for cancer classification using support vector

machines. Machine Learning 2002;46(1–3):389–422.

[12] Hajek B. Cooling schedules for optimal annealing. Math Oper Res 1988;13:311–29.

[13] Hoffgen K-U, Simon H-U, van Horn KS. Robust trainability of single neurons. J Comp Syst Sci

1995;50:114–25.

[14] Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, et al. Classification and diagnostic

prediction of cancers using gene expression profiling and artificial neural networks. Nat Med

2001;7(6):673–9.

[15] Kirkpatrick S, Gelatt Jr. CD, Vecchi MP. Optimization by simulated annealing. Science 1983;220:671–80.

[16] Minsky ML, Papert SA. Perceptrons. Cambridge (MA): MIT Press; 1969.

[17] Maughan NJ, Lewis FA, Smith V. An introduction to arrays. J Pathol 2001;195:3–6.


[18] Quackenbush J. Computational analysis of microarray data. Nat Rev Genet 2001;2(6):418–27.

[19] Rosenblatt F. Principles of neurodynamics. New York: Spartan Books; 1962.

[20] Schapire RE. The strength of weak learnability. Machine Learning 1990;5(2):197–227.

[21] Schapire RE, Freund Y, Bartlett P, Lee WS. Boosting the margin: a new explanation for the effectiveness

of voting methods. Ann Stat 1998;26(5):1651–86.

[22] Vapnik V. Statistical learning theory. New York: Wiley; 1998.


Documents

An Epicurean learning approach to gene-expression data classification