[IEEE International Joint Conference on Neural Networks 2005 - Montreal, Que., Canada (31 July-4 Aug. 2005)] Proceedings. 2005 IEEE International Joint Conference on Neural Networks,

Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005

Neural Networks for Gene Expression Analysisand Gene Selection from DNA Microarray

Jagdish Chandra Patra, Qin Zhen, Ee Luang Ang and Amitabha DasSchool ofComputer Engineering, Nanyang Technological University, Singapore 639798

Email: {aspatra, 148101347, aselang, and asadas}lntu.edu.sg

Abstract- We propose two approaches for microarray geneexpression analysis and gene selection using neural networks.Using these approaches, only those genes which help sampleclassification are selected from the original set of genes, and theredundant genes expression patterns involved in the hugemicroarray matrix are eliminated so that dimensionality of thematrix is reduced from a few thousands to a much smallernumber. An unsupervised SOM based technique and anothersupervised single layer perceptron based technique have beenutilized for this purpose. Performance of these two approaches iscompared in terms of accuracy, implementation and executiontime.

I. INTRODUCTION

Nowadays, a generic approach to cancer classification basedon gene expression monitoring by DNA microarrays is widelyapplied. This technology allows screening large number ofgenes to see whether they are active under various conditionsand can assist biologists to understand the behaviors ofvarious tumors based on the gene-expression. Normally, foreach sample, several thousand of genes are measured for theirmRNA expression levels [1], [2]. The high dimensionality ofthe data matrix is a big challenge for data analysis andmeaningful information extraction. To overcome this problem,we need to identify a "class distinguisher", which is a muchsmaller group of genes than the original set of genes, and theycan, in combination, help to classify cancers and to predictunknown sample. Here, two different neural network-basedapproaches are used to analyze the Alizadeh et al. dataset [3].

Alizadeh et al. dataset uses microarray to characterize geneexpression patterns in three lymphoid malignancies: DLBCL(47 samples), FL (9 samples) and CLL (I Isamples), and also29 non-lymphoma samples. Thus, this dataset consists of intotal 96 samples, each with expression values of 4026 genes.Our task is to (1) find a class distinguisher between lymphomaand non-lymphoma; (2) find a class distinguisher betweenDLBCLs and non-DLBCLs (FLs and CLLs); (3) find a classdistinguisher between the two subtypes of DLBCL, which areGC B-like DLBCLs and Activated B-like DLBCLs.

II. APPROACH 1: BASED ON DISCRIMINATION FACTOR ANDSELF-ORGANIZING MAPS

Golub et al. [4] have presented a novel method for selectingthe genes whose expression pattern is strongly correlated with

the class distinction. Each gene's degree of correlation withclass distinction is calculated as follows. Each gene isrepresented by an expression vector v(g) = (egl, eg2, . . ., e),where eg, denotes the expression level of gene g in ith samplein the dataset. A measure of correlation was calculated foreach gene. Firstly the means and SDs (standard derivations) ofthe log of the expression levels of gene g for the samples inclass I and class 2 need to be calculated. Let [ml(g),sl(g)]and [m2(g),s2(g)] denote these values respectively andcomputed as

ml(g) = (1/cl) (egl+ eg2+. . .+ egn ), (1)sJ(g) = 'i (I/Cl)[(eg, - mI (g))2 +... +(eg,n- mI(g))2], (2)

where c I is number of samples in class 1.The mean and SD for class 2 are calculated using (1) and

(2) as well. Let us define a discrimination factor DF(g) givenby

DF(g) = [m1(g) - m2(g)4[s1(g) + s2(g)]. (3)

Large values of IDF(g)l indicate a strong correlationbetween the expression pattern of gene g and the classdistinction, while the sign of DF(g) being positive or negativecorresponds to g being more highly expressed in class I orclass 2.The informative gene selection phase involved all the

samples in the dataset. In this experiment, the 96 samples(each with expression values of4026 genes) in original datasetwere first divided into two classes, Lymphoma (cancer) andNon-lymphoma (normal). The samples were labeled asLymphoma (Cl) (67 samples):

(i) FL, (ii) CLL, (iii) DLBCL-g (GC B-like DLBCL), and(iv) DLBCL-a (Activated B-like DLBCL), and

Non-lymphoma (C2) (29 samples):(i) Blood-B-r (resting blood B), (ii) Blood-B-a (activatedblood B), (iii) Lymph-Node, (iv) Tonsil, (v) Cell Line, and(vi) Blood T (Resting/Activated T).

Thereafter, the three tasks, each looks for a class distinguisher,were carried out.

A. Task 1: A Class Distinguisher Between Lymphoma andNon-lymphoma

Let us assign class I (C,) as lymphoma and class 2 (C2) asnon-lymphoma. After calculating DF(g) for all the 4026

0-7803-9048-2/05/$20.00 ©2005 IEEE 509

genes, 15 genes with largest DF(g) are selected to be used toidentify lymphoma class, and another 15 genes with smallestDF(g), i.e. negative and with largest absolute values areselected to identify non-lymphoma class. The 30 genescombined together to form the lymphoma and non-lymphomadistinguisher as shown in Fig. 1.

lyinphoma non-lynmphomasamples samples

Fig. I. Selected 30 genes as lymphoma and non-lymphoma distinguisher.

A self-organizing map (SOM) [5], [6] was employed tovisualize the distribution of all 96 samples based on theirexpression patterns of these 30 genes.

The algorithm of SOM works as follows: firstly define atwo-dimensional grid of nodes, which can be of hexagonal orrectangular geometry. Initially input samples, each withexpression values for 30 genes, are randomly allocated to oneof the nodes, and then during the iterative training process, foreach input sample, the winning node gets its weight updatedand the sample is assigned to this node. Besides, the weightsof the nodes neighboring to the winning node are alsoupdated. At the end of the training process, the nodes of theSOM grid have clusters of co-expression samples assigned tothem, and the map captures the distribution of the inputsamples.

The size of the map needs to be chosen according tonumber of samples involved, too large map may cause toomany blank nodes and too small one may make the boundariesof clusters unclear. After a series of experiments with differentmap sizes, 4x4 is found to be sufficient and appropriate todescribe the samples.An SOM was initialized with a size of 4x4 nodes, with

initial radius for neighborhood of 3, and initial learning rateca(O) of 0.1. A Gaussian function was used as neighborhoodfunction. The learning rate a(t) decreases when iterationnumber, t increases according to

(4)where the trainlen is the training length in term of epochs.These parameters were used for all experiments using SOM inthis manuscript.

Fig. 2. Distribution of all 96 samples based on lymphoma and non-lymphomadistinguisher.

Each sample was an input of 30 dimensions correspondingto 30 genes' expression values, and the 96 samples were usedto train the SOM (4x4) with training length of 15 epochs. Thetrained map is shown in Fig. 2. The dark line separateslymphoma samples and non-lymphoma samples. The numberwithin the brackets beside the sample label indicates thenumber of samples assigned to that node.

B. Task 2: A Class Distinguisher Between DLBCL and Non-DLBCL

Similar scheme was used to find DLBCL (C,1,) and non-DLBCL (C, 2) distinguisher. In this case, only 67 lymphomasamples were involved in the experiment. Again, 30 geneswere selected using the DF method to form a classdistinguisher and shown in Fig. 3.

DLBCL non-DLBCLsa=1ples sunples

Fig. 3. Selected 30 genes as DLBCL and non-DLBCL distinguisher.

Each Lymphoma sample of dimension 30 can be used as

input to an SOM and be visualized in a similar way as task 1.Due to space limitation, the figure of the trained map is notshown here.

510

CLL(9) FL(4) DLBCL-g(3) DLBCL-g(1 1)CLL(9) DLBCL-a(2) FL(3) DLBCL-a(4)

CLL(2) DLBCL-g(2) DLBCL-a(5)Blood-13-r(l) FL(2) DBLg3

DLBCL-a(7)Blood-B-r(1) Blood-B-r(1) DLBCL-g(2) DLBCL-g(2)

Tonsil(1)

Blood-T(7) Blood-B-a(4) Blood-e-a(2) DLBCL-a(4)Cell-i;ne(5) Blood-B-r(l) Lym-Node(l) DLBCL-g(l)Blood-B-a(4) Cell-line(l)

trainlen)a(t) =a(O)*(0.0051a(O))(('-1)

C. Task 3: A Class Distinguisher Between Two Subtypes ofDLBCLs

This task was to find class distinguisher between GC B-likeDLBCL (C,-,-,) and Activated B-like DLBCL (Cf-1l2). Byusing same scheme and 47 DLBCL samples, 30 genes whichform the DLBCL subtypes distinguisher were found andshown in Fig. 4.

25

3s

GC B-likeDLBCL samples

Activated B-likeDLBCL samples

Fig. 4. Selected 30 genes as DLBCL subtype distinguisher.

Besides the original 3 tasks, an extra class distinguisher (aset of 30 genes) which distinguishes between FL and CLL wasfound and shown in Fig. 5.

CLL(11) DLBCL-a(6) DLBCL-a(10)

DLBCL-a(2) DLBCL-a(5)

FL(1) DLBCL-g(1) DLBCL-g(1) DLBCL-g(3)

FL(8) DLBCL-g(2) DLBCL-g(6) DLBCL-g(11)

Fig. 6. Distribution of 67 lymphoma samples based on three lymphomadistinguishers (set of90 genes).

D. Testing Phase

In testing phase, the quality of each of the previouslydiscovered class distinguishers, i.e. how well they can classifysamples, was examined. First of all, the dataset with 90selected genes were split into 3 subsets A, B and C, preservingapproximately the same ratio of samples of class I and class 2in each subset. Their details are summarized in Table I - 111.

TABLE I

THREE SUBSETS OF ALL 96 SAMPLES

FL CLL

Fig. 5. Selected 30 genes FL and CLL distinguisher.

So far, four class distinguishers have been found using theDF/SOM approach. The two distinguishers found in task 2 and3, as well as the distinguisher between FLs and CLLs, whichin total consist of 90 genes are applied on lymphoma samples.Now each of the 90-dimensional 67 lymphoma samples wasused as an input to an SOM with training length of 15 epochs.They were clearly distributed into 4 groups as shown in Fig. 6.

Subset No. of lymphoma No. of non-lymphomaA 22 9B 22 10C 23 10

TABLE 11

THREE SUBSETS OF THE 67 LYMPHOMA SAMPLES

Subset No. of DLBCL No. of non-DLBCLA 15 16B 167C 16 7

TABLE tIITHREE SUBSETS OF THE 47 DLBCL SAMPLES

Subset No. of GC B-like No. of Activated B-likeDLBCL DLBCL

A 8 7B 8 8C 8 8

511

To test the experimental result of each of the 3 tasks, first,we used the combination of any two of subsets A, B and C astraining dataset to obtain a trained SOM, and then used the leftout subset for testing. For each sample in testing dataset, itsbest matching unit of the map (the winning node) indicates towhich class it belongs. The number of samples which werecorrectly classified (classification rate) in testing phase issummarized in Table IV.

TABLE IV

TESTING OF RESULTS FOR DF/SOM APPROACH

Task ITraining Testing Classification RateData Data Lymphoma Non-

LymphomaB+C A 21/22 8/9A+C B 21/22 9/10A+B C 22/23 10/10

Task 2Training Testing Classification RateData Data DLBCL Non-DLBCLB+C A 14/15 6/6A+C B 15/16 7/7A+B C 16/16 7/7

Task 3Training Testing Classification RateData Data GC B-like Activated B-

DLBCL like DLBCLB+C A 7/8 6/7A+C B 7/8 8/8A+B C 7/8 8/8

expression vector is v(g) = (egi, eg2, . . ., eg96), the meanexpression value for all the 96 samples was calculated as:

mean(g) = (1/96)( eg, +eg2+ * * *+ eg96). (4)

Two threshold Thrl and Thr2 were calculated as

Thrl = mean(g)-(1/5) *(mean(g)-min(g)),Thr2 = mean(g)+(1/5)*(max(g)-mean(g),

(5)(6)

where min(g) and max(g) represent the smallest and largestexpression values across 96 samples for gene g. All theexpression values greater than Thr2 were set to 1, theexpression values smaller than Thrl were set to 0, and theexpression values in between the two thresholds were set to0.5. The purpose of this change is probably to remove thenoises existing in the dataset and to reduce the computationcomplexity.

A. Task 1: A Class Distinguisher Between Lymphoma andNon-lymphoma

Training datasets and testing datasets in Table I - III werealso used in this approach. The Backpropagation algorithm [9]with a learning rate of 0.01 was chosen to create and train thesingle layer neural networks. The weights were randomlyinitialized. There are 4026 input nodes, each corresponds toone gene. Therefore, the weight vector has 4026 dimensions.To train the neural networks, each sample was used as aninput. The activation value of the output node is the sum ofweighted inputs as given by:

Y(s) = (Wgisgl + Wg2 Sg2 + +Wg4O26Sg4o26 ) (7)

Based on testing results from Table IV, it can be concludedthat this DF/SOM approach can generate class distinguisherswith reasonably good classification rates.

For future unknown samples, it can be first classified intoeither lymphoma or non-lymphoma using the trained mapshown in Fig. 2. If this sample happens to be a lymphomasample, it can be further classified into one of the four types oflymphoma using the trained map shown in Fig. 6.

III. APPROACH 2: BASED ON SINGLE LAYER PERCEPTRON

The second approach for discovering class distinguisher is toadopt a single layer neural network [7],[8] as proposed byNarayanan et al. [9]. In this method, the weights linking inputgene nodes to output node were examined, thresholds basedon standard deviations were calculated to provide a selectioncriterion for a smaller group of genes. Thereafter, these geneswere used in a second round of the gene selection procedure.We adopt a method to change the DNA microarray data into

a three-valued gene expression, in which each expressionvalue was set to 1, 0.5 or 0. Firstly, for every gene g, its

where Y(s) is the activation value to output node for a samples, and wg,sg, is the weight of the link between each input node iand output node multiplied with expression value of gene gi ins. The transfer function for the output node is a logistic/sigmoid with output values between 0 and 1. The learning ruleis the standard perceptron learning rule: wj (t+1) = wj (t) +

rq(d-y)s, where connection weight wj at iteration t+l is the sumof its weight at iteration t and the difference between thedesired output d and actual output y for an input sample smultiplied with step size i that can vary between 0 and 1.

Three neural networks were created. Firstly, thecombination of subset B and C was used to train network 1(100 epochs), secondly, the combination of subset A and Cwas then used to train network2 (100 epochs) and thirdly,combination of subset A and B was used to train network3. Ineach training process, the target output for lymphoma samplewas I and for non-lymphoma sample was 0.

After training of the there neural networks, three sets ofweights were obtained, which are expressed as

WI = [w2,, w,2,, ... WI-40261,WI = [W ,-/, w32,,... w4026].W3 = [W3-1, W3-2, *-- W3-4026]-

(8)

512

These weight vectors were then added together to get Wsum =[ws-, w.-2, ... ws 4026]. The mean M and SD were calculatedacross all the dimensions in the vector Wsum, and twothresholds were set as Thr3 = M+2D, Thr4 = M-2D. If thevalue of the weight ws-g is greater than Thr3 or less than Thr4,gene g survives, otherwise it was eliminated. In this round,180 genes out of the original 4026 genes satisfied the abovecriteria. These 180 gene values were then extracted from thefull database, and then were again split into three subsets, A, Band C, and now each input has only 180 dimensions.The above scheme was repeated, and this time, using the

above threshold criteria, only 10 genes survived. Since thereduction from 180 genes to 10 appeared too severe, a morerelaxed criterion for setting the thresholds, i.e., M± D wasused. In this way, 180 genes were reduced to 48 genes. Theexpression levels of these 48 genes were used for classifica-tion in subsequent classification tasks.To test the results of task 1, we extracted the 48 genes'

expression together with 96 samples' class information, andthen split them into 3 subsets A, B and C. Firstly, we used Band C to train a neural network with 48 input nodes andlearning rate of 0.0 1, and then used A as testing data. For eachsample s in subset A, if the output from the neural networkwas above 0.9, s was signified as lymphoma, and if the outputwas below 0.1, s was signified as non-lymphoma. Thereafter,we used another two subsets to train a neural network and theleft out subset to test in the similar way. The classificationrates are summarized in Table V.

TABLE V

TESTING RESULTS FOR APPROACH 2, TASK I

Task ITraining Testing Classification RateData Data Lymphoma Non-LymphomaB+C A 21/22 9/9A+C B 20/22 10/10A+B C 22/23 10/10

B. Task 2: ADLBCL

Class Distinguisher Between DLBCL and non-

We used similar scheme and subsets in Table II to find aDLBCL and Non-DLBCL distinguisher. After first round ofreduction, 4026 genes were reduced to 184, and they werefurther reduced to 56 genes in the second round. Testing wasdone in the same way as task 1. Table VI summarizes theresults.

TABLE VI

TESTING RESULTS FOR APPROACH 2, TASK 2

Task 2Training Testing Classification RateData Data DLBCL Non-DLBCLB+C A 15/15 6/6A+C ,[ B 16/16 7/7A+B [ C 16/16 7/7

C. Task 3: A Class Distinguisher Between Two Subtypes ofDLBCLs

In this task, three subsets shown in Table III were involved.4026 genes were reduced to 176 in the first round and furtherreduced to 51 genes in the second round. Testing results aresummarized in Table VII.

TABLE VII

TESTING RESULTS FOR APPROACH 2, TASK 3

Task 3Training Testing Classification RateData Data GC B-like Activated B-

DLBCL like DLBCLB+C A 8/8 7/7A+C B 8/8 8/8A+B C 8/8 8/8

From Table V-VII, it is noticed that this approach was ableto provide higher classification rates than approach 1. For task1, only 4 lymphoma samples were misclassified, and for task 2and 3, all samples were properly classified.

For future unknown cases, they were just needed to be usedas inputs to a neural network which is trained with a datasetthat consists of the class distinguishers found using thisapproach. By examining the value of output, the unknownsample can be classified.

IV. MAJOR CONTRIBUTIONSWe improved the methods presented in the two papers [4], [9]and applied them to Alizadeh et al. dataset [3]. The method inpaper [4] uses statistical method for unknown sampleclassification. The Discrimination Factor Approach weproposed chooses to use SOM instead, which is relativelysimpler since no complex computation needs to be involved.Besides unknown sample classification, it can capture andvisualize the distribution of samples as well.

In the Single Layer Perceptron Neural Network-basedapproach, we proposed a new mathematical method to changeoriginal dataset into discrete-valued one instead of using thecommercial software used in [9]. The testing results show thatthis alternative works well.

Approach 1 is easy to implement and relatively faster tofind class distinguisher genes. It can produce accurate resultsand the number of genes in the class distinguisher can bespecified. An SOM in this case is chosen for classification offuture unknown samples so the distribution of samples canalso be visualized. This approach is suitable for datasets withdifferent class distinctions.Approach 2 produces even more accurate results and its

testing phase and classification for unknown sample is simpleand straightforward. It is generally applicable for mostdatasets. However, it has a disadvantage that the first round ofgene reduction is always slow due to high dimensionality ofthe inputs and weights of the neural networks. Therefore, it

513

takes longer time to train the neural networks for a hugedataset. Moreover, the number of genes in class distinguisherscannot be specified beforehand. The comparison between theirtiming issues are summarized in Table VIII.

TABLE VIII

COMPARISON BETWEEN THE TIMING ISSUES OF THE Two APPROACHES

Task Time for Finding Class Distinguisher

I Approach I Approach 24s 770s

2 3 s 687 s3 3s 635s

In comparison to another neural network-based sampleclassification approach named SFAM [10], which has alsobeen applied to Alizadeh et al. dataset, the two approaches weproposed have several advantages summarized as follows:

1) The two approaches we proposed have both geneselection and sample classification functionality. Incontrast, the SFAM approach in [10] can only classifysamples based on the genes selected in Alizadeh et al.experiment [3].

2) The approaches we proposed have higher classificationaccuracy than SFAM approach. Accuracy [10] isdefined as:

Accuracy = (TP + TN)/N, (9)

where TP (true positive) is the number of class Isamples correctly identified, TN (true negative) is thenumber of class 2 samples correctly identified and N isthe total number of samples involved. SFAM approachwas used to classify 45 DLBCL samples and 18 normalsamples, and the highest accuracy obtained.was 76%.In contrast, when classifying 67 lymphoma samples and29 normal samples, the two approaches we proposedgave accuracy of at least 93.5%.

V. CONCLUSION

Nowadays, DNA microarrays are widely applied in the fieldof cancer classification, the high dimensionality of the datamatrix is a big challenge for data analysis and meaningfulinformation extraction. Various approaches based on neuralnetworks can be applied to gene profiling datasets for geneselection and dimensionality reduction. The two approacheswe proposed here are based on the methods presented by someother papers [4],[9], but have been modified in some ways sothat they are improved in terms of functionality, difficulties ofimplementation, or cost of implementations /application.These two approaches are also compared in terms of accuracy,implementation and execution time, so different approachescan be selected on a case-by-case basis. Moreover, they werecompared with SFAM approach [10] and they offeradvantages in terms of functionality and classificationaccuracy.

REFERENCES

[11] K. Wang et al., "Monitoring gene expression profile changes in ovariancarcinomas using cDNA microarray," Gene, vol. 299, pp. 101-108, Mar.1999.

[2] J. Derisi et al., "Use of cDNA microarray to analyze gene expressionpatterns in human cancers," Nature Genet, vol. 14, pp. 457-460, Dec.1996.

[3] A. A. Alizadeh et al., "Distinct types of diffuse large B-cell lymphomaidentified by gene expression profiling," Nature, vol. 43, pp. 503-511,2000.

[4] T. R. Golub et al., "Molecular classification of cancer: class discoveryand class prediction by Gene Expression Monitoring," Science, vol. 286,pp 53 1-537, Oct. 1999.

[5] J. C. Bezdek et al., "A note on self-organizing semantic maps," IEEETransaction on Neural Nenvorks, vol. 6, pp. 1029-1035, Sep. 1995.

[6] J. Vesanto et al. "Clustering of the self-organizing map," IEEETransaction on Neural Nenvorks, vol. 11, pp. 586-600, May 2000.

[71 S. Dennis et al., "Introduction to Neural Nenvorks," available online:www2.psy.uq.edu.au/-brainway/Manual/Whatls.html, 1997.

[8] S. Haykin, "Neural Networks: A Comprehensive Foundation," NewJersey: Pretice Hall, Inc. 1999.

[9] A. Narayanan et al., "Single-layer artificial neural networks for geneexpression analysis," Neurocomputing, vol 61, pp. 217-240, 2004.

[10] F. Azuaje, "A computational neural approach to support the discovery ofgene function and classes of cancer," IEEE Transactions of BiomedicalEngineering, vol.48, pp. 332-338, Mar. 2001.

514

Documents

[IEEE International Joint Conference on Neural Networks 2005 - Montreal, Que., Canada (31 July-4 Aug. 2005)] Proceedings. 2005 IEEE International Joint Conference on Neural Networks,