A Mixed Factors Model for Dimension Reduction and ...bonsai.hgc.jp/~imoto/yoshida_csb2004.pdfA Mixed Factors Model for Dimension Reduction and Extraction of a Group Structure in Gene

A Mixed Factors Model for Dimension Reduction andExtraction of a Group Structure in Gene Expression Data

Ryo YoshidaThe Graduate University for Advanced Studies,

4-6-7 Minami-Azabu, Minato-ku, Tokyo, 103-8569, [email protected]

Tomoyuki HiguchiInstitute of Statistical Mathematics

4-6-7 Minami-Azabu, Minato-ku, Tokyo, 103-8569, [email protected]

Seiya ImotoHuman Genome Center, Institute of Medical Science, University of Tokyo,

4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, [email protected]

Abstract

When we cluster tissue samples on the basis of genes,the number of observations to be grouped is much smallerthan the dimension of feature vector. In such a case, theapplicability of conventional model-based clustering is lim-ited since the high dimensionality of feature vector leads tooverfitting during the density estimation process. To over-come such difficulty, we attempt a methodological extensionof the factor analysis. Our approach enables us not only toprevent from the occurrence of overfitting, but also to han-dle the issues of clustering, data compression and extractinga set of genes to be relevant to explain the group structure.The potential usefulness are demonstrated with the applica-tion to the leukemia dataset.

1. Introduction

Cluster analysis of microarray gene expression dataplays an important role in the automated search and thevalidation for the various classes in either tissue samples[2, 5, 7] and genes [4, 19]. A distinction of microarraydataset is that the number of tissue samples is much smallerthan the number of genes. Our work in this study focuseson clustering of tissue samples.

For this purpose, one of the widely used techniques hasbeen hierarchical clustering. Although the method has con-

tributed to find distinct subtypes of disease [2, 5], there isweakness as to systematic guidance for solving practicalquestions, e.g. what genes are relevant to explain the groupstructure and, how many clusters there are. Furthermore, itis not evident for this approach to classify a set of futuresamples since the decision boundary is not explicit.

The model-based clustering using the finite mixturemodel is a well-established technique for finding groups inmultivariate dataset [10, 11, 17]. Most commonly, the at-tention has focused on the use of Gaussian mixture becauseof the computational convenience. However, in some mi-croarray experiments, the applicability of this approach islimited. That is, when we wish to cluster tissue samples onthe basis of genes, then the sample size is much smaller thanthe dimension of feature vector. In such a case, the finitemixture models often lead to overfitting during the densityestimation process. Therefore, we need to reduce the di-mensionality of data before proceeding to cluster analysis.

The principal component analysis (PCA, [3]) is a com-monly used method for reducing the dimensionality of mi-croarrays [11, 12]. In spite of its usefulness, PCA is notjustified in clustering context since the projections of datacorresponding to the dominant eigenvalues do not neces-sary reflect the presence of groups in dataset. Most suchlimitations are related to the fact that PCA only takes intoconsideration the second-order characteristic of data. Someauthors gave the illustrations that PCA fails to reveal under-lying groups [6, 17].

Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (CSB 2004) 0-7695-2194-0/04 $20.00 © 2004 IEEE

This article attempts a methodological extension of thefactor analysis. In our model, so referred to as the mixedfactors model, the factor variable plays a role to give a par-simonious description of clusters in the feature space. Themodel presents a parsimonious parameterization of Gaus-sian mixture. Consequently, our approach enables us toprevent from the occurrence of overfitting in the densityestimation even when the dimension of data is about sev-eral thousand. The application of the mixed factors analy-sis covers the issues of clustering and dimension reduction.The reduced-dimensionality representation of original datais constructed as to be plausible estimate of signal reveal-ing the existence of group structure. The proposed methodalso can extract genes to be relevant to explain the groupstructure. In this process, sets of genes that are functionallyco-worked are automatically detected.

The mixture of factor analyzers (MFA, [16, 17]), whichis an extension of the mixture of probabilistic principalcomponents analysis (MPPCA, [21]), is closely related toour model. Our mixed factors model is distinguished fromMFA and MPPCA in terms of the parameterization of Gaus-sian mixture. While the number of free parameters of MFAand MPPCA grows quickly as the number of clusters tend-ing to large, our approach can mitigate such difficulty. Someresearchers might be motivated by grouping the tissue sam-ples into the large number of clusters across several thou-sand genes. In such situations the superiority of our methodwill be apparent.

The rest of this article is organized as follows. In Section2, we will present the mixed factors model and outline somematerials used in the following sections. Our model will bealso discussed in relation to MFA and MPPCA. Section 3covers the EM algorithm for the maximum likelihood esti-mation of our model. Section 4 will express the proceduresof clustering, data compression, visualization and selectionof genes to be relevant to explain the presence of groups. InSection 5, the potential usefulness of our approach will bedemonstrated with the application to a well-known dataset,the leukemia data [12]. Finally, the concluding remarks aregiven in Section 6.

2. Mixed Factors Model

2.1. Probability Model

Let� be an observed variable distributed over��. Whenwe cluster the tissue samples across the genes, the dimen-sional of feature vector, �, corresponds to the number ofgenes which is typically ranging from �� to ��. The basicidea underlying factor analysis [3] is to relate � to the factorvariable � � �� ;

� � �� (1)

Here � � �, and the � is an observational noise to be Gaus-sian, � � ��, with� � �� . The matrixof order ��,�, contains the factor loadings and is referredto the factor loading matrix. The factor variable and theobservational noise are conventionally assumed to be mutu-ally independent random variables. In the following discus-sions, we assume that the noise covariance is to be isotropic,i.e. � � ��, where �� denotes �-dimensional identitymatrix. Note that, for a given factor variable, the observedvariable is distributed according to �� .A key motivation of usual factor analysis is that due tothe diagonality of noise covariance matrix, all variables in� � �� are conditionally independent for a givenfactor variable. Therefore, it can be considered as that thefactor variable plays a role to give a parsimonious explana-tion for the dependencies in �.

A key idea of the proposed method is to describe clusterson �� by using the factor variable. Suppose that the totalpopulation of � consists of � subpopulations, �� .Here we let �� be a vector of unknown classlabels to indicate the subpopulations by

� ��

Consider that the � follows from multinomial distribution,� � �� with probabilities �� , andthat given � � �, the factor variable is distributed to beGaussian, � �� . Then the unconditionaldistribution of � is given by the �-components Gaussianmixture with density

��

�� (2)

Here the �� denotes the normal density withmean �� and covariance matrix �� . We refer to the ob-served system, composed of (1) and (2), the mixed factorsmodel.

For this model, the complete data is taken to be �� , where the missing variables correspond to �and �. The model of the complete data � is factorized as

�� (3)

where

��

��

��

��

��

��

��


Notice that �� holds in the mixed factorsmodel. This implies that given a factor variable, the classlabel has no effect to the conditional distribution of �. Thepresence of clusters on �� is completely explained by thefactor variable.

Under this generative model, the observed variable is un-conditionally distributed to be the �-components Gaussianmixture,

��

��

�� (4)

Thus, the feature variable is characterized by clusters cen-tered at mean �� in which each group size is defined bythe mixing proportion ��, � � �� . The covariancematrix formed in � ��

� � �� imposes a geometric fea-ture on the �th cluster. For the Gaussian mixture with unre-stricted covariance, there are �� distinct parametersin each component covariance matrix. In cluster analysis ofmicroarray data, the unrestricted covariance leads to over-fitting in the density estimation process since the number offree parameters grows quickly as� tends to large. When weare in such situations, the mixed factors model gives a natu-ral approach to the parsimonious parameterization for Gaus-sian mixture. The occurrence of overfitting can be avoidedby choosing � as to be appropriate for a given dimensional-ity of data and the specified number of components.

2.2. Rotational Ambiguity and Parameter Restric-tion

The parameter vector � in the mixed factors model con-sists of all elements in �, � and the component parameters��, ��, �� for � � �� . While our approach pro-vides a parsimonious parameterization for Gaussian mix-ture, such modeling leads to the lack of identifiability of theparameter space.

Let be any nonsingular matrix of order �. Note thateach component mean and covariance in (4) are invariantunder the manipulation by as

��

��

This will occur when we apply a nonsingular linear trans-formation so that � � �, � � �� . Therefore, thereis an infinity of choices for ��, �� and �

To avoid the nonidentifiability, we need to reduce thedegree of freedom for the parameters by imposing �� con-straints which corresponds to the order of . In this paper,we consider a natural approach as follows;

(a) �� , � � �� ,

(b) �� .

The conditions (a) and (b) impose �� and ��

restrictions on the number of free parameters, respectively.Then, the total size of restrictions results in �� to be im-posed. The condition (b) offers the orthonormality of �columns in the factor loading matrix, � � �� , i.e.

�� , � � �� , �� for all � � �.Consequently, the mixed factors model has unknown pa-

rameters with the degree of freedom,

��

�

�� (5)

Note that �� grows with only proportional to � as� tends to large, and also that it grows with only propor-tional to �� as � tends to large. As will be remarkedin next discussion, this property will be desirable in situa-tions where researchers are motivated by grouping the tissuesamples into the large number of clusters across genes.

2.3. Related Models

The mixture of factor analyzers (MFA, [16, 17]) presentsthe �-components Gaussian mixture as follows;

��

��

�� (6)

Here the� � is �� matrix and�� .This is a generalization of the mixture of probabilistic PCA(MPPCA, [21]) which takes�� to be isotropic,�� .It was shown by [17, 21] that there implicitly exists a gener-ative model behind this form of Gaussian mixture. Supposethat we have� submodels in which each �th model is givenby

� ��

for � � �� . Here the �-dimensional latent variable is defined to be Gaussian �� , and the noise modelis �� . Mixing these submodels with probabil-ity �� , then one can obtain the Gaussianmixture in the form of (6).

As well as our approach, MFA also presents a parsimo-nious parameterization of Gaussian mixture model. Thismodel characterizes more flexible geometric feature of clus-ters than that of the mixed factors model. However, in clus-ter analysis to be faced in microarray studies, it may stillsuffer from the problem of overfitting. It follows from [17]that the number of free parameters of MPPCA is given by

��

�

��

Table � shows a comparison for the number of free param-eters between the mixed factors model and MPPCA forvarying � and �. For instance, consider that we have a


Mixed factors model (� � �)

� � � � � � � � � � � ��

MPPCA (� � �)

� � � � � � � � � � � ��

Table 1. Comparison for the number of freeparameters (� � �) between the mixed fac-tors model (top) and the mixture of principalcomponent analyzers (bottom), against somespecified values of the number of groups �and the dimension of feature vector �

.

set of ��-dimensional observations, and are motivated bygrouping them into two clusters using �-dimenional latentvariable (� � ��, � � �, � � �). This situation is con-sidered to be typical in microarray studies. Then, the num-ber of free parameters of MPPCA is �� ,while our model gives �� . Further, if� � ��, � � �, � � �, then, �� and�� . While �� becomes large quicklyas� tends to large, our approach enables the increase in thenumber of parameters to be saved. In essence, the challengefaced by the biological scientists is to use the large-scaleddataset whereas the several organisms are composed of thelarge number of genes, e.g. the genome of Saccharomycescerevisiae contains more than �� genes. In such situa-tions, we can not expect some estimates obtained from MFAand MPPCA approaches to be reliable, and so the scope oftheir application is limited.

2.4. Posterior Distributions

An objective of the mixed factors analysis is to find theplausible values of � and � based on an observation �. It isachieved by the suitable estimators to be close to the esti-mands on the average. Our analysis aims to reduce the di-mensionality of data by estimating the factor variable and todivide the feature space, ��, by attributing the labels � for

all � � ��, simultaneously. Most common approach forestimating the latent random variables is based on Bayes’rule. We will revisit to these issues in the later sections. Inthis context, the posterior distributions of � and � play a keyrole in constructing the Bayes estimators. Hence, in sequelwe will investigate the functional form of them.

Let �� and � � �� , that is, the orthogonaltransformations of � and � onto ��. Then, the generativemodel (1) can be replaced by

�� (7)

where � � �� . Given this formulation, the �-dimensional variable �� is distributed according to the�-component Gaussian mixture as

��

��

�� (8)

Here the �th component model is defined by�� . Notice that the density of observed vari-able can be rewritten as

��

��

�� (9)

if the parameters satisfy the restrictions (a) and (b).Using Bayes theorem, the posterior probability of � � �

can be assessed by

��

�� (10)

for � � �� . Each value assigns the membershipprobability that an observation � belongs to the �th sub-population ��. Therefore, the posterior probability of �is given by multinomial distribution having the probabil-ity mass function � ��

��

�� . Note thatthe probability of belonging is defined on the �-dimensionalvariable ��, and so we can write �� without loss of generality.

Next let us consider the posterior distribution of the fac-tor variable. Given � � �, the �� and the � are uncon-ditionally distributed to be ��-dimensional Gaussian withmean ��

��

� and covariance matrix

��

��

�

Then, it follows from standard property of the multivariatenormal distribution that

��


where the posterior mean and covariance are given by

��

� ��

and

��

� �� (11)

Here the � � � matrix �� is diagonal so that each diagonalelement consists of the signal to noise ratio �� ,� � �� corresponding to the generative model (7) inwhich �� denotes the �th diagonal element of ��. Hence,this gives the posterior distribution of � in the form of theGaussian mixture;

��

�� (12)

The mixing proportions correspond to the posterior proba-bilities of belonging. These posterior distributions are usedto estimate � (Section 3) and to construct Bayes estimatorsfor the latent variables, e.g. the posterior expectation, themaximum a posteriori (MAP) estimator.

Finally, note that we have no need to calculate �-dimensional density for assessing either (10) and (9). Inmicroarray studies where the dimension of feature vectoris ranging from �� to ��, the direct calculation based onthe form of ��

� � ��and (4) might fail since the high dimensional densities takeextremely small values, and also are computationally verydemanding. In contrast to such intractability, our approachof the parameter restriction provides a way of saving thecomputational resources and avoiding the overflow.

3. Model Estimation

3.1. EM algorithm

Suppose that we have a sample of size � , � �� where� denotes a data matrix of order �� .Given data, the mixed factors model can be fitted under theprinciple of the maximum likelihood although there existsno closed-form solution for the estimator. The EM algo-rithm is a general approach to maximum likelihood estima-tion for the problem in which a probability model includesone or more latent variables.

We let �� and �� bethe realizations of factor variable and class label vector cor-responding to the �th observation � . The complete data��

� ��

� � ��

� � is assumed to be i.i.d. sample drawn

from (3). Suppose that we are now having an estimate �� at

the current step of the EM algorithm. The method alternatesbetween two steps, say E step and M step.

Firstly, consider to update the �th column of the factorloading matrix, �, while all of parameters except to � arefixed at the obtained values. The evaluation of � must takeaccount of the constraints that �� for all � � �.This can be achieved by the use of Lagrange multiplier ��for � � �. Then, the complete data log-likelihood with theLagrange terms is given by

� �

��

��

��

��

�� (13)

where �� denotes the fixed values and we have omitted theterms independent of �. Here, we have no concern to theimposition on the norm of � since this can be scaled a pos-teriori as will be seen in later. But it is assumed that all of�� for � � � have been normalized such that �� .

By taking derivative of (13) with respect to � and mul-tiplying it by some constants, one can obtain the gradient asfollow;

��

�

��

��

��

�

��

��

Here the quantities �

��

�� and

�� are summarized in the �th column of the ma-trices of the sufficient statistics,

� ��

� ��

�

�� (14)

In the E step, we replace �

��

�� and

�� by the conditional expectations with respect to� �� (12) where the posterior distribution � �� is as-sessed by the current values of parameters.

These are given by the �th column of

��

��

��

�� (15)

��

��

�� (16)

where again, all posterior quantities are evaluated by thecurrent values of parameters. Then, these give the estimat-ing equation for � as follow;

��

��

��

��

��

��

�� (17)


To find a solution for ��, consider to multiply (17) by ��

� .Then, this leads to

��

�

��

� ��

��

��

�

for all � � �. Further, substituting �� to (17) give a solutionas

��

�

��

��

��

��

� ��

��

�� (18)

Converting this new values to the old one yields ��.Here, notice that since �� , we need to normalize

it. Let �� be a diagonal matrix of order � such that the �thdiagonal element is ��

��, otherwise �. Since��

��

remains diagonal, the probabilistic nature of (1) are still pre-served even if ��

� and � � �� . Thus, withoutloss of generality, we can rescale the estimates of parame-ters by

��

�� (19)

Repeating these processes, (18) and (19), for � � �� ,we would have an estimate of factor loading matrix.

Next, differentiating the complete data log-likelihoodwith respect to the remaining parameters and setting allderivatives to zero, one can obtain the following equationsafter some manipulations,

��

��

� � � ��

and for � � �� ,

� ��

��

�

��

Here, the sufficient statistics is comprised of (14) and

� ��

��

�

��

� ��

��

�

� �

The E step requires the expectation of sufficient statisticsconditional on � and the current values of parameters. Itfollows from the results presented in Section 2 that

��

��

��

��

��

��

� ��

��

�� (20)

and (15), (16).In the M step, all sufficient statistics are replaced by

these posterior quantities. Then the M step is given by

��

��

� ��

�� (21)

and for � � ��

��

�� (22)

��

�� (23)

��

��

�� (24)

Thus, the EM algorithm is summarized as follows:

1. Specify initial values of parameters, ��.

2. (Factor loadings) Repeat the following steps for � �� .

(a) Evaluate the �th column of �� and ��

based on ��.

(b) Compute �� by equation (18).

(c) Normalize �� and rescale �� , �� , � �� .

(d) Set these parameters to ��.

3. (Remaining parameters)

(a) Evaluate �� , �� , �� , �� , �� ,

�� based on ��.

(b) Update �, ��, �� , and�� by (21),(22),(23),(24).

(c) Set these parameters to ��.

4. If the sequence of parameters and log-likelihood isjudged to have converged, the iteration is stopped, oth-erwise go to �.

The factor loading matrix are updated iteratively accord-ing to (18) and the rescaling (19) in which one stage is com-posed of � cycles. Then, the equations from (21) to (24) fol-low. The EM algorithm alternates between these processesuntil a sequence of the produced parameters and the valuesof likelihood is judged to have converged. Under mild con-ditions, the method is guaranteed to find a local maximumof the log-likelihood.


3.2. Implementations

As well as another mixture models, the likelihood equa-tions of the mixed factor model have the multiple roots, andso the EM algorithm should be started from a wide choiceof starting values to search for all local maxima. An obviouschoice for the roots is the one corresponding to the largestof the local maxima. In below, we will describe an ad hocway of specifying initial parameters.

Firstly, consider to determine each element in the fac-tor loading matrix, denoted by ��, � � �� and� � �� and the noise variance �. The main dif-ficulty is that the factor loadings must take account thecondition ��

� � �� . Let ��

�� and��

��

�, that is, sample mean and vari-ance of �th variable in �. Then our approach to the initial-ization of these parameters is summarized as follows:

1. Set � � ��

�� .

2. Generate the factor loadings by �� , forall �, �.

3. Implement the cholesky decomposition, ��

�� where � is the lower triangle matrix of order�, and recompute the factor loadings by

��

Subsequently, these parameters give a set of orthogo-nal transformation of original data ��

�� ,� � �� and also the mean and variance, ��

�� and � � ��

��

�� which will be useful for specifying thecomponent parameters of �� . An outright way is that�� , �� and �� , � �� .

For some gene expression datasets, each array containssome genes with fluorescence intensity measurements thatwere flagged by the experimenter and recorded as missingdata points. For instance, [9] reported that the mean per-centage of missing data points per array is �� for thelymphoma datasets [2] and �� for the NCI 60 datasets[19]. Our modeling offers a natural approach to the anal-ysis in the cases where some values in �� exhibit one or more missing values. The presence of miss-ing values requires a minor change in the EM algorithm inwhich the missing values are considered as to be a set of la-tent variables. However, we are omitted to discuss this issuemore deeply in this study.

4. Mixed Factors Analysis

Once the model has been fitted to a dataset, the mixedfactors analysis offers the applications to clustering, dimen-

sion reduction and data visualization. These can be ad-dressed by finding the plausible values of latent variables� , � . The analysis also covers the method of extract-ing sets of genes to be relevant to explain the presence ofgroups. In this process, the sets of genes that are consideredto be functionally co-worked is automatically detected. Insequel, we will discuss these methods.

4.1 Model Selection

The basic issues arising in the mixed factors analysis aredetermination of the number of clusters and the dimensionof factor variable. In statistics, these issues can be con-verted into the problem of model selection among all possi-ble models being in consideration, ��, � � �� .

A commonly used approach to this problem is based onthe Bayesian Information Criterion (BIC, [20]);

�� !� �� (25)

Here the � � is a selected local maxima of log-likelihood of��, and !� denotes the number of free parameter given by(5). We should choose a model to be most likely in sense ofthe minimum BIC. Unfortunately, finite mixture models donot satisfy the regularity conditions that underlies the pub-lished proofs of (25), but several results suggest its appro-priateness and good performance in a range of applicationsof model-based clustering [17].

There also exists another possible approaches to be ap-plicable in this context, e.g. the classical hypothesis testing,as the likelihood ratio test [15], Akaike Information Crite-rion (AIC, [1]) and the �-fold cross validation ([1, 18]).However, the validity of these methods also depends on thesame regularity conditions needed for the asymptotic ex-pansions in the derivation of BIC. These conditions breakdown due to the lack of identifiability in the mixture model.

4.2. Cluster Analysis

The goal of cluster analysis is to classify data � , � �� into nonoverlapping � groupings. In terms ofmodel-based clustering, it is converted into the problem toinfer the � on the basis of the feature data � or to divide thefeature space �� by attributing labels �� , � � �� for all � � ��. Our analysis facilitates the clustering basedon the estimated posterior probability that� belongs to�� ,� � �� . The most common classifier is to assign� to a class with the highest posterior probability of be-longing;

��

�� ! �� "�

��

��

� ��#��$�%��

If the estimate �� were true, this classificationrule would be the Bayes rule which minimizes the overall


misclassification rate [18]. This clustering is made basedon �� and the estimated �-dimensional Gaussian mix-ture � ��. If the specified dimension of �� is less orequal to two, it is possible to visualize the decision bound-ary.

4.3. Dimension Reduction

The mixed factors analysis gives a scope to transform thedata into some reduced-dimensionality representation. Thiscan be carried out by the posterior mean of factor variable,

��

��

��

under the Bayes rule [14]. An alternative way of construct-ing a reduced-rank data may be the orthogonal mapping offeature data onto�� ,

"��

although we have no clarity to do so. When the specifieddimension of factor is less or equal to three, the estimatedfactor variables are useful for data visualization. Even when� # �, the data visualization is possible by using some tech-niques, e.g. the scatter plot.

Alternative method of data visualization is to select someaxes to be plotted among �� such that set of themappings reveals the presence of group structure. Considernow to select two axes. Let �� $�� $�� .We consider that the degree of separation given by the set,��, � � �� is measured by using the minusentropy of the component memberships;

"��

��

��

��

where

��

� � ��

The �� consists of the �th and �th elements of ��, and

the ��

� is diagonal matrix such that the elements are

given by �� . Here it is assumed that ��

�� if �� .Note that the "�� is minimized if and only if

�� for all � � �� and� � �� . This implies the poor separation of��, � � �� in sense of that all of ��can not be assigned to a particular group. In contrast to this,the"�� is maximized if and only if �� for certain � and ��%�� for & � �,� � �� . This means that each observation is

completely classified into a particular group with proba-bility one. In this sense, the minus entropy of the com-ponent memberships can be interpreted as a quantity formeasuring the degree of separation exhibited by ��,�� . Thus we select �� and �� to be visual-ized such that �� "��"��.

The axes selection can also be achieved by using the an-other possible quantities for measuring the degree of sepa-ration, e.g. the between group variance under the estimatedclass labels. However, we have no idea to select one amongpossible approaches.

4.4 Interpretation of Mixed Factors

The researchers might often desire to obtain a biologicalinterpretation of the estimated mixed factors, and also to ex-tract variables from �� so as to contributethe presence of groups on the feature space or to excludeones not to do so. In our context, this can be achieved by as-sessing the dependency between �� and �� .A natural measure to summarize the dependency in thesevariables might be covariance

'(��

��

�� (26)

It may be more convenient to use the correlation matrix��which is of that each �� th element of (26) are di-vided by the square of ) *%��

�� and the �th

diagonal element of ) *%��

��

��.

By investigating the values in �� or Cov(�,� ), each of�-coordinates can be understood. If the �th gene, that is,��, is highly correlated with ��, then it is considered to berelevant to explain the grouping shown in �th coordinate.

In practice, it will be helpful to list some genes to givethe highly positive correlation with �� at &�� and to givethe highly negative correlation with �� at &�� for � �� . As will be demonstrated in next section, in con-text of gene expression analysis, these �� sets can be usefulto find the biologically meaningful groups of genes to befunctionally co-worked and also to explain the existence ofgroup structure.

5. Real Data Analysis

In this section we will give an illustration of the mixedfactors analysis with a well-known dataset, the leukemiadata of Goloub et al. [12]. This dataset is available athttp://www.broad.mit.edu/cancer/. Our ob-jective is to cluster the leukemia tissues on the basis ofgenes. Although the class labels have been available, weapplied the mixed factors analysis to the dataset without this


1 2 3 4 5 6 7 8 9 10111213141516171819202122232425

2627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172

G = 2, q =5G = 3, q =5

AMLALL

B-cell T-cell

(G = 2, q =5 ) : :,

(G = 3, q =5) : :, :,

Figure 1. Results of clustering. The order of tissues was rearranged so that the cases of AML arelabeled by � �� and the ALL are labeled by �� (B-cell ALL, �� and T-cell ALL, �� ). Forclustering given by the model of � � � and � � �, the black and white sites correspond to the tissuesgrouped into �� and ��, respectively. For clustering given by the model of � � � and � � �, the black,white and gray sites indicate the tissues grouped into ��, �� and ��, respectively.

1

11

11

1 1

111

1

11

1 11

1

1

11

1

1

1

12

22

22

2

2222

2

22

2

2

2

22

22

2222

2

2

22

22222

22

2

22

3

33

333

33

3

11

1

1

11

1 11

1

1

11

1 1 11

11

1

11

1

122

222 2

22

22

2 222

22

2

2 22

2 2

2

2

2 2

2

2 2

2222 2

222

2

3

33

3

3

3

3

33

11

1

1

11

1 11

1

1

11

11 11

11

1

11

1

1222

22 2222

2

2 222

22

2

2 22

22

2

2

2 2

2

22

222 22

222

2

3

33

3

3

3

3

3

PC1

PC3

PC2

Figure 2. Scatter plots of the first three of prin-cipal components. The true groups are la-beled by AML � �, B-cell ALL � � and T-cellALL � �.

knowledge in order to demonstrate its potential usefulnessas a method of unsupervised learning.

5.1. Leukemia Data and Preprocessing

The leukemia data were studied by Goloub et al. [12].Originally it was reported that the leukemia data con-tains two types of acute leukemias: acute lymphoblas-tic leukemia (ALL) and acute myeloid leukemia (AML).The �� gene expression levels were measured using

-40 -20 0 20 40

Second Mixed Factor

-20

0

20

40

First

Mix

ed

Fact

or

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

2

2

2

2

2

2 2

2

2

2

2

22

2

2

2

2

2

2

2

2

2

22

2

2

2 2

222 2 2

2

22

2

2

3

33

33

3

3

3

3

Figure 3. Plot of ��, �� given by themixed factors model with � � � and � � �.The true groups are labeled by AML � �, B-cell ALL � � and T-cell ALL � �.

Affymetrix high density oligonucleotide arrays on �� pa-tients, consisting of �� cases of ALL and �� cases of AML(�� B-cell ALL and � T-cell ALL). As shown in Figure 1,the order of tissues was rearranged so that the cases of AMLare labeled by � �� and the ALL are labeled by �� (B-cell ALL, �� and T-cell ALL, �� ). Following[9, 12, 16], three preprocessing steps were applied to thenormalized matrix of intensity values available on the web-site: (a) thresholding, floor of �� and ceiling of �� ;(b) filtering, exclusion of genes with "� "�' � � or�"�"�'� � �� where "� and "�' refer to the max-imum and minimum intensities for a particular gene acrossthe �� samples; (c) the natural logarithm of the expression


Figure 4. Heat map for the expression levels of genes judged to be relevant to the group structure.These genes are selected by the mixed factors model with � � �, � � �. Each of � rows shows theexpression levels corresponding to �� genes in &�� (left) and &�� (right), for � � �� . The �� genesin &�� were selected such that these are top �� out of �� to give the highest positive correlationwith ��. The �� genes in &�� are top �� out of �� to give the highest negative correlation with ��.The first �� samples refer to the AML cases, and the next �� samples, the ALL �� cases (B-cellALL, �� and T-cell ALL, �� ). The name of gene follows from the gene accession number tobe available at the website.

levels was taken. After preprocessing, the 3571 genes re-main and so produce the data matrix� � �� oforder �� .

The retained dataset was firstly standardized so that eachcolumns in the matrix of the logged microarray data havemean � and variance �, and then we standardized the rowsof � to have mean zero and unit variance.

5.2. Clustering and Selecting Relevant Genes

We firstly applied PCA to this retained dataset based onthe correlation matrix. Figure 2 displays the first three ofprincipal components. These projections slightly revealedclusters implying the existence of two classes, ALL andAML. However these provided no evidence for the pres-ence of subclasses, AML, B-cell ALL and T-cell ALL. Ifproceeding to clustering of these projections using sometechniques, e.g. �-means, Gaussian mixture clustering, wewould obtain the large number of misallocations.

Next we considered clustering of the tissue samples onthe basis of the retained �� genes using the mixed factorsmodel with � � �. After fitting the models ranging from� � � to � � � with �� starting values of parameters, the

model of � � � was chosen to be best in the sense of theminimum BIC. This gave the following groups;

��

��

�� (27)

This grouping is also summarized in Figeure 1. It can befound from (27) and Figure 1 that the two clusters reflectthe partition corresponding to the ALL and AML leukemia,and that most part of members in�� correspond to the AMLleukemia tissues. Particularly, the AML and T-cell ALLwere completely classified. The misallocations were equalto �� , and so, the error rate is about� . All of misclassifications corresponds to B-cell ALL.We confirmed that the mixed factors analysis could providea meaningful grouping despite of the high-dimensionalityof dataset.

Figure 3 displays ��, ��, � � �� ob-tained from the selected model, where the coordinates ��

and �� were chosen according to the minus entropy of thecomponent memberships. Each data points is labeled bythe true class. This plot is helpful to understand the groupstructure, visually. These projections present the existence


of two clusters. All of the � projections corresponding tomisallocations were located near the boundary.

The �� ALL tissues consists of � T-cell and �� B-celltypes. Given the existence of three subclasses, �� AML, ��B-cell ALL and � T-cell ALL, Chow et al. [7] and McLach-lan et al. [16] attributed these samples into three groups.We also considered clustering this dataset into three groupsby using the mixed factors model of� � �. Given the mod-els ranging from � � � to � � � with �� starting values, thescores of BIC produced by each local maxima decided thegoodness of � � �. This gave the following clusters (seealso Figeure 1),

��

��

��

��

This split implies the weak association of three groups �� ,� � �� , with AML, B-cell ALL and T-cell ALL, respec-tively. The first cluster �� consisted of �� AML, �� B-cellALL plus � T-cell ALL. The �� consisted of the remain-ing B-cell cases. All of the members belonging to �� wereT-cell cases. All of AML cases were completely classifiedas well as clustering by � � �. Most part of misallocationscorrespond to B-cell ALL cases and were classified into ��.

Figure 4 displays the heat map of the expression levelsgiven by genes in &�� and &��, � � �� . These ��sets of �� genes showed the �� expression patterns. Allgenes in a set exhibit the similar expression pattern. Thesegenes are considered to be functionally co-worked. In ad-dition, notice that a pair of &�� and &�� shows the oppo-site expression patterns. We can interpret that all genes in&�� are expressed in combination with ones in &��. Thetwo sets are negatively correlated each other. Figure � dis-plays ��, ��, � � �� which were chosenaccording to the minus entropy of the component member-ships. The expression patterns given by the genes in&�� and&��, � � �� (top two of five sets in Figure 4) are com-pressed into ��, ��, � � �� . This plotprovides the evidence for the presence of three subclassesin the leukemia tissues.

6. Concluding Remarks

When we cluster tissue samples on the basis of genes,the number of observations to be grouped is typically muchsmaller than the dimension of feature vector. In such acase, the applicability of the conventional mixture model-based approach is limited. In this paper, we have shown themethod of the mixed factors analysis. The mixed factorsmodel presents a parsimonious parameterization of Gaus-sian mixture. Consequently, our approach enables us to

-40-40 -30-30 -20-20 -10-10 0 10 10 20 20 30 30-50-50

-30-30

-10-10

1010

3030

1

1

1

11 1

1

11

1

1

11

1

11

1

1 11

2

11 1

22 2

2

22

22

2

1

2

222

22

22

2

2

2

2

22

2

2 2

2

222

2 222

2

22

3

33 3 3

3

3

3

3

Firs

t Mixe

d Fa

ctor

Firs

t Mixe

d Fa

ctor

Second Mixed FactorSecond Mixed Factor

Figure 5. Plot of ��, �� given by themixed factors model with � � � and � � �.The true groups are labeled as AML � �, B-cell ALL � � and T-cell ALL � �.

prevent from the occurrence of overfitting during the den-sity estimation process. In the process of clustering, themethod automatically reduce the dimensionality of featuredata, to extract genes to be relevant to explain the presenceof groups and also to detect genes that are expressed in com-bination. The mixed factors analysis was applied to a well-known dataset, the leukemia data for highlighting its useful-ness. The density estimation succeeded in spite of that weused the ��-dimensional feature data and the clusteringproduced the biologically meaningful groups of leukemiatissues. The results showed the potential usefulness and therole of our method in the application to microarray datasets.We expect that the mixed factors analysis will contribute tofind molecular subtypes of disease.

Acknowledgements

We thank the anonymous referees for helpful comments.Work at the Institute of Statistical Mathematics was carriedout in part under a Grant-in-Aid for Science Research (A)(14208025).

References

[1] H. Akaike, A new look at the statistical model identification,IEEE Transactions on Automatic Control, 19(30A), pp. 9–14, 1974.

[2] A.A. Alizadeh, M.B. Eisen, R.E. Davis, C.Ma, I.S. Lossos,A. Rosenwald, J.C. Boldrick, H. Sabet, T. Tran, X. Yu, J.I.Powell, L. Yang, G.E. Marti, T. Moore, J. Hudson, L. Lu,R. Lewis, D.B Tibshirani, G. Sherlock, W.C. Chan, T.C.Greiner, D.D. Weisenberger, J.O. Armitage, R. Warnke, and


L.M. Staudt, Distinct types of diffuse large B-cell lym-phoma identified by gene expression profiling, Nature, 403,pp. 503–511, 2000.

[3] T.W. Anderson, An Introduction to multivariate statisticalanalysis, Wiley, New York, 1984.

[4] Y. Barash, and N. Friedman, Context-specific Bayesian clus-tering for gene expression data, Proceedings of the Fifth An-nual International Conference on Computational Biology,pp. 22–25, 2001.

[5] M. Bittener, P. Meltzer, Y. Chen, Y. Jiang, E. Seftor,M. Hendrix, M. Radmacher, R. Simon, Z. Yakhini, A. Ben-Dor, N. Sampas, E. Dougherty, E. Wang, F. Marincola,C. Gooden, J. Lueders, A. Glatfelter, P. Pollock, J. Carpten,E. Gillanders, D. Leja, K. Dietrich, C. Beaudry, M. Berens,D. Alberts, and V. Sondak, Molecular classification of cu-taneous malignant melanoma by gene expression profiling,Nature, 406, pp. 536–540, 2000.

[6] W.C. Chang, On using principal components before sep-arating a mixture of two multivariate normal distributions,Applied Statistics, 32, pp. 267–275, 1983.

[7] M.L. Chow, E.J. Moler, and I.S. Mian, Identifying markergenes in transcription profiling data using a mixture of fea-ture relevance experts, Physiol. Genomics, 5, pp. 99–111,2001.

[8] A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum like-lihood from incomplete data via the EM algorithm (with dis-cussion), Journal of the Royal Statistical Society B, 39, pp.1–38, 1977.

[9] S. Dudoit, J. Fridlyand, and T.P. Speed, Comparison of dis-crimination methods for the classification of tumors usinggene expression data, Journal of the American StatisticalAssociation, 97(457), pp. 77–87, 2002.

[10] C. Fraley and A.E. Raftery, Model-based clustering, dis-criminant analysis, and density estimation, Journal of theAmerican Statistical Association, 97(458), pp. 611–631,2002.

[11] D. Ghosh and A.M. Chinnaiyan, Mixture modeling of geneexpression data from microarray experiments, Bioinformat-ics, 18(2), pp. 275–286, 2002.

[12] T.R. Goloub, D.K. Slonim, P. Tamayo, C. Huard,M. Gassenbeck, J.P. Mersirov, H. Coller, M.L. Loh, J.R.Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lan-der, Molecular classification of cancer: class discovery andclass prediction by gene expression monitoring, Science,286, pp. 531–537, 1999.

[13] I. Holmes and W.J. Bruno, Finding regulatory elementsusing joint likelihoods for sequence and expression profiledata, Proceedings of the Eighth International Conference onIntelligent Systems for Molecular Biology, pp. 19–23, 2000.

[14] E.L. Lehmann, Theory of point estimation, Wiley, NewYork, 1983.

[15] E. L. Lehmann, Testing statistical hypotheses, Wiley, NewYork, 1986.

[16] G.J. McLachlan, R.W. Bean, and D. Peel, A mixture model-based approach to the clustering of microarray expressiondata, Bioinformatics, 18(3), pp. 413–422, 2002.

[17] G.J. McLachlan and D. Peel, Finite Mixture Models, Wiley,New York, 1997.

[18] B.D. Ripley, Pattern recognition and neural networks, Cam-bridge University Press, Cambridge, 1996.

[19] D.T. Ross, U. Scherf, M.B. Eisen, C.M. Perou, P. Spellman,V. Iyer, S.S. Jeffrey, M.V. deRijin, M. Waltham, A. Perga-menshikov, C.F. Lee, D. Lashkari, D. Shalon, T.G. Myers,J.N. Weinstein, D. Botstein, and P.O. Brown, Systematicvariation in gene expression patterns in human cancer celllines, Nature Genetics, 24, pp. 227–234, 2000.

[20] G. Schwartz, Estimating the dimension of a model, Annalsof Statistics, 6, pp. 461–464, 1978.

[21] M.E. Tipping and C.M. Bishop, Mixtures of probabilisticprincipal component analyzers, Neural Computation, 11,pp. 443–482, 1999.


Documents

A Mixed Factors Model for Dimension Reduction and ...bonsai.hgc.jp/~imoto/yoshida_csb2004.pdfA Mixed Factors Model for Dimension Reduction and Extraction of a Group Structure in Gene