Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Coclustering—a useful tool for chemometricsRasmus Broa*, Evangelos E. Papalexakisb, Evrim Acara
and Nicholas D. Sidiropoulosc
Nowadays, chemometric applications in biology can readily deal with tens of thousands of variables, for instance, inomics and environmental analysis. Other areas of chemometrics also deal with distilling relevant information inhighly information-rich data sets. Traditional tools such as the principal component analysis or hierarchical cluster-ing are often not optimal for providing succinct and accurate information from high rank data sets. A relatively littleknown approach that has shown significant potential in other areas of research is coclustering, where a data matrix issimultaneously clustered in its rows and columns (objects and variables usually).
Coclustering is the tool of choice when only a subset of variables is related to a specific grouping among objects.Hence, coclustering allows a select number of objects to share a particular behavior on a select number of variables.
In this paper, we describe the basics of coclustering and use three different example data sets to show the advantagesand shortcomings of coclustering. Copyright © 2012 John Wiley & Sons, Ltd.
Keywords: clustering; coclustering; L1 norm; sparsity
1. INTRODUCTION
The chemometric field is dealing with increasingly complex data,for instance, in omics, quantitative structure–activity relationships,and environmental analysis. It is not uncommon to use hyphen-ated methods for measuring thousands of chemical compounds.This is quite different from traditional chemometric applications,for instance, in spectroscopy where the number of variables (wave-lengths) may be high but the actual number of chemicals reflectedin the data—the chemical rank—is typically low. Approaches suchas principal component analysis (PCA) are very well suited foranalyzing fairly low rank data, especially when the gathered dataare known to be relevant to the problem being investigated.Traditional clustering techniques aremore useful for exploratory
analyses of “classical” data. However, with the increasing numberof variables being measured nowadays, there is an interestingopposite trend toward not being interested in modeling the fulldata. Instead, the focus is often on finding few, so-called, biomar-kers. A biomarker can be a specific chemical compound indicativeof a pathological condition or indicative of intake of certain foodstuff. Thus, even though the actual amount of data and “informa-tion” increases, at the same time, the need for simplifying thevisualization, interpretation, and understanding increases.In coclustering, a data matrix is simultaneously clustered in its
rows and columns (objects and variables usually). Coclustering isby no means new [11], but it has attracted considerable interestin recent years because of some algorithmic developments andits promising performance in various applications—particularlyin bioinformatics [15].One of the main advantages of coclustering is that it clusters
both objects (samples) and variables simultaneously. Supposewe have a data set that shows the food intake of various itemsfor a group of people from Belgium and Korea. In order to findthe clusters in this data set, we may use a simple approachwhere the samples are clustered first, and subsequently, thevariables are clustered. It is conceivable that the main clusterscould be exactly Asian and European because, overall, the main
difference in intake relates to cultural differences. Hence, clusteringamong samples would split the samples into these two groups. It isalso conceivable that there could be another grouping because of,for example, some people preferring fish. However, becausefish-related items are only a small part of the variables and fishlovers appear in both populations, such a cluster cannot be real-ized. On the other hand, coclustering could capture both a countryand a fish cluster because it considers which samples are relatedwith which variables at the same time rather than one modalityat a time.
Hence, coclustering is the tool of choice when subsets ofsubjects are related with respect to corresponding subsets ofvariables. For some coclustering methods it also holds that anindividual subject (or variable) can belong to several (or no) clus-ters. This is so-called overlapping coclustering as opposed tonon-overlapping coclustering where each variable is assignedto at most one cluster.
In the following, we describe the theory behind coclusteringand subsequently exemplify coclustering on a toy data setreflecting different kinds of animals, on a data set of chromato-graphic measurements of olive oils, as well as on cancer geneexpression data.
* Correspondence to: R. Bro, Department of Food Science, Faculty of Life Sciences,University of Copenhagen, DK-1958 Frederiksberg, Denmark.E-mail: [email protected]
a R. Bro, E. AcarDepartment of Food Science, Faculty of Life Sciences, University of Copenhagen,DK-1958 Frederiksberg, Denmark
b E. E. PapalexakisSchool of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
c N. D. SidiropoulosDepartment of Electrical and Computer Engineering, University of Minnesota,Minneapolis, MN, USA
Special Issue Article
Received: 21 September 2011, Revised: 3 January 2012, Accepted: 3 January 2012, Published online in Wiley Online Library: 2012
(wileyonlinelibrary.com) DOI: 10.1002/cem.1424
J. Chemometrics (2012) Copyright © 2012 John Wiley & Sons, Ltd.
2. THEORY
We assume that our data forms a matrix X of dimensions I� J.
2.1. Coclustering with sparse matrix regression
Coclustering can be formulated as a constrained outer productdecomposition of the data matrix, with sparsity on the latentfactors of the bilinear model [17]. Each cocluster is representedby a rank-1 component of the decomposition. Instead of usinga plain bilinear model, sparsity on the latent factors is imposed.Intuitively, latent sparsity selects the appropriate rows andcolumns that belong to each cocluster, rendering all other coef-ficients that do not belong to a certain cocluster exactly zero.Hence, each bilinear component represents a cocluster.
Mathematically, this coclustering scheme may be stated as theminimization of the following loss function:
X� ABT�� ��2
Fþ l
X
i;k
Aikj j þ lX
j;k
Bjk
�� ��
where A and B are matrices of size I� K and J� K, respectively; Kcorresponds to the number of extracted coclusters. The sum ofabsolute values is used as a sparsity-inducing surrogate for thenumber of nonzero elements, for example, see Ref. [19], and lis a sparsity-controlling parameter.
The loss function can be interpreted as a constrained versionof a bilinear model such as PCA. Rotations such as varimax [12]also aim at simplicity and sparsity, but they do so in a losslessmanner, where the actual bilinear approximation of the data isleft unchanged. It is merely rotated toward a simpler view thatwill not usually lead to real sparsity.
Doubly sparse matrix factorization as shown has been proposedearlier [13,20]. Witten et al. [20] proposed adding sparsity-inducinghard one-norm constraints on both left and right latent vectors, asa variation of sparse singular value decomposition and sparsecanonical correlation analysis. Although their model was not devel-oped with coclustering in mind, it is similar to sparse matrixregression (SMR), which uses soft one-norm penalties instead ofhard constraints (and possibly non-negativity when appropriate).Algorithmically, Witten et al. [20] use a deflation algorithm thatextracts one rank-1 component at a time, instead of alternatingoptimization across rank-1 components as in SMR.
Lee et al. [13] proposed a similar approach specifically for coclus-tering. However, their algorithm is not guaranteed to convergebecause the penalties are not kept fixed during iterations. As aresult, the algorithm in Lee et al. [13] does not monotonically re-duce a tangible cost function, and instabilities are not uncommon.
In Papalexakis et al. [18], a coordinate descent algorithm isproposed in order to solve the given optimization problem. Morespecifically, one may solve this problem in an alternating fashion,where each subproblem is basically a least absolute shrinkageand selection operator problem [16,19]. We have to note that aglobal minimum for the bilinear problem may not be attained;the existing algorithms guarantee a local minimum or saddlepoint solution only.
The SMR coclustering algorithm [18] may be characterized as asoft or fuzzy coclustering algorithm, in the sense that coclustermembership is not merely a zero or one, but can be any valuein between. Some rows and columns may not be assigned toany cocluster, and overlapping coclusters are allowed and canbe extracted.
It follows that when sparsity is imposed to such an extent thatrows and columns are completely left out, the concept of assessingresidual sums of squares or fit values is not meaningful or at leastnot meaningful in the same sense as for ordinary least squaresfitting. Therefore, other means for evaluating the usefulness of amodel are needed. Such are described in the following sectionon metaparameters. Also, interpreting why certain samples orvariables are left “orphan” may be useful for understanding thecoclustering. This is usually an application-specific problem.One may add non-negativity constraints to the given loss
function formulation, which can be readily applied within theexisting coordinate descent algorithm with minor modifications.Although our focus here will be on SMR coclustering because of
its appropriateness for chemometric applications, there are severaltypes of coclustering models and algorithms that are popular inother areas and worth mentioning. Banerjee et al. [1,3,8] haveintroduced a class of coclustering algorithms that use Bregmandivergences, unified in an abstract framework. Bregman cocluster-ing is a hard coclustering technique, in the sense that it seeks tolocate a non-overlapping “checkerboard” structure in the data. Thistype of coclustering is typically not of interest in chemometrics,where one often deals with data that contain large numbers ofpotentially irrelevant variables. Dhillon [7] has formulated coclus-tering as a bipartite graph partitioning problem originally in thecontext of coclustering of documents and words from a documentcorpus. This algorithm can also be classified as hard coclustering. Inaddition, this algorithm works for non-negative data only. Initialtesting of various algorithms has shown that appearance of localminima is a common problem. In fact, most hard coclusteringalgorithms seem to have much more pronounced problems withlocal minima than soft coclustering ones. Furthermore, the possi-ble local minima in soft coclustering are often distinct (e.g., rankdeficient) and hence easier to spot. Other approaches thatare more distantly related are methods presented by Damianet al. [4] and Friedman and Meulman [9], which do not accountfor sparsity, and the hard coclustering method of Hagemanet al. [10], which uses a genetic algorithm that is sensitiveto local minima.
2.2. Metaparameters
For SMR, there are certain meta-parameters, that is, the penalty land number of coclusters that need to be chosen. The number ofcoclusters must be selected in most coclustering methods, butfor SMR, which is not based on hard clustering, it is found thatin many cases, the clusters are exactly or approximately nestedas we increase the number of clusters. Hence, for example, fora solution with five coclusters, it is often found that the first threecoclusters is approximately equal the solution found using onlythree coclusters. The reason for this approximate nestedness iscurrently being investigated further. In any case, it greatly simpli-fies the use of the method. For hard coclustering methods, asimilar behavior is naturally not observed.In practice, the metaparameters are mostly determined in the
following way: The penalty for a given number of components ischosen so that it is active. Choosing l that is too small wouldgive an inactive penalty, and choosing l that is too big wouldlead to some components/coclusters with all zero values. Asimple line search can be implemented to find a value of l thatis active without leading to all zeros. It is generally seen that thespecific setting of l is not critical, but of course, any automaticallydetermined value of l can be further refined. This has not been
R. BRO ET AL.
wileyonlinelibrary.com/journal/cem Copyright © 2012 John Wiley & Sons, Ltd. J. Chemometrics (2012)
Table
1.Animal
data
setused
toillustratecoclusterin
g
Has
eyes
Num
ber
oflegs/
arms
Carnivo
reFeathe
rWings
Dom
esticized
Eatenby
Cau
casian
s>10
0kg
>2m
Breathe
unde
rwater
Extin
ctDan
gerous
Life
expe
ctan
cyRa
ndom
Has a
beak
Walk
on two
legs
Speed
(MPH
)
Gira
ffe
14
00
00
01
10
00
301
00
32Cow
14
00
01
11
10
00
153
00
30Lion
14
10
00
01
00
01
156
00
50Gorilla
14
00
00
01
00
01
302
01
25Fly
16
00
10
00
00
00
0,1
70
05
Spider
18
10
00
00
00
00
18
00
1Sh
ark
10
10
00
01
01
01
504
00
30Hou
se0
00
00
00
11
00
010
09
00
0Horse
14
00
01
11
10
00
152
00
40Elep
hant
14
00
00
01
10
00
356
00
25Mam
moth
14
00
00
01
10
10
355
00
25SabreTige
r1
41
00
00
10
01
115
70
040
Pig
14
00
01
11
00
00
258
00
11Cod
10
10
00
10
01
00
409
00
2Eel
10
10
00
10
01
00
551
00
20Jellyfish
10
00
00
00
01
00
0,7
30
01
Dolph
in1
01
00
00
11
10
030
50
035
Nem
o1
00
00
00
00
10
01
60
04
Shrim
p1
00
00
01
00
10
01
20
00,5
Dog
14
10
01
00
00
00
138
00
35Cat
14
10
01
00
00
00
259
00
30Fo
x1
41
00
00
00
00
014
40
042
Wolf
14
10
00
00
00
01
183
00
25Ra
bbit
14
00
01
10
00
00
98
00
35Chicken
12
01
11
10
00
00
151
11
9Eagle
12
11
10
00
00
00
553
11
60Seag
ull
12
11
10
00
00
00
106
11
25Blackb
ird1
21
11
00
00
00
018
01
125
Bat
12
10
10
00
00
00
244
00
8T.Re
x.1
41
00
00
11
01
140
90
125
Neand
erthal
14
10
00
00
00
10
508
01
18Triceratop
s1
41
00
00
11
01
130
50
010
Man
14
10
00
00
00
00
802
01
28Pe
nguin
12
11
10
00
00
00
154
11
25
Coclustering
J. Chemometrics (2012) Copyright © 2012 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem
pursued here. In order to determine the number of coclusters, afairly ad hoc approach has been used. Because coclustering is usedfor exploratory analysis and the solution is nested, we simplyextract sufficiently many components to explain the main clus-ters. More rigorous approaches such as cross-validation couldbe implemented, but we do not see the predictive ability ofcoclustering as a very meaningful criterion to optimize. Rather,we find that interpretability of clusters is what is often soughtand what we focus on here.
3. MATERIALS AND METHODS
A toy data set is constructed for illustrating the behavior of coclus-tering in general. This data set shows attributes of differentanimals, and the data were not made particularly meticulously.Several variables are not well defined, but this is of moderateconsequence in this context. Also, the data were made from theauthors’ point of view, for example, in terms of which animals aredomesticized. In Table I, the data set is tabulated. Note that thedata also includes an outlying sample (house) and an outlyingvariable (random).
As another example, data from Refs [5,6] are analyzed. Onehundred twenty-six oil samples are analyzed by HPLC coupled toa charged aerosol detector. Of the oil samples, 68 were varioustypes, and grades of olive oils and the remaining were eithernon-olive vegetable oils or non-olive vegetable oils mixed witholive oil. The HPLC method is aimed at providing a triacylglycerideprofile of the oils. The triacylglycerides are known to have a distinctpattern for olive oils. The data were baseline corrected and alignedas described in the original work, and the resulting data afterremoval of a few outliers is shown in Figure 1.
As a final data set, we looked at a typical gene expression dataset. A total of 56 samples were selected from a cohort of lungcancer patients assayed by using the Affymetrix 95av2 GeneChipbrand oligonucleotide array. The 56 patients represent fourdistinct histological types: normal lung, pulmonary carcinoidtumors, colon metastases, and small cell carcinoma. The datahave been described in several publications [2,14] and also usingcoclustering [13]. The original data set contains 12 625 genes.Unlike most publications, no pre-selection to reduce the numberof genes is performed here. Rather, coclustering is applied directlyon the data. The data set holds information on 56 patients of which20 are pulmonary carcinoid samples, 13 colon cancer metastasissamples, 17 normal lung samples, and 6 small cell carcinomasamples. The data set is fairly easy to cluster into these four groups.
The data and the algorithm can be found at www.models.life.ku.dk (January 2012).
4. RESULTS
4.1. Looking at the animal data set
It is interesting to investigate the outcome of a simple PCAmodel on the auto-scaled animal data. In Figure 2, a score plotof the first two components of a PCA model is shown. Compo-nent 1 seems to reflect birds, which is verified from the loadingvector that has high values for the variables: feather, wings, hasa beak, and walk on two legs. Component 2, though, is difficultto interpret and seems to reflect a mix of different properties.This is also apparent from the loading plot.Looking at components 3 and 4 (Figure 3), similar complications
arise in interpreting the meaning of different components. All butthe first component reflect several phenomena in a contrastfashion, and often, it is difficult to extract and distinguish the im-portant variation.Turning to SMR, a model is fitted using six coclusters. Similar
results are obtained with different numbers of coclusters, butwe chose six here to exemplify the results. The data are scaled,not centered, and non-negativity is imposed. It is possible to plotthe resulting components/clusters as ordinary PCA componentsin scatter or line plots. However, the semi-discrete nature ofthe clusters sometimes makes such visualizations less efficient.Instead, we have developed a plot where each cluster is shownby labels of all samples and variables larger than a threshold.This threshold was set to 20% of maximum but was inactive herebecause all elements smaller than 20% of maximum were exactlyzero. Furthermore, the size of the label indicates the size of theelement. This provides an intuitive visualization as shown inFigure 4 for the six-cocluster SMR model.It is striking how easy it is to assess the meaning of this model
compared with the PCA model. Looking at the coclusters one ata time, it is observed that cocluster 1 is a bird cocluster. Cocluster2 is given by one variable (extinct) and is evident. Cocluster 3comprises big animals. Note how several samples in coclusters2 and 3 coincide. Animals in cocluster 4 are “grown” and eatenby people, and cocluster 5 captures animals living in water.Finally, cocluster 6 is too dense to allow an easy interpretation.It is apparently a cocluster relating to the overall variationand is in this sense taking care of the offsets induced by the lackof centering.
100 200 300 400 500 600
5
10
15
20
Time [au]
Figure 1. Preprocessed chromatographic data.
R. BRO ET AL.
wileyonlinelibrary.com/journal/cem Copyright © 2012 John Wiley & Sons, Ltd. J. Chemometrics (2012)
There is a dramatic difference in how easy it is to visualize theresults of PCA and SMR, but the data set is simple in the sensethat there are no significant amounts of irrelevant variation. In
order to see how SMR can deal with irrelevant variation, 30random variables (uniformly distributed) were added to theoriginal 17 variables. The data were scaled such that each vari-able had unit variance and SMR was performed.
In Figure 5, it is seen that the method very nicely distinguishesbetween the animal-related information and the random variables.All coclusters but cocluster 7 are easy to interpret. Cocluster 7 is notsparse at all—it comprises almost all variables and all samples.Also, note that the remaining coclusters are not identical to thecoclusters found before, but they are indeed fairly similar.
4.2. Olive oils
For the olive oil data set, a nice separation is achieved with threecoclusters. Adding more does not seem to change the coclustersobtained in the three cocluster model, and the added coclustersare not immediately meaningful. In Figure 6, it is seen thatcocluster 1 reflects olive oils, whereas cocluster 2 reflects non-olive oils. The mixed samples containing some olive oils areplaced in between. The third cocluster seems to reflect only afraction of the olive oils. This is likely related to the olive oilsbeing a very diverse class of samples spanning from pomace toextra virgin oil. The corresponding elution profiles of each clusterare meaningful. The first (olive oil) cocluster has peaks around300 and 400 (arbitrary units), and those peaks represent themain olive oil triacylglycerides (triolein, 1,2-olein-3-palmitin, and1,2-olein-3-linolein). Likewise, the non-olive oil cocluster repre-sents trilinolein, 1,2-linolein-3-olein, and 1,2-linolein-3-palmitin,which are frequent in non-olive oils. It is satisfying to see thatthe olive oil samples are clustered together, as desired, eventhough SMR is an unsupervised approach that does not useany prior or side information.
The results obtained with coclustering are not too different fromwhat would be obtained with PCA. In fact, it is somewhat disturb-ing that there is a distinct lack of sparsity. Although the modelmakes sense from a chemical point of view, little sparsity is seen,for example, in loading 1 and 2 (on the other hand, loading 3 issparse, and so are scores 2 and 3 to a certain extent). As describedin the theory section, the magnitude of the L1 penalty is automat-ically chosen, but it turns out that it is not possible to obtain more
-0.2 0.2 0.6 1-0.4
-0.2
0
0.2
0.4
0.6
Scores on PC 1 (30.26%)
Sco
res
on
PC
2 (
18.3
0%)
Giraffe
Cow
LionGorilla
FlySpider
SharkHouse
Horse
Elephant
Mammoth
Sabre Tiger
PigCodEelJellyfish
Dolphin
Nemo
Shrimp
DogCat
Fox
Wolf
Rabbit
Chicken
EagleSeagullBlackbird
Bat
T. Rex.
Neanderthal
Triceratops
ManPenguin
PCA score plot
-0.4 -0.2 0 0.2 0.4 0.6 0.8
-0.4
-0.2
0
0.2
0.4
Loading 1 (30.26%)
Lo
adin
g 2
(18
.30%
)
Has eyes
Number of legs/arms
Carnivor
FeatherWings
Domesticized
Eaten by Caucassians
>100kg>2m
Breathe under water
ExtinctDangerous
Life expectancy
RandomHas a beak
Walk on two legs
Speed (MPH)
PCA loading plot
Figure 2. Top: score plot of PCA model. Bottom: corresponding load-ing plot.
5 10 15 20 25 30-0.4
-0.2
0
0.2
0.4
0.6
Sco
res
on
PC
3 (
13.1
5%)
Giraffe
Cow
LionGorillaFlySpider
Shark
House
Horse
ElephantMammoth
Sabre Tiger
Pig
CodEelJellyfish
Dolphin
NemoShrimp
DogCat
FoxWolf
RabbitChicken
EagleSeagullBlackbirdBat
T. Rex.
Neanderthal
Triceratops
ManPengui
5 10 15 20 25 30-0.4
-0.2
0
0.2
0.4
Sample
Sco
res
on
PC
4 (
8.87
%)
GiraffeCow
Lion
Gorilla
Fly
Spider
Shark
House
HorseElephantMammoth
Sabre TigerPig
CodEel
Jellyfish
Dolphin
NemoShrimp
DogCatFoxWolf
Rabbit
ChickenEagleSeagull
Blackbird
Bat
T. Rex.
Neanderthal
Triceratops
Man
Pengui
Figure 3. Scores 3 and 4 from a PCA model.
Coclustering
J. Chemometrics (2012) Copyright © 2012 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem
sparsity than shown here. Manually increasing l leads to a modelwhere one component/cocluster turns all zero and hence rankdeficient. This points to a problem with the current coclusteringapproach. Because l is the same for both the row and thecolumn mode, problems or lack of sparsity may occur whenthe modes are quite different in dimension. The lack of sparsityis likely caused by the strong collinearity as well as by the lackof intrinsic sparsity in this type of data. It is questionable if
coclustering as defined here is a suitable model for spectral-likedata such as these. A more suitable approach could be an elasticnet-type coclustering [21], which would allow the natural colli-nearities to be represented in the clusters. This seems like aninteresting research direction.Note that for this particular data set, it would be possible to
integrate the chromatographic peaks and thereby obtaindiscrete data that would be more suitable for coclustering.
Giraffe Cow Lion Gorilla Fly Spider Shark House Horse Elephant Mammoth Sabre TigerPig Cod Eel Dolphin Dog Cat Fox Wolf Rabbit Eagle Seagull Blackbird Bat T. Rex. NeanderthalTriceratopsMan Penguin
Has eyes Carnivor >100kg Dangerous Life expectancy Random Walk on two legs Speed (MPH)
Chicken
Eagle
Seagull
Blackbird
Penguin Cluster 1
Feather
Wings
Has a beak
Walk on two legs
Mammoth
Sabre Tiger
T. Rex.
Neanderthal
TriceratopsCluster 2
Extinct Giraffe Cow House Horse Elephant Mammoth Dolphin T. Rex. Triceratops
Cluster 3
>100kg
>2m
Cow Horse Pig Cod
Eel
Shrimp
Dog
Cat
Rabbit Chicken
Cluster 4
Domesticized Eaten by Caucassians
Cluster 5
Number of legs/arms
Shark
Cod
Eel
Jellyfish
Dolphin
Nemo
Shrimp Cluster 6
Breathe under water
Figure 4. Sparse matrix regression coclusters of animal data. Font size indicates “belongingness” to the cluster.
Chicken Eagle Seagull Blackbird Penguin
Cluster 1
Feather Wings Has a beak Walk on two legs
Random
Shark Cod Eel Jellyfish Dolphin Nemo Shrimp
Cluster 2
Eaten by CaucasiansBreathe under water
Cluster 3
Giraffe Cow Lion Gorilla Fly Spider Shark House Horse Elephant Mammoth Sabre TigerPig Cod Eel Jellyfish Dolphin Nemo Shrimp Dog Cat Fox Wolf Rabbit Chicken Eagle Seagull Blackbird Bat T. Rex. NeanderthalTriceratopsMan Penguin
Has eyes Number of legs/arms Carnivor >100kg Life expectancy Random Speed (MPH) Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random Random
Cow Horse Pig Cod
Shrimp Dog Cat Rabbit Chicken
Cluster 4
Domesticized Eaten by Caucasians
Fly
Bat Cluster 5
Wings
Random
Giraffe Cow House Horse Elephant Mammoth Dolphin T. Rex. Triceratops
Cluster 6
>100kg>2m
Cluster 7
Lion Gorilla Shark Mammoth Sabre TigerWolf T. Rex. NeanderthalTriceratops
>100kg ExtinctDangerous
Figure 5. Sparse matrix regression coclusters with 30 random variables added to the data.
R. BRO ET AL.
wileyonlinelibrary.com/journal/cem Copyright © 2012 John Wiley & Sons, Ltd. J. Chemometrics (2012)
The intention though, with the given example, is to illustratethe behavior of coclustering on continuous data.
4.3. Cancer
When analyzing the gene expression data, the four differentcancer types come out immediately when we fit a four-coclustermodel as shown in Figure 7, where the four cancer classes arecolor coded. It is apparent that the four cancer classes are per-fectly clustered, but it is also apparent that the gene mode showslittle sparsity in comparison with patients. Hence, coclustering
does not provide the sparsity desired in order to be able to talkmeaningfully of specific biomarkers.
Performing a PCA on the same data (auto-scaled) provides avery clear grouping into the four cancer types (not shown).The separation is not perfect as in Figure 7, but the tendencyis very clear. Lee et al. [13] also performed coclustering with analgorithm similar to the SMR algorithm. The coclustering in thesample space that they obtained resembles the one obtainedusing PCA more than the distinct coclustering obtained inFigure 7. This, however, can be explained by the fact that pen-alties are chosen differently by Lee et al. using a Bayesian
Samp 34Samp 35Samp 36Samp 37Samp 38Samp 39Samp 40Samp 41Samp 42Samp 43Samp 44Samp 45Samp 46Samp 47Samp 48Samp 49Samp 50
Cluster 1
Var 6Var 39Var 42Var 58Var 92Var 125Var 191Var 217Var 301Var 365Var 396Var 399Var 400Var 434Var 435Var 492Var 498Var 547Var 590Var 645Var 646Var 801Var 825Var 868Var 880Var 910Var 911Var 966Var 982Var 997Var 1018Var 1032Var 1066Var 1092Var 1128Var 1132Var 1159Var 1186Var 1205Var 1221Var 1244Var 1273Var 1276Var 1282Var 1296Var 1312Var 1316Var 1443Var 1450Var 1456Var 1493Var 1538Var 1570Var 1686Var 1701Var 1734Var 1735Var 1790Var 1807Var 1836Var 1886Var 1913Var 2053Var 2070Var 2128Var 2147Var 2247Var 2269Var 2270Var 2295Var 2328Var 2338Var 2342Var 2550Var 2554Var 2565Var 2574Var 2575Var 2585Var 2586Var 2593Var 2605Var 2611Var 2616Var 2636Var 2640Var 2664Var 2690Var 2740Var 2772Var 2774Var 2780Var 2798Var 2844Var 2864Var 2873Var 2881Var 2919Var 2932Var 3142Var 3175Var 3187Var 3262Var 3266Var 3291Var 3303Var 3304Var 3325Var 3356Var 3359Var 3362Var 3402Var 3405Var 3414Var 3431Var 3444Var 3479Var 3531Var 3533Var 3534Var 3733Var 3740Var 3765Var 3789Var 3791Var 3801Var 3839Var 3855Var 3903Var 3912Var 3927Var 3981Var 4037Var 4129Var 4144Var 4233Var 4250Var 4275Var 4296Var 4345Var 4347Var 4352Var 4361Var 4404Var 4416Var 4419Var 4449Var 4681Var 4722Var 4753Var 4838Var 4848Var 4920Var 4933Var 5060Var 5064Var 5201Var 5217Var 5227Var 5232Var 5418Var 5772Var 5785Var 5807Var 5898Var 5924Var 5983Var 5994Var 6042Var 6072Var 6149Var 6164Var 6178Var 6214Var 6215Var 6287Var 6485Var 6509Var 6555Var 6557Var 6566Var 6596Var 6606Var 6613Var 6632Var 6640Var 6686Var 6691Var 6693Var 6702Var 6722Var 6723Var 6742Var 6776Var 6814Var 6838Var 6845Var 6944Var 6955Var 6975Var 6998Var 7004Var 7005Var 7010Var 7025Var 7042Var 7045Var 7071Var 7073Var 7076Var 7078Var 7088Var 7090Var 7091Var 7092Var 7093Var 7094Var 7099Var 7106Var 7166Var 7216Var 7236Var 7255Var 7264Var 7269Var 7272Var 7302Var 7316Var 7394Var 7414Var 7447Var 7453Var 7461Var 7464Var 7467Var 7468Var 7473Var 7474Var 7476Var 7478Var 7491Var 7564Var 7584Var 7614Var 7616Var 7670Var 7744Var 7757Var 7761Var 7775Var 7817Var 7821Var 7833Var 7851Var 7939Var 7984Var 7994Var 8027Var 8034Var 8052Var 8103Var 8105Var 8110Var 8115Var 8129Var 8139Var 8154Var 8158Var 8164Var 8172Var 8173Var 8188Var 8189Var 8203Var 8205Var 8216Var 8255Var 8261Var 8272Var 8299Var 8318Var 8378Var 8406Var 8416Var 8434Var 8443Var 8458Var 8459Var 8471Var 8484Var 8486Var 8487Var 8498Var 8510Var 8513Var 8533Var 8534Var 8538Var 8539Var 8589Var 8713Var 8735Var 8773Var 8787Var 8800Var 8831Var 8837Var 8855Var 8858Var 8866Var 8867Var 8873Var 8879Var 8917Var 8926Var 9053Var 9080Var 9124Var 9147Var 9152Var 9167Var 9201Var 9232Var 9269Var 9308Var 9336Var 9406Var 9427Var 9434Var 9439Var 9440Var 9499Var 9619Var 9672Var 9684Var 9723Var 9726Var 9782Var 9787Var 9790Var 9821Var 9852Var 9853Var 9868Var 9874Var 9932Var 10088Var 10114Var 10174Var 10176Var 10186Var 10199Var 10242Var 10379Var 10420Var 10429Var 10465Var 10518Var 10547Var 10555Var 10595Var 10599Var 10618Var 10622Var 10670Var 10685Var 10708Var 10743Var 10766Var 10799Var 10841Var 10869Var 10873Var 10877Var 10895Var 10902Var 10944Var 11004Var 11032Var 11040Var 11057Var 11201Var 11215Var 11270Var 11282Var 11283Var 11297Var 11299Var 11336Var 11401Var 11547Var 11579Var 11613Var 11640Var 11719Var 11730Var 11744Var 11834Var 11846Var 11856Var 11875Var 11881Var 11882Var 11883Var 11921Var 11939Var 11993Var 12045Var 12154Var 12159Var 12160Var 12171Var 12240Var 12241Var 12264Var 12331Var 12333Var 12334Var 12337Var 12338Var 12383Var 12385Var 12437Var 12454Var 12476Var 12487Var 12489Var 12555
Samp 34Samp 35Samp 36Samp 37Samp 38Samp 39Samp 40Samp 41Samp 42Samp 43Samp 44Samp 45Samp 46Samp 47Samp 48Samp 49Samp 50
Var 6Var 39Var 42Var 58Var 92Var 125Var 191Var 217Var 301Var 365Var 396Var 399Var 400Var 434Var 435Var 492Var 498Var 547Var 590Var 645Var 646Var 801Var 825Var 868Var 880Var 910Var 911Var 966Var 982Var 997Var 1018Var 1032Var 1066Var 1092Var 1128Var 1132Var 1159Var 1186Var 1205Var 1221Var 1244Var 1273Var 1276Var 1282Var 1296Var 1312Var 1316Var 1443Var 1450Var 1456Var 1493Var 1538Var 1570Var 1686Var 1701Var 1734Var 1735Var 1790Var 1807Var 1836Var 1886Var 1913Var 2053Var 2070Var 2128Var 2147Var 2247Var 2269Var 2270Var 2295Var 2328Var 2338Var 2342Var 2550Var 2554Var 2565Var 2574Var 2575Var 2585Var 2586Var 2593Var 2605Var 2611Var 2616Var 2636Var 2640Var 2664Var 2690Var 2740Var 2772Var 2774Var 2780Var 2798Var 2844Var 2864Var 2873Var 2881Var 2919Var 2932Var 3142Var 3175Var 3187Var 3262Var 3266Var 3291Var 3303Var 3304Var 3325Var 3356Var 3359Var 3362Var 3402Var 3405Var 3414Var 3431Var 3444Var 3479Var 3531Var 3533Var 3534Var 3733Var 3740Var 3765Var 3789Var 3791Var 3801Var 3839Var 3855Var 3903Var 3912Var 3927Var 3981Var 4037Var 4129Var 4144Var 4233Var 4250Var 4275Var 4296Var 4345Var 4347Var 4352Var 4361Var 4404Var 4416Var 4419Var 4449Var 4681Var 4722Var 4753Var 4838Var 4848Var 4920Var 4933Var 5060Var 5064Var 5201Var 5217Var 5227Var 5232Var 5418Var 5772Var 5785Var 5807Var 5898Var 5924Var 5983Var 5994Var 6042Var 6072Var 6149Var 6164Var 6178Var 6214Var 6215Var 6287Var 6485Var 6509Var 6555Var 6557Var 6566Var 6596Var 6606Var 6613Var 6632Var 6640Var 6686Var 6691Var 6693Var 6702Var 6722Var 6723Var 6742Var 6776Var 6814Var 6838Var 6845Var 6944Var 6955Var 6975Var 6998Var 7004Var 7005Var 7010Var 7025Var 7042Var 7045Var 7071Var 7073Var 7076Var 7078Var 7088Var 7090Var 7091Var 7092Var 7093Var 7094Var 7099Var 7106Var 7166Var 7216Var 7236Var 7255Var 7264Var 7269Var 7272Var 7302Var 7316Var 7394Var 7414Var 7447Var 7453Var 7461Var 7464Var 7467Var 7468Var 7473Var 7474Var 7476Var 7478Var 7491Var 7564Var 7584Var 7614Var 7616Var 7670Var 7744Var 7757Var 7761Var 7775Var 7817Var 7821Var 7833Var 7851Var 7939Var 7984Var 7994Var 8027Var 8034Var 8052Var 8103Var 8105Var 8110Var 8115Var 8129Var 8139Var 8154Var 8158Var 8164Var 8172Var 8173Var 8188Var 8189Var 8203Var 8205Var 8216Var 8255Var 8261Var 8272Var 8299Var 8318Var 8378Var 8406Var 8416Var 8434Var 8443Var 8458Var 8459Var 8471Var 8484Var 8486Var 8487Var 8498Var 8510Var 8513Var 8533Var 8534Var 8538Var 8539Var 8589Var 8713Var 8735Var 8773Var 8787Var 8800Var 8831Var 8837Var 8855Var 8858Var 8866Var 8867Var 8873Var 8879Var 8917Var 8926Var 9053Var 9080Var 9124Var 9147Var 9152Var 9167Var 9201Var 9232Var 9269Var 9308Var 9336Var 9406Var 9427Var 9434Var 9439Var 9440Var 9499Var 9619Var 9672Var 9684Var 9723Var 9726Var 9782Var 9787Var 9790Var 9821Var 9852Var 9853Var 9868Var 9874Var 9932Var 10088Var 10114Var 10174Var 10176Var 10186Var 10199Var 10242Var 10379Var 10420Var 10429Var 10465Var 10518Var 10547Var 10555Var 10595Var 10599Var 10618Var 10622Var 10670Var 10685Var 10708Var 10743Var 10766Var 10799Var 10841Var 10869Var 10873Var 10877Var 10895Var 10902Var 10944Var 11004Var 11032Var 11040Var 11057Var 11201Var 11215Var 11270Var 11282Var 11283Var 11297Var 11299Var 11336Var 11401Var 11547Var 11579Var 11613Var 11640Var 11719Var 11730Var 11744Var 11834Var 11846Var 11856Var 11875Var 11881Var 11882Var 11883Var 11921Var 11939Var 11993Var 12045Var 12154Var 12159Var 12160Var 12171Var 12240Var 12241Var 12264Var 12331Var 12333Var 12334Var 12337Var 12338Var 12383Var 12385Var 12437Var 12454Var 12476Var 12487Var 12489Var 12555
Samp 1Samp 2Samp 3Samp 4Samp 5Samp 6Samp 7Samp 8Samp 9Samp 10Samp 11Samp 12Samp 13Samp 14Samp 15Samp 16Samp 17Samp 19Samp 20
Cluster 2
Var 248Var 421Var 1168Var 1210Var 1265Var 1271Var 1278Var 1833Var 1857Var 1860Var 2094Var 2157Var 2272Var 2274Var 2466Var 2533Var 2535Var 2672Var 2674Var 2676Var 2720Var 2796Var 3079Var 3102Var 3186Var 3347Var 3383Var 3408Var 3458Var 3478Var 3737Var 3738Var 3871Var 3979Var 4305Var 4423Var 4818Var 4888Var 4893Var 5223Var 5295Var 5466Var 5509Var 5653Var 5687Var 5722Var 5797Var 5833Var 6207Var 6219Var 6251Var 6455Var 6670Var 6815Var 6917Var 6918Var 6933Var 6991Var 7056Var 7082Var 7250Var 7279Var 7355Var 7549Var 7560Var 7751Var 7854Var 8109Var 8224Var 8241Var 8364Var 8545Var 8564Var 8689Var 8794Var 8846Var 8887Var 8923Var 8924Var 8970Var 8971Var 9042Var 9112Var 9207Var 9265Var 9354Var 9441Var 9482Var 9663Var 9788Var 9957Var 10099Var 10188Var 10244Var 10261Var 10271Var 10289Var 10298Var 10369Var 10416Var 10553Var 10750Var 10848Var 10911Var 10928Var 10965Var 10966Var 11020Var 11099Var 11174Var 11316Var 11396Var 11459Var 11532Var 11538Var 11739Var 11785Var 11903Var 12032Var 12220Var 12256Var 12406Var 12479Var 12488
Samp 1Samp 2Samp 3Samp 4Samp 5Samp 6Samp 7Samp 8Samp 9Samp 10Samp 11Samp 12Samp 13Samp 14Samp 15Samp 16Samp 17Samp 19Samp 20
Var 248Var 421Var 1168Var 1210Var 1265Var 1271Var 1278Var 1833Var 1857Var 1860Var 2094Var 2157Var 2272Var 2274Var 2466Var 2533Var 2535Var 2672Var 2674Var 2676Var 2720Var 2796Var 3079Var 3102Var 3186Var 3347Var 3383Var 3408Var 3458Var 3478Var 3737Var 3738Var 3871Var 3979Var 4305Var 4423Var 4818Var 4888Var 4893Var 5223Var 5295Var 5466Var 5509Var 5653Var 5687Var 5722Var 5797Var 5833Var 6207Var 6219Var 6251Var 6455Var 6670Var 6815Var 6917Var 6918Var 6933Var 6991Var 7056Var 7082Var 7250Var 7279Var 7355Var 7549Var 7560Var 7751Var 7854Var 8109Var 8224Var 8241Var 8364Var 8545Var 8564Var 8689Var 8794Var 8846Var 8887Var 8923Var 8924Var 8970Var 8971Var 9042Var 9112Var 9207Var 9265Var 9354Var 9441Var 9482Var 9663Var 9788Var 9957Var 10099Var 10188Var 10244Var 10261Var 10271Var 10289Var 10298Var 10369Var 10416Var 10553Var 10750Var 10848Var 10911Var 10928Var 10965Var 10966Var 11020Var 11099Var 11174Var 11316Var 11396Var 11459Var 11532Var 11538Var 11739Var 11785Var 11903Var 12032Var 12220Var 12256Var 12406Var 12479Var 12488
Samp 21Samp 22Samp 23Samp 24Samp 25Samp 27Samp 28Samp 29Samp 30Samp 31Samp 32Samp 33
Cluster 3
Var 258Var 415Var 487Var 520Var 521Var 630Var 633Var 784Var 795Var 990Var 1081Var 1156Var 1206Var 1211Var 1271Var 1366Var 1489Var 1493Var 1624Var 1813Var 1875Var 1904Var 1967Var 1968Var 2052Var 2131Var 2326Var 2327Var 2510Var 2610Var 2611Var 2812Var 2821Var 2847Var 3303Var 3304Var 3336Var 3353Var 3468Var 3860Var 3941Var 4360Var 4383Var 4747Var 4748Var 4840Var 4886Var 4944Var 5243Var 5288Var 5326Var 5360Var 5396Var 5418Var 5467Var 5496Var 5559Var 5698Var 5821Var 5888Var 5983Var 5989Var 6018Var 6128Var 6160Var 6164Var 6192Var 6350Var 6485Var 6517Var 7058Var 7117Var 7254Var 7255Var 7332Var 7444Var 7469Var 7486Var 7494Var 7798Var 7825Var 7884Var 7939Var 7950Var 7967Var 7972Var 8002Var 8216Var 8238Var 8272Var 8340Var 8404Var 8459Var 8502Var 8549Var 8663Var 8667Var 8697Var 8728Var 8889Var 8911Var 9196Var 9206Var 9427Var 9428Var 9442Var 9797Var 10045Var 10053Var 10138Var 10174Var 10431Var 10483Var 10490Var 10544Var 10795Var 10801Var 10838Var 10906Var 11002Var 11003Var 11178Var 11270Var 11373Var 11578Var 11593Var 11728Var 11788Var 11819Var 12139Var 12232Var 12239Var 12329Var 12548
Samp 21Samp 22Samp 23Samp 24Samp 25Samp 27Samp 28Samp 29Samp 30Samp 31Samp 32Samp 33
Var 258Var 415Var 487Var 520Var 521Var 630Var 633Var 784Var 795Var 990Var 1081Var 1156Var 1206Var 1211Var 1271Var 1366Var 1489Var 1493Var 1624Var 1813Var 1875Var 1904Var 1967Var 1968Var 2052Var 2131Var 2326Var 2327Var 2510Var 2610Var 2611Var 2812Var 2821Var 2847Var 3303Var 3304Var 3336Var 3353Var 3468Var 3860Var 3941Var 4360Var 4383Var 4747Var 4748Var 4840Var 4886Var 4944Var 5243Var 5288Var 5326Var 5360Var 5396Var 5418Var 5467Var 5496Var 5559Var 5698Var 5821Var 5888Var 5983Var 5989Var 6018Var 6128Var 6160Var 6164Var 6192Var 6350Var 6485Var 6517Var 7058Var 7117Var 7254Var 7255Var 7332Var 7444Var 7469Var 7486Var 7494Var 7798Var 7825Var 7884Var 7939Var 7950Var 7967Var 7972Var 8002Var 8216Var 8238Var 8272Var 8340Var 8404Var 8459Var 8502Var 8549Var 8663Var 8667Var 8697Var 8728Var 8889Var 8911Var 9196Var 9206Var 9427Var 9428Var 9442Var 9797Var 10045Var 10053Var 10138Var 10174Var 10431Var 10483Var 10490Var 10544Var 10795Var 10801Var 10838Var 10906Var 11002Var 11003Var 11178Var 11270Var 11373Var 11578Var 11593Var 11728Var 11788Var 11819Var 12139Var 12232Var 12239Var 12329Var 12548
Samp 51Samp 52Samp 53Samp 54Samp 55Samp 56
Cluster 4
Var 61Var 427Var 520Var 521Var 545Var 557Var 561Var 641Var 648Var 731Var 830Var 896Var 984Var 1050Var 1211Var 1278Var 1326Var 1358Var 1599Var 2283Var 2292Var 2631Var 3137Var 3138Var 3160Var 3186Var 3303Var 3304Var 3365Var 3379Var 4132Var 4133Var 4136Var 4144Var 4383Var 4424Var 4924Var 5299Var 5363Var 5583Var 5619Var 5753Var 6003Var 6052Var 6111Var 6553Var 7008Var 7056Var 7288Var 7372Var 7374Var 7375Var 7648Var 7823Var 7939Var 8061Var 8142Var 8194Var 8272Var 8716Var 8724Var 8732Var 8923Var 8924Var 9077Var 9196Var 9245Var 9331Var 9421Var 9426Var 9697Var 9769Var 9803Var 10063Var 10084Var 10136Var 10169Var 10213Var 10241Var 10291Var 10511Var 10521Var 10628Var 10643Var 10644Var 10667Var 10720Var 10828Var 11192Var 11270Var 11271Var 11322Var 11338Var 11385Var 11692Var 11939Var 12107Var 12213Var 12455Var 12456Var 12466
Samp 51Samp 52Samp 53Samp 54Samp 55Samp 56
Var 61Var 427Var 520Var 521Var 545Var 557Var 561Var 641Var 648Var 731Var 830Var 896Var 984Var 1050Var 1211Var 1278Var 1326Var 1358Var 1599Var 2283Var 2292Var 2631Var 3137Var 3138Var 3160Var 3186Var 3303Var 3304Var 3365Var 3379Var 4132Var 4133Var 4136Var 4144Var 4383Var 4424Var 4924Var 5299Var 5363Var 5583Var 5619Var 5753Var 6003Var 6052Var 6111Var 6553Var 7008Var 7056Var 7288Var 7372Var 7374Var 7375Var 7648Var 7823Var 7939Var 8061Var 8142Var 8194Var 8272Var 8716Var 8724Var 8732Var 8923Var 8924Var 9077Var 9196Var 9245Var 9331Var 9421Var 9426Var 9697Var 9769Var 9803Var 10063Var 10084Var 10136Var 10169Var 10213Var 10241Var 10291Var 10511Var 10521Var 10628Var 10643Var 10644Var 10667Var 10720Var 10828Var 11192Var 11270Var 11271Var 11322Var 11338Var 11385Var 11692Var 11939Var 12107Var 12213Var 12455Var 12456Var 12466
Figure 7. Sparse matrix regression coclusters of cancer data color-coded according to cancer class.
20 40 60 80 100
1
2
3
Sample number
Clu
ster
bel
ongi
ngne
ss
Cluster 1
100 200 300 400 500 6000
1
2
3
4
Retention
Clu
ster
bel
ongi
ngne
ss
20 40 60 80 1000
1
2
3
4
5
Sample number
Cluster 2
100 200 300 400 500 6000
1
2
Retention
20 40 60 80 1000
1
2
Sample number
Cluster 3
100 200 300 400 500 6000
0.4
0.8
1
1.4
Retention
Non oliveOliveMix
Non oliveOliveMix
Non oliveOliveMix
Figure 6. Three sparse matrix regression clusters are shown. Top plots show sample clusters and bottom plots show elution time clusters.
Coclustering
J. Chemometrics (2012) Copyright © 2012 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem
information criterion. Regardless, as also observed with theSMR algorithm, the algorithm of Lee et al. produces solutionsthat are not as sparse as expected in the gene mode.
5. CONCLUSION
The basic principles behind coclustering have been explained,and a new model and algorithm have been favorably comparedwith common methods such as PCA. It is shown that coclusteringcan provide meaningful and easily interpretable results on bothfairly simple and complex data compared with more traditionalapproaches. Limitations were encountered when the number ofirrelevant samples grew too high and when spectral-like dataare analyzed. More elaborate algorithms need to be developedfor handling such situations.
Acknowledgements
N. Sidiropoulos was supported in part by ARO grant W911NF-11-1-0500.
REFERENCES1. Banerjee A, Merugu S, Dhillon IS, Ghosh J. Clustering with Bregman
divergences. J. Mach. Learn Res. 2005; 6: 1705–1749.2. Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd
C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, LanderER, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M.Classification of human lung carcinomas by mRNA expression profil-ing reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci.U.S.A. 2001; 98: 13790–13795.
3. Cho H, Dhillon IS, Guan Y, Sra S. Minimum sum-squared residueco-clustering of gene expression data. Proceedings of the Fourth SIAMInternational Conference on Data Mining 2004; 114–125.
4. Damian D, Oresic M, Verheij E, Meulman J, Friedman J, Adourian A,Morel N, Smilde A, van der Greef J. Applications of a new subspaceclustering algorithm (COSA) in medical systems biology. Metabolomics2007; 3: 69–77.
5. de la Mata-Espinosa P, Bosque-Sendra JM, Bro R, Cuadros-RodriguezL. Discriminating olive and non-olive oils using HPLC-CAD andchemometrics. Anal. Bioanal. Chem. 2011a\; 399: 2083–2092.
6. de la Mata-Espinosa P, Bosque-Sendra JM, Bro R, Cuadros-RodriguezL. Olive oil quantification of edible vegetable oil blends using triacyl-glycerols chromatographic fingerprints and chemometric tools.Talanta 2011b; 85: 177–182.
7. Dhillon IS. Co-clustering documents and words using bipartitespectral graph partitioning. Proceedings of the seventh ACM SIGKDDinternational conference on Knowledge discovery and data mining2001; 269–274.
8. Dhillon IS, Mallela S, Modha DS. Information-theoretic co-clustering.Proceedings of the Ninth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining 2003; 89–98.
9. Friedman JH, Meulman JJ. Clustering objects on subsets of attributes.J. Roy. Stat. Soc. B Stat. Meth. 2004; 66: 815–849.
10. Hageman JA, van den Berg RA, Westerhuis JA, van der Werf MJ,Smilde AK. Genetic algorithm based two-mode clustering of meta-bolomics data. Metabolomics 2008; 4: 141–149.
11. Hartigan JA. Direct clustering of a data matrix. J. Am. Stat. Assoc.1972; 67: 123–129.
12. Kaiser HF. The varimax criterion for analytic rotation in factor analysis.Psychometrika 1958; 23: 187–200.
13. Lee M, Shen H, Huang JZ, Marron JS. Biclustering via sparse singularvalue decomposition. Biometrics 2010; 66: 1087–1095.
14. Liu Y, Hayes DN, Nobel A, Marron JS. Statistical significance ofclustering for high-dimension, low-sample size data. J. Am. Stat.Assoc. 2008; 103: 1281–1293.
15. Madeira SC, Oliveira AL. Biclustering algorithms for biological data anal-ysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinform. 2004; 1: 24–45.
16. Osborne MR, Presnell B, Turlach BA. On the LASSO and its dual. J.Comput. Graph. Stat. 2000; 9: 319–337.
17. Papalexakis EE, Sidiropoulos ND. Co-clustering asmultilinear decompo-sition with sparse latent factors. 2011 IEEE International Conference onAcoustics, Speech and Signal Processing, Prague, Czech Republic, 2011.
18. Papalexakis EE, Sidiropoulos ND, Garofalakis MN. Reviewer profilingusing sparse matrix regression. 2010 IEEE International Conferenceon Data Mining Workshops, 2010; 1214–1219.
19. Tibshirani R. Regression shrinkage and selection via the lasso. J. Roy.Stat. Soc. B 1996; 58: 267–288.
20. Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition,with applications to sparse principal components and canonicalcorrelation analysis. Biostatistics 2009; 10: 515–534.
21. Zou H, Hastie T. Regularization and variable selection via the elasticnet. J. Roy. Stat. Soc. B 2005; 67: 301–320.
R. BRO ET AL.
wileyonlinelibrary.com/journal/cem Copyright © 2012 John Wiley & Sons, Ltd. J. Chemometrics (2012)