View
13
Download
0
Category
Preview:
Citation preview
Biclustering with heterogeneous varianceGuanhua Chena, Patrick F. Sullivanb,c, and Michael R. Kosoroka,d,1
Departments of aBiostatistics, bGenetics, cPsychiatry, and dStatistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599
Edited by Xiaotong Shen, University of Minnesota, Minneapolis, MN, and accepted by the Editorial Board June 4, 2013 (received for review March 7, 2013)
In cancer research, as in all of medicine, it is important to classifypatients into etiologically and therapeutically relevant subtypes toimprove diagnosis and treatment. One way to do this is to useclustering methods to find subgroups of homogeneous individualsbased on genetic profiles together with heuristic clinical analysis.A notable drawback of existing clustering methods is that theyignore the possibility that the variance of gene expression profilemeasurements can be heterogeneous across subgroups, and meth-ods that do not consider heterogeneity of variance can lead toinaccurate subgroup prediction. Research has shown that hyper-variability is a common feature among cancer subtypes. In thispaper, we present a statistical approach that can capture bothmeanand variance structure in genetic data. We demonstrate thestrength of our method in both synthetic data and in two cancerdata sets. In particular, our method confirms the hypervariability ofmethylation level in cancer patients, and it detects clearer subgrouppatterns in lung cancer data.
Clustering is an important type of unsupervised learning al-gorithm for data exploration. Successful examples include K-
mean clustering and hierarchical clustering, both of which arewidely used in biological research to find cancer subtypes and tostratify patients. These and other traditional clustering algo-rithms depend on the distances calculated using all of the fea-tures. For example, individuals can be clustered into homogeneousgroups by minimizing the summation of within-clusters sum ofsquares (the Euclidean distances) of their gene expression pro-files. Unfortunately, this strategy is ineffective when only a subsetof features is informative. This phenomenon can be demon-strated by K-means clustering (1) results for a toy example usingonly the variables which determine the underlying true clustercompared with using all variables (which includes many un-informative variables). As can be seen in Fig. 1, clustering per-formance is poor when all variables are used in the clusteringalgorithm (2).To solve this problem, sparse clustering methods have been
proposed to allow clustering decisions to depend on only a subsetof feature variables (the property of sparsity). Prominent sparseclustering methods include sparse principal component anal-ysis (PCA) (3–5) and Sparse K-means (2), among others (6).However, sparse clustering still fails if the true sparsity is a localrather than a global phenomenon (6). More specifically, differentsubsets of features can be informative for some samples but notall samples, or, in other words, sparsity exists in both featuresand samples jointly. Biclustering methods are a potential solu-tion to this problem, and further generalize the sparsity principleby considering samples and features as exchangeable concepts tohandle local sparsity (6, 7). For example, gene expression datacan be represented as a matrix with genes as columns, andsubjects as rows (with various and possibly unknown diseases ortissue types). Traditional methods will either cluster the rows—as done, for example, in microarray research, where researcherswant to find subpopulation structure among subjects to identifypossible common disease status—or cluster the columns, asdone, for example, in gene clustering research, where genes areof interest and the goal is to predict the biological function ofnovel genes from the function of other well-studied genes within thesame clusters. In contrast, biclustering involves clustering rowsand columns simultaneously to account for the interaction of row
and column sparsity. This local sparsity perspective providesan intuition for using sparse singular value decomposition(SSVD) algorithms for biclustering (8–11). SSVD assumesthat the signal in the data matrix can be representedby a low-rank matrix X≈UDVT =
Pri=1diuiv
Ti with X∈ℜn× p.
U= ½u1; u2; . . . ; ur�∈ℜn× r and V= ½v1; v2; . . . ; vr�∈ℜr × p containleft and right sparse singular vectors and are orthonormal withonly a few nonzero elements (corresponding to local sparsity).D∈ℜr × r is diagonal (with diagonal elements d1; d2; . . . ; dr) withr � rankðXÞ. The outer product of each pair of sparse singularvectors (uivTi , i= 1; 2; . . . ; r) will designate two biclusters corre-sponding to positive and negative elements, respectively.A common assumption of existing SSVD biclustering methods
is that the observed data can be decomposed into a signal matrixplus a fully exchangeable random noise matrix:
X=Ξ+Φ; [1]
where X is the observed data, Ξ= ðξijÞ is an n× p matrix repre-senting the signal, and Φ= ðϕijÞ is an n× p random noise/residualmatrix with independent identically distributed (i.i.d.) entries(10, 12, 13). A method based on model 1 is proposed in ref. 9which minimizes the sum of the Frobenius norm of X− Ξ anda penalty function with variable selection, such as the ℓ1 − norm(14) or smoothly clipped absolute deviation (15). A similar lossplus penalty minimization approach can be seen in ref. 11. Adifferent method for SSVD employs iterative thresholding QRdecomposition to estimate Ξ in ref. 10. We refer to ref. 9 asLSHM (for Lee, Shen, Huang, and Marron) and ref. 10 as fastiterative thresholding for SSVD (FIT-SSVD), and comparethese approaches to our method. An alternative approach, whichis more direct, is based on a mixture model (16, 17). For exam-ple, ref. 17 defines the bicluster as a submatrix with a large pos-itive or negative mean. Although these approaches have provensuccessful in some settings, they are limited by their focus on onlythe mean signal approximation. In addition, the explicit homo-geneous residual variance assumption is too restrictive inmany applications.To our knowledge, the only extension of the traditional model
given in [1] is the generalized PCA approach (18), which assumesthat if the random noise matrix were stacked into a vector,vecðΦÞ, it would have mean 0 and variance R−1 ⊗Q−1, whereR−1 is the common covariance structure of the random variableswithin the same column, and Q−1 is the common covariancestructure of the random variables within the same row. Thisapproach is especially suited to denoising NMR data for whichthere is a natural covariance structure of the form given above(18). Drawbacks of the generalized PCA method, however, are
Author contributions: G.C., P.F.S., and M.R.K. designed research; G.C., P.F.S., and M.R.K.performed research; G.C. and M.R.K. contributed new reagents/analytic tools; G.C. ana-lyzed data; and G.C., P.F.S., and M.R.K. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission. X.S. is a guest editor invited by the EditorialBoard.
Freely available online through the PNAS open access option.1To whom correspondence should be addressed. E-mail: kosorok@unc.edu.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1304376110/-/DCSupplemental.
www.pnas.org/cgi/doi/10.1073/pnas.1304376110 PNAS | July 23, 2013 | vol. 110 | no. 30 | 12253–12258
STATIST
ICS
GEN
ETICS
Dow
nloa
ded
by g
uest
on
Feb
ruar
y 4,
202
0
that it remains focused on mean signal approximation and thestructure of R−1 and Q−1 must be explicitly known in advance.In this paper, we present a biclustering framework based on
SSVD called heterogeneous sparse singular value decomposition(HSSVD). This method can detect both mean biclusters andvariance biclusters in the presence of unknown heterogeneousresidual variance. We also apply our method, as well as com-peting approaches, to two cancer data sets, one with methylationdata and the other with gene expression data. Our methoddelivers more distinct genetic profile pattern detection and is ableto confirm the biological findings originally made for each of thedata sets. We also apply our method as well as other competingapproaches on synthetic data to compare their performancequantitatively. We demonstrate that our proposed method is ro-bust, location- and scale invariant, and computationally feasible.
Application to Cancer DataHypervariability of Methylation in Cancer. We demonstrate thecapability of variance bicluster detection with methylation datain cancer versus normal patients (19). The experiments wereconducted by a custom nucleotide-specific Illumina bead array toincrease the precision of DNA methylation measurements onpreviously identified cancer-specific differentially methylatedregions (cDMRs) in colon cancer (20). The data set (GEO ac-cession: GSE29505) consists of 290 samples including cancersamples (colon, breast, lung, thyroid, and Wilms’ tumor cancers)and matched normal samples. Each sample had 384 methylationprobes which covered 151 cDMRs. The authors of the primaryreport concluded that cancer samples had hypervariability inthese cDMRs across all cancer types (19).First, we wish to verify that HSSVD can provide a good mean
signal approximation of methylation. In this data set, all of theprobes measuring the methylation are placed in the cDMRsidentified in colon cancer patients. As a result, we would expectthat mean methylation levels differ between colon cancer sam-ples and the matched normal samples. Under this assumption,we require the biclustering methods to capture this mean structure
before investigating the information gained from variancestructure estimation. Note that the numerical range of methyl-ation level is between 0 and 1. Hence, we applied the logittransformation on the original data for further biclusteringanalysis. We compare three methods, HSSVD, FIT-SSVD andLSHM, all based on SVD. Only colon cancer samples and theirmatched normal samples are used for this particular analysis. InFig. 2, we can see from the hierarchical clustering analysis thatthe majority of colon cancer samples (labeled blue in the side-bar) are grouped together and most of the cDMRs are differ-entially expressed in colon tumor samples compared with normalsamples. The conclusion is the same for all three methodscompared, including our proposed HSSVD method.Second, our proposed HSSVD method confirms the most
important finding in ref. 19 that cancer samples tended to havehypervariability in methylation level regardless of tumor subtype.We compared the mean approximation and variance approxi-mation results of HSSVD. All samples were used in this analysis.The variance approximation of HSSVD (Fig. 3A) shows thatnearly all normal samples have low variance compared with cancersamples, and this pattern is consistent across all cDMRs. Notably,our method provides additional information beyond the conclu-sion from ref. 19. Specifically, our variance approximation suggeststhat some cancer samples are not characterized by hypervariabilityin methylation level for certain cDMRs. More precisely, somecDMRs for a few cancer samples (surrounded by normal samples)are predicted to have low variance (lower left part of Fig. 3A). Ourmethod also highlights cDMRs with the greatest contrast variancebetween cancer and normal samples. The corresponding cDMRswith high contrast variance (especially some of the first and middlecolumns of Fig. 3A) warrant further study for biological and clin-ical relevance. We also want to emphasize that the analysis in ref.19 relies on the disease status information, whereas for HSSVDthe disease status is only used for result interpretation. Note thatmost cancer patients cluster together by hierarchical clustering ofthe variance approximation from HSSVD. In contrast, cluster-ing the mean approximation from HSSVD in Fig. 3B fails to
−1.5 −0.5 0.0 0.5 1.0 1.5
−1.5
−1.0
−0.5
0.00.5
1.01.5
True Label
X1
X2
−1.5 −0.5 0.0 0.5 1.0 1.5
−1.5
−1.0
−0.5
0.00.5
1.01.5
Predicted Label (two variables)
X1
X2
−1.5 −0.5 0.0 0.5 1.0 1.5
−1.5
−1.0
−0.5
0.00.5
1.01.5
Predicted Label (all variables)
X1
X2
Fig. 1. Data set contains two clusters determined by two variables X1 and X2 such that points around ð1;1Þ and ð−1; − 1Þ naturally form clusters. There are200 observations (100 for each cluster) and 1,002 variables (X1, X2 and 1,000 random noise variables). We plot the data in the 2D space of X1 and X2. Graphswith true cluster labels and predicted cluster labels obtained by clustering using only X1 and X2 and clustering by using all variables are laid from left to right.The predicted labels are the same as the true labels only when X1 and X2 are used for clustering; however, the performance is much worse when all variablesare used.
12254 | www.pnas.org/cgi/doi/10.1073/pnas.1304376110 Chen et al.
Dow
nloa
ded
by g
uest
on
Feb
ruar
y 4,
202
0
reveal such a pattern. This indicates that most cancer samplesmay have hypervariability of methylation as a common featurewhereas their mean-level methylation varies from sample tosample. Hence, identifying variance biclusters can provide po-tential new insight for cancer epigenesis.
Gene Expression in Lung Cancer. Some biological settings, in con-trast with the methylation example above, do not express vari-ance heterogeneity. Usually, the presence or absence of suchheterogeneity is not known in advance for a given research dataset. Thus, it is important to verify that the proposed approachremains effective in either case for discovering mean-only biclus-ters. We now demonstrate that even in settings without varianceheterogeneity, HSSVD can better identify discriminative biclustersfor different cancer subtypes than other methods, includingFIT-SSVD (10), LSHM (9), and traditional SVD. We use alung cancer data set which has been studied in the statistics litera-ture (9, 10, 17). The samples are a subset of patients (21) havinglung cancer with gene expression measured by the Affymetrix95av2 GeneChip (22). The data set contains the expression levelsof 12,625 genes for 56 patients, each having one of four dis-ease subtypes: normal lung (20 samples), pulmonary carcinoidtumors (13 samples), colon metastases (17 samples), and small-cell carcinoma (6 samples).The performance of different methods is evaluated based on
the pattern difference of subtypes based on the mean approx-imations. For all methods, we set the rank of the mean signalmatrix equal to 3 to maintain consistency with the ranks used inFIT-SSVD (10) and LSHM (9). Further, we use the measure-ment “support” to evaluate the sparsity of the estimated genesignal (10). Support is the cardinality of the nonzero elements in
the right and left singular vectors across the three layers (i.e.,support is an integer that cannot exceed the data dimension).Smaller support values suggest a sparser model. Table 1 showsthat HSSVD, FIT-SSVD and LSHM yield similar levels ofsparsity in the gene signal, whereas SVD is not sparse, as expec-ted. Fig. 4 shows checkerboard plots of rank-three approximationsby the four methods. Patients are placed on the vertical axis, andthe patient order is the same for all images. Patients within thesame subtype are stacked together and different subtypes areseparated by white lines. Within each image, genes are laid on thehorizontal axis and are ordered by the value of v2 (10). We cansee a clear block structure in both the FIT-SSVD and HSSVDmethods, indicating biclustering. The block structure suggests wecan discriminate the four cancer subtypes using either the FIT-SSVD or HSSVD methods, whereas LSHM and SVD are unableto achieve such separation among subtypes.
Simulation StudyTo evaluate the performance of HSSVD quantitatively, weconducted a simulation study. We compared HSSVD with themost relevant existing biclustering methods, FIT-SSVD andLSHM (9, 10). HSSVD includes a rank estimation component,whereas the other methods do not automatically include this. Forthis reason, we will use a fixed oracle rank (at the true value) forthe non-HSSVD methods. For comparison, we also evaluateHSSVD with fixed oracle rank (HSSVD-O).The performance of these methods on simulated data was
evaluated on four criteria. The first criterion is “sparsity of es-timation,” defined as the ratio between the size of the correctlyidentified background cluster and the size of the true backgroundcluster. The second criterion is “biclustering detection rate,”
155
117 87 201
348 23 321
337 48 33 366 47 326
311
265
118
153 50 279
242
188
310
255
303
269
141 88 178
261 99 343
109
328
275
130 20 260
317 83 218
215
145 52 294
179 56 68 240
166
111 43 243 91 65 372
198
262
291 26 71 176
102
157 37 4 349
301
252 32 72 70 375
214
383
322
371 13 89 258
268
120
356
172
213
306
148
205
345
190
314
127
228 46 332
164
193
183
342 51 237
249
251
199
163 62 175
216
203 27 54 152
158
197
146 2
119
143
331
184 49 21 327
133 63 280
233
236
220
129
150
151
377
384
382
380
379
378
376
374
370
362
361
359
355
354
344
339
336
330
325
323
319
316
313
305
302
300
298
290
289
285
283
282
281
276
274
256
254
253
248
247
246
245
239
232
227
226
224
223
217
212
210
209
208
207
206
204
202
194
186
182
177
169
165
162
161
160
156
154
142
139
138
135
132
131
128
121
116
115
114
113
110
108
107
106
105
104
101 98 96 95 94 93 92 90 86 84 82 80 79 78 76 74 73 67 66 61 57 55 53 41 36 31 25 19 17 16 14 11 3 5 35 7 284 81 39 159 30 264
315
346
149 18 8 136
191
187 6
181
173
189
195 85 29 97 134 45 250
335
200 44 334 58 351
225
112
271
222
287
211
140
234
270
360
263
259
350
312
125
229
278
365
241
126 9
238
219
367 75 373 59 100
286
272
122 69 12 307
308
381
297 28 320
185 24 333
235
304
369 10 77 363
357
170 60 174
329
144
358
230
244
368
292
266
347
318 15 353
192
340 38 167
267
296 42 123
180
273
231
324
147
338
124
293
352
196
309 40 364
171 34 277
221
295
341
299
257
103
168
288 64 137 22 1
2482325171316132921263322912141041815752644194720114928313250393546273840454330413642343748
68 87 20 155 88 294
326
321
275
269
337 23 47 178
117 99 348
214 33 48 343
242
265
317
141
166
213
145
261
311
260
215 56 120
314
130
366
109
375 89 203
345
258
190
205
383
356
306 91 201 32 252
255
303
218 83 52 111
118 50 310
240
148
127 72 179
328
164
199
237
251
175
384
380
379
378
377
376
374
371
370
359
344
342
339
332
331
330
327
325
323
322
313
302
300
290
285
283
282
281
280
276
274
268
256
254
253
249
247
246
245
239
236
233
232
228
227
223
220
216
212
210
209
208
207
206
204
197
194
193
184
183
182
172
169
165
163
158
156
154
152
151
150
146
143
142
139
138
133
131
129
128
121
119
116
115
114
108
107
106
105
104
101 98 96 95 94 93 92 90 86 84 82 79 76 73 70 66 63 62 57 55 54 53 51 49 46 41 36 31 27 25 21 19 17 16 14 13 2 5 3 224
382
135 61 243
349
198
102 43 37 262
188 4
153 71 372 65 279
291 26 176
157
301 35 7 284 39 81 335
159
312 85 30 195
315
346
264
187 29 360
173
234
189
134 58 181 6
200
225
229 45 97 334
149 18 136
270
191 24 8 44 298 78 202
362
177
355
289
186
160
132
319
110
113
217
361
305
162 11 167 12 69 272
341 74 75 373
286
367
122
296
354
336 67 80 248
100 59 42 267
316
238
161
241
126
278
365 60 221
353 40 318 28 381
357
350
259
364
287
271 15 185
219
140
222
333
211
351
125
263
250
231 38 147
309
123 22 137
257
338 1
273
226
340
358
299
293
180
288 9 77 64 192
103
277
304
144
329
174
368
230
324
124
266
369
112
295
168
352
196
235
244
347
170 34 320
297 10 363
171
308
292
307
4237343630414835432832313927464538504047471049919121815142920112233624321517442381316125226
343
242 33 311
366
109 56 215
348 52 269
321 23 178 88 326
294
166
213
317
275 32 111 91 201
303
310 50 252
255
118 68 87 117 99 20 155
179 83 218 47 337
141
265 48 127
214 72 164
240
148
328
199
145
261 89 237
175
251
203
205
345
314
190
383
130
375
120 51 342
183
331
236
150
216
133
131
121
302
246 5
374 63 207
104
290
210 21 377 41 359 73 139 19 194 36 55 245
339
376
281
154
300
276
282 57 172
371 70 322
268 13 62 158
184
152 27 46 249
193
332
197 54 143
163
129 14 128 86 274
280
233
165
327
223
220
209 90 53 79 151
142 82 323 84 98 101
206
208
260
258
306
356
228
146 2
212 49 119
198
102
243
349 43 37 65 262
188 4 71 372
157
176 26 291
279
153
301 31 325
108 94 135
224 16 330
227
169 93 106 61 204
382
379
380
160
289 67 365
241
316 95 283
156 11 76 305
353
295
221 40 364 15 226 74 298
340
180 9
103
171
292 17 254
115
107
285
256
232 96 92 138
239
247
344
370
116
105
182
378
384 25 66 336
319
132
110
113
177 80 38 12 147
100
161 75 373
186
114
361 3
253
313
162
217 78 355
202
362 69 278
286
272
354
248 42 59 267
126
238
167
231
341
367
122
296 22 273
123
309
358
299 1
137
257
338 44 30 189
181 58 24 29 234 7 39 335 35 284 81 85 195
187
159
312
264
315
346
134
333
225
360
173 6
200
229 45 97 334
149 18 8 136
270
191
230
277
124
324
347
363
304
288
293
192
318 64 196
168
352 60 357
381 28 263
259
287
350
271
185
250
219
140
222
125
211
351 34 368
266
369
144
329
174
235
112 77 244
320
170
297 10 308
307
3836444332493929484137424550474031353033344725234610199281215181427116651320112172621832224
Fig. 2. Mean approximation of colon cancer and the normal matched samples. From left to right the methods are HSSVD, FIT-SSVD, and LSHM. Colon cancersamples are labeled in blue, and normal matched samples are labeled in pink in the sidebar. Genes and samples are ordered by hierarchical clustering. Coloncancer patients are clustered together, which indicates that the mean approximations for these three methods achieve the expected signal structure.
147 34 91 345
333
259
336
309
315
244
205
159
360
261
301
155
286 69 246
178
271 44 141
123
136
369
338 4 97 76 111
102
308
118
231
168
143 65 124
320
314
149
255
170
257
384
220
343
117
349 43 291 26 176
307
100 10 350
344
250
222
157
198
304
262 81 334
187
229 47 348
381
166
277
288
216
317
218 87 179 20 71 326
328
372
251
134 35 171 24 272
139
368
125
355
292 46 235
353 58 77 217
242
356 48 25 310 66 283
172 41 200
137 33 29 15 278
153
279
337 37 50 232
245
376 13 7 346 85 154
293 45 174
234
181
263
173
189
191
335
284 8
273 98 327
341
358
266 39 140
264
312 6
295
221
249
313
342
225 90 252
281
167
113 95 63 182 22 75 378
177
370 1 3
188
164 83 72 127
148
240
303 42 88 145 30 196
199
183
270
365
306
112 86 51 52 362
211
185
311 99 243
175
129 59 233
289
169
247
108
212
142
302 21 254
379
107 19 207
201
373
268
366 53 120
297 68 121
265
361
110 28 96 296 2
347 64 285
193
275
132
363
339
321
331
156 38 274 60 213 73 236
325 67 93 224
352 70 119
214
106
160 89 374
146
210
126
238 17 208
340
248
158 80 54 367
104
194 36 329
267 74 152
135
298
258
165
228 61 128 57 209
130 94 14 79 150
323 5 49 253
204
230 62 18 269
371
197
109
186
383 23 300
237 84 184
161
138
203
223
115 82 114
380
280
332
101
133
359 92 192
195 11 324
322
219
282
276
377
202
105 27 319 55 144
103
180
116
318
256
206
215 32 351
239
151
287 12 122
260 16 305
354
162
316
364 40 227
241
131
294
375
226
290
330
163
382 9 31 357 56 299 78 190
204212681510712221226316412425182531813176744567992696372617366685564597454706071576562755811190461311381131371142811671663710827720313513028414194267192161155227451634715711912325822224151444036343130352850424839321684338513329414917999847680971008877858798957890828193969189948386118134120116121136132102106127103129128105126101104111271125272257109110133243221213192285170145252171154241155148165256176177185169195181253180199288143149224270150153193159161152174162236278280279255242240239235205204189178156147146139117231121982762549173200183231107234237191142140251187172202186184175238160188141197151283201289264212244250247232287265274230269246245275249233266207216226261259206210263229208248158286273211268225218228227262215260220209196282214219217223290
HSSVD: Variance Approximation
−2 0 1 2Value
Color Key
265
183
356 89 337 52 68 111
326
199 88 343 47 178
251
261
214
166 48 99 306
275
331
269
317
321
175
237 70 172
345
193
348
242 13 119
146
203 49 33 366 23 56 290 2
205
210
314
228
311 32 213
145
127
243
164
148
118 20 188
155
218
328
117
201
141
240 72 303
279
310 87 50 252
179 26 4 176
157
349
291 37 65 198
102 71 262
372 43 91 153
255 83 301
375
282 57 130 51 190
158
150
300
294
120
165
216
152
268
342
236
109
163
215
258 54 212 27 143 41 260
274
249
128
325 94 11 135
224
220
246 62 204 63 46 139
327
332 5
106 31 359 90 121 55 184
108 73 142
151
339 25 93 377 14 379
227
160
382 61 354
202
361
206
122
253
180
330
241 67 298 74 40 364
131 19 154
302 21 194
133
207
374
209 17 376
104
254 36 384
383
378
371
370
344
323
322
299
281
280
256
247
245
233
223
208
197
182
169
162
161
156
129 98 95 92 86 82 79 78 76 59 53 3 16 232
276
107
101 96 80 305
329 12 324
283
113 66 238
126 1
177
272
167
267
115
296
132
285 77 239 22 34 147
100
365
380
347
231
340
217
137
316
226 75 289
297 38 341
367
286
144
313
110
138 84 230
105
248 7
195
270
288
363 29 352
186 9
319
373
355 69 318
266
336
192
116
211
287
112
357
278
114 10 42 358 60 369
125 18 309
308
257
362
174
351
381
103
320
277
271 44 123
307
196
273
219 28 149
292
171
222
244
350 64 124
170
304
200
293
168 35 39 335
284
264
134
221 85 81 312
259
263
140
333
185
338
368
225
250
235 8 45 360
234
346 97 181
315
295
229 30 334 58 173
159 6 24 189
353
187
136 15 191
264306957111526696733351149273122701072584025431282165113026121451028942141929015220753281352252341084813732311645196272911924011717912268503949182158144476619416828323190224235142189199166197139198183159163221253201234463728020323614323921318715616727727819175741896973826313136274205923141266877915726999286171331125689824742652321131178939282847310094656322281727621680584457281602476855542756226277260918312328425790206271124856424489618824386246287272250249180204141237205267276178147188242146151140149279200255238288254161105128129134233102136103133126111110101106118121116132120138245127109104125259248209217210208228227225211230223215229219214212226218220150148195153145181186154270162172202155241169170160285152174184192185175251173193252177171256165176
HSSVD: Mean Approximation
−1 0 0.5 1Value
Color KeyA B
Fig. 3. HSSVD approximation result for all samples. (A) Variance approximation; (B) mean approximation. Blue represents cancer samples, and pink rep-resents normal samples in the sidebar. Genes and samples are ordered by hierarchical clustering. Red represents large values, and green represents smallvalues. Only the variance approximation can discriminate between cancer and normal samples. More importantly, within the same gene, the heatmap for thevariance approximation indicates that cancer patients have larger variance than normal individuals. This result matches the conclusion in ref. 19. In addition,the cDMRs with the greatest contrast variance across cancer and normal samples are highlighted by the variance approximation, whereas the original paperdoes not provide such information.
Chen et al. PNAS | July 23, 2013 | vol. 110 | no. 30 | 12255
STATIST
ICS
GEN
ETICS
Dow
nloa
ded
by g
uest
on
Feb
ruar
y 4,
202
0
defined as the ratio of the intersection of the estimated biclusterand the true bicluster over their union (also known as the Jac-card index). For the first two criteria, larger values indicatebetter performance. The third and fourth criteria are “overallmatrix approximation errors” for mean and variance biclusters,consisting of the scaled recovery error for the low-rank meansignal matrix ~Ξ=Ξ+ b J, computed via
Lmean�~Ξ; Ξ
�=��Ξ−~Ξ
��2F
.��Ξ��2F ;and the scaled recovery error for the low-rank variance signalmatrix logð~ΣÞ= logðΣÞ+ logðρ2JÞ, computed via
Lvar
�log
�~Σ�; log
���
=����log
�Σ1=2
�−log
�~Σ1=2
�����2
F
�����log�Σ1=2
�����2
F;
with k_kF being the Frobenius norm.The simulated data comprise a 1000× 100 matrix with in-
dependent entries. The background entries follow a normaldistribution with mean 1 and SD 2. We denote the distribution asNð1; 22Þ, where Nða; b2Þ represents a normal random variablewith mean a and SD b. There are five nonoverlapping rectan-gular- shaped biclusters: bicluster 1, bicluster 2, and bicluster 5
are mean clusters, bicluster 3 is a mean and small variancecluster, and bicluster 4 is a large variance cluster. More precisely,bicluster 1 (size 100× 20) is generated from Nð7; 22Þ, bicluster 2(size 100× 10) is generated from Nð−5; 22Þ, bicluster 3 (size100× 10) is generated from Nð7; 0:42Þ, bicluster 4 (size 100× 20)is generated from Nð1; 82Þ, and bicluster 5 (size 100× 20) isgenerated from Nð6:8; 22Þ. The biclustering results are shown inTable 2: HSSVD and HSSVD-O can detect both mean andvariance biclusters, whereas FIT-SSVD-O and LSHM-O can onlydetect mean biclusters (where “O” stands for oracle input biclusternumber). For mean bicluster detection, all methods performedwell because the biclustering detection rates are all greaterthan 0.7. For variance bicluster detection, HSSVD and HSSVD-O deliver a similar biclustering detection rate. On average, thecomputation time of LSHM-O is about 30 times that of HSSVDand 60 times that of FIT-SSVD-O.Both FIT-SSVD and LSHM are provided with the oracle rank
as input. We also evaluated an automated rank version for thesemethods, but determined the performance was worse than thecorresponding oracle rank version (results not shown). Note thatthe input data are standardized to mean 0 and SD 1 element-wisely for FIT-SSVD-O and LSHM-O. Although this step is notmentioned in the original papers (9, 10), this simple procedure iscritical for accurate mean bicluster detection. From Table 2, wecan see that HSSVD-O provides the best overall performance,while HSSVD is close to the best; however, in practice, the oraclerank is unknown. For this reason, HSSVD is the only fully auto-mated approach which delivers robust mean and variance de-tection in the present of unknown heterogeneous residual varianceamong those considered.
Conclusion and DiscussionIn this paper, we introduced HSSVD, a statistical framework and itsimplementation, to detect biclusters with potentially heterogeneous
HSSVD FIT-SSVD
LSHM SVD
case
s
case
sca
ses
case
s
genes
genes genes
genes
Fig. 4. Checkerboard plots for four methods. We plot the rank-three approximation for each method. Within each image, samples are laid in rows, andgenes are in columns. We order the samples by subtype for all images (top to bottom: carcinoid, colon, normal, and small cell), and different subtypes areseparated by white lines. Genes are sorted by the estimated second right singular vector ðu2Þ, and we only included genes that are in the support (defined inTable 1). Across all methods, the HSSVD and FIT-SSVD methods provide the clearest block structure reflecting biclusters.
Table 1. Cardinality of union support of the first three singularvectors for different methods applied on lung cancer data
Support HSSVD FIT-SSVD LSHM SVDS3
i= 1kuik0 4,689 4,686 4,655 12,625S3i=1kvik0 56 56 56 56
12256 | www.pnas.org/cgi/doi/10.1073/pnas.1304376110 Chen et al.
Dow
nloa
ded
by g
uest
on
Feb
ruar
y 4,
202
0
variances. Compared with existing methods, HSSVD is both scaleinvariant and rotation invariant (as the quantity for scaling is thesame for all matrix entries and does not vary by row or column).HSSVD also has the advantage of working on the log scale(Materials and Methods) in estimating the variance components: thelog scale makes detection of low-variance (less than 1) biclusterspossible, and any traditional SSVDmethod can be naturally used inour variance detection steps. This method confirms the existence ofmethylation hypervariability in the methylation data example. Al-though we use the FIT-SSVDmethod in our implementation, otherlow-rank matrix approximation methods are applicable. Moreover,the software implementing our proposed approach was compu-tationally comparable to the other approaches we evaluated.A potential shortcoming of SVD-based methods is their in-
ability to detect overlapping biclusters. We investigate thisproblem in the first paragraph of SI Materials and Methods. Weshow that our method can serve as a denoising process foroverlapping bicluster detection. In particular, we can first applythe HSSVD method on the raw data to obtain the mean ap-proximation. Then we can apply a suitable approach, such as thewidely used plaid model (16, 23), on the mean approximation todetect overlapping biclusters. This combined procedure improveson the performance of the plaid model when the overlappingbiclusters have heterogeneous variance. Hence, our methodremains useful in the present of overlapping biclusters.Another potential issue for HSSVD is the question of whether
a low-rank mean approximation plus a low-rank variance ap-proximation could be alternatively represented by a higher-rankmean approximation. In other words, is it possible to detectvariance biclusters through mean biclusters only, even thoughthe mean clusters that form the variance clusters would bepseudomean clusters? A detailed discussion of this issue can befound in the second paragraph of SI Materials and Methods. Ourconclusion is that the variance detection step in HSSVD isnecessary for the following two reasons: First, pseudomeanbiclusters are completely unable to capture small variancebiclusters. Second, although pseudomean biclusters are able tocapture some structure from large variance biclusters, suchstructure is much less accurate than that provided by HSSVD,and can be confounded with one or more true mean biclusters.Although HSSVD works well in practice, there are a number
of open questions that are important to address in future studies.For example, it would be worthwhile to modify the method toallow nonnegative matrix approximations to better handle countdata such as next-generation sequencing data (RNA-seq). Ad-ditionally, the ability to incorporate data from multiple “omic”platforms is becoming increasingly important in current bio-medical research, and it would be useful to extend this workto simultaneous analysis of methylation, gene expression, andmicroRNA data.
Materials and MethodsModel Assumptions for HSSVD. We define biclusters as subsets of the datamatrix which have the same mean and variance. We assume that there existsa dominate null cluster in which all elements have a common mean andvariance and that all other biclusters are restricted to rectangular structureswhich have either a distinct mean or variance compared with the null cluster.We can also express our model in the framework of a random effect modelwherein
X=Ξ+ ρ2Σ×Φ+bJ; [2]
where X and Ξ are the same structures given in the traditional model 1, andwhere we require Φ, an n×p matrix, to have i.i.d. random components withmean 0 and variance 1. Moreover, the “×” in [2] is defined element-wisely:see the next section for details. Added components in the model includeΣ= ðσijÞ, an n×p matrix representing the heterogeneous variance signal;Jn×p, an n×p matrix with all values equal to 1; ρ, a finite positive numberserving as a common scale factor; and b, a finite number serving as a com-mon location factor. We also make the sparsity assumption that the majorityof ðξijÞ values are 0 and the majority of ðσijÞ values are 1. Further, just as weassumed for the mean structure Ξ, we also assume that the variance struc-ture Φ is low rank.
From the definitions, the traditional model 1 is a special case of our model2, with b= 0, Σ= J, and ρ= 1. The presence of b and ρ in the model allows thecorresponding method to be scale invariant, while the presence of Σ enablesus to incorporate heterogeneous variance signals.
HSSVD Method. We propose HSSVD based on the model 2 with a hierarchicalstructure for signal recovery. First, we properly scale the matrix elements tominimize false detection of pseudomean biclusters which can arise as arti-facts of high-variance clusters. This motivates us to add the quadraticrescaling step in the procedure. Then we can detect mean biclusters basedon the scaled data and later detect variance biclusters based on the loga-rithm of the squared residual data after subtracting out the mean biclusters.The quadratic rescaling step works well in practice, as shown in the simu-lation studies and data analysis. The pseudocode for the algorithm is pro-vided as follows:
1. Input step: Input the raw data matrix Xorigin. Standardize Xorigin (treateach cell as i.i.d.) to have mean 0 and variance 1. Denote the overall meanof Xorigin as μ and the overall SD as σ, and let the standardized matrix bedefined as X= ðXorigin − μJÞ=σ.
2. Quadratic rescaling: Apply SSVD on X2 − J to obtain the approximationmatrix U.
3. Mean search: Let Y=X=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiU+ J− cJ
p, where c is a small nonpositive con-
stant to ensure thatffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiU+ J− cJ
pexists. Then, apply SSVD on Y to obtain
the approximation matrix ~Y.
4. Variance search: Let Zorigin = logðX− ~Y ×ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiU+J− cJ
p Þ2, center Zorigin to havemean 0, and denote the centered version as Z. Perform SSVD on Z toobtain the approximation matrix ~Z.
5. Background estimation: Let P= fpijg denote the n×p matrix of indicatorsof whether the corresponding cells belong to the background cluster,with pij =1 if both ~Yij =0 and ~Zij =0, and pij =0 otherwise. Based on theassumption that most elements in the matrix should be in the null cluster,
Table 2. Comparison of four methods in the simulation study
Criteria HSSVD HSSVD-O FITSSVD-O LSHM-O
Lmean 0.013 (0.01) 0.013 (0.01) 0.081 (0.01) 0.019 (0.01)Lvar 0.157 (0.03) 0.156 (0.03) NA NASparsity 0.950 (0.04) 0.950 (0.03) 0.988 (0.02) 0.997 (0.01)BLK1 (mean) 0.861 (0.10) 0.862 (0.10) 0.818 (0.08) 0.872 (0.08)BLK2 (mean) 0.934 (0.18) 0.936 (0.17) 0.939 (0.18) 0.976 (0.01)BLK3 (mean) 0.972 (0.10) 0.974 (0.10) 0.971 (0.11) 0.987 (0.01)BLK5 (mean) 0.977 (0.11) 0.948 (0.11) 0.977 (0.11) 0.996 (0.01)BLK3 (var) 0.977 (0.02) 0.977 (0.02) NA NABLK4 (var) 0.628 (0.25) 0.633 (0.24) NA NA
Lmean and Lvar measure the difference between the approximated signal and the true signal, and so smaller is better. For the other measures of accuracy ofbicluster detection, the larger the better. The rows BLK1 to BLK5 represent the biclustering detection rate for each bicluster.“-O” indicates that the oraclerank is provided.
Chen et al. PNAS | July 23, 2013 | vol. 110 | no. 30 | 12257
STATIST
ICS
GEN
ETICS
Dow
nloa
ded
by g
uest
on
Feb
ruar
y 4,
202
0
we can estimate b with 1′ðXorigin ×PÞ11′P1 and ρ with 1′ðXorigin ×P− bPÞ21
1′P1− 1 , where 1 isa vector with all elements equal to 1.
6. Scale back: Define P1 = fpijg, with pij = 1 if ~Yij =0, pij =0 otherwise. Simi-larly, define P2 = fpijg, with pij = 1 if ~Zij = 0, pij = 0 otherwise. The meanðΞ+bJÞ approximation is computed with σð~Y ×
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiU+ J− cJ
p Þ+ μðJ−P1Þ+bP1, and the variance ðρ2ΦÞ approximation is computed with ½ρ2P2 + σ2
ðJ−P2Þ�× expð~ZÞ.
The operators ×, /, expðÞ,logðÞ, expðÞ, minðÞ, and ffiffiffiffiffiffið Þpused above are
defined element-wisely when they are applied to the matrix, e.g.,Un×p ×Vn×p = ðuijvijÞ. In all steps involving SSVD, we implement the FIT-SSVDmethod (10). We use FIT-SSVD because it is computationally fast and hassimilar or superior performance compared with other competing methodsunder the homogeneous variance assumption (10). The matrix
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiU+ J− cJ
pprovides a working variance level estimate of the data and makes ourmethod more robust. Note that the reason for working on the log scale forthe variance detection is twofold. First, working on the log scale makes thedetection of the deflated variance (less than 1) bicluster possible. Intuitively,as variance measures deviance from the mean, we can work on the squaredresiduals to find the variance structure. For the deflated variance biclustersetting, if the mean structure is estimated correctly, the residuals within thebicluster are close to zero. The SSVD-based methods shrink the small non-zero elements to zero to achieve sparsity. As a result, if we work on thesquared residuals directly, the SSVD based methods will fail to detect the lowvariance structure. Second, to use the well-established SSVD method in thevariance detection steps we need to work on the log scale. To see this, wecan rewrite the equation in [2] as logðX−Ξ−bJÞ2 = logðΣ2Þ+ logðρ2Φ2Þ,which is similar to the model in [1]. Consequently, we can apply any methodswhich are applicable to [1] in our variance detection step if we work on thelog scale and Φ is low rank. We also want to point out that results obtaineddirectly from FIT-SSVD are relative to the location and scale of the
background cluster. In addition, we have scaled the data in the “input step.”To provide a correct mean and variance approximation of the original data,we need the “scale back” step. Assuming that the detection of null clusters isclose to the truth, then the pooled mean and variance estimates based onelements exclusively from the identified null cluster (b and ρ) are more ac-curate than estimates based on all elements of the matrix (μ and σ). Asa result, we need to use the comprehensive formula proposed in the scaleback step.
The FIT-SSVDmethod, as well as any other SVD-based method, requires anapproximation of the rank of the matrix (which is essentially the number oftrue biclusters) as input. We adapt the bicross validation method (BCV) by ref.24 for rank estimation, and we notice that in some cases the rank isunderestimated. For this reason, we introduce additional steps followinga BCV rank estimation of rank k: First, we approximate the data witha sparse matrix Xk+1
(rank = k + 1), where Xk+1=Pk+1
j=1 dj uj vTj . Define the
proportion of variance explained by the top i rank sparse matrix asRi =
Pij=1d
2
j =Pk+1
j=1 d2
j (25). Ri is between 0 and 1 and is increasing with i, andwe believe that the redundant components of the sparse matrix should notcontribute much to the total variance. The final rank estimation for HSSVD isthe smallest integer rwhich satisfies Rr > 0:95, and 1≤ r ≤ k+ 1. Note that FIT-SSVD (10) used the modified BCV method for rank estimation; however, theauthors require that most rows (the whole row) and most columns (thewhole column) are sparse, which appears to be too restrictive. In practice,this assumption is violated if the data are block diagonal or have certainother commonly assumed data structures. For this reason, we use the orig-inal BCV method as our starting point.
ACKNOWLEDGMENTS. The authors thank the editor and two referees forhelpful comments. The authors also thank Dr. Dan Yang for sharing part ofher code. This work was supported in part by Grant P01 CA142538 from theNational Institutes of Health.
1. Hastie T, Tibshirani R, Friedman J (2009) The Elements of Statistical Learning. (Springer,New York), Vol 1, pp 460–462.
2. Witten DM, Tibshirani R (2010) A framework for feature selection in clustering. J AmStat Assoc 105(490):713–726.
3. Ma Z (2013) Sparse principal component analysis and iterative thresholding. AnnStatist 41(2):772–801.
4. Shen H, Huang J (2008) Sparse principal component analysis via regularized low rankmatrix approximation. J Multivariate Anal 99(6):1015–1034.
5. Zou H, Hastie T, Tibshirani R (2006) Sparse principal component analysis. J ComputGraph Statist 15(2):265–286.
6. Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: A survey onsubspace clustering, pattern-based clustering, and correlation clustering. ACM TransKnowl Discov Data 3(1):1–58.
7. Cheng Y, Church GM (2000) Biclustering of expression data. Proc Int Conf Intell SystMol Biol 8:93–103.
8. Busygin S (2008) Biclustering in data mining. Comput Oper Res 35(9):2964–2987.9. Lee M, Shen H, Huang JZ, Marron JS (2010) Biclustering via sparse singular value
decomposition. Biometrics 66(4):1087–1095.10. Yang D, Ma Z, Buja A (2011) A sparse SVD method for high-dimensional data. arXiv:
1112.2433.11. Witten DM, Tibshirani R, Hastie T (2009) A penalized matrix decomposition, with
applications to sparse principal components and canonical correlation analysis. Bio-statistics 10(3):515–534.
12. Hoff PD (2006) Model averaging and dimension selection for the singular value de-composition. J Am Stat Assoc 102(478):674–685.
13. Johnstone IM, Lu AY (2009) On consistency and sparsity for principal componentsanalysis in high dimensions. J Am Stat Assoc 104(486):682–693.
14. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc, B58(1):267–288.
15. Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracleproperties. J Am Stat Assoc 96(456):1348–1360.
16. Lazzeroni L, Owen A (2002) Plaid models for gene expression data. Statist Sinica 12:61–86.
17. Shabalin AA, Weigman VJ, Perou CM, Nobel AB (2009) Finding large average sub-matrices in high dimensional data. Ann Appl Stat 3(3):985–1012.
18. Allen GI, Grosenick L, Taylor J (2011) A generalized least squares matrix de-composition. arXiv: 1102.3074.
19. Hansen KD, et al. (2011) Increased methylation variation in epigenetic domains acrosscancer types. Nat Genet 43(8):768–775.
20. Irizarry RA, et al. (2009) The human colon cancer methylome shows similar hypo- andhypermethylation at conserved tissue-specific CpG island shores. Nat Genet 41(2):178–186.
21. Liu Y, Hayes DN, Nobel A, Marron JS (2008) Statistical significance of clustering forhigh-dimension, low-sample size data. J Am Stat Assoc 103(483):1281–1293.
22. Bhattacharjee A, et al. (2001) Classification of human lung carcinomas by mRNA ex-pression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA98(24):13790–13795.
23. Turner H, Bailey T, Krzanowski W (2005) Improved biclustering of microarray datademonstrated through systematic performance tests. Comput Stat Data Anal 48(2):235–254.
24. Owen AB, Perry PO (2009) Bi-cross-validation of the SVD and the nonnegative matrixfactorization. Ann Appl Stat 3(2):564–594.
25. Allen GI, Maleti�c-Savati�c M (2011) Sparse non-negative generalized PCA with appli-cations to metabolomics. Bioinformatics 27(21):3029–3035.
12258 | www.pnas.org/cgi/doi/10.1073/pnas.1304376110 Chen et al.
Dow
nloa
ded
by g
uest
on
Feb
ruar
y 4,
202
0
Recommended