Biclustering with heterogeneous variance - PNAS · Biclustering with heterogeneous variance Guanhua...

Biclustering with heterogeneous varianceGuanhua Chena, Patrick F. Sullivanb,c, and Michael R. Kosoroka,d,1

Departments of aBiostatistics, bGenetics, cPsychiatry, and dStatistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599

Edited by Xiaotong Shen, University of Minnesota, Minneapolis, MN, and accepted by the Editorial Board June 4, 2013 (received for review March 7, 2013)

In cancer research, as in all of medicine, it is important to classifypatients into etiologically and therapeutically relevant subtypes toimprove diagnosis and treatment. One way to do this is to useclustering methods to find subgroups of homogeneous individualsbased on genetic profiles together with heuristic clinical analysis.A notable drawback of existing clustering methods is that theyignore the possibility that the variance of gene expression profilemeasurements can be heterogeneous across subgroups, and meth-ods that do not consider heterogeneity of variance can lead toinaccurate subgroup prediction. Research has shown that hyper-variability is a common feature among cancer subtypes. In thispaper, we present a statistical approach that can capture bothmeanand variance structure in genetic data. We demonstrate thestrength of our method in both synthetic data and in two cancerdata sets. In particular, our method confirms the hypervariability ofmethylation level in cancer patients, and it detects clearer subgrouppatterns in lung cancer data.

Clustering is an important type of unsupervised learning al-gorithm for data exploration. Successful examples include K-

mean clustering and hierarchical clustering, both of which arewidely used in biological research to find cancer subtypes and tostratify patients. These and other traditional clustering algo-rithms depend on the distances calculated using all of the fea-tures. For example, individuals can be clustered into homogeneousgroups by minimizing the summation of within-clusters sum ofsquares (the Euclidean distances) of their gene expression pro-files. Unfortunately, this strategy is ineffective when only a subsetof features is informative. This phenomenon can be demon-strated by K-means clustering (1) results for a toy example usingonly the variables which determine the underlying true clustercompared with using all variables (which includes many un-informative variables). As can be seen in Fig. 1, clustering per-formance is poor when all variables are used in the clusteringalgorithm (2).To solve this problem, sparse clustering methods have been

proposed to allow clustering decisions to depend on only a subsetof feature variables (the property of sparsity). Prominent sparseclustering methods include sparse principal component anal-ysis (PCA) (3–5) and Sparse K-means (2), among others (6).However, sparse clustering still fails if the true sparsity is a localrather than a global phenomenon (6). More specifically, differentsubsets of features can be informative for some samples but notall samples, or, in other words, sparsity exists in both featuresand samples jointly. Biclustering methods are a potential solu-tion to this problem, and further generalize the sparsity principleby considering samples and features as exchangeable concepts tohandle local sparsity (6, 7). For example, gene expression datacan be represented as a matrix with genes as columns, andsubjects as rows (with various and possibly unknown diseases ortissue types). Traditional methods will either cluster the rows—as done, for example, in microarray research, where researcherswant to find subpopulation structure among subjects to identifypossible common disease status—or cluster the columns, asdone, for example, in gene clustering research, where genes areof interest and the goal is to predict the biological function ofnovel genes from the function of other well-studied genes within thesame clusters. In contrast, biclustering involves clustering rowsand columns simultaneously to account for the interaction of row

and column sparsity. This local sparsity perspective providesan intuition for using sparse singular value decomposition(SSVD) algorithms for biclustering (8–11). SSVD assumesthat the signal in the data matrix can be representedby a low-rank matrix X≈UDVT =

Pri=1diuiv

Ti with X∈ℜn× p.

U= ½u1; u2; . . . ; ur�∈ℜn× r and V= ½v1; v2; . . . ; vr�∈ℜr × p containleft and right sparse singular vectors and are orthonormal withonly a few nonzero elements (corresponding to local sparsity).D∈ℜr × r is diagonal (with diagonal elements d1; d2; . . . ; dr) withr � rankðXÞ. The outer product of each pair of sparse singularvectors (uivTi , i= 1; 2; . . . ; r) will designate two biclusters corre-sponding to positive and negative elements, respectively.A common assumption of existing SSVD biclustering methods

is that the observed data can be decomposed into a signal matrixplus a fully exchangeable random noise matrix:

X=Ξ+Φ; [1]

where X is the observed data, Ξ= ðξijÞ is an n× p matrix repre-senting the signal, and Φ= ðϕijÞ is an n× p random noise/residualmatrix with independent identically distributed (i.i.d.) entries(10, 12, 13). A method based on model 1 is proposed in ref. 9which minimizes the sum of the Frobenius norm of X− Ξ anda penalty function with variable selection, such as the ℓ1 − norm(14) or smoothly clipped absolute deviation (15). A similar lossplus penalty minimization approach can be seen in ref. 11. Adifferent method for SSVD employs iterative thresholding QRdecomposition to estimate Ξ in ref. 10. We refer to ref. 9 asLSHM (for Lee, Shen, Huang, and Marron) and ref. 10 as fastiterative thresholding for SSVD (FIT-SSVD), and comparethese approaches to our method. An alternative approach, whichis more direct, is based on a mixture model (16, 17). For exam-ple, ref. 17 defines the bicluster as a submatrix with a large pos-itive or negative mean. Although these approaches have provensuccessful in some settings, they are limited by their focus on onlythe mean signal approximation. In addition, the explicit homo-geneous residual variance assumption is too restrictive inmany applications.To our knowledge, the only extension of the traditional model

given in [1] is the generalized PCA approach (18), which assumesthat if the random noise matrix were stacked into a vector,vecðΦÞ, it would have mean 0 and variance R−1 ⊗Q−1, whereR−1 is the common covariance structure of the random variableswithin the same column, and Q−1 is the common covariancestructure of the random variables within the same row. Thisapproach is especially suited to denoising NMR data for whichthere is a natural covariance structure of the form given above(18). Drawbacks of the generalized PCA method, however, are

Author contributions: G.C., P.F.S., and M.R.K. designed research; G.C., P.F.S., and M.R.K.performed research; G.C. and M.R.K. contributed new reagents/analytic tools; G.C. ana-lyzed data; and G.C., P.F.S., and M.R.K. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission. X.S. is a guest editor invited by the EditorialBoard.

Freely available online through the PNAS open access option.1To whom correspondence should be addressed. E-mail: kosorok@unc.edu.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1304376110/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1304376110 PNAS | July 23, 2013 | vol. 110 | no. 30 | 12253–12258

STATIST

that it remains focused on mean signal approximation and thestructure of R−1 and Q−1 must be explicitly known in advance.In this paper, we present a biclustering framework based on

SSVD called heterogeneous sparse singular value decomposition(HSSVD). This method can detect both mean biclusters andvariance biclusters in the presence of unknown heterogeneousresidual variance. We also apply our method, as well as com-peting approaches, to two cancer data sets, one with methylationdata and the other with gene expression data. Our methoddelivers more distinct genetic profile pattern detection and is ableto confirm the biological findings originally made for each of thedata sets. We also apply our method as well as other competingapproaches on synthetic data to compare their performancequantitatively. We demonstrate that our proposed method is ro-bust, location- and scale invariant, and computationally feasible.

Application to Cancer DataHypervariability of Methylation in Cancer. We demonstrate thecapability of variance bicluster detection with methylation datain cancer versus normal patients (19). The experiments wereconducted by a custom nucleotide-specific Illumina bead array toincrease the precision of DNA methylation measurements onpreviously identified cancer-specific differentially methylatedregions (cDMRs) in colon cancer (20). The data set (GEO ac-cession: GSE29505) consists of 290 samples including cancersamples (colon, breast, lung, thyroid, and Wilms’ tumor cancers)and matched normal samples. Each sample had 384 methylationprobes which covered 151 cDMRs. The authors of the primaryreport concluded that cancer samples had hypervariability inthese cDMRs across all cancer types (19).First, we wish to verify that HSSVD can provide a good mean

signal approximation of methylation. In this data set, all of theprobes measuring the methylation are placed in the cDMRsidentified in colon cancer patients. As a result, we would expectthat mean methylation levels differ between colon cancer sam-ples and the matched normal samples. Under this assumption,we require the biclustering methods to capture this mean structure

before investigating the information gained from variancestructure estimation. Note that the numerical range of methyl-ation level is between 0 and 1. Hence, we applied the logittransformation on the original data for further biclusteringanalysis. We compare three methods, HSSVD, FIT-SSVD andLSHM, all based on SVD. Only colon cancer samples and theirmatched normal samples are used for this particular analysis. InFig. 2, we can see from the hierarchical clustering analysis thatthe majority of colon cancer samples (labeled blue in the side-bar) are grouped together and most of the cDMRs are differ-entially expressed in colon tumor samples compared with normalsamples. The conclusion is the same for all three methodscompared, including our proposed HSSVD method.Second, our proposed HSSVD method confirms the most

important finding in ref. 19 that cancer samples tended to havehypervariability in methylation level regardless of tumor subtype.We compared the mean approximation and variance approxi-mation results of HSSVD. All samples were used in this analysis.The variance approximation of HSSVD (Fig. 3A) shows thatnearly all normal samples have low variance compared with cancersamples, and this pattern is consistent across all cDMRs. Notably,our method provides additional information beyond the conclu-sion from ref. 19. Specifically, our variance approximation suggeststhat some cancer samples are not characterized by hypervariabilityin methylation level for certain cDMRs. More precisely, somecDMRs for a few cancer samples (surrounded by normal samples)are predicted to have low variance (lower left part of Fig. 3A). Ourmethod also highlights cDMRs with the greatest contrast variancebetween cancer and normal samples. The corresponding cDMRswith high contrast variance (especially some of the first and middlecolumns of Fig. 3A) warrant further study for biological and clin-ical relevance. We also want to emphasize that the analysis in ref.19 relies on the disease status information, whereas for HSSVDthe disease status is only used for result interpretation. Note thatmost cancer patients cluster together by hierarchical clustering ofthe variance approximation from HSSVD. In contrast, cluster-ing the mean approximation from HSSVD in Fig. 3B fails to

−1.5 −0.5 0.0 0.5 1.0 1.5

−1.5

−1.0

−0.5

0.00.5

1.01.5

True Label

−1.5 −0.5 0.0 0.5 1.0 1.5

−1.5

−1.0

−0.5

0.00.5

1.01.5

Predicted Label (two variables)

−1.5 −0.5 0.0 0.5 1.0 1.5

−1.5

−1.0

−0.5

0.00.5

1.01.5

Predicted Label (all variables)

Fig. 1. Data set contains two clusters determined by two variables X1 and X2 such that points around ð1;1Þ and ð−1; − 1Þ naturally form clusters. There are200 observations (100 for each cluster) and 1,002 variables (X1, X2 and 1,000 random noise variables). We plot the data in the 2D space of X1 and X2. Graphswith true cluster labels and predicted cluster labels obtained by clustering using only X1 and X2 and clustering by using all variables are laid from left to right.The predicted labels are the same as the true labels only when X1 and X2 are used for clustering; however, the performance is much worse when all variablesare used.

12254 | www.pnas.org/cgi/doi/10.1073/pnas.1304376110 Chen et al.

reveal such a pattern. This indicates that most cancer samplesmay have hypervariability of methylation as a common featurewhereas their mean-level methylation varies from sample tosample. Hence, identifying variance biclusters can provide po-tential new insight for cancer epigenesis.

Gene Expression in Lung Cancer. Some biological settings, in con-trast with the methylation example above, do not express vari-ance heterogeneity. Usually, the presence or absence of suchheterogeneity is not known in advance for a given research dataset. Thus, it is important to verify that the proposed approachremains effective in either case for discovering mean-only biclus-ters. We now demonstrate that even in settings without varianceheterogeneity, HSSVD can better identify discriminative biclustersfor different cancer subtypes than other methods, includingFIT-SSVD (10), LSHM (9), and traditional SVD. We use alung cancer data set which has been studied in the statistics litera-ture (9, 10, 17). The samples are a subset of patients (21) havinglung cancer with gene expression measured by the Affymetrix95av2 GeneChip (22). The data set contains the expression levelsof 12,625 genes for 56 patients, each having one of four dis-ease subtypes: normal lung (20 samples), pulmonary carcinoidtumors (13 samples), colon metastases (17 samples), and small-cell carcinoma (6 samples).The performance of different methods is evaluated based on

the pattern difference of subtypes based on the mean approx-imations. For all methods, we set the rank of the mean signalmatrix equal to 3 to maintain consistency with the ranks used inFIT-SSVD (10) and LSHM (9). Further, we use the measure-ment “support” to evaluate the sparsity of the estimated genesignal (10). Support is the cardinality of the nonzero elements in

the right and left singular vectors across the three layers (i.e.,support is an integer that cannot exceed the data dimension).Smaller support values suggest a sparser model. Table 1 showsthat HSSVD, FIT-SSVD and LSHM yield similar levels ofsparsity in the gene signal, whereas SVD is not sparse, as expec-ted. Fig. 4 shows checkerboard plots of rank-three approximationsby the four methods. Patients are placed on the vertical axis, andthe patient order is the same for all images. Patients within thesame subtype are stacked together and different subtypes areseparated by white lines. Within each image, genes are laid on thehorizontal axis and are ordered by the value of v2 (10). We cansee a clear block structure in both the FIT-SSVD and HSSVDmethods, indicating biclustering. The block structure suggests wecan discriminate the four cancer subtypes using either the FIT-SSVD or HSSVD methods, whereas LSHM and SVD are unableto achieve such separation among subtypes.

Simulation StudyTo evaluate the performance of HSSVD quantitatively, weconducted a simulation study. We compared HSSVD with themost relevant existing biclustering methods, FIT-SSVD andLSHM (9, 10). HSSVD includes a rank estimation component,whereas the other methods do not automatically include this. Forthis reason, we will use a fixed oracle rank (at the true value) forthe non-HSSVD methods. For comparison, we also evaluateHSSVD with fixed oracle rank (HSSVD-O).The performance of these methods on simulated data was

evaluated on four criteria. The first criterion is “sparsity of es-timation,” defined as the ratio between the size of the correctlyidentified background cluster and the size of the true backgroundcluster. The second criterion is “biclustering detection rate,”

117 87 201

348 23 321

337 48 33 366 47 326

153 50 279

141 88 178

261 99 343

130 20 260

317 83 218

145 52 294

179 56 68 240

111 43 243 91 65 372

291 26 71 176

157 37 4 349

252 32 72 70 375

371 13 89 258

228 46 332

342 51 237

163 62 175

203 27 54 152

184 49 21 327

133 63 280

101 98 96 95 94 93 92 90 86 84 82 80 79 78 76 74 73 67 66 61 57 55 53 41 36 31 25 19 17 16 14 11 3 5 35 7 284 81 39 159 30 264

149 18 8 136

195 85 29 97 134 45 250

200 44 334 58 351

367 75 373 59 100

122 69 12 307

297 28 320

185 24 333

369 10 77 363

170 60 174

318 15 353

340 38 167

296 42 123

309 40 364

171 34 277

288 64 137 22 1

2482325171316132921263322912141041815752644194720114928313250393546273840454330413642343748

68 87 20 155 88 294

337 23 47 178

117 99 348

214 33 48 343

215 56 120

375 89 203

306 91 201 32 252

218 83 52 111

118 50 310

127 72 179

101 98 96 95 94 93 92 90 86 84 82 79 76 73 70 66 63 62 57 55 54 53 51 49 46 41 36 31 27 25 21 19 17 16 14 13 2 5 3 224

135 61 243

102 43 37 262

153 71 372 65 279

291 26 176

301 35 7 284 39 81 335

312 85 30 195

187 29 360

134 58 181 6

229 45 97 334

149 18 136

191 24 8 44 298 78 202

162 11 167 12 69 272

341 74 75 373

336 67 80 248

100 59 42 267

365 60 221

353 40 318 28 381

271 15 185

231 38 147

123 22 137

288 9 77 64 192

170 34 320

297 10 363

4237343630414835432832313927464538504047471049919121815142920112233624321517442381316125226

242 33 311

109 56 215

348 52 269

321 23 178 88 326

275 32 111 91 201

310 50 252

118 68 87 117 99 20 155

179 83 218 47 337

265 48 127

214 72 164

261 89 237

120 51 342

374 63 207

210 21 377 41 359 73 139 19 194 36 55 245

282 57 172

371 70 322

268 13 62 158

152 27 46 249

197 54 143

129 14 128 86 274

209 90 53 79 151

142 82 323 84 98 101

212 49 119

349 43 37 65 262

188 4 71 372

176 26 291

301 31 325

108 94 135

224 16 330

169 93 106 61 204

289 67 365

316 95 283

156 11 76 305

221 40 364 15 226 74 298

292 17 254

232 96 92 138

384 25 66 336

177 80 38 12 147

161 75 373

217 78 355

362 69 278

248 42 59 267

296 22 273

338 44 30 189

181 58 24 29 234 7 39 335 35 284 81 85 195

229 45 97 334

149 18 8 136

318 64 196

352 60 357

381 28 263

351 34 368

112 77 244

297 10 308

3836444332493929484137424550474031353033344725234610199281215181427116651320112172621832224

Fig. 2. Mean approximation of colon cancer and the normal matched samples. From left to right the methods are HSSVD, FIT-SSVD, and LSHM. Colon cancersamples are labeled in blue, and normal matched samples are labeled in pink in the sidebar. Genes and samples are ordered by hierarchical clustering. Coloncancer patients are clustered together, which indicates that the mean approximations for these three methods achieve the expected signal structure.

147 34 91 345

286 69 246

271 44 141

338 4 97 76 111

143 65 124

349 43 291 26 176

100 10 350

262 81 334

229 47 348

218 87 179 20 71 326

134 35 171 24 272

292 46 235

353 58 77 217

356 48 25 310 66 283

172 41 200

137 33 29 15 278

337 37 50 232

376 13 7 346 85 154

293 45 174

273 98 327

266 39 140

225 90 252

113 95 63 182 22 75 378

370 1 3

164 83 72 127

303 42 88 145 30 196

112 86 51 52 362

311 99 243

129 59 233

302 21 254

107 19 207

366 53 120

297 68 121

110 28 96 296 2

347 64 285

156 38 274 60 213 73 236

325 67 93 224

352 70 119

160 89 374

238 17 208

158 80 54 367

194 36 329

267 74 152

228 61 128 57 209

130 94 14 79 150

323 5 49 253

230 62 18 269

383 23 300

237 84 184

115 82 114

359 92 192

195 11 324

105 27 319 55 144

215 32 351

287 12 122

260 16 305

364 40 227

382 9 31 357 56 299 78 190

204212681510712221226316412425182531813176744567992696372617366685564597454706071576562755811190461311381131371142811671663710827720313513028414194267192161155227451634715711912325822224151444036343130352850424839321684338513329414917999847680971008877858798957890828193969189948386118134120116121136132102106127103129128105126101104111271125272257109110133243221213192285170145252171154241155148165256176177185169195181253180199288143149224270150153193159161152174162236278280279255242240239235205204189178156147146139117231121982762549173200183231107234237191142140251187172202186184175238160188141197151283201289264212244250247232287265274230269246245275249233266207216226261259206210263229208248158286273211268225218228227262215260220209196282214219217223290

HSSVD: Variance Approximation

−2 0 1 2Value

Color Key

356 89 337 52 68 111

199 88 343 47 178

166 48 99 306

237 70 172

242 13 119

203 49 33 366 23 56 290 2

311 32 213

118 20 188

240 72 303

310 87 50 252

179 26 4 176

291 37 65 198

102 71 262

372 43 91 153

255 83 301

282 57 130 51 190

258 54 212 27 143 41 260

325 94 11 135

246 62 204 63 46 139

106 31 359 90 121 55 184

108 73 142

339 25 93 377 14 379

382 61 354

241 67 298 74 40 364

131 19 154

302 21 194

209 17 376

254 36 384

129 98 95 92 86 82 79 78 76 59 53 3 16 232

101 96 80 305

329 12 324

113 66 238

285 77 239 22 34 147

226 75 289

297 38 341

138 84 230

363 29 352

355 69 318

114 10 42 358 60 369

125 18 309

271 44 123

219 28 149

350 64 124

168 35 39 335

221 85 81 312

235 8 45 360

346 97 181

229 30 334 58 173

159 6 24 189

136 15 191

264306957111526696733351149273122701072584025431282165113026121451028942141929015220753281352252341084813732311645196272911924011717912268503949182158144476619416828323190224235142189199166197139198183159163221253201234463728020323614323921318715616727727819175741896973826313136274205923141266877915726999286171331125689824742652321131178939282847310094656322281727621680584457281602476855542756226277260918312328425790206271124856424489618824386246287272250249180204141237205267276178147188242146151140149279200255238288254161105128129134233102136103133126111110101106118121116132120138245127109104125259248209217210208228227225211230223215229219214212226218220150148195153145181186154270162172202155241169170160285152174184192185175251173193252177171256165176

HSSVD: Mean Approximation

−1 0 0.5 1Value

Color KeyA B

Fig. 3. HSSVD approximation result for all samples. (A) Variance approximation; (B) mean approximation. Blue represents cancer samples, and pink rep-resents normal samples in the sidebar. Genes and samples are ordered by hierarchical clustering. Red represents large values, and green represents smallvalues. Only the variance approximation can discriminate between cancer and normal samples. More importantly, within the same gene, the heatmap for thevariance approximation indicates that cancer patients have larger variance than normal individuals. This result matches the conclusion in ref. 19. In addition,the cDMRs with the greatest contrast variance across cancer and normal samples are highlighted by the variance approximation, whereas the original paperdoes not provide such information.

Chen et al. PNAS | July 23, 2013 | vol. 110 | no. 30 | 12255

STATIST

defined as the ratio of the intersection of the estimated biclusterand the true bicluster over their union (also known as the Jac-card index). For the first two criteria, larger values indicatebetter performance. The third and fourth criteria are “overallmatrix approximation errors” for mean and variance biclusters,consisting of the scaled recovery error for the low-rank meansignal matrix ~Ξ=Ξ+ b J, computed via

Lmean�~Ξ; Ξ

�=��Ξ−~Ξ

��2F

.��Ξ��2F ;and the scaled recovery error for the low-rank variance signalmatrix logð~ΣÞ= logðΣÞ+ logðρ2JÞ, computed via

�log

�~Σ�; log

�Σ��

=��log

�Σ1=2

�−log

�~Σ1=2

��2

��log�Σ1=2

��2

with k_kF being the Frobenius norm.The simulated data comprise a 1000× 100 matrix with in-

dependent entries. The background entries follow a normaldistribution with mean 1 and SD 2. We denote the distribution asNð1; 22Þ, where Nða; b2Þ represents a normal random variablewith mean a and SD b. There are five nonoverlapping rectan-gular- shaped biclusters: bicluster 1, bicluster 2, and bicluster 5

are mean clusters, bicluster 3 is a mean and small variancecluster, and bicluster 4 is a large variance cluster. More precisely,bicluster 1 (size 100× 20) is generated from Nð7; 22Þ, bicluster 2(size 100× 10) is generated from Nð−5; 22Þ, bicluster 3 (size100× 10) is generated from Nð7; 0:42Þ, bicluster 4 (size 100× 20)is generated from Nð1; 82Þ, and bicluster 5 (size 100× 20) isgenerated from Nð6:8; 22Þ. The biclustering results are shown inTable 2: HSSVD and HSSVD-O can detect both mean andvariance biclusters, whereas FIT-SSVD-O and LSHM-O can onlydetect mean biclusters (where “O” stands for oracle input biclusternumber). For mean bicluster detection, all methods performedwell because the biclustering detection rates are all greaterthan 0.7. For variance bicluster detection, HSSVD and HSSVD-O deliver a similar biclustering detection rate. On average, thecomputation time of LSHM-O is about 30 times that of HSSVDand 60 times that of FIT-SSVD-O.Both FIT-SSVD and LSHM are provided with the oracle rank

as input. We also evaluated an automated rank version for thesemethods, but determined the performance was worse than thecorresponding oracle rank version (results not shown). Note thatthe input data are standardized to mean 0 and SD 1 element-wisely for FIT-SSVD-O and LSHM-O. Although this step is notmentioned in the original papers (9, 10), this simple procedure iscritical for accurate mean bicluster detection. From Table 2, wecan see that HSSVD-O provides the best overall performance,while HSSVD is close to the best; however, in practice, the oraclerank is unknown. For this reason, HSSVD is the only fully auto-mated approach which delivers robust mean and variance de-tection in the present of unknown heterogeneous residual varianceamong those considered.

Conclusion and DiscussionIn this paper, we introduced HSSVD, a statistical framework and itsimplementation, to detect biclusters with potentially heterogeneous

HSSVD FIT-SSVD

LSHM SVD

genes genes

Fig. 4. Checkerboard plots for four methods. We plot the rank-three approximation for each method. Within each image, samples are laid in rows, andgenes are in columns. We order the samples by subtype for all images (top to bottom: carcinoid, colon, normal, and small cell), and different subtypes areseparated by white lines. Genes are sorted by the estimated second right singular vector ðu2Þ, and we only included genes that are in the support (defined inTable 1). Across all methods, the HSSVD and FIT-SSVD methods provide the clearest block structure reflecting biclusters.

Table 1. Cardinality of union support of the first three singularvectors for different methods applied on lung cancer data

Support HSSVD FIT-SSVD LSHM SVDS3

i= 1kuik0 4,689 4,686 4,655 12,625S3i=1kvik0 56 56 56 56

variances. Compared with existing methods, HSSVD is both scaleinvariant and rotation invariant (as the quantity for scaling is thesame for all matrix entries and does not vary by row or column).HSSVD also has the advantage of working on the log scale(Materials and Methods) in estimating the variance components: thelog scale makes detection of low-variance (less than 1) biclusterspossible, and any traditional SSVDmethod can be naturally used inour variance detection steps. This method confirms the existence ofmethylation hypervariability in the methylation data example. Al-though we use the FIT-SSVDmethod in our implementation, otherlow-rank matrix approximation methods are applicable. Moreover,the software implementing our proposed approach was compu-tationally comparable to the other approaches we evaluated.A potential shortcoming of SVD-based methods is their in-

ability to detect overlapping biclusters. We investigate thisproblem in the first paragraph of SI Materials and Methods. Weshow that our method can serve as a denoising process foroverlapping bicluster detection. In particular, we can first applythe HSSVD method on the raw data to obtain the mean ap-proximation. Then we can apply a suitable approach, such as thewidely used plaid model (16, 23), on the mean approximation todetect overlapping biclusters. This combined procedure improveson the performance of the plaid model when the overlappingbiclusters have heterogeneous variance. Hence, our methodremains useful in the present of overlapping biclusters.Another potential issue for HSSVD is the question of whether

a low-rank mean approximation plus a low-rank variance ap-proximation could be alternatively represented by a higher-rankmean approximation. In other words, is it possible to detectvariance biclusters through mean biclusters only, even thoughthe mean clusters that form the variance clusters would bepseudomean clusters? A detailed discussion of this issue can befound in the second paragraph of SI Materials and Methods. Ourconclusion is that the variance detection step in HSSVD isnecessary for the following two reasons: First, pseudomeanbiclusters are completely unable to capture small variancebiclusters. Second, although pseudomean biclusters are able tocapture some structure from large variance biclusters, suchstructure is much less accurate than that provided by HSSVD,and can be confounded with one or more true mean biclusters.Although HSSVD works well in practice, there are a number

of open questions that are important to address in future studies.For example, it would be worthwhile to modify the method toallow nonnegative matrix approximations to better handle countdata such as next-generation sequencing data (RNA-seq). Ad-ditionally, the ability to incorporate data from multiple “omic”platforms is becoming increasingly important in current bio-medical research, and it would be useful to extend this workto simultaneous analysis of methylation, gene expression, andmicroRNA data.

Materials and MethodsModel Assumptions for HSSVD. We define biclusters as subsets of the datamatrix which have the same mean and variance. We assume that there existsa dominate null cluster in which all elements have a common mean andvariance and that all other biclusters are restricted to rectangular structureswhich have either a distinct mean or variance compared with the null cluster.We can also express our model in the framework of a random effect modelwherein

X=Ξ+ ρ2Σ×Φ+bJ; [2]

where X and Ξ are the same structures given in the traditional model 1, andwhere we require Φ, an n×p matrix, to have i.i.d. random components withmean 0 and variance 1. Moreover, the “×” in [2] is defined element-wisely:see the next section for details. Added components in the model includeΣ= ðσijÞ, an n×p matrix representing the heterogeneous variance signal;Jn×p, an n×p matrix with all values equal to 1; ρ, a finite positive numberserving as a common scale factor; and b, a finite number serving as a com-mon location factor. We also make the sparsity assumption that the majorityof ðξijÞ values are 0 and the majority of ðσijÞ values are 1. Further, just as weassumed for the mean structure Ξ, we also assume that the variance struc-ture Φ is low rank.

From the definitions, the traditional model 1 is a special case of our model2, with b= 0, Σ= J, and ρ= 1. The presence of b and ρ in the model allows thecorresponding method to be scale invariant, while the presence of Σ enablesus to incorporate heterogeneous variance signals.

HSSVD Method. We propose HSSVD based on the model 2 with a hierarchicalstructure for signal recovery. First, we properly scale the matrix elements tominimize false detection of pseudomean biclusters which can arise as arti-facts of high-variance clusters. This motivates us to add the quadraticrescaling step in the procedure. Then we can detect mean biclusters basedon the scaled data and later detect variance biclusters based on the loga-rithm of the squared residual data after subtracting out the mean biclusters.The quadratic rescaling step works well in practice, as shown in the simu-lation studies and data analysis. The pseudocode for the algorithm is pro-vided as follows:

1. Input step: Input the raw data matrix Xorigin. Standardize Xorigin (treateach cell as i.i.d.) to have mean 0 and variance 1. Denote the overall meanof Xorigin as μ and the overall SD as σ, and let the standardized matrix bedefined as X= ðXorigin − μJÞ=σ.

2. Quadratic rescaling: Apply SSVD on X2 − J to obtain the approximationmatrix U.

3. Mean search: Let Y=X=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiU+ J− cJ

p, where c is a small nonpositive con-

stant to ensure thatffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiU+ J− cJ

pexists. Then, apply SSVD on Y to obtain

the approximation matrix ~Y.

4. Variance search: Let Zorigin = logðX− ~Y ×ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiU+J− cJ

p Þ2, center Zorigin to havemean 0, and denote the centered version as Z. Perform SSVD on Z toobtain the approximation matrix ~Z.

5. Background estimation: Let P= fpijg denote the n×p matrix of indicatorsof whether the corresponding cells belong to the background cluster,with pij =1 if both ~Yij =0 and ~Zij =0, and pij =0 otherwise. Based on theassumption that most elements in the matrix should be in the null cluster,

Table 2. Comparison of four methods in the simulation study

Criteria HSSVD HSSVD-O FITSSVD-O LSHM-O

Lmean 0.013 (0.01) 0.013 (0.01) 0.081 (0.01) 0.019 (0.01)Lvar 0.157 (0.03) 0.156 (0.03) NA NASparsity 0.950 (0.04) 0.950 (0.03) 0.988 (0.02) 0.997 (0.01)BLK1 (mean) 0.861 (0.10) 0.862 (0.10) 0.818 (0.08) 0.872 (0.08)BLK2 (mean) 0.934 (0.18) 0.936 (0.17) 0.939 (0.18) 0.976 (0.01)BLK3 (mean) 0.972 (0.10) 0.974 (0.10) 0.971 (0.11) 0.987 (0.01)BLK5 (mean) 0.977 (0.11) 0.948 (0.11) 0.977 (0.11) 0.996 (0.01)BLK3 (var) 0.977 (0.02) 0.977 (0.02) NA NABLK4 (var) 0.628 (0.25) 0.633 (0.24) NA NA

Lmean and Lvar measure the difference between the approximated signal and the true signal, and so smaller is better. For the other measures of accuracy ofbicluster detection, the larger the better. The rows BLK1 to BLK5 represent the biclustering detection rate for each bicluster.“-O” indicates that the oraclerank is provided.

Chen et al. PNAS | July 23, 2013 | vol. 110 | no. 30 | 12257

STATIST

we can estimate b with 1′ðXorigin ×PÞ11′P1 and ρ with 1′ðXorigin ×P− bPÞ21

1′P1− 1 , where 1 isa vector with all elements equal to 1.

6. Scale back: Define P1 = fpijg, with pij = 1 if ~Yij =0, pij =0 otherwise. Simi-larly, define P2 = fpijg, with pij = 1 if ~Zij = 0, pij = 0 otherwise. The meanðΞ+bJÞ approximation is computed with σð~Y ×

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiU+ J− cJ

p Þ+ μðJ−P1Þ+bP1, and the variance ðρ2ΦÞ approximation is computed with ½ρ2P2 + σ2

ðJ−P2Þ�× expð~ZÞ.

The operators ×, /, expðÞ,logðÞ, expðÞ, minðÞ, and ffiffiffiffiffiffið Þpused above are

defined element-wisely when they are applied to the matrix, e.g.,Un×p ×Vn×p = ðuijvijÞ. In all steps involving SSVD, we implement the FIT-SSVDmethod (10). We use FIT-SSVD because it is computationally fast and hassimilar or superior performance compared with other competing methodsunder the homogeneous variance assumption (10). The matrix

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiU+ J− cJ

pprovides a working variance level estimate of the data and makes ourmethod more robust. Note that the reason for working on the log scale forthe variance detection is twofold. First, working on the log scale makes thedetection of the deflated variance (less than 1) bicluster possible. Intuitively,as variance measures deviance from the mean, we can work on the squaredresiduals to find the variance structure. For the deflated variance biclustersetting, if the mean structure is estimated correctly, the residuals within thebicluster are close to zero. The SSVD-based methods shrink the small non-zero elements to zero to achieve sparsity. As a result, if we work on thesquared residuals directly, the SSVD based methods will fail to detect the lowvariance structure. Second, to use the well-established SSVD method in thevariance detection steps we need to work on the log scale. To see this, wecan rewrite the equation in [2] as logðX−Ξ−bJÞ2 = logðΣ2Þ+ logðρ2Φ2Þ,which is similar to the model in [1]. Consequently, we can apply any methodswhich are applicable to [1] in our variance detection step if we work on thelog scale and Φ is low rank. We also want to point out that results obtaineddirectly from FIT-SSVD are relative to the location and scale of the

background cluster. In addition, we have scaled the data in the “input step.”To provide a correct mean and variance approximation of the original data,we need the “scale back” step. Assuming that the detection of null clusters isclose to the truth, then the pooled mean and variance estimates based onelements exclusively from the identified null cluster (b and ρ) are more ac-curate than estimates based on all elements of the matrix (μ and σ). Asa result, we need to use the comprehensive formula proposed in the scaleback step.

The FIT-SSVDmethod, as well as any other SVD-based method, requires anapproximation of the rank of the matrix (which is essentially the number oftrue biclusters) as input. We adapt the bicross validation method (BCV) by ref.24 for rank estimation, and we notice that in some cases the rank isunderestimated. For this reason, we introduce additional steps followinga BCV rank estimation of rank k: First, we approximate the data witha sparse matrix Xk+1

(rank = k + 1), where Xk+1=Pk+1

j=1 dj uj vTj . Define the

proportion of variance explained by the top i rank sparse matrix asRi =

Pij=1d

j =Pk+1

j=1 d2

j (25). Ri is between 0 and 1 and is increasing with i, andwe believe that the redundant components of the sparse matrix should notcontribute much to the total variance. The final rank estimation for HSSVD isthe smallest integer rwhich satisfies Rr > 0:95, and 1≤ r ≤ k+ 1. Note that FIT-SSVD (10) used the modified BCV method for rank estimation; however, theauthors require that most rows (the whole row) and most columns (thewhole column) are sparse, which appears to be too restrictive. In practice,this assumption is violated if the data are block diagonal or have certainother commonly assumed data structures. For this reason, we use the orig-inal BCV method as our starting point.

ACKNOWLEDGMENTS. The authors thank the editor and two referees forhelpful comments. The authors also thank Dr. Dan Yang for sharing part ofher code. This work was supported in part by Grant P01 CA142538 from theNational Institutes of Health.

1. Hastie T, Tibshirani R, Friedman J (2009) The Elements of Statistical Learning. (Springer,New York), Vol 1, pp 460–462.

2. Witten DM, Tibshirani R (2010) A framework for feature selection in clustering. J AmStat Assoc 105(490):713–726.

3. Ma Z (2013) Sparse principal component analysis and iterative thresholding. AnnStatist 41(2):772–801.

4. Shen H, Huang J (2008) Sparse principal component analysis via regularized low rankmatrix approximation. J Multivariate Anal 99(6):1015–1034.

5. Zou H, Hastie T, Tibshirani R (2006) Sparse principal component analysis. J ComputGraph Statist 15(2):265–286.

6. Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: A survey onsubspace clustering, pattern-based clustering, and correlation clustering. ACM TransKnowl Discov Data 3(1):1–58.

7. Cheng Y, Church GM (2000) Biclustering of expression data. Proc Int Conf Intell SystMol Biol 8:93–103.

8. Busygin S (2008) Biclustering in data mining. Comput Oper Res 35(9):2964–2987.9. Lee M, Shen H, Huang JZ, Marron JS (2010) Biclustering via sparse singular value

decomposition. Biometrics 66(4):1087–1095.10. Yang D, Ma Z, Buja A (2011) A sparse SVD method for high-dimensional data. arXiv:

1112.2433.11. Witten DM, Tibshirani R, Hastie T (2009) A penalized matrix decomposition, with

applications to sparse principal components and canonical correlation analysis. Bio-statistics 10(3):515–534.

12. Hoff PD (2006) Model averaging and dimension selection for the singular value de-composition. J Am Stat Assoc 102(478):674–685.

13. Johnstone IM, Lu AY (2009) On consistency and sparsity for principal componentsanalysis in high dimensions. J Am Stat Assoc 104(486):682–693.

14. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc, B58(1):267–288.

15. Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracleproperties. J Am Stat Assoc 96(456):1348–1360.

16. Lazzeroni L, Owen A (2002) Plaid models for gene expression data. Statist Sinica 12:61–86.

17. Shabalin AA, Weigman VJ, Perou CM, Nobel AB (2009) Finding large average sub-matrices in high dimensional data. Ann Appl Stat 3(3):985–1012.

18. Allen GI, Grosenick L, Taylor J (2011) A generalized least squares matrix de-composition. arXiv: 1102.3074.

19. Hansen KD, et al. (2011) Increased methylation variation in epigenetic domains acrosscancer types. Nat Genet 43(8):768–775.

20. Irizarry RA, et al. (2009) The human colon cancer methylome shows similar hypo- andhypermethylation at conserved tissue-specific CpG island shores. Nat Genet 41(2):178–186.

21. Liu Y, Hayes DN, Nobel A, Marron JS (2008) Statistical significance of clustering forhigh-dimension, low-sample size data. J Am Stat Assoc 103(483):1281–1293.

22. Bhattacharjee A, et al. (2001) Classification of human lung carcinomas by mRNA ex-pression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA98(24):13790–13795.

23. Turner H, Bailey T, Krzanowski W (2005) Improved biclustering of microarray datademonstrated through systematic performance tests. Comput Stat Data Anal 48(2):235–254.

24. Owen AB, Perry PO (2009) Bi-cross-validation of the SVD and the nonnegative matrixfactorization. Ann Appl Stat 3(2):564–594.

25. Allen GI, Maleti�c-Savati�c M (2011) Sparse non-negative generalized PCA with appli-cations to metabolomics. Bioinformatics 27(21):3029–3035.

Biclustering with heterogeneous variance - PNAS · Biclustering with heterogeneous variance Guanhua...

Documents

Analysing Online Social Network Data with Biclustering … · Analysing Online Social Network Data with Biclustering and Triclustering Alexander Semenov 1 1Dmitry Gnatyshak 2,1 Dmitry

Biclustering Bioinformatics Data Sets: A Possibilistic ...daa_erice07/solicited/masulli.pdf · Biclustering Bioinformatics Data Sets: A Possibilistic Approach Francesco Masulli Dept

Supplementary Material for Biclustering via Optimal …10.1186/1471-2105-9-458...Supplementary Material for Biclustering via Optimal Re-ordering of Data Matrices Case Study 1:

Biclustering Via Sparse Clustering - arXiv · Biclustering methods may be useful in situations where clusters are formed by only a subset of the features. Biclustering aims to identify

Biclustering MDA 2011 - cw.fel.cvut.cz · – Biclustering identifies groups of genes that show similar activity patterns under a specific subset of the experimental conditions. •

A New Biclustering Algorithm for Analyzing Biological Data

An Improved Heuristic for Consistent Biclustering Problemsplaza.ufl.edu/artyom/Papers/Biclustering.pdf · An Improved Heuristic for Consistent Biclustering Problems 5 data set. Our

Analysis of Gene Expression Data Using Biclustering …cdn.intechopen.com/pdfs/...Analysis_of_gene_expression_data_using... · Analysis of Gene Expression Data Using Biclustering

Exhaustive Signature Algorithm Guy Harari. Outline ISA biclustering algorithm Bimax biclustering algorithm Exhaustive Signature Algorithm Results and

High Performance Parallel/Distributed Biclustering Using

CoBi: Pattern Based Co-Regulated Biclustering of Gene ...jkalita/papers/2013/RoySwarupPatternRecognition... · CoBi: Pattern Based Co-Regulated Biclustering of Gene Expression Data

Plaid models, biclustering, clustering on subsets of ...ligarto.org/rdiaz/Papers/Plaid.et.al.pdf · Plaid models, biclustering, clustering on subsets of attributes, feature selection

Application of Clustering and Biclustering Techniques to Yeast Metabolic … · 2017-09-21 · Application of Clustering and Biclustering Techniques to Yeast Metabolic Cycle Bioinform

Comparison of Biclustering Methods: A Systematic ... · Comparison of Biclustering Methods: A Systematic Comparison and Evaluation of Biclustering Methods for Gene Expression Data

Biclustering via Sparse Singular Value Decompositionjianhua/paper/LeeShenHuangMarron09.pdf · data analysis. Busygin et al. (2008) provide a survey of biclustering in data mining

Biclustering Algorithms for Biological Data Analysis€¢ If all weights are binary, biclustering becomes biclique finding. 7 NP-complete Genes Conditions. 8 ... • bicluster is biclique

Biclustering of Expression Data - Association for the ... · Biclustering of Expression Data Yizong Cheng $f* and St George M. Church tDepartment of Genetics, Harvard Medical School,

Biclustering Usinig Message Passing - NIPS

Consistent Biclustering via Fractional 0–1 Programmingmariog/slides/Pardalos_biclustering.pdf · Consistent Biclustering Conclusions Major Data Mining Problems Clustering (Unsupervised):

Biclustering: Methods, Software and Application · Biclustering: Methods, Software and Application Dissertation zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften