Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: Whither? BMC Bioinformatics

1. DYS875-006Reproducible Clusters from Microarray Research: Whither? Garge, Nikhil et. al.Gota MorotaDec. 9, 2009Gota MorotaReproducible Clusters from Microarray Research: Whither?

2. DYS875-006SeminarClustering Gene Expression ProlesGiven: expression proles for a set of genes or experiments/individuals/time points Do: organize proles into clusters such that genes in the same cluster are highly similar to each other genes from different clusters have low similarity to each otherGoal Understand general characteristics of data and infer something about a gene based on how it relates to other genesGota MorotaReproducible Clusters from Microarray Research: Whither? 3. SeminarDYS875-006Validity of Clustering Analysis Clustering presents challenges because there is no null hypothesis to test and no right answer the result of clustering may be method sensitive (distance metric, clustering algorithm) no way to evaluate the validity of a cluster solution Measure replicability of clustering algorithms. Clusters that produce classications with greater replicability would be considered more valid.Objective Determine the replicability and degree of stability of commonly used non-hierarchical clustering algorithms Gota MorotaReproducible Clusters from Microarray Research: Whither? 4. DYS875-006SeminarData !"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-#Real datasetsSimulated datasets!"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-#Table 1: List of microarray datasets considered for the study. Table 1 contains two columns of datasets. Each dataset is described by its name, source, and sample size (n). Table 1 shows 39 datasets. The first 3 columns list 19 datasets and last three columns describe 18 datasets.Name of the datasetSourceSample size (n)Name of the datasetSourceSample size (n)GDS22 GDS171 GDS184 GDS232 GDS274 GDS285 GDS365GEO GEO GEO GEO GEO GEO GEO80 30 30 46 80 20 66[30] [31] [32] [33] [34] Unpublished Unpublished70 34 100 60 42 24 106GDS465 GDS331 GDS534 GDS565 GDS427 GDS402 GDS356 GDS389 GDS388 GDS352 GDS531 GDS535GEO GEO GEO GEO GEO GEO GEO GEO GEO GEO GEO GEO90 70 74 48 24 12 14 16 18 12 172 12Leukemia dataset Medulloblastoma Data Set Prostate Cancer dataset Gaffney Head and Neck data Affymetrix Hu133A Latin Square CNGI design experiment Paired pre and post euglycaemic insulin clamp skeletal muscle biopsies GDS156 GDS254 GDS268 GDS287 GDS288 GDS472 GDS473 GDS511 GDS520 GDS564 GDS540GEO GEO GEO GEO GEO GEO GEO GEO GEO GEO GEO12 16 24 16 16 14 12 12 20 28 18Table 2: List of simulated microarray datasets. Table 2 show the details of simulated datasets. Each of these datasets has clustering structure k = 6 (six clusters) with correlation ! set to (0.33)1/2.3. Compute th not updated un the data.Dataset Name Sample size Number of genes Clusters4. Alternate ste ters.Dataset1 Dataset2 Dataset3 Dataset4 Dataset5 Dataset6 Dataset7 Dataset820 100 200 500 1000 40 60 801200 1200 1200 1200 1200 1200 1200 12006 6 6 6 6 6 6 6We consider methods, whic algorithms for package.K-means In K-means clu ters and rando Filtered out genes values which contained at least one missing value ters. If a gene Missing not available in real datasets. We simulated 8 datasets If we represent microarray data as a matrix with rows repwith 1200 genes and sample sizes ranging from n = 20 to cluster, as asse resenting genes and columns representing chips or 1000, where n is the number of subjects. All simulated Igij ! Standardized the we filtered out all rows which contained at leastsam- meanIgizero and unit ples, expression values to one datasets were structured for 6 clusters (k = 6) with correlaPearson's corre Zij = null expression or missing value because we do not know tion ! set to (0.33) for all pairwise combinations of SDgi genes within clusters and zero for all pair wise combinasource(s) for the observation. be assigned to variance to validate our the exactdata can be due missing/null valuetranscription tions of genes in different clusters. In order Missing to array damage, to the closest cl methodology, we would predict higher scores when we errors, etc. Conventional algorithms for clustering require extract 6 clusters in our fitted solutions. Simulated datacomplete datasets to run and extending these clustering Where Zij = Z score computed for expression level recalculated. A sets also help us understand the stability behaviour for routines to accommodate missing data was beyond the values other than k = 6 (i.e., when we extract the wrong scope of our inquiry. troids will no l number of clusters). Table 2 explains the details of simuobserved for gene i in sample/subject j, Ig = intensity Standardization lated datasets. We acknowledge that number of genes in Gota Morota Reproducible Clusters from Microarray Research:ij Whither? K-means cluste 1/2 5. DYS875-006SeminarFour Algorithms ConsideredFour non-hierarchical (partitional) clustering algorithms. Non-hierarchical clusterings require the number of clusters (k) be pre-specied. K-means ( kmeans {stats} ) Self Organizing Maps (SOM) (som { cluster }) Clustering LARge Applications (CLARA) (clara { cluster }) Fuzzy C-means (fanny { cluster })Gota MorotaReproducible Clusters from Microarray Research: Whither? 6. SeminarDYS875-006K-means 1 2 3 4 5 6K-Means ClusteringChoose the number of k clusters Randomly assign items to the k clusters ! assume our instances are represented by vectors of real Calculate new centroid for each of the k clusters values Calculate the distance of all items to the k centroids ! put k cluster centers in same space as instances ! Assign items to closest centroid ! each cluster is represented by a vector f j Repeat until clusters assignments are stable! consider an example in which our vectors have 2 dimensions +instances++cluster center+Figure 1: K-Means Gota MorotaReproducible Clusters from Microarray Research: Whither? 7. DYS875-006SeminarK-meansK-Means ClusteringEach iteration involves two steps 1! each iteration involves two steps assignment of instances to clusters2! assignment the means re-computation of of instances to clusters ! re-computation of the means ++++++ ++assignmentre-computation of meansFigure 2: K-MeansGota MorotaReproducible Clusters from Microarray Research: Whither? 8. DYS875-006SeminarOther Clustering Methods Self Organazing Map (SOM) Similar to K-means, but centroids are restricted to a two-dimensional grid Clustering LARge Applications (CLARA) Extension of PAM(Partition Around Medoids) it can deal with much larger datasets than PAM Fuzzy C-means each gene belongs to a cluster that is specied by a membership degree (0-1) basically you can assign genes to more than one cluster assign the gene to a cluster showing maximum degree of membershipGota MorotaReproducible Clusters from Microarray Research: Whither? 9. DYS875-006SeminarCluster Stability Cramers v2 2 N(k 1) where2 is the ordinary 2 test statistic for independence in contingency tablesN is the number of genes to be clustered k is the number of clusters extracted Stability score 0: no relationship 1: perfect reproducibility Gota MorotaReproducible Clusters from Microarray Research: Whither? 10. DYS875-006SeminarApproach to Compute Cluster Stability !"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-#Microarray dataset with S subjects and N genesSplit dataset into left and right datasetsLeft dataset with S/2 subjectsSub-sample left dataset into sets of various sample sizes (3 to S/2)Sub-sample right dataset into sets of various sample sizes (3 to S/2)Left sub-sampled set of sample size x (x ranges from 3 to S/2)Right subsampled set of sample size x (x ranges from 3 to S/2)Cluster left set of sample size x with k (2 to 10) number of clustersRepeat 3 timesRight dataset with S/2 subjectsCluster right set of sample size x with k (2 to 10) number of clustersCompute Chi square (X2) between clustering resultsCluster Stability S(x,k) = Cramers v2Figure 1 Algorithm: cluster stability computation Algorithm: cluster stability computation. Cluster stability score S(x,k) is computed for every "k"(number of clusters) and every pair of sub-sampled set of sample size "x"../01!2!34!--Figure 3: Cluster Stability Computation /0+12$'3*42)$'&,$(&)$-%,+,%&'$03)0&.2.5Gota MorotaReproducible Clusters from Microarray Research: Whither? 11. SeminarDYS875-006Result on Real Datasets (SOM) !"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-#Table 3: Table showing stability results produced on a real dataset of sample size 16. Table 3 shows stability scores produced on a given dataset of a sample size of n = 16. We split the dataset into two halves each containing 8 subjects. The left dataset is resampled 6 times producing 6 samples of sample sizes 3 to 8, respectively. Similarly the right dataset is resampled to produce 6 samples. We measured the strength of the association between the clusters produced on every pair of samples (one sample from left and other from right dataset both of same sample size) using Cramer's v2 . Columns in the table represent number of clusters (k) and rows represent sample sizes. Stability score quantified for k = 10 and sample size 8 is 0.3699. This table shows there is 37% agreement between the clusters produced (k = 10) on pair of samples (a sample from left dataset and other from right dataset both of sample size 8).K (CLUSTERS) 2 SAMPLESIZE3 4 5 6 7 83456789100.5883 0.5799 0.5738 0.6433 0.6534 0.67590.47091 0.48045 0.48296 0.54638 0.54821 0.584470.4503 0.4244 0.4297 0.5142 0.5250 0.55200.4028 0.3894 0.3982 0.4727 0.4826 0.50450.3809 0.365 0.3644 0.4405 0.4462 0.47000.3600 0.3469 0.3430 0.4066 0.4211 0.45920.3313 0.3132 0.3195 0.3817 0.3915 0.41600.3107 0.297 0.3013 0.3616 0.3679 0.39750.2992 0.2858 0.2790 0.3396 0.348 0.3699sample size. CLARA and Fuzzy C-means, however, mainwe deviate from k = 6, we observed a decline in stability Figure scores until a sample size of a real scores. This phenomenon can 16 tained low stability4: Stability result on 30 was dataset of sample sizebe clearly observed in attained. Stability scores then gradually increased after CLARA, K-means and Fuzzy C-means (Figure 5). Hence, this threshold. K-means and SOM showed superior stabilscores observed on k = 7 were always higher than that on ity scores as compared to CLARA until the sample size k = 2, since k = 7 is nearer to k = 6 (Figure 5). Figure 4 attained n = 30. It is interesting to note that average stabilshows results on simulated datasets for k = 6. We observed ity achieved is not greater than 0.55 for all four clustering the following differences in stability behaviors among the Gota Morota routines even when at sample size of n = 50 is attained.Reproducible Clustersalgorithms. four clustering from Microarray Research: Whither? 12. ng methods. Alternatively, if e of scores across 37 selected scores from 37 real datasets) epresent stability coefficients clustering structure, we then nd 0.8 until a sample size of algorithms is achieved.iors until sample size reached n = 100. K-means showed Seminar DYS875-006 high stability at smaller sample sizes as compared to the other methods.Result on Real Datasets among dierent algorithms Real datasetsStability coefficient0.5 SOM0.4Kmeans0.3Fuzzy C-means0.2Clara0.14838433328231881303the same clustering structure tion ! set to (0.33)1/2 within all datasets show high scores her values of k. In simulated utput tables produced on 8 e with each cell computed as ding cells in 8 tables thereby scores for each value of k (k sub-sampled space. The final ability behavior of the clusvalues of clusters (k) considwe produced a final output 2 to 10) across sub-sampled esults for various values of k n in Figure 5. As expected, ed for the correct number of tering routines thereby valiprogramming. However, as0.6Sample SizeFigure Stability results Cluster 3 Cluster Stability results. Stability scores for various values of k (2 to 10) are computed on all 37 datasets. For each dataset, we selected a column (k) showing maximum summation of scores across sample size. Finally all 37 columns selected on 37 datasets were merged into one resultant column representing stability scores with respect to sample size for that clustering routine.Figure 5: Stability result on a real dataset of sample size 16 ./01!2!34!-/0+12$'3*42)$'&,$(&)$-%,+,%&'$03)0&.2.5Gota MorotaReproducible Clusters from Microarray Research: Whither? 13. SeminarDYS875-006Result on Simulated Datasets 1 !"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-#Figure 3 and Figure 4 suggest that ble performance than other cluste ered (SOM, CLARA and Fuzzy C 1.2 SOM showed similar behavior in 1 they are closely related to each oth SOM 0.8 ids move freely in multidimension K-means 0.6 constrained to a two-dimensiona Fuzzy C-means 0.4 SOM, the distance of each input fr Clara 0.2 is considered, instead of just the c the neighborhood kernel [29]. Th 0 as conventional clustering algorit Sample Size neighborhood kernel is zero [29]. on all four clustering routines microarray datasets, in general, Figure Stability results on simulated datasets for k = 6 Cluster 4 structure. We do not claim that th Cluster Stability results on simulated datasets for k = the exact stability nature of a giv 6. Datasets are simulated with a clustering structure k = 6 (6 sample size, since these are genera clusters). The above figure shows high stability scores observed for k = 6 on all four clustering routines. and variety of datasets. Nonetheles consider performing cluster analys to obtain more stable clustering s Stability Fuzzy C-means simulated fluctuation criterion for 6: K-means,evenresult onand SOM showeddataset forsuggests ofstatistical(k) for a givensm k = 6a in scores at large sample sizes, whereas CLARA number clusters showed consistent behavior (constant level of scores) at may be accomplished by computi ous values of k and selecting that larger sample sizes. vides a maximum stability score fo CLARA maintained 100% stability for larger sample We also evaluated stability perfo sizes (300500) whereas, SOM and Fuzzy C-means failed Gota Morota Reproducible Clusters from Microarray Research: Whither?Figure47643339034730426121817589132463Stability coefficientSimulated datasets for rho = sqrt(0.33) and k=6 14. SeminarDYS875-006Result on Simulated Datasets 2 !"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-#Clara: rho = sqrt(0.33) & k=2 to 10K-means: rho = sqrt(0.33) & k=2 to 10 2130.840.650.460.27 8 4984654323993663333002672342011686910Sample SizeFuzzy C-means: rho = sqrt(0.33) & k=2 to 109 10Sample SizeSOM: rho = sqrt(0.33) & k=2 to 10Sample Size2 30.640.450.26 7 498465432399366333300267234201168135690 3681 0.81024984654323993663333007 2670 2346 2010.2 16851350.469436310220.631 0.8Stability coefficient1.231.2 Sability coefficient1359360 1024984654323993663338 3000 2677 2340.2 20161680.46951350.6364102331 0.81.2 Stability coefficient23Stability coefficient1.29Sample SizeFigure 5 Cluster Stability results on simulated datasets for k = 2 to k = 10 Cluster Stability results on simulated datasets for k = 2 to k = 10. Stability scores for various values of k (2 to 10) are computed on all the 8 simulated datasets. For each dataset, we generate an output table of scores (explained in Algorithms section). We merge all the 8 output tables produced into one table with each cell computed as average of corresponding cells in 8 tables. Finally scores are plotted for all k values with respect to sample size. For cleaner visualization purposes, we do not show stability curves for all k values in figure 5c and figure 5d. a Scores plotted for CLARA for each k (210). b Scores plotted for K-means for each k (210). c Scores plotted for Fuzzy Cmeans for each k (210). d Scores plotted for SOM for each k (2 10).Gota MorotaReproducible Clusters from Microarray Research: Whither? 15. DYS875-006SeminarConclusionmicroarray datasets may lack natural clustering structure thereby producing low stability scores on all four methods the algorithms studied may not be well suited to producing reliable results sample sizes typically used in microarray research may be too small to support derivation of reliable clustering resultsGota MorotaReproducible Clusters from Microarray Research: Whither?

Technology

Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: Whither? BMC Bioinformatics