Unsupervised classification of single particles by cluster tracking in multi-dimensional space

Journal of

www.elsevier.com/locate/yjsbi

Journal of Structural Biology 157 (2007) 226–239

StructuralBiology

Unsupervised classification of single particles by cluster trackingin multi-dimensional space

Jie Fu a, Haixiao Gao b, Joachim Frank a,b,*

a Department of Biomedical Sciences, State University of New York at Albany, Empire State Plaza, Albany, NY 12201-0509, USAb Howard Hughes Medical Institute, Health Research, Inc. at the Wadsworth Center, Empire State Plaza, Albany, NY 12201-0509, USA

Received 11 April 2006; received in revised form 7 June 2006; accepted 11 June 2006Available online 21 July 2006

Abstract

In cryo-electron microscopy (cryo-EM) single-particle reconstruction, the heterogeneity of two-dimensional projection image dataresulting from the co-existence of different conformational or ligand binding states of a macromolecular complex remains a major obsta-cle as it impairs the validity of reconstructed density maps and limits the progress toward higher resolution. Classification of cryo-EMdata according to the different conformations is difficult because of the coexistence of multiple orientations in a single dataset. Here, wepresent an unsupervised classification method, termed cluster tracking, which utilizes the continuity in multi-dimensional space inducedby angular adjacency of projections in large datasets. In a proof of concept, the testing of cluster tracking on simulated projection data,which were generated from multiple conformations and orientations of an existing volume, produced clusters that are consistent with theconformational identity of the data. The application of the method to experimental cryo-EM projection data is found to result in a par-tition similar to the one generated by supervised classification.� 2006 Elsevier Inc. All rights reserved.

Keywords: Cluster tracking; Classification; Conformational heterogeneity

1. Introduction

Cryo-electron microscopy (cryo-EM), in combinationwith single-particle reconstruction (see Frank, 2006), isused increasingly for the visualization of biological mac-romolecular complexes, especially in situations where thecomplex is too large and too flexible to be amenable tocrystallization for X-ray crystallography. Examples ofstructures whose study has greatly benefited from thedevelopment of this methodology include the ribosome(Frank, 2001) and GroEL (Ludtke et al., 2004; Saibil,2000). The structure of the ribosome and its interactionwith its ligands have been extensively studied by cryo-EM, which resulted in the discovery of the complexarchitecture of the ribosome, determination of bindingpositions of the factors and tRNA, and observations of

1047-8477/$ - see front matter � 2006 Elsevier Inc. All rights reserved.

doi:10.1016/j.jsb.2006.06.012

* Corresponding author. Fax: +1 518 486 2191.E-mail address: [email protected] (J. Frank).

conformational changes in response to the binding ofprotein factors (reviewed by Frank, 2001; Frank andSpahn, 2006).

One of the important assumptions in single-particlereconstruction is that all the data represent randomly ori-ented two-dimensional (2D) projections of the samethree-dimensional (3D) structure; that is, the sample mustbe highly homogeneous. However, a high level of samplehomogeneity is often difficult to achieve, especially whenthe molecule has flexible domains or when it can occur indifferent ligand binding states. Therefore, the projectionimages extracted from electron micrographs frequently rep-resent projections of 3D structures that differ in conforma-tion. A reconstruction from such mixed datasets cannotaccurately portray any of the co-existing conformationalstates; for example, regions with structural changes oftenappear to be fragmented in the reconstructed EM maps.The heterogeneity of the dataset also adversely affects theresolution of reconstructed 3D volumes.

mailto:[email protected]

Fig. 1. (A) Correlation of Fourier components belonging to two different,adjacent projections. In Fourier space, the projections of an object may berepresented by two central sections, Pi and Pj. Each Fourier component issurrounded by a ‘‘region of influence,’’ which is the 3D shape transform ofthe object (adapted from Frank and Radermacher, 1986). (B) Principle ofcluster tracking by classifications of partially overlapping datasets. Eachmark (+) represents a point on the angular grid. A local dataset is definedas all projections falling into a certain point on the grid and its fiveimmediate neighbors. Two such datasets are indicated by circles.

J. Fu et al. / Journal of Structural Biology 157 (2007) 226–239 227

Heterogeneity is therefore a prevalent problem that lim-its full exploitation of cryo-EM data and progress towardhigher resolution (Brink et al., 2004; Burgess et al., 2004;Rye et al., 1999; Zhou et al., 2001). Again referring tothe example of the ribosome, in many studies of differentribosomal complexes, data heterogeneity presented a stum-bling block. Examples for structures posing such a problemare the EF-G Æ 70S complex (Agrawal et al., 1998; Gaoet al., 2004; Valle et al., 2003), the RRFÆ70S complex(Gao et al., 2005), and the initiation complex (Allenet al., 2005). Thus, a method of classifying particles intodifferent conformational states is needed. However, theclassification of cryo-EM data according to the conforma-tion of the originating structures is greatly complicated bythe co-existence of multiple projection views and multipleconformational states in a single dataset.

Among the methods to address heterogeneity, cross-cor-relation-based supervised classification has been used withsome success (Bohm et al., 2000; Gao et al., 2004; Gaoet al., 2005; Heymann et al., 2004; Valle et al., 2002). In thismethod, the 2D projections are compared with several ref-erence density maps, and the similarities of the 2D imagesto these references are measured by their cross-correlationcoefficients. The images are then divided into classesaccording to the references they best resemble. There aretwo problems with this method: (i) that all the differentconformations need to be known beforehand, and (ii) thatan incorrect selection of a reference (for instance, a refer-ence that does not correspond to any of the experimentaldata) leads to a meaningless classification.

In contrast to methods that require a comparison withreferences, unsupervised classification methods classify datainto groups according to their intrinsic clustering, or multi-dimensional cohesiveness. Different types of unsupervisedclassification methods have been applied in cryo-EMsingle-particle studies (see Frank, 1990); for example,K-means clustering (Penczek et al., 1996), hierarchicalascendant classification (HAC) (van Heel, 1984), hybrid(K-means and HAC) classification (Frank et al., 1988),and self-organizing maps (Marabini and Carazo, 1994;Zuzan et al., 1998). Often, unsupervised classification isbeing combined with multivariate data analysis such ascorrespondence analysis. In these studies, classificationwas either used to determine the projection orientationswithout regard to possible heterogeneity of the 3D struc-tures (e.g., Frank et al., 1988; Zuzan et al., 1998), or todivide projections representing a single view according toconformations (de Haas et al., 1996; Marabini and Carazo,1994; Pascual-Montano et al., 2001). A method capable ofclassifying an entire projection set into different conforma-tions has not been available.

In the current study, we designed an unsupervised clas-sification method, termed cluster tracking, which is able toclassify a set of particles into different conformationalgroups after an initial orientation assignment. By exploit-ing the tools of multivariate data analysis and K-meansclustering, we succeeded in classifying simulated data with

realistic signal-to-noise ratios (SNR) into different confor-mational states with high accuracy. Furthermore, theapplication of our new classification method to experimen-tal cryo-EM data yielded results similar to those obtainedpreviously by the use of supervised classification.

2. Mathematical background

2.1. Similarity relationship among projections

Given a set of N projections (pi, i = 1. . .N) of a moleculepresent in C conformations (indicated by an index ci) inarbitrary orientations,

fpðidealÞi g ¼ fpiðcijwi; hi;/iÞ; ci 2 f1 . . . Cgg; ð1Þ

where wi, hi, /i are three Eulerian angles defining the orien-tation of pi. In the cryo-EM data, these projections exist asnoise-contaminated measurements. Assuming additivenoise,

pðmeasuredÞi ¼ pðidealÞ

i þ ni: ð2ÞAccording to the Projection theorem (see Frank, 2006),each projection is represented by a 2D central section in3D Fourier space (Fig. 1A). This will be indicated by cap-italized counterparts of entities in Eq. (2),

P ðmeasuredÞi ðkmÞ ¼ P ðidealÞ

i ðkmÞ þ NiðkmÞ ð3Þwhere km is the spatial frequency. To quantify the similar-ity between adjacent projections, we may use a measure ofdissimilarity, expressed by the generalized Euclidean dis-tance in a multi-dimensional space:

228 J. Fu et al. / Journal of Structural Biology 157 (2007) 226–239

Eij ¼XM

m¼1

jP iðkmÞ � P jðkmÞj2; ð4Þ

where the summation goes over the M coefficients of thediscrete Fourier expansion. Small dissimilarity indicateshigh similarity and vice versa. For an infinitely extendedobject, Eij is large (and hence the similarity small) unlessi = j. Mathematically, a finite object could be viewed asthe product of an infinite object with a shape functionwhose value inside the boundary of the finite object equals1 and outside 0. Thus, the finite object’s Fourier transformO (k) will be given by the convolution of the Fourier trans-form of the infinite object O (k) with the Fourier transformof the object’s 3D shape function, also called shape trans-form, S (k):

OðkÞ ¼ OðkÞ � SðkÞ: ð5ÞAs a result, each point on a 2D Fourier central section thatrepresents a projection is surrounded by a ‘‘region of influ-ence’’ whose extent is given by the 3D shape transform(Fig. 1A). By convoluting the 2D central plane with the3D shape transform, we obtain a thick section, whosethickness indicates the extent of the region whose valuesare correlated with those on the central section itself. Asa consequence, Eij has small values, due to the effect ofthe cross-correlation terms P iðkmÞP �j ðkmÞ in Eq. (4), provid-ed the angular separation between the projections is suchthat it allows appreciable interpenetration of their associat-ed thick central sections in Fourier space.

Another consequence of Eq. (4) is that Eij is a monoton-ic function of the angular separation between projections i

and j, since the radius of the region in Fourier space wherethe associated thick central sections interpenetrate (indica-tion of high correlation) shrinks as the angular separationis increased, so that fewer and fewer terms of the formP iðkmÞP �j ðkmÞ contribute to Eij in Eq. (4).

We have therefore the result that projections are orderedby similarity (as measured by the cross-correlation termP iðkmÞP �j ðkmÞ in the expression for the generalized Euclideandistance) according to their angular separation. For projec-tions that lie closely together in angular space, their angularseparations are parameterized by two angles. Thus projec-tions in such a neighborhood belonging to the same confor-mation form a cluster that has 2D characteristics in RM (themulti-dimensional space in which the Euclidean distances (4)are measured); i.e., it is extended along two perpendiculardirections but ‘‘thin’’ in a third direction. Since projectionshaving the same angles but originating from molecules withdifferent conformations are separated into different clustersin RM, and since clusters are continuous—by virtue of thesimilarity relationship—as we move in angular space, theclusters associated with different conformations form con-solidated clusters that are distinct and, depending on noiselevel, separable.

This leads to the philosophy of the current approach:given a sufficient number of projections, the analysis ofthe properties of Eij might make it possible to classify data

on account of their distinct similarity ordering in RM or inan equivalent multi-dimensional space RJ in which theimages are represented in their original form, as arrays ofpixels. Briefly, classification is first applied to an arbitrarysubset of the data that fall in a narrow angular region.Classification is then repeated successively in neighboring,partially overlapping regions, and the resulting clustersare compared and analyzed for consistency.

In more detail, the strategy we will follow to explore thefeasibility of cluster tracking is outlined in Fig. 1B. First,consider a set of quasi-evenly distributed Eulerian angleswhich defines a discrete angular grid in the angular space.In any point of this discrete angular grid (marked by +), eachrepresenting the orientation of a set of projections, we candefine a local region of immediate neighbors (indicated bythe points enclosed by a circle). Classification is applied toall projections that fall into such a local region. Next, the cir-cle is moved such that another local region, partially over-lapped with the first, is created. Again classification isapplied. In this manner, several partially overlapping localregions are analyzed for the existence of clusters. At theend, the results of all classifications are related to one anoth-er, for evidence of consistency and continuity.

2.2. Tracking of similarity ordering in RJ

For classification of the data, we made use of multivar-iate data analysis, specifically principal component analy-sis, and K-means clustering. Principal component analysis(PCA) is one of the eigen analysis-based methods for ana-lyzing datasets with multiple variables (Lebart et al., 1984).In essence, PCA transforms the large number of variablesin the datasets into a smaller number of independent vari-ables, thus reducing the complexity of the analysis andfacilitating classification. In the context of cryo-EM, eachimage can be represented by a vector in the multi-dimen-sional space RJ, where J is the number of image pixels(see van Heel and Frank (1981) and Frank and van Heel(1982), where correspondence analysis, a related method,was introduced into electron microscopy). The vectorscan be meaningfully compared if all the images are alignedto one another. To present the problem in the terms ofeigen analysis, if a J · N matrix (N: the number of images)is constructed with the image vectors as rows,

X ¼

x11 x12 � � � x1J

x21 x22 � � � x2J

� ��

xN1 xN2 xNJ

0BBBBBBBB@

1CCCCCCCCA

ð6Þ

the task of PCA is to find the eigenvectors u of the covar-iant matrix of X; i.e.,

Du ¼ ku; ð7Þwhere matrix D is defined as


D ¼ ðX� XÞ0ðX� XÞ ð8Þand X is a matrix containing the average image in eachrow. After the transformation, each image is representedby a new set of coordinates, and the data matrix can bewritten as:

X0 ¼

x011 x012 � � � x01q

x021 x022 � � � x02q

� ��

x0N1 x0N2 x0Nq

0BBBBBBBB@

1CCCCCCCCA; ð9Þ

where q is the number of eigenvectors and q = min (N,J).The new vectors u are mutually orthogonal, and they

are ranked by decreasing importance as reflected in themagnitudes of the eigenvalues k. PCA thus facilitates avery compact representation of the image set, with onlya few eigenvectors required to reflect the essence of itsvariability. In the context of our goal of tracking clus-ters, the local similarity ordering of particles can beexpected to be dominated by two independent traits:(1) continuous 2D ordering according to the differentdirections of local angular variability, and (2) disjointclustering according to conformational or ligand bindingdifferences.

In the reduced space spanned by the factors from PCA,K-means clustering may be used for the actual partition ofthe dataset. The K-means clustering technique divides thedata into a predefined number of clusters. This is accom-plished through Diday’s ‘‘moving center’’ algorithm (fordetails, see Frank, 1990).

In this study, the factors generated by PCA were firstanalyzed for their meaning, and those related to conforma-tional heterogeneity were selected. Subsequently, K-meansclustering was applied to PCA coordinates of the datasetto determine classes, or subsets with different conforma-tions. Fig. 2 shows a flowchart of the proposed procedureof cluster tracking.

3. Results

The description of our results is organized as follows:we first (Section 3.1) present the datasets to be used,both simulated and experimental. Use of a simulateddataset has the advantage that it allows the success ofthe classification to be exactly quantified since all anglesand class memberships are known a priori. We then (Sec-tion 3.2) describe the application of the cluster trackingmethod to the simulated dataset, which makes it possibleto investigate under what SNR conditions it will work.The section after that (Section 3.3) presents the super-vised classification of the experimental dataset as a con-trol, against which the success of the cluster tracking(Section 3.4) can be measured.

3.1. Description of datasets: simulated and experimental

data

The simulated data were generated from a previouslyobtained density map of an EF-G-bound ribosome(70SÆMFTI-tRNA(Ile)ÆEF-GÆGDPNP+puromycin; Valleet al., 2003). First, conformational heterogeneity was intro-duced by using two versions of this density map: one with-out modification, the other with EF-G density manuallyremoved. The 2D projections were generated on a hemi-spherical angular grid with 6-degree spacing, which resultsin a total of 551 orientations in (considering the imagechirality, a hemispherical angular grid well covers all thepossible orientations). Second, realistic amounts of noisewere added to the 2D projections, with a SNR of 0.07(Fig. 3). All the data were rotationally and translationallyaligned before entering classification. The purpose of thisalignment is to exclude variations that are due to theconventions of Eulerian angles, which would otherwisedominate the analysis.

As experimental cryo-EM dataset, we used data collect-ed for a translocational complex (70SÆtRNA(fMet)ÆfMet-Ile-tRNA(Ile)ÆEF-GÆGDPNP; Gao, H., Fu, J., Lei, J.,Zavialov, A., Ehrenberg, M., and Frank, J., work inprogress). Supervised classification with two references(the ribosome in two different ratcheting states, identifiedas a rotation of the 30S subunit relative to the 50S subunit)confirmed the existence of conformational heterogeneity inthis dataset. The orientation of each 2D image was deter-mined by projection matching; that is, cross-correlationalignment to the reference of a vacant ribosome withoutratchet motion. To be consistent with the spacing of thesimulated data, a 6-degree angular step size was chosenfor the angular grid in the projection matching. For multi-variate data analysis and classification, a dataset wasdefined as a collection of projection images assigned to agiven view and its 5 neighboring views. In order to studythe continuity among neighboring classifications, we creat-ed 6 partially overlapping datasets by stepping along theangular grid by one unit in all directions (Fig. 4 and Table 1).

Considering the uneven angular distribution of theexperimental projection data (Fig. 4A), a region withhigh data density was chosen to minimize the statisticalfluctuations. Thus, projection images assigned aroundprojection view #215 (Eulerian angles w = 0, h = 54,and / = 107.25) were set aside for classification, with atotal of 7668 images belonging to 15 projection views(Fig. 4B). Each dataset (i.e., 6 neighboring views) wereclassified independently, and final subclasses of the7668 images were obtained by combining the results ofthe 6 classifications.

3.2. Unsupervised classification of simulated data by cluster

tracking

In preparation for the unsupervised classification, weapplied a low-pass filtration (1/10 A�1 filter radius) to the

Fig. 2. Flowchart of the cluster tracking method.


simulated data, which enhanced the SNR from 0.07 to 1.7(Fig. 3D and E). Subsequently, PCA was applied to classifythe dataset that was generated from 6 neighboring projec-tion views around projection view #215 (Fig. 3C andFig. 4B). Fig. 5A shows the first 30 eigenimages in descend-ing order, ranked by the magnitude of their correspondingeigenvalues. Interestingly, the fifth eigenimage, which isassociated with the fifth factor, shows a density that resem-

bles the shape of EF-G as seen from the viewing directionchosen; thus we conclude that the fifth factor represents thecomponent of interimage variation associated with thepresence versus absence of EF-G on the ribosome. Wetherefore constructed a histogram of the distribution ofthe projection images with respect to factor 5, which showsa distinct separation of two conformations (Fig. 6B). Inaddition, the map of factors 1 versus 2 shows six distinct

Fig. 3. Description of simulated data. (A) Density map of 70S ribosome (yellow) bound with EF-G (red). (B) Density map of the same volume in (A), butthe mass of EF-G was manually removed. (C) Six neighboring projection views around view #215. The numbers identify their positions on the angulargrid, which includes a total of 551 projection views with 6-degree spacing. (D) Simulated data generated from the volume in (A), with SNR = 0.07 (upperrow) and 1.7 (lower row). (E) Simulated data generated from the volume in (B), with SNR = 0.07 (upper row) and 1.7 (lower row). The data withSNR = 1.7 are generated by applying a low-pass filtration to the data with SNR = 0.07.

Fig. 4. (A) Image distribution on a 6-degree angular grid. The scale bar indicates the image number for each projection view and all the views with 500 ormore images are colored in red. (B) Close-up of the square-shaped region in (A), each projection view is identified by a number, as explained in Fig. 3.Projection views chosen for classification by cluster tracking are colored in yellow. Refer to Table 1 for the composition of the datasets.


clusters arranged in a pattern that reflects the (locally pla-nar) angular relationship among the six projections(Fig. 6A). The fact that the variability in a set of projec-tions encodes their angular relationship, and that this rela-tionship is uncovered by PCA or other multivariate dataanalysis methods, has been observed before (Frank andvan Heel, 1982).

In order to study the meaning of each factor anddetermine those related to the conformational changes,we generated a set of histograms that show the distribu-

tion of particles along each of the first nine factors. Sincethe orientations of the images entering the unsupervisedclassification process are considered to be known (inboth simulated and experimental data), the images withinthe same orientation can be grouped together. Along fac-tors that reflect the difference of particle orientations inthe dataset, the grouped images should form separableclusters. In contrast, along those factors that reflect onlyconformational changes in the dataset, the image groupsshould not be separate. As shown in Fig. 7, factors 1–4

Fig. 5. (A) First 30 eigenimages generated from the simulated dataset using PCA. (B) Eigenvalue histogram of (A).

Table 1Experimental datasets used in cluster tracking

Dataset Projection viewa Number of images

1 170 172 171 131 215 216 43032 214 213 215 170 262 261 18683 215 214 216 171 263 262 39634 216 215 217 172 264 263 45015 261 263 262 214 215 313 20456 262 264 263 215 216 314 3464

a Refer to Fig. 3 for their relative positions on the angular grid.


and 6 appear to relate to the differences in orientations,since different image groups, represented by differentcurves in the histogram, are clearly separate. Therefore,we excluded these factors from the classification analysissince they were unlikely to be related to conformationalheterogeneity. The K-means method (with K = 2) wasapplied to the four-dimensional space spanned by theremaining factors (5, 7, 8, and 9). The resulting two clus-ters correlate well with the known membership in thetwo conformational groups, and the accuracy of classifi-cation is around 82%, where accuracy is defined as theproportion of particles that fall into the correct clusters.Thus, despite the low SNR, our method of cluster track-ing is able to recognize the conformational heterogeneityin the simulated dataset in the presence of multiple ori-entations, as proposed.

We then changed the original noise level in the dataset(still staying in the realistic range) and applied the clustertracking method on the modified datasets. The accuraciesof classification for all three datasets are shown in Table2. As expected, the classification becomes more accurateas the level of noise decreases. For comparison, we alsotested our method on the simulated data for both noise-freeand pure noise cases. The accuracy of the classification on

noise-free images is 100% as all the images are classified ininto the correct clusters. For the test on pure noise, wealigned the noise images with the volume that was usedto generate the simulated data, and applied PCA on thealigned image. The eigenimages did not show any signifi-cant features as seen in other datasets. Moreover, the eigen-value histogram had the characteristics (flat slope) of noise,which would be grounds for dismissal of the dataset fromany further analysis.

We have also tested our method on simulated particleswith a 3-degree separation, in which case only factors 1and 2 showed a correlation with the differences in orienta-tion, and the conformation-related clustering on the plot ofparticle distribution against factor 3 was distinct. Thus, thefiner angular grid offers a clear advantage, in producing astronger consolidation of factors related to either angularor conformational changes. (Such stronger consolidationfacilitates automated differentiation between conforma-tion- and orientation-related factors.) However, in thatcase the number of projections per classification datasetis too small in the experimental situation to give statistical-ly defined patterns, so that we decided to keep the 6-degreeangular grid for both the simulated and experimentalstudy.

Fig. 6. (A) Factor map (factor 1 versus factor 2) of the simulated dataset: evidence of grouping according to orientations only. (B) Distribution ofsimulated data along factor 5: evidence of grouping according to conformations only. In (B), the images belong to the two different conformations arecolored differently.


3.3. Supervised classification of experimental data

Supervised classification was applied to the experimentaldataset, which confirmed the existence of at least two dis-tinct conformational states in the dataset. We followedthe procedure described by Valle et al. (2002) and, morerecently, by Gao et al. (2004). Two reference volumes withthe ribosome in different ratcheting states were used todetermine both the orientation and the class of theprojection images based on the criterion of maximumcross-correlation coefficient (see Fig. S1). Thus, to eachprojection was assigned (i) the three Eulerian angles definingthe orientation that yielded the highest cross-correlationcoefficient, and (ii) the two values of the cross-correlation

coefficient (CC1 and CC2) given by the two referencevolumes.

The distribution of projection data with respect to theparameter of DCC = CC2–CC1 shows a profile of simi-larity ordering of the data (Fig. S1A). Projection imagesproducing DCC values on the negative side of the axis(i.e., on the left hand side of the histogram inFig. S1A) are more similar to reference 1, a vacant ribo-some without ratchet motion, while those on the positiveside are more similar to reference 2, a vacant ribosomewith ratchet motion. The data were arbitrarily dividedinto 5 groups along the histogram such that each portioncontained a sufficient number of projections for a single-particle reconstruction. Not surprisingly, volumes

Fig. 7. Distribution of simulated data with respect to factors 1–9. The images are grouped by their associated projection views.


reconstructed from the two extreme groups (#1 and #5)showed maximal resemblance to the two references, interms of the absence and presence of the ratchet motionin the 30S subunits (Fig. S1B). In addition, the ribosomein the non-ratcheting state (#1) bears three tRNAs in the

A, P, and E sites, while the ribosome in the ratchetingstate (#5) is bound with EF-G and a hybrid P/E-sitetRNA, indicating that these two maps represent two dis-tinctive functional states. Overall, the three volumes gen-erated from the three intermediate groups appear similar

Table 2Cluster tracking on simulated data with different noise levels

Original SNR SNR after filtration Classification accuracy (%)

Infinite Infinite 1000.07 1.7 820.09 1.9 840.14 2.7 92


to map #1, except that some subtle continuous changesare seen in the ribosomal conformations.

3.4. Unsupervised classification of experimental data by

cluster tracking

The cluster tracking method was tested on the real datain a region of angular space with high data density, asdescribed in Section 3.1 (Fig. 3). Within that region, weselected six partially overlapping datasets, each definedby a projection orientation and its five immediately sur-rounding neighbors on the 6-degree angular grid. PCAwas applied individually to each of the datasets. The first30 eigenimages for one of the dataset, typical for the resultsfrom others, are shown in Fig. 8. Features related to eitherthe changes in projection view or in conformation were notas well distinguishable as in the case of the simulated data.This may be due to several reasons, including residual mis-alignment of images, orientational misclassification, andthe more complicated noise situation. However, the sixtheigenimage does have a density pattern that shares somestructural features with EF-G.

Using the same strategy as that used for the simulateddata, we constructed a series of histogram for the distribu-tion of projection images grouped by projection viewsalong each of the first nine factors (Fig. 9). Similarly, fac-tors 1–4 were excluded from further analysis, on account

Fig. 8. (A) First 30 eigenimages generated from dataset 3 (see Table 1) of the

of the fact that in their histograms the projections belong-ing to the six views are separated. Subsequently, K-meansclustering was applied in the space spanned by the remain-ing factors. To test the hypothesis that more than twostructural states might exist in this dataset, we chose touse four clusters (K = 4) instead of two as used in the caseof the simulated data.

To compare the results of the cluster tracking methodwith those from supervised classification, we plotted histo-grams of the DCC values for each dataset (Fig. 10). PositiveDCC values indicate higher similarity with reference #2,while negative values indicate higher similarity with refer-ence #1. In such a plot, a large shift between the histo-grams of two clusters indicates that they have beensuccessfully separated in a similar way as through thesupervised classification.

The results obtained confirmed the achievement of sep-aration through cluster tracking in a way that reproducesthe results from supervised classification (Fig. 10). The his-togram of all six datasets showed similar patterns, in whichone of the clusters is strongly shifted to the right (=posi-tive) side of the axis, and thus distinctively separated fromthe other three whose centers are located around the origin.Clusters in two of the histograms, corresponding to data-sets 2 and 5, have less distinguishable separation comparedto the others, which might be explained by their exception-ally small number of particles (around 2000 particles com-pared to an average of 4058 for the other four datasets).Looking at the six classifications, we can say that eachshows evidence for the existence of two classes, presumablywith the same structural meaning. As the two ‘‘classes’’, wecount the smaller cluster that is consistently separate, andthe ‘‘supercluster’’ formed by the other three virtuallyindistinguishable clusters.

In order to investigate whether the origin and meaning ofthe classes is the same as we move around in angular space,

experimental cryo-EM data using PCA. (B) Eigenvalue histogram of (A).

Fig. 9. Distribution of experimental data (dataset 3) with respect to factors 1–9. The images are grouped by their associated projection views.


we can make use of the partial overlap of the six subsets: eachpair of neighboring subsets has a certain set of projectionimages in common, and we can ask the question to whatextent the memberships in the two classes are maintained.According to the premise of our approach, which postulates

that classes are continuous, we would expect that for bothclasses, a high percentage of the images in the overlap setwould keep their class affiliation.

The cross-tabulation of memberships for each of the 10pairs of neighboring classifications (Table 3) proves that

Fig. 10. Data distribution for the six experimental datasets with respect to CC2–CC1. CC1 and CC2 are the cross-correlation coefficients of projections tothe two reference maps calculated in the supervised classification. Images are grouped according to their class affiliation produced by the cluster trackingmethod. (Note: the class numbers, 1–4, are assigned arbitrarily in each classification, so a class number in a given histogram is not related to the samenumber in a different histogram.)


class affiliation is indeed maintained for a high percentage ofimages. Overall, most pairs of classifications have a high pro-portion of stable projection images, indicated by the averagevalue listed in Table 3. For example, the overlapping imagesof dataset 1 and dataset 3, which is the center one of the sixdatasets, were classified with high fidelity such that around79% of the images remained in the same class instead ofjumping to another class from one dataset to another.

Importantly, projections belonged to the three clustersof the ‘‘supercluster,’’ which are indistinguishable in thesupervised classification sense, showed strong stability insome pairs of independent classifications of overlappingdatasets, reflected by the small value of the standard devi-ation in Table 3. This kind of high consistency (evenamong all four clusters) appeared in the cross-tabulationof four datasets (1, 3, 5, and 6), which suggests the possibleexistence of additional states other than the two identifiedby supervised classification.

In the absence of references needed for the supervisedclassification, which is the case motivating this study, wehave no DCC values, and cannot characterize the K-means clusters by histograms such as that given inFig. 10. In this situation, the tracking of particle mem-berships in overlapped classifications is the only way toestablish the identity and continuity of classes as a func-tion of angle.

4. Discussion

4.1. Feasibility of the method

We introduced the cluster tracking method as a meansto classify cryo-EM data in a reference-free manner. Thisapproach makes use of the local continuity in multi-dimen-sional space induced by angular adjacency to track the sim-ilarity ordering among projections. Data are first classified

Table 3Cross-tabulation of particle membership for the 10 pairs of neighboringclassifications

Dataset Neighboringdataset

Proportion of projection imagesthat remained affiliated in twoclassifications

Averagea Standard deviationa

1 2 0.70 0.091 3 0.79 0.071 4 0.57 0.162 3 0.61 0.082 5 0.60 0.103 4 0.59 0.123 5 0.76 0.073 6 0.78 0.054 6 0.71 0.175 6 0.73 0.12

a The average and standard deviation for each pair of neighboringclassifications were calculated based on the proportion of overlappingdata between the 4 pairs of corresponding clusters generated in the twoclassifications (see Fig. 10). A large value of the average indicates highconsistency among memberships in two neighboring classifications, and asmall value of the standard deviation indicates uniform consistency amongall the four clusters in neighboring classifications.


into different orientations by angular projection matching,then subsets of projections each falling into a narrow angu-lar range are again classified by PCA and K-means. Factorsobtained by PCA relate to variations in conformation, ori-entation, and possibly other properties. We proposed away to identify factors related to conformational differenc-es by excluding those factors that are clearly related to ori-entational differences. K-means clustering is then applied inthe space spanned by factors not excluded.

To study the feasibility of the method, we applied it to asimulated cryo-EM dataset generated from multiple projec-tions that occupied a local region in angular space. The ini-tial SNR, in the range of 0.07, corresponds to theexperimental situation. The cluster tracking method wasable to separate the two conformations existing in the sim-ulated dataset with accuracy higher than 80%.

Cluster tracking of experimental cryo-EM data faceslarger challenges because of (i) the more complicated‘‘non-white’’ noise statistics; (ii) possible misalignment ofthe projections, (iii) the more complicated nature of confor-mational heterogeneities, and (iv) the complications pro-duced by the unevenness of the angular distribution,which leads to poor statistical support in some regions.We applied cluster tracking to experimental data thatbelonged to a defined, well-populated region in angularspace, and compared the results with those obtained bysupervised classification. The main result of our analysisis that for the angular region analyzed, cluster trackingreproduces the partition of the dataset obtained by super-vised classification, into data that correspond to ribosomewith EF-G and those without EF-G.

The consistency of classification from one dataset to aneighboring dataset, mediated through the projections theyhave in common, was investigated in our study because of

its importance for tracking clusters through the entireangular space. We showed that, in pairs of independentclassifications of overlapping datasets, the membershipassignment of overlapping projections is quite stable andmutually consistent, making the global tracking of clustersprincipally possible. Surprisingly, we find that local angularconsistency is found in all clusters for a K-means classifica-tion with K = 4, suggesting that more than two distinctbiologically meaningful states might coexist in thisspecimen.

4.2. Identification of relevant factors and feasibility of

automation

In order to identify the factors related to structural het-erogeneity, it is necessary to set up a quantifiable standardfor exclusion of factors that express changes in projectionorientations. In this study, we accomplished this exclusionby visually analyzing the histograms of particle distribu-tions, grouped by orientations assigned by projectionmatching, against each of the factors. To develop a repro-ducible method of exclusion, we need to characterize thesehistograms quantitatively. For each orientation, the distri-bution of particles could be characterized by the location ofthe histogram’s center (e.g., in terms of its medium andmean), its spread (standard deviation), and a certain weightassociated with the number of images it contains. Thespread in the parameter characterizing the locations of cen-ters should then be compared with the spreads of the indi-vidual curves, to arrive at a criterion for sensitivity to thechanges in angle. A fully automated performance of theunsupervised local classification might then be possible.The various parameters required to set up a standard couldbe established by the use of training datasets, such as thoseprocessed by supervised classification-a separation of clus-ters along the DCC axis would indicate that correct factorswere chosen for the K-means classification.

4.3. The move from the local angular neighborhood to the

global angular space

So far, the cluster tracking method has only beenapplied to a small region of angular space (15 out of 551projection views on a 6-degree angular grid) with a relative-ly large number of images (with an average of 511 imagesper projection orientation, compared to the global averageof 165 images per projection orientation). In order to clas-sify the entire cryo-EM dataset, we need to move the pro-cessing region stepwise around in a partially overlappingway such that the entire angular space is covered. Theuneven data distribution poses a challenge to the clustertracking as the regions with fewer images have higher sta-tistical fluctuation and may introduce ambiguity into theclassification. In the worst case, clusters will no longerretain their identity, so the tracking will be lost.

This problem can be addressed by a combination of twomeasures: first, by collecting more data to reduce such


regions of low statistical definition; second, as the globaldistribution is increasingly populated, one can bypassregions of low coverage and making detours as necessaryalong regions providing sufficient continuity.

5. Conclusions

For many molecules, the problem posed by heterogene-ity in single-particle reconstruction might be the ultimatehurdle that stands in the way of achieving atomic resolu-tion. We show here that it might be possible in certain casesto separate highly noisy datasets, in which conformationaland orientational variability is intermixed, into their com-ponents. The indications are that the presence vs. absenceof EF-G, which represents just �4% of the total mass ofthe ribosome, gives rise to a virtually unambiguous classi-fication, at least in a region of angular space that is highlyoversampled. However, statistical limitations in less over-sampled regions pose no real hurdle as they can eventuallybe overcome by extending the data collection.

Acknowledgments

We thank William Baxter for help with the preparation ofthe angular distribution map and Michael Watters for assis-tance with the preparation of the figures. This work was sup-ported by the Howard Hughes Medical Institute and NIHGrants P41 RR01219 and R37 GM29169 (to J.F.).

Appendix A. Supplementary data

Supplementary data associated with this article canbe found, in the online version, at doi:10.1016/j.jsb.2006.06.012.

References

Agrawal, R.K., Penczek, P., Grassucci, R.A., Frank, J., 1998. Visualizationof elongation factor G on the Escherichia coli 70S ribosome: themechanism of translocation. Proc. Natl. Acad. Sci. USA 95, 6134–6138.

Allen, G.S., Zavialov, A., Gursky, R., Ehrenberg, M., Frank, J., 2005. Thecryo-EM structure of a translation initiation complex from Escherichia

coli. Cell 121, 703–712.Bohm, J., Frangakis, A., Hegerl, R., Nickell, S., Typke, D., Baumeister,

W., 2000. Toward detecting and identifying macromolecules in acellular context: template matching applied to electron tomograms.Proc. Natl. Acad. Sci. USA 97, 14245–14250.

Brink, J., Ludtke, S.J., Kong, Y.F., Wakil, S.J., Ma, J.P., Chiu, W., 2004.Experimental verification of conformational variation of human fattyacid synthase as predicted by normal mode analysis. Structure 12, 185–191.

Burgess, S.A., Walker, M.L., Thirumurugan, K., Trinick, J., Knight, P.J.,2004. Use of negative stain and single-particle image processing toexplore dynamic properties of flexible macromolecules. J. Struct. Biol.147, 247–258.

de Haas, F., Taveau, J.C., Boisset, N., Lambert, O., Vinogradov, S.N.,Lamy, J.N., 1996. Three-dimensional reconstruction of the chlorocru-orin of the polychaete annelid Eudistylia vancouverii. J. Mol. Biol. 255,140–153.

Frank, J., van Heel, M., 1982. Correspondence analysis of aligned imagesof biological particles. J. Mol. Biol. 161, 134–137.

Frank, J., Radermacher, M., 1986. Three-dimensional reconstruction ofnon-periodic macromolecular assemblies from electron micrographs.In: Koehler, J.K. (Ed.), Advanced Techniques in Biological ElectronMicroscopy, vol. 3. Springer, Berlin, pp. 1–72.

Frank, J., Bretaudiere, J.P., Carazo, J.M., Verschoor, A., Wagenknecht,T., 1988. Classification of images of biomolecular assemblies: a studyof ribosomes and ribosomal subunits of Escherichia coli. J. Microsc.150, 99–115.

Frank, J., 1990. Classification of macromolecular assemblies studied as‘single particles’. Q. Rev. Biophys. 23, 281–329.

Frank, J., 2001. Cryo-electron microscopy as an investigative tool: theribosome as an example. BioEssays 8, 725–732.

Frank, J., 2006. Three-Dimensional Electron Microscopy of Macromo-lecular Assemblies, second ed. Oxford University Press, New York.

Frank, J., Spahn, C.M.T., 2006. The ribosome and the mechanism ofprotein synthesis. Rep. Prog. Phys. 69, 1383–1417.

Gao, H., Valle, M., Ehrenberg, M., Frank, J., 2004. Dynamics of EF-Ginteraction with the ribosome explored by classification of a hetero-geneous cryo-EM dataset. J. Struct. Biol. 148, 283–289.

Gao, N., Zavialov, A.V., Li, W., Sengupta, J., Valle, M., Gursky, R.P.,Ehrenberg, M., Frank, J., 2005. Mechanism for the disassembly of theposttermination complex inferred from cryo-EM studies. Mol. Cell 18,663–674.

Heymann, J.B., Conway, J.F., Steven, A.C., 2004. Molecular dynamics ofprotein complexes from four-dimensional cryo-electron microscopy. J.Struct. Biol. 147, 291–301.

Lebart, L., Morineau, A., Warwick, K.M., 1984. Multivariate DescriptiveStatistical Analysis: Correspondence Analysis and Related Techniquesfor Large Matrices. John Wiley, New York.

Ludtke, S.J., Chen, D.H., Song, J.L., Chuang, D.T., Chiu, W., 2004.Seeing GroEL at 6 A resolution by single particle electron cryomi-croscopy. Structure 12, 1929–1936.

Marabini, R., Carazo, J.M., 1994. Pattern recognition and classification ofimages of biological macromolecules using artificial neural networks.Biophys. J. 66, 1804–1814.

Pascual-Montano, A., Donate, L.E., Valle, M., Barcena, M., Pascual-Marqui, R.D., Carazo, J.M., 2001. A novel neural network techniquefor analysis and classification of EM single-particle images. J. Struct.Biol. 133, 233–245.

Penczek, P.A., Zhu, J., Frank, J., 1996. A common-lines based method fordetermining orientations for N > 3 particle projections simultaneously.Ultramicroscopy 63, 205–218.

Rye, H.S., Roseman, A.M., Chen, S., Furtak, K., Fenton, W.A., Saibil,H.R., Horwich, A.L., 1999. GroEL–GroES cycling: ATP and nonnativepolypeptide direct alternation of folding-active rings. Cell 97, 325–338.

Saibil, H.R., 2000. Molecular chaperones: containers and surfaces forfolding, stabilizing or unfolding proteins. Curr. Opin. Struct. Biol. 10,251–258.

Valle, M., Sengupta, J., Swami, N.K., Grassucci, R.A., Burkhardt, N.,Nierhaus, K.H., Agrawal, R.K., Frank, J., 2002. Cryo-EM reveals anactive role for aminoacyl-tRNA in the accommodation process.EMBO J. 21, 3557–3567.

Valle, M., Zavialov, A., Sengupta, J., Rawat, U., Ehrenberg, M., Frank, J.,2003. Locking and unlocking of ribosomal motions. Cell 114, 123–134.

van Heel, M., Frank, J., 1981. Use of multivariate statistics in analysingthe images of biological macromolecules. Ultramicroscopy 6, 187–194.

van Heel, M., 1984. Multivariate statistical classification of noisy images(randomly oriented biological macromolecules). Ultramicroscopy 13,165–184.

Zhou, Z.H., Liao, W.C., Cheng, R.H., Lawson, J.E., McCarthy, D.B.,Reed, L.J., Stoops, J.K., 2001. Direct evidence for the size andconformational variability of the pyruvate dehydrogenase complexrevealed by three-dimensional electron microscopy—the ‘‘breathing’’core and its functional relationship to protein dynamics. J. Biol. Chem.276, 21704–21713.

Zuzan, H., Holbrook, J.A., Kim, P.T., Harauz, G., 1998. Self-organiza-tion of cryoelectron micrographs of the phosphoenolpyruvate synthasefrom Staphylothermus marinus. Optik 109, 181–189.

http://dx.doi.org/10.1016/j.jsb.2006.06.012

http://dx.doi.org/10.1016/j.jsb.2006.06.012

Documents

Unsupervised classification of single particles by cluster tracking in multi-dimensional space