9
Analysis of a Data Set of Paired Uncomplexed Protein Structures: New Metrics for Side-Chain Flexibility and Model Evaluation Shanrong Zhao, David S. Goodsell, and Arthur J. Olson * Department of Molecular Biology, Scripps Research Institute, La Jolla, California ABSTRACT We compiled and analyzed a data set of paired protein structures containing proteins for which multiple high-quality uncomplexed atomic structures were available in the Protein Data Bank. Side-chain flexibility was quantified, yielding a set of residue- and environment-specific confidence lev- els describing the range of motion around x 1 and x 2 angles. As expected, buried residues were inflexible, adopting similar conformations in different crystal structure analyses. Ile, Thr, Asn, Asp, and the large aromatics also showed limited flexibility when ex- posed on the protein surface, whereas exposed Ser, Lys, Arg, Met, Gln, and Glu residues were very flexible. This information is different from and complementary to the information available from rotamer surveys. The confidence levels are useful for assessing the significance of observed side-chain motion and estimating the extent of side-chain mo- tion in protein structure prediction. We compare the performance of a simple 40° threshold with these quantitative confidence levels in a critical evaluation of side-chain prediction with the pro- gram SCWRL. Proteins 2001;43:271–279. © 2001 Wiley-Liss, Inc. Key words: side-chain flexibility; protein structure prediction INTRODUCTION Proteins combine structural rigidity with local flexibil- ity. Most natural proteins adopt a defined folded structure, with secondary-structure segments arranged in a defined geometry. Layered on top of this relatively rigid core are several levels of flexibility: occasionally, the motion of entire domains alters the entire shape of the protein; often, the motion of connecting loops and terminal extensions modifies the shape of a cleft or extension; and in all proteins, side-chain motion alters the local topography. 1 Side-chain conformation is determined by the intrinsic torsional flexibility of each residue, which is then limited by a combination of external factors: steric contacts with the local peptide backbone, interactions with neighboring parts of the protein, and interactions with surrounding proteins and solvents. Most analyses of side-chain conformation study the range of motion available to a given residue type, but they do not analyze the flexibility of a given residue within a given protein environment. In a typical study, a database of representative structures is chosen from the Protein Data Bank (PDB), and the range of conformations is tabulated for each type of residue. For x 1 angles, this yields the familiar three-peaked histograms, showing that amino acids generally prefer the three staggered conforma- tions [Fig. 1(A)]. These histograms may be used to gener- ate rotamer libraries for protein structure prediction by picking a representative set of conformations that will cover most of the commonly observed (and, therefore, energetically favored) ranges. These analyses, however, do not yield information on the flexibility of a given residue within a protein. All residues, whether buried or exposed, are surrounded by other residues, limiting their range of motion. Some positions will allow motion between different rotameric states, but other positions with stronger restraints will not allow such flexibility. Because only a single structure of each protein is included in rotamer surveys, location-specific interac- tions tend to average out, and the results reflect primarily the steric contacts with the main chain of adjacent resi- dues, which are consistent across the entire test set. Rotamer analyses reveal the most energetically favorable conformations when observed in all environments, but a different approach must be taken to determine the flexibil- ity of individual residues within the environment of a given protein. Instead of surveying a single representative of each pro- tein, we compared several different structures of each pro- tein, looking for differences in side-chain conformation among the different structure solutions. In this way, we could look at each position, such as Arg14 in lysozyme, individually, determining its range of motion and the effect of the local environment on this motion. In this article, we report quanti- tative values describing the ranges of amino acid flexibility observed in uncomplexed protein structures. This informa- tion has important implications for the design and evalua- tion of protein prediction methods. Manuscript 13192-MB from the Scripps Research Institute. Grant sponsor: National Institutes of Health; Grant number: PO1 HL16411. *Correspondence to: Arthur Olson, Department of Molecular Biol- ogy, Scripps Research Institute, 10550 N. Torrey Pines Road, La Jolla, CA 92037. E-mail: [email protected] Received 5 September 2000; Accepted 22 January 2001 Published online 00 Month 2001 PROTEINS: Structure, Function, and Genetics 43:271–279 (2001) © 2001 WILEY-LISS, INC.

Analysis of a data set of paired uncomplexed protein structures: New metrics for side-chain flexibility and model evaluation

Embed Size (px)

Citation preview

Page 1: Analysis of a data set of paired uncomplexed protein structures: New metrics for side-chain flexibility and model evaluation

Analysis of a Data Set of Paired Uncomplexed ProteinStructures: New Metrics for Side-Chain Flexibility andModel EvaluationShanrong Zhao, David S. Goodsell, and Arthur J. Olson*Department of Molecular Biology, Scripps Research Institute, La Jolla, California

ABSTRACT We compiled and analyzed a dataset of paired protein structures containing proteinsfor which multiple high-quality uncomplexed atomicstructures were available in the Protein Data Bank.Side-chain flexibility was quantified, yielding a setof residue- and environment-specific confidence lev-els describing the range of motion around x1 and x2

angles. As expected, buried residues were inflexible,adopting similar conformations in different crystalstructure analyses. Ile, Thr, Asn, Asp, and the largearomatics also showed limited flexibility when ex-posed on the protein surface, whereas exposed Ser,Lys, Arg, Met, Gln, and Glu residues were veryflexible. This information is different from andcomplementary to the information available fromrotamer surveys. The confidence levels are usefulfor assessing the significance of observed side-chainmotion and estimating the extent of side-chain mo-tion in protein structure prediction. We comparethe performance of a simple 40° threshold withthese quantitative confidence levels in a criticalevaluation of side-chain prediction with the pro-gram SCWRL. Proteins 2001;43:271–279.© 2001 Wiley-Liss, Inc.

Key words: side-chain flexibility; protein structureprediction

INTRODUCTION

Proteins combine structural rigidity with local flexibil-ity. Most natural proteins adopt a defined folded structure,with secondary-structure segments arranged in a definedgeometry. Layered on top of this relatively rigid core areseveral levels of flexibility: occasionally, the motion ofentire domains alters the entire shape of the protein; often,the motion of connecting loops and terminal extensionsmodifies the shape of a cleft or extension; and in allproteins, side-chain motion alters the local topography.1

Side-chain conformation is determined by the intrinsictorsional flexibility of each residue, which is then limitedby a combination of external factors: steric contacts withthe local peptide backbone, interactions with neighboringparts of the protein, and interactions with surroundingproteins and solvents.

Most analyses of side-chain conformation study therange of motion available to a given residue type, but theydo not analyze the flexibility of a given residue within agiven protein environment. In a typical study, a database

of representative structures is chosen from the ProteinData Bank (PDB), and the range of conformations istabulated for each type of residue. For x1 angles, thisyields the familiar three-peaked histograms, showing thatamino acids generally prefer the three staggered conforma-tions [Fig. 1(A)]. These histograms may be used to gener-ate rotamer libraries for protein structure prediction bypicking a representative set of conformations that willcover most of the commonly observed (and, therefore,energetically favored) ranges.

These analyses, however, do not yield information on theflexibility of a given residue within a protein. All residues,whether buried or exposed, are surrounded by otherresidues, limiting their range of motion. Some positionswill allow motion between different rotameric states, butother positions with stronger restraints will not allow suchflexibility. Because only a single structure of each proteinis included in rotamer surveys, location-specific interac-tions tend to average out, and the results reflect primarilythe steric contacts with the main chain of adjacent resi-dues, which are consistent across the entire test set.Rotamer analyses reveal the most energetically favorableconformations when observed in all environments, but adifferent approach must be taken to determine the flexibil-ity of individual residues within the environment of agiven protein.

Instead of surveying a single representative of each pro-tein, we compared several different structures of each pro-tein, looking for differences in side-chain conformation amongthe different structure solutions. In this way, we could look ateach position, such as Arg14 in lysozyme, individually,determining its range of motion and the effect of the localenvironment on this motion. In this article, we report quanti-tative values describing the ranges of amino acid flexibilityobserved in uncomplexed protein structures. This informa-tion has important implications for the design and evalua-tion of protein prediction methods.

Manuscript 13192-MB from the Scripps Research Institute.Grant sponsor: National Institutes of Health; Grant number: PO1

HL16411.*Correspondence to: Arthur Olson, Department of Molecular Biol-

ogy, Scripps Research Institute, 10550 N. Torrey Pines Road, La Jolla,CA 92037. E-mail: [email protected]

Received 5 September 2000; Accepted 22 January 2001

Published online 00 Month 2001

PROTEINS: Structure, Function, and Genetics 43:271–279 (2001)

© 2001 WILEY-LISS, INC.

Page 2: Analysis of a data set of paired uncomplexed protein structures: New metrics for side-chain flexibility and model evaluation

Analyses of paired-protein databases have been re-ported in several contexts. Flores et al.2 analyzed a set of 9unique proteins for which multiple structures were solvedand compared the structural differences with those found

in a series of 90 pairs of homologous structures. Forside-chain torsional angles, they found that as the level ofhomology decreased, an increasing number of side-chainschanged conformation from one rotamer to another. Bettsand Sternberg3 analyzed a set of 12 paired proteins, usingthe results to evaluate the significance of side-chain mo-tion in a study of protein–protein complexation. Najmanov-ich et al.4 analyzed side-chain motion within ligand bind-ing sites, using a large database of 980 complexed and 353apo protein structures.

METHODS

For the analysis of side-chain motion in proteins, wedeveloped a data set of proteins for which multiple atomicstructures had been deposited at the PDB (May 1999release, http://www.rcsb.org/pdb). Paired crystallographicstructures provide a unique resource for the analysis offlexibility. The proteins being compared are identical anddo not contain bound ligands, so differences are a result ofintrinsic flexibility under the influence of the differentenvironments in each crystal determination. The struc-tures are solved to atomic resolution, so they do notcontain the ambiguity of NMR analyses of flexible regions.However, the results represent only a few snapshots of therange of flexibility that may be present in solution, and onemust be wary of artifacts from the process of crystallo-graphic structure solution, such as the use of definedrotamers during the refinement of underdetermined re-gions.

We applied a stringent set of selection criteria to confinethe data set to high-quality structures with relatively rigidbackbone conformations. Selection criteria included a crys-tallographic resolution of 2.2 Å or better and a chain lengthlarger than 50 amino acids. Proteins with unknown se-quences, a-carbon atoms only, conflicts between SEQRESand ATOM records, chain discontinuities, nonstandardresidues, significant numbers of missing atoms (.5%a-carbon atoms or 15% side-chain atoms), or alternatelocation atoms (.10%) were omitted. In addition, allcomplexes were omitted. For the subsequent analysis, thethree structures with the highest resolution were chosenfrom each group of proteins, and they are presented inTable I. The following analysis was performed on the threepossible pairings within each group.

Protein pairs were superimposed with an iterativeprocess. The goal was to match the rigid core of theprotein, omitting any small flexible loops or terminalextensions from the calculation. For each pair of pro-teins, the backbone atoms were superimposed with thefast algorithm of Sippl and Stegbuchner.5 Then, anybackbone atoms that showed a difference greater than 1Å were omitted. A new backbone superimposition wasperformed with these atoms omitted. The process wasrepeated until the solution converged. Finally, a morestringent cutoff of 0.5 Å on the root-mean-square devia-tion (RMSD), which reflects the amount of positionaluncertainty we might expect from these high-resolutionstructures, was applied.6 Most of the proteins superim-posed well, with more than 80% of the backbone struc-

Fig. 1. An analysis of x1 conformation versus an analysis of x1

flexibility. (A) The upper graph is a typical result of a survey of side-chainconformations, tabulating the number of observations found at each valueof the torsion angle. The familiar three-peaked results, shown here forserine, are from steric contacts with the backbone of adjacent residues,favoring the three staggered conformations. (B) The lower graph is fromour paired data set, which analyzes side-chain flexibility and conforma-tion. Each point represents a single amino acid residue within a givenprotein observed in two different structure determinations. The angle fromone structure is on the horizontal axis; the angle from the other is on thevertical axis. Residues that have the same conformation in the twostructures fall along the diagonal, and those that change are found off thediagonal. The spread of points away from the diagonal is a measure of theflexibility of the residue. The points cluster in the familiar three staggeredconformations, similarly to the histogram previously presented.

272 S. ZHAO ET AL.

Page 3: Analysis of a data set of paired uncomplexed protein structures: New metrics for side-chain flexibility and model evaluation

TABLE I. Uncomplexed Paired-Protein Data Setsa

C1 R1 C2 R2 Len1 Len2 RMSD1 RMSD2 C1 R1 C2 R2 Len1 Len2 RMSD

1931 1.33 1941 1.40 129 129 0.07 0.07 1osa 1.68 1clm 1.80 144 141 0.391931 1.33 1aki 1.50 129 125 0.43 0.33 1scs 1.60 1enr 1.80 237 237 0.141941 1.40 1aki 1.50 129 125 0.43 0.33 1scs 1.60 2ctv 1.95 237 237 0.111bxa 1.30 2rac 1.30 105 102 0.34 0.20 1enr 1.80 2ctv 1.95 237 237 0.161bxa 1.30 1aaj 1.80 105 104 0.17 0.13 1crm 2.00 2cab 2.00 256 254 0.362rac 1.30 1aaj 1.80 105 102 0.34 0.19 1cub 1.75 1cuc 1.75 197 192 0.501ak2 1.92 2ak2 2.10 220 216 0.42 0.32 1ede 1.90 2had 1.90 310 310 0.121ald 2.00 2ald 2.10 363 341 2.59 0.19 1ede 1.90 2dhd 2.00 310 309 0.191yme 1.53 1arl 1.86 307 307 0.11 0.11 2had 1.90 2dhd 2.00 310 310 0.161yme 1.53 1cpx 2.00 305 300 0.52 0.23 1ert 1.70 1eru 2.10 105 105 0.241arl 1.86 1cpx 2.00 305 300 0.52 0.24 1ert 1.70 1auc 2.10 105 105 0.161iab 1.79 1ast 1.80 200 198 0.18 0.11 1eru 2.10 1auc 2.10 105 105 0.251iab 1.79 1iae 1.83 200 198 0.17 0.11 1esa 1.65 3est 1.65 240 240 0.211ast 1.80 1iae 1.83 200 200 0.09 0.09 1esa 1.65 1lvy 1.87 240 240 0.231axn 1.78 1aii 1.95 322 290 0.55 0.38 3est 1.65 1lvy 1.87 240 240 0.143rn3 1.45 1rat 1.50 124 123 0.30 0.28 1fua 1.92 2fua 2.00 206 205 0.183rn3 1.45 1rhb 1.50 124 123 0.23 0.20 1amm 1.20 1gcs 2.00 174 174 0.151rat 1.50 1rhb 1.50 124 124 0.25 0.25 1gof 1.70 1gog 1.90 639 639 0.121tld 1.50 2ptn 1.55 223 222 0.28 0.25 1ilb 2.00 4ilb 2.00 151 148 0.371tld 1.50 1tgn 1.65 222 191 1.92 0.31 1ilb 2.00 1iob 2.00 151 146 0.662ptn 1.55 1tgn 1.65 222 193 1.91 0.28 4ilb 2.00 1iob 2.00 151 146 0.582cba 1.54 2cbb 1.67 258 258 0.05 0.05 2ilk 1.60 1ilk 1.80 151 146 1.012cba 1.54 2cbd 1.67 258 258 0.05 0.05 1jcv 1.55 1jcw 1.70 153 153 0.192cbb 1.67 2cbd 1.67 258 258 0.04 0.04 1jcv 1.55 1yso 1.73 153 152 0.221brf 0.95 1bq8 1.10 53 53 0.15 0.15 1jcw 1.70 1yso 1.73 153 152 0.161brf 0.95 1caa 1.80 53 53 0.13 0.13 1l04 1.70 1l05 1.70 164 164 0.081bq8 1.10 1caa 1.80 53 53 0.17 0.17 1l39 1.85 1l40 1.85 164 164 0.041cak 1.90 1caz 1.90 258 258 0.11 0.11 1l42 1.80 1l43 1.80 164 164 0.021lhm 1.80 2bqa 1.80 130 130 0.18 0.18 1top 1.78 1ncx 1.80 162 162 0.142lzm 1.70 3lzm 1.70 164 163 0.16 0.13 1top 1.78 1ncz 1.80 162 162 0.252lzm 1.70 4lzm 1.70 162 162 0.12 0.12 1ncx 1.80 1ncz 1.80 162 162 0.253lzm 1.70 4lzm 1.70 162 162 0.08 0.08 1ubi 1.80 1ubq 1.80 76 76 0.091lzl 1.50 1rex i.50 130 130 0.13 0.13 3wbc 2.13 4wbc 2.14 180 175 1.501mpb 2.00 1mpc 2.10 370 370 0.20 0.20 1xaa 2.10 1xab 2.10 345 345 0.151npk 1.80 1nsp 2.10 150 149 0.22 0.17 1xac 2.10 1xad 2.10 345 345 0.201rro 1.30 1omd 1.85 107 106 0.30 0.20 1xzb 1.75 1xza 1.80 197 197 0.071paz 1.55 3paz 1.73 120 120 0.13 0.13 1une 1.50 1mkt 1.72 123 110 0.731paz 1.55 1pza 1.80 120 120 0.12 0.12 1une 1.50 2bpp 1.80 123 108 0.893paz 1.73 1pza 1.80 120 120 0.16 0.16 1mkt 1.72 2bpp 1.80 123 115 0.414pnp 1.80 1pbn 2.00 284 253 0.90 0.37 2che 1.80 2chf 1.80 128 128 0.101ixg 1.05 1pbp 1.90 321 321 0.30 0.30 1gvp 1.60 1vqb 1.80 86 83 0.301ixh 0.98 2abh 1.70 321 320 0.26 0.25 2lhm 1.80 3lhm 1.80 130 130 0.211pgb 1.92 1pga 2.07 56 56 0.26 0.26 5icb 1.50 6icb 1.85 75 74 0.181pnc 1.60 1pnd 1.60 99 99 0.08 0.08 3rnt 1.80 8rnt 1.80 104 104 0.141pnc 1.60 2pcy 1.80 99 98 0.34 0.24 5pti 1.00 4pti 1.50 58 56 1.221pnd 1.60 2pcy 1.80 99 98 0.34 0.24 5pti 1.00 6pti 1.70 57 57 0.392pkc 1.50 2prk 1.50 279 278 0.16 0.14 4pti 1.50 6pti 1.70 57 54 0.471qui 1.90 1quj 1.90 321 321 0.06 0.06 4rxn 1.20 5rxn 1.20 54 53 0.241sbh 1.80 1yja 1.80 275 275 0.17 0.17 1aew 1.95 1dat 2.05 170 170 0.201sbh 1.80 1yjb 1.80 275 275 0.20 0.20 1djc 2.00 1djb 2.10 257 256 0.351yja 1.80 1yjb 1.80 275 275 0.18 0.18 1fib 2.10 3fib 2.10 249 249 0.092st1 1.80 1st2 2.00 275 273 0.18 0.15 1fib 2.10 1fid 2.10 249 235 0.511svn 1.40 1jea 2.00 269 269 0.27 0.27 3fib 2.10 1fid 2.10 249 235 0.501tcy 1.70 1wqr 1.80 130 128 0.30 0.23 3lip 2.00 2lip 2.10 320 307 0.762tgi 1.80 1tfg 1.95 112 108 0.29 0.15 1loz 1.80 1oua 1.80 130 128 0.292tgi 1.80 1tgi 2.10 112 109 0.32 0.18 1bme 1.85 1bvt 1.85 221 221 0.041tfg 1.95 1tgi 2.10 112 109 0.33 0.20 1a3z 1.90 1rcy 1.90 150 149 0.191amf 1.75 1wod 1.75 231 231 0.07 0.07 1u9b 2.00 1u9a 2.00 159 155 0.561mku 1.80 1mks 1.90 123 109 0.91 0.42 1ame 1.65 1gzi 1.80 65 64 0.341a58 1.95 1a33 2.15 174 170 0.33 0.19 4paz 1.76 5paz 1.76 123 120 0.321bkr 1.10 1aa2 2.00 108 108 0.22 0.22 6paz 1.91 7paz 2.00 123 121 0.291zia 1.54 1zib 2.00 124 123 0.26 0.18aFor each pair, the PDB accession code and resolution are listed. The matched chain length (Len1), the optimal length after residues with highRMSD greater than 1 Å are removed (Len2), and the corresponding two RMSD values are also reported.

METRICS FOR SIDE-CHAIN FLEXIBILITY 273

Page 4: Analysis of a data set of paired uncomplexed protein structures: New metrics for side-chain flexibility and model evaluation

ture assigned as rigid, but 14 protein pairs showed largeflexible regions, such as domain motion in isocitratedehydrogenase,7 and were omitted from the followinganalysis.

In any study of this kind, the choice of selection criteria willaffect the results that are ultimately obtained. The effect ofthe selection criteria on the confidence levels reported laterwere tested. As the criteria were made more stringent, thenumber of structures in the database was reduced. Testingthe resolution cutoff, we looked for the least stringent crite-rion that yielded a consistent set of results. Cutoffs of 1.8, 2.0,

2.2, 2.4, and 2.8 were tested. All cutoffs below 2.2 gavesimilar results, but the two more permissive cutoffs gaveresults that showed more side-chain flexibility than expectedfrom the more stringent cutoffs. By limiting the analysis tohigh-resolution structures, we have the added advantage ofreducing the possible introduction of map interpretation andrefinement artifacts.

RESULTS AND DISCUSSION

Our data set contains 123 paired structures that satisfythe stringent criteria. For each, Table I contains two

Fig. 2. x1 flexibility by residue type. As in Figure 1, each point represents a single amino acid residue observed in two different crystallographicstructures. Residues that are exposed to the solvent are shown above the diagonal, and buried residues are shown below. Buried and exposed pointshave been artificially separated by 4°, forming an empty stripe along the diagonal to highlight the difference between the exposed and buried residues.Lines on either side of the diagonal present the 90% confidence levels.

274 S. ZHAO ET AL.

Page 5: Analysis of a data set of paired uncomplexed protein structures: New metrics for side-chain flexibility and model evaluation

RMSD values, one calculated for all residues and onecalculated after residues with backbone RMSDs greaterthan 1 Å were removed. For the most part, the proteinsuperimposes well, and the two values are comparable.Some, however, show larger differences: the pair 3wbc and4wbc have a backbone RMSD as high as 1.5 Å, but withonly five residues removed, the RMSD is reduced to 0.05 Å.These superpositions were used to identify the rigid re-gions of the structures for further investigation. Thesubsequent torsional analyses are performed only on thoseparts of the protein that superimpose well, so the resultsreflect only side-chain flexibility separate from any back-bone motion.

Side-Chain Flexibility

Side-chain flexibility was quantified by the evaluation ofchanges in torsional angles. For each group of pairedproteins, each amino acid position in the chain wascompared to the identical position in the matched chains.The results are presented in diagonal plots such as isshown in Figure 1(B). In the figure, points represent the x1

torsional angle for a given residue in two different struc-tures of the protein, such as Arg14 from lysozyme struc-ture 193l and Arg14 in the matched lysozyme structure194l. Compare these plots with the histogram in Figure1(A). We can obtain this histogram by translating each

Figure 2 (Continued)

METRICS FOR SIDE-CHAIN FLEXIBILITY 275

Page 6: Analysis of a data set of paired uncomplexed protein structures: New metrics for side-chain flexibility and model evaluation

point down to the diagonal line and then counting howmany points are at each position along the line. Thehistogram in Figure 1(A) tells us the probability of findinga given residue in a given position, whereas the diagonalplot in Figure 1(B) gives us this information but alsoreveals whether each individual residue may adopt twodifferent conformations (points off the diagonal) or adoptthe same conformation in the two structure analyses(points along the diagonal).

Diagonal plots are given for each residue type in Figure2. In each graph, points below the diagonal correspond toresidues that are completely buried within the protein,and points above are residues that are partially or whollyexposed on the protein surface. Solvent exposure wasestimated with a solvent-accessible (expanded radius)surface calculated in MSMS8 with a probe radius of 1.4 Å.Only protein-heavy atoms were included in the calcula-tion. The solvent exposure of each residue was compared tothat of the extended polypeptide Ala-X-Ala, where X is theparticular residue type. If the surface area exposed to thesolvent was 20% or less than the surface area of theextended polypeptide, the residue was classified as buried.

Most points are arrayed close to the diagonal in thesegraphs, showing that there is typically no change in the x1

torsion angle of a given residue position from structure tostructure. The points are clustered into the three regionscorresponding to staggered conformations, just as ex-pected from previous rotamer studies. Comparing thedistribution of points along the diagonal with the rotamerlibrary of Dunbrack and Karplus,9 we see similar trends.The three staggered conformations are not equally popu-lated: most residues prefer gauche conformations, withexceptions such as valine, which strongly prefers the trans

conformation, and threonine, which strongly disfavorstrans conformations.

The off-diagonal points are the ones of interest here.These correspond to residues that change conformationfrom structure to structure. They also cluster in theregions of staggered conformations. This is readily appar-ent for the serine graph: the value of 60° on the horizontalaxis shows three peaks at 60, 180, and 300° along thevertical axis. As one might expect, the large aromaticscluster closely along the diagonal, indicating that they arepacked into well-defined niches in proteins, and they showlittle variation from structure to structure. The chargedand polar amino acids, however, show much variability.

Two different levels of motion are seen for each residuetype. First, there are small deviations, with the side-chainstaying within a given staggered conformation but adopt-ing a slightly different position. This may be seen in thegraphs as the spread of points around the three peaks inthe diagonal. Table II quantifies this spread of values,tabulating the angular thresholds that contain 90% of theobserved structures (e.g., 90% of all paired leucine resi-dues were found within 23.0° of one another). In general,the values reflect our common-sense ideas about side-chain flexibility; lysine and arginine show large mobility,requiring a large angular tolerance to cover 90% of theobservations, and the large aromatic amino acids are lessmobile, with most moving by less than 10° between thedifferent structures of a given protein. Buried residuesshow far less mobility than surface residues. These confi-dence levels are described in more detail later.

A second level of motion is also observed when residuesjump to a different staggered conformation. These changesare seen in the graphs as points far from the diagonal. One

TABLE II. Confidence Levels for Side-Chain Flexibility

Number ofobservations Angle difference for 90% cutoff (°)

Buried Exposed

x1 x2

Exposed Buried All Alla

Arg 160 662 37.0 13.9 33.0 32.0Asn 289 948 15.0 8.7 13.6 19.2Asp 318 1036 13.8 8.0 12.1 21.4Cys 368 74 7.3 9.5 9.2 —Gln 163 647 53.6 10.7 37.0 19.0Glu 200 989 53.6 13.0 38.0 22.6His 239 220 11.2 7.1 9.0 13.0Ile 888 277 12.4 8.7 9.6 18.0Leu 1214 480 28.6 19.6 23.0 24.8Lys 62 1364 42.1 14.7 41.0 55.6Met 312 130 47.5 17.7 19.8 12.5Phe 671 208 8.0 6.4 6.9 9.0Ser 595 1091 102.7 16.1 74.3 —Thr 498 825 13.5 9.3 11.3 —Trp 283 76 9.0 6.7 7.0 8.2Tyr 489 359 8.0 6.2 7.0 10.4Val 1148 427 21.0 10.0 12.0 —All 30.4 10.5 18.0 20.1aTabulated for residues with a change in x1 of less than 40°.

276 S. ZHAO ET AL.

Page 7: Analysis of a data set of paired uncomplexed protein structures: New metrics for side-chain flexibility and model evaluation

might expect that buried residues would show far fewer ofthese large changes than exposed residues, being lockedwithin the packed core of the protein. For the largearomatics, this is certainly the case. Only a single buriedphenylalanine, Phe238 in purine nucleoside phosphory-lase, shows a large change from structure to structure(seen in PDB files 4pnp and 1pbn). However, the smalleraliphatic amino acids show surprisingly large motions;many points cluster below the diagonal in isoleucine,leucine shows a large cluster bridging trans and gaucheconformations, and the valine plot is nearly symmetric,showing similar gross flexibility inside and outside theprotein. Serine is markedly inflexible inside proteins,presumably because of its ability to form structural hydro-gen bonds with neighboring portions of the chain.

Confidence Levels

The amount of observed side-chain flexibility for eachresidue type is quantified in Tables II and III. Table IIpresents 90% confidence levels for each residue buried andexposed and averaged over both environments. Whenlooking at the all-environment values, we see the expectedordering of flexibility: Ser . Lys, Glu, Gln, Arg, Met . Leu,Asn, Asp, Val, Thr . Ile, His, Cys, Trp, Tyr, Phe.

As one might expect, buried residues show limitedflexibility. Leucine, surprisingly, shows the largest amountof motion when buried, 19.6°. We might expect the smallaliphatic amino acids, Ile, Leu, and Val, to have thegreatest mobility when buried because of their lack ofspecific hydrogen-bonding or salt-bridge interactions withneighboring parts of the protein. That the largest of thesethree would have the greatest flexibility may be a result ofsteric hindrance in the b-branched structure of isoleucineand valine.

The exposed residues are not uniformly flexible. Thelarge polar or charged residues, Arg, Lys, Glu, and Gln, are

very flexible, but many of the smaller polar and chargedresidues are markedly inflexible, such as asparagine andaspartate, which show only about 15° of flexibility, andhistidine, which shows only 11°. The difference betweenserine and threonine underscores the effect of b branching:serine is the most flexible residue at 102.7°, and threonineshows a much smaller flexibility similar to its value whenburied. The large aromatics Phe, Tyr, and Trp showlimited flexibility when exposed and when buried.

We performed a similar analysis for x2 angles. Confi-dence levels are included in Table II. One might expectthat the x2 value would be dependent on the x1 value, soindependent confidence levels for x2 would not be appropri-ate. To confirm this, we created a series of plots with thedifference in x1 values in each paired residue comparedwith the difference in x2 (data not shown). These plotsshowed a random distribution of points, with no correla-tion between x1 and x2. Thus, the confidence levels for x2

are expected to be a valid measure of the flexibility of eachresidue at this torsional bond.

Application of Confidence Levels

The confidence levels presented in Table II provide aquantitative method for evaluating the significance ofpredicted side-chain conformations, dependent on the typeof residue and its local environment. A threshold of 40° forthe difference in x1 angles is widely used to define correctand incorrect predictions of side-chain position (e.g., seeRef. 10 and references therein). Our results indicate that a40° standard for side-chain prediction is inappropriate inmost circumstances. For most buried residues, this toler-ance is far too generous, and for some surface residues, it isfar too restrictive. For instance, as shown in Table III, a40° cutoff will contain 99.8% of phenylalanine residues butonly 89.6% of lysine residues. The 40° standard will tend toexaggerate the effect of misprediction of flexible surface

TABLE III. Percentage of x1 Angles Below a Given Threshold for All Residuesin Paired Data Set

Threshold (°)

10.0 15.0 20.0 25.0 30.0 35.0 40.0 50.0 60.0

Arg 71.1 79.6 83.9 87.4 88.9. 90.7 91.7 92.9 94.0Asn 84.5 91.9 94.9 96.2 96.3 96.8 97.0 97.2 97.7Asp 87.2 93.0 95.1 96.4 96.8 97.2 97.7 97.3 97.8Cys 92.1 98.1 99.5 99.5 99.5 99.8 99.8 99.8 99.8Gln 71.8 79.7 83.7 86.6 88.5 89.6 90.6 91.2 92.4Glu 68.5 78.7 83.6 86.4 87.9 89.2 90.2 91.1 92.2His 90.9 96.7 98.2 98.5 98.5 98.7 98.7 98.7 98.7Ile 90.4 96.7 97.9 98.3 98.3 98.3 98.4 98.4 98.5Leu 80.4 86.4 88.8 90.9 92.4 94.4 96.1 98.2 98.6Lys 64.7 75.3 81.4 84.7 86.8 88.4 89.6 92.1 93.2Met 81.2 86.2 89.9 91.3 92.3 93.0 93.4 94.4 96.0Phe 97.0 99.4 99.7 99.7 99.7 99.7 99.7 99.8 99.8Ser 66.1 76.8 81.1 83.9 85.5 86.8 87.5 88.2 89.0Thr 86.9 93.1 94.6 95.2 95.5 95.8 95.9 96.3 96.3Trp 96.0 99.1 99.4 99.4 99.4 99.4 99.4 99.4 99.4Tyr 96.2 99.3 99.9 99.9 99.9 99.9 99.9 99.9 99.9Val 88.0 92.7 94.4 94.8 95.1 95.2 95.2 95.4 95.4ALL 81.4 88.3 91.0 92.5 93.4 94.1 94.7 95.4. 95.9

METRICS FOR SIDE-CHAIN FLEXIBILITY 277

Page 8: Analysis of a data set of paired uncomplexed protein structures: New metrics for side-chain flexibility and model evaluation

residues when predicted conformations are being scored,focusing attention on surface residues instead of theinterior residues, where small errors can have large ef-fects.

Take, for example, the report of SCWRL, a homology-modeling program.11 It was tested by the rebuilding ofside-chains onto the experimentally determined backbonestructures of a set of 299 crystallographic structures.Figure 3 evaluates the performance of the method with ournew confidence levels. The 40° cutoff used in the report

reveals that serine, arginine, and lysine are poorly pre-dicted and the large aromatics are well predicted. Theoverall score for each of these predictions is dominated bythe poor performance on the flexible residues.

In comparison with the confidence levels from our work,however, we see that SCWRL was working better than onemight be led to believe from the simple 40° cutoff. Theconfidence levels indicate that we cannot expect to predictserine and arginine with great confidence because eachindividual arginine and serine will often adopt a widerange of conformations in a given protein. A plot ofSCWRL prediction success versus the confidence levelsshows an approximately linear relationship, showing thatSCWRL successfully predicts these residue-specific differ-ences in side-chain flexibility. Notice that the point forcysteine falls far from the line, indicating that SCWRLperforms more poorly than one might expect. BecauseSCWRL does not take disulfide linkages into account, thisresult is not unexpected. The authors also note thatSCWRL performs better on buried residues than on ex-posed residues. This result is also expected and would beaccurately reflected with the environment-specific confi-dence levels given here instead of the 40° criterion.

For the evaluation of the success of the SCWRL method,it would be better to take this information into account,requiring the method to be more accurate when accuracy isindicated, as in the aromatics, and less accurate when theresidues show more observed variability. The advantagesof this approach were demonstrated in the SCWRL analy-sis, with similar results used from a paired-protein data-base containing three different proteins (see Fig. 5 in Ref.10). They note that these confidence levels on flexibility“put an upper limit on the accuracy that any side-chainprediction program could hope to achieve.”10 The confi-dence levels and angular tolerances in Table II providemetrics to score prediction methods within the propercontext, requiring better prediction and discriminationwhere the principles of protein structure call for it andallowing more flexibility where it is typically observed.

Rotamer Analyses Versus Paired-Protein Analyses

Rotamer analyses, derived from surveys of databasesthat include a single structure for each protein, will notgive an accurate estimate of the flexibility of residueswithin proteins. Rotamer analyses yield an estimate of therange of accessible conformations that are available to agiven side-chain when observed in all environments. Theeffects of the protein environment tend to average out inthese studies, and the distribution of rotamers is domi-nated by the steric contacts between the side-chain and theprotein main chain. The clustering into staggered confor-mations and the preference for particular staggered confor-mations may be explained by these steric contacts. Thespread of points in the rotamer analysis is not, however,useful for the prediction of flexibility.

The normal range of motion available to a given aminoacid residue, when placed within a protein, will be deter-mined by a combination of short-range steric forces alongwith the interactions with the neighboring portions of the

Fig. 3. Evaluation of SCWRL performance. We measured SCWRLperformance by determining the percentage of residues for which x1

values were predicted with an accuracy of better than 40°. Thesepercentages are presented on the vertical axis and compared with thepercentages that are expected from the survey of paired proteinspresented in this article (the values are taken from Table III). There is anapproximately linear relationship between the values, indicating thatSCWRL predicts accurately when accuracy is warranted, as with therelatively inflexible aromatic residues, and it is less accurate for theresidues that show more flexibility (at the bottom left corner of the graph).However, the line does not fall along the diagonal (dashed line), indicatingthat SCWRL is still not able to predict side-chain conformation within thetolerances defined by the observed flexibility of each residue.

278 S. ZHAO ET AL.

Page 9: Analysis of a data set of paired uncomplexed protein structures: New metrics for side-chain flexibility and model evaluation

chain and the surrounding environment. Our analysis ofmotions in individual residues from the paired-proteindatabase quantifies this flexibility: the flexibility of aresidue within its normal protein environment. Theseconfidence levels are more restrictive than an estimatederived from the spread of points in a rotamer analysis.This indicates that the motion of a given residue within aprotein is more constrained than expected from a compari-son of conformations of different residues in the chain.This is easily seen in Figure 2: the data points along thediagonal cluster into elliptical regions instead of circularregions, with a greater spread along the diagonal (which isrepresentative of the range of conformations when com-pared across many proteins) than the spread perpendicu-lar to the diagonal (which is representative of the range ofconformations of a given residue position in a givenprotein).

This analysis of side-chain flexibility complements rota-mer approaches. Rotamer approaches will provide sampleconformations that are consistent with the observed confor-mation of side-chains in proteins. The confidence levelspresented here will then indicate the amount of flexibilitythat might be expected for that side-chain conformation.For example, if it is a surface arginine, we must expect awide range of motion: 33° to capture 90% of the expected x1

conformations. However, if it is a buried phenylalanine, weneed to model much less motion: only 6.9° for 90% of thecases. For instance, the flexible rotamer approach ofMendes et al.11 currently allows side-chains to explore a

defined region around each rotamer conformation. Theamount of flexibility, constant in that work, may be linkedto the observed range of flexibility quantified in our study.

REFERENCES1. Karplus M, McCammon JA. Dynamics of proteins: elements and

function. Ann Rev Biochem 1983;53:263–300.2. Flores TP, Orengo CA, Moss DS, Thornton JM. Comparison of

conformational characteristics in structurally similar protein pairs.Protein Sci 1993;2:1811–1826.

3. Betts MJ, Sternberg MJE. An analysis of conformational changeson protein–protein association: implications for predictive dock-ing. Protein Eng 1999;12:271–283.

4. Najmanovich R, Kuttner J, Sobolev V, Edelman M. Side-chainflexibility in proteins upon ligand binding. Proteins 2000;39:261–268.

5. Sippl MJ, Stegbucher H. Superposition of three dimensionalobjects: a fast and numerically stable algorithm for the calculationof the matrix of optimal rotation. Comput Chem 1991;15:73–78.

6. Daopin S, Davies DR, Schlunegger MP, Grutter MG. Comparisonof two crystal structures of TGF-b2: the accuracy of refined proteinstructures. Acta Crystallogr D Biol Crystallogr 1994;50:85–92.

7. Finer-Moore J, Tsutakawa SE, Cherbavaz DB, LaPorte DC,Koshland DE, Stroud RM. Access to phosphorylation in isocitratedehydrogenase may occur by domain shifting. Biochem 1997;36:13890–13896.

8. Sanner MF, Olson AJ, Spehner J-C. Reduced surface: an efficientway to compute molecular surfaces. Biopoly 1996;38:305–320.

9. Dunbrack RL, Karplus M. Backbone-dependent rotomer libraryfor proteins. Application to side-chain prediction. J Mol Biol1993;230:543–574.

10. Bower MJ, Cohen FE, Dunbrack RL. Prediction of protein side-chain rotamers from a backbone-dependent rotamer library: anew homology modeling tool. J Mol Biol 1997;267:1268–1282.

11. Mendes J, Baptista AM, Carrondo MA, Soares CM. Improvedmodeling of side-chains in proteins with rotomer-based methods: aflexible rotomer model. Proteins 1999;37:530–543.

METRICS FOR SIDE-CHAIN FLEXIBILITY 279