19
A Multibody, Whole-Residue Potential for Protein Structures, With Testing by Monte Carlo Simulated Annealing Stefan Mayewski Max-Planck-Institut fu ¨ r Biochemie, Martinsried, Germany ABSTRACT A new multibody, whole-residue potential for protein tertiary structure is described. The potential is based on the local environment surrounding each main-chain carbon (CA), de- fined as the set of all residues whose CA coordinates lie within a spherical volume of set radius in 3-dimen- sional (3D) space surrounding that position. It is shown that the relative positions of the CAs in these local environments belong to a set of preferred templates. The templates are derived by cluster analysis of the presently available database of over 3000 protein chains (750,000 residues) having not more than 30% sequence similarity. For each tem- plate is derived also a set of residue propensities for each topological position in the template. Using lookup tables of these derived templates, it is then possible to calculate an energy for any conforma- tion of a given protein sequence. The application of the potential to ab initio protein tertiary structure prediction is evaluated by performing Monte Carlo simulated annealing on test protein sequences. Proteins 2004;00:000 – 000. © 2004 Wiley-Liss, Inc. Key words: packing motifs; structural motifs; multi- body potentials; statistical potentials; simulated annealing; structure predic- tion INTRODUCTION The advent of whole-genome sequencing has greatly increased the number of protein sequences whose struc- ture is unknown. X-ray and NMR methods, which require weeks or months to determine a structure, cannot close this gap; therefore, the development of computer methods to predict structure becomes more necessary. This effort, which has been ongoing for over 30 years, 1,2 has diversified into several channels that can be catego- rized according to the resolution or size of basic unit of knowledge employed. Thus, models may work at the level of atoms, whole residues, groups of residues, or whole domains or folds. Each of these approaches has its own field of application. In the first category are the pairwise interatom poten- tials AMBER, 3 CHARMM, 4 ECEPP, 5 GROMOS, 6 and so forth. Simulations using these potentials give detailed information at atomic resolution about folding mecha- nisms and folding pathways. The potentials can be derived using only a small database of a few organic molecules. 2 However, they are extremely demanding of computer processing power. A folding simulation of even a small peptide may require weeks or months of computer time, 7–11 Scaling up to proteins of several hundred residues would extend this time by many orders of magnitude. Hence, this approach is at present not useful for general structure prediction. At the other end of the spectrum, whole domains or folds are the basic unit. The best fold for a given sequence is found by threading, fold recognition, using a database of known folds. 12,13 This approach only gives the general fold; no information on mechanisms or pathway is ob- tained. The computational effort required is relatively small. However, a very large database is required, larger than that presently available (12,000 proteins). Thus, between 40% and 66% of new protein sequences derived from genomes cannot be recognized in the present data- base, and this situation does not appear to be improv- ing. 14,15 The question that arises is whether a middle course between the above extremes, avoiding the dilemma of unavailable computer power on the one hand, and unavail- able database on the other, could lead to a more useful method for general structure prediction. Whole-residue methods have been tried many times. 16 –19,20 –24 Mostly, pairwise potentials have been used. However, in the last few years, it has become clear that pairwise potentials alone (210 parameters) cannot fold a protein. 21,25–30 Therefore, attention has turned to larger groups of residues. Sequential segments of various lengths 31– 40 have proved useful for prediction of secondary structure; however, as they do not include long-range interactions, they are not a good indicator of tertiary structure. Groups of residues associated in 3-dimensional (3D) space, not necessarily close in sequence, include long- range tertiary interactions. They can be used to derive multibody potentials. Four-body potentials between adja- cent main-chain groups i, i 1, j, and j 1 were used empirically to improve cooperativity of main-chain hydro- gen bonds, 41 and later put on an analytical basis. 21 *Correspondence to: Stefan Mayewski, Max Planck Institute fu ¨r Biochemie, Abteilung Strukturforschung, Am Klopferspitz 18A, 82152 Martinsried, Germany. E-mail: [email protected] Received 10 May 2004; Accepted 14 October 2004 Published online 00 Month 2004 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/prot.20397 PROTEINS: Structure, Function, and Bioinformatics 00:000 – 000 (2004) © 2004 WILEY-LISS, INC. tapraid5/z7e-protein/z7e-protein/z7e00505/z7e2053d05g royerl S16 1/13/05 4:50 Art: 20397 AQ: 1 AQ: 2 AQ: 3

A Multibody, Whole-Residue Potential for Protein Structures, With

Embed Size (px)

Citation preview

Page 1: A Multibody, Whole-Residue Potential for Protein Structures, With

A Multibody, Whole-Residue Potential for ProteinStructures, With Testing by Monte Carlo SimulatedAnnealingStefan MayewskiMax-Planck-Institut fur Biochemie, Martinsried, Germany

ABSTRACT A new multibody, whole-residuepotential for protein tertiary structure is described.The potential is based on the local environmentsurrounding each main-chain � carbon (CA), de-fined as the set of all residues whose CA coordinateslie within a spherical volume of set radius in 3-dimen-sional (3D) space surrounding that position. It isshown that the relative positions of the CAs in theselocal environments belong to a set of preferredtemplates. The templates are derived by clusteranalysis of the presently available database of over3000 protein chains (750,000 residues) having notmore than 30% sequence similarity. For each tem-plate is derived also a set of residue propensities foreach topological position in the template. Usinglookup tables of these derived templates, it is thenpossible to calculate an energy for any conforma-tion of a given protein sequence. The application ofthe potential to ab initio protein tertiary structureprediction is evaluated by performing Monte Carlosimulated annealing on test protein sequences.Proteins 2004;00:000–000. © 2004 Wiley-Liss, Inc.

Key words: packing motifs; structural motifs; multi-body potentials; statistical potentials;simulated annealing; structure predic-tion

INTRODUCTION

The advent of whole-genome sequencing has greatlyincreased the number of protein sequences whose struc-ture is unknown. X-ray and NMR methods, which requireweeks or months to determine a structure, cannot closethis gap; therefore, the development of computer methodsto predict structure becomes more necessary.

This effort, which has been ongoing for over 30 years,1,2

has diversified into several channels that can be catego-rized according to the resolution or size of basic unit ofknowledge employed. Thus, models may work at the levelof atoms, whole residues, groups of residues, or wholedomains or folds. Each of these approaches has its ownfield of application.

In the first category are the pairwise interatom poten-tials AMBER,3 CHARMM,4 ECEPP,5 GROMOS,6 and soforth. Simulations using these potentials give detailedinformation at atomic resolution about folding mecha-nisms and folding pathways. The potentials can be derivedusing only a small database of a few organic molecules.2

However, they are extremely demanding of computerprocessing power. A folding simulation of even a smallpeptide may require weeks or months of computer time,7–11

Scaling up to proteins of several hundred residues wouldextend this time by many orders of magnitude. Hence, thisapproach is at present not useful for general structureprediction.

At the other end of the spectrum, whole domains or foldsare the basic unit. The best fold for a given sequence isfound by threading, fold recognition, using a database ofknown folds.12,13 This approach only gives the generalfold; no information on mechanisms or pathway is ob-tained. The computational effort required is relativelysmall. However, a very large database is required, largerthan that presently available (12,000 proteins). Thus,between 40% and 66% of new protein sequences derivedfrom genomes cannot be recognized in the present data-base, and this situation does not appear to be improv-ing.14,15

The question that arises is whether a middle coursebetween the above extremes, avoiding the dilemma ofunavailable computer power on the one hand, and unavail-able database on the other, could lead to a more usefulmethod for general structure prediction.

Whole-residue methods have been tried manytimes.16 –19,20 –24 Mostly, pairwise potentials have beenused. However, in the last few years, it has become clearthat pairwise potentials alone (210 parameters) cannotfold a protein.21,25–30

Therefore, attention has turned to larger groups ofresidues. Sequential segments of various lengths31–40

have proved useful for prediction of secondary structure;however, as they do not include long-range interactions,they are not a good indicator of tertiary structure.

Groups of residues associated in 3-dimensional (3D)space, not necessarily close in sequence, include long-range tertiary interactions. They can be used to derivemultibody potentials. Four-body potentials between adja-cent main-chain groups i, i � 1, j, and j � 1 were usedempirically to improve cooperativity of main-chain hydro-gen bonds,41 and later put on an analytical basis.21

*Correspondence to: Stefan Mayewski, Max Planck Institute furBiochemie, Abteilung Strukturforschung, Am Klopferspitz 18A, 82152Martinsried, Germany. E-mail: [email protected]

Received 10 May 2004; Accepted 14 October 2004

Published online 00 Month 2004 in Wiley InterScience(www.interscience.wiley.com). DOI: 10.1002/prot.20397

PROTEINS: Structure, Function, and Bioinformatics 00:000–000 (2004)

© 2004 WILEY-LISS, INC.

tapraid5/z7e-protein/z7e-protein/z7e00505/z7e2053d05g royerl S�16 1/13/05 4:50 Art: 20397

AQ: 1

AQ: 2

AQ: 3

Page 2: A Multibody, Whole-Residue Potential for Protein Structures, With

Four-body potentials for side-chains i, i � 4, j, and j � 4,were used to improve packing of helices and �-sheets.20

Five-body terms (hydrogen donor, acceptor, plus 3 sur-rounding hydrophobics) were used to rescale and stabilizemain-chain hydrogen bonds surrounded by hydrophobicresidues.42

In these examples, the multibody potentials were ap-plied to restricted categories of interactions and requiredto be used in conjunction with other energy components tomake up the total potential. Recently, the possiblity thatmultibody potentials could play a more central role in thetotal energy function, using all the residues in the protein,has been investigated. However, it was demonstrated thatthis is not possible for 3-body contact potentials (286parameters), either alone or together with an auxiliarysolvation potential.30 A 4-body potential, based on rigorouspartitioning of all the CAs in the chain into nonoverlap-ping groups of 4 by means of a tessellation procedure(between 630 and 8855 parameters, depending on how themodel is defined), and not using any other energy contribu-tions, also did not lead to better tertiary structure predic-tion than obtainable by pairwise potentials alone.23,28,43,44

Many structural motifs involve more than 4 residues notclose in sequence.13,45–49 Using the face-centered cubic(fcc) model of residue packing, a given residue may coordi-nate with up to 12 neighboring residues.17,18,50

In this connection, it may be noted that in going from anall-atom to a whole-residue model, there is an order ofmagnitude reduction in the number of interacting centers.To compensate for this, therefore, a corresponding in-crease in the number of terms in the multibody potentialmay seem reasonable.

All of this suggests that multibody potentials involvinglarger numbers of residues than the above 3–5 may berequired in order to more fully capture cooperative effects.

This work describes a potential based on the residues inthe local environment surrounding each residue. Up to 17residues may be contained within this region. The poten-tial is derived using the maximum database available atthe present time, over 3000 protein chains, having notmore than 30% sequence similarity.

Finally, the potential is tested by ab initio structureprediction for a set of test proteins, by Monte Carlosimulated annealing, starting from a random conforma-tion, in which the chain can access the entire conforma-tional space.

METHODSDefinition of Local Environment

The local environment (LE) at any residue position iwithin the chain was defined as the set of all residues,including i, whose CA coordinates lie within a sphericalvolume of set radius R_LE surrounding the CA of residue i.The residues within the LE were numbered according tothe sequential order in which they occur in the chain.Thus, if there are n_CA residues within the LE, then theyare numbered 1…n_CA [see Fig. 1(a)]. They may representa segment of consecutive residues, or they may representresidues from separate parts of the chain. Thus, the

concept of LE is in principle similar to that used in otherpacking motif studies,13,48 with differences such as choiceof R_LE, whether CAs or side-chain centers are used, andso forth.

The value selected for R_LE should ideally be largeenough to include CAs from neighboring secondary struc-tural units, so that tertiary interactions may be detected.The separation between contacting helices ranges from 7 Åto 12 Å51; for a helix in contact with a �-sheet, 7–13 Å52;and for close-packed �-sheets, 8–12 Å.53–55 This indicatesthat CAs in associated secondary structural units canapproach to within 5 Å for helix–helix and helix–�-sheet,and a little more for �-sheet–�-sheet.

A large value for R_LE increases the number of differentkinds of templates found; however, since the availabledata are spread over a larger number of cases, the occur-rence frequency of each template is reduced, eventuallyfalling below a level for which useful statistics may becalculated. This effect, which is determined by the size ofthe database, places an upper limit on the useful value ofR_LE.

The LEs were classified according to the numbers of CAspreceding and following the central CA. Thus, LEs with 4CAs may have 1,2 or 2,1 CAs preceding and following thecentral CA, respectively, designated 1_2 or 2_1. LEs with 5CAs were designated 1_3, 2_2, or 3_1, while LEs with 17CAs were designated 1_15, 2_14, 3_13,…,15_1.

The LEs were further classified according to the sequen-tial intervals between successive residues in the LE. Theseare the sequential gaps between the residues positions inthe chain. If there are n_CA residues in the LE, then thereare n_CA � 1 sequence intervals. The significance ofsequence connectivity in interactions between residueshas been noted many times.19,22–24 In the present work,this factor was taken into account by representing thesequence intervals as a string of n_CA � 1 digits, wherechain gaps of 1, 2, 3, �4 were represented as 1, 2, 3, 4.

Thus, the LE at any position CA[i] contained the follow-ing information:

1. Number of CAs before and after the central CA[i]2. Sequential intervals3. Coordinates of the CAs in the LE4. Residue types of the CAs.

Derivation of Local Environment Templates (LETs)

The database CulledPDB56 was used, containing 3071protein chains determined by X-ray at �3.0 Å resolution,with less than 30% sequence similarity.

A set of small proteins was selected for later foldingtesting. Those that were in the database, together withother database chains that had sequence similarity to thetest proteins, were removed from the database. The orderof the remaining chains in the database was randomized.

Scanning each chain in the database, the LE at each CAin the chain was determined. LEs containing sequentiallyconsecutive CAs with separation � 3.2 Å, taken as cisconformation, or � 5.0 Å, taken as chain breaks, werepassed over. For each LE so found, the CA coordinates of

2 S. MAYEWSKI

tapraid5/z7e-protein/z7e-protein/z7e00505/z7e2053d05g royerl S�16 1/13/05 4:50 Art: 20397

F1

AQ: 4

AQ: 5

Page 3: A Multibody, Whole-Residue Potential for Protein Structures, With

the residues in the LE, the residue types, and the sequenceintervals between the residues were stored.

The total LEs so found were divided into sets of the sametype (same numbers of CAs preceding and following thecentral CA, same sequence intervals).

For each set, a distance matrix was constructed, holdingthe root-mean-square deviations (RMSDs) between the CAcoordinates of all pairs of LEs in the set; see Figure 1(b).The RMSD between 2 LEs was calculated after superimpos-

ing their CAs for best fit using the method of Mc-Clachlan.57 Due to finite size of computer memory, thenumber of LEs in the distance matrix was restricted to20,000 (200 million distances). In cases where a setcontained more than 20,000 LEs, only the first 20,000 wereused. Since the order in the database was randomized, thisamounted to a random selection.

Having set up the distance matrix, the next step was todivide the set of LEs into clusters. Various methods exist

Fig. 1. (a) Pancreatic polypeptide (36 residues), with terminals marked N and C. The LE at an arbitrary residue CA[i] (filled circle) is shown. The largecircle represents the local spherical volume around CA[i], of fixed radius R_LE. The LE consists of the set of all residues whose CAs lie within this sphere(open circles). These are numbered in the order in which they occur in the chain. In the LE, there are 8 CAs preceding CA[i], and 4 following. Thesequence intervals between these CAs are designated 111411111111. (b) A set of LEs of a given type (same numbers of CAs before and after centralCA, same sequential intervals) taken from the database, represented in two dimensions. Each point represents one LE. Distances between pointsrepresent RMSDs between LEs, calculated after superimposing their CAs for best fit. Three clusters are shown. LPD at any point X is calculated as thenumber of LEs within a region bounded by rd_max surrounding that point, shown as dotted circles. (c) Center of cluster is taken as that LE which hashigher LPD than all other LEs within rd_max. Using a fixed rd_max for all clusters leaves some LEs not included in a cluster. (d) Using an rd_max that istoo large results in some clusters not being optimally resolved. (e) Spread of a cluster can be estimated by constructing concentric shells (light circles)around the cluster center, and counting the numbers of LEs in these shells. Cluster spread (indicated by first minimum) is shown as a bold circle. Usingthe appropriate value for rd_max for each cluster leads to improved coverage.

A MULTIBODY, WHOLE-RESIDUE POTENTIAL 3

tapraid5/z7e-protein/z7e-protein/z7e00505/z7e2053d05g royerl S�16 1/13/05 4:50 Art: 20397

Page 4: A Multibody, Whole-Residue Potential for Protein Structures, With

for doing this.31,37,40 The method used here was selectedbecause it assumes spherical clusters; thus, it is suited tothe energy calculation (see below), which also works on thebasis that the clusters are spherical.

The cluster analysis was carried out as follows. Forevery LE in the set, the local population density wascalculated [i.e., the number of neighboring LEs within acertain fixed cutoff RMSD (rd_max)]. Those LEs having ahigher local population density than their neighbors,provided this was equal or greater than a certain low-frequency cutoff (set to 10), were taken as local maxima, orcluster centers; see Figure 1(c). Different values for rd-_max in the range 1.0–1.6 Å were tested.

Each cluster center LE so identified was taken as thebasis of an LET, representative of the cluster. The coordi-nates of the CAs in the LET were taken directly from thecentral LE. The template occurrence frequency was takenas the number of LEs in the cluster ( i.e., the number ofLEs within rd_max of the central LE). For simplicity, afixed value of rd_max was assumed for all clusters.

Thus, each LET contained the following information:

1. Number of CAs before and after central CA2. Sequence intervals3. Coordinates of CAs in template (taken from central LE

in cluster)4. Occurrence frequency5. For those templates with a high enough occurrence

frequency (at least 50), the number of residues of eachtype was counted at each CA position in the LET, takenover all LEs in the cluster, enabling calculation of a setof residue position propensities:

res_pos_prop[r] [i] [t]

�res_pos_freq[r] [i] [t]/temp_freq[t]

res_prop[r] , (1)

where

res_pos_prop[r][i][t] � position propensity of residue type rat position i in template t

res_pos_freq[r][i][t] � frequency of residue type r atposition i in template t

temp_freq[t] � occurrence frequency of template tres_prop[r] � proportion of residue type r in whole

database.

Residue position propensities calculated in this way havebeen used by many authors for various kinds of second-ary,36,58–61 supersecondary,49 and tertiary motifs.62–65

6. For each template, the distribution of frequencies ofneighboring LEs in concentric shells of thickness 0.3 Åsurrounding the cluster center LE was kept. From this,the cluster spread could be estimated [see Fig. 1(e)].

Simulated Annealing Runs

To test the usefulness of the LETs, simulated annealingruns were carried out on a set of test proteins. This is the

inverse of the preceding operation. There, known struc-tures were used to derive templates. In the folding simula-tions, the templates were tested to see if they could returnstructures.

The set of test proteins, not sequentially similar to anyin the database, was put through Monte Carlo simulatedannealing. The test proteins were:

Residues Protein Data Bank codeBetanova66 20GS peptide67 203�68 24Antifreeze protein69 37 1wfbA�_turn_�70 3 38 1abz_

**coordinates suppliked by authors.

The proteins were modeled as a reduced representation,backbone atoms only, without side-chains. The conforma-tions were described by the dihedral angles �, .

Simulations were started in a random conformation athigh temperature, 1500 K, and cooled at a constant rate byreducing the temperature by a factor in the range 0.9997 to0.9999 every 10 steps, to 150 K, where the temperaturewas held constant. The simulation was continued for afurther 20% of run steps after the last lowest energyconformation.

During cooling, a set of moves was applied, enabling thechain to randomly explore conformational space:

1. Vary �[i], [i] at a random point i in the chain2. Vary [i], �[i � 1] oppositely at a random point i

At each step, a trial move was selected randomly fromthe above, and the energy of the conformation was calcu-lated and compared to the energy of the previous acceptedmove. The new conformation was accepted or rejected byapplying the Metropolis criterion.71

The lowest energy conformation found during the runwas kept as the final predicted conformation.

Energy Calculation

The energy of the conformation was calculated as thesum of collision energy and template energy:

1. Collision energy: If any 2 backbone atoms approachedcloser than the sum of their van der Waals radii, acollision penalty was applied. Values for this penalty inthe range 1.0–10.0 were tried.

2. Template energy: At each CA[j] in the chain (exceptfirst and last), the local environment LE[j] was deter-mined, as described above.

The lookup table of LETs, which was derived previouslyas described above, was searched to find LETs of the sametype as LE[j] (i.e., matching numbers of CAs before andafter the central CA, and matching sequential intervals).

If matches could be found, then the RMSD between theCAs in LE[j] and each matching LET in the lookup tablewas calculated, using the method of McClachlan.57 The

4 S. MAYEWSKI

tapraid5/z7e-protein/z7e-protein/z7e00505/z7e2053d05g royerl S�16 1/13/05 4:50 Art: 20397

AQ: 6

Page 5: A Multibody, Whole-Residue Potential for Protein Structures, With

best-fitting template t with the lowest RMSD (rd) waskept.

If the occurrence frequency of template t was at least 50,and rd � rd_max_2, then E[j] was calculated from theresidue position propensities, using the Boltzmann equa-tion:

Ej� � � f*R*T* �n_CA

ln(res_pos_prop[r] i� t�), (2)

where

n_CA � number of CAs in LE[j]R, T � gas constant, temperature

r � residue type at position i in LE[j]f � factor, see below

If any of the above conditions was not satisfied, E[j] � 0.Residue propensities have been used similarly by other

authors to calculate energy scores for secondary motifs36,37

and tertiary motifs.63–65

For templates with low occurrence frequency (�100),the residue propensity in Eq. (2) was occasionally zero, aconsequence of sparse data. To avoid taking the logarithmof zero, this situation was handled by replacing with thevalue 0.3, a value chosen empirically from the lower end ofobserved propensities, as done by other authors.65 Alterna-tively, a more sophisticated treatment of this situation canbe applied, as originated by Sippl.19,22,24,58

By repeating this for all residues in the chain, andsumming, the total energy of the conformation was ob-tained.

rd_max_2 was set at 2.5 or 3.0 Å. In conjunction withthis, the factor f(rd) was used, depending on the value ofrd. This consisted basically of a function having a value 1.0for perfect fit (rd � 0), and zero for rd above rd_max_2.Initially, a step function was tried for f(rd); however, thisresulted in a certain looseness or indeterminacy of thefinal structure, even after selection of the correct tem-plates, due to the flat top of the step function. Therefore,the step function was modified to the approximate sigmoi-dal function (Fig. 2):

f(rd)

� ��1.0 � 0.375*rd/rd_max_2

0�rd � 0.4*rd_max_20.85 � 3.5*(rd � 0.4*rd_max_2)/rd_max_2

0.4*rd_max_2 � rd � 0.6*rd_max_20.15 � 0.375*(rd � 0.6*rd_max_2)/rd_max_2

0.6*rd_max_2 � rd � rd_max_20

rd rd_max_2

The slope in this function served to bias the conformationso as to optimally fit the selected templates.

RESULTSChoice of R_LE and rd_max Cluster Analysis

Using the described method of cluster analysis, thechoice of R_LE and rd_max represented a compromise ofvarious factors.

In Table I are shown numbers of templates found forvarious values of R_LE and low-frequency cutoff, withrd_max fixed at 1.6 Å.

From this, it is seen that as R_LE increases, the numberof templates having an occurrence frequency above acertain cutoff frequency at first increases, as more differ-ent templates are detected. Further increase in R_LEcauses the occurrence frequencies of the templates todecrease, resulting in decreasing numbers of templatesabove the cutoff frequency. For R_LE in the range 7.0–8.0Å, the number of templates found was highest.

In Table II are shown details of the cluster analysis for 2different values of rd_max, with R_LE fixed at 8.0 Å forselected sets of LEs taken from the database. This shows,for each set, the numbers of LEs in the set, and the sizes ofthe main clusters found, larger than 50. For each cluster isshown also the structure.

From this it is seen that with rd_max � 1.3 Å, the largeclusters were resolved better, and more cluster centerswere found.

Thus, for the set of LEs of type 4_4 11111111, rd_max �1.3 Å found 5 templates: helix, 3 helix–turn–helix motifs,plus helix N-cap, while rd_max � 1.6 Å found only thehelix template.

For LEs of type 6_6 111411114111, both rd_max � 1.3 Åand rd_max � 1.6 Å found 9 templates, representing

Fig. 2. Function f(rd) used in energy calculation [Eq. (2)]. The stepfunction (dotted line) was replaced by the approximate sigmoidal function(full line).

TABLE I. Numbers of Templates Found for Various Valuesof R_LE and Low-Frequency Cutoff

R_LE Freq 10 Freq 20 Freq 50 Freq 100

6.0 1622 985 499 3157.0 2604 1495 659 3687.5 2571 1433 602 3208.0 2606 1413 559 2648.5 2249 1112 408 189

Templates calculated with rd_max � 1.6 Å.

A MULTIBODY, WHOLE-RESIDUE POTENTIAL 5

tapraid5/z7e-protein/z7e-protein/z7e00505/z7e2053d05g royerl S�16 1/13/05 4:50 Art: 20397

F2

T1

T2

Page 6: A Multibody, Whole-Residue Potential for Protein Structures, With

various 3-stranded �-sheets. Some of these are shown inFigure 3(d, e, and g).

For LEs of type 7_4 31411111111 and 4_7 11111111431,both values of rd_max found 2 templates, representing 2associated helices.

On the other hand, with rd_max � 1.3 Å, the clusterswere reduced in size, with the result that the size of someof the smaller ones fell below 50 LEs. Thus, overall, thetotal number of templates with occurrence frequencyabove 50 was lower, at 509, than with rd_max � 1.6 Å at559.

Regarding the clustering of the whole database, forrd_max � 1.3 Å, the sizes of all clusters with at least 10LEs totaled 229,000 LEs, while for rd_max � 1.6 Å, theytotaled 263,000. The reason why not all LEs in thedatabase (750,000 LEs) lie inside a cluster derives fromthe use of a fixed value of rd_max for all clusters, seeFigure 1(c).

Thus, the choice for R_LE and rd_max, using thisclustering procedure, represents a compromise, an at-tempt to simultaneously maximize inclusion of CAs fromneighboring secondary structural units, resolution of tem-plates, and total number of templates. In the following,

values for R_LE � 8.0 Å and rd_max � 1.6 Å were used,unless otherwise stated.

Description of Templates

Using R_LE � 8.0 Å and rd_max � 1.6 Å, the templatesrange in size from 4 CAs up to a maximum of 17 CAs (Fig.4). Their occurrence frequencies vary widely, from thelow-frequency cutoff of 10, up to 32,854. The top 50templates with the highest occurrence frequencies areshown in Table III.

A small proportion, 52/559, or 9%, of the templates arecomposed entirely of sequential residues (sequence inter-vals all 1’s); these templates correspond to the sequentialsegments investigated by others.31–40,72

They represent a single secondary or supersecondarystructure unit, such as helix, helix-cap, �-turn, �-hairpin,and so forth.

Most of the templates represent interactions between 2or more secondary structure units.

The templates can be broadly classed according towhether the central CA is in helix, �-sheet, or random coil.

A selection of various templates is shown with centralCA in helix (Fig. 5). For each template is shown numbers of

TABLE II. Cluster Analysis of Selected LE Sets

LE type No of LEs in set

No. of LEs in main clusters

rd_max � 1.6 Å rd_max � 1.3 Å

4_4 11111111 37,884 32,854 (h) 30,548 (h) 369 (hth) 129 (hth)102 (hth) 85 (hN)

4_2 111111 14,676 7344 (hC) 5602 (hC) 2195 (hC) 1254 (bh)543 (hN)

2_4 111111 14,291 8855 (hN) 3050 (hN) 6822 (hN) 2272 (hN) 217 (bh)5_4 411111111 11,158 5303 (h � h) 4135 (h � h)4_3 1111111 10,695 6673 (hC) 963 (hC) 546 (hC) 4238 (hC) 692 (hC) 442 (hC)

183 (hN) 139 (irr) 54 (bh)4_5 111111114 10,412 4789 (h � h) 3838 (h � h) 247 (b � h) 90 (h � h)2_3 11111 6815 2287 (hN) 2184 (bh) 1585 (hN) 1565 (hN) 699 (hN)3_4 1111111 6441 4050 (h) 679 (hN) 470 (hN) 3211 (h) 605 (hth) 407 (hN)

313 (hN) 225 (hN)6_2 11141111 6299 2917 (2b) 576 (2b) 2186 (2b) 416 (2b) 314 (2b)

88 (2b) 55 (2b)3_2 11111 6252 2462 (irr) 1826 (hC) 1592 (bh) 1189 (hC)6_6 111411114111 5975 1375 (3b) 1288 (3b) 879 (3b) 988 (3b) 969 (3b) 755 (3b)

317 (3b) 217 (3b) 134 (3b) 282 (3b) 185 (3b) 120 (3b)100 (3b) 100 (3b) 89 (3b) 81 (3b) 79 (3b) 74 (3b)

5_2 1111111 3852 1019 (bh) 877 (hC) 273 (hC) 269 (hC) 593 (bh) 591 (hC) 475 (bh)357 (hC) 194 (hC) 72 (hN)

5_2 4111111 3434 610 (h � h) 403 (h � h) 69 (b � bh) 64 (h � h) 397 (b � h) 357 (h � h) 310 (h � h)62 (b � bh) 141 (h � h) 55 (h � h)

4_5 111111111 3335 1445 (hC) 1212 (hC) 1045 (hC) 91 (hth)6_4 4411111111 3289 594 (b � h) 513 (b � h) 363 (h � h) 310 (b � h)4_6 1111111144 3109 607 (b � h) 401 (h � h) 320 (h � h)6_4 1411111111 3016 891 (b � h) 485 (b � h) 71 (b � h)4_6 1111111141 2965 1017 (h � h) 569 (h � h) 67 (h � h)7_4 31411111111 2207 939 (h � h) 572 (h � h) 673 (h � h) 430 (h � h)4_7 11111111431 2063 858 (h � h) 481 (h � h) 623 (h � h) 334 (h � h)4_8 111111114313 602 183 (h � h) 119 (h � h) 146 (h � h) 133 (h � h) 86 (h � h)9_5 11143141111411 147 93 (b � h) 58 (b � h)

Total database � 750,000 LEs, using R_LE � 8.0 Å. Only clusters with � � 50 LEs are shown. For each cluster is shown the number of LEs in thecluster, and the structure represented by the cluster: h � helix, hN � helix N-cap, hC � helix C-cap, hth � helix–turn–helix, h � h � 2 associatedhelices, bh � �-hairpin, 3b � 3-stranded �-sheet, b � h � �-sheet associated with helix, irr � coil.

6 S. MAYEWSKI

tapraid5/z7e-protein/z7e-protein/z7e00505/z7e2053d05g royerl S�16 1/13/05 4:50 Art: 20397

F3,AQ:7

AQ: 8

F4

T3

F5

Page 7: A Multibody, Whole-Residue Potential for Protein Structures, With

Fig. 3. Templates with central CA in �-sheet. Details as in Figure 5.

tapraid5/z7e-protein/z7e-protein/z7e00505/z7e2053d05g royerl S�16 1/13/05 4:50 Art: 20397

Page 8: A Multibody, Whole-Residue Potential for Protein Structures, With

CAs before and after the central CA, sequence intervals,rank, occurrence frequency in the database, and a represen-tative example. The rank is the position of the template inthe hierarchy of templates of the same type. Note that theoccurrence frequencies shown are provisional only; theydepend on the size of rd_max used (1.6 Å). Since the spreadof many templates is larger than this, the true occurrencefrequencies of these templates should be higher.

Figure 5(a) shows the template for a single �-helix. Thisis the template with the highest frequency, 32,854.

Figure 5(b–i) shows templates for helix–helix interac-tions. These represent tertiary structure interactions. Forall templates involving 2 secondary structure units, thereare two ways they can be connected. Thus, shown in Figure5(c) is the antiparallel helix– helix interaction 4_711111111431, while Figure 5(e) shows the same interac-tion with alternative connection 7_4 31411111111. Thefrequencies of these 2 templates are similar, 857 and 938;also, the residue position propensities at correspondingtopological positions in the templates are similar, as can beseen by comparing (a) and (b) in Table IV.

In Figure 5(c and d), the templates have the samenumbers of CAs and the same sequence intervals. Thus,they are of the same type and belong to the same set.However, the RMSD between them, 2.9 Å, leads them to beresolved into separate clusters.

Figure 5(b and c) shows templates with the samenumbers of CAs 4_7; these have a lower deviation fromeach other, 1.8 Å RMSD. However, the sequence intervalsdiffer, causing them to be distinguished. This illustratesthe role of the sequence intervals.

Figure 5(g and i) shows templates for 3 interactinghelices. These can be connected 6 ways. Hence, thesetemplates occur in 6 versions (same propensities, differentconnectivities). Figure 5(g) is from a 3-helix bundle, whileFigure 5(i) is from a 4-helix bundle.

Figure 5(h) shows the template for the interaction�-helix–polyproline helix.

Figure 5(j, k, and l) shows templates for helix–�-sheetinteractions.

Shown in Figure 3 are templates with the central CA inthe �-sheet. Figure 3(a and b) represents pairs of antipar-allel �-strands. Figure 3(c) represents a pair of parallel�-strands.

Figure 3(d–g) represents 3 adjacent �-strands, isolatedfrom �-sheets of 3 or more strands. All these templatesrepresent a single secondary structure unit (�-sheet),without tertiary interactions.

Templates representing 2 secondary structure units,�-sheet plus helix, are shown in Figure 3(h–l).

Templates involving 2 separate �-sheets were not seen.Factors contributing to this may be the larger separationbetween interacting �-sheets, as compared to helix–helix,and helix–�-sheet interactions,53 and the intrinsicallylarger coordinate variation (template spread) in �-sheet–�-sheet interactions.

Cluster Size of Templates

The cluster size of the templates could be estimated fromthe distribution of occurrence frequency against RMSDfrom the cluster center [Fig. 1(e)]. Typically this showed aninitial peak, followed by a minimum, followed by one ormore peaks (Fig. 6). The minimum following the first peakgave the radius or spread of the cluster.31,72 Peaks follow-ing this represented possible other clusters in the distancematrix and/or background nonsignificant data [Fig. 1(e)].

From this, it was seen that clusters have differentspreads, from 1.2 Å up to �3.0 Å RMSD. Values in thisrange have been used in other studies.72

In general, templates representing a single secondarystructural unit, helix or �-sheet, whose geometry is deter-mined primarily by intra-main-chain hydrogen bonds,have low spreads [Fig. 6(a)]. Templates representing 2 ormore secondary structural units, associated by inter-side-chain interactions of various kinds, such as van der Waalsinteractions, ionic bonds, disulfide bonds, and so forth,have larger spreads [Fig. 6(b)].

Coverage

It is of interest to know to what extent the templates asfound above are representative of LEs in proteins gener-ally (i.e., how many sites in a protein can be covered by atemplate).

Table V shows data for proteins of different types—all-�,mixed �/�, all-�— and templates within 2.5 and 3.0 ÅRMSD, to allow for the fact that the spread is often greaterthan 1.6 Å RMSD.

From this, it appears that most � proteins can becovered in the range 75–100%. For � and �/� proteins, thecoverage is lower.

Template coverage for the 56-residue �-hairpin repres-sor protein, 1ropA, is shown in Table VI. Here it is seenthat of the 54 CAs for which a local environment wascalculated, 30 LEs consist of strictly consecutive residues(sequence intervals all 1’s). These LEs form the basis of thesecondary structural units, in this case, helices. Twenty-one LEs on the other hand, contain sequence intervals of 4or more; these represent long-range tertiary interactionsbetween the 2 helices (marked * in Table VI). In the firstgroup, the average RMSD from the best-fitting template is0.7 Å; for the second group, it is 1.6 Å RMSD, thusillustrating the larger deviation for tertiary interactions,as described above. The template coverage for the whole

Fig. 4. Frequency distribution of numbers of CAs in local environment,with R_LE � 8.0 Å. The peak at n_CA � 9 represents � helix.

8 S. MAYEWSKI

tapraid5/z7e-protein/z7e-protein/z7e00505/z7e2053d05g royerl S�16 1/13/05 4:51 Art: 20397

T4

AQ: 9

F6

AQ: 10

T5,AQ: 11

T6

AQ: 12

Page 9: A Multibody, Whole-Residue Potential for Protein Structures, With

protein (RMSD � 3.0 Å, occurrence frequency � 10) is100%.

Template coverage for the 61-residue �/� protein G,2igd_, is shown in Table VII. Here, the best-fitting tem-plate at some sites has an occurrence frequency below 10

(i.e., the lookup table contained no matching template forthe LEs at these sites). These positions are left as gaps inTable VII. At other sites, the RMSD of the best-fittingtemplate is large (� 3.0 Å). Thus, for this protein, thelookup table is not complete, and the overall coverage,

TABLE III. Top 50 Templates With Highest Occurrence Frequencies

Template type Rank Occ. freq. Cluster center Structure

4_4 11111111 1 32,854 1n7sA G1 SNRRLQQ T9 h2_4 111111 1 8855 1iw7C A423 GFDVR D429 hN4_2 111111 1 7344 1lba_ T105 LLAKY E111 hC2_2 1111 1 6837 1csn_ L216 KAA T220 irr4_3 1111111 1 6673 1iw7D F501 RQNLLG K508 hC5_4 411111111 1 5303 1g60A Y54 F81 ICQYLVS K89 h � h*4_5 111111114 1 4789 1jndA T154 ALVRDVK D162 K193 h � h*3_4 1111111 1 4050 1je0A E164 FVKKWS S171 h2_4 111111 2 3050 1evsA S4 KEYRV L10 hN6_2 11141111 1 2917 1gci_ N198 VQ S201 Y208 ASL N212 2b3_2 11111 1 2462 1dosA V225 HGVY K230 irr2_6 11114111 1 2361 1dbhA I291 NDK D295 H302 AF E305 2b2_3 11111 1 2287 1h8eA V81 GEEL L86 hN2_3 11111 2 2184 1dcoA E22 LEGR D27 bh3_2 11111 2 1826 1kcmA E180 LVNQ K185 hC5_6 11411114111 1 1566 1rypA M72 V V74 I131 LTF V135 I145 YK T148 3b5_4 111111111 1 1544 1chkA R120 VYFDPAVS Q129 hN2_5 1111411 1 1462 1bm8_ T59 HEK V63 T73 W V75 2b6_5 11141111411 1 1448 1ewfA K86 IS G89 N103 FDL S107 V144 H I146 3b4_5 111111111 1 1445 1wdcB Y97 IKDLLENM G106 hC6_6 111411114111 1 1375 1ltxR I363 LT V366 V376 RVI E380 V395 HL T398 3b5_2 1141111 1 1374 1o9wA C47 V G49 S101 GWC V105 2b6_6 111411114111 2 1288 1kngA L35 VN V38 P108 ETF V112 K122 LV G125 3b4_6 1111111111 1 1130 1mb3A E84 ERIREGGCE A94 hC5_2 1111111 1 1019 1gp1A R9 PLAGGE P16 bh4_6 1111111141 1 1017 1f5aA A207 IKLWAKR H215 L342 F343 h � h*4_3 1111111 2 963 3sil_ I160 VTKNGT I167 hC7_4 31411111111 1 939 1n5uA V425 N428 L429 Y451 LSVVLNQ L459 h � h*6_4 1411111111 1 891 1d7uA I165 P166 E186 LDYAFDL I194 b � h*6_6 111411114111 3 879 1qo2A V123 FS L126 E160 IVH T164 V191 LA A194 3b5_2 1111111 2 877 1m3yA P315 YQSIGG V322 hC4_7 11111111431 1 858 1whsA Y192 VGTFEFW W200 Y210 L213 K214 h � h*2_6 11111111 1 765 113aA S27 PLDSGAF K35 bh3_4 1111111 2 679 1iz0A L241 REGALV E248 hN4_7 11111111111 1 671 1dxy_ F16 KQWAKDTGNT L27 hC10_2 111411141111 1 653 1qayA Q112 VQ I115 V158 HT F161 M169 LVA H173 3b3_3 111111 1 650 1huw_ S85 WLEPV Q91 h4_1 11111 1 621 1bkrA H103 YFSK M108 hC2_5 1111111 1 614 1jhfA K158 KQGNKV E165 bh5_2 4111111 1 610 1eca_ V16 Y107 MKAHT D113 h � h*2_5 1111114 1 609 1gtvA G12 KRTLV E18 D94 b � h*4_6 1111111144 1 607 1l9vA A238 DRVYATF K246 M276 L282 b � h*4_4 41111111 1 604 1gw5A V148 W180 TSRVVH L187 h � h*6_4 4411111111 1 594 1nb8A F26 R32 L53 QRVFYEL Q61 b � h*6_2 11141111 2 576 1b5eA K160 IN A163 K204 AGS I208 2b7_4 31411111111 2 572 1m1nA D125 L128 A129 D159 IESVSKV K167 h � h*2_10 111141114111 1 567 1o75A V243 RYD Y247 S260 YG T263 Y298 RL T301 3b4_3 1111111 3 546 1nf3C A73 QSTGLL A80 hC9_2 11141141111 1 545 1kr4A L10 VY S13 F37 N A39 A59 AIF K63 3b4_4 11111114 1 540 1bpyA L113 EDLRKN E120 L131 h � h*

Templates derived with R_LE � 8.0 Å, rd_max � 1.6 Å. Occurrence frequency � number of LE/s within rd_max of LE atcluster center. For LE at cluster center is shown PDB code and residues in LE, numbered from the start of the chain.Unnumbered residues are consecutive in the sequence. The last column shows type of structure: h � helix, hN � helixN-cap, hC � helix C-cap, h � h � 2 associated helices, bh � �-hairpin, 2b � 2 � strands, 3b � 3-stranded �-sheet, b � h ��-sheet associated with helix, irr � coil. Templates marked * represent tertiary interactions.

A MULTIBODY, WHOLE-RESIDUE POTENTIAL 9

tapraid5/z7e-protein/z7e-protein/z7e00505/z7e2053d05g royerl S�16 1/13/05 4:51 Art: 20397

T7

Page 10: A Multibody, Whole-Residue Potential for Protein Structures, With

Fig. 5. Templates (a–l) with central CA in helix. Large circle represents local environment (sphere of radius R_LE). CAs in template are shown assmall circles, central CA as filled circle. Connecting regions of the chain outside the template are shown as dotted lines. First and last CAs in template arenumbered. Shown for each template are numbers of CAs before and after central CA, sequence intervals, rank of cluster in distance matrix, occurrencefrequency in database, and a representative example in the PDB. Templates (c) and (e) are numbered for use with Table IV.

tapraid5/z7e-protein/z7e-protein/z7e00505/z7e2053d05g royerl S�16 1/13/05 4:51 Art: 20397

AQ:16

Page 11: A Multibody, Whole-Residue Potential for Protein Structures, With

TABLE IV. Comparison of Topologically Similar Templates, 4_7 11111111431 (1) and 7_4 31411111111 (1)

(a) Template 4_7 11111111431 (1) 858Cluster center LE1whs_A Y192 V193 G194 T195 F196 E197 F198 W199 W200 Y210 L213 K214

Distribution surrounding cluster center in shells of thickness 0.3 Å up to 4.5 Å1 18 83 323 364 178 105 105 140 314 245 107 47 19 5

ALA CYS ASP GLU PHE GLY HIS ILE LYS LEU MET ASN PRO GLN ARG SER THR VAL TRP TYR1 84 15 47 56 31 24 8 60 59 141 30 27 13 31 55 36 35 57 10 362 80 4 41 59 35 22 20 71 44 137 23 16 16 45 40 25 43 90 17 283 88 5 53 96 27 24 16 32 90 68 11 42 13 67 63 48 37 39 18 194 91 4 34 105 36 20 16 47 64 106 20 17 4 59 58 35 47 54 14 275 61 14 6 17 34 18 15 171 21 214 23 22 1 21 22 20 48 93 9 266 96 18 42 73 35 26 15 73 55 102 21 30 7 57 49 35 36 61 10 167 104 11 34 115 28 28 24 25 86 80 16 31 7 46 56 50 41 49 6 178 80 12 20 79 29 20 15 69 58 181 27 20 3 45 47 29 33 51 9 289 94 17 19 41 44 23 19 54 48 156 27 30 3 46 54 49 40 52 11 28

10 105 13 6 25 49 13 24 90 32 220 26 9 8 30 28 27 33 71 7 4111 82 12 25 66 26 10 14 85 45 212 37 24 3 30 37 26 21 65 10 2412 70 16 17 67 25 16 11 88 49 186 27 18 1 37 52 29 42 70 9 27

1 1.19 1.26 0.94 0.99 0.90 0.38 0.41 1.23 1.18 1.80 1.73 0.72 0.32 0.96 1.27 0.70 0.74 0.94 0.81 1.182 1.14 0.34 0.82 1.04 1.01 0.35 1.02 1.46 0.88 1.75 1.33 0.43 0.40 1.39 0.92 0.49 0.91 1.48 1.38 0.923 1.25 0.42 1.06 1.69 0.78 0.38 0.81 0.66 1.79 0.87 0.63 1.12 0.32 2.07 1.45 0.94 0.78 0.64 1.46 0.624 1.29 0.34 0.68 1.85 1.04 0.32 0.81 0.96 1.28 1.35 1.15 0.45 0.10 1.82 1.34 0.69 0.99 0.89 1.13 0.885 0.87 1.18 0.12 0.30 0.98 0.29 0.76 3.51 0.42 2.73 1.33 0.59 0.02 0.65 0.51 0.39 1.01 1.53 0.73 0.856 1.37 1.52 0.84 1.28 1.01 0.42 0.76 1.50 1.10 1.30 1.21 0.80 0.17 1.76 1.13 0.69 0.76 1.01 0.81 0.527 1.48 0.93 0.68 2.02 0.81 0.45 1.22 0.51 1.71 1.02 0.92 0.83 0.17 1.42 1.29 0.98 0.86 0.81 0.49 0.568 1.14 1.01 0.40 1.39 0.84 0.32 0.76 1.42 1.16 2.31 1.56 0.53 0.07 1.39 1.08 0.57 0.70 0.84 0.73 0.929 1.34 1.43 0.38 0.72 1.27 0.37 0.97 1.11 0.96 1.99 1.56 0.80 0.07 1.42 1.24 0.96 0.84 0.86 0.89 0.92

10 1.49 1.10 0.12 0.44 1.42 0.21 1.22 1.85 0.64 2.81 1.50 0.24 0.20 0.93 0.65 0.53 0.70 1.17 0.57 1.3411 1.17 1.01 0.50 1.16 0.75 0.16 0.71 1.74 0.90 2.71 2.13 0.64 0.07 0.93 0.85 0.51 0.44 1.07 0.81 0.7912 1.00 1.35 0.34 1.18 0.72 0.26 0.56 1.81 0.98 2.38 1.56 0.48 0.02 1.14 1.20 0.57 0.89 1.15 0.73 0.88

(b) Template 7_4 31411111111 (1) 939

Cluster center LE1nSuA V425 N428 L429 Y451 L452 S453 V454 V455 L456 N457 Q458 L459

Distribution surrounding cluster center in shells of thickness 0.3 Å up to 4.5 Å1 10 121 402 344 179 117 82 127 372 297 85 35 20 10

ALA CYS ASP GLU PHE GLY HIS ILE LYS LEU MET ASN PRO GLN ARG SER THR VAL TRP TYR1 69 14 13 27 67 23 17 114 32 226 22 19 12 29 36 26 45 91 18 352 83 13 38 72 45 23 10 69 36 242 33 18 3 42 52 26 30 61 11 283 82 20 20 40 44 22 8 102 51 188 28 24 2 47 56 38 41 83 11 314 88 8 36 75 38 28 21 61 66 171 22 31 14 40 35 34 42 74 14 345 94 7 35 78 38 33 23 77 64 143 20 25 7 32 49 34 37 87 10 436 97 14 62 106 18 30 18 29 83 81 9 46 14 60 91 52 45 50 8 237 92 7 54 99 38 23 32 48 58 114 25 41 9 49 70 40 42 59 8 268 75 15 8 32 34 16 20 168 31 229 24 12 1 23 23 29 38 129 4 239 106 10 55 82 29 28 21 59 71 135 22 30 4 45 69 43 43 57 11 15

10 95 12 56 110 32 33 13 39 91 72 20 44 4 66 82 40 55 47 9 1611 73 13 37 77 27 23 19 90 61 175 33 36 2 36 57 43 28 60 8 3312 87 15 27 75 38 23 27 60 50 165 36 17 3 40 40 41 44 79 14 51

1 0.90 1.08 0.24 0.43 1.77 0.34 0.79 2.14 0.58 2.64 1.16 0.46 0.27 0.82 0.76 0.47 0.87 1.37 1.33 1.052 1.08 1.00 0.69 1.16 1.19 0.34 0.46 1.29 0.66 2.83 1.74 0.44 0.07 1.18 1.10 0.47 0.58 0.92 0.81 0.843 1.07 1.54 0.36 0.64 1.16 0.32 0.37 1.91 0.93 2.20 1.47 0.58 0.05 1.33 1.18 0.68 0.79 1.25 0.81 0.934 1.14 0.62 0.66 1.21 1.00 0.41 0.97 1.14 1.20 2.00 1.16 0.75 0.32 1.13 0.74 0.61 0.81 1.12 1.04 1.025 1.22 0.54 0.64 1.25 1.00 0.48 1.07 1.44 1.17 1.67 1.05 0.61 0.16 0.90 1.03 0.61 0.71 1.31 0.74 1.296 1.26 1.08 1.13 1.70 0.48 0.44 0.84 0.54 1.51 0.95 0.47 1.12 0.32 1.69 1.92 0.93 0.87 0.75 0.59 0.697 1.20 0.54 0.99 1.59 1.00 0.34 1.49 0.90 1.06 1.33 1.32 1.00 0.20 1.38 1.47 0.72 0.81 0.89 0.59 0.788 0.97 1.16 0.15 0.51 0.90 0.23 0.93 3.15 0.56 2.67 1.26 0.29 0.02 0.65 0.48 0.52 0.73 1.94 0.30 0.699 1.38 0.77 1.00 1.32 0.77 0.41 0.97 1.11 1.29 1.58 1.16 0.73 0.09 1.27 1.45 0.77 0.83 0.86 0.81 0.45

10 1.23 0.92 1.02 1.77 0.84 0.48 0.60 0.73 1.66 0.84 1.05 1.07 0.09 1.86 1.73 0.72 1.06 0.71 0.67 0.4811 0.95 1.00 0.68 1.24 0.71 0.34 0.88 1.69 1.11 2.04 1.74 0.88 0.05 1.01 1.20 0.77 0.54 0.90 0.59 0.9912 1.13 1.16 0.49 1.21 1.00 0.34 1.25 1.13 0.91 1.93 1.90 0.41 0.07 1.13 0.84 0.73 0.85 1.19 1.04 1.52

For each template are shown template type, rank of cluster, occurrence frequency, PDB code and residues of LE at cluster center, frequencydistribution surrounding cluster center LE at 0.3 � Å intervals, residue position counts, and residue position propensities. Central residue isunderlined. Numbering of the CA/s in the templates is as given in Figure 5(c and e). Comparison shows that residue counts and propensities insimilar topological positions in the 2 templates are similar. Thus, positions 1–9 in (a) above correspond to positions 4–12 in (b); positions 10–12 in(a) correspond to positions 1–3 in (b). The frequency distribution surrounding the cluster center indicates a spread of 2.4 Å RMSD.

tapraid5/z7e-protein/z7e-protein/z7e00505/z7e2053d05g royerl S�16 1/13/05 4:51 Art: 20397

AQ:t1

Page 12: A Multibody, Whole-Residue Potential for Protein Structures, With

defined as above, is 78%. The number of templates repre-senting tertiary interactions between the helix and the�-sheet is 12 (marked * in Table VII).

Total Number of Templates

Regarding the total number of different templates thatmay occur in proteins in nature, no sign of an upper limitcould be seen with this size of database (see Table I). As thelow-frequency cutoff was decreased, increasing numbers oftemplates were found, with no sign of a plateau.

Simulated Annealing of Test Proteins

The folding simulations on the test proteins yieldedlowest energy structures differing by a low RMSD from thestructures obtained by NMR or Xray (Figs. 7–11). Theacquisition of the lowest energy structure was signaled bya sharp drop in energy, signifying 2-state folding (Fig. 12).

For the �-sheet sequences, the RMSDs were 3.0 Å forBetanova (Fig. 7), and 2.3 Å for GS-peptide (Fig. 8).

An interesting feature for these peptides was that thecorrect structure was obtained, even though templatecoverage of the Protein Data Bank (PDB) structure wasnot complete (83–89%) (Table IV).

For the 24-residue peptide 3� (Fig. 9), the lowest energystructure closely agreed with the NMR structure as de-scribed.68 The RMSD, however, could not be calculated, ascoordinates were not available.

For the 37-residue antifreeze protein 1wfbA, the simula-tion correctly gave an �-helix RMSD of 3.0 Å (Fig. 10).

For the 38-residue peptide �-hairpin 1abz_, using thetemplates derived with rd_max 1.6 Å, the simulation gavea structure with 2 helices correctly positioned and orientedto one another, with an RMSD of 3.0 Å. In a second runusing templates derived with rd_max 1.3 Å, a slightly

Fig. 6. Spreads of templates shown as RMSD distribution in concen-tric shells of thickness 0.3 Å surrounding cluster center. The spread istaken as the position of the first minimum. (a) Three-strand �-sheet 6_6111411114111 (1), spread 1.2 Å RMSD. (b) Helix–helix interaction 4_711111111413 (2), spread 2.7 Å RMSD.

TABLE V. Template Coverage of Proteins in the Classes All-�, Mixed �/�, and All-�

Protein PDB code Residues

Coverage % Occ. freq. 10

� 2.5 Å RMSD � 3.0 Å RMSD

All-�Antifreeze protein 1wfbA 37 100 100�–turn–� 1abz_ 38 97 100Repressor 1ropA 56 95 100Uteroglobulin 1utg_ 70 79 87Hemeglobin 1b0b_ 141 66 75Phycocyanin 1cpcA 162 63 76

Mixed alpha/betaCystatin 1cewI 108 77 78Protein G 2igd_ 61 69 79Ubiquitin 1ubi_ 76 59 73Carbox. inhibitor 1dtvA 67 51 63

All-�Betanova 20 83 89GS-peptide 20 83 83Cold-shock protein 1nmg_ 67 49 57Plastocyanin 1plc_ 99 43 51Amicyanin 1id2A 106 55 62Porin 2por_ 301 62 66

Coverage � percent of total residues in protein for which a matching template can be found in the lookuptable, within a deviation of 2.5 or 3.0 Å RMSD, and occurrence frequency 10.

12 S. MAYEWSKI

tapraid5/z7e-protein/z7e-protein/z7e00505/z7e2053d05g royerl S�16 1/13/05 4:51 Art: 20397

F7-11

F12

AQ: 17

AQ: t2

Page 13: A Multibody, Whole-Residue Potential for Protein Structures, With

TABLE VI. Template Coverage of Repressor Protein

1ropA Resn_res 56

Local Environment Best-fitting template

Type CAs in LE Rank RMSD Freq.

T 2 1_4 11111 M1 T2 K3 Q4 E5 K6 1 1.308 398K 3 2_4 111111 M1 T2 K3 Q4 E5 K6 T7 1 1.321 8855Q 4 2_4 111111 T2 K3 Q4 E5 K6 T7 A8 1 1.332 8855E 5 4_5 111111114* M1 T2 K3 Q4 E5 K6 T7 A8 L9 F56 1 2.240 4789K 6 5_4 111111111 M1 T2 K3 Q4 E5 K6 T7 A8 L9 N10 1 1.161 1544T 7 4_4 11111111 K3 Q4 E5 K6 T7 A8 L9 N10 M11 1 0.676 32,854A 8 4_4 11111111 Q4 E5 K6 T7 A8 L9 N10 M11 A12 1 0.664 32,854L 9 4_8 111111114313* E5 K6 T7 A8 L9 N10 M11 A12 R13 Y49 C52 L53 F56 1 1.211 183N 10 4_4 11111111 K6 T7 A8 L9 N10 M11 A12 R13 F14 1 0.585 32,854M 11 4_4 11111111 T7 A8 L9 N10 M11 A12 R13 F14 I15 1 0.597 32,854A 12 4_8 111111114313* A8 L9 N10 M11 A12 R13 F14 I15 R16 A45 L48 Y49 C52 1 1.421 183R 13 4_6 1111111143* L9 N10 M11 A12 R13 F14 I15 R16 S17 D46 Y49 1 1.561 533F 14 4_4 11111111 N10 M11 A12 R13 F14 I15 R16 S17 Q18 1 0.598 32,854I 15 4_5 111111114* M11 A12 R13 F14 I15 R16 S17 Q18 T19 A45 2 2.844 11R 16 4_8 111111114313* A12 R13 F14 I15 R16 S17 Q18 T19 L20 H42 A45 D46 Y49 1 1.184 183S 17 4_4 11111111 R13 F14 I15 R16 S17 Q18 T19 L20 T21 1 0.676 32,854Q 18 4_4 11111111 F14 I15 R16 S17 Q18 T19 L20 T21 L22 1 0.594 32,854T 19 4_9 1111111141213* I15 R16 S17 Q18 T19 L20 T21 L22 L23 C38 E39 L41 H42 A45 2 2.748 31L 20 4_6 1111111143* R16 S17 Q18 T19 L20 T21 L22 L23 E24 E39 H42 1 1.452 533T 21 4_4 11111111 S17 Q18 T19 L20 T21 L22 L23 E24 K25 1 0.751 32,854L 22 4_5 111111114* Q18 T19 L20 T21 L22 L23 E24 K25 L26 C38 2 2.814 11L 23 4_7 11111111431* T19 L20 T21 L22 L23 E24 K25 L26 N27 A35 C38 E39 1 1.141 858E 24 4_4 11111111 L20 T21 L22 L23 E24 K25 L26 N27 E28 1 0.748 32,854K 25 4_4 11111111 T21 L22 L23 E24 K25 L26 N27 E28 L29 1 0.754 32,854L 26 4_8 111111111121 L22 L23 E24 K25 L26 N27 E28 L29 D30 A31 D32 Q34 A35 1 0.769 117N 27 4_6 1111111113 L23 E24 K25 L26 N27 E28 L29 D30 A31 D32 A35 1 1.289 86E 28 4_3 1111111 E24 K25 L26 N27 E28 L29 D30 A31 1 1.349 6673L 29 4_2 111111 K25 L26 N27 E28 L29 D30 A31 1 1.451 7344D 30 4_2 111111 L26 N27 E28 L29 D30 A31 D32 1 2.833 7344A 31 5_4 111111111 L26 N27 E28 L29 D30 A31 D32 E33 Q34 A35 2 1.660 206D 32 4_4 13111111 L26 N27 D30 A31 D32 E33 Q34 A35 D36 2 1.047 57E 33 2_4 111111 A31 D32 E33 Q34 A35 D36 I37 1 1.216 8855Q 34 4_4 41111111* L26 A31 D32 E33 Q34 A35 D36 I37 C38 1 0.858 604A 35 7_4 31411111111* L23 L26 N27 A31 D32 E33 Q34 A35 D36 I37 C38 E39 1 1.283 939D 36 4_4 11111111 D32 E33 Q34 A35 D36 I37 C38 E39 S40 1 0.714 32,854I 37 4_4 11111111 E33 Q34 A35 D36 I37 C38 E39 S40 L41 1 0.715 32,854C 38 7_4 31411111111* T19 L22 L23 Q34 A35 D36 I37 C38 E39 S40 L41 H42 1 1.194 939E 39 7_4 13411111111* T19 L20 L23 A35 D36 I37 C38 E39 S40 L41 H42 D43 1 2.427 487S 40 4_4 11111111 D36 I37 C38 E39 S40 L41 H42 D43 H44 1 0.747 32,854L 41 5_4 411111111* T19 I37 C38 E39 S40 L41 H42 D43 H44 A45 1 1.445 5303H 42 7_4 31411111111* R16 T19 L20 C38 E39 S40 L41 H42 D43 H44 A45 D46 1 1.176 939D 43 4_4 11111111 E39 S40 L41 H42 D43 H44 A45 D46 E47 1 0.751 32,854H 44 4_4 11111111 S40 L41 H42 D43 H44 A45 D46 E47 L48 1 0.707 32,854A 45 8_4 313411111111* A12 I15 R16 T19 L41 H42 D43 H44 A45 D46 E47 L48 Y49 1 1.474 213D 46 6_4 3411111111* R13 R16 H42 D43 H44 A45 D46 E47 L48 Y49 R50 1 1.685 487E 47 4_4 11111111 D43 H44 A45 D46 E47 L48 Y49 R50 S51 1 0.678 32,854L 48 5_4 411111111* A12 H44 A45 D46 E47 L48 Y49 R50 S51 C52 1 1.406 5303Y 49 8_4 313411111111* L9 A12 R13 R16 A45 D46 E47 L48 Y49 R50 S51 C52 L53 1 1.181 213R 50 4_4 11111111 D46 E47 L48 Y49 R50 S51 C52 L53 A54 1 0.573 32,854S 51 4_4 11111111 E47 L48 Y49 R50 S51 C52 L53 A54 R55 1 0.559 32,854C 52 6_4 3411111111* L9 A12 L48 Y49 R50 S51 C52 L53 A54 R55 F56 1 1.734 487L 53 5_3 41111111* L9 Y49 R50 S51 C52 L53 A54 R55 F56 1 0.942 349A 54 4_2 111111 R50 S51 C52 L53 A54 R55 F56 1 1.241 7344R 55 4_1 11111 S51 C52 L53 A54 R55 F56 1 0.596 621

For each residue are shown local environment and best-fitting template in lookup table. LEs involved in tertiary interactions are marked*.

A MULTIBODY, WHOLE-RESIDUE POTENTIAL 13

tapraid5/z7e-protein/z7e-protein/z7e00505/z7e2053d05g royerl S�16 1/13/05 4:51 Art: 20397

AQ: t3

Page 14: A Multibody, Whole-Residue Potential for Protein Structures, With

TABLE VII. Template Coverage of Protein G Domain III

2igd_ Resn_res 61

Local environment Best-fitting template

Type CAs in LE Rank RMSD Freq.

T 2 1_2 111 M1 T2 P3 A4 1 0.621 530P 3 2_3 11114 M1 T2 P3 A4 V5 V26 1 2.878 442A 4 2_3 11114 T2 P3 A4 V5 T6 V26 1 1.183 442V 5 2_6 11114111 P3 A4 V5 T6 T7 K24 A25 V26 D27 1 2.086 2361T 6 2_6 11114111 A4 V5 T6 T7 Y8 K24 A25 V26 D27 1 1.803 2361T 7 2_7 111141114 V5 T6 T7 Y8 K9 T22 T23 K24 A25 K55 1 0.436 116Y 8 2_11 1111411133411* T6 T7 Y8 K9 L10 T22 T23 K24 A25 A28 A31 K55 T56 F57K 9 2_9 11114111411 T7 Y8 K9 L10 V11 E20 T21 T22 T23 K55 T56 F57 5 0.675 57L 10 2_10 111141144111 Y8 K9 L10 V11 I12 E20 T21 T22 F35 T56 F57 T58 V59 1 4.261 48V 11 2_10 111141114111 K9 L10 V11 I12 N13 K18 G19 E20 T21 T56 F57 T58 V59 5 1.304 111I 12 2_9 11113111411 L10 V11 I12 N13 G14 L17 K18 G19 E20 T58 V59 T60 1 3.424 23N 13 2_9 11112114111 V11 I12 N13 G14 K15 L17 K18 G19 T58 V59 T60 E61G 14 2_9 11111142411 I12 N13 G14 K15 T16 L17 K18 V44 G46 V59 T60 E61K 15 2_5 1111144 N13 G14 K15 T16 L17 K18 V44 E61 1 4.684 11T 16 2_2 1111 G14 K15 T16 L17 K18 1 1.816 6837L 17 5_4 111111142* I12 N13 G14 K15 T16 L17 K18 G19 N42 V44K 18 7_2 111111111 V11 I12 N13 G14 K15 T16 L17 K18 G19 E20 1 1.243 89G 19 5_2 1141111 V11 I12 N13 L17 K18 G19 E20 T21 1 1.923 1374E 20 6_2 11141111 K 9 L 10 V 11 I 12 K 18 G 19 E 20 T 21 T 22 1 1.006 2917T 21 5_2 1141111 K9 L10 V11 G19 E20 T21 T22 T23 1 0.795 1374T 22 6_2 11141111 T7 Y8 K9 L10 E20 T21 T22 T23 K24 1 0.660 2917T 23 5_5 1141111431* T7 Y8 K9 T21 T22 T23 K24 A25 A31 A34 F35 1 0.874 13K 24 6_3 111411114* V5 T6 T7 Y8 T22 T23 K24 A25 V26 A31 1 1.048 200A 25 6_5 11141111121* V5 T6 T7 Y8 T23 K24 A25 V26 D27 A28 T30 A31 2 1.405 14V 26 6_3 111411112 P3 A4 V5 T6 K24 A25 V26 D27 A28 T30 1 3.327 48D 27 4_4 14111111* V5 T6 A25 V26 D27 A28 E29 T30 A31 1 2.302 110A 28 4_5 411111114* Y8 A25 V26 D27 A28 E29 T30 A31 E32 K55 2 2.878 18E 29 2_4 111111 D27 A28 E29 T30 A31 E32 K33 1 1.132 8855T 30 5_4 111111111 A25 V26 D27 A28 E29 T30 A31 E32 K33 A34 1 2.544 1544A 31 8_4 411211111111* Y8 T23 K24 A25 D27 A28 E29 T30 A31 E32 K33 A34 F35E 32 4_5 111111114* A28 E29 T30 A31 E32 K33 A34 F35 K36 F57 1 0.694 4789K 33 4_4 11111111 E29 T30 A31 E32 K33 A34 F35 K36 Q37 1 0.663 32,854A 34 5_4 411111111* T23 T30 A31 E32 K33 A34 F35 K36 Q37 Y38 1 0.981 5303F 35 6_4 4411111111* L10 T23 A31 E32 K33 A34 F35 K36 Q37 Y38 A39 2 0.917 513K 36 4_4 11111111 E32 K33 A34 F35 K36 Q37 Y38 A39 N40 1 0.556 32,854Q 37 4_4 11111111 K33 A34 F35 K36 Q37 Y38 A39 N40 D41 1 0.572 32,854Y 38 4_4 11111111 A34 F35 K36 Q37 Y38 A39 N40 D41 N42 1 0.629 32,854A 39 4_6 1111111111 F35 K36 Q37 Y38 A39 N40 D41 N42 G43 V44 D45 1 2.487 1130N 40 4_5 111111111 K36 Q37 Y38 A39 N40 D41 N42 G43 V44 D45 1 2.398 1445D 41 4_3 1111111 Q37 Y38 A39 N40 D41 N42 G43 V44 1 1.384 6673N 42 5_2 4111111 L17 Y38 A39 N40 D41 N42 G43 V44 1 1.749 610G 43 4_2 111111 A39 N40 D41 N42 G43 V44 D45 1 2.990 7344V 44 8_3 12411111114 G14 K15 L17 A39 N40 D41 N42 G43 V44 D45 G46 E61D 45 4_3 1311114 A39 N40 G43 V44 D45 G46 V47 E61 1 1.401 21G 46 3_5 41111411 G14 V44 D45 G46 V47 W48 V59 T60 E61 7 2.927 13V 47 2_5 1111411 D45 G46 V47 W48 T49 V59 T60 E61 1 0.953 1462W 48 2_6 11114111 G46 V47 W48 T49 Y50 F57 T58 V59 T60 1 1.333 2361T 49 2_5 1111411 V47 W48 T49 Y50 D51 F57 T58 V59 1 0.677 1462Y 50 2_6 11113111 W48 T49 Y50 D51 D52 K55 T56 F57 T58 1 0.840 113D 51 2_6 11111111 T49 Y50 D51 D52 A53 T54 K55 T56 F57 2 2.328 35D 52 2_4 111111 Y50 D51 D52 A53 T54 K55 T56 2 2.523 3050A 53 2_2 1111 D51 D52 A53 T54 K55 1 1.998 6837T 54 3_2 11111 D51 D52 A53 T54 K55 T56 1 1.327 1826K 55 9_2 11441111111* T7 Y8 K9 A28 Y50 D51 D52 A53 T54 K55 T56 F57T 56 9_2 11141121111 Y8 K9 L10 V11 Y50 D51 D52 T54 K55 T56 F57 T58 1 4.426 12F 57 11_2 1114411141111* Y8 K9 L10 V11 E32 W48 T49 Y50 D51 K55 T56 F57 T58 V59 4 1.231 23T 58 9_2 11141141111 L10 V11 I12 N13 W48 T49 Y50 T56 F57 T58 V59 T60 2 0.620 105V 59 11_2 1111411141111 L10 V11 I12 N13 G14 G46 V47 W48 T49 F57 T58 V59 T60 E61 3 1.345 18T 60 8_1 114114111 I12 N13 G14 G46 V47 W48 T58 V59 T60 E61

For each residue are shown local environment, and best-fitting template in lookup table. Gaps indicate positions where the lookup table containedno matching template. LEs involved in tertiary interactions are marked*.

14 S. MAYEWSKI

tapraid5/z7e-protein/z7e-protein/z7e00505/z7e2053d05g royerl S�16 1/13/05 4:51 Art: 20397

AQ: t4,

t5

Page 15: A Multibody, Whole-Residue Potential for Protein Structures, With

improved structure was obtained with an RMSD of 2.4 Å(shown in Fig. 11).

Table VIII shows template coverage of the minimumconformation. In this conformation, 10 templates areinvolved in helix–helix interactions (marked *). The asso-ciation of the 2 helices shows the ability of the LETpotential to achieve tertiary structure.

As in the runs with the other test proteins, the coverageof the minimum structure is not complete. At 2 positions,Leu3 and Leu25, the best-fitting template in the lookuptable, had an occurrence frequency below 50; therefore, noenergy was calculated for these positions. Nevertheless,the procedure identified the correct structure.

DISCUSSION AND CONCLUSIONS

The LET potential described represents an attempt toextend the useful results obtained with 1D sequentialsegments for prediction of secondary structure, to 3Dvolume elements for prediction of tertiary structure.

It may be compared with the 4-body potential of Gan etal.23 As pointed out by these authors, defining a multibodypotential over volume leads to a question of overcounting.In order to avoid this, these authors relied on a rigorouspartitioning of the CAs into nonoverlapping groups of 4,using a tessellation procedure.43,44 However, as men-tioned above, 4 residues may be insufficient to accountadequately for cooperative effects, and this may be thereason why this potential did not show improved predic-tive performance (8–9 Å root-mean-square error), in com-parison with a pairwise potential.23

In the LET potential as described, the question ofovercounting has been left out of account. In view of theuncertainties arising from various sources in all suchderivations of statistical potentials from databases,73–75

no attempt was made to apply a correction for this factor inthe present version of the program.

The LET potential may also be compared with therecently described SPREK evaluation function,13 which

Fig. 7. Betanova. Superposition of NMR (blue) and predicted struc-ture (red). RMSD � 3.0 Å.

Fig. 8. GS-peptide. Superposition of NMR (blue) and predictedstructure (red). RMSD � 2.3 Å.

Fig. 9. 3� predicted structure. This structure agrees with the experi-mental structure as described by Griffiths-Jones and Searle68; however,no NMR coordinates were available.

Fig. 10. Antifreeze protein 1wfbA. Superposition of Xray (blue) andpredicted structure (red). RMSD � 3.0 Å.

Fig. 11. �–turn–� 1abz_. Superposition of NMR (blue) and predictedstructure (red). RMSD � 2.4 Å.

Fig. 12. Monte Carlo simulated annealing cooling curve for agr;–turn–�.

A MULTIBODY, WHOLE-RESIDUE POTENTIAL 15

tapraid5/z7e-protein/z7e-protein/z7e00505/z7e2053d05g royerl S�16 1/13/05 4:51 Art: 20397

T8

COLOR

COLOR

COLOR

COLOR

COLOR

Page 16: A Multibody, Whole-Residue Potential for Protein Structures, With

also uses the concept of matching local environments in adatabase of known structures. The matching is done on thebasis of matching residue types and secondary structures,using an information theory approach, rather than a directstructure-to-structure match, as in the LET potential.

The function was tested in fold recognition of proteins of54–75 residues, against a set of 600–700 decoys. No testsfor ab initio structure prediction were done.

The use of decoys involves the question of whether theconformational space (10 exp 50 for a 60-residue protein,using 7 Ramachandran regions) has been adequatelysampled. If this is not so, potentials may perform differ-ently for different decoy sets.76,77

To avoid these questions, the LET potential was testedin this work by ab initio structure prediction by MonteCarlo simulated annealing, starting from random confor-

mations, and allowing the chain to explore the totalconformational space. For the 38-residue test protein�–turn–�, this minimization procedure required samplingof 7.9 million conformations (340,000 accepted steps; seeFig. 12) to find the minimum. Runs using faster coolingrates, sampling fewer conformations, did not find theminimum.

The calculation of the LET energy is relatively fast incomparison with other procedures, such as tessellation,44

enabling it to be readily combined with the Monte Carlominimization procedure.

The LET potential has the merit of simplicity. Mostunited-residue potentials described to date consist of sev-eral components representing different energy contribu-tions, such as van der Waals, hydrogen bonds, electro-static, solvent, disulfide bonds, helix dipole, and so forth.

TABLE VIII. Templates and Energy Components of MinimumConformation for Test Protein �–Turn–�

Res

Local environment Best-fitting template

Type Rank RMSD Freq. Energy

W 2 1_4 11111 1 0.669 310 �1.160L 3 2_7 111111431* 2 2.186 23K 4 3_5 11111114* 2 0.327 195 �3.188A 5 4_4 11111111 1 0.427 30,548 �2.272R 6 4_4 11111111 1 0.369 30,548 �0.856V 7 4_7 11111111431* 1 0.677 623 �2.749E 8 4_4 11111111 1 0.327 30,548 �1.758E 9 4_4 11111111 1 0.269 30,548 �2.973E 10 4_4 11111111 1 0.300 30,548 �1.450L 11 4_6 1111111143* 1 1.020 388 �2.097Q 12 4_4 11111111 1 0.217 30,548 �3.801A 13 4_4 11111111 1 0.199 30,548 �3.377L 14 4_6 1111111111 2 0.320 590 �4.422E 15 4_5 111111111 2 0.382 1045 �4.823A 16 4_3 1111111 1 2.010 4238 �0.087R 17 4_2 111111 1 2.165 5602 0.019G 18 4_2 111111 2 0.340 2195 �3.953T 19 5_2 1111111 4 0.695 357 �2.680D 20 4_2 131111 1 0.550 136 �1.250S 21 2_4 111111 3 1.763 2272 �0.131N 22 2_4 111111 1 1.173 6822 �0.689A 23 2_4 111111 1 0.413 6822 �2.856E 24 3_4 1111111 1 0.976 3211 �0.562L 25 5_4 411111111* 4 2.376 30R 26 4_4 11111111 1 0.255 30,548 �2.288A 27 4_4 11111111 1 0.202 30,548 �2.814M 28 6_4 4411111111* 1 0.925 363 �2.487E 29 4_4 11111111 1 0.254 30,548 �1.989A 30 4_4 11111111 1 0.258 30,548 �2.815K 31 5_4 411111111* 1 0.970 4135 �1.532L 32 7_4 13411111111* 1 0.834 301 �2.165K 33 4_4 11111111 1 0.308 30,548 �2.002A 34 4_4 11111111 1 0.343 30,548 �2.903E 35 5_3 41111111* 2 0.574 189 �2.417I 36 5_2 4111111* 4 1.083 141 �0.149Q 37 4_1 11111 1 0.217 597 �2.294Total template energy �73.0

The template lookup table for this simulation was derived with R_LE � 8.0 Å, rd_max �1.3 Å. For residues L3 and L25, the best-fitting template had occurrence frequency lessthan 50; therefore, no energy was calculated for these positions. Templates marked * areinvolved in tertiary interactions.

16 S. MAYEWSKI

tapraid5/z7e-protein/z7e-protein/z7e00505/z7e2053d05g royerl S�16 1/13/05 4:51 Art: 20397

AQ: t6

Page 17: A Multibody, Whole-Residue Potential for Protein Structures, With

The parameters and weightings for these have to bechosen semiempirically, or by lengthy optimization proce-dures.20–22,78,79

The LET potential, on the other hand, consists essen-tially of a single potential (apart from the main-chaincollision penalty), in which the above contributions areincluded implicitly. The value of R_LE selected, 8.0 Å,encompasses the CA separation of main-chain hydrogenbonds, 6.2–7.2 Å.20,41,42 It covers commonly used cutoffsfor electrostatic interactions in the range 7.0–8.0 Å.8–10,29

The count of neighboring CAs within 8.0 Å was shown tocorrelate with hydrophobic burial.80 The value 8.0 Å alsocovers the CA separation of disulfide-bonded cysteines,which extends up to 6.5 Å.81 This suggests that thesecontributions are largely included implicitly in the LETpotential, and auxiliary components representing thesecontributions are not required.

After a few basic operating parameters, such as R_LEand rd_max, have been selected, then determination of the�70,000 other template parameters follows in a relativelystraightforward manner, by application of Eqs. (1) and (2).While this number of parameters is larger than thenumber of parameters used in other whole-residue poten-tials, it may be not be unreasonable, considering thewell-known very large conformational freedom of proteinchains.

The database size of 750,000 residues implies an overallratio of 11 residues/parameter, which may be compared toratios in early derivations of statistical potentials of 20residues/parameter,16,82 suggesting that the program couldbenefit from future increases in the database size.

Future improvements to the potential would includetaking into account the different spreads of the templates.Here, for simplicity, a single fixed value was used for alltemplates. This had the result that templates with smallerspreads may be not optimally resolved, while templateswith larger spreads are under-represented. Also, the tem-plate coverage for some proteins is not complete. Using theappropriate spread for each template should allow morecomplete use of the database, and better template cover-age of proteins.

The results obtained with the simple proteins presentedhere, all-� proteins, and simple � proteins consisting of asingle �-sheet, suggest that with improvements in theclustering procedure and further increases in the databasesize, then the approach could be developed into a usefulmethod for structure prediction of proteins with morecomplex structures.

ACKNOWLEDGMENTS

Thanks are due to Professors Huber and Holak fordiscussion and advice, to Professor Fernandez for a criticalreading of the manuscript, to Dr. Kortemme for supplyingcoordinates of Betanova, and to Dr. Jimenez for coordi-nates of GS peptide.

REFERENCES

1. Gibson KD, Scheraga HA. Minimization of polypeptide energy: I.Preliminary structures of bovine pancreatic ribonuclease S-peptide. Proc Natl Acad Sci USA 1967;58:420–427.

2. Hagler AT, Huler E, Lifson S. Energy functions for peptides andproteins: I. Derivation of a consistent force field including thehydrogen bond from amide crystals. J Amer Chem Soc 1974;96:5319–5327.

3. Cornell WD, et al. A second generation force field for the simula-tion of proteins, nucleic acids, and organic molecules. J Am ChemSoc 1995;117:5179–5197.

4. Smith JC, Karplus M. Empirical force field studies of geometriesand conformational transitions of some organic molecules. J AmChem Soc 1992;114:801–812.

5. Nemethy G, et al. Energy parameters in polypeptides: 10 im-proved geometrical parameters and nonbonded interactions foruse in the ECEPP/3 algorithm, with applications to proline-containing peptides. J Phys Chem 1992;96:6472–6484.

6. Lindahl E, Hess B, van der Spoel D. GROMACS 3.0: a package formolecular simulation and trajectory analysis. J Mol Model 2001;7:306–317.

7. Duan Y, Kollman PA. Pathways to a protein folding intermediateobserved in a 1-microsecond simulation in aqueous solution.Science 1998;282:740–744.

8. Ferrara P, Caflisch A. Folding simulations of a three-strandedantiparallel �-sheet peptide. Proc Natl Acad Sci USA 2000;97:10780–10785.

9. Cavalli A, Ferrara P, Caflisch A. Weak temperature dependence ofthe free energy surface and folding pathways of structuredpeptides. Proteins 2002;47:305–314.

10. Shen M, Freed KF. All-atom fast protein simulations: the villinheadpiece. Proteins 2002;49:439–445.

11. Snow CD, et al. Absolute comparison of simulated and experimen-tal protein-folding dynamics. Nature 2002;40:1–4.

12. Bujnicki JM, et al. LiveBench-2: Large-scale automated evalua-tion of protein structure prediction servers. Proteins 2001;Suppl5:184–191.

13. Taylor WR, Jonassen I. A structural pattern-based method forprotein fold recognition. Proteins 2004;56:222–234.

14. Bonneau R, et al. De novo prediction of three-dimensional struc-tures for major protein families. J Mol Biol 2002;322:65–78.

15. Moult J, et al. The significance of performance ranking in CASP.Structure 2002;10:291–293.

16. Tanaka S, Scheraga HA. Medium- and long-range interactionparameters between amino acids for predicting three-dimensionalstructures of proteins. Macromol 1976;9:945–950.

17. Miyazawa S, Jernigan RL. Estimation of effective contact energiesfrom protein crystal structures: quasi-chemical approximation.Macromolecules 1985;18:534–552.

18. Miyazawa S, Jernigan RL. Residue–residue potentials with afavourable contact pair term and an unfavourable high packingdensity term, for simulation and threading. J Mol Biol 1996;256:623–644.

19. Sippl MJ. Calculation of conformational ensembles from poten-tials of mean force. J Mol Biol 1990;213:859–883.

20. Kolinski A, Godzik A, Skolnick J. A general method for theprediction of the three-dimensional structure and folding pathwayof globular proteins: application to designed helical proteins.J Chem Phys 1993;98:7420–7433.

21. Liwo A, et al. United-residue force field for off-lattice proteinstructure simulations: III. Origin of backbone hydrogen-bondingcooperativity in united-residue potentials. J Comput Chem 1998;19:259–276.

22. Jones DT. Successful ab initio prediction of the tertiary structureof NK-Lysin using multiple sequences and recognized supersecond-ary structural motifs. Proteins 1997;Suppl 1:185–191.

23. Gan HH, Tropsha A, Schlick T. Lattice protein folding with twoand four-body statistical potentials. Proteins 2001;43:161–174.

24. Betancourt MR. A reduced protein model with accurate nativestructure identification ability. Proteins 2003;53:889–907.

25. Betancourt MR, Thirumalai D. Pair potentials for protein folding:choice of reference states and sensitivity of predicted native statesto variations in the interaction schemes. Protein Sci 1999;8:361–369.

26. Vendruscolo M, Domany E. Pairwise contact potentials are unsuit-able for protein folding. J Chem Phys 1998;109:11101–11108.

27. Vendruscolo M, Domany E. Can pairwise contact potentialsstabilize native protein against decoys obtained by threading?Proteins 2000;38:134–148.

28. Carter CW Jr, LeFebvre BC, Cammer SA, Tropsha A, Edgell MH.Four-body potentials reveal protein-specific correlations to stabil-

A MULTIBODY, WHOLE-RESIDUE POTENTIAL 17

tapraid5/z7e-protein/z7e-protein/z7e00505/z7e2053d05g royerl S�16 1/13/05 4:51 Art: 20397

AQ: 13

AQ: 13

AQ: 13

AQ: 13

AQ: 13

AQ: 13

AQ: 13

Page 18: A Multibody, Whole-Residue Potential for Protein Structures, With

ity changes caused by hydrophobic core mutations. J Mol Biol2001;311:625–638.

29. Fernandez A. Conformation-dependent environments in foldingproteins. J Chem Phys 2001;114:2489–2502.

30. Khatun J, Sagar DK, Dokholyan NV. Can contact potentialsreliably predict stability of proteins? J Mol Biol 2004;336:1223–1238.

31. Unger R, Harel D, Wherland S, Sussman JL. A 3D building blocksapproach to analyzing and predicting structure of proteins. Pro-teins 1989;5:355–373.

32. Suchhardt J, et al. Local structure motifs of protein backbones areclassified by self-organizing neural networks. Prot Eng 1996;9:833–842.

33. Bystroff C, Baker D. Blind predictions of local protein structrue inCASP2 targets using the I-Sites library. Proteins 1997;Suppl1:167–171.

34. Fetrow JS, et al. Patterns, structures, amd amino acid frequenciesin structural building blocks, a protein secondary structure classi-fication scheme. Proteins 1997;27:249–271.

35. Camproux AC, et al. Hidden Markov model approach for identify-ing the modular framework of the protein backbone. Prot Eng1999;12:1063–1073.

36. Deane CM, Blundell TL. A novel exhaustive search algorithm ofpolypeptide segments in proteins. Proteins 2000;40:135–144.

37. de Brevern AG, Etchebest C, Hazout S. Bayesian probabilisticapproach for predicting structures in terms of protein blocks.Proteins 2000;41:271–287.

38. de Brevern AG, Valadie H, Hazout S, Etchebest C. Extension of alocal backbone description using a structural alphabet: A newapproach to the sequence–structure relationship. Protein Sci2002;11:2871–2886.

39. Figureau. A pentapeptide-based method for protein secondarystructure prediction. Protein Eng 2003;16:103–108.

40. Rooman MJ, Rodriguez, Wodak SJ. Automatic definition of recur-rent local structure motif in proteins. J Mol Biol 1990;213:327–336.

41. Kolinski A, Skolnick J. Discretized model of proteins: I. MonteCarlo study of cooperativity in homopolypeptides. J Chem Phys1992;97:9412–9426.

42. Fernandez A, Sosnick TR, Colubri A. Dynamics of hydrogen bonddesolvation in protein folding. J Mol Biol 2002;321:659–675.

43. Munson PJ, Singh RJ. Statistical significance of hierarchicalmultibody potentials based on Delaunay tessellation and theirapplication in sequence–structure alignment. Protein Sci 1997;6:1467–1481.

44. Singh RK, Tropsha A, Vaisman I. Delaunay tessellation of pro-teins: Four-body nearest neighbour propensities of amino acids.J Comput Biol 1996;3:213–221.

45 Karlin S, Zhu Z. Characterization of diverse residue clusters inprotein three-dimensional structures. Proc Natl Acad Sci USA1996;93:8344–8349.

46. Zhu Z, Karlin S. Clusters of charged residues in protein three-dimensional structures. Proc Natl Acad Sci USA 1996;93:8350–8355.

47. Jonassen I, Eidhammer I, Taylor WR. Discovery of local packingmotifs in protein structures. Proteins 1999;34:206–219.

48. Jonassen I, Eidhammer I, Conklin D, Taylor WR. Structure motifdiscovery and mining the PDB. Bioinformatics 2002;18:362–367.

49. Herringa J, Argos P. Side-chain clusters in protein structures andtheir role in protein folding. J Mol Biol 1991;220:151–171.

50. Bagci Z, Kloczkowski A, Jernigan RL, Bahar I. The origin andextent of coarse-grained regularities in protein internal packing.Proteins 2003;53:56–67.

51. Chothia C. Principles that determine the structure of proteins.Ann Rev Biochem 1984;53:537–572.

52. Cohen FE, Sternberg MJE, Taylor WR. Analysis and prediction ofthe packing of alpha-helices against a beta-sheet in the tertiarystructure of globular proteins. J Mol Biol 1982;156:821–862.

53. Cohen FE, Sternberg MJE, Taylor WR. Analysis of the tertiarystructure of protein beta-sheet sandwiches. J Mol Biol 1981;148:253–277.

54. Murzin A. Structural principles for the propeller assembly ofbeta-sheets: the preference for seven-fold symmetry. Proteins1992;14:191–201.

55. Chothia C, Janin J. Orthogonal packing of beta-pleated sheets inproteins. Biochem 1982;21:3955–3965.

56. Dunbrack R. A protein sequence culling server. Available online athttp://www.fccc.edu/research/labs/dunbrack/culledpdb.html

57. McClachlan AD. Gene duplications in the structural evolution ofchymotrypsin. J Mol Biol 1979;128:49–79.

58. Zhu Z, Blundell TL. The use of amino acid patterns of classifiedhelices and strands in secondary structure prediction. J Mol Biol1996;260:261–276.

59. Hutchinson E, Thornton JM. A revised set of potentials forbeta-turn formation in protein. Prot Sci 1994;3:2207–2216.

60. Chou PY, Fasman GD. Beta-turns in proteins. J Mol Biol 1977;115:135–175.

61. Richardson JS, Richardson DC. Amino acid preferences for spe-cific locations at the ends of alpha helices. Science 1988;240:1648–1652.

62. Cohen C, Parry DAD. Alpha-helical coiled coils and bundles: howto design an alpha-helical protein. Proteins 1990;7:1–15.

63. Lupas A, van Dyke M, Stock J. Predicting coiled coils from proteinsequences. Science 1991;252:1162–1164.

64. Paliakasis CD, Kokkinidis M. Relationships between sequenceand structure for the four-alpha-helix bundle tertiary motif inproteins. Protein Eng 1992;5:739–748.

65. Woolfson DN, Alber T. Predicting oligomerization states of coiledcoils. Protein Sci 1995;4:1596–1607.

66. Kortemme T, Ramirez-Alvarado M, Serrano L. Design of a 20-amino acid, three-stranded beta-sheet protein. Science 1998;281:253–256.

67. de Alba E, Santoro J, Rico M, Jimenez MA. De novo design of amonomeric three-stranded antiparallel beta-sheet. Protein Sci1999;8:854–865.

68. Griffiths-Jones, Searle MS. Structure, folding, and energetics ofcooperative interactions between the beta-strands of a de novodesigned three-stranded antiparallel beta-sheet peptide. J AmerChem Soc 2000;122:8350–8356.

69. Sicheri F, Yang DSC. Ice-binding structure and mechanism of anantifreeze protein from winter flounder. Nature 1995; 375:427–431.

70. Fezoui Y, Connolly PJ, Osterhout JJ. Solution structure of alpha-T-alpha, a helical hairpin peptide of de novo design. Protein Sci1997;6:1869–1877.

71. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, TellerEJ. Equation of state calculation by fast computing machines.Chem Phys 1952;21:1087–1092.

72. Wintjens RT, Rooman MJ, Wodak SJ. Automatic classificationand analysis of alpha-alpha-turn motifs in proteins. J Mol Biol1996;255:235–253.

73. Rooman MJ, Wodak SJ. Are database-derived potentials valid forcoring forward and inverted protein folding? Protein Eng 1995;8:849–858.

74. Thomas PD, Dill KA. Statistical potentials extracted from proteinstructures: How accurate are they? J Mol Biol 1996;257:457–469.

75. Ben-Naim. Statistical potentials extracted from protein struc-tures: Are these meaningful potentials? J Chem Phys 1997;107:3698–3706.

76. Park BH, Huang ES, Levitt M. Factors affecting the ability ofenergy functions to discriminate correct from incorrect folds. J MolBiol 1997;266:831–846.

77. Zhou H, Zhou Y. Distance-scaled, finite ideal-gas reference stateimproves structure-derived potentials of mean force for structureselection and stability prediction. Protein Sci 2002;11:2714–2726.

78. Liwo A, et al. A method for optimizing potential-energy functionsby a hierarchical design of the potential-energy landscape: applica-tion to th UNRES force field. Proc Natl Acad Sci USA 2002;99:1937–1942.

79. Avbelj F, Fele L. Prediction of the three-dimensional structure ofproteins using the electrostatic screening model and hierarchiccondensation. Proteins 1998;31:74–96.

80. Nishikawa K, Ooi T. Prediction of the surface–interior diagram ofglobular proteins by an empirical method. Int J Protein PeptideRes 1980;16:19–32.

81. Sowdhamini R, et al. Stereochemical modelling of disulfide bridges:criteria for introduction into proteins by site-directed mutagen-esis. Prot Eng 1989;3:95–103.

82. Warme PK, Morgan RS. A survey of amino acid side-chaininteractions in 21 proteins. J Mol Biol 1978;118:289–304.

18 S. MAYEWSKI

tapraid5/z7e-protein/z7e-protein/z7e00505/z7e2053d05g royerl S�16 1/13/05 4:51 Art: 20397

AQ: 13

AQ: 13

AQ: 13

AQ: 14

AQ: 14

AQ: 15

AQ: 14

AQ: 14

AQ: 13

AQ: 13

Page 19: A Multibody, Whole-Residue Potential for Protein Structures, With

A MULTIBODY, WHOLE-RESIDUE POTENTIAL 19

tapraid5/z7e-protein/z7e-protein/z7e00505/z7e2053d05g royerl S�16 1/13/05 4:51 Art: 20397