17
When is a Potential Accurate Enough for Structure Prediction? Theory and Application to a Random Heteropolymer Model of Protein Folding Joseph D. Bryngelson SFI WORKING PAPER: 1992-11-053 SFI Working Papers contain accounts of scientific work of the author(s) and do not necessarily represent the views of the Santa Fe Institute. We accept papers intended for publication in peer-reviewed journals or proceedings volumes, but not papers that have already appeared in print. Except for papers by our external faculty, papers must be based on work done at SFI, inspired by an invited visit to or collaboration at SFI, or funded by an SFI grant. ©NOTICE: This working paper is included by permission of the contributing author(s) as a means to ensure timely distribution of the scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the author(s). It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may be reposted only with the explicit permission of the copyright holder. www.santafe.edu SANTA FE INSTITUTE

When is a Potential Accurate Enough for Structure …...it has met with limited to non-existent success in predicting the structure of globular proteins. This failure has typically

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: When is a Potential Accurate Enough for Structure …...it has met with limited to non-existent success in predicting the structure of globular proteins. This failure has typically

When is a Potential AccurateEnough for StructurePrediction? Theory andApplication to a RandomHeteropolymer Model of ProteinFoldingJoseph D. Bryngelson

SFI WORKING PAPER: 1992-11-053

SFI Working Papers contain accounts of scientific work of the author(s) and do not necessarily represent theviews of the Santa Fe Institute. We accept papers intended for publication in peer-reviewed journals or proceedings volumes, but not papers that have already appeared in print. Except for papers by our externalfaculty, papers must be based on work done at SFI, inspired by an invited visit to or collaboration at SFI, orfunded by an SFI grant.©NOTICE: This working paper is included by permission of the contributing author(s) as a means to ensuretimely distribution of the scholarly and technical work on a non-commercial basis. Copyright and all rightstherein are maintained by the author(s). It is understood that all persons copying this information willadhere to the terms and constraints invoked by each author's copyright. These works may be reposted onlywith the explicit permission of the copyright holder.www.santafe.edu

SANTA FE INSTITUTE

Page 2: When is a Potential Accurate Enough for Structure …...it has met with limited to non-existent success in predicting the structure of globular proteins. This failure has typically

When is a Potential Accurate Enough forStructure Prediction?: Theory and Application to

a Random Heteropolymer Model of ProteinFolding

Joseph D. BryngelsonComplex Systems Group (T-13), Theoretical Division

Los Alamos National Laboratory, Los Alamos, NM 87545, USA

October 27, 1992

AbstractAttempts to predict molecular structure, often try to minimize some

potential function over a set of structures. Much effort has gone into cre­ating potentials functions and algorithms for minimizing these potentialfunctions. This paper develops a formalism that addresses a comple­mentary question: What are the a.ccuracy requirements for a potentialfunction that predicts molecular structure? The formalism is applied toa simple model of a protein structure potential. The results of this calcu­lation show that for high accuracy predictions (- lA. rms deviation) of a.typical protein, the monomer-monomer interaction energies must be ac­curate to within five to fifteen percent. The paper closes with a discussionof the implications of these results for practical structure prediction.

1 Introductil;>nThe theoretical prediction of the structure of a molecule or an assembly ofmolecules, such as a cluster, frequently involves the minimization of a potentialfunction. Examples of this activity range from using sophisticated techniques ofmodern quantum chemistry to obtain high accuracy predictions of structures ofsmall molecules in the gas phase, to using semi-empirical potentials of mean forceto predict the structures of macromolecules in solution. Particularly for largemolecules, much effort has gone into developing accurate potentials that r;''Iuiretractable' amounts of computer time for their evaluation, and into developingefficient algorithms for finding the deepest minimum of these potentials. Thispapers addresses another aspect of structure prediction: the accuracy required

1

Page 3: When is a Potential Accurate Enough for Structure …...it has met with limited to non-existent success in predicting the structure of globular proteins. This failure has typically

of a potential that predicts molecular structure. If the potential function is notaccurate enough, then the best minimization algorithm possible is still uselessfor predicting structure. However, if the accuracy requirements are known, thendefinite goals for potential creation exist, and researchers can concentrate onproblems that are solvable with the present potentials. The formalism derivedhere is general, and applicable to any structure prediction problem. One ofthe most important unsolved problems in molecular structure prediction is theprediction of protein structure from amino acid sequence. Therefore, this paperthen applies the formalism to a simple protein model to estimate the accuracyneeded for a potential that predicts protein structure.

This paper will discuss the problem of predicting the full three-dimensionalor tertiary structure of a protein. Most attempts to predict protein tertiarystructure are, at least implicitly based on the thermodynamic hypothesis, whichstates that a protein in solution folds to the configuration that minimizes thefree energy of the protein plus solvent system. [1] Typically the solvent is wa­ter. This hypothesis suggests a general strategy for predicting protein tertiarystructure from sequence. First, the researcher develops a semi-empirical poten­tial function Vappro•. (q, s), whi~h approximates the free energy of the protein­solvent system as a function of q, the three-dimensional configuration of theprotein and s, the sequence of the protein. Henceforth the dependence on swill be suppressed in my notation. Next, the researcher attempts to solve theproblem by finding the configuration q:;;,r:o, that minimizes Vappro•. (q). Theconfiguration q:;;,r:o, is the predicted protein structure. Although the above gen­eral strategy has successfully predicted the structure of small polypeptides, [2Jit has met with limited to non-existent success in predicting the structure ofglobular proteins. This failure has typically been attributed to the difficulty offinding the the minimum of the potential functions, so that a great deal of efforthas gone into algorithms for optimizing these potential functions. This paperanalyzes a complementary question, the potential accuracy question: How ac­curate must the potential function,Vappro•. (q), be so that q:;;.r;;o. is the correctstructure?

The protein calculation presented here has a forerunner in the form of apaper by Shaknovitch and Gutin on the probability of a neutral mutation ina protein. [3] Their calculation is related to a special case of the calculationpresented here. The present paper also makes explicit the mathematical ap­proximations and the notion of structure implicit in the mutation paper. I haveused some of the same notation as the mutation paper so the reader can easilycompare it afterwards.

The following Section defines how the term "structure" is used in this paper,and poses the above question in a mathematically precise manner. SectionsThree and Four formally solves the potential accuracy problem for some simplecases. A simple model of a protein potential function is described in SectionFive and the results of Sections Three and Four are applied to this model inSection Six. The paper concludes with a critical discussion of these results, their

2

Page 4: When is a Potential Accurate Enough for Structure …...it has met with limited to non-existent success in predicting the structure of globular proteins. This failure has typically

implications for real protein structure prediction, and a look at some futuredirections. Readers not interested in technical details should skip Section Six.

2 Posing the ProblemConsider a representation of the arrangement of atoms in three-dimensionalspace. Coarse-grain this configuration space so there are a countable number ofdiscrete states. If the representation of the arrangement of atoms was discrete tobegin with, then there is no need to coarse-grain. When I refer to a "structure"or (for emphasis) "discrete structure" in this paper, I mean one of these discretestates. Each structure will be labeled with an integer. The value of the potentialfunction for structure i will be denoted E" and I will refer to it as as the "energy"of structure i, even though it is really a free energy or the value of a potentialof mean force. This loose terminology will prove convenient.

The notion of coarse-graining and discrete structure may become clearerwith a simple example. Consider a small molecule with only one flexible bond,so the spatial arrangement of atoms can be specified by denoting a bond an­gIe, 8. To coarse-grain the bond angle, I could consider the molecule to be instructure one, if 00 ~ 8 < 100

, structure two, if 100 ~ 8 < 200, and so on.

The value of a potential function for each of these discrete structures must bedetermined by some fixed algorithm. To continue this example, if I am given acontinuous potential V:;;;oz. (8), then I could define the value of this potentialfor structure one to be El = V:C;;':or.(5°), E2 = V:;;:o:r.(15°), and so on. I canspecify the spatial arrangement of atoms more or less accurately by making thediscretization finer or coarser. The formalism discussed in this paper is inde­pendent of the manner of representing the arrangement of atoms and the detailsof the coarse-graining procedure, which may be chosen to suit the applicationat hand.

The approximate potential Vopproz.(q) assigns a real number to each struc­ture q. The structure with the lowest Vopproz.(q) is labeled q~f:oz. The struc­ture labeled q~r:oz is the predicted structure. There is also some real, exact en­ergy of the molecule or the molecule-solvent system, given by potential Vreol(q).The potential v"eol(q) .lIso assigns a real number to each of the structures. De­note by q:;;;:: the discrete structure with the lowest value of v"eol(q). The struc­ture q~r:oz is the structure predicted by the approximate potential Vopproz.(q)

d rool' th t t . t if reol opproz th" ( )an qmin 18 e s rue ure In na ure, so qmin = qmin I en Yapproz. q pre-dicts the correct structure. The approximate potential function will in generalhave inaccuracies due to inaccuracies in parameters, neglect of physical effectsand so on. The errors Vopproz.(q) are denoted by

oE(q) =Vopproz.(q) - v"eol(q) (I)and may be thought of as noise added to the true potential. Notice that oE(q)is a function of each discrete state. Since the real potential Vreol(q) is unknown,

3

Page 5: When is a Potential Accurate Enough for Structure …...it has met with limited to non-existent success in predicting the structure of globular proteins. This failure has typically

the 6E(q) is also unknown. However, the knowledge used in constructing thepotential can 'also be used to find a probability distribution for 6E. The expectedsize of the inaccuracies in the approximate potential can usually be estimated.For example one can estimate the size of the physical effects that are ignored,and the inaccuracies in the parameters. These estimates of inaccuracy can beinterpreted to mean that the best one can really do is to give a probabilitydistribution for the energy, parameter, etc. with a width given by the estimateof inaccuracy. The value of 6E is given by the sum of all of these inaccuracies,and if there are a large number of terms in this sum, which will typically be truefor a large molecule potential, the central limit theorem implies that p(6E), theprobability distribution of 6E is given by a Gaussian, whose mean and width isfound from the inaccuracy estimates. Now the potential accuracy question canbe put into a precise, mathematical form: Given p(6E), what is the probabilitythat q:;:r': and q~f,;or are the same structure?

3 Deterministic CaseTo illustrate the formal solutions to the potential accuracy question, I will startwith the simplest examples and proceed to examples of greater complexity.After developing the formalism sufficiently, I will apply it to a simple model ofa protein potential function.

The simplest problem is that of two structures. A scientist uses an approxi­mate potential V,ppro•. (q) to calculate the energies, Eh and Ef, for structures 0and 1 respectively. Without loss of generality, I assume Eh < E~, so q::r:or = 0For concreteness, one could think of a spin in a solid that could point in oneof only two directions, surely the simplest kind of "structure. ll These structuresalso have real energies Eo and E" which are related to the approximate energiesby

Eh =E~ =

Eo +6EoE, +6E, (2)

(5)

where the 6E; are errors in v.ppr... (q) and the distribution of the 6E; is p(6Ei)'For this simple case the potential accuracy question is: What is the probability,R, that Eo < E, ? Note that Eo < E, implies

E; - Eh > AE (3)where I have used the natural variable.

AE,= 6E, - 6Eo (4)If I denote the distribution of AE by P(AE), then R, the probability of pre­dicting the correct structure is given by

lE;-E~

R(E~,E;) = -00 P(AE)d(AE).

4

Page 6: When is a Potential Accurate Enough for Structure …...it has met with limited to non-existent success in predicting the structure of globular proteins. This failure has typically

To complete the formal solution to the two structure problem, P(AE) mustbe expressed'in terms of p(6E). In general there is no connection between thesetwo distributions. For example, the errors in the potential function may be soclosely correlated that 6Eo "" 6E1, in which case P(AE) becomes, to a goodapproximation, a Dirac delta function at zero! Here I will only consider thesimplest case, which must be solved before others. The simplest assumption isthat the errors in the energy of one structure is independent of the errors in theenergies of all the other structures. With this assumption, the distribution ofAE becomes

P(AE) = 1:00

p(6Eo)p(6E + 6Eo) d(6Eo). (6)

Equations ( 5) and ( 6) are the desired solution of the two structure problemwith the independent error assumption.Before proceeding further, it will do well to discuss consequences of the in­

dependent error assumption and of the correlations that are expected in realsystems. In practice, the independent error assumption is probably a worstcase, because the kinds of correlations that occur in the inaccuracies of realpotential functions will tend to narrow the distribution of P(AE). For exam­ple, perhaps the most important error correlation is due to the similarities oftwo structures. If two structures are similar, they will have many of the sameinteractions, e.g., the same hydrogen bonds, which are represented by the same(possibly inaccurate) terms in the potential function. The inaccuracies due tothese common interactions are the same, so the difference between the errors inthese two structures is due only to the interactions that the structures do nothave in common. This effect will tend to narrow P(AE).

The solution of the many structure problem is a straightforward generaliza­tion of the solution of the two structure problem presented above. Consider Qdifferent structures, labeled 0, 1,2, ... , Q - 1. As before, the energies calculatedfrom the approximate potential, Eh, E;, E2,... ,EO_1 are related to the realenergies Eo,E1,Ez, .•• , EO- 1 by

Eh = Eo+6Eo

E: = EI +6E1

E2 = Ez +6Ez

EO- 1 = EO_ 1+ 6Eo_I (7)where the 6E; are errors in the potential Vapprcz. (q) and the distribution ofthe 6E; is p(6E;). Once again I may assume without loss of generality thatEh < E;, Eh < E2,.. ·, Eo < EO_I' What is the probability, R, that Eo <E1,Eo < E z,'" ,Eo < EO-I, i.e., that the 0 structure has the lowest energyfor the real potential? Since Eo < E; implies

Eo - Eo> AE., (8)

5

Page 7: When is a Potential Accurate Enough for Structure …...it has met with limited to non-existent success in predicting the structure of globular proteins. This failure has typically

where in analogy with the two structure problem I have defined

!1E, == 6E; - 6Eo,

the probability of E; > Eo is g(EI- Eo) where

g(:c) == [~ P(!1E)d(!1E)

(9)

(10)

and P(!1E) is the distribution of the !1E; and is given by equation (6). Withinthe independent error assumption the probability that all of the inequalities in( 8) are simultaneously true is the product of each of them being true, so theprobability of predicting the right structure is

0-1

R(Eo,E;, E~, ... , En_I) = IT g(EI- Eo)·i=l

(11)

Equations ( 6), ( 10) and ( 11) are the desired solution of the many structureproblem with the independent error assumption.

4 Stochastic CaseThe work in the above Section solves the potential accuracy problem for thecase when the calculated eneryies of all of the structures are known. This iscertainly not a realistic assumption for protein structure predictions, or indeedmost problems where one wishes to predict the structure of a molecule. A rea­sonable alternative to the previous formulation is to assume that one knows,or can estimate, the density of states (structures) of the molecule, that is, theprobability density p(Eo' E;, E;, ,En_I) that a molecule has structures withcalculated energies Eb, E;, E~, , En_I' The function p(Eo,E;,E~, ... , En_I)may be calculated from an approximate model of the potential, and may alsoinclude information drawn from simulation or experiment. In the next Sectionp(Eo' E;, E~, ... , En_I) will represent the probability that a sequence, drawnat random from tbe ensemble of all possible sequences of N amino acids, hasstructures with calculated energies Eo,E;,E~, ... ,En_I' When only a proba­bility density is known, one can calculate R, the average of R. This averagingyields

[

0-1 ]x 11 g(EI- Eo)8(EI- Eo) dEodE; ... dEn_I

6

(12)

Page 8: When is a Potential Accurate Enough for Structure …...it has met with limited to non-existent success in predicting the structure of globular proteins. This failure has typically

where

O(x) ={I if x ~ 0o If x < O. (13)

Notice that the energies in the argument of p are not arranged in any specialorder, and in particular, there is no requirement that Eo < Ei for i > O.Therefore, in equation ( 12) the product of O-functions ensures that the structurelabeled 0 is, indeed, the lowest energy structure, and there is a factor of!1 infront of the integral because the selection of the D-labeled structure is arbitrary,as any of the !1 structures could be the lowest energy structure.

Equation ( 12) can be simplified for some important special case that oc­curs when p(E~,E;,E~, ... ,E'O_I) has some specific form. In this paper I willconside~ one such special case, namely

0-1

p(E~, E;, E2,···, En_I) = II pee;)i=O

(14)

Equation ( 14) with hold when the calculated energies of the structure arerandom, independent variables distributed with probability density p(E'). Sub­stituting equation( 14) into equation ( 12) yields

+00 [+00 ] 0-1R =!1 100 p(E~) k; p(E')g(E' - E~)dE' dE~ (15)

Equation ( 15) can be approximated with a steepest descent technique. Define

;,+00

(E~) == p(E')g(E' - E~)dE'E'o

andP(E~) == (!1/R)p(E~)(E~)O-1

Note that P(E~) is normalized so that

1+00

-00 P(E~)dE~ =1.

(16)

(17)

(18)

The identity

(19)

can be rewritten as

R 1+00('(E,~)P(E~)dE~ =-1.

-00 p(Eo)

7

(20)

Page 9: When is a Potential Accurate Enough for Structure …...it has met with limited to non-existent success in predicting the structure of globular proteins. This failure has typically

In most relevant cases P(Na) will have a maximum that grows sharper as 11becomes larger. Thus, for large 11, P(Eh) is well approximated by a Dirac deltafunction at Eo, the value of Eh that maximizes P(Eh). In this approximation,equation ( 20) for R becomes

(21)- p(Eo)R'" - {'(Eo)

A useful expression for e(Eh) is found by differentiating equations ( 16) and(10) for { and 9 respectively and changing the variable of integration in equation( 16) to .\ ;: E' - Eh to find

10 f+OO{'(E~) = -p(E~) -00 P(.\)d.\ - Jo p(E~ + .\)P(.\)d.\ (22)

(23)- 1R = ~":'-"':7. 1+ f(Eo)

After substituting this expression for e, the expression ( 21) for R can be putin the suggestive form

where

(24)

5 A Random Heteropolymer Model of a Pro­tein Potential Function

(25)

The random heteropolymer model is the simplest model of a protein potentialfunction. This model, with some significant extensions, was first proposed as amodel of protein folding by Bryngelson and Wolynes, [4, 5] who solved it'withina random energy approximation. Later, and independently, Shaknovitch andGutin also proposed this model and solved it within a mean field approximation.[6, 7J Shaknovitch and Gutin showed that their mean field approximation wasequivalent to the random energy approximation of Bryngelson and Wolynes, andwere able to obtain further information. I will use the notation of Shaknovitchand Gutin. In the random heteropolymer model the energy, E, ofa configurationis determined by the contacts between the amino acids, so if

<l(' .) _ {1 if amino acids i and j are in contact', 1 - 0 otherwise .

then the energy of state q ={<l(i,i)} is given by

N

E =1l({6.(i,j)}) = LBI,j<l(i,j)i<i

(26)

8

Page 10: When is a Potential Accurate Enough for Structure …...it has met with limited to non-existent success in predicting the structure of globular proteins. This failure has typically

(30)

where there are N amino acids in the protein. In equation( 26) for the energy,the Bij are the energies of contact between amino acids i and j. In the randomheteropolymer model, the Bi,; are random, with probability distribution

1 ( B7.)P,on'o,,(B'j) = ..j2;Jj2 exp - 2~~ (27)

The quantity B in equation ( 27) sets the energy scale of the amino acid contactenergies~ 1

The model ( 26) distinguishes between configurations based on their con­tacts. Therefore in this model the discrete states {q}are represented hy theiramino acid contacts. This representation is often refered to as the contact ordistance map representation [8] of protein structure, and is useful in manyapplications.

For compact structures, the solution of the Shaknovitch-Gutin mean fieldtheory for random heteropolymers has four properties that are important forthe present calculation. These properties are a product of both the model andthe mean field approximation, and therefore may not be entirely physical. Inthe conclusion I will discuss possible ways to check and improve the model andapproximations. First, the total number of compact structures is

n =vN(28)

where v is the average number of conformations per amino acid residue in thecompact phase. (Excluded volume effects are included in this counting.) Second,the energies are random, independent variables, so equation ( 14) holds. Third,the probability density that a structure has energy E is

ptE) = .J,r~ZB2 exp (- N~~2) (29)where z is. the average number of contacts each amino acid residue has withother amino acid residues. Fourth, the important low energy structures havefew contacts, hence few interactions, in common. Therefore, as noted in Section3, the independent error approximation is valid.

Inaccuracies in the pair interactions between amino acids are modeled byadding random noise to the contact energies, so the known approximate poten­tial is

NE' = 1i'({Ll(i,j)}) = LB:jLl(i,j)

i<i1The erudite reader may recaJ.J. that in the Shaknovitch-Gutin paper t.he G&UMian (or theB'ti was centered about a mean energy 130. and the potential fWlction included A three-bodyinteraction tcnn CE~(ri.rj)~(rk.rj)which accounted {or excluded volume. The valuesof Eo and C determine whether or not the protein molecule is collapsed. In this paper I 8nlonly interested in the reltltive energies of the different. co1l4p#etl confonnations. Changes inEo and C would only add a constant energy to each collapsed conformation, and thereforecan be ignored.

9

Page 11: When is a Potential Accurate Enough for Structure …...it has met with limited to non-existent success in predicting the structure of globular proteins. This failure has typically

where

BI,j = Bi,j + ryij (31)and the ryij are random variables distributed with mean zero and standarddeviation ry. Since "Ii and "Ii' have the same form, they have the same properties.A scientist that uses an approximate potential function "Ii' to calculate theenergy of the state {A( i, j)} will err by an energy

N

6E({A(i,j)}) =I>;jA(i,j)i<j

(32)

The quantity 6E is a sum of ~zN independent random variables, so by thecentral limit theorem, 6E is a random variable with probability density

(33)p(8E) = kexP(-N6E22)1I:Nzry2 zry

Substituting this equation ( 33) into equation ( 6) gives the probability densityof AE, the error in energy with ~espect to the predicted lowest energy structure,

. 1 (AE2

)P(AE) = exp --- .,)211:N zry2 2Nzry2(34)

6 Accuracy Requirements for the Random Het­ropolymer Model

(36)

(35)

2Eo = (n _ l)e'(Eo)NzB2 e(Eo)

For all values of E~, e'(E~) < 0 and e(E~) > 0, so Eo < o. Substitutingequations ( 29) and ( 34) for peE') and P(AE) into equations ( 16) and ( 22)for e(Co) and e'(E~) yields

, 1 ( IEol )e(Eo) = 1-"4erfc (Nz)1/2B

tIPdE' =0o

so, after substituting ( 29) for peE') in equation ( 17) for P and differentiatingI obtain

At the level of accuracy of the Shaknovitch-Gutin mean field theory, equations( 23) and ( 24) for the probability of predicting the correct structure are validfor the random heteropolymer model. The density of structures p(E') andthe distribution of errors in structure energies P(AE) are given in the previousSection, so it only remains to calculate Eo and simplify the resulting expressions.For E~ = Eo,

10

Page 12: When is a Potential Accurate Enough for Structure …...it has met with limited to non-existent success in predicting the structure of globular proteins. This failure has typically

and

1 ([ 2] -1/2 [ 2(E')2 2]((Eb) = -"2p(Eb) 1+ 1+ 2 (;) exp NZ(B2~27]2) (;)

X { 1 +erf LIN~~:27]2) (;)]}) (38)

for E~ < O. Equations ( 36), ( 37) and ( 38) for Eo can be solved for the caseof large N. First, define'" so that

Eo'" == N z 1/2B' (39)

I have shown that", > O. I <l$sume that", is of order one and will show that thisassumption is self-consistent. I will also assume that 7] is at most the same orderof magnitude as B, and quite possibly much smaller. This assumption coversall interesting cases, because if 7] is much larger than B, then the "signal" (theBi,i) is swamped by the "noise" (the 7];,i) so the probability of predicting thecorrect structure is essentially zero. With these assumptions, the asymptoticexpansion for the error function complement for large argument gives

(40)where '"I is a constant of order one. Therefore, the equation for'" becomes

4,rN2", = "NeXP(-N",2){1+[1+2G)TI/2expG~:22:2)

x [1 + erf C/::"';7]2 ) )} (41)

Equation ( 41) simplifies in two limits, VN(7]/B) small, and VN(7]/B) large.The expansion of ( 41) for small VN(7]/ B) to first order in VN(7]/B) is

21r1/2N'/2", =vNe-Na' [1 +~ (N~27])] (42)

and similarly for large VN(7]/ B) the leading term in the asymptotic expansionyields

11

Page 13: When is a Potential Accurate Enough for Structure …...it has met with limited to non-existent success in predicting the structure of globular proteins. This failure has typically

Equations ( 4,2) and ( 43) can be solved for large N by writing a as

(logN) (l)a =aD + ~ a, + N a2 + ...

and substituting this expression into the above equations for a to find

(44)

(45)

for large ,fN(TJ/B). In the large N limit (logN)/,fN goes to zero, so equations( 45) and ( 46) can be solved to leading orders of N by requiring the order N,order log N and order one terms each to vanish separately, to yield

a = 1/2 10gN 1(log v) - 4N(logv)'/2 + 4N(logv)'/2 10g(4".logv)

1 [(lOg v) 1/2 N1/2TJ]2N(logv)1/2 log 1 + -2- --B-

for small ,fN(TJ/B) and

(47)

(B 2+ 2TJ2 ) 1/2 1 (4".NB 2 10g v) (B 2+ 2TJ2 ) -1/2a = B2 log v - 4N log B2 + 2TJ2 B2 logv

(48)for large ,fN(TJ/B). The leading order terms in both equations ( 47) and ( 48)for a are of order ,;Iogv. A typical estimate gives v "" 1.4, [9] with makes a

12

Page 14: When is a Potential Accurate Enough for Structure …...it has met with limited to non-existent success in predicting the structure of globular proteins. This failure has typically

(49)

(50)

of order one, as promised. Notice that this conclusion still holds if the value offor v-I is changed by a factor of ten.

When peE') and P(!!.E) are given by (29) and ( 34) respectively and Eo =N",fiB, then equation ( 24) for «Eo) becomes

«"') = Hl+2Grr/'.xp G~:22:2)

x [1 + erf ( M",'1 )]jB2+2rp

12

By inspection ofequation ( 49), < is small, and hence the probability of predictingthe correct structure, R = 1/(1 + f), clooe to one, only if ..;N(TJIB) is small.For this case the value of", is given by equation ( 47), so substituting this intothe above expression for «"') and expanding to first order in ..;N(TJ/B) gives

(2 I ) 1/2 { log[(41r log v)IN) } (N I/2TJ ) 0 (N'I2)<= - ogv 1~ -- + --1r 4Nlogv B B2

Therefore, if 'I is small compared to B/..;N, then, to first order in ..;N('1/B),the probability of predicting the correct structure is

R= 1- (~IOgv)1/2{1_10g[(41rlogV)/NJ}(NI /2TJ ) (51)1r 4Nlogv B

which, for large N, is well approximated by

(2 ) 1/2 (N1/2 )R=I- ;logv -r (52)

For large ..;NTJ/B, < :> I so R'" 1/<, therefore, to the leading term in theasymptotic expansion,

- _ [ (!L)2] 1/2 (41rNB210gv)Q'f(B'+2q,) -2Nq'IB'R - 1 + 2 B B2 + 2'12 V.

7 Conclusions

(53)

This paper has two principle purposes. First, the general formalism developedin Sections 2,3, and 4 can potentially be applied to a wide variety of problemsin chemical physics. The prerequisites for applying the formalism are fourfold.First, there must be a suitable coarse-graining procedure to obtain a finite num­ber of discrete structures. As I have previously mentioned, this coarse-graining

13

Page 15: When is a Potential Accurate Enough for Structure …...it has met with limited to non-existent success in predicting the structure of globular proteins. This failure has typically

procedure can be tailored to the problem at hand, and one can specify thestructure as accurately as desired by making the discretization finer. Second,the solution of the structure prediction problem must be put into the form ofa solution to a minimization problem, typically not a stringent requirement.Third, something must be known about the sources of inaccuracy. More specifi­cally, the different possible kinds of inaccuracies must be known and the typicalsize of each of these inaccuracies must also be known. With this knowledge, onecan then usually use the central limit theorem to obtain the probability densityfor the total inaccuracy in energy. Fourth, there must be some informationabout the distribution of the calculated energies of the structures. In practice,only the calculated energies of the few structures with lowest calculated energiesneed be known because all of the other structures have a negligible probabilityof being the real lowest energy structure. For large molecules where even theseare calculations are unfeasible, the methods of Section 4 can be used if one canestimate the distribution of calculated energies of structures, using, for example,a simple model like the one described in Section 5. The main assumption inderiving the equations for the probability of predicting the correct structure,equations ( 5),( Il), ( 12) and ( 23), is the independent error assumption, whichis a worst case for most problems. The derivation of equation ( 23) also usedthe independent energy assumption. Extensions of this formalism that lessenthese assumptions are under way.

The second purpose of this paper is to report a result for the needed ac­curacy of potential that predicts protein structure. The major result of thisinvestigation is equation ( 52), which states that the probability of predictingthe correct structure is given by

(Nl/2~)probability = 1 - k -8- (54)

where B is the scale of the monomer-monomer interaction energies, ~ is the scaleof the inaccuracy of the these interaction energies, N is the number of monomers,and k is a constant of order one. Equation ( 54) was derived from equation( 23) and therefore is based on the independent error and independent energyassumptions. I noted in Section 5 that Shaknovitch and Gutin have shown that,if one models a protein as a random heteropolymer, then these assumptions arecorrect at the level of accuracy of a mean field theory. Equation ( 54) impliesthat, if a potential function is to predict the correct structure, the monomer­monomer interactions energies must have proportional error of less than 1/v'N.For a globular protein N will typically be between 50 and 400, so the requiredaccuracy in monomer-monomer interactions is about five to fifteen percent. Itis important to note that this result is the accuracy required for getting allof the monomer-monomer contacts right, that is, predicting the entire contactmap with perfect accuracy, a stringent requirement for a potential function.Proteins with 60 or more percent of correct contacts are usually considered

14

Page 16: When is a Potential Accurate Enough for Structure …...it has met with limited to non-existent success in predicting the structure of globular proteins. This failure has typically

to be structurally homologous. Therefore, the protein calculation should beextended to talculate the probability of predicting a structure with a specifiedfraction of correct contacts. This extension will require that the formalism andthe model be improved. The formalism must be extended so that it can be usedto calculate the probability of predicting one of many, rather than just one,low energy state. The model must be extended so that the low energy statesof the model have contacts in common. The model can be extended in twoways. First, the statistical mechanics of the model potential function could besolved in an approximation that is more accurate than the Shaknovitch-Gutinmean field theory. Some progress has already been made in this direction.(SilvioFranz, private communication) Second, the model potential function could beextended by incorporating new effects that are alleged to be important in proteinfolding, such as the principle of minimal frustration. [4, 5, 10, 11]

8 AcknowledgementsThis work was done under the auspices of the U.S. Department of Energythrough the Los Alamos National Laboratory. I wish to thank the Los AlamosNational Laboratory and the Department of Energy for generous support of thiswork. I also wish to thank the Center for Non-Linear Studies and the Santa FeInstitute for their generous hospitality. The work described in this papers owesmuch to the help and encouragement of many people. In particular I wouldlike to thank Drs. Henrik Bohr, Ken Dill, Walter Fontana, Silvio Franz, JohnHopfield, Giulia lori, Alan Lapedes, Jiri Novotny, Jose Nelson Onuchic, PeterLeopold, Lawrence Pratt, Jeff Skolnick, Paul Stolorz, James Theiler, Miguel Vi­rasoro, David Wolpert and Peter Wolynes, and Miss Anne Keegan for listeningto my ideas and for good advice concerning this work.

References

[1] C. B. Anfinsen, "Principles that Govern the Folding of Protein Chains,"Science, 181,223-230 (1973).

[2] Z. Li and H. A. Scheraga, "Monte Carlo-minimization approach to themultiple-minimum problem in protein folding," Proc. Nat!. Acad. Sci.USA, 84, 6611-6615 (1987).

[3] E. 1. Shaknovitch and A. M. Gutin, "Influence of Point Mutations onProtein Structure: Probability of a Neutral-Mutation," J. Theor. Bioi.,149, 537-546 (1991).

[4] J. D. Bryngelson and P. G. Wolynes, "Spin glasses and the statisticalmechanics of protein folding," Proc. Natl. Acad. Sci. USA, 84,7524-7528(1987).

15

Page 17: When is a Potential Accurate Enough for Structure …...it has met with limited to non-existent success in predicting the structure of globular proteins. This failure has typically

[5J J. D. Bryngelson and P. G. Wolynes, "A Simple Statistical Field The­ory of Heteropolymer Collapse with Application to Protein Folding,"Biopolymers, 30, 177-188 (1990).

[6J E. I. Shaknovitch and A. M. Gutin, "Formation of unique structure inpolypeptide chains: Theoretical investigation with the aid of a replicaapproach," Biophysical Chemistry, 34, 187-199 (1989).

[7] E. I. Shaknovitch and A. M. Gutin, "Frozen states of a disordered glob­ular heteropolymer," J. Phys. A: Math. Gen., 22, 1647-1659 (1989).

[8] T. E. Creighton, Proteins, W. H. Freeman and Company, New York,1984, p. 231.

[9] K. A. Dill, "Theory for the Folding and Stability of Globular Proteins,"Biochemistry, 24, 1501-1509 (1984).

[10] J. D. Bryngelson and P. G. Wolynes, "Intermediates and Barrier Cross­ing in a Random Energy Model (with Applications to Protein Folding),"J. Phys. Chern., 93,.69q2-6915 (1989).

[11] P. E. Leopold, M. Molital, and J. N. Onuchic, "Protein folding funnels:A kinetic approach to the sequence-structure relationship," Proc. Nat!.Acad. Sci. USA, 89, 8721-8725 (1992).

16