6
Protein folding and binding can emerge as evolutionary spandrels through structural coupling Michael Manhart a and Alexandre V. Morozov a,b,1 a Department of Physics and Astronomy and b BioMaPS Institute for Quantitative Biology, Rutgers University, Piscataway, NJ 08854 Edited* by Curtis G. Callan, Jr., Princeton University, Princeton, NJ, and approved December 31, 2014 (received for review August 16, 2014) Binding interactions between proteins and other molecules medi- ate numerous cellular processes, including metabolism, signaling, and gene regulation. These interactions often evolve in response to changes in the proteins chemical or physical environment (such as the addition of an antibiotic). Several recent studies have shown the importance of folding stability in constraining protein evolution. Here we investigate how structural coupling between folding and bindingthe fact that most proteins can only bind their targets when foldedgives rise to an evolutionary coupling between the traits of folding stability and binding strength. Using a biophysical and evolutionary model, we show how these protein traits can emerge as evolutionary spandrelseven if they do not confer an intrinsic fitness advantage. In particular, proteins can evolve strong binding interactions that have no functional role but merely serve to stabilize the protein if its misfolding is dele- terious. Furthermore, such proteins may have divergent fates, evolving to bind or not bind their targets depending on random mutational events. These observations may explain the abundance of apparently nonfunctional interactions among proteins ob- served in high-throughput assays. In contrast, for proteins with both functional binding and deleterious misfolding, evolution may be highly predictable at the level of biophysical traits: adap- tive paths are tightly constrained to first gain extra folding sta- bility and then partially lose it as the new binding function is developed. These findings have important consequences for our understanding of how natural and engineered proteins evolve un- der selective pressure. protein evolution | evolutionary spandrels | fitness landscapes | protein interactions | folding stability P roteins carry out a diverse array of chemical and mechanical functions in the cell, ranging from metabolism to gene ex- pression (1). Thus, proteins serve as central targets for natural selection in wild populations, as well as a key toolbox for engi- neering novel molecules with medical and industrial applications (2, 3). Most proteins must fold into their native state, a unique 3D conformation, to perform their function, which typically involves binding a target molecule such as a small ligand, DNA, or another protein (1). Misfolded proteins not only fail to per- form their function but also may form toxic aggregates and divert valuable protein synthesis and quality control resources (47). It is therefore imperative that the folded state be stable against typical thermal fluctuations. However, biophysical experiments and computational studies reveal that most random mutations in proteins destabilize the folded state (8, 9), including mutations that improve function (911). As a result, many natural proteins tend to be only marginally stable, mutationally teetering at the brink of unfolding (12, 13). With proteins in such a precarious position, how can they evolve new functions while maintaining sufficient folding stability? Directed evolution experiments have offered a window into the dynamics of this process (2, 3), indicating the importance of compensatory mutations, limited epistasis, and mutational ro- bustness. Theoretical efforts to describe protein evolution in biophysical terms have focused on evolvability (14), distributions of protein stabilities and evolutionary rates (13, 1517), and global properties of protein interaction networks (18, 19). However, a subtle but key property of proteins has not been explored in this context: structural coupling of folding and binding (the fact that folding is required for function) implies evolutionary coupling of folding stability and binding strength. Thus, selection acting directly on only one of these traits may produce apparent, indirect selection for the other. The impor- tance of this effect was popularized by Gould and Lewontin in their influential paper on evolutionary spandrels(20), defined as traits that evolve as byproducts in the absence of direct se- lection. Since then the importance of spandrels and coupled traits has been explored in many areas of evolutionary biology (21), including various molecular examples (12, 22, 23). How do coupled traits affect protein evolution? We consider a biophysical model that describes evolution of a new binding interaction in response to a change in the proteins chemical or physical environment, including availability and concentrations of various ligands (24, 25), or in response to artificial selection in a directed evolution experiment (3). We postulate a fitness landscape as a function of two biophysical traits: folding stability and binding affinity to a target molecule. We then use an exact numerical algorithm (26, 27) to characterize adaptive paths on this fitness landscape, focusing on how coupled protein traits affect evolutionary fates, epistasis, and predictability. Results Model of Protein Energetics. We consider a protein with two-state folding kinetics (1). In the folded state, the protein has an in- terface that binds a target molecule. Assuming thermodynamic equilibrium (valid when protein folding and binding are faster than typical cellular processes), the probabilities of the proteins structural states are given by their Boltzmann weights: Significance Most proteins must be folded to perform their function, which typically involves binding another molecule. This property enables protein traits of stable folding and strong binding to emerge through adaptation even when the trait itself does not increase the organisms fitness. Such traits, known as evolu- tionary spandrels,have been investigated across many organisms and scales of biology. Here we show that proteins can spontaneously evolve strong but nonfunctional binding interactions. Moreover, evolution of such interactions can be highly unpredictable, with chance mutation events de- termining the outcome. In contrast, evolution of functional interactions follows a more predictable pattern of first gaining folding stability and then losing it as the new binding func- tion improves. Author contributions: M.M. and A.V.M. designed research; performed research; analyzed data; and wrote the paper. The authors declare no conflict of interest. *This Direct Submission article had a prearranged editor. 1 To whom correspondence should be addressed. Email: [email protected]. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1415895112/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1415895112 PNAS | February 10, 2015 | vol. 112 | no. 6 | 17971802 EVOLUTION Downloaded by guest on August 27, 2021

Protein folding and binding can emerge as evolutionary ... · DEF Fig. 1. Fitness, selection, and epistasis in energy trait space. (A) Phase diagram of protein structural states

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Protein folding and binding can emerge as evolutionary ... · DEF Fig. 1. Fitness, selection, and epistasis in energy trait space. (A) Phase diagram of protein structural states

Protein folding and binding can emerge asevolutionary spandrels through structural couplingMichael Manharta and Alexandre V. Morozova,b,1

aDepartment of Physics and Astronomy and bBioMaPS Institute for Quantitative Biology, Rutgers University, Piscataway, NJ 08854

Edited* by Curtis G. Callan, Jr., Princeton University, Princeton, NJ, and approved December 31, 2014 (received for review August 16, 2014)

Binding interactions between proteins and other molecules medi-ate numerous cellular processes, including metabolism, signaling,and gene regulation. These interactions often evolve in responseto changes in the protein’s chemical or physical environment (suchas the addition of an antibiotic). Several recent studies haveshown the importance of folding stability in constraining proteinevolution. Here we investigate how structural coupling betweenfolding and binding—the fact that most proteins can only bindtheir targets when folded—gives rise to an evolutionary couplingbetween the traits of folding stability and binding strength. Usinga biophysical and evolutionary model, we show how these proteintraits can emerge as evolutionary “spandrels” even if they do notconfer an intrinsic fitness advantage. In particular, proteins canevolve strong binding interactions that have no functional rolebut merely serve to stabilize the protein if its misfolding is dele-terious. Furthermore, such proteins may have divergent fates,evolving to bind or not bind their targets depending on randommutational events. These observations may explain the abundanceof apparently nonfunctional interactions among proteins ob-served in high-throughput assays. In contrast, for proteins withboth functional binding and deleterious misfolding, evolutionmay be highly predictable at the level of biophysical traits: adap-tive paths are tightly constrained to first gain extra folding sta-bility and then partially lose it as the new binding function isdeveloped. These findings have important consequences for ourunderstanding of how natural and engineered proteins evolve un-der selective pressure.

protein evolution | evolutionary spandrels | fitness landscapes |protein interactions | folding stability

Proteins carry out a diverse array of chemical and mechanicalfunctions in the cell, ranging from metabolism to gene ex-

pression (1). Thus, proteins serve as central targets for naturalselection in wild populations, as well as a key toolbox for engi-neering novel molecules with medical and industrial applications(2, 3). Most proteins must fold into their native state, a unique3D conformation, to perform their function, which typicallyinvolves binding a target molecule such as a small ligand, DNA,or another protein (1). Misfolded proteins not only fail to per-form their function but also may form toxic aggregates and divertvaluable protein synthesis and quality control resources (4–7). Itis therefore imperative that the folded state be stable againsttypical thermal fluctuations. However, biophysical experimentsand computational studies reveal that most random mutations inproteins destabilize the folded state (8, 9), including mutationsthat improve function (9–11). As a result, many natural proteinstend to be only marginally stable, mutationally teetering at thebrink of unfolding (12, 13). With proteins in such a precariousposition, how can they evolve new functions while maintainingsufficient folding stability?Directed evolution experiments have offered a window into

the dynamics of this process (2, 3), indicating the importance ofcompensatory mutations, limited epistasis, and mutational ro-bustness. Theoretical efforts to describe protein evolution inbiophysical terms have focused on evolvability (14), distributionsof protein stabilities and evolutionary rates (13, 15–17), and

global properties of protein interaction networks (18, 19).However, a subtle but key property of proteins has not beenexplored in this context: structural coupling of folding andbinding (the fact that folding is required for function) impliesevolutionary coupling of folding stability and binding strength.Thus, selection acting directly on only one of these traits mayproduce apparent, indirect selection for the other. The impor-tance of this effect was popularized by Gould and Lewontin intheir influential paper on evolutionary “spandrels” (20), definedas traits that evolve as byproducts in the absence of direct se-lection. Since then the importance of spandrels and coupledtraits has been explored in many areas of evolutionary biology(21), including various molecular examples (12, 22, 23).How do coupled traits affect protein evolution? We consider

a biophysical model that describes evolution of a new bindinginteraction in response to a change in the protein’s chemical orphysical environment, including availability and concentrationsof various ligands (24, 25), or in response to artificial selection ina directed evolution experiment (3). We postulate a fitnesslandscape as a function of two biophysical traits: folding stabilityand binding affinity to a target molecule. We then use an exactnumerical algorithm (26, 27) to characterize adaptive paths onthis fitness landscape, focusing on how coupled protein traitsaffect evolutionary fates, epistasis, and predictability.

ResultsModel of Protein Energetics. We consider a protein with two-statefolding kinetics (1). In the folded state, the protein has an in-terface that binds a target molecule. Assuming thermodynamicequilibrium (valid when protein folding and binding are fasterthan typical cellular processes), the probabilities of the protein’sstructural states are given by their Boltzmann weights:

Significance

Most proteins must be folded to perform their function, whichtypically involves binding another molecule. This propertyenables protein traits of stable folding and strong binding toemerge through adaptation even when the trait itself does notincrease the organism’s fitness. Such traits, known as evolu-tionary “spandrels,” have been investigated across manyorganisms and scales of biology. Here we show that proteinscan spontaneously evolve strong but nonfunctional bindinginteractions. Moreover, evolution of such interactions canbe highly unpredictable, with chance mutation events de-termining the outcome. In contrast, evolution of functionalinteractions follows a more predictable pattern of first gainingfolding stability and then losing it as the new binding func-tion improves.

Author contributions: M.M. and A.V.M. designed research; performed research; analyzeddata; and wrote the paper.

The authors declare no conflict of interest.

*This Direct Submission article had a prearranged editor.1To whom correspondence should be addressed. Email: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1415895112/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1415895112 PNAS | February 10, 2015 | vol. 112 | no. 6 | 1797–1802

EVOLU

TION

Dow

nloa

ded

by g

uest

on

Aug

ust 2

7, 2

021

Page 2: Protein folding and binding can emerge as evolutionary ... · DEF Fig. 1. Fitness, selection, and epistasis in energy trait space. (A) Phase diagram of protein structural states

folded; bound: pf;b = e−βðEf+EbÞ�Z;folded; unbound: pf;ub = e−βEf

�Z;

unfolded; unbound: puf;ub = 1=Z:

[1]

Here β is the inverse temperature, Ef is the free energy of folding(also known as ΔG), and Eb =Eb′− μ, where Eb′ is the bindingfree energy and μ is the chemical potential of the target mole-cule. For simplicity we will refer to Eb as the binding energy. Thepartition function is Z= e−βðEf+EbÞ + e−βEf + 1. Note that Ef < 0 forintrinsically stable proteins and Eb < 0 for favorable bindinginteractions; binding can thus be made more favorable eitherby improving the intrinsic affinity between the protein and thetarget (decreasing Eb′) or by increasing the concentration of thetarget molecule (increasing μ). Because the protein can bind onlywhen it is folded, the binding and folding processes are structur-ally coupled. This gives rise to binding-mediated stability: strongbinding stabilizes a folded protein, such that even intrinsicallyunstable proteins ðEf > 0Þ may attain their folded and boundstate if binding is strong enough ðEf +Eb < 0Þ. Different physicalmechanisms are possible for the kinetics of the structural cou-pling (28, 29), but in thermodynamic equilibrium only the freeenergy differences matter.The folding and binding energies depend on the protein’s

genotype (amino acid sequence) σ. We assume that adaptationonly affects “hotspot” residues at the binding interface (30, 31);the rest of the protein does not change on relevant time scalesbecause it is assumed to be already optimized for folding. Weconsider L hotspot residues which, to a first approximation,make additive contributions to the total folding and binding freeenergies (32) (SI Materials and Methods):

EfðσÞ=Ereff +

XL

i=1

ef�i; σi

�; EbðσÞ=Emin

b +XL

i=1

eb�i; σi

�; [2]

where efði; σiÞ and ebði; σiÞ capture the energetic contributions ofamino acid σi at position i (equivalent to ΔΔG). The referenceenergy Eref

f is the fixed contribution to the folding energy fromall other residues in the protein, and the parameter Emin

b is theminimum binding energy among all genotypes (Materials and

Methods). Amino acid energies efði; σiÞ and ebði; σiÞ are randomlysampled from distributions constructed using available ΔΔGdata and other biophysical considerations (Materials and Meth-ods); we will consider properties of the model averaged overthese distributions. The exact shape of the distributions is notimportant for large enough L due to the central limit theorem.

Fitness Landscape. We construct a fitness landscape based on themolecular traits Ef and Eb. Without loss of generality, we assumethat the protein contributes fitness 1 to the organism if it is al-ways folded and bound. Let fub; fuf ∈ ½0; 1� be the multiplicativefitness penalties for being unbound and unfolded, respectively:the total fitness is fub if the protein is unbound but folded, andfubfuf if the protein is both unbound and unfolded. Then thefitness of the protein averaged over all three possible structuralstates in Eq. 1 is given by

F�Ef ;Eb

�= pf;b + fub pf;ub + fub fuf puf;ub: [3]

This fitness landscape is divided into three nearly flat plateauscorresponding to the three protein states of Eq. 1, separated bysteep thresholds corresponding to the folding and binding tran-sitions (Fig. 1A). The heights of the plateaus are determined bythe values of fub and fuf , leading to three qualitative regimes ofthe global landscape structure (Fig. 1 B–D).In the first case (Fig. 1B), a protein that is perfectly folded but

unbound has no fitness advantage over a protein that is bothunbound and unfolded: fub = fuf fub. Thus, selection acts directlyon the binding trait only. This regime requires that either fub = 0[binding is essential, e.g., in the context of conferring antibioticresistance to the cell (24)] or fuf = 1 (misfolded proteins are nottoxic). The latter case also includes directed evolution experi-ments where only function is artificially selected for in vitro. Incontrast, when fub = 1 and 0≤ fuf < 1 (Fig. 1C), a perfectly foldedand bound protein has no fitness advantage over a folded butunbound protein, and thus this case entails direct selection forfolding only; binding is nonfunctional because it provides nointrinsic fitness advantage. Such proteins may have other, func-tional binding interfaces that are already adapted to their targets.Thus, the fuf fitness penalty reflects both the intrinsic misfoldingtoxicity [e.g., due to aggregation (4–7)] as well as the loss ofthe other functional interactions. Finally, in the most generalcase there are distinct selection pressures on both binding and

A B C

D E F

Fig. 1. Fitness, selection, and epistasis in energytrait space. (A) Phase diagram of protein structuralstates. Dashed lines separate structural states of theprotein corresponding to plateaus on the fitnesslandscape; arrows represent the folding transition(green), binding transition (red), and the coupledfolding–binding transition (blue). Fitness landscapesFðEf,EbÞ with direct selection (B) for binding onlyðfub = fuf = 0Þ, (C) for folding only (fub = 1, fuf = 0),and (D) for both binding and folding (fub = 0:9,fuf =0). Black contours indicate constant fitness val-ues. The contours are uniformly sampled in energyspace, with unequal fitness differences betweenadjacent contours. Streamlines indicate the directionof the selection “force” ~∇ logF , with color showingits magnitude (decreasing from red to blue). (E) Ex-ample genotype distribution and mutational net-work for L= 2 and k= 2. (F) Blue arrows indicate thesame mutation on different genetic backgrounds.When the fitness contours are straight, the mutationis beneficial regardless of the background (σ1 or σ2).With curved contours, the same mutation can be-come deleterious ðσ3 → σ3′Þ, indicative of sign epis-tasis. Sign epistasis can give rise to multiple localfitness maxima (e.g., AA and BB in E).

1798 | www.pnas.org/cgi/doi/10.1073/pnas.1415895112 Manhart and Morozov

Dow

nloa

ded

by g

uest

on

Aug

ust 2

7, 2

021

Page 3: Protein folding and binding can emerge as evolutionary ... · DEF Fig. 1. Fitness, selection, and epistasis in energy trait space. (A) Phase diagram of protein structural states

folding. This occurs when 0< fub < 1 and 0≤ fuf < 1, resulting ina landscape that is a hybrid of the previous two cases (Fig. 1D).It is straightforward to generalize our three-state model to

proteins with additional structural states (such as metastablepartially folded configurations) and allow for simultaneous ad-aptation at multiple binding interfaces. Furthermore, the fitnesslandscape in Eq. 3 can be made an arbitrary nonlinear functionof state probabilities. However, these more complex scenarioswould still share the essential features of our basic model: cou-pling between folding and binding traits and sharp fitnessthresholds between bound/unbound and folded/unfolded states.Thus, our qualitative conclusions do not depend on the specificmodel in Eq. 3.

Epistasis and Local Fitness Maxima. For protein sequences of lengthL with an alphabet of size k, each of the kL possible genotypes isprojected onto the 2D trait space of Ef and Eb (Eq. 2) andconnected to Lðk− 1Þ immediate mutational neighbors, forminga network of states that the population must traverse (a simpleexample is shown in Fig. 1E). Adaptive dynamics are determinedby the interplay between the fitness function (Eq. 3) and thedistribution of genotypes in trait space (Eq. 2).This interplay gives rise to the possibility of epistasis and

multiple local fitness maxima. For the energy traits our model isnonepistatic (Eq. 2). At the level of fitness, magnitude epistasis iswidespread owing to the nonlinear dependence of fitness onfolding and binding energies (Eq. 3). Sign epistasis in fitnesscannot occur when the fitness contours in energy space arestraight parallel lines (Fig. 1F). However, curved fitness con-tours, which occur near folding or binding thresholds in ourmodel (Fig. 1 B–D), can produce sign epistasis, giving rise tomultiple local fitness maxima in the genotype space (Fig. 1E).

Evolutionary Dynamics. We assume that a population encodingthe protein of interest evolves in the monomorphic limit:LNu logN � 1, where L is the number of residues, N is an ef-fective population size, and u is the per-residue probability ofmutation per generation (33). In this limit, the entire populationhas the same genotype at any given time, and the rate of sub-stitution from the current genotype to one of its mutationalneighbors is given by Eq. S1 in SI Materials and Methods. We usethe strong-selection limit of the substitution rate (Eq. S2), inwhich the effective population size enters only as an overall timescale. We verify the validity of this assumption in SI Materials andMethods (Fig. S1). In this regime, deleterious mutations never fixand adaptive paths have a finite number of steps, terminatingat a global or local fitness maximum. For compact genomic unitssuch as proteins, the monomorphic condition is generally metin multicellular species, although it may be violated in rapidlymutating unicellular organisms (34, 35). Sequential fixation ofsingle mutants is also a typical mode of adaptation in directedevolution experiments (3). For simplicity, we neglect more com-plex mutational dynamics such as indels and recombination.

Quantitative Description of Adaptation. Although our model candescribe many adaptive scenarios, for concreteness we focus ona specific but widely applicable case. A population begins asperfectly adapted to binding an original target molecule char-acterized by an energy matrix eb1 with minimum binding energyEminb1 (defining a fitness landscape F 1). The population is then

subjected to a selection pressure that favors binding a new target,with energy matrix eb2 and minimum binding energy Emin

b2 (fitnesslandscape F 2). For simplicity we assume that the original bindingtarget is replaced by a new one with uncorrelated binding en-ergies, although it is straightforward to extend our results to thecase where the original target simply changes concentration andhence chemical potential (eb1 = eb2 but Emin

b1 ≠Eminb2 ). The adap-

tive paths are first-passage paths leading from the global maxi-mum on F 1 to a local or global maximum on F 2, with fitnessincreasing monotonically along each path.

Each adaptive path φ with probability Π½φ� is a sequence ofgenotypes connecting initial and final states. Using an exactnumerical algorithm (26, 27) (SI Materials and Methods), wedetermine the path length distribution ρðℓÞ, which gives theprobability of taking an adaptive path with ℓ total amino acidsubstitutions, and the mean adaptation time t. We also considerSpath, the entropy of the adaptive paths:

Spath =−X

φ

Π½φ�logΠ½φ�: [4]

The path entropy is maximized when evolution is neutral, re-sulting in all paths of a given length being accessible and equallylikely. In that case Spath = ℓ logLðk− 1Þ (27), where ℓ is the meanpath length.We also consider the path density ψðσÞ, which gives the total

probability of reaching a state σ at any point along a path. Whenσ is a final state (a local fitness maximum on F 2), the pathdensity is the commitment probability. We calculate the entropyScom of the commitment probabilities as

Scom =−X

final states σ

ψðσÞlogψðσÞ: [5]

Direct Selection for Binding Only. We first focus on the casefub = fuf fub in Eq. 3. The geometry of the fitness contours is in-variant under overall shifts in the binding energy Eb (Fig. 1B);equivalently, the direction (but not the magnitude) of the selectionforce ð~∇ logF=

��~∇ logF ��Þ does not depend on Eb. Thus, withoutloss of generality, we set Emin

b1 =Eminb2 = constant in this section.

The contours of constant fitness are parallel to the Ef axiswhen Ef is low, indicating that, as expected, selection acts only onbinding when proteins are sufficiently stable. However, whenstability is marginal [which describes most natural proteins (12,13)], the fitness contours begin to curve downward, indicatingindirect selection for folding, even though selection acts directlyonly on the binding trait. Thus, adaptation will produce a trait(more stability) that is neutral at the level of the fitness functionsimply because it is coupled with another trait (binding) that isunder selection. Folding stability can therefore be considered anevolutionary spandrel (20) caused by structural coupling betweenfolding and binding. For intrinsically unstable proteins ðEf > 0Þ,binding-mediated stability causes the fitness contours to ap-proach diagonal lines: selection effectively acts to improve bothbinding and folding equally (Fig. 1B).An example realization of evolutionary dynamics in the mar-

ginally stable regime is shown in Fig. 2 A and B (see Fig. S2 forstable and intrinsically unstable examples and Fig. S3 for dis-tributions of initial, intermediate, and final states averaged overmultiple landscape realizations). There is typically just one ortwo fitness maxima; all maxima are usually accessible (Fig. 2C).For stable proteins, the global maximum almost always coincideswith the best-binding genotype and is about as far as a randomlychosen genotype from the best-folding genotype (Fig. 2D; tworandom genotypes are separated by 1− 1=k= 0:8 for an alphabetof size k= 5). However, as Ef becomes greater, the averagedistance between the maxima and the best-binding genotypeincreases, while the average distance between the maxima andthe best-folding genotype decreases, until they meet halfway forintrinsically unstable proteins (Fig. 2D). In general the maximalie on or near the Pareto front (36) (Fig. 2A and Fig. S2), definedhere as the set of genotypes such that either Ef or Eb cannot bedecreased further without increasing the other (the global max-imum is always on the front, whereas local maxima may not be).As protein stability decreases (Ef becomes higher), the aver-

age distance between initial and final states for adaptation decrea-ses. As a result the mean path length (number of substitutions)decreases as well, although the variance of path lengths is relativelyconstant over all energies (Fig. 2E). The path entropy per substitution

Manhart and Morozov PNAS | February 10, 2015 | vol. 112 | no. 6 | 1799

EVOLU

TION

Dow

nloa

ded

by g

uest

on

Aug

ust 2

7, 2

021

Page 4: Protein folding and binding can emerge as evolutionary ... · DEF Fig. 1. Fitness, selection, and epistasis in energy trait space. (A) Phase diagram of protein structural states

Spath=ℓ also decreases with Ef , reflecting greater constraints onadaptive paths (note that Spath=ℓ= logLðk− 1Þ≈ 3:2 for neutralevolution with L= 6 and k= 5). Finally, Scom ≈ 0:31 in the mar-ginally stable regime (Fig. 2F). Because the average number ofmaxima is ≈ 1:9 in this regime (Fig. 2C), the maximum value ofScom is roughly log 1:9≈ 0:64, indicating that multiple maxima areusually not equally accessible.

Direct Selection for Folding Only. In this regime, fub = 1 and 0≤ fuf < 1in Eq. 3, which means that the binding interaction under consid-eration is nonfunctional. Analogous with the previous case, thegeometry of the fitness contours and thus most landscape propertiesare now independent of the folding energy Ef (Fig. 1C).When the nonfunctional binding is weak, the fitness contours

are parallel to the Eb axis, indicating that selection acts only onfolding (Fig. 1C). This regime yields a single fitness maximumdue to the lack of sign epistasis; the maximum predominantlycoincides with the best-folding genotype (Fig. 3A). However,with increasing binding strength the fitness contours curve suchthat the effective selection force attempts to improve bothbinding and folding equally, due to binding-mediated stability.Thus, binding emerges as an evolutionary spandrel in this case.There is also an increased likelihood of multiple local maxima,located between the best-folding and best-binding genotypes(Fig. 3A).Depending on the abundance of the old and new ligands in the

cell and their binding properties, several adaptive scenarios maytake place. First, the best-binding strengths Emin

b1 and Eminb2 of the

old and new targets may be similar in magnitude. If both areweak, initial and final states are likely to be the best-foldinggenotype or close to it (Fig. 3A); in this case, there is a highprobability that no adaptation will occur (Fig. 3B). When Emin

b1and Emin

b2 are both low, adaptation usually occurs to accommo-date the binding specificity of the new ligand (Fig. 3B and

Fig. S4A). Surprisingly, we see that proteins frequently evolvestronger binding at the expense of folding (Fig. S4A, Bottom).This happens due to the constraints of the genotype–phenotypemap: not enough genotypes are available to optimize bothtraits simultaneously.It is also possible to gain or lose a nonfunctional binding in-

terface through adaptation. In the first case, the new target ismore abundant or has stronger binding than the old oneðEmin

b2 <Eminb1 Þ. Thus, the initial state is the best-folding genotype

or close to it, and the protein adapts toward a genotype withintermediate folding and binding (Fig. S4B). As before, adap-tation is tightly constrained by the genotype–phenotype map,sacrificing the trait (folding stability) under direct selection toaffect the spandrel (nonfunctional binding interaction). Effec-tively, the protein switches from being a self-reliant folder toneeding a binding partner to stabilize folding. In the second case(Emin

b1 <Eminb2 ), the dynamics is opposite: the protein loses its

nonfunctional binding interface and becomes self-reliant (Fig.S4C). Thus, proteins may acquire or lose binding interfacesdepending on the availability of ligands that can participate inbinding-mediated stability.

Divergent Evolutionary Fates.Near Eb = 0 the selection streamlinesdiverge in Fig. 1C, creating the possibility of multiple localmaxima with at least one having negative Eb (strong binding)and at least one having positive Eb (negligible binding); Fig. 3Cshows an example. Such a protein has two qualitatively differ-ent fates available to it: one in which it evolves to bind thetarget and another in which it does not. The eventual fate of theprotein is determined by random mutation events, makingevolution inherently unpredictable. Indeed, the distribution offinal states can be strongly bimodal (Fig. 3D), and there isa sizable probability of divergent fates across a range of bindingenergies (Fig. S5).

A

C

E

B

D

F

Fig. 2. Adaptation with direct selection for bind-ing only. (A) Distribution of folding and bindingenergies for all kL = 56 genotypes (small graypoints) in a single realization of the model witha marginally stable protein (Eref

f =−3 kcal/mol). Theblack star indicates the initial state for adaptation(global maximum on F 1); red triangles indicatelocal fitness maxima on F 2, shaded according totheir commitment probabilities ψðσÞ; and bluecrosses show best-folding and best-binding geno-types. The magenta line connects genotypes on thePareto front, and black contours indicate constantfitness F 2. (B) The region of energy space accessibleto adaptive paths, zoomed in from A. Examplepaths are shown in blue and green; black circlesindicate intermediate states along paths, sizedproportional to their path density ψðσÞ; small graycircles are genotypes inaccessible to adaptation. (C)Average number m of local fitness maxima (solid,green) and average number macc of local maximaaccessible to adaptation (dashed, blue) versus Eref

f .(D) Average per-residue Hamming distance be-tween the maxima and the best-folding genotype(δf; solid, green) and the best-binding genotype (δb;dashed, blue) versus Eref

f . (E) Average distributionsρðℓÞ of path lengths ℓ (number of substitutions) forstable proteins (Eref

f =−15 kcal/mol; solid, green),marginally stable proteins (Eref

f =−3 kcal/mol;dashed, blue), and intrinsically unstable proteins(Eref

f = 5 kcal/mol; dotted, red). (F) Average per-sub-stitution path entropy Spath=ℓ (solid, green) and av-erage entropy of commitment probabilities Scom(dashed, blue) versus Eref

f . Averages are taken over105 realizations of the energy matrices (ef, eb1 , eb2 )in E and 104 realizations otherwise. In all panels,fub = fuf = 0 and Emin

b1= Emin

b2=−5 kcal/mol.

1800 | www.pnas.org/cgi/doi/10.1073/pnas.1415895112 Manhart and Morozov

Dow

nloa

ded

by g

uest

on

Aug

ust 2

7, 2

021

Page 5: Protein folding and binding can emerge as evolutionary ... · DEF Fig. 1. Fitness, selection, and epistasis in energy trait space. (A) Phase diagram of protein structural states

Direct Selection for Both Binding and Folding.Here we consider themost general case in which 0< fub < 1 and 0≤ fuf < 1 in Eq. 3 (Fig.1D). The fitness landscape is divided into two regions by astraight diagonal contour with fitness fub and slope −1. Belowthis contour, the landscape is qualitatively similar to the case ofselection for binding only (Fig. 1B), whereas above the contourthe landscape resembles that of the folding-only selection sce-nario (Fig. 1C). Thus, evolutionary dynamics for proteins withfavorable binding and folding energies ðEf ;Eb < 0Þ will largelyresemble the case of selection for binding only. However,a qualitatively different behavior will occur if the distribution ofgenotypes straddles the diagonal contour (Fig. 4). This happenswhen folding stability is marginal and initial binding is unfavor-able. In this case, selection streamlines around the diagonalcontour (Fig. 1D) and the genotype–phenotype map tightlyconstrain the adaptive paths to gain extra folding stability first,and then lose it as the binding function is improved (Fig. 4).

Tempo and Rhythm of Adaptation. The strength of selection is theprimary determinant of the average adaptation time t. If theselection coefficient s is small (but Ns> 1), the substitution ratein Eq. S1 is proportional to s. Thus, as selection becomes ex-ponentially weaker for lower energies (Fig. S1), adaptationbecomes exponentially slower. The distribution of the total ad-aptation time over an adaptive path is highly nonuniform. Forexample, in the case of selection for binding only and a margin-ally stable protein, the adaptation time is concentrated at the endof the path, one mutation away from the final state (Fig. S6 Aand B). Substitutions at the beginning of the path occur quicklybecause there are many possible beneficial substitutions andbecause selection is strong; in contrast, at the end of the pathadaptation slows down dramatically as beneficial mutations aredepleted and selection weakens. This behavior is observed inmost of the other model regimes as well.The exception to this pattern occurs in the case of selection

for both binding and folding in marginally stable and marginally

bound proteins, owing to the unique contour geometry (Figs. 1Dand 4). As the adaptive paths wrap around the diagonal contourin the region of high Eb and low Ef , the landscape flattens, makingselection weaker and substitutions slower (Fig. S6C). Thus, most ofthe time is spent in the middle of the path rather than the end (Fig.S6D). Adaptation accelerates toward the end of the path as thestrength of selection increases again. If the intermediate slowdownis significant enough, a protein may not have time to complete thesecond half of its path before environmental conditions change, sothat it will never evolve the new binding function.

DiscussionProtein Folding and Binding As Evolutionary Spandrels. In the dec-ades since Gould and Lewontin’s paper (20), the existence ofevolutionary spandrels has become a critical evolutionary con-cept. There are many possible scenarios in which spandrels canemerge through evolution (20, 21), although two key mecha-nisms are indirect selection (arising from coupled traits) andneutral processes (such as genetic drift and biases in mutationand recombination) (37). Here we have focused on the former,which we expect to be more important on short time scales.Previous studies have argued that the marginal stability of

most proteins may be an evolutionary spandrel that evolveddue to mutation–selection balance (3, 12, 13). We suggest morebroadly that having folding stability at all may be a spandrel forproteins with no misfolding toxicity. Even more striking is thepossibility that some binding interactions may be spandrels thatevolved solely to stabilize protein structures. Previously the roleof binding-mediated stability has been mainly discussed in thecontext of intrinsically disordered proteins (28, 29) and therapiesfor protein misfolding diseases (38); here we show that it is ageneral phenomenon with significant implications for our inter-pretation of data on proteome-wide interactions (39). In par-ticular, our results suggest that less stable proteins should havea greater number of nonfunctional interactions. Protein stabilityð−EfÞ was previously argued to correlate positively with abun-dance to explain the observed negative correlation of abundancewith evolutionary rate (16, 17), whereas models of protein–protein interaction networks imply that protein abundance alsocorrelates negatively with the intrinsic number of interactions(e.g., as measured by surface hydrophobicity) (19, 40). Togetherthese observations argue that stability should indeed be nega-tively correlated with the number of interactions (41). Moreover,the possibility of evolving new binding interactions solely tostabilize protein structure suggests how chaperones and proteinquality control machinery may have first evolved (42).

Pareto Optimization of Proteins. The Pareto front is a useful con-cept in problems of multiobjective optimization (36). The Paretofront in our model consists of the protein sequences alongthe low-Ef , low-Eb edge of the genotype distribution (e.g., see

A

C

B

D

Fig. 3. Adaptation with direct selection for folding only. (A) The averagenumber of local maxima m (solid, green) and their average per-residue Ham-ming distances from the best-folding (δf; dashed, blue) and the best-binding (δb;dotted, red) genotypes versus Emin

b . (B) Probability that adaptation occurs (i.e.,the initial state is not coincident with any of the final states) as a function ofEminb1

and Eminb2

. (C) Example landscape with divergent binding fates: there aretwo accessible local maxima, one with Eb < 0 (favorable binding, ψðσÞ≈ 0:6) andthe other with Eb > 0 (negligible binding, ψðσÞ≈ 0:4). All symbols are the sameas in Fig. 2B. (D) Average distribution of local maxima weighted by theircommitment probabilities. In C and D, Emin

b1= Emin

b2=−6:5 kcal/mol. Averages are

taken over 105 realizations of the energy matrices in D and 104 realizationsotherwise. In all panels, fub = 1, fuf = 0, and Eref

f = 0 kcal/mol.

A B C

Fig. 4. Adaptation with direct selection for both binding and folding.(A and B) Example distribution of folding and binding energies for a marginallystable and marginally bound protein; all symbols are the same as in Fig. 2 Aand B. (C) Distribution of initial states in green, intermediate states in blue(weighted by their path densities), and final states in red (weighted by theircommitment probabilities) for 105 realizations of the energy matrices. In allpanels, fub = 0:9, fuf = 0, and Eref

f = Eminb1

= Eminb2

=−4 kcal/mol.

Manhart and Morozov PNAS | February 10, 2015 | vol. 112 | no. 6 | 1801

EVOLU

TION

Dow

nloa

ded

by g

uest

on

Aug

ust 2

7, 2

021

Page 6: Protein folding and binding can emerge as evolutionary ... · DEF Fig. 1. Fitness, selection, and epistasis in energy trait space. (A) Phase diagram of protein structural states

Fig. 2A). Pareto optimization assumes that all states on the frontare valid final states for adaptation; however, nonlinear fitnessfunctions with saturation effects (as expected from biophysicalconsiderations, e.g., Eq. 3) will confound this assumption. Ourmodel shows how this nonlinearity leads to a small subset of truefinal states on or even off the front. Thus, Pareto optimizationdoes not capture a key feature of the underlying biophysics,providing only a rough approximation to the true dynamics.

Epistasis and Evolutionary Predictability.Our results also shed lighton the role of epistasis—the correlated effects of mutationsat different sites—in protein evolution. Epistasis underlies theruggedness of fitness landscapes (43, 44) and determines thepredictability of evolution, an issue of paramount importance inbiology (24, 45, 46). In most cases considered here, limited signepistasis gives rise to less-predictable intermediate pathways(high Spath) but highly predictable final outcomes (low Scom).However, there are two major exceptions to this pattern. First,proteins with a nonfunctional binding interaction may havemultiple local maxima, some with strong binding and others withweak binding (Fig. 3 C and D). Here both the intermediatepathways and the final states are unpredictable—pure chance,in the form of random mutations, drives the population to onebinding fate or the other. The second exception occurs in pro-teins with direct selection for both binding and folding. Herethere is usually a single maximum, but the adaptive paths are

tightly constrained in energy space (Fig. 4), making protein evo-lution highly predictable at the level of biophysical traits.

Materials and MethodsEnergetics of Protein Folding and Binding. Folding energetics are probed ex-perimentally and computationally by measuring the changes in Ef resultingfrom point mutations of a reference sequence. We choose an arbitrary ge-notype σref as a reference, with efði,σirefÞ= 0 for all i∈ f1, . . . ,Lg such thatEfðσrefÞ= Eref

f . Then we sample the remaining entries of ef from a Gaussiandistribution with mean 1.25 kcal/mol and standard deviation 1.6 kcal/mol,consistent with available ΔΔG data showing these energies to be universallydistributed over many proteins (8). We also choose an arbitrary genotype σbbto have the minimum binding energy: ebði,σibbÞ= 0 for all i∈ f1, . . . ,Lg andEbðσbbÞ= Emin

b . Because binding hotspot residues typically have a 1–3 kcal/molpenalty for mutations away from the wild-type amino acid (30, 31), wesample the other entries of eb from an exponential distribution defined inthe range of ð1,∞Þ kcal/mol, with mean 2 kcal/mol. This distribution is con-sistent with alanine-scanning experiments that probe energetics of aminoacids at the binding interface (47). Note that binding energy matrices eb1

and eb2are sampled independently in all figures. We consider L= 6 hot-

spot residues and a reduced alphabet of k=5 amino acids (grouped intonegative, positive, polar, hydrophobic, and other), resulting in 56 = 15,625genotypes. Our population genetics model and the algorithm for ex-act calculation of adaptive path statistics are described in SI Materialsand Methods.

ACKNOWLEDGMENTS. A.V.M. acknowledges support from an Alfred P. SloanResearch Fellowship.

1. Creighton TE (1992) Proteins: Structures and Molecular Properties (Freeman, NewYork).

2. Campbell RE, et al. (2002) A monomeric red fluorescent protein. Proc Natl Acad SciUSA 99(12):7877–7882.

3. Bloom JD, Arnold FH (2009) In the light of directed evolution: Pathways of adaptiveprotein evolution. Proc Natl Acad Sci USA 106(Suppl 1):9995–10000.

4. Bucciantini M, et al. (2002) Inherent toxicity of aggregates implies a common mech-anism for protein misfolding diseases. Nature 416(6880):507–511.

5. Chiti F, Stefani M, Taddei N, Ramponi G, Dobson CM (2003) Rationalization of theeffects of mutations on peptide and protein aggregation rates. Nature 424(6950):805–808.

6. Drummond DA, Wilke CO (2008) Mistranslation-induced protein misfolding asa dominant constraint on coding-sequence evolution. Cell 134(2):341–352.

7. Geiler-Samerotte KA, et al. (2011) Misfolded proteins impose a dosage-dependentfitness cost and trigger a cytosolic unfolded protein response in yeast. Proc Natl AcadSci USA 108(2):680–685.

8. Tokuriki N, Stricher F, Schymkowitz J, Serrano L, Tawfik DS (2007) The stability effectsof protein mutations appear to be universally distributed. J Mol Biol 369(5):1318–1332.

9. Tokuriki N, Stricher F, Serrano L, Tawfik DS (2008) How protein stability and newfunctions trade off. PLOS Comput Biol 4(2):e1000002.

10. Wang X, Minasov G, Shoichet BK (2002) Evolution of an antibiotic resistance enzymeconstrained by stability and activity trade-offs. J Mol Biol 320(1):85–95.

11. Sun SB, et al. (2013) Mutational analysis of 48G7 reveals that somatic hypermutationaffects both antibody stability and binding affinity. J Am Chem Soc 135(27):9980–9983.

12. Taverna DM, Goldstein RA (2002) Why are proteins marginally stable? Proteins 46(1):105–109.

13. Zeldovich KB, Chen P, Shakhnovich EI (2007) Protein stability imposes limits on or-ganism complexity and speed of molecular evolution. Proc Natl Acad Sci USA 104(41):16152–16157.

14. Bloom JD, Labthavikul ST, Otey CR, Arnold FH (2006) Protein stability promotesevolvability. Proc Natl Acad Sci USA 103(15):5869–5874.

15. DePristo MA, Weinreich DM, Hartl DL (2005) Missense meanderings in sequencespace: A biophysical view of protein evolution. Nat Rev Genet 6(9):678–687.

16. Drummond DA, Bloom JD, Adami C, Wilke CO, Arnold FH (2005) Why highly ex-pressed proteins evolve slowly. Proc Natl Acad Sci USA 102(40):14338–14343.

17. Serohijos AWR, Rimas Z, Shakhnovich EI (2012) Protein biophysics explains why highlyabundant proteins evolve slowly. Cell Reports 2(2):249–256.

18. Johnson ME, Hummer G (2011) Nonspecific binding limits the number of proteins ina cell and shapes their interaction networks. Proc Natl Acad Sci USA 108(2):603–608.

19. Heo M, Maslov S, Shakhnovich E (2011) Topology of protein interaction networkshapes protein abundances and strengths of their functional and nonspecific inter-actions. Proc Natl Acad Sci USA 108(10):4258–4263.

20. Gould SJ, Lewontin RC (1979) The spandrels of San Marco and the Panglossian par-adigm: A critique of the adaptationist programme. Proc R Soc Lond B Biol Sci 205(1161):581–598.

21. Pigliucci I, Kaplan I (2000) The fall and rise of Dr Pangloss: Adaptationism and theSpandrels paper 20 years later. Trends Ecol Evol 15(2):66–70.

22. Weiss MA, et al. (2002) Protein structure and the spandrels of San Marco: Insulin’sreceptor-binding surface is buttressed by an invariant leucine essential for its stability.Biochemistry 41(3):809–819.

23. Barrett RDH, Hoekstra HE (2011) Molecular spandrels: Tests of adaptation at thegenetic level. Nat Rev Genet 12(11):767–780.

24. Weinreich DM, Delaney NF, Depristo MA, Hartl DL (2006) Darwinian evolution canfollow only very few mutational paths to fitter proteins. Science 312(5770):111–114.

25. Chou HH, Chiu HC, Delaney NF, Segrè D, Marx CJ (2011) Diminishing returns epistasisamong beneficial mutations decelerates adaptation. Science 332(6034):1190–1192.

26. Manhart M, Morozov AV (2013) Path-based approach to random walks on networkscharacterizes how proteins evolve new functions. Phys Rev Lett 111(8):088102.

27. Manhart M, Morozov AV (2014) First-Passage Phenomena and Their Applications, edsMetzler R, Oshanin G, Redner S (World Scientific, Singapore).

28. Wright PE, Dyson HJ (2009) Linking folding and binding. Curr Opin Struct Biol 19(1):31–38.

29. Habchi J, Tompa P, Longhi S, Uversky VN (2014) Introducing protein intrinsic disorder.Chem Rev 114(13):6561–6588.

30. Clackson T, Wells JA (1995) A hot spot of binding energy in a hormone-receptor in-terface. Science 267(5196):383–386.

31. Moreira IS, Fernandes PA, Ramos MJ (2007) Hot spots—a review of the protein-protein interface determinant amino-acid residues. Proteins 68(4):803–812.

32. Wells JA (1990) Additivity of mutational effects in proteins. Biochemistry 29(37):8509–8517.

33. Champagnat N (2006) A microscopic interpretation for adaptive dynamics trait sub-stitution sequence models. Stoch Proc Appl 116(8):1127–1160.

34. Lynch M (2007) The Origins of Genome Architecture (Sinauer, Sunderland, MA).35. Charlesworth B (2009) Fundamental concepts in genetics: Effective population size

and patterns of molecular evolution and variation. Nat Rev Genet 10(3):195–205.36. Shoval O, et al. (2012) Evolutionary trade-offs, Pareto optimality, and the geometry of

phenotype space. Science 336(6085):1157–1160.37. Lynch M (2007) The evolution of genetic networks by non-adaptive processes. Nat Rev

Genet 8(10):803–813.38. Rodrigues JV, Henriques BJ, Lucas TG, Gomes CM (2012) Cofactors and metabolites as

protein folding helpers in metabolic diseases. Curr Top Med Chem 12(22):2546–2559.39. Stark C, et al. (2006) BioGRID: A general repository for interaction datasets. Nucleic

Acids Res 34(Database issue):D535–D539.40. Yang JR, Liao BY, Zhuang SM, Zhang J (2012) Protein misinteraction avoidance causes

highly expressed proteins to evolve slowly. Proc Natl Acad Sci USA 109(14):E831–E840.41. Dixit PD, Maslov S (2013) Evolutionary capacitance and control of protein stability in

protein-protein interaction networks. PLOS Comput Biol 9(4):e1003023.42. Hartl FU, Bracher A, Hayer-Hartl M (2011) Molecular chaperones in protein folding

and proteostasis. Nature 475(7356):324–332.43. Poelwijk FJ, Kiviet DJ, Weinreich DM, Tans SJ (2007) Empirical fitness landscapes re-

veal accessible evolutionary paths. Nature 445(7126):383–386.44. Szendro IG, Schenk MF, Franke J, Krug J, de Visser JA (2013) Quantitative analyses of

empirical fitness landscapes. J Stat Mech 2013:P01005.45. Gould SJ (1990)Wonderful Life: The Burgess Shale and the Nature of History (Norton,

New York).46. Lobkovsky AE, Koonin EV (2012) Replaying the tape of life: Quantification of the

predictability of evolution. Front Genet 3:246.47. Thorn KS, Bogan AA (2001) ASEdb: A database of alanine mutations and their effects

on the free energy of binding in protein interactions. Bioinformatics 17(3):284–285.

1802 | www.pnas.org/cgi/doi/10.1073/pnas.1415895112 Manhart and Morozov

Dow

nloa

ded

by g

uest

on

Aug

ust 2

7, 2

021