7
DOI: 10.1126/science.278.5335.82 , 82 (1997); 278 Science et al. Bassil I. Dahiyat, Selection De Novo Protein Design: Fully Automated Sequence www.sciencemag.org (this information is current as of February 29, 2008 ): The following resources related to this article are available online at http://www.sciencemag.org/cgi/content/full/278/5335/82 version of this article at: including high-resolution figures, can be found in the online Updated information and services, found at: can be related to this article A list of selected additional articles on the Science Web sites http://www.sciencemag.org/cgi/content/full/278/5335/82#related-content http://www.sciencemag.org/cgi/content/full/278/5335/82#otherarticles , 16 of which can be accessed for free: cites 41 articles This article 433 article(s) on the ISI Web of Science. cited by This article has been http://www.sciencemag.org/cgi/content/full/278/5335/82#otherarticles 99 articles hosted by HighWire Press; see: cited by This article has been http://www.sciencemag.org/cgi/collection/biochem Biochemistry : subject collections This article appears in the following http://www.sciencemag.org/about/permissions.dtl in whole or in part can be found at: this article permission to reproduce of this article or about obtaining reprints Information about obtaining registered trademark of AAAS. is a Science 1997 by the American Association for the Advancement of Science; all rights reserved. The title Copyright American Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by the Science on February 29, 2008 www.sciencemag.org Downloaded from

De Novo Protein Design: Fully Automated Sequence ......De Novo Protein Design: Fully Automated Sequence Selection Bassil I. Dahiyat† and Stephen L. Mayo* The first fully automated

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: De Novo Protein Design: Fully Automated Sequence ......De Novo Protein Design: Fully Automated Sequence Selection Bassil I. Dahiyat† and Stephen L. Mayo* The first fully automated

DOI: 10.1126/science.278.5335.82 , 82 (1997); 278Science

et al.Bassil I. Dahiyat,SelectionDe Novo Protein Design: Fully Automated Sequence

www.sciencemag.org (this information is current as of February 29, 2008 ):The following resources related to this article are available online at

http://www.sciencemag.org/cgi/content/full/278/5335/82version of this article at:

including high-resolution figures, can be found in the onlineUpdated information and services,

found at: can berelated to this articleA list of selected additional articles on the Science Web sites

http://www.sciencemag.org/cgi/content/full/278/5335/82#related-content

http://www.sciencemag.org/cgi/content/full/278/5335/82#otherarticles, 16 of which can be accessed for free: cites 41 articlesThis article

433 article(s) on the ISI Web of Science. cited byThis article has been

http://www.sciencemag.org/cgi/content/full/278/5335/82#otherarticles 99 articles hosted by HighWire Press; see: cited byThis article has been

http://www.sciencemag.org/cgi/collection/biochemBiochemistry

: subject collectionsThis article appears in the following

http://www.sciencemag.org/about/permissions.dtl in whole or in part can be found at: this article

permission to reproduce of this article or about obtaining reprintsInformation about obtaining

registered trademark of AAAS. is aScience1997 by the American Association for the Advancement of Science; all rights reserved. The title

CopyrightAmerican Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by theScience

on

Feb

ruar

y 29

, 200

8 w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

Page 2: De Novo Protein Design: Fully Automated Sequence ......De Novo Protein Design: Fully Automated Sequence Selection Bassil I. Dahiyat† and Stephen L. Mayo* The first fully automated

De Novo Protein Design: FullyAutomated Sequence Selection

Bassil I. Dahiyat† and Stephen L. Mayo*

The first fully automated design and experimental validation of a novel sequence for anentire protein is described. A computational design algorithm based on physical chemicalpotential functions and stereochemical constraints was used to screen a combinatoriallibrary of 1.9 3 1027 possible amino acid sequences for compatibility with the designtarget, a bba protein motif based on the polypeptide backbone structure of a zinc fingerdomain. A BLAST search shows that the designed sequence, full sequence design 1(FSD-1), has very low identity to any known protein sequence. The solution structure ofFSD-1 was solved by nuclear magnetic resonance spectroscopy and indicates thatFSD-1 forms a compact well-ordered structure, which is in excellent agreement with thedesign target structure. This result demonstrates that computational methods can per-form the immense combinatorial search required for protein design, and it suggests thatan unbiased and quantitative algorithm can be used in various structural contexts.

Significant advances have been made to-ward the design of stable, well-folded pro-teins with novel sequences (1). These effortshave generated insight into the factors thatcontrol protein folding and have suggestednew approaches to biotechnology (2). Inorder to broaden the scope and power ofprotein design techniques, several groupshave developed and experimentally testedsystematic quantitative methods for proteindesign directed toward developing generaldesign algorithms (3, 4). These techniques,which have been used to screen possiblesequences for compatibility with the desiredprotein fold, have been focused mostly onthe redesign of protein cores.

We have sought to expand the range ofcomputational protein design to residues ofall parts of a protein: the buried core, thesolvent-exposed surface, and the boundarybetween core and surface (4–6). Our goal isan unbiased, quantitative design algorithmthat is based on the physical properties thatdetermine protein structure and stability andthat is not limited to specific folds or motifs.Such a method should escape the lack ofgenerality of design approaches based on sys-tem-specific heuristics or subjective consid-erations or both. We have developed ouralgorithm by combining theory, computa-tion, and experiment in a cycle that hasimproved our understanding of the physicalchemistry governing protein design (4). Wenow report the successful design by the algo-

rithm of an original sequence for an entireprotein and the experimental validation ofthe protein’s structure.

Sequence selection. Our design method-ology begins with a backbone fold and weattempt to select an amino acid sequencethat will stabilize this target structure. Themethod consists of an automated side-chainselection algorithm that explicitly and quan-titatively considers specific interactions be-tween (i) side chain and backbone and (ii)side chain and side chain (4). The sidechain selection algorithm screens all possi-ble amino acid sequences and finds the op-timal sequence and side-chain orientationsfor a given backbone. In order to correctlyaccount for the torsional flexibility of sidechains and the geometric specificity of side-chain placement, we consider a discrete setof all allowed conformers of each side chain,called rotamers (7). The sizable search prob-lem presented by rotamer sequence optimi-zation is overcome by application of thedead-end elimination (DEE) theorem (8).Our implementation of the DEE theoremextends its utility to sequence design andrapidly finds the globally optimal sequencein its optimal conformation (4).

Previously we determined the differentcontributions of core, surface, and boundaryresidues to the scoring of a sequence ar-rangement. The sequence predictions of ascoring function, or a combination of scor-ing functions, were experimentally tested inorder to assess the accuracy of the algorithmand to derive improvements to it. We suc-cessfully redesigned the core of a coiled coiland of the streptococcal protein G b1(Gb1) domain using a van der Waals poten-tial to account for steric constraints and anatomic solvation potential favoring the buri-al and penalizing the exposure of nonpolar

surface area (4, 6). Effective solvation pa-rameters and the appropriate balance be-tween packing and solvation terms werefound by systematic analysis of experimentaldata and feedback into the simulation. Sol-vent-exposed residues on the surface of aprotein were designed with the use of ahydrogen-bond potential and secondarystructure propensities in addition to a vander Waals potential. Coiled coils designedwith such a scoring function were 10° to12°C more thermally stable than the natu-rally occurring analog (5). Residues thatform the boundary between the core andsurface require a combination of the coreand the surface scoring functions. The algo-rithm considers both hydrophobic and hy-drophilic amino acids at boundary positions,whereas core positions are restricted to hy-drophobic amino acids and surface positionsare restricted to hydrophilic amino acids.

In order to assess the capability of ourdesign algorithm, we have computed theentire amino acid sequence for a small pro-tein motif. We sought a protein fold thatwould be small enough to be both computa-tionally and experimentally tractable, yetlarge enough to form an independently fold-ed structure in the absence of disulfide bondsor metal binding. We chose the bba motiftypified by the zinc finger DNA bindingmodule (9). Although this motif consists offewer than 30 residues, it does contain sheet,helix, and turn structures. The ability of thisfold to form in the absence of metal ions ordisulfide bonds has been demonstrated byImperiali and co-workers, who designed a23-residue peptide, containing an unusualamino acid (D-proline) and a nonnaturalamino acid [3-(1,10-phenanthrol-2-yl)-L-alanine], which achieved this fold (10); ourinitial characterization of a partially comput-ed sequence indicated that it also forms thisfold (11). In computing the full sequence forthis target fold, we use the scoring functionsfrom our previous work without modification(12). The bba motif was not used in any ofour prior work to develop the design meth-odology and therefore provides a test of thealgorithm’s generality.

The sequence selection algorithm requiresstructure coordinates that define the targetmotif ’s backbone (N, Ca, C, and O atomsand Ca-Cb vectors). The Brookhaven Pro-tein Data Bank (PDB) (13) was examined forhigh-resolution structures of the bba motif,and the second zinc finger module of theDNA binding protein Zif268 was selected asour design template (9, 14). In order to assignthe residue positions in the template structureinto core, surface, or boundary classes, theorientation of the Ca-Cb vectors was assessedrelative to a solvent-accessible surface com-puted with only the template Ca atoms (15).The small size of this motif limits to one

B I. Dahiyat, Division of Chemistry and Chemical Engi-neering, California Institute of Technology, mail code147-75, Pasadena, CA 91125, USA.S. L. Mayo, Howard Hughes Medical institute and Divi-sion of Biology, California Institute of Technology, mailcode 147-75, Pasadena, CA 91125, USA.

*To whom correspondence should be addressed. E-mail:[email protected]†Present address: Xencor, Pasadena, CA 91106, USA.

RESEARCH ARTICLE

SCIENCE z VOL. 278 z 3 OCTOBER 1997 z www.sciencemag.org82

on

Feb

ruar

y 29

, 200

8 w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

Page 3: De Novo Protein Design: Fully Automated Sequence ......De Novo Protein Design: Fully Automated Sequence Selection Bassil I. Dahiyat† and Stephen L. Mayo* The first fully automated

(position 5) the number of residues that canbe assigned unambiguously to the core, where-as seven residues (positions 3, 7, 12, 18, 21,22, and 25) were classified as boundary andthe remaining 20 residues were assigned to thesurface. Whereas three of the zinc bindingpositions of Zif268 are in the boundary orcore, one residue, position 8, has a Ca-Cbvector directed away from the geometric cen-ter of the protein and is classified as a surfaceposition. As in our previous studies, the aminoacids considered at the core positions duringsequence selection were Ala, Val, Leu, Ile,Phe, Tyr, and Trp; the amino acids consideredat the surface positions were Ala, Ser, Thr,His, Asp, Asn, Glu, Gln, Lys, and Arg; andthe combined core and surface amino acid sets(16 amino acids) were considered at theboundary positions. Two of the residue posi-tions (9 and 27) have f angles greater than 0°and are set to Gly by the sequence selectionalgorithm to minimize backbone strain.

The total number of amino acid sequenc-es that must be considered by the designalgorithm is the product of the number ofpossible amino acids at each residue posi-tion. The bba motif residue classificationdescribed above results in a virtual combi-natorial library of 1.9 3 1027 possible aminoacid sequences (16). This library size is 15orders of magnitude larger than that acces-sible by experimental random library ap-proaches. A corresponding peptide libraryconsisting of only a single molecule for each28-residue sequence would have a mass of11.6 metric tons (17). In order to accuratelymodel the geometric specificity of side-chain placement, we explicitly consider thetorsional flexibility of amino acid sidechains in our sequence scoring by represent-ing each amino acid with a discrete set ofallowed conformations, called rotamers(18). As a result, the design algorithm mustconsider all rotamers for each possible aminoacid at each residue position. The total sizeof the search space for the bba motif istherefore 1.1 3 1062 possible rotamer se-quences. We use a search algorithm basedon an extension of the DEE theorem tosolve the rotamer sequence optimizationproblem (4, 8). Efficient implementation ofthe DEE theorem has made complete pro-tein sequence design tractable for about 50residues on current parallel computers in asingle calculation. The rotamer optimizationproblem for the bba motif required 90 CPUhours to find the optimal sequence (19, 20).

The optimal sequence (Fig. 1) is calledfull sequence design (FSD-1). Even thoughall of the hydrophilic amino acids were con-sidered at each of the boundary positions,the algorithm selected only nonpolar aminoacids. The eight core and boundary positionsare predicted to form a well-packed buriedcluster. The Phe side chains selected by the

algorithm at positions 21 and 25, the zinc-binding His positions of Zif268, are morethan 80 percent buried, and the Ala atposition 5 is 100 percent buried but the Lysat position 8 is more than 60 percent ex-posed to solvent (Fig. 2). The other bound-ary positions demonstrate the steric con-straints on buried residues by packing similarside chains in an arrangement similar to thatof Zif268 (Fig. 2). The calculated optimalconfiguration for core and boundary residuesburies ;1150 Å2 of nonpolar surface area.On the helix surface, the algorithm placesAsn14 with a hydrogen bond between itsside-chain carbonyl oxygen and the back-bone amide proton of residue 16. The eightcharged residues on the helix form threepairs of hydrogen bonds, although in ourcoiled-coil designs, helical surface hydrogen

bonds appeared to be less important thanthe overall helix propensity of the sequence(5). Positions 4 and 11 on the exposed sheetsurface were selected by the program to beThr, one of the best b-sheet forming resi-dues (21).

Alignment of the sequences for FSD-1and Zif268 (Fig. 1) indicates that only 6 ofthe 28 residues (21 percent) are identicaland only 11 (39 percent) are similar. Four ofthe identities are in the buried cluster, whichis consistent with the expectation that bur-ied residues are more conserved than sol-vent-exposed residues for a given motif (22).A BLAST (23) search of the FSD-1 se-quence against the nonredundant proteinsequence database of the National Centerfor Biotechnology Information did not re-veal any zinc finger protein sequences. Fur-

Fig. 1. Sequence of FSD-1 aligned with the second zinc finger of Zif268. The bar at the top of the figureshows the residue position classifications: the solid bar indicates the single core position, the hatchedbars indicate the seven boundary positions and the open bars indicate the 20 surface positions. Thealignment matches positions of FSD-1 to the corresponding backbone template positions of Zif268. Ofthe six identical positions (21 percent) between FSD-1 and Zif268, four are buried (Ile7, Phe12, Leu18, andIle22). The zinc binding residues of Zif268 are boxed. Representative nonoptimal sequence solutionsdetermined by means of a Monte Carlo simulated annealing protocol are shown with their rank. Verticallines indicate identity with FSD-1. The symbols at the bottom of the figure show the degree of sequenceconservation for each residue position computed across the top 1000 sequences: filled circles indicatemore than 99 percent conservation, half-filled circles indicate conservation between 90 and 99 percent,open circles indicate conservation between 50 and 90 percent, and the absence of a symbol indicatesless than 50% conservation. The consensus sequence determined by choosing the amino acid with thehighest occurrence at each position is identical to the sequence of FSD-1. Single-letter abbreviations foramino acid residues as follows: A, Ala; C, Cys; D, Asp; E, Glu; F, Phe; G, Gly; H, His; I, Ile; K, Lys; L, Leu;M, Met; N, Asn; P, Pro; Q, Gln; R, Arg; S, Ser; T, Thr; V, Val; W, Trp; and Y, Tyr.

www.sciencemag.org z SCIENCE z VOL. 278 z 3 OCTOBER 1997 83

on

Feb

ruar

y 29

, 200

8 w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

Page 4: De Novo Protein Design: Fully Automated Sequence ......De Novo Protein Design: Fully Automated Sequence Selection Bassil I. Dahiyat† and Stephen L. Mayo* The first fully automated

ther, the BLAST search found only lowidentity matches of weak statistical signifi-cance to fragments of various unrelated pro-teins. The highest identity matches were 10residues (36 percent) with P values rangingfrom 0.63 to 1.0, where P is the probability of

a match being a chance occurrence. Random28-residue sequences that consist of aminoacids allowed in the bba position classifica-tion described above produced similarBLAST search results, with 10- or 11-residueidentities (36 to 39 percent) and P values

ranging from 0.35 to 1.0, further suggestingthat the matches for FSD-1 are statisticallyinsignificant. The low identity with anyknown protein sequence demonstrates thenovelty of the FSD-1 sequence and under-scores that no sequence information fromany protein motif was used in our sequencescoring function.

In order to examine the robustness of thecomputed sequence, we used the sequence ofFSD-1 as the starting point of a Monte Carlosimulated annealing run. The Monte Carlosearch revealed high scoring, suboptimal se-quences in the neighborhood of the optimalsolution (4). The energy spread from theground-state solution to the 1000th moststable sequence is about 5 kcal/mol, an indi-cation that the density of states is high. Theamino acids comprising the core of the mol-ecule, with the exception of position 7, areessentially invariant (Fig. 1). Almost all ofthe sequence variation occurs at surface po-sitions, and typically involves conservativechanges. Asn14, which is predicted to form astabilizing hydrogen bond to the helix back-

A

B

Fig. 2. Comparison of Zif268 (9) and computed FSD-1 structures. (A) Stereoview of the second zincfinger module of Zif268 showing its buried residues and zinc binding site. (B) Stereoview of thecomputed orientations of buried side chains in FSD-1. For clarity, only side chains from residues 3, 5, 8,12, 18, 21, 22, and 25 are shown. Color figures were created with MOLMOL (38).

Table 1. NMR structure determination: distance restraints, structural statistics, and atomic root-mean-square (rms) deviations. ^SA& are the 41 simulated annealing structures, SA is the average structurebefore energy minimization, (SA )r is the restrained energy minimized average structure, and SD is thestandard deviation.

Distance restraints

Intraresidue 97Sequential 83Short range (ui – ju 5 2 to 5 residues) 59Long range (ui – ju . 5 residues) 35Hydrogen bond 10Total 284

Structural statisticsrms deviations ^SA& 6 SD (SA)r

Distance restraints (Å) 0.043 6 0.003 0.038Idealized geometry

Bonds (Å) 0.0041 6 0.0002 0.0037Angles (degrees) 0.67 6 0.02 0.65Impropers (degrees) 0.53 6 0.05 0.51

Atomic rms deviations (Å)*^SA& versus SA 6 SD ^SA& versus (SA)r 6 SD

Backbone 0.54 6 0.15 0.69 6 0.16Backbone 1 nonpolar side chains† 0.99 6 0.17 1.16 6 0.18Heavy atoms 1.43 6 0.20 1.90 6 0.29

*Atomic rms deviations are for residues 3 to 26, inclusive. Residues 1, 2, 27, and 28 were disordered [f, c, angularorder parameters (34) , 0.78] and had only sequential and ui – ju 5 2 NOEs. †Nonpolar side chains are fromresidues Tyr3, Ala5, Ile7, Phe12, Leu18, Phe21, Ile22, and Phe25, which constitute the core of the protein.

Fig. 3. Circular dichroism (CD) measurements ofFSD-1. (A) Far-UV CD spectrum of FSD-1 at 1°C.The minima at 220 and 207 nm indicate a foldedstructure. (B) Thermal unfolding of FSD-1 moni-tored by CD. The melting curve has an inflectionpoint at 39°C. To illustrate the cooperativity of thethermal transition, the melting curve was fit to atwo-state model [(39) and the derivative of the fit isshown (inset)]. The melting temperature deter-mined from this fit is 42°C.

SCIENCE z VOL. 278 z 3 OCTOBER 1997 z www.sciencemag.org84

on

Feb

ruar

y 29

, 200

8 w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

Page 5: De Novo Protein Design: Fully Automated Sequence ......De Novo Protein Design: Fully Automated Sequence Selection Bassil I. Dahiyat† and Stephen L. Mayo* The first fully automated

bone, is among the most conserved surfacepositions. The strong sequence conserva-tion observed for critical areas of themolecule suggests that, if a representativesequence folds into the design target struc-ture, then many sequences whose varia-tions do not disrupt the critical interac-tions may be equally competent. Even ifbillions of sequences would successfullyachieve the target fold, they would repre-sent only a very small proportion of the1027 possible sequences.

Experimental validation. FSD-1 wassynthesized in order to allow us to charac-terize its structure and assess the per-formance of the design algorithm (24).The far-ultraviolet (UV) circular dichro-ism (CD) spectrum of FSD-1 shows mini-ma at 220 nm and 207 nm, which isindicative of a folded structure (Fig. 3A)(25). The thermal melt is weakly cooper-ative, with an inflection point at 39°C(Fig. 3B), and is completely reversible.The broad melt is consistent with a lowenthalpy of folding which is expected for amotif with a small hydrophobic core. Thisbehavior contrasts the uncooperative ther-mal unfolding transitions observed forother folded short peptides (26). FSD-1 ishighly soluble (greater than 3 mM), andequilibrium sedimentation studies at 100mM, 500 mM, and 1 mM show the proteinto be monomeric (27). The sedimentationdata fit well to a single species, monomermodel with a molecular mass of 3630 at 1mM, in good agreement with the cal-culated monomer mass of 3488. Also, farUV-CD spectra showed no concentra-tion dependence from 50 mM to 2 mM,and nuclear magnetic resonance (NMR)COSY spectra taken at 100 mM and 2 mMwere essentially identical.

The solution structure of FSD-1 wassolved by means of homonuclear 2D 1H

NMR spectroscopy (28). NMR spectra werewell dispersed, indicating an ordered proteinstructure and easing resonance assignments.Proton chemical shift assignments were de-termined with standard homonuclear meth-ods (29). Unambiguous sequential and short-range NOEs (Fig. 4) indicate helical second-ary structure from residues 15 to 26 in agree-ment with the design target. Representativelong-range NOEs from the helix to Ile7 andPhe12 indicate a hydrophobic core consistentwith the desired tertiary structure (Fig. 4B).

The structure of FSD-1 was determinedfrom 284 experimental restraints (10.1 re-straints per residue) that were nonredundantwith covalent structure including 274 NOEdistance restraints and 10 hydrogen bondrestraints involving slowly exchangingamide protons (30). Structure calculationswere performed with X-PLOR (31) with theuse of standard protocols for hybrid distancegeometry-simulated annealing (32). An en-semble of 41 structures converged with goodcovalent geometry and no distance restraint

violations greater than 0.3 Å (Fig. 5 andTable 1). The backbone of FSD-1 is welldefined with a root-mean-square (rms) devi-ation from the mean of 0.54 Å (residues 3 to26). Consideration of the buried side chains(Tyr3, Ala5, Ile7, Phe12, Leu18, Phe21, Ile22,and Phe25) along with the backbone gives anrms deviation of 0.99 Å, indicating that thecore of the molecule is well ordered. Thestereochemical quality of the ensemble ofstructures was examined with PROCHECK(33). Apart from the disordered termini andthe glycine residues, 87 percent of the resi-dues fall in the most favored region and theremainder in the allowed region of f,cspace. Modest heterogeneity is evident inthe first strand (residues 3 to 6), which hasan average backbone angular order parame-ter, ^S& (34), of 0.96 6 0.04 compared to thesecond strand (residues 9 to 12) with an ^S&5 0.98 6 0.02 and the helix (residues 15 to26) with an ^S& 5 0.99 6 0.01. Overall,FSD-1 is notably well ordered and, to ourknowledge, is the shortest sequence consist-

Fig. 4. NOE contacts for FSD-1. (A) Sequential and short-range NOE con-nectivities. The d denotes a contact between the indicated protons. Alladjacent residues are connected by Ha-HN, HN-HN, or Hb-HN NOE cross-peaks. The helix (residues 15 to 26) is well defined by short-range connec-

tions, as is the hairpin turn at residues 7 and 8. (B) Representative NOEcontacts from aromatic to methyl protons. Several long-range NOEs from Ile7

and Phe12 to the helix help define the fold of the protein. The starred peak hasan ambiguous F1 assignment, Ile22 Hd1 or Leu18 Hd2.

Fig. 5. Solution structure of FSD-1. Stereoview showing the best-fit superposition of the 41 convergedsimulated annealing structures from X-PLOR (31). The backbone Ca trace is shown in blue and theside-chain heavy atoms of the hydrophobic residues ( Tyr3, Ala5, Ile7, Phe12, Leu18, Phe21, Ile22, andPhe25) are shown in magenta. The amino terminus is at the lower left of the figure and the carboxylterminus is at the upper right of the figure. The structure consists of two antiparallel strands frompositions 3 to 6 (back strand) and 9 to 12 (front strand), with a hairpin turn at residues 7 and 8, followedby a helix from positions 15 to 26. The termini, residues 1, 2, 27, and 28 have very few NOE restraintsand are disordered.

RESEARCH ARTICLE

www.sciencemag.org z SCIENCE z VOL. 278 z 3 OCTOBER 1997 85

on

Feb

ruar

y 29

, 200

8 w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

Page 6: De Novo Protein Design: Fully Automated Sequence ......De Novo Protein Design: Fully Automated Sequence Selection Bassil I. Dahiyat† and Stephen L. Mayo* The first fully automated

ing entirely of naturally occurring aminoacids that folds to a well-ordered structurewithout metal binding, oligomerization, ordisulfide bond formation (35).

The packing pattern of the hydrophobiccore of the NMR structure ensemble ofFSD-1 (Tyr3, Ile7, Phe12, Leu18, Phe21, Ile22,and Phe25) is similar to the computed pack-ing arrangement. Five of the seven residueshave x1 angles in the same gauche1, gauche2

or trans category as the design target, andthree residues match both x1 and x2 angles.The two residues that do not match theircomputed x1 angles are Ile7 and Phe25,which is consistent with their location at theless constrained open end of the molecule.Ala5 is not involved in its expected exten-sive packing interactions and instead ex-poses about 45 percent of its surface areabecause of the displacement of the strand 1backbone relative to the design template.Conversely, Lys8 behaves as predicted by thealgorithm with its solvent exposure (60 per-cent) and x1 and x2 angles matching thecomputed structure. Because there are fewNOEs involving solvent-exposed sidechains, most of these side chains are disor-dered in the solution structure, a state thatprecludes examination of the predicted sur-face residue hydrogen bonds. However,Asn14 forms a hydrogen bond from its sidechain carbonyl oxygen as predicted, but tothe amide of Glu17, not Lys16 as expectedfrom the design. This hydrogen bond ispresent in 95 percent of the structure ensem-ble and has a donor-acceptor distance of2.6 6 0.06 Å. In general, the side chains ofFSD-1 correspond well with the design algo-rithm predictions, but further refinement ofthe scoring function and rotamer libraryshould improve sequence selection and sidechain placement and improve the correla-tion between the predicted and observedstructures.

We compared the average restrainedminimized structure of FSD-1 and the designtarget (Fig. 6). The overall backbone rmsdeviation of FSD-1 from the design target is1.98 Å for residues 3 to 26 and only 0.98 Å

for residues 8 to 26 (Table 2). The largestdifference between FSD-1 and the targetstructure occurs from residues 4 to 7, with adisplacement of 3.0 to 3.5 Å of the backboneatom positions of strand 1. The agreementfor strand 2, the strand-to-helix turn, and thehelix is remarkable, with the differencesnearly within the accuracy of the structuredetermination. For this region of the struc-ture, the rms difference of f, c angles be-tween FSD-1 and the design target is only14 6 9°. In order to quantitatively assess thesimilarity of FSD-1 to the global fold of thetarget, we calculated their supersecondarystructure parameter values (Table 2) (36,37), which describe the relative orientationsof secondary structure units in proteins. Thevalues of u, the inclination of the helixrelative to the sheet, and V, the dihedralangle between the helix axis and the strandaxes (see legend to Table 2), are nearlyidentical. The height of the helix above thesheet, h, is only 1 Å greater in FSD-1. Astudy of protein core design as a function ofhelix height for Gb1 variants demonstratedthat up to 1.5 Å variation in helix height haslittle effect on sequence selection (37). Thecomparison of supersecondary structure pa-rameter values and backbone coordinateshighlights the excellent agreement betweenthe experimentally determined structure ofFSD-1 and the design target, and demon-strates the success of our algorithm at com-puting a sequence for this bba motif.

The quality of the match between FSD-1and the design target demonstrates the abil-ity of our algorithm to design a sequence fora fold that contains the three major second-ary structure elements of proteins: sheet, he-lix, and turn. Since the bba fold is differentfrom those used to develop the sequence-selection methodology, the design of FSD-1represents a successful transfer of our algo-rithm to a new motif. Further tests of theperformance of the algorithm on several dif-ferent motifs are necessary, although its basisin physical chemistry and the absence ofheuristics and subjective considerationsshould allow the algorithm to be used in

many different structural contexts. Also, thegeneration of various kinds of backbone tem-plates for use as input to our fully automatedsequence selection algorithm could enablethe design of new protein folds. Recent re-sults indicate that the sequence selectionalgorithm is not sensitive to even fairly largeperturbations in backbone geometry andshould be robust enough to accommodatecomputer-generated backbones (37).

The key to using a quantitative method forthe FSD-1 design, and for the continued de-velopment of the methodology, is the tightcoupling of theory, computation, and experi-ment used to improve the accuracy of thephysical chemical potential functions in ouralgorithm. When combined with these poten-tial functions, computational optimizationmethods such as DEE can rapidly find se-quences for structures too large for experimen-tal library screening or too complex forsubjective approaches. Given that the FSD-1sequence was computed with only a 4-Giga-FLOPS computer (19), and that TeraFLOPScomputers are now available with PetaFLOPScomputers on the drawing board, the prospectfor pursuing even larger and more complexdesigns is excellent.

REFERENCES AND NOTES___________________________

1. M. H. J. Cordes, A. R. Davidson, R. T. Sauer, Curr.Opinion Struct. Biol. 6, 3 (1996).

2. D. Y. Jackson et al., Science 266, 243 (1994); B. Li etal., ibid. 270, 1657 (1995); J. S. Marvin et al., Proc.Natl. Acad. Sci. U.S.A. 94, 4366 (1997).

3. H. W. Hellinga, J. P. Caradonna, F. M. Richards, J.Mol. Biol. 222, 787 (1991); J. H. Hurley, W. A. Baase,B. W. Matthews, ibid. 224, 1143 (1992); J. R. Des-jarlais and T. M. Handel, Protein Sci. 4, 2006 (1995);P. B. Harbury, B. Tidor, P. S. Kim, Proc. Natl. Acad.Sci. U.S.A. 92, 8408 (1995); M. Klemba, K. H. Gard-

Fig. 6. Comparison of the FSD-1 structure (blue) and the design target (red). Stereoview of the best-fitsuperposition of the restrained energy minimized average NMR structure of FSD-1 and the backbone ofZif268. Residues 3 to 26 are shown.

Table 2. Comparison of the FSD-1 experimentallydetermined structure and the design target struc-ture. The FSD-1 structure is the restrained energyminimized average from the NMR structure deter-mination. The design target structure is the sec-ond DNA binding module of the zinc finger Zif268(9)

Atomic rms deviations (Å)

Backbone, residues 3 to 26 1.98Backbone, residues 8 to 26 0.98

Super-secondary structure parameters*

h (Å)u (degrees)V (degrees)

*h, u, and V are calculated as described (36, 37 ). h is thedistance between the centroid of the helix Ca coordi-nates (residues 15 to 26) and the least-squares plane fitto the Ca coordinates of the sheet (residues 3 to 12); u isthe angle of inclination of the principal moment of the helixCa atoms with the plane of the sheet; V is the anglebetween the projection of the principal moment of thehelix onto the sheet and the projection of the averageleast-squares fit line to the strand Ca coordinates (resi-dues 3 to 6 and 9 to 12) onto the sheet.

FSD-1 Design target

9.914.213.1

8.916.513.5

SCIENCE z VOL. 278 z 3 OCTOBER 1997 z www.sciencemag.org86

on

Feb

ruar

y 29

, 200

8 w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

Page 7: De Novo Protein Design: Fully Automated Sequence ......De Novo Protein Design: Fully Automated Sequence Selection Bassil I. Dahiyat† and Stephen L. Mayo* The first fully automated

ner, S. Marino, N. D. Clarke, L. Regan, Nature Struc.Biol. 2, 368 (1995); S. F. Betz and W. F. Degrado,Biochemistry 35, 6955 (1996).

4. B. I. Dahiyat and S. L. Mayo, Protein Sci. 5, 895(1996).

5. iiii, ibid. 6, 1333 (1997).6. iiii, Proc. Natl. Acad. Sci. U.S.A. 94, 10172

(1997).7. J. W. Ponder and F. M. Richards, J. Mol. Biol. 193,

775 (1987).8. [See J. Desmet, M. De Maeyer, B. Hazes, I. Lasters,

Nature 356, 539 (1992); R. F. Goldstein, Biophys. J.66, 1335 (1994); M. De Maeyer, J. Desmet, I. Laster,Folding Design 2, 53 (1997)] DEE finds and elimi-nates rotamers that are mathematically provable tobe inconsistent (or dead-ending) with the global min-imum energy solution of the system. A rotamer r atsome residue position i will be dead-ending if, whencompared with some other rotamer t, at the sameresidue position the following inequality is satisfied:

E~ir! 2 E~it! 1 Oj

mins

@E~ir js! 2 E~it js!# . 0

where E(ir) and E(it) are rotamer-template energies,E(ir js) and E(it js) are rotamer-rotamer energies forrotamers on residues i and j, and the function minsselects the rotamer s on residue j that minimizesthe argument of the function. Iterative application ofthe elimination criterion results in a rapid and sub-stantial reduction in the combinatorial size of theproblem and application of similar but higher-orderelimination criteria are required to find the ground-state solution.

9. N. P. Pavletich and C. O. Pabo, Science 252, 809(1991).

10. M. D. Struthers, R. P. Cheng, B. Imperiali, ibid. 271,342 (1996).

11. B. I. Dahiyat and S. L. Mayo, unpublished results.12. Potential functions and parameters for van der Waals

interactions, solvation, hydrogen bonding, and sec-ondary structure propensity are described in our pre-vious work (4–6). A secondary structure propensitypotential was used for surface b-sheet positionswhere the i 2 1 and i 1 1 residues were also inb-sheet conformations (5 ). Propensity values fromSerrano and co-workers were used [ V. Munoz andL. Serrano, Proteins Struct. Funct. Genet. 20, 301(1994)].

13. F. C. Bernstein et al., J. Mol. Biol. 112, 535 (1977).14. The coordinates of PDB record 1zaa (9, 13) from

residues 33 to 60 were used as the structure tem-plate. In our numbering, position 1 corresponds to1zaa position 33. The program BIOGRAF (MolecularSimulations, Inc., San Diego, CA) was used to gen-erate explicit hydrogens on the structure which wasthen conjugate-gradient minimized for 50 steps bymeans of the Dreiding force field (40).

15. A solvent-accessible surface was generated using theConnolly algorithm (41) with a probe radius of 8.0 Å, adot density of 10 Å22, and a Ca radius of 1.95 Å. Aresidue was classified as a core position if the dis-tance from its Ca, along its Ca-Cb vector, to thesolvent-accessible surface was greater than 5.0 Å,and if the distance from its Cb to the nearest surfacepoint was greater than 2.0 Å. The remaining residueswere classified as surface positions if the sum of thedistances from their Ca, along their Ca-Cb vector, tothe solvent-accessible surface plus the distance fromtheir Cb to the closest surface point was less than 2.7Å. All remaining residues were classified as boundarypositions. The classifications for Zif268 were used ascomputed except that positions 1, 17, and 23 wereconverted from the boundary to the surface class toaccount for end effects from the proximity of chaintermini to these residues in the tertiary structure andinaccuracies in the assignment.

16. One core position (7 possible amino acids), 7 bound-ary positions (16 possible amino acids), 18 surfacepositions (10 possible amino acids), and 2 positionswith f greater than 0° (1 possible amino acid) resultin 7 p 167 p 1018 p 12 5 1.88 3 1027 possible aminoacid sequences.

17. 1.88 3 1027 peptide molecules, with an average

mass of 3712 daltons for the possible compositionsallowed by the residue position classification, wouldweigh (1.88 3 1027 p 3712 daltons) 5 1.159 3 107

g 5 11.6 metric tons.18. As in our previous work (5), a backbone-dependent

rotamer library was used [R. L. Dunbrack and M.Karplus, J. Mol. Biol. 230, 543 (1993)]. All His rota-mers were protonated on both Nd and N«.

19. All calculations were performed on a Silicon Graph-ics Power Challenge server with 10 R10000 proces-sors running in parallel. Peak performance is 3.9GigaFLOPS (FLOPS 5 floating point operations persecond).

20. The sequence optimization consists of two phases:pairwise rotamer energy calculations and DEEsearching. The DEE optimization was initially run withcontrol parameters set for optimal speed followed bya DEE-based, residue-pairwise, round-robin optimi-zation. The energy calculations took 53 CPU (centralprocessing unit) hours and sequence optimizationstook 37 CPU hours.

21. C. W. A. Kim and J. M. Berg, Nature 362, 267 (1993);D. L. Minor and P. S. Kim, ibid. 367, 660 (1994); C. K.Smith, J. M. Withka, L. Regan, Biochemistry 33,5510 (1994).

22. J. U. Bowie, J. F. Reidhaar-Olson, W. A. Lim, R. T.Sauer, Science 247, 1306 (1990).

23. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J.Lipman, J. Mol. Biol. 215, 403 (1990).

24. FSD-1 was synthesized by means of standard solid-phase Fmoc chemistry. The peptide was cleavedfrom the resin with trifluoroacetic acid and purified byreversed-phase high-performance liquid chromatog-raphy. Peptide was lyophilized and stored at 220°C.Matrix-assisted laser desorption mass spectrometryyielded a molecular weight of 3489.7 daltons (3489.0calculated).

25. Protein concentration was 50 mM in 50 mM sodiumphosphate at pH 5.0. The spectrum was acquired at1°C in a 1-mm cuvette and was baseline-correctedwith a buffer blank. The spectrum is the average of 3scans with a 1-s integration time and 1-nm incre-ments. All CD data were acquired on an Aviv 62DSspectrometer equipped with a thermoelectric tem-perature control unit. Thermal unfolding was moni-tored at 218 nm in a 1-mm cuvette with 2° incre-ments and an averaging time of 40 s and an equili-bration time of 120 s per increment. Reversibility wasconfirmed by comparison of 1°C CD spectra beforeand after heating to 99°C. Peptide concentrationswere determined by UV spectrophotometry.

26. J. M. Scholtz et al., Proc. Natl. Acad. Sci. U.S.A. 88,2854 (1991); M. A. Weiss and H. T. Keutmann, Bio-chemistry 29, 9808 (1990); M. D. Struthers, R. P.Cheng, B. Imperiali, J. Am. Chem. Soc. 118, 3073(1996).

27. Sedimentation equilibrium studies were performedon a Beckman XL-A ultracentrifuge equipped with anAn-60 Ti analytical rotor at a speed of 40,000 rpm.Protein concentration was 100 mM, 500 mM, or 1mM in 50 mM sodium phosphate at pH 5.0 and 7°C.Absorption was monitored at 286 nm (500 mM and 1mM) or 234 nm (100 mM). Concentration profileswere fit to an ideal single species model which re-sulted in randomly distributed residuals.

28. NMR data were collected on a Varian Unityplus 600MHz spectrometer equipped with a Nalorac inverseprobe with a self-shielded z-gradient. NMR samples(;2 mM) were prepared in H2O-D2O (90:10) or in99.9 percent D2O with 50 mM sodium phosphate atpH 5.0 (uncorrected glass electrode). All spectrawere collected at 7°C. DQF-COSY [U. Piantini, O. W.Sorensen, R. R. Ernst, J. Am. Chem. Soc. 104, 6800(1982)], TOCSY [A. Bax and D. G. Davis, J. MagneticReson. 65, 355 (1985)], and NOESY [J. Jeener, B. H.Meier, P. Bachmann, R. R. Ernst, J. Chem. Phys. 71,4546 (1979)] spectra were acquired to accomplishresonance assignments and structure determina-tion. NOESY spectra were recorded with mixingtimes of 200 ms for use during resonance assign-ments and 100 ms to derive distance restraints. Wa-ter suppression was accomplished either with pre-

saturation during the relaxation delay or pulsed fieldgradients [M. Piotto, V. Saudek, V. Sklenar, J. Bi-omol. NMR 2, 661 (1992)]. Spectra were processedwith VNMR (Varian Associates, Palo Alto, CA), andspectra were assigned with ANSIG [P. J. Kraulis, J.Magnetic Reson. 24, 627 (1989)].

29. K. Wuthrich, NMR of Proteins and Nucleic Acids(Wiley, New York, 1986).

30. NOEs were classified into three distance-boundranges based on cross-peak intensity calibrated tothe Tyr3 Hd-H« crosspeak: strong (1.8 to 2.7 Å),medium (1.8 to 3.3 Å), and weak (1.8 to 5.0 Å). Upperbounds for restraints involving methyl protons wereincreased by 0.5 Å to account for the increasedintensity of methyl resonances. All partially over-lapped NOEs were set to weak restraints. Hydrogenbond restraints were derived from hydrogen deute-rium-exchange kinetics measurements followed byone dimensional 1H spectroscopy. Unambiguouslyassigned amide peaks for Tyr3, Phe12, Leu18, Phe21,and Phe25 were protected from exchange at 7°C, pH5.0. Hydrogen bond restraints (two per hydrogenbond) were only included at the late stages of struc-ture refinement when initial calculations indicated thedonor-acceptor pairings.

31. A. T. Brunger, X-PLOR, version 3.1, A System forX-ray Crystallography and NMR ( Yale Univ. Press,New Haven, CT, 1992).

32. Standard hybrid distance geometry-simulated an-nealing protocols were followed [M. Nilges, G. M.Clore, A. M. Gronenborn, FEBS Lett. 229, 317(1988); M. Nilges, J. Kuszewski, A. T. Brunger, inComputational Aspects of the Study of BiologicalMacromolecules by NMR J. C. Hoch, Ed. (Plenum,New York, 1991); J. Kuszewski, M. Nilges, A. T.Brunger, J. Biomol. NMR 2, 33 (1992)]. Distancegeometry structures (100) were generated, regular-ized, and refined, resulting in an ensemble, called^SA&, of 41 structures with no restraint violationsgreater than 0.3 Å, rms deviations from idealizedbond lengths less than 0.01 Å, and rms deviationsfrom idealized bond angles and impropers less than1°. An average structure was generated by superim-posing and then averaging the coordinates of theensemble, followed by refinement and restrainedminimization.

33. R. A. Laskowski, M. W. Macarthur, D. S. Moss, J. M.Thornton, J. Appl. Crystallogr. 26, 283 (1993).

34. S. G. Hyberts, M. S. Goldberg, T. F. Havel, G. Wag-ner, Protein Sci. 1, 736 (1992).

35. C. J. McKnight, P. T. Matsudaira, P. S. Kim, NatureStruct. Biol. 4, 180 (1997).

36. J. Janin and C. Chothia, J. Mol. Biol. 143, 95 (1980);F. E. Cohen, M. J. E. Sternberg, W. R. Taylor, ibid.56, 821 (1982).

37. A. Su and S. L. Mayo, Protein Sci. 6, 1701 (1997).38. R. Koradi, M. Billeter, K. Wuthrich, J. Mol. Graph. 14,

51 (1996).39. W. J. Becktel and J. A. Schellman, Biopolymers 26,

1859 (1987).40. S. L. Mayo, B. D. Olafson, W. A. Goddard III, J. Phys.

Chem. 94, 8897 (1990).41. M. L. Connolly, Science 221, 709 (1983).42. We thank P. Poon and T. Laue for sedimentation

equilibrium measurements and discussions, A. Sufor assistance calculating super-secondary structureparameters, S. Ross for assistance with NMR mea-surements, G. Hathaway for mass spectrometry, J.Abelson and P. Bjorkman for critical reading of themanuscript, and R. A. Olofson for helpful discus-sions. Supported by the Howard Hughes MedicalInstitute (S.L.M.), the Rita Allen Foundation, theChandler Family Trust, the Booth Ferris Foundation,the David and Lucile Packard Foundation, the SearleScholars Program and The Chicago CommunityTrust, and grant GM08346 from the National Insti-tutes of Health (B.I.D.). Coordinates and NMR re-straints have been deposited in the Brookhaven Pro-tein Data Bank with accession numbers 1FSD andR1FSDMR, respectively.

16 June 1997; accepted 8 September 1997

RESEARCH ARTICLE

www.sciencemag.org z SCIENCE z VOL. 278 z 3 OCTOBER 1997 87

on

Feb

ruar

y 29

, 200

8 w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from