Structure prediction: Folding proteins by pattern recognition

Dispatch R151

Structure prediction: Folding proteins by pattern recognitionDavid Shortle

Although we are still a long way from being able topredict the details of protein structure from theunderlying chemistry, slow but steady progress is beingmade at modeling structural features by recognizingthe patterns that connect sequence to structure.

Address: Department of Biological Chemistry, The Johns HopkinsUniversity School of Medicine, 725 North Wolfe Street, Baltimore,Maryland 21205, USA.

Electronic identifier: 0960-9822-007-R0151

Current Biology 1997, 7:R151–R154

© Current Biology Ltd ISSN 0960-9822

At the recent second meeting on the ‘Critical Assessmentof Structure Prediction’ (CASP2; Asilomar, California,12–16 December, 1996), 170 protein scientists met withthe aim of rigorously evaluating the accuracy of methodsused for predicting and modeling proteins of unknownstructure. Following the same general format used forassessment of blind predictions at the CASP1 meeting in1994 [1,2], research groups were provided the names,amino-acid sequences, and literature references of ‘target’proteins whose structures were in the process of beingsolved by X-ray crystallography or nuclear magnetic reson-ance (NMR) spectroscopy. After retrieving this informationfrom an Internet site for proteins that provided suitablechallenges for their methods, researchers were required tosubmit all predictions electronically before an expiry date,only after which the experimental structure would becomegenerally available to the scientific community.

The prediction methods used by most of the researchgroups at the meeting involve searching the enormousdatabases of protein sequences, and the much smaller data-base of protein structures, for patterns that may connect atarget protein’s sequence to one or more known structures.These methods can be grouped under the label of ‘foldrecognition’, as their first task is to recognize which of allthe characterized folds, if any, most closely resembles theunknown fold of the target protein. And from the resultspresented at CASP2, it is clear that many of the patternsconnecting sequence to structure have been identified.Not only can these patterns be used to predict the foldfamily of proteins of unknown structure, in favorable cir-cumstances they can be used to model specific structuraldetails before the structure is determined experimentally.

Fold recognition by sequence homologyBy far the simplest and most informative pattern for fold recognition is sequence homology — a statistically

significant similarity in amino-acid sequence that indicatesthe target protein must bear an evolutionary relationshipto one or more proteins. As the chain segments withinhomologous proteins that are similar in sequence areinvariably very similar in structure, the identification of ahomolog of known structure definitively establishes thefold of the target protein.

In this situation, the real challenge becomes to build ahigh-resolution model of the target protein that correctlypredicts the structural consequences of its divergence insequence from the homolog ‘template’ [3]. The first stepin the model-building process is to optimize the alignmentbetween the target protein and its closest homolog. Forthis purpose, the sequences of other homologous proteins,whose structures may not be known, often prove useful.In a global alignment of a family of homologous proteins,the most conserved, and therefore most easily aligned,sequences often point to the central structural features,whereas the least conserved sequences usually defineloops or segments on the surface of the protein.

In the second step, the most conserved segments of thetarget, usually alpha helices and beta strands, are assignedthe same structure as the corresponding segments in thetemplate. Loops — the less conserved segments — arethen added to connect the core segments, side chains aresubstituted to convert the template sequence into that ofthe target, and finally the structure is optimized locally byan energy-based refinement method.

In the months leading up to the CASP2 meeting, ninetarget proteins were made available for such comparativemodeling, and twenty-one research groups responded tothe challenge by submitting one or more models. As thelevel of sequence identity to a homolog of known structurevaried from 15–85%, this set of targets presented a range ofchallenges that tested different aspects of the model-building process. The accuracy of each model was quanti-tatively analyzed by structural comparison with the correctstructure determined by X-ray or NMR methods. I shallonly describe some of the conclusions and impressionsoffered by conference participants.

Overall, there was a sense that comparative modelingmethods have improved somewhat, primarily becausegreater attention is now paid to optimizing the alignmentof the target sequence to the template structure. Mistakesat this first, and most critical, step cannot yet be correctedin subsequent steps. Comparison of the level of successachieved by different methods leads to one dominant

conclusion: the more structural details the model inheritsdirectly from one (or several) templates, the more accuratethe model will be. This point was brought home by therealization that when the target sequence was realigned onthe template structure using, as a guide, the experimentalstructure of the target — which was unknown at the timeof the original prediction — the resulting model was inalmost all cases better than the best comparative model,which typically included small alignment errors.

Three other general observations appear to be corollariesof this rule of maximal inheritance of homologous struc-tural details. The correctness of loop structures wasusually low when there was significant variation in lengthor in sequence with respect to the loops of the templateprotein(s). Refinement by molecular dynamics, energyminimization or some other global optimization schemeseldom improved the accuracy of a model, and in mostcases made it worse. And finally, it was felt that com-parative model building breaks down severely when thereis less than 20% sequence identity. Much of the problemin this situation arises from large errors that occur in thevery uncertain alignment of the target sequence to that ofthe template.

Fold recognition by sequence–structure compatibilityIn the absence of sufficient sequence identity to recognizethe fold of the target protein, recourse can be made to asecond strategy, commonly known as ‘threading’ [4,5].Instead of trying to align the target protein’s sequence tothat of a potential template protein, the sequence of thetarget protein is threaded through a representation of afold, which is based on the sequence(s) and/or structure(s)of members of that fold family. At each step in theprocess, an estimate is made of the compatibility of thesequence to the representation. In effect, many questionsare asked at each trial alignment. Does the targetsequence correspond well to the structural features of thefold? Are the hydrophobic residues buried? Do residues ina helix have a high helix propensity? And so on. Unlikethe methods that are based on sequence homology, in thisapproach it is recognizing the correct fold, not building adetailed model, that is the principal challenge.

Threading methods can be grouped into three classesbased on the representation of the fold [4]. In the simplestclass, the fold is represented as a one-dimensional array ofprobabilities for each of the twenty amino acids occurringat each position in the fold, derived from the sequences ofthe members of the fold family. In a second class, the foldis represented as a one-dimensional array of descriptors ofthe environment of each residue in the three-dimensionalfold. And in the most sophisticated class, the three-dimen-sional fold is represented by itself, permitting quantitativeanalysis of the spatial relationships between residues. Atthe CASP2 meeting only one new threading method was

implemented. This new approach, based on the mostsophisticated neural network known — the human brain— is particularly important because, by some criteria, itoutperformed all earlier methods.

A total of eighteen target sequences were provided in thethreading category challenge, only six of which turned outto possess previously known folds. Thirty-four groupsaccepted the challenge, which consisted of providing a hitlist of predicted folds for each target, along with the bestalignment of the target sequence on each fold. If morethan one fold was included in the hit list, a weight or con-fidence level was assigned to each. Assessment of the per-formance of different methods was not straightforward.For example, it was not always clear which, if any, knownfold most closely corresponded to a target’s structure.

Nevertheless, quite noteworthy performances were postedby several participating groups. Alexey Murzin, an authorof the SCOP classification of protein structure [6], success-fully identified the fold classification for all five of thetargets with known folds that he attempted. In addition,for all other targets attempted, he correctly concluded thateach possessed a novel fold. Using his background knowl-edge of protein structure, sequence and function, Murzinwas able to recognize distinctive patterns in the name,function, length and sequence of each target protein,allowing him to relate it to the correct SCOP fold subfam-ily. A second research group that combined the results offive different threading methods to arrive at a ‘best’ pre-diction (as judged subjectively) also did remarkably well— scoring five or six out of six, depending on the criteriaused. More rigorous methods based on empirical poten-tials [7], and requiring no human intervention, alsoshowed evidence of improvement (see Fig. 1). In general,participants felt that progress has been made in improvedalignments; the target sequence was often fairly accuratelypositioned on the template structure in the lowest energyalignment to the ‘hit’ structure(s).

Fold prediction, or ab initio, methodsIn this category, the challenge is to predict the structure ofthe target protein when its sequence is not similar to anyprotein of known structure and it cannot be successfullythreaded on a known structure [8]. This is always the situa-tion when the target protein defines a new fold. A widerange of approaches were represented at CASP2. At oneend of the spectrum, methods employed techniques quitesimilar to those described above, in which extensive use ismade of recognizing sequence–structure patterns presentin protein databases. At the other end were the traditionalab initio ‘heavy numbers’ methods, which attempt to mimicwithin a computer the physical process of protein folding:an algorithm for generating conformations is coupled to anenergy function, and millions of conformations aresearched in an effort to find the one lowest in energy.

R152 Current Biology, Vol 7 No 3

Of the thirty-four proteins submitted to the CASP2competition, twelve remained after those with recogniz-able folds were removed. Unfortunately, even thoughmany laboratories around the world are pursuing heavynumbers computational approaches to protein folding,fewer than ten groups submitted predictions to the Asilo-mar competition. In several cases, segments of secondarystructure were correctly predicted, and occasionally fea-tures of super-secondary structure and global topologyappeared to be approximately correct. But overall, therewas a clear impression that this class of methods has a longway to go in order to begin to achieve reliable results,especially with proteins longer than 100 amino acids.

Pattern recognition methods consistently identified somefeatures of target proteins with new folds. For example,the neural-network-based secondary-structure predictionprogram PHD [9] achieved a 74% success rate in assigningamino acids to one of the three classes: helix, sheet or loop.Importantly, the large majority of prediction errors in-volved defining the precise ends of helices and sheets; themore serious errors of calling a helix a strand, or a strand ahelix, were relatively infrequent. As a consequence, a

number of ab initio strategies are being developed to packsegments of predicted secondary structure, either bypattern recognition or by using more traditional confor-mational search/energy evaluation methods. In the nearfuture, it seems likely that such hybrid approaches willbegin to meet with more consistent successes.

DockingThe new category in the CASP2 competition was‘docking’, in which the challenge is successfully to ‘dock’or position a small molecule ligand or another protein intoits binding site, as defined by crystallography or NMRmethods [10,11]. As this type of prediction plays a centralrole in rational drug design, a great deal of interest andeffort in docking programs have come from the pharma-ceutical industry. Somewhat surprisingly, only nineresearch groups entered models of the seven protein–ligand targets, and only four groups entered models of thesingle target involving protein–protein docking.

The protein–ligand targets represented a range of rela-tively easy challenges, with five of the targets consisting ofprotease–inhibitor complexes (three with elastase, two

Dispatch R153

Figure 1

(a) (b)

The result of fold recognition with the target exfoliative toxin A fromStaphylococcus aureus. (a) The predicted structure of toxin A, basedon the structure of the model protein human leukocyte elastase. (b)The crystal structure of the region of human leukocyte elastasepredicted to be the best model for toxin A. It was reported at themeeting that human leukocyte elastase (218 amino acids) was the

best hit on threading [7] the sequence of toxin A (242 amino acids)through a library of structure-containing representatives of all knownfold families. The level of sequence identity is 16%; 152 out of 166equivalent residue pairs were correctly aligned in the threaded modelof toxin A. (Image courtesy of Manfred Sippl.)

with trypsin) and none requiring de novo identification ofthe inhibitor binding site on the protein. The singleprotein–protein target, a complex between influenzahemagglutinin and an antibody, was felt to be quite adifficult challenge, accounting in part for the smallnumber of models submitted.

For each target, at least one method yielded a model closeto the X-ray crystal structure of the complex, yet no onemethod clearly stood out as the best. In each case, therewas a distribution in the quantitative agreement betweenmodels and the real structure that had some of the fea-tures one would expect for a random distribution. Severalgroups expressed disappointment with their methods andsurprise at the overall low accuracy of their models. Butthen the recent experience of docking algorithms in pre-dicting the details of enzyme–inhibitor complexes has notbeen lacking in surprises [11].

ConclusionsIt appears that fold recognition may soon be a solvedproblem. Progress in the past two years has come entirelyfrom methods that build upon patterns present in thedatabases of protein sequences and structures. As thesedatabases are growing at a rapid pace, the success rate forcorrectly assigning a protein with a previously defined foldto its proper class can only increase from its current highlevel. The slow but steady progress in building models ofa protein once its fold has been defined is also likely tocontinue for the same reason: pattern recognition, whichprovides the most reliable route to a good model, still hasroom for improvement.

But in the areas of ab initio heavy numbers structureprediction, high-precision model building and docking,progress does not depend on improved pattern recogni-tion. Instead, it depends on incorporating the correctquantitative chemistry into the computer programs thatcalculate the correspondence between structure andenergy. In these three areas, significant progress mustawait new scientific insights and advances. Given the verymodest level of success achieved in blind predictions inthese categories at the CASP2 meeting, new insights andapproaches are desperately needed.

CASP3, the next assessment meeting, is scheduled forDecember 1998. For readers interested in more details, aspecial issue of the journal Proteins: Structure, Function andGenetics devoted to the results presented at CASP2 willappear in the second half of 1997. The results of CASP1were published in 1995 [1].

References1. Moult J, Pedersen JT, Judson R, Fidelis K: A large scale experiment

to assess protein structure prediction methods. Proteins: StructFunct Genet 1995, 23:ii–iv.

2. Shortle D: Protein fold recognition. Nat Struct Biol 1995, 2:91–93.

3. Mosimann S, Meleshko R, James MNG: A critical assessment ofcomparative modeling of tertiary structures of proteins. Proteins:Struct Funct Genet 1995, 23:301–317.

4. Bowie JU, Eisenberg D: Inverted protein structure prediction. CurrOpin Struct Biol 1993, 3:437–444.

5. Lemer CMR, Rooman MJ, Wodak SJ: Protein structure prediction bythreading methods: evaluation of current techniques. Proteins:Struct Funct Genet 1995, 23:337–355.

6. Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: A structuralclassification of proteins database for the investigation ofsequences and structures. J Mol Biol 1994, 235:13–26.

7. Flockner H, Braxenthaler M, Lackner P, Jaritz M, Ortner M, Sippl MJ:Progress in fold recognition. Proteins: Struct Funct Genet 1995,23:376–386.

8. Defay T, Cohen FE: Evaluation of current techniques for ab initioprotein structure prediction. Proteins: Struct Funct Genet 1995,23:431–445.

9. Rost B, Sander C: Progress of 1D protein structure prediction atlast. Proteins: Struct Funct Genet 1995, 23:295–300.

10. Rosenfeld R, Vajda S, DeLisi C: Flexible docking and design. AnnuRev Biophys Biomol Struct 1995, 24:677–700.

11. Bamborough P, Cohen FE: Modeling protein–ligand complexes.Curr Opin Struct Biol 1996, 6:236–241.

R154 Current Biology, Vol 7 No 3

Documents

Structure prediction: Folding proteins by pattern recognition