6
Fold Recognition Using Predicted Secondary Structure Sequences and Hidden Markov Models of Protein Folds Valentina Di Francesco, 1 * V. Geetha, 1 Jean Garnier, 2 and Peter J. Munson 1 1 Analytical Biostatistics Section, Laboratory of Structural Biology, National Institutes of Health, Bethesda, Maryland 2 Fogarty International Center, National Institutes of Health, Bethesda, Maryland ABSTRACT We present an analysis of the blind predictions submitted to the fold recogni- tion category for the second meeting on the Critical Assessment of techniques for protein Structure Prediction. Our method achieves fold recognition from predicted secondary struc- ture sequences using hidden Markov models (HMMs) of protein folds. HMMs are trained only with experimentally derived secondary structure sequences of proteins having similar fold, therefore protein structures are described by the models at a remarkably simplified level. We submitted predictions for five target se- quences, of which four were later found to be suitable for threading. Our approach correctly predicted the fold for three of them. For a fourth sequence the fold could have been cor- rectly predicted if a better model for its struc- ture was available. We conclude that we have additional evidence that secondary structure information represents an important factor for achieving fold recognition. Proteins, Suppl. 1:123–128, 1997. r 1998 Wiley-Liss, Inc. ² Key words: protein structure prediction; hid- den Markov models; fold recogni- tion; secondary structure INTRODUCTION The large-scale protein structure prediction experi- ment, summarized at the second meeting for the ‘‘Critical Assessment of Techniques for Protein Struc- ture Prediction’’ (CASP2) held in Asilomar, Califor- nia, in December 1996, offered a common set of target sequences, from which the prediction teams could test their own methods. In CASP2 different methods are tested on the same set of proteins, thus allowing direct comparison between them. Moreover, as the three-dimensional structures of the target sequences were unpublished at the time of the prediction, the robustness of different methodologies is tested in real experimental conditions. Therefore, it is possible to regard the results of all the submitted predictions as fully cross-validated. Here, we present the predictions submitted to the ‘‘fold recognition’’ category using our recently devel- oped method. 1 This approach is based on an ex- tremely simplified representations of protein folds, consisting of one dimensional strings of secondary structure states (i.e., helix, strand and coil) at each residue position. Protein folds are described by hid- den Markov models 2 (HMMs) trained on proteins having the same ‘‘topology’’, that is, similar orienta- tion and connectivity of secondary structure ele- ments. 3 Despite the highly simplified structure rep- resentation, results we obtained previously on mostly helical proteins led us to expect reasonable perfor- mance on proteins in the other structural classes. METHODS Our approach combines two well studied method- ologies: hidden Markov models and protein second- ary structure prediction. 1 The hidden Markov mod- els have the architecture proposed by Krogh et al. 4 to model families of proteins of a specific structural topology (CATH, version Jan.1995 and version 1.0 3 ) and are calibrated using well established algo- rithms. 2 The main difference of our approach from other HMM-based approaches to fold recognition is that instead of training HMMs by aligning amino acid sequences, 4,5 the model parameters are cali- brated by aligning sequences of experimentally deter- mined secondary structure states of protein family members. The model parameters are the transition probability distributions between hidden states (match, insert, and delete states at each position in the model) and the observation symbol probability distributions for the secondary structures H, E, C, given that the residue is in a match or an insert state. Each model provides a probabilistic description of the consensus secondary structure of a topology group of proteins, such as the globins, the TIM barrels or the serine proteases. A prediction was submitted for five target se- quences (Table I). As our method does not yet provide a complete library of models for all known topology groups, when possible, we took advantage of certain *Correspondence to: Valentina Di Francesco, The Institute for Genomic Research, Rockville, MD 20850. E-mail: [email protected] Received 5 May 1997; Accepted 26 August 1997 PROTEINS: Structure, Function, and Genetics, Suppl. 1:123–128 (1997) r 1998 WILEY-LISS, INC. ² This article is a US govern- ment work and, as such, is in the public domain in the United States of America.

Fold recognition using predicted secondary structure sequences and hidden Markov models of protein folds

  • Upload
    peter-j

  • View
    213

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Fold recognition using predicted secondary structure sequences and hidden Markov models of protein folds

Fold Recognition Using Predicted SecondaryStructure Sequences and Hidden MarkovModels of Protein FoldsValentina Di Francesco,1* V. Geetha,1 Jean Garnier,2 and Peter J. Munson1

1Analytical Biostatistics Section, Laboratory of Structural Biology, National Institutes of Health, Bethesda, Maryland2Fogarty International Center, National Institutes of Health, Bethesda, Maryland

ABSTRACT We present an analysis of theblind predictions submitted to the fold recogni-tion category for the second meeting on theCritical Assessment of techniques for proteinStructure Prediction. Our method achieves foldrecognition from predicted secondary struc-ture sequences using hidden Markov models(HMMs) of protein folds. HMMs are trainedonly with experimentally derived secondarystructure sequences of proteins having similarfold, therefore protein structures are describedby the models at a remarkably simplified level.We submitted predictions for five target se-quences, of which four were later found to besuitable for threading. Our approach correctlypredicted the fold for three of them. For afourth sequence the fold could have been cor-rectly predicted if a better model for its struc-ture was available. We conclude that we haveadditional evidence that secondary structureinformation represents an important factor forachieving fold recognition. Proteins, Suppl.1:123–128, 1997. r 1998 Wiley-Liss, Inc.†

Key words: protein structure prediction; hid-den Markov models; fold recogni-tion; secondary structure

INTRODUCTION

The large-scale protein structure prediction experi-ment, summarized at the second meeting for the‘‘Critical Assessment of Techniques for Protein Struc-ture Prediction’’ (CASP2) held in Asilomar, Califor-nia, in December 1996, offered a common set oftarget sequences, from which the prediction teamscould test their own methods. In CASP2 differentmethods are tested on the same set of proteins, thusallowing direct comparison between them. Moreover,as the three-dimensional structures of the targetsequences were unpublished at the time of theprediction, the robustness of different methodologiesis tested in real experimental conditions. Therefore,it is possible to regard the results of all the submittedpredictions as fully cross-validated.

Here, we present the predictions submitted to the‘‘fold recognition’’ category using our recently devel-

oped method.1 This approach is based on an ex-tremely simplified representations of protein folds,consisting of one dimensional strings of secondarystructure states (i.e., helix, strand and coil) at eachresidue position. Protein folds are described by hid-den Markov models2 (HMMs) trained on proteinshaving the same ‘‘topology’’, that is, similar orienta-tion and connectivity of secondary structure ele-ments.3 Despite the highly simplified structure rep-resentation, results we obtained previously on mostlyhelical proteins led us to expect reasonable perfor-mance on proteins in the other structural classes.

METHODS

Our approach combines two well studied method-ologies: hidden Markov models and protein second-ary structure prediction.1 The hidden Markov mod-els have the architecture proposed by Krogh et al.4 tomodel families of proteins of a specific structuraltopology (CATH, version Jan.1995 and version 1.03)and are calibrated using well established algo-rithms.2 The main difference of our approach fromother HMM-based approaches to fold recognition isthat instead of training HMMs by aligning aminoacid sequences,4,5 the model parameters are cali-brated by aligning sequences of experimentally deter-mined secondary structure states of protein familymembers. The model parameters are the transitionprobability distributions between hidden states(match, insert, and delete states at each position inthe model) and the observation symbol probabilitydistributions for the secondary structures H, E, C,given that the residue is in a match or an insert state.Each model provides a probabilistic description ofthe consensus secondary structure of a topologygroup of proteins, such as the globins, the TIMbarrels or the serine proteases.

A prediction was submitted for five target se-quences (Table I). As our method does not yet providea complete library of models for all known topologygroups, when possible, we took advantage of certain

*Correspondence to: Valentina Di Francesco, The Institutefor Genomic Research, Rockville, MD 20850.

E-mail: [email protected] 5 May 1997; Accepted 26 August 1997

PROTEINS: Structure, Function, and Genetics, Suppl. 1:123–128 (1997)

r 1998 WILEY-LISS, INC. †This article is a US govern-ment work and, as such, is in the public domain in theUnited States of America.

Page 2: Fold recognition using predicted secondary structure sequences and hidden Markov models of protein folds

biochemical knowledge gleaned from the literatureto train a new HMM for a suggested topology. This adhoc model training was performed in the cases of thePolyribonucleotide Nucleotidyltransferase (T0004)and Beta-Cryptogein (T0032) target sequences.

Several methods for secondary structure predic-tion were used to obtain predicted sequences for thetarget proteins: QL,6 PHD,7 NNSSP,8 PREDATOR,9

GOR version IV,10 all available from World Wide Webservers. The predicted sequences were added to a‘‘control’’ database of 112 previously predicted second-ary structure sequences of proteins (obtained withthe QL algorithm, average cross-validated Q3 568%). This database consists of a subset of all theknown protein structures, with sequences havingless than 25% pairwise residue identity.1 Each pre-dicted sequence was assigned a log-odds score byeach model in the HMM library. The score indicatesthe relative likelihood of the protein having thatparticular fold compared to the likelihood under ageneric null model of the same length. The scoresassigned by each model are then ranked. Finally, thefold model giving the best ranks to the predictedtarget sequences is used as the prediction. We com-pare ranks instead of log-odds scores, because thescores depend strongly on model length and param-eter values. When several predicted sequencesranked high for a certain model, we considered thisas additional evidence for the correctness of theprediction.

The predicted HMM does not indicate a single,unique fold for the protein sequence of interest.Rather, it identifies the protein topology and a set ofpossible folds, which are the members of the indi-cated topology family. The selection of the particularmember which would fit the structure of the targetsequence best was handled intuitively. The sequencelength and patterns of secondary structure elementsof the top-ranking predicted sequence with the experi-mentally derived secondary structure sequences ofthe proteins in the training set were compared.During the scoring process, the target sequence isaligned to the HMM and hence to the sequence ofsecondary structure states of the selected modelprotein structure. The alignment was manually re-vised in some cases. Confidence values for the pre-dicted model structures and the alignments are not

provided quantitatively by our method. We chosesubjective values based on the ranks assigned to thepredicted secondary structure sequences by variousmodels and the appearance of the alignments.

RESULTS AND DISCUSSION

Of the 22 target proteins in the fold recognitioncategory, the structures for 5 were not solved in time,and 10 were not deemed suitable targets for thread-ing because they had either a previously unobservedfold (8 targets) or a very unusual fold (2 targets),leaving only 7 suitable targets. Of the 5 predictionswe submitted (Table I), 4 were for appropriate thread-ing targets. Of the 4, the fold was correctly predictedfor three (Table II), and the fourth could have beencorrectly predicted if we had a better HMM for thatspecific topology. All the measures of accuracy pro-vided in Table II coincide with those calculated bythe prediction evaluators at the time and are avail-able at the URL: http://PredictionCenter.llnl.gov/.

Prediction of PolyribonucleotideNucleotidyltransferase (T0004)

This target sequence belongs to a ribosomal pro-tein with an RNA binding S1 motif. Ribosomalproteins involved in RNA interactions often share afolding motif, namely, the oligonucleotide binding(OB) fold. An HMM for OB fold proteins was builtand trained using the experimentally derived second-ary structure sequences of 7 representative proteinshaving the OB fold. Predicted secondary structuresequences of T0004 were aligned to this model.Three predicted sequences for T0004 were foundwithin the 5 top-ranking scores; rank 2 was assignedto the control database protein 2sns, which also hasthe OB fold. Predicted secondary structure se-quences of T0004 were also aligned to the otherHMMs in our library and were assigned poor rank-ing scores (below rank 8), except for the Plait PTFmodel, which assigned ranks 1 and 4 to the predic-tions. Thus our method predicted that T0004 hadeither an OB fold topology (model PDB structures:1lts, chain D, and 1snc), or a Plait PTF topology(model PDB structures: 1ptf, 2bop, chain A, and2fxb). We assigned confidence 0.6 to the OB fold and0.4 to the Plait PTF, with the higher confidence forOB fold due to the strong indication in the literature

TABLE I. Summary of Submitted Predictions

Target sequencesTarget

numberPrediction

code Fold topologyStructure solved by

(ref. no.)Fold correctly

predicted

Polyribonucleotide T0004 T0004FR218 OB fold M. Bycroft et al. (18) YesNucleotidyltransferase3-Dehydroquinase T0014 T0014FR178 TIM barrel A. Shrive et al. (19) YesFerrochelatase T0020 T0020FR717 NTRC domain S. Al-Karadaghi, et al. (20) NoExfoliative toxin A T0031 T0031FR829 Serine protease G.M. Vath et al. (21) YesBeta-Cryptogein T0032 T0032FR444 Unique fold G. Boissy et al. N/A

124 V. DI FRANCESCO ET AL.

Page 3: Fold recognition using predicted secondary structure sequences and hidden Markov models of protein folds

for RNA binding proteins. The selection of the spe-cific protein structure among those in the OB foldHMM training set was not crucial for fold recogni-tion, since all but one of those proteins were found inthe list of structurally similar proteins provided byat least one of the algorithms used by the evaluators(DALI, VAST, or SSAP), with different degrees ofsignificance of the structural match. Figure 1 presentsthe OB fold HMM alignment of 1ltsD with the targetsequence. The PREDATOR prediction shown in Fig-ure 1 received the top rank from the OB fold model ofany of the secondary structure predictions. Thefigure shows the extensive agreement between theobserved secondary structure sequence of 1ltsD withthe predicted sequence of T0004. However, therewere several errors in the prediction, including anadditional strand, and the prediction of the finalstrand as a helix. The Plait PTF HMM incorrectlygave good rank to the T0004 sequence because PlaitPTF proteins have length and a pattern of secondarystructure elements resembling that of proteins withthe OB fold.

Prediction of 3-Dehydroquinase (T0014)

Information in the literature did not suggest aparticular fold for this target. Various secondarystructure prediction algorithms were providing pre-dictions consistent with the a–b class. Predictedsecondary structure sequences were aligned to allthe HMMs in the library. Four predicted sequenceswere assigned ranks 2, 3, 4, and 9 by the model ofTIM barrel. Rank 1 was assigned to 3timA, anotherTIM barrel protein in the control database; otherhigh-ranking proteins were 3-layer (aba)-sandwichfolds. The predicted secondary structure sequencesof T0014 also obtained ranks 5, 7, and 8 by thenitrogen regulatory protein receiver domain (NTRC)topology HMM, outranked by three proteins in thecontrol database that indeed have the NTRC topol-ogy, and a globin sequence. The TIM barrel topologywas chosen over the NTRC domain topology becausethe HMM alignment of one predicted sequence to theNTRC model revealed long inserts of one or moresecondary structure elements, suggesting that the

TABLE II. Summary of Results for Target Sequences Whose Fold Was Correctly Recognized*

Targetsequence(length)

Predictedmodel

structures

Structure comparison methodsVAST

SCRms (Å)/SCLen (res.)

HMMASns (%)/Shft (res.)

DALISCRms (Å)/SCLen (res.)

HMMASns (%)/Shft (res.)

SSAPSCRms (Å)/SCLen (res.)

HMMASns (%)/Shft (res.)

T0004 1ltsD 2.01/32 15.6/1.4† — — 10.35/66 0/6.2†

(85 res.) 1snc 2.66/49 6.1/6.1† — — — —T0014 5timA 4.30/191 25.4/3.0† 3.77/182 26.5/2.9† 5.47/206 19.6/2.7†

(252 res.) 1nar 3.31/176 20.7/3.8 4.09/192 0.0/35.6 6.63/217 8.0/4.6T0031 3est 2.42/177 19.8/5.4 2.31/185 19.5/5.6 2.55/82 25.9/2.5(242 res.) 4chaA 2.75/175 23.4/4.5 2.60/185 22.2/4.7 2.75/87 29.1/1.9

*This table contains only a subset of the measures of accuracy of the model structure-to-target sequence alignments, as they werereported by the CASP2 evaluators. Refer to the URL: http://PredictionCenter.llnl.gov/ for the other measures and their explanation.SCRms is the root mean square deviation between the Ca coordinates of the model structure and those of the target structuremeasured on the residues that have been aligned by the structure comparison methods. SCLen is the length of the alignment betweenthe model structure and the target structure produced by the structure comparison methods (VAST, DALI, SSAP). The alignmentsensitivity (ASns), is the percentage of target sequence residues which were aligned by the structure comparison method, that havebeen exactly aligned by the HMMs. The alignment mean shift error (Shft) is the weighed average difference between the alignmentproduced by the structural comparison methods and the alignment produced by the HMMs. — indicates that the structure comparisonmethod does not regard the predicted structure as a suitable model structure for the target sequence.†These values are weighed by the confidence values assigned to two different HMM alignments that were submitted with thepredictions. The values reported in the text might differ from these because they are not weighed by the confidence value associated tothe submitted alignment prediction.

Fig. 1. Alignment of polyribonucleotide nucleotidyltransferase(T0004) to chain D of 1lts produced by the HMM of the OB Fold.Above the amino acid sequence of 1ltsD is its observed secondarystructure, obtained with the program DSSP17; below the aminoacid sequence of T0004 are its DSSP observed and predictedsecondary structure sequences. The prediction was obtained withthe program PREDATOR. The underscore characters identify coil

regions. Two different HMM alignments between 1ltsD and T0004have been submitted to CASP2. The alignment shown wassubmitted with the highest confidence value (0.6). The black linesshow the correct residue equivalences as determined by thestructure comparison algorithm VAST. The alignment has a meanshift residue error of 1.4 and alignment sensitivity 15.6%.

125FOLD RECOGNITION FROM SECONDARY STRUCTURE SEQUENCES

Page 4: Fold recognition using predicted secondary structure sequences and hidden Markov models of protein folds

predicted sequence did not fit that model very well.The TIM barrel PDB structures: 5tim, chain A, and1nar were selected as model structures for 3-dehydroquinase, but this selection was not critical. Itis interesting to note that according to VAST all theproteins in the training set were highly significantstructural matches for T0014, while according toDALI only 5timA was. The HMM alignment between5timA and 3-Dehydroquinase submitted with thehighest confidence value (0.7) had a 30.5% sensitiv-ity and 2.5 average residue shift error, when com-pared to those in the alignment provided by theVAST structural comparison algorithm. In general,the sheet forming the barrel was not well aligned,since the location of 3 strands (out of 8) was shiftedby more than 2 residues.

Prediction of Exfoliative Toxin A (T0031)

All the prediction teams were provided with theinformation that sequence alignment11 and experi-mental data12 suggested that T0031 had an activesite similar to that of serine protease. On this basis,the HMM for the serine protease fold, already avail-able in our HMM library, was used to score all thepredicted secondary structure sequences for exfolia-tive toxin. In addition to the usual set of predictionalgorithms listed above, we also used our recentlydeveloped secondary structure prediction method.13

This new method combines the HMM parameters(the transitions and observation symbol probabilitydistributions) with the probability distributions as-signed by QL to each secondary structure state ofeach residue. This approach thus incorporates globalstructural information into the secondary structureprediction procedure, since each HMM describes aprotein fold. We used the serine protease fold HMMto modify the QL predicted sequence of secondarystructure states, improving the QL prediction accu-racy for T0031 by almost 4% (QL Q3 5 56.0% and QL‘‘adjusted’’ with the serine protease HMM: Q3 559.8%). More importantly, the percentage of cor-rectly predicted secondary structure elements (ei-ther helix or strands), out of a total of 27, increasedfrom 41% (QL) to 48% (QL-adjusted), while thepercentage of those wrongly predicted (helices in-stead of strands and vice versa) decreased from 18%(QL) to 7% (QL-adjusted). The QL-adjusted predic-tion had the highest ranking score assigned by theHMM of serine proteases, which should not besurprising, since the prediction itself was obtainedusing the model parameters. Other secondary struc-ture predictions were given rank 2 and 7 by theserine protease HMM. The PHD server predictionwas assigned rank 7, although its prediction accu-racy was later found to be Q3 5 63.6%. The percent-age of secondary structural elements wrongly pre-dicted by the PHD server was the same as byQL-adjusted (7.4%), while the percentage of thosepredicted correctly was lower (44%). It seems that

what counted for achieving fold recognition fromsecondary structure sequences was not the residue-by-residue prediction accuracy Q3, but the numberof correctly and incorrectly predicted secondary struc-ture elements.

The PDB structures: 3est and 4cha, chain A, werechosen as serine protease models for T0031, but thechoice was again not critical. The alignment sensitiv-ity of T0031 and 3est obtained by the HMM rangesbetween 20% and 26%, depending on the structuralalignment; when aligning T0031 to 4chaA it rangesbetween 22% and 29%. The poor alignment quality isprobably due to the low secondary structure predic-tion accuracy.

Prediction of Ferrochelatase (T0020)

The fold of this target consists of two structurallysimilar domains belonging to the nitrogen regula-tory protein receiver domain (NTRC) topology in the3-layer (aba)-sandwich architecture. We wronglypredicted it to have the TIM barrel topology. A posthoc analysis of the data collected at the time of thesubmission of the prediction uncovered some evi-dence for the NTRC topology, which we unfortu-nately just disregarded. The TIM barrel HMM as-signed ranks 2, 3, and 5 to the T0020 predictions, ofwhich the highest ranking one was the PHD predic-tion. Rank 1 was obtained by another TIM barrelprotein (3timA) in the control database, while theremaining high ranking sequences were either TIMbarrels or proteins with the NTRC topology. This factshould have suggested to us that proteins with theNTRC topology and those with the TIM barreltopology might not be easily discriminated by theTIM barrel HMM.

The very good agreement in the aligned secondarystructure sequences of the top ranking PHD predic-tion (Q3 5 76%, 92% of the secondary structureelements predicted correctly) with the experimen-tally derived secondary structure sequence of theTIM barrel model structure 1ypiA explains whyT0020 was predicted as a TIM barrel (data notshown). Moreover, the PHD prediction contains 9strands and 10 helices. Disregarding one strandthought to be a wrongly predicted segment since italigned to a helix, then a typical TIM barrel second-ary structure pattern with 8 strands and multipleintervening helices is obtained. However, the PHDprediction of T0020 was also assigned the top rank-ing score by the NTRC topology model, and anotherprediction was given rank 3. The TIM barrel topol-ogy was chosen over the NTRC topology mainlybecause the alignment of T0020 to the NTRC modelhad several insertions or deletions, the longest onecontaining 5 predicted secondary structure ele-ments.

It seems that this failure to select the NTRC modelwas partly due to the fact that the training set of theNTRC HMM contained both one and two domain

126 V. DI FRANCESCO ET AL.

Page 5: Fold recognition using predicted secondary structure sequences and hidden Markov models of protein folds

sequences. Since the majority of them are one do-main sequences, the HMM describes the consensusstructure of one domain proteins better, while T0020in fact has two domains. After the contest, we trainedanother NTRC model using only the 2 domain NTRCsecondary structure sequences. Here, the top 10ranking scores assigned by this new model belong to5 different predictions of T0020, and the alignmentof the top-ranking PHD predicted sequence to themodel contained only one insertion of a helicalsegment, aside from the expected insertions or dele-tions of residues at ends of secondary structuresegments, which are typical of proteins that have acommon topology.

Prediction of Beta-Cryptogein (T0032)

This target protein has a novel fold, and thereforeit is not an appropriate target for fold recognition. ABLAST search14 returned an alignment of the last 87(out of 98), residues of T0032 to the alpha-elicitincapsicein protein (Swissprot ID: P15571) with 86%sequence identity. The secondary structure of thisalpha-elicitin capsicein protein had already beendetermined by NMR spectroscopy by Bouaziz et al.15

As the sequence identity was quite high over anextended region, the secondary structure of T0032was simply predicted to be identical to the NMRassignment of alpha-elicitin, and consisting of fivealpha helices and two strands. This predicted se-quence achieved the prediction accuracy Q3 5 79.6%,with 97.8% of secondary structure elements cor-rectly predicted. Since it was pointed out by Bouazizet al. that the C-terminal secondary structure motifof capsicein evokes the structural features of phospho-lipase A2 proteins, an HMM for that topology wasbuilt and used for submitting the prediction. Thismodel assigned a high rank (2) to the score of thebeta-cryptogein predicted secondary structure se-quence; the same rank was also obtained by thepredicted sequence using the models of helical cyto-kines and EF-Hand proteins. In conclusion, despitethe high prediction accuracy of the secondary struc-ture sequence of T0032, we could not determine thatthe beta-cryptogein sequence had a previously unob-served fold.

CONCLUSIONS

We have presented additional evidence that second-ary structure information is a major determinant forachieving fold recognition. Information found inpublished literature also contributed to the successrate of this approach, because it suggested models ofsome folds not yet available in our HMM library atthe time of CASP2. The HMM of the correct fold for atarget protein was always found to assign high ranksto several of the predicted secondary structure se-quences of the targets. Ferrochelatase, whose foldwas wrongly predicted to be a TIM barrel, could havebeen correctly recognized by a two-domain model of

the NTRC topology, which was not available at thetime of CASP2. The quality of the alignments pro-duced by these HMMs needs improvement. Usingmore detailed information than just secondary struc-ture to represent protein folds with HMMs, thealignment quality and the fold recognition capabili-ties are expected to improve. Using ranks ratherthan a true measure of significance of the scorescauses problems when models of two different topolo-gies are both assigning high ranks to the samepredicted sequences, as in the case of Ferrochelataseor Beta-Cryptogein.

Our approach provides the most likely proteintopology and an ensemble of possible folds for thetarget sequence, which are the proteins used to trainthe model of that topology. We found that we couldhave selected almost any PDB structure in thetraining sets as a model structure, without affectingthe fold recognition success rate substantially, sincealmost all those proteins were found to be similar tothe structure of the target sequence. This agreeswith an observation of Lemer et al.16 about themultiplicity of acceptable solutions for protein foldrecognition problems. Note though that differentstructures have different levels of significance of theparticular score associated to the model structure-to-target structure alignment provided by VAST orDALI or SSAP, so that some structures are more fitthan others to be used as structural models for thetarget sequence, and they would probably producebetter alignments.

The use of several secondary structure predictionalgorithms turned out to be very useful, since itincreased our chance of having the correct sequenceof secondary structure elements for the target se-quence or their correct length. We have yet toanalyze the differences in the various predictedsequences in relation to how they were aligned andscored by the models.

Although CASP2 offered a unique opportunity forperforming bona fide protein structure predictionsand more objectively measure the performances ofvarious methodologies, the set of 7 target sequencessuitable for fold recognition is too small to drawdefinitive conclusions concerning the relative suc-cess rate of different methods. We await future testsof this sort.

ACKNOWLEDGMENTS

We thank Philip McQueen for the QL ‘‘adjusted’’secondary structure prediction of the exfoliativetoxin A sequence; Michael Levitt for the assessmentof the predictions in the fold recognition category;Aron Marchler-Bauer, Ceslovas Venclovas, andAdamZemla for the development of software for the auto-matic evaluation of predictions; and all the membersof the CASP2 organizing committee for providing uswith the opportunity of taking part in the predictionexperiment.

127FOLD RECOGNITION FROM SECONDARY STRUCTURE SEQUENCES

Page 6: Fold recognition using predicted secondary structure sequences and hidden Markov models of protein folds

REFERENCES

1. Di Francesco, V., Garnier, J., Munson, P.J. Protein topologyrecognition from secondary structure sequences: Applica-tion of the hidden Markov models to the alpha classproteins. J. Mol. Biol. 267:446–463, 1997.

2. Rabiner, L.R. A tutorial on hidden Markov models andselected applications in speech recognition. Proc. IEEE257–286, 1989.

3. Orengo, C.A., Flores, T.P., Taylor, W.R., Thornton, J.M.Identification and classification of protein fold families.Protein Eng. 6:485–500, 1993.

4. Krogh, A., Brown, M., Mian, I.S., Sjolander, K., Haussler,D. Hidden Markov models in computational biology: Appli-cations to protein modeling. J. Mol. Biol. 235:1501–1531,1994.

5. Hubbard, T.J., Park, J. Fold recognition and ab initiostructure predictions using hidden Markov models andb-strand pair potential. Proteins 23:398–402, 1995.

6. Munson, P.J., Di Francesco, V., Porrelli, R. Protein second-ary structure prediction using periodic-quadratic-logisticmodels: Statistical and technical issues. In: ‘‘Proceedings ofthe 27th Hawaii International Conference on System Sci-ences.’’New York: IEEE Computer Society Press, 1994:375–384.

7. Rost, B., Sander, C., Schneider, R. PHD: An automatic mailserver for protein secondary structure prediction. Comput.Appl. Biosci. 10:53–60, 1994.

8. Salamov, A.A., Solovyev, V.V. Prediction of protein second-ary structure by combining nearest-neighbor algorithmsand multiple sequence alignments. J. Mol. Biol. 247:11–15,1995.

9. Frishman, D., Argos, P. Incorporation of non-local interac-tions in protein secondary structure prediction from theamino acid sequence. Protein Eng. 9:133–142, 1996.

10. Garnier, J., Gibrat, J.-F., Robson, B. GOR method forpredicting protein secondary structure from amino acidsequence. Methods Enzymol 266:540–553, 1996.

11. Dancer, S.J., Garratt, R., Saldanha, J., Jhoti, H., Evans, R.The epidermolytic toxins are serine proteases. FEBS Lett.268:129–132, 1990.

12. Bailey, C.J., Redpath, M.B. The esterolytic activity ofepidermolytic toxins. Biochem. J. 284:177–180, 1992.

13. Di Francesco, V., McQueen, P., Garnier, J., Munson, P.J.Incorporating global information into secondary structureprediction with hidden Markov models of protein folds. In:‘‘Proceedings of the Fourth International Conference Intel-ligence Systems in Molecular Biology, Halkidiki, Greece.’’1997:100–103.

14. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman,D.J. Basic local alignment search tool. J. Mol. Biol. 215:403–410, 1990.

15. Bouaziz, S., van Heijenoort, C., Huet, J.C., Pernollet, J.C.,Guittet, E. 1H and 15N resonance assignment and second-ary structure of capsicein, an alpha-elicitin, determined bythree-dimensional heteronuclear NMR. Biochemistry 33:8188–8197, 1994.

16. Lemer, C.M.-R., Rooman, M.J., Wodak, S.J. Protein struc-ture prediction by threading methods: Evaluation of cur-rent techniques. Proteins 23:337–355, 1995.

17. Kabsh, W., Sander, C. Dictionary of protein secondarystructure: pattern recognition of hydrogen-bonded andgeometrical features. Biopolymers 22: 2577–2637, 1983.

18. Bycroft, M., Hubbard, T.J., Proctor, M., Freund, S.M.,Murzin, A.G. The solution structure of the S1 RNA bindingdomain: A member of an ancient nucleic acid-binding fold.Cell 88:235–242, 1997.

19. Shrive, A.K., Polikarpov, I., Krell, T., Coulson, A., Coggins,J.R., Hawkins, A.R., Sawyer, L. Three-dimensional struc-ture of type I dehydroquinase: An enzyme recruited for arole in eukaryotic transcription regulation. Nature Struct.Biol. Submitted, 1997.

20. Al-Karadaghi, S., Nikonov, S., Jonsson, B., Hederstedt, L.Crystal structure of ferrochelatase. EMBO J. In press,1997.

21. Vath, G.M., Earhart, C.A., Rago, J.V., Kim, M.H., Bonach,G.A., Schlievert, P.M., Ohlendorf, D.H. The structure of thesuperantigen exfoliative toxin A suggests a novel regula-tion as a serine protease. Biochemistry 36:1559–1566,1997.

22. Boissy, G., de la Fortelle, Kahn, R., Huet, J.C., Bricogne,G., Pernollet, J.C., Brunie, S. Crystal structure of a fungalelicitor secreted by phytophthora cryptogea, a member of anovel class of plant necrotic proteins. Structure 4:1429–1439, 1996.

128 V. DI FRANCESCO ET AL.