8
Distant Homology Recognition Using Structural Classification of Proteins Alexey G. Murzin 1 * and Alex Bateman 2 1 Centre for Protein Engineering, MRC Centre, Cambridge, United Kingdom 2 MRC Laboratory of Molecular Biology, Cambridge, United Kingdom ABSTRACT Protein structure prediction is arguably the biggest unsolved problem of structural biology. The notion of the number of naturally occurring different protein folds be- ing limited allows partial solution of this prob- lem by the use of fold recognition methods, which ‘‘thread’’ the sequence in question through a library of known protein folds. The fold recognition methods were thought to be superior to the distant homology recognition methods when there is no significant sequence similarity to known structures. We show here that the Structural Classification of Proteins (SCOP) database, organizing all known pro- tein folds according their structural and evolu- tionary relationships, can be effectively used to enhance the sensitivity of the distant homol- ogy recognition methods to rival the ‘‘thread- ing’’ methods. In the CASP2 experiment, our approach correctly assigned into existing SCOP superfamilies all of the six ‘‘fold recognition’’ targets we attempted. For each of the six tar- gets, we correctly predicted the homologous protein with a very similar structure; often, it was the most similar structure. We correctly predicted local alignments of the sequence features that we found to be characteristic for the protein superfamily containing a given target. Our global alignments, extended manu- ally from these local alignments, also appeared to be rather accurate. Proteins, Suppl. 1:105– 112, 1997. r 1998 Wiley-Liss, Inc. Key words: CASP; fold recognition; SCOP; su- perfamily; structure prediction INTRODUCTION There are two general approaches to the predic- tion of protein structure—one comes from the prin- ciples of protein physics and chemistry and the other from the evolutionary history of proteins. The so- called ab initio methods utilize the physics and chemistry of proteins in its almost pure form, whereas the distant homology recognition methods are domi- nated by the history of proteins. Most of the other methods, however, combine both approaches in order to improve their performance—notably, the fold rec- ognition methods that ‘‘thread’’ the sequence in question through a library of known protein folds (see other articles in this issue). The justification for fold recognition methods is the notion that the total number of different protein folds occurring in nature is limited and that we already know a significant fraction, probably the majority of all natural folds. Apparently, the number of natural folds is limited by historical rather than physico-chemical reasons. All natural proteins come from a relatively small number of different superfamilies 1 . All members of a superfamily are descended from a common ancestor and share the same ancestral fold, which very often is unique for a given superfamily. The distant homol- ogy recognition methods predict evolutionary rela- tionships of proteins lacking notable sequence simi- larity by detecting more subtle conserved features in their sequences. We suggest that the sensitivity of the distant homology recognition methods are en- hanced by the use of structural information to rival the fold recognition methods in the field of protein structure prediction. To prove this suggestion, we use the Structural Classification of Proteins (SCOP) database. 2 SCOP distinguishes structural similarities being due to a probable common origin from fortuitous similarities arising from the physics and chemistry of proteins. Evolutionary relationships are described on two SCOP levels, family and superfamily. The superfam- ily level clusters families with structural and general functional similarity but without notable sequence similarity. Structural comparisons on this level can reveal subtle superfamily-specific features inherited from a common ancestor. 3 These superfamily-specific features may be active site details or unusual struc- tural details which can be used to construct sequence templates. These features may include the catalytic mechanism or other functional features which may not be expressed simply in sequence terms but may nevertheless be used in prediction. These features allow identification of other sequence families of unknown structure that may belong to a given superfamily. Preliminary studies showed that ‘‘true’’ Contract grant sponsor: MRC Senior Fellowship. *Correspondence to: Alexey G. Murzin, Centre for Protein Engineering, MRC Centre, Hills Road, Cambridge CB2 2QH, United Kingdom. Received 30 June 1997; Accepted 28 August 1997 PROTEINS: Structure, Function, and Genetics, Suppl. 1:105–112 (1997) r 1998 WILEY-LISS, INC.

Distant homology recognition using structural classification of proteins

  • Upload
    alex

  • View
    215

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Distant homology recognition using structural classification of proteins

Distant Homology Recognition Using StructuralClassification of ProteinsAlexey G. Murzin1* and Alex Bateman2

1Centre for Protein Engineering, MRC Centre, Cambridge, United Kingdom2MRC Laboratory of Molecular Biology, Cambridge, United Kingdom

ABSTRACT Protein structure predictionis arguably the biggest unsolved problem ofstructural biology. The notion of the number ofnaturally occurring different protein folds be-ing limited allows partial solution of this prob-lem by the use of fold recognition methods,which ‘‘thread’’ the sequence in questionthrough a library of known protein folds. Thefold recognition methods were thought to besuperior to the distant homology recognitionmethods when there is no significant sequencesimilarity to known structures. We show herethat the Structural Classification of Proteins(SCOP) database, organizing all known pro-tein folds according their structural and evolu-tionary relationships, can be effectively usedto enhance the sensitivity of the distant homol-ogy recognition methods to rival the ‘‘thread-ing’’ methods. In the CASP2 experiment, ourapproach correctly assigned into existing SCOPsuperfamilies all of the six ‘‘fold recognition’’targets we attempted. For each of the six tar-gets, we correctly predicted the homologousprotein with a very similar structure; often, itwas the most similar structure. We correctlypredicted local alignments of the sequencefeatures that we found to be characteristic forthe protein superfamily containing a giventarget. Our global alignments, extended manu-ally from these local alignments, also appearedto be rather accurate. Proteins, Suppl. 1:105–112, 1997. r 1998 Wiley-Liss, Inc.

Key words: CASP; fold recognition; SCOP; su-perfamily; structure prediction

INTRODUCTION

There are two general approaches to the predic-tion of protein structure—one comes from the prin-ciples of protein physics and chemistry and the otherfrom the evolutionary history of proteins. The so-called ab initio methods utilize the physics andchemistry of proteins in its almost pure form, whereasthe distant homology recognition methods are domi-nated by the history of proteins. Most of the othermethods, however, combine both approaches in orderto improve their performance—notably, the fold rec-ognition methods that ‘‘thread’’ the sequence in

question through a library of known protein folds(see other articles in this issue). The justification forfold recognition methods is the notion that the totalnumber of different protein folds occurring in natureis limited and that we already know a significantfraction, probably the majority of all natural folds.

Apparently, the number of natural folds is limitedby historical rather than physico-chemical reasons.All natural proteins come from a relatively smallnumber of different superfamilies1. All members of asuperfamily are descended from a common ancestorand share the same ancestral fold, which very oftenis unique for a given superfamily. The distant homol-ogy recognition methods predict evolutionary rela-tionships of proteins lacking notable sequence simi-larity by detecting more subtle conserved features intheir sequences. We suggest that the sensitivity ofthe distant homology recognition methods are en-hanced by the use of structural information to rivalthe fold recognition methods in the field of proteinstructure prediction.

To prove this suggestion, we use the StructuralClassification of Proteins (SCOP) database.2 SCOPdistinguishes structural similarities being due to aprobable common origin from fortuitous similaritiesarising from the physics and chemistry of proteins.Evolutionary relationships are described on twoSCOP levels, family and superfamily. The superfam-ily level clusters families with structural and generalfunctional similarity but without notable sequencesimilarity. Structural comparisons on this level canreveal subtle superfamily-specific features inheritedfrom a common ancestor.3 These superfamily-specificfeatures may be active site details or unusual struc-tural details which can be used to construct sequencetemplates. These features may include the catalyticmechanism or other functional features which maynot be expressed simply in sequence terms but maynevertheless be used in prediction. These featuresallow identification of other sequence families ofunknown structure that may belong to a givensuperfamily. Preliminary studies showed that ‘‘true’’

Contract grant sponsor: MRC Senior Fellowship.*Correspondence to: Alexey G. Murzin, Centre for Protein

Engineering, MRC Centre, Hills Road, Cambridge CB2 2QH,United Kingdom.

Received 30 June 1997; Accepted 28 August 1997

PROTEINS: Structure, Function, and Genetics, Suppl. 1:105–112 (1997)

r 1998 WILEY-LISS, INC.

Page 2: Distant homology recognition using structural classification of proteins

superfamilies, containing two or more different fami-lies, can be extended to a number of sequencefamilies. This extension can be achieved by indi-vidual analysis of all available data on a givensuperfamily, including not only sequences and struc-tures but also any functional and other relevantinformation.3 We are planning a systematic analysisof SCOP superfamilies in order to predict proteinfolds for many other sequence families of unknownstructure. The CASP2 experiment provided an excel-lent opportunity to test our approach. From ananalysis of the SCOP database and its growth, it wasexpected that most of the fold recognition targets(those to have known folds) would be the distanthomology recognition targets (those belonging toexisting superfamilies). We expected few of the tar-gets to possess previously observed folds belongingto completely new superfamilies. Our analysis sug-gested that the number of distant homology recogni-tion targets would be about half the number of alltargets submitted for fold recognition and ab initiocategories.

GOALS

Our goal was to predict, for each chosen target, asingle superfamily that it may belong to. In order tocover most of the potential distant homology recogni-tion targets, we submitted predictions for two-thirdsof the fold recognition / ab initio targets, in spite ofour expectation that only about one-half of thetargets would belong to existing SCOP superfami-lies. We were likely to make a few incorrect predic-tions, assigning targets into existing superfamiliesthat would be later classified as new superfamilies,but we did not worry about mispredictions of thiskind. For the distant homology recognition targets,we proposed the following criteria of evaluating ourprediction success. First, we would see whether thesuperfamily has been predicted correctly or not. Ifyes, then we would compare the real structure withthe representative structure(s) of the superfamily. Itwould tell us how close to the real structure we couldbe, provided our alignment was absolutely correct.In the alignment itself, we would see whether thelocal superfamily-specific features are aligned cor-rectly or not. However, we did not care about theaccuracy of the global alignment, as no attempt wasmade to refine our manual alignments beyond theselocal features.

METHODS

Our approach of extending known superfamilieswill eventually predict folds for many protein se-quences. However, given the limited time of theCASP2 experiment, it seemed unlikely that a signifi-cant fraction of the target sequences would be as-signed into the existing SCOP superfamilies bysystematic individual analysis of known structures.So we chose to direct our efforts at analyzing as small

a number of SCOP superfamilies that seemed mostlikely to contain the target sequence. To narrow thelist of SCOP superfamilies relevant for further analy-sis, we used the following approaches: 1) to analyzethe sequence of the target and 2) to look at the knownfunctional and experimental information for thetarget. Once potential SCOP superfamilies weredecided we carried out a structural analysis of thesuperfamily to find the superfamily-specific featuresand then compared these to the target sequence tosee if it matched the superfamily.

Sequence Methods

Sequence analysis was carried out for each of thetargets predicted. Multiple alignments were con-structed and secondary structure predictions made.Sequence analysis is an essential step which canallow the elucidation of domain structure. Multipledomain proteins are well known to confound predic-tion methods. To create multiple alignments we usedBLASTP4 to gather close relatives. Then the HMMersoftware by Sean Eddy was used to build HiddenMarkov models5 from sequence alignments to detectfurther members of the sequence family. Searcheswere carried out in an iterative fashion, with athreshold of 25 bits generally considered significant.Alignments of sequence families were constructedusing either CLUSTAL W6 or the HMMer package.5

Searches were generally carried out on SWISS-PROT release 33 and the nonredundant OWL re-lease 28 databases.7,8

Dot plots were made for each target sequenceusing a dotter by Erik Sonnhammer; this step de-tects internal duplications within the target, whichwas found in targets T0002 and T0016.

Secondary structure predictions were performedon the multiple alignments using the PHD server.9

These data allowed us to narrow down the choiceof potential SCOP superfamily using the length ofthe structure and the predicted secondary structureas guides.

Functional Information

For each target and its homologues we performedliterature searches to determine what experimentalwork had been carried out. These data were thenused to see if each SCOP superfamily was compatiblewith the known data. Some SCOP superfamilieshave very specific functions and in these cases itseems very unlikely that the fold will be co-opted toperform other functions.

Superfamily Analysis

Once a list of potential SCOP superfamilies hasbeen drawn up (typically, fewer than four) a struc-tural analysis is performed to find superfamily-specific features. These features are then checkedagainst the target sequence for compatibility. Particu-

106 A.G. MURZIN AND A. BATEMAN

Page 3: Distant homology recognition using structural classification of proteins

lar details of our predictions are individual for eachsuperfamily and will be discussed below.

Submissions

We submitted predictions for 12 targets (Table I)in the fold recognition / ab initio categories. The twocategories have different submission formats—thefold recognition (FR) and ab initio (AB) formats. Asour approach is neither of these two categories, wesubmitted our predictions in both formats, wherepossible (Table I). Three-dimensional models weregenerated by Modeller10 from the predicted align-ments with template structures. The template coor-dinates were taken from PDB11 and some of themwere edited before modelling to accommodate inser-tions/deletions or include the symmetry-related struc-ture. For one target, T0032, where we were sure thatit would be a new fold, the prediction was submittedin the FR format only. For T0042, our prediction of anovel superfamily with a known fold was submittedin the AB format only. For two other targets, T0010and T0031, the FR format was not used for technicalreasons. T0010 was modelled on the structure of aproaerolysin domain, whose coordinates were onhold in PDB at the time (entry 1PRE). In T0031,predicted as the glutamate-specific serine proteaseand modelled on the PDB entry 1HPG, the align-

ment for the substrate-binding region did not complywith the FR format due to the nonsequential residuenumbering of the trypsin-like serine proteinase inPDB. In two of our other predictions, there areman-made errors not detected at the time of theirsubmission. In T0002, a six-residue insertion hasbeen inadvertently introduced during conversion ofthe ‘‘correct’’ fold recognition alignment into the abinitio model, causing a frameshift in the C-terminalsubdomain. Conversely, in T0004 the ab initio modelis based on the ‘‘correct’’ alignment, but the foldrecognition alignment was corrupted during manualfilling of the submission form.

PREDICTIONS AND COMPARISONWITH THE EXPERIMENTAL STRUCTURES

Experimental structures have been determinedfor 11 of our 12 predictions (Table I) including T0026solved most recently (see noted added in proof). Sixof the 11 structures can be classified into existingSCOP superfamilies and four have distinctively newfolds. One structure, T0042, can arguably be classi-fied into a new superfamily with a known fold;despite this similarity, it was not detected by auto-mated methods. We will comment on this classifica-tion at the end of this section.

TABLE I. Complete List of Our Predictions

TargetPrediction

format SCOP fold/SuperfamilyTemplate

usedRMSd/overlap

T0002 (CtD) AB212†

FR220Ferredoxin-like/regulatory domain in S&T biosynthesis 1 PSD* 2.2 Å/All

T0004 AB108FR213†

OB-fold/Ancient NA-binding domain 1 MJC 1.8 Å/All

T0005 AB5FR21

New 1 GBG

T0010 AB216 New 1 PRE*T0014 AB217

FR173TIM barrel/Aldolase 1 PII§ 2.2 Å/All

T0016 AB28FR24

New 1 HBG

T0026 AB377FR381

1 BIA* 2.1Å/All

T0031 AB552 Trypsin-like serine protease 1 HPG 1.9 Å/AllT0032 FR495 New noneT0038 AB587

FR585Galactose binding domain-like 1 DLC 2.6 Å/All

T0042 AB970 Calbindin-like (?)/New 4 ICB 3.5 Å/AllPredicted but not solved.T0023 AB586

FR5901 NAL* Unknown

Predictions for the attempted targets have been deposited to the Protein Structure Prediction Centre and are available on the internetat URL http://PredictionCenter.llnl.gov/. ABxxx codes for the ab initio prediction format and FRxxx for the fold recognition format. Theitalicized predictions are the best ones for the corresponding targets, according the independent assessment of fold recognition andalignment accuracy.33 Putative SCOP folds and superfamilies for the target experimental structures are given. Template structuresfor the ab initio models are denoted by their PDB codes.11 The accuracy of distant homology recognition is given in the last column andis measured by the root mean square deviation (RMSd) in the common core of the superposed target structure and template structureand the proportion of the target fold predicted. Asterisks (*) denote those templates which PDB coordinates were modified beforemodelling, § denotes the template taken from a different superfamily of the same fold, and † denote the two corrupted predictions.

107FOLD RECOGNITION USING SCOP

Page 4: Distant homology recognition using structural classification of proteins

For one of the new fold targets, T0032 (β-crypto-gein12), we correctly predicted it having a new fold.Literature searches revealed that the secondarystructure, disulfide bonds and some other long-rangecontacts, were determined for a close homologue ofthe target sequence.13 No fold in SCOP was compat-ible with these data, so we were very confident in our‘‘negative’’ prediction. In the other three new foldtargets, T0005, T0010, and T0016, we could not ruleout the possibility of distant relationship to knownstructures; therefore, with a different degree ofconfidence, we assigned a single SCOP superfamilyfor each of these targets. These ‘‘positive’’ predictionswere incorrect and will not be discussed here.

In contrast, each of our single predictions for thedistant homology targets has been a success. Herewe describe how five of these predictions have beenmade and evaluate these five predictions accordingto our criteria in the order of increasing difficulty ofprediction. At the end we also describe a relativelysuccessful prediction of a new superfamily fold.

Target T0031: The Exfoliative Toxin AFrom Staphylococcus aureus14

This target was the easiest prediction. The exfolia-tive toxin A is apparently homologous to the gluta-mate-specific serine proteases from the same speciesand some other bacteria. This family was predictedin literature to belong to the trypsin-like serineproteases on the basis of three conserved regionswith the invariant histidine, aspartate, and serineresidues forming the putative catalytic triad.15 Thefirst structure of a glutamate-specific serine proteasefrom Streptomyces griseus (Glu-GSP) provided addi-tional support to this prediction by revealing thatthe substrate-recognition residues are also con-served.15 Having discarded other possibilities, weagreed with this prediction and modelled the targetsequence into the Glu-GSP structure. The superfam-ily was predicted correctly and the template struc-ture was very close, although not the closest to theexperimental structure. Superposition of our ab ini-tio model with the experimental structure showsthat both catalytic and substrate-binding parts ofthe putative active site are aligned correctly, butsome distant regions are frameshifted.As this predic-tion was not converted into the FR format (wecouldn’t do it without omitting the substrate-bindingregion), it was not assessed against the other CASP2predictions for this target.

Target T0004: S1 Domain ofPolyribonucleotide Nucleotidyltransferase16

This was the second-easiest target. A BLASTPsearch against all the SCOP sequences detected aweak homology to the functionally related cold-shockprotein.17 This small protein shares the commonOB-fold with the growing number of different pro-teins, most of which bind either oligonucleotides or

oligosacharides. Our comparative analysis of theOB-folds revealed three conserved structural fea-tures: a small residue, a β-bulge, and a turn. Allthree features together occur in many OB-folds withcommon nucleic acid-binding function and thereforedetermine a superfamily.16 The sequence signaturesof all three features were seen in the BLASTPalignment of the target and cold-shock protein se-quences and well conserved in near and distanthomologues of the target sequence. Secondary struc-ture prediction was fully consistent with the OB-fold. We aligned the target sequence with four differ-ent members of the OB-fold superfamily to reflectpossible structural variations. The ab initio modelwas built into the structure of cold-shock protein,which turned out to be the closest to the experimen-tal structure. All three determinants were alignedcorrectly, but two of the five OB-fold β-strands wereslightly frameshifted.

Target T0014: 3-Dehydroquinase18

T0014 was a target of moderate difficulty. Thereare quite a few sequence homologues of this enzyme.The conserved residues, suggested to be catalyticallyessential, were tried by site-directed mutagenesis19

and the putative catalytic mechanism was proposed.The catalyzed reaction proceeds via an intermediate,in which Shiff base is formed between the substratemolecule and the invariant Lys 170. In the predicteda/β secondary structure, the catalytically essentialresidues are located at the ends of β-strands. Analy-sis of the enzyme superfamilies acting through Shiffbase-forming lysine residue suggested the aldolasesuperfamily as the most likely candidate. The aldol-ase superfamily shares the common TIM-barrel fold.In all known aldolases except transaldolase,20 thecatalytic lysine comes from the sixth β-strand, sug-gesting the same location for the invariant lysine inthe target sequence. To fit the predicted secondarystructure in the TIM-barrel fold, we replaced onepredicted helix with a strand and one predictedstrand with a helix. As the target sequence wassignificantly shorter than any member of the aldol-ase superfamily in SCOP, we used as a template thestructure from a different superfamily with thesmallest and roundest TIM-barrel domain in thedatabase, phosphoribosylanthranilate isomerase21

(1PII). The position of catalytic lysine in the sixthstrand was inferred from the structure of N-acetyl-neuraminate lyase22 (1NAL), a member of the aldol-ase superfamily, by its superposition with the 1PIItemplate. In manual alignment of the remainingβ-strands and a-helices, we followed some simplerules deduced from an analysis of all TIM-barrelstructures: 1) at the N-end of the barrel, connectionsbetween a-helices and β-strand should be short, 2)the putative catalytic residues should be at theC-end of the barrel, 3) hydrophilic residues areallowed in the barrel interior but not between the

108 A.G. MURZIN AND A. BATEMAN

Page 5: Distant homology recognition using structural classification of proteins

barrel and the helices. This prediction was correct. Itturned out that 1NAL was the closest to the experi-mental structure and our template, 1PII, was runner-up. Six of the eight strands, including that withcatalytic lysine, were aligned correctly. Other second-ary structures were frameshifted, with the frame-shifts being greater at the ends of protein chain.

Target T0038: The N-terminalCellulose-Binding Domain of CenC23

T0038 was not a very difficult target but therewere a few traps to be avoided. Each of the threeSCOP families that we selected for further analysiscontained carbohydrate-binding function and weresimilar to the predicted all-β secondary structure.Moreover, the folds of these superfamilies are some-what similar—all of them are β-sandwiches withsimilar jelly-roll topology but different numbers ofβ-strands (Fig. 1). If any of the three superfamilies isthe correct prediction, the other predictions wouldautomatically have some folding similarity. Lookingfor specific structural details in each of the superfami-lies, we discovered that two of them contain aconserved β-bulge in the middle of the C-terminalβ-strand. The probable role of this bulge is bending ofthe β-sheet that contains this bulge (the rear sheet atFig. 1) in order to keep the curvature of the frontsheet containing the carbohydrate-binding site. Thesecondary structure predictions from multiple align-ments of the target sequence with its close sequencehomologues as well as with probable distant rela-tives strongly suggested a β-bulge in the last strand.This disallowed the third superfamily, despite ithaving very same function, binding of cellulose. Thefinal choice between the two remaining superfami-lies was made according to the target chain lengthand the number of predicted strands. The targetsequence was aligned manually against the struc-tural alignment of two different members of thepredicted superfamily, and the member with similarsize loops was used to build the ab initio model. Theexperimental structure is indeed very close to ourtemplate structure of the C-terminal domain ofd-endotoxin24 and it does contain the predicted bulge.All of the strands were predicted correctly but half ofthem were aligned with frameshifts.

Target T0002: Threonine Deaminase25

This was arguably the most complex target in ourlist. Its prediction was done in several steps. Ini-tially, we predicted the border between the N-terminal part, homologous to the β-subunit of trypto-phan synthase26 and suggested for the comparativemodelling, and the C-terminal part submitted for thefold recognition/ab initio prediction. The first stepwas not straightforward, as the significant similar-ity to tryptophan synthase sequence does not extendto the end of this sequence. It was noted previouslythat the structure of the β-subunit of tryptophan

synthase contains probable duplication and consistsof two (sub)domains related by a pseudodyad.26 Itsuggested to us that the symmetry-related partswould be conserved in both structures. A timely helpwas the publication of an abstract27 reporting thestructure of O-acetylserine sulfhydrylase homolo-gous to the β-subunit of tryptophan synthase with anexplicit reference to a few large insertions in thetryptophan synthase chain. An alignment of thethree enzyme sequences, generated by CLUSTAL W,was manually edited to take into account the struc-tural duplication and the insertions in tryptophansynthase. This alignment confidently predicted theC-end of the tryptophan synthase-like domain in thetarget sequence. It also turned out to be very accu-rate through all its length when compared with theexperimental structure.

Sequence analysis of the C-terminal domain de-tected two tandem repeats of an 80-residue motif.Most known threonine deaminases contain the sameduplication but there are a few shorter sequenceswith a single motif (Fig. 2). We concluded that eachmotif will have a compact fold and that the twomotifs will probably be associated together in asymmetrical fashion. The predicted βaββaβ second-ary structure suggested it can be the ferredoxin-likefold which contains more different superfamiliesthan any other fold in SCOP. In several of thesesuperfamilies, the ferredoxin-like folds form dimers.Inspection of functions of these superfamilies sug-gested a tenuous link between the target domain andthe regulatory domain of phosphoglycerate dehydro-genase.28 Both domains are C-terminally located inthe enzymes involved in biosynthesis of the hydroxyl-containing amino acids, serine and threonine, andthey are known or presumed to play a regulatoryrole. We followed this link and found that thesequence and the predicted secondary structure ofeach of the two motifs in the target domains fitextremely well in the structure of the regulatorydomain of phosphoglycerate dehydrogenase (Fig. 2).In the alignment created by our manual threading,the residues in the dimer interface of phosphoglycer-ate dehydrogenase appear to coincide with the mostconserved residues in the threonine deaminase mo-tifs. This suggested that the two motifs will associatein the same way and allowed the prediction of thewhole two-motif structure of the C-terminal domain,modelled on the dimer structure of the phosphoglyc-erate dehydrogenase domain. Comparison with theexperimental structure confirmed that all details ofour prediction—domain organization, internal dupli-cation, the fold and association of individual motifs—are essentially correct. In the C-terminal motif,there is a structural disorder in the region corre-sponding to one of the helices in the N-terminal motifand predicted to be helical (Fig. 2). In the targetsequence in this region, there is a nonconservedproline residue that may be a cause of this disorder.

109FOLD RECOGNITION USING SCOP

Page 6: Distant homology recognition using structural classification of proteins

Fig. 1. The prediction of the cellulose binding domain ofCellulomonas fimi (T0038) as belonging to the galactosebinding domain-like SCOP superfamily. The structure of theC-terminal domain of d-endotoxin (1DLC) was used as thetemplate structure for modelling. The accuracy of predictedalignment is indicated by color code. The correctly alignedstrands are dark blue, minor frameshifts (two residues) arelight blue, larger frameshifts are green (four residues) andyellow (five1) Two other superfamilies which were consideredbut not chosen, with similar topology, are represented by1EXG and 1BYH. In these, β-strands that can be equivalencedto those in the target structure are colored green, nonequiva-lent strands are white. The alignment of the C-terminal β-strand containing the superfamily-specific β-bulge (orangeballs in the structures) is shown for the template sequencealigned to members of the target superfamily. e denotesβ-strand residues and ` denotes the β-bulge residue. Con-served residues are shown in bold red type.

Fig. 2. The prediction of the C-terminal domain of threonine deaminaseof E. coli (T0002) as belonging to thesame superfamily as the regulatory do-main of phosphoglycerate dehydroge-nase (1PSD) of the ferredoxin-like fold.The C-terminal domain was predicted tocontain an internal duplication, the twohalves were predicted to be ferredoxin-like folds. The target structure is com-pared with the dimer structure of 1PSDdomain, used as the template structurefor modelling and to illustrate the relativeorientation predicted for the two subdo-mains. The accuracy of predicted align-ment is indicated by color code as inFigure 1. The alignment of the templatestructure’s sequence is shown with thetarget family repeats. e denotes β-strand, H denotes a-helix and V de-notes the β-strand that is close to thedyad axis. THD1_BACSU contains onlya single ferredoxin-like domain. Con-served residues are shown in bold redtype.

110 A.G. MURZIN AND A. BATEMAN

Page 7: Distant homology recognition using structural classification of proteins

Our alignment of both motifs with the templatesequence is correct for all the secondary structuresapart from the last strand, where there is a minor,two-residue frameshift.

Target T0042: NK-lysin29—New SuperfamilyWith a Familiar Fold?

Our approach predicted no SCOP superfamily thatthis target is likely to belong to. It did not necessarilymean that it would be a new fold. This small targetwas predicted to consist of four a-helices and noβ-sheets. This target provided a good opportunity toresurrect an original prediction method speciallydesigned for small globular all-a proteins.30 In thismethod (AGM & O. Galzytskaya, unpublished), themost favorable fold for a given sequence is searchedin a library of ideal theoretical folds.31 For a givensequence, all theoretical folds are ranked by theenergy-like function dependant on the helix-to-helixcontacts, the positions of the helix caps, and thelengths of interhelical connections. Using very basicideas of this approach, we predicted for the targetsequence one of the ideal four-helical folds. This foldhappened to approximate the real fold of calbindin,32

which we used as a template for modelling the targetstructure. It should be noted here that the real foldsof nonhomologous proteins may deviate differentlyfrom the common ideal fold, and therefore they canexpectedly be more deviant than the folds of dis-tantly homologous proteins.

In the experimental structure, there are five ratherthan four helices, with one of the predicted helicesbeing broken in two parts. Nevertheless, the overalltopologies of NK-lysin and calbindin are similarindeed. Coincidentally, calbindin also contains ashort fifth helix in the middle of the chain, which weignored in our prediction, and which can be equiva-lenced with the ‘‘extra’’ helix of NK-lysin. Superposi-tion of the target and template structures gives anr.m.s. deviation of 3.5 Å for 47 Ca pairs in all fivehelices (more than 60% of each structure). Classifica-tion aside, if aligned correctly, the calbindin struc-ture would make a reasonable model of the targetstructure. With one helix misaligned completely andthe other helices frameshifted our ab initio modelgives an r.m.s deviation for all Ca atoms as high as6.3 Å. It is a poor result compared with our distanthomology predictions, but it is a good one comparedwith ab initio predictions of the same target by othergroups. Regretfully, this prediction was assessedneither in ab initio nor in fold recognition categories.

CONCLUSION

We have shown that the distant homology toknown structure, where it exists, can be recognizedsuccessfully. We have done it for five of the fivedistant homology targets we attempted. For each ofthe five targets, we correctly predicted the homolo-gous protein with a very similar structure and, in a

few cases, it was the most similar structure. Also, foreach target we correctly predicted local alignmentsof the sequence features that we found to be charac-teristic for the protein superfamily containing thegiven target. Our global alignments, extended manu-ally from these local alignments to comply with thesubmission formats, also appeared to be ratheraccurate.

Our approach relies on human knowledge andexpertise; that is, it is essentially an educated guess.It suits well our ultimate goal to link most of thesequence families to the structural superfamiliesand make them the subject of comparative modellingmethods rather than fold recognition methods. Webelieve that this goal will be achieved in our lifetimeby the development of better theories of proteinstructure and evolution rather than by the improve-ment of existing automated methods and algorithmsbased on a limited understanding of these complexphenomena.

NOTE ADDED IN PROOF

After submission of this paper, the determinationof solution structure for target T0026 (the N-terminaldomain of arginine repressor) by Sunnerhagen etal.34; PDB entry 1AOY, has finally confirmed ithaving the predicted ‘‘winged helix’’ DNA-bindingfold. Abstract and Table I reflect this fact. T0026 isproven to be an easy target with many other groupshaving made similar predictions. For this particulartarget we have made considerable effort to refine theglobal alignment that has turned out to be correct forall of the three helices and one of the two strands ofthe b-hairpin ‘‘wing’’ and to have a one-residueframeshift in the other strand. The experimentalstructure differs from our ‘ab initio’ model mainly inthe space position of the tip of the hairpin. If sevenresidues in the tip are omitted from comparison withthe model, the backbone atoms of the remaining 56well-ordered residues in the solution structure givean r.m.s. deviation of only 2.3 Å.

ACKNOWLEDGMENTS

This work has been supported by MRC SeniorFellowship to AGM.

REFERENCES

1. Chothia, C. One thousand families for the molecularbiologist. Nature 357:543–544, 1992.

2. Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.SCOP: A structural classification of proteins database forthe investigation of sequences and structures. J. Mol. Biol.247:536–540, 1995.

3. Brannigan, J.A., Dodson, G., Duggleby, H.J., Moody, P.C.E.,Smith, J.L., Tomchick, D.R., Murzin, A.G. A protein cata-lytic framework with an N-terminal nucleophile is capableof self-activation. Nature 378:416–419, 1995.

4. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman,D.J. Basic local alignment search tool. J. Mol. Biol. 215:403–410, 1990.

5. Eddy, S.R. Hidden Markov models. Curr. Opin. Struct.Biol. 6:361–365, 1996.

111FOLD RECOGNITION USING SCOP

Page 8: Distant homology recognition using structural classification of proteins

6. Thompson, J.D., Higgins, D.G., Gibson, T.J. CLUSTAL W:Improving the sensitivity of progressive multiple sequencealignment through sequence weighting, position-specificgap penalties and weight matrix choice. Nucleic Acids Res.22:4673–4680, 1994.

7. Bairoch, A., Boeckmann, B. The SWISS-PROT proteinsequence data bank. Nucleic Acids Res. 19:2247–2249,1991.

8. Bleasby, A.J., Wootton, J.C. Construction of validated,non-redundant composite protein sequence databases. Pro-tein Eng. 3:153–159, 1990.

9. Rost, B., Sander, C. Combining evolutionary informationand neural networks to predict protein secondary struc-ture. Proteins 19:55–72, 1994.

10. Sali, A., Blundell, T.A. Comparative protein modelling bysatisfaction of spatial restraints. J. Mol. Biol. 234:779–815,1993.

11. Bernstein, F.C., Koetzle, T.F., Williams, G.J.B., Meyer, E.F.,Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T.,Tasumi, M. The protein data bank: A computer-basedarchival file for macro-molecular structure. J. Mol. Biol.112:535–542, 1977.

12. Boissy, G., de la Fortelle, E., Kahn, R., Huet, J.C., Bricogne,G., Pernollet, J.C., Brunie, S. Crystal structure of a fungalelicitor secreted by Phytophtora cryptogea, a member of anovel class of plant necrotic proteins. Structure 4:1429–1439, 1996.

13. Bouaziz, S., van Heijenoort, C., Huet, J.C., Pernollet, J.C.,Guittet, E. 1H and 15N resonance assignment and second-ary structure of capsicein, an a-elicitin, determined bythree-dimensional heteronuclear NMR. Biochemistry 33:8188–8197, 1994.

14. Vath, G.M., Earhart, C.A., Rago, J.V., Kim, M.H., Bohach,G.A., Schlievert, P.M., Ohlendorf, D.H. The structure of thesuperantigen exfoliative toxin A suggests a novel regula-tion as a serine protease. Biochemistry 36:1559–1566,1997.

15. Nienaber, V.L., Breddam, K., Birtoft, J.J. A glutamic acidspecific serine protease utilizes a novel histidine triad insubstrate binding. Biochemistry 32:11469–11475, 1993.

16. Bycroft, M., Hubbard, T.J.P., Proctor, M., Freund, S.M.V.,Murzin, A.G. The solution structure of the S1 RNA bindingdomain: A member of an ancient nucleic acid-binding fold.Cell 88:235–242, 1997.

17. Schindelin, H., Jiang, W., Inouye, M., Heinemann, U.Crystal structure of CspA, the major cold shock protein ofEscherichia coli. Proc. Natl. Acad. Sci. USA 91:5119–5123,1994.

18. Shrive, A.K., Polikarpov, I., Sawyer, L. Manuscript inpreparation.

19. Leech, A.P., James, R., Coggins, J.R., Kleanthous, C.Mutagenesis of active site residues in type I dehydro-quinase from Escherishia coli. Stalled catalysis in a histi-

dine to alanine mutant. J. Biol. Chem. 270:25827–25836,1995.

20. Jia, J., Huang, W., Schvrken, U., Sahm, H., Sprenger, G.A.,Lindqvist, Y., Schneider, G. Crystal structure of transaldol-ase B from Escherichia coli suggests a circular permuta-tion of the a/β barrel within the class I aldolase family.Structure 4:715–724, 1996.

21. Wilmanns, M., Preistle, J.P., Niermann, T., Jansonius, J.N.Three-dimensional structure of the bifunctional enzymephophoribosylanthrinilate isomerase: Indoleglycerolphos-phate synthase from Escherichia coli refined at 2.0 Åresolution. J. Mol. Biol. 223:477–507, 1992.

22. Izard, T., Lawrence, M.C., Malby, R.L., Lilley, G.G., Col-man, P.M. The three-dimensional structure of N-acetylneu-raminate lyase from Escherichia coli. Structure 2:361–369,1994.

23. Johnson, P.E., Joshi, M.D., Tomme, P., Kilburn, D.G.,McIntosh, L.P. Structure of the N-terminal cellulose-binding domain of Cellulomonas fimi CenC determined bynuclear magnetic resonance spectroscopy. Biochemistry35:14381–14394, 1996.

24. Li, J.D., Carroll, J., Ellar, D.J. Crystal structure of insecti-cidal d-endotoxin from Bacillus thuringiensis at 2.5 Åresolution. Nature 353:815–821, 1991.

25. Gallagher, T. Manuscript in preparation.26. Hyde, C.C., Ahmed, S.A., Padlan, E.A., Miles, E.W., Davies,

D.R. Three-dimesional structure of the tryptophan syn-thase a2β2 multienzyme complex from Salmonella typhi-murium. J. Biol. Chem. 263:17857–17871, 1988.

27. Burkhard, P., Hohenester, E., Rao, G.S.J., Cook, P.F.,Jansonius, J.N. Three-dimensional structure of O-acetylser-ine sulfhydrylase from Salmonella typhimurium. ActaCryst. A52:C125–C126, 1996.

28. Schuller, D.J., Grant, G.A., Banaszak, L.J. The allostericligand site in the Vmax-type cooperative enzyme phospho-glycerate dehydrogenase. Nat. Struct. Biol. 2:69–76, 1995.

29. Liepinsh, E., Andersson, M., Ruysschaert, J.-M., Otting, G.Saposin fold revealed by the NMR structure of NK-lysin.Nat. Struct. Biol. 4:793–795, 1997.

30. Ptitsyn, O.B., Finkelstein, A.V., Murzin, A.G. Structuralmodel for interferons. FEBS Lett. 186:143–148, 1985.

31. Murzin, A.G., Finkelstein, A.V. General architecture of thea-helical globule. J. Mol. Biol. 204:749–769, 1988.

32. Szebenyi, D.M., Obendorf, S.K., Moffat, K. Structure ofvitamin D-dependant calcium protein from bovine intes-tine. Nature 294:327–332, 1981.

33. Levitt, M. Competitive assessment of protein fold recogni-tion and alignment accuracy. Proteins Suppl. 1:92–104,1997.

34. Sunnerhagen, M., Nigels, M., Otting, G., Carey, J. Solutionstructure of the DNA-binding domain and model for thecomplex of multifunctional hexameric arginine repressorwith DNA. Nat. Struct. Biol. 4:819–825, 1997.

112 A.G. MURZIN AND A. BATEMAN