5
Proc. Natl. Acad. Sci. USA Vol. 92, pp. 11921-11925, December 1995 Evolution Sequence similarity analysis of Escherichia coli proteins: Functional and evolutionary implications (protein sequence similarity/Escherichia coli genome/paralogous protein clusters/ancient conserved regions) EUGENE V. KOoNIN*, RoMAN L. TATUSOV, AND KENNETH E. RUDD National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894 Communicated by Dale Kaiser, Stanford University Medical Center, Stanford, CA, September 11, 1995 ABSTRACT A computer analysis of 2328 protein se- quences comprising about 60% of the Escherichia coli gene products was performed using methods for database screen- ing with individual sequences and alignment blocks. A high fraction of E. coli proteins-86%c-shows significant sequence similarity to other proteins in current databases; about 70% show conservation at least at the level of distantly related bacteria, and about 40% contain ancient conserved regions (ACRs) shared with eukaryotic or Archaeal proteins. For >90% of the E. coli proteins, either functional information or sequence similarity, or both, are available. Forty-six percent of the E. coli proteins belong to 299 clusters of paralogs (intraspecies homologs) defined on the basis of pairwise similarity. Another 10% could be included in 70 superclusters using motif detection methods. The majority of the clusters contain only two to four members. In contrast, nearly 25% of all E. coli proteins belong to the four largest superclusters- namely, permeases, ATPases and GTPases with the conserved "Walker-type" motif, helix-turn-helix regulatory proteins, and NAD(FAD)-binding proteins. We conclude that bacterial protein sequences generally are highly conserved in evolution, with about 50% of all ACR-containing protein families rep- resented among the E. coli gene products. With the current sequence databases and methods of their screening, computer analysis yields useful information on the functions and evo- lutionary relationships of the vast majority of genes in a bacterial genome. Sequence similarity with E. coli proteins allows the prediction of functions for a number of important eukaryotic genes, including several whose products are im- plicated in human diseases. Genome sequencing has come of age. Hundreds of viral genome sequences and about 20 organellar sequences are currently available. The genomes of several bacteria and yeast are expected to be completed within 2-3 years (1). The ultimate value of genome projects is to use the sequence to deduce how the genome dictates all cellular functions. Sharply focused biochemical and genetic experimentation will be required in order to achieve the complete understanding of a microbial physiology. A necessary and complementary ap- proach is the computer analysis of the protein sequences encoded in the genome in order to predict protein functions and derive possible evolutionary relationships. No other cellular organism has been studied in such detail as Escherichia coli (2). It is estimated that >80% of the metabolic pathways in E. coli are known, and functions have been ascribed to more than one-half of the projected total number of 4000 genes that are distributed along the 4.7-Mb, circular E. coli chromosome (2, 3). The E. coli genome sequencing project has been underway since 1991 (4), and, as of August 1994, 60% of the E. coli chromosome has been sequenced (5). We used the 2328 sequences of E. coli proteins and predicted gene The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact. products encoded in the sequenced portions of the chromo- some for a pilot project on genome scale protein sequence similarity analysis. We consider three interconnected principal aspects of this analysis. (i) Functionalprediction. A major challenge of genome scale protein sequence analysis is the efficient and accurate detec- tion of all similarities that allow functional prediction. Careful examination of weak similarities using various computer meth- ods has revealed new relationships matching genes wvith known biochemical functions or mutational phenotypes to uncharac- terized genes. However, preyious reports analyzed only a limited number of protein families (6, 7). (ii) Sequence conservation. Our ability to detect sequence conservation is crucial for making functional predictions and evolutionary inferences. When large data sets have been examined, sequence similarity was usually detected for -50% of the proteins (1). Recent studies using up-to-date computer methods have brought this fraction up to 70% for yeast proteins (P. Bork, personal communication) and 75% for Mycoplasma capricolum proteins (8). Since Mycoplasmas have the smallest genomes among bacteria (9), one could expect that they maintain the basic set of housekeeping genes, and, accordingly, the reported fraction of conserved proteins may be close to the upper limit for all organisms. Therefore this may be a useful reference point to compare the results obtained with a genome scale protein set. Perhaps an even more interesting question is how many proteins, and which ones, contain ancient conserved regions (ACRs)-sequences that are conserved in homologous pro- teins from bacteria and eukaryotes or Archaea (7, 10). This question is all the more pertinent as recent estimates have strongly suggested that the number of ACRs is likely to be quite modest-on the order of 1000 (7, 10, 11). (iii) Relationships among the genes and proteins within the E. coli genome. The existence of paralogs [intraspecies homolo- gous genes (12)] in E. coli has been recognized for some time; these genes should have evolved by duplication followed by diversification (13, 14). Now that the sequence of a major part of the E. coli genome is available, construction of a represen- tative set of the paralogous clusters has become feasible. With the selection for compactness that is thought to be a governing principle of the bacterial genome evolution (2), only those diverged gene duplicates that confer a selective advantage to the cell could persist. Therefore one might expect that the relationship between the number of paralogs and the function of the proteins belonging to clusters should reveal important aspects of the bacterial cell physiology and evolution. Each of these problems has been tackled previously from different perspectives and with different, relatively small data sets. Here we address them all systematically at a genome scale. It seems likely that conclusions drawn from the analysis of 60% Abbreviations: ACR, ancient conserved region; HTH, helix-turn- helix; SAM, S-adenosylmethionine. *To whom reprint requests should be addressed. 11921 Downloaded by guest on February 1, 2021

Sequencesimilarity analysis of Escherichia proteins ... · Proc. Natl. Acad. Sci. USA Vol. 92, pp. 11921-11925, December 1995 Evolution Sequencesimilarity analysis ofEscherichiacoli

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sequencesimilarity analysis of Escherichia proteins ... · Proc. Natl. Acad. Sci. USA Vol. 92, pp. 11921-11925, December 1995 Evolution Sequencesimilarity analysis ofEscherichiacoli

Proc. Natl. Acad. Sci. USAVol. 92, pp. 11921-11925, December 1995Evolution

Sequence similarity analysis of Escherichia coli proteins:Functional and evolutionary implications

(protein sequence similarity/Escherichia coli genome/paralogous protein clusters/ancient conserved regions)

EUGENE V. KOoNIN*, RoMAN L. TATUSOV, AND KENNETH E. RUDDNational Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894

Communicated by Dale Kaiser, Stanford University Medical Center, Stanford, CA, September 11, 1995

ABSTRACT A computer analysis of 2328 protein se-quences comprising about 60% of the Escherichia coli geneproducts was performed using methods for database screen-ing with individual sequences and alignment blocks. A highfraction ofE. coli proteins-86%c-shows significant sequencesimilarity to other proteins in current databases; about 70%show conservation at least at the level of distantly relatedbacteria, and about 40% contain ancient conserved regions(ACRs) shared with eukaryotic or Archaeal proteins. For>90% of the E. coli proteins, either functional information orsequence similarity, or both, are available. Forty-six percentof the E. coli proteins belong to 299 clusters of paralogs(intraspecies homologs) defined on the basis of pairwisesimilarity. Another 10% could be included in 70 superclustersusing motif detection methods. The majority of the clusterscontain only two to four members. In contrast, nearly 25% ofall E. coli proteins belong to the four largest superclusters-namely, permeases, ATPases and GTPases with the conserved"Walker-type" motif, helix-turn-helix regulatory proteins,and NAD(FAD)-binding proteins. We conclude that bacterialprotein sequences generally are highly conserved in evolution,with about 50% of all ACR-containing protein families rep-resented among the E. coli gene products. With the currentsequence databases and methods of their screening, computeranalysis yields useful information on the functions and evo-lutionary relationships of the vast majority of genes in abacterial genome. Sequence similarity with E. coli proteinsallows the prediction of functions for a number of importanteukaryotic genes, including several whose products are im-plicated in human diseases.

Genome sequencing has come of age. Hundreds of viralgenome sequences and about 20 organellar sequences arecurrently available. The genomes of several bacteria and yeastare expected to be completed within 2-3 years (1). Theultimate value of genome projects is to use the sequence todeduce how the genome dictates all cellular functions. Sharplyfocused biochemical and genetic experimentation will berequired in order to achieve the complete understanding of amicrobial physiology. A necessary and complementary ap-proach is the computer analysis of the protein sequencesencoded in the genome in order to predict protein functionsand derive possible evolutionary relationships.No other cellular organism has been studied in such detail as

Escherichia coli (2). It is estimated that >80% of the metabolicpathways in E. coli are known, and functions have beenascribed to more than one-half of the projected total numberof 4000 genes that are distributed along the 4.7-Mb, circular E.coli chromosome (2, 3). The E. coli genome sequencing projecthas been underway since 1991 (4), and, as of August 1994, 60%of the E. coli chromosome has been sequenced (5). We usedthe 2328 sequences of E. coli proteins and predicted gene

The publication costs of this article were defrayed in part by page chargepayment. This article must therefore be hereby marked "advertisement" inaccordance with 18 U.S.C. §1734 solely to indicate this fact.

products encoded in the sequenced portions of the chromo-some for a pilot project on genome scale protein sequencesimilarity analysis.We consider three interconnected principal aspects of this

analysis.(i) Functionalprediction. A major challenge of genome scale

protein sequence analysis is the efficient and accurate detec-tion of all similarities that allow functional prediction. Carefulexamination ofweak similarities using various computer meth-ods has revealed new relationships matching genes wvith knownbiochemical functions or mutational phenotypes to uncharac-terized genes. However, preyious reports analyzed only alimited number of protein families (6, 7).

(ii) Sequence conservation. Our ability to detect sequenceconservation is crucial for making functional predictions andevolutionary inferences. When large data sets have beenexamined, sequence similarity was usually detected for -50%of the proteins (1). Recent studies using up-to-date computermethods have brought this fraction up to 70% for yeastproteins (P. Bork, personal communication) and 75% forMycoplasma capricolum proteins (8). Since Mycoplasmas havethe smallest genomes among bacteria (9), one could expectthat they maintain the basic set of housekeeping genes, and,accordingly, the reported fraction of conserved proteins maybe close to the upper limit for all organisms. Therefore this maybe a useful reference point to compare the results obtainedwith a genome scale protein set.

Perhaps an even more interesting question is how manyproteins, and which ones, contain ancient conserved regions(ACRs)-sequences that are conserved in homologous pro-teins from bacteria and eukaryotes or Archaea (7, 10). Thisquestion is all the more pertinent as recent estimates havestrongly suggested that the number of ACRs is likely to bequite modest-on the order of 1000 (7, 10, 11).

(iii) Relationships among the genes and proteins within the E.coli genome. The existence of paralogs [intraspecies homolo-gous genes (12)] in E. coli has been recognized for some time;these genes should have evolved by duplication followed bydiversification (13, 14). Now that the sequence of a major partof the E. coli genome is available, construction of a represen-tative set of the paralogous clusters has become feasible. Withthe selection for compactness that is thought to be a governingprinciple of the bacterial genome evolution (2), only thosediverged gene duplicates that confer a selective advantage tothe cell could persist. Therefore one might expect that therelationship between the number of paralogs and the functionof the proteins belonging to clusters should reveal importantaspects of the bacterial cell physiology and evolution.Each of these problems has been tackled previously from

different perspectives and with different, relatively small datasets. Herewe address them all systematically at a genome scale.It seems likely that conclusions drawn from the analysis of60%

Abbreviations: ACR, ancient conserved region; HTH, helix-turn-helix; SAM, S-adenosylmethionine.*To whom reprint requests should be addressed.

11921

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 1,

202

1

Page 2: Sequencesimilarity analysis of Escherichia proteins ... · Proc. Natl. Acad. Sci. USA Vol. 92, pp. 11921-11925, December 1995 Evolution Sequencesimilarity analysis ofEscherichiacoli

Proc. Natl. Acad. Sci. USA 92 (1995)

of the genes will generally hold when the genome sequence iscompleted.

MATERIALS AND METHODSA comprehensive set ofE. coli protein sequences was producedby translation of the EcoSeq7 database that consists of 375contigs representative of the entire genome. It includes allpreviously characterized proteins as well as putative proteinsthat could be detected using database searches and the Gene-Mark method for coding region prediction (15-17). Thesesequences were compared to the nonredundant (NR) se-quence database at the National Center for BiotechnologyInformation (National Institutes of Health, Bethesda, MD)using programs of the BLAST family (18, 19), the BLOSUM62matrix (20), and the SEG program to mask compositionallybiased regions that frequently produce spurious hits in data-base searches (21). For sequences that initially failed to showsimilarity to any other sequences in the database, the searchwas repeated without SEG in order to rule out the possibilitythat a conserved region has been masked.

Initial searches were performed using the BLASTP programto compare all the E. coli protein sequences to the proteinversion of the NR database. Those sequences that did not showsimilarity to any protein were compared, using the TBLASTNprogram (19), to the nucleotide version of the NR databasetranslated in all 6 reading frames, in order to detect similarityto putative proteins that have not been annotated in thedatabase. Three independent recent studies have shown theimportance of searching uncharacterized nucleotide sequencesfor the discovery of new bacterial genes (17, 22, 23).

Alignments with similarity scores above 90, correspondingto the probability of matching by chance of about 0.001 (for anaverage-size protein), were accepted as an indication of anauthentic relationship. The significance of the similarities inthe "twilight zone" that contains both biologically relevant andspurious alignments (scores between 60 and 90) was assessedby conserved motif analysis using the CAP and MOST programs(24), multiple alignment analysis using the MACAW program(25), as well as considerations of functional relevance.

In order to delineate groups of paralogs among the E. coliproteins, the set of 2328 proteins from EcoSeq7 was comparedto itself using BLASTP. For clustering the proteins, a single-linkage algorithm was used. A cluster was defined as a groupof protein sequences connected by similarity scores above achosen cut-off (BLASTP score of 70 or greater) but without therequirement that each pair of sequences within a cluster hadsuch a score. Given the nontransitivity of BLAST, this algorithmensures the identification of a maximal number of paralogsbelonging to a cluster. In addition, conserved motifs typical ofeach of the paralogous clusters were defined and used to searchthe database of E. coli protein sequences (24), in order to

delineate superclusters that combine previously identifiedclusters as well as some of the sequences not belonging to anycluster. These motifs also were used to resolve the problemencountered by the single-linkage clustering algorithm withproteins containing two or more distinct conserved domains.Such proteins may artifactually bring together otherwise un-related clusters. Therefore, conserved motifs typical of eachcluster were used to identify distinct domains of multidomainproteins, which were then included in different clusters.The methods used for protein sequence comparison at a

genome scale and their relative contributions to the detectionof biologically relevant similarities are detailed elsewhere (26).

RESULTS AND DISCUSSIONE. coli Proteins Are Highly Conserved. The great majority of

the 2328 E. coli proteins in EcoSeq7-86%-show significantsequence similarity to other proteins in the database (Fig. 1A).This is higher than the 70% recently reported for yeast proteins(P. Bork, personal communication) and the 75% observed forM. capricolum (8). In other words, the fraction of E. coliproteins with no homologs currently in the database is abouthalf that for yeast.About 60% of the E. coli proteins show similarity to proteins

from distantly related bacteria, which, for the purpose of thisanalysis, were defined as those outside of the Proteobacteria inthe latest 16S rRNA phylogenetic tree (27), and about 40%contain ACRs; 93% of the ACRs are regions conserved in E.coli and eukaryotic proteins. This is nearly a 2-fold greaterfraction of proteins with ACRs than previously estimated (10).There are two components to this increase-inclusion ofalignments from the twilight zone and the growth of databases.We estimate that the higher sensitivity accounts for about 60%of the additional ACRs. For the great majority of the ACRsdetected in E. coli proteins-at least 97%-the related eu-karyotic (archaeal) proteins have already been known to beconserved at least at the level of different animal phyla, whichis the definition of an ACR (10). This is compatible with thenotion that relatively few new ACR-containing families remainto be discovered (7, 10, 11).

In our estimates of the number of ACRs in E. coli proteins,we excluded cases of similarity to proteins encoded in endo-symbiotic organellar (mitochondrial and chloroplast) ge-nomes. Even so, some of the genes coding for eukaryoticproteins with the highest similarity to their E. coli homologsalmost certainly have evolved from organellar genes trans-ferred to the nucleus (28, 29). Nevertheless, on average, E. coliproteins are less similar to their eukaryotic homologs than tothe homologs from distantly related bacteria; the mean simi-larity score, computed using BLASTP, between E. coli proteinsand their eukaryotic or archaeal homologs was 174 (median ofthe score distribution, 109), and the mean score for homologs

A

Eukaryotes/Archaea

41% A

unrelated to other14% proteins

3%E. coli only

16%

closely relatedhkl>!4 :bocteria

B

Similarity detectedno function 9%

Similarity detected A.function predicted

;~.....16% ;):t;

No similarity9% no function

Function knownsimilarity detected

61%

distant bacteria 26%Function known 5%no similarity

FIG. 1. Sequence conservation in E. coli proteins. (A) Levels of sequence conservation. E. coli proteins are partitioned by the phylogeneticallymost distant homologs; the "distant bacteria" sector includes proteins for which homologs were detected in distantly related bacteria but not ineukaryotes or Archaea, and the "closely related bacteria" sector includes those proteins that did not have detectable homologs in distantly relatedbacteria. (B) Sequence conservation and functional information.

11922 Evolution: Koonin et al.

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 1,

202

1

Page 3: Sequencesimilarity analysis of Escherichia proteins ... · Proc. Natl. Acad. Sci. USA Vol. 92, pp. 11921-11925, December 1995 Evolution Sequencesimilarity analysis ofEscherichiacoli

Proc. Natl. Acad. Sci. USA 92 (1995) 11923

permease!7OX

no

!S

Walker type NTPases

6%

5% HTH DNA-binding

NAD/FAD-binding

PM t receiver domainse l sensor domains

SAM-dependentmethyltransferasesaminotransferases

1% each

30%other clusters and superclusters

FIG. 2. Clusters and superclusters of paralogous proteins in E. coli.HTH, helix-turn-helix; SAM, S-adenosylmethionine.

from distantly related bacteria was 230 (median, 146). Thus,most of the ACRs probably antedate the radiation of bacteriaand the common ancestor of eukaryotes and Archaea.

Altogether, approximately two-thirds of the known E. coliproteins contain regions that are conserved at least at the levelof distantly related bacteria that are separated from E. coli bymore than a billion years of evolution (Fig. IA). Thus, the highconservation of protein sequences in E. coli is not due to trivialsimilarity to proteins from closely related bacterial species.Rather, the evolution of the majority of the bacterial proteinsappears to be strongly constrained, probably because of therequirements to maintain a specific tertiary structure and topreserve functional sites.The m.ajority of the E. coli proteins have a known, at least

in general terms, biological function and show sequencesimilarity to other proteins (Fig. 1B). The second largestcategory-one-fourth-includes E. coli gene products that donot have a known function but do have homologs in sequencedatabases. For two-thirds of these (about 400 proteins), a

putative function could be predicted based on sequence con-servation. This group is of special interest as sequence-basedprediction of protein functions is one of the most importantaspects of any genome project. About 200 E. coli gene productswithout a known function are related only to other uncharac-terized proteins (Fig. 1B). These findings show that suchproteins have conserved functions and allow us to define new,functionally important motifs, but they are not immediatelyuseful for functional prediction.

Among the 340 E. coli gene products (14% of the total) thatare not similar to any proteins in the databases, 70% have notbeen functionally characterized. The remaining ones have avariety of functions, some of which have not yet been discov-ered in other organisms, whereas others may be specific for E.coli and closely related enterobacteria.

Only a small fraction of the E. coli gene products (<10%)are true unknowns, with neither an established function nor

homologs among the known proteins (Fig. 1B).About One-Half of the E. coli Proteins Belong to Clusters

and Superclusters of Paralogs. We observed that 46% of theE. coli proteins belong to 299 paralogous clusters, defined on

the basis of pairwise similarity. Based on motif conservation,we delineated 70 superclusters, drawing in an additional 10%of the sequences (Fig. 2).Most of the clusters of paralogs in E. coli are small-I 79

clusters (60%) contain only two members, and 252 (84%)contain two to four members. In contrast, several clusters have>10 members (Table 1). Moreover, the 4 largest superclus-ters-namely, permeases, ATPases and GTPases with theconserved "Walker-type" motif, HTH regulatory proteins,and NAD(FAD)-binding proteins-account for nearly 25% ofthe E. coli proteins (Fig. 2). Clustering of the E. coli proteinssuggests that more than one-half of the bacterial genes mayhave evolved from a relatively small number of ancestral genes.A reconstruction of such an ancestral gene set, even thoughnecessarily hypothetical, seems to be a possibility.

Large clusters of paralogs in E. coli include mostly transportproteins and proteins involved in various regulatory processes,whereas paralogous metabolic enzymes typically form smallclusters (Table 1). Some key enzymes of the genome replica-tion and expression-notably, catalytic subunits of DNA andRNA polymerases and ribosomal proteins-do not have de-tectable paralogs. Extensive diversification of transport andregulatory proteins provides a basis for the environmentaladaptability and, accordingly, is of selective advantage to thebacterium. In contrast, the basic mechanisms of replicationand expression apply uniformly to the whole genome-hence,no need for paralogs. This is not necessarily valid for very earlystages of evolution when essential genes involved in replicationand expression could have evolved by duplication; traces ofsuch ancient duplication events may be detectable throughtertiary structure comparisons.

In 56% of the paralogous clusters, at least one proteincontains an ACR, whereas a very small fraction of the clustersconsists of proteins with similarity only to other E. coli

Table 1. Ten largest clusters of transport and regulatory proteins and selected small clusters of metabolic enzymes in E. coli

Homologs

From Fromdistantly eukaryotes/

Cluster No. of proteins Function/activity related bacteria Archaea

ABC transporter ATPases 54 ATP-dependent membrane Yes Yestransport; DNA repair

Permeases (AraE-related) 34 Membrane transport Yes YesHTH proteins (Ada-related) 29 Transcription regulation Yes No

Receiver domains 28 Membrane signal transduction Yes YesSensor domains 24 Membrane signal transduction Yes YesPermeases (ArtM-related) 24 Membrane transport Yes YesHTH proteins (CynR-related) 21 Transcription regulation Yes NoSugar-binding domains 15 Metabolite transport YesDEAD/H helicases 15 RNA/DNA duplex unwinding Yes YesGTPases 15 GTP-dependent processes Yes Yes

Acetyltransferases 3: AccB, AceF, SucB Acyl-CoA synthesis Yes YesAcetate kinases 2: AckA, YhaA' Acetyl-CoA synthesis Yes YesAlanine racemases 2: Alr, DadX Peptidoglycan biosynthesis Yes No

and degradationAminotransferases 2: AspC, TyrB Amino acid biosynthesis Yes Yes

Evolution: Koonin et al.

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 1,

202

1

Page 4: Sequencesimilarity analysis of Escherichia proteins ... · Proc. Natl. Acad. Sci. USA Vol. 92, pp. 11921-11925, December 1995 Evolution Sequencesimilarity analysis ofEscherichiacoli

Proc. Natl. Acad. Sci. USA 92 (1995)

proteins. Thus, most of the clusters include proteins withessential functions that should have been encoded in thegenome of the common ancestor of bacteria, eukaryotes, andArchaea.About One-Half of All ACR Families Are Represented in E.

coli. Each paralogous cluster that includes proteins with anACR should be considered a part of a single ACR-containingprotein family that includes both paralogs and orthologs fromdifferent organisms. Similarly, each of the E. coli proteins thatdoes not belong to a paralogous cluster but contains an ACRrepresents such a family. Combining the numbers of clustersand individual sequences with ACRs, we found that 530ACR-containing protein families are represented in E. coli,which may comprise about one-half of the total number of suchfamilies (7, 10, 11).Conserved Motifs: Signatures of Paralogous Protein Clus-

ters and Superclusters. We attempted to identify conservedsequence motifs that would define clusters and superclusters ofparalogous E. coli proteins by combining consistent segmentsof pairwise alignments detected by BLAST into multiple align-ment blocks and using these blocks for database screening (24).We derived 165 unique conserved motifs, 99 of which definesingle clusters and 66 of which correspond to superclusters.The latter group of motifs is most valuable as they help obtaininformation on sequence conservation in distantly relatedparalogs that could not be extracted by pairwise similaritysearches. Typically, the motifs are highly conserved in evolu-tion, with 117 of the 165 including ACRs.

E. coli Protein Sequence Analysis Helps Predict Functionsof Human Proteins. One of the main objectives of studyinggenomes of well-characterized microorganisms is the ability toinfer functions of human genes. Our analysis of the E. coliprotein sequences resulted in a number of such functionalpredictions. Below we describe an example including an im-portant group of eukaryotic and archaeal proteins.

Findings with the SAM-Binding Motif: Fibrillarins ArePredicted To Be rRNA Methyltransferases. The results ofdatabase searches using a position-dependent weight matrixfor the SAM-binding motif that is conserved in a variety ofmethyltransferases (30, 31) were among the most interesting.After initiating the search with the sequences of known E. colimethyltransferases, we detected the SAM-binding motif in 23E. coli proteins and in about 200 proteins from other bacteria,eukaryotes, and Archaea (Fig. 3). All of the proteins in this setwith a known enzymatic activity are methyltransferases thattransfer the methyl group from SAM to a variety of substrates.By inference, the SAM-dependent methyltransferase activityis predicted also for the uncharacterized proteins selected fromthe database. Thus, this activity was tentatively assigned to 14functionally uncharacterized E. coli proteins (Fig. 3). TheSAM-binding motif also was identified, at a statistically sig-nificant level (P < 0.02), in the human nucleolar tumor markerp120 (32), and in fibrillarins, another group of nucleolarproteins (Fig. 3). Fibrillarins are involved in rRNA processingand methylation and react with autoantibodies from sclero-derma patients (33, 34). The presence of the SAM-bindingmotif strongly suggests that fibrillarins are bona fide rRNAmethylases.

CONCLUSIONS(i) Computer analysis yields testable unctionalpredictions forthe majority of uncharacterized gene products. In addition, ourknowledge of proteins with known function is increased-e.g.,via the prediction of active centers. Many of the predictions aremade possible only by using motif detection methods. A"by-product" of the E. coli protein sequence analysis that maybe no less important than the original goal of characterizationof the bacterial genome is the systematic identification ofeukaryotic homologs, including some that are implicated in

consensus

CfaUbiGPrmATrmATehBPcmSpeEDcmHsdMKsgACysGBioCFmvGidBFtsJHemKYebHYigOYcbDYgcAYfiCYgdEYggHYabCYjjT

FBRL_HUMANFBRL_XENLAFBRL_YEASTFBRL_SCHPOFBRL_LEIMAFLPA_METVAFLPA_METVO

P120 HUMANMTH1_HAEHACOMT_HUMANMLSB_SACERCOQ3_YEAST

UUDU$$$.$ .......

171:60:

162:214:34 :79:81:89:

171:41:

218:46:12:69:55:

113:89:67:48:

260:48:

215:57:27:

200:

152:154:159:149:129:67:65:

388:14:

112:65:

130:

VLDIGCGWGGLAHYMASVLDVGCGGGILAESMARVIDFGCGSGILAIAALKLLELYCGNGNFSLALARTLDLGCGNGRNSLYLAAVLEIGTGSGYQTAILAHVLIIGGGDGAMLREVTRFIDLFAGIGGIRRGFESVQDPAAGTAGFLIEADRMVEIGPGLAALTEPVGEVVLVGAGPGDAGLLTLKVLDAGCGPGWMSRHWREILDLCAAPGGKTTHILEFIDVGTGPGLPGIPLSIVVDLGAAPGGWSQYVVTILDLGTGTGAIALALASVLDIGCGEGYYTHAFADVLDLAGGTGDLTAKFSRVLDAGGGEGQTAIKMAEVLDLFCGMGNFTQPLATCLDIGAGSGLLALMLAQAVDLGACPGGWTYQLVKTLEIGFGMGASLVAMAKYIDGTFGRGGHSRLILSVLDVGCGAGVLSVAFAR

I

+

VLYLGAASGTTVSHVSDVLYLGAASGTTVSHVSDVLYLGAASGTSVSHVSDVLYLGAANGTSVSHVADVLYLGGATGTTVSHVSDVLYLGASAGTTPSHVSDVLYLGASAGTTPSHVAD

ILDMCCAPGGKTSYMAQFIDLFAGLGGFRLALESLLELGAYCGYSAVRMARVLEAGPGEGLLTRELADVLDVGCGGGILSESLAR

+*

+*

FIG. 3. SAM-binding motif in E. coli proteins and selected eu-karyotic proteins. Plus signs denote proteins with known methyltrans-ferase activity; asterisks show proteins with known three-dimensionalstructure; exclamation marks indicate proteins for which the methyl-transferase activity was predicted in this study; and dots show proteinsfor which the prediction has been published previously. The consensusincludes amino acid residues conserved in the majority of SAM-binding sites; U indicates a bulky hydrophobic residue; $ indicates asmall residue (G, A, or S); and a dot indicates any residue. The residuesconforming to the consensus are highlighted by boldface type. Theposition of the first amino acid of the aligned segment in the proteinsequence is indicated by a number. The E. coli protein sequences werefrom EcoSeq7; the remaining sequences were from Swiss-Prot, withthe protein names indicated as in Swiss-Prot (FBRL are fibrillarins;FLPA are their archaeal homologs); the human tumor marker pl20sequence was from the Protein Identification Resource database(accession no. A48168).

human disease, and the prediction of functions for theseproteins.

(ii) A large fraction of bacterial protein sequences is highlyconserved in evolution. ACRs were detected in >40% of the E.coli proteins. This is almost a 2-fold greater fraction thanpreviously estimated, due in part to a detailed analysis of"weak" similarities and in part to the database growth. Thefraction of proteins with ACRs may further grow with thecompletion of eukaryotic and archaeal genome sequences, butthe increase cannot be dramatic since most of the ancientprotein families are already in the database (7, 10, 11). Arealistic projection may be that about 50% of bacterial proteinscontain ACRs.

(iii) About 50% of the bacterial proteins belong to clustersof paralogs. Transport and regulatory proteins tend to form

11924 Evolution: Koonin et al.

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 1,

202

1

Page 5: Sequencesimilarity analysis of Escherichia proteins ... · Proc. Natl. Acad. Sci. USA Vol. 92, pp. 11921-11925, December 1995 Evolution Sequencesimilarity analysis ofEscherichiacoli

Proc. Natl. Acad. Sci. USA 92 (1995) 11925

large clusters; metabolic enzymes typically belong to smallclusters; no paralogs so far have been found for key proteinsinvolved in genome replication and expression. The majority ofthe clusters correspond to essential, ancient functions and con-tain ACRs.

(iv) As much as one-half of the distinct ACR-containingprotein families appears to be represented in E. coli. Given thewealth of biochemical and genetic information on E. coliproteins, this suggests that E. coli sequence analysis will beinstrumental in defining the catalog of universal cellularfunctions.Addendum. The results of a preliminary analysis of an

updated set of about 3000 E. coli protein sequences performedafter the submission of this manuscript were very similar tothose described here, suggesting that the principal conclusionsof this study are likely to hold for the complete genome (26).The entire 1.83-Mb genome sequence of Haemophilus influ-enzae has been published very recently; the majority of the H.influenzae gene products have homologs among known E. coliproteins, but E. coli encodes many proteins not found in H.influenzae (35).

Availability of the Results. The set of protein sequences usedin this study, the list of functional predictions for uncharac-terized, putative E. coli proteins, the table of the paralogousprotein clusters and superclusters of E. coli proteins, and alibrary of conserved motifs typical of these clusters are avail-able by anonymous ftp at ncbi.nlm.nih.gov in the directory/repository/Eco/EcoProt.

We are grateful to David Lipman for useful discussions andencouragement, to Amos Bairoch, Peer Bork, David Landsman, andDavid Lipman for critical reading of various versions of this manu-script, to Bobby Baum for drawing our attention to several subtlesequence similarities, and to Peer Bork for communicating his resultsprior to publication.

1. Bork, P., Ouzounis, C. & Sander, C. (1994) Curr. Opin. Struct.Biol. 4, 393-403.

2. Neidhardt,F.,Ingraham,J. L.,Low,K. B.,Magasanik,B.,Schaech-ter, U. & Umbarger, H. E., eds. (1987) Escherichia coli andSalmonella typhimurium: Cellular and Molecular Biology (Am.Soc. Microbiol., Washington, DC).

3. Riley, M. (1993) Microbiol. Rev. 57, 862-952.4. Daniels, D., Plunkett, G., Burland, V. & Blattner, F. R. (1992)

Science 257, 771-778.5. Wahl, R., Rice, P., Rice, C. M. & Kroger, M. (1994)NucleicAcids

Res. 22, 3450-3455.6. Koonin, E. V., Bork, P. & Sander, C. (1994) EMBO J. 13,

493-503.

7. Green, P. (1994) Curr. Opin. Struct. Biol. 4, 404-412.8. Bork, P., Ouzounis, C., Casari, G., Sander, C., Dolan, M. &

Gillevet, P. (1995) Mol. Microbiol. 16, 955-967.9. Razin, S. (1992) FEMS Microbiol. Lett. 100, 423-432.

10. Green, P., Lipman, D. J., Hillier, L., Waterston, R., States, D. &Claverie, J.-M. (1993) Science 259, 1711-1716.

11. Chothia, C. (1992) Nature (London) 357, 543-544.12. Fitch, W. M. (1970) Syst. Zool. 19, 99-1 10.13. Zipkas, D. & Riley, M. (1975) Proc. Natl. Acad. Sci. USA 72,

1354-1358.14. Labedan, B. & Riley, M. (1995) J. Bacteriol. 177, 1585-1588.15. Rudd, K. E. (1993) Am. Soc. Microbiol. News 59, 335-341.16. Berlyn, M., Low, K. B. & Rudd, K. E. (1995) in Escherichia coli

and Salmonella: Cellular and Molecular Biology, eds. Neidhardt,F. C., Curtiss, R., III, Ingraham, J. L., Lin, E. C. C., Low, K. B.,Magasanik, B., Reznikoff, W., Riley, M., Schaechter, M. &Umbarger, H. E. (Am. Soc. Microbiol., Washington, DC), 2ndEd., in press.

17. Borodovsky, M., Rudd, K. E. & Koonin, E. V. (1994) NucleicAcids Res. 22, 4756-4767.

18. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman,D. J. (1990) J. Mol. Biol. 215, 403-410.

19. Altschul, S. F., Boguski, M. S., Gish, W. & Wootton, J. C. (1994)Nat. Genet. 6, 119-129.

20. Henikoff, S. & Henikoff, J. (1992) Proc. Natl. Acad. Sci. USA 89,10915-10919.

21. Wootton, J. C. & Federhen, S. (1993) Comput. Chem. 17, 149-163.

22. Robison, K., Gilbert, W. & Church, G. M. (1994) Nat. Genet. 7,205-214.

23. Krogh, A., Mian, I. S. & Haussler, D. (1994) Nucleic Acids Res.22, 4768-4778.

24. Tatusov, R. L., Altschul, S. F. & Koonin, E. V. (1994) Proc. Natl.Acad. Sci. USA 91, 12091-12095.

25. Schuler, G. D., Altschul, S. F. & Lipman, D. J. (1991) ProteinsStruct. Funct. Genet. 9, 180-190.

26. Koonin, E. V., Tatusov, R. L. & Rudd, K. E. (1995) MethodsEnzymol., in press.

27. Olsen, G. J., Woese, C. R. & Overbeek, R. (1994) J. Bacteriol.176, 1-8.

28. Palmer, J. D. (1985) Annu. Rev. Genet. 19, 325-354.29. Gray, M. W. (1989) Trends Genet. 5, 294-299.30. Ingrosso, D., Fowler, A. V., Bleibaum, J. & Clarke, S. (1989) J.

Biol. Chem. 264, 20131-20139.31. Koonin, E. V. (1993) J. Gen. Virol. 74, 733-740.32. Koonin, E. V. (1994) Nucleic Acids Res. 22, 2476-2478.33. Schimmang, T., Tollervey, D., Kern, H., Frank, R. & Hurt, E. C.

(1989) EMBO J. 8, 4015-4024.34. Tollervey, D., Lehtonen, H., Jansen, R., Kern, H. & Hurt, E. C.

(1993) Cell 72, 443-457.35. Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A. &

Kirkness, E. F. (1995) Science 269, 496-512.

Evolution: Koonin et al.

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 1,

202

1