21
Genome-based identification and analysis of collagen-related structural motifs in bacterial and viral proteins By: Magnus Rasmussen 1 *, Micael Jacobsson 2 , and Lars Björck 1 . 1 From the Department of Cell and Molecular Biology, Section for Molecular 1 Copyright 2003 by The American Society for Biochemistry and Molecular Biology, Inc. JBC Papers in Press. Published on June 3, 2003 as Manuscript M304709200 by guest on March 26, 2018 http://www.jbc.org/ Downloaded from

Genome-based identification and analysis of collagen-related

Embed Size (px)

Citation preview

Genome-based identification and analysis of

collagen-related structural motifs in bacterial and

viral proteins

By: Magnus Rasmussen1*, Micael Jacobsson2, and Lars Björck1.

1From the Department of Cell and Molecular Biology, Section for Molecular

1

Copyright 2003 by The American Society for Biochemistry and Molecular Biology, Inc.

JBC Papers in Press. Published on June 3, 2003 as Manuscript M304709200 by guest on M

arch 26, 2018http://w

ww

.jbc.org/D

ownloaded from

Pathogenesis, Lund University, Lund, Sweden and 2 Biovitrum AB, Stockholm, Sweden

and the Department of Medical Chemistry, Uppsala University, Uppsala, Sweden

*Corresponding author. Mailing address: Section for Molecular Pathogenesis,

Department of Cell and Molecular Biology, Lund University, BMC, B14, TornavÀgen

10, S-221 84 Lund, Sweden. Tel. : +46-46-2224489. Fax: +46-46-157756.

E-mail: [email protected]

2

by guest on March 26, 2018

http://ww

w.jbc.org/

Dow

nloaded from

Summary

Collagens are extended trimeric proteins composed of the repetitive sequence glycine-

X-Y. A collagen-related structural motif (CSM) containing glycine-X-Y repeats is also

found in numerous proteins often referred to as collagen-like proteins. Little is known

about CSMs in bacteria and viruses, but the occurrence of such motifs has recently been

demonstrated. Moreover, bacterial CSMs form collagen-like trimers regardless that these

organisms can not synthesize hydroxyproline, a critical residue for the stability of the

collagen triple helix. Here we present 100 novel proteins of bacteria and viruses

(including bacteriophages) containing CSMs identified by in silico analyses of genomic

sequences. These CSMs differ significantly from human collagens in amino acid content

and distribution; bacterial and viral CSMs having a lower proline content and a

preference for proline in the X position of GXY triplets. Moreover, the CSMs identified

contained more threonine than collagens, and in 17 out of 53 bacterial CSMs threonine

was the dominating amino acid in the Y position. Molecular modeling suggests that

threonines in the Y position make direct hydrogen bonds to neighboring backbone

carbonyls and thus substitute for hydroxyproline in the stabilization of the collagen-like

triple-helix of bacterial CSMs. The majority of the remaining CSMs were either rich in

proline or rich in charged residues. The bacterial proteins containing a CSM that could be

functionally annotated were either surface structures or spore components whereas the

viral proteins generally could be annotated as structural component of the viral particle.

The limited occurrence of CSMs in eubacteria and lower eukaryotes and the absence of

CSMs in archaebacteria, suggest that DNA encoding CSMs has been transferred

3

by guest on March 26, 2018

http://ww

w.jbc.org/

Dow

nloaded from

horizontally, possibly from multicellular organisms to bacteria.

4

by guest on March 26, 2018

http://ww

w.jbc.org/

Dow

nloaded from

Introduction

Collagens, present in most multicellular organisms, are helical proteins composed of

three extended polyproline type II-like chains (1). Best studied are the fibrillar collagens,

constituting important components of the extracellular matrix. However, also non-

fibrillar collagens as well as proteins with shorter collagen-like regions exist, and a

collagen-related structural motif (CSM) has been identified in proteins of bacteria,

bacteriophages, and viruses (2-5). Bacteria and viruses are generally believed to be

unable to synthesize hydroxyproline, a residue regarded essential for the stabilization of a

triple-helical structure. However, the absolute requirement of hydroxyproline for triple-

helix formation has lately been challenged (6,7), and alternative means for the

stabilization of triple-helical collagen have been proposed (8,9). Importantly, it has been

shown that three different bacterial CSMs trimerize, despite the lack of hydroxyprolines

(2,10).

This study was undertaken to investigate the occurrence of proteins containing CSMs in

bacteria and viruses through an in silico approach. We found that CSMs are encoded by a

minority of bacteria and bacteriophages and that these CSMs differs from human

collagens in several important aspects. A novel mechanism for the stabilization of

bacterial CSMs is presented. We also propose that horizontal gene transfer has

contributed to the evolution of CSMs.

Experimental procedures

5

by guest on March 26, 2018

http://ww

w.jbc.org/

Dow

nloaded from

The CSMs were identified by down-loading 56 bacterial proteomes from the TIGR.org

homepage using the batch download function and pattern searches in-house, using

MacVector (Oxford Molecular Ltd., version 6.5.3). The pattern used was (GPP)(7) with

12 allowed mismatches. Matches containing G in the P positions were excluded. The

CSMs were then used in unfiltered tBLASTn searches from the NCBI microbial genome

home-page against 136 eubacterial genomes, 15 archeabacterial genomes, 30 genomes of

lower eukaryotes, and against available viral genomes. Obtained hits were analyzed with

the pattern in-house. Open reading frames flanking the bacterial CSM-encoding genes

were analyzed by BLASTp to determine if the gene was phage-encoded.

Homology models of proline-rich CSMs were constructed on a trimeric [(GPP)10]3

template from a 1.3 ≈ X-ray structure (11). The models were built by global energy

minimization using the template as constraint. After the model structures had been

minimized, conformations of all threonine side-chains were sampled through an

exhaustive systematic search of the χ-angles of said side-chains. All molecular

mechanics calculations were carried out using the ICM software (12).

All statistical analyses for differences were performed using Fisher’s exact test.

Results and discussion

Proteins with collagen-related structural motifs in bacteria and viruses- The

characteristic primary structure of collagen (glycine-X-Y repeats) and the high content

of proline allowed us to construct a simple pattern ((GPP)(7)) for in silico detection of

CSMs. The pattern was applied to 58 bacterial proteomes to detect such sequences. In

6

by guest on March 26, 2018

http://ww

w.jbc.org/

Dow

nloaded from

addition, BLAST searches against 137 eubacterial genomes, 15 archaebacterial genomes,

30 genomes of lower eukarya, and against viral genomes, with the CSMs obtained in the

proteome screen, identified additional proteins containing CSMs. In total 103 CSMs were

identified. The results, which are summarized in figure 1A, show that proteins with a

CSM are present in a minority of genomes analyzed. No CSMs was detected in the 15

archaebacterial genomes, while 53 CSMs were detected in the 137 eubacterial genomes

analyzed. In eubacteria, CSMs are mostly found in the firmicute group (including

mycobacteria and Gram-positive bacteria) and on several occasions a single genome

encodes more than one such protein. The viral CSMs detected were mostly from

bacteriophages (37/43) and some of them represent allelic variants. A given viral or

bacteriophage genome only encodes one CSM. In the lower eukarya group, CSMs were

only detected in proteins from three different strains of Plasmodium falciparum. These

CSMs were omitted from further analysis as they were to few to allow any statistical

analysis.

Characteristics of bacterial and viral CSMs- The CSMs were generally dissimilar in

primary structure, except for the periodicity of the glycines and a relatively high proline

content. The length of the CSM varied from seven (the detection-limit of the pattern) to

745 continuos GXY repeats, with a mean number of 76 GXY repeats. In 18 of the 103

CSM-containing proteins identified, more than one CSM was identified. Both COOH-

and NH2-terminal to the CSM, other regions of varying length and sequence were found

in all proteins. A position-specific analysis of the amino acid content of the identified

7

by guest on March 26, 2018

http://ww

w.jbc.org/

Dow

nloaded from

CSMs revealed several differences compared to human collagens (Fig 1B). As no

statistically significant differences were noted between bacteriophage CSMs and CSMs

from other viruses these were pooled for the purpose of subsequent analyses. Perhaps

most striking is the difference in proline content and the preference for proline in position

X in the CSMs from bacteria and viruses. The proline content is significantly lower in

these CSMs as compared to human collagens (p=3.7x10-147 for the bacterial CSMs and

p=2.8x10-16 for the viral CSMs). In bacterial and viral CSMs the fraction of prolines

found in the Y position was significantly lower as compared this fraction in human

collagens (p=3.2x10-77 for bacterial and p=1.2x10-59 for the viral CSMs). Prolines in

position Y are generally hydroxylated in human collagen, thus, the relative absence of

prolines in this position among CSMs from bacteria and viruses probably reflects their

inability to synthesize hydroxyproline. In this context it should be mentioned that

bacterial homologues of eukaryotic prolyl hydroxylases could not be identified despite

extensive similarity searches.

The CSMs identified contained significantly higher proportion of threonine (p=7.6x10-

206 for bacterial CSMs and p=1.9x10-36 for viral CSMs) and glutamine (p=2.1x10-12

for bacteria and p=4.1x10-14 for viruses) in the Y position as compared to human

collagen. A minority of the bacterial CSMs (17 out of 53) had more than 50 % of Y

positions occupied by threonine (mean 94 %). In these threonine-rich CSMs, the X

position is typically occupied by prolines, alanines, or serines, whereas charged residues

are less common than in the other bacterial CSMs and in human collagen. The proteins

8

by guest on March 26, 2018

http://ww

w.jbc.org/

Dow

nloaded from

containing threonine-rich CSMs are clustered in five bacterial species: the spore-

forming human pathogens Clostridium difficile, Bacillus anthracis, and Bacillus cereus,

the nitrogen-fixating Metorhizobium loti, and the sulphur-metabolising Desulfitobacter

hafniense. To determine if threonine can influence stability of a collagen-like triple-

helix, we built homology models of six representative proline-rich CSMs on a trimeric

[(GPP)10]3 template constructed from a 1.3 ≈ X-ray structure (11). The resulting models

show that threonine in the Y position is able to form direct inter-chain hydrogen bonds to

backbone carbonyls in its energetically most favored conformation (Fig. 2). When the

amino acids in the X and Y positions of one threonine-rich CSM were switched, the

threonines were not able to form inter-chain hydrogen bonds (data not shown).

Interestingly, there are indications that threonine in the Y position can stabilize a triple-

helical structure also through indirect hydrogen-bonding or through glycosylations

(8,13,14).

Most of the remaining 83 CSMs could instead be classified as rich in prolines (>30 % in

X and Y position) or charged (>45 % charged residues in X and Y). Eight of the

identified CSMs did not meet any of the criteria, while five CSMs fell into more than one

class (Fig. 3A). Not only the prolines, but also the charged residues, were found to be

unevenly distributed between the X and Y position (Fig 3B). Most often, negatively

charged residues are found in position X while positively charged residues (arginine and

lysine) are found in the Y position. This may be related to triple helix stability, as

especially arginine in the Y position stabilizes trimers (9). Negative charges in the Y

position are generally destabilizing, with the exception of GR/KD triplets (15). Among

9

by guest on March 26, 2018

http://ww

w.jbc.org/

Dow

nloaded from

the charged CSMs with a net negative charge in the Y position, aspartate is more

common than glutamate and the aspartate is in 50 % of the cases preceded by a positively

charged residue. In addition to stabilization, charges in collagen are important for the

interactions between collagen and other macromolecules (16).

Function of the CSM- The function of CSMs in bacteria and viruses is not well

understood. The few CSMs studied have been shown to mediate trimerization or

elongation (2,10,17,18) of proteins found at bacterial or bacteriophage surfaces (2,18-

25). Furthermore, the CSM of one protein from Streptococcus sanguis has been shown to

mediate aggregation of platelets (26), while it is less clear whether the CSMs from two

Streptococcus pyogenes proteins contribute to the adhesive properties of these proteins

(21,24). The viral proteins containing CSMs identified in this study were mainly encoded

by bacteriophages and had to a large extent been annotated or could easily be annotated

by similarity searches. The majority of these proteins were tail fiber or host specificity

proteins, involved in the interactions between bacteriophages and bacteria (see e g (27)).

Only a minority of the bacterial proteins containing a CSM could be annotated. Three

proteins had previously been described as surface or spore associated (20-25) and from

pattern and similarity searches, an additional 13 proteins could be classified as cell wall-

attached (28) or transmembrane proteins. The non collagen-related parts of the proteins

showed surprisingly little sequence similarity to other proteins, making a functional

classification of the bacterial proteins difficult. The presence of CSMs in proteins with so

little sequence similarity indicate that the CSM is indeed a structural motif to elongate or

10

by guest on March 26, 2018

http://ww

w.jbc.org/

Dow

nloaded from

trimerize a variety of different, mainly extracellular proteins. The possibility remains,

however, that some CSMs may mediate binding of bacterial or bacteriophage proteins to

other macromolecules.

Evolutionary aspects- Since so few bacterial and unicellular eucaryotic genomes encode

CSMs and archaebacteria completely lack such sequences, it seems unlikely that CSMs

have evolved before the diversification of archaea, bacteria, and eukaryotes. It appears

more plausible that genes encoding CSMs have moved horizontally or arisen on several

occasions during evolution rather than selective gene loss among archaea, bacteria, and

lower eukaryotes. Horizontal gene transfer between bacteria and vertebraea has been

proposed to be relatively common (29,30), although this view has also been questioned

(31). If horizontal gene transfers of CSMs have occurred, we believe that the direction of

transfer has been from eukaryotes to bacteria. After horizontal transfer of sequences

encoding a CSM, bacteriophages may have promoted further horizontal transfer within

the bacterial kingdom leading to the ”patchy” distribution seen for CSMs among

bacteria. We have no evidence for a role of viruses or bacteriophages in the putative

horizontal transfer between eukaryotes and bacteria, but this can not be excluded. After

the putative horizontal transfer, extensive rearrangements seem to have occurred in the

bacterial CSMs, since there are limited sequence similarities except for the spacing of

glycines in the bacterial CSMs. The limited sequence similarities also made phylogenetic

analyses unmeaningful to perform. Interestingly, the threonine-rich CSMs are very

dissimilar in amino acid composition from human collagens and from the other bacterial

11

by guest on March 26, 2018

http://ww

w.jbc.org/

Dow

nloaded from

CSMs. Instead, the threonine-rich CSMs have their closest homologues in hypothermal

vent worm cuticle collagens (14). This suggests that several events of horizontal gene

transfer of the genetic material encoding CSMs could have occurred during evolution. An

alternative evolutionary explanation to the threonine-rich CSMs is that these have arisen

de novo in bacteria, followed by horizontal spread to a few other bacterial species.

Especially the threonine-rich CSMs are highly repetitive and could have evolved from

duplications of a short DNA segment.

The presence of CSMs in bacteria and viruses underlines the importance of collagen as a

structural motif in nature. This work also suggest that alternative means for triple-helix

stabilization probably operate in bacteria and viruses, and future elucidation of the

structure and function of bacterial and viral CSMs represents an interesting scientific

task.

Acknowledgments

TIGR and other generous providers of sequence data are acknowledged. We thank Dr.

Robert Janulczyk for constructive comments. This work was supported by grants from

the Swedish Research Council (projects 7480 and 14379); the Medical Faculty, Lund

University; the Foundations of Kock and Österlund; and Hansa Medical AB.

12

by guest on March 26, 2018

http://ww

w.jbc.org/

Dow

nloaded from

References

1. Beck, K., and Brodsky, B. (1998) J Struct Biol 122, 17-29

2. Charalambous, B. M., Keen, J. N., and McPherson, M. J. (1988) EMBO J. 7, 2903-

2909

3. Bamford, D. H., and Bamford, J. K. (1990) Nature 344, 497

4. Medveczky, M. M., Geck, P., Vassallo, R., and Medveczky, P. G. (1993) Virus Genes

7, 349-365

5. Smith, M. C., Burns, N., Sayers, J. R., Sorrell, J. A., Casjens, S. R., and Hendrix, R.

W. (1998) Science 279, 1834

6. Ruggiero, F., Exposito, J. Y., Bournat, P., Gruber, V., Perret, S., Comte, J., Olagnier,

B., Garrone, R., and Theisen, M. (2000) FEBS Lett 469, 132-136

7. Perret, S., Merle, C., Bernocco, S., Berland, P., Garrone, R., Hulmes, D. J. S., Theisen,

M., and Ruggiero, F. (2001) J. Biol. Chem 276, 43693-43698

8. Bann, J. G., Peyton, D. H., and Bachinger, H. P. (2000) FEBS Lett 473, 237-240

9. Yang, W., Chan, V. C., Kirkpatrick, A., Ramshaw, J. A., and Brodsky, B. (1997) J

Biol Chem 272, 28837-28840

10. Xu, Y., Keene, D. R., Bujnicki, J. M., Höök, M., and Lukomski, S. (2002) J Biol

Chem 277, 27312-27318

11. Berisio, R., Vitagliano, L., Mazzarella, L., and Zagari, A. (2002) Prot. Sci. 11, 262-

270

12. Abagyan, R. A., and Totrov, M. M. (1994) J. Mol. Biol. 235, 983-1002

13. Kramer, R. Z., Bella, J., Mayville, P., Brodsky, B., and Berman, H. M. (1999) Nat

13

by guest on March 26, 2018

http://ww

w.jbc.org/

Dow

nloaded from

Struct Biol 6, 454-457

14. Mann, K., Mechling, D. E., Bachinger, H. P., Eckerskorn, C., Gaill, F., and Timpl, R.

(1996) J Mol Biol 261, 255-266

15. Chan, V. C., Ramshaw, J. A. M., Kirkpatrick, A., Beck, K., and Brodsky, B. (1997) J.

Biol. Chem. 272, 31441-31446

16. Doi, T., Higashino, K., Kurihara, Y., Wada, Y., Miyazaki, T., Nakamura, H., Uesugi,

S., Imanishi, T., Kawabe, Y., and Itakura, H. (1993) J. Biol. Chem. 268, 2126-2133

17. Sylvestre, P., Couture-Tosi, E., and Mock, M. (2003) J. Bacteriol. 185, 1555-1563

18. Caldentey, J., Tuma, R., and Bamford, D. H. (2000) Biochemistry 39, 10566-10573

19. Erickson, P. R., and Herzberg, M. C. (1987) J. Immunol. 138, 3360-3366

20. Rasmussen, M., Edén, A., and Björck, L. (2000) Infect. Immun. 68, 6370-6377

21. Lukomski, S., Nakashima, K., Abdi, I., Cipriano, V. I., Ireland, R. M., Reid, S. D.,

Adams, G. G., and Musser, J. M. (2000) Infect. Immun. 68, 6542-6553

22. Whatmore, A. (2001) Microbiology 147, 419-429

23. Lukomski, S., Nakashima, K., Abdi, I., Cipriano, V. J., Shelvin, B. J., Graviss, E. A.,

and Musser, J. M. (2001) Infect. Immun. 69, 1729-1738

24. Rasmussen, M., and Björck, L. (2001) Mol. Microbiol. 40, 1427-1438

25. Sylvestre, P., Couture-Tosi, E., and Mock, M. (2002) Mol Microbiol 45, 169-178

26. Erickson, P. R., and Herzberg, M. C. (1993) J. Biol. Chem. 268, 1646-1649

27. Duplessis, M., and Moineau, S. (2001) Mol. Microbiol. 41, 325-336

28. Janulczyk, R., and Rasmussen, M. (2001) Infect. Immun. 69, 4019-4026

29. Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J.,

14

by guest on March 26, 2018

http://ww

w.jbc.org/

Dow

nloaded from

Devon, K., Dewar, K., Doyle, M., FitzHugh, W., Funke, R., Gage, D., Harris, K.,

Heaford, A., Howland, J., Kann, L., Lehoczky, J., LeVine, R., McEwan, P., McKernan,

K., Meldrim, J., Mesirov, J. P., Miranda, C., Morris, W., Naylor, J., Raymond, C.,

Rosetti, M., Santos, R., Sheridan, A., Sougnez, C., Stange-Thomann, N., Stojanovic, N.,

Subramanian, A., Wyman, D., Rogers, J., Sulston, J., and Ainscough, R. (2001) Nature

409, 860-921

30. Syvanen, M. (2002) Trends in Genetics 18, 245-248

31. Stanhope, M. J., Lupas, A., Italia, M. J., Koretke, K. K., Volker, C., and Brown, J. R.

(2001) Nature 411, 940-944

32. Ramshaw, J. A. M., Shah, N. K., and Brodsky, B. (1998) J. Struct. Biol. 122, 86-91

33. Bella, J., Eaton, M., Brodsky, B., and Berman, H. M. (1994) Science 266, 75-81

15

by guest on March 26, 2018

http://ww

w.jbc.org/

Dow

nloaded from

Legends to figures

FIG 1. Occurrence and composition of CSMs

A: The number of proteins containing a CSM in the various organisms is given, and the

number of genomes by which these CSMs are encoded are indicated within parenthesis.

The asterisk indicates that searches were made against all entries containing viral

sequences. The number of genomes analyzed and the number of proteins containing

CSMs in a given genome are also shown.

B: The position-specific distribution of glycines, prolines, glutamines, and threonines of

the bacterial and viral CSMs and human collagens is given. The values represent the

percentage of a given amino acid in a specific position of the GXY triplet. The asterisk

indicate that prolines in the Y position of human collagens are most often hydroxylated.

Data on amino acid frequency for human collagens were from a previous report (32).

FIG 2. Detailed view of the homology model of a threonine-rich CSM from B. anthracis.

The conformation of threonine side-chains was determined by an exhaustive systematic

search procedure. Hydrogen bonds are shown in black. Part of the hydroxyproline-

containing triple-helical structure in PDB entry 1cag (33) is shown in cyan for

comparison.

FIG 3. Amino acid composition of bacterial and viral CSMs

A: The CSMs identified in bacterial and viral proteins (53 and 47 respectively) were

16

by guest on March 26, 2018

http://ww

w.jbc.org/

Dow

nloaded from

categorized into three groups. The percentage of CSMs falling into each group is given

The threonine-rich CSMs have more than 50 % of Y positions occupied by threonine, the

proline-rich have more than 30 % of X and Y positions occupied with proline, whereas

charged CSMs have more than 45 % charged residues in the X and Y positions. Some

CSMs can not be classified into one of these groups, whereas some meet more than one

criterion.

B: The skew in charge-distribution between the X and Y position of GXY triplets from

charged CSMs is shown. For every charged CSM, the net charge (K/R=+1, D/E=-1) of

residues in the Y position minus the net charge of residues in the X position was divided

by the total number of charged residues. Thus, a CSM with all positively charged

residues in the Y position and all negatively charged residues in the X position will obtain

a value of +1, whereas a CSM with all positively charged residues in the X position and

all negatively charged residues in the Y position will obtain a value of -1. Each dot in the

diagram represents one CSM.

17

by guest on March 26, 2018

http://ww

w.jbc.org/

Dow

nloaded from

A.

Organisms no of CSMs

no of genomes

no of CSMs/genome

viruses/phages 47 (47) * 1eubacteria 53 (25) 136 1-9 archaebacteria 0 15 0unicellular eukarya 3 (3) 30 1

B.

G X Y

Human collagens 100% PTQ

28.2%1.2%3.0%

P*TQ

38.4%2.6%6.0%

Bacterial CSMs 100% PTQ

19.0%0.4%1.7%

PTQ

4.2%31.6%11.7%

Viral/phageCSMs

100% PTQ

34.5%1.0%0.7%

PTQ

12.5%18.1%15.7%

by guest on March 26, 2018

http://ww

w.jbc.org/

Dow

nloaded from

by guest on March 26, 2018

http://ww

w.jbc.org/

Dow

nloaded from

by guest on March 26, 2018

http://ww

w.jbc.org/

Dow

nloaded from

Magnus Rasmussen, Micael Jacobsson and Lars Björckbacterial and viral proteins

Genome-based identification and analysis of collagen-related structural motifs in

published online June 3, 2003J. Biol. Chem. 

  10.1074/jbc.M304709200Access the most updated version of this article at doi:

 Alerts:

  When a correction for this article is posted• 

When this article is cited• 

to choose from all of JBC's e-mail alertsClick here

by guest on March 26, 2018

http://ww

w.jbc.org/

Dow

nloaded from