Upload
semih-erhan
View
216
Download
1
Embed Size (px)
Citation preview
Int. J . Peptide Protein Res. 6, 1974, 175-181 Published by Munksgaard, Copenhagen, Denmark No part may bc reproduced by any process without written permission from the author(s1
P R E S E N C E O F R E P E A T I N G SUB-SEQUENCES A N D S Y M M E T R Y P A T T E R N S I N P R O T E I N S
SEMIH ERHAN and LARRY D. GRELLER
Department of Animal Biology, School of Veterinary Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, U.S.A.
Received 24 January 1974
Comparison of amino m i d sequences of ancestrally unrelated proteins as well as a particular protein with itself, using a sliding match performed by a computer (Greller & Erhan, previous paper), reveals the presence of certain repeating sequences. Further- more, a symmetry pattern is observed if either the target or the key sequence is reversed, written,from the carboxyl to amino terminus when a protein is matched against itself. The significance of these findings is discussed, particularly in reference to the possibility of these repeating sequences to represent primordial sub-sequences from which proteins might have been formed.
The question of whether proteins have evolved from a limited number of primordial peptides through gene reduplication or through some other mechanism is an interesting and contro- versial one. Investigators of this problem appear to be divided into two groups: those who believe that proteins could have evolved from a few peptides point out the regularities observed in various proteins (1,2,3), and those who try to prove what the others contend to be regularities could have arisen by chance (4,5,6).
During our study of homologous sequences found among ancestrally unrelated proteins (7), we noticed that some homologous sequences occurred more than once along both protein chains. We have investigated this problem by matching certain proteins among and against themselves. This communication will describe the results of this study and a useful property of our method for demonstrating symmetry patterns in a protein.
METHODS
The matchings were performed according to our modification of McLachlan’s “double matching” method (8) which was described in the preceding paper (7).
RESULTS
A matching performed between bovine trypsino-
gen and Bence-Jones I , SH (human) results in a homology pattern where a certain peptide seg- ment on one protein matches, with varying degrees of perfection, more than one peptide segment of the other protein (Table 1). When the same protein, for instance trypsinogen, is matched against itself using a sliding match, a similar result is obtained (Table 2). It is important to note that some of the peptides that are homo- logous within trypsinogen are also homologous with Bence Jones I, SH (human) (Table 4). Other proteins such as DNase and RNAase matched against themselves also yield similar results (Erhan & Greller, manuscript in preparation).
The method’s particular usefulness derives from the nature of the computer print-out when proteins are matched against themselves when either the target or the key sequence is reversed, i.e., written from carboxyl to amino terminus. Patterns of symmetry are observed that extend throughout the length of segments being matched. A center of symmetry is either an amino acid or a peptide bond between two amino acids. The observed symmetries are not necessarily con- tiguous exact matches but exact matches inter- spersed with non exact matches. Nevertheless, the m-scores of these non exactly matched amino acids display symmetry. With a little practice symmetry patterns can be detected very easily, as can be seen in Table 5 .
175
SEMlH ERHAN A N D LARRY D . GRELLER
TABLE I Homologies found amonE diverse proteins and pepplidrs
Bradykinin 1 RPPGFSPFR Cytochrome c human 42 PGYS
key = PGFS . . 6 . Bradykinin 1 RPPGFSPFR Cytochrome c donkey 44 PGFS
key = PGFS
Bence-Jones lambda human Kern 36 QRPGQSPLL key = RPPGFSPFR 5 3 . . 0 . .53
Bradykinin I RPPGFSPFR Bence-Jones lambda human Kern 38 PGQSP
key = PGFSP . . o . . Bradykinin reversed” 1 RFPSFGPPR Immunoglobulin G I heavy human eu 53 PMFGPP
key = PSFGPP . 2 ‘ . . . Bradykinin potentiator B I bEGLPPRPKIPP ICSH beta ovine 4s LPV PPL 1P
key = BPPRPKIP . . 3 3 . 2 . . Bradykinin potentiator E 1 bEKWDPP PVSPP ICSH beta ovine 77 DPGMVSFP
key = DPPPVSPP . . 3 I . . I . Botulinus toxin type A I GDSCVEAETAGK Bence-Jones lambda human SH 193 SCQVTHE
key = SCSVEAE . ‘ 3 . 4 3 . Kinin wasps I ZTNKKKLRGRPPGFSPFR Growth hormone human 81 ETQKSDL
key = ZTNKKKL . . 4 . 3 3 . Trypsinogen bovine 156 SAYPGQI Growth hormone human 129 SGRTGQI
key = SAYPGQIS , 3 2 3 . . . Trypsinogen bovine 158 YPGQIT Bence-Jones lambda 141 YPGAVT
human SH ’ . . 3 5 . Try psi nogen bovine 179 CQGDS Bence-Jones lambda 20 CQGDS
human SH . . . . ICSH beta ovine 39 PSMKRVLPVPPL IPMPQ Collagen a’ fragment, rat PS G P RGLPG PPGAP GPQ
. . 1 3 . 2 . . 2 . . 1 2 . 1 . . Botulinus toxin type A 1 SCSVEAE Bence-Jones lambda 193 SCQVTHE
human SH . ‘ 3 . 4 3 . DNase, bovine, pancreatic 129 SHSAPS RNase, bovine, pancreatic 16 STSAAS
. 4 . . 4 . DNase, bovine, pancreatic 204 NCAY R Nase, bovine, pancreatic 94 NCAY
Bradykinin I RPPCFSPFR
1 7
, , I .
~~~
M P ( M ’ > M )
30 4 x 1 0
33 6 x 10
40 2 x 1 0
32 2 x 1 0
43 2 x 1 0
48 I x 10 ‘
45 7 x 10
43 6 x 1 0 ’
42 2 x 1 0 ‘
40 2 x 1 0
41 5 x 1 0
41 1 x 1 0 ’
92 2 x 1 0 - 7
43 6 x 1 0 - ’
40 2 x 1 0
34 2 x 1 0
a “reversed” indicates that the sequence is written carboxyl to amino end as opposed to the con- ventional amino to carboxyl direction.
Only with bradykinin potentiators B & E, amino acid E indicates the unusual pyroglutamic acid (2-pyrrolidone-5-carboxylic acid) rather than glutamic acid. M is the score for the length of homologous regions indicated here by the length of the third line. P ( M ’ > M ) the probability that 2 score could occur by chance.
176
REPEATING SUB-SEQUENCES AND SYMMETRY PATTERNS IN PROTEINS
TABLE 2
Homologies found between trypsinogen and Bence-Jones ASH (human) and between DNase and RNAase
5" 179
82 176
I52 I75
206 34
208 156
210 98
14 45
14 142
16 200
DNase 17 RNase 18
DNase 16 RNase 46
DNase 129 RNase 16
DNase 129 RNase 75
DNase 204 RNase 94
DNase 201 RNase 15
SLINSQW SLTPEQW . . 3 1 3 . ' 40 SYNSNTLNN SY LS LTPEQ . ' I . 1 . 1 4 4 44
SSCKSAYPGQ 1 TSN SSY LS LTPEQWKSH . . 1 2 . 2 1 . 3 . 3 3 . 4 67 KNKPG QQKPG 4 4 . . . 32 KPGVYT KAGVET . 4 . . I . 37
G VYTKV GGGTKL . 2 0 . . 4 42
GANTYPY G RN N RPS . 2 . 3 2 . 3 34
GANTV GAVTV
. . I . . 33 NTVPYQV S S TVEKTVA 5 . , 4 1 3 . 4 41 SNATLA SY SAASS SNY . 3 . 5 2 4 5 ' 43 MS NATLA FVHESLA 5 2 4 4 5 . . 36
SHSAPS STSAAS . 4 . . 4 . 40
SHSAPS SYSTMS . 4 . 3 1 . 32 NCAY NCAY
TSTNCA S ST SAA 5 . . 5 1 . 35
34 I , . .
2.1 x 10-4
3.1 x 10-4
3.5 x 10- 4
1 . 7 ~ 10-4
3.7x 10-4
1 . 2 ~ 10-3
5.5x 10
2.2 x 10-4
3.ox 10-3
].ox 10-3
4 . 4 ~ 10-3
1 . 9 ~ 10-5
3.5 x 1 0 - 5
1 . 8 ~ 10
4 . 6 ~ 10
a Numbering is contiguous from the N-terminal amino acid first row always trypsinogen, second Bence-Jones peptides.
177
SEMIH ERHAN AND LARRY D. GRELLER
TABLE 3 Internal homologies found in trypsinogen
23 101
24 123
24 36
22 197
1 I7 110
1 I7 198
118 65
61 31
65 I18
148 41
148 162
149 35
222 144
224 97
226 163
SLNS SLNS
LNSG Y L I SGW . 1 . . 6 LNSG Y I NSQW 5 . . 2 6
VSLNSG VSWGSG . . 3 3 . . SAGTQCL IS S L PTSCASA . 2 3 . 3 ‘ 2 2 4
S AGTQC SWGSGC , I . 5 2 . AGTQCLIS EGNQQFIS 4 . 3 . 0 5 . . NQQFI NSQWI . 3 . 6 5
EGNQQFIS AG TQCLlS 4 . 3 . 0 5 . . I LSNSSC
VVSAAHC 5 4 . 3 4 3 ’ ILSN S SC ITSNMFC . 2 . . 2 2 ‘ LSNS LINS . 2 . . I KQTIASN LKAP ICSN 5 . 3 3 . 2 . . QT 1 AS KSAAS 4 5 2 . . IASN ITSN . 3 . .
. . . . 32
31
29
38
49
33
44
30
44
36
39
26
45
27
27
1.3x10-5
5 . 8 ~ 10-4
6.5x 10-4
1 . 4 x 10-4
] . O X 10-3
9.2 x 10 -
2.5 x 10-4
5.2 x 10 -4
2.5 x 10-4
2.9 x 1 0 - 3
2.4 x 10 -4
9.3 x 10-4
1 . 8 ~ 10-4
2 x 10-5.3
4.3 x 10-4
In addition to symmetry patterns one also ob- serves homologous segments when a protein is matched against itself reversed. S. aureus nuclease
reversed 133 KAQAESKRLL 28 KGQPMTFRLL 53
. 3 . 4 1 5 0 . . .
1.9 x 10 -4
TABLE 4 Commonness of internal homologies with intcr-
molecular homologies
Trypsin
Bence-Jones 1 Trypsin
Bence-Jones 1 Trypsin
Bence-Jones 1 Trypsin
Bence-Jones 1
21 79 50
I88 30
167 59
I10 101 85
217 182 48 53
QVSL NS HPSYNSNTLN RPSG 1 PDRFS VCSGKL FCGGS L FCAGYL FSGSS S L P T S C A S SLNSRV A S
NSRDSS YVSWIK
YKSG I Q VRL S G I P D R F
DSGGP V
(Homologous amino acids are italicized.)
39 PETKHPK 38 9.8 x 10 -4
reversed 122 EHTNNPK 4 2 . 4 4 . .
A detailed account of this phenomenon will be published elsewhere (Erhan & Greller, manu- script in preparation).
DISCUSSION
Obviously certain sequences tend to occur re- peatedly in the same protein, suggesting the possibility that proteins might have been made up of a limited number of primordial subse- quences originally. If this were true, then muta- tions occurring randomly could have led to the sequences we see today. There are compelling theoretical grounds that support this view.
Recently, Salisbury (9) has commented on a serious conflict between the concepts that random events were responsible for evolution by natural selection of adaptive genes, on the one hand, and that each gene in a DNA molecule is unique in the sequence of its bases, on the other. Referring to an earlier work by Quastler (lo), he states “the probability of life (an organism with lo00 in- formation bits. . .) originating by a ‘lucky acci- dent’. . . in 2.109 years is about - essentially no probability at all!” Again using an example where a very small DNA molecule con- taining only lo00 nucleotide pairs which replicates and undergoes mutation simultaneously under unrealistically favorable conditions - lo6 times per second, he calculated that the probability for producing a DNA molecule like this in four
to
178
REPEATING SUB-SEQUENCES AND SYMMETRY PATTERNS IN PROTEINS
TABLE 5 Symmetry paiivrns
S. aureus nuclease
I02
reversed 109
I12
reversed I48
reversed
RNase T I
reversed
reversed
82
96
8
14
21
1
31
reversed 68
I I I
A LVRQG LA
ALGQRVLA
. . 2 5 5 2 . .
YVYKPNNTHEQLLRKSEAQAKKEKLNIWSENDADSGQ
QGSDADNESWINLKEKKAQAESKRLLQEHTNNPKYVY
1 2 3 3 4 5 . 4 3 1 0 1 . 5 4 3 4 . . . 4 3 4 5 . 1 0 1 3 4 . 5 4 3 3 2
RGLAYIYADGK
KGDAYIYA LGR
5 . 1 . I . ’ . 1 . 5
I I I
I I I I I
I I I
SNCY S S S
SSSYCNS
. 5 2 ’ 2 5 .
ACDYTCGSNCY SSSDVSTAQA
AQA TSVDSS SYCNSGCTY DCA
. 0 3 1 5 1 3 . 5 2 ’ 2 5 . 3 1 5 1 3 0 .
ETVGSNSYPHKYNNYEGFDFS VSSPY YEWP I LSSGDVY
YVDGSS L I PW EYY P SSVS FDFGEYNNYKHPYSNSG VTE
1 3 1 . . 5 2 3 . 3 4 ’ 2 1 3 3 2 2 1 I 2 2 3 3 1 2 ‘ 4 3 . 3 2 5 . . I 3 I
I I I
I I I I
billion years and in a volume equivalent to lozo planets similar to earth are 10 - 5 ’ s, i.e., practically nil.
It is entirely possible, however, for a much larger molecule to be produced by chance in a much shorter time if the large molecules are put together stepwise by elemental building blocks, giving rise to sub-assemblies which in turn com- bine to give larger sub-assemblies which then yield the macromolecule. Simon ( 1 I ) states, “if a sys- tem of k elementary components is built up in a many level hierarchy, and s components, on the average, combine at any level into a component a t
the next higher level, then the expected time of evolution for the whole system will be propor- tional to the logarithm t o base s of k. In such a hierarchy, the time required for systems con- taining atoms to evolve from systems con- taining loz3 atoms would be the same as the time required for systems containing lo3 atoms to evolve from systems containing 10 atoms.”
Thus it is conceivable that formation of present day proteins from a limited number of primordial peptides would not only circumvent the dilemma described by Salisbury but that such a primordial process is quite feasible.
179
SEMIH E R H A N A N D L A R R Y D. GRELLER
The consequence of randomness of mutations in repeated segments, particularly from the stand- point of practical application, is rather important. Each mutation that effected achange ofoneamino acid when it occurred results in the appearance of two different amino acids (B,Y) in two identical sequences :
i . . . . . A B C D E . . . . . A B C D E . . . . . giving . . . . . A Y C D E . . . . . A B C D E . . . . . This is because we d o not know today which of the two amino acids we see represents the original amino acid before mutation altered one of them. Similarly, two mutations result in the appearance of four different amino acids (B,Y,D,F)
1 1 . . . . . A B C D E . . . . . A B C D E . . . . .g iving . . . . . A Y C D E . . . . . A B C F E . . . . .
In other words when one of the amino acids in the above example changes due to a mutation, the chances of finding two identical sub-sequences is decreased by 20%. With two amino acid changes this chance is decreased by 40%. Under these circumstances it would actually be a miracle to find these sub-sequences exactly alike for a long protein, say having ca. 100 amino acids. This fact can be appreciated by looking a t any alignment chart in Dayhoff (12). I t has t o be repeated here that this argument holds under the premise that the proteins are made up of repeating sub- sequences. Hence to use a statistical approach to find a n answer to the problem of whether the regularities observed in the primary structure of proteins could have arisen by chance, without giving a thorough consideration to the factors listed above, can only lead to erroneous con- clusions, namely, that they could arise by chance ! The major cause of this oversight is the concept that amino acid sequence of proteins is random, which has somehow survived to the present day.
The earliest statistical studies on amino acid sequences in proteins began with the purpose of trying to understand the nature of genetic coding (1 3). and nearest neighbor pairs in proteins were searched. An extension of this study by Ycas (4) led to the belief that globular proteins were a set of random sequences of amino acids. During this time Sorm had already published his papers sug- gesting that there were regularities in the amino acid sequences of proteins which could not be attributed to coincidence. In 1961 Williams, Clegg & Mutch addressed themselves to this problem because Sorm had not based his conclusions on statistical studies. They compared random sequences of amino acids, corresponding to RNase, trypsinogen, chymotrypsinogen, tabacco
mosaic virus protein and A and B chains of hemoglobin for the presence of intramolecular repeats of peptides and concluded that none of these proteins differed significantly from random sequences with respect to intramolecular repeats. A point to remember here is the size of repeats being considered, a preponderance of doublets, then triplets, and a few quadruplets. It is clear that the shorter the sequences that are being considered identical, the greater is the probability that they might have occurred by chance. The presence of tri-, tetra- and pentapeptides with significant biological activity (I 6,17), however, speaks against ignoring regularities just because they happen to be short.
Only after many proteins are compared as to their repeating sub-sequences, however, will one be able to state with sufficient certainty what the nature and number of these sub-sequences are.
Interestingly, when a small peptide such as bradykinin is matched against a larger protein, the same repeating of homologous sequences occurs and underlines the general nature of this phenomenon.
The idea that proteins could be formed from polypeptides was recently proposed on very practical grounds (18). The most impressive experimental evidence that proteins are not random sequences of amino acids comes from the laboratory of Fox & Nakashima (19). An equi- molar mixture of tyrosine, glycine and glutamic acid which was heated to 170°C yielded a hexa- peptide of a definite sequence comprising 10% of the product: pyroglutamyglycyltyrosyl-,r-glu- taminyl tyrosylglycine.
The studies by Steinman (20) also indicated that side chains of amino acids d o influence which amino acid will follow another one. The reason for discussing these investigations at some length, particularly stressing the randomness vs. the nonrandomness of proteins, is to point out the consequences of these two concepts. Assumption of randomness of amino acid sequences seems to entitle the investigator to obtain simple proba- bilities for the occurrence of a n amino acid, any amino acid, a t any position of a protein, com- pletely independently of the amino acid preceding as well as the one succeeding it, as was done by Williams et al. (5 ) . This leads to a n overestima- tion of the number of vertical matches between the amino acids of two proteins by chance because each amino acid is considered not only indepen- dent of the others but equivalent to others (7). Evidence obtained from in oiuo systems also support the views expressed above. It is demon- strated that a corticotrophin-like intermediate
180
REPEATING SUB-SEQUENCES AND SYMMETRY PATTERNS IN PROTEINS
lobe peptide is formed, together with a-melano- cyte stimulating hormone, from corticotrophin (ACTH) in pars intermedia of pig and rat pituitaries (21).
Furthermore, reacquisition of the ability to hydrolyze &galactosides by a mutant of E. coli K12 with deletion of lac Z gene was reported. It was demonstrated that the enzyme that evolved differed from /ac Z &galactosidase of the wild type E. coli in its immunological, kinetic and sedimentation characteristics. The authors sug- gested that the adaptation of a gene not involved in lactose utilization to a form capable of speci- fying a &galactosidase is similar to evolution by natural selection (22).
We believe that if there were subsequences, on many proteins, homologous to the active site(s) of 8-galactosidase, they would be the most likely starting point for the phenomenon described above, instead of just any peptide somehow being converted to a state which is capable of per- forming an enzymatic reaction. The gene that codes for the one protein, of many, which may have these repeating subsequences, whose three dimensional conformation is closest t o that of 8-galactosidase is modified so that it can finally hydrolyze lactose. Finally we feel it is proper for us to acknowledge that the repeat and symmetry patterns that we are reporting here were first observed and described by Sorm & Keil (23).
Note added in proof: Our preliminary experi- ments on simulation of evolution suggest that during 1.3 x lo9 years, 57 of the amino acids of a 100 amino acid long protein would have been altered. Hence, to level of significance accepted by us may turn out to be lower than what could be attained under the evolutionary pressures !
1 .
2.
3. 4.
5.
REFERENCES
SORM, F. (1958) in Symposium on Protein Struc- ture, (NEUBERGER, A., ed.), p. 77, Methuen, London. SORM, F. & KEIL, B. (1958) CON. Czech. Chem. Commun. 23, 1575-1578. FITCH, W. M. (1966) J. Mol. Biol. 16, 17-27. YCAS, M. (1958) in Information Theory in Biology, (YOCKEY, H. P., ed.), p. 70, Pergamon Press, London. WILLIAMS, J., CLEGG, J. B. & MUTCH, M. 0. (1961) J. Mol. Biol. 3, 532-540.
6. MCLACHLAN, A. D. (1972) J. Mol. Biol. 64, 417-
7. GRELLER, L. D. & ERHAN, S. (1974) Int. J. Pep.
8. MCLACHLAN, A. D. (1971) J. Mol. Biol. 61, 409-
9. SALISBURY, F. B. (1969) Nature 224, 342-343.
437.
Prot. Res. 6, 165-173.
424.
10. QUASTLER, H. (1964) in The Emergence of Bio logical Organization, Yale University Press, London.
1 1 . SIMON, H. A. (1973) in Hierarchy Theory, (PATTEE, H. H., ed.), p. 3, George Braziller, New York.
12. DAYHOFF, M. 0. (1972) in Atlas of Protein Sequence and Structure, vol. 5 , Nat. Biomed. Res. Foundation, Wash., D.C.
13. GAMOW, G., RICH, A. & YCAS, M. (1956) in Advances in Biological and Medical Physics, p. 23, vol. 4, Academic Press, New York.
14. SORM, F. (1954) Coll. Czech. Chem. Commun. 19,
15. SORM, F., KEIL, B., HOLEYSOVSKY, V., KNESSLOVA, V. KOSTKA, V., MASIAR, P., MELOUN, B., MIKES, O., TOMASEK, V. & VANECEK, J. (1957) CON. Czech. Chem. Commun. 22, 1310-1329.
(1971) Biochem. Biophys. Res. Commun. 43, I3761 38 I .
17. BURGUS, R., DUNN, T. F., DESIDERIO, D. & GUILLEMIN, R. (1969) C.R. Acad. Sci., Ser. D., 269, 1870-1873.
18. PIGMAN, W., DOWNS, F., MOSCHERA, J. & WEISS, M. (1970) in Proceedings of the Inter- national Conference on Blood and Tissue Antigens, (AMINOFF, D., ed.), p. 205, Academic Press, New York.
19. Fox, S. W. & NAKASHIMA, T. (1967) Biochim. Biophys. Acta 140, 155-167.
20. STEINMAN, G. & COLE, M. N. (1967) Proc. Nut/. Acad. Sci. U S . 58, 735-742.
21. SCOTT, A. P., RATCLIFFE, J. G., REES, L. H., LANWN, J., BENNETT, H. P. J., LOWRY, P. J. & MCMARTIN, C. (1973) Nature New Biol. 244, 65- 67.
22. CAMPBELL, J. H., LENGYEL, J. A. & LANGRIDGE, J. (1973) Proc. Narl. Acad. Sci. U.S. 70, 1841- 1845.
23. SORM, F. & KEIL, B. (1962) in Advances in Protein Chemistry, p. 167-205, Academic Press, New York.
1003-1005.
16. NAIR, R. M. G., KASTIN, A. J. & SCHALLY, A. V.
Address: Semih Ehran Dept. of Animal Biology School of Pennsylvania Philadelphia Pennsylvania 19174 U.S.A.
181