PRESENCE OF REPEATING SUB-SEQUENCES AND SYMMETRY PATTERNS IN PROTEINS

Int. J . Peptide Protein Res. 6, 1974, 175-181 Published by Munksgaard, Copenhagen, Denmark No part may bc reproduced by any process without written permission from the author(s1

P R E S E N C E O F R E P E A T I N G SUB-SEQUENCES A N D S Y M M E T R Y P A T T E R N S I N P R O T E I N S

SEMIH ERHAN and LARRY D. GRELLER

Department of Animal Biology, School of Veterinary Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, U.S.A.

Received 24 January 1974

Comparison of amino m i d sequences of ancestrally unrelated proteins as well as a particular protein with itself, using a sliding match performed by a computer (Greller & Erhan, previous paper), reveals the presence of certain repeating sequences. Further- more, a symmetry pattern is observed if either the target or the key sequence is reversed, written,from the carboxyl to amino terminus when a protein is matched against itself. The significance of these findings is discussed, particularly in reference to the possibility of these repeating sequences to represent primordial sub-sequences from which proteins might have been formed.

The question of whether proteins have evolved from a limited number of primordial peptides through gene reduplication or through some other mechanism is an interesting and contro- versial one. Investigators of this problem appear to be divided into two groups: those who believe that proteins could have evolved from a few peptides point out the regularities observed in various proteins (1,2,3), and those who try to prove what the others contend to be regularities could have arisen by chance (4,5,6).

During our study of homologous sequences found among ancestrally unrelated proteins (7), we noticed that some homologous sequences occurred more than once along both protein chains. We have investigated this problem by matching certain proteins among and against themselves. This communication will describe the results of this study and a useful property of our method for demonstrating symmetry patterns in a protein.

METHODS

The matchings were performed according to our modification of McLachlan’s “double matching” method (8) which was described in the preceding paper (7).

RESULTS

A matching performed between bovine trypsino-

gen and Bence-Jones I , SH (human) results in a homology pattern where a certain peptide segment on one protein matches, with varying degrees of perfection, more than one peptide segment of the other protein (Table 1). When the same protein, for instance trypsinogen, is matched against itself using a sliding match, a similar result is obtained (Table 2). It is important to note that some of the peptides that are homologous within trypsinogen are also homologous with Bence Jones I, SH (human) (Table 4). Other proteins such as DNase and RNAase matched against themselves also yield similar results (Erhan & Greller, manuscript in preparation).

The method’s particular usefulness derives from the nature of the computer print-out when proteins are matched against themselves when either the target or the key sequence is reversed, i.e., written from carboxyl to amino terminus. Patterns of symmetry are observed that extend throughout the length of segments being matched. A center of symmetry is either an amino acid or a peptide bond between two amino acids. The observed symmetries are not necessarily contiguous exact matches but exact matches inter- spersed with non exact matches. Nevertheless, the m-scores of these non exactly matched amino acids display symmetry. With a little practice symmetry patterns can be detected very easily, as can be seen in Table 5 .

175

SEMlH ERHAN A N D LARRY D . GRELLER

TABLE I Homologies found amonE diverse proteins and pepplidrs

Bradykinin 1 RPPGFSPFR Cytochrome c human 42 PGYS

key = PGFS . . 6 . Bradykinin 1 RPPGFSPFR Cytochrome c donkey 44 PGFS

key = PGFS

Bence-Jones lambda human Kern 36 QRPGQSPLL key = RPPGFSPFR 5 3 . . 0 . .53

Bradykinin I RPPGFSPFR Bence-Jones lambda human Kern 38 PGQSP

key = PGFSP . . o . . Bradykinin reversed” 1 RFPSFGPPR Immunoglobulin G I heavy human eu 53 PMFGPP

key = PSFGPP . 2 ‘ . . . Bradykinin potentiator B I bEGLPPRPKIPP ICSH beta ovine 4s LPV PPL 1P

key = BPPRPKIP . . 3 3 . 2 . . Bradykinin potentiator E 1 bEKWDPP PVSPP ICSH beta ovine 77 DPGMVSFP

key = DPPPVSPP . . 3 I . . I . Botulinus toxin type A I GDSCVEAETAGK Bence-Jones lambda human SH 193 SCQVTHE

key = SCSVEAE . ‘ 3 . 4 3 . Kinin wasps I ZTNKKKLRGRPPGFSPFR Growth hormone human 81 ETQKSDL

key = ZTNKKKL . . 4 . 3 3 . Trypsinogen bovine 156 SAYPGQI Growth hormone human 129 SGRTGQI

key = SAYPGQIS , 3 2 3 . . . Trypsinogen bovine 158 YPGQIT Bence-Jones lambda 141 YPGAVT

human SH ’ . . 3 5 . Try psi nogen bovine 179 CQGDS Bence-Jones lambda 20 CQGDS

human SH . . . . ICSH beta ovine 39 PSMKRVLPVPPL IPMPQ Collagen a’ fragment, rat PS G P RGLPG PPGAP GPQ

. . 1 3 . 2 . . 2 . . 1 2 . 1 . . Botulinus toxin type A 1 SCSVEAE Bence-Jones lambda 193 SCQVTHE

human SH . ‘ 3 . 4 3 . DNase, bovine, pancreatic 129 SHSAPS RNase, bovine, pancreatic 16 STSAAS

. 4 . . 4 . DNase, bovine, pancreatic 204 NCAY R Nase, bovine, pancreatic 94 NCAY

Bradykinin I RPPCFSPFR

1 7

, , I .

~~~

M P ( M ’ > M )

30 4 x 1 0

33 6 x 10

40 2 x 1 0

32 2 x 1 0

43 2 x 1 0

48 I x 10 ‘

45 7 x 10

43 6 x 1 0 ’

42 2 x 1 0 ‘

40 2 x 1 0

41 5 x 1 0

41 1 x 1 0 ’

92 2 x 1 0 - 7

43 6 x 1 0 - ’

40 2 x 1 0

34 2 x 1 0

a “reversed” indicates that the sequence is written carboxyl to amino end as opposed to the con- ventional amino to carboxyl direction.

Only with bradykinin potentiators B & E, amino acid E indicates the unusual pyroglutamic acid (2-pyrrolidone-5-carboxylic acid) rather than glutamic acid. M is the score for the length of homologous regions indicated here by the length of the third line. P ( M ’ > M ) the probability that 2 score could occur by chance.

176

REPEATING SUB-SEQUENCES AND SYMMETRY PATTERNS IN PROTEINS

TABLE 2

Homologies found between trypsinogen and Bence-Jones ASH (human) and between DNase and RNAase

5" 179

82 176

I52 I75

206 34

208 156

210 98

14 45

14 142

16 200

DNase 17 RNase 18

DNase 16 RNase 46

DNase 129 RNase 16

DNase 129 RNase 75

DNase 204 RNase 94

DNase 201 RNase 15

SLINSQW SLTPEQW . . 3 1 3 . ' 40 SYNSNTLNN SY LS LTPEQ . ' I . 1 . 1 4 4 44

SSCKSAYPGQ 1 TSN SSY LS LTPEQWKSH . . 1 2 . 2 1 . 3 . 3 3 . 4 67 KNKPG QQKPG 4 4 . . . 32 KPGVYT KAGVET . 4 . . I . 37

G VYTKV GGGTKL . 2 0 . . 4 42

GANTYPY G RN N RPS . 2 . 3 2 . 3 34

GANTV GAVTV

. . I . . 33 NTVPYQV S S TVEKTVA 5 . , 4 1 3 . 4 41 SNATLA SY SAASS SNY . 3 . 5 2 4 5 ' 43 MS NATLA FVHESLA 5 2 4 4 5 . . 36

SHSAPS STSAAS . 4 . . 4 . 40

SHSAPS SYSTMS . 4 . 3 1 . 32 NCAY NCAY

TSTNCA S ST SAA 5 . . 5 1 . 35

34 I , . .

2.1 x 10-4

3.1 x 10-4

3.5 x 10- 4

1 . 7 ~ 10-4

3.7x 10-4

1 . 2 ~ 10-3

5.5x 10

2.2 x 10-4

3.ox 10-3

].ox 10-3

4 . 4 ~ 10-3

1 . 9 ~ 10-5

3.5 x 1 0 - 5

1 . 8 ~ 10

4 . 6 ~ 10

a Numbering is contiguous from the N-terminal amino acid first row always trypsinogen, second Bence-Jones peptides.

177

SEMIH ERHAN AND LARRY D. GRELLER

TABLE 3 Internal homologies found in trypsinogen

23 101

24 123

24 36

22 197

1 I7 110

1 I7 198

118 65

61 31

65 I18

148 41

148 162

149 35

222 144

224 97

226 163

SLNS SLNS

LNSG Y L I SGW . 1 . . 6 LNSG Y I NSQW 5 . . 2 6

VSLNSG VSWGSG . . 3 3 . . SAGTQCL IS S L PTSCASA . 2 3 . 3 ‘ 2 2 4

S AGTQC SWGSGC , I . 5 2 . AGTQCLIS EGNQQFIS 4 . 3 . 0 5 . . NQQFI NSQWI . 3 . 6 5

EGNQQFIS AG TQCLlS 4 . 3 . 0 5 . . I LSNSSC

VVSAAHC 5 4 . 3 4 3 ’ ILSN S SC ITSNMFC . 2 . . 2 2 ‘ LSNS LINS . 2 . . I KQTIASN LKAP ICSN 5 . 3 3 . 2 . . QT 1 AS KSAAS 4 5 2 . . IASN ITSN . 3 . .

. . . . 32

31

29

38

49

33

44

30

44

36

39

26

45

27

27

1.3x10-5

5 . 8 ~ 10-4

6.5x 10-4

1 . 4 x 10-4

] . O X 10-3

9.2 x 10 -

2.5 x 10-4

5.2 x 10 -4

2.5 x 10-4

2.9 x 1 0 - 3

2.4 x 10 -4

9.3 x 10-4

1 . 8 ~ 10-4

2 x 10-5.3

4.3 x 10-4

In addition to symmetry patterns one also ob- serves homologous segments when a protein is matched against itself reversed. S. aureus nuclease

reversed 133 KAQAESKRLL 28 KGQPMTFRLL 53

. 3 . 4 1 5 0 . . .

1.9 x 10 -4

TABLE 4 Commonness of internal homologies with intcr-

molecular homologies

Trypsin

Bence-Jones 1 Trypsin



Bence-Jones 1

21 79 50

I88 30

167 59

I10 101 85

217 182 48 53

QVSL NS HPSYNSNTLN RPSG 1 PDRFS VCSGKL FCGGS L FCAGYL FSGSS S L P T S C A S SLNSRV A S

NSRDSS YVSWIK

YKSG I Q VRL S G I P D R F

DSGGP V

(Homologous amino acids are italicized.)

39 PETKHPK 38 9.8 x 10 -4

reversed 122 EHTNNPK 4 2 . 4 4 . .

A detailed account of this phenomenon will be published elsewhere (Erhan & Greller, manuscript in preparation).

DISCUSSION

Obviously certain sequences tend to occur re- peatedly in the same protein, suggesting the possibility that proteins might have been made up of a limited number of primordial subsequences originally. If this were true, then mutations occurring randomly could have led to the sequences we see today. There are compelling theoretical grounds that support this view.

Recently, Salisbury (9) has commented on a serious conflict between the concepts that random events were responsible for evolution by natural selection of adaptive genes, on the one hand, and that each gene in a DNA molecule is unique in the sequence of its bases, on the other. Referring to an earlier work by Quastler (lo), he states “the probability of life (an organism with lo00 information bits. . .) originating by a ‘lucky acci- dent’. . . in 2.109 years is about - essentially no probability at all!” Again using an example where a very small DNA molecule containing only lo00 nucleotide pairs which replicates and undergoes mutation simultaneously under unrealistically favorable conditions - lo6 times per second, he calculated that the probability for producing a DNA molecule like this in four

to

178


TABLE 5 Symmetry paiivrns

S. aureus nuclease

I02

reversed 109

I12

reversed I48

reversed

RNase T I

reversed

reversed

82

96

8

14

21

1

31

reversed 68

I I I

A LVRQG LA

ALGQRVLA

. . 2 5 5 2 . .

YVYKPNNTHEQLLRKSEAQAKKEKLNIWSENDADSGQ

QGSDADNESWINLKEKKAQAESKRLLQEHTNNPKYVY

1 2 3 3 4 5 . 4 3 1 0 1 . 5 4 3 4 . . . 4 3 4 5 . 1 0 1 3 4 . 5 4 3 3 2

RGLAYIYADGK

KGDAYIYA LGR

5 . 1 . I . ’ . 1 . 5

I I I

I I I I I

I I I

SNCY S S S

SSSYCNS

. 5 2 ’ 2 5 .

ACDYTCGSNCY SSSDVSTAQA

AQA TSVDSS SYCNSGCTY DCA

. 0 3 1 5 1 3 . 5 2 ’ 2 5 . 3 1 5 1 3 0 .

ETVGSNSYPHKYNNYEGFDFS VSSPY YEWP I LSSGDVY

YVDGSS L I PW EYY P SSVS FDFGEYNNYKHPYSNSG VTE

1 3 1 . . 5 2 3 . 3 4 ’ 2 1 3 3 2 2 1 I 2 2 3 3 1 2 ‘ 4 3 . 3 2 5 . . I 3 I

I I I

I I I I

billion years and in a volume equivalent to lozo planets similar to earth are 10 - 5 ’ s, i.e., practically nil.

It is entirely possible, however, for a much larger molecule to be produced by chance in a much shorter time if the large molecules are put together stepwise by elemental building blocks, giving rise to sub-assemblies which in turn combine to give larger sub-assemblies which then yield the macromolecule. Simon ( 1 I ) states, “if a system of k elementary components is built up in a many level hierarchy, and s components, on the average, combine at any level into a component a t

the next higher level, then the expected time of evolution for the whole system will be propor- tional to the logarithm t o base s of k. In such a hierarchy, the time required for systems containing atoms to evolve from systems containing loz3 atoms would be the same as the time required for systems containing lo3 atoms to evolve from systems containing 10 atoms.”

Thus it is conceivable that formation of present day proteins from a limited number of primordial peptides would not only circumvent the dilemma described by Salisbury but that such a primordial process is quite feasible.

179

SEMIH E R H A N A N D L A R R Y D. GRELLER

The consequence of randomness of mutations in repeated segments, particularly from the stand- point of practical application, is rather important. Each mutation that effected achange ofoneamino acid when it occurred results in the appearance of two different amino acids (B,Y) in two identical sequences :

i . . . . . A B C D E . . . . . A B C D E . . . . . giving . . . . . A Y C D E . . . . . A B C D E . . . . . This is because we d o not know today which of the two amino acids we see represents the original amino acid before mutation altered one of them. Similarly, two mutations result in the appearance of four different amino acids (B,Y,D,F)

1 1 . . . . . A B C D E . . . . . A B C D E . . . . .g iving . . . . . A Y C D E . . . . . A B C F E . . . . .

In other words when one of the amino acids in the above example changes due to a mutation, the chances of finding two identical sub-sequences is decreased by 20%. With two amino acid changes this chance is decreased by 40%. Under these circumstances it would actually be a miracle to find these sub-sequences exactly alike for a long protein, say having ca. 100 amino acids. This fact can be appreciated by looking a t any alignment chart in Dayhoff (12). I t has t o be repeated here that this argument holds under the premise that the proteins are made up of repeating sub- sequences. Hence to use a statistical approach to find a n answer to the problem of whether the regularities observed in the primary structure of proteins could have arisen by chance, without giving a thorough consideration to the factors listed above, can only lead to erroneous conclusions, namely, that they could arise by chance ! The major cause of this oversight is the concept that amino acid sequence of proteins is random, which has somehow survived to the present day.

The earliest statistical studies on amino acid sequences in proteins began with the purpose of trying to understand the nature of genetic coding (1 3). and nearest neighbor pairs in proteins were searched. An extension of this study by Ycas (4) led to the belief that globular proteins were a set of random sequences of amino acids. During this time Sorm had already published his papers suggesting that there were regularities in the amino acid sequences of proteins which could not be attributed to coincidence. In 1961 Williams, Clegg & Mutch addressed themselves to this problem because Sorm had not based his conclusions on statistical studies. They compared random sequences of amino acids, corresponding to RNase, trypsinogen, chymotrypsinogen, tabacco

mosaic virus protein and A and B chains of hemoglobin for the presence of intramolecular repeats of peptides and concluded that none of these proteins differed significantly from random sequences with respect to intramolecular repeats. A point to remember here is the size of repeats being considered, a preponderance of doublets, then triplets, and a few quadruplets. It is clear that the shorter the sequences that are being considered identical, the greater is the probability that they might have occurred by chance. The presence of tri-, tetra- and pentapeptides with significant biological activity (I 6,17), however, speaks against ignoring regularities just because they happen to be short.

Only after many proteins are compared as to their repeating sub-sequences, however, will one be able to state with sufficient certainty what the nature and number of these sub-sequences are.

Interestingly, when a small peptide such as bradykinin is matched against a larger protein, the same repeating of homologous sequences occurs and underlines the general nature of this phenomenon.

The idea that proteins could be formed from polypeptides was recently proposed on very practical grounds (18). The most impressive experimental evidence that proteins are not random sequences of amino acids comes from the laboratory of Fox & Nakashima (19). An equi- molar mixture of tyrosine, glycine and glutamic acid which was heated to 170°C yielded a hexa- peptide of a definite sequence comprising 10% of the product: pyroglutamyglycyltyrosyl-,r-glu- taminyl tyrosylglycine.

The studies by Steinman (20) also indicated that side chains of amino acids d o influence which amino acid will follow another one. The reason for discussing these investigations at some length, particularly stressing the randomness vs. the nonrandomness of proteins, is to point out the consequences of these two concepts. Assumption of randomness of amino acid sequences seems to entitle the investigator to obtain simple proba- bilities for the occurrence of a n amino acid, any amino acid, a t any position of a protein, com- pletely independently of the amino acid preceding as well as the one succeeding it, as was done by Williams et al. (5 ) . This leads to a n overestima- tion of the number of vertical matches between the amino acids of two proteins by chance because each amino acid is considered not only indepen- dent of the others but equivalent to others (7). Evidence obtained from in oiuo systems also support the views expressed above. It is demonstrated that a corticotrophin-like intermediate

180


lobe peptide is formed, together with a-melano- cyte stimulating hormone, from corticotrophin (ACTH) in pars intermedia of pig and rat pituitaries (21).

Furthermore, reacquisition of the ability to hydrolyze &galactosides by a mutant of E. coli K12 with deletion of lac Z gene was reported. It was demonstrated that the enzyme that evolved differed from /ac Z &galactosidase of the wild type E. coli in its immunological, kinetic and sedimentation characteristics. The authors sug- gested that the adaptation of a gene not involved in lactose utilization to a form capable of speci- fying a &galactosidase is similar to evolution by natural selection (22).

We believe that if there were subsequences, on many proteins, homologous to the active site(s) of 8-galactosidase, they would be the most likely starting point for the phenomenon described above, instead of just any peptide somehow being converted to a state which is capable of per- forming an enzymatic reaction. The gene that codes for the one protein, of many, which may have these repeating subsequences, whose three dimensional conformation is closest t o that of 8-galactosidase is modified so that it can finally hydrolyze lactose. Finally we feel it is proper for us to acknowledge that the repeat and symmetry patterns that we are reporting here were first observed and described by Sorm & Keil (23).

Note added in proof: Our preliminary experi- ments on simulation of evolution suggest that during 1.3 x lo9 years, 57 of the amino acids of a 100 amino acid long protein would have been altered. Hence, to level of significance accepted by us may turn out to be lower than what could be attained under the evolutionary pressures !

1 .

2.

3. 4.

5.

REFERENCES

SORM, F. (1958) in Symposium on Protein Struc- ture, (NEUBERGER, A., ed.), p. 77, Methuen, London. SORM, F. & KEIL, B. (1958) CON. Czech. Chem. Commun. 23, 1575-1578. FITCH, W. M. (1966) J. Mol. Biol. 16, 17-27. YCAS, M. (1958) in Information Theory in Biology, (YOCKEY, H. P., ed.), p. 70, Pergamon Press, London. WILLIAMS, J., CLEGG, J. B. & MUTCH, M. 0. (1961) J. Mol. Biol. 3, 532-540.

6. MCLACHLAN, A. D. (1972) J. Mol. Biol. 64, 417-

7. GRELLER, L. D. & ERHAN, S. (1974) Int. J. Pep.

8. MCLACHLAN, A. D. (1971) J. Mol. Biol. 61, 409-

9. SALISBURY, F. B. (1969) Nature 224, 342-343.

437.

Prot. Res. 6, 165-173.

424.

10. QUASTLER, H. (1964) in The Emergence of Bio logical Organization, Yale University Press, London.

1 1 . SIMON, H. A. (1973) in Hierarchy Theory, (PATTEE, H. H., ed.), p. 3, George Braziller, New York.

12. DAYHOFF, M. 0. (1972) in Atlas of Protein Sequence and Structure, vol. 5 , Nat. Biomed. Res. Foundation, Wash., D.C.

13. GAMOW, G., RICH, A. & YCAS, M. (1956) in Advances in Biological and Medical Physics, p. 23, vol. 4, Academic Press, New York.

14. SORM, F. (1954) Coll. Czech. Chem. Commun. 19,

15. SORM, F., KEIL, B., HOLEYSOVSKY, V., KNESSLOVA, V. KOSTKA, V., MASIAR, P., MELOUN, B., MIKES, O., TOMASEK, V. & VANECEK, J. (1957) CON. Czech. Chem. Commun. 22, 1310-1329.

(1971) Biochem. Biophys. Res. Commun. 43, I3761 38 I .

17. BURGUS, R., DUNN, T. F., DESIDERIO, D. & GUILLEMIN, R. (1969) C.R. Acad. Sci., Ser. D., 269, 1870-1873.

18. PIGMAN, W., DOWNS, F., MOSCHERA, J. & WEISS, M. (1970) in Proceedings of the Inter- national Conference on Blood and Tissue Antigens, (AMINOFF, D., ed.), p. 205, Academic Press, New York.

19. Fox, S. W. & NAKASHIMA, T. (1967) Biochim. Biophys. Acta 140, 155-167.

20. STEINMAN, G. & COLE, M. N. (1967) Proc. Nut/. Acad. Sci. U S . 58, 735-742.

21. SCOTT, A. P., RATCLIFFE, J. G., REES, L. H., LANWN, J., BENNETT, H. P. J., LOWRY, P. J. & MCMARTIN, C. (1973) Nature New Biol. 244, 65- 67.

22. CAMPBELL, J. H., LENGYEL, J. A. & LANGRIDGE, J. (1973) Proc. Narl. Acad. Sci. U.S. 70, 1841- 1845.

23. SORM, F. & KEIL, B. (1962) in Advances in Protein Chemistry, p. 167-205, Academic Press, New York.

1003-1005.

16. NAIR, R. M. G., KASTIN, A. J. & SCHALLY, A. V.

Address: Semih Ehran Dept. of Animal Biology School of Pennsylvania Philadelphia Pennsylvania 19174 U.S.A.

181

Documents

PRESENCE OF REPEATING SUB-SEQUENCES AND SYMMETRY PATTERNS IN PROTEINS