6
Int. J. Peptide Protein Res. 21, 1983, 190-195 Structural conservation in globular proteins V.N. VISWANADHAN and K. SUNDARAM Department of Crystallography and Biophysics*, University of Madras, Guindy Campus, Madras, India Received 4 January, accepted for publication 16 June 1982 Some well sequenced classes of homologous proteins have been analysed in the light of their representative tertiary structures to reveal the nature of structural conservation during protein evolution. Within ‘the sphere of influence’ around conserved residue sites preferential association of other conservative sites has been observed to be a feature of natural selection valid for all residue types. An information theory approach evolved to examine residue variability reveals that even some non-conservative sites scaffold clusters of high biochemical specificity and intermediate folding units. Key words: chain-fold conservation; conserved sites;residue variability; structural specificity Biochemical specificity of a protein is known to be due to its three-dimensional geometry (1). The characteristic shape of a given protein is in large part due to the specific chain fold (e.g. the globin fold, the cytochrome fold) of the protein family to which the given protein belongs. It is well known that the chain fold is remarkably well conserved within a homo- logous family (2, 3). Furthermore, amino acid changes within a close range of taxonomic spectrum tend to be conservative in nature. It is also a widely recognized principle that, in general, the greater the dissimilarity between two homologous proteins, the farther apart they are on the phylogenetic tree. In the process of protein speciation, certain sites tend to remain conserved to maintain the structural stability (the chain fold) and the functional specificity of the molecule. And the rate of change of amino acids in the primary structure far exceeds that of tertiary *Contributions No. 594. 190 structure (4). Thus it is the sites conserved in evolution that are the clearest indicators of factors relating to the chain fold conservation in proteins. In the present paper, an analysis of conservative sites and their association in three dimensions is presented to examine these factors. Some important considerations enter into such an analysis. Firstly, the number of conservative sites and their distribution in three dimensions depends on the range of species (taxonomic class) considered. For example, in the case of cytochrome c, considering only known sequences of vertebrates, 66 sites are conserved whereas in the context of all known species only 27 sites remain invariant. This difficulty in the analysis of conserved sites is circumvented by the following working assumption: conservative sites in any given class are structurally and functionally impor- tant as far as that class is concerned and hence these sites are selectively preserved. Among these, those sites that are preserved in the longer reign of evolutionary history of the

Structural conservation in globular proteins

Embed Size (px)

Citation preview

Page 1: Structural conservation in globular proteins

Int. J. Peptide Protein Res. 21, 1983, 190-195

Structural conservation in globular proteins

V.N. VISWANADHAN and K. SUNDARAM

Department of Crystallography and Biophysics*, University of Madras, Guindy Campus, Madras, India

Received 4 January, accepted for publication 16 June 1982

Some well sequenced classes of homologous proteins have been analysed in the light of their representative tertiary structures to reveal the nature of structural conservation during protein evolution. Within ‘the sphere of influence’ around conserved residue sites preferential association of other conservative sites has been observed to be a feature of natural selection valid for all residue types. An information theory approach evolved to examine residue variability reveals that even some non-conservative sites scaffold clusters of high biochemical specificity and intermediate folding units.

K e y words: chain-fold conservation; conserved sites; residue variability; structural specificity

Biochemical specificity of a protein is known to be due to its three-dimensional geometry (1). The characteristic shape of a given protein is in large part due to the specific chain fold (e.g. the globin fold, the cytochrome fold) of the protein family to which the given protein belongs. It is well known that the chain fold is remarkably well conserved within a homo- logous family (2, 3). Furthermore, amino acid changes within a close range of taxonomic spectrum tend to be conservative in nature. It is also a widely recognized principle that, in general, the greater the dissimilarity between two homologous proteins, the farther apart they are on the phylogenetic tree.

In the process of protein speciation, certain sites tend to remain conserved to maintain the structural stability (the chain fold) and the functional specificity of the molecule. And the rate of change of amino acids in the primary structure far exceeds that of tertiary

*Contributions No. 594.

190

structure (4). Thus it is the sites conserved in evolution that are the clearest indicators of factors relating to the chain fold conservation in proteins. In the present paper, an analysis of conservative sites and their association in three dimensions is presented to examine these factors. Some important considerations enter into such an analysis. Firstly, the number of conservative sites and their distribution in three dimensions depends on the range of species (taxonomic class) considered. For example, in the case of cytochrome c, considering only known sequences of vertebrates, 66 sites are conserved whereas in the context of all known species only 27 sites remain invariant. This difficulty in the analysis of conserved sites is circumvented by the following working assumption: conservative sites in any given class are structurally and functionally impor- tant as far as that class is concerned and hence these sites are selectively preserved. Among these, those sites that are preserved in the longer reign of evolutionary history of the

Page 2: Structural conservation in globular proteins

Structural conservation in proteins

internal changes are likely to be more damaging to structure. This suggests that conservative sites are mainly internal and are likely to form clusters, and we examined the association of conservative sites in tertiary structures of the proteins by considering the nature of residue clusters at each site; a residue cluster is defined as the set of residue sites that are enclosed in an 8A sphere around the acarbon at the site. Earlier work (10, 11) indicates that the ‘influence’ of a given residue extends pre- dominantly only a distance of 8A. Within each residue cluster, we computed ratio of observed to expected number of conservative sites as follows:

Let X be the number of conservative sites in the homologous class considered, and Y the number of residues in a cluster. Then (X/N)Y gives the expected number of conservative sites within each cluster, where N refers to the number in the sequence. The ratio of observed to expected number of conserved sites within each cluster is given by (X.Y/N*R) where R is the number of observed residues within the clusters. We had computed these ratios for all sites of myoglobin, cytochrome c and haemoglobins (a and 0 chains) using a computer program.

Among the sequences of cytochrome c from the 27 vertebral species considered, 66 sites were found invariant; among the 25 mammalian myoblobins, 78 sites were invariant; among the 20 sequences of haemo- globin a chain, 33 sites were invariant and among the 18 chains of 0-haemoglobins, 45 sites were invariant. All chains considered were of constant length, except for cytochrome c , which has 103 residues in most sequences but in a few cases 104.

Fig. 1 (a-d) depicts the chain contiguous profiles of frequency ratios for different proteins mentioned. The line at frequency ratio 1 .O demarcates the sites centring conservative and nonconservative clusters respectively.

Conservative sites are denoted by open circles on the profile and nonconservative sites by dots. The continuous and smooth variation of the values within local segments may be noticed from the profile. In the case of haemoglobins, which are more variable than myoglobins or cytochrome c, the profiles show

family are of greater importance with regard to biochemical specificity. In some cases, however, even nonconservative sites figure in the folding process as scaffolds of folding clusters and a method based on information theory is proposed to identify such sites. A close relation emerges between the overall conservative nature of a segment or a cluster in a protein and folding nuclei, as we shall endeavour to show in the following.

METHODOLOGY AND ANALYSIS

For the following analysis, we assume the same tertiary structure for a class of homologous proteins from a close range of taxa like that of all known vertebrate sequences. For the proteins considered here for anlysis - cyto- chrome c, myoglobin and haemoglobin, this assumption seems valid, as these proteins are known to contain strongly conserved chain- folds (5, 6 ) . Each tertiary structure is indicated by its alpha carbons (7), for a class of closely related homologous proteins.

Structural data Alpha carbon coordinates of cytochrome c, myoglobin and haemoglobin (a and 0 chains) were retreived from the Protein Data Bank (8). The particular files used are ICYT (cytochrome c from albacore) 3MBN (myoblobin from sperm whale) and lHHB (haemaglobin chains from human). Amino acid sequence data were taken from the published alignments of these proteins (Dayhoff 1976,1978 (9)); cytochrome c from 27 vertebral species were considered (see ref. 22 for details of species). Twenty five mammalian myoglobin sequences (Dayhoff 1978 (9)), 20 sequences from higher vertebrate range of haemoglobin alpha chains, and 18 sequences from the higher vertebrate range of haemoglobin beta chains (Dayhoff 1976 (9)) were considered.

Spatial association of conservative sites in proteins Literature on protein evolution suggests that in general polar residues belong to protein exterior, where replacements are more frequent and hydrophobic residues by their internal nature are less susceptible to mutability since

191

Page 3: Structural conservation in globular proteins

V.N. Viswanadhan and K. Sundaram

greater variability along the chains. Within the taxonomic classes considered, the profiles serve to indicate the three-dimensional association of conservative sites.

Preferential association o f conservative sites and identification of residue sites as centres of conservative clusters In general, all profiles indicate that conservative sites tend to form clusters in three dimensions, High values of obs/exp frequency ratio of some sites indicate that those sites scaffold conservative clusters; these are again con- servative, with very few exceptions. It is noted that clustering of conservative sites is not only confined to the interior locations of the globule,'but throughout the chain, and may be viewed as a selective criterion in speciation.

In the following, the details of residue sites that are centre of conservative clusters, are presented for the proteins analysed.

Cytochrome c In the case of cytochrome c , we shall see that all sites, excepting three residue sites, which are

FIGURE 1

invariant in the overall context, are surrounded by an abundance of conservative sites. We had considered such sites that also have high ratios likely scaffolds of local clusters. These are Gly 41 , Gly 77 ,Lys 72 ,Lys 7 3 , L y s 79,Pro 71,Pro 7 6 , and Tyr 78 . The predicted nucleation sites for this molecule are residue sites 79-82. Several residues may be noted to scaffold conservative clusters around this region. It is noteworthy that the charged residue Lys emerges as the scaffold in the majority of the sites; Lys is known to occur mostly on the protein surface (1 2 , 13).

Myoglo bins All residues that remain invariant, considering the entire set of globins from all known taxa, have a frequency ratio greater than 1 , with the exception of Gly 25 and Tyr 146. Both histidines involved in heme cross linkage scaffold conservative pockets. Lys and Leu predominate in forming conservative clusters. Among the residue sites that have high fre- quency ratio and are conservative in all the globin super-family, we have Trp 14, Phe 33 , Pro 37, Phe 43 , Phe 4 6 , Val 6 8 , His 93 and Lys 133. Here, too, it is noted that two residue sites, Phe 33 and Pro 37 , occur close to the predicted nucleation site for this protein (14) .

Haemoglo bins In haemoglobins (a and f l chains) all residue

Chain contiguous profies of observed to expected ratio of conservative sites at each cluster around and a-carbon dong the protein chain. (a) Vertebrate cytochromes c. (b) Mammalian myoglobins. (c) Haemoglobin a-chains from higher vertebrates. (d) Haemoglobin P-chains from higher vertebrates.

192

Page 4: Structural conservation in globular proteins

Structural conservation in proteins

theoretic treatment of selection pressure, genetic drift and compositional entropy and specificity of a site. In that work, he also introduced the concepts of structural capacity, selective value and kinetics of selection, which are of further interest for the treatment of conservation levels of residues sites and clusters, based on the tertiary distribution of residue sites, given below.

Let us denote by H F , the compositional entropy of a given site ‘K’ in a protein

20

sites that remain invariant in the overall super- family of globins, except Phe 98, Gly 59 and His 58 of the cr chain, have high frequency ratios. Earlier prediction of nucleation sites (1 5), included only the crchain (residue sites 31-42). Residues that are close to or part of these sites with high frequency ratios and invariant in the overall globin super-family are, Leu 28, Pro 36, and Phe 45. These and sites His 63, Gly 64, Leu 88 and His 92, can be considered to scaffold conservative and structure-specific clusters for this protein.

Conservation level of a residue site and cluster: approach from information theory Until now, residue sites and clusters were classified on the criterion of whether the sites were invariant or not. Considering that even nonconservative sites could serve as folding nuclei (for example site 110 of myoglobin, see 14, 15) it would be of interest to develop a continuous function that accounts for the ‘conservation level’ of a given site or cluster. Earlier Kabat et al. (16) defined a quantitative index of variability as the number of different amino acids occurring at that site divided by the frequency of the most common residue at that position. However, an index based on probabilities of various amino acids at a pos- ition is likely to be more satisfactory. More- over, since it was shown that preferential association exists among conservative sites, a definition of conservation level is better considered from the point of view of tertiary association of residues.

Theory We consider evolution a communication sys- tem. As a result of selection pressure and gen- etic drift, the functional information contained in the sequence and shape of a protein is preserved and at the same structural variability is developed as reflected in the alignments of available sequences. An invariant site is regarded as function-specific and a highly variable site as non-specific. Between these two extremes, we have varying levels of conser- vation, as reflected by the residue composition at each site.

Eigen (1 7) developed a phenomenological

i = l

where p?s are the probabilities of 20 residue types at the site ‘K’ and the compositional entropy H’: is measured in bits. The difference between the maximum possible value of HF (equal to log2 20) and the observed entropy of the site is labelled the ‘divergence from equiprobability’ (18) and this is given by

The Dfs are a class of information densities, characteristic of a homologous class and they measure the level of conservation of various residue sites. Let us consider the level of conservation (or specificity) of a residue cluster as the algebraic of D f s of various sites within the cluster; this is given by,

N SK = D t i

i = l

where the summation is taken over all the residue N, within a residue cluster around the site ‘K’. The class of indices SK are charac- teristic of residue clusters around each site and it is obvious that, they show a greater range than the class DF as measures of specificity.

Conservation levels of residue clusters in myog!o bin A chain contiguous profile of SK values is depicted in Fig. 2 , in the case of mammalian myoglobins. As before all clusters whose centres are conservative are denoted by open circles. The SK measure reflects both the number of residue sites in a given pocket as well as their

193

Page 5: Structural conservation in globular proteins

V.N. Viswanadhan and K . Sundararn

4 -

> C) 3 z u o

FIGURE 2 Chain contiguous profde of the specificity index for mammalian myoglobins.

( b) (Mammals) Myoglobln -

individual levels of conservation. We may regard both factors as important in obtaining the specificity measure for clusters. Residue clusters in the inner locations of the globule where replacements are less frequent, tend to contain more number of residues than outer pockets (1 1).

One may note that a number of nonconser- vative sites are centres of residue clusters that are highly specific. For example site 110, which tolerates three residues, has a high SK value of 52 bits. This site and site 32, which have been proposed as nucleation points (14, 15), occur at the local maxima of the profile. In fact, the entire stretch of 103-1 15 residues proposed as the nucleation site has high SK values in general. Lesk & Rose (19) examined the inertial ellipsoids of chain contiguous sections of various sizes and proposed a set of hierarchic folding pathways for this protein. It was possible to compare their folding model with the result of the present method. All chain- contiguous sections of 32 to 78 residues, when examined by the inertial ellipsoid method, yielded compact domains whose centres lie in the region 4045 sites towards the N-terminal. It is noteworthy that these stretches have overall high values of SK. Chain contiguous sections of 88-1 18 residues yielded compact units centred around site 6 5 , which has a strikingly large value of SK. Thus, the class of indices SK may seem to be a simple way of gleaning essential information regarding smaller sub-units involved in folding and possibly regions of high biochemical specificity.

194

c

w a a IL 1 7

I

-P

Normalized residue frequencies at conservative sites To derive more information on the roles of individual residues in the specificity of overall structure and function, we analyse their occur- rence frequencies at conservative sites. In an earlier communication (20), we had considered actual occurrences of residues in pools of closely related homologous classes, at conserva-

-

AMINO ACID R E S I D E

AMINO ACID RESIWE

4

> 3 !Mammals\ Hemoglobins

i, 2

3l- oq (L

w 0 2

:a 1

0 A C D E F G H I K L M N P O R 5 T V W Y

AMINO ACID RESIDUE

FIGURE 3 Histogram-profdes depicting frequency ratios (observed to expected) of amino acid residues at conservative sites for three homologous protein classes. The amino acids are indicated by their one letter symbols arranged in alphabetical order (IUB - single letter code for amino acids (23): A - alanine; C - cysteine; D - aspartic acid; E - glutamic acid; F - phenyl alanine; G - glycine; H - histidine; I - Isoleucine; K - lysine; L - leucine; M - methionine; N - asparigine; P - proline; Q - glutamine; R - argi- nine; S - serine; T - threonine; V - valine; W - tryptophan; Y - tyrosine).

Page 6: Structural conservation in globular proteins

tive sites. In Fig. 3 the histogram-profiies of frequency ratios (ratio of percentage occur- rences at conservative sites and in the entire chain) are presented for proteins involved in the analysis. In terms of contributions to the overall specificity of structure or function, Gly, His, Phe, Lys, Leu, Arg, Tyr and Trp stand out from the rest of residue types. Sample size limitations, however, restrict the value of these observations, as the actual counts in working out ratios are not many. Among the residues mentioned above, those which predominate cluster formation are Gly , Leu, Lys and His. Except for Leu, it may be noted that these are non-hydrophobic in nature (21).

CONCLUSION

The above analysis examined different aspects of structural conservation in relation to three- dimensional geometry. Preferences of residues at conservative sites indicated their differential contribution to molecular specificity. In particular, Lys, Leu, Gly, His and in some cases, Cys and Phe were noted to contribute more than other residue types.

Inter-residue association patterns revealed that conservative sites tend to occur as selec- tively fixed clusters or pockets. However, there are a number of exceptions to this trend, so that a rigorous statistical analysis is warranted before any firm statement could be made. Some sites which are preserved by natural selection and surrounded by other conser- vative sites may be viewed .as nuclei of high biochemical specificity. These were identified and their characteristics were analysed. Using an approach from information theory, it was seen that in some cases even non conservative sites scaffold clusters of high specificity. Further points of study that emerge from the present analysis include the relation of conservative sites to their conformation, pathways of conservative clusters within the globule and their relation to hydrophobic domains.

ACKNOWLEDGEMENTS

V.N.V. thanks the Department of Atomic Energy

Structural conservation in proteins

(India) for a fellowship grant. Part of the work is supported by a grant from the Department of Science and Technology to Professor K. Sundaram.

1.

2.

3.

4.

5.

6. 7.

8.

9.

10.

11.

12. 13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

REFERENCES

Anfinsen, C.B. & Scheraga, H.A. (1975). Adv. Protein Chem. 29,205-300 Rossman, M.G., Liljas, A., Branden, C.A. & Bamaszak, L.J. (1975) in The Enzymes (Boyer, P.D., ed.), vol. 11. 3rd Edn. Kretsinger, R.H. (1972) Nature New Biol. 240, 85 Rossman, M.G. & Argos, P. (1976) J. Mol. Biol. 88,177 Schulz, G.E. & Schirmer, R.H. (1978) Principles of Protein Structure, p. 188. Springer, New York Dickerson, R.E. (197 1) J. Mol. Evol. 1, 26 -3 1 Crampin, J., Nicholson, B.H. & Robson, B. (1978) Nature 272,558-560 Bernstein, F.C., Koetzle, T.G., Williams, G.J.B., Meyer, E.F., Jr., Brice, M.D., Rogers, J.R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977) J. Mol. Biol. 112, 535. Dayhoff, M.O. (ed.) (1972) Atlas of Protein Sequence and Structure, vol. 5; (1976) Suppl. 2, (1978), National Biomedical Research Foun- dation, Bethesda, MD Manavalan, P. & Ponnuswamy, P.K. (1978) Arch. Biochem. Biophys. 184,476-487 Ponnuswamy, P.K., Prabhakaran, M. & Manavalan, P. (1980) Biochim. Biophys. Acta

Chothia, C (1976) J. Mol. Biol. 105, 1-14 Janin, J., Wodak, S., Levitt, M. & Maigret, B. (1978) J. Mol. Biol. 125, 357-386 Ponnuswamy, P.K. & Prabhakaran, M. (1980) Biochem. Biophys. Res. Commun. 97, 1582- 1590 Matheson, R.R. & Scheraga, H.A. (1978) Macro- molecules 11, 819-829 Kabat, E.A., Wu, T.T. & Bilofsky, H. (1976). Variable Chains of Immunoglobin Chains. Bolt Beranck and Newman, Cambridge, MA Eigen, M. (1971) Naturwissenschaften 58,473- 4 84 Gatlin, L.L. (1972) Information Theory and Living System. Columbia University Press, New York. Lesk, A.M. & Rose, G.D. (1981) Proc. Natl. Acad. Sci. US, 73,43044308 Viswanadhan, V.N. & Sundaram, K. (1982) Phys. Chem. Phys. In press Nozaki, Y. & Tanford, C. (1971) J. Biol. Chem.

Lyddiat, A., Peacock, D. & Boulter, D. (1978)

IUPAC-IUB Commission on Biochemical Nomen- clature (1968) J. Biol. Chem. 243,3557-3559

623,301-316

246,2211-2217

J. MoI. EvoI. 11, 35-45

195