9
Biochemical Systematics and Ecology, Vol. 7, pp. 177 to 185. Pergamon Press Ltd. 1979. Printed in England. Development of a Novel Classificatory Method m Use of Amino Acid Nearest Neighbors as an Objective Key SEMIH ERHAN*, BARBARA RASCO, LAURA COHEN and THOMAS R. MARZOLF Franklin Research Center, Philadelphia, PA 19103, USA Key Word Index--Taxonomic classification; homologous proteins; amino acids; nearest neighbor analysis. Abstract--The distribution of overlapping dipeptidesfound within the amino acid sequence of a protein is biassed, i.e. somedipeptidesoccur exclusively within angiosperms,somewithin vertebratesetc., while others are absentfrom the same organisms. Thus, utilization of the amino acid sequences of the members of a family of homologous proteins, such as cytochromes c, may lead to an objective classificatory system. Introduction Because man is a visually gifted animal [1], taxonomy from its inception as a science has relied mainly on inference based on morpho- logical and anatomical characteristics of organ- isms to be classified. An ideal classification should be unambiguous [1], but since the choice of biological characteristics to be used is very subjective, no two persons can be expected to agree in every respect [2]. Furthermore, in some cases self-reinforcing circular arguments have been used to establish categories and form a logical fallacy underlying certain taxonomic procedures [3]. A major problem that stems from this reliance on morphological characteristics is that at least six systems have been proposed in classifying the main kingdoms of organisms [4, 5]. Recent developments in numerical tax- onomy and the use of serological, biochemical data for classification have begun to alter this picture somewhat, but these are not universally accepted [1]. Serious attempts to utilize serological methods for taxonomic purposes began at the turn of this century [6, 7]. Most of the biochemical work has been done using "secondary products" whose presence or absence may be affected by intrinsic as well as extrinsic factors [8]. Thus, it was not always possible to decide whether the appearance of a particular compound in different taxa indicated a close relationship or has arisen as a result of convergent evolution [9]. Boulter has recently summarized the criticisms that can be directed against studies dealing with small "To whom all correspondence should be sent. (Revised received 12 October 1978) molecules and stressed the need to consider primary macromolecules of plants as well as of animals [10]. One possibility of classifying organisms objectively has been discovered and the method relies on the nearest neighbor relation- ships of amino acids found in homologous proteins. The method described in this com- munication is based on the frequency distri- bution of overlapping di-, tri- and higher peptides found in cytochromes c. It is flexible: it can be used as a very simple manual method limited to the use of the dipeptide distribution frequencies, or can be extended to include the frequency distribution of tripeptides and higher groups using computerized techniques. Most important, it does not depend upon the correct alignment of sequences. The method is capable of suggesting a choice among the six classifi- catory systems available for living organisms [4. 5]. Results If amino acid sequences of proteins are known, it is rather simple to obtain the frequency distribution of overlapping dipeptides manual- ly: since 20 amino acids occur in proteins, there are 400 possible dipeptides which can be formed. A table listing all 400 dipeptides alphabetically is constructed. The amino acid sequence of a protein is then read, starting with the N-terminal amino acid as the first dipeptide, and marked across from the appropriate doublet on the table. Then the second dipeptide starting with the second amino acid is marked on the table. This operation is 177

Development of a novel classificatory method — Use of amino acid nearest neighbors as an objective key

Embed Size (px)

Citation preview

Page 1: Development of a novel classificatory method — Use of amino acid nearest neighbors as an objective key

Biochemical Systematics and Ecology, Vol. 7, pp. 177 to 185. Pergamon Press Ltd. 1979. Printed in England.

Development of a Novel Classificatory Method m Use of Amino Acid Nearest Neighbors as an Objective Key

SEMIH ERHAN*, BARBARA RASCO, LAURA COHEN and THOMAS R. MARZOLF Franklin Research Center, Philadelphia, PA 19103, USA

Key Word Index--Taxonomic classification; homologous proteins; amino acids; nearest neighbor analysis.

Abstract--The distribution of overlapping dipeptides found within the amino acid sequence of a protein is biassed, i.e. some dipeptides occur exclusively within angiosperms, some within vertebrates etc., while others are absent from the same organisms. Thus, utilization of the amino acid sequences of the members of a family of homologous proteins, such as cytochromes c, may lead to an objective classificatory system.

I n t r o d u c t i o n Because man is a visually gifted animal [1], taxonomy from its inception as a science has relied mainly on inference based on morpho- logical and anatomical characteristics of organ- isms to be classified. An ideal classification should be unambiguous [1], but since the choice of biological characteristics to be used is very subjective, no two persons can be expected to agree in every respect [2]. Furthermore, in some cases self-reinforcing circular arguments have been used to establish categories and form a logical fallacy underlying certain taxonomic procedures [3]. A major problem that stems from this reliance on morphological characteristics is that at least six systems have been proposed in classifying the main kingdoms of organisms [4, 5].

Recent developments in numerical tax- onomy and the use of serological, biochemical data for classification have begun to alter this picture somewhat, but these are not universally accepted [1]. Serious attempts to utilize serological methods for taxonomic purposes began at the turn of this century [6, 7]. Most of the biochemical work has been done using "secondary products" whose presence or absence may be affected by intrinsic as well as extrinsic factors [8]. Thus, it was not always possible to decide whether the appearance of a particular compound in different taxa indicated a close relationship or has arisen as a result of convergent evolution [9]. Boulter has recently summarized the criticisms that can be directed against studies dealing with small

"To whom all correspondence should be sent.

(Revised received 12 October 1978)

molecules and stressed the need to consider primary macromolecules of plants as well as of animals [10].

One possibility of classifying organisms objectively has been discovered and the method relies on the nearest neighbor relation- ships of amino acids found in homologous proteins. The method described in this com- munication is based on the frequency distri- bution of overlapping di-, tri- and higher peptides found in cytochromes c. It is flexible: it can be used as a very simple manual method limited to the use of the dipeptide distribution frequencies, or can be extended to include the frequency distribution of tripeptides and higher groups using computerized techniques. Most important, it does not depend upon the correct alignment of sequences. The method is capable of suggesting a choice among the six classifi- catory systems available for living organisms [4. 5].

R e s u l t s If amino acid sequences of proteins are known, it is rather simple to obtain the frequency distribution of overlapping dipeptides manual- ly: since 20 amino acids occur in proteins, there are 400 possible dipeptides which can be formed. A table listing all 400 dipeptides alphabetically is constructed. The amino acid sequence of a protein is then read, starting with the N-terminal amino acid as the first dipeptide, and marked across from the appropriate doublet on the table. Then the second dipeptide starting with the second amino acid is marked on the table. This operation is

177

Page 2: Development of a novel classificatory method — Use of amino acid nearest neighbors as an objective key

178 SEMIH ERHAN, BARBARA RASCO, LAURA COHEN A N D T H O M A S R. MARZOLF

repeated n - 1 times to obtain all of the dipeptides which occur in a protein possessing n amino acids. The frequency distribution obtained is a unique characteristic of each protein, as no two proteins, even those belonging to the same protein family such as cytochromes c, have the same distribution of their dipeptides. For these studies, it is highly desirable to use single letter symbols for amino acids [11]. Breakdown of the amino acid sequence of a protein into overlapping di-, tripeptides etc. can also be obtained by a computer [1 2 ]

Next, the frequency distributions of over- lapping doublets are transferred onto 20 x 20 matrices, where the vertical symbols represent the N-terminal and the horizontal symbols the C-terminal amino acid of the dipeptides (Fig. 1 ). Each matrix is unique for the protein it represents and no two matrices so far examined have been found to be identical.

Because the sequenced cytochromes c number only about 60, nearly 10 of which have ambiguities about some of the amino acids, and do not even cover most of the classes of organisms, the available sequences were arbitrarily divided into seven categories: fungi, angiosperms, insects, fish, birds, reptiles and

~ l ~ l l l l t ~ l l ~ l l l ~ i l ~ ] l l l ~ l l l l @ l l l l l l l l l l ~ I I I I I I l ~ l l ~ l l l l l l l l ~ @ l l l l l ~ m l ~ @ l l l l l l l l l t l l l l l l ~ l ~ l l l l l l l l l l immmlnml)mq)ulunl

a mmm m m mmmmmiim m mmmmmm

=m mmm m i m@m m mmm@ ammmmmmmmm)mmmmmmmmmm Immmmmmm mm)mmmnmmmmm )) mmmm)))®)mmmmmmmmm mmmm mmm mm mm)mm ammmmm)m)mm mm mmmmmm a m mmm)mmm)mmmmmmmmm jmmmmm , m )mmmmm)mmm

Pa i r s found in no sequence

FIG. 2. DOUBLETS WHICH DO NOT OCCUR IN ANY I N D I V I D U A L CYTOCHROM E C SEQUENCE, SUCH AS VC SEEN INTABLE 1 (b).

Z L V A G F" M C P W H R K Y T S Q N E I~

• r I I t

L I I I 2 I 2 V I I I A I I 3 I I I I I I I G 2 I I I I 1 2 I I F II i I i M I C I I P 3 I I I I

W I H l I R I I : K I 2 I I I I 2 1 ) 2 I i I I Y I 2 2 I T I 1 2 21 I S I I I I I 0 I I I I

~N I I I 2 ; I E 2 I 2 I D i l I I ] 1 I

3 e 3 1 3 1 1 4 I 2 7 I 2 2 1 5 6 7 5 4 6 6 5

Cytochrome ct rope

I L V A G F M C P W H R K Y T S Q N E D

fml l ii i lmmm,~

) l l l l l l )

FIG. 1. DIPEPTIDE FREQUENCY DISTRIBUTION OF RAPE (BRASS~CA NAPUS) CYTOCHROME C Vertical symbols give the N-terminal and the horizontal symbols give the C-terminal amino acid of each dipeptida.

Pa i r s f o u n d in a l l c y t o c h r o m e s c

FIG. 3. DIPEPTiDES WHICH ARE PRESENT IN ALL CYTOCHROME C SEQUENCES.

Page 3: Development of a novel classificatory method — Use of amino acid nearest neighbors as an objective key

USE OF A M I N O A C I D NEAREST N E I G H B O R S A S A N OBJECTIVE KEY 1 7 9

mammals. The data on dipeptide frequency distribution can then be reorganized into tabular form, where the occurrence and absence of each dipeptide within each of these seven categories could be seen. For simplicity, the presence and absence of dipeptides are indicated by (+) and ( - ) , respectively.

The data given in Table 1 show a portion of the whole table. It is possible to see from the results which dipeptides:

(a) are absent from all cytochromes c (cf. Fig. 2);

(b) occur in all cytochromes c (cf. Fig. 3); (c) occur in all seven categories of cyto-

chromes c, even though not in all individual sequences (cf. Fig. 4);

(d) occur exclusively in only one category; (e) are absent exclusively from only one

category; (f) are shared by two categories; (g) are missing exclusively from only two

categories; and

(h) are shared by more than two categories, by either presence or absence.

Of the possible 400 dipeptides, nine are present in all individual cytochromes c while 139 do not occur at all. Thus the data from (a) and (b) form a fingerprint or signature for the cytochrome c family of proteins. When such fingerprints become available for other protein families then the shared doublets can be eliminated leaving unique doublets for each family which can be used to represent and classify them [13].

It is worth noting that the absence of a particular dipeptide from a protein family, from one of the categories listed above or from a particular cytochrome c sequence, is given equal weight to the presence of other di- peptides.

In Table 1 one can also see whether a particular dipeptide occurs in all individual sequences or in only 1 or 2 by placing subscript numbers against the + signs. Thus 45 di- peptides are shared by all seven categories

T A B L E 1. P O R T I O N S OF THE C U M U L A T I V E D IPEPTIDE D A T A

(s) For the category of insects D.m. C.h.

TS IT - - TV + + T W + + mY - V A VC VD VE + + VF VG + +

S.c. P.s.

4" ÷ +

(b) For all seven categories Only in Shared by

one two category categories

Ma Re Bi Fi In Fu PI TS - +1 + + - +1 +2 TT . . . . . . +

+all +all +all +all + + +all TVV + +1 +al l +1 +all + - TY +2 . . . . + - V A - - - + - - - VC . . . . . . . V D - - +z - - +1 + VE +all +all +all +all + + + VF - + t - 4"2 4"1 - +all VG + +1 - +2 + + +1

Pr A b Pr A b Main

PI

PI MaFu

A por t ion o f the table which is essential for the preparation of cumulative matrices. The preparation of the table occurs in two steps. At first a table is prepared for each of the 7 categories (see below), which shows in which organisms each depeptide occurs. Upper portion of Table 1, (a) s h o w s a part of the table prepared for the insects. D.m. - Drosophila melanogaster. C.h. - Callitroga hominivorax; S . C . - Samia cynthia; P.s. - Protoparce sexta. (+) indicates the presence end ( - ) the absence of a particular dipeptida within the ¢ytochrome c of each organism. In the second step, the occurrence of each dipeptide within ell 7 categories is given. Lower portion of Table 1, (b) shows a part of this final form with the same dipeptides as in (a). Here Ma - mammals, Re - repti les, Bi - birds, Fi - f ish, In - insects, Fu = fungi , PI - plants. The numbers below the ( + ) s igns s h o w w h e t h e r a particular dipeptide occurs in all of the organisms belonging to that category (all), or in only one (1) , or in only two (2) of the organisms. ( + ) w i t h o u t a subscript indicates that the occurrence is between these extremes. The lower right side of the table also shows which dipeptides occur in only one category; which are absent from only one category; which are shared by two categories, by either occurring in only two categories or by being absent from only two categories, here Pr =present, Ab - absent.

Page 4: Development of a novel classificatory method — Use of amino acid nearest neighbors as an objective key

180 SEMIH ERHAN, BARBARA RASCO, LAURA COHEN AND THOMAS R. MARZOLF

I L V A G F M C P W H R K Y T S Q N E D

i ¸

C

P W H

mmmm mmmmmmmmmm mmmm mmmmmmnmmmmmmmmnmmmm mmmmmmmmmm mnmmmmmmm mmmm mmmmmmm mmmm mm mmmmmmmmmmmmmmmmmmmm

[] mmmmm mmmmmmmmmmmmmmmmmmmm mmm m mmmmm mmmnm

mmmmmmmmmmmmmmmmmn m • mmmmmmn mmmmmmmmmmmmm mmmmmn mmmmmmm mmmmmmmmmmmm m mmm mmmmmmmmmmmm mmm mm m mmmmmmmmmmmm mmmmm

Pairs shored by 7 categor ies

FIG. 4. DIPEPTIDES WHICH ARE FOUND IN ALL SEVEN CATE- GORIES OF CYTOCHROMES C. However, not necessarily in each of the individual sequences, such as "IV and VE seen in Table 1 (a) and (b).

given above, even though not by each member of each category. These dipeptides are shown in Fig. 4.

Of the remaining 252 dipeptides, 23 are found in angiosperms exclusively while 10 dipeptides are absent in angiosperm cyto- i chromes c; 43 are found exclusively in fungi L while six are absent; seven are present while v three are absent exclusively in insects; one is A present and five are absent exclusively in G birds; six dipeptides are present and three are F absent exclusively in fish; one dipeptide is M present whi le one is absent exclusively in c reptiles and finally two dipeptides occur P exclusively in mammals. Thus 83 dipeptides w H occur exclusively in one of these seven R categories while 28 dipeptides do not occur K exclusively in one of these seven categories y (Figs. 5-10). T

Of the rest, 66 dipeptides are shared between s two categories; of these, 46 occur in only two o groups while 20 are absent exclusively only in N two categories (Table 2). The remaining 75 dipeptides are shared by more than two groups thus a c c o u n t i n g for all of the 4 0 0 a m i n o acid pairs.

Next the dipeptides present and absent in each of the seven categories were transferred

onto a separate matrix, again indicating the presence and absence with (+) and ( - ) , respectively. Any organism containing cyto- chrome c, whose origin or relatedness is in doubt, is subjected to the same procedure: its

I L V A G F M C P W H R K Y T S Q N E D

+ +

-I-i+ + + + -

+i + + + +

+ • t, i , - +l + +l +

+ 4- 4-

-I- - + l + +

+ + + + I

+ + I I + Ip + ÷ I "t"

+ I I + - I

+ + + - I 1 + I + I I i

Cytochromes c, fungi

FIG. 5. CUMULATIVE MATRIX FOR FUNGI, (+) indicates the presence and ( - ) the absence of a dipeptide in that category.

I L V A G F M C P W H R K Y T S O N E D

+ + -

+ +

+ + + +

i+ + +

+1 I

- I I

- + -- i -I-

+1 + +J + + +1

I + + I

Cytoc hromes c, plants

FIG. 6. CUMULATIVE MATRIX FOR PLANTS (ANGIOSPERMS).

Page 5: Development of a novel classificatory method — Use of amino acid nearest neighbors as an objective key

USE OF AMINO ACID NEAREST NEIGHBORS AS AN OBJECTIVE KEY 181

Z L V A G F M C P W H R K Y T S Q N E D

z I z L J L V J + V A ] A G J - G F J + + F

c C

w w H I H

R K K + -

Y Y

T - T

s ÷ S o Q N N E E D ÷ D

Cyl"ochrornes ¢, insects

FIG. 7. CUMULATIVE MATRIX FOR INSECTS.

Z L V A G F M C P w H I R K - Y T s

El l

Z L V A G F M C P. W H R K Y T S.O N E D ÷] +

i + ;+

I I I I I I - I

I

I I I

Cytochromes c~ fish

FIG. 8. CUMULATIVE MATRIX FOR FISH.

amino acid sequence is broken down into overlapping di-, tripeptides etc. and the dipeptides are transferred onto a 20 x 20 matrix. This matrix is then compared with the seven cumulative matrices, one by one.

L V A G F M C P W H R K Y T S Q N E D

I I I

-I I I I I 1 I I I I -

- I I 1 I I I I

Cytochromes c, b i rds

FIG. 9, CUMULATIVE MATRIX FOR BIRDS.

T L V A G F M C P W H R K Y T S O N E D

rM

M, I I

I I

I I I

Cytochrome c, rept i les and mammals

FIG. 10. CUMULATIVE MATRIX FOR REPTILES AND MAMMALS. Because the number of dipeptides that occur in each of these two groups was only two, they were all represented in one matrix. "~ represents the dipeptide that is absent from reptiles, exclusively. Since reptiles are represented by only Wvo samples, any dipeptide that occurs in only one sample has greater weight than would be expected under normal circumstances. Of these which we consider significant are given as ~ , Among mammals, the dipeptide MK represents the primates and even though it is shared with reptiles and fungi, it occurs in only one organism in each category. Hence we have included it, too, as 3.

Page 6: Development of a novel classificatory method — Use of amino acid nearest neighbors as an objective key

182 SEMIH ERHAN, BARBARA RASCO, LAURA COHEN AND THOMAS R. MARZOLF

TABLE 2. DIPEPTIDES SHARED BY TWO GROUPS Categories Mammals/reptiles Present FT

Absent AK, SK Mammals/insects Present

Absent EG,TS Mammals/fungi Present FI,RE,TY

Absent Reptiles/birds Present

Absent AP, N E,SA Reptiles/insects Present DD

Absent Reptiles/fungi Present RT,TM

Absent Reptiles/plants Present

Absent KS Fish/fu ngi Present LO,LR,LV, LW, N D,N N,R I,VW.WN

Absent AD Fish/insects Present WQ

Absent KN Fish/plants Present QE,QQ,RQ

Absent Birds/fungi Present DI,I E,TP

Absent AA, DV Insects/fungi Present FE,GV, RC

Absent KC Insects/plants Present ES,KP, NA.QR

Absent 1K,QICSQ Fungi/plants Present D R,DS,DY, E L.EW, F K, GA, LG,MA,N M,N O,PT.

S F,TQ,WE,YE Absent FV, G K, KH,M I

All of the dipeptides that are shared by only two categories, by This table is taken from the lower right hand side of Table 1 (b).

Because the number of dipeptides present and absent decreases from fungi toward mammals, the method is one of elimination; the com- parison is started with the fungi, followed by angiosperms etc. and ending with mammals.

One concludes that the cumulative class matrix with which the unknown species shares the greatest number of dipeptides--both in presence and in absence--is the category to which it is most likely to belong.

Next the dipeptides in the matrix of the unknown species are compared with the dipeptides listed in Table 2 to see whether any of the dipeptides of the unknown are identical with any belonging to the category expected for the previous step.

Comparison with Table 2 completes this pictorial method based solely on the dipeptide distribution. The method, however, is much more powerful when information contained within tripeptides, tetrapeptides etc. is utilized. Even though overlapping tripeptide frequency distribution of proteins can be obtained without a computer, the operation is very tedious and the chances of making errors increase greatly. Hence the use of computers is necessary for this phase, either to obtain the breakdown into tripeptides, tetrapeptides etc. or even to obtain the matrices directly (Erhan and Barnsteiner, unpublished observations).

Some of the tripeptides and tetrapeptides

either being present or being absent solely in those two groups.

are seen to occur more than once within the amino acid sequence of some proteins, for instance, SLNS (seryl-leucyl-asparaginyl- serine) is found at position 22 and 101 within the trypsinogen molecule [11]; the position numbers refer to contiguous numbering of amino acids from the N-terminus of a protein. Table 3 shows such repeating tripeptides and tetrapeptides found within cytochrome c sequences. It is seen from this table that tripeptides KTG occurs in all vertebrates, except bonito, and most probably represents the Subphylum Vertebrata. The triplet VEK, by its absence, may characterize the Class Aves as the tripeptide EKG, by its absence, may indicate the Classes Chondrichthyes and Osteichthyes. Among mammals, kangaroo stands alone without EKG and VEK, while possessing KTG. Whether this is diagnostic of the Subclass Metatheria remains to be confirmed with data from at least one other marsupial. Among animals, only members of the Class Reptilia and Mammalia, except kangaroo, have repeating quadruplets. The major problem for the method at the present time, is the scarcity of cyto- chromes c which have been sequenced. There are only two reptile samples and even though mammals are represented by 14 sequences, they do not even cover all of the mammalian orders. Furthermore, it is impossible to say at this time; whether some orders do not have any

Page 7: Development of a novel classificatory method — Use of amino acid nearest neighbors as an objective key

USE OF AMINO ACID NEAREST NEIGHBORS AS AN OBJECTIVE KEY

TABLE 3. SOME REPEATING TRIPLETS AND QUADRUPLETS FOUND IN CYTOCHROMES C

183

Mammals Triplets Quadruplets

Mscropus rufus Miniopteris schmibersi Homo sapiens, Macaca mulatta, Pan troglodytes, Erythrocebus patas Oryctolagus cuniculus Eschrichtius g/aucus Phoca vitulina Equus caballus, Equus asinus, Equus zebra Camelus dmmedarius, Sus scrofs, Ovis aries, Bos tsurs Csnis familiari=

Reptiles Chelydra sarpentina Vipers russalli

Birds Gallus gallus, Meleagris gallopavo, Anas plstyrhynchos Dromiceius novse-hollandie Apteodytes patsgonica Columba livia

Fish Entisphenon tridentatus Squslus sucklii Thunnus thynnus Katsuwonus vsgrans

Insects Drosophila melanogaster Samia cynthia Callitroga hominivorax Protoparce sexta

Fungi Debatyomyces kloeckari Saccharomyces oviformis Neurospora crassa Ustilago sphaerogena Humicola Isnuginosa

Plants Triticum sativs L ycopersicon esculentum Sambucus nigra Phaseolus aureus Spinacea oleracea Sesamum indicum Gossypium barbadense Abutilon theophrasti Ricinus communis Brassica napus Brassica olearacea Cucurbita maxima Acer negundo Helianthus annuus

Protozoa Crithidia oncopeltl Euglena gtacilis

- - - , KTG, - . . . . . . EKG, KTG, VEK VEKG EKG, KTG, VEK VEKG EKG, KTG, VEK VEKG EKG, KTG. VEK VEKG EKG. KTG, VEK TLM VEKG EKG, KTG, VEK VEKG EKG, KTG, VEK VEKG EKG, KTG, VEK VEKG

EKG, KTG, VEK VEKG EKG, KTG, VE~ TAA VEKG

EKG, K T G , - - - EKG, K T G , - - - EKG, K T G , - - - EKG, K T G , - - -

- - - , K T G , VEK KEG, L K K , - - - KTG,- . . . . .

~0

- - - , KKG, - - - - - - , LKK, - - -

__2; I:ET, "w-_ AKG, G A N , - - -

AGA, DAG, LIA NPK, - . . . . . NPK, - . . . . . STA, - . . . . . SYS, - . . . . .

AKG, LEN, YLE VPG,- . . . . .

DAGA

YLEN

Repeating tripeptides and tetrapeptides found in individual cytochrome c sequences organized into 7 categories mentioned in the text. The last item, E. gracilis, being placed under Protozoa does not imply that we consider it to be a protozoan. The quadruplet VEKG, which presumably appears twice within the sequence of Anasplatyrhynchos cytochrome c, has been omitted from th is table because of the ambiguity of the sequence around position 20.

characteristic dipeptides or we have this impression because we do not have sufficient data. For example, CS, FI, IM, MK, SQ, YS are found only among primates while RE and WK are found among members of the Order Perissodactyla, such as horse and donkey. No such exclusive pairs could be found among the dipeptides we have for the members of the Order Artiodactyla, such as camel, sheep etc. So there is an urgent need for more data

representing organisms that will complement the information at hand. A complete list of all of the organisms whose cytochromes c have been sequenced is given in Table 3.

Plants represent a more interesting picture. Here, even though the samples are heavily biassed toward dicots, with more data be- coming available it might be possible to differentiate the families, in contrast to animals where only orders or classes can be noted.

Page 8: Development of a novel classificatory method — Use of amino acid nearest neighbors as an objective key

184 SEMIH ERHAN, BARBARA RASCO, LAURA COHEN AND THOMAS R. MARZOLF

A particularly interesting point is that Sambucus nigra and Lycopersicon esculentum possess the repeating triplet NPK, suggesting a close relation between them. Whether data from other monocots will support it or not, Triticum sativa is the only plant that possesses a repeating tetrapeptide. Again the need for other representative samples becomes acute.

A few cautious speculations will attempt to demonstrate the application of the method to the problem of kingdom classification. The first question raised is: should fungi be classified as members of the Kingdom Plantae or should they be assigned a kingdom of their own [4].

There are 43 dipeptides which occur exclusively among fungi (Fig. 5), 23 di- peptides which occur among angiosperms (Fig. 6) while only 16 dipeptides are shared (Table 2). Of the total of 82 dipeptides, then, 52.5% occur among fungi, 28% among plants while only 19.5% are shared. A similar situation was obtained for the dipeptides that are exclusively missing from the fungi [6], from angiosperms [10] and from both [4]. Of the total of 20 dipeptides which should not occur within each group, 30% are missing from fungi, 50% from the angiosperms while only 20% are absent from both. Thus it seems that there is not that great a similarity between the fungi and angiosperms. This conclusion, of course, is based on the similarity one finds between the algae and fungi and the currently favored view that angiosperms have evolved from the green algae. Unfortunately we do not have any algal cytochrome c sequence, and it will be interest- ing to compare algae and fungi according to this method. Without additional data, this conclusion should, therefore, be considered tentative. Table 3 shows that there are no shared triplets between the angiosperms and fungi. In the light of previous results discussed above, dealing with the repeating triplet and quadruplets found in the Kingdom Animalia, we feel that this finding is significant. Hence, even though with more data this picture might change, at the present time Whittaker's suggestion [4] to assign fungi a separate kingdom is supported.

Should Euglena gracilis be classified as an animal or a plant? Euglena shares three di- peptides, out of a possible total of 20, with angiosperms and seven dipeptides, out of a possible total of 17 with animals.

Furthermore, of the dipeptides which should not occur in angiosperms, according to Fig. 6,

eight out of a total of 10, occur in Euglena, while only five, out of a total of 17, of the dipeptides which should not be found in animals occur in Euglena. The only protozoan matrix we have, belonging to Crithidia onco- peltL shares with Eug/ena 42 out of a possible 86 dipeptides. Thus our limited data suggest that E. gracilis is more like animals than plants. C. oncopelti shares the repeatingtriplet AKG with H. lanuginosa which is a Deuteromycete.

Discussion Proteins, in order to function, have to fold into a unique three dimensional structure which depends solely on the amino acid sequence [14]. This, of course, is the reason why a study of the nearest neighbor relationships of the constituent amino acids leads to an under- standing of protein folding as well as identity [13, 15]. It is clear now that the amino acid sequence of proteins, even those which belong to the same family such as cytochromes c hemoglobins etc., are not identical except in those very few cases where the organisms are very closely related, e.g. so far no difference has been found between the cytochrome c sequences of man and chimpanzee. The differ- ences, then, must be representative of the changes that have occurred when new classes, orders, or even families appeared, In other words, these changes in sequence represent the differences in the identity of classes, orders and families during the course of evolution. This view is supported by the recently develop- ed idea that even though homologous proteins of one family generally have very similar three dimensional structures, there still exist indi- vidual differences in detail.

During our studies of the sequence charac- teristics of proteins we have observed a biassed distribution of certain dipeptides, tripeptides etc. within classes, orders of organisms: while some occur only within, say, insects, others were conspicuously absent from the same taxa.

After successfully identifying an organism which was accidentally left out of the computer print-out containing its amino acid sequence and dipeptide matrix, such as the cytochrome c from Brassica napus, a concentrated effort was made to find out whether this was a lucky accident. The result was the development of a method which has the promise of being the first "objective" classificatory system which:

(a) does not rely upon any subjective choice of characteristics; and

Page 9: Development of a novel classificatory method — Use of amino acid nearest neighbors as an objective key

USE OF AMINO ACID NEAREST NEIGHBORS AS AN OBJECTIVE KEY 185

(b) represents the changes in the primary structure of an informational macro- molecule which have occurred during evolution, hence representing phylo- genetic relationships [16].

Thus, this method profoundly differs from the ancestral sequence methods of Boulter [17] and of Dayhoff [11 ] and the tree building method of Fitch [18], in that it was not developed with the express purpose of discerning phylogenetic relationships,

The major difficulty in the method at the present time is the very unsatisfactory choice of samples that do not really lend themselves to taxonomic refinement. For example, except for one sequence, there are no cytochrome sequences from Annelida, Crustacea, Poly- chaeta, Echinodermata, Mollusca, Nematoda, Platyhelminthes from Chloro-, Chryso- Phaeo- and Rhodophyta, Myxomycophyta, Bryophyta, Gymnospermae, as well as from the Kingdom Monera.

With an increase in the availability of a sufficient number of selected sequences, the method may be refined to even accommodate families of organisms. Furthermore it can be applied to other protein families and may show that it is universally applicable.

In conclusion, if one subscribes to the notion that the change in the amino acid sequence of proteins is due to random mutations, then those proteins that have repeating tetrapeptides ought to have been evolved more recently than those which have only repeating tripeptides, which in turn are younger than those that do not even have any repeating tripeptides. This is because the more ancient a protein, that is to say, the earlier the sequence had emerged during the evolution, the greater is the chance for it to be modified, since it had been subject to change for a longer period of time. A simulation experiment, whereby a 100 amino acid long polypeptide, which contained five identical sequences distributed along its length, was subjected to random mutations for varying lengths of time supports this view: the longer the sequence was subjected to random mutations, the less those identical sequences

remained unchanged [19]. Again, if supported by future data, this idea may help shed some light on the problem of determining which organism may have preceeded others.

Acknowledgements - -T . R. Marzolf is a graduate student in the Moore School of Electrical Engineering of the University of Pennsylvania. The authors are indebted to Dr. Ralph O. Erickson of the Department of Botany, University of Pennsylvania for his advice and many useful discussions, and to the Moore School for financial assistance with the computer studies.

R e f e r e n c e s 1. Heywood, V. (1973) in Chemistry in Botanical

Classification (Bemdz, G. and Santesson, J., eds.) p. 41. Academic Press, New York.

2. Sokal, R. R. (1974) Science 185, 1115. 3. Sokal, R. R. and Sneath, P. H. A. (1973) Numerical

Taxonomy, p. 6. Freeman, San Francisco. 4. Whittaker, R. H. (1969) Science 163, 150. 5. Margulis, L. (1970) Origins of Eukaryotic Cells.

Yale University Press, New Haven. 6. Mez, C. and Zegenspeck, H. (1926) Bot. Arch.

13, 483. 7. Chester, K. S. (1937) Q. Rev. Biol. 12, 294. 8. Fluck, H. (1963) ChemicalPlant Taxonomy (Swain,

T., ed.) p. 167. Academic Press, New York. 9. Davis, P. H. and Heywood, V. H. (1965) The

Principles of Angiosperm Taxonomy. Oliver and Boyd, Edinburgh.

10. Boulter, D. (1973) Chemistry in Botanical Classifi- cation (Bendz, G, and Santesson, J., eds.) p. 212. Academic Press, New York.

11. Dayhoff, M. O. (I 972) Atlas of Protein Sequence and Structure, Natl. Biomed. Res. Fdn. Washington, D.C.

12. Erhan, S., Marzolf, T. R. and Cohen, L., Submitted for publication.

13. Erhan. S., Submitted for publication. 14. Anfinsen, C. B., Haber, E., Sela, M. and White,

F. R., Jr. (1961) Proc. Natl. Acad. Sci. U.S.A. 47, 1309. Erhan, S. and Greller, L. D. (1974) Int. J. Pept. Protein Res. 5, 175. Erhan, S. (1978) Int. J. Bio-Med. Comput. 9, 115. Boulter, D.. Ramshaw, J. A. M., Thompson, E. W., Richardson, M. and Brown, R. H. (1972) Proc. Roy. Soc. London Ser. B 181,441. Fitch, W. M. and Margoliash, E. (1970) Evol. Biol. 4, 67. Marzolf, T. R., Greller, L. D. and Ethan, S. (1978) Int. J. Bio-Med. Comput. 9, 171.

15.

16. 17.

18.

19.