1
Protein Secondary Structure Prediction with inclusion of Hydro phobicity information Tzu-Cheng Chuang, Okan K. Ersoy and Saul B. Gelfand School of Electrical and Computer Engineering, Purdue Universiy, West Lafayette,IN 1.Introduction We usually say protein has four structures. In our research, we want to predict the secondary structure from the amino acid sequence. 2. Support Vector Machin e SVM was invented by Vapnik in t he 1990s. The algorithm is trying to find the best separating hyper plane with largest margin width a nd smaller training errors. In SV M, the hyperplane of the non-sepa rable case is determined by solvi ng the following equation: Acknowledgement This research was supported by NSF Grant MCB-9873139 and partly by NSF Grant #0325544. 2 1 min 2 .. ( )1 0 i i T i i i i w C subject to y xw b 4.Hydrophobicity Hydrophobicity and hydrophobic m oment were previously used to clas sify membrane and surface protein sequences. In Eisenberg hydrophobi city, the scale was normalized wit h mean value equal to zero and sta ndard deviation equal to unity. Th e hydrophobic moment is calculated as follows : 3.Dataset The protein sequence and protein structure data for membrane photos ystem proteins were downloaded fro m the database of secondary struct ure assignments (DSSP) and the Pro tein Data Bank (PDB). We know whic h proteins are membrane proteins b y using the information given in t he Stephen White laboratory at UC Irvine. The protein id’s for these membrane photosystem proteins are 2PPS, 1JB0, 1FE1, 1S5L, 2AXT, 1IZL, 1VF5 and 1Q90. 1/2 2 2 1 1 sin( ) cos( ) N N H n n n n H n H n where μ H is the hydrophobic moment, H n is the hydrophobicity of the ami no acid number n, N is the window length used ( it is 7 here), and δ is the angle of the turn between t wo amino acids. For classical heli ces as discussed here, the angle w as set to 100. 5.Method Amino acids have different characte ristics in hydrophobicity and hydroph obic moment. We first segment input d ata into two groups based on hydropho bicity or hydrophobic moment. Then, t he data in each group is classified b y an SVM. A threshold is picked by ma ximizing the difference of the compos ition. The input pattern with hydroph obicity greater than the threshold is sent to SVM1. The input pattern with hydrophobicity smaller than the thres hold is sent to SVM2. In this way, by segmenting the original dataset into 2 groups, each SVM is trained with si milar composition of structural types, and the overall classification accur acy is expected to be higher. 6.Experiment Result In each experiment, we randomly pic k some proteins for training and some for testing so we can compare whether this method does improve the accuracy. We set the hydrophobicity threshold equal to 0.42. We set the hydrophobic moment threshold equal to 1.7. Table 1. Comparison of the testing classifi cation accuracy for H/ not H between one SV M only, and with two SVMs by considering hy drophobicity. Experiment 1 2 3 4 5 Only 1 SVM 73.49% 81.19% 58.08% 71.69% 62.55% SVM1 86.78% 88.63% 74.89% 82.61% 73.59% SVM2 76.57% 83.52% 67.41% 81.22% 65.39% 2 SVMs overa ll 80.07% 85.29% 70.06% 81.72% 68.4% Table 2. Comparison of the testing classificat ion accuracy for H/ not H between one SVM only, and with two SVMs by considering hydromoment. Experiment 1 2 3 4 5 Only 1 SVM 73.49% 81.19% 71.69% 62.55% 72.81% SVM1 72.99% 82.19% 75.42% 60.57% 72.74% SVM2 79.46% 83.79% 76.87% 60.64% 83.04% 2 SVMs overa ll 75.75% 82.87% 76.05% 60.6% 77.2% Table 3. Comparison of Qtotal with one SVM only, and with two SVMs. Experiment 1 2 3 4 5 Only 1 SVM 76.02% 54.09% 60.35% 54.09% 69.05% 2 SVMs overa ll 80.66% 63.04% 63.74% 63.04% 76.95% 7.Conclusions This method is most useful wh en the distribution of hydropho bicity of the three classes are very separable. From the hydrom oment experiment, Table 2, we s ee that one of the experiments actually gives worse result tha n one SVM. The separability of the three classes of structure is really important to improve classification accuracy. In these experiments, the rat io of support vectors was almos t half of the training vectors. This means the patterns are rea lly difficult to differentiate. Including hydrophobicity and/ or hydrophobic moment improves classification accuracy in the membrane proteins. In the futur e, we plan to search for some o ther natural discriminating fea tures in alpha-helix, beta-shee t and coil to further increase classification accuracy. References B. E. Boser, I. M. Guyon, and V. N. Vapnik, ”A t raining algorithm for optimal margin classifiers, ” D. Haussler, editor, 5th Annual ACM Workshop o n COLT, pages 144-152, Pittsburgh, PA, 1992. ACM Press. D. Eisenberg, E. Schwarz, M. Kmaromy and R. Wall, ”Analysis of membrane and surface protein seque nces with the hydrophobic moment plot,” Journal of Molecular Biology, Volume 179, Issue 1, 15 Oc tober 1984, Pages 125-142. J.Y. Yang, M.Q. Yang, O. Ersoy, ”Datamining and knowledge discovery from membrane proteins,” Pro ceeding of IEEE Intelligence: Methods and Applic ations Conference, Istanbul, June 2005. Input Coding method Assuming window length is equal to 7, the window includes the previous 3 and subsequent 3 positions of the amino acid. Each amino acid is represented as a 1x 20 vector in one-of-m representation like [0 1 0 ... 0]. If the window length is 7, the pattern within a window is represented as a 1x140 vector.

Protein Secondary Structure Prediction with inclusion of Hydrophobicity information Tzu-Cheng Chuang, Okan K. Ersoy and Saul B. Gelfand School of Electrical

Embed Size (px)

Citation preview

Page 1: Protein Secondary Structure Prediction with inclusion of Hydrophobicity information Tzu-Cheng Chuang, Okan K. Ersoy and Saul B. Gelfand School of Electrical

Protein Secondary Structure Prediction with inclusion of Hydrophobicity informationTzu-Cheng Chuang, Okan K. Ersoy and Saul B. Gelfand

School of Electrical and Computer Engineering, Purdue Universiy, West Lafayette,IN

1.Introduction We usually say protein has four structures. In our research, we want to predict the secondary structure from the amino acid sequence.

2. Support Vector Machine

SVM was invented by Vapnik in the 1990s. The algorithm is trying to find the best separating hyperplane with largest margin width and smaller training errors. In SVM, the hyperplane of the non-separable case is determined by solving the following equation:

AcknowledgementThis research was supported by NSF Grant MCB-9873139 and partly by NSF Grant #0325544.

21min

2

. .

( ) 1

0

ii

Ti i i

i

w C

subject to

y x w b

4.Hydrophobicity Hydrophobicity and hydrophobic moment were previously used to classify membrane and surface protein sequences. In Eisenberg hydrophobicity, the scale was normalized with mean value equal to zero and standard deviation equal to unity. The hydrophobic moment is calculated as follows :

3.Dataset The protein sequence and protein structure data for membrane photosystem proteins were downloaded from the database of secondary structure assignments (DSSP) and the Protein Data Bank (PDB). We know which proteins are membrane proteins by using the information given in the Stephen White laboratory at UC Irvine. The protein id’s for these membrane photosystem proteins are 2PPS, 1JB0, 1FE1, 1S5L, 2AXT, 1IZL, 1VF5 and 1Q90.

1/ 22 2

1 1

sin( ) cos( )N N

H n nn n

H n H n

where μH is the hydrophobic moment, Hn is the hydrophobicity of the amino acid number n, N is the window length used ( it is 7 here), and δ is the angle of the turn between two amino acids. For classical helices as discussed here, the angle was set to 100.

5.Method Amino acids have different characteristics in hydrophobicity and hydrophobic moment. We first segment input data into two groups based on hydrophobicity or hydrophobic moment. Then, the data in each group is classified by an SVM. A threshold is picked by maximizing the difference of the composition. The input pattern with hydrophobicity greater than the threshold is sent to SVM1. The input pattern with hydrophobicity smaller than the threshold is sent to SVM2. In this way, by segmenting the original dataset into 2 groups, each SVM is trained with similar composition of structural types, and the overall classification accuracy is expected to be higher.

6.Experiment Result In each experiment, we randomly pick some proteins for training and some for testing so we can compare whether this method does improve the accuracy. We set the hydrophobicity threshold equal to 0.42. We set the hydrophobic moment threshold equal to 1.7.

Table 1. Comparison of the testing classification accuracy for H/ not H between one SVM only, and with two SVMs by considering hydrophobicity.

Experiment 1 2 3 4 5

Only 1 SVM 73.49% 81.19% 58.08% 71.69% 62.55%

SVM1 86.78% 88.63% 74.89% 82.61% 73.59%

SVM2 76.57% 83.52% 67.41% 81.22% 65.39%

2 SVMs overall 80.07% 85.29% 70.06% 81.72% 68.4%

Table 2. Comparison of the testing classification accuracy for H/ not H between one SVM only, and with two SVMs by considering hydromoment.

Experiment 1 2 3 4 5

Only 1 SVM 73.49% 81.19% 71.69% 62.55% 72.81%

SVM1 72.99% 82.19% 75.42% 60.57% 72.74%

SVM2 79.46% 83.79% 76.87% 60.64% 83.04%

2 SVMs overall 75.75% 82.87% 76.05% 60.6% 77.2%

Table 3. Comparison of Qtotal with one SVM only, and with two SVMs.

Experiment 1 2 3 4 5

Only 1 SVM 76.02% 54.09% 60.35% 54.09% 69.05%

2 SVMs overall 80.66% 63.04% 63.74% 63.04% 76.95%

7.Conclusions This method is most useful when the distribution of hydrophobicity of the three classes are very separable. From the hydromoment experiment, Table 2, we see that one of the experiments actually gives worse result than one SVM. The separability of the three classes of structure is really important to improve classification accuracy. In these experiments, the ratio of support vectors was almost half of the training vectors. This means the patterns are really difficult to differentiate. Including hydrophobicity and/ or hydrophobic moment improves classification accuracy in the membrane proteins. In the future, we plan to search for some other natural discriminating features in alpha-helix, beta-sheet and coil to further increase classification accuracy.

ReferencesB. E. Boser, I. M. Guyon, and V. N. Vapnik, ”A training algorithm for optimal margin classifiers,” D. Haussler, editor, 5th Annual ACM Workshop on COLT, pages 144-152, Pittsburgh, PA, 1992. ACM Press. D. Eisenberg, E. Schwarz, M. Kmaromy and R. Wall, ”Analysis of membrane and surface protein sequences with the hydrophobic moment plot,” Journal of Molecular Biology, Volume 179, Issue 1, 15 October 1984, Pages 125-142. J.Y. Yang, M.Q. Yang, O. Ersoy, ”Datamining and knowledge discovery from membrane proteins,” Proceeding of IEEE Intelligence: Methods and Applications Conference, Istanbul, June 2005.

Input Coding methodAssuming window length is equal to 7, the window includes the previous 3 and subsequent 3 positions of the amino acid. Each amino acid is represented as a 1x 20 vector in one-of-m representation like [0 1 0 ... 0]. If the window length is 7, the pattern within a window is represented as a 1x140 vector.