2009/2010 Artificial Neural Networks in Prediction of Secondary Protein Structure Using CB513 Database

2009/2010

Artificial Neural Networks in Prediction of Secondary

Protein Structure Using CB513 Database

GENETIC PARADIGMDNA WATSON-CRICK MODEL GENETIC CODE

The answer comes from the knowledge of three concepts : DNA-WC, GP, GC. The living systems can not be imagined without biological molecule DNA and protein.DNA is in nucleus and consists from two strands linked with complementary bases.Proteins are suited in cytoplasm and are being formed from codons made from three DNA bases. Looking at GC one codon is one AA, and these AAc produce peptide chain.

Before the explanation of our algorithm we must answer the next question: Why do we need secondary protein structure prediction ?

Protein Structure Prediction Problem

4. Automated methods?

5. Bridge between primary and 3D structure?

3. Experimental methods?

1. 3D protein structures?

2. Relationship between primary and

tertiary protein structure?

• It is known that 3D protein structure represents human functions and behaviour. Unique 3D structure is determined by the primary sequence of protein. This sequence is determined by the structure of DNA. In this way DNA controls the development of our functions and hereditary characteristics.

• Determination of the relationship between the primary protein structure and its 3D structure is the main problem of the contemporary molecular biology.

• Primary sequences of protein are becoming available at a rapid rate. But, determination of 3D protein structure using experimental methods are very expensive, long term duration and requests experts from different fields. The number of 3D protein structures is only a tiny number of the total proteins number.

• It is clear that we need automated methods for predicting 3D structures from primary structures.

• Sience the rules of protein folding are largely unknown and general problem of predicting 3D structures unsolved, AI offers bridge to bring the gap between primaray structure and 3D structure in order to satisfy but only part of our need.

What is the bridge which AI offers?

THE BRIDGE BETWEEN PRIMARY AND TERTIARY PROTEIN STRUCTURE

THE CONCEPT OF AN ALGORITHM FOR PREDICTION OF SECONDARY PROTEIN STRUCTURE

// When using this database, please cite: // PDBFinderII - a database for protein structure analysis and prediction // Krieger,E., Hooft,R.W.W., Nabuurs,S., Vriend.G. (2004) Submitted// ID : 101M // Header : OXYGEN TRANSPORT // Date : 1998-04-08 Compound : myoglobin // Compound : Mutant Source : (physeter catodon) // Source : sperm whale // Water-Mols : 138 // Sequence : MVLSEGEWQLVLHVWAKVEADVAGHGQD I L I // DSSP : CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHH // Nalign : 4 5 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9

PDBFIND2-CB513 DATA SET AND PRINCIPLE OF AMINO ACIDS CODING

Amino Acids 1-letter symbol Code Binary representation

Alanine A 01 10000000000000000000

Cysteine C 02 01000000000000000000

Aspartate D 03 00100000000000000000

Glutamate E 04 00010000000000000000

Phenylalanine F 05 00001000000000000000

Glycine G 06 00000100000000000000

Histidine H 07 00000010000000000000

Isolecine I 08 00000001000000000000

Lysine K 09 00000000100000000000

Leucine L 10 00000000010000000000

Methionine M 11 00000000001000000000

Asparagine N 12 00000000000100000000

Proline P 13 00000000000010000000

Glutamine Q 14 00000000000001000000

Arginine R 15 00000000000000100000

Serine S 16 00000000000000010000

Threonine T 17 00000000000000001000

Valine V 18 00000000000000000100

Tryptophan W 19 00000000000000000010

Threonine Y 20 00000000000000000001

http://www.paraschorpa.com /project/evoca_prot/index.php.

DIFFERENT APPROACHES

1.TRAINING SET AND TEST SET FROM PDBFIND2:

There were homology and nonhomology data in both sets.

2.Training SET FROM CB513 (NONHOMOLOGY DATA SET)

Test SET FROM PDBFIND2 ( homology and nonhomology data sets)

3.TRAINING SET FROM CB513 (nonhomology 413 protein sequences)

TEST SET FROM CB513 (nonhomology 100 protein sequences)

APLICATION INTERFACE USED FOR EXTRACTING,

PREPARING AND ENCODING (API_EPE2)

// When using this database, please cite: // PDBFinderII - a database for protein structure analysis and prediction // Krieger,E., Hooft,R.W.W., Nabuurs,S., Vriend.G. (2004) Submitted// ID : 101M // Header : OXYGEN TRANSPORT // Date : 1998-04-08 Compound : myoglobin // Compound : Mutant Source : (physeter catodon) // Source : sperm whale // Water-Mols : 138 // Sequence : MVLSEGEWQLVLHVWAKVEADVAGHGQD I L I // DSSP : CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHH // Nalign : 4 5 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9

http://www.paraschorpa.com/project/evoca_prot/index.php

PDBFIND2

Elimination of unimportant parts of aminoacidsequences and related

secondary structures

CB513.txt MVLSEGEWQLVLHVWAKVEADVAGHGQDILI

CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHH

MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGH

CCHHHHHHHHHCCEEEEEECTTSCEEEETTE

Separation of input data

and output data

PrimCB513.txt

11 18 10 16 04 06 06 19 14 10 07 11 12 08 05 04 11 10 15 08 03 04

SekCB513.txt

3 03 03 03 01 01 01 01 01 01 01 03 03 01 01 01 01 01 01 01 01 01

Encoded of extracted data into a numeric patterns

CB513.all

Encoded

CB513.txt

PrimCB513.txt

11 18 10 16 04 06 06 19 14 10 0711 12 08 05 04 11 10 15 08 03 04

SekCB513.txt

3 03 03 03 01 01 01 01 01 01 01 03 03 01 01 01 01 01 01 01 01 01

prepare_data

vector

prepare _data

string

11 12 08 05 04 11 10 15

01 01 09 16

06 18 18 17 09 03 04

the first function (prepare_data)

produces vector:

11 12 8 5 4 11 10 15 0 1 1 9 16 0 6

18 18 17 9 3 4

11231111

1132

112122211111

the function prepare_data

produces the following

string:

112311110113201121222

1111.

prepare_pattern prepare_target

OUTPUT TARGET MATRIX

1 1 0 0 1 1 0 1

0 0 0 1 0 0 1 0

0 0 1 0 0 0 0 0

INPUT PATTERN MATRIX

NEURAL NETWORK BASED ALGORITHM

FOR SECONDARY STRUCTURE PREDICTION

Design of input pattern matrix taken from the string

of encoded protein sequences

Tra

nsf

orm

ing

1-n

um

ber

cod

e in

to

20-

nu

mb

er c

ode

A window is a short segment of a complete protein string and in the middle of it there is an amino acid for which we want to predict secondary structure.This window moves through protein , 1 amino acid at a time

Our prediction is made for central AA and if we have 0 at that position of window then our function prepere_pattern doesn‘t permit to that window to be placed into patern matrix

INPUT PATTERN MATRIX

DESIGN AND TRAINIG OF NEURAL NETWORK

Example from homologous data set

Example from homologous data set

DESIGN AND LEARNING OF NEURAL NETWORK

IN MATLAB – NN TOOLBOX

Tab3

Tab4Tab5 Tab6

Instruction creatennw is used for designing neural networks with different sizes of windows (11, 13, 15, 17, and 19); instruction createCB513w is used for creation of training set, and training of neural network is started with instruction trainw.

createCB513w

trainw

PDBFIND2

We did not

consider homology between protein sequences in the test set taken from PDBFIND2 data base.

PERFORMANCE EVALUATION OF NEURAL NETWORKS

USING NONHOMOLOGUS DATA TEST

CB513, data set in which there is no sequence similarity

between protein sequences, was divided into two sets:• 413 training protein sequences (stored in PrimTest413.txt and

SekTrain413.txt data file), and 100 test protein sequences (stored in

PrimTest100.txt and SekTest100.txt data file).

In MATLAB a software package nonhomtest was created with the following

functions:• design of input data (samples) matrix using PrimTrain413.txt data file, and

design of output data matrix using SekTrain413 data file,• design of neural network based on 5 neurons in hidden layer, and using

window size 19, • training of designed neural network in time interval of 4000 epochs,• design of input data (samples) matrix using PrimTest100.txt data file, and

design of output data matrix using SekTest100 data file, and accuracy

evaluation of our neural networks using .

Accuracy of a neural network prediction evaluated with non-homologous

test set is 62.7253.

PERFORMANCE EVALUATION OF NEURAL NETWORKS

USING NONHOMOLOGUS DATA TEST

In the training process it was used an improved backpropagation algorithm (momentum and adaptive learning rate) for different window sizes, diferent number of neurons in hidden layer, different training sets and different number of epochs.

Comparing the network output to the desired output and changing the weights in the direction in order to minimize the difference between actual output and desired output.

The main goal of the network is to minimize the total error E(SSE) of each output node j over all training examples p:

E=p

j (Tj –Oj)2

3

1

2

Comparison with other methods

Avdagic & Purisevic (2005)

Q3=62%

Method Q3%

Chou & Fasman (1978) 50

Lim (1974) 50

Robson (1978) 53

Levin (1986) 59.7

Sejnovski (1988), net1 62.7

Qian & Sejnovski (1988), net2 64.3

Chandonia & Korplus (1995) 73.9

Chandonia & Korplus (1996) 80.2

Avdagic&Purisevic

Q3

62.7253 %

COMPARISON WITH OTHER METHODS

SUMMARY OF CONCLUSIONS

In our earlier research, neural network was evaluated with a protein set without consideration of homology and nonhomology between protein sequences. The result of prediction was correct in 63.6261% of Q3 test cases.

In current study, we used a training set (CB513), in which there are no sequence similarities between protein sequences. However, we did not consider homology between protein sequences in the test set taken from PDBFIND2 data base. The result of prediction was 62.8776%. Due to homology of protein sequences these results are not fully objective.

The solution in this study is based on division of CB513 non-homologous data set into two subsets. The first subset consists of 413 nonhomologous protein sequences and was used for training of our neural network. The second subset consists of 100 nonhomologous protein sequences and was used for testing of our neural network. Accuracy of the neural network prediction evaluated with non-homologous test set was 62.7253% and we conclude that this is objective. The achieved exactness is relatively small compared to the study which used neural network training and test sets without verification of protein homology.

CONSENSUS METHOD FOR SECONDARY PROTEIN STRUCTURE PREDICTION

The SS determines how groups of Aas form sub-structures such as coil, helix or extended strand. The correct derivation of SS provides vital information as to the tertiary structures and therefore the function of the protein.

There are various methods which can be used to predict secondary structure, including:DSSP approach(which uses hydrogen and bond patterns as predictors,DEFINE algorithm(which uses the distance between C-alpha atoms, and P-CURVE method (which finds regularites along a helocoidal axis).

As might expected, these disparate approaches do not necessarily agree with each other when given the same problem, this can create problems for researchers.

The work by Selbig, Mevissen, and Lengauer (1999) develops the DECISSION TREE as a method for achieving CONSENSUS between these approaches by creating a dataset of prediscted structures from a number of prediction methods for the same data set.The correct structures for each of the protein elements in the training set is known and this forms classification for each of the records in the training set.

The IDENTIFICATION TREE therefore creates rules of the form:

IF Method1 = Helix AND Method2 = Helix THEN Consensus = Helix.

This methodology ensures that prediction performance is at worst the same as the best prediction methods and in the best case should perform better than that.

Intelligent Bioinformatics 169

IDENTTIFYING PROTEIN SUBCELLULAR LOCATION

Protein function is often closely related to its location within the cell.Cai, Lui, and Chou (2002) used a Kohonen neural network to predict where protein was located , based on its amino acid make-up.

The authors clearly identified a set of data from Chou and Elrod(1999) where each of the 2139 proteins was assigned an unambiguous class from this set:1: CHLOROPLAST 2:CYTOPLASM 3:CYTOSKELETON 4:ENDOPLASMIC5:EXTRACELLULAR 6:GOLGI APARATUS 7:LYSOSOME 8:MITOCHONDRIA9:NUCLEUS 10:PEROXISOME 11:PLASMA MEMBRANE 12:VACUOLE

The proteins themselves were represented by 20 variables which each represented an aminoacid in the makeup of the protein. A Kohonen network was trained on the data set and then tested by inputing the test data and observing which nodes in the feature map was most highly activated by the test example. Using this method the network achieved an accuracy of almost 80 per cent on the leave –one-out cross-validation.Whils this is some 2 per cent less than the state-of-the-art(the covariant discriminant algorithm in this case), the authors claimed that this approach was more accurate than other widely-used systems such as ProtLock.

Fully conected output layer

Full connection

Input layer

000000000011000000011111

18 training paterns

(00000000)(00000000)(00110000)(...)(...)

(00000000)(00..00)...(00000001)

Nine eight-bit vectors representing the grid row by row

1

9

1

9

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

AN IDEALIZED KSOM

PITANJA

1. Objasniti genetičku paradigmu?

2. Objasniti dijagram za ekstrakciju, pripremanje i enkodiranje podataka?

3. Objasniti algoritam neuronske mreže za predikciju sekundarne strukture proteina

INITIAL PARAMETERS OF NEURAL NETWORK

Prediction of secondary structure was implemented using non-linear neural network with three layers based on feed-forward supervised learning and back-propagation error algorithm [8][10]. Default values for window size, number of neurons in hidden layer and number of training epochs are presented in the next Table.

Algorithm parameters Initial values

Window sizes 13

Number of neurons in hidden layer

5

Number of training epochs 250

Instruction creatennw is used for designing neural networks with different sizes of windows (11, 13, 15, 17, and 19); instruction createCB513w is used for creation of training set, and training of neural network is started with instruction trainw.

Tab3

DIFFERENT SIZE OF WINDOWS

After that, we evaluated performances of neural networks using testing sets stored in files PrimStrTestB.txt and SekStrTestB.txt. Results are shown in the next Table.

training/testingsets

window

size(11-19)

Evaluated results

Q3% (train.s

et)

Q3% (test set)

1 trainsetw11/testsetw11 11 62,2150 60,99102 trainsetw13/testsetw13 13 62,4055 61,10113 trainsetw15/testsetw15 15 57,0756 54,73384 trainsetw17/testsetw17 17 62,5809 61,70535 trainsetw19/testsetw19 19 63,5312 62,5160Average value Q3 61,5616 60,2094

Tab4

DIFFERENT NUMBER OF NEURONS IN HIDDEN LAYER

For this purpose we use the following instructions:

creatennhn (to design neural networks with different number of neurons in hidden layer),

trainhn (to train neural networks), and accuracyhn (for performance evaluation). Results

are shown in the next Table.

Numberof neuronsin hidden

layer

Test

Q3% (training set)

Q3% (testset)

1 2 44.1607 41.97362 3 63.2946 61.55933 4 63.2101 62.23734 5 63.5312 62.51605 6 62.8499 61.46896 7 63.6632 62.15447 8 63.6323 62.33528 9 63.6168 62.16959 10 63.4052 61.853110 11 63.5086 62.154411 12 63.4111 62.086612 13 63.7809 62.3051Average value Q3 61,8387 60,4011

Tab5

DIFFERENT NUMBER OF EPOCHS

At the end we used five neurons in hidden layer, window size 19 and different number of epochs using instructions traine. Results are shown in the next Table.

0

Number of training epochs

Test

Q3% (training set)

Q3% (test set)

1 100 44.5602 43.05082 150 57.8345 56.15823 200 63.0461 61.99624 250 63.5312 62.51605 300 63.5229 62.38796 500 63.5550 62.46337 1000 63.7357 62.38798 2000 64.3885 62.58389 3000 64.7369 62.817310 4000 64.8546 62.877611 5000 64.8914 62.839912 10000 64.5740 61.9736

Average value Q3 61.9539 60.3377

Tab6

Documents

2009/2010 Artificial Neural Networks in Prediction of Secondary Protein Structure Using CB513 Database