4
Application of Genetic Algorithms in Structural Representation of Proteins Mrs.Pallavi M. Chaudhari PIET,Nagpur,India [email protected] Mr.Prasad P. Thute PhD student, Dresden University of Technology, Dresden,Germany. [email protected] Abstract One of the most promising and rapidly growing areas of GA application is data analysis and prediction in molecular biology. GAs have been used for predicting protein structure. Recently there has been considerable effort towards developing methods such as GAs and neural networks for automatically predicting protein structures. This GA prediction project illustrates one way in which GAs can be used on this task. Keywords:Genetic algorithm , protein structure. 1. Protein . Proteins are the fundamental functional building blocks of all biological cells. The main purpose of DNA in a cell is to encode instructions for building up proteins in turn carry out most of the Figure 1 – 3 D structure of Crambin structural and metabolic functions of the cell. A protein is made up of a sequence of amino acids connected by peptide bonds. The length of the sequence varies from protein to protein but is typically on the order of 100 amino acids. Owing to electrostatic and other physical forces, the sequence folds up to a particular three dimensional structure. It is this three dimensional structure that primarily determines the protein’s function. The three dimensional structure of a Crambin protein (a plant- seed protein consisting of 46 amino acids) is taken as an example. The three dimensional structure of a protein is determined by the particular sequence of its amino acids, but it is not currently known precisely how a given sequence leads to a given structure. In fact, being able to predict a protein structure from its amino acid sequence is one of the most important unsolved problems of molecular biology and biophysics. Not only would a successful prediction algorithm be a tremendous advance in the understanding of the biochemical mechanisms of proteins , but, since such an algorithm could conceivably be used to design proteins to carry out specific functions, it would have profound, far reaching effects on biotechnology and the treatment of disease. The amino acid sequence of the Crambin protein is taken and used a GA to search in the space of possible structures for one that would fit well with Crambin’s amino acid sequence. The most straight forward way to describe the structure of a protein is to list the three dimensional co-ordinates of each amino acid, or even each atom. In principle, a GA uses such a representation, evolving vectors of co- ordinates to find one that resulted in a plausible structure. First International Conference on Emerging Trends in Engineering and Technology 978-0-7695-3267-7/08 $25.00 © 2008 IEEE DOI 10.1109/ICETET.2008.212 671

[IEEE 2008 First International Conference on Emerging Trends in Engineering and Technology - Nagpur, Maharashtra, India (2008.07.16-2008.07.18)] 2008 First International Conference

Embed Size (px)

Citation preview

Page 1: [IEEE 2008 First International Conference on Emerging Trends in Engineering and Technology - Nagpur, Maharashtra, India (2008.07.16-2008.07.18)] 2008 First International Conference

Application of Genetic Algorithms in Structural Representation of Proteins

Mrs.Pallavi M. Chaudhari

PIET,Nagpur,India [email protected]

Mr.Prasad P. Thute PhD student,

Dresden University of Technology, Dresden,Germany.

[email protected]

Abstract One of the most promising and rapidly growing areas of GA application is data analysis and prediction in molecular biology. GAs have been used for predicting protein structure. Recently there has been considerable effort towards developing methods such as GAs and neural networks for automatically predicting protein structures. This GA prediction project illustrates one way in which GAs can be used on this task. Keywords:Genetic algorithm , protein structure.

1. Protein . Proteins are the fundamental functional building blocks of all biological cells. The main purpose of DNA in a cell is to encode instructions for building up proteins in turn carry out most of the

Figure 1 – 3 D structure of Crambin

structural and metabolic functions of the cell. A protein is made up of a sequence of amino acids connected by peptide bonds. The length of the sequence varies from protein to protein but is typically on the order of 100 amino acids. Owing to electrostatic and other physical forces, the sequence folds up to a particular three dimensional structure. It is this three dimensional structure that primarily determines the protein’s function. The three dimensional structure of a Crambin protein (a plant-seed protein consisting of 46 amino acids) is taken as an example. The three dimensional structure of a protein is determined by the particular sequence of its amino acids, but it is not currently known precisely how a given sequence leads to a given structure. In fact, being able to predict a protein structure from its amino acid sequence is one of the most important unsolved problems of molecular biology and biophysics. Not only would a successful prediction algorithm be a tremendous advance in the understanding of the biochemical mechanisms of proteins , but, since such an algorithm could conceivably be used to design proteins to carry out specific functions, it would have profound, far reaching effects on biotechnology and the treatment of disease. The amino acid sequence of the Crambin protein is taken and used a GA to search in the space of possible structures for one that would fit well with Crambin’s amino acid sequence. The most straight forward way to describe the structure of a protein is to list the three dimensional co-ordinates of each amino acid, or even each atom. In principle, a GA uses such a representation, evolving vectors of co-ordinates to find one that resulted in a plausible structure.

First International Conference on Emerging Trends in Engineering and Technology

978-0-7695-3267-7/08 $25.00 © 2008 IEEE

DOI 10.1109/ICETET.2008.212

671

Page 2: [IEEE 2008 First International Conference on Emerging Trends in Engineering and Technology - Nagpur, Maharashtra, India (2008.07.16-2008.07.18)] 2008 First International Conference

1.1. Structure of protein The tertiary structure of the protein Crambin is illustrated in Figure II. The alpha helices are easily identifiable. A beta sheet is a relatively straight and flat region in the sequence. Random coils are segments of no repetitive structure. This figure was created with the Quanta software package

Figure 2- Tertiary structure of Crambin

1.2. Protein backbone Viewing the Protein Backbone Known protein structures that have been solved by X-ray crystallography and NMR are recorded in the Brookhaven Protein Database (PDB) [Bernstein et al. 1977]. Each PDB file contains the Cartesian coordinates for all the atoms of the particular protein. These coordinates can be used to calculate the torsion angles, from which the secondary structures can then be identified. Each amino acid in the protein can then be labelled as belonging to an alpha-helix, beta sheet, or random coil.A portion of the data file created for the protein Crambin is shown in table I. Crambin contains 46 amino acids. Each amino acid in the sequence is represented by one line in the file. The first column lists its secondary structure (S denotes beta sheet, c is random coil, H is alpha helix). The second column gives its index, or position in the sequence, and the last three columns give the x, y, and z coordinates of the amino acid’s Ca atom.

Table I- x, y, and z coordinates of the amino acid’s Ca atom

1.3. Projections of protein

Figure 3 - Stereo projection of Crambin with Side Chains

Figure 4 - Stereoprojection of Crambin without Side Chains

S 1 16.967 12.784 4.338 S 2 13.856 11.469 6.066 S 3 13.660 10.707 9.787 S 4 10.646 8.991 11.408 C 5 9.448 9.034 15.012 C 6 8.673 5.314 15.279 H 7 8.912 2.083 13.258 H 8 5.145 2.209 12.453

672

Page 3: [IEEE 2008 First International Conference on Emerging Trends in Engineering and Technology - Nagpur, Maharashtra, India (2008.07.16-2008.07.18)] 2008 First International Conference

2. Implementation

Start with the population of different possible protein structures Design the fitness function as minimize potential energy of the protein (the protein with minimum potential energy is said to be stable) Some useful relations with respect to protein

Hence potential energy is directly proportional to length of protein bonds.

3. Concept applied

The concept of traveling salesman problem is applied to get folding structure of protein with low potential energy (The traveling salesman problem is an optimization problem where there is a finite number of cities, and the cost of travel between each city is known. The goal is to find an ordered set of all the cities for the salesman to visit such that the cost is minimized. To solve the traveling salesman problem, we need a list of city locations and distances, or cost, between each of them.)

Similarly in case of protein folding prediction there are finite numbers of points from where protein folds, and the potential energy between each point is known. The goal is to find a set of points for the protein to fold such that the potential energy is minimized.

We will generate random locations of points inside the border of the specific area. We can use the INPOLYGON function to make sure that all the points are inside or very close to the specific area.

Blue circles represent the locations of the points where the protein folds. Given the list of points as locations, we can calculate the distance matrix for all the points.

By default, the genetic algorithm solver solves optimization problems based on double and binary string data types. The functions for creation, crossover, and mutation assume the population is a matrix of type double, or logical in the case of binary strings. The genetic algorithm solver can also work on optimization problems involving arbitrary data types. You can use any data structure you like for your population. For example,

a custom data type can be specified using a MATLAB cell array. In order to use GA with a population of type cell array you must provide a creation function, a crossover function, and a mutation function that will work on your data type, e.g., a cell array.

This section demonstrates how to create and register the three required functions. An individual in the population for the Protein folding problem is an ordered set, and so the population can easily be represented using a cell array. The custom creation function for the Protein folding problem will create a cell array, say P, where each element represents an ordered set of points as a permutation vector. That is, the salesman will travel in the order specified in P{i}. The creation function will return a cell array of size Population Size.

3.1.Crossover

The custom crossover function takes a cell array, the population, and returns a cell array, the children that result from the crossover.

3.2. Mutation

The custom mutation function takes an individual, which is an ordered set of points, and returns a mutated ordered set. We also need a fitness function for the protein folding problem. The fitness of an individual is the total potential energy of an ordered set of points. The fitness function also needs the distance matrix to calculate the total length.

3.3. Fitness function

GA will call our fitness function with just one argument 'x', but our fitness function has two arguments: x, lengths. We can use an anonymous function to capture the values of the additional argument, the distances matrix. We create a function handle 'FitnessFcn' to an anonymous function that takes one input 'x', but calls 'protein_folding _fitness' with x, and distances. The variable, distances has a value when the function handle 'FitnessFcn' is created, so these values are captured by the anonymous function.

673

Page 4: [IEEE 2008 First International Conference on Emerging Trends in Engineering and Technology - Nagpur, Maharashtra, India (2008.07.16-2008.07.18)] 2008 First International Conference

We can add a custom plot function to plot the location of the points and the current best folding. A red circle represents a point and the blue lines represent a valid bond between two points.

3.4. Plot

Once again we will use an anonymous function to create a function handle to an anonymous function which calls 'protein_folding_plot' with the additional argument 'locations'.

3.5. GA at results GA options setup First, an options structure to indicate a custom data type and the population range is created.

Creation, crossover, mutation, and plot functions as well as setting some stopping conditions are created.

Finally, the genetic algorithm with problem information is called.

Figure 5--Points where the protein folds(first run)

Figure 6 – One of the stable protein structures after applying GA ( first run)

4. Conclusion

Genetic algorithms proved to be an efficient search tool for structural representations of proteins. For a protein model with a simple, force field (potential energy)as fitness function and using a rather small population the genetic algorithm produced several individuals (i.e. protein conformations) of dissimilar topology but each with highly optimized fitness values.

5. References [1]. Holland J.H., Adaptation in natural and artificial system, Ann Arbor, The University of Michigan Press, 1975. [2]. B. Rost, C. Sander, Prediction of protein secondary structure at better than 70% accuracy, Journal of Molecular Biology, vol 232, pp. 584 - 599, 1993. [3]. H.-P. Schwefel, Numerical Optimization of Computer Models, Chichester, John Wiley, 1981, (originally published in 1977). [4]. D. E. Goldberg, Genetic Algorithms in Search, Optimization & Machine Learning, Addison-Wesley, 1989. [5]. Tang, K.S., K.F. Man, S. Kwong and Q. He. "Genetic algorithms and their applications." IEEE Signal Processing Magazine, vol.13, no.6, p.22-37 (November 1996).

674