8
Parallel Protein Structure Prediction by Multiobjective Optimization J.C. Calvo and J. Ortega Department of Computer Architecture and Technology University of Granada Abstract—The protein structure prediction (PSP) problem is considered an open problem as there is no recognized ”best” procedure to find solutions. Moreover, this problem presents a vast search space and the analysis of each conformation requires significant amount of computing time. Thus, parallel processing could provide new opportunities to improve the quality of the solutions found until now. In this paper we use a parallel multiobjective evolutionary approach to the problem including a method to decrease the complexity of the problem by reducing the set of variables involved in the optimization process. The proposed parallel procedure provides linear speedups in the structure prediction of the proteins used as benchmarks. Index Terms—Multiobjective Optimization, Parallelism, Pro- tein Structure Prediction, PSP I. I NTRODUCTION Proteins have important biological functions such as the enzymatic activity of the cell, attacking diseases, transport and biological signal transduction, among others. They are chains of amino acids selected from a set of twenty elements. Whenever an amino acid chain is synthesized, it folds together and uniquely determines its 3D structure. Moreover, although the amino acid sequence of a protein provides interesting information, the functionality of a protein is exclusively de- termined by its 3D structure [1]–[3]. Thus, there is a high interest in knowing the 3D structure of any given protein. For example, as some proteins attack diseases, taking into account that protein functionality depends on their structure, it would be important to know this 3D structure to learn its correlation with the protein functionality. Thus, the knowledge of the protein 3D structure can aid in the design of efficient drugs. In this way, many useful applications involve determining the protein structure, and many disciplines, such as Medicine, Biochemistry, Biology, Engineering, etcetera, are interested in this problem. It is possible to reach the 3D structure of a protein expe- rimentally by using methods such as X-ray crystallographic and nuclear magnetic resonance (NMR). Nevertheless, these processes are quite complex and costly as they would require months of expert work and laboratory resources. This situation comes clear if considering that less than a 25% of the protein structures included in the PDB (Protein DataBase) have been solved [2]. An alternative approach is to use high performance compu- ting. Nevertheless, the computational analysis of each con- formation requires a significant time and this is a Grand Challenge Problem that still remains unsolved [1]–[3]. This Fig. 1. Protein structures. From primary structure to quaternary structure. computer approach is called protein structure prediction (PSP) and implies predicting the tertiary structure of a protein given its primary structure. Recently, efforts in protein structure prediction such as Rosetta@Home [4] and Predictor@Home [5] have been made using grid or global computing. These proposal tries to augment previous methods and algorithms by orders of magnitude more computing power to improve the prediction quality [4]. The procedure proposed in this paper is based on a multiobjective evolutionary algorithm. As evo- lutionary algorithms are population-based metaheuristics they allow efficient and easy to implement workload distribution among the processors of the parallel/distributed platform at hand. Although some multiobjective optimization approaches to PSP have been proposed [1], [6], up to knowledge, their parallelization has not been studied in depth. In [7], it is proposed a parallel hybrid evolutionary algorithm that includes a conjugated gradient-based hill climbing local search method. In our procedure, we include a method to manage torsion angles to reduce the complexity of the search space by using the backbone-rotamer library. The protein structure can be divided into four levels: primary structure, secondary and super-secondary structure, tertiary structure and quaternary structure (Fig. 1). The primary structure is the sequence of amino acids that define the order of the sequence. The secondary structure is a set of contiguous amino acids joined by some hydrogen bonds. Then, the super- secondary structure is the combination of two secondary structures by a short connecting peptide. The tertiary structure is a three-dimensional structure of a single sequence of a protein. All force-field atoms take part in this conformation. Finally, the quaternary structure refers to a protein formed by two or more sequences. This structure defines the relations between the different sequences of the protein. Thus, the PSP tries to determine how the primary structure translates into the tertiary structure. Parallel, Distributed and Network-based Processing 1066-6192/09 $25.00 © 2009 IEEE DOI 10.1109/.12 268 Parallel, Distributed and Network-based Processing 1066-6192/09 $25.00 © 2009 IEEE DOI 10.1109/PDP.2009.13 268

[IEEE 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing - Weimar, Germany (2009.02.18-2009.02.20)] 2009 17th Euromicro International

Embed Size (px)

Citation preview

Page 1: [IEEE 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing - Weimar, Germany (2009.02.18-2009.02.20)] 2009 17th Euromicro International

Parallel Protein Structure Prediction byMultiobjective Optimization

J.C. Calvo and J. OrtegaDepartment of Computer Architecture and Technology

University of Granada

Abstract—The protein structure prediction (PSP) problem isconsidered an open problem as there is no recognized ”best”procedure to find solutions. Moreover, this problem presents avast search space and the analysis of each conformation requiressignificant amount of computing time. Thus, parallel processingcould provide new opportunities to improve the quality of thesolutions found until now. In this paper we use a parallelmultiobjective evolutionary approach to the problem including amethod to decrease the complexity of the problem by reducingthe set of variables involved in the optimization process. Theproposed parallel procedure provides linear speedups in thestructure prediction of the proteins used as benchmarks.

Index Terms—Multiobjective Optimization, Parallelism, Pro-tein Structure Prediction, PSP

I. INTRODUCTION

Proteins have important biological functions such as theenzymatic activity of the cell, attacking diseases, transportand biological signal transduction, among others. They arechains of amino acids selected from a set of twenty elements.Whenever an amino acid chain is synthesized, it folds togetherand uniquely determines its 3D structure. Moreover, althoughthe amino acid sequence of a protein provides interestinginformation, the functionality of a protein is exclusively de-termined by its 3D structure [1]–[3]. Thus, there is a highinterest in knowing the 3D structure of any given protein. Forexample, as some proteins attack diseases, taking into accountthat protein functionality depends on their structure, it wouldbe important to know this 3D structure to learn its correlationwith the protein functionality. Thus, the knowledge of theprotein 3D structure can aid in the design of efficient drugs.In this way, many useful applications involve determining theprotein structure, and many disciplines, such as Medicine,Biochemistry, Biology, Engineering, etcetera, are interested inthis problem.

It is possible to reach the 3D structure of a protein expe-rimentally by using methods such as X-ray crystallographicand nuclear magnetic resonance (NMR). Nevertheless, theseprocesses are quite complex and costly as they would requiremonths of expert work and laboratory resources. This situationcomes clear if considering that less than a 25% of the proteinstructures included in the PDB (Protein DataBase) have beensolved [2].

An alternative approach is to use high performance compu-ting. Nevertheless, the computational analysis of each con-formation requires a significant time and this is a GrandChallenge Problem that still remains unsolved [1]–[3]. This

Fig. 1. Protein structures. From primary structure to quaternary structure.

computer approach is called protein structure prediction (PSP)and implies predicting the tertiary structure of a protein givenits primary structure. Recently, efforts in protein structureprediction such as Rosetta@Home [4] and Predictor@Home[5] have been made using grid or global computing. Theseproposal tries to augment previous methods and algorithmsby orders of magnitude more computing power to improve theprediction quality [4]. The procedure proposed in this paperis based on a multiobjective evolutionary algorithm. As evo-lutionary algorithms are population-based metaheuristics theyallow efficient and easy to implement workload distributionamong the processors of the parallel/distributed platform athand. Although some multiobjective optimization approachesto PSP have been proposed [1], [6], up to knowledge, theirparallelization has not been studied in depth. In [7], it isproposed a parallel hybrid evolutionary algorithm that includesa conjugated gradient-based hill climbing local search method.In our procedure, we include a method to manage torsionangles to reduce the complexity of the search space by usingthe backbone-rotamer library.

The protein structure can be divided into four levels:primary structure, secondary and super-secondary structure,tertiary structure and quaternary structure (Fig. 1). The primarystructure is the sequence of amino acids that define the orderof the sequence. The secondary structure is a set of contiguousamino acids joined by some hydrogen bonds. Then, the super-secondary structure is the combination of two secondarystructures by a short connecting peptide. The tertiary structureis a three-dimensional structure of a single sequence of aprotein. All force-field atoms take part in this conformation.Finally, the quaternary structure refers to a protein formedby two or more sequences. This structure defines the relationsbetween the different sequences of the protein. Thus, the PSPtries to determine how the primary structure translates into thetertiary structure.

Parallel, Distributed and Network-based Processing

1066-6192/09 $25.00 © 2009 IEEE

DOI 10.1109/.12

268

Parallel, Distributed and Network-based Processing

1066-6192/09 $25.00 © 2009 IEEE

DOI 10.1109/PDP.2009.13

268

Page 2: [IEEE 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing - Weimar, Germany (2009.02.18-2009.02.20)] 2009 17th Euromicro International

The paper has been structured as follows. In Section numberII we introduce the concepts related with optimization that weare going to use throughout of the work. The next section de-fines the different methods that a multiobjective optimizationprocess needs in this context. In the Section IV we describeour approach to tackle the PSP problem with a multiobjectivealgorithm to produce an operative software. Section V showsa parallel approach to this problem. Finally, Section VI showsthe results and compares our algorithms with other approaches,and Section VII provides the conclusions and suggests thework to do in the future.

II. OPTIMIZATION PROCESS

Computational approaches to PSP can be divided into threemain alternatives, comparative modelling, fold recognitionand ab initio procedures [2], [3], [6], [8], etc. The first twoalternatives consider that similar primary structures will foldinto similar ways and try to take advantage of known structuresfor similar amino acid sequences (homologue sequences). Theab initio alternative does not require any homology and canbe applied when an amino acid sequence does not correspondto any other known one. In this case, to predict the protein3D structure, we need to know the relation between primaryand tertiary structures. Although this relation is not trivial,and there are many factors affecting the folding process thatproduces the final 3D conformation, we are looking for thenative tertiary conformation with minimum free energy. The-refore, we do not consider external factors that may influencethe protein folding process, such as temperature, neighborproteins, and other conditions into the cell. We are consideringProtein Structure Prediction (PSP) rather than Protein Folding(PF).

The PSP problem is usually tackled as an optimizationproblem [1], [2], etc. Thus, it is necessary to have, explicitlyor implicitly, a cost or energy function optimized by the nativetertiary conformation of the protein. Among the optimizationapproaches that have been proposed to solve the PSP problemwe have [9] molecular dynamics, statistical mechanics, proba-bilistic road map, and lattice models [9], [10].

Any cost function depends on a set of variables that de-fine the search space. These variables provide the requiredinformation to build the 3D conformation of the protein.Many algorithms have been proposed to solve the PSP byoptimizing an objective or energy function, usually by usingevolutionary algorithms [10]–[13]. Nevertheless, over the lastfew years, some new approaches have been suggested thatmodel the PSP problem as a multiobjective problem [1], [14].There are several reasons for trying a multiobjective approach.For example, as indicated in [14], some work demonstratesthat some evolutionary algorithms improve their effectivenesswhen they are applied to multiobjective algorithms [15].Moreover, one of the most frequently used energy function isthe CHARMM energy function, which consists of several (ten)added terms that correspond to the bonded energies (stretching,bending, and torsion terms) and the non-bonded energies (vande Waals and electrostatic terms). Whenever a multiobjectiveapproach is considered, it is possible to define separately

several functions that are simultaneously optimized and thus,to provide a set of solutions in the Pareto front among which,a decision maker can select the solutions in its preferred frontzone [14]. Indeed, in [1] it is argued that PSP problem canbe naturally modelled as a multiobjective problem because theprotein conformations could involve tradeoffs among differentobjectives as it is experimentally shown by analyzing theconflict between bonded and non-bonded energies. In thispaper, we also propose a multiobjective approach to the PSPproblem.

In many optimization problems, once a suitable cost func-tion is found, it would not be difficult to obtain an efficientevolutionary algorithm that generated sufficient good solutionsfor the problem. This is not the case on the PSP problem. Asthe search space is so vast, it is necessary to select an adequateset of variables and a representation space, along with theinclusion of specific information about the PSP problem inthe evolutionary algorithm. Thus, some memetic algorithms(those that hybridize evolutionary algorithms with local searchoperators) have been previously proposed for the PSP problem[10], [11], [16].

A. Set of variablesNot many representations of the tertiary structure of a

protein are commonly used [1]: the all-atom three-dimensionalcoordinates, the backbone atom three-dimensional coordinateswith side-chain centroids, and the backbone and side-chaintorsion angles, are the more frequently used ones.

In this paper we use torsion angles to represent the confor-mation of the protein, because this representation needs lessvariables than other alternatives. So, as it is shown in TableI, three torsion angles are required in the backbone per eachamino acid and some additional torsion angles depending onthe side-chain ( [1], [2]).

TABLE Iχ ANGLES PER EACH AMINO ACID.

residue angles χGLY, ALA, PRO only backboneSER, CYS, THR, VAL χ1

ILE, LEU, ASP, ASN, HIS, PHE, TYR, TRP χ1, χ2

MET, GLU, GLN χ1, χ2, χ3

LYS, ARG χ1, χ2, χ3, χ4

We have also used a reduced set of variables applying thebackbone-dependant rotamer library [17]. This concept will beexplained below in Section III.

B. Cost functionIn this paper, we propose a multiobjective evolutionary

algorithm [18] to solve the PSP problem. This kind of al-gorithm needs a cost function to compare any new solu-tion with the best solution found at the moment. Althougha realistic measure of protein conformation quality shouldprobably imply considering quantum mechanics principles,it would be too computationally complex to become useful.Thus, as it is usual, we have used the Chemistry at HARvardMacromolecular Mechanics (CHARMM) energy function [1]–[3], [19], etc. It is one of the most popular all-atom forcefield used for studying macromolecules. We have consideredits implementation at the TINKER library package [20].

269269

Page 3: [IEEE 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing - Weimar, Germany (2009.02.18-2009.02.20)] 2009 17th Euromicro International

Fig. 2. backbone angles φ, ψ and ω.

The CHARMM energy function has the form:

Echarmm =∑bonds

Kb(b− b0)2︸ ︷︷ ︸E1

+∑UB

kUB(S − S0)2

︸ ︷︷ ︸E2

+∑angles

k0(θ − θ0)2︸ ︷︷ ︸E3

+∑

torsions

kχ[1 + cos(nχ− δ)]︸ ︷︷ ︸E4

+∑

impropers

Kimp(φ− φ0)2

︸ ︷︷ ︸E5

+∑

non−bond

εij

[(Rminijτij

)12

−(Rminijτij

)6]

︸ ︷︷ ︸E6

+qiqjeτij︸︷︷︸E7

(1)where [1]:

1) b is the bond length, b0 is the bond equilibrium distanceand k0 is the bond force constant.

2) S is the distance between two atoms separated by twocovalent bonds, S0 is the equilibrium distance and KUB

is the Urey Bradley force constant.3) θ is the valence angle, θ0 is the equilibrium angle and

k0 is the valence angle force constant.4) χ is the dihedral or torsion angle, kχ is the dihedral

force force constant, n is the multiplicity and δ is thephase angle.

5) φ is the improper angle, φ0 is the equilibrium improterangle and kimp is the improper force constant.

6) εij is Leonnard Jones well depth, τij is the distancebetweens angles i and j, Rminij is the minimun inte-raction radius, qi is the partial atomic charges and e isthe dielectric constant.

III. MULTIOBJECTIVE FORMULATION

Recently, multiobjective optimization has been proposedas an efficient approach to improve the results obtained bysingle-objective formulations of problems in computationalbiology. In [7] a survey of the application of multiobjectiveoptimization in this field is provided. This paper also includes

Fig. 3. Side-chain angles χi in the tyrosine amino acid.

the references to previous works that apply multiobjectiveoptimization to protein structure prediction.

A multiobjective optimization problem [18] can be definedas the problem of finding a vector x∗ = [x1∗, x2∗, ..., xn∗]that satisfies a given restriction set g(x) ≤ 0, h(x) = 0 and op-timizes the function vector: f(x) = {f1(x), f2(x), , fm(x)}.The objectives are usually in conflict between themselves,thus, optimizing one of them is carried out at the expenseof the values of the others. This leads to the need of making acompromise, which implies the concept of Pareto optimality.In a multiobjective optimization problem, a decision vectorx* is said to be a Pareto optimal solution if there is not anyother feasible decision vector, x, that improves one objectivewithout worsening at least one of the other objectives (GivenP the set of Pareto optimal solutions, ∀a, b ∈ P (∃i, j ∈{1, 2, · · · , n}|(fi(a) < fi(b)) ∧ (fj(a) > fj(b)))).

Usually, there are many vectors which are Pareto optimal.These solutions are called non-dominated. The set of all non-dominated solutions, in the decision space, determines thePareto front in the objective space.

Some alternatives can be taken into account to reducethe search space. Thus, although the PSP problem impliesto predict the tertiary structure of a given protein from itsprimary structure, it could be a good idea to use predictionsof the secondary and super-secondary structures as they giveus information about the amino acids involved in one of thesestructures, determining some constraints in the torsion anglesof each amino acid (as shown in Table II). In order to getthe super-secondary structure given its secondary structure,we have to analyze the conformation of the residues in theshort connecting peptide between two secondary structures.They are classified into five types, namely, a, b, e, l or t

270270

Page 4: [IEEE 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing - Weimar, Germany (2009.02.18-2009.02.20)] 2009 17th Euromicro International

[21]. Sun et al. [21] developed a method to predict the elevenmost frequently occurring super-secondary structures: H-b-H,H-t-H, H-bb-H, H-ll-E, E-aa-E, H-lbb-H, H-lba-E, E-aal-E, E-aaal-E and H-l-E where H and E are α helix and β strand,respectively. This way a reduction in the search space of thePSP problem is obtained.

TABLE IISEARCH SPACE OF EACH ANGLE φ AND ψ DEPENDING ON THE POSITION

OF THE SUPER-SECONDARY STRUCTURE THEY ARE.

Super-secondary structure φ ψH (α helix) [-75, -55] [-50, -30]E (β strand) [-130, -110] [110, 130]a [-150, -30] [-100, 50]b [-230, -30] [100, 200]e [30, 130] [130, 260]l [30, 150] [-60, 90]t [-160, -50] [50, 100]undefined [-180, 0] [-180, 180]

Moreover, side-chain torsion angles have interesting depen-dencies. Dumbrack et al. [17] produce many rotamers librariesthat help us to identify constraints about these torsion angles.An example of these libraries is the backbone-independentrotamer library. Given an amino acid, this library includesconstraints for its side-chain torsion angles. There are depen-dencies between side-chain torsion angles in the same aminoacid, but these dependencies are not related with the backbonetorsion angles. It is difficult to find a good representationfor these constraints because their mutual dependency. Thebackbone-dependent rotamer library is more complex thanthe backbone-independent rotamer library, because the formerincludes the dependency between side-chain torsion anglesand backbone torsion angles. Therefore, the size of the libraryincreases significantly and, although the information providedby this library is interesting, the time required to include thisinformation in the optimization procedure also increases.

Therefore, there are dependencies among backbone torsionangles and side-chain torsion angles, and there are also depen-dencies inside torsion angles of the side-chain. An optimiza-tion process has an independent movement for each variable.Thus, by including the backbone torsion angles and the side-chain torsion angles in the set of variables, we are ignoring thecross-information between torsion angles. Therefore, a suitableway to approach the problem could be to implement a way tomanage the backbone torsion angles in an optimization processand, inside it, to include another mechanism to manage theside-chain torsion angles.

In this work we propose a new method to manage torsionangles using the backbone-dependent rotamer library. In thisway, we reduce the set of variables involved in the optimi-zation process by eliminating the side-chain torsion angles.Nevertheless we can not eliminate side-chain torsion angleswithout adding another mechanism to take them into account.This mechanism selects the most probable conformation ofthe side-chain in each amino acid depending on the backbonetorsion angles, then it can change the side-chain conformation,considering the possibilities included in the library, to producea better global conformation as it is shown in Figure 4. Stepsshown in Figure 4 are:

Fig. 4. Method to calculate the CHARMM energy in the optimization processusing a reduced set of variables.

Fig. 5. Reduction of variables for the Met-enkephalin protein

1) 1. Main loop of the multiobjective optimization process.2) 1.1. In each iteration of 1. the algorithm has to evaluate

the fitness function of each individual.3) 1.2. Side-chain optimization module returns the fitness

of the best found conformation.4) 1.1.1. To evaluate an individual, this method try to

optimize its side-chain. Thus the algorithm loop for eachconformation evaluating it fitness function.

5) 1.1.1.1. The CHARMM module evaluates a single con-formation

6) 1.1.1.2. The CHARMM module returns the fitness of agiven conformation.

With this method we can reduce the number of variables,and thus the complexity of the search space. For example asit is shown in Figure 5, the Met-enkephalin molecule has 22variables in its traditional representation, and we reduce theset to a representation set with ”only” 10 variables applyingthe backbone-dependent rotamers library and setting the ωiangles at its ideal value of 180o [1], [22].

Given the set of angles φ, ψ, ω, and χ of a configuration,the 3D structure of the protein can be determined, and theCHARMM function is evaluated for this structure. In thisfunction, the terms E1, E2, and E3 remain constant acrossdifferent configurations because, as it can be seen from (1),these terms are not affected by changes in the angles thatdefine the 3D protein structure.

Finally, as we use a multiobjective evolutionary optimiza-tion formulation of the PSP problem, the different terms of theCHARMM energy function have to be transformed into severalobjectives. In [1] it is distinguished between bond and non-bond energies. We use this idea, although we have introducedsome modifications that following some characteristics of thesolution domain.

The first cost functions have the form:

271271

Page 5: [IEEE 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing - Weimar, Germany (2009.02.18-2009.02.20)] 2009 17th Euromicro International

f1 = Ebond =5∑k=1

Ek (2)

f2 = Enon−bond =7∑k=6

Ek (3)

Analyzing these energies we can observe that the VanDer Wall energy term (E6 in f2) has higher change rangethan the energy E7 in f2. Accordingly, this last energy termcan be hidden by the other. Thus to optimize this energyappropriately, we propose a cost function with three objectivesas follows:

f1 = Ebond =5∑k=1

Ek (4)

f2 = E6 (5)

f3 = E7 (6)

IV. THE PROPOSED PARALLEL MULTIOBJECTIVEAPPROACH

In this work we have used NSGA2 (Non-dominated SortingGenetic Algorithm II) [18] as it is one of the best multiob-jective optimization algorithms presented up to now. It usescrowding methods to avoid coexistence of similar solutions,and it usually obtains better distributed and fitted Pareto frontsthan other approaches.

The NSGA2 algorithm introduces elitism into the popula-tion in an efficient way. One of the main differences betweenNSGA2 and SPEA2 ( [23]–[25]) is that NSGA2 keeps theelitism inside the population, while SPEA2 has a separatedpopulation for store the elite.

The main drawbacks shown in NSGA2 are the followingones:

1) It has a worse behavior than others approaches when abinary representation is used. Nevertheless this problemdoes not affect our PSP problem because it uses a realrepresentation.

2) It presents problems when there are many objectives, butwe have two or three objectives, and in any case, thistrouble affects to all multiobjective algorithms.

The pseudocode description of NSGA2 is the following:1. Rt = Pt ∪Qt2. F = fastNonDominatedSort(Rt)3. Pt+1 = Ø and i=14. until |Pt+1|+ |Fi| ≤ N5. → crowdingDistAssignment(Fi)6. → Pt+1 = Pt+1 ∪ (Fi)7. → i = i+ 18. Sort(Fi)9. Pt+1 = Pt+1 ∪ Fi[1 : (N − |Pt+1|)]10. Qt+1 = makeNewPop(Pt+1)11. t = t+ 1

where:

1) Pt is the population in the generation t.2) Qt is the descendence of the population in the generation

t.3) Rt is the union of Pt and Qt, it is used to select the

new population.4) fastNonDominatedSort sorts individuals in Rt by

non dominated fronts. In each Fi there are individualsin the same non dominated front.

5) crowdingDistAssigment calculate the crowding dis-tance for each individual.

6) makeNewPop makes the new population applying thecrossover and the mutation operators.

A. Decision-making phase

We can derive some conclusions after analyzing the Pa-reto fronts obtained by our optimization procedure basedon NSGA2. Among the non-dominated solutions obtained,the solution with lower energy does not correspond to thebetter solution according to the RMSD. Therefore, we usethe knee concept [1], that means to select a solution in thePareto front which a small improvement of one objective hassignificant bad effects in the other objectives. After applyingthis mechanism it is possible to get several knee solutions.Then another decision-making phase should be accomplishedto select one of these. This second decision step is to selectthe solution with the lowest energy. With these mechanismswe can obtain the solutions in Pareto fronts that correspondto better predictions, and in many cases we obtain the bestprediction.

B. Software architecture

We have built a PSP software that incorporates manyexisting tools along with others that have been developed byus. To improve its usefulness this software has been organizedaccording to the three phases of pre-processing, optimization,and post-processing as it is shown in Figure 6. In the pre-processing phase the protein is analyzed in order to generatethe search space. The information extracted in this phase canbe used by the different optimization procedures that could beexecuted in the next phase. The optimization phase includesthe multiobjective optimizer in our case. It produces a setof non-dominated solutions that define the obtained Paretofront. Finally, the post-processing phase includes the decisionphase to select the solution from the Pareto front and othertools, such as the RMSD computation procedures, required toevaluate the optimization process.

The moo psp program create two files, the most importantis params.moo. In this file there is the search space configuredto be an input of the NSGA2 algorithm. The other file is adescription of the protein.

In the next stage, there are the optimization processes.They use files creates by moo psp and then they begin theoptimization process. At the end, they create four files with thesame information, but in different formats. The most importantare the .prot and the .xml files. The .prot file has a descriptionof all the solutions in the Pareto front. The knee method hasto use this file to find only one solution. The .xml file, gives

272272

Page 6: [IEEE 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing - Weimar, Germany (2009.02.18-2009.02.20)] 2009 17th Euromicro International

Fig. 6. Needed software integration

us some information about the optimization process. It can beloaded in the BiOmage ( [26], [27]) to study the evolution ofthe algorithm.

Thus software package includes not only existing tools wi-dely used in Bioinformatics, but also new modules developedby us such as the BiOmage, moo psp, decision-making phaseand file formats. We also had to integrate several processeswhich are written in other programming languages such asfortran, C and C++. Moreover, our own runtime softwaremanages all the integrated processes creating a transparentaccess layer.

C. Parallel implementation

The computing time required to predict the structure ofa single protein is very high, it can take several hours.This is a clear example of application that can benefit fromparallel processing. To get a first parallel approach, we haveconsider that problem has a hard phase that corresponds to thefitness computation for all the individuals in the populationof the evolutionary algorithm. Moreover, the computation ofthe fitness for a given individual is independent from thecomputation of the others. Therefore, we have parallelized thisphase of our procedure through a master-worker approach. Asit is shown in Fig. 7, the master executes all the algorithm,and distributes the fitness evaluation task between the workers:by using a Round-Robin method (master-worker-1), dividingthe population in blocks of equal size (master-worker-2) anddividing in blocks of equal size but the master also acts as aworker (master-worker-3). As it can be seen in the followingsection, this parallel approach provides good speedup figures.

Fig. 7. Workload distribution by a master-worker scheme.

V. RESULTS

In this section we provide and analyze the results obtainedwith our parallel procedure. They have been obtained by usingthe famous Met-enkephalin protein [1], [28] as a benchmark.Although this protein has only five amino acids, it producesa high complexity multiobjective optimization problem. It isestimated that this problem has more than 1011 locally optimalconformations. We ran the NSGA2 algorithm for 1000 genera-tions with 50 individuals in the population, and implementedthe two (expressions 2 and 3) and three (expressions 4, 5 and6) cost functions respectively. As it has been stated earlier, byusing our procedure, we have used 10 variables to representthe protein instead of the 22 variables of others previous works[1].

As the cost functions used to model the PSP problem onlyapproximate the conformation energy of the protein, it is notenough to evaluate how near the solution obtained is to theglobal minima of the cost function, but it is also necessary toevaluate the quality of the approximation. To do so, we shouldaccomplish a comparison between a given known proteinstructure and the solution obtained by optimizing the costfunction that model the corresponding PSP problem.

The most famous measure of similarity between predictedand known native structures is the RMSD [1] (Formula 7).RMSD computes the difference between two structures byadding the standard deviation of each atom in the predictedstructure, with to the same atom in the known structure. Thus,to use this measure we need to fit both structures as much aspossible to minimize the difference.

RMSD(a, b) =

√∑ni=1 |τai − τbi|2

n(7)

where

1) a and b are proteins to be compared.2) n is the number of atoms in the proteins.3) τai and τbi are the 3D position of the atom i in a and b

respectivelyAfter several executions of our multiobjective formulation

of the PSP problem we have obtained that the best conforma-

273273

Page 7: [IEEE 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing - Weimar, Germany (2009.02.18-2009.02.20)] 2009 17th Euromicro International

tions found have RMSD=2.650 A and RMSD=1.891 A withtwo and three objectives, respectively. The knee procedure hasbeen applied in these cases to select the conformation amongthe set of solutions in the Pareto obtained after executingthe corresponding optimization procedures. Table III comparesour results (rows NSGA2 in Table III) with those providedin [1] corresponding to several approaches (rows I-PAES,REGAL, and Lamarkian) for met-enkephalin protein. As itcan be observed our results outperform those provided in [1]for this protein.

TABLE IIIPROVIDED NSGA2 VERSION VERSUS OTHER APPROACHES FOR

MET-ENKEPHALIN PEPTIDE.

Algorithm RMSD (A)NSGA2-Reduced-3obj 1.891NSGA2-Reduced-2obj 2.650I-PAES 2.835REGAL (real cod.) 3.230Lamarkian (binary cod.) 3.330

Once the quality of the solutions found by our procedure hasbeen analyzed, performance results of our parallel procedureare now provided. We have executed the algorithms in a clusterwith 14 nodes. Figure 8 shows the average computing time re-quired to complete the structure prediction against the numberof processors, and Figures 10 and 9 provide the correspondingspeedups. The individuals of the population have been equallydistributed among the processors used in each experiment. Asit is shown in Figure 10 the highest speedup is equal to 13.62for 14 processors. As it is shown in Figure 9, in the master-worker-3 approach, the efficiency remains near to 1.00 amongall the experiments (for 14 nodes the efficiency is 0.97). In themaster-worker-1 and 2 approaches, the efficiency also remainshigh from six processors.

From this results, we can see that, despite its simplicity, themaster-worker approach implemented obtains a performancelevel sufficiently good. Thus, with fourteen processors it runs13.62 times faster than the sequential algorithm. It shows thatthe cost function evaluation is the bottleneck in the amountof processor time that the PSP problem needs. It is also clearthat, for the number of processors available in our system, thespeedup achievable by our parallel procedure is not saturated.

VI. CONCLUSION AND FUTURE WORK

The PSP problem joins biological and computational con-cepts. It requires not only efficient algorithms that make itpossible to take advantage of powerful parallel computers,but also accurate and tractable models of the conformationsenergy. Thus, there is a long way to go to find usefulsolutions to the problem for proteins of realistic sizes. Ourcontribution in this paper deals with a new procedure forPSP based on a multi-objetive evolutionary algorithm and itsparallel implementation. Our procedure allows a reduction inthe number of variables that an implementation of the multi-objective algorithm NSGA2 has to manage. In this way, weprovide a reduction in the search space, although at the cost ofloosing some flexibility in the search process. By comparing

Fig. 8. Executing time vs. processors

Fig. 9. Parallel efficiency vs. processors

Fig. 10. Speedup vs. processors

274274

Page 8: [IEEE 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing - Weimar, Germany (2009.02.18-2009.02.20)] 2009 17th Euromicro International

the results of our procedure with those provided by othermultiobjective approaches [1], it can be shown that our methodprovides conformations of comparable good quality for simpleproteins (met-enkephalin).

With respect to our future work, it will follow two mainlines to improve the quality of the solutions found. Thefirst line searches for ways to complete the procedure herepresented by defining new operators that make it possibleto include local search transformations in the optimizationprocess, thus improving the way the side-chains are processedin order to reach better conformations. The other researchingalternative deals with a more complete study of the differentparallelization alternatives and their performances, includingtheir scalability behaviour for more complex proteins andsystems with higher number of processors.

ACKNOWLEDGMENT

This paper has been supported by the Spanish Ministerio deEducacion y Ciencia under project TIN2007-60587.

REFERENCES

[1] V. Cutello, G. Narcisi, and G. Nicosia, “A multi-objetive evolutionaryapproach to the protein structure prediction problem.” J. R. Soc. Inter-face, vol. 3, pp. 139–151, 2006.

[2] A. M. Lesk, Introduction to Bioinformatics. Oxford University Press,iSBN 0–19–927787-7.

[3] ——, Introduction to Protein Architecture. Oxford University Press,iSBN 0-19-850474-8.

[4] P. Bradley, K. Misura, and D. Baker, “Toward high-resolution de novostructure prediction for small proteins,” Science, vol. 309, pp. 1868–1871, 2005.

[5] M. Taufer, C. An, A. Kerstens, and C. Brooks, “Predictor@home: A’protein structure prediction supercomputer’ based on global compu-ting.” IEEE Transactions on Parallel and Distributed Systems, vol. 17,no. 8, 2006.

[6] J. Handl, D. Kell, and J. Knowles, “Multiobjective optimization inbioinformatics and computational biology.” IEEE/ACM Transactions onComputational Biology and Bioinformatics (TCBB), vol. 4, no. 2, pp.279–292, April 2007.

[7] A.-A. Tantar, N. Melab, E.-G. Talbi, B. Parent, and D. Horvath, “Aparallel hybrid genetic algorithm for protein structure prediction on thecomputational grid,” Future Generation Computer Systems, vol. 23, pp.398–409, 2007.

[8] C. Branden and J. Tooze, “Introduction to protein structure.” iSBN 0-81-532305-0.

[9] V. Cutello, G. Nicosia, M. Pavone, and J. Timmis, “An immunealgorithm for protein structure prediction on lattice models,” IEEE Trans.On Evolutionary Computation, vol. 11, no. 1, pp. 101–117, February2007.

[10] C. Cotta, “Hybrid evolutionary algorithms for protein structure predic-tion under the hpnx model,” Advances in Soft Computing, vol. 2, pp.525–534, 2005.

[11] N. Krasnogor, W. Hart, J. Smith, and D. Pelta, “Protein structureprediction with evolutionary algorithm.” in Proceedings of the Geneticand Evolutionary Computation Conference, 1999.

[12] R. Day, G. Lamont, and R. Pachter, “Protein structure predictionby applying an evolutionary algorithm.” in International Parallel andDistributed Processing Symposium (IPDPS’03), 2003, p. 155a.

[13] S. Michaud, J. Zydallis, G. Lamont, and R. Pachter, “Scaling a geneticalgorithm to medium-sized peptides by detecting secondary structureswith an analysis of building blocks,” in Proc. 1st Int. Conference onComputational Nanoscience, March 2001, pp. 29–32.

[14] R. Day, J. Zydallis, and G. Lamont, “Solving the protein structure pre-diction problem through a multiobjective genetic algorithm.” Nanotech,vol. 2, pp. 32–35, 2002.

[15] J. Zydalis, A. V. Veldhuizen, and G. Lamont, “A statistica comparisonof moeas including the momga-ii,” in Proc. 1st Int. Conference onEvolutionary Multicriterion Optimization, 2001, pp. 226–240.

[16] J. Smith, “The co-evolution of memetic algorithms for protein structureprediction.”

[17] R. Dunbrack and F. Cohen, “Bayesian statistical analysis of proteinsidechain rotamer preferences,” Protein Sci, vol. 6, pp. 1661–1681, 1997.

[18] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitistmultiobjective genetic algorithm: Nsga-ii.” IEEE Transactions on Evo-lutionary Computation, vol. 6, no. 2, pp. 182–197, April 2002.

[19] B. Wathen, “Hydrophobic residue patterning in β-strandsand implication for β-sheet nucleation.” [Online]. Available:http://qcse.queensu.ca/conferences/documents/BrentWathen.ppt

[20] TINKER, “Software tools for molecular design.” [Online]. Available:http://dasher.wustl.edu/tinker/

[21] Z. Sun and B. Jiang, “Patterns and conformations of commonly oc-curring supersecondary structures (basic motifs) in protein data bank.”Journal of Protein Chemistry, vol. 15, no. 7, 1996.

[22] V. Cutello, G. Narcisi, and G. Nicosia, “A class of pareto archivedevolution strategy algorithms using immune inspired operators for ab-initio protein structure prediction,” in EvoWorkshops 2005, LNCS 3449,2005, pp. 54–63.

[23] E. Zitzler and L. Thiele, “An evolutionary algorithm for multiobjec-tiveoptimization: The strength pareto approach.” in Technical Report43,Zrich, Switzerland: Computer Engineering and Networks Laboratory(TIK),Swiss Federal Institute of Technology (ETH), 1998.

[24] ——, “Multiobjectiveevolutionary algorithms: A comparative case studyand the strength pareto approach.” IEEE Transactions on EvolutionaryComputation, vol. 3, no. 4, pp. 257–217, 1999.

[25] E. Zitzler, K. Deb, and L. Thiele, “Comparison of multiobjectiveevolutionary algorithms: Empirical results.” Evolutionary ComputationJournal, vol. 8, no. 2, pp. 125–148, 2000.

[26] J. Calvo, “Biomage: Herramienta para el analisis de algoritmosevolutivos.” [Online]. Available: http://atc.ugr.es/ jccalvo/proyectos.php

[27] ——, “Biomage: Manual de usuario.” [Online]. Available:http://atc.ugr.es/ jccalvo/download/manualusuario.pdf

[28] RCSB, “Protein data bank (pdb).” [Online]. Available:http://www.pdb.org

275275