8
Abstract— Conformational sampling, the computational prediction of the experimental geometries of small proteins (folding) or of protein-ligand complexes (docking), is often cited as one of the most challenging multimodal optimization problems. Due to the extreme ruggedness of the energy landscape as a function of geometry, sampling heuristics must rely on an appropriate trade-off between global and local searching efforts. A previously reported “planetary strategy”, a generalization of the classical island model used to deploy a hybrid genetic algorithm on computer grids, has shown a good ability to quickly discover low-energy geometries of small proteins and sugars, and sometimes even pinpoint their native structures – although not reproducibly. The procedure focused on broad exploration and used a tabu strategy to avoid revisiting the neighborhood of known solutions, at the risk of “burying” important minima in overhastily set tabu areas. The strategy reported here, termed “divide-and-conquer planetary model” couples this global search procedure to a local search tool. Grid nodes are now shared between global and local exploration tasks. The phase space is cut into “cells” corresponding to a specified sampling width for each of the N degrees of freedom. Global search locates cells containing low- energy geometries. Local searches pinpoint even deeper minima within a cell. Sampling width controls the important trade-off between the number of cells and the local search effort needed to reproducibly sample each cell. The probability to submit a cell to local search depends on the energy of the most stable geometry found within. Local searches are allotted limited resources and are not expected to converge. However, as long as they manage to discover some deeper local minima, the explored Manuscript received October 9, 2001. This work was supported by the French “Agence Nationale de la Recherche”, grant “ANR Dock - Molecular Docking on Grids” (http://www2.lifl.fr/~talbi/docking/ ) and realized on the GRID 5000 network (http://www.grid5000.fr ). D. Horvath is a CNRS (Centre National de la Recherche Scientifique) scientist at the Laboratoire d’Infochimie, UMR 7177, Universtité de Strasbourg, France (corresponding author: [email protected] or [email protected] ). S. Conilleau is a post-doctoral fellow at the Laboratoire d'Infochimie, UMR 7177, Universtité de Strasbourg, France L. Brillet is an engineer at the Center for Bio-Active Molecules (CMBA) of the Commisariat l'Energie atomique (CEA), Grenoble, France S. Roy is a scientist at the Center for Bio-Active Molecules (CMBA) of the Commisariat à l’Energie atomique (CEA), Grenoble, France ([email protected] ) A.-A. Tantar and J. C. Boisson are Ph.D. students at Université des Sciences et Technologies de Lille, Laboratoire d'Informatique Fondamentale de Lille (LIFL), France N. Melab is associated professor at the Université des Sciences et Technologies de Lille, Laboratoire d'Informatique Fondamentale de Lille (LIFL), France E.G. Talbi is professor at the Université des Sciences et Technologies de Lille, Laboratoire d’Informatique Fondamentale de Lille (LIFL), France ([email protected] ) cell remains eligible for further local search, now relying on the improved energy level to enhance chances to be picked again. This competition prevents the system to waste too much effort in fruitless local searches. Eventually, after a limited number of local searches, a cell will be “closed” and used – first as “seed”, later as tabu zone – to bias future global searches. Technical details and some folding and docking results will be discussed I. INTRODUCTION ONFORMATIONAL sampling, the computational prediction of the experimental geometries of small proteins (folding) or of protein-ligand complexes (docking), is often cited as one of the most challenging multimodal optimization problems. Authors emphasize the huge phase space volume, explaining that if each of the N=50...1000 of torsional axis of a protein would be iteratively rotated by a step of D<60 degrees, then an impossible-to-manage (360/D) N geometries must be considered – way more than the protein itself may visit during a time span of billions of years, at a rate of a geometry/picosecond (the Levinthal [1] paradox). However, such statements meant to emphasize the complexity of the problem actually understate its real difficulty. The main challenge in conformational sampling is not phase space volume, but the extreme ruggedness of the energy hypersurface (Fig. 1). Degrees of freedom are strongly coupled: a rotation around a central axis, pushing two fragments towards each other, will be spontaneously accompanied by rearrangements of intra-fragment torsions, to yield clash-minimizing fragment geometries. Out of the (360/D) N geometries sampled at step size D of say 10°, virtually none will be actual energy minima. These need yet to be searched in the neighborhood of some of the (360/D) N phase space points, which requires an important local fine tuning effort of torsions. Gradient-based relaxation will typically not suffice to explore the rugged neighborhood. Knowing the energy at one point offers little or no information on how deep a neighboring minimum may be. The global minimum, surrounded by an activation barrier 1 , will certainly not be found in the vicinity of the most stable enumerated geometry. Exhaustive conformational sampling would not require (360/D) N energy evaluations, but (360/D) N local searches – it is thus obvious that some heuristics is needed to focus on “promising” phase space zones that may harbor the relevant minima. 1 the dense packing maximizing favorable interactions cannot be reached otherways than by crossing some strained transition state Local vs. Global Search Strategies in Evolutionary GRID-based Conformational Sampling & Docking Dragos Horvath, Lorraine Brillet, Sylvaine Roy, Sébastien Conilleau, Alexandru-Adrian Tantar, Jean- Charles Boisson, Nouredine Melab and El-Ghazali Talbi C

Local vs. Global Search Strategies in Evolutionary GRID ...dockinggrid.gforge.inria.fr/documents/deliverables/cec2009... · reached otherways than by crossing some strained transition

Embed Size (px)

Citation preview

Abstract— Conformational sampling, the computational prediction of the experimental geometries of small proteins (folding) or of protein-ligand complexes (docking), is often cited as one of the most challenging multimodal optimization problems. Due to the extreme ruggedness of the energy landscape as a function of geometry, sampling heuristics must rely on an appropriate trade-off between global and local searching efforts. A previously reported “planetary strategy”, a generalization of the classical island model used to deploy a hybrid genetic algorithm on computer grids, has shown a good ability to quickly discover low-energy geometries of small proteins and sugars, and sometimes even pinpoint their native structures – although not reproducibly. The procedure focused on broad exploration and used a tabu strategy to avoid revisiting the neighborhood of known solutions, at the risk of “burying” important minima in overhastily set tabu areas.

The strategy reported here, termed “divide-and-conquer planetary model” couples this global search procedure to a local search tool. Grid nodes are now shared between global and local exploration tasks. The phase space is cut into “cells” corresponding to a specified sampling width for each of the N degrees of freedom. Global search locates cells containing low-energy geometries. Local searches pinpoint even deeper minima within a cell. Sampling width controls the important trade-off between the number of cells and the local search effort needed to reproducibly sample each cell. The probability to submit a cell to local search depends on the energy of the most stable geometry found within. Local searches are allotted limited resources and are not expected to converge. However, as long as they manage to discover some deeper local minima, the explored

Manuscript received October 9, 2001. This work was supported by the

French “Agence Nationale de la Recherche”, grant “ANR Dock - Molecular Docking on Grids” (http://www2.lifl.fr/~talbi/docking/) and realized on the GRID 5000 network (http://www.grid5000.fr).

D. Horvath is a CNRS (Centre National de la Recherche Scientifique) scientist at the Laboratoire d’Infochimie, UMR 7177, Universtité de Strasbourg, France (corresponding author: [email protected] or [email protected] ).

S. Conilleau is a post-doctoral fellow at the Laboratoire d'Infochimie, UMR 7177, Universtité de Strasbourg, France

L. Brillet is an engineer at the Center for Bio-Active Molecules (CMBA) of the Commisariat l'Energie atomique (CEA), Grenoble, France

S. Roy is a scientist at the Center for Bio-Active Molecules (CMBA) of the Commisariat à l’Energie atomique (CEA), Grenoble, France ([email protected] )

A.-A. Tantar and J. C. Boisson are Ph.D. students at Université des Sciences et Technologies de Lille, Laboratoire d'Informatique Fondamentale de Lille (LIFL), France

N. Melab is associated professor at the Université des Sciences et Technologies de Lille, Laboratoire d'Informatique Fondamentale de Lille (LIFL), France

E.G. Talbi is professor at the Université des Sciences et Technologies de Lille, Laboratoire d’Informatique Fondamentale de Lille (LIFL), France ([email protected] )

cell remains eligible for further local search, now relying on the improved energy level to enhance chances to be picked again. This competition prevents the system to waste too much effort in fruitless local searches. Eventually, after a limited number of local searches, a cell will be “closed” and used – first as “seed”, later as tabu zone – to bias future global searches. Technical details and some folding and docking results will be discussed

I. INTRODUCTION ONFORMATIONAL sampling, the computational prediction of the experimental geometries of small

proteins (folding) or of protein-ligand complexes (docking), is often cited as one of the most challenging multimodal optimization problems. Authors emphasize the huge phase space volume, explaining that if each of the N=50...1000 of torsional axis of a protein would be iteratively rotated by a step of D<60 degrees, then an impossible-to-manage (360/D)N geometries must be considered – way more than the protein itself may visit during a time span of billions of years, at a rate of a geometry/picosecond (the Levinthal [1] paradox). However, such statements meant to emphasize the complexity of the problem actually understate its real difficulty. The main challenge in conformational sampling is not phase space volume, but the extreme ruggedness of the energy hypersurface (Fig. 1). Degrees of freedom are strongly coupled: a rotation around a central axis, pushing two fragments towards each other, will be spontaneously accompanied by rearrangements of intra-fragment torsions, to yield clash-minimizing fragment geometries. Out of the (360/D)N geometries sampled at step size D of say 10°, virtually none will be actual energy minima. These need yet to be searched in the neighborhood of some of the (360/D)N phase space points, which requires an important local fine tuning effort of torsions. Gradient-based relaxation will typically not suffice to explore the rugged neighborhood. Knowing the energy at one point offers little or no information on how deep a neighboring minimum may be. The global minimum, surrounded by an activation barrier1, will certainly not be found in the vicinity of the most stable enumerated geometry. Exhaustive conformational sampling would not require (360/D)N energy evaluations, but (360/D)N local searches – it is thus obvious that some heuristics is needed to focus on “promising” phase space zones that may harbor the relevant minima.

1 the dense packing maximizing favorable interactions cannot be reached otherways than by crossing some strained transition state

Local vs. Global Search Strategies in Evolutionary GRID-based Conformational Sampling & Docking

Dragos Horvath, Lorraine Brillet, Sylvaine Roy, Sébastien Conilleau, Alexandru-Adrian Tantar, Jean-Charles Boisson, Nouredine Melab and El-Ghazali Talbi

C

“Well”-docked(folded) zone

“Misdocked”(folded) conformers

“Misdocked”(folded) conformers

ΔE

ΔE#

Fig. 1 Energy landscape ruggedness is the key difficulty in conformational sampling: one local clash is enough to let a near-optimal geometry have a much higher energy than misfolded structures. The experimental structure (“PDB”) – which is but a model satisfying experimental constraints – does not strictly match the minimum. The native zone is surrounded by an activation barrier (ΔE#). The difference between stable misfolded and well-folded energies, ΔE is much lower than ΔE#. In a typical engineering problem, an objective score decrease by ΔE# leads to useful near-optimal solutions, an additional gain of ΔE being most likely irrelevant. It is essential in this “all-or-nothing” challenge. Our previous [2-5] effort in the field lead to the development of a GRID-deployed hybrid genetic algorithm, capable to sample the native folds of some small proteins (the “tryptophane cage” 1L2Y [6], the “tryptophane zipper” [7] 1LE1) or structured sugars, such as cyclodextrine. The “planetary” GRID deployment strategy [3] was introduced as a generalization of the classical island model used with genetic algorithms [8] (each node runs an independent island model, and may thus be assimilated to a “planet”). It actively exploits the knowledge derived from so-far sampled geometries in order to bias the future searching efforts. This bias was dubbed “panspermia” by analogy to the theory stipulating that life on Earth could have been seeded by germs from outer space. It includes both • “seeds”: best-to-date solutions serving as attractors in

phase space, in the sense that degrees of freedom are preferentially – but not exclusively – assigned values seen in these seeds, and

• “tabus”: solutions having already served as seeds, now defining phase space zones to which an empirical energy excess is associated, in order to discourage their re-sampling.

Panspermia is the critical point of the planetary strategy: both seeding and tabu zones are instrumental in improving the sampling quality. Excessive use of a seed allows computer effort wasted to resample a well-explored phase space zone, while a prematurely set tabu zone may block the access to therein contained, not yet discovered minima. Therefore, the discovery of native folds by the planetary strategy had significant reproducibility problems, especially with non-helical peptides such as 1LE1.

The trade-off between the use of seeds and tabus in panspermia implicitly affects the balance between local and global sampling effort, even though it does not permit a strict control thereof. Seeds merely favor rather than force revisiting of specified phase space zones. Therefore, it is impossible to define a best moment to toggle the status of a known solution from “seed” to “tabu”.

In an attempt gain control over this issue, the current paper reports the development of specific global and local search modules. To this purpose, phase space is subdivided into cells corresponding to equal width ranges of key torsional degrees of freedom. Global searches operate throughout the entire phase space, seeking for stable geometries to represent each cell. Each visited cell represented by its to-date most stable known conformer. Cells in which some more or less stable geometry was proven to exist are a better rational choice as a starting point for a local search than phase space zones in which the global searcher never managed to find any clash-free structure. Therefore, submission of cells to local searching will be prioritized by the energy of their representative structure. Shall this local search (in which degrees of freedom are confined within cell walls) lead to a more stable structure, then this becomes the new cell representative. The cell re-enters the pool of competitors waiting for local sampling. After a user-defined number of failures to lower the energy upon local sampling, a cell will be eventually “closed” and its free energy is estimated on hand of all energies of its sampled states. This free energy score, reflecting both the depth and the width of energy minima within the cell, is expected to relate to the actual probability of having the cell “populated” by molecules in solution at a given temperature.

The local vs. global sampling balance (generically embodied by the parameter D) as well as the chosen panspermia strategy are key factors of this conformational sampling procedure. Unfortunately, a folding simulation of a mini-protein may run for weeks on tens of nodes – it is thus impossible to envisage an exhaustive fine tuning of these setup parameters. The first results to be discussed here show that optimal setups may depend on the actual problems. Folding of α-helices and docking problems featuring flexible loops in the active site were successfully solved by coarse splitting (into a minimal number of cells of quite large volume each), folding of β-sheets failed (alternative strategies being now under scrutiny).

II. METHODS Only the latest developments of the hybrid genetic

algorithm-based “planetary” model for conformational sampling and docking will be mentioned here, with an emphasis on local search strategies. Please note that a major development consisted in integration of intermolecular degrees of freedom (translation and rotations) in the chromosome coding torsional degrees of freedom, in order to generalize the use of the algorithm for docking problems.

This aspect will however not be detailed here – for convenience, the three translation vectors and the three Euler angles were also formally mapped by values between 0 and 360, as if they were regular “pseudo-torsional” degrees of freedom. Newly introduced operational parameters (see below) are implicitly subject to operational parameter tuning.

A. Phase Space Slicing The impact of torsional degrees of freedom on the

molecular geometry depends on the size S of the fragment they pilot: let this size factor be the maximal number of bonds between the anchoring atom and any terminal atom within the fragment (see Fig. 2).

Torsional axis

Anchor Atom

FixedFragment

Terminal Atom

Fig. 2 The fragment associated to the current degree of freedom has a size of S=2 (remotest terminal atom of fragment is two bonds away from anchor)

More central torsions require a fine tuning, as they pilot more atoms. Therefore, a size-dependent number of divisions should favor more cells be created along central degrees of freedom, while phase space may not be subdivided at all with respect to minor, terminal torsions like in Fig. 2. Suppose that a torsional angle Θi may adopt values within [ ]maxmin , ii θθ . In global searches,

iii and σπθθ /20 maxmin == , σi being the symmetry index of the torsion (σ=2, means that a turn by 180° is equivalent to a complete rotation by 360°). This range will be subdivided into a number of Ki domains.

−−

×Κ=Κ minmax

minmax ,1min,1maxint

SSSSi

i (1)

Here int denotes the truncation operator, while Smin and Smax are threshold fragment size values. Phase space will be split into Kmax subdomains along each axis i piloting a fragment of size Smax and larger, but will not be subdivided along the axes associated to fragments of size Smin or lower. The choice of Kmax defines the graininess of the phase space splitting scheme and thus represents the actual control parameter standing for the generic D value used beforehand for introductory purposes. Related to Ki, the minimal significant torsional shift (the angle by which an axis must be rotated in order to obtain a “significantly different” geometry) may then be defined as:

i

iiis

Κ−

=minmax θθ (2)

Note that, according to this definition, rotations around low-importance axes of Ki=1 never lead to “significantly different” geometries, i.e. do not allow to exit a “tabu” zone (see below). Rototranslational degrees of freedom are assigned maximal importance, i.e. subjected to Kmax splits.

B. Buffered inter-island migrations Our previous experience with the island model of genetic

algorithms showed that resetting island populations trapped into evolutionary dead ends [2] did not fulfill its purpose – of exploring new phase space zones – due to “contamination” by fitter chromosomes migrating to the reset island. Entering a recent population of poorly evolved individuals, these create related, fit offspring, in detriment of the radically new opportunities opened by the members of the new populations. In the current version, migrants are stored into a buffer zone and not allowed to enter a population unless it shows signs of slowing evolutionary pace (20 generations without progress). Reset populations are thus given the time to potentially evolve towards new, original low-energy conformations, at lower risk to be driven to extinction by much more evolved migrant chromosomes.

C. Dynamic Tabu Penalties Previously, geometries were declared tabu and

immediately withdrawn from the population if not even one of its defining torsional values differed significantly, in the sense of equation (2), from those encountered in already visited geometries. This approach may prematurely discard potentially interesting phase space zones. If, for example, the “native-like” structure from Fig. 1 (high energy due to a mispositioned minor fragment), was given “tabu” status, the program would have been denied access to the actual absolute minimum. On the other hand, it certainly makes sense to forbid the software the enumeration of all the possible rotameric states of minor side chains around a given main chain geometry – though the stable rotamers are wanted, and others should be ignored.

A dynamic tabu strategy was now introduced in order to solve the above-mentioned dilemma. A solution situated too close to a tabu geometry will now be assigned a continuous and differentiable user-defined penalty function, representing a handicap for Darwinian selection. Let ti be the torsional values in the tabu solutions, by contrast to θi in the current geometry, and let Δi denote their effective absolute difference (accounting for periodicity and symmetry, if applicable). This difference is significant if Δi>si, i.e. the geometry is outside the tabu zone if max(Δi/si)>1. However, using the max operator leads to a non-differentiable expression, therefore DMAX(Δi/si) – a differentiable empirical approximation of max(Δi/si)-1 was used instead. As Δ values change and DMAX turns positive, the taboo penalty rapidly decreases from a user-defined maximal contribution

T to zero – an exponential dependence on DMAX was postulated. Furthermore, if the energy value of the current geometry e is significantly lower than the one of the tabu point et, then its discovery would be valuable even if it is actually located within the tabu area – the applied penalty should decrease as its stability advantage over the tabu point increases (according to a postulated sigmoid dependence). The fitness of the geometry in presence of a tabu point becomes (sum up right-hand terms if many tabu geometries are specified; α and β are empirical parameters):

[ ]

( )[ ]t

ii

eesDMAXTefitness

−+Δ−

+=β

αexp1

)/(exp (3)

D. Diversity- and Age-based Consensus Selection While tabu zones are set to avoid revisiting previously

discovered conformers, a related metric is used to avoid redundancy of individuals within the current population. If two chromosomes fail to display, at least with respect to one degree of freedom, a torsional angle difference exceeding a given fraction f of the significant difference si, i.e. max(Δi/si)<f, then the less fit of the two is considered redundant. A redundancy score within the [0,1] range is calculated (0 means that the current chromosome has no fitter neighbors, while 1 is scored by the chromosomes with a maximum of fitter neighbors). Also, an aging score is defined between 0 (“newborn” individuals) and 1 (chromosome having reached or exceeded the user-defined maximal age Amax). Eventually, the fitness value is converted, for selection purposes, into a rank-proportional score, from 0 (fittest) to 1 (less fit). Selection is performed on hand of the sum of the three (fitness, redundancy, age) scores, allowing for a balanced selection of fit, non-redundant and “young” individuals.

E. Exploratory and Focusing Modes A new generation is said to represent a “progress” with

respect to the previous one if the fitness of at least one of the ntop=4 individuals was improved by a rank-dependent minimum energy drop. A decrease by ε (user-defined, within 0.5…2 kcal/mol) of the top-ranked fitness score or 2ε of the second-best fitness value, etc, count as “progress”.

The redundancy control fraction f is first set to a user-defined value Fmax of the order of 1.0, meant to favor diversity in the early stages of evolution, the “exploratory mode” (Fmax replaces the ancient diversity control parameter Smax from [2]). After Nnonew generations with no progress, the genetic algorithm is toggled into “focusing mode”: f is set to 0.1 and all population members except for the ntop=4 individuals are replaced by new chromosomes in which low-importance degrees of freedom (with Ki=1) are randomized, while the others are copied from one of the ntop leaders. This is meant to detect whether terminal fragment rearrangement, not properly explored during the exploratory stage, may

bring any significant fitness improvement of to-date best solutions. Following Nnonew unsuccessful generations in focusing mode, a complete reset of all but the Nelit “immortal” solutions specified by the elitism control [2] is performed and the exploratory mode is resumed. A simulation terminates after a tunable number Nreset of resetting events (unless forcefully terminated due to elapsed node reservation time). Parameters Nnonew and Nreset are the new termination controls, replacing the ones used in [2].

Note that, upon population reset, the ancient chromosomes are added to the list of tabu solutions in order to push further sampling away from the already explored areas.

F. Local Sampling Procedure Local sampling is initiated around a chromosome (Θi)

representing the cell C to be explored. The first step therefore represents the specification of the cell containing the chromosome, by converting the torsional angle values into integer cell coordinates ci:

Κ=

πθσ

2int iii

ic (4)

The cell coordinate with respect to axis i may take values from 0 to Ki-1. Accordingly, the accessible range of Θi will be restricted to:

[ ]

Κ

=ii

iii

ilocalii ccσπ

σπ

θθ2)1(,2, maxmin (5)

There are no other methodological differences between local and global searches. In equation (5) the Ki values correspond to the graininess Kmax of the global search procedure that had led to the discovery of the seed chromosome, which must not be identical to the one used internally during the local search. Unlike the global searches, where operational parameters are controlled by a meta-optimization scheme (picked on hand of previous observations relating sampling success to parameter setup [2, 3]), a standard setup meant to favor quick convergence is used for local searches • population size=100, • six “islands”/”planet” out of which in four the GA is

hybridized with a “Lamarckian” optimization scheme (with a probability of 50%, conjugate gradient minimization is applied to newly discovered top chromosomes), while the other two also support directed mutations[2] and a Monte Carlo geometry relaxation protocol (each applied with a probability of 10% to novel fit individuals),

• aging penalty is disabled, • elitist strategy [2] is enabled (Nelit=1). • allocated runtime is a fraction λ=1/3 of the WALLTIME

allowed for global searches Furthermore, the initial chromosome is allowed to enter two of the six island populations, using the buffered migration

mechanism. If local sampling leads to the discovery of a new more

stable conformer within the cell C (by at least 0.1 kcal/mol), or if this cell has been subjected to less than five local search campaigns, the cell will remain “open”, i.e. eligible for further local searching. Otherwise, it will be considered “closed” (fully explored) and characterized by its free energy index G(C):

( )

−−= kT

ekTCG i

Cexpln (6)

Summation in (6) is performed over all geometries i sampled within C and kT represents the temperature factor (0.6 kcal/mol at 300 K). G(C) synthetically characterizes both the depth and the width of the key minima within cell C.

G. Panspermia Strategies – Seeds and Tabus. In order to exploit information from previous runs, island

models subsequently deployed on new nodes (planets) may be started in presence of two control files – SEED and TABU – containing selected chromosomes of to-date available conformers. SEED contains chromosomes thought to contain privileged torsional angle values, which should be preferentially adopted during the new simulation. This is achieved by means of two biasing procedures: • “tradition-based” biasing of torsional angle random

drawing probabilities, as defined in [2], and • the “ancestor crossover” technique, replacing (with a user-

defined frequency fancest) classical random chromosome draws with cross-over results of chromosomes from SEED.

Note that chromosomes in SEED simultaneously act as tabu geometries, the goal of seeding being to force exploration of novel combinations of these privileged torsional angle values, likely encoding well-folded molecular fragments. Also, the initial SEED file will be randomly split into specific equal size subsets for each "island". The content of the TABU file will be shared by all the islands and serves to define tabu areas in phase space.

For local searches, the applicable panspermia strategy is straightforward: SEED contains all the so-far harvested chromosomes within the local cell C, further on denoted as the In-Cell Repository (ICR), and there are no additional TABU geometries.

For global searches, panspermia is only applied as soon as the number of "closed" cells reaches a user-defined threshold (typically 30). Each closed cell disposes of a limited biasing impact allowance BIA, i.e. may be dispatched to the SEED file of an emerging "planet" for a limited number of times. However, the expected biasing impact of a chromosome is empirically considered to vary as the inverse square of its rank in the SEED file, which is sorted by energy. The fittest-to-date chromosome in SEED is counted as exercising a biasing impact of 1.0, the second best an impact of 0.25, etc.

With each selection in a SEED file, chromosome impact counters are updated by corresponding increments. Chromosomes having reached the predefined BIA will further on be declared tabu. Presently, a BIA value of 30 was tentatively used.

H. Divide-and-Conquer Planetary Strategy At start-up, the user is asked to define the number of

nodes to be used for global (NG) and local (NL) searches respectively. Initially, however, the dispatcher script in launches unbiased (no SEED, no TABU) global searches on all the NG+NL nodes, then enters (Fig. 3) the waiting loop in expectation of results files returning from their “planets” of origin.

Global searches return a set of diverse low-energy conformer chromosomes. When detected by the dispatcher, these files are first parsed in order to determine the cell containing each chromosome. These cells may be either newly discovered, open for local searches or closed. In the first two situations, the chromosomes are added to the Open Cell Repository (OCR), replacing any potential higher-energy entries associated to those cells – or ignored if the OCR already contains a more stable representative. If the global search has revisited a closed cell, most likely the rediscovered geometry is of higher energy than the best representative from the Closed Cell Repository (CCR) and will be ignored. Otherwise, the cell will be reopened – moved from CCR to OCR, and represented by the now discovered best conformer.

After finishing the import of arrived results, the dispatcher toggles back into launching mode, and verifies whether the ratio between still running global and local searches exceeds NG/NL. If so, and if OCR is not empty, a local search will be started next.

In order to do so, an open cell must be selected for in-depth sampling from the OCR. With a probability of 20%, the lowest energy entry of OCR will be picked. In the remaining 80% of cases, the selection criterion is a trade-off between the representative cell energy level and the block distance Σ|ci-cj|, between this cell and the ones that are closed or currently under search. This measure was undertaken in order to avoid simultaneous local searching in neighboring cells. The selected cell is extracted from the OCR and sent to a new “planet”, for local exploration. If this is not the first time the cell is submitted to local searching, the available In-Cell geometry Repository (ICR) will be passed as SEED file.

After the fifth local search, a cell will be closed if the current run failed to discover any more stable geometry. The local search will return results in the “closed cell” format, which the dispatcher adds to the CCR file. Otherwise, the “open cell” format is similar to the one used by global searches – except that all the returned chromosomes are merged with the associated ICR file.

Dispatcher

DetectRunning Jobs

G

LrunningG

runningL

NN

NN

>

Submit Global Search

• Set WALLTIME• Select SEED andTABU from CCR (ifenough entries)• Select Operationalparameters.

Submit Local Search

• Set WALLTIME/λ• Pick a cell from OCR• Use ICR as SEED

DetectResultType

Closed Cell

•Add to CCR

Open Cell

• Add to OCR• Add geometriesto ICR

Global Search

• Assign found geo-metries to cells• Merge entries intoOCR (keep stablestgeometry /cell)• Update Samplingsuccess vs. Opera-tional pars. table

LaunchingModeResult Integration Mode

Fig. 3. Principle of the Global/Local Planetary Strategy, as outlined in this paragraph. Shuttles pointing towards the upper right corners symbolize submission of a run (including molecular data files, in addition to specified setups) on a remote “planet” (node). The incoming shuttle (pointing to lower left) represents result transmission from the planet to the dispatcher.

III. SIMULATIONS SETUP There are several key parameters controlling the relative

interplay of global and local sampling strategies: • The maximal number of divisions per degree of freedom,

Kmax and the extremal fragment sizes Smin, Smax control the coarseness of phase space splitting, and intrinsically the phase space volumes of considered cells. Increasing Kmax while decreasing both Smin and Smax decreases the chance of early discovery of the cell containing the native fold decrease. In compensation, the odds of local search failure within this cell (premature convergence) also decrease. The empirical choices Smin=3, Smax=10 were adopted throughout the present work – only the influence of the choice of Kmax was explicitly addressed.

• The choice of the number of nodes to be used for local and respectively global searching, NL and NG, is also of paramount importance. Higher NG means improved chances to discover a representative geometry of the “native” cell, but also potentially prohibitive waiting times before this cell is submitted to local searching. Unfortunately, in practice it is quite difficult to impose a strict control of NL and NG as such, because the total number of available nodes on the GRID is strongly fluctuating, potentially down to zero. The dispatcher may ensure a relatively constant rate of NG/NL as far as NG+NL>>10. The current environment does not allow a monitoring of the impact of the choice of these parameters on simulation success: NG=15 and NL=25 were used throughout the study.

• Last but not least, the choice of λ, the runtime scale-down factor for local searches may have an important, but difficult to monitor impact on the simulation behavior. Shorter local search time means more local searches – a stronger competition between cells applying for local sampling resources, but also increased risk of premature “closing” of phase space cells. The choice of allotted

local search runtime must be furthermore calibrated with respect to the typical time required for a local search to reach its final stages. Else, chances to have local searches progress below the initial energy become vanishingly small. Unlike in global searches, subject to variable parameterization regimes, the behavior of fixed-parameter local searches is less stochastic. For each studied molecule the order of magnitude of the needed local search runtime (λ.WALLTIME) was estimated on hand of a few test runs at minimal Kmax (maximal cell volumes). If λ=0.3, the corresponding WALLTIME allotted to global searches turned out to be enough in order to let more than 50% of global runs terminate graciously rather than being stopped due to time constraints. Therefore, λ=0.3 has been adopted as a temporary reasonable solution in conjunction to problem-dependent empirically observed WALLTIME values, pending a more rigorous study of its impact.

Other empirical parameters controlling the behavior of the divide-and-conquer planetary model exist and were mentioned above. Unfortunately, each simulation takes days to weeks on tens of nodes of a cluster: it is thus practically impossible to generate enough observations in order to obtain parameter-specific robust average success scores, canceling out the noise induced by GRID deployment accidents and the intrinsic stochastic nature of evolutionary processes.

IV. RESULTS

A. Folding of the Tryptophane Cage The successful and reproducible folding of this helical

folding benchmark peptide had already been achieved by previous versions of the “planetary” strategy in a matter of several (3…4) days x 30 nodes [3]. Those results are not, strictly speaking, directly comparable to the current ones – they were obtained on different nodes (in terms of hardware) and were generated with a different set of empirical force field parameters (the tuning of which represents an additional computer-intensive task [9]). The Divide-and-Conquer strategy, at very coarse phase space splitting (Kmax=3), required only 17 hours on 40 nodes to find a native fold as the most stable so-far sampled geometry. Fig. 4 shows the top 10 most stable geometries from the CCR – each geometry being contained in a different cell, the large cell size explains the important diversity of these conformers, amongst which only the most stable matches the native structure (α carbon RMSD upon overlay is of 1.7 Å).

Fig. 4: Left, top 10 most stable geometries of 1L2Y obtained with coarse phase space splitting. The most stable (right, in red) matches the native geometry (green)

Using a finer split of Kmax=10, obtaining native-like folds took slightly longer (~24 hours), but this may not be significant (at this time, no resources are available for the investigation of the reproducibility of simulation durations). The geometries (Fig. 5) representing the top 10 best ranked CCR cells are now, unsurprisingly, much more similar to each other, as several neighboring cells are needed to cover the near-native phase space zone. From an experimental point of view, these close geometries are likely to be redundant – in the sense that inter-conformer differences are likely inferior to the resolution of structure-solving experimental methods.

Fig. 5: Top 10 1L2Y conformers generated under fine splitting conditions.

Unsurprisingly, top conformers found at finer splitting have significantly lower energies than their counterparts from Fig. 4, since conditions allowed for finer tuning of the side chain placements.

B. Folding of the Tryptophane Zipper Although smaller that the Cage, this minimalistic example

of a β-sheet is accordingly much more difficult to fold [7]. The coarse sampling strategy (Kmax=3) is very quick (within the first 50 to 100 global search runs, i.e. in a matter of hours) to visit the phase space cell containing the native fold. Unfortunately, these initial representatives of the native cell are quite poor matches of the experimental fold (the larger the cell, the more diverse geometries it may accommodate). Unlike the native fold, they are also quite unfit and therefore hardly qualify for local searching. Local searching within the native cell was observed to take place only after several hundreds of other cells were already closed – in a matter of several days. It was first seen to significantly lower the best-to-date energy level within the cell, but eventually converged prematurely without discovering the native state.

Fine splitting dramatically reduces the volume of the phase space cell harboring the native structures.

Unsurprisingly, representatives of the native cell of the coarse split strategy fall outside this much more restrained phase space zone. Local search within it should have no problem to discover the native fold. However, global searches –so far – failed to enumerate any representatives of the native cell.

The folding of a β-sheet is a cooperative event, requiring all the implied degrees of freedom to be set to values very close to the native ones in order to witness a stabilization effect. Unlike in helical topologies, where partial folding of n turns out of N in the native fold produces a partly stabilized intermediate (by n/N of the total energy gain of the native structure), proper folding of half a β-sheet does not bring any stability. Furthermore, to our knowledge, the current force field used in simulations does not consider the native fold to correspond to the absolute energy minimum. A misfolded structure, about 5 kcal/mol more stable than the best native-like fold (found by local sampling within the neighborhood of the native fold) seems to be the major attractor in the energy landscape – it is systematically rediscovered by all simulations, irrespective of their coarseness.

C. Docking of emodine into the active site of Casein Kinase 2 (CK2), assuming hinge region flexibility Kinases are an excellent benchmarking system for the

docking tool, because it is known that the hinge loop – an aminoacid sequence connecting the two main, rigid subunits of the protein – is flexible and may change its geometry in order to accommodate different ligands binding at the ATP site. The program was challenged to reproduce the crystal structure of protein CK2 with the ligand emodine [10], considering, in addition to the 6 rototranslational degrees of freedom of the ligand, 94 torsional variables, including: • Rotatable bonds in the ligand • Rotatable bonds within side chains of active site residues • Both main chain and side chain torsions of hinge region

aminoacids Val 116 – Gln 123. A sphere of residues within 15 Å of the crystal structure

position of the ligand was defined as the active site, and all protein atoms not concerned by the above-mentioned degrees of freedom were considered fixed.

Flexible hinge region

Fig. 6: Top 10 geometries calculated for the CK2-emodin complex, allowing for flexibility of the hinge region. In all cases, the ligand was properly docked. The most stable found conformer is the one with correct loop geometry.

Only coarse splitting at Kmax=3 lead, within reasonable simulation times (30 h) to a successful discovery (Fig. 6) of the experimental structure as the lowest energy level found (so far) in this huge 100-dimensional phase space. At Kmax=10 the simulation had to be stopped after 6 days, while still at ~50 kcal/mol away from the minimum found at Kmax=3.

A possible explanation to this behavior is the fact that the degrees of freedom associated to ligand rototranslation may locally display a quite flat energy profile: if the ligand is far from the protein, it has a lot of accessible space to move without impacting on the total energy of the system. It is extremely unlikely to draw, by pure chance, the perfect placement of the ligand in the active site – knowing that any geometrically “almost” correct pose will suffer from a few bad contacts and hence very high energy. By contrast, it is quite easy to randomly pick a relatively low energy state in which the ligand is far from the protein. Therefore, actual exploration of low-energy poses within the site will only commence after these easily accessible phase space cells describing unbound ligand states were visited and declared tabu. The finer the splitting, the more such cells need to be enumerated – their number roughly scales as the sixth power of Kmax.

V. CONCLUSIONS These preliminary results show, in the first place, how

difficult the problem of conformational sampling is. The implied computational effort is such that publicly shared computational resources such as the GRID 5000 initiative [11], used as a development platform of the present tool, cannot support any in-depth studies of the impact of various strategy elements on the rapidity and reproducibility of results. Further results concerning other test molecules, such as the villin headpiece 1VII are scheduled to be obtained in the near future.

As a general and foreseeable trend, increasedly fine-grained phase space splitting will allow for better optimization of structural details, while at risk of returning a large number of potentially redundant conformers. However, coarse sampling allows to quickly reject uninteresting, energy-neutral phase space zones with few favorable contacts, but no bad contacts (unfolded states, unbound ligand poses). The risk of overhasty closure of cells without having enumerated all the relevant minima it contains does not appear to be an issue, except for the notoriously difficult folding of β-sheets.

REFERENCES [1] Levinthal, C.: ‘How to fold graciously.’: ‘Mossbauer Spectroscopy in Biological Systems.’ (University of Illinois Press, 1969), pp. 22-24 [2] Parent, B., Kökösy, A., Horvath, D.: ‘Optimized Evolutionary Strategies in Conformational Sampling.’, Soft Computing, 2007, 11, pp. 63-79

[3] Parent, B., Tantar, A., Melab, N., Talbi, E.-G., Horvath, D.: ‘Grid-based Evolutionary Strategies Applied to the Conformational Sampling Problem.’, in Editor (Ed.)^(Eds.): ‘Book Grid-based Evolutionary Strategies Applied to the Conformational Sampling Problem.’ (2007, edn.), pp. 291-296 [4] Tantar, A.-A., Melab, N., Talbi E.-G., Parent, B., Horvath, D.: ‘A parallel hybrid genetic algorithm for protein structure prediction on the computational grid.’, Future Generation Computer Systems., 2007, 23, pp. 398-409 [5] Tantar, A.-A., Conilleau, S., Parent, B., Melab, N., Brillet, L., Roy, S., Talbi, E.-G., Horvath, D.: ‘Docking and Biomolecular Simulations on Computer Grids: Status and Trends’, Current Computer-Aided Drug Design, 2008, 4, pp. in press [6] Snow, C.D., Zagrovic, B., and Pande, V.S.: ‘The Trp Cage: Folding Kinetics and Unfolded State Topology via Molecular Dynamics Simulations’, J. Am. Chem. Soc., 2002, 124, pp. 14548-14549 [7] Cochran, A.G., Skelton, N.J., Starovasnik, M.A. : ‘Tryptophan zippers: Stable, monomeric beta-hairpins.’, Proc. Natl. Acad. Sci. USA, 2001, 98, (10), pp. 5578-5583 [8] Belding, T.C.: ‘The Distributed Genetic Algorithm Revisited.’, in Editor (Ed.)^(Eds.): ‘Book The Distributed Genetic Algorithm Revisited.’ (Morgan Kaufman, 1995, edn.), pp. 114-121 [9] Horvath, D., Tantar, A.A., Boisson, J.C., Melab, N., Brillet, L., Roy, S., and Talbi, E.G.: ‘Force-field-based conformational sampling of proteins within the Docking@GRID project: status, results, issues’. Proc. META 08, Hammamet, Tunisia, 27-31 Oct 2008 2008 pp. Pages [10] Raaf, J., Klopffleisch, K., Issinger, O.G., Niefind, K.: ‘The catalytic subunit of human protein kinase CK2 structurally deviates from its maize homologue in complex with the nucleotide competitive inhibitor emodin.’, J. Mol. Biol., 2008, 14, pp. 1-8 [11] https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home2007