9
proteins STRUCTURE FUNCTION BIOINFORMATICS Docking with PIPER and refinement with SDU in rounds 611 of CAPRI Yang Shen, 1,2 Ryan Brenke, 1,3 Dima Kozakov, 1,4 Stephen R. Comeau, 5 Dmitri Beglov, 1,4 and Sandor Vajda 1,4 * 1 BioMolecular Engineering Research Center, Boston University, Boston, Massachusetts 2 Program in Systems Engineering, Boston University, Boston, Massachusetts 3 Program in Bioinformatics, Boston University, Boston, Massachusetts 4 Department of Biomedical Engineering, Boston University, Boston, Massachusetts 5 Dyax Corp., Boston, Massachusetts INTRODUCTION For a number of years we have been using a multistep approach to protein–protein docking. 1–6 The first step is exploring the conforma- tional space by a rigid body global search method based on the Fast Fou- rier Transform (FFT) correlation approach that evaluates the energies of billions of docked conformations on a grid. Depending on the particular method, 2000–20,000 docked structures are retained for analysis. Second, the number of structures are reduced by rigid body filters based on em- pirical potentials and electrostatics calculations. Third, the retained struc- tures are clustered using the pairwise root mean square Deviation (RMSD) as the distance measure and a fixed or variable clustering ra- dius. 5 We have demonstrated that the 30 largest clusters contain at least one near-native structure (defined as having less that 10 A ˚ ligand RMSD from the native) for 93% of the complexes in the well-known Weng benchmark set. 7 Finally, the conformations in the selected clusters are refined, in the past primarily by energy minimization with the CHARMM potential. 8 The use of this algorithm in Rounds 1 through 5 of CAPRI was fairly successful, and our team, led by Dr. Carlos Camacho at that time, consistently ranked among the top five performers in each round. We note that Dr. Camacho is now at the University of Pittsburgh, and participates in CAPRI with a separate team. During the last year we have substantially redesigned the two main steps of the above multistep approach, namely the initial rigid body search and the refinement of structures in the selected clusters. For the rigid body search we have developed a new docking program called PIPER. 6 Similar to a number of other docking programs, PIPER is based on the FFT correlation approach, but the method is extended to be used with pairwise interaction potentials. Such structure-based (or knowledge- based) potentials have been previously used with success for ranking docked conformations, 1,2,4,9–12 but their use directly for docking by the The authors state no conflict of interest. Grant sponsor: National Institute of Health; Grant numbers: GM061867, GM079396; Grant sponsor: National Science Foundation; Grant number: MRI DBI-0116574. Yang Shen, Ryan Brenke, Dima Kozakov, and Stephen R. Comeau contributed equally to this work. *Correspondence to: Sandor Vajda, Department of Biomedical Engineering Boston University, 44 Cummington Street, Boston, MA 02215. E-mail: [email protected] Received 5 June 2007; Revised 13 July 2007; Accepted 18 July 2007 Published online 12 September 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/prot.21754 ABSTRACT Our approach to protein–protein docking includes three main steps. First we run PIPER, a new rigid body docking program. PIPER is based on the Fast Fourier Trans- form (FFT) correlation approach that has been extended to use pairwise interactions potentials, thereby substantially increasing the number of near-native structures gener- ated. The interaction potential is also new, based on the DARS (Decoys As the Reference State) principle. In the second step, the 1000 best energy conformations are clustered, and the 30 largest clusters are retained for refine- ment. Third, the conformations are refined by a new medium-range optimization method SDU (Semi-Definite programming based Un- derestimation). SDU has been developed to locate global minima within regions of the conformational space in which the energy function is funnel-like. The method constructs a convex quadratic underestimator function based on a set of local energy minima, and uses this function to guide future sampling. The combined method performed reliably without the direct use of biological informa- tion in most CAPRI problems that did not require homology modeling, providing accept- able predictions for targets 21, and medium quality predictions for targets 25 and 26. Proteins 2007; 69:734–742. V V C 2007 Wiley-Liss, Inc. Key words: fast Fourier transform; clustering; energy funnel; protein docking by PIPER; refinement by SDU; global optimization; semidefinite programming; DARS pairwise potential. 734 PROTEINS V V C 2007 WILEY-LISS, INC.

Docking with PIPER and refinement with SDU in rounds 6–11 of CAPRI

Embed Size (px)

Citation preview

proteinsSTRUCTURE O FUNCTION O BIOINFORMATICS

Docking with PIPER and refinementwith SDU in rounds 6–11 of CAPRIYang Shen,1,2 Ryan Brenke,1,3 Dima Kozakov,1,4 Stephen R. Comeau,5

Dmitri Beglov,1,4 and Sandor Vajda1,4*1 BioMolecular Engineering Research Center, Boston University, Boston, Massachusetts

2 Program in Systems Engineering, Boston University, Boston, Massachusetts

3 Program in Bioinformatics, Boston University, Boston, Massachusetts

4 Department of Biomedical Engineering, Boston University, Boston, Massachusetts

5 Dyax Corp., Boston, Massachusetts

INTRODUCTION

For a number of years we have been using a multistep approach to

protein–protein docking.1–6 The first step is exploring the conforma-

tional space by a rigid body global search method based on the Fast Fou-

rier Transform (FFT) correlation approach that evaluates the energies of

billions of docked conformations on a grid. Depending on the particular

method, 2000–20,000 docked structures are retained for analysis. Second,

the number of structures are reduced by rigid body filters based on em-

pirical potentials and electrostatics calculations. Third, the retained struc-

tures are clustered using the pairwise root mean square Deviation

(RMSD) as the distance measure and a fixed or variable clustering ra-

dius.5 We have demonstrated that the 30 largest clusters contain at least

one near-native structure (defined as having less that 10 A ligand RMSD

from the native) for 93% of the complexes in the well-known Weng

benchmark set.7 Finally, the conformations in the selected clusters are

refined, in the past primarily by energy minimization with the

CHARMM potential.8 The use of this algorithm in Rounds 1 through 5

of CAPRI was fairly successful, and our team, led by Dr. Carlos Camacho

at that time, consistently ranked among the top five performers in each

round. We note that Dr. Camacho is now at the University of Pittsburgh,

and participates in CAPRI with a separate team.

During the last year we have substantially redesigned the two main

steps of the above multistep approach, namely the initial rigid body

search and the refinement of structures in the selected clusters. For the

rigid body search we have developed a new docking program called

PIPER.6 Similar to a number of other docking programs, PIPER is based

on the FFT correlation approach, but the method is extended to be used

with pairwise interaction potentials. Such structure-based (or knowledge-

based) potentials have been previously used with success for ranking

docked conformations,1,2,4,9–12 but their use directly for docking by the

The authors state no conflict of interest.

Grant sponsor: National Institute of Health; Grant numbers: GM061867, GM079396; Grant sponsor:

National Science Foundation; Grant number: MRI DBI-0116574.

Yang Shen, Ryan Brenke, Dima Kozakov, and Stephen R. Comeau contributed equally to this work.

*Correspondence to: Sandor Vajda, Department of Biomedical Engineering Boston University, 44

Cummington Street, Boston, MA 02215. E-mail: [email protected]

Received 5 June 2007; Revised 13 July 2007; Accepted 18 July 2007

Published online 12 September 2007 inWiley InterScience (www.interscience.wiley.com).

DOI: 10.1002/prot.21754

ABSTRACT

Our approach to protein–protein docking

includes three main steps. First we run

PIPER, a new rigid body docking program.

PIPER is based on the Fast Fourier Trans-

form (FFT) correlation approach that has

been extended to use pairwise interactions

potentials, thereby substantially increasing

the number of near-native structures gener-

ated. The interaction potential is also new,

based on the DARS (Decoys As the Reference

State) principle. In the second step, the 1000

best energy conformations are clustered, and

the 30 largest clusters are retained for refine-

ment. Third, the conformations are refined by

a new medium-range optimization method

SDU (Semi-Definite programming based Un-

derestimation). SDU has been developed to

locate global minima within regions of the

conformational space in which the energy

function is funnel-like. The method constructs

a convex quadratic underestimator function

based on a set of local energy minima, and

uses this function to guide future sampling.

The combined method performed reliably

without the direct use of biological informa-

tion in most CAPRI problems that did not

require homology modeling, providing accept-

able predictions for targets 21, and medium

quality predictions for targets 25 and 26.

Proteins 2007; 69:734–742.VVC 2007 Wiley-Liss, Inc.

Key words: fast Fourier transform; clustering;

energy funnel; protein docking by PIPER;

refinement by SDU; global optimization;

semidefinite programming; DARS pairwise

potential.

734 PROTEINS VVC 2007 WILEY-LISS, INC.

FFT correlation approach is novel. A knowledge-based

potential, IFACE, was recently introduced into the FFT-

based docking program ZDOCK.13 In principle, FFT

based methods can use pairwise potentials as their scor-

ing function, because a potential defined for K atom

types and given by a K 3 K interaction matrix can be

written as the sum of K correlation functions. The func-

tion can be then evaluated by performing K forward and

one inverse Fourier transformations. However, since K is

generally between 10 and 20, the approach is computa-

tionally expensive, even with the increasing computa-

tional power currently available. In PIPER we avoided

this problem by an eigenvalue–eigenvector decomposition

of the coefficient matrix that substantially reduced the

complexity of the calculations. In fact, adequate accuracy

was achieved by restricting consideration to the eigenvec-

tors corresponding to the four largest eigenvalues, and

thus performing only four forward and one inverse FFT

calculations.6

The PIPER program was used with a new class of struc-

ture-based potentials called DARS (Decoys As the Refer-

ence State), based on the inverse Boltzmann approach. In

this approach, a statistical potential between two atoms i

and j of types I and J, respectively, is defined by

the expression of the form eIJ 5 2RT ln(pIJ), where R is

the gas constant, T is the temperature, and pIJ denotes the

probability of two atoms of types I and J interacting. This

probability is approximated by the ratio pIJ 5 mobsIJ/mref

IJ

where mobsIJ is the observed frequency of interacting atom

pairs of types I and J, and mrefIJ is the expected frequency

of interacting atom pairs of types I and J assuming an

appropriate reference state. The general assumption in the

reference state is that the specific interactions determining

the distribution of interaction sites are removed as much

as possible. The most frequently used reference state uses

the mole fractions to define mrefIJ 5 mobs 3 XI 3 XJ, where

mobs is the frequency of all interacting pairs and XI is the

mole fraction of atom type I.

The novelty of the DARS approach is that we generate

a large decoy set of docked conformations using only

shape complementarity as the scoring function (i.e.,

without any account for the atom types), and use the fre-

quency of interacting atom pairs in these decoy struc-

tures as the reference state. Thus, developing the poten-

tial we compare the frequency of contacts between two

specific atom types in X-ray structures of protein com-

plexes to the frequency of contacts in the decoys that are

devoid of specific interactions. Since the goal is finding

complex conformations close to the native among the

many structures that all have good shape complementa-

rity, this scoring scheme is very natural, as it rewards the

occurrence in the interface of the atom pairs that are fre-

quently seen to interact in the native complexes. Because

of the use of the more accurate DARS potential in

PIPER, we do not need further filtering, and the top

1000 structures retained from the rigid body search are

all clustered using a hierarchical clustering scheme and

clustering radius as described previously.5

The structures in the 30 largest clusters are refined by

the novel medium-range optimization method SDU

(Semi-Definite programming based Underestimation)

that has been developed to locate global minima within

well defined regions of the funnel-like binding free

energy landscape. Although the search is in the 6D space

of rigid body rotations and translations, it allows for side

chain flexibility in the interface, and can use free energy

functions that are more accurate than the ones used for

rigid body docking. Since energy functions generally have

many local minima, even in relatively small regions, the

most frequent optimization methods used involve the

Monte-Carlo approach that can cross energy barriers, but

this does not assure global convergence and is numeri-

cally not very efficient. To overcome these shortcomings

we assume that the free energy, DG, is a funnel-like func-

tion within each region defined by a cluster. Such an

energy function can be under-estimated by a convex

function U such that U(x*) � DG(x*) for all local min-

ima x*’s within the region. The SDU method14 is tai-

lored to this type of function. Taking advantage of the

funnel-like behavior of the free energy, SDU constructs a

convex quadratic under-estimator based on a set of local

energy minima. The tightest underestimator is obtained

by solving a semidefinite programming problem and

then it is used to bias further sampling. The process is

iterated with the set of local minima being updated, and

the search region being reduced until certain convergence

conditions are satisfied. Instead of working on the

extremely rugged free energy landscape with a huge

number of local minima separated by high energy bar-

riers, SDU thus works on a smoothed landscape deter-

mined by the continuously updated underestimator. This

approach yields a more than 10-fold increase in efficiency

compared to Monte-Carlo-type algorithms.15

In this article we first provide short summaries of the

computational steps that constitute our method (i.e.,

docking by PIPER, discrimination by clustering, and

refinement by SDU), and then describe the application of

the combined procedure to the targets in rounds 6–11 of

CAPRI. We focus on targets 21, 25, 26, and 27 that did

not require homology modeling and hence are more

appropriate for demonstrating the application of the

multistage approach to well defined docking problems.

In fact, the method turned out to be successful only for

targets 21, 25, and 26 out of the 6 in rounds 6–11 of

CAPRI, producing at least one acceptable prediction for

each, and also medium accuracy predictions for targets

25 and 26. We note that neither PIPER nor SDU were

completed when target 21 was released, and we have

used our ClusPro server (described separately in this

issue) for rigid body docking of this target. Although

SDU was used for refinement, at that time we were able

to perform the search only in the translational space. As

Performance of PIPER and SDU

DOI 10.1002/prot PROTEINS 735

will be further discussed, for each target, models 1 and 2

incorporated some biological information we were able

to find in the literature. However, it turned out that

none of these models achieved even acceptable accuracy,

and our successful predictions were generally selected on

the basis of either large cluster size or low energy. In

addition to describing our results, we provide a brief

analysis to show that most targets in CAPRI rounds 6–11

were more difficult than the targets in the previous

rounds, attesting the progress a number of groups,

including ours, have made in developing docking metho-

dology.

METHODS

Rigid body docking using PIPER

FFT docking algorithms perform exhaustive evaluation

of simplified energy functions in discretized 6D space of

mutual orientations of the protein partners. The larger

docking partner is considered the receptor and its center

of mass is fixed at the origin of the coordinate system.

The other partner is considered the ligand and all its

possible orientational and translational positions are eval-

uated at the given level of discretization. The rotational

space is sampled using a deterministic layered Sukharev

grid sequence for the rotational group SO(3), which

quasi-uniformly covers the space with a given number of

samples.16 The translational space is represented as a grid

of displacements of the ligand center of mass with respect

to the receptor’s center of mass. The energy function is

given as the sum of terms representing shape complemen-

tarity, electrostatic, and desolvation contributions, the lat-

ter described by a pairwise potential as follows.

E ¼ Eshape þ w2Eelec þ w3Epair

Eshape ¼ Eattr þ w1Erep

Eelec ¼XNr

i¼1

XNl

j¼1

qiqj

r2ij þ D2 exp�r2

ij

4D2

� �� �12

Epair ¼XNr

i¼1

XNl

j¼1

eij

where Nr and Nl denote the numbers of atoms in the

receptor and the ligand, respectively. The shape comple-

mentarity term Eshape is a stepwise implementation of the

Van der Waals energy, with Eattr and Erep representing its

attractive and repulsive components, respectively. Eelec is

the Coulombic electrostatic energy. Epair is a pairwise

knowledge-based potential, with eij representing the inter-

action energy between atoms i and j. To evaluate the

energy function E by FFT, it must be written as a sum of

correlation functions. The first two terms, Eshape and

Eelec, satisfy this condition. To write Epair as a sum of a

few correlation functions, we consider the eigenvalue–

eigenvector decomposition.

Epair ¼XNr

i¼1

XNl

j¼1

XKp¼1

dijkpupIupJ

Here dij denotes an indicator function such that dij 5 1

if atom i of the receptor is within a cutoff distance from

atom j of the ligand, kp is pth eigenvalue of the interac-

tion matrix, and upI is the Ith component of the pth

eigenvector.

The details of the PIPER docking program implement-

ing the above energy functions have been previously

reported.6 The same article also described the DARS

potential used in our calculations, and provided the ma-

trix of interaction coefficients.6 As mentioned in the

introduction, Epair can be well approximated by restrict-

ing considerations to the first few terms of the eigen-

value–eigenvector decomposition. Each term in the

decomposition represents an energy contribution propor-

tional to the absolute value of the eigenvalue kp, and

such contributions are independent due to the orthogo-

nality of the eigenvectors. Calculating the energy on a

grid may yield up to 10% error, so it is well justified to

truncate the summation when the energy contributions

of the neglected terms are comparable to this error. The

analysis shows that only K 5 4 eigenvalues are needed to

achieve this accuracy. Although the meaning of the corre-

sponding eigenvectors are somewhat difficult to interpret,

the first eigenvector primarily represents favorable hydro-

phobic interactions. The second eigenvector shows that

repulsive same-sign electrostatic interactions and, more

generally, Lys side chains are not favorable in the interface.

The atomic parameters in the energy functions have

been adopted from the CHARMM potential.8 The pa-

rameters of the FFT algorithm used in our calculations

were independent of the type of the proteins to be

docked. We sampled 70,000 rotations which aproximately

corresponds to sampling at every 58 in the space of Euler

angles. Increasing the number of rotations generally

would improve the results, and the selected number of

points was chosen as a compromise between performance

and computational efficiency. We used grids with 1.2 A

cell size, which was found to be adequate for representing

protein structures with sufficient details and at the same

time providing acceptable computational efficiency.

Discrimination by clustering

The clustering of the retained conformations is based

on the pairwise RMSD of ligand structures, calculated

for the atoms that are within 10 A of any atom of the

fixed receptor (to be referred to as ligand RMSD). We

use a simple greedy algorithm to find the structures with

the largest number of neighbors within a clustering ra-

dius RC. As we described earlier,5 the choice of RC

Y. Shen et al.

736 PROTEINS DOI 10.1002/prot

depends on a clustering parameter 0 � D � 1, which is

based on the histogram of pairwise RMSD values, and

measures the depth of the separation between clusters.

D 5 1 indicates perfect separation of intercluster and

intracluster length scales. Such separation means that clus-

tering is very easy, and that the use of the optimal radius

RC, calculated from the histogram of pairwise RMSD

values,5 substantially increases the number of near-native

structures in the top clusters. The analysis of the docked

structures for the proteins in the benchmark set7 showed

that for D � 0.4 the use of the optimal radius generally

increases the number of the near-native predictions in

the largest clusters. In contrast, for D < 0.4 the choice of

the calculated optimal radius does not necessarily

improve the results, and hence it is better to use the

default clustering radius of 9 A. Once a clustering radius

RC is selected, the structure with the highest number of

neighbors within RC is considered as the center of the

first cluster. The members of this cluster are removed,

and we select the next structure with the highest number

of neighbors from the remaining ligands until the set is ex-

hausted, thereby generating 10–30 rank ordered clusters.5

Refining top-ranked clusters with SDU

The structures in the top-ranked clusters are further

refined using SDU, a stochastic global optimization

method which exploits and utilizes the funnel-like behav-

ior of the free energy function DG in the regions of the

conformational space defined by separate clusters.14 SDU

consists of two key components:

(i). Underestimating the surface spanned by the local min-

ima. Given a set of K local minima x1,. . .,xK of DG,we are interested in constructing a convex quadratic

function U that underestimates DG at all local

minima i.e., U(xi) � DG(xi), i 5 1,. . .,K. Let

U ðxÞ¼D x0Qx þ b0x þ c, where Q [ Rn3n is positive

semidefinite (hence, U is convex), b [ Rn, and c is a

scalar. Using an L1 norm as a distance metric the

problem of finding the tightest possible underestima-

tor U can be formulated as follows:

minXKi¼1

ðDGðxiÞ � c � xi0Qxi � b0xiÞ

s:t: DGðxiÞ � c þ xi0Qxi þ b0xi; i ¼ 1; . . . ;K ;

Q ” 0;

where the decision variables are Q, b, and c, and ‘‘” 0’’

denotes positive semidefiniteness. We have shown that

this problem can be reformulated into a semidefinite

programming problem which can be solved efficiently

using interior-point methods.14

(ii). Using the underestimator to guide further exploration.

Assuming that the constructed underestimator re-

flects the funnel-like behavior of free energy func-

tion, we call the global minimum xP of U the ‘‘pre-

dictive conformation’’ where xP ¼ � 12Q�1b. High

energy samples or those far from xP are discarded.

Considering the huge number of local minima

spread over the region B, we sample more struc-

tures and give more chances to those closer to xP.

This can be achieved by using the following

probability density function (pdf) in B: f ðxÞ ¼U ðxÞ�UmaxR

BðUðxÞ�UmaxÞ dx; x 2 B where Umax 5 maxiU(x

i).

Random samples with the above pdf are generated by

an acceptance/rejection method and added to the

sample set. The process of (i) and (ii) are iterated for

a fixed number of iterations unless the convergence

conditions are satisfied. SDU reports the lowest-

energy structure in the sample set at the end as the

final solution.

In the refinement step by SDU we use the free energy

model DG 5 DGdes 1 DGelec 1 kDEvdw, where the desol-

vation free energy DGdes is estimated by the Atomic Con-

tact Potential (ACP),17 the electrostatic energy DGelec is

based on the Coulombic formula with distance-depend-

ent dielectrics e 5 4r, and the Van der Waals term is

adopted from the CHARMM potential.8 The scaling fac-

tor k of the Van der Waals contributions to the free

energy is dynamically selected between 0 and 1 by SDU,

which strengthens the ability of the method to cross high

energy barriers.14 The search space is the 6D space of

rigid motions including translations and rotations, that

is, the special Euclidean group (SE(3) 5 R3 3 SO(3)),

plus side-chain flexibility in the interface. Each sample

structure is locally minimized with 300 steps of Adopted

Base Newton–Raphson (ABNR) method under

CHARMM,8 by fixing the backbones and allowing the

side-chain flexibility in the interface from both receptor

and ligand. The resulted local minima are used in SDU

for determining the underestimator and thus performing

the 6D search. The lowest free energy structure (with full

contribution of the Van der Waals term) of each cluster

is chosen as the new cluster representative.

Prediction ranking

After all clusters have been refined by SDU, we select

the models to be submitted. At this point the selection is

a manual and somewhat intuitive process, rather than the

application of well defined selection criteria. Although we

rank order the clusters based on their size (the number of

entries), the selection also incorporated other criteria. In

particular, the first two submitted models were selected

from large clusters that also had relatively good energies,

and agreed the most with the biological information

available on the target. If not selected, the lowest energy

structure in the largest cluster was usually submitted as

Model 3, without accounting for any a priori information.

As we describe, we generally had better success with this

Performance of PIPER and SDU

DOI 10.1002/prot PROTEINS 737

model than with the first two. The remaining seven mod-

els were representatives from other large clusters.

Classification of complexes based ondocking difficulty

We have introduced a simple classification of protein–protein complexes in order to predict the expected diffi-culty of docking.18,19 Following the analyses by Janinet al.,20,21 it was found that the best predictors of successin docking are the conformational change upon binding,the change in the solvent accessible surface area, and thehydrophobicity of the interface. On the basis of this corre-lation, five classes of protein complexes, Types I through V,have been defined (Table I). The change in the solventaccessible surface area, DSASA, was calculated by DSASA5 SASAreceptor 1 SASAligand 2 SASAcomplex, where SASAdenotes the solvent accessible surface area of the proteinindicated in the subscript. The hydrophobicity of theinterface is given as the free energy, DGdes, of desolvationupon association, calculated using the Atomic ContactPotential (ACP),22 an atom-level extension of the Miya-zawa and Jernigan potential.23 A negative value of DGdes

means that the removal of water from the interface is favor-able, thereby indicating a largely hydrophobic interface.

RESULTS AND DISCUSSION

Eight targets were provided during CAPRI rounds 6–

11. However, both targets in round 8 (Targets 22 and 23)

were canceled for manual predictors because of the avail-

ability of the complex structural details. As will be

described, our methods provided an acceptable predic-

tion for Target 21, and medium accuracy predictions for

Targets 25 and 26. No correct models were produced for

Targets 20 and 24 for which docking required modeling

one of the component protein structures based on a tem-

plate. An acceptable model for Target 27 was obtained

and submitted via our server ClusPro (see companion

article in this issue), but not by our human predictor

team. No group submitted any acceptable prediction for

Target 28. This target will not be discussed any further.

Some details for the other targets are given as follows.

Target 20: RF1 and HemK

The docking task was to predict the complex structure

of E.coli HemK and Release Factor 1 (RF1). HemK methy-

lates a glutamine in the Q-loop of release factors. We

selected RF1 as the receptor. The structure of this protein

had to be built by homology modeling based on the

structure of the release factor 2 (RF2) from E. coli (PDB

code: 1gqe)24 with about 40% sequence identity to RF1.

This was done with the package Modeller.25 Since the

equivalent Q-loop region in RF2 structure was com-

pletely disordered, it had to be built by the program. The

subsequently released structure of the HemK-RF1 com-

plex26 shows generally high structural similarity between

template and target structures, but reveals that the Q-

loop region was predicted with large deviation [Fig.

1(A)]. Unfortunately this loop turned out to be an im-

portant part of the interface. The predicted position of

the ligand, HemK, moved with the incorrect loop [Fig.

1(B)], resulting in docked structures that were rotation-

ally far (>1208) from the native bound structure, with

the ligand RMSDs exceeding 40 A. Although we did not

have any acceptable prediction, several of our submis-

sions (Models 6, 4, and 1) correctly predicted more than

20% of the native contacts and had interface RMSDs

around 10 A. These results show that the deviations are

largely due to flexibility of the receptor, but also empha-

sizes that the prediction accuracy of our rigid body dock-

ing method is primarily determined by the accuracy of

the component proteins in the interface region. In fact,

all three acceptable models for this target were submitted

by the Baker group, and thus were obtained by Rosetta-

Dock that allows for considerable flexibility.27

Target 21: Orc1 and Sir1

The unbound structures for Orc1(pdb code: 1m4z)

and Sir1 (pdb code: 1z1a) were provided as the starting

Table IClassification of Protein Complexes on the Basis of Docking Difficulty

Type Definition Expected difficulty of docking

I 1400 �2 < DSASA < 2000 �2

DGdes < 24 kcal/molEasy, unless key side chains are in wrong conformations

II 2000 �2 < DSASA < 3000 �2

CaRMSD < 2�Moderate difficulty

III 1400 �2 < DSASA < 2000 �2

DGdes > 24 kcal/molUnpredictable; can be very difficult

IV DSASA < 1400 �2

DGdes < 24 kcal/molVery difficult; hits are found, but are frequently lost in scoring and ranking

V DSASA > 2500 �2

CaRMSD > 2 �Rigid body methods always seem to fail for these complexes

DSASA, change in the solvent accessible surface area; DGdes, free energy of desolvation upon association; Ca RMSD, a-carbon root mean square deviation between free

and bound protein.

Y. Shen et al.

738 PROTEINS DOI 10.1002/prot

structures for docking. On the basis of the released struc-

ture (pdb code: 1zhi),28 the Orc1-Sir1 complex has a

relatively small interface (DSASA 5 1453 A2) which is

fairly polar (see Table II). As expected, the target proved to

be relatively difficult, with only six groups submitting at

least acceptable predictions. As already mentioned, PIPER

Figure 1(A) Target 20: Docking the unbound structure of HemK (ligand) to the homology model of RF1 (receptor). Modeling of RF1. Color code: red, the bound structure of RF1;

yellow, homology model of RF1 used for the docking. (B) Best docked conformation of the HemK-RF1 complex. Color code: red, the bound structure of RF1; yellow,

homology model of RF1 used for the docking; cyan, the bound structure of HemK; green, predicted HemK structure. (C) Target 21: Docking of the unbound structure of

Sir1 (ligand) to the unbound structure of Orc1 (receptor). Color code: red, the bound structure of Orc1; cyan, the bound structure of Sir1; green, the best docked

conformation of Sir1. (D) Target 25: Docking of the bound structure of ARHGap10 (ligand) to the unbound structure of Arf1-GTP (receptor). Color code: red, the bound

structure of Arf1-GTP; cyan, the bound structure of ARHGap10; green, the best docked conformation of ARHGap10. The figure also shows the structure of GTP (in stick

representation) and the Mg21 ion (grey sphere) bound to the ARF1 (receptor) structure. (E) Target 26: Docking of the bound structure of Pal (ligand) to the unbound

structure of TolB (receptor). Color code: red, the bound structure of TolB; cyan, the bound structure of Pal; green, the best docked conformation of Pal. (F) Target 27:

Docking of the unbound structure of Ubc9 to the unbound structure of Hip2 (receptor). Color code: red, the bound structure of Hip2; cyan, the bound structure of Ubc9;

green, the best docked conformation of Ubc9 (predicted by our server ClusPro).

Performance of PIPER and SDU

DOI 10.1002/prot PROTEINS 739

was not ready when Target 21 was released, and hence

we used our server ClusPro4 which did not yield any ac-

ceptable model. However, starting from the largest cluster

of ClusPro-generated structures, refinement by SDU

derived a structure, submitted as our Model 3, which

turned out to be of acceptable accuracy [Fig. 1(C)]. Since

SDU was not fully developed either, we performed the

refinement only in the translational space. Nevertheless,

it dramatically improved the percentage of native con-

tacts from 4.9 to 24.4%, and the ligand RMSD was

decreased from 11.9 A to 9.6 A. Our posterior analysis

shows that the high energy barriers due to clashes

between overlapping atoms would have prevented an off-

grid search from approaching the native conformation if

side chains at the interface had been fixed. Thus, it was

very important that the SDU enabled the adjustment of

structures by medium-range minimization. We also

tested PIPER on this target as part of our post-examina-

tion. The center of the 8th largest cluster had 9.7 A

RMSD from the native complex, and thus would have

provided a model with acceptable accuracy. However, the

refinement of this cluster actually slightly increased the

RMSD, to 9.9 A. Thus, even our new programs would

produce only an acceptable model for Target 21.

Targets 24 and 25: Arf1 and ARHGap10

Two targets of the same complex were provided during

round 9. The difference lies in the structures of ARH-

Gap10. In Target 24, the ARHGap10 structure must be

built from the binding site of a PH domain (PDB code:

1btn). Since 27 C-terminus residues of ARHGap10 do

not align to any region of 1btn, these were not modeled.

Post-examination of the native complex structure29 (pdb

code: 2j59) shows that this C-terminus region folds into

a long a-helix, which is an important part of the inter-

face. Without the helix, the docking failed to deliver any

acceptable prediction. Our best submission (Model 5)

had the interface RMSD of 5.6 A but no native contacts.

In Target 25 the bound structure of ARHGap10 was

provided. The structure included a bound GTP molecule

and a magnesium ion that were retained for the docking.

Note that the use of these ligands required special addi-

tions to the parameter files of PIPER and SDU. With the

explicit introduction of GTP and Mg21, the predicted

conformations from PIPER were concentrated more in

the native-like region. Our submissions included one me-

dium accuracy prediction (Model 3) and three acceptable

accuracy ones. Model 3 was derived from the third lar-

gest cluster provided by PIPER (with 4.4 A ligand RMSD)

which, after the SDU refinement, was reduced to 2.8 A.

This model, shown in Figure 1(D), predicted 63% of the

native contacts and had an interface RMSD of 1.3 A. We

note that the largest cluster from PIPER had 8.5 A ligand

RMSD, but this was increased to 10.4 A in the SDU

refinement, resulting in an incorrect model. Our three

acceptable predictions (Models 4, 10, and 8, with ligand

RMSD values of 6.8, 8.7, and 9.6 A, respectively) were

obtained from further large clusters derived by PIPER.

Target 26: TolB and Pal

The problem was to dock the unbound structures of

TolB (pdb code: 1c5k) and Pal (pdb code: 1oap). The

native complex of TolB and Pal30 (pdb code: 2hqs)

reveals that this target is relatively easy (see Table II), and

the docking was indeed straightforward. The largest clus-

ter from PIPER had a ligand RMSD of 9.8 A, which was

reduced to 8.0 A RMSD by SDU. This was submitted as

Model 6, resulting in acceptable accuracy. Our best pre-

diction came from the 23rd largest cluster of PIPER. The

center of this cluster had 6.0 A ligand RMSD, which was

reduced to 3.0 A by SDU. Although derived from a rela-

tively small cluster, the refined structure (submitted as

Table IITargets and Performance in Rounds 6–11 of CAPRI

Target Level of difficulty Performancea

ID Receptor LigandInterfacearea �2

DGdes

kcal/mol Type Comment Group Community

T20 Homology model Unbound 3756.5 13.24 V Very difficult; conformationalchange

Incorrect loop 3*

T21 Unbound Unbound 1453.3 3.14 III Uncertain; close to region II 1* 4** and 7*T24 Unbound Homology

model751.2 0.51 IV Difficult, requires correct

prediction of helixRemoved C-terminalhelix

4*

T25 Unbound Bound 751.2 0.51 IV Difficult but doable; smallarea, but somewhathydrophobic

1** and 3* 1*** and 13**and 20*

T26 Unbound model Bound 2562.1 6.82 II Easy/uncertain (close toregion III)

2** and 1* 22** and 20*

T27 Unbound Unbound 520.0 4.74 IV (A:D) Very difficult None None1190.3 1.77 IV (A:C) Difficult/undertain 1* (ClusPro) 2** and 55*

aAcceptable (*), medium (**), and high (***) accuracy predictions.

Y. Shen et al.

740 PROTEINS DOI 10.1002/prot

Model 3) had the 3rd lowest energy among all models

refined by SDU, and hence was selected for submission.

This model, shown in Figure 1(E), predicted 54% of the

native contacts and had an interface RMSD of 1.2 A. We

had another medium accuracy submission (Model 5)

with 5.5 A ligand RMSD. We note that from mutagenesis

experiments31 it was known that the propeller domain

of TolB is important for binding of Pal. However, since

the rigid body docking by PIPER placed several large

clusters of models in such orientations, including the

largest cluster, we would have had at least acceptable ac-

curacy models even without this information.

Target 27: Hip2 and Ubc9

The challenge was to dock the unbound structures of

Hip2 (pdb code: 1yla) and Ubc9 (pdb code: 1a3s). The

released structure of the complex32 reveals two interfaces,

with the biological state unknown. As shown in Table II,

the A:D interface has very small surface area, suggesting

a very weak complex which is beyond the current dock-

ing methods.18,19 In fact, no submission represented

this orientation. The A:C interaction has a larger inter-

face and hence is easier to predict, but the complex is

still classified as Type IV with high level of difficulty. For

this target PIPER generated a large number of clusters

that did not considerably differ in size and collectively

covered large regions on the receptor surface. Thus, we

were unsure whether our usual strategy of refining the 30

largest clusters would result in any meaningful model. In

addition, studies of an ubiquitin-conjugating enzyme33

suggested that Lys14 of Hip2 should be close to the cata-

lytic Cys93 of Ubc9 to allow sumoylation of Hip2 by

Ubc9. In view of the uncertainty of docking we have gen-

erated a number of models that satisfied this constraint.

However, the experimentally solved complex structure

reveals two unexpected binding interfaces which do not

satisfy the sumoylation constraint, and hence we did not

produce any valid model for Target 27. In contrast, our

server ClusPro generated a model (Model 7) that turned

out to be of acceptable accuracy without employing any

biological information (see article in this issue). During a

follow-up analysis we have further studied the models

provided by PIPER and ClusPro, and made two observa-

tions. First, Model 7 from ClusPro was selected on the

basis of favorable electrostatics, which is not a very im-

portant selection criterion in PIPER. This issue will be

discussed further below. Second, refining this ClusPro-

generated model by SDU yields the lowest energy among

all models for this target.

CONCLUSIONS

We describe a three-step docking and refinement pro-

cedure that includes rigid body docking by PIPER, clus-

tering the generated structures and ranking the clusters

by size, and finally refinement of the 30 largest clusters

by the semiglobal stochastic optimization method called

SDU. For submission to CAPRI, we selected models from

the large clusters if they also had relatively low energies

after refinement. Results in Rounds 6–11 of CAPRI sug-

gest that this algorithm can generally produce at least

acceptable predictions in both unbound/unbound and

unbound/bound docking problems if the interface

regions in the structures of the component proteins do

not heavily differ from the corresponding structures in

the complex. Moreover, in most cases, this level of accu-

racy can be achieved without the use of biological infor-

mation.

The above algorithm was also tested on the docking

benchmarks.7,34 Our strategy of selecting the largest

clusters for refinement and then selecting low energy

structures from these clusters appears to be reasonably

successful, but needs further testing and refinement.

Indeed, for Targets 20, 25, and 26 the medium and ac-

ceptable accuracy predictions came from large clusters.

The failure of producing acceptable predictions for Tar-

gets 20 and 24 was due to large errors in our homology

modeling. The only current target for which the PIPER/

SDU docking algorithm was clearly inferior to a number

of others was Target 27. In fact, for this target the rigid

body docking using PIPER was less productive than

using the DOT program35 in the ClusPro server that

generated an acceptable accuracy model.

Our experience in this round of CAPRI led to two

changes in our strategy. First, due to the progress in

methodology, it may be advisable to put more trust into

the results of docking calculations. In fact, our attempts

to improve the prediction using information available in

the literature did not provide much help for any target,

and led to losing an acceptable model for Target 27.

However, such information may be crucial for docking

problems that are not within the realm of current metho-

dology; for example, if the binding involves large confor-

mational changes or leads to a weak complex. Second,

post-examination of Target 27 revealed why DOT was

superior to PIPER for this target. The DARS potential,

used in PIPER, was extracted from a large protein–pro-

tein interaction database,6 dominated by multisubunit

proteins and enzyme-inhibitor complexes. On the basis

of such structures, the current version of DARS works

best for complexes in which the interface is at least par-

tially hydrophobic. However, the Hip2-Ubc9 complex is

primarily stabilized by electrostatics. Although the poten-

tial in PIPER separately accounts for electrostatics, the

weight of the electrostatic term in the expression is rela-

tively small. In contrast, ClusPro uses a filtering stra-

tegy36 that retains 500 structures with favorable desolva-

tion, and another 1500 structures with the most favorable

electrostatics. This approach is motivated by the fact

that, for a particular complex, it is not possible to a pri-

ori assume the relative contributions of different types of

Performance of PIPER and SDU

DOI 10.1002/prot PROTEINS 741

interactions. The electrostatics filter works well for Target

27, and yields a structure with acceptable accuracy. On

the basis of this experience, the use of PIPER will be

modified to reflect the ClusPro approach as follows: we

will run PIPER twice, with different weights of the elec-

trostatic term in the energy expression, and select some

clusters from both calculations.

ACKNOWLEDGMENTS

For the CPU time used for this paper we are grateful

to the Boston University Scientific Computing and Visua-

lization Center for the opportunity of running the PIPER

program on the Blue Gene/L supercomputer.

REFERENCES

1. Camacho CJ, Gatchell DW, Kimura SR, Vajda S. Scoring docked

conformations generated by rigid-body protein–protein docking.

Proteins 2000;40(3):525–537.

2. Murphy J, Gatchell D, Prasad J, Vajda S. Combination of scoring

functions improves discrimination in protein–protein docking. Pro-

teins 2003;53:840–854.

3. Camacho C, Vajda S. Protein docking along smooth association

pathways. Proc Natl Acad Sci USA 2001;98:10636–10641.

4. Comeau S, Gatchell D, Vajda S, Camacho C. ClusPro: an automated

docking and discrimination method for the prediction of protein

complexes. Bioinformatics 2004;20:45–50.

5. Kozakov D, Clodfelter K, Vajda S, Camacho C. Optimal clustering

for detecting near-native conformations in protein docking. Biophys

J 2005;89:867–875.

6. Kozakov D, Brenke R, Comeau SR, Vajda S. Piper: an fft-based pro-

tein docking program with pairwise potentials. Proteins 2006;65:

392–406.

7. Chen R, Mintseris J, Janin J, Weng Z. A protein–protein docking

benchmark. Proteins 2003;52:88–91.

8. Brooks B, Bruccoleri R, Olafson B, States D, Swaminathan S, Kar-

plus M. CHARMM: a program for macromolecular energy, minimi-

zation, and dynamics calculations. J Comput Chem 1983;4:187–217.

9. Li L, Cheng R, Weng Z. RDOCK: refinement of rigid-body protein

docking predictions. Proteins 2003;53:693–707.

10. Moont G, Gabb H, Sternberg M. Use of pair potentials across pro-

tein interfaces in screening predicted docked complexes. Proteins

1999;35:364–373.

11. Lu H, Lu L, Skolnick J. Development of unified statistical potentials

describing protein–protein interactions. Biophys J 2003;84:1895–

1901.

12. Liu S, Zhang C, Zhou H, Zhou Y. A physical reference state unifies

the structure-derived potential of mean force for protein folding

and binding. Proteins 2004;56:93–101.

13. Mintseris J, Pierce B, Wiehe K, Anderson R, Chen R, Weng Z. Inte-

grating statistical pair potentials into protein complex prediction.

Proteins, to be published.

14. Paschalidis IC, Shen Y, Vakili P, Vajda S. A semi-definite program-

ming-based underestimation method for stochastic global optimiza-

tion in protein docking. IEEE Trans Automatic Control 2007;52:

664–676.

15. Paschalidis IC, Shen Y, Vakili P, Vajda S. Protein–protein docking

with reduced potentials by exploiting multi-dimensional energy

funnels. In Proceedings of the 28th IEEE Engineering in Medicine

and Biology Society (EMBS) Conference, New York City, New York,

2006, pp 5330–5333.

16. Lindemann S, Yershova A, LaValle S. Incremental grid sampling

strategies in robotics. In Proceedings of the Sixth International

Workshop on the Algorithmic Foundations of Robotics, July 11–13,

2004; Utrecht/Zeist, Netherlands.

17. Zhang C, Vasmatzis G, Cornette J. Determination of atomic desol-

vation energies from the structures of crystallized proteins. J Mol

Biol 1997;267:707–726.

18. Vajda S, Camacho C. Protein–protein docking: is the glass half full

or half empty? Trends Biotechnol 2004;22:110–116.

19. Vajda S. Classification of protein complexes based on docking diffi-

culty. Proteins 2005;60:176–180.

20. LoConte L, Chothia C, Janin J. The atomic structure of protein–

protein recognition sites. J Mol Biol 1999;285:2177–2198.

21. Chakrabarti P, Janin J. Dissecting protein–protein recognition sites.

Proteins 2002;47:334–343.

22. Zhang C, Vasmatzis G, Cornette J, DeLisi C. Determination of

atomic desolvation energies from the structures of crystallized pro-

teins. J Mol Biol 1997;267:707–726.

23. Miyazawa S, Jernigan R. Estimation of effective interresidue contact

energies from protein crystal structures: quasi-chemical approxima-

tion. Macromolecules 1985;18:534–552.

24. Vestergaard B, Van L, Andersen G, Nyborg J, Buckingham R,

Kjeldgaard M. Bacterial polypeptide release factor RF2 is structur-

ally distinct from eukaryotic eRF1. Mol Cell 2001;8:1375–1382.

25. Sanchez R, Sali A. Evaluation of comparative protein structure

modeling by MODELLER-3. Proteins 1997;1:50–58.

26. Graille M, Heurgue-Hamard V, Champ S, Mora L, Scrima N, Ulryck

N, van Tilbeurgh H, Buckingham R. Molecular basis for bacterial class

I release factor methylation by PrmC. Mol Cell 2005;20:917–927.

27. Gray J, Moughon S, Wang C, Schueler-Furman O, Kuhlman B,

Rohl C, Baker D. Protein–protein docking with simultaneous opti-

mization of rigid-body displacement and side-chain conformations.

J Mol Biol 2003;331:281–299.

28. Hou Z, Bernstein D, Fox C, Keck J. Structural basis of the Sir1-

origin recognition complex interaction in transcriptional silencing.

Proc Natl Acad Sci USA 2005;102:8489–8494.

29. Menetrey J, Perderiset M, Cicolari J, Dubois T, Elkhatib N, El

Khadali F, Franco M, Chavrier P, Houdusse A. Structural basis for

ARF1-mediated recruitment of ARHGAP21 to golgi membranes.

EMBO J 2007;26:1953–1962.

30. Bonsor D, Grishkovskaya I, Dodson E, Kleanthous C. Molecular

mimicry enables competitive recruitment by a natively disordered

protein. J Am Chem Soc 2007;129:4800–4807.

31. Ray M, Germon P, Vianney A, Portalier R, Lazzaroni J. Identifica-

tion by genetic suppression of Escherichia coli TolB residues impor-

tant for TolB-Pal interaction. J Bacteriol 2000;182:821–824.

32. Walker J, Avvakumov G, Xue S, Newman E, Mackenzie F, Weigelt J,

Sundstrom M, Arrowsmith C. Novel and unexpected complex

between the sumo-1-conjugating enzyme UBC9 and the ubiquitin-

conjugating enzyme E2-25 kDa, to be published.

33. Pichler A, Knipscheer P, Oberhofer E, van Dijk W, Korner R, Olsen

J, Jentsch S, Melchior F, Signma T. SUMO modifications of the

ubiquitin-conjugating enzyme E2-25K. Nat Struct Mol Biol 2005;12:

264–269.

34. Mintseris J, Wiehe K, Pierce B, Anderson R, Chen R, Janin J, Weng

Z. Protein–Protein docking benchmark 2.0: an update. Proteins

2005;60:214–216.

35. Mandell J, Roberts V, Pique M, Kotlovyi V, Mitchell J, Nelson E,

Tsigelny I, Ten Eyck L. Protein docking using continuum electro-

statics and geometric fit. Protein Eng 2001;14:105–113.

36. Camacho C, Gatchell D, Kimura S, Vajda S. Scoring docked confor-

mations generated by rigid-body protein–protein docking. Proteins

2000;40:525–537.

Y. Shen et al.

742 PROTEINS DOI 10.1002/prot