8
proteins STRUCTURE O FUNCTION O BIOINFORMATICS Protein structure modeling for CASP10 by multiple layers of global optimization Keehyoung Joo, 1,2 Juyong Lee, 1 Sangjin Sim, 1 Sun Young Lee, 1 Kiho Lee, 1 Seungryong Heo, 1 In-Ho Lee, 1,3 Sung Jong Lee, 1,4 and Jooyoung Lee 1,5 * 1 Center for In Silico Protein Science, Korea Institute for Advanced Study, Dongdaemun-gu, Seoul 130-722, Korea 2 Center for Advanced Computation, Korea Institute for Advanced Study, Dongdaemun-gu, Seoul 130-722, Korea 3 Korea Research Institute of Standards and Science (KRISS), Yuseong, Daejeon 305-600, Korea 4 Department of Physics, University of Suwon, Hwaseong-Si, Gyeonggi-do 445-743, Korea 5 School of Computational Sciences, Korea Institute for Advanced Study, Dongdaemun-gu, Seoul 130-722, Korea ABSTRACT In the template-based modeling (TBM) category of CASP10 experiment, we introduced a new protocol called protein model- ing system (PMS) to generate accurate protein structures in terms of side-chains as well as backbone trace. In the new pro- tocol, a global optimization algorithm, called conformational space annealing (CSA), is applied to the three layers of TBM procedure: multiple sequence-structure alignment, 3D chain building, and side-chain re-modeling. For 3D chain building, we developed a new energy function which includes new distance restraint terms of Lorentzian type (derived from multiple templates), and new energy terms that combine (physical) energy terms such as dynamic fragment assembly (DFA) energy, DFIRE statistical potential energy, hydrogen bonding term, etc. These physical energy terms are expected to guide the struc- ture modeling especially for loop regions where no template structures are available. In addition, we developed a new qual- ity assessment method based on random forest machine learning algorithm to screen templates, multiple alignments, and final models. For TBM targets of CASP10, we find that, due to the combination of three stages of CSA global optimizations and quality assessment, the modeling accuracy of PMS improves at each additional stage of the protocol. It is especially noteworthy that the side-chains of the final PMS models are far more accurate than the models in the intermediate steps. Proteins 2014; 82(Suppl 2):188–195. V C 2013 Wiley Periodicals, Inc. Key words: structure prediction; CASP; homology modeling; template-based modeling; side-chain modeling; high-accuracy modeling; global optimization; energy function. INTRODUCTION For modeling of TBM target of CASP10, there is a wide consensus on the importance of predicting more accurate protein models in terms of all-atom details beyond the modeling of just backbone structures. That is, accurate modeling of side-chains of proteins is as impor- tant as accurate backbone modeling. More accurate side chain modeling can provide valuable information and important clues for elucidating the sequence-structure- function relationships in modern structural biology. For CASP10 predictions, we have developed a new prediction protocol which employs global optimization in three lev- els: multiple sequence-structure alignment (MSA), protein 3D chain building, and side-chain re-modeling. 1,2 In three stages of modeling, there are specific energy (or score) functions to optimize. For generating MSA, the score function to be optimized represents the net consistency of all possible residue-residue matches gener- ated by pair-wise alignments. 3 For 3D chain building, two major changes are introduced in the energy func- tion. One is the development of a new Lorentzian-type energy term for spatial restraints (coming from tem- plates) instead of using Gaussian-type or spline functions used in MODELLER. The other is the energy function for loop regions for which no restraints are available from templates, where we combined physical energy Grant sponsor: Korea government (MSIP); Grant numbers: 2008-0061987, 2009- 0090085; Grant sponsor: National Institute of Supercomputing and Networking/ Korea Institute of Science and Technology Information; Grant numbers: KSC- 2012-C3-01; KSC-2012-C3-02. *Correspondence to: Jooyoung Lee, Center for In Silico Protein Science, Korea Institute for Advanced Study, Dongdaemun-gu, Seoul 130–722, Korea. E-mail: [email protected] Received 3 April 2013; Revised 9 July 2013; Accepted 9 August 2013 Published online 22 August 2013 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/prot.24397 188 PROTEINS V V C 2013 WILEY PERIODICALS, INC.

Protein structure modeling for CASP10 by multiple layers of global optimization

Embed Size (px)

Citation preview

Page 1: Protein structure modeling for CASP10 by multiple layers of global optimization

proteinsSTRUCTURE O FUNCTION O BIOINFORMATICS

Protein structure modeling for CASP10 bymultiple layers of global optimizationKeehyoung Joo,1,2 Juyong Lee,1 Sangjin Sim,1 Sun Young Lee,1 Kiho Lee,1 Seungryong Heo,1

In-Ho Lee,1,3 Sung Jong Lee,1,4 and Jooyoung Lee1,5*1 Center for In Silico Protein Science, Korea Institute for Advanced Study, Dongdaemun-gu, Seoul 130-722, Korea

2 Center for Advanced Computation, Korea Institute for Advanced Study, Dongdaemun-gu, Seoul 130-722, Korea

3 Korea Research Institute of Standards and Science (KRISS), Yuseong, Daejeon 305-600, Korea

4 Department of Physics, University of Suwon, Hwaseong-Si, Gyeonggi-do 445-743, Korea

5 School of Computational Sciences, Korea Institute for Advanced Study, Dongdaemun-gu, Seoul 130-722, Korea

ABSTRACT

In the template-based modeling (TBM) category of CASP10 experiment, we introduced a new protocol called protein model-

ing system (PMS) to generate accurate protein structures in terms of side-chains as well as backbone trace. In the new pro-

tocol, a global optimization algorithm, called conformational space annealing (CSA), is applied to the three layers of TBM

procedure: multiple sequence-structure alignment, 3D chain building, and side-chain re-modeling. For 3D chain building,

we developed a new energy function which includes new distance restraint terms of Lorentzian type (derived from multiple

templates), and new energy terms that combine (physical) energy terms such as dynamic fragment assembly (DFA) energy,

DFIRE statistical potential energy, hydrogen bonding term, etc. These physical energy terms are expected to guide the struc-

ture modeling especially for loop regions where no template structures are available. In addition, we developed a new qual-

ity assessment method based on random forest machine learning algorithm to screen templates, multiple alignments, and

final models. For TBM targets of CASP10, we find that, due to the combination of three stages of CSA global optimizations

and quality assessment, the modeling accuracy of PMS improves at each additional stage of the protocol. It is especially

noteworthy that the side-chains of the final PMS models are far more accurate than the models in the intermediate steps.

Proteins 2014; 82(Suppl 2):188–195.VC 2013 Wiley Periodicals, Inc.

Key words: structure prediction; CASP; homology modeling; template-based modeling; side-chain modeling; high-accuracy

modeling; global optimization; energy function.

INTRODUCTION

For modeling of TBM target of CASP10, there is a

wide consensus on the importance of predicting more

accurate protein models in terms of all-atom details

beyond the modeling of just backbone structures. That is,

accurate modeling of side-chains of proteins is as impor-

tant as accurate backbone modeling. More accurate side

chain modeling can provide valuable information and

important clues for elucidating the sequence-structure-

function relationships in modern structural biology. For

CASP10 predictions, we have developed a new prediction

protocol which employs global optimization in three lev-

els: multiple sequence-structure alignment (MSA), protein

3D chain building, and side-chain re-modeling.1,2

In three stages of modeling, there are specific energy

(or score) functions to optimize. For generating MSA,

the score function to be optimized represents the net

consistency of all possible residue-residue matches gener-

ated by pair-wise alignments.3 For 3D chain building,

two major changes are introduced in the energy func-

tion. One is the development of a new Lorentzian-type

energy term for spatial restraints (coming from tem-

plates) instead of using Gaussian-type or spline functions

used in MODELLER. The other is the energy function

for loop regions for which no restraints are available

from templates, where we combined physical energy

Grant sponsor: Korea government (MSIP); Grant numbers: 2008-0061987, 2009-

0090085; Grant sponsor: National Institute of Supercomputing and Networking/

Korea Institute of Science and Technology Information; Grant numbers: KSC-

2012-C3-01; KSC-2012-C3-02.

*Correspondence to: Jooyoung Lee, Center for In Silico Protein Science, Korea

Institute for Advanced Study, Dongdaemun-gu, Seoul 130–722, Korea.

E-mail: [email protected]

Received 3 April 2013; Revised 9 July 2013; Accepted 9 August 2013

Published online 22 August 2013 in Wiley Online Library (wileyonlinelibrary.com).

DOI: 10.1002/prot.24397

188 PROTEINS VVC 2013 WILEY PERIODICALS, INC.

Page 2: Protein structure modeling for CASP10 by multiple layers of global optimization

terms including dynamic fragment assembly (DFA)

energy4 together with DFIRE5 statistical potential energy,

hydrogen bonding term,6 and GOAP terms.7 For side-

chain re-modeling, we used an in-house energy function

in the discrete rotamer space. To optimize these energy

functions, we utilized conformational space annealing

(CSA), a powerful global optimization and efficient con-

formational search method.8 In addition, we developed

new quality assessment methods based on random-forest

machine learning algorithm to properly rank protein

models at the respective stages of fold-recognition, selec-

tion of MSAs, and selection of the final model ones.

MATERIALS AND METHODS

For server prediction of CASP10 targets, we have

developed an automated procedure called protein model-

ing system (PMS) protocol which combined old proto-

cols employed in the previous CASP experiments1 with

new features explained in this article. Figure 1 shows the

overall flow of PMS protocol describing how to combine

three global optimization procedures and other utilities

such as template selection and quality assessment at each

stage.

Fold recognition

The first step of template-based modeling (TBM) is to

search PDB for homologous structures of a target

sequence. Since CASP7, for fold recognition, we have

used a sequence-structure alignment method called

FOLDFINDER, an in-house method of profile–profile

alignment utilizing predicted secondary structures.9 We

built a template database of 27,333 chains obtained from

PISCES culling server at the level of 95% sequence iden-

tity with the chain length in the range of 50–1000 resi-

dues including both X-ray and NMR structures.

For a given target sequence, a total of 50 top-scoring

templates are selected. For each template and its alignment

to the target sequence, 60 three-dimensional models are

generated by MODELLER10 and then perturbed. Then,

we optimize an energy function for these models and 10

lowest energy models are selected. The quality of each

selected model is estimated by quality assessment method

QA1 (see “quality assessment of protein structures,”

below), and the quality of the template is estimated by the

average QA1 score.

We perform structural clustering of the 50 template

models using recently developed community detection

method11 based on all-to-all pairwise TM-scores. This clus-

tering step ensures that only structurally similar templates

are combined in the MSA and chain building thereby

reducing conflicts arising from structurally inconsistent

multiple templates. Then subsets of the templates belonging

to the same cluster/community are selected to generate

template lists via template combinations. Typically, we

generate one to eight lists, each of which can include up to

15 templates (Joo et al., in preparation). For difficult tar-

gets, we have considered up to 200 top-scoring templates

ranked by FOLDFINDER, and the number of template lists

can be as large as 20.

Multiple sequence-structure alignment

We performed multiple sequence-structure alignment

by using MSACSA method.3 In MSACSA, a consistency-

based score function similar to COFFEE score12 is used.

For CASP10, the original score function1 of MSA is

slightly modified as follows. With N sequences to align

(one target sequence and N 2 1 template sequences), all

pairwise alignments are carried out to construct a

restraint library of matched residue pairs. As for the

pairwise alignment between the target sequence and a

template sequence, sequence-based pairwise alignment

from FOLDFINDER is used, while, for alignment among

the templates, structure-structure alignment by TM-

align13 is used. We assign a weight wkij5f k

ij � SeqId for

each aligned residue pair between ith and jth sequences

at the kth column, where SeqId is the sequence identity

between i and j, and f kij is either the profile–profile match

score from FOLDFINDER for target-templates align-

ments or (12dkij=8) for template–template alignments, dk

ij

Figure 1The flow chart of PMS protocol for CASP10 is shown. CSA method isused for MSA, 3D chain building, and side-chain re-modeling. Three

QA methods (QA1, QA2, and QA3) are used for selection of templates,MSAs, and final models.

Protein Structure Modeling by Global Optimization

PROTEINS 189

Page 3: Protein structure modeling for CASP10 by multiple layers of global optimization

being the Ca-Ca distance between matched residues at the

kth column (generated by TM-align13). Then, f kij is linearly

rescaled in the range of [0.01, 1]. The library is now a col-

lection of aligned residue pairs with specific weights wkij .

Denoting the sum of all weights byX

w, the score func-

tion of a multiple alignment A can be written as3

Score Að Þ5XN

i;j51;i<j

XMk51

wkij dk

ij Að Þ�X

w; (1)

where M represents the set of all aligned columns, and

dkij51 or 0 depending on whether the aligned residue

pair at the kth column is in the library or not. It is

obvious that a good score is associated with an align-

ment which is consistent with the restraint library. The

more consistent is an alignment with the pair-wise

restraint library, the higher the alignment’s score will be.

With the score function, a straightforward and rigorous

global optimization is performed starting from multiple

alignments initially generated in a random fashion. This

should be contrasted with existing heuristic progressive

alignment methods popular in the literature. In a progres-

sive alignment procedure, successive pairwise alignments

are carried out to obtain a multiple alignment. Therefore, it

is rather fast but alignment errors created in early stages of

the method cannot be fixed later. It is demonstrated that

alignments generated by MSACSA are more consistent with

the gold standard than those generated by progressive

methods.3 Another advantage of MSACSA is that it pro-

vides not only the globally optimal alignment solution but

many distinct low-lying suboptimal alternative alignment

solutions. Details of the algorithm can be found elsewhere.3

Typically, 100 alignments are generated for each template

list and 10 top-scoring MSAs are chosen according to Eq.

(1). Therefore with nlist number of template lists, 3D mod-

els are generated for a total of 103nlist alignments. After

MSACSA’s are finished, we construct models for top 10

MSAs (in terms of the MSACSA score function) and then

perform quality assessment again to select best MSAs. The

quality of an MSA is assessed by QA2 (see “quality assess-

ment of protein structures,” below). Typically up to 10

MSAs are selected and passed onto the next step.

Energy function and protein 3D modeling

For protein 3D modeling using selected MSAs above,

we developed a new energy function, which is defined as:

E5Estereo-chemistry 1Erestraint 1Ephysical (2)

where Estereo-chemistry denotes stereochemical terms that

are borrowed from the MODELLER energy function.10

The second term Erestraint includes distance restraints

(Ca–Ca;N2O, and the others involved with side-chain

atoms) terms used in MODELLER. The difference is that

in MODELLER either harmonic or spline functions are

used while we used Lorentzian-type energy functions.

For each pair of atoms with NR distance restraints, the

Lorentzian energy term is defined as

ELorentzian rð Þ5XNR

i

1

ri

r2ri0

� �2

r2ri0ð Þ21r2

i

; (3)

where ri0 denotes a restraint distance copied from a spe-

cific template, and ri is an estimated uncertainty of the

distance ri0, which controls the strength of an individual

distance restraint. Finally r is the distance between the

atoms from the current model structure. To obtain ri ,

we employed a random forest algorithm, where input

features are based on the sequence-template alignment

and environmental features including the profile similar-

ity, gap features, secondary structure consensus, and sol-

vent accessibility consensus (Lee et al., in preparation).

Finally, in Eq. (2), Ephysical includes dynamic fragment

assembly (DFA) energy (which were originally developed

for ab initio protein structure modeling4) together with

DFIRE statistical potential energy, hydrogen bonding

term, and GOAP terms.7 Weights for energy terms were

optimized using a subset of CASP9 targets. The energy

function of Eq. (2) is optimized considering all possible

structural variation by employing CSA. Typically, 100

three-dimensional models are generated for each MSA,

and the average QA3 (see below) score of 100 models is

used to select the top MSA. The lowest energy model of

the top MSA is submitted as the model one.

Quality assessment of protein structures

PMS protocol uses three quality assessment (QA) steps:

(1) assessment of templates, (2) assessment of MSAs, and

(3) assessment of final lowest energy structures. To assign

ranking for the first two steps, 3D model structures are

assessed. Various energy terms of these models and other

structural features were used as input features for training

and testing a machine for each step. To capture hidden

nonlinear relationships between features of model struc-

tures and their TM-scores to the native structure, we used

the random forest method.14 Input features are listed in

Table I. To minimize the computing time of estimating rin Eq. (3), Erestraint for QA1, QA2 is generated using the

MODELLER term, while the Lorentzian term of Eq. (3) is

used only for QA3. All input features and TM-scores are

normalized to [0,1] scale, separately for each target.

Assessment of templates (QA1)

For a selected template with its alignment to the target

sequence, 60 model structures are generated by minimiz-

ing MODELLER-generated and then perturbed 3D mod-

els using Eq. (2). Average QA1 score of 10 lowest energy

models are used. To train a random forest machine, we

used 1178 non-redundant sequences selected and then

K. Joo et al.

190 PROTEINS

Page 4: Protein structure modeling for CASP10 by multiple layers of global optimization

screened from PISCES18 excluding highly similar sequen-

ces. Top 50 templates are generated by FOLDFINDER.

Assessment of MSAs (QA2)

For a selected MSA, 3D model structures are generated

following the same procedure as in QA1. To train a random

forest machine for this step, we used 27 CASP9 single-

domain targets under 200 aa. For each target, top templates

are filtered using QA1. For each MSA, we generated 10

models. The average QA2 score of 10 models are used.

Assessment of models (QA3)

For each MSA selected by QA2, PMS generates 100

different model structures. To train the third random

forest machine, model structures of the 27 benchmark

proteins mentioned above were generated for MSAs

selected by QA2. The energy values and TM-scores of the

models were extracted and used as the training data. The

average QA3 score of 100 models are used to identify the

top MSA, and the lowest energy [by Eq. (2)] model of

the top MSA is selected as model one. The other four

models are selected from the other suboptimal MSAs.

Side-chain modeling

Side-chain re-modeling is carried out for final five

models to be submitted to CASP10. We have used the

backbone-dependent rotamer library of SCWRL 3.0 (Ref.

19) along with a target-specific rotamer library con-

structed during the final CSA simulation for 3D model-

ing. The side-chain modeling used here follows the

identical procedure employed in earlier CASPs.1

Human predictions

Here, we discuss human prediction methods, LEE and

LEEcon. LEE is same as PMS except that we have consid-

ered additional templates from FOLDFINDER. LEEcon is

a consensus method using SERVER models as templates

that are selected from the largest cluster of all SERVER

models released by CASP10. For structural clustering we

used a recently proposed community detection

method.11 The performance difference between LEE and

LEEcon partly indicates the fold recognition performance

difference between our procedure and the consensus of

CASP10 server predictions (see below, “Performance of

human predictions” section).

RESULTS AND DISCUSSION

In this section, we analyze the performance of PMS in

terms of the backbone accuracy and the side-chain accu-

racy of the models with respect to the native structures.

In the CASP10, there are 112 domains that are officially

Table IList of the Input Features for Quality Assessment

Index Feature Description

F1 Etotal Total energyF2 Evdw van der Waals energyF3 Estereo Stereochemistry energyF4 Erestraint Distance restraint energy from templatesF5 Egeom Sum of stereochemistryF6 EDFA Total DFA energyF7 EDFA;dist DFA distance energyF8 EDFA;angle DFA torsion angle energyF9 EDFA;neigh DFA neighbor energyF10 EDFA;hydrophobic DFA neighbor energy with hydrophobic center residuesF11 EDFA;hydrophilic DFA neighbor energy with hydrophilic center residuesF12 EDFA;aromatic DFA neighbor energy with aromatic center residuesF13 EDFA;beta DFA beta-sheet energyF14 EDFIRE DFIRE energyF15 EDFIRE;np DFIRE energies between non-polar atom pairsF16 EDFIRE;pol DFIRE energies from atom pairs that either of the pair is polarF17 Ehbond Total hydrogen bond energyF18 Ehb;bb Contribution of backbone–backbone hydrogen bond energyF19 Ehb;bs Contribution of backbone–sidechain hydrogen bond energyF20 Ehb;ss Contribution of sidechain–sidechain hydrogen bond energyF21 EGOAP GOAP statistical energyF22

Pi pi sið Þd si ; tið Þ Sum of PSIPRED probability

F23 corr SSpred; SSmodel� �

Correlation between solvent accessibilities of model and predictionF24 Nhelix =Ntotal Fraction of predicted helical residues in a modelF25 Nsheet =Ntotal Fraction of predicted beta-sheet residues in a modelF26 Ncoil =Ntotal Fraction of predicted coil residues in a model

For Erestraint, the MODELLER energy term is used for QA1, QA2, and the Lorentzian energy for QA3. The secondary structure of the target sequence is that predicted

by using PSIPRED,15 while the solvent accessibility of the target sequence is obtained by SANN.16 In F22, pi sið Þ is the probability of the secondary structure state si 2C;H ; Ef g (from PSIPRED) at ith residue position, and the secondary structure state ti for the model is calculated by using DSSP.17 Here, d si ; tið Þ51 if si5ti and zero

otherwise.

Protein Structure Modeling by Global Optimization

PROTEINS 191

Page 5: Protein structure modeling for CASP10 by multiple layers of global optimization

classified as TBM targets. We considered 104 domains

excluding eight domains. The native structure of T0739

is not available for two domains (for D3 and D4

domains), and for the remaining six cases, our template

selections were not consistent with the domain defini-

tions provided by the assessors (T0705-D1/D2, T0717-

D1/D2, T0744-D1, and T0755-D1).

Figure 2 shows the template qualities in terms of TM-

score between the top templates (by the CASP10 asses-

sors) and the best templates (BTs; used for PMS model-

ing). The BT used for PMS modeling is the top (TM-

score) template among typically 50 templates identified

by FOLDFINDER of PMS (by structural alignment using

TM-align13). FOLDFINDER could successfully include

the top template among the 50 templates when its TM-

score is better than around 60%, while, for the other

cases (TM-score< 60%), the top template was missed

among the 50 templates. In short, for easy targets,

FOLDFINDER performed quite well, but it did not for

hard targets.

With multiple templates identified by FOLDFINDER,

PMS protocol carried out the next steps of MSA and 3D

modeling. Table II shows that PMS models gradually

improve on average in terms of TM-score as each addi-

tional step is processed. For comparison, we have used

MODELLER for BT models and MSA models. Here, the

BT model is the best TM-score model among

MODELLER-generated models using a single template

out of multiple templates used to generate the model

one PMS model. The average TM-score and RMSD of

the BT models are 67.28% and 7.83 A with the coverage

of 68.23%. For MSA models, the average TM-score and

RMSD are 67.37% and 6.52 A with coverage of 73.30%.

As a result, on average, MSA models are better than BT

models. Generally, it is not easy to identify the BT out of

a given set of multiple templates without knowing the

3D structure of a target sequence. It should be noted

that MSA is an important step for extracting the infor-

mation of the BT among multiple templates. For final

PMS models, the average TM-score and RMSD are

69.32% and 5.84 A with the coverage of 75.51%, which

is significantly better than MSA models.

Figure 3 shows TM-score comparison between BT

models and PMS models for 104 TBM targets. In Figure

3, we find that PMS models consistently generate better

Figure 2Comparison of TM-scores between BTs by PMS and top templates byassessor is shown. BTs by PMS are identified among 50 FOLDFINDER

generated templates by using the native structure and TM-align pro-gram. Template selection of PMS worked well for targets in the region

of TM-score> 60%, while, in other cases, the bests were missed. [Color

figure can be viewed in the online issue, which is available atwileyonlinelibrary.com.]

Table IIComparison of Backbone Accuracies Between BT, MSA, and PMS

Models

Models TM-score (%) RMSD (�) Coverage (%)

BT model 67.28 7.83 68.23MSA model 67.37 6.52 73.30PMS model 69.32 5.84 75.51

The BT model is the best TM-score model among multiple templates, which were

used for PMS modeling. MSA models are generated by simple application of

MODELLER. PMS models are the final models submitted as model ones. All val-

ues are averages over 104 TBM domains. Coverage is calculated by TM-score, and

it represents the percentage of aligned residues between a 3D model and its native

structure.

Figure 3TM-score comparison between BT models and PMS models for 104TBM targets is shown. On average PMS models are better than BT

models. [Color figure can be viewed in the online issue, which is avail-able at wileyonlinelibrary.com.]

K. Joo et al.

192 PROTEINS

Page 6: Protein structure modeling for CASP10 by multiple layers of global optimization

quality models in the regions of hard TBM (TM-score

less than about 60%). For hard TBM targets, there are

many loop regions in alignments, where physical energy

terms play an important role in 3D modeling. Figure 4

shows examples of T0664-D1 and T0699-D1, where the

model (thick line) is superimposed on the native struc-

ture (thin line). For T0664-D1, TM-score improves grad-

ually as the modeling of each additional step proceeds.

In the case of T0699-D1, TM-score of the MSA model is

worse than the BT model, but the final PMS model using

the same MSA is much improved. On average, PMS pro-

tocol can generate better backbone structures than

MODELLER models using the same templates and

alignments.

Table III shows side-chain accuracies of models gener-

ated by PMS protocol at each step. Side-chains of PMS

models are far more accurate than BT models and MSA

models in terms of v1=v112 angle accuracies. The average

accuracies of v1 and v112 angles of PMS models are

57.86% and 40.30%, respectively, averaged over 104 TBM

domains. In the case of v1 angle, the average accuracy

increased about 27% over the BT models. Figure 5 shows

the scatter plot of side-chain accuracies between the BT

models and final PMS models. Almost all models are

improved in v1 angle accuracy. Overall, the current opti-

mization procedure in the 3D modeling and side-chain

re-modeling of PMS protocol improved the atomic

details of the protein model in addition to the backbone

improvement. Figure 6 shows an example of successful

side-chain modeling of the core region for T0749-D1.

TM-score of the PMS model is 95.53%, and its v1 angle

accuracy of the side-chain is 79.65% (to be compared

Table IIIComparison of Side-Chain Accuracies Between BT, MSA, and PMS

Models are Shown in Terms of the Percentage of Correctly Predicted v1

and v112 Angles

Models v1 %ð Þ v112 %ð ÞBT model 45.54 28.75MSA model 46.67 30.02PMS model 57.86 40.30

Figure 4Structure superimpositions of models (thick lines) on native structures (thin lines) are shown for (a) T0664-D1 and (b) T0699-D1. Final PMS

models are significantly improved compared to the BT models. Coloring of blue to red indicates the backbone tracing from N-terminal to

C-terminal. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Protein Structure Modeling by Global Optimization

PROTEINS 193

Page 7: Protein structure modeling for CASP10 by multiple layers of global optimization

with 72.84% of the BT model). In Figure 6, one can

observe that aromatic side-chains from the core region

are particularly well predicted.

Figure 7 shows an example (T0651-D2) of good loop

modeling where a gapped region is successfully modeled

by a bent helical segment. Actually, when no template

structures are available from the alignment, loop model-

ing was achieved not explicitly but by using the physical

energy term in Eq. (2). We have not used any secondary

structure-related information. As the consequence, the

TM-score of the PMS model is 73.70%, which is better

than the MODELLER model’s TM-score of 68.94%.

Performance of human predictions

So far, we have discussed procedures and results of the

PMS server. Table IV shows results of human predictions

in terms of TM-score and v1 angle accuracy. All scores

are averaged over 50 TBM targets that are common

among PMS, LEE, and LEEcon. In Table IV, we find that

LEE models show significant improvements over PMS

models in backbone as well as side-chain accuracies due

to the consideration of additional templates and template

combinations compared to PMS. LEEcon shows the best

Figure 5Comparison of side-chain v1 angle accuracy between BT models and

PMS models for 104 TBM targets is shown. Almost all PMS models areimproved by global optimization. [Color figure can be viewed in the

online issue, which is available at wileyonlinelibrary.com.]

Figure 6An example of good side-chain modeling of PMS model (magenta) and

native structure (green) for T0749-D1 is shown. Backbone is highly

accurate in the cartoon figure (only the native structure is shown) andthe aromatic rings of the core region are well aligned.

Figure 7An example of good loop modeling by PMS model (magenta) is shown

for T0651-D2 with MODELLER model (yellow) and native structure

(green). On the upper most part of the figure, the PMS model (TM-score of 73.70%) which is better than the MODELLER model (TM-

score of 68.94%) follows the bent helical conformation of the nativestructure.

Table IVComparison Between PMS and Human Predictions of LEE and LEEcon

Methods for 50 Common TBM Targets are Shown in Terms of Back-bone and v1 Angle Accuracies

PMS LEE LEEcon

TM-score (%) 55.74 58.24 60.83v1 %ð Þ 45.54 55.39 56.47

K. Joo et al.

194 PROTEINS

Page 8: Protein structure modeling for CASP10 by multiple layers of global optimization

performance, and the performance difference between

LEE and LEEcon is also quite substantial. Utilization of

consensus SERVER models helped the identification of

BTs in an implicit way.

What went wrong?

For hard TBM targets, the PMS protocol failed in

identifying the top templates. From Figure 2, we observe

this failure for about 15 targets with TM-score lower

than 60%. For target T0691-D1, we have considered a

total of 300 templates obtained from FOLDFINDER,

among which there were five templates that are structur-

ally similar (� 50% of TM-score by using TM-align

which is a sequence-independent structure alignment

program) to the native structure. However, all five tem-

plates’ alignments to the target sequence were signifi-

cantly off, consequently producing poor 3D structures.

Theses alignment failures are partly due to the low

sequence similarities (lower than 10%) between these

template structures and the native structure.

Target T0721-D1 is an easy TBM target with doubly

connected two-domain structure. We considered the pos-

sibility of domain reorientation using the physical energy

term in Eq. (2). The resulting PMS structure of T0721-

D1 is more compact but wrong in relative domain orien-

tation while the integrity of each domain structure is

preserved. The CASP10 assessors assessed this target as a

one-domain target. For this reason, the first MODELLER

model’s TM-score is high (90.37%), while PMS model’s

TM-score is poor (83.66%). This indicates that our phys-

ical energy term can drive the model structure into a

compact structure, unfortunately too much in this case.

CONCLUSIONS

PMS protocol is shown to be quite successful in accu-

rate modeling of atomic details of side-chains as well as

backbone structures. PMS combines serial applications of

global optimizations in three homology modeling steps:

multiple sequence alignment, 3D chain building, and

side-chain re-modeling. We introduced a Lorentzian-type

energy term for distance restraints from templates to

avoid excessive penalization of errors in distance

restraints. When a part of the alignment between target

sequence and templates are erroneous, with the

Lorentzian-type energy term, there is room for improve-

ment either by using the restraints generated from differ-

ent templates or using Ephysical term in Eq. (2). Basically

the energy function for 3D modeling is not exact and

there are many conflicts between restraints, so that it can

have many local minima in the complex energy land-

scape. Therefore, proper global optimization is important

in exploring the rugged energy landscape. This is demon-

strated by PMS protocol in this CASP10 experiment,

where more accurate models at the level of side-chains

together with backbone structures of proteins are

achieved through utilization of a powerful global optimi-

zation of CSA.

ACKNOWLEDGMENTS

The authors thank KIAS Center for Advanced Compu-

tation for providing computing resources. They also

thank the National Institute of Supercomputing and Net-

working/Korea Institute of Science and Technology Infor-

mation for providing supercomputing resources.

REFERENCES

1. Joo K, Lee J, Lee S, Seo, J-H, Lee SJ, Lee J. High accuracy template

based modeling by global optimization. Proteins 2007;69 (Suppl 8):

83–89.

2. Joo K, Lee J, Seo J-H, Lee K, Kim B-G, Lee J. All-atom chain-

building by optimizing MODELLER energy function using confor-

mational space annealing. Proteins 2009;75:1010–1023.

3. Joo K, Lee J, Kim I, Lee SJ, Lee J. Multiple sequence alignment by

conformational space annealing. Biophys J 2008;95:4813–4819.

4. Lee J, Lee J, Sasaki TN, Sasai M, Seok C, Lee J. De novo protein

structure prediction by dynamic fragment assembly and conforma-

tional space annealing. Proteins 2011;79:2403–2417.

5. Zhou H, Zhou Y. Distance-scaled, finite ideal-gas reference state

improves structure-derived potentials of mean force for structure

selection and stability prediction. Protein Sci 2002;11:2714–2726.

6. Kortemme T, Morozov AV, Baker D. An orientation-dependent

hydrogen bonding potential improves prediction of specificity and

structure for proteins and protein-protein complexes. J Mol Biol

2003;326:1239–1259.

7. Zhou H, Skolnick J. GOAP: a generalized orientation-dependent,

all-atom statistical potential for protein structure prediction. Bio-

phys J 2011;101:2043–2052.

8. Lee J, Scheraga HA, Rackovsky S. New optimization method for

conformational energy calculations on polypeptides: conformational

space annealing. J Comput Chem 1997;18:1222–1232.

9. Orry AJW, Abagyan R, editors. Homology modeling: methods and

protocols. Humana Press, New York, 2012.

10. Sali A, Blundell TL. Comparative protein modelling by satisfaction

of spatial restraints. J Mol Biol 1993;234:779–815.

11. Lee J, Gross SP, Lee J. Modularity optimization by conformational

space annealing. Phys Rev E 2012;85:056702.

12. Notredame C, Holm L, Higgins DG. Coffee: an objective function

for multiple sequence alignments. Bioinformatics 1998;14:407–422.

13. Zhang Y, Skolnick J. Tm-align: a protein structure alignment algo-

rithm based on the TM-score. Nucl Acids Res 2005;33:2302–2309.

14. Breiman L. Random forests. Mach Learn 2001;45:5–32.

15. Jones DT. Protein secondary structure prediction based on position-

specific scoring matrices. J Mol Biol 1999;292:195–202.

16. Joo K, Lee SJ, Lee J. SANN: solvent accessibility prediction of pro-

teins by nearest neighbor method. Proteins 2012;80:1791–1797.

17. Kabsch W, Sander C. Dictionary of protein secondary structure:

pattern recognition of hydrogen-bonded and geometrical features.

Biopolymers 1983;22:2577–2637.

18. Wang G, Dunbrack RL, Jr. PISCES: a protein sequence culling

server. Bioinformatics 2003;19:1589–1591.

19. Canutescu AA, Shelenkov AA, Dunbrack RL. A graph-theory algo-

rithm for rapid protein side-chain prediction. Protein Sci 2003;12:

2001–2014.

Protein Structure Modeling by Global Optimization

PROTEINS 195