Upload
jooyoung
View
215
Download
2
Embed Size (px)
Citation preview
proteinsSTRUCTURE O FUNCTION O BIOINFORMATICS
Protein structure modeling for CASP10 bymultiple layers of global optimizationKeehyoung Joo,1,2 Juyong Lee,1 Sangjin Sim,1 Sun Young Lee,1 Kiho Lee,1 Seungryong Heo,1
In-Ho Lee,1,3 Sung Jong Lee,1,4 and Jooyoung Lee1,5*1 Center for In Silico Protein Science, Korea Institute for Advanced Study, Dongdaemun-gu, Seoul 130-722, Korea
2 Center for Advanced Computation, Korea Institute for Advanced Study, Dongdaemun-gu, Seoul 130-722, Korea
3 Korea Research Institute of Standards and Science (KRISS), Yuseong, Daejeon 305-600, Korea
4 Department of Physics, University of Suwon, Hwaseong-Si, Gyeonggi-do 445-743, Korea
5 School of Computational Sciences, Korea Institute for Advanced Study, Dongdaemun-gu, Seoul 130-722, Korea
ABSTRACT
In the template-based modeling (TBM) category of CASP10 experiment, we introduced a new protocol called protein model-
ing system (PMS) to generate accurate protein structures in terms of side-chains as well as backbone trace. In the new pro-
tocol, a global optimization algorithm, called conformational space annealing (CSA), is applied to the three layers of TBM
procedure: multiple sequence-structure alignment, 3D chain building, and side-chain re-modeling. For 3D chain building,
we developed a new energy function which includes new distance restraint terms of Lorentzian type (derived from multiple
templates), and new energy terms that combine (physical) energy terms such as dynamic fragment assembly (DFA) energy,
DFIRE statistical potential energy, hydrogen bonding term, etc. These physical energy terms are expected to guide the struc-
ture modeling especially for loop regions where no template structures are available. In addition, we developed a new qual-
ity assessment method based on random forest machine learning algorithm to screen templates, multiple alignments, and
final models. For TBM targets of CASP10, we find that, due to the combination of three stages of CSA global optimizations
and quality assessment, the modeling accuracy of PMS improves at each additional stage of the protocol. It is especially
noteworthy that the side-chains of the final PMS models are far more accurate than the models in the intermediate steps.
Proteins 2014; 82(Suppl 2):188–195.VC 2013 Wiley Periodicals, Inc.
Key words: structure prediction; CASP; homology modeling; template-based modeling; side-chain modeling; high-accuracy
modeling; global optimization; energy function.
INTRODUCTION
For modeling of TBM target of CASP10, there is a
wide consensus on the importance of predicting more
accurate protein models in terms of all-atom details
beyond the modeling of just backbone structures. That is,
accurate modeling of side-chains of proteins is as impor-
tant as accurate backbone modeling. More accurate side
chain modeling can provide valuable information and
important clues for elucidating the sequence-structure-
function relationships in modern structural biology. For
CASP10 predictions, we have developed a new prediction
protocol which employs global optimization in three lev-
els: multiple sequence-structure alignment (MSA), protein
3D chain building, and side-chain re-modeling.1,2
In three stages of modeling, there are specific energy
(or score) functions to optimize. For generating MSA,
the score function to be optimized represents the net
consistency of all possible residue-residue matches gener-
ated by pair-wise alignments.3 For 3D chain building,
two major changes are introduced in the energy func-
tion. One is the development of a new Lorentzian-type
energy term for spatial restraints (coming from tem-
plates) instead of using Gaussian-type or spline functions
used in MODELLER. The other is the energy function
for loop regions for which no restraints are available
from templates, where we combined physical energy
Grant sponsor: Korea government (MSIP); Grant numbers: 2008-0061987, 2009-
0090085; Grant sponsor: National Institute of Supercomputing and Networking/
Korea Institute of Science and Technology Information; Grant numbers: KSC-
2012-C3-01; KSC-2012-C3-02.
*Correspondence to: Jooyoung Lee, Center for In Silico Protein Science, Korea
Institute for Advanced Study, Dongdaemun-gu, Seoul 130–722, Korea.
E-mail: [email protected]
Received 3 April 2013; Revised 9 July 2013; Accepted 9 August 2013
Published online 22 August 2013 in Wiley Online Library (wileyonlinelibrary.com).
DOI: 10.1002/prot.24397
188 PROTEINS VVC 2013 WILEY PERIODICALS, INC.
terms including dynamic fragment assembly (DFA)
energy4 together with DFIRE5 statistical potential energy,
hydrogen bonding term,6 and GOAP terms.7 For side-
chain re-modeling, we used an in-house energy function
in the discrete rotamer space. To optimize these energy
functions, we utilized conformational space annealing
(CSA), a powerful global optimization and efficient con-
formational search method.8 In addition, we developed
new quality assessment methods based on random-forest
machine learning algorithm to properly rank protein
models at the respective stages of fold-recognition, selec-
tion of MSAs, and selection of the final model ones.
MATERIALS AND METHODS
For server prediction of CASP10 targets, we have
developed an automated procedure called protein model-
ing system (PMS) protocol which combined old proto-
cols employed in the previous CASP experiments1 with
new features explained in this article. Figure 1 shows the
overall flow of PMS protocol describing how to combine
three global optimization procedures and other utilities
such as template selection and quality assessment at each
stage.
Fold recognition
The first step of template-based modeling (TBM) is to
search PDB for homologous structures of a target
sequence. Since CASP7, for fold recognition, we have
used a sequence-structure alignment method called
FOLDFINDER, an in-house method of profile–profile
alignment utilizing predicted secondary structures.9 We
built a template database of 27,333 chains obtained from
PISCES culling server at the level of 95% sequence iden-
tity with the chain length in the range of 50–1000 resi-
dues including both X-ray and NMR structures.
For a given target sequence, a total of 50 top-scoring
templates are selected. For each template and its alignment
to the target sequence, 60 three-dimensional models are
generated by MODELLER10 and then perturbed. Then,
we optimize an energy function for these models and 10
lowest energy models are selected. The quality of each
selected model is estimated by quality assessment method
QA1 (see “quality assessment of protein structures,”
below), and the quality of the template is estimated by the
average QA1 score.
We perform structural clustering of the 50 template
models using recently developed community detection
method11 based on all-to-all pairwise TM-scores. This clus-
tering step ensures that only structurally similar templates
are combined in the MSA and chain building thereby
reducing conflicts arising from structurally inconsistent
multiple templates. Then subsets of the templates belonging
to the same cluster/community are selected to generate
template lists via template combinations. Typically, we
generate one to eight lists, each of which can include up to
15 templates (Joo et al., in preparation). For difficult tar-
gets, we have considered up to 200 top-scoring templates
ranked by FOLDFINDER, and the number of template lists
can be as large as 20.
Multiple sequence-structure alignment
We performed multiple sequence-structure alignment
by using MSACSA method.3 In MSACSA, a consistency-
based score function similar to COFFEE score12 is used.
For CASP10, the original score function1 of MSA is
slightly modified as follows. With N sequences to align
(one target sequence and N 2 1 template sequences), all
pairwise alignments are carried out to construct a
restraint library of matched residue pairs. As for the
pairwise alignment between the target sequence and a
template sequence, sequence-based pairwise alignment
from FOLDFINDER is used, while, for alignment among
the templates, structure-structure alignment by TM-
align13 is used. We assign a weight wkij5f k
ij � SeqId for
each aligned residue pair between ith and jth sequences
at the kth column, where SeqId is the sequence identity
between i and j, and f kij is either the profile–profile match
score from FOLDFINDER for target-templates align-
ments or (12dkij=8) for template–template alignments, dk
ij
Figure 1The flow chart of PMS protocol for CASP10 is shown. CSA method isused for MSA, 3D chain building, and side-chain re-modeling. Three
QA methods (QA1, QA2, and QA3) are used for selection of templates,MSAs, and final models.
Protein Structure Modeling by Global Optimization
PROTEINS 189
being the Ca-Ca distance between matched residues at the
kth column (generated by TM-align13). Then, f kij is linearly
rescaled in the range of [0.01, 1]. The library is now a col-
lection of aligned residue pairs with specific weights wkij .
Denoting the sum of all weights byX
w, the score func-
tion of a multiple alignment A can be written as3
Score Að Þ5XN
i;j51;i<j
XMk51
wkij dk
ij Að Þ�X
w; (1)
where M represents the set of all aligned columns, and
dkij51 or 0 depending on whether the aligned residue
pair at the kth column is in the library or not. It is
obvious that a good score is associated with an align-
ment which is consistent with the restraint library. The
more consistent is an alignment with the pair-wise
restraint library, the higher the alignment’s score will be.
With the score function, a straightforward and rigorous
global optimization is performed starting from multiple
alignments initially generated in a random fashion. This
should be contrasted with existing heuristic progressive
alignment methods popular in the literature. In a progres-
sive alignment procedure, successive pairwise alignments
are carried out to obtain a multiple alignment. Therefore, it
is rather fast but alignment errors created in early stages of
the method cannot be fixed later. It is demonstrated that
alignments generated by MSACSA are more consistent with
the gold standard than those generated by progressive
methods.3 Another advantage of MSACSA is that it pro-
vides not only the globally optimal alignment solution but
many distinct low-lying suboptimal alternative alignment
solutions. Details of the algorithm can be found elsewhere.3
Typically, 100 alignments are generated for each template
list and 10 top-scoring MSAs are chosen according to Eq.
(1). Therefore with nlist number of template lists, 3D mod-
els are generated for a total of 103nlist alignments. After
MSACSA’s are finished, we construct models for top 10
MSAs (in terms of the MSACSA score function) and then
perform quality assessment again to select best MSAs. The
quality of an MSA is assessed by QA2 (see “quality assess-
ment of protein structures,” below). Typically up to 10
MSAs are selected and passed onto the next step.
Energy function and protein 3D modeling
For protein 3D modeling using selected MSAs above,
we developed a new energy function, which is defined as:
E5Estereo-chemistry 1Erestraint 1Ephysical (2)
where Estereo-chemistry denotes stereochemical terms that
are borrowed from the MODELLER energy function.10
The second term Erestraint includes distance restraints
(Ca–Ca;N2O, and the others involved with side-chain
atoms) terms used in MODELLER. The difference is that
in MODELLER either harmonic or spline functions are
used while we used Lorentzian-type energy functions.
For each pair of atoms with NR distance restraints, the
Lorentzian energy term is defined as
ELorentzian rð Þ5XNR
i
1
ri
r2ri0
� �2
r2ri0ð Þ21r2
i
; (3)
where ri0 denotes a restraint distance copied from a spe-
cific template, and ri is an estimated uncertainty of the
distance ri0, which controls the strength of an individual
distance restraint. Finally r is the distance between the
atoms from the current model structure. To obtain ri ,
we employed a random forest algorithm, where input
features are based on the sequence-template alignment
and environmental features including the profile similar-
ity, gap features, secondary structure consensus, and sol-
vent accessibility consensus (Lee et al., in preparation).
Finally, in Eq. (2), Ephysical includes dynamic fragment
assembly (DFA) energy (which were originally developed
for ab initio protein structure modeling4) together with
DFIRE statistical potential energy, hydrogen bonding
term, and GOAP terms.7 Weights for energy terms were
optimized using a subset of CASP9 targets. The energy
function of Eq. (2) is optimized considering all possible
structural variation by employing CSA. Typically, 100
three-dimensional models are generated for each MSA,
and the average QA3 (see below) score of 100 models is
used to select the top MSA. The lowest energy model of
the top MSA is submitted as the model one.
Quality assessment of protein structures
PMS protocol uses three quality assessment (QA) steps:
(1) assessment of templates, (2) assessment of MSAs, and
(3) assessment of final lowest energy structures. To assign
ranking for the first two steps, 3D model structures are
assessed. Various energy terms of these models and other
structural features were used as input features for training
and testing a machine for each step. To capture hidden
nonlinear relationships between features of model struc-
tures and their TM-scores to the native structure, we used
the random forest method.14 Input features are listed in
Table I. To minimize the computing time of estimating rin Eq. (3), Erestraint for QA1, QA2 is generated using the
MODELLER term, while the Lorentzian term of Eq. (3) is
used only for QA3. All input features and TM-scores are
normalized to [0,1] scale, separately for each target.
Assessment of templates (QA1)
For a selected template with its alignment to the target
sequence, 60 model structures are generated by minimiz-
ing MODELLER-generated and then perturbed 3D mod-
els using Eq. (2). Average QA1 score of 10 lowest energy
models are used. To train a random forest machine, we
used 1178 non-redundant sequences selected and then
K. Joo et al.
190 PROTEINS
screened from PISCES18 excluding highly similar sequen-
ces. Top 50 templates are generated by FOLDFINDER.
Assessment of MSAs (QA2)
For a selected MSA, 3D model structures are generated
following the same procedure as in QA1. To train a random
forest machine for this step, we used 27 CASP9 single-
domain targets under 200 aa. For each target, top templates
are filtered using QA1. For each MSA, we generated 10
models. The average QA2 score of 10 models are used.
Assessment of models (QA3)
For each MSA selected by QA2, PMS generates 100
different model structures. To train the third random
forest machine, model structures of the 27 benchmark
proteins mentioned above were generated for MSAs
selected by QA2. The energy values and TM-scores of the
models were extracted and used as the training data. The
average QA3 score of 100 models are used to identify the
top MSA, and the lowest energy [by Eq. (2)] model of
the top MSA is selected as model one. The other four
models are selected from the other suboptimal MSAs.
Side-chain modeling
Side-chain re-modeling is carried out for final five
models to be submitted to CASP10. We have used the
backbone-dependent rotamer library of SCWRL 3.0 (Ref.
19) along with a target-specific rotamer library con-
structed during the final CSA simulation for 3D model-
ing. The side-chain modeling used here follows the
identical procedure employed in earlier CASPs.1
Human predictions
Here, we discuss human prediction methods, LEE and
LEEcon. LEE is same as PMS except that we have consid-
ered additional templates from FOLDFINDER. LEEcon is
a consensus method using SERVER models as templates
that are selected from the largest cluster of all SERVER
models released by CASP10. For structural clustering we
used a recently proposed community detection
method.11 The performance difference between LEE and
LEEcon partly indicates the fold recognition performance
difference between our procedure and the consensus of
CASP10 server predictions (see below, “Performance of
human predictions” section).
RESULTS AND DISCUSSION
In this section, we analyze the performance of PMS in
terms of the backbone accuracy and the side-chain accu-
racy of the models with respect to the native structures.
In the CASP10, there are 112 domains that are officially
Table IList of the Input Features for Quality Assessment
Index Feature Description
F1 Etotal Total energyF2 Evdw van der Waals energyF3 Estereo Stereochemistry energyF4 Erestraint Distance restraint energy from templatesF5 Egeom Sum of stereochemistryF6 EDFA Total DFA energyF7 EDFA;dist DFA distance energyF8 EDFA;angle DFA torsion angle energyF9 EDFA;neigh DFA neighbor energyF10 EDFA;hydrophobic DFA neighbor energy with hydrophobic center residuesF11 EDFA;hydrophilic DFA neighbor energy with hydrophilic center residuesF12 EDFA;aromatic DFA neighbor energy with aromatic center residuesF13 EDFA;beta DFA beta-sheet energyF14 EDFIRE DFIRE energyF15 EDFIRE;np DFIRE energies between non-polar atom pairsF16 EDFIRE;pol DFIRE energies from atom pairs that either of the pair is polarF17 Ehbond Total hydrogen bond energyF18 Ehb;bb Contribution of backbone–backbone hydrogen bond energyF19 Ehb;bs Contribution of backbone–sidechain hydrogen bond energyF20 Ehb;ss Contribution of sidechain–sidechain hydrogen bond energyF21 EGOAP GOAP statistical energyF22
Pi pi sið Þd si ; tið Þ Sum of PSIPRED probability
F23 corr SSpred; SSmodel� �
Correlation between solvent accessibilities of model and predictionF24 Nhelix =Ntotal Fraction of predicted helical residues in a modelF25 Nsheet =Ntotal Fraction of predicted beta-sheet residues in a modelF26 Ncoil =Ntotal Fraction of predicted coil residues in a model
For Erestraint, the MODELLER energy term is used for QA1, QA2, and the Lorentzian energy for QA3. The secondary structure of the target sequence is that predicted
by using PSIPRED,15 while the solvent accessibility of the target sequence is obtained by SANN.16 In F22, pi sið Þ is the probability of the secondary structure state si 2C;H ; Ef g (from PSIPRED) at ith residue position, and the secondary structure state ti for the model is calculated by using DSSP.17 Here, d si ; tið Þ51 if si5ti and zero
otherwise.
Protein Structure Modeling by Global Optimization
PROTEINS 191
classified as TBM targets. We considered 104 domains
excluding eight domains. The native structure of T0739
is not available for two domains (for D3 and D4
domains), and for the remaining six cases, our template
selections were not consistent with the domain defini-
tions provided by the assessors (T0705-D1/D2, T0717-
D1/D2, T0744-D1, and T0755-D1).
Figure 2 shows the template qualities in terms of TM-
score between the top templates (by the CASP10 asses-
sors) and the best templates (BTs; used for PMS model-
ing). The BT used for PMS modeling is the top (TM-
score) template among typically 50 templates identified
by FOLDFINDER of PMS (by structural alignment using
TM-align13). FOLDFINDER could successfully include
the top template among the 50 templates when its TM-
score is better than around 60%, while, for the other
cases (TM-score< 60%), the top template was missed
among the 50 templates. In short, for easy targets,
FOLDFINDER performed quite well, but it did not for
hard targets.
With multiple templates identified by FOLDFINDER,
PMS protocol carried out the next steps of MSA and 3D
modeling. Table II shows that PMS models gradually
improve on average in terms of TM-score as each addi-
tional step is processed. For comparison, we have used
MODELLER for BT models and MSA models. Here, the
BT model is the best TM-score model among
MODELLER-generated models using a single template
out of multiple templates used to generate the model
one PMS model. The average TM-score and RMSD of
the BT models are 67.28% and 7.83 A with the coverage
of 68.23%. For MSA models, the average TM-score and
RMSD are 67.37% and 6.52 A with coverage of 73.30%.
As a result, on average, MSA models are better than BT
models. Generally, it is not easy to identify the BT out of
a given set of multiple templates without knowing the
3D structure of a target sequence. It should be noted
that MSA is an important step for extracting the infor-
mation of the BT among multiple templates. For final
PMS models, the average TM-score and RMSD are
69.32% and 5.84 A with the coverage of 75.51%, which
is significantly better than MSA models.
Figure 3 shows TM-score comparison between BT
models and PMS models for 104 TBM targets. In Figure
3, we find that PMS models consistently generate better
Figure 2Comparison of TM-scores between BTs by PMS and top templates byassessor is shown. BTs by PMS are identified among 50 FOLDFINDER
generated templates by using the native structure and TM-align pro-gram. Template selection of PMS worked well for targets in the region
of TM-score> 60%, while, in other cases, the bests were missed. [Color
figure can be viewed in the online issue, which is available atwileyonlinelibrary.com.]
Table IIComparison of Backbone Accuracies Between BT, MSA, and PMS
Models
Models TM-score (%) RMSD (�) Coverage (%)
BT model 67.28 7.83 68.23MSA model 67.37 6.52 73.30PMS model 69.32 5.84 75.51
The BT model is the best TM-score model among multiple templates, which were
used for PMS modeling. MSA models are generated by simple application of
MODELLER. PMS models are the final models submitted as model ones. All val-
ues are averages over 104 TBM domains. Coverage is calculated by TM-score, and
it represents the percentage of aligned residues between a 3D model and its native
structure.
Figure 3TM-score comparison between BT models and PMS models for 104TBM targets is shown. On average PMS models are better than BT
models. [Color figure can be viewed in the online issue, which is avail-able at wileyonlinelibrary.com.]
K. Joo et al.
192 PROTEINS
quality models in the regions of hard TBM (TM-score
less than about 60%). For hard TBM targets, there are
many loop regions in alignments, where physical energy
terms play an important role in 3D modeling. Figure 4
shows examples of T0664-D1 and T0699-D1, where the
model (thick line) is superimposed on the native struc-
ture (thin line). For T0664-D1, TM-score improves grad-
ually as the modeling of each additional step proceeds.
In the case of T0699-D1, TM-score of the MSA model is
worse than the BT model, but the final PMS model using
the same MSA is much improved. On average, PMS pro-
tocol can generate better backbone structures than
MODELLER models using the same templates and
alignments.
Table III shows side-chain accuracies of models gener-
ated by PMS protocol at each step. Side-chains of PMS
models are far more accurate than BT models and MSA
models in terms of v1=v112 angle accuracies. The average
accuracies of v1 and v112 angles of PMS models are
57.86% and 40.30%, respectively, averaged over 104 TBM
domains. In the case of v1 angle, the average accuracy
increased about 27% over the BT models. Figure 5 shows
the scatter plot of side-chain accuracies between the BT
models and final PMS models. Almost all models are
improved in v1 angle accuracy. Overall, the current opti-
mization procedure in the 3D modeling and side-chain
re-modeling of PMS protocol improved the atomic
details of the protein model in addition to the backbone
improvement. Figure 6 shows an example of successful
side-chain modeling of the core region for T0749-D1.
TM-score of the PMS model is 95.53%, and its v1 angle
accuracy of the side-chain is 79.65% (to be compared
Table IIIComparison of Side-Chain Accuracies Between BT, MSA, and PMS
Models are Shown in Terms of the Percentage of Correctly Predicted v1
and v112 Angles
Models v1 %ð Þ v112 %ð ÞBT model 45.54 28.75MSA model 46.67 30.02PMS model 57.86 40.30
Figure 4Structure superimpositions of models (thick lines) on native structures (thin lines) are shown for (a) T0664-D1 and (b) T0699-D1. Final PMS
models are significantly improved compared to the BT models. Coloring of blue to red indicates the backbone tracing from N-terminal to
C-terminal. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
Protein Structure Modeling by Global Optimization
PROTEINS 193
with 72.84% of the BT model). In Figure 6, one can
observe that aromatic side-chains from the core region
are particularly well predicted.
Figure 7 shows an example (T0651-D2) of good loop
modeling where a gapped region is successfully modeled
by a bent helical segment. Actually, when no template
structures are available from the alignment, loop model-
ing was achieved not explicitly but by using the physical
energy term in Eq. (2). We have not used any secondary
structure-related information. As the consequence, the
TM-score of the PMS model is 73.70%, which is better
than the MODELLER model’s TM-score of 68.94%.
Performance of human predictions
So far, we have discussed procedures and results of the
PMS server. Table IV shows results of human predictions
in terms of TM-score and v1 angle accuracy. All scores
are averaged over 50 TBM targets that are common
among PMS, LEE, and LEEcon. In Table IV, we find that
LEE models show significant improvements over PMS
models in backbone as well as side-chain accuracies due
to the consideration of additional templates and template
combinations compared to PMS. LEEcon shows the best
Figure 5Comparison of side-chain v1 angle accuracy between BT models and
PMS models for 104 TBM targets is shown. Almost all PMS models areimproved by global optimization. [Color figure can be viewed in the
online issue, which is available at wileyonlinelibrary.com.]
Figure 6An example of good side-chain modeling of PMS model (magenta) and
native structure (green) for T0749-D1 is shown. Backbone is highly
accurate in the cartoon figure (only the native structure is shown) andthe aromatic rings of the core region are well aligned.
Figure 7An example of good loop modeling by PMS model (magenta) is shown
for T0651-D2 with MODELLER model (yellow) and native structure
(green). On the upper most part of the figure, the PMS model (TM-score of 73.70%) which is better than the MODELLER model (TM-
score of 68.94%) follows the bent helical conformation of the nativestructure.
Table IVComparison Between PMS and Human Predictions of LEE and LEEcon
Methods for 50 Common TBM Targets are Shown in Terms of Back-bone and v1 Angle Accuracies
PMS LEE LEEcon
TM-score (%) 55.74 58.24 60.83v1 %ð Þ 45.54 55.39 56.47
K. Joo et al.
194 PROTEINS
performance, and the performance difference between
LEE and LEEcon is also quite substantial. Utilization of
consensus SERVER models helped the identification of
BTs in an implicit way.
What went wrong?
For hard TBM targets, the PMS protocol failed in
identifying the top templates. From Figure 2, we observe
this failure for about 15 targets with TM-score lower
than 60%. For target T0691-D1, we have considered a
total of 300 templates obtained from FOLDFINDER,
among which there were five templates that are structur-
ally similar (� 50% of TM-score by using TM-align
which is a sequence-independent structure alignment
program) to the native structure. However, all five tem-
plates’ alignments to the target sequence were signifi-
cantly off, consequently producing poor 3D structures.
Theses alignment failures are partly due to the low
sequence similarities (lower than 10%) between these
template structures and the native structure.
Target T0721-D1 is an easy TBM target with doubly
connected two-domain structure. We considered the pos-
sibility of domain reorientation using the physical energy
term in Eq. (2). The resulting PMS structure of T0721-
D1 is more compact but wrong in relative domain orien-
tation while the integrity of each domain structure is
preserved. The CASP10 assessors assessed this target as a
one-domain target. For this reason, the first MODELLER
model’s TM-score is high (90.37%), while PMS model’s
TM-score is poor (83.66%). This indicates that our phys-
ical energy term can drive the model structure into a
compact structure, unfortunately too much in this case.
CONCLUSIONS
PMS protocol is shown to be quite successful in accu-
rate modeling of atomic details of side-chains as well as
backbone structures. PMS combines serial applications of
global optimizations in three homology modeling steps:
multiple sequence alignment, 3D chain building, and
side-chain re-modeling. We introduced a Lorentzian-type
energy term for distance restraints from templates to
avoid excessive penalization of errors in distance
restraints. When a part of the alignment between target
sequence and templates are erroneous, with the
Lorentzian-type energy term, there is room for improve-
ment either by using the restraints generated from differ-
ent templates or using Ephysical term in Eq. (2). Basically
the energy function for 3D modeling is not exact and
there are many conflicts between restraints, so that it can
have many local minima in the complex energy land-
scape. Therefore, proper global optimization is important
in exploring the rugged energy landscape. This is demon-
strated by PMS protocol in this CASP10 experiment,
where more accurate models at the level of side-chains
together with backbone structures of proteins are
achieved through utilization of a powerful global optimi-
zation of CSA.
ACKNOWLEDGMENTS
The authors thank KIAS Center for Advanced Compu-
tation for providing computing resources. They also
thank the National Institute of Supercomputing and Net-
working/Korea Institute of Science and Technology Infor-
mation for providing supercomputing resources.
REFERENCES
1. Joo K, Lee J, Lee S, Seo, J-H, Lee SJ, Lee J. High accuracy template
based modeling by global optimization. Proteins 2007;69 (Suppl 8):
83–89.
2. Joo K, Lee J, Seo J-H, Lee K, Kim B-G, Lee J. All-atom chain-
building by optimizing MODELLER energy function using confor-
mational space annealing. Proteins 2009;75:1010–1023.
3. Joo K, Lee J, Kim I, Lee SJ, Lee J. Multiple sequence alignment by
conformational space annealing. Biophys J 2008;95:4813–4819.
4. Lee J, Lee J, Sasaki TN, Sasai M, Seok C, Lee J. De novo protein
structure prediction by dynamic fragment assembly and conforma-
tional space annealing. Proteins 2011;79:2403–2417.
5. Zhou H, Zhou Y. Distance-scaled, finite ideal-gas reference state
improves structure-derived potentials of mean force for structure
selection and stability prediction. Protein Sci 2002;11:2714–2726.
6. Kortemme T, Morozov AV, Baker D. An orientation-dependent
hydrogen bonding potential improves prediction of specificity and
structure for proteins and protein-protein complexes. J Mol Biol
2003;326:1239–1259.
7. Zhou H, Skolnick J. GOAP: a generalized orientation-dependent,
all-atom statistical potential for protein structure prediction. Bio-
phys J 2011;101:2043–2052.
8. Lee J, Scheraga HA, Rackovsky S. New optimization method for
conformational energy calculations on polypeptides: conformational
space annealing. J Comput Chem 1997;18:1222–1232.
9. Orry AJW, Abagyan R, editors. Homology modeling: methods and
protocols. Humana Press, New York, 2012.
10. Sali A, Blundell TL. Comparative protein modelling by satisfaction
of spatial restraints. J Mol Biol 1993;234:779–815.
11. Lee J, Gross SP, Lee J. Modularity optimization by conformational
space annealing. Phys Rev E 2012;85:056702.
12. Notredame C, Holm L, Higgins DG. Coffee: an objective function
for multiple sequence alignments. Bioinformatics 1998;14:407–422.
13. Zhang Y, Skolnick J. Tm-align: a protein structure alignment algo-
rithm based on the TM-score. Nucl Acids Res 2005;33:2302–2309.
14. Breiman L. Random forests. Mach Learn 2001;45:5–32.
15. Jones DT. Protein secondary structure prediction based on position-
specific scoring matrices. J Mol Biol 1999;292:195–202.
16. Joo K, Lee SJ, Lee J. SANN: solvent accessibility prediction of pro-
teins by nearest neighbor method. Proteins 2012;80:1791–1797.
17. Kabsch W, Sander C. Dictionary of protein secondary structure:
pattern recognition of hydrogen-bonded and geometrical features.
Biopolymers 1983;22:2577–2637.
18. Wang G, Dunbrack RL, Jr. PISCES: a protein sequence culling
server. Bioinformatics 2003;19:1589–1591.
19. Canutescu AA, Shelenkov AA, Dunbrack RL. A graph-theory algo-
rithm for rapid protein side-chain prediction. Protein Sci 2003;12:
2001–2014.
Protein Structure Modeling by Global Optimization
PROTEINS 195