12
This article was downloaded by: [University of Newcastle (Australia)] On: 03 October 2014, At: 06:11 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Journal of Biomolecular Structure and Dynamics Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/tbsd20 Evaluation performance of substitution matrices, based on contacts between residue terminal groups Boris Vishnepolsky a , Grigol Managadze a , Maya Grigolava a & Malak Pirtskhalava a a Life Science Research Centre, Laboratory of Bioinformatics , 14 Gotua St, Tbilisi , 0160 , Georgia Published online: 12 Jun 2012. To cite this article: Boris Vishnepolsky , Grigol Managadze , Maya Grigolava & Malak Pirtskhalava (2012) Evaluation performance of substitution matrices, based on contacts between residue terminal groups, Journal of Biomolecular Structure and Dynamics, 30:2, 180-190, DOI: 10.1080/07391102.2012.677769 To link to this article: http://dx.doi.org/10.1080/07391102.2012.677769 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http:// www.tandfonline.com/page/terms-and-conditions

Evaluation performance of substitution matrices, based on contacts between residue terminal groups

  • Upload
    malak

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Evaluation performance of substitution matrices, based on contacts between residue terminal groups

This article was downloaded by: [University of Newcastle (Australia)]On: 03 October 2014, At: 06:11Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK

Journal of Biomolecular Structure and DynamicsPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/tbsd20

Evaluation performance of substitution matrices,based on contacts between residue terminal groupsBoris Vishnepolsky a , Grigol Managadze a , Maya Grigolava a & Malak Pirtskhalava aa Life Science Research Centre, Laboratory of Bioinformatics , 14 Gotua St, Tbilisi , 0160 ,GeorgiaPublished online: 12 Jun 2012.

To cite this article: Boris Vishnepolsky , Grigol Managadze , Maya Grigolava & Malak Pirtskhalava (2012) Evaluationperformance of substitution matrices, based on contacts between residue terminal groups, Journal of Biomolecular Structureand Dynamics, 30:2, 180-190, DOI: 10.1080/07391102.2012.677769

To link to this article: http://dx.doi.org/10.1080/07391102.2012.677769

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon andshould be independently verified with primary sources of information. Taylor and Francis shall not be liable forany losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use ofthe Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Page 2: Evaluation performance of substitution matrices, based on contacts between residue terminal groups

Evaluation performance of substitution matrices, based on contacts between residue terminalgroups

Boris Vishnepolsky*, Grigol Managadze, Maya Grigolava and Malak Pirtskhalava*

Life Science Research Centre, Laboratory of Bioinformatics, 14 Gotua St Tbilisi 0160, Georgia

Communicated by Vsevolod Makeev

(Received 27 May 2011; final version received 13 January 2012)

Sequence alignment is a standard method for the estimation of the evolutionary, structural, and functional relation-ships among amino acid sequences. The quality of alignments depends on the used similarity matrix. Statistical con-tact potentials (CPs) contain information on contact propensities among residues in native protein structures.Substitution matrices (SMs) based on CPs are applicable for the comparison of distantly related sequences. Here,contact between amino acids was estimated on the basis of the evaluation of the distances between side-chain termi-nal groups (SCTGs), which are defined as the group of the side-chain heavy atoms with fixed distances betweenthem. In this paper, two new types of CPs and similarity matrices have been constructed: one based on fixed cutoffdistance obtained from geometric characteristics of the SCTGs (TGC1), while the other is distance-dependent poten-tial (TGC2). These matrices are compared with other popular SMs. The performance of the matrices was evaluatedby comparing sequence with structural alignments. The obtained results show that TGC2 has the best performanceamong contact-based matrices, but on the whole, contact-based matrices have slightly lower performance than otherSMs except fold-level similarity.

Keywords: contact potentials; alignment; protein structure prediction; twilight zone; fold recognition

Introduction

In general, in analyzing protein structures, researchersoften use information that is written as a matrix A= (aij),particularly for substitution matrices (SMs) and contactpotentials (CPs). The SMs carry information on the proba-bility of replacement of a given amino acid by anothertype as a result of evolutionary-driven mutations, and areused in assessments of sequence alignments or similaritysearches. Statistical CPs contain information about contactpropensities among residues in native protein structures.

SMs play key role in protein sequence alignment, andtherefore in protein structure prediction methods.Although a promising way to improve the quality ofalignment between a query protein sequence and a tem-plate is to employ multiple sequence alignments to enrichevolutionary information (usually in the form of profile–profile alignments) (Gribskov, McLachlan, & Eisenberg,1987; Jaroszewski, Rychlewski, & Godzik, 2000;Marti-Renom, Madhusudhan, & Sali, 2004; Wallace,Blackshields, & Higgins, 2005; Wang & Dunbrack,2004); improving SMs used for creating sequence

alignments and profiles is also very important. A tem-plate-based protein structure prediction algorithm is aimedat aligning two proteins with a barely recognizable evolu-tionary relationship (i.e. superfamily-level similarity) andeven protein pairs of evolutionary not related but sharingthe same fold (fold-level similarity). Apparently, toachieve this, a prediction algorithm should use a scoringscheme that reflects structural similarity but does notsolely rely on the homologous relationship of proteins.Commonly used amino acid SMs that are derived from aset of homologous proteins, such as the BLOSUM(Henikoff & Henikoff, 1992) or PAM (Dayhoff, Barker,& Hunt, 1983) series, do not work well for distantlyrelated proteins. If a better SM for distantly related pro-teins is developed, it will benefit not only the constructionof pairwise but also the profile–profile alignments becausethe new SM matrix could make a better profile itself forthe query and template protein sequences.

Several SMs have been proposed for aligning dis-tantly related protein sequences. These matrices employstructural information in the different forms such as

*Corresponding authors. Email: [email protected] (B.Vishnepolsky), [email protected] (M.Pirtskhalava)

Journal of Biomolecular Structure and DynamicsVol. 30, No. 2, 2012, 180–190

ISSN 0739-1102 print/ISSN 1538-0254 onlineCopyright � 2012 Taylor & Francishttp://dx.doi.org/10.1080/07391102.2012.677769http://www.tandfonline.com

Dow

nloa

ded

by [

Uni

vers

ity o

f N

ewca

stle

(A

ustr

alia

)] a

t 06:

11 0

3 O

ctob

er 2

014

Page 3: Evaluation performance of substitution matrices, based on contacts between residue terminal groups

structurally aligned protein sequences (Overington,Donnelly, Johnson, Sali, & Blundell, 1992; Prlic,Domingues, & Sippl, 2000), numerical optimization ofSM matrix (Hourai, Akutsu, & Akiyama, 2004; Qian &Goldstein, 2002), amino acid contact propensities(Managadze, Vishnepolsky, Grigolava, & Pirtskhalava,2007; Miyazawa & Jernigan, 1993; Tan, Huang, &Kihara, 2006).

In this study, we have derived new SMs which arecalculated from the similarity of the contact propensitybetween pairs of amino acid residues. The contact propen-sity of amino acids can be described in the form of a pair-wise residue CP. CPs are used because the contactpropensity of the amino acid to the other is one of themost conserved features in each position of a proteinstructure.

The CP describe the pairwise interactions betweenthe 20 side chains by a 20 × 20 matrix, the elements ofwhich give the interaction strength between a pair ofamino acids in contact. Two amino acid residues are incontact if the distance between them is less than a cutoffdistance, Rc. Definition of Rc is very important for theperformance of CPs. Typically, the CPs are derived fromknown protein structures. Distance-dependent CPs arealso used in fold recognition, for example, in ab initioprotein structure prediction program ROSETTA (Bradleyet al., 2003).

The natural amino acids, except Gly and Pro, areidentical in being represented as a main-chain plus side-chain (Cβ) atoms, and the differences between aminoacids are mainly revealed at the level of side-chain termi-nal groups (SCTG). We have defined SCTG, so that thedistances between heavy atoms in them are fixed andtheir geometry does not depend on side-chain conforma-tions. SCTG, not having internal conformational free-dom, allows to define more accurately correct values ofcontact distances. So, it is reasonable to estimate theexistence of contacts between amino acids on the basisof the estimation of the distances between SCTGs, butas far as we know this has never been done by otherauthors. In our previous work, we presented statisticalpotential and substitution matrix based on contactsbetween amino acid terminal groups (Managadze et al.,2007). In this paper, we have modified methods forobtaining the CPs and SMs. Two ways for definition ofcutoff distance Rc were used for the determination ofCPs: the first way is to define Rc from geometric charac-teristics of the SCTG (it is possible because the distancesbetween heavy atoms in the terminal groups are fixed);the second way is to use distance-dependent CPs. Theevaluation of the SM performance in this paper is basedon the comparison of alignment accuracy of our matriceswith other widespread matrices used for aligningdistantly related protein sequences.

Methods

The terminal groups are selected so that the distancesbetween heavy atoms in them are fixed in approximationof fixed values of bond length and bond angles. So, wecan consider these groups as invariant geometrical for-mations, and consequently, for the assessment of dis-tances between them, it is reasonable to use the distancesbetween geometrical centers of SCTGs. The detailedassessment of the cutoff contact distances betweenSCTGs and SCTG geometrical parameters for all 20amino acids is described in Managadze et al. (2007).

Definition of CP

Over the years, a variety of amino acid pair-specificpotentials have been formulated (see AAindex3 database(Kawashima, Pokarowski, Pokarowska, Kolinski, Katay-ama, & Kanehisa, 2008). In the classical approach for-mulated originally by Tanaka and Scheraga (1976) andsubsequently followed by many other investigators, oneuses a library of native-like structures to extract the rela-tively observed frequency of side-chain contacts betweena given pair of amino acids. This frequency is then com-pared to that expected in some reference state where spe-cific side-chain interactions do not occur. Then assumingthat the contacts between amino acid pairs in the data-base of Ntot protein structures follow a Boltzmann distri-bution, the pair interaction energy between amino acidsof type i and j is given by:

E(i; j) ¼ �kT ln

PNtot

l¼1 Nobs(i; j; l)PNtot

l¼1 Nexp(i; j; l)

!(1)

The expected number of contacts for the quasi-chemical approximation is:

Nexp(i; j) ¼ C(l)fi(l)fj(l) (2)

Where the total number of observed contacts in thelth protein, C(l), can be calculated from:

C(l) ¼Xi

Xj

Nobs(i; j; l) (3)

where Nobs(i; j; l) is the actual number of observed side-chain contacts between terminal groups of amino acidsof type i and j in the lth protein. fi(l) – frequency ofresidue type i in the lth protein can be calculated from:

fi(l) ¼ ni(l)

ntot(l)(4)

ni(l) and ntot(l) are the number of residues type i and thefull number of residues in the lth protein.

Performance of Substitution Matrices 181

Dow

nloa

ded

by [

Uni

vers

ity o

f N

ewca

stle

(A

ustr

alia

)] a

t 06:

11 0

3 O

ctob

er 2

014

Page 4: Evaluation performance of substitution matrices, based on contacts between residue terminal groups

Definition of fixed cutoff-based potentials

Formulas (1–4) are used for the definition of SM basedon fixed contact cutoff distance Rc.

Definition of distance-dependent potentials

Distance-dependent CPs are defined by formula analo-gous to Equations (1) – (3):

E(i; j; r) ¼ �kT ln

PNtot

l¼1 Nobs(i; j; l; r)PNtot

l¼1 Nexp(i; j; l; r)

!(5)

Nexp(i; j; r) ¼ C(l; r)fi(l)fj(l) (6)

C(l; r) ¼Xi

Xj

Nobs(i; j; l; r) (7)

And here, Nobs(i; j; l; r) is the actual number of observedpair of amino acids of i and j types, distance betweenSCTGs of which lies within intervals [r–Δr, r+Δr]. Inthis paper, r is varied from 2 to 20Å with step 2Å andΔr= 1Å.

Derivation of the SMs matrices from CPs

For the evaluation of the performance of CPs, we deriveSMs based on the obtained CPs. The definition of SMsfrom CPs is based on the fact that replacement of agiven amino acid by another type can be evaluated bythe differences of both amino acid contact preferences.Hence, for quantitative evaluation of the similarity, wecan compute the correlation coefficient of the CP valuesof the two columns of the two amino acids. So, the ele-ments of the SM are obtained by formula:

smij ¼PK

k¼1

P20m¼1 (cpimk � cpi)(cpjmk � cpj)ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPK

k¼1

P20m¼1 (cpimk � cpi)

2(cpjmk � cpj)2

q (8)

Here, K = 1 for fixed cutoff-based CP and 10 for distance-dependent CP, where contact distances varied in 10 inter-vals as it is described in previous paragraph; cp – ele-ments of CP obtained from (1) and (5) divided by �kT.

Assessment of substitution matrix diagonal elements

The diagonal elements of the matrices obtained from CPwill be equal to 1. But the diagonal elements play animportant role for the SM performance, because theydetermine the property of nonsubstitutability of theamino acids, i.e. their originality. We cannot obtain diag-onal elements directly from CP, so we use anotherapproach for the definition of diagonal elements.

The detailed description of the diagonal element defi-nition was made in the paper by (Managadze et al.,2007). We estimated the originality of amino acids byevaluating the possibility of being in contact simulta-

neously with n (n= 0, 1, 2, …, 6) other amino acids. Fi

frequency is defined as

Fni ¼ Nn

i

Ni(9)

Where Nni is the number of formations which involve n

SCTGs (irrespective of type) contacting with the ith typeof SCTG; Ni is the number of SCTGs of the ith type;n= 0, 1, …, 6 number of amino acids beingsimultaneously in contact with ith amino acid (i= 1,20).

The diagonal elements are calculated assuming thatamino acids with a prominent contact ability or contactdisability play a key role in folding. The contact abilitypi of the ith amino acid is given by

pi ¼X6n¼1

Fni (10)

The amino acid substitution ability is characterized asthe ratio of pi with respect to p0. p0 corresponds to theamino acid with the least-expressed originality (theobtained results show that it is Ala (Managadze et al.,2007)). Hence, the diagonal elements of the matrix weredetermined in the following way:

smij ¼

pjp0

; when pj > p0

p0pj; when p0 > pj

(11)

and j= 1, …, 20

Benchmarks

For obtaining CP and SM elements, we use FSSP2library (Vishnepolsky, Managadze, & Pirtskhalava, 2008)which contains 637 structures selected from FSSP(Holm, Ouzounis, Sander, Tuparev, & Vriend, 1992) byusing filter criterion – Dali z-score for structuralsimilarity between set members < 2;

The alignment accuracy is tested on Twilight Zoneset of the 1.65 version of the SABmark reference align-ment database (VanWalle, Lasters, & Wyns, 2005),which contains single domain sequences with lowsequence similarity. The sequences of the Twilight Zoneset are taken from a SCOP (Murzin, Brenner, Hubbard,& Chothia, 1995) subset provided by the ASTRAL com-pendium, in which domains have a pairwise Blast E-value of at least 1, for a theoretical database size of 108

residues (Altschul et al., 1997; Chandonia et al., 2004).The Twilight Zone set contains 10,667 sequence pairs,based on 1740 sequences and joined into 209 folds.

182 B. Vishnepolsky et al.

Dow

nloa

ded

by [

Uni

vers

ity o

f N

ewca

stle

(A

ustr

alia

)] a

t 06:

11 0

3 O

ctob

er 2

014

Page 5: Evaluation performance of substitution matrices, based on contacts between residue terminal groups

The alignment accuracy is tested both on full Twi-light Zone set and separately on family, superfamily, andfold level according SCOP classification.

Evaluation of the performance of SMs

The alignment accuracy has been evaluated by com-paring the sequence with structural alignments. Forthe evaluation of the performance of obtained SMs,we compare matrices on the basis of alignmentaccuracy.

The following matrices have been considered:BLOSUM45 (Henikoff & Henikoff, 1992) – the most

frequently used matrix group for sequence alignmentsand database searches (Altschul, Gish, Miller, Myers, &Lipman, 1990; Altschul et al., 1997). BLOSUM45 ischosen because it is the matrix used for distantly relatedsequences in BLAST (Altschul et al., 1990) andPSI-BLAST (Altschul et al., 1997).

GONNET (Gonnet, Cohen, & Benner, 1992) – oneof the popular SMs, which has good performance in foldrecognition experiments (Fischer, Rice, & Eisenberg,1996) and it is used for distantly related sequences.

PRLA2 (Prlic et al., 2000), JOHM (Johnson &Overington, 1993) – structure-based matrices, con-structed on structural alignments

QUIB (Qian & Goldstein, 2002) – is the best-performing matrix for alignment sequences at family andsuperfamily level according to Tan et al. (2006).

All the abovementioned matrices are constructed onthe basis of the alignment between sequences and arenamed here as alignment-based (AB) matrices.

MJ (27), CC_80 (12) – SMs based on contactsbetween amino acids.

TGC0 – the terminal group-based matrix, receivedfrom paper (Managadze et al., 2007). For the ease ofcomputation, the elements of the matrix from paper(Managadze et al., 2007) were divided by 10.

RANDOM – a random amino acid similarity matrix.In the random matrix, the values obtained from the ran-dom number generator between �3 and 3 are assignedfor each matrix element.

The global dynamic programming method(Needelman & Wunsch, 1970) is used for the align-ment of protein sequence pairs. The alignment hasbeen made by ALIGN_MTX (Vishnepolsky & Pirts-khalava, 2009).

We have used the Cline shift score (Cline, Hughey,& Karplus, 2002) as a measure of alignment quality.This score gives more accurate measure of alignmentperformance than often used modeler and developerscores, i.e. the fraction of residues aligned correctly asproportion of the length of the predicted and structuralalignments, respectively. The modeler and developerscores alone do not reflect under-alignment and over-alignment, respectively (Edgar & Sjolander, 2004; Sau-

der et al., 2000). Cline score takes into account informa-tion about both of these scores and also considers shiftbetween the predicted and structural alignments. It takesvalues between �.2 and 1 (in this paper, the scores aremultiplied by 100) and can be negative when big dis-agreement between predicted and structural alignments isobtained.

The gap penalty optimization was made on the train-ing set, which contains 209 sequence pairs representingeach fold in Twilight Zone set of SABmark database.

Gap opening and gap extension penalties optimiza-tion was carried out by maximizing average Cline shiftscore of all the pairs of the set. In the first step, gapopening and gap extension penalties were varied from 0to 30 by step 2, and then for the best values minimiza-tion program was used to maximize Cline score.

Results

CP for fixed cutoff distance and two obtained SMs:TCG1 (based on fixed cutoff distance) and TCG2 (basedon distance-dependent CP) are presented in Tables 1and 2.

Correlations between SMs

The data about correlation coefficients between differentSMs are presented in Table 3. The data show that ABmatrices have high correlations with each other (the cor-relation coefficients lie within interval .84–.92), whilecontact-based SMs show lower correlations both witheach other and in relation to AB matrices (the correlationcoefficients are within interval .51–.82). TGC2 matrixhas the largest correlation with other matrices amongcontact-based SM (.64–.78), but it is nevertheless lowerthan the correlations between other AB matrices. Hence,it can be said that contact-based SMs have mainly reflectsome other factors determining amino acid similaritythan AB matrices (see also Discussion section). In spiteof common physical basis, as it has been pointed outabove contact-based matrices are less similar to eachother than AB matrices. Furthermore, even terminalgroup-based matrices (TGC0, TGC1, TGC2) have lowermaximum correlation (.82) than minimum correlationbetween AB matrices (.84). So, it is essential to definecorrectly the parameters such as contact cutoff distancesfor these matrices. Evaluation of correctness of thoseparameters would be carried out in alignmentexperiments.

Evaluation of alignment performance of SMs

The terminal group-based SM (TGC0) was firstly pre-sented in our previous paper (Managadze et al., 2007),but evaluation of its performance was not given. Here,we use the other method for the generation of new SM(TGC1 and TGC2). The performance of TGC0, TGC1,

Performance of Substitution Matrices 183

Dow

nloa

ded

by [

Uni

vers

ity o

f N

ewca

stle

(A

ustr

alia

)] a

t 06:

11 0

3 O

ctob

er 2

014

Page 6: Evaluation performance of substitution matrices, based on contacts between residue terminal groups

Table

1.CPforfixedcutoffcontactdistance.

ALA

ARG

ASN

ASP

CYS

GLN

GLU

GLY

HIS

ILE

LEU

LYS

MET

PHE

PRO

SER

THR

TRP

TYR

VAL

ALA

1.93

ARG

.36

1.08

ASN

�.07

�.09

.67

ASP

.16

�1.13

�.43

1.09

CYS

.20

.33

.10

�.05

�1.97

GLN

.25

�.14

�.41

�.15

�.09

.85

GLU

.45

�1.09

�.05

.58

.50

.17

1.50

GLY

1.74

.48

.25

.05

.58

.25

.38

2.81

HIS

.21

.11

�.08

�.58

�1.39

�.07

�.46

.23

.02

ILE

�.20

.37

.33

.27

�.55

.12

.48

.48

.21

.30

LEU

�.56

.00

.01

.24

�.94

�.19

.17

.26

�.35

�.36

�.17

LYS

.89

.75

�.01

�.84

.38

�.04

�.96

1.19

.54

1.45

.85

1.88

MET

�.47

.08

.08

.58

�.79

�.21

.28

.15

.30

�.32

�.65

.43

.38

PHE

�.45

�.03

.25

.74

�.52

.03

.47

.09

�.17

�.47

�.96

.48

�.83

.48

PRO

.69

.49

.65

.70

.01

.34

.53

1.37

.12

.99

.78

1.89

.50

.00

2.71

SER

1.07

.23

� .16

�.61

1.05

.03

�.28

1.19

�.28

.40

.35

.44

.44

.79

.74

1.43

THR

.02

.12

�.46

�.62

�.19

�.38

�.35

.28

�.28

.50

.31

.64

.15

.31

.49

�.17

.48

TRP

�.87

�.86

�.70

�.32

�.72

�.79

�.43

�.68

�1.14

�1.14

�1.31

�.55

�1.15

�1.07

�1.06

�.45

�.61

�.18

TYR

�.26

�.14

.13

.95

�.29

�.13

.52

�.17

�.29

�.31

�.66

�.03

�.56

�.09

�.46

.77

�.02

�.77

.92

VAL

�.40

.21

.15

.44

�.81

.00

.35

.37

�.11

�.49

�.52

1.19

�.34

�.61

.81

.40

.08

�1.19

�.44

.17

Note:

The

elem

entsof

thematrixisob

tained

from

Gribskovet

al.(1987):cp

ij=E(i,j)/kT.

184 B. Vishnepolsky et al.

Dow

nloa

ded

by [

Uni

vers

ity o

f N

ewca

stle

(A

ustr

alia

)] a

t 06:

11 0

3 O

ctob

er 2

014

Page 7: Evaluation performance of substitution matrices, based on contacts between residue terminal groups

Table

2.SMs:TGC1(top

half)andTGC2(bottom

half)ob

tained

from

fixedandvariable

cutoffdistance,respectiv

ely.

ALA

ARG

ASN

ASP

CYS

GLN

GLU

GLY

HIS

ILE

LEU

LYS

MET

PHE

PRO

SER

THR

TRP

TYR

VAL

ALA

1.00

.37

.11

�.21

.59

.51

.01

.89

.39

.54

.51

.31

.42

.40

.51

.50

.15

.20

.21

.51

ARG

.16

1.22

.41

�.48

.06

.23

�.50

.53

.54

.35

.24

.85

.10

�.09

.46

.59

.69

�.48

�.34

.24

ASN

.35

.54

1.12

.24

.01

.21

.27

.36

.40

.26

.18

.49

.13

.03

.52

.42

.40

�.53

�.10

.21

ASP

.07

.11

.53

1.18

�.16

.20

.84

�.12

�.12

�.28

�.22

�.22

�.12

�.06

.02

.06

�.05

�.19

.20

�.17

CYS

.58

�.03

.06

.01

1.54

.27

�.07

.48

.45

.59

.64

.02

.51

.64

.40

.16

.08

.44

.50

.64

GLN

.42

.50

.73

.53

.16

1.10

.36

.49

.41

.35

.30

.26

.22

.25

.45

.36

.17

�.27

.08

.35

GLU

.00

.11

.45

.92

�.08

.56

1.09

.05

�.21

�.20

�.15

�.31

�.10

.01

.03

.10

�.15

�.08

.22

�.15

GLY

.72

.33

.59

.20

.39

.50

.09

1.21

.47

.51

.51

.57

.35

.23

.67

.64

.42

�.04

�.03

.49

HIS

.48

.31

.47

.27

.70

.52

.20

.49

1.23

.59

.48

.61

.50

.18

.60

.20

.57

�.44

�.08

.54

ILE

.37

�.11

.05

�.13

.23

.09

�.11

.39

.22

1.09

.94

.42

.80

.65

.78

.02

.40

.11

.30

.90

LEU

.47

�.01

.10

�.10

.36

.25

�.05

.35

.36

.88

1.26

.35

.82

.63

.81

.00

.34

.16

.32

.93

LYS

.04

.93

.52

.10

�.15

.50

.14

.28

.24

� .05

.01

1.08

.13

�.14

.70

.52

.85

�.55

�.46

.33

MET

.61

.03

.19

.03

.47

.39

.09

.36

.50

.69

.88

.01

1.22

.59

.62

�.12

.09

.12

.38

.84

PHE

.47

�.08

.04

�.04

.38

.21

.00

.25

.35

.81

.89

�.08

.86

1.28

.29

.00

�.04

.44

.76

.66

PRO

.35

.12

.41

.05

.14

.39

.09

.63

.37

.66

.57

.21

.48

.44

1.37

.28

.58

�.24

�.07

.78

SER

.49

.56

.63

.44

.32

.60

.31

.57

.42

�.04

.01

.47

.12

.07

.13

1.11

.45

�.25

�.01

.01

THR

.35

.40

.60

.24

.16

.49

.15

.63

.41

.59

.48

.45

.42

.41

.60

.46

1.13

�.51

�.31

.26

TRP

.33

�.11

.00

.09

.27

.14

.11

.18

.20

.64

.64

�.17

.59

.79

.27

.13

.32

1.60

.50

.15

TYR

.37

�.08

.15

.11

.30

.29

.14

.22

.37

.68

.75

�.10

.76

.87

.39

.15

.41

.79

1.26

.36

VAL

.47

�.08

.11

�.11

.33

.20

�.09

.45

.31

.95

.93

�.03

.79

.85

.67

.01

.60

.62

.71

1.19

Note:

Diagonalelem

entsof

both

matricesareidentical.

Performance of Substitution Matrices 185

Dow

nloa

ded

by [

Uni

vers

ity o

f N

ewca

stle

(A

ustr

alia

)] a

t 06:

11 0

3 O

ctob

er 2

014

Page 8: Evaluation performance of substitution matrices, based on contacts between residue terminal groups

Table

3.Correlatio

ncoefficientbetweendifferentSMs.

Con

tact-based

matrices

ABmatrices

TGC2

TGC1

TGC0

MJ

CC80

GONNET

PRLA2

JOHM

QUIB

BLOSUM45

Con

tact-based

matricesmatrices

TGC2

1.00

.78

.76

.64

.77

.74

.75

.72

.70

.75

TGC1

.78

1.00

.82

.51

.52

.62

.61

.58

.57

.59

TGC0

.76

.82

1.00

.56

.56

.71

.71

.71

.67

.73

MJ

.64

.51

.56

1.00

.62

.76

.76

.70

.72

.66

CC80

.77

.52

.56

.62

1.00

.65

.67

.58

.61

.62

AB

matrices

GONNET

.74

.62

.71

.76

.65

1.00

.91

.89

.94

.91

PRLA2

.75

.61

.71

.76

.67

.91

1.00

.86

.84

.88

JOHM

.72

.58

.71

.70

.58

.89

.86

1.00

.84

.92

QUIB

.70

.57

.67

.72

.61

.94

.84

.84

1.00

.85

BLOSUM45

.75

.59

.73

.91

.88

.92

.66

.85

.62

1.00

186 B. Vishnepolsky et al.

Dow

nloa

ded

by [

Uni

vers

ity o

f N

ewca

stle

(A

ustr

alia

)] a

t 06:

11 0

3 O

ctob

er 2

014

Page 9: Evaluation performance of substitution matrices, based on contacts between residue terminal groups

Table 4. Gap penalties optimization on the training set of the SABmark benchmark.

Gap opening penalty Gap extension penalty Cline score

BLOSUM 45 18.60 1.46 15.75GONNET 11.88 .42 17.19PRLA2 26.85 1.45 17.16QUIB 11.41 .00 17.34JOHM 18.61 1.19 15.95MJ 2.10 .14 13.93CC80 .79 .00 13.00TGC2 1.80 .00 16.10TGC1 1.87 .02 13.49TGC0 18.53 1.64 13.47RANDOM 17.11 12.63 1.97

Contact basedmatrices

Contact basedmatrices

Contact basedmatrices

Contact basedmatrices

AB matrices

TCG

2

TCG

1

TGC

0

MJ

CC

80

BLO

SUM

45

GO

NN

ET

PRLA

2

QU

IB JO

RAN

DO

M

Alig

nmen

t acc

urac

y

0

5

10

15

20

25

AB matrices

TCG

2

TCG

1

TGC

0

MJ

CC

80

BLO

SUM

45

GO

NN

ET

PRLA

2

QU

IB JO

RAN

DO

M

Alig

nmen

t acc

urac

y

0

10

20

30

40

50

AB matrices

TCG

2

TCG

1

TGC

0

MJ

CC

80

BLO

SUM

45

GO

NN

ET

PRLA

2

QU

IB JO

RAN

DO

M

Alig

nmen

t acc

urac

y

Alig

nmen

t acc

urac

y

-1

0

1

2

3

4

5

AB matrices

TCG

2

TCG

1

TGC

0

MJ

CC

80

BLO

SUM

45

GO

NN

ET

PRLA

2

QU

IB JO

RAN

DO

M

0

5

10

15

20

(a) (b)

(d)(c)

Figure 1. Alignment performance of different SMs at the (a) all, (b) family, (c) superfamily, and (d) fold level in Twilight Zone setof the SABmark benchmark. Distribution of Cline shift scores (10, 25, 50, 75, and 90 percentiles estimated by bootstrap simulationon 10,000 resampling of the data set) is reported for each method.

Performance of Substitution Matrices 187

Dow

nloa

ded

by [

Uni

vers

ity o

f N

ewca

stle

(A

ustr

alia

)] a

t 06:

11 0

3 O

ctob

er 2

014

Page 10: Evaluation performance of substitution matrices, based on contacts between residue terminal groups

and TGC2 was carried out on the basis of the compari-son alignment accuracy of those with the other SMs.

Training set

The training set (see Methods section) was used for theoptimization of gap penalties.

The data of gap opening and gap extension penaltiesand corresponding shift scores are presented in Table 4.

Testing set

The alignment accuracy was tested on full Twilight Zoneset of SABmark database without 209 pairs, which wereused in training set. For the evaluation of alignment per-formance, the average Cline scores for pairwise align-ments in every of 209 folds were calculated, and thenthose scores were averaged to determine the final Clinescore. These scores have been used for comparing theSMs with each other. The corresponding data arepresented in Figure 1(a).

In Figure 1(b), 1(c), and 1(d), the correspondingscores are presented separately for fold, superfamily, andfamily levels. PRLA2 and GONNET matrices have thebest performance at family and superfamily levels. FromFigure 1, we can see that except fold level, contact-basedmatrices have lower performance than AB matrices. Atfold level, SMs based on CPs (except TGC0) have betterperformance than other matrices, but the performance ofall matrices at this level is very low. TGC1 and TGC0have close performance at all levels except fold whereTGC1 looks better than TGC0. TGC2 has the best per-formance among contact-based SMs at all levels exceptfold where CC80 has a slightly better performance.

Discussion

The folding process is driven by opposing thermody-namic forces: the system entropy and the energy(enthalpy) of the residue–residue and residue–solventinteractions. The structural characteristics used in foldrecognition can be divided into three groups in accor-dance with the forces they reflect: (1) the first groupincludes characteristics connected with the given confor-mations or the type of the secondary structure of the res-idue (Ayers, Gooley, Widmer-Cooper, & Torda, 1999;Kim, Xu, Guo, Ellrott, & Xu, 2003; Zhou & Zhou,2004) mainly reflecting the results of short- and med-ium-range interactions (i.e. intra-residue interactions orinteractions between residues located close to each otherin the sequence); (2) the second group includes the char-acteristics connected with the entropy, for example, thesolvent accessible surface areas, distance between theresidue and the center of globule (Bowie, Luethy, &Eisenberg, 1991; Kim et al., 2003; Koehl & Delarue,

1994; Zhou & Zhou, 2004) reflecting the results ofhydrophobic effect; and (3) the third group includescharacteristics connected with tightly packed side chainsreflecting specific interactions.

Obviously, the elements of matrices based on align-ment between sequences (AB) should reveal the influ-ence of every factor causing evolutionary changes. Thus,these elements reflect all three groups of structural char-acteristics. But contact-based SMs mainly reflect specificinteractions between amino acids and thus correspond tothe third group of structural characteristics. Therefore,lower values of correlation coefficients between contact-based and AB matrices can be explained by the fact thatSMs mainly reflect specific interactions, while AB matri-ces reflect all types of interactions. It may be one of thereasons that contact-based SMs have slightly lower per-formance than AB matrices. But including these SMs orCPs to the combined threading potentials may improvetheir performance (Fang & Shortle, 2003; Jones &McGuffin, 2003; Panchenko, Marchler-Bauer, & Bryant,2000; Pillardy et al., 2001; Shao & Bystroff, 2003; Skol-nick et al., 2003; Wedemeyer et al., 2003; Xia, Huang,Levitt, & Samudrala, 2000). These SMs can alsoimprove profile–profile alignments, because some factorswhich are not reflected on the construction of contact-based SMs can be supplemented with data obtained frommultiple sequence alignments. It can be also noted thatin MJ matrix, information on hydrophobic effect is alsoincluded, but in spite of this, TGC2 matrix has betterperformance at almost all levels. Thus, we can say thatusing the terminal groups improves the performance ofthe SM in respect to other contact-based matrices. TGC2matrix shows better performance than TGC1 and TGC0.So, it should be said that distance-dependent potentialdecreases the influence of errors in the definition of con-tact cutoff distances in TGC1 and TGC0 matrices. Wehave included TGC2 matrix into SORDIS threadingpotential (Vishnepolsky et al., 2008), and the resultsshow that the obtained potential gives better performance(Vishnepolsky & Pirtskhalava, 2012).

Acknowledgment

The designated work has been fulfilled by the financial supportof the Georgia National Science Foundation (Grant GNSF/ST07/6-239).

ReferencesAltschul, S.F., Gish, W., Miller, W., Myers, E.W., & Lipman,

D.J. (1990). Basic local alignment search tool. Journal ofMolecular Biology, 215, 403–410.

Altschul, S., Madden, T., Schäffer, A., Zhang, J., Zhang, Z.,Miller, W., & Lipman, D. (1997). Gapped BLAST and PSI-BLAST: A newgeneration of protein database search pro-grams. Nucleic Acids Research, 25, 3389–3402.

188 B. Vishnepolsky et al.

Dow

nloa

ded

by [

Uni

vers

ity o

f N

ewca

stle

(A

ustr

alia

)] a

t 06:

11 0

3 O

ctob

er 2

014

Page 11: Evaluation performance of substitution matrices, based on contacts between residue terminal groups

Ayers, D.J., Gooley, P.R., Widmer-Cooper, A., & Torda, A.E.(1999). Enhanced protein fold recognition using secondarystructure information from NMR. Protein Science, 8, 1127–1133.

Bowie, J.U., Luethy, R., & Eisenberg, D. (1991). A method toidentify protein sequences that fold into a known three-dimensional structure. Science, 253, 164–169.

Bradley, P., Chivian, D., Meiler, J., Misura, K.M.S., Rohl, C.A., Schief, W.R., … Baker, D. (2003). Rosetta predictionsin CASP5: Successes, failures, and prospects for completeautomation. Proteins, 53, 457–468.

Chandonia, J.-M., Hon, G., Walker, N.S., Conte, L.L., Koehl,P., Levitt, M., & Brenner, S.E. (2004). The ASTRAL com-pendium in 2004. Nucleic Acids Research, 32, D189–D192(Database issue).

Cline, M., Hughey, R., & Karplus, K. (2002). Predicting reli-able regions in protein sequence alignments. Bioinformat-ics, 18, 306–314.

Dayhoff, M.O., Barker, W.C., & Hunt, L.T. (1983). Establish-ing homologies in protein sequences. Methods in Enzymol-ogy, 91, 524–545.

Edgar, R.C., & Sjolander, K. (2004). A comparison of scoringfunctions for protein sequence profile alignment. Bioinfor-matics, 20, 1301–1308.

Fang, Q., & Shortle, D. (2003). Prediction of protein structureby emphasizing local side- chain/backbone interactions inensembles of turn fragments. Proteins, 53, 486–490.

Fischer, D., Elofsson, A., Rice, D.W., LeGrand, S., & Eisen-berg, D. (1996). Assessing the performance of fold recog-nition methods by means of a comprehensive benchmark.In L. Hunter & T. Klein (Eds.), Biocomputing: Proceed-ings of the 1996 Pacific Symposium (pp. 300–318). Sin-gapore: World Scientific Publishing Co. Pac SympBiocomput.

Gonnet, G.H., Cohen, M.A., & Benner, S.A. (1992). Exhaus-tive matching of the entire protein sequence database. Sci-ence, 256, 1433–1445.

Gribskov, M., McLachlan, A.D., & Eisenberg, D. (1987). Pro-file analysis: Detection of distantly related proteins. Pro-ceedings of the National Academy of Sciences of theUnited States of America, 84, 4355–4358.

Henikoff, S., & Henikoff, J.G. (1992). Amino acid substitutionmatrices from protein blocks. Proceedings of the NationalAcademy of Science of the United States of America, 89,10915–10919.

Holm, L., Ouzounis, C., Sander, C., Tuparev, G., & Vriend, G.(1992). A database of protein structure families with com-mon folding motifs. Protein Science, 1, 1691–1698.

Hourai, Y., Akutsu, T., & Akiyama, Y. (2004). Optimizing sub-stitution matrices by separating score distributions. Bioin-formatics, 20, 863–873.

Jaroszewski, L., Rychlewski, L., & Godzik, A. (2000). Improv-ing the quality of twilight-zone alignments. Protein Science,9, 1487–1496.

Johnson, M.S., & Overington, J.P. (1993). A structural basisfor sequence comparisons. An evaluation of scoring meth-odologies. Journal of Molecular Biology, 233, 716–738.

Jones, D.T., & McGuffin, L.J. (2003). Assembling novel pro-tein folds from supersecondary structural fragments. Pro-teins, 53, 480–485.

Kawashima, S., Pokarowski, P., Pokarowska, M., Kolinski, A.,Katayama, T., & Kanehisa, M. (2008). AAindex: Aminoacid index database, progress report 2008. Nucleic AcidsResearch, 36, D202–D205(Database issue).

Kim, B.D., Xu, D., Guo, J.-t., Ellrott, K., & Xu, Y. (2003).PROSPECT II: Protein structure prediction program forgenome-scale applications. Protein Engineering, 16, 641–650.

Koehl, P., & Delarue, M. (1994). Polar and non-polar atomicenvironments in the protein core: Implication for foldingand binding. Proteins, 20, 264–278.

Managadze, G., Vishnepolsky, B., Grigolava, M., & Pirtskha-lava, M. (2007). Amino acid similarities based on sidechain terminal group contacts. Journal of Biological Phys-ics and Chemistry, 7, 7–14.

Marti-Renom, M.A., Madhusudhan, M.S., & Sali, A. (2004).Alignment of protein sequences by their profiles. ProteinScience, 13, 1071–1087.

Miyazawa, S., & Jernigan, R.L. (1993). A new substitution matrixfor protein sequence searches based on contact frequencies inprotein structures. Protein Engineering, 6, 267–278.

Murzin, A.G., Brenner, S.E., Hubbard, T., & Chothia, C.J.(1995). SCOP: A structural classification of proteins data-base for the investigation of sequences and structures.Molecular Biology, 247, 536–540.

Needelman, S.B., & Wunsch, C.D. (1970). A general methodapplicable to the search for similarities in the amino acidsequences of two proteins. Journal of Molecular Biology,48, 443–453.

Overington, J., Donnelly, D., Johnson, M.S., Sali, A., & Blun-dell, T.L. (1992). Environment-specific amino acid substitu-tion tables: Tertiary templates and prediction of proteinfolds. Protein Science, 1, 216–226.

Panchenko, A.R., Marchler-Bauer, A., & Bryant, S.H. (2000).Combination of threading potentials and sequence profilesimproves fold recognition. Journal of Molecular Biology,296, 1319–1331.

Pillardy, J., Czaplewski, C., Liwo, A., Lee, J., Ripoll, D.R.,Kamierkiewicz, R., … Scheraga, H.A. (2001). Recentimprovements in prediction of protein structure by globaloptimization of a potential energy function. Proceedings ofthe National Academy of Sciences of the United States ofAmerica, 98, 2329–2333.

Prlic, A., Domingues, F.S., & Sippl, M.J. (2000). Structure-derived substitution matrices for alignment of distantlyrelated sequences. Protein Engineering, 13, 545–550.

Qian, B., & Goldstein, R.A. (2002). Optimization of a newscore function for the generation of accurate alignments.Proteins, 48, 605–610.

Sauder, J.M., Arthur, J.W., & Dunbrack, R.L.Jr. (2000). Large-scale comparison of protein sequence alignment algorithmswith structure alignments. Proteins, 40, 6–22.

Shao, Y., & Bystroff, C. (2003). Predicting interresidue contactsusing templates and pathways. Proteins, 53, 497–502.

Skolnick, J., Zhang, Y., Arakaki, A.K., Kolinski, A., Boniecki,A., Szilagyi, A., & Kihara, D. (2003). TOUCHSTONE: Aunified approach to protein structure prediction. Proteins,53, 469–479.

Tan, Y.H., Huang, H., & Kihara, D. (2006). Statistical poten-tial-based amino acid similarity matrices for aligning dis-tantly related protein sequences. Proteins, 64, 587–600.

Tanaka, S., & Scheraga, H.A. (1976). Medium and long rangeinteraction parameters between amino acids for predictingthree dimensional structures of proteins. Macromolecules,9, 945–950.

VanWalle, I., Lasters, I., & Wyns, L. (2005). SABmark – Abenchmark for sequence alignment that covers the entireknown fold space. Bioinformatics, 21, 1267–1268.

Performance of Substitution Matrices 189

Dow

nloa

ded

by [

Uni

vers

ity o

f N

ewca

stle

(A

ustr

alia

)] a

t 06:

11 0

3 O

ctob

er 2

014

Page 12: Evaluation performance of substitution matrices, based on contacts between residue terminal groups

Vishnepolsky, B., Managadze, G., & Pirtskhalava, M. (2008).Comparison of the efficiency of evolutionary change-basedand side chain orientation-based fold recognition potentials.Proteins, 71, 1863–1878.

Vishnepolsky, B., & Pirtskhalava, M. (2009). ALIGN MTX—An optimal pairwise textual sequence alignment program,adapted for using in sequence-structure alignment. Compu-tational Biology and Chemistry, 33, 235–238.

Vishnepolsky, B., & Pirtskhalava, M. (2012). CONTSOR—Anew knowledge-based fold recognition potential, based onside chain orientation and contacts between residue termi-nal groups. Protein Science, 21, 134–141.

Wallace, I.M., Blackshields, G., & Higgins, D.G. (2005). Multi-ple sequence alignments. Current Opinion in StructuralBiology, 15, 261–266.

Wang, G., & Dunbrack, R.L.Jr. (2004). Scoring profile-to-profile sequence alignments. Protein Science, 13, 1612–1626.

Wedemeyer, W.J., Schueler-Furman, O., Murphy, P., Schonb-run, J., Strauss, C., & Baker, D. (2003). Rosetta predictionsin CASP5: Successes, failures, and prospects for completeautomation. Proteins, 53, 457–468.

Xia, Y., Huang, E.S., Levitt, M., & Samudrala, R. (2000). Abinitio construction of protein tertiary structures using a hier-archical approach. Journal of Molecular Biology, 300, 171–185.

Zhou, H., & Zhou, Y. (2004). Single-body residue-level knowl-edge-based energy score combined with sequence-profileand secondary structure information for fold recognition.Proteins, 55, 1005–1013.

190 B. Vishnepolsky et al.

Dow

nloa

ded

by [

Uni

vers

ity o

f N

ewca

stle

(A

ustr

alia

)] a

t 06:

11 0

3 O

ctob

er 2

014