Network Analysis of Protein Structures Identifies -

doi:10.1016/j.jmb.2004.10.055 J. Mol. Biol. (2004) 344, 1135–1146

Network Analysis of Protein Structures IdentifiesFunctional Residues

Gil Amitai, Arye Shemesh, Einat Sitbon, Maxim Shklar, Dvir NetanelyIlya Venger and Shmuel Pietrokovski*

Department of MolecularGenetics, Weizmann Institute ofScience, Rehovot 76100, Israel

0022-2836/$ - see front matter q 2004 E

Abbreviations used: RIGs, residugraph(s); RSA, relative solvent acceMatthews correlation coefficient.E-mail address of the correspond

[email protected]

Identifying active site residues strictly from protein three-dimensionalstructure is a difficult task, especially for proteins that have few or nohomologues. We transformed protein structures into residue interactiongraphs (RIGs), where amino acid residues are graph nodes and theirinteractions with each other are the graph edges. We found that active site,ligand-binding and evolutionary conserved residues, typically have highcloseness values. Residues with high closeness values interact directly orby a few intermediates with all other residues of the protein. Combiningcloseness and surface accessibility identified active site residues in 70% of178 representative structures. Detailed structural analysis of specificenzymes also located other types of functional residues. These includethe substrate binding sites of acetylcholinesterases and subtilisin, and theregions whose structural changes activate MAP kinase and glycogenphosphorylase. Our approach uses single protein structures, and does notrely on sequence conservation, comparison to other similar structures orany prior knowledge. Residue closeness is distinct from various sequenceand structure measures and can thus complement them in identifying keyprotein residues. Closeness integrates the effect of the entire protein onsingle residues. Such natural structural design may be evolutionarymaintained to preserve interaction redundancy and contribute to optimalsetting of functional sites.

q 2004 Elsevier Ltd. All rights reserved.

Keywords: protein active sites identification; protein structure analysis;network analysis; closeness degree; ORFans
*Corresponding author
Introduction

Complex systems can be analyzed as networksof interactions between the system components.Analyzing the network can then characterize thewhole system and its individual components.1,2

Protein structures are typically perceived as localstructure elements that form diverse topologies andfolds.3,4 However, protein structures can also berepresented as networks (graphs) where amino acidresidues are the nodes and their interactions are theedges.5 This approach was used to study variousprotein aspects, including protein structure flexi-bility,6 folding of protein domains,7 recurring

lsevier Ltd. All rights reserve

e interactionsssibility; MCC,

ing author:.il

structural patterns,8 key residues in folding,9

residue fluctuation,10 and side-chain clusters.11

Identifying functional residues (e.g., active siteresidues) in proteins is a complex issue, even whenatomic detailed structures are available.12 This isfurther complicated when no recognizable homo-logues with characterized functional residues areknown. Only a very small fraction of all knownproteins has been biochemically well studied.Therefore, for most proteins with solved 3Dstructures we need de novo methods to predictfunctional residues. Evolutionary conservationtogether with structure information is successfulin predicting some ligand binding and active sitesin various proteins.13–16 However, some proteinswith known structures do not have determined,and maybe even existing, homologues. Using onlystructural data, functional residues were identifiedby computing structure energetics and ionizationproperties.17,18 In a different approach, such resi-dues could be characterized by their structural

d.

1136 Protein Structures Residue-interaction Networks

properties or recurring structural motifs.8,19 Com-bining different structural and evolutionary residueproperties improves the identification of active siteresidues.20 Each method has its own advantagesand limitations, but even integrating differentapproaches could not identify all sought-aftersites. New ways to characterize and predict keyprotein residues are still needed.

The interactions of protein residues within andbetween functional sites are crucial for proteinactivity. Analysis of protein structures found themto be small-world networks.9,10,21 Such networksare characterized by both clusters of local inter-actions and “long range” interactions betweendifferent clusters.22 Frequently these networksinclude a small number of central nodes that arehubs through which many nodes can indirectlyconnect.23 We sought to examine whether centralnodes of protein residue interaction networkscorrespond to functional residues. The closenessof a residue to other residues on the network wasfound to characterize many functional protein sites.We preformed a large-scale prediction of active siteresidues, and examined the relation of residuecloseness to other residue functions. Active siteresidue prediction relied on single structure analy-sis without using homologous sequences or struc-tures. Residue closeness is distinct from otherstructurally derived residue properties. However,high closeness is associated with sequence con-servation and is characteristic of key positions ingeneral, where mutations can abolish proteinactivity. It can thus be used, with other measures,to analyze protein structures. This includes struc-tures with unique sequences or folds. We discusshere possible explanations and consequences forthe relation between residue functionality and highnetwork closeness.

Figure 1. Enzyme residues closeness and degree.Values are shown for all 59,935 residues from the 178enzyme chains.On top, highly exposed residues (RSAO50)are in orange, core residues (RSAZ0) are in green, activesite residues are in red, and all other residues are in blue.In the distributions below, exposed residues (RSAO0) arein orange dotted lines, core residues are in green brokenlines, and active site residues are in red. Standardizedcloseness and degree values are calculated as noted inMethods section.

Results

Transforming protein structures intomathematical graphs

Protein structures were transformed into math-ematical graphs (networks) by identifying allinteractions between the amino acids in eachstructure. We first found all interatomic contactsusing the CSU program24 and then integrated thesecontacts by amino acids. In the resulting residueinteractions graphs (RIGs) amino acid residues arethe nodes and their interactions are the edges. Weset out to characterize properties of individualamino acid residues (nodes) based on their relationwith other residues of the protein (network). Forthis end we examined the centrality of networknodes. Network centrality measures of individualnodes are based on their distance from other nodes,or on their position along paths between othernodes.25 The distance between two nodes is thenumber of edges along the shortest path betweenthe nodes. Centrality measures can also be classified

by their dependence on the whole or on the localstructure of the network. Degree-centrality is thenumber of edges each node has (i.e., its number ofimmediate neighbors), and is thus a local measure.Closeness-centrality is derived from the meandistance of a node to all other nodes, and is thus aglobal measure. We first compared these twodistinct centrality measures of protein RIGs.

Degree and closeness centrality measures wereexamined for all amino acids in a representativegroup of 178 protein chains of enzymes with knownstructure and uniformly defined active sites.19

There is a positive correlation between the twomeasures for the entire set of amino acid residues(rZ0.59). Amino acid residues with low degree andcloseness values are mainly highly exposed (rela-tive solvent accessibility [RSA]R50%), while unex-posed amino acids (core, RSAZ0) have mainly high

Figure 2. Closeness and accessi-bility of enzyme active sites. In redare the 567 active site residues andin grey all other residues from the178 enzyme chains.

Protein Structures Residue-interaction Networks 1137

values for both centrality measures. Most active siteresidues have high degree and closeness values(Figure 1). Closeness values of active site residuesare significantly higher than those of the otherresidues, including those in the protein core. Degreevalues of active sites are less distinct than closenessvalues from the corresponding values of core andsurface residues (Figure 1). Degree measure is alsosimilar to other measures that are calculated fromthe local environment of structure residues (e.g.,solvent accessibility, depth). In contrast, closeness isan attribute of single residues that depends on all ofthe protein residues.

The independence of the closeness measure wasexamined by comparing it to other residue struc-tural properties. Both closeness and RSA valuescharacterize active sites, but they are only partiallyrelated (correlation of rZK0.32, Figure 2). Inaddition, there is no detectible relation betweenresidue closeness values and either properties ofprotein structure clefts they are found in (size andrank), or their B-factors (mean thermal motion of allthe residue atoms) (Supplementary information).Finally, ROC curve analyses26 of enzyme active sitepredictions using RSA and using RSA and closenesstogether, show a very significant improvementwhen adding closeness to RSA (Supplementaryinformation). These findings show that closeness isa distinct and informative property of proteinresidues. Our results indicated that active sites canbe characterized and predicted by their closenesscentrality values, with or without other structuralproperties.

† http://www.weizmann.ac.il/SARIG

Prediction of active site residues by thecloseness parameter

RIG closeness and RSA values were used topredict the active sites in the group of 178previously characterized enzyme chain structures.In this initial analysis we simply used a closenesslower-limit value and an RSA lower and upper-limit values. Many combinations (O2000) ofthreshold values were evaluated by a jackknifeprocedure. This objectively identified a set ofoptimal parameters to use for active site prediction(standardized closeness values R1.1 and RSA

values of 4.5–40%). These parameters gave overthe entire data set an average of 46.5% sensitivity(fraction of identified active site residues) and 9.4%specificity (fraction of correct prediction). Themedian overall performance rate (MCC, Matthewscorrelation coefficient, a measure that integratessensitivity and specificity) of the prediction was0.20. These values are comparable to the bestpublished active-site predictions from structuraldata alone.20 Full correct prediction (100% sensi-tivity) was found for about quarter of the proteins(42/178) and partial correct prediction was foundfor 70% of the proteins (Figure 3(A)). These arehighly significant results giving a Z-score of 185.2,highly unlikely to occur by chance, PwZ0. Betterprediction was found for large proteins, longer than120 amino acid residues, (49.0% sensitivity and9.7% specificity means) than for small proteins,%120 amino acid residues, (13.2% sensitivity, 4.7%specificity). No predictions at all (zero sensitivityand specificity) were found for nine of the 12 smallproteins in the examined data. A WWW serverimplementing this method is available†.

Relation between residue closeness values andfunctionality

Our results showed that some residues have highcloseness values although they are not classified asactive site residues according to the definition usedby Bartlett et al.19 Nevertheless, some of theseresidues might be considered as active sites usingother criteria, or might have other functionalimportance (e.g., binding of substrates, of cofactors,or of metals). We chose to examine the relationbetween closeness values and sequence conserva-tion for representative protein families. Sequenceconservation is a good indicator for important siteson proteins. Closeness was found to positivelycorrelate with sequence conservation, with valuesranging from 0.24 to 0.62 (Supplementaryinformation).Predicting enzyme active site residues by inte-

grating closeness and sequence conservation sig-nificantly improved prediction by only using

http://www.weizmann.ac.il/SARIG

Figure 3. Active site residuesprediction using closeness, RSAand sequence conservation.(A) Active site prediction successfor all analyzed 178 chains usingintegrated closeness and RSA.Large filled circle indicates 54chains that had 0% sensitivityand specificity. (B) ROC curvesfor all residues from the analyzedchains, using sequence conserva-tion active site prediction (brokenline) and using integratedsequence conservation and close-ness (continuous line). Sequenceconservation values are from theCONSURF server60 output foreach chain. Closeness standard-ized values and CONSURFsequence conservation valueswere integrated for each residueby adding the closeness value tothe conservation value multipliedby K2.


sequence conservation. For highly specific predic-tions (5% false positives) using only sequenceconservation only gave 33% sensitivity. Addingcloseness to sequence conservation increased thesensitivity to 51% for the same specificity value(Figure 3(B)).

Large-scale experimental data on the relationbetween protein activity and amino acid substi-tutions are available for a few proteins. Typicallymost of the protein residues were mutated one at atime to various other amino acids, and the effect onprotein activity measured.27–32 Such exhaustivemutagenesis studies examined both conservedand variable residues. The positions wheremutations effected protein activity were identifieddirectly in T4-lysozyme and barnase,29,32 andindirectly in TEM1-b-lactamase (TEM1).27

Altogether, the influence of each position on proteinactivity was roughly estimated due to severalreasons. First, the activity threshold for enzymeinactivation is different in the three examples (0.1, 3,

27% for barnase, T4-Lysozyme, and TEM1-b lacta-mase, respectively). Second, not all possible aminoacid substitutions were examined in these studies.We thus considered positions where mutationscould abolish activity as key positions.

Key positions in all three structures had higherRIG closeness values than positions wheremutations had no apparent effect on protein activity(Figure 4). To quantitatively estimate the use ofcloseness and compare it with the use of RSA toidentify key positions we performed a logisticregression analysis. We combined the data for thethree structures, but only examined the surfaceexposed residues, avoiding the simple task ofidentifying key residues in protein cores. The log-odds estimate was 3.7 for using closeness, and 0.9for using RSA. Thus, for exposed residues closenesshas a strong predictive power while RSA, atypically used measure,27–30 is a weak predictor(log-odds estimate close to one).

We correctly predicted both evolutionary

Figure 4. Closeness distributionin key and mutation-tolerant resi-dues. Closeness values of T4-lyso-zyme, barnase, and TEM1-beta-lactamase (PDB accessions 1lzm,1a2p, 1axb, respectively). Values ofmutation-tolerant residues areshown in empty bars and hatchedbars show the values of key resi-dues, residues where mutationscould abolish protein activity.


conserved- and variant key positions in TEM1.Seventy percent of the key positions in this enzymeare variant (30/43).27 More than a third of these keyresidues have high closeness values (11/30). Thisfraction of variant key residues that have highcloseness is significantly higher than expected (pZ5!10K3) from the total number of residues withhigh closeness in the protein (17%, 46/263). Similarsignificance (pZ6!10K3) is found even if we onlyconsider the TEM1 solvent exposed residues. Wealso identified T4-lysozyme variable key residuesby their closeness values (not shown). This couldnot be tested in barnase since all its key residues areconserved (not shown). Closeness values can thuscomplement sequence conservation in identifyingkey positions.

RIG closeness of functionally important non-catalytic residues

To complement the large scale and generalanalyses on the nature of residues with highcloseness values we conducted more thorough

Figure 5.Closeness analysis of subtilisin DYprotease. Closen(PDB accession 1BH6). Closeness increases from blue to red. Tinhibitor, shown in sticks. The right view is related to the top bNa atom in cation binding site B. Note the infrequency of reswith the subtilisin active and cation binding sites.

examination of specific protein structures. Wechose well-studied examples to study substratebinding sites, allosteric sites and the effect ofregulatory post-translational modifications.

Substrate, cofactor and ion binding sites

In subtilisin protease,33 high closeness valuesidentified both substrate binding and one of the twocation binding-sites, in addition to all catalytic siteresidues. While the substrate and catalytic sites areclose to one another, the identified cation binding-site is on the opposite side of the protein structure(Figure 5). The cation binding site stabilizessubtilisin structure.34 We also identified the sub-strate and ion binding sites, together with activesite, in the ERK2 MAP kinase.35 ERK2 ATP andMg2C binding sites could be found by their highcloseness values (Figure 6(A)). Acetylcholinesterase(AChE) catalytic site is situated within a deepnarrow gorge.36–39 All residues within the gorgeincluding the catalytic triad, oxyanion hole and theanionic site had high closeness values. AChE

ess residue values are shown on the surface of the proteinhe left view shows the protease active site with a syntheticy about 908 counterclockwise turn on the Yaxis. It shows aidues with high closeness values and their exact overlap

Figure 6. Closeness values ofERK2 MAP kinase. (A) Surfacerepresentation of unphosphoryl-ated ERK2 MAP kinase (3ERK)bound to an inhibitor in the ATPbinding site (shown in sticks).Residues are colored by closenessvalues, with red and blue colorscorresponding to the highest andlowest closeness values, respec-tively. The active site and ATP-Mg2C binding region have highcloseness values. (B) Closenessvalue changes between phos-phorylated (P-ERK2) and unphos-phorylated ERK2 (ERK2) formsshown on the structure of P-ERK2(2ERK). Residues whose closenesssignificantly changed (more thanone standard deviation) in theP-ERK2 compared with ERK2 areshown as sticks. Significantlyincreased closeness is shown inred and significant decrease isshown in cyan. Helix-C is repre-sented by the large grey cylinder,and the helix that forms uponphosphorylation in the L16 regionis represented by the small greycylinder. Phosphate groups areshown in magenta. The structureis oriented as in (A).


peripheral anionic site residues, located on the rimof the catalytic site gorge, and acyl-binding pocketalso had high closeness values (Supplementaryinformation).

Allosteric sites

Glycogen phosphorylase (GP) was the firstallosteric enzyme to be isolated and analyzed indetail.40 AMP binding or phosphorylation mediatethe transition between GP active and inactive struc-tural states. Long-range allosteric changes occurbetween the GP catalytic site and either AMP-binding or phosphorylation site (45 A and 35 A

apart, respectively). The transition into the activatedstate is initiated by sliding of two helices, found inthe interface of the GP homodimer. This modifies ab-sheet in each of the subunits, that in turn transmitsthe structural change to the catalytic site.41

A RIG analysis of activated dimers shows that theresidues of the AMP-binding allosteric site havehigh closeness values (Mean closeness 1.2G0.16).Residues of a different, recently discovered, allo-steric site42 have even higher closeness values(mean standardized closeness 1.7G0.23). Moreover,20 out of 26 residues mediating the conformationalchanges43 have high closeness values as well (meanstandardized closeness of all 26 residues 1.39G

Figure 7. Closeness and allosterictransition glycogen phosphorylase(GP). Chain A of activated (R-state)GP functional dimer43 (PDB acces-sion 7GPB) is shown with residuescolored by closeness values. Resi-dues that transmit the allostericsignal occurring upon AMP bind-ing are shown as spheres (residues133, 162–165, 262–279, 281). Pyri-doxal phosphate cofactor (PLP),situated within the catalytic site, isshown in magenta. AMP bound toits regulatory site is shown in green.Closeness values were calculatedby transforming GP dimers into asingle network.


0.45). This is highly significant relative to thenumber of high closeness residues in the wholeprotein (159/824, pZ2!10K9). Moreover, all resi-dues of the AMP binding site form a continuousstretch of high closeness amino acid residues thatreach the residues that structurally change uponAMP binding (Figure 7).

MAP kinase activation and RIG closenesschanges

MAP kinase isoforms ERK2 and ERK1 are wellstudied ubiquitous, growth-factor-activated, tyro-sine-phosphorylated, enzymes.44–46 ERK2 reachesmaximal activity upon phosphorylation of tworesidues (T183 and Y185).46–48 We compared resi-due closeness value changes between the inactiveERK2 unphosphorylated35 and active phosphoryl-ated (ERK2-P2)49 forms. Phosphorylation causeslocal and global structural changes in many ERK2regions, including rotation of a whole domain.49

However, large increase in closeness (O1 closenessstandard deviations) is found in a few residuesforming the activation-loop (F181, L182, T183), andresidue D335. The activation-loop contains the twophosphorylation sites. Upon phosphorylation theloop refolds and activates the protein. The loopcontacts the L16 region, which includes D335. Theconformational changes in L16 are coordinated withposition shifts helix C into the active site49 (Figure6). The largest increase in closeness (1.76 standarddeviations) is in the phosphorylated residue T183.This residue is the hub of two new hydrogen andionic interaction networks formed within ERK2-P2.It is responsible for bringing the active site residuesinto alignment, the ATP binding Helix C closer tothe active site, and improving the interactionbetween the conserved catalytic residues K52

and E69.49 The largest decrease in closeness values(K1.1 standard deviations) is identified in F329. It isin a part of L16 that forms a helix after phosphoryl-ation, and interacts with the activation loop.

Discussion

Transforming protein structures into networksof residue interactions allowed us to examine therelationship between each amino acid residue andthe protein structure as a whole. Relationships wereevaluated by the centrality of each residue withinthe interaction network. In this study we used thecloseness parameter as a measure of networkcentrality, where highly central residues have highcloseness values. Such residues interact with mostother residues directly or by a few intermediates.Residues with high centrality values are thenassumed to efficiently integrate and transmitinformation from and to the rest of the protein.Such information may appear in a chemical orphysical form, affecting the attraction or repulsionof amino acids. Our results demonstrate that manyactive site residues have high centrality values(Figure 2). The high centrality of active site residuessuggests they can effectively (directly) disseminateand receive signals from the rest of the protein.Active site residues are highly central relative to

solvent exposed residues, where active sites aretypically found, and also even relative to the core ofthe protein, where one would expect residues to behighly central (Figure 1). Active site residues can beidentified by various other approaches. Someapproaches rely on sequence conservation deducedfrom multiple sequence alignments,50 some onstructural features such as electrostatic17,18 orstructural clefts51 and some on a combination of


such features.13,19,20,52 Here we report that thecombination of centrality and solvent accessibilityalready enabled us to identify active site residues in70% of the proteins within a data set of 178representative enzymes generated by Bartlett etal.19 (Figure 3(A)). This is a large scale analysis,comprising approximately 60,000 residues and 500active site residues. Our prediction results are thushighly significant.

Our method defines individual active site resi-dues, not just patches including them. In addition,other residues, not directly linked to catalysis(e.g., binding cofactors or substrates), can also beidentified using our approach (Figures 5 and 6 andSupplementary information). We suggest that ournew approach for active site identification not onlycomplements other known methods but also leadsto a deeper understanding of the relationshipbetween the whole structure of proteins andtheir specific functional sites. This relationship isprobably a direct consequence of the necessaryfunctional robustness of proteins.

Robustness to environmental and mutationalperturbations is fundamental to protein function.Proteins hence evolve toward a common designthat supports such features. We found this designreflected in the centrality of amino acid residueswithin residue interaction networks. One result ofsuch network design is that many residues contri-bute to the optimal arrangement of the active site(e.g., orientation, charge). This can yield proteinactivity resilient to environmental perturbations(e.g., changes in pH, temperature, ionic strength,and mutations) by preserving the essential state ofthe active site. In each protein, many network edges(interactions) or nodes (residues) can be indi-vidually removed while the remaining nodes willcontinue to interact by alternate paths. Such amodel is supported by data from exhaustivemutagenesis studies, where numerous single-sitesubstitutions are tested within each position ofthe protein. These studies show that mostmutations do not impair protein activity.27,29,30–32

We found that mutations in highly central residuesoften impair activity (Figure 4). These results areremarkable since we could only use crude estimatesfor the effect of the mutation on the activity of eachprotein.

Typically, only a few amino acid residues withinthe protein are directly involved in catalysis.19

Nonetheless, protein activity can be modulated bynon-catalytic residues, including residues verydistant from the active site. This is observed inallosteric enzymes where effectors often bind toresidues distant from active sites. Our analysis ofglycogen phosphorylase, a well-studied allostericenzyme, shows that both allosteric regulatorysites and the residues involved in the allosteric-derived structural changes are central within theresidue network of this enzyme. This supportsour suggestion that the interactions between allo-steric and active sites lead to the high centrality ofboth.

Amino acid sequence conservation is oftenattributed to important sites such as those directlyrelated to protein function.13,50,52 We followed ourfindings on the relation between network centralityand functional residues by evaluating the relationbetween network centrality and sequence conserva-tion. Our results show that centrality and sequenceconservation are positively correlated in severaldiverse protein families (Supplementary informa-tion). Sequence conservation identifies residues thatare under the same selection pressure in the wholefamily. However, some residues that are particu-larly adjusted for the function or structure inspecific proteins within the family will not be con-served. This is shown in TEM1-b-lactamase wherean exhaustive mutagenesis study showed that somepositions are both completely intolerant to muta-tionsandnonconservedwithin theirprotein family.27

Usingnetwork centralitywe identified such residues,together with conserved residues. Thus, for pro-teins with known structure, our approach maycomplement sequence information in the predictionof non conserved functional residues.

Active sites were correctly predicted for themajority of the enzymes in our large-scale analysis(70% of the structures, that had a median of 67%sensitivity and 13% selectivity). No correct predic-tions at all, however, were found by our method forthe remaining enzymes. Examining this groupshowed that it included nine of the 12 smallestenzymes (%120 residues). Residue interaction net-works of such proteins may not be amenable to ouranalysis approach due to their small size. Alter-natively, their active site design may be differentfrom that of larger enzymes. These sites might beidentified by analyzing their networks in a differentfashion. We are yet uncertain what might becommon for the remaining (large) enzymes withno correct predictions. One distinguishing feature isthe distribution of catalytic function types.19 Thisdistribution is significantly different between thepredicted and unpredicted active site residues inlarge enzymes (pZ2!10K2). Two types of catalyticfunctions (primer and acid–base, as defined byBartlett et al.19) contributed most of the difference,indicating that our approach and current thresholdsare more successful in predicting certain types ofactive site residues.

Improvement in active site prediction wasachieved by combining all chains of some multi-meric enzymes into a single interaction network.However, this was not noted for all such enzymesthat were examined, and the overall predictionsuccess did not noticeably improve, relative tosingle chain network analysis (not shown). Webelieve that prediction success depends on thelocation of the active site residues. Using all chainsgave highly successful predictions for active siteswith residues in more than one subunit, but notwhen all the active site residues were within asingle subunit. It is therefore advantageous toinclude all available biochemical knowledge


upon deciding on the analysis parameters andinterpreting its results.

In summary, our results show that key proteinpositions have high centrality values in residueinteractions networks. Catalytic, substrate andco-factor binding, and mutation intolerant residueswere found highly central. Other types of function-ally and structurally important residues may yetbe found to have specific interaction networksfeatures. Our approach is unique in not focusingon residue physicochemical properties or on theirevolutionary conservation. Rather, it relies on theinteractions of each residue and the rest of theprotein. This type of analysis can complement otherexisting methods for studying proteins (Figure 3(B)and Supplementary information). Our active siteresidue predictions, using only closeness and RSAvalues (47% sensitivity, 9% specificity), are compar-able to the structural based findings of a recentwork that analyzed the same enzyme set, integrat-ing RSA, depth, secondary structure, cleft features,and amino acid type (41% sensitivity, 10% speci-ficity).20 Using interaction networks may be par-ticularly useful for protein structures having noknown homologous sequences (ORFans).53,54 Wealso showed that non-conserved key positions canbe identified by network analysis. This is importantsince while sequence conservation often indicatescrucial protein positions, such positions can also bevariable.55 Protein network analyses can thus formthe basis for identifying important species-specificand allele specific residues. Not depending on priortraining or comparison to known protein sequencesand structures, our approach is a de novo predictionmethod. As such, it may also uncover functionalsites dissimilar to known sites.

Protein structures provide detailed three-dimen-sional information of the proteins. However, it is notobvious how and which of these data identifies thefunction of each protein and its specific functionalsites. A reductionist approach, transforming proteinstructures into abstract networks, provides a dis-tinct depiction of proteins. We found that a basiccentrality measure of networks nodes can predictfunctionally important protein residues. Ourapproach enhances other methods of protein func-tion analysis and illustrates key aspects in naturaldesign and evolution of proteins.

Methods

† http://www.rcsb.org/pdb/‡ http://consurf.tau.ac.il§ http://cast.engr.uic.edu/cast/

Transforming protein structures into residueinteraction graphs

For each protein chain all interatomic contacts werefound using the CSU program.24 These atomic contactswere integrated for each amino acid residue. Residueinteractions form the edges, and their connected residuesform the nodes of the protein RIG. Interactions includedthe backbone peptide bonds as well as non-covalentbonds, such as hydrogen and hydrophobic interactions.

Protein structures analysis

Protein structures were obtained from the PDBdatabase†.56 The structure set used for the large scaleexamination of our approach is that described by Thorntonand co-workers.19 It does not contain similar (homologous)pairs and covers all six top-level enzyme classification (EC)numbers.57 Residue relative accessibility was calculated bythe NACCESS program.58 Protein structures are shownwith the PyMol program (http://www.pymol.org).

Multiple sequence alignment

Structure and sequence alignments of proteinsequences of similar structures were obtained from theHOMSTRAD database59 or calculated by us. Conserva-tion of multiple sequence alignment positions wascalculated by the CONSURF server.‡60 The servercalculated the conservation for 91% of the 178 analyzedchains, not identifying enough homologs for the follow-ing 16 structures: 1c3j, 2thi, 1gog, 2plc, 3csm, 4kbp, 1chk,1d8h, 1pya, 1coy, 1fui, 1pgs, 1ps1, 1vnc, 2cpo, and 1uox.

Network analysis

Analysis of protein structure residue interactions net-works was carried out using programs specificallywritten in Perl v5.6.1 and Java v1.4. Graph represen-tations and shortest paths of the residue network werecalculated using the JDSL java module (Copyright q

1999, 2000 Brown University, Providence, RI andAlgomagic Technologies, Inc., Belmont, MA). Networkcentrality measures were developed by Freeman, Beau-champ, and Sabidussi.61–63 The formula for closenesscentrality calculation62,63 is shown in equation (1).Basically “closeness centrality” of node x(C(x)) is

calculated as follows:

CðxÞZ ðnK1Þ=X

y2U;ysx

dðx; yÞ (1)

Where d(x,y) is the geodesic distance (shortest-path)between node x and any node y. U is the set of all nodesand n is the number of nodes in the network. The closenessvalue is therefore the inverse of the average distancebetween x and other nodes ð �dÞ (equation (2)).

CðxÞZ 1= �d (2)

Closeness values are then standardized by calculating theirstandard deviation from the mean value in each proteinstructure.

Measuring cleft volume and rank

Protein structure cleft volumes were calculated by theCASTp server§64 using 42 representative proteins fromdifferent enzyme class, subclass, and sub-subclasses (firstthree EC number fields).

Statistic procedures

Logistic regression analysis was done using the SASsoftware version 8.02 (SAS institute Inc., Cary NC, USA),using default parameters.

http://www.pymol.org

http://www.rcsb.org/pdb/

http://consurf.tau.ac.il

http://cast.engr.uic.edu/cast/


Probabilities for finding residues with high closenessvalues (O1) were estimated using hypergeometricdistribution. Significance of active site prediction wasassessed by the method used by Aloy et al.13 consideringall 59,935 residues from the 178 analyzed chains. TheZ-score was thus computed with nZ59935, POZ0.082,and PRZ0.0094 as

zZ ðPO KPRÞ=ðPRð1KPRÞ=nÞ0:5

Identifying optimal parameters for active siteprediction

A Jackknife procedure was used to identify optimalprediction parameters. For each 178 analyzed chainsthe parameters were chosen from the success rate(MCC) of the other 177 chains. The median MCC ofeach thresholds combination was calculated for all 177chains. The combination with the highest medianwas used to predict the active sites for the currentanalyzed chain. Two thousand one hundred andninty seven threshold combinations were examined asfollows: standardized closeness lower thresholdsbetween 0.9 and 2.1 in intervals of 0.1, RSA lowerthresholds between 2.5 and 8.5 in intervals of 0.5,RSA upper thresholds between 33 and 45 in intervals ofone.

Standardizing the degree of connectivity

To standardize the degree of connectivity for eachamino acid type we collected data of amino acidconnectivity from over 2000 structures (w500,000amino acid residues). These representativestructures were retrieved from the PDB-REPRDBdatabase,65 and had the following properties: lessthan 30% sequence identity between the proteins,resolution !3 A, B-factor !0.3, no chain breaks, nomutants and no membrane proteins. The degree ofresidue connectivity was extracted from eachprotein using a self-written program combined withthe CSU program.24 Connectivity degree was thenstandardized as number of standard deviations fromthe mean.

Acknowledgements

We thank Joel Sussman, Israel Silman, MiriEisenstein, Debbie Fass, Gideon Schreiber, SvenBergmann for encouragement and insightful dis-cussions, Edna Schechtman for expert statisticalconsulting, Vladimir Sobolev for developing andproviding the CSU software, and Gail Bartlett forproviding active site data.

Supplementary Data

Supplementary data associated with this articlecan be found, in the online version, at doi:10.1016/j.jmb.2004.10.055

References

1. Strogatz, S. H. (2001). Exploring complex networks.Nature, 410, 268–276.

2. Albert, R. & Barabasi, A. L. (2002). Statisticalmechanics of complex networks. Rev. Mod. Phys. 74,50–97.

3. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia,C. (1995). SCOP: a structural classification of proteinsdatabase for the investigation of sequences andstructures. J. Mol. Biol. 247, 536–540.

4. Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T.,Swindells, M. B. & Thornton, J. M. (1997). CATH—ahierarchic classification of protein domain structures.Structure, 5, 1093–1108.

5. Lifson, S. & Sander, C. (1979). Antiparallel andparallel beta-strands differ in amino acid residuepreferences. Nature, 282, 109–111.

6. Jacobs, D. J., Rader, A. J., Kuhn, L. A. & Thorpe, M. F.(2001). Protein flexibility predictions using graphtheory. Proteins: Struct. Funct. Genet. 44, 150–165.

7. Dokholyan, N., Li, L., Ding, F. & Shakhnovich, E.(2002). Topological determinants of protein folding.Proc. Natl Acad. Sci. USA, 99, 8637–8641.

8. Wangikar, P. P., Tendulkar, A. V., Ramya, S., Mali,D. N. & Sarawagi, S. (2003). Functional sites in proteinfamilies uncovered via an objective and automatedgraph theoretic approach. J. Mol. Biol. 326, 955–978.

9. Vendruscolo, M., Dokholyan, N. V., Paci, E. &Karplus, M. (2002). Small-world view of the aminoacids that play a key role in protein folding. Phys. Rev.E Stat. Nonlinear Soft Matter Phys. 65, 061910.

10. Atilgan, A. R., Akan, P., Baysal, C., Vendruscolo, M.,Dokholyan, N. V., Paci, E. & Karplus, M. (2004).Small-world communication of residues and signifi-cance for protein dynamics. Small-world view of theamino acids that play a key role in protein folding.Biophys. J. 86, 85–91.

11. Kannan, N. & Vishveshwara, S. (1999). Identificationof side-chain clusters in protein structures by a graphspectral method. J. Mol. Biol. 292, 441–464.

12. Jones, S. & Thornton, J. M. (2004). Searching forfunctional sites in protein structures. Curr. Opin.Chem. Biol. 8, 3–7.

13. Armon, A., Graur, D. & Ben-Tal, N. (2001). ConSurf:an algorithmic tool for the identification of functionalregions in proteins by surface mapping of phylo-genetic information. J. Mol. Biol. 307, 447–463.

14. Aloy, P., Querol, E., Aviles, F. X. & Sternberg, M. J.(2001). Automated structure-based prediction offunctional sites in proteins: applications to assessingthe validity of inheriting protein function fromhomology in genome annotation and to proteindocking. J. Mol. Biol. 311, 395–408.

15. Lichtarge, O., Bourne, H. R. & Cohen, F. E. (1996). Anevolutionary trace method defines binding surfacescommon to protein families. J. Mol. Biol. 257, 342–358.

16. Landgraf, R., Xenarios, I. & Eisenberg, D. (2001).Three-dimensional cluster analysis identifies inter-faces and functional residue clusters in proteins.J. Mol. Biol. 307, 1487–1502.

17. Ondrechen, M. J., Clifton, J. G. & Ringe, D. (2001).THEMATICS: a simple computational predictor ofenzyme function from structure. Proc. Natl Acad. Sci.USA, 98, 12473–12478.

18. Elcock, A. H. (2001). Prediction of functionallyimportant residues based solely on the computedenergetics of protein structure. J. Mol. Biol. 312,885–896.

http://dx.doi.org/doi:10.1016/j.jmb.2004.10.055

http://dx.doi.org/doi:10.1016/j.jmb.2004.10.055


19. Bartlett, G. J., Porter, C. T., Borkakoti, N. & Thornton,J. M. (2002). Analysis of catalytic residues in enzymeactive sites. J. Mol. Biol. 324, 105–121.

20. Gutteridge, A., Bartlett, G. J. & Thornton, J. M. (2003).Using a neural network and spatial clustering topredict the location of active sites in enzymes. J. Mol.Biol. 330, 719–734.

21. Greene, L. & Higman, V. (2003). Uncovering networksystems within protein structures. J. Mol. Biol. 334,781–791.

22. Watts, D. J. & Strogatz, S. H. (1998). Collectivedynamics of “small-world” networks. Nature, 393,440–442.

23. Barabasi, A. L. & Albert, R. (1999). Emergence ofscaling in random networks. Science, 286, 509–512.

24. Sobolev, V., Sorokine, A., Prilusky, J., Abola, E. E. &Edelman,M. (1999). Automated analysis of interatomiccontacts in proteins. Bioinformatics, 15, 327–332.

25. Freeman, L., Borgatti, S. &White, D. (1991). Centralityin valued graphs: a measure of betweenness based onnetwork flow. Social Netw. 13, 141–154.

26. Metz, C. E. (1978). Basic principles of ROC analysis.Semin. Nucl. Med. 8, 283–298.

27. Huang, W., Petrosino, J., Hirsch, M., Shenkin, P. S. &Palzkill, T. (1996). Amino acid sequence determinantsof beta-lactamase structure and activity. J. Mol. Biol.258, 688–703.

28. Markiewicz, P., Kleina, L. G., Cruz, C., Ehret, S. &Miller, J. H. (1994). Genetic studies of the lac repressor.XIV. Analysis of 4000 altered Escherichia coli lacrepressors reveals essential and non-essential resi-dues, as well as “spacers” which do not require aspecific sequence. J. Mol. Biol. 240, 421–433.

29. Rennell, D., Bouvier, S. E., Hardy, L. W. & Poteete,A. R. (1991). Systematic mutation of bacteriophage T4lysozyme. J. Mol. Biol. 222, 67–88.

30. Terwilliger, T. C., Zabin, H. B., Horvath, M. P.,Sandberg, W. S. & Schlunk, P. M. (1994). In vivocharacterization of mutants of the bacteriophage f1gene V protein isolated by saturation mutagenesis.J. Mol. Biol. 236, 556–571.

31. Loeb, D. D., Swanstrom, R., Everitt, L., Manchester,M., Stamper, S. E. & Hutchison, C. A., Jr (1989).Complete mutagenesis of the HIV-1 protease. Nature,340, 397–400.

32. Axe, D. D., Foster, N. W. & Fersht, A. R. (1998). Asearch for single substitutions that eliminate enzy-matic function in a bacterial ribonuclease. Biochem-istry, 37, 7157–7166.

33. Eschenburg, S., Genov, N., Peters, K., Fittkau, S.,Stoeva, S., Wilson, K. S. & Betzel, C. (1998). Crystalstructure of subtilisin DY, a random mutant ofsubtilisin Carlsberg. Eur. J. Biochem. 257, 309–318.

34. Alexander, P. A., Ruan, B. & Bryan, P. N. (2001).Cation-dependent stability of subtilisin. Biochemistry,40, 10634–10639.

35. Zhang, F., Strand, A., Robbins, D., Cobb, M. H. &Goldsmith, E. J. (1994). Atomic structure of the MAPkinase ERK2 at 2.3 A resolution. Nature, 367, 704–711.

36. Harel, M., Kryger, G., Rosenberry, T. L., Mallender,W. D., Lewis, T., Fletcher, R. J. et al. (2000). Three-dimensional structures of Drosophila melanogasteracetylcholinesterase and of its complexes with twopotent inhibitors. Protein Sci. 9, 1063–1072.

37. Sussman, J. L., Harel, M., Frolow, F., Oefner, C.,Goldman, A., Toker, L. & Silman, I. (1991). Atomicstructure of acetylcholinesterase from Torpedo califor-nica: a prototypic acetylcholine-binding protein.Science, 253, 872–879.

38. Bourne, Y., Taylor, P. & Marchot, P. (1995). Acetyl-cholinesterase inhibition by fasciculin: crystal struc-ture of the complex. Cell, 83, 503–512.

39. Kryger, G., Harel, M., Giles, K., Toker, L., Velan, B.,Lazar, A. et al. (2000). Structures of recombinant nativeand E202Q mutant human acetylcholinesterase com-plexed with the snake-venom toxin fasciculin-II. ActaCrystallog. D: Biol. Crystallog. 56, 1385–1394.

40. Helmreich, E. & Cori, C. F. (1964). The role of adenylicacid in the activation of phosphorylase. Proc. NatlAcad. Sci. USA, 51, 131–138.

41. Barford,D.,Hu, S.H.& Johnson, L.N. (1991). Structuralmechanism for glycogen phosphorylase control byphosphorylation and AMP. J. Mol. Biol. 218, 233–260.

42. Oikonomakos, N. G., Skamnaki, V. T., Tsitsanou, K. E.,Gavalas, N. G. & Johnson, L. N. (2000). A newallosteric site in glycogen phosphorylase b as a targetfor drug interactions. Struct. Fold. Des. 8, 575–584.

43. Barford, D. & Johnson, L. N. (1989). The allosterictransition of glycogen phosphorylase. Nature, 340,609–616.

44. Cobb, M. H., Boulton, T. G. & Robbins, D. J. (1991).Extracellular signal-regulated kinases: ERKs in pro-gress. Cell Reg. 2, 965–978.

45. Boulton, T. G., Yancopoulos, G. D., Gregory, J. S.,Slaughter, C., Moomaw, C., Hsu, J. & Cobb, M. H.(1990). An insulin-stimulated protein kinase similar toyeast kinases involved in cell cycle control. Science,249, 64–67.

46. Anderson, N. G., Maller, J. L., Tonks, N. K. & Sturgill,T. W. (1990). Requirement for integration of signalsfrom two distinct phosphorylation pathways foractivation of MAP kinase. Nature, 343, 651–653.

47. Crews, C. M., Alessandrini, A. & Erikson, R. L. (1992).The primary structure of MEK, a protein kinase thatphosphorylates the ERK gene product. Science, 258,478–480.

48. Ahn, N. G., Seger, R., Bratlien, R. L., Diltz, C. D.,Tonks, N. K. & Krebs, E. G. (1991). Multiplecomponents in an epidermal growth factor-stimu-lated protein kinase cascade. In vitro activation of amyelin basic protein/microtubule-associated protein2 kinase. J. Biol. Chem. 266, 4220–4227.

49. Canagarajah, B. J., Khokhlatchev, A., Cobb, M. H. &Goldsmith, E. J. (1997). Activation mechanism of theMAP kinase ERK2 by dual phosphorylation. Cell, 90,859–869.

50. Casari, G., Sander, C. & Valencia, A. (1995). A methodto predict functional residues in proteins. NatureStruct. Biol. 2, 171–178.

51. Laskowski, R. A., Luscombe, N. M., Swindells, M. B.& Thornton, J. M. (1996). Protein clefts in molecularrecognition and function. Protein Sci. 5, 2438–2452.

52. Zvelebil, M. J. & Sternberg, M. J. (1988). Analysis andprediction of the location of catalytic residues inenzymes. Protein Eng. 2, 127–138.

53. Siew, N. & Fischer, D. (2003). Analysis of singletonORFans in fully sequenced microbial genomes.Proteins: Struct. Funct. Genet. 53, 241–251.

54. Siew, N., Azaria, Y. & Fischer, D. (2004). The ORFanage:an ORFan database.Nucl. Acids Res. 32, D281–D283.

55. Axe, D. D. (2000). Extreme functional sensitivity toconservative amino acid changes on enzymeexteriors. J. Mol. Biol. 301, 585–595.

56. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G.,Bhat, T. N., Weissig, H. et al. (2000). The Protein DataBank. Nucl. Acids Res. 28, 235–242.

57. Bairoch, A. (1993). The ENZYME data bank. Nucl.Acids Res. 21, 3155–3156.


58. Hubbard, S. J. (1996). Naccess (2.1.1 edit.), Biomole-cular Structure and Modelling Unit, UniversityCollege, London, UK.

59. Mizuguchi, K., Deane, C. M., Blundell, T. L. &Overington, J. P. (1998). HOMSTRAD: a database ofprotein structure alignments for homologous families.Protein Sci. 7, 2469–2471.

60. Glaser, F., Pupko, T., Paz, I., Bell, R. E., Bechor-Shental,D., Martz, E. & Ben-Tal, N. (2003). ConSurf: identifi-cation of functional regions in proteins by surface-mapping of phylogenetic information. Bioinformatics,19, 163–164.

61. Freeman, L. C. (1978). Centrality in social networksconceptual clarification. Social Netw. 1, 215–239.

62. Beauchamp, M. A. (1965). An improved index ofcentrality. Behav. Sci. 10, 161–163.

63. Sabidussi, G. (1966). The centrality of a graph.Psychometrika, 31, 581–603.

64. Liang, J., Edelsbrunner, H. & Woodward, C. (1998).Anatomy of protein pockets and cavities: measure-ment of binding site geometry and implications forligand design. Protein Sci. 7, 1884–1897.

65. Noguchi, T. & Akiyama, Y. (2003). PDB-REPRDB: adatabase of representative protein chains from theProtein Data Bank (PDB) in 2003. Nucl. Acids Res. 31,492–493.

66. Henikoff, S., Henikoff, J. G., Alford, W. J. &Pietrokovski, S. (1995). Automated construction andgraphical presentation of protein blocks from una-ligned sequences. Gene, 163, GC17–GC26.

Edited by B. Honig

(Received 15 June 2004; received in revised form 8 October 2004; accepted 19 October 2004)

Documents

Network Analysis of Protein Structures Identifies -