Comprehensive structural classification of ligand binding motifs in

arX

iv:0

810.

1327

v1 [

q-bi

o.B

M]

8 O

ct 2

008

Comprehensive structural classification of ligand binding motifs in proteins

Akira R. Kinjo a,∗ and Haruki Nakamura a

aInstitute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka 565-0871, Japan

Abstract

Comprehensive knowledge of protein-ligand interactions should provide a useful basis for annotating protein functions, studying

protein evolution, engineering enzymatic activity, and designing drugs. To investigate the diversity and universality of ligand

binding sites in protein structures, we conducted the all-against-all atomic-level structural comparison of over 180,000 ligand binding

sites found in all the known structures in the Protein Data Bank by using a recently developed database search and alignment

algorithm. By applying a hybrid top-down-bottom-up clustering analysis to the comparison results, we determined approximately

3000 well-defined structural motifs of ligand binding sites. Apart from a handful of exceptions, most structural motifs were found

to be confined within single families or superfamilies, and to be associated with particular ligands. Furthermore, we analyzed the

components of the similarity network and enumerated more than 4000 pairs of ligand binding sites that were shared across different

protein folds.

Introduction Most proteins function by interacting withother molecules. Therefore, the knowledge of interactionsbetween proteins and their ligands is central to our under-standing of protein functions. However, simply enumerat-ing the interactions of individual proteins with individualligands, which is now indeed possible owing to the massiveproduction of experimentally determined protein struc-tures, would only serve to increase the amount of data,not necessarily our knowledge or understanding, of pro-tein functions. What is needed is a classification of generalpatterns of interactions. Otherwise, it would be difficultto apply the wealth of information to elucidate the evolu-tionary history of protein functions (Andreeva & Murzin,2006; Goldstein, 2008), to engineer enzymatic activity(Gutteridge & Thornton, 2005), or to develop new drugs(Rognan, 2007).

In order to classify protein-ligand interactions and to ex-tract general patterns from the classification, it is a prereq-uisite to compare the ligand binding sites of different pro-teins. There are already a number of methods to comparethe atomic structures or other structural features of func-tional sites of proteins (see reviews, Jones & Thornton,2004; Lee et al., 2007).

Applications of these methods lead to the discoveriesof ligand binding site structures shared by many proteins

∗ Corresponding authorEmail address: [email protected] (Akira R. Kinjo).

of different folds (Kobayashi & Go, 1997; Kinoshita et al.,1999; Stark & Russell, 2003; Brakoulias & Jackson,2004; Shulman-Peleg et al., 2004; Gold & Jackson, 2006).Gold & Jackson (2006) conducted an all-against-all com-parison of 33,168 binding sites, the results of which havebeen compiled into the SitesBase database. They havedescribed several unexpected similarities across differentprotein folds and applied their method to the annotationof unclassified proteins. More recently, Minai et al. (2008)compared all pairs of 48,347 potential ligand binding sitesin 9708 representative protein chains, and demonstratedthe applicability of ligand binding site comparison to drugdiscovery.

To date, however, no method has been applied tothe exhaustive all-against-all comparison of all ligandbinding sites found in the Protein Data Bank (PDB)(Berman et al., 2007), presumably because these methodswere not efficient enough to handle the huge amount ofdata in the current PDB, or because it was assumed thatthe redundancy (in terms of sequence homology) or some“trivial” ligands (such as sulfate ions) in the PDB didnot present any interesting findings. As of June, 2008, thePDB contains over 51,000 entries with more than 180,000ligand binding sites excluding water molecules, and hencenaively comparing all the pairs of this many binding sites(> 3 × 1010 pairs) is indeed a formidable task. Neverthe-less, multiple structures of many proteins that have beensolved with a variety of ligands (e.g., inhibitors for en-zymes) could provide a great opportunity for analyzing

Preprint submitted to arXiv 8 October 2008

http://arXiv.org/abs/0810.1327v1

the diversity of binding modes, and some apparently triv-ial ligands are often used by crystallographers to infer thefunctional sites from the “apo” structure. In other words,the diversity of these apparently redundant data is tooprecious a source of information to be ignored.

To handle this huge amount data, we have recentlydeveloped the GIRAF (Geometric Indexing with Re-fined Alignment Finder) method (Kinjo & Nakamura,2007). By combining ideas from geometric hashing(Wolfson & Rigoutsos, 1997) and relational databasesearching (Garcia-Molina et al., 2002), this method canefficiently find structurally and chemically similar localprotein structures in a database and produce alignmentsat atomic resolution independent of sequence homology,sequence order, or protein fold. In this method, we firstcompile a database of ligand binding sites into an ordi-nary relational database management system, and createan index based on the geometric features with surround-ing atomic environments. Owing to the index, potentiallysimilar ligand binding sites can be efficiently retrieved andunlikely hits are safely ignored. For each of the potentialhits found, the refined atom-atom alignment is obtainedby iterative applications of bipartite graph matching andoptimal superposition. In this study, we have further im-proved the original GIRAF method so that one-against-allcomparison takes effectively one second, and applied it tothe first all-against-all comparison of all ligand bindingsites in the PDB.

In order to extract recurring patterns in ligand bind-ing sites, we then classified the ligand binding sites basedon the results of the all-against-all comparison, and de-fined structural motifs. So far, such structural motifs havebeen determined either manually (Porter et al., 2004) orautomatically (Wangikar et al., 2003; Polacco & Babbitt,2006). Given the huge amount of data, manual curationof all potential motifs is not feasible, and previously de-veloped automatic methods are computationally too inten-sive (Wangikar et al., 2003) or limited in scope (e.g., beingbased on sequence alignment (Polacco & Babbitt, 2006)).Therefore, we first applied divisive (top-down) hierarchicalclustering to obtain single-linkage clusters from the simi-larity network of ligand binding sites which can be read-ily obtained from the result of the all-against-all compari-son. Based on the hierarchy of the single-linkage clusters,agglomerative (bottom-up) complete-linkage clustering isthen applied. Thus obtained complete-linkage clusters areshown to be well-defined structural motifs, and are thensubject to statistical characterization regarding their lig-and specificity and protein folds.

Furthermore, based on the result of the all-against-allcomparison, we study the structure of the similarity net-work of ligand binding sites, and enumerate interesting sim-ilarities shared across different folds. The list of clustersand the list of pairs of ligand binding sites not sharing thesame fold are available on-line 1 .

1 http://pdbjs6.pdbj.org/∼akinjo/lbs/

Clustering, annotation, ...

A B

(P < 0.001)

38,869,791 matches

186,485 ligand binding sites

all−against−all comparison

1

10

100

1000

10000

100000

1e+06

0 500 1000 1500 2000 2500 3000 3500 4000

Cou

nt

Number of matches / site

P < 10-3

P < 10-10

P < 10-15

P < 10-20

51,289 PDB entries

Fig. 1. Summary of experiment. A: Flow of the analysis. B: His-togram of the number of matches per ligand binding site.

Results

All-against-all comparison of ligand binding sitesOut of 51,289 entries in the Protein Data Bank

(Berman et al., 2007) as of June 13, 2008, all 186,485ligand binding sites were extracted and compiled into adatabase. A ligand binding site is defined to be the set ofprotein atoms that are within 5A from any of the corre-sponding ligand atoms. To define a ligand, we used theannotations in PDB’s canonical XML (extensible markuplanguage) files (PDBML) (Westbrook et al., 2005) becausethese annotations are more accurate than the HETATMrecord of the flat PDB files. Our definition of ligands in-cludes not only small molecules, but also polymers suchas polydeoxyribonucleotide (DNA), polyribonucleotide(RNA), polysaccharides, and polypeptides with less than25 amino acid residues; water molecules and ligands con-sisting of more than 1000 atoms were excluded. We did notexclude “trivial” ligands such as sulfate (SO2−

4 ), phosphate(PO3−

4 ), and metal ions. We did not use a representativeset of proteins based on sequence homology to reduce thedata size.

In total, the all-against-all comparison yielded 38,869,791matches with P-value < 0.001 with 208 matches per siteon average (Fig. 1A). While 5014 sites found no hits otherthan themselves, 8369 sites found more than 1000 matches.When we limit the matches to more stringent P-valuethresholds (10−10, 10−15, 10−20), the long tail of the largenumber of matches rapidly disappears (Fig. 1B), indicat-ing that many matches reflect partial and weak similaritiesbetween sites.

Relationship between similarities of protein sequencesand ligand binding sites As noted above, the present dataset is highly redundant in terms of sequence homology. Ifthe similarity of ligand binding sites is sharply correlatedwith that of amino acid sequences, it would have been bet-ter to use sequence representatives. To justify the use ofthe redundant data set, we carried out an all-against-all

2

A B C

Fig. 2. Relationship between sequence similarity and ligand binding site similarity. A: Sequence identity of BLAST hits versus GIRAFP-values. B: Sequence identity of BLAST hits versus root-mean-square deviation of aligned ligand binding sites found by GIRAF. C: Sequenceidentity of BLAST hits versus the number of ligand bind site atoms aligned by GIRAF.

BLAST (Altschul et al., 1997) search of all protein chainsof the present data set, and checked the correlation betweensequence identity and GIRAF P-value (Fig. 2A). It shouldbe noted that a ligand binding site may reside at an in-terface of more than two protein subunits (chains), whichcomplicates the notion of representative chains. Therefore,we defined sequence similarity between two PDB entries asthe maximum sequence identity of all the possible pairs ofchains from the two PDB entries.

While there was a significant but very weak negativecorrelation between the GIRAF P-value and percent se-quence identity (Pearson’s correlation -0.14), there weremany strikingly similar (GIRAF P-value < 10−50) pairs ofligand binding sites with low (< 30%) sequence identity,and there were also many weakly similar ligand bindingsites (GIRAF P-value > 10−20) at high (> 90%) sequenceidentity region. This tendency was also confirmed by usingmore conventional measures of similarities. Although theroot-mean-square deviation (RMSD) of aligned atoms ex-hibited a stronger negative correlation with the sequenceidentity (Fig. 2B; Pearson’s correlation -0.46), the rangeof scatter of RMSD was so large that it was not possibleto distinguish the range of sequence identity from RMSDvalues and vice versa. In addition, the number of alignedatoms did not correlate with the sequence identity (Fig.2C), indicating that the local structures of ligand bindingsites can be strictly conserved among distantly related pro-teins. Visual inspection suggested a few possible reasons forthe large deviation in the region of high sequence similar-ity. First, the binding sites do not necessarily overlap com-pletely when different ligands are complexed with (almost)identical proteins. Second, many binding sites are flexible,yet they are able to bind the same ligand. Third, some lig-ands are flexible and can be bound as different conformers,which in turn causes structural changes of the binding site.

One of the rationales for an exhaustive all-against-all comparison is that some similarities between non-representative proteins would be ignored when only se-quence representatives were used. For example, in theresults of a comparison of potential ligand binding sitesof 9708 sequence representative proteins conducted byMinai et al. (2008), the similarity between the ADP bind-ing sites of human inositol (1,4,5)-triphosphate 3-kinase(PDB: 1W2D (Gonzalez et al., 2004); SAICAR synthase-

like fold) and of Archaeoglobus fulgidus Rio2 kinase (PDB:1ZAR (Laronde-Leblanc et al., 2005); Protein kinase-likefold) was not detected although this match was found tohave P-value of 8.1 × 10−17 (40 aligned atoms; RMSD0.75A) in the present result. Furthermore, equivalentmatches were found in not all homologs of these two pro-teins. We note, however, that Minai et al. (2008) did findan equivalent similarity between the binding sites of theseprotein folds, but it was based on apo structures whichwere not treated here. Thus, the similarity not detected byMinai et al. is likely to be due to the use of representatives,but not due to the difference in sensitivity of their methodand the present one.

We conclude that the similarity of sequences and thatof ligand binding site structures are weakly correlated, butthe correlation is not strong enough to infer the one fromthe other.

Defining structural motifs of ligand binding sites Wehave seen that sequence representatives are not suitable forstudying the diversity of ligand binding sites. The use of theraw data of ligand binding sites for statistical analysis, how-ever, would be problematic due to some over-representedand under-represented binding sites. Therefore, it is prefer-able to remove the redundancy based on the ligand bindingsimilarity itself. Furthermore, a list of pairwise similaritiesis not sufficient for characterizing typical patterns of bind-ing modes. Accordingly, we applied the hybrid top-down-bottom-up clustering method to obtain complete-linkageclusters based on P-value. In a complete-linkage cluster(hereafter referred to as ‘cluster’), any pair of its mem-bers are similar within the specified P-value threshold. Assuch, clusters may be regarded as precisely defined struc-tural motifs of ligand binding sites, and hence we use theterm ‘cluster’ and ‘structural motif’ (or simply ‘motif’) in-terchangeably when appropriate. Based on the analysis ofsimilarity networks with varying thresholds (see below), weset the threshold to 10−15 in the following analysis.

It is immediately evident that there are a large numberof small clusters and a small number of large clusters (Fig.3A). Excluding 58,001 singletons (clusters with only onemember), there were 20,224 clusters which accounted for

3

1

10

100

1 10 100 1000 10000

Num

ber

of li

gand

bin

ding

site

s

Cluster (rank)

0

20

40

60

80

100

120

0 50 100 150 200 250

Num

ber

of li

gand

type

s

Cluster size

1

10

100

1000

0 20 40 60 80 100

Cou

nt

Number of ligand types / motif

1

10

100

1000

10000

0 50 100 150 200 250

Cou

nt

Number of motifs / ligand type

CASO4ZNGOL

0 50 100 150 200 250

CASO4

ZNGOL

sugar*peptide*

ADPHEMPO4MNNAMG

DNA*CL

ATPEDONAG

FENADANP

KNAPACTFAD

CUHEC

COPLPNDPAMP

Number of motifs

Liga

nd ty

pe

A B

C D

E

Fig. 3. Statistical properties of structural motifs. A: Size of complete-linkage clusters defined with P-value thresholds of 10−15. B: Scatter plotof cluster size versus ligand types found in cluster. C: Histogram of the number of ligand types per structural motif (cluster). D: Histogramof the number of structural motifs (clusters) associated with a given ligand type. E: 30 most abundant ligand types (polymer molecules aremarked with an asterisk).

128,484 (69%) of all the 186,485 sites. Out of these clusters,2959 clusters consisted of at least 10 sites, accounting for69,748 (37%) sites. The list of these clusters of structuralmotifs is available on-line 2 . Since the ligand binding sitesin small clusters are not reliable due to statistical errors,we use only the 2959 clusters consisting of at least 10 sitesin the following analysis unless otherwise stated.

Diversity of structural motifs with respect to ligand types

Although some structural motifs included binding sitesfor a wide variety of ligand types, this is not always thecase (Fig. 3B). Here, each PDB chemical component iden-tifier (consisting of 1 to 3 letters) corresponds to a ligandtype except for peptides, nucleic acids or sugars, whichwere treated simply as such (i.e., polymer sequence iden-tity is ignored). Large clusters associated with many kindsof ligands were almost always enzymes such as proteases(eukaryotic or retroviral), carbonic anhydrases, proteinkinases and protein phosphatases, whose structures havebeen solved with a variety of inhibitors. For example, twostructural motifs consisting of 246 and 147 ligand bindingsites of eukaryotic (trypsin-like) proteases were associatedwith 106 and 80 ligand types, respectively; two motifs con-sisting of 197 and 115 sites of retroviral proteases with 82and 62 ligand types, respectively; a motif of 63 sites of pro-tein kinases with 58 ligand types. On the contrary, largeclusters with a limited variety of ligands were binding sitesfor heme (globins and nitric oxide synthase oxygenases) or

2 http://pdbjs6.pdbj.org/∼akinjo/lbs/cluster.xml

metal ions. Each structural motif is associated with 3.2 lig-and types on average (standard deviation 5.3): 1322 motifs(47%) with only one ligand type and 2807 motifs (95%)with less than 10 ligand types whereas only 34 motifs con-tained more than 20 ligand types (Fig. 3C). In general, thediversity of ligand types per structural motif is low.

The converse is also true. That is, the number of struc-tural motifs associated with each ligand type is generallyvery limited with the average of 2.1 motifs (standard devi-ation 8.4) per ligand type (Fig. 3D), and 3791 ligand typescorrespond to single motifs. Nevertheless, there were someligands which were associated with many motifs (Fig. 3E).As expected, ligands often included in the solvent (e.g.,SO4 [sulfate], MG [magnesium ion], GOL [glycerol], EDO[ethanediol]) were found in many motifs. Reflecting a largenumber of possible sequences, polymer molecules includ-ing peptide, sugar, and DNA were also found to be boundwith many motifs, respectively. Other than these, mononu-cleotides and dinucleotides and metal ions exhibited a widerange of binding modes.

Diversity with respect to protein families and foldsNot many, but some structural motifs were found to con-tain ligand binding sites of distantly related proteins. Toquantitatively analyze the diversity of structural motifs interms of homologous families and global structural simi-larities, we assigned protein family, superfamily, fold andclasses to each structural motif according to the SCOP(Murzin et al., 1995) database. More concretely, the mostspecific SCOP code (SCOP concise classification string,SCCS) was assigned to each motif that was shared by all

4

members of the corresponding cluster when it was possi-ble, otherwise (i.e., there is at least one member that isdifferent from other members in the cluster at the classlevel), motif was categorized as “others” (Fig. 4A).

Out of 2705 motifs to which SCCS can be assigned, 2637and 62 motifs shared the same domains at the family andsuperfamily level, respectively. Thus, more than 99% of themotifs (of at least 10 binding sites) only contained bindingsites of evolutionarily related proteins. One motif containedproteins from different superfamilies but of the same fold.This motif corresponded to the heme binding site of heme-binding four-helical bundle proteins (SCOP: f.21). Five mo-tifs accommodated similarities across different folds, out ofwhich three were zinc binding motifs (Krishna et al., 2003).One motif contained a P-loop motif which is shared be-tween the P-loop containing nucleotide triphosphate hydro-lases (NTH) (SCOP: c.37) and the PEP carboxykinase-likefold (SCOP: c.91) (Fig. 5A) (Tari et al., 1996). One mo-tif was of the nucleotide-binding sites from FAD/NAD(P)-binding domain (SCOP: c.3) and Nucleotide-binding do-main (SCOP: c.4) (Fig. 5B). Note that some PDB entrieshave not yet been annotated in SCOP. Currently, if suchmembers exist in a cluster, they are simply ignored, andthe assigned SCCS is based only on the members whoseSCCS is known. Therefore, the number of motifs not shar-ing the same folds is somewhat underestimated. Neverthe-less, it seems a general tendency that most motifs are con-fined within homologous proteins, namely families or su-perfamilies.

It was shown above that sequence similarity was onlyweakly related to the structural similarity of ligand bind-ing sites (Fig. 2). This point can be further clarified byexamining motifs of similar binding sites of related pro-teins. For example, the peptide binding sites of a pigtrypsin (PDB: 1UHB (Pattabhi et al., 2004)) and of a hu-man hepsin (PDB: 1Z8G (Herter et al., 2005)) were bothin the same cluster but they share little sequence similar-ity (5% sequence identity based on a structural alignment(Kawabata & Nishikawa, 2000; Kawabata, 2003)), whilethe peptide binding site of bovine trypsin (PDB: 1QB1(Whitlow et al., 1999)) in another cluster shares 81% se-quence identity with the pig thrombin in the previouscluster. This observation can be explained by the fact thatdifferent motifs cover different regions of proteins eventhough they are spatially close or even partially overlap-ping. The same argument applies to other motifs of relatedproteins. Thus, the structural motifs distinguish subtledifferences in ligand binding site structures independent ofsequence similarity.

It has been known that some protein folds can accom-modate a wide range of functions. It is expected that thediversity of function is reflected in that of structures of lig-and binding sites. To analyze such tendency, we countedthe number of motifs that belong to each protein fold (Fig.4B). Only a handful of folds showed a large diversity interms of structural motifs. On average, 8.9 motifs were as-signed to a fold. Out of 332 folds used in the analysis,

1 10 100 1000

Family

Superfamily

Fold

Class

Others

Number of motifs

(2637)

(62)

(1)

(4)

(1)

0 50 100 150 200

TIM beta/alpha-barrel [c.1]P-loop containing NTP hydrolases [c.37]

NAD(P)-binding Rossmann-fold [c.2]Immunoglobulin-like beta-sandwich [b.1]

Ferredoxin-like [d.58]Phosphorylase/hydrolase-like [c.56]PLP-dependent transferases [c.67]Protein kinase-like (PK-like) [d.144]

Ferritin-like [a.25]EF Hand-like [a.39]

Trypsin-like serine proteases [b.47]Zincin-like [d.92]

Concanavalin A-like lectins [b.29]Globin-like [a.1]

Cupredoxin-like [b.6]alpha/beta-Hydrolases [c.69]

FAD/NAD(P)-binding domain [c.3]Heme-dependent peroxidases [a.93]

Flavodoxin-like [c.23]Bacterial photosystem II [f.26]

Number of motifs

1

10

100

0 50 100 150 200

Cou

nt

Number of motifs / fold

c.1c.37c.2b.1

C

BA

Fig. 4. Diversity of structural motifs in terms of protein folds. A:Number of motifs to which the given SCOP (Murzin et al., 1995)hierarchical level (family, superfamily, fold, class) can be assigned. B:Histogram of the number of structural motifs associated with eachSCOP fold. C: 20 most diverse SCOP folds in terms of the numberof associated structural motifs.

only 18 contained more than 30 motifs (Fig. 4C). Amongthem, the TIM barrel fold was an extreme case with 183motifs assigned, reflecting the great diversity of its func-tions (Nagano et al., 2002). Some superfolds (Orengo et al.,1994) such as Rossmann-fold, immunoglobulin-like, globin-like, etc. also showed great diversities of ligand bindingsites.

Similarity network of ligand binding sites While eachmotif defines a precise pattern of ligand binding mode,the members of different structural motifs share signifi-cant structural similarities with each other. To explore theglobal structure of the ‘ligand binding site universe,’ weconstructed a similarity network based on the results ofthe all-against-all comparison. Each structural motif wasrepresented as a node and two nodes were connected if amember of one node was significantly similar to a memberof the other node (i.e., the P-value of their alignment wasbelow a predefined threshold). Thus constructed networkcan be decomposed into a number of connected compo-nents. When the threshold was greater than 10−14, the sizeof the largest connected component of the network was oneor two orders of magnitude greater than that of the secondlargest one (Fig. 6A). For example, setting the thresholdto 10−10 yielded the largest connected component consist-ing of 78,190 sites (i.e., 42% of 186,485 sites). Accordingly,

5

A

B

Fig. 5. Examples of structural motifs shared by different protein folds.The left panel shows the whole protein structures (colored in blueor pink) superimposed based on the alignment of the ligand bindingsites shown in the right panel (colored in the CPK scheme or ma-genta (protein) and green (ligand), respectively). A: ADP bindingsite of bacterial shikimate kinase (PDB: 2DFT (Dias et al., 2007);

SCOP: c.37; blue / CPK-colored) and ATP binding site of bacterialphosphoenolpyruvate carboxykinase (PDB: 1AQ2 (Tari et al., 1997);SCOP: c.91; pink / protein in magenta, ADP in green). B: FAD bind-ing site of human glutathione reductase (PDB: 5GRT (Stoll et al.,1997); SCOP: c.3; blue / CPK-colored) and ADP binding site of bac-terial trimethylamine dehydrogenase (PDB: 2TMD (Barber et al.,1992); SCOP: c.4; pink / protein in magenta, ADP in green).

many functionally unrelated binding sites were somehowconnected in the largest component, which complicated theinterpretation of the component. With the P-value thresh-old of 10−15 or less, the first several connected componentswere of the same order (Fig. 6A), and many members ofeach component appeared to be more functionally related.Thus, we set P = 10−15 for constructing the network inthe following (as well as for defining the complete-linkageclusters described above).

Excluding 54,092 singleton components (those consistingof only one site), 11,532 connected components were found.The largest component consisted of 7935 sites, and 1881components contained at least 10 sites (Fig. 6A).

The main constituents of the largest connected com-ponent of the similarity network (Fig. 6B) were mononu-cleotide (ADP, GDP, etc.) or phosphate binding (PO4)sites. Most notable were P-loop containing NTH (SCOP:c.37) and PEP carboxykinases (SCOP: c.91) which formeda closely connected group as they share similar phosphatebinding sites, i.e., the P-loop motif (the term ‘group’ usedhere indicates closely connected clusters in a network com-ponent colored in green in Fig. 6B-F). Directly connectedwith this group was the coenzyme A (CoA) binding site ofacetyl-CoA acetyltransferases. The magnesium ion (MG)binding site of Ras-related proteins were also connectedwith the group of the P-loop containing proteins since themagnesium ion is often located near the phosphate binding

site. Mononucleotide or phosphate (AMP,U5P,PRP, PO4)binding sites of various phosphoribosyltransferases and theflavin mononucleotide (FMN) binding site of flavodoxinswere also closely connected. The phosphate binding site oftyrosine-protein phosphatases formed another group whichwas weakly connected to the FMN binding site of flavodox-ins.

It is surprising that the heme binding site of globins(hemoglobins, myoglobins, cytoglobins, etc.) was also in-cluded in this component. Nevertheless, it was not directlyconnected to the main group of P-loops, but indirectly viathe sparse group consisting of chloride ion binding site ofT4 lysozymes and sulphate and phosphate binding sitesof miscellaneous proteins. The binding sites of this lattergroup were made of regular structures at the termini of α-helices. When we used a more stringent P-value threshold(say, 10−20), the groups of globins and lysozymes were de-tached from the main group, but the main group contain-ing the P-loops was almost unaffected (data not shown).Thus, the matches connecting globins, lysozymes, and P-loop containing proteins may be considered as ‘false’ hits.Based solely on structural similarity, however, they are dif-ficult to discriminate from ‘true’ hits (structural matchesbetween functionally related sites) since many functionalsites often include regular structures at termini of sec-ondary structures. Nevertheless, the fact that only a sub-set of regular structures were detected suggests that thesematches may correspond to recurring structural patternsoften used as building blocks of functional sites. In addition,we point out that weak but meaningful enzymatic functionsare sometimes detected experimentally in such ‘false’ hits(Ikura et al., 2008).

The second largest connected component mainly con-sisted of mononucleotides or dinucleotides binding sitesof the so-called Rossmann-like fold domains (Fig. 6C)which include, among others, NAD(P)-binding Rossmann-fold domains (SCOP: c.2), FAD/NAD(P)-binding domain(SCOP: c.3), nucleotide-binding domain (SCOP: c.4),SAM-dependent methyltransferases (SCOP: c.66), activat-ing enzymes of the ubiquitin-like proteins (SCOP: c.111)and urocanase (SCOP: e.51).

Peptide (and inhibitor) binding sites of trypsin-like andsubtilisin-like proteases were found in the third largest com-ponent (Fig. 6D). These two proteases do not share a com-mon fold, but were connected due to the similarity of theactive site structures around the well-known catalytic triad.

The EF hand motif, a major calcium binding motif, wasfound in the fourth largest component (Fig. 6E) in which avariety of other calcium ion binding sites were also found.Although the main group in this component mainly con-sisted of the calcium ion binding sites of various calmodulin-like proteins, it also contained similar sites of periplasmicbinding proteins (PBP). The ligands of these PBP’s also in-clude sodium in addition to calcium ions. The main groupwas weakly connected to calcium ion binding sites of pro-teins of completely different folds such as galactose-bindingdomains (e.g., galactose oxidase, fucolectins), laminin G-

6

1

10

100

1000

10000

100000

1 10 100 1000 10000

Num

ber

of s

ites

Connected components (rank)

P < 10-10

P < 10-14

P < 10-15

P < 10-20

P < 10-30

D

BA

E F

C

Fig. 6. Networks of structural motifs of ligand binding sites. A: Distribution of the size of connected component of the similarity networkwith varying P-value thresholds. A transition is observed at P=10−15. B-F: The five largest connected components of the similarity network(P-value threshold = 10−15). Some groups of structural motifs are marked by black circles annotated with ligand types and protein folds. To

facilitate visualization, each node (shown as a sphere) is represented as a complete-linkage cluster (structural motif) of ligand binding sitesdefined with P = 10−15 (the sphere size is proportional to the cluster size). Nodes and edges are colored according to the values of theirclustering coefficient (green: high; magenta: low) (Watts & Strogatz, 1998).

like modules (e.g., laminin, agrin, etc.), alpha-amylases, an-nexins, and phospholipase A2. Due to its spatial proxim-ity, the calcium binding site of phospholipase A2 was alsoconnected to its inhibitor binding sites.

Our last example, the fifth largest component, exhibitedan exploding structure (Fig. 6F). Nevertheless, most bind-ing sites are associated with nucleotides. The main closelyconnected group consisted of the ATP (and inhibitors)binding sites of protein kinase family proteins, next to

which the ADP binding sites of glutathione synthetase fam-ily proteins (including D-ala-D-ala ligases) were connected.Other closely connected groups included FAD binding sitesof ferredoxin reductase-like proteins, ATP or magnesiumbinding sites of adenine nucleotide alpha hydrolases-likeproteins, inhibitor binding sites of nitric-oxide synthases,and NAD (analog) binding sites of ADP-ribosylation pro-teins (e.g., T-cell ecto-ADP-ribosyltransferase 2, iota toxin,etc.). There was a large sparse group connected with the

7

10 100

(c.109,c.91)-(c.37)

(c.2)-(c.66)

(c.37)-(c.91,d.94)

(a.39)-(b.69)

(g.39)-(g.50)

(a.1)-(d.58)

(a.39)-(c.94)

(a.39)-(g.75)

(a.39)-(b.30)

(a.39)-(c.93)

(a.39)-(d.92)

(a.139)-(a.39)

(a.39)-(d.3)

(a.39)-(d.2)

(c.31)-(g.41)

(a.102,c.94)-(a.39)

(c.108)-(c.23)

(b.3)-(d.144)

(c.3,d.16)-(c.66)

(c.37)-(d.108)

Number of motif pairsP

air o

f S

CO

P fold

s 10 100 1000

CA-CA

ZN-ZN

FAD-NAD

SF4-SF4

CA-NA

SO4-SO4

FE-ZN

GDP-SO4

ATP-SO4

PO4-SO4

NAD-NAD

F3S-SF4

ADP-ATP

ADP-PO4

NAD-SAM

CA-MG

NAD-SAH

NAD-NAI

FAD-MTA

FAD-SAH

Number of motif pairsA B

Fig. 7. Ligand binding sites shared across different protein folds. A:20 most common pairs of different folds sharing significant ligandbinding site similarities. B: 20 most common pairs of ligand typesshared across different folds.

main group of protein kinases. In that sparse group, lig-and binding sites of transthyretins (prealbumins) were of-ten found to be directly connected with that of protein ki-nases although their folds are different. These binding sitesboth involve a face of a β-sheet, and their similarity wasfound due to the backbone conformation of the β-sheet.Since many proteins bind their ligands on a face of a β-sheet, this observation in turn explains the origin of thelarge sparse group.

Significant similarities across different folds The similar-ity network of ligand binding sites revealed many structuralsimilarities across different folds. To explore the extent ofsignificant ‘cross-fold’ similarities (with P < 10−15), we as-signed SCOP codes to as many structural motifs as possi-ble, and enumerated motif pairs whose members were sig-nificantly similar but did not share a common fold (Fig.7A). We also examined the ligand pairs in those matches,and found that most of them were reasonable matches (Fig.7B): metal ions were matched with metal ions, nucleotideswith nucleotides or phosphate, and so on. Thus, many ofthese cross-fold similarities are expected to be function-ally relevant. The observation that sulfate (SO4) bindingsites were often found to be matched with mononucleotide(GDP and ATP) or phosphate (PO4) binding sites (Fig.7B) confirms the usefulness of the former ligand in infer-ring the binding of the latter ligands, as often practiced bycrystallographers. We note that multiple SCCS may be as-signed to a single motif if it contains multiple fold typesor its member sites are located at an interface of multipledomains. In order to cover all possible fold pairs, we didnot exclude motifs consisting of less than 10 binding sitesin this analysis. There were in total 4035 pairs of struc-tural motifs (52,709 pairs of binding sites) that exhibitedsignificant similarities but did not share the same fold. Thecomplete list of these pairs is available on-line 3 .

The most common cross-fold similarity was found be-tween the P-loop containing NTH (SCOP: c.37) and the

3 http://pdbjs6.pdbj.org/∼akinjo/lbs/diffold.xml

PEP-carboxykinase-like (SCOP: c.91) (c.f. Fig. 5A). As de-scribed in the analysis of complete-linkage clusters, thiscorresponds to mononucleotide or phosphate binding sites.

Mononucleotide or dinucleotide binding sites of variousRossmann-like folds (SCOP: c.2, c.3, c.4, c.66) also exhib-ited significant mutual similarities (e.g., Figs. 5B and 8A).

The calcium binding sites of EF hand-like fold (a.39) werefound to be similar to the metal binding sites of many foldsincluding beta-propeller proteins (Fig. 8B) and periplas-mic binding proteins (SCOP: c.93 [class I], c.94 [class II]),lysozyme-like (SCOP: d.2), Zincin-like (SCOP: d.92), andmany others.

Similar zinc binding sites were found in many, mostlysmall, folds in addition to DHS-like NAD/FAD-binding do-main (SCOP: c.31) and Rubredoxin-like (g.41) (Fig. 8C),the former of which may be regarded as an inserted zincfinger motif.

The similarity between globin-like (SCOP: a.1) andferredoxin-like (SCOP: d.58) was due to the coordinatedstructures of the iron-sulfur clusters found in alpha-helicalferredoxins and ferredoxins, respectively.

HAD-like fold proteins (SCOP: c.108) and CheY-like(flavodoxin fold) proteins (SCOP: c.23) often share similarbinding sites (e.g., Fig. 8D). Interestingly, although theseproteins have very similar topologies, the orders of alignedsecondary structure elements were different when the align-ment was based on the ligand binding site similarity.

As noted in the description of a network component (Fig.6F), protein kinases and transthyretins share similar bind-ing sites which are located on a face of a β-sheet (Fig. 8E).Nevertheless, their ligand moieties seem also similar.

Also as seen in the network component (Fig. 6B), phos-phate binding site of the P-loop motif exhibits a significantsimilarity with CoA binding site of acetyltransferases (Fig.8F). A close examination showed the phosphate bound tothe P-loop motif coincided with the phosphate group ofCoA bound to the acetyltransferase.

The list of the cross-fold similarities contained manyother examples including, but not limited to, those dis-cussed in the context of the similarity network. Here wegive two other examples. Bacterial peptide deformylase2 (SCOP: d.167) and human macrophage metalloelastase(SCOP: d.92) both act with peptides, and their ligand bind-ing sites exhibit high structural similarity (Fig. 8G). DNAis one of the most abundant ligands found in cross-fold sim-ilarities (Fig. 7B). Not surprisingly, there can be also foundsimilarity between binding sites for DNA and RNA. One ex-ample is the KH1 domain of human poly(rC)-binding pro-tein 2 which binds DNA and bacterial transcription elonga-tion protein NusA which binds RNA (Fig. 8H). These pro-teins have different variants of the KH domains (Grishin,2001b).

Discussion From the result of the exhaustive all-against-

8

HD

A

C

E

F

G

B

Fig. 8. Examples of ligand binding sites shared across different folds. The color schemes are the same as in Fig. 5. A: AMP bindingsite of Thermotoga maritima hypothetical protein tm1088a (PDB: 2G1U (Joint Center for Structural Genomics, 2006); SCOP: c.2; blue/ CPK-colored) and SAM binding site of human putative ribosomal RNA methyltransferase 2 (PDB: 2NYU (Wu et al., 2006); SCOP:c.66; pink / protein in magenta, SAM in green). B: Calcium binding sites of Clostridium thermocellum cellulosomal scaffolding pro-tein A (PDB: 2CCL (Carvalho et al., 2007); SCOP: a.39; blue / CPK-colored) and human integrin alpha-IIb (PDB: 1TXV (Xiao et al.,2004); SCOP: b.69; pink / protein in magenta, calcium in green). C: Zinc binding sites of human NAD-dependent deacetylase (PDB:2H4H (Hoff et al., 2006); SCOP: c.31 [inferred by SSM (Krissinel & Henrick, 2004)]; blue / CPK-colored) and Bacillus stearothermophilus

adenylate kinase (PDB: 1ZIN (Berry & Phillips Jr., 1998); SCOP: g.41; pink / protein in magenta, zinc in green). D: Formic acid bind-ing site of Xanthobacter autotrophicus L-2-haloacid dehalogenase (PDB: 1AQ6 (Ridder et al., 1997); SCOP: c.108; blue / CPK-colored)and BeF3− binding site of Escherichia coli PhoB (PDB: 1ZES (Bachhawat et al., 2005); SCOP: c.23; pink / protein in magenta, BeF3−

in green). E: 3,5-diiodosalicylic acid binding site of human transthyretin (PDB: 3B56; SCOP: b.3; blue / CPK-colored) and inhibitor(N-[3-(4-fluorophenoxy)phenyl]-4-[(2-hydroxybenzyl)amino]piperidine-1-sulfonamide) binding site of human mitogen-activated protein kinase14 (PDB: 1ZZ2; SCOP: d.144; pink / protein in magenta, inhibitor in green). F: phosphate binding site of Pyrococcus furiosus Rad50 ABC-AT-Pase (PDB: 1II8; SCOP: c.37; blue / CPK-colored) and coenzyme-A (CoA) binding site of Salmonella typhimurium LT2 acetyl transferase(PDB: 1S7N; SCOP: d.108; pink / protein in magenta, CoA in green). G: Actinonin binding site of B. stearothermophilus peptide deformylase2 (PDB: 1LQY (Guilloteau et al., 2002); SCOP: d.167; blue / CPK-colored) and NNGH binding site of human macrophage metalloelastase(PDB: 1Z3J; SCOP: d.92; pink / protein in magenta, NNGH in green). H: DNA binding site of KH1 domain of human poly(rC)-binding pro-tein (PDB: 2AXY (Du et al., 2005); SCOP: d.51; blue / CPK-colored, DNA in orange) and RNA binding site of Mycobacterium tuberculosis

transcription elongation protein NusA (PDB: 2ATW (Beuth et al., 2005); SCOP: d.52; pink / protein in magenta, RNA in green).

all comparison, we were able to obtain an extensive listof ligand binding site similarities irrespective of sequencehomology or global protein fold. The similarity networkuncovered very many cross-fold similarities as well as well-known ones. Although it is still not clear how many ofthese similarities are functionally relevant, it was oftenobserved that different folds were superimposable to asignificant extent when the alignment was based on theligand binding sites (e.g., Figs. 5, 8). Aligning protein

structures based on ligand binding sites (or functional sitesin general) irrespective of sequence similarity, sequenceorder, and protein fold (as currently defined) may be auseful approach to elucidating the evolutionary historyof fold changes (Grishin, 2001a; Krishna & Grishin, 2004;Andreeva & Murzin, 2006; Taylor, 2007; Goldstein, 2008;Xie & Bourne, 2008).

As was seen in the similarity network (Fig. 6), some linksare based on the similarity of highly regular (secondary)

9

structures which are found in many protein structures (e.g.,Fig. 6B,F). While such similarities may not be directly re-lated to any biochemical functions, they suggest that manyligand binding sites are based on combinations of some reg-ular local structures. It is known that a relatively smalllibrary of backbone fragments can accurately model ter-tiary structures of proteins (Kolodny et al., 2002). Con-sequently, the variety of contiguous fragments recurringin ligand binding sites is also limited as far as backbonestructure is concerned. Friedberg & Godzik (2005) foundsimilarities across different protein folds including thoseinvolved in various zinc-finger motifs and Rossmann-likefolds, as shown in this study. They also showed significantcorrelations between similarity of fragments and that ofprotein functions. This observation is consistent with thepresent results in that it suggests that specific combina-tions of fragments encode specific functions. To apply theGIRAF method to functional annotations, however, it ispreferable to discriminate functionally relevant similaritiesfrom purely structural similarities.

Some of the short-comings of simple pairwise comparisonmay be overcome by the complete-linkage clustering analy-sis of similar binding sites, which allowed us to define precisestructural motifs. It should be stressed that defining reliablemotifs requires redundancy in the PDB (Wangikar et al.,2003), otherwise it would be more difficult to distinguishrecurring structures from incidental matches. These motifsmay be useful for defining structural templates for efficientmotif matching (Wallace et al., 1997). Despite the diver-sity of binding sites and their similarities, most motifs werefound to be confined within single families or superfamilies,and they were also found to be highly specific to particularligands. Thus, these motifs may be helpful for annotatingputative functions of proteins, especially, of structural ge-nomics targets.

In conclusion, the development of an extremely efficientsearch method (GIRAF) to detect local structural simi-larities made it possible to conduct the first exhaustiveall-against-all comparison of all ligand binding sites inall the known protein structures. We identified a numberof well-defined structural motifs, enumerated many non-trivial similarities. While exhaustive pairwise comparisonsare useful for detecting weak and possibly partial similar-ities between ligand binding sites, the significance of suchmatches may not be immediately obvious because some ofthem may be based on ubiquitous regular structures.

Meanwhile, complete-linkage clusters of ligand bindingsites are useful for identifying functionally relevant bind-ing site structures, but they may neglect partial but sig-nificant matches. Therefore, these two approaches, exhaus-tive pairwise comparison and motif matching, are comple-mentary to each other, and hence the combination thereofmay be helpful for more reliable annotations of proteinswith unknown functions. These approaches may be furthersupplemented by other existing fold and/or sequence-basedmethods (Standley et al., 2008; Xie & Bourne, 2008). Thepresent method can be also applied to a whole protein

structure (not limited to its predefined ligand binding sites)to find potential ligand binding sites (Kinjo & Nakamura,2007). In this way, we are currently annotating all struc-tural genomics targets (Chen et al., 2004). We also plan tomake this method available as a web service so that struc-tural biologists can routinely search for ligand binding sitesof their interest.

Experimental Procedures

The GIRAF method The details of the original GIRAFmethod has been published elsewhere (Kinjo & Nakamura,2007). In this study, an improved version of GIRAF wasused for conducting the all-against-all comparison. The im-provement includes more sensitive geometric indexing withatomic composition around each reference set, simplifiedSQL expressions, and parallelization (A.R.K. and H.N., un-published).

All-against-all comparison Ligand binding sites were ex-tracted from PDBML files as described in Results. A ligandis defined as those entities that are annotated neither as“polypeptide(L)” with more than 24 amino acid residuesnor “water” in the entity category. That is, a ligand can bepolypeptide shorter than 25 residues, DNA, RNA, polysac-charides (sugars), lipids, metal ions, iron-sulfur clusters, orany other small molecules. However, ligands with more than1000 atoms were discarded. The all-against-all comparisonwas carried out on a cluster machine consisting of 20 nodesof 8-core processors (Intel Xeon 3.2 GHz). The whole com-putation was finished within approximately 60 hours.

Clusters of similar ligand binding sitesTo obtain complete-linkage clusters, we first constructed

a single-linkage network based on a pre-defined P-valuethreshold. Then this network was decomposed into con-nected components. Each component was then broken intofiner components by imposing a more stringent P-valuethreshold. This decomposition was iterated until P-valuethreshold reached 10−100. Then bottom-up complete-linkage was iteratively applied to each connected compo-nent, the result of which was then combined into an uppercomponent (previously determined with a higher P-valuethreshold). This bottom-up process was terminated whenP-value threshold, 10−15, was reached. Each cluster wasdefined as a structural motif for the ligand binding sites.

10

Analysis of networks and structural motifsTo annotate thus obtained structural motifs with the

SCOP (Murzin et al., 1995) codes, we used the parsable fileof SCOP (version 1.73). When an analysis involved SCOPcodes, those PDB entries whose SCOP classification has notyet been determined were ignored. Each SCOP SCCS codewas assigned to a ligand binding site as described by others(Gold & Jackson, 2006). When a site resides at an interfaceof multiple domains, multiple SCCS codes were assignedto the site. Two or more binding sites are said to share thesame fold (or family, superfamily, etc.) if the intersection oftheir SCCS code sets is not empty. The SCCS code assignedto a structural motif was defined as the union of all theSCCS codes found in the corresponding cluster members.We used only the seven main SCOP classes (all-α [a], all-β [b], α/β [c], α + β [d], multi-domain [e], membrane andcell surface proteins and peptides [f], and small proteins[g]). The figures of alignments (Figs. 5 and 8) were createdwith jV version 3 (Kinoshita & Nakamura, 2004) using thePDBML-extatom files produced by GIRAF. The networkfigures (Fig. 6B-F) were created with the Tulip software(http://www.tulip-software.org/).

ReferencesAltschul, S. F., Madden, T. L., Schaffer, A. A., Zhang,

J., Zhang, Z., Miller, W., & Lipman, D. L. (1997).Gapped blast and PSI-blast: A new generation of pro-tein database search programs. Nucleic Acids Res. 25,3389–3402.

Andreeva, A. & Murzin, A. G. (2006). Evolution of proteinfold in the presence of functional constraints. Curr. Opin.Struct. Biol. 16, 399–408.

Bachhawat, P., Swapna, G. V., Montelione, G. T., & Stock,A. M. (2005). Mechanism of activation for transcriptionfactor PhoB suggested by different modes of dimerizationin the inactive and active states. Structure 13, 1353–1363.

Barber, M. J., Neame, P. J., Lim, L. W., White, S., &Matthews, F. S. (1992). Correlation of X-ray deducedand experimental amino acid sequences of trimethy-lamine dehydrogenase. J. Biol. Chem. 267, 6611–6619.

Berman, H., Henrick, K., Nakamura, H., & Markley, J. L.(2007). The worldwide Protein Data Bank (wwPDB):ensuring a single, uniform archive of PDB data. NucleicAcids Res. 35, D301–D303.

Berry, M. B. & Phillips Jr., G. N. (1998). Crystal struc-tures of Bacillus stearothermophilus adenylate kinasewith bound Ap5A, Mg2+ Ap5A, and Mn2+ Ap5A revealan intermediate lid position and six coordinate octahe-dral geometry for bound Mg2+ and Mn2+. Proteins 32,276–288.

Beuth, B., Pennell, S., Arnvig, K. B., Martin, S. R., & Tay-

lor, I. A. (2005). Structure of a Mycobacterium tubercu-

losis NusA-RNA complex. EMBO J. 24, 3576–3587.Brakoulias, A. & Jackson, R. M. (2004). Towards a struc-

tural classification of phosphate binding sites in protein-nucleotide complexes: an automated all-against-all struc-tural comparison using geometric matching. Proteins 56,250–260.

Carvalho, A. L., Dias, F. M. V., Nagy, T., Prates, J. A. M.,Proctor, M. R., Smith, N., Bayer, E. A., Davies, G. J.,Ferreira, L. M. A., Romao, M. J., Fontes, C. M. G. A., &Gilbert, H. J. (2007). Evidence for a dual binding modeof dockerin modules to cohesins. Proc. Natl. Acad. Sci.USA 104, 3089–3094.

Chen, L., Oughtred, R., Berman, H. M., & Westbrook,J. (2004). TargetDB: a target registration database forstructural genomics projects. Bioinformatics 20, 2860–2862.

Dias, M. V., Faim, L. M., Vasconcelos, I. B., de Oliveira,J. S., Basso, L. A., Santos, D. S., & de Azevedo, W. F.(2007). Effects of the magnesium and chloride ions andshikimate on the structure of shikimate kinase from My-

cobacterium tuberculosis. Acta Crystallogr. F 63, 1–6.Du, Z., Lee, J. K., Tjhen, R., Li, S., Pan, H., Stroud, R. M.,

& James, T. L. (2005). Crystal structure of the first khdomain of human poly(C)-binding protein-2 in complexwith a C-rich strand of human telomeric DNA at 1.7 A.J. Biol. Chem. 280, 38823–38830.

Friedberg, I. & Godzik, A. (2005). Connecting the proteinstructure universe by using sparse recurring fragments.Structure 13, 1213–1224.

Garcia-Molina, H., Ullman, J. D., & Widom, J. (2002).Database Systems: The Complete Book (Prentice Hall:Upper Saddle River, NJ, U. S. A.).

Gold, N. D. & Jackson, R. M. (2006). Fold independentstructural comparisons of protein-ligand binding sitesfor exploring functional relationships. J. Mol. Biol. 355,1112–1124.

Goldstein, R. A. (2008). The structure of protein evolu-tion and the evolution of protein structure. Curr. Opin.Struct. Biol. 18, 170–177.

Gonzalez, B., Schell, M. J., Letcher, A. J., Veprintsev,D. B., Irvine, R. F., & Williams, R. L. (2004). Structureof a human inositol 1,4,5-trisphosphate 3-kinase: sub-strate binding reveals why it is not a phosphoinositide3-kinase. Mol. Cell 15, 689–701.

Grishin, N. V. (2001a). Fold change in evolution of proteinstructures. J. Struct. Biol. 134, 167–185.

Grishin, N. V. (2001b). KH domain: one motif, two folds.Nucleic Acids Res. 29, 638–643.

Guilloteau, J. P., Mathieu, M., Giglione, C., Blanc, V.,Dupuy, A., Chevrier, M., Gil, P., Famechon, A., Meinnel,T., & Mikol, V. (2002). The crystal structures of fourpeptide deformylases bound to the antibiotic actinoninreveal two distinct types: a platform for the structure-based design of antibacterial agents. J. Mol. Biol. 320,951–962.

Gutteridge, A. & Thornton, J. M. (2005). Understanding

11

http://www.tulip-software.org/

nature’s catalytic toolkit. Trends Biochem. Sci. 30, 622–629.

Herter, S., Piper, D. E., Aaron, W., Gabriele, T., Cutler,G., Cao, P., Bhatt, A. S., Choe, Y., Craik, C. S., Walker,N., Meininger, D., Hoey, T., & Austin, R. J. (2005). Hep-atocyte growth factor is a preferred in vitro substratefor human hepsin, a membrane-anchored serine proteaseimplicated in prostate and ovarian cancers. Biochem. J.390, 125–136.

Hoff, K. G., Avalos, J. L., Sens, K., & Wolberger, C. (2006).Insights into the sirtuin mechanism from ternary com-plexes containing NAD+ and acetylated peptide. Struc-ture 14, 1231–1240.

Ikura, T., Kinoshita, K., & Ito, N. (2008). A cavity withan appropriate size is the basis of the ppiase activity.Protein Eng. Des. Sel. 21, 83–89.

Joint Center for Structural Genomics (2006). Crystal struc-ture of (tm1088a) from Thermotoga maritima at 1.50 Aresolution. PDB entry 2G1U.

Jones, S. & Thornton, J. M. (2004). Searching for functionalsites in protein structures. Curr. Opin. Struct. Biol. 8,3–7.

Kawabata, T. (2003). Matras: a program for protein 3Dstructure comparison. Nucleic Acids Res. 31, 3367–3369.

Kawabata, T. & Nishikawa, K. (2000). Protein tertiarystructure comparison using the markov transition modelof evolution. Proteins 41, 108–122.

Kinjo, A. R. & Nakamura, H. (2007). Similarity search forlocal protein structures at atomic resolution by exploit-ing a database management system. BIOPHYSICS 3,75–84. doi:10.2142/biophysics.3.75.

Kinoshita, K. & Nakamura, H. (2004). eF-site and PDB-jViewer: database and viewer for protein functional sites.Bioinformatics 20, 1329–1330.

Kinoshita, K., Sadanami, K., Kidera, A., & Go, N. (1999).Structural motif of phosphate-binding site common tovarious protein superfamilies: all-against-all structuralcomparison of protein-mononucleotide complexes. Pro-tein Eng. 12, 11–14.

Kobayashi, N. & Go, N. (1997). ATP binding proteins withdifferent folds share a common ATP-binding structuralmotif. Nat. Struct. Biol. 4, 6–7.

Kolodny, R., Koehl, P., Guibas, L., & Levitt, M. (2002).Small libraries of protein fragments model native proteinstructures accurately. J. Mol. Biol. 323, 297–307.

Krishna, S. S. & Grishin, N. V. (2004). Structurally anal-ogous proteins do exist!. Structure 12, 1125–1127.

Krishna, S. S., Majumdar, I., & Grishin, N. V. (2003).Structural classification of zinc fingers: survey and sum-mary. Nucleic Acids Res. 31, 532–550.

Krissinel, E. & Henrick, K. (2004). Secondary-structurematching (SSM), a new tool for fast protein structurealignment in three dimensions. Acta Crystallogr. D 60,2256–2268.

Laronde-Leblanc, N., Guszczynski, T., Copeland, T., &Wlodawer, A. (2005). Autophosphorylation of Ar-

chaeoglobus fulgidus Rio2 and crystal structures of its

nucleotide-metal ion complexes. FEBS J. 272, 2800–2810.

Lee, D., Redfern, O., & Orengo, C. (2007). Predicting pro-tein function from sequence and structure. Nat. Rev.Mol. Cell Biol. 8, 995–1005.

Minai, R., Matsuo, Y., Onuki, H., & Hirota, H. (2008).Method for comparing the structures of protein ligand-binding sites and application for predicting protein-druginteractions. Proteins 72, 367–381.

Murzin, A. G., Brenner, S. E., Hubbard, T., & Chothia,C. (1995). SCOP: A structural classification of proteinsdatabase for the investigation of sequences and struc-tures. J. Mol. Biol. 247, 536–540.

Nagano, N., Orengo, C. A., & Thornton, J. M. (2002). Onefold with many functions: the evolutionary relationshipsbetween TIM barrel families based on their sequences,structures and functions. J. Mol. Biol. 321, 741–765.

Orengo, C. A., Jones, D. T., & Thornton, J. M. (1994).Protein superfamilies and domain superfolds. Nature372, 631–634.

Pattabhi, V., Syed Ibrahim, B., & Shamaladevi, N. (2004).Trypsin activity reduced by an autocatalytically pro-duced nonapeptide. J. Biomol. Struct. Dyn. 21, 737–744.

Polacco, B. J. & Babbitt, P. C. (2006). Automated discov-ery of 3D motifs for protein function annotation. Bioin-formatics 22, 723–730.

Porter, C. T., Bartlett, G. J., & Thornton, J. M. (2004).The Catalytic Site Atlas: a resource of catalytic sitesand residues identified in enzymes using structural data.Nucleic Acids Res. 32, D129–D133.

Ridder, I. S., Rozeboom, H. J., Kalk, K. H., Janssen, D. B.,& Dijkstra, B. W. (1997). Three-dimensional structure ofL-2-haloacid dehalogenase from Xanthobacter autotroph-

icus GJ10 complexed with the substrate-analogue for-mate. J. Biol. Chem. 272, 33015–33022.

Rognan, D. (2007). Chemogenomic approaches to rationaldrug design. Br. J. Pharmacol. 152, 38–52.

Shulman-Peleg, A., Nussinov, R., & Wolfson, H. J. (2004).Recognition of functional sites in protein structures. J.Mol. Biol. 339, 607–633.

Standley, D. M., Toh, H., & Nakamura, H. (2008). Func-tional annotation by sequence-weighted structure align-ments: Statistical analysis and case studies from the Pro-tein 3000 structural genomics project in Japan. Proteins72, 1333–1351.

Stark, A. Sunyaev, S. & Russell, R. B. (2003). A model forstatistical significance of local similarities in structure.J. Mol. Biol. 326, 1307–1316.

Stoll, V. S., Simpson, S. J., Krauth-Siegel, R. L., Walsh,C. T., & Pai, E. F. (1997). Glutathione reductase turnedinto trypanothione reductase: structural analysis of anengineered change in substrate specificity. Biochemistry36, 6437–6447.

Tari, L. W., Matte, A., Goldie, H., & Delbaere, L. T. (1997).Mg2+-Mn2+ clusters in enzyme-catalyzed phosphoryl-transfer reactions. Nat. Struct. Biol. 4, 990–994.

Tari, L. W., Mattte, A., Pugazhenthi, U., Goldie, H., &

12

Delbaere, L. T. (1996). Snashot of an enzyme reactionintermediate in the structure of the ATP-Mg2+-oxalateternary complex of Escherichia coli PEP carboxykinase.Nat. Struct. Biol. 3, 355–363.

Taylor, W. R. (2007). Evolutionary transitions in proteinfold space. Curr. Opin. Struct. Biol. 17, 354–361.

Wallace, A. C., Borkakoti, N., & Thornton, J. M. (1997).TESS: A geometric hashing algorithm for deriving 3Dcoordinate templates for searching structural databases.application to enzyme active sites. Protein Sci. 6, 2308–2323.

Wangikar, P. P., Tendulkar, A., Ramya, S., Mali, D. N., &Sarawagi, S. (2003). Functional sites in protein familiesuncovered via an objective and automated graph theo-retic approach. J. Mol. Biol. 326, 955–978.

Watts, D. J. & Strogatz, S. (1998). Collective dynamics of‘small-world’ networks. Nature 393, 440–442.

Westbrook, J., Ito, N., Nakamura, H., Henrick, K., &Berman, H. M. (2005). PDBML: the representation ofarchival macromolecular structure data in XML. Bioin-formatics 21, 988–992.

Whitlow, M., Arnaiz, D. O., Buckman, B. O., Davey, D. D.,Griedel, B., Guilford, W. J., Koovakkat, S. K., Liang, A.,Mohan, R., Phillips, G. B., Seto, M., Shaw, K. J., Xu,W., Zhao, Z., Light, D. R., & Morrissey, M. M. (1999).Crystallographic analysis of potent and selective factorXa inhibitors complexed to bovine trypsin. Acta Crys-tallogr., D 55, 1395–1404.

Wolfson, H. J. & Rigoutsos, I. (1997). Geometric hashing:An overview. IEEE Comput. Sci. Eng. 4, 10–21.

Wu, H., Dong, A., Zeng, H., Loppnau, P., Weigelt, J.,Sundstrom, M., Arrowsmith, C. H., Edwards, A. M.,Bochkarev, A., & Plotnikov, A. N. (2006). The crystalstructure of human FtsJ homolog 2 (E. coli) protein incomplex with AdoMet. PDB entry 2NYU.

Xiao, T., Takagi, J., Coller, B. S., Wang, J. H., & Springer,T. A. (2004). Structural basis for allostery in integrinsand binding to fibrinogen-mimetic therapeutics. Nature432, 59–67.

Xie, L. & Bourne, P. E. (2008). Detecting evolutionaryrelationships across existing fold space, using sequenceorder-independent profile-profile alignments. Proc. Natl.Acad. Sci. USA 105, 5441–5446.

AcknowledgmentsThe authors thank Kengo Kinoshita, Motonori Ota and

Hiroyuki Toh for helpful discussion, and Daron M. Standleyfor critically reading the manuscript. This work was sup-ported by a grant-in-aid from Institute for BioinformaticsResearch and Development, Japan Science and TechnologyAgency (JST). H. N. was supported by Grant-in-Aid forScientific Research (B) No. 20370061 from Japan Societyfor the Promotion of Science (JSPS).

13

Documents

Comprehensive structural classification of ligand binding motifs in