38
Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

Insight IIBinding Site Analysis

March 2000

Accelrys9685 Scranton Road

San Diego, CA 92121-3752

858/458-9990 Fax: 858/458-0136

Page 2: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136
Page 3: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

Copyright*

This document is copyright © 2000, Accelrys Inc., a subsidiary of Pharmacopeia, Inc. Allrights reserved. Except as permitted under the United States Copyright Act of 1976, no partof this publication may be reproduced or distributed in any form or by any means or storedin a database retrieval system without the prior written permission of Accelrys Inc.The software described in this document is furnished under a license and may be used orcopied only in accordance with the terms of such license.

Restricted Rights LegendUse, duplication, or disclosure by the Government is subject to restrictions as in subpara-graph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFAR252.227–7013 or subparagraphs (c)(1) and (2) of the Commercial Computer Software—Restricted Rights clause at FAR 52.227-19, as applicable, and any successor rules and regula-tions.

Trademark AcknowledgmentsCatalyst, Cerius2, Discover, Insight II, and QUANTA are registered trademarks of AccelrysInc. Biograf, Biosym, Cerius, CHARMm, Open Force Field, NMRgraf, Polygraf, QMW, Quan-tum Mechanics Workbench, WebLab, and the Biosym, MSI, and Accelrys marks are trade-marks of Accelrys Inc.IRIS, IRIX, and Silicon Graphics are trademarks of Silicon Graphics, Inc. AIX, Risc System/6000, and IBM are registered trademarks of International Business Machines, Inc. UNIX is aregistered trademark, licensed exclusively by X/Open Company, Ltd. PostScript is a trade-mark of Adobe Systems, Inc. The X-Window system is a trademark of the MassachusettsInstitute of Technology. NSF is a trademark of Sun Microsystems, Inc. FLEXlm is a trademarkof Highland Software, Inc.

Permission to Reprint, Acknowledgments, and ReferencesAccelrys usually grants permission to republish or reprint material copyrighted by Accelrys,provided that requests are first received in writing and that the required copyright credit lineis used. For information published in documentation, the format is “Reprinted with permis-sion from Document-name, Month Year, Accelrys Inc., San Diego.” For example:

Reprinted with permission from Cerius2 User Guide, Month 2000, Accelrys Inc.,San Diego.

Requests should be submitted to Accelrys Scientific Support, either through electronic mailto [email protected] or in writing to:

*U.S. version of Copyright Page

Page 4: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

Accelrys Scientific Support and Customer Service9685 Scranton RoadSan Diego, CA 92121-3752

To print photographs or files of computational results (figures and/or data) obtained usingAccelrys software, acknowledge the source in the format:

Computational results obtained using software programs from Accelrys Inc.—dynamics calculations were done with the Discover® program, using the CFF91forcefield, ab initio calculations were done with the DMol program, and graphi-cal displays were printed out from the Cerius2 molecular modeling system.

To reference a Accelrys publication in another publication, no author should be specified andAccelrys Inc. should be considered the publisher. For example:

Cerius2 Modeling Environment, Month 1999. San Diego: Accelrys Inc., 1999.

Page 5: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

Binding Site Analysis v

Contents

1. Introduction 1

Hardware and installation. . . . . . . . . . . . . . . . . . . . . . . . . . .2Invoking Binding_Site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2Pilot tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2

2. Theory 3

EvolutionaryTrace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3Multiple sequence alignment and sequence dendrogram

4Identifying trace residues . . . . . . . . . . . . . . . . . . . . . . . .5Clustering residues . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5

ActiveSite_Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6

3. Methodology 9

EvolutionaryTrace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9Exposed trace residues . . . . . . . . . . . . . . . . . . . . . . . . .12Trace residues of specified B factors . . . . . . . . . . . . . . .14

ActiveSite_Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14

4. Command Summary 17

Sequences pulldown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17ActiveSite_Search pulldown . . . . . . . . . . . . . . . . . . . . . . . .17EvolutionaryTrace pulldown. . . . . . . . . . . . . . . . . . . . . . . .19Sequence Dendrogram . . . . . . . . . . . . . . . . . . . . . . . . . . . .23Residue Dendrogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24StrDB pulldown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25Prostat pulldown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

A. References 29

Page 6: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

vi Binding Site Analysis

Page 7: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

Binding Site Analysis 1

1 Introduction

The Binding_Site® module is a suite of programs in Insight II® foridentifying and characterizing protein active sites, binding sites,and functional residues from protein structures and multiplesequence alignments. It mainly consists of three methods: Evolu-tionaryTrace, ActiveSite_Search and StrDB.

♦ EvolutionaryTrace identifies protein binding surfaces or func-tional residues common to protein families by studying the res-idue conservation in protein sequence families duringevolution and the 3D arrangement of the conserved residues inthe protein structures. It provides input for site-directedmutagenesis studies of structure-function relationships in mac-romolecules and targets for structural based drug design.

♦ ActiveSite_Search characterizes protein active sites and bind-ing sites by locating cavities in 3D protein structures. The sitesidentified can be used to guide the protein-ligand dockingexperiments.

♦ StrDB searches protein 3D structural databases using residuetemplates. Protein structures determined by experiment ormodeling can be searched against a set of templates extractedfrom the known active sites. Based on the observation that pro-tein structures are more conserved than their sequences, espe-cially at protein active sites, additional protein functionalitymay be identified by template searching. This method is alsoavailable from Biopolymer module.

Prostat, Sequences and the Sequence Window are also included inthe Binding_Site module to facilitate the characterization of pro-tein binding sites.

Documentation for Sequences, the Sequence Window, and Prostatcan be found in the Homology manual. Documentation for StrDBcan be found in the Biopolymer manual.

Page 8: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

2 Binding Site Analysis

1. Introduction

Hardware and installation

For information regarding hardware, operating system level, andinstallation, please refer to the System Guide For Insight II Products.

Invoking Binding_Site

To invoke the Binding_Site module in Insight II, click the Accelrysicon in the upper left corner and select Binding_Site from the listof modules. The pulldowns in Binding_Site will appear on thelower menu bar. Regular Insight II rules and conventions applywhen you work with the Binding_Site interface (e.g., all Binding_Site commands may be run from the command line). BecauseBinding_Site uses standard Insight II objects, other relevantInsight II modules may be used to prepare objects for Binding_Site input or to help analyze and use Binding_Site output.

Pilot tutorials

Online Binding_Site tutorials are available for use with theInsight II Pilot interface. Click the mortarboard icon on the InsightII icon palette, go to the list of available online tutorials and selecttutorials for the Binding_Site module.

The tutorial EvolutionaryTrace uses the HIV reverse transcriptasesequence family as an example of how functional residues areidentified by the EvolutionaryTrace method.

The StrDB tutorial shows you how to create an active site templatefor serine proteases and to search the protein structure databaseusing the template.

Page 9: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

Binding Site Analysis 3

2 Theory

EvolutionaryTrace

In order to understand how proteins carry out certain biologicalfunctions, how they recognize ligands and form protein-protein orprotein-ligand complexes, it is essential to identify protein func-tional residues and interaction interfaces such as active sites orbinding sites. The functional interfaces can serve as targets forstructural based drug design or to guide the site-directedmutagenesis in studying the protein structure-function relation-ship. X-ray crystallography is a powerful tool for studying proteinfunction and structure. However, the number of protein sequencesdetermined grows much faster than the number of protein struc-tures. Even when the structural information is available, it is stilldifficult to infer the roles of specific residues in protein functiondirectly from the structure.

Exhaustive mutational analysis plays a major role in binding sitecharacterization. With the increasing size of sequence databases, awealth of mutational data is already available in the database,where sequences homologous to the protein of interest recordmutation “experiments” that have passed the test of natural selec-tion. The evolutionary trace method (O. Lichtarge, 1996a, 1996b,1997) is designed to extract the mutational data embedded in thosesequences and infer which residues are likely to be important toprotein function.

In general, functional residues undergo fewer mutations duringevolution. On the other hand, certain functional residues areforced to mutate to achieve the selectivity of subgroups within theprotein family. The evolutionary trace method makes a direct connec-tion between residue conservation in aligned sequences and theirfunctional importance. It also identifies functional specificity bypartitioning the protein sequence alignment into sub groups

Page 10: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

4 Binding Site Analysis

2. Theory

according to the sequence similarity. Furthermore, with the 3Dstructure of the protein of interest, exposed functional residueswhich are more likely to be responsible for binding or enzymaticactivity can be distinguished from the buried residues which aremore important for maintaining the structural integrity. Thismethod can be applied to proteins with known experimentalstructures as well as theoretical models.

Multiple sequence alignment and sequencedendrogram

A family of homologous sequences are aligned using a multiplesequence alignment program align123 (see the align123 section inthe Homology Theory chapter). Based on the alignment, thesequences are clustered using a hierarchical clustering method(Hartigan, J. A. 1975) according to the average percentagesequence identity. The distance between two nodes a and b in thesequence cluster is calculated as follows:

Eq. 1

where IDij is the pairwise percentage sequence identity betweensequence i in node a and sequence j in node b. There are m and nsequences in node a and b, respectively.

The sequence cluster, or dendrogram, is a representation of theevolutionary or functional relationship of the sequences in thesequence family. Based on the dendrogram, sequences in a familycan be divided into subfamilies at a given sequence identity cutoff.At different cutoffs, the subfamily represents different levels offunctional resolution. At high sequence identity cutoffs, the sub-family consists of smaller groups of sequences and shows morefunctional specificity, while at low sequence identity cutoffs, thesubfamilies are larger with less specificity.

DIDij

m n⋅-----------

j = 1,m

i = 1,n

�=

Page 11: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

EvolutionaryTrace

Binding Site Analysis 5

Identifying trace residues

From the multiple sequence alignment, residues conserved acrossthe family of protein sequences are identified as conserved resi-dues and are assumed to be essential for maintaining protein func-tions. Based on the dendrogram, sequences can be partitioned intosubgroups at selected percent identity cutoffs (PIC). The proteinsin different subgroups may have similar function but differentspecificity. The residues responsible for the functional specificityare forced to mutate during evolution to make the distinction. Res-idues that are conserved within subgroups and different betweensubgroups are identified as class-specific residues. The conservedresidues and class-specific residues are called trace residues(Figure 1). When the evolutionary trace method is applied to differ-ent sequence identity cutoffs for the same sequence family, traceresidues at different functional resolution will be identified.

If a sequence is not similar to any other sequences and is a sub-group by itself at selected PIC, this sequence will be ignored in theanalysis. It will be included in the analysis until the PIC is lowenough to make it a member of a subgroup. If the query sequencewith 3D-structure is not in any subgroup at the selected cutoff, notrace residue will be reported.

Clustering residues

To define the functional interfaces, trace residues are mapped to thestructure of one of the proteins in the sequence family and clus-tered in 3D space. Since protein structures are more conservedthan sequences, especially at active sites, trace residue clusters iden-tified from one protein structure can be applied to all the membersin the sequence family.

Residues are clustered using hierarchical clustering algorithms(Hartigen, J.A. 1975) including single linkage and average linkagemethods and residue clusters are represented by dendrogram forvisualization. Distances between side chain atoms or heavy atomsare used for clustering. The distance between a pair of residues isdefined as the distance between the closest pair of selected atoms(i.e., side chain atoms or all heavy atoms) in the two residues.

Page 12: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

6 Binding Site Analysis

2. Theory

ActiveSite_Search

ActiveSite_Search identifies protein active sites or binding sitesby locating cavities in the protein structure. The Site Search algo-rithm is shown as a 2D graph in Figure 2. First, the protein ismapped onto a grid which covers the complete protein space. Thegrid points are then defined as free points and protein points. Theprotein points are grid points, within 2 Å from a hydrogen atom or2.5 Å from a heavy atom.

Figure 1. Identification of trace residues: conserved residues and classspecific residues, from a sequence alignment. A segment of DNA

polymerase sequence is shown for four subgroups in the sequencefamily. Conserved residues are defined as residues conserved across

the sequence family and class specific residues are defined as residuesconserved within the subgroup that are different between subgroups.

Page 13: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

ActiveSite_Search

Binding Site Analysis 7

Then, a cubic eraser moves from the outside of the protein towardthe center to remove the free points (light green) until the openingis too small for it to move forward. Those free points not reachedby the eraser will be defined as site points. If a smaller eraser isused, sites with smaller openings will be identified. To find theshallow cleft on the protein surface as the one shown by the bluedots in Figure 2, a larger eraser should be used. A larger erasersometimes joins into one site several sites defined by a smallereraser.

After a site is located, it can be modified by expanding or contract-ing the site. One layer of grid points at the cavity opening site willbe added or removed by each expand or contract operation,respectively.

Figure 2. A 2D representation of the Site Search algorithm. The eraser ismoving from the left towards the protein, and half of the protein has beenprocessed by the eraser. The blue dots are the free points in the current

setting and are colored differently from the other free points to illustrate apotential binding site if the eraser size is increased.

Page 14: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

8 Binding Site Analysis

2. Theory

Page 15: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

Binding Site Analysis 9

3 Methodology

EvolutionaryTrace

EvolutionaryTrace can be applied to any protein sequence familywhich has at least one protein with known structure. Starting fromthe query protein with known 3D structure, a family of homolo-gous protein sequences can be obtained by searching a proteinsequence database such as SwissProt. Sequences selected from thedatabase search should represent the function of the query proteinor its close relatives. In general, sequences which are more than25% to 30% similar to the query sequence are good candidates. Itis also important to cut the sequences into the functional domainsin interest to obtain the best results since including unrelateddomains in the sequence alignment may result in different ways ofclustering the sequences, which in turn, will affect the final resultsof the trace residue.

The sequences selected from the sequence database are firstaligned using the multiple sequence alignment program align123(see the align123 section in the Homology Theory chapter). Thenthe standalone program evTrace takes the multiple sequencealignment and the protein structure as input. It identifies trace res-idues for the query protein and outputs the trace residue as a sub-set into a subset file. Then the trace residue subset can be read backinto Insight II by the Load_Trace command. Those trace residuescan be clustered according to the residue-residue distancesdefined by all heavy atoms or side chain heavy atoms.

1. Searching for homologous sequences

EvolutionaryTrace is not able to search protein sequence data-bases, however, the Create_Alignment command (under Evolu-tionaryTrace) can read most of the sequence files as input (for a

Page 16: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

10 Binding Site Analysis

3. Methodology

description of file formats, please see the documentation foralign123.

2. Multiple sequence alignment

Homologous sequences from sequence database searching arealigned using the align123 program. Substitution matrices and gappenalties can be adjusted or default parameters used. Because thealignment is time consuming, it runs as a background job. It takesa file with multiple sequences as input and generates an alignmentfile named as <job_name>.aln. Be sure to give a job name which isdifferent from the prefix of the input sequence file so that the inputwill not be overwritten. All the gaps in the input sequences will beretained if the parameter RemoveGap is turned off.

3. Generating trace residues

After the alignment is created, load the query protein with a struc-ture into Insight II and extract its sequence using the commandsunder Sequences pulldown. The query protein’s sequence will bealigned with the multiple sequence alignment from the previousstep by again using align123. Since align123 can not take twosequences with the same name, either the query protein sequenceshould not be included in the multiple sequence alignment file inthe previous step or the query protein should be given a differentname. The output alignment file is given the query protein namewith “.aln” appended.

Based on the alignment, sequences are clustered and trace residuesare identified at specific percentage sequence identity cutoffs(PICs). By default, they are identified at PIC = 30%, 40%, 50%, 60%,70%, 80%, and 90%. Three subsets are created for each PIC to holdthe conserved residues and class-specific residues, i.e.:

protein_ID$id%C: conserved residue for selected PIC.

protein_ID$id%CS: class-specific residues for selected PIC.

protein_ID$id%: both conserved and class-specific residues forselected PIC.

for example, for protein RJTIA at 30% PIC, the three subsets arenamed RJTIA$30C, RJTIA$30CS, and RJTIA$30, respectively.

Page 17: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

EvolutionaryTrace

Binding Site Analysis 11

4. Loading trace residues

Trace residues of the query protein are defined as subset in a fileby evTrace program. The subsets can be read into Insight II by theLoad_Trace command and can be used to color or render the queryprotein or as input for the Cluster_Residue command.

A separate graphical window containing the sequence dendro-gram can also be displayed (Figure 3). With a single mouse clickyou can move the vertical bar on the dendrogram to any place onthe graph. The vertical bar divides sequences into subfamilies andthe X coordinate of the vertical bar represents the percentagesequence identity cutoff (PIC). The PIC and number of subfamiliesor sequence clusters are displayed at the bottom of the graph. Thesequence subfamilies are shown in alternating red and magentacolors. If a sequence is not in any subfamily at the selected PIC, thesequence will be colored as the default foreground color which canbe changed by choosing a color and clicking on the ForegroundColor button. The dendrogram is useful for analyzing thesequence family and sub-families, and for selecting different PICcutoffs for generating trace residues.

The Multiple sequence alignment used for the calculation can alsobe displayed in a Netscape browser, where the consensussequences of each subfamily and of all sequences are also shownfor each PIC chosen. The conserved residues and the class specificresidues are colored in read and magenta respectively.

5. Clustering trace residues

Trace residues can be clustered using the single linkage method orthe average linkage method. The single linkage method is more sen-sitive if the trace residues form a chain and are not tightly clus-tered. The average linkage method should work equally as well asthe single linkage method for a tightly clustered group of residues.Clustering can be applied to side chain heavy atoms or all heavyatoms. Side chain atom clustering is recommended.

Residue clusters are displayed in a separate window as a dendro-gram (Figure 4), where the residue clusters can be selected bydrawing a circle at the node representing the cluster. To draw a cir-cle, press down the left mouse button, drag, and then release. Theresidues in the selected cluster can be colored in the 3D window by

Page 18: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

12 Binding Site Analysis

3. Methodology

pressing the Color Residue Cluster button on the dendrogramwindow.

Exposed trace residues

Exposed and buried trace residues represent different classes ofbehaviors. Exposed trace residues are more likely to be directlyinvolved in binding or enzymatic activity while the buried traceresidues will affect the structural integrity. In addition, core resi-dues are more often to be conserved by chance. If one is interestedin the functional residues at the binding site or active site, it is use-ful to identify the exposed trace residues. They are the commonresidues defined by the subset of exposed residues and the subsetof trace residues. The common elements of two subset can beobtained by executing the Subset/Combine command located onthe upper menu bar.

Figure 3

Page 19: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

EvolutionaryTrace

Binding Site Analysis 13

To define the subset of exposed residues, the Access_Surf andSAS_Subset commands under the Prostat pulldown are used.First, the Access_Surf command is used to calculate the solventaccessible surface of the protein. It is important to turn on the Cre-ate/Update Table option to produce a table report. Then the SAS_Subset command is called to define the subset according to one ofthe accessible surface types. For the selected SAS type, if a residuehas an SAS larger than the SAS_Cutoff it will be included in thesubset. For example, when a fractional SAS is chosen and the cut-off is set to 0.1, residues with a fractional SAS larger then 0.1 will

Figure 4

Page 20: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

14 Binding Site Analysis

3. Methodology

be included in the subset. The subset name is automatically set toprotein_name$SAS.

Trace residues of specified B factors

In general, trace residues of certain properties can always bedefined by the insertion of trace residue subsets with another sub-set using the Subset/Combine command. The Binding_Site mod-ule provides a command under the Prostat pulldown to generateresidue subsets of certain B factors. This subset can then be used todefine the trace reside of specified B factors.

ActiveSite_Search

As was discussed in the theory session, this command identifiesprotein binding or active sites by locating the cavities in proteinstructures.

An eraser is used to remove all the outside grid points which arenot occupied by protein atoms and to leave the internal free gridpoints as site points. The site identification is sensitive to the sizeof the eraser, which is defined in Angstroms by the Site_OpenSizeparameter. A large eraser can identify shallower clefts whichmight be missed by a smaller eraser. A smaller eraser may alsoyield a smaller site than a larger eraser for the same site since itmay remove more grid points at the entrance of the cavity.Remember to add hydrogens before executing the Asite_Searchcommand. Missing hydrogen atoms create a larger opening,allowing the eraser to fit more easily and to remove more or evenall of the grid points in the potential site.

Use the Exclude_HETATM option to specify whether ligand(s) areincluded or excluded in the search. This option is useful if youwish to include cofactors in the site identification.

To yield a more accurate shape and volume for the cavity, theGrid_Size parameter can adjust (in Angstroms) the size of the gridused to map the protein space.

Certain cavities are too small to bind to any ligand or to be theactive site. Use the parameter Site_CutoffSize to define the size of

Page 21: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

ActiveSite_Search

Binding Site Analysis 15

the smallest cavity to be selected (in grid points). The selected sitesare sorted according to their size and reported in a table if theLoad_Site_Table is turned on. The table reports the size of the cav-ities in number of grid points and the dimension of the cavities.

When the search completes, the largest site is automatically dis-played on the structure. Use Asite_Display to display other sitesone by one.

A specific site can be modified by using the Asite_Modify com-mand. After the modification, the Asite_Display command will bedisabled until the next time the Asite_Search command is called.

The active site object is pickable with the mouse, so it can be usedas the center of a Ludi search. It can also be used to define the sub-set of residues which line the active site. This is done using theSubset/Interface command and selecting the active site object asobject 1 and the protein as object 2 and an appropriate radiusaround the active site (e.g., 5Å).

Page 22: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

16 Binding Site Analysis

3. Methodology

Page 23: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

Binding Site Analysis 17

4 Command Summary

Sequences pulldown

The Sequences pulldown is documented in the Homology man-ual.

ActiveSite_Search pulldown

The ActiveSite_Search pulldown contains commands that searchfor and display the active site of a protein.

Asite_Search This command locates the active site of a protein by searchingthrough the protein and finding all the cavities inside the protein.

Asite_Search will return a list of cavities sorted by size and willdisplay the first graphically on the protein. You can then rotatethrough the list of cavities by using the Asite_Display commandwith Next (under Display_Status) turned on.

Note that to get accurate information about the active site (e.g.,volume), hydrogen atoms should be added to the protein beforethis command is executed.

The Protein_Name parameter allows you to select a protein to besearched by the Asite_Search command.

The Grid_Size parameter allows you to change the grid size forsearching the protein. The default grid size is 1x1x1 Å. The smallerthe grid size is, the more accurate the site volume and the longerthe computation time.

The Site_OpenSize parameter defines the maximum distance(Angstrom) between any two grid points which are exposed. Anexposed grid point is defined as grid point which does not have

Page 24: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

18 Binding Site Analysis

4. Command Summary

another grid point or an atom adjacent to it on at least one direc-tion.

The Site_CutoffSize parameter sets the minimum number of gridpoints to accept a cavity as a possible binding site.

If a protein is loaded into insight II with hetero atoms, the cavitycan be located with or without the hetero atoms. When Exclude_HETATM is set to on, the cavities are located without heteroatoms, and vice versa.

The Display_Color parameter sets the display color of the site.

The Load_Site_Table parameter specifies whether a site volumeand extent will be loaded into a table or not.

The Site_Table_Name parameter specifies the table name whichholds the site volume and extent.

Asite_Display This command allows you to display the specified active site onthe protein with a user-specified color.

The Protein Name parameter allows you to display the active siteof a protein. Only the last protein used in the Asite_Search com-mand can be used for the Asite_Display command.

The Display_Status parameter allows you to choose which site todisplay. Only one site can be displayed at a time. The sites aresorted from largest to smallest. To display the sites sequentially,choose the Next or Previous options. To display a specific site,enter the site number.

Next - Display the next (smaller) site.

Previous - Display the previous (larger) site.

Current - Display the current site.

Specify - Display a specific site.

Off - Turn off the site display.

The Site_Number parameter allows you to choose which site willbe displayed. The sites are numbered from 1 to <number of sites>.

Use the Display_Color parameter to set the display color of thesite.

Asite_Modify This command expands or contracts the size of the current site.

Page 25: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

EvolutionaryTrace pulldown

Binding Site Analysis 19

The Modify_Action parameter sets whether the site will beexpanded or contracted.

The Display_Color parameter sets the display color of the modi-fied site.

EvolutionaryTrace pulldown

The EvolutionaryTrace pulldown consists of commands whichcharacterize protein functional residues. It has commands whichallow you to

♦ Align a family of sequences.

♦ Create a sequence dendrogram according to the average per-centage sequence identity (PIC).

♦ Identify conserved residues of a protein family.

♦ Map those residues onto a protein structure within the proteinfamily.

EvolutionaryTrace also contains commands to cluster conservedresidues according to their 3D distances.

Create_Alignment This command invokes the ClustalW program to create a multiplesequence alignment from a sequence file.

The Sequence_File parameter allows you to pick a sequence filewith multiple sequences to be used for alignment. Seven sequencefile formats are automatically recognized:

♦ NBRF/PIR.

♦ EMBL/SWISSPROT.

♦ Pearson (Fasta).

♦ Clustal (*.aln).

♦ GCG/MSF (Pileup).

♦ GCG9/RSF.

♦ GDE flat file.

The Scoring_Matrix parameter allows you to specify the aminoacid scoring matrix used to align sequences.

Page 26: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

20 Binding Site Analysis

4. Command Summary

The Gap_Open parameter specifies the value of the gap openingpenalty.

Gap_Extension parameter specifies the value of the gap extensionpenalty.

The Remove_Gaps parameter controls whether the initial gaps areremoved from the input. When Remove_Gaps is off, the align-ment created retains all the gaps from input file.

The Guide_Tree parameter sets the way the guide tree is created.If Fast is chosen, the initial pairwise alignment will be performedusing the FastA algorithm. If Slow is chosen, the initial pairwisealignment will be done using the full global alignment algorithm.

The Job_Name parameter specifies the job name for the clustalWbackground job. The output alignment file will have the name:<job_name>.aln.

Create_Trace The Create_Trace command identifies the trace residues based onthe alignment of a family of protein sequences.

Trace residues are defined as residues conserved across the wholeprotein family or as residues conserved within all the subgroupsof sequences.

The protein sequences are divided into subgroups based on thepercentage sequence identity (PIC). At different PIC cutoffs,sequences will be grouped differently and trace residues identifiedare also different. By default, the trace residues will be identifiedat PIC cutoffs of 30% to 90% at increments of 10%.

The Alignment_File parameter allows you to specify the sequencealignment file created by the Create_Alignment command andused to create the trace residues.

The Protein parameter specifies a protein member (with known3D structure) of the protein family in which you are interested.

The sequence of the selected protein has to be extracted using theExtract command under the Sequences pulldown before theCreate_Trace command can be executed.

The trace residues identified by this command are considered sub-sets of the selected protein and can be loaded for further analysis.

The Default_PIC parameter is used to specify whether the traceresidues will be created using default PIC cutoffs. If Default_PIC

Page 27: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

EvolutionaryTrace pulldown

Binding Site Analysis 21

is on, the trace residues will be created for PIC cutoffs of 30%,40%,50%,60%,70%,80%,90%.

The PIC parameter allows you to specify the PIC cutoff valuewhich will be used to create trace residues. It is only active whenDefault_PIC is off.

The Align_Seq parameter specifies if the query protein selectedwill be aligned with the multiple sequence alignment usingalign123.

This parameter should be turned off if the protein selected isalready aligned with other sequences and exists in the multiplesequence alignment file.

If you want to align the selected protein with the multiplesequence alignment stored in Alignment_File, the Align_Seqoption should be set to on. In this case, the selected protein shouldnot already be in the multiple sequence alignment file. If there is asequence with same name existing in the multiple sequence align-ment file, align123 will fail to align the selected protein with theexisting alignment.

Load_Trace This command loads the trace residues as subsets of the specifiedprotein. It can also load sequence dendrograms and alignments.

The Protein parameter specifies the protein in the Create_Tracecommand to identify trace residues. The trace residues are loadedas a subset of the selected protein.

The Load_Alignment parameter controls whether a sequencealignment will be loaded by the Load_Trace command. If it is setto on, a Netscape browser will be used to display the multiplesequence alignment shown at different PIC cutoffs. The sequencesshown are grouped into subgroups, and trace residues are coloredfor each PIC cutoff.

The alignment display is not essential for identifying functionalresidues. If you do not have Netscape installed, turn off this optionwhen you execute the Load_Trace command.

The Load_Dendrogram parameter controls whether the sequencedendrogram will be loaded by the Load_Trace command. If it is setto on, a separate window will be launched to display the sequencedendrogram.

Page 28: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

22 Binding Site Analysis

4. Command Summary

Color_Trace This command colors the trace residues identified at a particularPIC cutoff.

The Protein parameter specifies the protein to be used in theCreate_Trace command to identify trace residues.

The PIC_Cutoff parameter is used to set the PIC cutoff. The traceresidues found within the specified PIC cutoff will be clustered.Default PICs are listed in the value aid list for this parameter. Ifyou have created trace residues at a specific PIC cutoff, you cancluster the trace residues by entering your own PIC cutoff value.

The Conserved_Res parameter specifies the color for trace resi-dues which are conserved across the whole protein family.

The ClassSpecific_Res parameter specifies the color for the traceresidues which are conserved within subgroups within theselected PIC cutoff.

Cluster_Residue This command creates trace residue clusters according to the resi-dues’ 3D distances within a selected PIC_CUTOFF using a hierar-chical clustering method.

A separate window will is launched to display the residue clustersThe dendrogram window’s command and parameters aredescribed under Residue Dendrogram below.

The Protein parameter specifies the protein to be used in theCreate_Trace command to identify trace residues.

The PIC_Cutoff parameter is used to set the PIC cutoff. The traceresidues found within the specified PIC cutoff will be clustered.Default PICs are listed in the value aid list for this parameter. Ifyou have created trace residues at a specific PIC cutoff, you cancluster the trace residues by entering your own PIC cutoff value.

The Cluster_By parameter is used to specify which atoms of thetrace residues will be used to create the residue clusters. There aretwo options: Heavy_Atom and Sidechain_Atom. The residues arethen clustered according to the distance between their closest pairof atoms.

Heavy_Atom: All heavy atoms except the carbonyl carbon ofthe residue will be used to calculate the residue distances.

Page 29: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

Sequence Dendrogram

Binding Site Analysis 23

SideChain_Atom: Only side chain heavy atoms are used to cal-culate the residue distances. The C alpha atom in alanine is con-sidered to be a side chain atom.

Sequence Dendrogram

The sequence dendrogram is created based on the average pair-wise percentage sequence identity (PIC) calculated from thesequence alignment.

Cutoff To view the clusters at different PIC cutoffs, move the blue line onthe graphics by clicking the mouse on the dendrogram. Thesequence clusters which have PICs larger than the cutoff value aredrawn in alternating red and magenta colors. The cutoff value andnumber of clusters within the cutoff are shown at the bottom of thewindow.

Color Palette The color of the rightmost buttons on the color palette is theselected color. Clicking one of the color buttons will update theselected color. The selected color can be used to change the Fore-ground Color and Background Color of the dendrogram.

Font The font size of the axis labels can be adjusted by clicking the twofont buttons, which will effectively change the size of the dendro-gram.

Foreground Color There are three colors used to draw the sequence dendrogram. Redand magenta are used alternately to show the sequence clusters atPICs larger than cutoff. The Foreground Color button is used tocontrol the line color for the clusters at PICs lower than cutoff.Clicking the Foreground Color button will apply the selectedcolor to the sequence dendrogram.

Background Color The background color of the dendrogram is set to black by default.It can be changed to a selected color by clicking the BackgroundColor button.

Page 30: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

24 Binding Site Analysis

4. Command Summary

Residue Dendrogram

Residue 3D clusters appearing in this window are created usingthe average linkage method. The distance between two residues isdefined as the closest distance between any pair of atoms betweenthe two residues. The atoms in the cluster can be either all heavyatoms or side chain atoms. The C alpha atom in alanine is consid-ered to be a side chain atom. The carbonyl carbon is alwaysexcluded.

Clustering Method The default method used to create a residue cluster is the averagelinkage method. It can be changed by clicking the button showingthe current clustering method (should be Average Linkage whenthe graph is first started) and dragging the mouse to the desiredmethod.

Color Residue Cluster The residue clusters can be selected and colored on the 3D proteinstructure. To select a cluster of residues, press down the mousenear the root of the cluster and drag the mouse across the root. Thisdraws a blue ellipsoid around the root of the cluster. Then select adesired color and click the Color Residue Cluster button.

Cutoff To view the clusters at different cutoff distances, move the blueline on the graph with the mouse. The residue clusters which havedistances smaller than the cutoff are drawn in alternating red andmagenta colors. The cutoff value and number of clusters withinthe cutoff are shown at the bottom of the window.

Color Palette The color of the rightmost button on the color palette is theselected color. Clicking one of the color buttons will update theselected color. The selected color can be used to change the Fore-ground Color and Background Color of the residue cluster.

Font The font size of the axis labels can be adjusted by clicking the twofont buttons.

Foreground Color There are three colors used to draw the sequence dendrogram. Redand magenta are used alternately to show the sequence clusters atPICs larger than cutoff. The Foreground Color button is used tocontrol the line color for the clusters at PICs lower than cutoff.Clicking the Foreground Color button will apply the selectedcolor to the sequence dendrogram.

Page 31: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

StrDB pulldown

Binding Site Analysis 25

Background Color The background color of the dendrogram is black by default. It canbe changed to a selected color by clicking the Background Colorbutton.

StrDB pulldown

The StrDB pulldown is documented in the Biopolymer manual.

Prostat pulldown

The Prostat pulldown contains commands for the analysis of pro-tein structure. The commands include Struct_Check, Residue_Dihedral, Secondary_Classify and Access_Surf.

Struct_Check The Struct_Check command allows protein specific bond lengths,angles and torsions in a protein 3D structure to be checked againstthe corresponding values in a knowledge base derived from accu-rate small molecule crystallographic studies. These data may beused to highlight erroneous structural features on the 3D structure,listed to the textport, or tabulated using a per residue Spreadsheettable. The Struct_Check command can automatically create perresidue Spreadsheet tables. The numerical property values in thetable can be used with the graphing capabilities built in to theSpreadsheet window to create 2D and 3D graphs for data visual-ization. Tables of monomer properties can be used to create col-ored, variable width molecular ribbon diagrams using theMolecule/Ribbon command in Insight II.

Residue_Dihedral The Residue_Dihedral command allows the tabulation of pep-tide/protein specific dihedral angles. These include the backbonephi, psi, omega, and sidechain chi1, chi2, chi3, and chi4 angles. Thecalculation can be performed on individual molecules or on anassembly of conformers of the same molecule. In the latter case,the individual dihedrals can be tabulated along with the mini-mum, maximum, and circular variance in the selected dihedralacross the assembly. The circular variance thus provides a measureof the conformational variability across the assembly of conform-ers. The Residue_Dihedral command can automatically create per

Page 32: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

26 Binding Site Analysis

4. Command Summary

residue Spreadsheet tables. The numerical property values in thetable can be used with the graphing capabilities built into theSpreadsheet window to create 2D and 3D graphs for data visual-ization. Tables of monomer properties can be used to create col-ored, variable width molecular ribbon diagrams using theMolecule/Ribbon command in Insight II.

SecondaryClassify The SecondaryClassify command will either create a new residuetable with a classification column, or add a classification column toan existing residue table. It can also be used to create subsets foruse by other Insight II commands.

Note that is is possible, and often useful, to add more than oneclassification column to a given table (e.g., to compare classifica-tion schemes). Once the classification is stored in the table, it maybe edited using the spreadsheet commands . A residue table witha classification column may be used as input to the Molecule/Sec-ondaryrender command.

Secondary structure classification in Insight II is disjoint. Thismeans that a residue that is part of a helix, for example, may not bepart of a turn. It also means that turns do not overlap. The classi-fication found in many PDB files is not disjoint, the regions oftenoverlap. When using the classification information from the PDBfile, these commands apply a priority scheme. The highest priorityis helix, the second is sheet and the third is turn. So if a residue isclassified as both turn and helix in the PDB file, it will show up ashelix in the monomer classification table.

Access_Surf The Access_Surf command calculates the solvent accessible sur-face (SAS) area for a molecule. The terminology and definitionsfor this procedure is taken from Lee and Richards (1971). The SASis the area traced out by the Center of a solvent molecule rolledacross the surface. The algorithm is a version of the Lee and Rich-ards (1971) method as modified by Shrake and Rupley (1973) forspeed.

Temp_Factor The Temp_Factor command is used to create a subset of residueshaving an average atomic temperature factor larger than a cutoffvalue. It can also load a table and graph showing the temperaturefactor of atoms in a protein.

Sas_Subset The Sas_Subset command is used to create residue subsets fromthe residue solvent accessible surface calculated by by the Access_

Page 33: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

Prostat pulldown

Binding Site Analysis 27

Surf command. The subset created consists of residues with sol-vent accessible surfaces larger than the Sas_Cutoff value.

Page 34: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

28 Binding Site Analysis

4. Command Summary

Page 35: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

Binding Site Analysis 29

A References

Hartigan, J. A. (1975). Clustering Algorithms. New York: Wiley.

Lichtarge, O., Bourne, H.R. and Cohen, F.E., J. Mol. Biol., 257, 342-358, (1996).

Lichtarge, O. Yamamoto, K.R. and Cohen, F.E., J. Mol. Biol., 274,325-337, (1997).

Page 36: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

30 Binding Site Analysis

A. References

Page 37: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

Binding Site Analysis 31

AAccess_Surf command, 13, 26ActiveSite_Search

basic methodology, 14brief definition, 1theory, 6

ActiveSite_Search pulldowncommand summary, 17

align123, 4, 9, 10Asite_Display command, 18Asite_Modify command, 18Asite_Search command, 17atom clustering, 11average linkage method, 11

BBackground Color command, 23, 25Binding_Site

short description, 1

Cclass-specific residues, 5clustering, 4, 11clustering algorithms, 5Clustering Method command, 24clustering residues, 5clustering trace residues, 11Cluster_Residue command, 11, 22Color Palette command, 23, 24Color Residue Cluster command, 24Color_Trace command, 22Combine command, 12commands

ActiveSite_Search pulldown, 17EvolutionaryTrace pulldown, 19Prostat pulldown, 25Residue Dendrogram, 24see also under command name

Sequence Dendrogram, 23Sequences pulldown, 17StrDB pulldown, 25

Create_Alignment command, 19Create_Trace command, 20Cutoff command, 23, 24

Ddendrogram, 4, 11documentation

further sources, 1

EEvolutionary Trace

basic algorithm, 9evolutionary trace method

theory, 3EvolutionaryTrace

brief definition, 1theory, 3

evolutionarytrace method, 5EvolutionaryTrace pulldown

command summary, 19evTrace, 9exposed trace residues, 12

FFont command, 23, 24Foreground Color button, 11Foreground Color command, 23, 24

Ggenerating trace residues, 10

Index

Page 38: Binding Site Analysis · Insight II Binding Site Analysis March 2000 Accelrys 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136

32 Binding Site Analysis

H

Hhardware, 2homologous sequence search, 9

Iinstallation, 2

Lloading

trace residues, 11loading trace residues, 11Load_Trace command, 9, 11, 21

Mmultiple sequence alignment, 10

displayed from browser, 11multiple sequence alignment program, 4

Pparameters

see command names for associated param-eters

percentage sequence identity cutoff (PIC), 10,11

PIC, 10PIC (percentage identity cutoff), 11Prostat

documentation, 1Prostat pulldown

command summary, 25protein sequence databases, 9

RRemoveGap parameter, 10residue clusters, 11Residue Dendrogram

command summary, 24Residue_Dihedral command, 25

SSAS_Subset command, 13Sas_Subset command, 26searching

homologous sequences, 9SecondaryClassify command, 26sequence cluster, 4Sequence Dendrogram

command summary, 23Sequence Window, 1Sequences

documentation, 1Sequences pulldown, 10

command summary, 17single linkage method, 11StrDB

brief definition, 1documentation, 1, 25

StrDB pulldowncommand summary, 25

Struct_Check command, 25sub-family, 4SwissProt, 9

TTemp_Factor command, 26trace residues, 5

exposed, 12