7
MMDB and VAST + : tracking structural similarities between macromolecular complexes Thomas Madej*, Christopher J. Lanczycki, Dachuan Zhang, Paul A. Thiessen, Renata C. Geer, Aron Marchler-Bauer and Stephen H. Bryant National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38 A, Room 8N805, 8600 Rockville Pike, Bethesda, MD 20894, USA Received November 3, 2013; Accepted November 4, 2013 ABSTRACT The computational detection of similarities between protein 3D structures has become an indispensable tool for the detection of homologous relationships, the classification of protein families and functional inference. Consequently, numerous algorithms have been developed that facilitate structure comparison, including rapid searches against a steadily growing collection of protein structures. To this end, NCBI’s Molecular Modeling Database (MMDB), which is based on the Protein Data Bank (PDB), maintains a comprehensive and up-to-date archive of protein structure similarities computed with the Vector Alignment Search Tool (VAST). These similarities have been recorded on the level of single proteins and protein domains, comprising in excess of 1.5 billion pairwise alignments. Here we present VAST + , an extension to the existing VAST service, which summarizes and presents structural similarity on the level of biological assemblies or macromol- ecular complexes. VAST +simplifies structure neigh- boring results and shows, for macromolecular complexes tracked in MMDB, lists of similar complexes ranked by the extent of similarity. VAST + replaces the previous VAST service as the default presentation of structure neighboring data in NCBI’s Entrez query and retrieval system. MMDB and VAST + can be accessed via http:// www.ncbi.nlm.nih.gov/Structure. INTRODUCTION NCBI has maintained the Molecular Modeling Database (MMDB) (1) since 1996, as a collection of publicly accessible experimentally determined macromolecular structures that have been deposited with the Protein Data Bank (PDB) (2). MMDB serves a variety of functions. It facilitates searching for macromolecular structure data in NCBI’s Entrez query and retrieval system (3); links and associates macromolecular structure data with a variety of other resources such as gene, sequence, sequence variation, chemistry and literature databases; provides sequence data for NCBI’s BLAST (4) services; and supports other NCBI resources such as the Conserved Domain Database (CDD) (5) and IBIS (6). MMDB mirrors the content of PDB and currently contains 94 688 macromolecular structures. More than 97% of these structures have at least one protein or poly- peptide component. Recently, MMDB has updated the default presentation of macromolecular structure data so that the biologically relevant macromolecular complexes (termed ‘biological units’ or ‘biological assemblies’), as defined by the structures’ authors or computed by the PDB (7), are shown by default, and so that interactions between macromolecules or macromolecules and smaller chemical ligands are emphasized (1). This presentation makes it easier to identify pairs of molecules that come into direct contact with each other in a macromolecular complex—something that can be difficult to determine, for example, when visualizing large biological assemblies using 3D graphics software. Another feature of MMDB is the association of 3D structures, through sequence similarity searches, with protein sequences that do not yet have solved structures, facilitating inference of protein function. A BLAST search of all sequences in the Entrez Protein database, against the subset of protein sequences from experimentally determined structures, maps a large fraction of the publicly available proteins to three-dimensional structure information in MMDB. For example, of the 35 138 human protein sequences tracked in NCBI’s BioProject 178 030, a genomic sequence data set from a human hydatidiform mole cell line, 76% appear similar to known 3D structures using standard protein-BLAST. An even larger fraction can be mapped to 3D struc- ture using approaches with higher sensitivity, such as the identification of conserved domain signatures. Thus, *To whom correspondence should be addressed. Tel: +1 301 435 5998; Fax:+1 301 435 7793; Email: [email protected] Published online 6 December 2013 Nucleic Acids Research, 2014, Vol. 42, Database issue D297–D303 doi:10.1093/nar/gkt1208 Published by Oxford University Press 2013. This work is written by US Government employees and is in the public domain in the US.

MMDB and VAST : tracking structural similarities between ......complexes or biological assemblies (1). Such assemblies range from simple homo-oligomers to intricate arrange-ments of

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MMDB and VAST : tracking structural similarities between ......complexes or biological assemblies (1). Such assemblies range from simple homo-oligomers to intricate arrange-ments of

MMDB and VAST+ tracking structural similaritiesbetween macromolecular complexesThomas Madej Christopher J Lanczycki Dachuan Zhang Paul A Thiessen

Renata C Geer Aron Marchler-Bauer and Stephen H Bryant

National Center for Biotechnology Information National Library of Medicine National Institutes of HealthBuilding 38 A Room 8N805 8600 Rockville Pike Bethesda MD 20894 USA

Received November 3 2013 Accepted November 4 2013

ABSTRACT

The computational detection of similarities betweenprotein 3D structures has become an indispensabletool for the detection of homologous relationshipsthe classification of protein families and functionalinference Consequently numerous algorithms havebeen developed that facilitate structure comparisonincluding rapid searches against a steadily growingcollection of protein structures To this end NCBIrsquosMolecular Modeling Database (MMDB) which isbased on the Protein Data Bank (PDB) maintains acomprehensive and up-to-date archive of proteinstructure similarities computed with the VectorAlignment Search Tool (VAST) These similaritieshave been recorded on the level of single proteinsand protein domains comprising in excess of 15billion pairwise alignments Here we presentVAST+ an extension to the existing VAST servicewhich summarizes and presents structural similarityon the level of biological assemblies or macromol-ecular complexes VAST+simplifies structure neigh-boring results and shows for macromolecularcomplexes tracked in MMDB lists of similarcomplexes ranked by the extent of similarityVAST+ replaces the previous VAST service as thedefault presentation of structure neighboring datain NCBIrsquos Entrez query and retrieval systemMMDB and VAST+ can be accessed via httpwwwncbinlmnihgovStructure

INTRODUCTION

NCBI has maintained the Molecular Modeling Database(MMDB) (1) since 1996 as a collection of publiclyaccessible experimentally determined macromolecularstructures that have been deposited with the ProteinData Bank (PDB) (2) MMDB serves a variety of

functions It facilitates searching for macromolecularstructure data in NCBIrsquos Entrez query and retrievalsystem (3) links and associates macromolecular structuredata with a variety of other resources such as genesequence sequence variation chemistry and literaturedatabases provides sequence data for NCBIrsquos BLAST(4) services and supports other NCBI resources such asthe Conserved Domain Database (CDD) (5) and IBIS (6)MMDB mirrors the content of PDB and currentlycontains 94 688 macromolecular structures More than97 of these structures have at least one protein or poly-peptide component Recently MMDB has updated thedefault presentation of macromolecular structure data sothat the biologically relevant macromolecular complexes(termed lsquobiological unitsrsquo or lsquobiological assembliesrsquo) asdefined by the structuresrsquo authors or computed by thePDB (7) are shown by default and so that interactionsbetween macromolecules or macromolecules and smallerchemical ligands are emphasized (1) This presentationmakes it easier to identify pairs of molecules that comeinto direct contact with each other in a macromolecularcomplexmdashsomething that can be difficult to determine forexample when visualizing large biological assembliesusing 3D graphics softwareAnother feature of MMDB is the association of 3D

structures through sequence similarity searches withprotein sequences that do not yet have solved structuresfacilitating inference of protein function A BLAST searchof all sequences in the Entrez Protein database againstthe subset of protein sequences from experimentallydetermined structures maps a large fraction of thepublicly available proteins to three-dimensional structureinformation in MMDB For example of the 35 138human protein sequences tracked in NCBIrsquos BioProject178 030 a genomic sequence data set from a humanhydatidiform mole cell line 76 appear similar toknown 3D structures using standard protein-BLASTAn even larger fraction can be mapped to 3D struc-ture using approaches with higher sensitivity such as theidentification of conserved domain signatures Thus

To whom correspondence should be addressed Tel +1 301 435 5998 Fax +1 301 435 7793 Email madejncbinlmnihgov

Published online 6 December 2013 Nucleic Acids Research 2014 Vol 42 Database issue D297ndashD303doi101093nargkt1208

Published by Oxford University Press 2013 This work is written by US Government employees and is in the public domain in the US

macromolecular structure data as provided by MMDBcan be used to postulate homology-inferred function fora large number of functionally uncharacterized proteinsequences and genesWhat may be less well-known and used are structure

neighboring data and a structure neighboring serviceavailable as part of MMDB which identify similarly-shaped structures based on geometric criteria regardlessof the extent of sequence similarity The resulting 3Dstructure alignments are helpful in understanding thefunctional consequences of sequence variation as well asin discovering distant homologous relationships andsubtle functional similarities The Vector AlignmentSearch Tool (VAST) (8) algorithm which computesthese similarities was developed around 1995 and hasbeen applied ever since to compute and maintain compre-hensive and up-to-date lists of statistically significantsimilarities between known protein 3D structures Pre-computed similarities and alignments derived from struc-ture superposition are available for all protein structuresthat have been included in MMDB and are suitable to beprocessed by VAST An interactive search tool VAST-Search facilitates structure similarity searches forprotein structure queries that are not (yet) part ofMMDBrsquos collection enabling a user to enter 3Dcoordinate data for comparison against all publiclyavailable structuresIn the past 25 years a variety of methods have been

developed to computationally characterize ormeasure struc-tural similarities between macromolecules resulting in aneven larger variety of published methods too numerous tobe listed here (9) The Protein Data Bank for examplereports structure neighbors from a subset of representativescomputed with the jFATCAT algorithm (10) and points to ahandful of external resources that provide structural classi-fications and structure comparisons SCOP (11) CATH(12) VAST (8) FATCAT (13) DALI (14) andSuperfamily (15) SCOP is a hierarchical classification ofdomain structures that has been maintained by manualintervention and does not rely on computationallydetermined 3D structure similarity CATH classifiesdomain structures hierarchically as well but makes system-atic use of the SSAP (16) algorithm to compute similaritieson a 3D level FATCAT a more recent development usesdynamic programming to string together locally alignedpairs of structural fragments while allowing for a numberof twists around pivot points decomposing the matchbetween two structures into a series of segment pairs thatcan be superimposed as rigid bodies The jFATCAT imple-mentation used to pre-compute data for the PDB site fallsback to reporting a single-segment rigid body superimpos-ition though DALI was one of the first structure compari-son methods that relied solely on geometric criteria and hasbeen available since themid 1990sAnothermethod that hadbeen incorporated in the Protein Data Bank CE (17)computes rigid body superimpositions for alignmentsfound via combinatorial extension of aligned fragmentpairs as opposed to dynamic programming or MonteCarlo optimizationsMost if not all of these computational resources and

the associated data have been maintained and available

accessible since their inception although updates of datasets such as pre-computed structure alignments may nothave happened frequently as most structure comparisonmethods are computationally intensive The DALIdatabase of pre-computed structure alignments forexample currently reports a most recent update inMarch 2011 Also most of the pre-computed structureneighboring data sets and search databases available forlive neighboring have been reduced in size to contain rep-resentative structures only Pre-computed structure neigh-bors as found on the PDB Web site for example havebeen obtained for representative structures from clustersformed at a threshold of 40 sequence identity meaningthat a structural alignment between an arbitrary pair ofsimilar or related protein structures may not be readilyavailable which somewhat limits the practical applicabil-ity of the data and search implementations

The VAST search database and database of pre-computed structure alignments have been maintained ascomplete and redundant collections since their launchwith automated updates occurring on a weekly basisThis was made possible by implementing a fast heuristicthat uses a model for the statistical significance of initialalignments of secondary structure vectors (which can becomputed quickly) so that the database searches canavoid costly alignment refinements for the large majorityof insignificant and uninteresting similarities The draw-backs are that a heuristic will miss some potentially inter-esting similarities The VAST algorithm will not forexample report similarities between structures deemedto have lt3 secondary structure elements Searches forstructural similarity can and should be complementedwith searches for sequence similarity as flexibility ofmolecular structure and limitations of the structure com-parison method may preclude the detection of matchesbetween structures of homologous polypeptides Ingeneral though structure comparison methods will pickup many subtle similarities that evade detection bysequence comparison strategies and there is no naturalcutoff point for a ranked list of similar structures unlikein the sequence comparison scenario where matches tonon-homologous gene products are considered accidentaland uninformative for the most part

Results computed by the VAST algorithm have beencompared against other approaches a number of times(17ndash19) Although there are subtle differences in retrievalsensitivity and alignment accuracy (20) it appears fair tostate that the large majority of extensive structuralsimilarities which are indicative of common evolutionarydescent and could be used to infer functional similaritiesare reported by VAST (and by most if not all of the alter-native approaches to detect common substructures)

As structure similarity search strategies have been de-veloped to also detect distant relationships that might notbe evident from sequence analysis most if not all of thecurrent approaches have been implemented so that they usea single protein molecule or rather a single domain as theunit of comparison This has been true for VAST in par-ticular However the Protein Data Bank is continuing toaccumulate structures of larger macromolecular complexesand has started to provide data on what constitutes

D298 Nucleic Acids Research 2014 Vol 42 Database issue

functionally or biologically relevant macromolecularcomplexes or biological assemblies (1) Such assembliesrange from simple homo-oligomers to intricate arrange-ments of many different components revealing details onspecific molecular interactions and on how these might con-strain sequence variation A small number of approacheshave been published in the past few years that examinestructural similarity of macromolecular complexes (2122)Here we present a simple strategy that builds on the existingdatabase of pairwise structure alignments computed byVAST and supports the first (to our knowledge) compre-hensive and regularly updated collection of macromolecu-lar complex similarities

VAST+AS AN EXTENSION TO EXISTING PROTEINSTRUCTURE COMPARISON

As information characterizing biological assemblies inmacromolecular structure data has become available itseemed that the biological assembly would be a convenientand informative unit of comparison between individualentries in the structure database If the goal is to list struc-tures most similar to any particular query one would haveto consider that the query itself may contain a macromol-ecular complex with a given stoichiometry and thatmatching complexes with matching stoichiometry mightbe more informative lsquostructure neighborsrsquo than forexample the structures that happen to contain moleculeswith the strongest local similarity to the query irrespectiveof the context

VAST+ builds on the existing VAST database togenerate such a report of structure neighbors Its goal isto find the largest set of pairs of matching macromoleculesbetween two biological assemblies and to characterize thatmatch and compute instructions for a global superimpos-ition that can be used to visualize the structural similarityFor each pair of structures in MMDB VAST+examinespre-computed structure alignments stored in the VASTdatabase that were computed for the full-length proteinmolecule components of the default biological assembliesIf such pairwise alignments are found the alignmentsbetween individual protein components of the biologicalassemblies are compared with each other for compatibil-ity and compatiblematching alignments are clusteredinto sets of alignments that together constitute a biologicalassembly match Pairwise alignments are compatible(i) if they do not share the same macromolecules ie aprotein molecule from one assembly cannot be aligned totwo molecules from the other assembly at the same timeand (ii) if they generate similar instructions (spatial trans-formation matrices) for the superpositions of coordinatesets A simple distance metric can be used to comparetransformation matrices and it lends itself to cluster align-ment sets efficiently

Each set of compatible pairwise alignments can becharacterized by (i) the number of pairwise matches iethe total number of pairs of protein molecules from thequery and subject biological assemblies that aresimultaneously aligned with each other (ii) the RMSDof the superposition obtained from considering all

alignments in the set (iii) the total length of all pairwisealignments ie the total number of amino acids that arealigned in 3D space and (iv) percentage of identicalresidues in the alignments For each pairwise comparisonof two biological assemblies only the match with thehighest number of aligned molecules and the highestnumber of aligned residues is recorded and reportedCurrently 53 of polypeptide-containing struc-

tures in MMDB have gt1 polypeptide chain The histo-gram plotted in Figure 1 breaks down the numbers byoligomer size and indicates that large fractions of theoligomeric assemblies have in general structure neighborsthat match the entire assemblies It should be noted thatthe fractions might be somewhat exaggerated as exactduplicates of a structure would be counted as biologicalassembly matches and no attempt was made to removeredundant structures or classify biological assemblymatches as informative versus uninformative

THE VAST+WEB SERVICE

Structure neighbors as computed by the VAST+algorithmwill be used in the future to provide links to lsquosimilar struc-turesrsquo on Entrezstructure document summaries Lists ofsimilar structures are then summarized via a new inter-active web service which can also be used independentlyof the Entrez query and retrieval system and provides toolsfor sub-setting results at httpwwwncbinlmnihgovStructurevastplusvastpluscgi For a query structurespecified by the user the service lists similar structuresshould they exist ranked by the extent of the matchMatches that associate each polymer chain of the querywith a corresponding polymer chain of some other struc-ture are considered complete and are indicated with fullcircles in the search results table partial matches are

1

10

100

1000

10000

100000

1 2 3 4 5 6 7 8 9 10 11 12 13

StructuresStructures with Bioassembly Match

Oligomeric state of Biological Assemblies

Figure 1 This histogram displays the number of structures in MMDB(blue) categorized by the size of the biological assembly Monomersdimers and higher oligomers up to dodecamers are plotted as separatecategories the 13th category summarizes tridecamers and all higheroligomers The y-axis is scaled logarithmically Red columns indicatethe number of structures in that category that have at least onecomplete biological assembly match according to VAST+

Nucleic Acids Research 2014 Vol 42 Database issue D299

Figure 2 The VAST+web service generates lists of structures that have 3D similarity to the query Matches are evaluated with biological assembliesas the unit of comparison (referred to as Biological Units) and may summarize simultaneous alignment of several protein molecule pairs The querystructure lsquo3O6Frsquo (24) currently yields 2712 structure neighbors Only the 115 neighbors with a complete biological assembly match have been selectedin this example (via the lsquodisplay filtersrsquo menu shown as collapsed in this figure) The 115 complete matches have been sorted by RMSD and thethird ranking match has been selected to provide more detail The tabulated matches are shown with their PDB accession descriptive textthe number of proteins aligned in the match the total number of aligned residues the sequence identity and the RMSD resulting from thesimultaneous superimposition of all aligned molecules In this example the query lsquo3O6Frsquo matches the structure lsquo1J8Hrsquo with a total of fouraligned protein molecules totaling 768 residues and resulting in a superposition with 255 A RMSD 80 of the residues in 3O6F and 1J8H thatwere spatially aligned by VAST are identical The extended panel characterizing this selected match contains a table that lists pairs of matchingaligned proteins and it provides schematic depictions of each biological assemblyrsquos composition and interactions The user can mouse-over thoseschematics to identify individual molecules and their corresponding match in the other structure (as shown in this example) The individual proteinmatch table contains action buttons that provide access to the pairwise sequence alignments as derived from the VAST superimposition and launchpoints for visualization of the structure superimposition with the protein structure viewer Cn3D (23) Each lsquolsquo3D Viewrsquorsquo button will open asuperposition of the complete biological assembly alignment with the 3D view centered on the selected protein molecule and its sequence datafeatured in the Cn3D sequence viewer window Next to the Aligned Molecules table an information box lists some stats that characterize thematched biological assembly

D300 Nucleic Acids Research 2014 Vol 42 Database issue

indicated with partially filled circles The default rankingputs matches with the most matched components at the topof the list Not all queries that have similar structures ac-cording to VAST+are guaranteed to also have completematches (although monomers usually do) The searchresults tables provided by the VAST+web service give aconcise summary of the matches and the extentquality ofthe similarity A clickable lsquo+rsquo symbol opens a panel for aselected match that provides more details andfunctionality

USING CN3D TO VISUALIZE BIOLOGICALASSEMBLY ALIGNMENTS

The 3D structures of superimposed biological assembliesmay be visualized using the 3D viewer Cn3D (23)which has been re-released as a new version 431 to

support the visualization style Currently Cn3D is ableto display the structure superposition of the matchedbiological assemblies and all the protein chainsinvolved but it can only display one sequence alignmentat a time Therefore the individual protein match tableas shown in Figure 2 provides separate Cn3D launchpoints for each matchedaligned protein pair All ofthese launch points will result in the same 3D imageand rendering but they will differ in the pair ofaligned sequences that are chosen as the contentof Cn3Drsquos sequencealignment viewer window Figure3 provides examples of Cn3D visualization sessionsPairs of matching molecules are rendered in the samecolor with unaligned segments rendered in gray Thedefault rendering settings as generated and providedby the VAST+ service can be examined and modifiedvia Cn3Drsquos StylejAnnotate menu

Figure 3 Visualization of structurally matching biological assemblies as rendered by the visualization tool Cn3D Cn3D is a helper application forthe web browser available for Windows and OS-X platforms The query structure PDB accession 3O6F represents the complex of an autoreactiveT-cell receptor (MS2-3C8 molecules rendered in green and brown) complexed with a self-peptide derived from myelin basic protein and the multiplesclerosis-associated MHC molecule HLA-DR4 (molecules rendered in magenta and blue) (24) The self-peptide has been fused with the MHCmolecule for the experiment which explains why the query is represented as a biological assembly with only four components (Figure 2) and isrendered in gray as is the default for all unaligned segments in Cn3D visualization sessions launched from VAST+ results pages The left panelshows 3O6F superimposed with the structure neighbor PDB accession 1J8H (25) which contains a complex between HLA-DR3 an Influenzahemagglutinin peptide and a human alphabeta T-cell receptor Molecules are rendered so that their colors match those of the corresponding querymolecules The structures of the two complexes match well resulting in a superimposition of 768 amino acid residues at 26 A RMSD Thisdemonstrates how well the autoreactive T-cell receptor complex mimics complexes that include foreign peptides and it is thought that this bindingmode is responsible for the autoimmune TCR escaping negative selection The right panel shows the VAST+ alignment between 3O6F and thestructure of a T-cell receptor from a patient with multiple sclerosis complexed with a myelin basic protein-derived peptide and an HLA-DR2 MHCPDB accession 1YMM (26) The conformations of the two complexes are different although their components are similar and VAST+ does notconsider the complete biological assemblies to match Instead it reports the most extensive sub-structure match which in this case involves bothsubunits of the MHC (molecules rendered in magenta and blue) The molecules corresponding to the TCR are rendered in gray color and would notbe displayed by default The unusual conformation of the complex reported in 1YMM is thought to represent an alternative binding mode that helpsautoimmune TCRs to escape negative selection

Nucleic Acids Research 2014 Vol 42 Database issue D301

SIMILAR SUBSTRUCTURES ORIGINAL VAST ANDVAST-SEARCH

MMDB is updated weekly following PDBrsquos scheduleWith each update computation of new structure neigh-bors is completed within a few days and they are availableas structure neighbors computed for biological assembliesvia the VAST+ service as well as structure neighborscomputed for individual protein chains and domains viathe original VAST service The latter is accessible on theVAST+ pages via a button labeled lsquooriginal VASTrsquo thatcan be found near the top of the VAST+results page Atthis point the VAST-search service which accepts3D structure data uploaded in PDB-format remainsunchanged and presents similar structures in theOriginal VAST format (Table 1)

LIMITATIONS AND FUTURE WORK

The current implementation of VAST+ the associateddatabase of pre-computed structure comparison resultsand the web service represent a first attempt at providinga comprehensive set of structure neighboring informationfor biological assemblies Several issues may limit thepotential applications of the idea and we intend toaddress them in future releases of the service and theassociated data Currently VAST+ neighboring is re-stricted to the default biological assembly for each struc-ture as determined by the content of the structure dataand MMDB parsing Although multiple biologicalassemblies if present in a structure entry tend to beclosely similar copies of each other there are exceptionsthat should be considered explicitly Also VAST+ cur-rently draws on results of VAST neighboring ascomputed for complete protein molecules and ignores alarger set of results obtained for individual domains Thiswas done intentionally so as to speed up the computa-tion and as the first implementation was intended tofocus on structural similarities that are both strong andglobal We anticipated that most cases where two entirebiological assemblies can be superimposed globallywould break down into individual protein pairs thatcan also be aligned and superimposed globally and notjust at the level of individual domains More import-antly VAST+ makes no attempt at this point atrefining the alignment and superpositions after detectinga match between two biological assemblies It is conceiv-able that such refinement would in many cases results in

somewhat shorter alignments and lower RMSD valuesand might be useful in emphasizing the conservedcontact interface between the components of a molecularcomplex Furthermore a strategy for detecting biologicalassembly matches that considers multiple molecules sim-ultaneously might exhibit higher sensitivity and pick upsimilarities that cannot be found via the 2-tieredapproach we have presented here Currently VAST+ignores non-polypeptide components of macromolecularcomplexes but certainly both nucleic acids and chemicalligands if present could be matched as well as theprotein components It should be mentioned that noVAST+ neighboring data are available for structuredatabase entries that lack assignment of biologicalassemblies and currently VAST+ skips biologicalassemblies whose size exceeds a threshold number ofprotein componentsmdashthe systematic evaluation of allpossible multimolecule matches becomes too time-consuming and will need to be supplemented by asuitable heuristic

FUNDING

Intramural Research Program of the National Libraryof Medicine at National Institutes of HealthDHHSComments suggestions and questions are welcome andshould be directed to infoncbinlmnihgov Fundingfor open access charge Intramural Research Program ofthe National Library of Medicine at the NationalInstitutes of HealthDHHS

Conflict of interest statement None declared

REFERENCES

1 MadejT AddessKJ FongJH GeerLY GeerRCLanczyckiCJ LiuC LuS Marchler-BauerA PanchenkoARet al (2012) MMDB 3D structures and macromolecularinteractions Nucleic Acids Res 40 D461ndashD464

2 RosePW BiC BluhmWF ChristieCH DimitropoulosDDuttaS GreenRK GoodsellDS PrlicA QuesadaM et al(2013) The RCSB protein data bank new resources for researchand education Nucleic Acids Res 41 D475ndashD482

3 GibneyG and BaxevanisAD (2011) Searching NCBI databasesusing Entrez Curr Protoc Hum Genet Chapter 6 Unit 610

4 AltschulSF MaddenTL SchafferAA ZhangJ ZhangZMillerW and LipmanDJ (1997) Gapped BLAST andPSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 25 3389ndash3402

5 Marchler-BauerA ZhengC ChitsazF DerbyshireMKGeerLY GeerRC GonzalesNR GwadzM HurwitzDILanczyckiCJ et al (2013) CDD conserved domains and

Table 1 URLs for MMDB and VAST resources

MMDB Database home page httpwwwncbinlmnihgovstructureMMDB FTP Data distribution ftpftpncbinihgovmmdbVAST Identify structurally similar individual protein molecules httpwwwncbinlmnihgovStructureVASTvastshtmlVAST+ Identify structurally similar macromolecular complexes httpwwwncbinlmnihgovStructurevastplusvastpluscgiVAST search Input the 3D coordinates of a query structure to

search for similar structureshttpwwwncbinlmnihgovStructureVASTvastsearchhtml

Cn3D Molecular graphics viewer httpwwwncbinlmnihgovStructureCN3Dcn3dshtmlCBLAST Find 3D structures that are related to a query protein via

sequence comparisonhttpwwwncbinlmnihgovStructurecblastcblastcgi

D302 Nucleic Acids Research 2014 Vol 42 Database issue

protein three-dimensional structure Nucleic Acids Res 41D348ndashD352

6 ShoemakerBA ZhangD TyagiM ThanguduRR FongJHMarchler-BauerA BryantSH MadejT and PanchenkoAR(2012) IBIS (Inferred Biomolecular Interaction Server) reportspredicts and integrates multiple types of conserved interactionsfor proteins Nucleic Acids Res 40 D834ndashD840

7 KrissinelE and HenrickK (2007) Inference of macromolecularassemblies from crystalline state J Mol Biol 372 774ndash797

8 GibratJF MadejT and BryantSH (1996) Surprisingsimilarities in structure comparison Curr Opin Struct Biol 6377ndash385

9 HaseqawaH and HolmL (2009) Advances and pitfalls ofprotein structural alignment Curr Opin Struct Biol 19341ndash348

10 PrlicA BlivenS RosePW BluhmWF BizonC GodzikAand BournePE (2010) Pre-calculated protein structurealignments at the RCSB PDB website Bioinformatics 262983ndash2985

11 MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classficiation of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

12 SillitoeI CuffAL DessaillyBH DawsonNL FurnhamNLeeD LeesJG LewisTE StuderRA RentzschR et al(2013) New functional families (FunFams) in CATH to improvethe mapping of conserved functional sites to 3D structuresNucleic Acids Res 41 D490ndashD498

13 YeY and GodzikA (2003) Flexible structure alignment bychaining aligned fragment pairs allowing twists Bioinformatics19(Suppl 2) ii246ndashii255

14 HolmL and RosenstromP (2010) Dali server conservationmapping in 3D Nucleic Acids Res 38(Suppl 2) W545ndashW549

15 GoughJ KarplusK HugheyR and ChothiaC (2001)Assignment of homology to genome sequences using a library ofhidden Markov models that represent all proteins of knownstructure J Mol Biol 313 903ndash919

16 OrengoCA and TaylorWR (1996) SSAP sequential structurealignment program for protein structure comparisonMethods Enzymol 266 617ndash635

17 ShindyalovIN and BournePE (1998) Protein structurealignment by incremental combinatorial extension (CE) of theoptimal path Protein Eng 11 739ndash747

18 Marchler-BauerA and BryantSH (1997) Measures of threadingspecificity and accuracy Proteins (Suppl 1) 74ndash82

19 SierkML and PearsonWR (2004) Sensitivity andselectivity in protein structure comparison Protein Sci 13773ndash785

20 KimC and LeeB (2007) Accuracy of structure-basedsequence alignments of automatic methods BMC Bioinformatics8 355

21 SipplMJ and WiedersteinM (2012) Detection of spatialcorrelations in protein structures and molecular complexesStructure 20 718ndash728

22 MukherjeeS and ZhangY (2009) MM-align a quick algorithmfor aligning multiple-chain protein complex structures usingiterative dynamic programming Nucleic Acids Res 37 e83

23 WangY GeerLY ChappeyC KansJA and BryantSH(2000) Cn3D sequence and structure views for Entrez TrendsBiochem Sci 25 300ndash302

24 YinY LiY KerzicM MartinR and MariuzzaRA (2011)Structure of a TCR with high affinity for self-antigen revealsbasis for escape from negative selection EMBO J 301137ndash1148

25 HenneckeJ and WileyDC (2002) Structure of a complex of thehuman alphabeta T cell receptor (TCR) HA17 Influenzahemagglutinin peptide and major histocompatibility complexclass II molecule HLA-DR4 (DRA0101 and DRB10401)insight into TCR cross-restriction and alloreactivity J ExpMed 195 571ndash581

26 HahnM NicholsonMJ PyrdolJ and WucherpfennigKW(2005) Unconventional topology of self-peptide majorhistocompatibility complex binding by a human autoimmuneT cell receptor Nat Immunol 6 490ndash496

Nucleic Acids Research 2014 Vol 42 Database issue D303

Page 2: MMDB and VAST : tracking structural similarities between ......complexes or biological assemblies (1). Such assemblies range from simple homo-oligomers to intricate arrange-ments of

macromolecular structure data as provided by MMDBcan be used to postulate homology-inferred function fora large number of functionally uncharacterized proteinsequences and genesWhat may be less well-known and used are structure

neighboring data and a structure neighboring serviceavailable as part of MMDB which identify similarly-shaped structures based on geometric criteria regardlessof the extent of sequence similarity The resulting 3Dstructure alignments are helpful in understanding thefunctional consequences of sequence variation as well asin discovering distant homologous relationships andsubtle functional similarities The Vector AlignmentSearch Tool (VAST) (8) algorithm which computesthese similarities was developed around 1995 and hasbeen applied ever since to compute and maintain compre-hensive and up-to-date lists of statistically significantsimilarities between known protein 3D structures Pre-computed similarities and alignments derived from struc-ture superposition are available for all protein structuresthat have been included in MMDB and are suitable to beprocessed by VAST An interactive search tool VAST-Search facilitates structure similarity searches forprotein structure queries that are not (yet) part ofMMDBrsquos collection enabling a user to enter 3Dcoordinate data for comparison against all publiclyavailable structuresIn the past 25 years a variety of methods have been

developed to computationally characterize ormeasure struc-tural similarities between macromolecules resulting in aneven larger variety of published methods too numerous tobe listed here (9) The Protein Data Bank for examplereports structure neighbors from a subset of representativescomputed with the jFATCAT algorithm (10) and points to ahandful of external resources that provide structural classi-fications and structure comparisons SCOP (11) CATH(12) VAST (8) FATCAT (13) DALI (14) andSuperfamily (15) SCOP is a hierarchical classification ofdomain structures that has been maintained by manualintervention and does not rely on computationallydetermined 3D structure similarity CATH classifiesdomain structures hierarchically as well but makes system-atic use of the SSAP (16) algorithm to compute similaritieson a 3D level FATCAT a more recent development usesdynamic programming to string together locally alignedpairs of structural fragments while allowing for a numberof twists around pivot points decomposing the matchbetween two structures into a series of segment pairs thatcan be superimposed as rigid bodies The jFATCAT imple-mentation used to pre-compute data for the PDB site fallsback to reporting a single-segment rigid body superimpos-ition though DALI was one of the first structure compari-son methods that relied solely on geometric criteria and hasbeen available since themid 1990sAnothermethod that hadbeen incorporated in the Protein Data Bank CE (17)computes rigid body superimpositions for alignmentsfound via combinatorial extension of aligned fragmentpairs as opposed to dynamic programming or MonteCarlo optimizationsMost if not all of these computational resources and

the associated data have been maintained and available

accessible since their inception although updates of datasets such as pre-computed structure alignments may nothave happened frequently as most structure comparisonmethods are computationally intensive The DALIdatabase of pre-computed structure alignments forexample currently reports a most recent update inMarch 2011 Also most of the pre-computed structureneighboring data sets and search databases available forlive neighboring have been reduced in size to contain rep-resentative structures only Pre-computed structure neigh-bors as found on the PDB Web site for example havebeen obtained for representative structures from clustersformed at a threshold of 40 sequence identity meaningthat a structural alignment between an arbitrary pair ofsimilar or related protein structures may not be readilyavailable which somewhat limits the practical applicabil-ity of the data and search implementations

The VAST search database and database of pre-computed structure alignments have been maintained ascomplete and redundant collections since their launchwith automated updates occurring on a weekly basisThis was made possible by implementing a fast heuristicthat uses a model for the statistical significance of initialalignments of secondary structure vectors (which can becomputed quickly) so that the database searches canavoid costly alignment refinements for the large majorityof insignificant and uninteresting similarities The draw-backs are that a heuristic will miss some potentially inter-esting similarities The VAST algorithm will not forexample report similarities between structures deemedto have lt3 secondary structure elements Searches forstructural similarity can and should be complementedwith searches for sequence similarity as flexibility ofmolecular structure and limitations of the structure com-parison method may preclude the detection of matchesbetween structures of homologous polypeptides Ingeneral though structure comparison methods will pickup many subtle similarities that evade detection bysequence comparison strategies and there is no naturalcutoff point for a ranked list of similar structures unlikein the sequence comparison scenario where matches tonon-homologous gene products are considered accidentaland uninformative for the most part

Results computed by the VAST algorithm have beencompared against other approaches a number of times(17ndash19) Although there are subtle differences in retrievalsensitivity and alignment accuracy (20) it appears fair tostate that the large majority of extensive structuralsimilarities which are indicative of common evolutionarydescent and could be used to infer functional similaritiesare reported by VAST (and by most if not all of the alter-native approaches to detect common substructures)

As structure similarity search strategies have been de-veloped to also detect distant relationships that might notbe evident from sequence analysis most if not all of thecurrent approaches have been implemented so that they usea single protein molecule or rather a single domain as theunit of comparison This has been true for VAST in par-ticular However the Protein Data Bank is continuing toaccumulate structures of larger macromolecular complexesand has started to provide data on what constitutes

D298 Nucleic Acids Research 2014 Vol 42 Database issue

functionally or biologically relevant macromolecularcomplexes or biological assemblies (1) Such assembliesrange from simple homo-oligomers to intricate arrange-ments of many different components revealing details onspecific molecular interactions and on how these might con-strain sequence variation A small number of approacheshave been published in the past few years that examinestructural similarity of macromolecular complexes (2122)Here we present a simple strategy that builds on the existingdatabase of pairwise structure alignments computed byVAST and supports the first (to our knowledge) compre-hensive and regularly updated collection of macromolecu-lar complex similarities

VAST+AS AN EXTENSION TO EXISTING PROTEINSTRUCTURE COMPARISON

As information characterizing biological assemblies inmacromolecular structure data has become available itseemed that the biological assembly would be a convenientand informative unit of comparison between individualentries in the structure database If the goal is to list struc-tures most similar to any particular query one would haveto consider that the query itself may contain a macromol-ecular complex with a given stoichiometry and thatmatching complexes with matching stoichiometry mightbe more informative lsquostructure neighborsrsquo than forexample the structures that happen to contain moleculeswith the strongest local similarity to the query irrespectiveof the context

VAST+ builds on the existing VAST database togenerate such a report of structure neighbors Its goal isto find the largest set of pairs of matching macromoleculesbetween two biological assemblies and to characterize thatmatch and compute instructions for a global superimpos-ition that can be used to visualize the structural similarityFor each pair of structures in MMDB VAST+examinespre-computed structure alignments stored in the VASTdatabase that were computed for the full-length proteinmolecule components of the default biological assembliesIf such pairwise alignments are found the alignmentsbetween individual protein components of the biologicalassemblies are compared with each other for compatibil-ity and compatiblematching alignments are clusteredinto sets of alignments that together constitute a biologicalassembly match Pairwise alignments are compatible(i) if they do not share the same macromolecules ie aprotein molecule from one assembly cannot be aligned totwo molecules from the other assembly at the same timeand (ii) if they generate similar instructions (spatial trans-formation matrices) for the superpositions of coordinatesets A simple distance metric can be used to comparetransformation matrices and it lends itself to cluster align-ment sets efficiently

Each set of compatible pairwise alignments can becharacterized by (i) the number of pairwise matches iethe total number of pairs of protein molecules from thequery and subject biological assemblies that aresimultaneously aligned with each other (ii) the RMSDof the superposition obtained from considering all

alignments in the set (iii) the total length of all pairwisealignments ie the total number of amino acids that arealigned in 3D space and (iv) percentage of identicalresidues in the alignments For each pairwise comparisonof two biological assemblies only the match with thehighest number of aligned molecules and the highestnumber of aligned residues is recorded and reportedCurrently 53 of polypeptide-containing struc-

tures in MMDB have gt1 polypeptide chain The histo-gram plotted in Figure 1 breaks down the numbers byoligomer size and indicates that large fractions of theoligomeric assemblies have in general structure neighborsthat match the entire assemblies It should be noted thatthe fractions might be somewhat exaggerated as exactduplicates of a structure would be counted as biologicalassembly matches and no attempt was made to removeredundant structures or classify biological assemblymatches as informative versus uninformative

THE VAST+WEB SERVICE

Structure neighbors as computed by the VAST+algorithmwill be used in the future to provide links to lsquosimilar struc-turesrsquo on Entrezstructure document summaries Lists ofsimilar structures are then summarized via a new inter-active web service which can also be used independentlyof the Entrez query and retrieval system and provides toolsfor sub-setting results at httpwwwncbinlmnihgovStructurevastplusvastpluscgi For a query structurespecified by the user the service lists similar structuresshould they exist ranked by the extent of the matchMatches that associate each polymer chain of the querywith a corresponding polymer chain of some other struc-ture are considered complete and are indicated with fullcircles in the search results table partial matches are

1

10

100

1000

10000

100000

1 2 3 4 5 6 7 8 9 10 11 12 13

StructuresStructures with Bioassembly Match

Oligomeric state of Biological Assemblies

Figure 1 This histogram displays the number of structures in MMDB(blue) categorized by the size of the biological assembly Monomersdimers and higher oligomers up to dodecamers are plotted as separatecategories the 13th category summarizes tridecamers and all higheroligomers The y-axis is scaled logarithmically Red columns indicatethe number of structures in that category that have at least onecomplete biological assembly match according to VAST+

Nucleic Acids Research 2014 Vol 42 Database issue D299

Figure 2 The VAST+web service generates lists of structures that have 3D similarity to the query Matches are evaluated with biological assembliesas the unit of comparison (referred to as Biological Units) and may summarize simultaneous alignment of several protein molecule pairs The querystructure lsquo3O6Frsquo (24) currently yields 2712 structure neighbors Only the 115 neighbors with a complete biological assembly match have been selectedin this example (via the lsquodisplay filtersrsquo menu shown as collapsed in this figure) The 115 complete matches have been sorted by RMSD and thethird ranking match has been selected to provide more detail The tabulated matches are shown with their PDB accession descriptive textthe number of proteins aligned in the match the total number of aligned residues the sequence identity and the RMSD resulting from thesimultaneous superimposition of all aligned molecules In this example the query lsquo3O6Frsquo matches the structure lsquo1J8Hrsquo with a total of fouraligned protein molecules totaling 768 residues and resulting in a superposition with 255 A RMSD 80 of the residues in 3O6F and 1J8H thatwere spatially aligned by VAST are identical The extended panel characterizing this selected match contains a table that lists pairs of matchingaligned proteins and it provides schematic depictions of each biological assemblyrsquos composition and interactions The user can mouse-over thoseschematics to identify individual molecules and their corresponding match in the other structure (as shown in this example) The individual proteinmatch table contains action buttons that provide access to the pairwise sequence alignments as derived from the VAST superimposition and launchpoints for visualization of the structure superimposition with the protein structure viewer Cn3D (23) Each lsquolsquo3D Viewrsquorsquo button will open asuperposition of the complete biological assembly alignment with the 3D view centered on the selected protein molecule and its sequence datafeatured in the Cn3D sequence viewer window Next to the Aligned Molecules table an information box lists some stats that characterize thematched biological assembly

D300 Nucleic Acids Research 2014 Vol 42 Database issue

indicated with partially filled circles The default rankingputs matches with the most matched components at the topof the list Not all queries that have similar structures ac-cording to VAST+are guaranteed to also have completematches (although monomers usually do) The searchresults tables provided by the VAST+web service give aconcise summary of the matches and the extentquality ofthe similarity A clickable lsquo+rsquo symbol opens a panel for aselected match that provides more details andfunctionality

USING CN3D TO VISUALIZE BIOLOGICALASSEMBLY ALIGNMENTS

The 3D structures of superimposed biological assembliesmay be visualized using the 3D viewer Cn3D (23)which has been re-released as a new version 431 to

support the visualization style Currently Cn3D is ableto display the structure superposition of the matchedbiological assemblies and all the protein chainsinvolved but it can only display one sequence alignmentat a time Therefore the individual protein match tableas shown in Figure 2 provides separate Cn3D launchpoints for each matchedaligned protein pair All ofthese launch points will result in the same 3D imageand rendering but they will differ in the pair ofaligned sequences that are chosen as the contentof Cn3Drsquos sequencealignment viewer window Figure3 provides examples of Cn3D visualization sessionsPairs of matching molecules are rendered in the samecolor with unaligned segments rendered in gray Thedefault rendering settings as generated and providedby the VAST+ service can be examined and modifiedvia Cn3Drsquos StylejAnnotate menu

Figure 3 Visualization of structurally matching biological assemblies as rendered by the visualization tool Cn3D Cn3D is a helper application forthe web browser available for Windows and OS-X platforms The query structure PDB accession 3O6F represents the complex of an autoreactiveT-cell receptor (MS2-3C8 molecules rendered in green and brown) complexed with a self-peptide derived from myelin basic protein and the multiplesclerosis-associated MHC molecule HLA-DR4 (molecules rendered in magenta and blue) (24) The self-peptide has been fused with the MHCmolecule for the experiment which explains why the query is represented as a biological assembly with only four components (Figure 2) and isrendered in gray as is the default for all unaligned segments in Cn3D visualization sessions launched from VAST+ results pages The left panelshows 3O6F superimposed with the structure neighbor PDB accession 1J8H (25) which contains a complex between HLA-DR3 an Influenzahemagglutinin peptide and a human alphabeta T-cell receptor Molecules are rendered so that their colors match those of the corresponding querymolecules The structures of the two complexes match well resulting in a superimposition of 768 amino acid residues at 26 A RMSD Thisdemonstrates how well the autoreactive T-cell receptor complex mimics complexes that include foreign peptides and it is thought that this bindingmode is responsible for the autoimmune TCR escaping negative selection The right panel shows the VAST+ alignment between 3O6F and thestructure of a T-cell receptor from a patient with multiple sclerosis complexed with a myelin basic protein-derived peptide and an HLA-DR2 MHCPDB accession 1YMM (26) The conformations of the two complexes are different although their components are similar and VAST+ does notconsider the complete biological assemblies to match Instead it reports the most extensive sub-structure match which in this case involves bothsubunits of the MHC (molecules rendered in magenta and blue) The molecules corresponding to the TCR are rendered in gray color and would notbe displayed by default The unusual conformation of the complex reported in 1YMM is thought to represent an alternative binding mode that helpsautoimmune TCRs to escape negative selection

Nucleic Acids Research 2014 Vol 42 Database issue D301

SIMILAR SUBSTRUCTURES ORIGINAL VAST ANDVAST-SEARCH

MMDB is updated weekly following PDBrsquos scheduleWith each update computation of new structure neigh-bors is completed within a few days and they are availableas structure neighbors computed for biological assembliesvia the VAST+ service as well as structure neighborscomputed for individual protein chains and domains viathe original VAST service The latter is accessible on theVAST+ pages via a button labeled lsquooriginal VASTrsquo thatcan be found near the top of the VAST+results page Atthis point the VAST-search service which accepts3D structure data uploaded in PDB-format remainsunchanged and presents similar structures in theOriginal VAST format (Table 1)

LIMITATIONS AND FUTURE WORK

The current implementation of VAST+ the associateddatabase of pre-computed structure comparison resultsand the web service represent a first attempt at providinga comprehensive set of structure neighboring informationfor biological assemblies Several issues may limit thepotential applications of the idea and we intend toaddress them in future releases of the service and theassociated data Currently VAST+ neighboring is re-stricted to the default biological assembly for each struc-ture as determined by the content of the structure dataand MMDB parsing Although multiple biologicalassemblies if present in a structure entry tend to beclosely similar copies of each other there are exceptionsthat should be considered explicitly Also VAST+ cur-rently draws on results of VAST neighboring ascomputed for complete protein molecules and ignores alarger set of results obtained for individual domains Thiswas done intentionally so as to speed up the computa-tion and as the first implementation was intended tofocus on structural similarities that are both strong andglobal We anticipated that most cases where two entirebiological assemblies can be superimposed globallywould break down into individual protein pairs thatcan also be aligned and superimposed globally and notjust at the level of individual domains More import-antly VAST+ makes no attempt at this point atrefining the alignment and superpositions after detectinga match between two biological assemblies It is conceiv-able that such refinement would in many cases results in

somewhat shorter alignments and lower RMSD valuesand might be useful in emphasizing the conservedcontact interface between the components of a molecularcomplex Furthermore a strategy for detecting biologicalassembly matches that considers multiple molecules sim-ultaneously might exhibit higher sensitivity and pick upsimilarities that cannot be found via the 2-tieredapproach we have presented here Currently VAST+ignores non-polypeptide components of macromolecularcomplexes but certainly both nucleic acids and chemicalligands if present could be matched as well as theprotein components It should be mentioned that noVAST+ neighboring data are available for structuredatabase entries that lack assignment of biologicalassemblies and currently VAST+ skips biologicalassemblies whose size exceeds a threshold number ofprotein componentsmdashthe systematic evaluation of allpossible multimolecule matches becomes too time-consuming and will need to be supplemented by asuitable heuristic

FUNDING

Intramural Research Program of the National Libraryof Medicine at National Institutes of HealthDHHSComments suggestions and questions are welcome andshould be directed to infoncbinlmnihgov Fundingfor open access charge Intramural Research Program ofthe National Library of Medicine at the NationalInstitutes of HealthDHHS

Conflict of interest statement None declared

REFERENCES

1 MadejT AddessKJ FongJH GeerLY GeerRCLanczyckiCJ LiuC LuS Marchler-BauerA PanchenkoARet al (2012) MMDB 3D structures and macromolecularinteractions Nucleic Acids Res 40 D461ndashD464

2 RosePW BiC BluhmWF ChristieCH DimitropoulosDDuttaS GreenRK GoodsellDS PrlicA QuesadaM et al(2013) The RCSB protein data bank new resources for researchand education Nucleic Acids Res 41 D475ndashD482

3 GibneyG and BaxevanisAD (2011) Searching NCBI databasesusing Entrez Curr Protoc Hum Genet Chapter 6 Unit 610

4 AltschulSF MaddenTL SchafferAA ZhangJ ZhangZMillerW and LipmanDJ (1997) Gapped BLAST andPSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 25 3389ndash3402

5 Marchler-BauerA ZhengC ChitsazF DerbyshireMKGeerLY GeerRC GonzalesNR GwadzM HurwitzDILanczyckiCJ et al (2013) CDD conserved domains and

Table 1 URLs for MMDB and VAST resources

MMDB Database home page httpwwwncbinlmnihgovstructureMMDB FTP Data distribution ftpftpncbinihgovmmdbVAST Identify structurally similar individual protein molecules httpwwwncbinlmnihgovStructureVASTvastshtmlVAST+ Identify structurally similar macromolecular complexes httpwwwncbinlmnihgovStructurevastplusvastpluscgiVAST search Input the 3D coordinates of a query structure to

search for similar structureshttpwwwncbinlmnihgovStructureVASTvastsearchhtml

Cn3D Molecular graphics viewer httpwwwncbinlmnihgovStructureCN3Dcn3dshtmlCBLAST Find 3D structures that are related to a query protein via

sequence comparisonhttpwwwncbinlmnihgovStructurecblastcblastcgi

D302 Nucleic Acids Research 2014 Vol 42 Database issue

protein three-dimensional structure Nucleic Acids Res 41D348ndashD352

6 ShoemakerBA ZhangD TyagiM ThanguduRR FongJHMarchler-BauerA BryantSH MadejT and PanchenkoAR(2012) IBIS (Inferred Biomolecular Interaction Server) reportspredicts and integrates multiple types of conserved interactionsfor proteins Nucleic Acids Res 40 D834ndashD840

7 KrissinelE and HenrickK (2007) Inference of macromolecularassemblies from crystalline state J Mol Biol 372 774ndash797

8 GibratJF MadejT and BryantSH (1996) Surprisingsimilarities in structure comparison Curr Opin Struct Biol 6377ndash385

9 HaseqawaH and HolmL (2009) Advances and pitfalls ofprotein structural alignment Curr Opin Struct Biol 19341ndash348

10 PrlicA BlivenS RosePW BluhmWF BizonC GodzikAand BournePE (2010) Pre-calculated protein structurealignments at the RCSB PDB website Bioinformatics 262983ndash2985

11 MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classficiation of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

12 SillitoeI CuffAL DessaillyBH DawsonNL FurnhamNLeeD LeesJG LewisTE StuderRA RentzschR et al(2013) New functional families (FunFams) in CATH to improvethe mapping of conserved functional sites to 3D structuresNucleic Acids Res 41 D490ndashD498

13 YeY and GodzikA (2003) Flexible structure alignment bychaining aligned fragment pairs allowing twists Bioinformatics19(Suppl 2) ii246ndashii255

14 HolmL and RosenstromP (2010) Dali server conservationmapping in 3D Nucleic Acids Res 38(Suppl 2) W545ndashW549

15 GoughJ KarplusK HugheyR and ChothiaC (2001)Assignment of homology to genome sequences using a library ofhidden Markov models that represent all proteins of knownstructure J Mol Biol 313 903ndash919

16 OrengoCA and TaylorWR (1996) SSAP sequential structurealignment program for protein structure comparisonMethods Enzymol 266 617ndash635

17 ShindyalovIN and BournePE (1998) Protein structurealignment by incremental combinatorial extension (CE) of theoptimal path Protein Eng 11 739ndash747

18 Marchler-BauerA and BryantSH (1997) Measures of threadingspecificity and accuracy Proteins (Suppl 1) 74ndash82

19 SierkML and PearsonWR (2004) Sensitivity andselectivity in protein structure comparison Protein Sci 13773ndash785

20 KimC and LeeB (2007) Accuracy of structure-basedsequence alignments of automatic methods BMC Bioinformatics8 355

21 SipplMJ and WiedersteinM (2012) Detection of spatialcorrelations in protein structures and molecular complexesStructure 20 718ndash728

22 MukherjeeS and ZhangY (2009) MM-align a quick algorithmfor aligning multiple-chain protein complex structures usingiterative dynamic programming Nucleic Acids Res 37 e83

23 WangY GeerLY ChappeyC KansJA and BryantSH(2000) Cn3D sequence and structure views for Entrez TrendsBiochem Sci 25 300ndash302

24 YinY LiY KerzicM MartinR and MariuzzaRA (2011)Structure of a TCR with high affinity for self-antigen revealsbasis for escape from negative selection EMBO J 301137ndash1148

25 HenneckeJ and WileyDC (2002) Structure of a complex of thehuman alphabeta T cell receptor (TCR) HA17 Influenzahemagglutinin peptide and major histocompatibility complexclass II molecule HLA-DR4 (DRA0101 and DRB10401)insight into TCR cross-restriction and alloreactivity J ExpMed 195 571ndash581

26 HahnM NicholsonMJ PyrdolJ and WucherpfennigKW(2005) Unconventional topology of self-peptide majorhistocompatibility complex binding by a human autoimmuneT cell receptor Nat Immunol 6 490ndash496

Nucleic Acids Research 2014 Vol 42 Database issue D303

Page 3: MMDB and VAST : tracking structural similarities between ......complexes or biological assemblies (1). Such assemblies range from simple homo-oligomers to intricate arrange-ments of

functionally or biologically relevant macromolecularcomplexes or biological assemblies (1) Such assembliesrange from simple homo-oligomers to intricate arrange-ments of many different components revealing details onspecific molecular interactions and on how these might con-strain sequence variation A small number of approacheshave been published in the past few years that examinestructural similarity of macromolecular complexes (2122)Here we present a simple strategy that builds on the existingdatabase of pairwise structure alignments computed byVAST and supports the first (to our knowledge) compre-hensive and regularly updated collection of macromolecu-lar complex similarities

VAST+AS AN EXTENSION TO EXISTING PROTEINSTRUCTURE COMPARISON

As information characterizing biological assemblies inmacromolecular structure data has become available itseemed that the biological assembly would be a convenientand informative unit of comparison between individualentries in the structure database If the goal is to list struc-tures most similar to any particular query one would haveto consider that the query itself may contain a macromol-ecular complex with a given stoichiometry and thatmatching complexes with matching stoichiometry mightbe more informative lsquostructure neighborsrsquo than forexample the structures that happen to contain moleculeswith the strongest local similarity to the query irrespectiveof the context

VAST+ builds on the existing VAST database togenerate such a report of structure neighbors Its goal isto find the largest set of pairs of matching macromoleculesbetween two biological assemblies and to characterize thatmatch and compute instructions for a global superimpos-ition that can be used to visualize the structural similarityFor each pair of structures in MMDB VAST+examinespre-computed structure alignments stored in the VASTdatabase that were computed for the full-length proteinmolecule components of the default biological assembliesIf such pairwise alignments are found the alignmentsbetween individual protein components of the biologicalassemblies are compared with each other for compatibil-ity and compatiblematching alignments are clusteredinto sets of alignments that together constitute a biologicalassembly match Pairwise alignments are compatible(i) if they do not share the same macromolecules ie aprotein molecule from one assembly cannot be aligned totwo molecules from the other assembly at the same timeand (ii) if they generate similar instructions (spatial trans-formation matrices) for the superpositions of coordinatesets A simple distance metric can be used to comparetransformation matrices and it lends itself to cluster align-ment sets efficiently

Each set of compatible pairwise alignments can becharacterized by (i) the number of pairwise matches iethe total number of pairs of protein molecules from thequery and subject biological assemblies that aresimultaneously aligned with each other (ii) the RMSDof the superposition obtained from considering all

alignments in the set (iii) the total length of all pairwisealignments ie the total number of amino acids that arealigned in 3D space and (iv) percentage of identicalresidues in the alignments For each pairwise comparisonof two biological assemblies only the match with thehighest number of aligned molecules and the highestnumber of aligned residues is recorded and reportedCurrently 53 of polypeptide-containing struc-

tures in MMDB have gt1 polypeptide chain The histo-gram plotted in Figure 1 breaks down the numbers byoligomer size and indicates that large fractions of theoligomeric assemblies have in general structure neighborsthat match the entire assemblies It should be noted thatthe fractions might be somewhat exaggerated as exactduplicates of a structure would be counted as biologicalassembly matches and no attempt was made to removeredundant structures or classify biological assemblymatches as informative versus uninformative

THE VAST+WEB SERVICE

Structure neighbors as computed by the VAST+algorithmwill be used in the future to provide links to lsquosimilar struc-turesrsquo on Entrezstructure document summaries Lists ofsimilar structures are then summarized via a new inter-active web service which can also be used independentlyof the Entrez query and retrieval system and provides toolsfor sub-setting results at httpwwwncbinlmnihgovStructurevastplusvastpluscgi For a query structurespecified by the user the service lists similar structuresshould they exist ranked by the extent of the matchMatches that associate each polymer chain of the querywith a corresponding polymer chain of some other struc-ture are considered complete and are indicated with fullcircles in the search results table partial matches are

1

10

100

1000

10000

100000

1 2 3 4 5 6 7 8 9 10 11 12 13

StructuresStructures with Bioassembly Match

Oligomeric state of Biological Assemblies

Figure 1 This histogram displays the number of structures in MMDB(blue) categorized by the size of the biological assembly Monomersdimers and higher oligomers up to dodecamers are plotted as separatecategories the 13th category summarizes tridecamers and all higheroligomers The y-axis is scaled logarithmically Red columns indicatethe number of structures in that category that have at least onecomplete biological assembly match according to VAST+

Nucleic Acids Research 2014 Vol 42 Database issue D299

Figure 2 The VAST+web service generates lists of structures that have 3D similarity to the query Matches are evaluated with biological assembliesas the unit of comparison (referred to as Biological Units) and may summarize simultaneous alignment of several protein molecule pairs The querystructure lsquo3O6Frsquo (24) currently yields 2712 structure neighbors Only the 115 neighbors with a complete biological assembly match have been selectedin this example (via the lsquodisplay filtersrsquo menu shown as collapsed in this figure) The 115 complete matches have been sorted by RMSD and thethird ranking match has been selected to provide more detail The tabulated matches are shown with their PDB accession descriptive textthe number of proteins aligned in the match the total number of aligned residues the sequence identity and the RMSD resulting from thesimultaneous superimposition of all aligned molecules In this example the query lsquo3O6Frsquo matches the structure lsquo1J8Hrsquo with a total of fouraligned protein molecules totaling 768 residues and resulting in a superposition with 255 A RMSD 80 of the residues in 3O6F and 1J8H thatwere spatially aligned by VAST are identical The extended panel characterizing this selected match contains a table that lists pairs of matchingaligned proteins and it provides schematic depictions of each biological assemblyrsquos composition and interactions The user can mouse-over thoseschematics to identify individual molecules and their corresponding match in the other structure (as shown in this example) The individual proteinmatch table contains action buttons that provide access to the pairwise sequence alignments as derived from the VAST superimposition and launchpoints for visualization of the structure superimposition with the protein structure viewer Cn3D (23) Each lsquolsquo3D Viewrsquorsquo button will open asuperposition of the complete biological assembly alignment with the 3D view centered on the selected protein molecule and its sequence datafeatured in the Cn3D sequence viewer window Next to the Aligned Molecules table an information box lists some stats that characterize thematched biological assembly

D300 Nucleic Acids Research 2014 Vol 42 Database issue

indicated with partially filled circles The default rankingputs matches with the most matched components at the topof the list Not all queries that have similar structures ac-cording to VAST+are guaranteed to also have completematches (although monomers usually do) The searchresults tables provided by the VAST+web service give aconcise summary of the matches and the extentquality ofthe similarity A clickable lsquo+rsquo symbol opens a panel for aselected match that provides more details andfunctionality

USING CN3D TO VISUALIZE BIOLOGICALASSEMBLY ALIGNMENTS

The 3D structures of superimposed biological assembliesmay be visualized using the 3D viewer Cn3D (23)which has been re-released as a new version 431 to

support the visualization style Currently Cn3D is ableto display the structure superposition of the matchedbiological assemblies and all the protein chainsinvolved but it can only display one sequence alignmentat a time Therefore the individual protein match tableas shown in Figure 2 provides separate Cn3D launchpoints for each matchedaligned protein pair All ofthese launch points will result in the same 3D imageand rendering but they will differ in the pair ofaligned sequences that are chosen as the contentof Cn3Drsquos sequencealignment viewer window Figure3 provides examples of Cn3D visualization sessionsPairs of matching molecules are rendered in the samecolor with unaligned segments rendered in gray Thedefault rendering settings as generated and providedby the VAST+ service can be examined and modifiedvia Cn3Drsquos StylejAnnotate menu

Figure 3 Visualization of structurally matching biological assemblies as rendered by the visualization tool Cn3D Cn3D is a helper application forthe web browser available for Windows and OS-X platforms The query structure PDB accession 3O6F represents the complex of an autoreactiveT-cell receptor (MS2-3C8 molecules rendered in green and brown) complexed with a self-peptide derived from myelin basic protein and the multiplesclerosis-associated MHC molecule HLA-DR4 (molecules rendered in magenta and blue) (24) The self-peptide has been fused with the MHCmolecule for the experiment which explains why the query is represented as a biological assembly with only four components (Figure 2) and isrendered in gray as is the default for all unaligned segments in Cn3D visualization sessions launched from VAST+ results pages The left panelshows 3O6F superimposed with the structure neighbor PDB accession 1J8H (25) which contains a complex between HLA-DR3 an Influenzahemagglutinin peptide and a human alphabeta T-cell receptor Molecules are rendered so that their colors match those of the corresponding querymolecules The structures of the two complexes match well resulting in a superimposition of 768 amino acid residues at 26 A RMSD Thisdemonstrates how well the autoreactive T-cell receptor complex mimics complexes that include foreign peptides and it is thought that this bindingmode is responsible for the autoimmune TCR escaping negative selection The right panel shows the VAST+ alignment between 3O6F and thestructure of a T-cell receptor from a patient with multiple sclerosis complexed with a myelin basic protein-derived peptide and an HLA-DR2 MHCPDB accession 1YMM (26) The conformations of the two complexes are different although their components are similar and VAST+ does notconsider the complete biological assemblies to match Instead it reports the most extensive sub-structure match which in this case involves bothsubunits of the MHC (molecules rendered in magenta and blue) The molecules corresponding to the TCR are rendered in gray color and would notbe displayed by default The unusual conformation of the complex reported in 1YMM is thought to represent an alternative binding mode that helpsautoimmune TCRs to escape negative selection

Nucleic Acids Research 2014 Vol 42 Database issue D301

SIMILAR SUBSTRUCTURES ORIGINAL VAST ANDVAST-SEARCH

MMDB is updated weekly following PDBrsquos scheduleWith each update computation of new structure neigh-bors is completed within a few days and they are availableas structure neighbors computed for biological assembliesvia the VAST+ service as well as structure neighborscomputed for individual protein chains and domains viathe original VAST service The latter is accessible on theVAST+ pages via a button labeled lsquooriginal VASTrsquo thatcan be found near the top of the VAST+results page Atthis point the VAST-search service which accepts3D structure data uploaded in PDB-format remainsunchanged and presents similar structures in theOriginal VAST format (Table 1)

LIMITATIONS AND FUTURE WORK

The current implementation of VAST+ the associateddatabase of pre-computed structure comparison resultsand the web service represent a first attempt at providinga comprehensive set of structure neighboring informationfor biological assemblies Several issues may limit thepotential applications of the idea and we intend toaddress them in future releases of the service and theassociated data Currently VAST+ neighboring is re-stricted to the default biological assembly for each struc-ture as determined by the content of the structure dataand MMDB parsing Although multiple biologicalassemblies if present in a structure entry tend to beclosely similar copies of each other there are exceptionsthat should be considered explicitly Also VAST+ cur-rently draws on results of VAST neighboring ascomputed for complete protein molecules and ignores alarger set of results obtained for individual domains Thiswas done intentionally so as to speed up the computa-tion and as the first implementation was intended tofocus on structural similarities that are both strong andglobal We anticipated that most cases where two entirebiological assemblies can be superimposed globallywould break down into individual protein pairs thatcan also be aligned and superimposed globally and notjust at the level of individual domains More import-antly VAST+ makes no attempt at this point atrefining the alignment and superpositions after detectinga match between two biological assemblies It is conceiv-able that such refinement would in many cases results in

somewhat shorter alignments and lower RMSD valuesand might be useful in emphasizing the conservedcontact interface between the components of a molecularcomplex Furthermore a strategy for detecting biologicalassembly matches that considers multiple molecules sim-ultaneously might exhibit higher sensitivity and pick upsimilarities that cannot be found via the 2-tieredapproach we have presented here Currently VAST+ignores non-polypeptide components of macromolecularcomplexes but certainly both nucleic acids and chemicalligands if present could be matched as well as theprotein components It should be mentioned that noVAST+ neighboring data are available for structuredatabase entries that lack assignment of biologicalassemblies and currently VAST+ skips biologicalassemblies whose size exceeds a threshold number ofprotein componentsmdashthe systematic evaluation of allpossible multimolecule matches becomes too time-consuming and will need to be supplemented by asuitable heuristic

FUNDING

Intramural Research Program of the National Libraryof Medicine at National Institutes of HealthDHHSComments suggestions and questions are welcome andshould be directed to infoncbinlmnihgov Fundingfor open access charge Intramural Research Program ofthe National Library of Medicine at the NationalInstitutes of HealthDHHS

Conflict of interest statement None declared

REFERENCES

1 MadejT AddessKJ FongJH GeerLY GeerRCLanczyckiCJ LiuC LuS Marchler-BauerA PanchenkoARet al (2012) MMDB 3D structures and macromolecularinteractions Nucleic Acids Res 40 D461ndashD464

2 RosePW BiC BluhmWF ChristieCH DimitropoulosDDuttaS GreenRK GoodsellDS PrlicA QuesadaM et al(2013) The RCSB protein data bank new resources for researchand education Nucleic Acids Res 41 D475ndashD482

3 GibneyG and BaxevanisAD (2011) Searching NCBI databasesusing Entrez Curr Protoc Hum Genet Chapter 6 Unit 610

4 AltschulSF MaddenTL SchafferAA ZhangJ ZhangZMillerW and LipmanDJ (1997) Gapped BLAST andPSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 25 3389ndash3402

5 Marchler-BauerA ZhengC ChitsazF DerbyshireMKGeerLY GeerRC GonzalesNR GwadzM HurwitzDILanczyckiCJ et al (2013) CDD conserved domains and

Table 1 URLs for MMDB and VAST resources

MMDB Database home page httpwwwncbinlmnihgovstructureMMDB FTP Data distribution ftpftpncbinihgovmmdbVAST Identify structurally similar individual protein molecules httpwwwncbinlmnihgovStructureVASTvastshtmlVAST+ Identify structurally similar macromolecular complexes httpwwwncbinlmnihgovStructurevastplusvastpluscgiVAST search Input the 3D coordinates of a query structure to

search for similar structureshttpwwwncbinlmnihgovStructureVASTvastsearchhtml

Cn3D Molecular graphics viewer httpwwwncbinlmnihgovStructureCN3Dcn3dshtmlCBLAST Find 3D structures that are related to a query protein via

sequence comparisonhttpwwwncbinlmnihgovStructurecblastcblastcgi

D302 Nucleic Acids Research 2014 Vol 42 Database issue

protein three-dimensional structure Nucleic Acids Res 41D348ndashD352

6 ShoemakerBA ZhangD TyagiM ThanguduRR FongJHMarchler-BauerA BryantSH MadejT and PanchenkoAR(2012) IBIS (Inferred Biomolecular Interaction Server) reportspredicts and integrates multiple types of conserved interactionsfor proteins Nucleic Acids Res 40 D834ndashD840

7 KrissinelE and HenrickK (2007) Inference of macromolecularassemblies from crystalline state J Mol Biol 372 774ndash797

8 GibratJF MadejT and BryantSH (1996) Surprisingsimilarities in structure comparison Curr Opin Struct Biol 6377ndash385

9 HaseqawaH and HolmL (2009) Advances and pitfalls ofprotein structural alignment Curr Opin Struct Biol 19341ndash348

10 PrlicA BlivenS RosePW BluhmWF BizonC GodzikAand BournePE (2010) Pre-calculated protein structurealignments at the RCSB PDB website Bioinformatics 262983ndash2985

11 MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classficiation of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

12 SillitoeI CuffAL DessaillyBH DawsonNL FurnhamNLeeD LeesJG LewisTE StuderRA RentzschR et al(2013) New functional families (FunFams) in CATH to improvethe mapping of conserved functional sites to 3D structuresNucleic Acids Res 41 D490ndashD498

13 YeY and GodzikA (2003) Flexible structure alignment bychaining aligned fragment pairs allowing twists Bioinformatics19(Suppl 2) ii246ndashii255

14 HolmL and RosenstromP (2010) Dali server conservationmapping in 3D Nucleic Acids Res 38(Suppl 2) W545ndashW549

15 GoughJ KarplusK HugheyR and ChothiaC (2001)Assignment of homology to genome sequences using a library ofhidden Markov models that represent all proteins of knownstructure J Mol Biol 313 903ndash919

16 OrengoCA and TaylorWR (1996) SSAP sequential structurealignment program for protein structure comparisonMethods Enzymol 266 617ndash635

17 ShindyalovIN and BournePE (1998) Protein structurealignment by incremental combinatorial extension (CE) of theoptimal path Protein Eng 11 739ndash747

18 Marchler-BauerA and BryantSH (1997) Measures of threadingspecificity and accuracy Proteins (Suppl 1) 74ndash82

19 SierkML and PearsonWR (2004) Sensitivity andselectivity in protein structure comparison Protein Sci 13773ndash785

20 KimC and LeeB (2007) Accuracy of structure-basedsequence alignments of automatic methods BMC Bioinformatics8 355

21 SipplMJ and WiedersteinM (2012) Detection of spatialcorrelations in protein structures and molecular complexesStructure 20 718ndash728

22 MukherjeeS and ZhangY (2009) MM-align a quick algorithmfor aligning multiple-chain protein complex structures usingiterative dynamic programming Nucleic Acids Res 37 e83

23 WangY GeerLY ChappeyC KansJA and BryantSH(2000) Cn3D sequence and structure views for Entrez TrendsBiochem Sci 25 300ndash302

24 YinY LiY KerzicM MartinR and MariuzzaRA (2011)Structure of a TCR with high affinity for self-antigen revealsbasis for escape from negative selection EMBO J 301137ndash1148

25 HenneckeJ and WileyDC (2002) Structure of a complex of thehuman alphabeta T cell receptor (TCR) HA17 Influenzahemagglutinin peptide and major histocompatibility complexclass II molecule HLA-DR4 (DRA0101 and DRB10401)insight into TCR cross-restriction and alloreactivity J ExpMed 195 571ndash581

26 HahnM NicholsonMJ PyrdolJ and WucherpfennigKW(2005) Unconventional topology of self-peptide majorhistocompatibility complex binding by a human autoimmuneT cell receptor Nat Immunol 6 490ndash496

Nucleic Acids Research 2014 Vol 42 Database issue D303

Page 4: MMDB and VAST : tracking structural similarities between ......complexes or biological assemblies (1). Such assemblies range from simple homo-oligomers to intricate arrange-ments of

Figure 2 The VAST+web service generates lists of structures that have 3D similarity to the query Matches are evaluated with biological assembliesas the unit of comparison (referred to as Biological Units) and may summarize simultaneous alignment of several protein molecule pairs The querystructure lsquo3O6Frsquo (24) currently yields 2712 structure neighbors Only the 115 neighbors with a complete biological assembly match have been selectedin this example (via the lsquodisplay filtersrsquo menu shown as collapsed in this figure) The 115 complete matches have been sorted by RMSD and thethird ranking match has been selected to provide more detail The tabulated matches are shown with their PDB accession descriptive textthe number of proteins aligned in the match the total number of aligned residues the sequence identity and the RMSD resulting from thesimultaneous superimposition of all aligned molecules In this example the query lsquo3O6Frsquo matches the structure lsquo1J8Hrsquo with a total of fouraligned protein molecules totaling 768 residues and resulting in a superposition with 255 A RMSD 80 of the residues in 3O6F and 1J8H thatwere spatially aligned by VAST are identical The extended panel characterizing this selected match contains a table that lists pairs of matchingaligned proteins and it provides schematic depictions of each biological assemblyrsquos composition and interactions The user can mouse-over thoseschematics to identify individual molecules and their corresponding match in the other structure (as shown in this example) The individual proteinmatch table contains action buttons that provide access to the pairwise sequence alignments as derived from the VAST superimposition and launchpoints for visualization of the structure superimposition with the protein structure viewer Cn3D (23) Each lsquolsquo3D Viewrsquorsquo button will open asuperposition of the complete biological assembly alignment with the 3D view centered on the selected protein molecule and its sequence datafeatured in the Cn3D sequence viewer window Next to the Aligned Molecules table an information box lists some stats that characterize thematched biological assembly

D300 Nucleic Acids Research 2014 Vol 42 Database issue

indicated with partially filled circles The default rankingputs matches with the most matched components at the topof the list Not all queries that have similar structures ac-cording to VAST+are guaranteed to also have completematches (although monomers usually do) The searchresults tables provided by the VAST+web service give aconcise summary of the matches and the extentquality ofthe similarity A clickable lsquo+rsquo symbol opens a panel for aselected match that provides more details andfunctionality

USING CN3D TO VISUALIZE BIOLOGICALASSEMBLY ALIGNMENTS

The 3D structures of superimposed biological assembliesmay be visualized using the 3D viewer Cn3D (23)which has been re-released as a new version 431 to

support the visualization style Currently Cn3D is ableto display the structure superposition of the matchedbiological assemblies and all the protein chainsinvolved but it can only display one sequence alignmentat a time Therefore the individual protein match tableas shown in Figure 2 provides separate Cn3D launchpoints for each matchedaligned protein pair All ofthese launch points will result in the same 3D imageand rendering but they will differ in the pair ofaligned sequences that are chosen as the contentof Cn3Drsquos sequencealignment viewer window Figure3 provides examples of Cn3D visualization sessionsPairs of matching molecules are rendered in the samecolor with unaligned segments rendered in gray Thedefault rendering settings as generated and providedby the VAST+ service can be examined and modifiedvia Cn3Drsquos StylejAnnotate menu

Figure 3 Visualization of structurally matching biological assemblies as rendered by the visualization tool Cn3D Cn3D is a helper application forthe web browser available for Windows and OS-X platforms The query structure PDB accession 3O6F represents the complex of an autoreactiveT-cell receptor (MS2-3C8 molecules rendered in green and brown) complexed with a self-peptide derived from myelin basic protein and the multiplesclerosis-associated MHC molecule HLA-DR4 (molecules rendered in magenta and blue) (24) The self-peptide has been fused with the MHCmolecule for the experiment which explains why the query is represented as a biological assembly with only four components (Figure 2) and isrendered in gray as is the default for all unaligned segments in Cn3D visualization sessions launched from VAST+ results pages The left panelshows 3O6F superimposed with the structure neighbor PDB accession 1J8H (25) which contains a complex between HLA-DR3 an Influenzahemagglutinin peptide and a human alphabeta T-cell receptor Molecules are rendered so that their colors match those of the corresponding querymolecules The structures of the two complexes match well resulting in a superimposition of 768 amino acid residues at 26 A RMSD Thisdemonstrates how well the autoreactive T-cell receptor complex mimics complexes that include foreign peptides and it is thought that this bindingmode is responsible for the autoimmune TCR escaping negative selection The right panel shows the VAST+ alignment between 3O6F and thestructure of a T-cell receptor from a patient with multiple sclerosis complexed with a myelin basic protein-derived peptide and an HLA-DR2 MHCPDB accession 1YMM (26) The conformations of the two complexes are different although their components are similar and VAST+ does notconsider the complete biological assemblies to match Instead it reports the most extensive sub-structure match which in this case involves bothsubunits of the MHC (molecules rendered in magenta and blue) The molecules corresponding to the TCR are rendered in gray color and would notbe displayed by default The unusual conformation of the complex reported in 1YMM is thought to represent an alternative binding mode that helpsautoimmune TCRs to escape negative selection

Nucleic Acids Research 2014 Vol 42 Database issue D301

SIMILAR SUBSTRUCTURES ORIGINAL VAST ANDVAST-SEARCH

MMDB is updated weekly following PDBrsquos scheduleWith each update computation of new structure neigh-bors is completed within a few days and they are availableas structure neighbors computed for biological assembliesvia the VAST+ service as well as structure neighborscomputed for individual protein chains and domains viathe original VAST service The latter is accessible on theVAST+ pages via a button labeled lsquooriginal VASTrsquo thatcan be found near the top of the VAST+results page Atthis point the VAST-search service which accepts3D structure data uploaded in PDB-format remainsunchanged and presents similar structures in theOriginal VAST format (Table 1)

LIMITATIONS AND FUTURE WORK

The current implementation of VAST+ the associateddatabase of pre-computed structure comparison resultsand the web service represent a first attempt at providinga comprehensive set of structure neighboring informationfor biological assemblies Several issues may limit thepotential applications of the idea and we intend toaddress them in future releases of the service and theassociated data Currently VAST+ neighboring is re-stricted to the default biological assembly for each struc-ture as determined by the content of the structure dataand MMDB parsing Although multiple biologicalassemblies if present in a structure entry tend to beclosely similar copies of each other there are exceptionsthat should be considered explicitly Also VAST+ cur-rently draws on results of VAST neighboring ascomputed for complete protein molecules and ignores alarger set of results obtained for individual domains Thiswas done intentionally so as to speed up the computa-tion and as the first implementation was intended tofocus on structural similarities that are both strong andglobal We anticipated that most cases where two entirebiological assemblies can be superimposed globallywould break down into individual protein pairs thatcan also be aligned and superimposed globally and notjust at the level of individual domains More import-antly VAST+ makes no attempt at this point atrefining the alignment and superpositions after detectinga match between two biological assemblies It is conceiv-able that such refinement would in many cases results in

somewhat shorter alignments and lower RMSD valuesand might be useful in emphasizing the conservedcontact interface between the components of a molecularcomplex Furthermore a strategy for detecting biologicalassembly matches that considers multiple molecules sim-ultaneously might exhibit higher sensitivity and pick upsimilarities that cannot be found via the 2-tieredapproach we have presented here Currently VAST+ignores non-polypeptide components of macromolecularcomplexes but certainly both nucleic acids and chemicalligands if present could be matched as well as theprotein components It should be mentioned that noVAST+ neighboring data are available for structuredatabase entries that lack assignment of biologicalassemblies and currently VAST+ skips biologicalassemblies whose size exceeds a threshold number ofprotein componentsmdashthe systematic evaluation of allpossible multimolecule matches becomes too time-consuming and will need to be supplemented by asuitable heuristic

FUNDING

Intramural Research Program of the National Libraryof Medicine at National Institutes of HealthDHHSComments suggestions and questions are welcome andshould be directed to infoncbinlmnihgov Fundingfor open access charge Intramural Research Program ofthe National Library of Medicine at the NationalInstitutes of HealthDHHS

Conflict of interest statement None declared

REFERENCES

1 MadejT AddessKJ FongJH GeerLY GeerRCLanczyckiCJ LiuC LuS Marchler-BauerA PanchenkoARet al (2012) MMDB 3D structures and macromolecularinteractions Nucleic Acids Res 40 D461ndashD464

2 RosePW BiC BluhmWF ChristieCH DimitropoulosDDuttaS GreenRK GoodsellDS PrlicA QuesadaM et al(2013) The RCSB protein data bank new resources for researchand education Nucleic Acids Res 41 D475ndashD482

3 GibneyG and BaxevanisAD (2011) Searching NCBI databasesusing Entrez Curr Protoc Hum Genet Chapter 6 Unit 610

4 AltschulSF MaddenTL SchafferAA ZhangJ ZhangZMillerW and LipmanDJ (1997) Gapped BLAST andPSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 25 3389ndash3402

5 Marchler-BauerA ZhengC ChitsazF DerbyshireMKGeerLY GeerRC GonzalesNR GwadzM HurwitzDILanczyckiCJ et al (2013) CDD conserved domains and

Table 1 URLs for MMDB and VAST resources

MMDB Database home page httpwwwncbinlmnihgovstructureMMDB FTP Data distribution ftpftpncbinihgovmmdbVAST Identify structurally similar individual protein molecules httpwwwncbinlmnihgovStructureVASTvastshtmlVAST+ Identify structurally similar macromolecular complexes httpwwwncbinlmnihgovStructurevastplusvastpluscgiVAST search Input the 3D coordinates of a query structure to

search for similar structureshttpwwwncbinlmnihgovStructureVASTvastsearchhtml

Cn3D Molecular graphics viewer httpwwwncbinlmnihgovStructureCN3Dcn3dshtmlCBLAST Find 3D structures that are related to a query protein via

sequence comparisonhttpwwwncbinlmnihgovStructurecblastcblastcgi

D302 Nucleic Acids Research 2014 Vol 42 Database issue

protein three-dimensional structure Nucleic Acids Res 41D348ndashD352

6 ShoemakerBA ZhangD TyagiM ThanguduRR FongJHMarchler-BauerA BryantSH MadejT and PanchenkoAR(2012) IBIS (Inferred Biomolecular Interaction Server) reportspredicts and integrates multiple types of conserved interactionsfor proteins Nucleic Acids Res 40 D834ndashD840

7 KrissinelE and HenrickK (2007) Inference of macromolecularassemblies from crystalline state J Mol Biol 372 774ndash797

8 GibratJF MadejT and BryantSH (1996) Surprisingsimilarities in structure comparison Curr Opin Struct Biol 6377ndash385

9 HaseqawaH and HolmL (2009) Advances and pitfalls ofprotein structural alignment Curr Opin Struct Biol 19341ndash348

10 PrlicA BlivenS RosePW BluhmWF BizonC GodzikAand BournePE (2010) Pre-calculated protein structurealignments at the RCSB PDB website Bioinformatics 262983ndash2985

11 MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classficiation of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

12 SillitoeI CuffAL DessaillyBH DawsonNL FurnhamNLeeD LeesJG LewisTE StuderRA RentzschR et al(2013) New functional families (FunFams) in CATH to improvethe mapping of conserved functional sites to 3D structuresNucleic Acids Res 41 D490ndashD498

13 YeY and GodzikA (2003) Flexible structure alignment bychaining aligned fragment pairs allowing twists Bioinformatics19(Suppl 2) ii246ndashii255

14 HolmL and RosenstromP (2010) Dali server conservationmapping in 3D Nucleic Acids Res 38(Suppl 2) W545ndashW549

15 GoughJ KarplusK HugheyR and ChothiaC (2001)Assignment of homology to genome sequences using a library ofhidden Markov models that represent all proteins of knownstructure J Mol Biol 313 903ndash919

16 OrengoCA and TaylorWR (1996) SSAP sequential structurealignment program for protein structure comparisonMethods Enzymol 266 617ndash635

17 ShindyalovIN and BournePE (1998) Protein structurealignment by incremental combinatorial extension (CE) of theoptimal path Protein Eng 11 739ndash747

18 Marchler-BauerA and BryantSH (1997) Measures of threadingspecificity and accuracy Proteins (Suppl 1) 74ndash82

19 SierkML and PearsonWR (2004) Sensitivity andselectivity in protein structure comparison Protein Sci 13773ndash785

20 KimC and LeeB (2007) Accuracy of structure-basedsequence alignments of automatic methods BMC Bioinformatics8 355

21 SipplMJ and WiedersteinM (2012) Detection of spatialcorrelations in protein structures and molecular complexesStructure 20 718ndash728

22 MukherjeeS and ZhangY (2009) MM-align a quick algorithmfor aligning multiple-chain protein complex structures usingiterative dynamic programming Nucleic Acids Res 37 e83

23 WangY GeerLY ChappeyC KansJA and BryantSH(2000) Cn3D sequence and structure views for Entrez TrendsBiochem Sci 25 300ndash302

24 YinY LiY KerzicM MartinR and MariuzzaRA (2011)Structure of a TCR with high affinity for self-antigen revealsbasis for escape from negative selection EMBO J 301137ndash1148

25 HenneckeJ and WileyDC (2002) Structure of a complex of thehuman alphabeta T cell receptor (TCR) HA17 Influenzahemagglutinin peptide and major histocompatibility complexclass II molecule HLA-DR4 (DRA0101 and DRB10401)insight into TCR cross-restriction and alloreactivity J ExpMed 195 571ndash581

26 HahnM NicholsonMJ PyrdolJ and WucherpfennigKW(2005) Unconventional topology of self-peptide majorhistocompatibility complex binding by a human autoimmuneT cell receptor Nat Immunol 6 490ndash496

Nucleic Acids Research 2014 Vol 42 Database issue D303

Page 5: MMDB and VAST : tracking structural similarities between ......complexes or biological assemblies (1). Such assemblies range from simple homo-oligomers to intricate arrange-ments of

indicated with partially filled circles The default rankingputs matches with the most matched components at the topof the list Not all queries that have similar structures ac-cording to VAST+are guaranteed to also have completematches (although monomers usually do) The searchresults tables provided by the VAST+web service give aconcise summary of the matches and the extentquality ofthe similarity A clickable lsquo+rsquo symbol opens a panel for aselected match that provides more details andfunctionality

USING CN3D TO VISUALIZE BIOLOGICALASSEMBLY ALIGNMENTS

The 3D structures of superimposed biological assembliesmay be visualized using the 3D viewer Cn3D (23)which has been re-released as a new version 431 to

support the visualization style Currently Cn3D is ableto display the structure superposition of the matchedbiological assemblies and all the protein chainsinvolved but it can only display one sequence alignmentat a time Therefore the individual protein match tableas shown in Figure 2 provides separate Cn3D launchpoints for each matchedaligned protein pair All ofthese launch points will result in the same 3D imageand rendering but they will differ in the pair ofaligned sequences that are chosen as the contentof Cn3Drsquos sequencealignment viewer window Figure3 provides examples of Cn3D visualization sessionsPairs of matching molecules are rendered in the samecolor with unaligned segments rendered in gray Thedefault rendering settings as generated and providedby the VAST+ service can be examined and modifiedvia Cn3Drsquos StylejAnnotate menu

Figure 3 Visualization of structurally matching biological assemblies as rendered by the visualization tool Cn3D Cn3D is a helper application forthe web browser available for Windows and OS-X platforms The query structure PDB accession 3O6F represents the complex of an autoreactiveT-cell receptor (MS2-3C8 molecules rendered in green and brown) complexed with a self-peptide derived from myelin basic protein and the multiplesclerosis-associated MHC molecule HLA-DR4 (molecules rendered in magenta and blue) (24) The self-peptide has been fused with the MHCmolecule for the experiment which explains why the query is represented as a biological assembly with only four components (Figure 2) and isrendered in gray as is the default for all unaligned segments in Cn3D visualization sessions launched from VAST+ results pages The left panelshows 3O6F superimposed with the structure neighbor PDB accession 1J8H (25) which contains a complex between HLA-DR3 an Influenzahemagglutinin peptide and a human alphabeta T-cell receptor Molecules are rendered so that their colors match those of the corresponding querymolecules The structures of the two complexes match well resulting in a superimposition of 768 amino acid residues at 26 A RMSD Thisdemonstrates how well the autoreactive T-cell receptor complex mimics complexes that include foreign peptides and it is thought that this bindingmode is responsible for the autoimmune TCR escaping negative selection The right panel shows the VAST+ alignment between 3O6F and thestructure of a T-cell receptor from a patient with multiple sclerosis complexed with a myelin basic protein-derived peptide and an HLA-DR2 MHCPDB accession 1YMM (26) The conformations of the two complexes are different although their components are similar and VAST+ does notconsider the complete biological assemblies to match Instead it reports the most extensive sub-structure match which in this case involves bothsubunits of the MHC (molecules rendered in magenta and blue) The molecules corresponding to the TCR are rendered in gray color and would notbe displayed by default The unusual conformation of the complex reported in 1YMM is thought to represent an alternative binding mode that helpsautoimmune TCRs to escape negative selection

Nucleic Acids Research 2014 Vol 42 Database issue D301

SIMILAR SUBSTRUCTURES ORIGINAL VAST ANDVAST-SEARCH

MMDB is updated weekly following PDBrsquos scheduleWith each update computation of new structure neigh-bors is completed within a few days and they are availableas structure neighbors computed for biological assembliesvia the VAST+ service as well as structure neighborscomputed for individual protein chains and domains viathe original VAST service The latter is accessible on theVAST+ pages via a button labeled lsquooriginal VASTrsquo thatcan be found near the top of the VAST+results page Atthis point the VAST-search service which accepts3D structure data uploaded in PDB-format remainsunchanged and presents similar structures in theOriginal VAST format (Table 1)

LIMITATIONS AND FUTURE WORK

The current implementation of VAST+ the associateddatabase of pre-computed structure comparison resultsand the web service represent a first attempt at providinga comprehensive set of structure neighboring informationfor biological assemblies Several issues may limit thepotential applications of the idea and we intend toaddress them in future releases of the service and theassociated data Currently VAST+ neighboring is re-stricted to the default biological assembly for each struc-ture as determined by the content of the structure dataand MMDB parsing Although multiple biologicalassemblies if present in a structure entry tend to beclosely similar copies of each other there are exceptionsthat should be considered explicitly Also VAST+ cur-rently draws on results of VAST neighboring ascomputed for complete protein molecules and ignores alarger set of results obtained for individual domains Thiswas done intentionally so as to speed up the computa-tion and as the first implementation was intended tofocus on structural similarities that are both strong andglobal We anticipated that most cases where two entirebiological assemblies can be superimposed globallywould break down into individual protein pairs thatcan also be aligned and superimposed globally and notjust at the level of individual domains More import-antly VAST+ makes no attempt at this point atrefining the alignment and superpositions after detectinga match between two biological assemblies It is conceiv-able that such refinement would in many cases results in

somewhat shorter alignments and lower RMSD valuesand might be useful in emphasizing the conservedcontact interface between the components of a molecularcomplex Furthermore a strategy for detecting biologicalassembly matches that considers multiple molecules sim-ultaneously might exhibit higher sensitivity and pick upsimilarities that cannot be found via the 2-tieredapproach we have presented here Currently VAST+ignores non-polypeptide components of macromolecularcomplexes but certainly both nucleic acids and chemicalligands if present could be matched as well as theprotein components It should be mentioned that noVAST+ neighboring data are available for structuredatabase entries that lack assignment of biologicalassemblies and currently VAST+ skips biologicalassemblies whose size exceeds a threshold number ofprotein componentsmdashthe systematic evaluation of allpossible multimolecule matches becomes too time-consuming and will need to be supplemented by asuitable heuristic

FUNDING

Intramural Research Program of the National Libraryof Medicine at National Institutes of HealthDHHSComments suggestions and questions are welcome andshould be directed to infoncbinlmnihgov Fundingfor open access charge Intramural Research Program ofthe National Library of Medicine at the NationalInstitutes of HealthDHHS

Conflict of interest statement None declared

REFERENCES

1 MadejT AddessKJ FongJH GeerLY GeerRCLanczyckiCJ LiuC LuS Marchler-BauerA PanchenkoARet al (2012) MMDB 3D structures and macromolecularinteractions Nucleic Acids Res 40 D461ndashD464

2 RosePW BiC BluhmWF ChristieCH DimitropoulosDDuttaS GreenRK GoodsellDS PrlicA QuesadaM et al(2013) The RCSB protein data bank new resources for researchand education Nucleic Acids Res 41 D475ndashD482

3 GibneyG and BaxevanisAD (2011) Searching NCBI databasesusing Entrez Curr Protoc Hum Genet Chapter 6 Unit 610

4 AltschulSF MaddenTL SchafferAA ZhangJ ZhangZMillerW and LipmanDJ (1997) Gapped BLAST andPSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 25 3389ndash3402

5 Marchler-BauerA ZhengC ChitsazF DerbyshireMKGeerLY GeerRC GonzalesNR GwadzM HurwitzDILanczyckiCJ et al (2013) CDD conserved domains and

Table 1 URLs for MMDB and VAST resources

MMDB Database home page httpwwwncbinlmnihgovstructureMMDB FTP Data distribution ftpftpncbinihgovmmdbVAST Identify structurally similar individual protein molecules httpwwwncbinlmnihgovStructureVASTvastshtmlVAST+ Identify structurally similar macromolecular complexes httpwwwncbinlmnihgovStructurevastplusvastpluscgiVAST search Input the 3D coordinates of a query structure to

search for similar structureshttpwwwncbinlmnihgovStructureVASTvastsearchhtml

Cn3D Molecular graphics viewer httpwwwncbinlmnihgovStructureCN3Dcn3dshtmlCBLAST Find 3D structures that are related to a query protein via

sequence comparisonhttpwwwncbinlmnihgovStructurecblastcblastcgi

D302 Nucleic Acids Research 2014 Vol 42 Database issue

protein three-dimensional structure Nucleic Acids Res 41D348ndashD352

6 ShoemakerBA ZhangD TyagiM ThanguduRR FongJHMarchler-BauerA BryantSH MadejT and PanchenkoAR(2012) IBIS (Inferred Biomolecular Interaction Server) reportspredicts and integrates multiple types of conserved interactionsfor proteins Nucleic Acids Res 40 D834ndashD840

7 KrissinelE and HenrickK (2007) Inference of macromolecularassemblies from crystalline state J Mol Biol 372 774ndash797

8 GibratJF MadejT and BryantSH (1996) Surprisingsimilarities in structure comparison Curr Opin Struct Biol 6377ndash385

9 HaseqawaH and HolmL (2009) Advances and pitfalls ofprotein structural alignment Curr Opin Struct Biol 19341ndash348

10 PrlicA BlivenS RosePW BluhmWF BizonC GodzikAand BournePE (2010) Pre-calculated protein structurealignments at the RCSB PDB website Bioinformatics 262983ndash2985

11 MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classficiation of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

12 SillitoeI CuffAL DessaillyBH DawsonNL FurnhamNLeeD LeesJG LewisTE StuderRA RentzschR et al(2013) New functional families (FunFams) in CATH to improvethe mapping of conserved functional sites to 3D structuresNucleic Acids Res 41 D490ndashD498

13 YeY and GodzikA (2003) Flexible structure alignment bychaining aligned fragment pairs allowing twists Bioinformatics19(Suppl 2) ii246ndashii255

14 HolmL and RosenstromP (2010) Dali server conservationmapping in 3D Nucleic Acids Res 38(Suppl 2) W545ndashW549

15 GoughJ KarplusK HugheyR and ChothiaC (2001)Assignment of homology to genome sequences using a library ofhidden Markov models that represent all proteins of knownstructure J Mol Biol 313 903ndash919

16 OrengoCA and TaylorWR (1996) SSAP sequential structurealignment program for protein structure comparisonMethods Enzymol 266 617ndash635

17 ShindyalovIN and BournePE (1998) Protein structurealignment by incremental combinatorial extension (CE) of theoptimal path Protein Eng 11 739ndash747

18 Marchler-BauerA and BryantSH (1997) Measures of threadingspecificity and accuracy Proteins (Suppl 1) 74ndash82

19 SierkML and PearsonWR (2004) Sensitivity andselectivity in protein structure comparison Protein Sci 13773ndash785

20 KimC and LeeB (2007) Accuracy of structure-basedsequence alignments of automatic methods BMC Bioinformatics8 355

21 SipplMJ and WiedersteinM (2012) Detection of spatialcorrelations in protein structures and molecular complexesStructure 20 718ndash728

22 MukherjeeS and ZhangY (2009) MM-align a quick algorithmfor aligning multiple-chain protein complex structures usingiterative dynamic programming Nucleic Acids Res 37 e83

23 WangY GeerLY ChappeyC KansJA and BryantSH(2000) Cn3D sequence and structure views for Entrez TrendsBiochem Sci 25 300ndash302

24 YinY LiY KerzicM MartinR and MariuzzaRA (2011)Structure of a TCR with high affinity for self-antigen revealsbasis for escape from negative selection EMBO J 301137ndash1148

25 HenneckeJ and WileyDC (2002) Structure of a complex of thehuman alphabeta T cell receptor (TCR) HA17 Influenzahemagglutinin peptide and major histocompatibility complexclass II molecule HLA-DR4 (DRA0101 and DRB10401)insight into TCR cross-restriction and alloreactivity J ExpMed 195 571ndash581

26 HahnM NicholsonMJ PyrdolJ and WucherpfennigKW(2005) Unconventional topology of self-peptide majorhistocompatibility complex binding by a human autoimmuneT cell receptor Nat Immunol 6 490ndash496

Nucleic Acids Research 2014 Vol 42 Database issue D303

Page 6: MMDB and VAST : tracking structural similarities between ......complexes or biological assemblies (1). Such assemblies range from simple homo-oligomers to intricate arrange-ments of

SIMILAR SUBSTRUCTURES ORIGINAL VAST ANDVAST-SEARCH

MMDB is updated weekly following PDBrsquos scheduleWith each update computation of new structure neigh-bors is completed within a few days and they are availableas structure neighbors computed for biological assembliesvia the VAST+ service as well as structure neighborscomputed for individual protein chains and domains viathe original VAST service The latter is accessible on theVAST+ pages via a button labeled lsquooriginal VASTrsquo thatcan be found near the top of the VAST+results page Atthis point the VAST-search service which accepts3D structure data uploaded in PDB-format remainsunchanged and presents similar structures in theOriginal VAST format (Table 1)

LIMITATIONS AND FUTURE WORK

The current implementation of VAST+ the associateddatabase of pre-computed structure comparison resultsand the web service represent a first attempt at providinga comprehensive set of structure neighboring informationfor biological assemblies Several issues may limit thepotential applications of the idea and we intend toaddress them in future releases of the service and theassociated data Currently VAST+ neighboring is re-stricted to the default biological assembly for each struc-ture as determined by the content of the structure dataand MMDB parsing Although multiple biologicalassemblies if present in a structure entry tend to beclosely similar copies of each other there are exceptionsthat should be considered explicitly Also VAST+ cur-rently draws on results of VAST neighboring ascomputed for complete protein molecules and ignores alarger set of results obtained for individual domains Thiswas done intentionally so as to speed up the computa-tion and as the first implementation was intended tofocus on structural similarities that are both strong andglobal We anticipated that most cases where two entirebiological assemblies can be superimposed globallywould break down into individual protein pairs thatcan also be aligned and superimposed globally and notjust at the level of individual domains More import-antly VAST+ makes no attempt at this point atrefining the alignment and superpositions after detectinga match between two biological assemblies It is conceiv-able that such refinement would in many cases results in

somewhat shorter alignments and lower RMSD valuesand might be useful in emphasizing the conservedcontact interface between the components of a molecularcomplex Furthermore a strategy for detecting biologicalassembly matches that considers multiple molecules sim-ultaneously might exhibit higher sensitivity and pick upsimilarities that cannot be found via the 2-tieredapproach we have presented here Currently VAST+ignores non-polypeptide components of macromolecularcomplexes but certainly both nucleic acids and chemicalligands if present could be matched as well as theprotein components It should be mentioned that noVAST+ neighboring data are available for structuredatabase entries that lack assignment of biologicalassemblies and currently VAST+ skips biologicalassemblies whose size exceeds a threshold number ofprotein componentsmdashthe systematic evaluation of allpossible multimolecule matches becomes too time-consuming and will need to be supplemented by asuitable heuristic

FUNDING

Intramural Research Program of the National Libraryof Medicine at National Institutes of HealthDHHSComments suggestions and questions are welcome andshould be directed to infoncbinlmnihgov Fundingfor open access charge Intramural Research Program ofthe National Library of Medicine at the NationalInstitutes of HealthDHHS

Conflict of interest statement None declared

REFERENCES

1 MadejT AddessKJ FongJH GeerLY GeerRCLanczyckiCJ LiuC LuS Marchler-BauerA PanchenkoARet al (2012) MMDB 3D structures and macromolecularinteractions Nucleic Acids Res 40 D461ndashD464

2 RosePW BiC BluhmWF ChristieCH DimitropoulosDDuttaS GreenRK GoodsellDS PrlicA QuesadaM et al(2013) The RCSB protein data bank new resources for researchand education Nucleic Acids Res 41 D475ndashD482

3 GibneyG and BaxevanisAD (2011) Searching NCBI databasesusing Entrez Curr Protoc Hum Genet Chapter 6 Unit 610

4 AltschulSF MaddenTL SchafferAA ZhangJ ZhangZMillerW and LipmanDJ (1997) Gapped BLAST andPSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 25 3389ndash3402

5 Marchler-BauerA ZhengC ChitsazF DerbyshireMKGeerLY GeerRC GonzalesNR GwadzM HurwitzDILanczyckiCJ et al (2013) CDD conserved domains and

Table 1 URLs for MMDB and VAST resources

MMDB Database home page httpwwwncbinlmnihgovstructureMMDB FTP Data distribution ftpftpncbinihgovmmdbVAST Identify structurally similar individual protein molecules httpwwwncbinlmnihgovStructureVASTvastshtmlVAST+ Identify structurally similar macromolecular complexes httpwwwncbinlmnihgovStructurevastplusvastpluscgiVAST search Input the 3D coordinates of a query structure to

search for similar structureshttpwwwncbinlmnihgovStructureVASTvastsearchhtml

Cn3D Molecular graphics viewer httpwwwncbinlmnihgovStructureCN3Dcn3dshtmlCBLAST Find 3D structures that are related to a query protein via

sequence comparisonhttpwwwncbinlmnihgovStructurecblastcblastcgi

D302 Nucleic Acids Research 2014 Vol 42 Database issue

protein three-dimensional structure Nucleic Acids Res 41D348ndashD352

6 ShoemakerBA ZhangD TyagiM ThanguduRR FongJHMarchler-BauerA BryantSH MadejT and PanchenkoAR(2012) IBIS (Inferred Biomolecular Interaction Server) reportspredicts and integrates multiple types of conserved interactionsfor proteins Nucleic Acids Res 40 D834ndashD840

7 KrissinelE and HenrickK (2007) Inference of macromolecularassemblies from crystalline state J Mol Biol 372 774ndash797

8 GibratJF MadejT and BryantSH (1996) Surprisingsimilarities in structure comparison Curr Opin Struct Biol 6377ndash385

9 HaseqawaH and HolmL (2009) Advances and pitfalls ofprotein structural alignment Curr Opin Struct Biol 19341ndash348

10 PrlicA BlivenS RosePW BluhmWF BizonC GodzikAand BournePE (2010) Pre-calculated protein structurealignments at the RCSB PDB website Bioinformatics 262983ndash2985

11 MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classficiation of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

12 SillitoeI CuffAL DessaillyBH DawsonNL FurnhamNLeeD LeesJG LewisTE StuderRA RentzschR et al(2013) New functional families (FunFams) in CATH to improvethe mapping of conserved functional sites to 3D structuresNucleic Acids Res 41 D490ndashD498

13 YeY and GodzikA (2003) Flexible structure alignment bychaining aligned fragment pairs allowing twists Bioinformatics19(Suppl 2) ii246ndashii255

14 HolmL and RosenstromP (2010) Dali server conservationmapping in 3D Nucleic Acids Res 38(Suppl 2) W545ndashW549

15 GoughJ KarplusK HugheyR and ChothiaC (2001)Assignment of homology to genome sequences using a library ofhidden Markov models that represent all proteins of knownstructure J Mol Biol 313 903ndash919

16 OrengoCA and TaylorWR (1996) SSAP sequential structurealignment program for protein structure comparisonMethods Enzymol 266 617ndash635

17 ShindyalovIN and BournePE (1998) Protein structurealignment by incremental combinatorial extension (CE) of theoptimal path Protein Eng 11 739ndash747

18 Marchler-BauerA and BryantSH (1997) Measures of threadingspecificity and accuracy Proteins (Suppl 1) 74ndash82

19 SierkML and PearsonWR (2004) Sensitivity andselectivity in protein structure comparison Protein Sci 13773ndash785

20 KimC and LeeB (2007) Accuracy of structure-basedsequence alignments of automatic methods BMC Bioinformatics8 355

21 SipplMJ and WiedersteinM (2012) Detection of spatialcorrelations in protein structures and molecular complexesStructure 20 718ndash728

22 MukherjeeS and ZhangY (2009) MM-align a quick algorithmfor aligning multiple-chain protein complex structures usingiterative dynamic programming Nucleic Acids Res 37 e83

23 WangY GeerLY ChappeyC KansJA and BryantSH(2000) Cn3D sequence and structure views for Entrez TrendsBiochem Sci 25 300ndash302

24 YinY LiY KerzicM MartinR and MariuzzaRA (2011)Structure of a TCR with high affinity for self-antigen revealsbasis for escape from negative selection EMBO J 301137ndash1148

25 HenneckeJ and WileyDC (2002) Structure of a complex of thehuman alphabeta T cell receptor (TCR) HA17 Influenzahemagglutinin peptide and major histocompatibility complexclass II molecule HLA-DR4 (DRA0101 and DRB10401)insight into TCR cross-restriction and alloreactivity J ExpMed 195 571ndash581

26 HahnM NicholsonMJ PyrdolJ and WucherpfennigKW(2005) Unconventional topology of self-peptide majorhistocompatibility complex binding by a human autoimmuneT cell receptor Nat Immunol 6 490ndash496

Nucleic Acids Research 2014 Vol 42 Database issue D303

Page 7: MMDB and VAST : tracking structural similarities between ......complexes or biological assemblies (1). Such assemblies range from simple homo-oligomers to intricate arrange-ments of

protein three-dimensional structure Nucleic Acids Res 41D348ndashD352

6 ShoemakerBA ZhangD TyagiM ThanguduRR FongJHMarchler-BauerA BryantSH MadejT and PanchenkoAR(2012) IBIS (Inferred Biomolecular Interaction Server) reportspredicts and integrates multiple types of conserved interactionsfor proteins Nucleic Acids Res 40 D834ndashD840

7 KrissinelE and HenrickK (2007) Inference of macromolecularassemblies from crystalline state J Mol Biol 372 774ndash797

8 GibratJF MadejT and BryantSH (1996) Surprisingsimilarities in structure comparison Curr Opin Struct Biol 6377ndash385

9 HaseqawaH and HolmL (2009) Advances and pitfalls ofprotein structural alignment Curr Opin Struct Biol 19341ndash348

10 PrlicA BlivenS RosePW BluhmWF BizonC GodzikAand BournePE (2010) Pre-calculated protein structurealignments at the RCSB PDB website Bioinformatics 262983ndash2985

11 MurzinAG BrennerSE HubbardT and ChothiaC (1995)SCOP a structural classficiation of proteins database for theinvestigation of sequences and structures J Mol Biol 247536ndash540

12 SillitoeI CuffAL DessaillyBH DawsonNL FurnhamNLeeD LeesJG LewisTE StuderRA RentzschR et al(2013) New functional families (FunFams) in CATH to improvethe mapping of conserved functional sites to 3D structuresNucleic Acids Res 41 D490ndashD498

13 YeY and GodzikA (2003) Flexible structure alignment bychaining aligned fragment pairs allowing twists Bioinformatics19(Suppl 2) ii246ndashii255

14 HolmL and RosenstromP (2010) Dali server conservationmapping in 3D Nucleic Acids Res 38(Suppl 2) W545ndashW549

15 GoughJ KarplusK HugheyR and ChothiaC (2001)Assignment of homology to genome sequences using a library ofhidden Markov models that represent all proteins of knownstructure J Mol Biol 313 903ndash919

16 OrengoCA and TaylorWR (1996) SSAP sequential structurealignment program for protein structure comparisonMethods Enzymol 266 617ndash635

17 ShindyalovIN and BournePE (1998) Protein structurealignment by incremental combinatorial extension (CE) of theoptimal path Protein Eng 11 739ndash747

18 Marchler-BauerA and BryantSH (1997) Measures of threadingspecificity and accuracy Proteins (Suppl 1) 74ndash82

19 SierkML and PearsonWR (2004) Sensitivity andselectivity in protein structure comparison Protein Sci 13773ndash785

20 KimC and LeeB (2007) Accuracy of structure-basedsequence alignments of automatic methods BMC Bioinformatics8 355

21 SipplMJ and WiedersteinM (2012) Detection of spatialcorrelations in protein structures and molecular complexesStructure 20 718ndash728

22 MukherjeeS and ZhangY (2009) MM-align a quick algorithmfor aligning multiple-chain protein complex structures usingiterative dynamic programming Nucleic Acids Res 37 e83

23 WangY GeerLY ChappeyC KansJA and BryantSH(2000) Cn3D sequence and structure views for Entrez TrendsBiochem Sci 25 300ndash302

24 YinY LiY KerzicM MartinR and MariuzzaRA (2011)Structure of a TCR with high affinity for self-antigen revealsbasis for escape from negative selection EMBO J 301137ndash1148

25 HenneckeJ and WileyDC (2002) Structure of a complex of thehuman alphabeta T cell receptor (TCR) HA17 Influenzahemagglutinin peptide and major histocompatibility complexclass II molecule HLA-DR4 (DRA0101 and DRB10401)insight into TCR cross-restriction and alloreactivity J ExpMed 195 571ndash581

26 HahnM NicholsonMJ PyrdolJ and WucherpfennigKW(2005) Unconventional topology of self-peptide majorhistocompatibility complex binding by a human autoimmuneT cell receptor Nat Immunol 6 490ndash496

Nucleic Acids Research 2014 Vol 42 Database issue D303