Journal of Structural Biology 125, 112–122 (1999)

Embed Size (px)

Citation preview

  • 8/14/2019 Journal of Structural Biology 125, 112122 (1999)

    1/11

    A Framework for Querying a Database for Structural Information on 3DImages of Macromolecules: A Web-Based Query-by-Content Prototype

    on the BioImage Macromolecular ServerP. A. de Alar con ,* A. Gupta, and J . M. Car azo*

    *Centro Nacional de Biotecnolog a-CS IC, Campu s Universidad A utonoma, Cant oblanco, 28049 Madrid , Spain ; and Center for Applied Compu tational S cience and E ngineering, S an Diego S uper Comput er Center, University of California at San Diego, La J olla, California 92093-0505

    Received December 4, 1998, and in revised form February 8, 1999

    Nowadays we are experiencing a remarkable growthin the number of da tabases tha t have become acces-s ib le over the Web. Howeve r, in a cer ta in nu mber of

    c a s e s , fo r e x a m p l e , i n t h e c a s e o f B i o Im a g e , t h i sin fo rma t ion i s no t o f a t ex tua l na tu re , t hus pos ingnew chal lenges in the des ign of tools to handle theseda ta . In t h i s work , we concen t r a t e on the deve lop -men t of new mech anism s a ime d a t queryingthe sed a t a b a s e s o f c o m p l e x d a t a s e t s b y t h e i r i n t r i n s i ccontent, rather than by their textual annotations only.We concentrate our effor ts on a subset of BioImagecontaining 3D images (volumes) of biological macromol-ecules , implemen ting a rst prototype of a query-by-conten tsystem. In the con text of databases of complexdata types the term query-by-content makes referenceto those da ta model ing t echn iques i n wh ich use r-den ed functio ns aim at unde rstanding(to s ome ex-tent) the informational content of the data sets. In thesesystems the matching cri teria introduced by the userare re la ted to in t rins ic fea tures concerning the 3Dimages them selves, hence, complementing tradit ionalqueries by textual key words only. Efficient com puta-t ional algorithms are required in order to extractstructural information of the 3D images prior to stor-ing them in the database. Also, easy-to-use interfacesshould be implemented in order to obta in feedbackfrom the expert . Our query-by-content prototype isused to con struct a concrete query, making u se of basic

    structural features, which are then evaluated over aset of three-dimen sional images ofbiological macromol-ecules. This experimental implementation can be ac-cesse d via the Web at the BioImage se rver in Madrid, athttp://www.bioimage.org/qbc/index.html. 1999 AcademicPress

    Key Words: que ry -by -con ten t ; im age da t abases ;s t ruc tu ra l b io logy ;e l ec t ron mic roscopy ;B io Image .

    1. INTRODUCTION

    Access to computational systems and biologicaldata bases curr ently available on th e Intern et, and in

    par ticular over th e WWW, is becoming an impotool for biologists a s well as biochem ists . Traditexamples ar e the well-known da ta bases of sequeof nucleic acids (GenBank and the EMBL DLibr a r y) a n d t h os e for p rot ein s equ en(SWISSPROT and PIR), although there are mmore available (for a review, see theNucleic Acids Research Special Issue on Data bases, Ja nu ar y 19

    The t erm query-by-cont entha s seldom been in the context of biological databases, makicont ra st with its gener alized use in other elds,as generic databases of complex types. Howesome of the functionality implied by this termbeen in common usage in biology. As a simexample let us consider t he quite norm al situa tiobtaining a new gene sequence and then accesGenBank to query for all the other sequences ent in the dat abase th at a re somehow similar tquery sequence. Certainly, the query will noma de over textual at tr ibutes of th e sequence destion, but most likely algorithm s such as FASTBLASTA will be run (for a review see Sm ith, an d, as a result , a ra nk ing of th ose similar sequwill be obtained. This usage represents, therean example of a form of query-by-content broused in biology.

    In t he context of structura l data bases, one omost used is PDB (Protein Data Bank; see the by Sussma net al. (1998)), which stores thousandsatomic-resolution structures of proteins and nuacids and whose growth rate is constantly riData sets in PDB contain a textual descripfollowed by a list of coordina tes of a toms (mcomplete descriptions ar e a lso possible). In conto sequence dat aba ses, lling out a form with teelds, such a s key words, organ ism, resolution, normal form of querying in PDB. Thus, it isstra nge tha t different au thors ma ke such an effdevelop more sophisticated accessing systems bon the structural information itself stored in P

    J ourna l of Str uctur al Biology125, 112122 (1999)Article ID jsbi.1999.4102, available online at http://www.idealibrary.com on

    1121047-8477/99 $30.00Copyright 1999 by Academ ic PressAll rights of reproduction in any form reserved.

  • 8/14/2019 Journal of Structural Biology 125, 112122 (1999)

    2/11

    Such applications can be considered query-by-cont ent systems since they at tempt to search for t hestructural information of a PDB entry (i .e., thecoordinat es of th e at oms tha t form a given protein ornucleic acid). Considering that this work will intro-duce a query-by-content approach on another struc-tur al dat abase (BioImage a s restr icted t o 3D imagesof biological ma cromolecules), a review of th e ma in

    methods aimed at querying PDB by accessing theinform at iona l content of its th ree-dimensional str uc-tural information will be presented in the followingsection.

    Still within the context of stru ctura l da tabases,the next step in biological complexity is the 3Dimages of biological macromolecules contained inBioIma ge. However, a 3D ima ge is quite different inna tu re from an ordered str ing of at omic coordinat es.As a consequence, developments in the eld of query-by-cont ent a pplied to generic ima ge data basesbecome especially relevant for its use in BioImage.

    Based on this considerat ion, a brief overview of query-by-content systems working on image data-bases will be provided in another section of thiswork.

    The outline of th is pa per is a s follows. After theIntroduction two sections focus on presenting anoverview of the approaches car ried out so far on thetopic of query-by-content rst in PDB and then ingenera l image dat aba ses. After t his, an an alysis willbe per form ed on some of the specicities a ssociatedwith th e an alysis of biologica l systems. Section 5 willcover a number of important issues associated with

    the system layout as well as th e implement ation of this rst prototype of query-by-content on 3D imagesof BioIma ge. More specic considerat ions relat ed t oth e inter faces design will be discussed u nder Section6. A concrete query-by-content prototype will th en beconstructed and evaluated over a version of BioIm-age, and eventu ally, the results will be present ed inSection 7. Finally, general conclusions as well asper spectives will be th e subject of Section 8.

    2. OVERVIEW OF QUERY-BY-CONTEN TAPPROACHES ON PDB

    Similarity sear ching in data bases of 3D str uctu resis consider ed to be a eld of grea t importa nce since itcould help in the discovery of novel biologicallyactive molecules and investigations of the relation-ship between proteinsstructure and their function.In essen ce, some of th e most relevan t meth ods to bereviewed in this section develop approaches to ana-lyze the str uctu re of molecules by using inter at omicdistance matrices. One of the most challenging is-sues is t o choose an effective mea sur e to quant ify thedegree of stru ctur al r esemblan ce between two mol-ecules (i.e., to dene precisely the concept of biologi-

    cal similarity). Most times this denition wilta ilored t o th e specic aim of a pa rt icular systeth is way, Holm an d Sander (1996) developed themeth od, which is a general a pproach for alignpair of proteins r epresented by 2D ma trices. Ttwo-dimensiona l distan ce ma trices ar e a r eprestion of a 3D structure. In these matrices theij thelement denotes the distan ce between theith andjth

    at oms. They ar e u seful for the compar ison prosince similar 3D structures have similar interdue distances. So, the problem of matchingprotein structures turns into a matrix matcproblem. The Dali server can be accessed thrthe WWW or via e-ma il. The user must providcoordinat es of a str uctu re as query input . As a rea list of similar structures is returned.

    In t he sa me cont ext, Shindyalov and Bourn e (1proposed an a lgorith m (CE, combina torial extenof th e optima l path ) to nd an optima l 3D alignof two polypeptide chains. In order to do so, cha ra

    istics of their local geometry (dened by vecbetween C alpha positions) are used. One-onstructure alignment using such an algorithmava ilable via th e Web (http://cl.sdsc.edu/ce.hUsers must submit complete or par tial polypechains in PDB format. Then, statistics for the ament are returned along with the sequence ament result ing from the s t ructure a l ignmSear ches aga inst th e complete PDB dat aba se uscoordinat e set n ot foun d in th e PDB ar e also pos(afterward, results are mailed to the submitterthis sense, we can consider the CE algorithm a

    that helps the user in accessing the real contethe PDB data base.Another int eresting appr oach to query in PD

    structural similarity to be mentioned here is proposed byArtymiuket al. (1992). Their work on 3searching in databases of small molecules aimsupporting drug and pesticide discovery. Thethors describe the use of two algorithms (ama pping an d clique detection) for similarity seing in data bases of sma ll 3D molecules. Additissues r elated to t he efficiency of the implememethod as well as more effective ways of repre

    ing the data are discussed. Another work by tauthors is the algorithm PROTEP (from protopographic explora tion program ), which is aat providing an oth er approach for structu ra l simity searches in PDB.

    3. OVERVIEW OF QUERY-BY-CONTENT SYSTEMSIN MULTIDIMENSIONAL IMAGE DATABASES

    Resear ch in visual inform at ion systems our iin the early par t of th e decade, with th e gener althat in many applications visual data like imand videos convey as much information as alph

    113ADVANCED QUERY TOOLS FOR 3D IMAGES OF MACROMOLECULES

  • 8/14/2019 Journal of Structural Biology 125, 112122 (1999)

    3/11

    meric data an d hence should be treated a s rst classsear chable objects by dat aba se systems. It wa s recog-nized tha t th ere ar e two kinds of inform at ion associ-at ed with a visual object (ima ge or video): inform a-t ion about the object , called its metadata, andinforma tion conta ined within the object, called vi-sual features. Metadata (such as the name of aprotein) is a lphanu meric an d genera lly expressibleas a schema of a relationa l or object-oriented data -base. Visual featu res, in contr ast , are m at hema ticalproperties of t he image derived by computat ionalprocesses, typically image processing, computer vi-sion, or compu ta tional geomet ric routines, executedon t he visua l object. For exam ple, the set of Four iercoefficient s of th e boundar y of an object can serve asa compu ta ble descriptor of its shape (Ming-Fa ng an dHsin-Teng, 1998). On the topic of retr ieval, a data -base system that allows a user to search for objectsin t erms of th e a bove-ment ioned comput ed pr oper-ties is said to support cont ent -based ret rieval forvisual inform at ion.

    Recent image information systems, both research(see IEE E Com puter, 1995, for a sampling) andcommercial (from companies like Virage, IBM, andExcalibur), are leaning more toward a query-by-example pa ra digm. There ar e two differen t styles forproviding examples. In th e rst st yle, th e example ispictorial: the user species a query by providingsome example image and asking th e system to ndother images in the data base t ha t look similar toth e example image. Systems of th is kind include the

    Virage Image Engine (Bachet al., 1996) and theNETRA system (Ma, 1997). In the second case theuser provides value exam ples for one or more visua lfeatures, something like an image with about 30%green an d 40% blue with a gra ss-like textur e in thegreen par t. The values ar e not provided in E nglish,but are provided by using visual tools to choosecolors a nd t extur e. An example of th is kind of systemis the QBIC engine from IBM (Niblacket al., 1993).

    When a conten t-based retr ieval system is appliedto an y specic domain it needs t o solve t wo impor-tant problems: which computable features are suffi-cient to describe all images in t he domain a nd wha tmathematical function should be used to nd ameasure of similarity between two objects. In thecase of genera l images, both th ese issues ha ve beenstudied extensively. The issue of query-by-contentcan be then regarded as a clustering application inth e cont ext of querying in complex dat aba ses. Somepopular featur es used in images ar e histograms of color distributions of an image, texture featurescompu ted from the gray-level concur ren ce mat rix of an image, and m omen ts of str ong edges in th e imagefor describing sh apes. In most cases, the similar ity

    between ima ges is computed by nding a mea sudifference (or distance) between images. A poway of computing such a distance is to computEuclidean distance between corr esponding feafrom two images and then computing a weigaverage of th ese individua l dista nces.

    Indeed, much of th e development s presen ted sha ve been a imed primar ily at 2D images, with consideration of multidimensional images. A ging eld of interest is that of video databasesmu ch effort is devoted to th is area . In cont ra st,few exam ples exist today on query-by-content acat ions designed for 3D images. St ill, one of th einteresting experimental developments is inbiomedical ar ea. Anew cont ent -based 3D neurorlogic image retrieval system recently appearedet al., 1998). This system is int erest ing in tha t itgood example of a t ru e content -based ima ge retrsystem within a specic 3D-image domain. It

    with a mu ltimedia dat abase conta ining a num bmultimodal (MR/CT) images as well as collainform at ion r elated t o each ima ge (pat ients agesymptom, etc.). Such a system could help medoctors to conrm diagn oses, as well as for explpossible treatments by comparing the image th ose stored in th e medical knowledge data banfar as we know, no other developments have report ed in biology or biomedicine so far.

    4. UNDERSTANDING DATA SET CONTENT: A view frombio logy on some re l evan t s t ruc tu ra l cha rac te r i s ti c s

    o f 3D images o f b io log ica l macromolecu les

    The previous section outlined the basic stepsquery-by-content on a 3D image database invClearly, the precise denition of these steps anway to combine them a re, in genera l, ra th er diffTher efore , an d as a way to redu ce th is complexiar e going to make extensive usa ge in th is worka priori knowledge related t o the concrete app licadomain t o which this st udy applies. We ar e consing querying by cont ent only on 3D ima ges coning stru ctur al inform ation of biological ma cro

    ecules (a subset of BioImage; Carazo and Ste1999). Indeed, th e explicit considerat ion of thecic application domain in wh ich queries a re gobe performed will be used as a strong restricprinciple in th e denition of visual feat ur es andcombination.

    The key simplifying considerat ion is tha t an inspecicat ion can be made on some of th ose str uccharacteristics that will be the most used whquer y-by-content is performed in th is specic bical context. Certainly, the choice of a given squery specicat ions will intr oduce limitations

    114 DE ALARCO N, GUPTA, AND CARAZO

  • 8/14/2019 Journal of Structural Biology 125, 112122 (1999)

    4/11

    the querying schema, although this is always theprice to be paid when ta iloring a genera l problem to agiven application domain. Still, the set of specica-tions tha t ar e considered in t his rst applicat ion of query-by-content is a mixture of rather generic fea-tu res (which will not intr oduce int rinsic query limita-tion) with r at her specic featu res. In fact, th is choicereects a strategy aimed at initiating a feedbackprocedure between developers and users by whichthis initial set of domain-dependent specicationswill be expanded an d rened a ccording to th e fut ur edeman d of the commu nity using this new stru ctu ra ldatabase.

    There are ve biological structural features thatare considered in this work: size, shape, channels,internal cavities,and symmetries.Obviously, therst two are very general while the latter t hree a requite specic. In this work we ha ve considered t hefollowing den itions:

    Size: We refer to the dimensions in angstromsalong th ex, y, andz axes of th e minimum bounding-box containing the specimen. The size of the speci-men acts as a ltering feature, selecting differentbroad a reas of inter est.

    Shape: This feature is related to the geometricalappearance of the specimen, and it can either beregarded as a global attribute or refer to specicregions of th e specimen.

    Chan nels: We refer to low-density area s th at tra -verse sections of the 3D images. In th e specicationof this feature we will focus our attention on four

    param eters: the number of chan nels traversing th especimen, their length, their pathways, and theirdiameters.

    Inter na l cavities: The denition of intern al cavi-ties adopted in this work is that of low-densityregions th at ar e totally enclosed within t he volumeof th e specimen. In addition we will consider bothth eir volume a nd locat ion.

    Symmet ries: We refer to cha ra cteristic point sym-met ries of th e biological specimen t ha t a re speciedby th eir appropriate symmetry elements.

    The rationale of the initial choice of these vefeat ur es is simple. The two gener al feat ur es, size andshape, provide the context information. Then, thethree specic features, channels, internal cavities,and symmetries, a ddress char acteristic propertiesthat have the potential to provide key biologicalinformation. Certainly, the choice of these latterthr ee featu res for our rst implement at ion h as beenvery inuenced by th e t ypes of specimens we workwith, and it is acknowledged that they should beexpan ded t o oth er featu res of special import an ce inoth er system s in fut ur e implementat ions. Hopefully,this expansion will be a direct consequence of a

    fruitful int era ction with user s of th e BioIma ge sof macromolecular 3D images.

    5. SYSTEM LAYOUT

    5.1. User-Provided An notations versus Automa ticCom putat ional Features E xtraction

    User-provided key word annotation is the t

    tional way to introduce knowledge concernipiece of inform at ion t o be stored in a da ta base. Icont ext of genera l ima ge dat aba ses, data sets cannotated m anu ally adding key words, such apresence of a r elevan t aspect (i.e., the occurrencfamous char acter), that are later used at the qstage to retr ieve related ima ges.

    However, the exclusive application of a maannotation approach presents some problems least two areas: (1) i t implies a large amounmanual effort in producing an annotated immaking this approach impractical for large co

    tions of data sets; (2) differences in t he interption of the image content could lead to inconsicies in the key word assignments among diffesubmitters.

    As an approach designed to circumvent tproblems, most recent query-by-content sysimplement an initial stage of automatic feaextraction a t t he time of submitting a n ew dataDuring th is stage, a number of comput at iona l detors from the ima ges are calculated a nd t hen stas pa rt of th e image descript ion.

    Nevertheless, the two approaches commen te

    before, the manu al a nd the aut omatic approaar e not necessa rily mutu ally exclusive, an d thebe used concur ren tly in the design of hybrid sysas is in fact the case for the implementationscribed in th is work. In precise ter ms, for the on query-by-content on 3D images in BioImatwo-stage hybrid approach has been followed: at the level of combining the information contain the textu al key words with th e stru ctur al infotion itself (these key words refer to genera l texinformat ion, termed metadata in the workLindek et al. , 1999), and second, at the stage

    extracting the ve biological features introducthe previous section. During the stage mentiowe ha ve made u se of both au tomatic processingan d information provided int era ctively by the suter.

    5.2. On th e Compu tation of Features

    So far, we ha ve been referring to visual featu ra r at her a bstra ct way. For insta nce, we have deshapeas being related to the geometrical apance of the specimen.However, concrete comtional approaches are needed when we reach

    115ADVANCED QUERY TOOLS FOR 3D IMAGES OF MACROMOLECULES

  • 8/14/2019 Journal of Structural Biology 125, 112122 (1999)

    5/11

    level of ass igning numerical values to these featu res.Indeed, a clear differentiation can be made betweenthe high-level features to which a structural mean-ing can be atta ched, on t he one hand, an d the set of lower level primitive featur es tha t are used at thecore of th e system, on th e oth er ha nd.

    In practical terms, any high-level feature is goingto be calculated as a combination of lower levelfeatur es, when we will re fer to it below as comput a-tional features. In addition, for abstract featuressuch as sh ape, th e way the combina tion of compu ta -tional features is doneor even the set of thesefeat ur es to be used for this ta skis not un ique. Thisproblem, quite common in image understan dingsystems , is usu ally referr ed to as th e sema nt ic gap .This term expresses in a ra th er direct way tha t a gapexists at the semantic level between the abstractmeans employed by the user t o address the u nder-stan ding of an image and t he concrete mat hema ticalfunctions that the system really computes. As apossible way to partially bridge this gap, we haveintroduced the notion of a bio-vocabulary,which isa set of predened terms t ha t ma p some of the u serabstr actions into dened r elationships between setsof comput ational featu res. We will expand this no-tion lat er in th is work .

    In more precise terms, a computational featuref iis dened by a tuple7 N i, S i, D i, I i 8, whereN i is thechosen name for the feature,S i is the space of representation,D i is a dist an ce fun ction from whicha similarity measure is obtained, andI i is an index

    stru ctur e n eeded to store a nd access efficiently thedata setsvalues for t ha t featur e. It should be notedtha t for a par ticular featu re several spaces, distan cefunctions, and index structures could be chosen. Astr uctu ra l featu re is th en dened as a combinat ion of comput at iona l feat ur es sat isfying expert criteria re-lated t o th e inform at ion being considered.

    In t he following par agraph s we will review the vehigh-level str uctu ra l feat ur es intr oduced previously,making explicit the way they are calculated inpra ctice th rough a set of compu ta tional feat ur es. Animportan t considerat ion t hat we must ma ke at the

    onset is that, within the particular implementationof the pr ototype presented in th is work, m ost of thecomputa tional features are extracted from isosur-faces derived from t he 3D images, ra ther tha n fromthe 3D images themselves. Indeed, we believe thatthis is a topic in which further work is needed inorder to extract more information from the 3Ddensity distributions directly. These are the steps tofollow:

    S ize (dim ensions). In order to compute the size of the specimen in angstroms, a procedure has beenimplemented that computes the smallest bounding

    box containing it . To do that, we rst obtareference coordinat e fram e positioned at the cof mass of the object. This frame is derivealigning the object with the principal axes oftensor of inertia. Then, th e height, width, a nd dof such a box ar e calcula ted, as well as t he occuprate inside it . Computation of several invarmoments has been done in order to obtain norized represent at ions for every dat a set a gainst an d rotat ion (Galvez and Cant on, 1993).

    All the operations that lead to the specicatioth e coordinat e system of size are compu ted on thimages aut omatically, i.e., without requiringintervention from the submitter. This operatiodone in a short time.

    Global shape. As we have commented befoshape is an example of a rat her abstra ct str ucfeature in which all the difficulties associated th e semant ic gap problem ar e especially relevan

    far as hu man s are concerned, the tra ditiona l wdescribe shapes is th rough t he use of words ; this clear wh at a r ing-sha ped or a spher ical-like ois. However, computer programs need quantitmea sur es in order to recognize an d compar e shapes. Thus, severa l ways to represent the sh aan object have been investigated. One of the popular ways is the u se of histograms as st at isdistr ibutions of some fun ction va lue. In our cascomput e au tomatically the sur face shape spectwhich is a hist ogram of a sh ape index (Na sta r, 1The sha pe index is dened as

    S ( p) 0.5 ( ) 1 arctan [(k 1( p)k 2( p))/(k 1( p) k 2( p))],

    wherek 1( p) andk 2( p) are the principal curvaturat a sur face pointp .

    Along with the shape index, the feature veform ed by the moments of inert ia can a lso be usth e process of identifying t he differen t sh apes, though a combination of the shape index andmoment s of inert ia has not been implemented in

    prototype yet.A bio-vocabu lary is int roduced at th is sta ge; wit is clear from the previous paragraphs howcomputational features associated with shapecomputed, it is not clear how they can be combinto an a bstr act description of sha pe. This problindeed extremely complex and fundamental inage understanding, and instead of attemptinprovide generic solutions to funda menta l, lstan ding, ma them atical problems, our aim ha s to explore partial solutions with practical apption within a dened context. In th is way, we

    116 DE ALARCO N, GUPTA, AND CARAZO

  • 8/14/2019 Journal of Structural Biology 125, 112122 (1999)

    6/11

    selected for t his prototype implementat ion a r educedset of abstract shapes, such as ring-like, spherical,an d so on, th us set ting u p our init ial bio-vocabular y.For each abstr act sha pe considered in this prototype(toroidal, spherical, triangular), a set of models hasbeen generated. Then, t he h istogram of shape indexha s been extra cted from th em, so as t o dene a set of multidimensiona l pointsone per modelar oundwhich the multidimensional points associated withthe experimenta l 3D images have been cluster ed. Inth is way, in addition to having stored in the da ta basethe histogram of shape index that is automaticallyextr acted from the 3D images, a shape labelis alsointroduced du ring t he clustering pr ocess by assign-ing the label of the closest clust er t o each experimen-ta l 3D ima ge. This ass ignment of labels to 3D imagescreates, in p ra ctice, a bio-vocabu lary dictionar y (seeFig. 2). As will be fur th er n oted in the Section 6, th islabel will be especially important both when query-ing by specifying complex feat ur es in abst ra ct termsand at the resulting visualization interface.

    This global shape feature can be complementedwith anoth er, a local sha pefeatur e. The calcula tionof the lat ter would follow the same principles as th eformer, although it is necessary to develop an addi-tional tool dedicated to cutsome interesting struc-tu ra l regions an d which ha s not been incorpora ted inth is prototype yet.

    Symmetry: The symmetry-related features arenot calculated au tomatically. Instea d, th e submitt ermust provide information about the type of symme-

    try and the additional information associated withit. For th e time being, only rotat iona l symmetr ies areconsidered, and the submitter is required to providethe direction of th e symmetry axis in the originalreference system in which the 3D ima ge was init iallysubmitted.

    Channels (num ber, path, diam eter): Descriptorsof chan nels tr avers ing the specimen can be extra ctedwith a minor user int era ction. For exam ple, in orderto derive the pa th of a determined chan nel, the usermust pr ovide a set of seed points a long that channel,including its sta rting a nd ending points. Then, analgorithm , which calculates t he shortest path alongthese points, is applied. At present, the submittermust provide the nu mber of chan nels. However, weare considering doing this task automatically bycomput ing the Eu ler num ber. The Euler nu mber is atopological invar iant measu re from which th e nu m-ber of cha nnels a nd intern al cavities can be derived(Leeet al., 1991).

    Internal cavities: The user inter actively providesone point inside t he cavity, aroun d which th e volumeoccupied by such a cavity and its surface histogramar e computed with the help of a lling algorithm .

    As stated in the description above, some offeatures need th e user s interaction in order tproperly derived. For insta nce, local sha pe and nel characterization are typical examples thamand visual annotation tools to introduce sinitial informa tion. At present , th is process is caout by relatively simple tools running on the seside, mainly involving the display of 2D secteven though vir tual reali ty modeling langu(VRML)-based interactive tools (Hartman and necke, 1996) ar e un der developmen t. An exam pa possible user inpu t in a VRML world is providFig. 1.

    5.3 System Structure

    The development of a query-by-conten t syinvolves intera ction among thr ee main a reas, nathe feature extraction programs, the database nology, and the visual tools interfacing with the

    A diagram representing the ow of events amth ese component s is shown in Fig. 2.Ea ch dat a set ha s its conten t extra cted in a hy

    way combining int era ctively user-provided infotion with au tomatic procedures. The a im of maan notations is to obtain inform ation th at canneasily extracted by pure computational methSometimes, the rea l reason to obta in the informin a nonautomatic manner is the intrinsically mabstract nature of these data. However, in ordma ke th is task easier, user -friend ly an nota tion mu st be provided, th us just ifying th e furt her devment of 3D environments in combination with plex data bases su ch as BioImage (for genera l coerat ions see Pittetet al., 1999).

    During t he design of automatic featu re extr aprogram s, the tr ade-off between th e required cota tional t ime and the precision of the representfor a given visual feature should be consideregeneral, r epresenta tions that lead to efficientfast algorithms are preferred to more complexcomput at iona l time-consum ing a pproaches, evth ey might be more accur at e. Ther e is a possiblof obtaining false detections, but the faster syresponse ma y compen sat e for this drawba ck.Fr om t he da ta base sta ndpoint , both conten tscriptions (automatic and textual) need to bdexed and stru ctur ed in a regular data base manment system. Actu ally, th e data base issue is a crstandpoint since efficient retrieval performancekey requiremen t in every quer y-by-conten t sysThe database management system that we hchosen for the prototypes implement at ion isInformix Universal server (which implementobject-relational policy) running on an O2 SilGraph ics ma chine.

    117ADVANCED QUERY TOOLS FOR 3D IMAGES OF MACROMOLECULES

  • 8/14/2019 Journal of Structural Biology 125, 112122 (1999)

    7/11

    6. QUERY AND VISUALIZATION IN TERFACES

    The database is accessible via WWW (http:// www.bioimage.org/qbc/index.html) as an advancedquery tool implement ed in t he BioImage Web serverin Madr id. Similarity quer ies can be ma de using twotypes of query int er faces (see Fig. 3).

    The rst interface is designed to allow the user tospecify properties of t he structure to be retrieved

    (content -relat ed or not). This goal is accomplishproviding a form that is processed by a CGI proSuch a progra m allows the user to query by meathr ee kinds of elds: not conten t r elated, ann otand derived from bio-vocabulary. The rst typeld includes those normally stored in BioImapart of i ts metadata (Lindeket al., 1999). Theseelds ar e not directly related to the str uctura l

    FIG. 1. Exam ple of a VRML world. An annota ted isosurface represent at ion corr esponding to th e data set of th e SV40 largepresen ted. The centra l symmet ry axis is indicated, as well as th e star ting point for a cha nnel. As ha ded box ma rks a designat esur face to be furth er an alyzed.

    FIG. 2. Components of the query by 3D content engine. BioImage is our source of informa tion. E very macromolecularprocessed and /or an notat ed in order t o derive informat ion on its content . This informat ion is indexed and stored in a separ at eBioIma ge data base. A bio-vocabula ry dictiona ry is a lso provided by abstr acting some of th ose feat ur es.

    118 DE ALARCO N, GUPTA, AND CARAZO

  • 8/14/2019 Journal of Structural Biology 125, 112122 (1999)

    8/11

    ert ies of th e dat a set bu t have to do with some oth erfeat ur es (e.g., th e na me of th e specimen ). The secondkind of eld is related to those content-based fea-tur es tha t a re inter actively provided by the submit-ter by a manu al ann otat ion pr ocedure, such a s th esymmetry elements. The third type is formed bythose featu res th at ar e totally or pa rtially comput er-extracted. For example, the ring-shape property isstored in t he bio-vocabular y dictionar y togeth er withthe shape spectrum of an ideal ring, which isconsidered as a pattern model. When the user asksfor ringsthis pattern model is compared with thespectra of every data set stored in the database.Thus, a distance measure with the label ring-shaped is derived. The advantages of this queryapproach consist of the fact tha t it is ra ther easy touse, an d in addition, it does not pose any signican tnetwork t ra ffic-related pr oblems upon quer ying.

    Along with the form-based interface we have justdescribed, the possibility ofa quer y-by-example int er-face opens some very interesting possibilities. Inessence, a query-by-example int erface allows theuser t o provide a quer y data set an d, as in the case of biological sequen ce compa risons, ask t he da ta base tond similar data sets considering one or severalfeatu res (shape, size. . .). As t he qu ery da ta set h asnot been preprocessed before, the database does notcontain any description of its features. This latterfact indicat es tha t t he cont ent of the quer y data setmust be extra cted before similar objects a re sea rchedfor in th e da ta base. This cont ent s extr action will berea lized in fut ur e implementat ions on the client sideby downloading a Java program and following basi-

    cally the same procedures as those used durinsubm ission of a new str uctur e.

    This approach will a lso be used in the developof a submission interface that would run onclient side and that would replace the simple population procedures that have been used inimplement at ion of th is prototype.

    As a result of the query, a ranked list of simobjects is displa yed on a sepa ra te Web page. The

    can inspect each of th e ret rieved dat a sets. A Visosurface is also available so that the userinter actively explore the selected da ta set.

    7. APP LICATION EXAMPLE

    The a im of this section is to constr uct a concase of query-by-content on 3D images of biolomacromolecules that will illustrate, in a pracway, the u sefulness of th ese new developmen tsapplication presented in this section has beespired by a recent work of Hingorani and ODo(1998), in which t hey consider th e sha pe of a nuof biological macromolecules in relation to function as enzymes interacting with DNA. auth ors collected the data from several souincluding the PDB database. In the context ofwork on query-by-content on structural databwe highlight their claim that almost all typeenzymes involved in DNA metabolism ha ve a shaped structure or employ one as part of a ftiona l complex. Also, it is inter est ing to considerthe central channel of the ring is large enougaccommodate either single-stra nded or doustra nded DNA.

    FIG. 3. Int era ction between th e query interfaces and th e search engine. Two query inter faces are indicated at th e left-hanimage: a form-based int erface as well as a qu ery-by-exam ple inter face. The form-based quer y inter face allows t he u ser t o contained in the bio-vocabulary dictionar y a s well as other ann otat ed descriptors. The query-by-example interface allowpresent an example of a structure asking the database to search for similar cases. The latter interface will be complementeprogram for feat ure extr action in the client ma chine. Results of th e query procedur e are r etrieved in th e sam e way for both in

    119ADVANCED QUERY TOOLS FOR 3D IMAGES OF MACROMOLECULES

  • 8/14/2019 Journal of Structural Biology 125, 112122 (1999)

    9/11

    From the standpoint of the database infrastruc-ture development, it is particularly important toconsider how their work would have been facilita tedshould query-by-content approaches over structuraldat aba ses have already been in comm on use. Indeed,by searching over the met ada ta associated with th edescription of the function, a nd then retr ieving thecontent information associated with shape, it wouldhave become a pparent that all the r etrieved str uc-tur es shar ed the same ring-like sha pe.

    Taking this work as a n inspiration, and with theaim to present a case of a content-based query only,we are presenting as a practical case a search forthose stru ctures with a ring-like shape t hat have acentral channel with a diameter of between 15 and60 . The sea rch will be perform ed over a set of 3Dimages cont aining th e cur ren t BioIma ge dat a sets onmacromolecules, complemented with some 3D im-ages calculated from PDB les. In this case, thequery is form-orient ed, and it is per form ed by llingout the features for shape an d channels within t hequery form (see Fig. 4a). As soon as the user laun chesthe query, the query-by-content engine searches thedatabase for data sets presenting those features.Then , th e set of ma tching data sets is present ed on asepara te pa ge (see Fig. 4b). The output page a llowsthe user to visua lize the cont ents m etada ta of each

    matching data set. Furth er inspection of the mdata content immediately indicates that thetr ieved str uctures ar e all involved in DNA melism. (It should be clearly sta ted that the oridat aba se question posed in Hingora ni and ODo(1998) was th at described in the previous par agnot the shape-only oriented search described inwork and which is used here only for illustrareasons. Also, the data set over which the conquery is performed is limited.) From this paguser can also inspect an isosurface of the datth at is represent ed in VRML form at .

    Another appr oach to search a stru ctur al datafocusing on its stru ctur al cont ent is t ha t ba sedquery-by-example interface. Under this approwe assume that the user has a data set, whicpresented as the query structure. In additionuser m ust st at e a set of reference feat ur es to befor searching (e.g., similar shape and an enumber of channels). Afterward, the most simdata sets , in the context of those features,ret rieved. This kind of inter face is not yet ava ibut in the near future i t wil l be one of thecomponents of our QB3DC (query by 3D conapplication.

    Users ar e encour aged to inter actively exploreresu lt of using combina tions of met ada ta sea rch

    FIG. 4. (a) Form-based int erface lled out in order t o nd 3D images with a r ing-like shap e as well as an inn er chann el obetween 15 a nd 60 . (b) Result of th e query per formed over a set of 3D images of biological m acromolecules fulllling theexpressed in the form sh own in a. On th e left-han d side a list of str uctur es as well as a repres enta tive view from each of th emClicking into any of these structures retrieves general textual information on the specimen and provides a VRML worlstru ctur e can be inspected. In this pa rticular example, furt her informat ion on t he st ructure of the Dn aB helicase, correspBioIma ge data set iden tier 26, has been selected.

    120 DE ALARCO N, GUPTA, AND CARAZO

  • 8/14/2019 Journal of Structural Biology 125, 112122 (1999)

    10/11

    well as content-based searches over the ve basicstru ctur al chara cteristics implement ed in th is proto-type by accessing the BioImage server in Madrid,where the form -based int erface is opera tive for test-ing. We believe tha t t his prototype shows clear ly thepotent ial benets of a query-by-conten t system a p-plied to th e biological ma cromolecule eld.

    8. CONCLUSIONS AND PERSP ECTIVES

    In this work we have developed a Web-basedprototype system th at is able to perform some formsofquer y-by-cont ent on 3D images of biological ma cro-molecules within th e cont ext of th e BioIma ge project.The pu rpose of th is prototype is th reefold:

    (1) to demonstrate the wide range benets thatth e intr oduction of new search capa bilities based ona qu ery-by-conten t a pproach can bring to structur aldatabases. In particular, and in t he context of thiswork, we ha ve focused our a tt ent ion on th e ma cromo-

    lecular section of BioImage.(2) to provide an a ppropria te modeling fra meworkwhere new a ttributes of the dat a sets content andtheir corr esponding extr action progra ms can be test edand incorporated.

    (3) to obta in feedback from scientists on the mac-romolecular eld in order to enhance the currentprototype.

    For a query-by-content approach to be feasible,queries must be reasonably fast. In this r espect, allthe procedures implemented in this prototype couldbe scaled up t o large ensembles of dat a set s in simplema nn ers. For insta nce, th e task of extra cting some of the visual features could be assigned to the clientcomput er tha t is submitting the data, rat her th an inth e server (and, in any case, th ey mu st be calculat edonly once per dat a set). Also, th ere a re m an y choicesof distance functions am ong volumes, a nd th ey canbe a djusted for the needed t ra de-off between speedan d a ccur acy.

    The pr ototype, with its acknowledged limita tions,can be therefore considered a pioneer test bed onwhich more elaborated developments in the eld of cont ent da ta modeling, au tomat ic featu re extra ction,an d inter faces will be tested. So far t he focus ha s notbeen centered on getting many volumes into thisprototype but, rather, in exploring new ways tohan dle the volume informa tion. Since fur ther workproceeds in to quer y-by-cont ent schema , more effort swill have to be devoted to a powerful and user-friendly su bmission interface. The availability of th is prototype in th e ma cromolecular BioIma ge Webserver is expected t o help increase the intera ctionsbetween developers and users, in such a way thatfutur e versions will incorporate new capabilitiesbased on this feedback int era ction.

    This work h as been pa rt ly funded by grant s from th e ComInt erm inister ial de Ciencia y Tecnologa (BIO97-1485-CE, BIO90761) within the context of the BioImage project (funded bEuropean Union through Grant PL960472). We expresthanks to the part icipants in the 1997 Workshop on LDatabases of Complex Objects as well as the 1998 WorkshQuery-by-Content and Data Mining in Image Databases, nized with the su pport of th e Spanish Resea rch Council withcont ext of our joint unit with t he U niversity of Malaga. P.Athe recipient of a predoctoral fellowship from the Comunid

    Madrid.

    REFERENCES

    Artymiuk, P. J.,et al. (1992) Similarity searching in databasethr ee-dimensional molecules and macromolecules,J . C h e m . Inf. Comput. Sci. 32 (6), 617630.

    Bach, J . R., Fuller, C., Gupta , A., Ham papu r, A.,et al. (1996) TheVirage image search engine: An open framework for imanagement,in Pr oceedings on Storage an d Retrieval for Ima ge and Video Dat aba ses IV, San J ose, CA, 12 Febr ua r2670, pp. 7687.

    Cara zo, J . M., and St elzer, E. (1999)J. S truct. Biol. , 125, in press.Galvez, J. M., and Can ton, M. (1993) Normalization an d

    recognition of three-dimensional objects by 3D momentsPat-tern Recognition 26 (5), 667681.

    Gupta, A., an d Ja in, R. (1997) Visual information retrComm un. ACM 40 (5), 7079.

    Hartman, J., and Wernecke, J. (1996) The VRML 2.0 HandBuilding Moving Worlds on the Web, Addison-WeReading, MA.

    Hingorani, M. M., and ODonnel, M. (1998) Toroidal proRunning rings around DNA,Cur r. Biol. 8, R83R86.

    Holm, L., and Sander, C. (1996) Alignment of three-dimenprotein structures: Network server for database searc Methods Enzymol. 266, 653662.

    IEEE Comput er, Septem ber 2331, 1995.

    Lee, C., Poston, T., and Rosenfeld, A. (1991) Winding and numbers for 2D and 3D digital images,CVGIP Graphical Models Im age Proc. 53 (6), 522537.

    Lindek, S ., Fr itsch, R., Ma chtynger, J ., de Alarcon, P. AChagoyen, M. (1999) Design and realization of a n ondat abas e for m ultidimen sional m icroscopic images of biolspecimens,J . S truct. Biol. 125, 103111.

    Liu, Y., Rothfus, W. E., an d Ka na de, T. (1998) Cont ent -basneuroradiologicimage retrieval:Preliminary results,in Proceed-ings of CAIVD 98: IEEE International Workshop on CoBased Access of Ima ge and Video Databa ses.

    Lohman , T. M., and Bjorn son, K. P. (1996) Mecha nisms ofh ecatalyzed DNA unwinding,Annu. Rev. Biochem. 65 , 169214.

    Ma, W. Y. (1997) NETRA: A toolbox for na vigat ing large

    databases, Ph.D. Dissertation, Department of ElectricaComputer Engineering, University of California a t San ta BaMa, W. Y., and Manjunath, B. S. (1996) Texture features

    learning similarity,in Proceedings of IEEE Conference Compu ter Vision a nd P at ter n Recognition, pp. 425430.

    Ming-Fang, W., and Hsin-Teng, S. (1998) Representation Surfaces by t wo-variable F ourier descriptors,IEEE Trans.Pattern Anal. M achine Intell. 20 (8).

    Nas ta r, C. (1997) The Ima ge Shape Spectru m for Ima ge RetResearch Report 3206. INRIA Rocquencort .

    Niblack, W., Barber, R.,et al. (1993) The QBIC Pr oject: Queryimage by content using color, texture a nd sh ape,in Proceedingsof SPIE on Storage an d Retrieval for Image an d Video bases, pp . 173187.

    121ADVANCED QUERY TOOLS FOR 3D IMAGES OF MACROMOLECULES

  • 8/14/2019 Journal of Structural Biology 125, 112122 (1999)

    11/11

    Nucleic Acids Res. J anua ry 1999, special issue on biologicaldatabases,27 (1).

    Pittet, J. J. , Henn, C., Engel, A., and Heymann, J.B. (1999)Visualizing 3D data obtained from microscopy on the In ter net, J. Struct. Biol. 125, 123132.

    Rui, Y., Huang, T. S., and Mehrotra, S. (1997) Content-basedimage retrieval with relevance feedback in MARS,Proc. IE EE Int. Conf. Im age Proc.

    Sedma n, J ., and Sten lund, A. (1996) The inita tor protein E1 binds

    to the bovine papillomavirus origin of replication as a trimericring-like structure,EMBO J . 15 , 50855092.

    Shindialov, I. N., a nd Bourn e, P. E . (1998)Protein Eng. 11 (9),739747.

    Smith, D. W. (Ed.) (1994) Biocomputing, Academic P ressDiego.

    Stukenberg, P. T., Studwell-Vaughan, P. S., and ODonne(1991) Mecha nism of th e sliding clamp of DNA polymeraholeoenzyme,J . Biol. Chem. 272, 1132811334.

    Suss ma n, Lin, Jia ng, Mann ing, Prilusky, Ritter, and Abola Protein Data Bank (PDB): database of 3D structur al inf

    tion of biological macromolecules,Acta Crystallogr. Sect. D 54 ,10781084. [URL: h tt p://www.pdb.bnl.gov/]

    122 DE ALARCO N, GUPTA, AND CARAZO