9
Advanced Query Mechanisms for Biological Databases I-Min A. Chen, Anthony S. Kosky, Victor M. Markowitz, Ernest Szeto, and Thodoros Topaloglou Bioinformatics Systems Division, Gene Logic Inc. 2001 Center Street, Suite 600, Berkeley, CA 94704 {ichen,ant hony,vmmarkowitz,szeto,thodoros} @genelogic.com Abstract Existing query interfaces for biological databases are either based on fixed forms or textual query languages. Users of a fixed form-based query interface are lim- ited to performing some pre-defined queries providing a fixed view of the underlying database, while users of a free text query language-based interface have to under- stand the underlying data models, specific query lan- guages and application schemas in order to formulate queries. Further, operations on application-specific complex data (e.g., DNA sequences, proteins), which are usually provided by a variety of software packages with their own format requirements and peculiarities, are not available as part of, nor integrated with biolog- ical query interfaces. In this paper, we describe generic tools that provide powerful and flexible support for interactively explor- ing biological databases in a uniform and consistent way, that is via common data models, formats, and no- tations, in the framework of the Object-Protocol Model (OPM).These tools include (i) a Java graphical query construction tool with support for automatic genera- tion of Webquery forms that can be either used for further specifying conditions, or can be saved and cus- tomized; (i.i) query processors for interpreting and exe- cuting queries that may involve complex application- specific objects, and that could span multiple het- erogeneous databases and file systems; and (iii) util- ities for automatic generation of HTML pages contain- ing query results, that can be browsed using a Web browser. These tools avoid the restrictions imposed by traditional fixed-form query interfaces, while provid- Lag users with simple and intuitive facilities for formu- lating ad-hoc queries across heterogeneous databases, without the need to understand the underlying data models and query languages. Introduction An increasing number of biological databases are pro- viding publicly accessible query interfaces. For ex- ample, archival databases such as the Mouse Genome Database (MGD) at the Jackson Laboratory, Genome Database (GDB) (Fasman et al. 1996) at Johns Hopkins School of Medicine, the Genome Sequence Database (GSDB) at National Center for Genome Re- sources(NCG 1997), and Genbank at the National Cen- ter for Biotechnology Information (Shuler el al. 1997), can all be queried via Web based interfaces. Exploring data in biological databases involves exam- ining the structure (metadata) of the databases, brows- ing and querying the databases, interpreting the re- suits of queries, and processing and viewing application- specific data types, such as protein and DNA sequences, using special data type-specific operations, such as se- quence comparison, structure comparison, and protein structure visualization. These operations are often available as individual software packages with their own input and output formats (e.g., BLAST and FAST_& for DNA sequence comparison, DALI for structure compar- ison, RasMol for viewing macromolecnles). In order to support querying and data exploration, biological databases must offer facilities for easy for- mulation of queries and interpretation of query re- sults as well as support for seamless manipulation of application-specific data. The study of (Markowitz e! al. 1997) showed that most archival databases pro- vide limited query support. Their Web based query interfaces, for example, have a fixed structure involving a predetermined set of database components (e.g.. ta- bles, classes), and a predetermined set of attributes for each database component. These interfaces are based on canned queries that can be parameterized on cer- tain values and may allow users to have some control over conditions or set the values used in the condi- tions. The main limitation of fixed-form interfaces is thai they conform to and provide only some predeter- mined view of the underlying databases. For example, such interfaces do not allow specifying ad-hoc queries spanning multiple tables or classes. Furthermore, fixed- form interfaces may need to change whenever the un- derlying database changes or new features need to be supported. Some Web interfaces provide the ability to specify queries over a database using free-form textual query languages, while not providing proper metadata support, so that most users are not able to make use of these facilities. Even worse, without a mechanism that would help verifying the semantic correctness of queries, users may specify semantically incorrect queries. Fi- nally, biological query interfaces do not provide support for manipulating application-specific data. Chen 43 From: ISMB-98 Proceedings. Copyright © 1998, AAAI (www.aaai.org). All rights reserved.

Advanced Query Mechanisms for Biological …Advanced Query Mechanisms for Biological Databases I-Min A. Chen, Anthony S. Kosky, Victor M. Markowitz, Ernest Szeto, and Thodoros Topaloglou

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Advanced Query Mechanisms for Biological …Advanced Query Mechanisms for Biological Databases I-Min A. Chen, Anthony S. Kosky, Victor M. Markowitz, Ernest Szeto, and Thodoros Topaloglou

Advanced Query Mechanisms for Biological Databases

I-Min A. Chen, Anthony S. Kosky, Victor M. Markowitz,Ernest Szeto, and Thodoros Topaloglou

Bioinformatics Systems Division, Gene Logic Inc.2001 Center Street, Suite 600, Berkeley, CA 94704

{ichen,ant hony,vmmarkowitz,szeto,thodoros} @genelogic.com

Abstract

Existing query interfaces for biological databases areeither based on fixed forms or textual query languages.Users of a fixed form-based query interface are lim-ited to performing some pre-defined queries providinga fixed view of the underlying database, while users of afree text query language-based interface have to under-stand the underlying data models, specific query lan-guages and application schemas in order to formulatequeries. Further, operations on application-specificcomplex data (e.g., DNA sequences, proteins), whichare usually provided by a variety of software packageswith their own format requirements and peculiarities,are not available as part of, nor integrated with biolog-ical query interfaces.In this paper, we describe generic tools that providepowerful and flexible support for interactively explor-ing biological databases in a uniform and consistentway, that is via common data models, formats, and no-tations, in the framework of the Object-Protocol Model(OPM). These tools include (i) a Java graphical queryconstruction tool with support for automatic genera-tion of Web query forms that can be either used forfurther specifying conditions, or can be saved and cus-tomized; (i.i) query processors for interpreting and exe-cuting queries that may involve complex application-specific objects, and that could span multiple het-erogeneous databases and file systems; and (iii) util-ities for automatic generation of HTML pages contain-ing query results, that can be browsed using a Webbrowser. These tools avoid the restrictions imposed bytraditional fixed-form query interfaces, while provid-Lag users with simple and intuitive facilities for formu-lating ad-hoc queries across heterogeneous databases,without the need to understand the underlying datamodels and query languages.

IntroductionAn increasing number of biological databases are pro-viding publicly accessible query interfaces. For ex-ample, archival databases such as the Mouse GenomeDatabase (MGD) at the Jackson Laboratory, GenomeDatabase (GDB) (Fasman et al. 1996) at JohnsHopkins School of Medicine, the Genome SequenceDatabase (GSDB) at National Center for Genome Re-sources(NCG 1997), and Genbank at the National Cen-

ter for Biotechnology Information (Shuler el al. 1997),can all be queried via Web based interfaces.

Exploring data in biological databases involves exam-ining the structure (metadata) of the databases, brows-ing and querying the databases, interpreting the re-suits of queries, and processing and viewing application-specific data types, such as protein and DNA sequences,using special data type-specific operations, such as se-quence comparison, structure comparison, and proteinstructure visualization. These operations are oftenavailable as individual software packages with their owninput and output formats (e.g., BLAST and FAST_& forDNA sequence comparison, DALI for structure compar-ison, RasMol for viewing macromolecnles).

In order to support querying and data exploration,biological databases must offer facilities for easy for-mulation of queries and interpretation of query re-sults as well as support for seamless manipulation ofapplication-specific data. The study of (Markowitz e!al. 1997) showed that most archival databases pro-vide limited query support. Their Web based queryinterfaces, for example, have a fixed structure involvinga predetermined set of database components (e.g.. ta-bles, classes), and a predetermined set of attributes foreach database component. These interfaces are basedon canned queries that can be parameterized on cer-tain values and may allow users to have some controlover conditions or set the values used in the condi-tions. The main limitation of fixed-form interfaces isthai they conform to and provide only some predeter-mined view of the underlying databases. For example,such interfaces do not allow specifying ad-hoc queriesspanning multiple tables or classes. Furthermore, fixed-form interfaces may need to change whenever the un-derlying database changes or new features need to besupported. Some Web interfaces provide the ability tospecify queries over a database using free-form textualquery languages, while not providing proper metadatasupport, so that most users are not able to make use ofthese facilities. Even worse, without a mechanism thatwould help verifying the semantic correctness of queries,users may specify semantically incorrect queries. Fi-nally, biological query interfaces do not provide supportfor manipulating application-specific data.

Chen 43

From: ISMB-98 Proceedings. Copyright © 1998, AAAI (www.aaai.org). All rights reserved.

Page 2: Advanced Query Mechanisms for Biological …Advanced Query Mechanisms for Biological Databases I-Min A. Chen, Anthony S. Kosky, Victor M. Markowitz, Ernest Szeto, and Thodoros Topaloglou

In this paper, we describe tools that provide advancedquery mechanisms for biological databases in the con-text of an object model, the Object-Protocol Model(OPM). These tools are used in conjunction with theOPM retrofitting tools for constructing OPM views ofexisting relational and structured file databases (Chenet al. 1997).

The OPM query tools are generic (schema-driven),that is, they are driven by the metadata associatedwith the underlying database and allow ad-hoc queriesto be constructed using graphical, Web based inter-faces. Query construction is a simple two-stage pro-cess. In the first stage, the user constucts a query treeby graphically browsing the object schema of the un-derlying database and iteratively selecting classes andattributes of interest. In the second stage, the user canspecify conditions, customize, save, or submit the Webform that is generated automatically by the OPM querytools.

The query tools generate queries in an object-oriented query language, OPM-QL, which are then pro-cessed using OPM query translators. The query trans-lators generate equivalent, possibly semantically opti-mized, queries for the underlying relational database(Chen et al. 1996) or flat file system.

Querying support for complex (application-specific)objects is provided via OPM Applicatlon-Specific DataTypes (ASDTs) and methods. ASDTs are supported top of existing relational DBMSs, such as Oracle 7 andSybase 1 l, but may also take advantage of the advancedfeatures of the emerging object-relational DBMSs, suchas the Oracle 8 and Informix Universal Servers whichprovide mechanisms for incorporating ASDTs and theirassociated methods into the DBMS.

The remainder of this paper is organized as follows.First, we briefly overview our framework for exploringdatabases on the Web. The query construction tools to-gether with the techniques used for implementing themare described next. Then, we discuss query processingand query result browsing. The paper concludes witha brief description of related work.

Background

Our approach for exploring biological databases isbased on an object model, the Object-Protocol Model(OPM). An object data model such as OPM can used to represent heterogeneous databases in a uniform,abstract (system-independent), and consistent way. addition, OPM provides extensive schema documen-tation facilities in the form of descriptions, examplesand user-specified properties. A variety of existingOPM tools provide facilities for developing and access-ing databases, for constructing OPM views on top ofexisting databases and files, for generating alternativeschema representations, and for querying heterogeneousdatabases through uniform OPM views. We briefly re-view the main constructs of OPM below; OPM is de-scribed in detail in (Chen & Markowitz 1995).

Basic OPM Constructs

OPM is a data model whose object part closely re-sembles the ODMG standard for object-oriented datamodels (Cattell 1996). Objects in OPM are uniquelyidentified by object identifiers (oids), are qualified attributes, and are classified into classes. Classes canbe organized in subclass-superclass hierarchies and canbe grouped into clusters. In addition to object classes,OPM supports a protocol class construct for model-ing scientific experiments. Protocol classes are not dis-cussed in this paper.

Attributes can be simple or consist of a tuple (aggre-gation) of simple attributes. Attributes can be single-valued, set-valued or list-valued. If the value class(or domain) of an attribute is a system-provided datatype, or a controlled-value class of enumerated values orranges, then the attribute is said to be primitive. If anattribute takes values from an object class or a unionof object classes, then it is said to bc abstract.

Figure 1 contains an example of a part of the OPMschema for the Genome Database (GDB)1 representedin a diagrammatic notation and browsed using the Javabased OPM Schema Browser. This example containsseveral OPM classes, such as Hap and Chromosome; at-tributes includesMap of Hap and maps of Chromosomeare set-valued, while attributes map0f and units ofMap are single valued; attribute ±ncludesMap is a tu-pie attribute consisting of two simple attributes, mapand orientation; attribute chromosome is an abstractattribute taking values from class Chromosome, whileattribute minCoord is a primitive attribute.

OPM supports the specification of derived attributesusing derivation rules involving arithmetic expressions,aggregate functions (rain, max, sum, avg, comat), compositions of attributes and inverse attributes. Acomposition derivation consists of a path or a union ofpaths of the following form: Bt [Oi,] B2 [Oi~] ... B,~[Oi,], where each Oi~ (1 < k < n) denotes a class, andeach Bk (1 < k < n) denotes an attribute or inverse at-tribute associated with Oi(,_,> (O,o = Oi). An inverseattribute of a class O, is the reverse of an attribute A as-sociated with another class, O’, where O is a value classof A; such an attribute is denoted ! A. For example, de-rived attribute maps in Figure 1 is associated with classChromosome and is defined by derivation: ! chromosome[Hap]. OPM also supports derived subclasses and de-rived superclasses. A derived subclass is defined as asubclass of one or more object classes with an optionalderivation condition. A derived superclass is defined asa union of two or more object classes.

Advanced OPM Constructs

OPM has been extended with a new construct, theApplication-Specific Data Type (ASDT). ASDTs areused to model complex, multimedia data types, suchas DNA sequences, maps, and gel images. Suppose,

lhttp://gdb~w.gdb.org/gdb/schema.html

44 ISMB-98

Page 3: Advanced Query Mechanisms for Biological …Advanced Query Mechanisms for Biological Databases I-Min A. Chen, Anthony S. Kosky, Victor M. Markowitz, Ernest Szeto, and Thodoros Topaloglou

R~at

--~ FLOAT

---~ .FLOAT

--{ mlnUmowd I FLOAT

.--[ InaxU¢¢~ p FLOAT

DERIVATION:tchromosomelMap]

.FLOAT

FLOAT

Figure 1: Browsing Clmsses using the OPM Schema Browser

for example, that a Gel object class has an image at-tribute that takes values from an ImageGel ASDT. AnImageGel instance is a complex data element that canbe manipulated and displayed using application specificoperations called methods.

APPLICATION SPECIFIC DATA TYPE Gellmage

DATATYPE : OPM_BLOB/IMAGESTORED : INTERNAL

DESCRIPTION: "A binary image type"PKOPERTIES : "imagetype .... tiff"METHOD display

SIGNATURE: "void display()"LANGUAGE: "java"CODE: "/me/code/java/asdts/gelimage. java"DESCRIPTION: "displays gel image"

In this example, the second line of the GellmageASDT definition specifies that it’s data type isOPM_BLOB/IMAGE, which is an OPM primitive valueclass. The third line specifies that objects of this type

will be stored inside the database. Alternatively, suchobjects could be stored in files outside the database andthe database would contain references to these files.ASDT properties can be used either by the applica-tion program or by the methods that operate on theASDT. For instance, "imagetype .... t±¢f" means thata "tiff"-eapable viewer needs to be used for display-ing this image. Methods are associated with an im-plementation specified in some programming language,employed by the OPM tools when needed. The signa-ture part of the method specification defines the returntype and the formal parameters of the method. Servicemethods such as display, return void. The code field(optional) specifies the location of the file containingthe code implementing the method. The language anddescription parts are self-explanatory.

Database Development and RetrofittingOPM schemas can be specified using a graphicalOPM schema editor or a regular text editor. OPM

Chen 45

Page 4: Advanced Query Mechanisms for Biological …Advanced Query Mechanisms for Biological Databases I-Min A. Chen, Anthony S. Kosky, Victor M. Markowitz, Ernest Szeto, and Thodoros Topaloglou

schema translators automatically generate completedefinitions for databases implemented with commer-cial relational database management systems (DBMSs),such as Sybase and Oracle. A mapping dictionaryrecords the correspondences between the classes andattributes of an OPM schema and the underlying re-lational tables.

Existing relational or structured file databases thathave not been developed using OPM tools, can beretrofitted with an OPM schema (view) using the OPMretrofitting tools (Chen et al. 1997). Retrofitting in-volves first generating a canonical OPM schema fromthe native database schema, and then refining, and thussemantically enhancing: this view via a series of schemarestructuring manipulations. The retrofitting tools canbe used for constructing multiple OPM views for a sin-gle (OPM or non-OPM) database, these tools generatemapping dictionaries as described above.

Query Interfaces

Existing public query interfaces for biological databasesare usually available on the V~reb and are either form-based or textual query language-based. Form-basedquery interfaces usually have a fixed structure involv-ing a predetermined set of components (e.g., tables,classes, attributes), and provide a limited number ofoptions, such as specifying the values for certain at-tributes. Such interfaces are based on predeterminedor "canned" queries, possibly parameterized on certainvalues, and provide a single fixed view of the database.They may not reflect the structure of the underlyingdatabase, since the list of attributes or fields retrievedby a canned query may involve only a subset of theattributes and fields in the underlying database, andthe classes or tables accessed may be only a subset ofthose in the underlying database. In spite of their re-strictions, form-based query interfaces are both easy toimplement and use.

Attempts to provide ad hoc query facilities usuallyrely on allowing the user to specify queries using somead hoc textual query language. Users of such interfacesmust have expert knowledge of the underlying querylanguage, data model and database schema. The fa-cilities offered by such interfaces depend on the querylanguage supported by the underlying DBMS: differ-ent DBMSs support different flavors of query languagesbased on different or even identical data models. Forexample numerous databases are developed using rela-tional DBMSs which support different versions of SQL.Specifying ad-hoc SQL queries therefore requires non-trivial knowledge of the structure and manipulation ofrelational databases, and of the particular relationalDBMS and dialect of SQL being used. This is oftenbeyond the ability of gcneral uscrs.

In this section we describe the OPM query tools. Ourstrategy is to provide Web based ad-hoc query specifi-cation capabilities via a schema-driven Java graphicalinterface, coupled with dynamically generated HTML

query forms. Queries specified using this tool are sub-sequently passed to OPM query translators for com-mercial relational DBMSs or structured fiat-file sys-tems. Used in conjunction with native or retrofittedOPM schemas, the OPM query tools provide a uniformand intuitive query interface on top of heterogeneousdatabase systems, and allow users to formulate querieswhile browsing database schemas. Query results areorganized in HTML pages with hyperlinks to relatedobjects and metadata definitions in order to facilitateWeb browsing.

Formulating Basic Queries. The OPM query in-terface is designed to provide an extension to the ubiq-uitous Web (HTML) query forms that users are alreadyfamiliar with, so that using this tool has a very shortlearning curve. This is necessary since, in our expe-rience, many users are not computer scientists and arenot used to complex graphical query interfaces. Insteadof providing predefined query forms, the OPM Webquery tool provides support for constructing a querytree by selecting classes and attributes of interest usinga graphical user interface, for generating dynamicallyHTML query forms based on this query tree. Furtherquery condition specification can be carried out by fill-ing in these query forms.

Query construction with our interface is illustratedby the example shown in Figure 2. The graphical userinterface shown in the top window in Figure 2 is usedto construct a query tree. First a root class is selectedfrom the Generalization Hierarchy list-box on the left ofthe interface window. Then the query tree is expandedby recursively selecting classes and attributes that areinvolved in the conditions and/or output of the query.ALtributes associated with a selected class, C, are se-lected from an Attributes list-box; inverse attributes as-sociated with C, that is abstract attributes that takevalues from C, can be selected from an Inverse attributelist-box.

In the example shown in Figure 2 , class Rap is se-lected as the root of the query tree. Attributes, suchas displayNamo and chromosome are then added tothe tree by selecting (clicking on) a class in the dia-grammatic representation of the tree and then select-ing attributes associated with that class from the At-tributes or Inverses list-boxes on the right-hand side.A selected attribute is added to the query tree dis-played in the main window. Primitive attributes, suchas displayName and comment form the leaves of thetree. For abstract attributes, such as chromosome, avalue class, such as Chromosome, must also be selectedand displayed in the main window; attributes of new se-lected classes may then be selected in turn, such as at-tributes displayName and comment of class Chromosomein Figure 2.

The selection process is repeated until all classesand attributes of interest have been added to thequery tree. Attributes can be renamed (e.g., attributesdisplayName of Map and displayName of Chromosome

46 ISMB-98

Page 5: Advanced Query Mechanisms for Biological …Advanced Query Mechanisms for Biological Databases I-Min A. Chen, Anthony S. Kosky, Victor M. Markowitz, Ernest Szeto, and Thodoros Topaloglou

Map

(Optional) Condition Specification:

Figure 2: Constructing Web Query Forms with the the OPM Web Query Tool

in the upper side window of Figure 2 are renamed mapand chromosome_number, respectively, in the generatedform) and specified as part of the query output, condi-tion or both.

Once the query tree is completed, an HTML queryform can be generated (see the form in the lower halfof Figure 2). This form is used for specifying condi-tions on attributes in the familiar Query-by-Examplemode (e.g., chromosomemu~ber = "22"). Menu but-tons help selecting the operator appropriate for the typeof a given attribute. For controlled value classes, a list-box displays the set of valid values that can be used inexpressing conditions. The query can then be submit-ted to the database via the OPM Query Translator orthe form can be saved as an HTML file, that can thenbe customized for subsequent use or inclusion in Webpages.

The two-stage query construction described aboveleads to the specification of a query in the OPM QueryLanguage (OPM-QL), an object-oriented query lan-

guage similar to OQL, the ODMG standard for object-oriented query languages (Chen et al. 1996). An OPMquery involves local, inherited, and derived attributesand path expressions starting with these attributes, andconsists of a SELECT statement specifying the attributevalues that should retrieved for the instances satisfy-ing the query condition; a FROM statement specifyingthe classes containing the instances that are consideredby the query; and an optional WttERE statement speci-fying conditions on instances, where conditions consistof and-or compositions of atomic comparisons. Variabledeclarations in a FRO~ statement define the set of valuesthat the variables may range over.

For example, the query constructed in figure 2 is ex-pressed by the following OPM query:

SELECT map = M.displayName,chromosome_number = C.display~ame,cytogeneZic_marksr = CM.displayName,comment = C. comment

FROM M IN Map,

Chen 47

Page 6: Advanced Query Mechanisms for Biological …Advanced Query Mechanisms for Biological Databases I-Min A. Chen, Anthony S. Kosky, Victor M. Markowitz, Ernest Szeto, and Thodoros Topaloglou

C IN M. chromosome [Chromosome],CM IN M. !map[CytogeneticMarker]

WHERE C.displayName = "22";

The OPM query interface does not support the speci-fication of the full range of queries that can be expressedin the OPM textual query language (OPM-QL). Forexample, OPM queries involving several root classes orcomplex conditions involving parentheses and attributecomparisons, are not supported in the current version ofthe tool, mainly because of the difficulty in supportingthe graphical specification of such queries in a simpleand intuitive way. However, the OPM query interfacesupports the specification of queries that are substan-tially more complex than those underlying fixed forms,while allowing access to all the elements of the under-lying database.

Examples of databases that can be accessed usingthis query tool are available at the OPM Web site 2. Ofparticular interest is the Primary Database of the Ger-man Genome Resource Center 3, where the OPM Webquery tool is available for constructing queries directly,but has also been used for setting up Web query forms4.

Formulating Queries with ASDTs. OPM queriesinvolving ASDTs may invoke methods for those ASDTs.For example, the following query can be used to displaythe image of the gel with identity gel_000111:

SELECT X.gelId,X. image, display (

FKOM X in GelWHERE X.gelId = "gel_O00111"

The following query displays only part ofgel_00011 ls image:

SELECT X. gelid,X. image, crop (0,0,200,400). display (

FROM X in GelWHERE X.gelId = "gel_O00~li"

where crop is a method that returns a piece of theimage of specified coordinates.

Implementation. The OPM query interface is im-plemented using a combination of the Java program-ming language and HTML forms. This query interfaceis schema driven in the sense that the OPM schema andmapping information is loaded into the Java interfaceonce it is started. Query trees are drawn using Java Ab-stract Windowing Toolkit (AWT) based on the schemainformation and user selections.

After a query is constructed, a click on the Gener-ate Query From button in the query interface results insending a URL to a Common Gateway Interface (CGI)script. The URL includes database and user informa-tion, and the query tree encoded in a text string. Based

2http ://gizmo. lbl. gov/j opmDemo/demoDbs, html3http ://wwL rzpd. de/cgi-bin/public_login4http ://www. rzpd. de/obj ect_f orm. htm/

on this information, an HTML form is generated and isavailable for flirther condition specification.

There were several reasons for choosing HTML formsfor specifying conditions, rather than implementing theforms in Java or specifying conditions directly on thegraphical query tree: most users are familiar and com-fortable with Web query forms, and are often reluc-tant to learn new query paradigms; formatting formsin current versions of Java would have been less visu-ally appealing and would have required more develop-ment work than using HTML; and, once generated, theHTML forms can be edited, incorporated into otherWeb pages, and used independently without needing todownload the Java applets.

In this implementation, upon generation of theHTML form control is turned over to a Web browser(e.g., Netscape) which opens a second window contain-ing the form. Subsequent actions in this window follownormal Web browser behavior, outside the control ofthe Java query constructor applet.

Query Processing

The OPM Web query tools described above provide afront end to a system for processing queries and analyz-ing query results. In this section we will describe theremaining tools which comprise this system, and ourimplementation strategies for these tools.

Query Translators. Queries in an OPM frameworkare evaluated using OPM Query Translators. An 0PMquery translator takes queries specified in the OPMQuery Language (0PM-QL), and generates correspond-ing queries in the query language supported by the un-derlying DBMS or file system. The results of thesequeries are then structured and returned using OPMspecific data structures.

The query translators are driven by an 0PM map-ping diclionary that records the mapping between theelements of an OPM schema or OPM view, such asclasses and attributes, and the corresponding elementsof the underlying database or file. The content of themapping dictionary and OPM schema is compiled intoa metadata file that is dynamically linked into the querytranslators for efficiency.

OPM query translators have been developed for com-mercial relational DBMSs, such as Sybase and Oracle(Chen et al. 1996), and for structured flat-file sys-tems. The relational OPM query translators generatequeries in the dialect of SQL supported by the under-lying DBMS, and employ DBMS-specific C/C++ APIsfor database access. In general, the SQL queries gener-ated by these translators are considerably more complexthan the OPM-QL queries, since a single OPM class isusually represented by several distinct relational tables.

The flat-file OPM query translator employs SRS (Se-quence Retrieval System), a system for parsing andindexing structured flat files developed at the Euro-pean Molecular Biology Laboratory (EMBL), initially

48 ISMB-98

Page 7: Advanced Query Mechanisms for Biological …Advanced Query Mechanisms for Biological Databases I-Min A. Chen, Anthony S. Kosky, Victor M. Markowitz, Ernest Szeto, and Thodoros Topaloglou

for accessing molecular biology data repositories (Et-zold & Argos 1993). The SRS query language has lim-ited power, in the sense that conditions are restrictedto simple comparisons of indexed attributes with con-stants, so that OPM queries can be only partly trans-lated into SRS queries. Consequently post-processing ofSRS query results is carried out locally using an OPMquery engine originally developed for evaluating multi-database queries (Kosky el al. 1998).

Application programs can interact with the OPMquery translators either via C++ or CORBA APIs(each query translator shares a common API), or calling the query translators as Unix command-lineprograms. The later can be done using Perl or Unixshell scripts, using temporary files for passing OPM-QL queries and query results.

Processing Queries with ASDTs. If an ASDTmethod appears only in the SELECT clause of an OPMquery, then the query is sent to the underlying DBMSthrough the OPM query translator, and the result isthen directed to an ASDT method server. For exam-ple, the earlier OPM query will retrieve a gel’s id anda handle to its image data, as a first step, then themethod display will be called to present the image.

If a method appears in the WHERE clause, then theOPM Query Processor needs to be involved in thequery evaluation: the OPM Query Processor rewritesthe query into method-free parts that can be evalu-ated by the underlying DBMS, method calls, and post-processing necessary to complete the evaluation of thequery. The Query Processor evaluates the query bysubmitting the method-free sub-queries to the QueryTranslator and the underlying DBMS, calling the meth-ods based on the intermediate results of each sub-query,and finally, evaluating locally the results of methodcalls. The OPM Query Processor is based on a client-server architecture, with Query Translator servers usedfor evaluating the method-free sub-queries, and ASDT-servers used for evaluating method-calls.

Implementation. We initially used the ubiquitousCommand Gateway Interface (CGI) for implementinginterfaces between the OPM Web query tools and theQuery Translators and Processor. CGI was found to beeasy to use and maintain given our multip]e language(Java and C++) programming environment. CGI is natural choice for dynamically creating HTML pages:Java is used for implementing the front (query construc-tion) ends, and C++ (with a Perl wrapper) is used tile CGI back-end. However, since a standard CGI callrequires the creation of a new Unix process, this al-ternative could be inefficient. In our experience, theresponse time for CGI calls on the Web is acceptable,and slow responses are usually caused by slow queries orqueries with large results, rather than by delays causedby the invocation of Unix processes.

We are currently developing CORBA based interfaces

for application communication. CORBA is limited inthe data structures that can be passed between appli-cations, and consequently requires extraneous conver-sions between data structures used in applications andthose that can be communicated. Further, despite theC++ mapping defined in the CORBA 2.0 standard,CORBA implementations remain vendor specific, sothat a CORBA implementation of the query tools wouldbe tied to a particular CORBA product. NeverthelessCORBA simplifies the task of building client-server sys-tems, and makes such issues as implementing languagesand platforms, or locations of servers transparent to aclient process. We have used CORBA for implementingthe communication between the OPM Query Processorand the Query Translator and ASDT servers.

Browsing Query Results. Results of OPM queriessubmitted using the Web query forms described aboveare returned in HTML format. The query results arepresented as a table, in order to provide a concise sum-mary of the data retrieved (see the top window in Fig-ure 3) The columns of the table correspond to the fieldsof the query form, or equivalently, the leaves of theoriginal query tree. Clicking on a column label revealstile full definition of the corresponding field, includingthe description of the relevant attributes in the OPMschema.

The query result table contains a colunm of objectidentifiers for the root objects in the query tree (Rapobjects in Figure 3). Selecting (clicking on) an objectidentifier results in displaying the values of all the at-tributes of the corresponding object, also representedin HTML format.

The lower window of Figure 3 shows a Rap instancein object form. Single-valued primitive attributes, suchas accessJ.onID and status are represented simplyby their values. Multi-valued or tuple attributes andabstract attributes are represented using HTML ta-bles. The columns of a table representing a tuple at-tribute consist of the component attributes of the tu-pie attribute. The columns of a table representingan abstract attribute consist of the represe,*alive at-lribu*es of the value class associated with the abstractattributes, as specified in the OPM schema. For ex-ample attribute citations of class Map takes valuesfrom class Citation, whose representative attributesare accessionlD, displayName and url. Any objectidentifier included in a query result can be selected(clicked on) in order to display the values of all its at-tributes. In this way the instances of a database canbe browsed interactively, starting from the objects re-trieved by the initial query. In order to help interpretingthe query results, each label is linked to a descriptionfor the corresponding attribute or class.

HTML pages containing query results are dynami-cally generated using the same CGI script that gener-ated the query form, this time responding to "get tu-ples" or "get objects" messages. Similarly HTML pages

Chen 49

Page 8: Advanced Query Mechanisms for Biological …Advanced Query Mechanisms for Biological Databases I-Min A. Chen, Anthony S. Kosky, Victor M. Markowitz, Ernest Szeto, and Thodoros Topaloglou

Figure 3: HTML pages displaying query results

representing schema information are generated by send-ing the CGI script the message "show metadata".

Query results are returned and displayed at once (upto a cut-off value), instead of showing n objects at time. This approach cannot be avoided: OPM views arebuilt on top of relational databases or structured flat-file systems, where query evaluation usually involvesjoining several tables, and/or may be followed by postprocessing of the query results, so that selecting "n ob-jects at a time" cannot be pursued. A cut-off value isset in order prevent loading too many objects, whichcould potentially overwhelm a Web browser.

Concluding Remarks

In this paper we have described a suite of tools thatprovide advanced querying mechanisms for biologicaldatabases in the framework of the Object-Protocol

Model (OPM).A large amount of work exists in the area of graph-

ical query interfaces to databases. Many vendors offergraphical query interfaces, such as Access and Para-dox, to relational DBMSs. However these interfacesdo not support object-oriented views of the underlyingdatabases, or access to non-relational databases. Var-ious graphical query interfaces for object-oriented orsemantic databases have been developed as part of re-search projects; a survey of such interfaces is providedin (Batini 1991). A variety of paradigms for formulatingqueries and browsing results underlies these interfacesand each interface is based on a trade off between ex-pressivity and ease-of-use.

The paradigm underlying our query tools, of con-structing query trees and generating Web based queryforms, as well as the ability of these tools to be used foraccessing databases on the Web is, as far as we know,

50 ISMB-98

Page 9: Advanced Query Mechanisms for Biological …Advanced Query Mechanisms for Biological Databases I-Min A. Chen, Anthony S. Kosky, Victor M. Markowitz, Ernest Szeto, and Thodoros Topaloglou

unique. Our tools provide a natural and relatively sim-ple, yet powerful, extension to an already well knowninterface, the Web form. This strategy has proved suc-cessful in gaining acceptance for our tools in a largecommunity of non-expert users of scientific (molecu-lar biology and physics) databases. Further, our querytools can be used both for directly accessing databasesand for constructing and customizing (fixed-form Web)database interfaces.

Work on generic Web-based query tools, that is toolswhich are not tied to a specific database schema or un-derlying DBMS, is much more limited. The Genera sys-tem developed by Letovsky5 allows HTML query formsto be generated automatically from an object-oriented(e.g., OPM) schema. Genera provided both the inspira-tion and the incentive for our query tools to go beyond asingle class and predefined structure forms, by support-ing the dynamic construction and generation of formsspanning multiple classes.

The MOBIE system, developed at Stanford as partof the TSIMMIS project (Hammer, Aranha, & Ireland1996), provides support for browsing object-orientedquery results using hyperlinks for navigating complex,deeply nested data-structures. MOBIE does not ad-dress the problem of formulating queries on tile Weband is designed to support the TSIMMIS data model ofsemi-structured data, that is, data without a schema.Consequently, this system does not provide support forinterpreting and understanding query results, for exam-ple, via links to schemas form and database documen-tation.

We plan to extend our query tools in two areas. First,we are experimenting with alternative visuM paradigmssuggested by our users for formulating queries andbrowsing data. For example, we have built a proto-type Java tool that allows both query construction anddata browsing through the familiar metaphor of filesand folders. We also continue to extend our existingquery tools in order to enhance their query construc-tion and interpretation capabilities.

Acknowledgements. The work presented in this pa-per was carried out while the authors were affiliatedwith the Lawrence Berkeley National Laboratory, withsupport provided by the Office of Health and Envi-ronmental Research Program of the Office of EnergyResearch, U.S. Department of Energy under ContractDE-AC03-76SF00098. This paper has been issued astechnical report LBNL-40340.

References

Batini, C. e. a. 1991. Visual query systems. TechnicalReport 04.91, University of Roma.

Cattell, R. G. G., ed. 1996. The Object DatabaseStandard: ODMG-93. Morgan Kaufmann Publishers.

5Genera: A Specification Driven Web/Database Gate-way, http: / /gdbdoc.gdb.org/letovsky /wgen.html

Chen, I. A., and Markowitz, V. M. 1995. An overviewof the object-protocol model (opm) and the opm datamanagement tools. Information Systems 20(5):393-418.Chen, I. A.; Kosky, A. S.; Markowitz, V. M.; andSzeto, E. 1996. The opm query language and transla-tor. Technical Report LBL-33706, Lawrence BerkeleyNational Laboratory. http://gizmo.lbl.gov/opm.html.Chen, I. A.; Kosky, A. S.; Markowitz, V. M.; andSzeto, E. 1997. Constructing and maintaining scien-tific database views. In Proc. of the 9th Int. Conferenceon Scientific and Statistical Database Management.

Etzold, T., and Argos, P. 1993. Srs, an indexing andretrieval tools for flat file data libraries. ComputerApplications of Biosciences 9(1):49-57.Fasman, K. H.; Letovsky, S. I.; Cottingham, R. W.;and Kingsbury, D. T. 1996. Improvements to thegdb human genome data base. Nucleic Acids Research24(1):57-63.Hammer, 3.; Aranha, R.; and Ireland, K. 1996. Brows-ing object databases through the web. Technical Re-port 237, Stanford University.Kosky, A. S.; Markowitz, V. M.; Chen, I. A.; andSzeto, E. 1998. Exploring heterogeneous biologicaldatabases: Tools and applications. In Proc. of the6th International Conference on Extending DatabaseTechnology.Markowitz, V. M.; Chen, I. A.; Kosky, A. S.; andSzeto, E. 1997. Facilities for exploring molecular bi-ology databases on the web: A comparative study. InAltman, R.. a., ed., Pacific Symposium on Bwcom-puting, 256-267. World Scientific.

NCGR, The National Center for Genome Resources.1997. Genome Sequence DataBase (GSDB) 1.0.htt p://www.ncgr.org/gsdb/gsdb .html.Shuler, G. D.; Epstein, J. A.; Ohkawa, H.; and Kans,J.A. 1997. Entrez. In Doolittle, R., ed., Meth-ods in Enzymology. Academic Press, Inc. See alsohttp://www3.ncbi.nlm.nih.gov/Entrez/.

Chela 51