VNS Retriever: Querying MEDLINE over the InternetVNS Retriever is a central resource server that acts as an agent between the user and the NLM via asyn-chronous electronic message

VNS Retriever: QueryingMEDLINE over the Internet

Kevin Brook Long, Jerry Fowler, stan Barber - Baylor college of Medicine

ABSTRACT

Academic medical centers around the country are developing networking infrastructuresand connecting to the Internet. Baylor College of Medicine is developing VÌrlS Retriever, anarchitecture for comprehensive handling of our institution's database requests. The firstimplemented instance of this architecture is the MEDLINE Retriever, a tool to query theconsiderable citation database of medical literature at the National Library of Medicine.Response times that we have experienced with the MEDLINE Retriever have pleased us andimpressed our user community. The system will work for small sites, but is extensible foruse on large campuses as well. The MEDLINE Retriever uses the Corporation for NationalResearch Initiatives' ABIDE gateway. The principal user interface employs Motif and the XWindow System.

Introduction

The information consumer in the academicmedical center environment is faced with a moderndilemma; as electronic, networked informaticsresources become more common, users are presentedwith a disparate collection of interfaces, mostlydesigned independently as islands of informationwhose only integration with each othe¡ is the coin-cidence of existing on the same network. Althoughthe presence of high-speed networks might appear tobring this electronic information nearer the user, it isinstead often un¡eachable behind a wall of technicaldetail. We are attempting to develop and apply infor-matics architectures that more seamlessly integatethe information from certain classes of theseresources into the information user's environment.Specifically, we have applied the concept of the lay-ered architecture and the intelligent retrieval engineto facilitate interaction with networked bibliographicresources.

At Baylor College of Medicine, our participa-tion in the National Library of Medicine's IntegraiedAcademic Information Management System (IAIMS)program has involved exploring ways to applyadvancements in computing and informatics to theproblems of biomedicine. rrly'e have developed anintegrated networked environment for coord-inatinghuman, computer, and informatic resources. Thecenterpiece of our work is the Virtual Notebook Sys-tem (VNS), an electronic analogy to the traditionalpaper notebook [1,2,3]. The VNS is the realizationof an architecture that supports the collaborativeresearch environment. The work described here is anatural extension of this effort, a step closer to thelonger-range goal of establishing intelligent,networked information clearinghouses.

Recognizing the growing demand in ourbiomedical research community for rapid access toremote electronic databases, both bibliographic andotherwise, we have devised VNS Retriever, an archi-tecture to facilitate user access to a variety of localand remote electronic databases. The cornerstone ofVNS Retriever is a central resource server that actsas an agent between the user and the NLM via asyn-chronous electronic message passing. At the front-end, we have implemented a user interface writtenusing the X-Window System and the OSF Motiftoolkit so that many users of a wide variety of plahforms can acc€ss this new seryice [4,5]. As a bãck-end connection to the MEDLINE database, ourresource server makes use of the ABIDE eatewavand Knowbot Operating Environment develõped bythe Corporation for National Research Initiatives(CNRD [6].

This paper discusses VNS Retriever, a layeredarchitecture for wide-area database query andretrieval, and our fust implementation of VNS Retri-ever, the MEDLINE Retriever. In the next section,we amplify on the motivations for this project.Thereafter, we describe the architecture of VNSRetriever, including the communications interfacesbetween components. Then we discuss the imple-mentation of the MEDLINE Retriever in order toclarify the structure of the central components ofVNS Retriever. We then discuss administration ofthe MEDLINE Retriever, and present some simpleperformance analysis, before discussing furure wõrkand drawing conclusions.

Motivation for VNS Retriever

The MEDLINE Retriever began as an effort tofacilitate Internet acc¿ss from networked worksta-tions to'the National Library of Medicine's MED-LINE facility. It has become the embodiment of anarchitecture to treat bibliographic information in a

Summer '92 USENIX - June 8-June 12,lgg? - San Antonio, TX 81

VNS Retriever: Querying MEDLINE over the Internet

consistent fashion, and to assist the user in overcom-ing many of the technical details that often impedebibliographic research. At a more general level,VNS Retriever is an a¡chitecture that attempts to add(among other things) systemized resource characte¡i-zation, automated query generation, and intelligentfiltering mechanisms to the overall confines of theVNS architecture, which by contrast deal more withthe integration, annotation, organization and dissemi-nation of information.

We anticipate that VNS Retriever will beapplied to a variety of resource types, includingbibliographic (such as is addressed by the MED-LINE Retriever), genetic (such as the GENBANKRetriever), programmatic, textual, video, etc. TheMEDLINE Retriever is our fi¡st instantiation of thisarchitecture. The advantages that the MEDLINERetriever brings to MEDLINE access are multiple:

o Database access software can be maintainedc€ntrally and singly for all networked users ata given site, allowing updates to search stra-tegies, the MeSH vocabularies, accessmethods, etc. to be implemented immedi-ately. This greatly simplifres the distributionof software and documentation updates.

. Users need only an X-capable display and aconnection to the network to begin using VNSRetriever. It is no longer necessary for a userto procure a modem, a copy of the accesssoftware itself (such as Grateful Med [7]), alist of network access numbers, and perhapseven a personal NLM account.

o Users of hundreds of other presently unsup-ported computing confrgurations would beable to gain access to MEDLINE.

o This centrally-maintained, globally-availableaccess to the NLM's resources could help toeliminate an institution's need to acquire alocal version of MEDLINE.

o Users who travel frequently can maintainaccess to their queries and citations from any-where on the network. This is consistent witha model that allows the entire network tobecome a support mechanism for the user, notjust one machine.

o Institutional database usage patterns could bemonitored centrally with the goal of under-standing and providing better for the needs ofthe institution's research community and pos-sibly optimÞing access methods for the mostused resources.

o Frequently in the academic environment, onlyby enriching the nehvorked environment willusers be attracted to a more distributed com-puting model; we feel that the MEDLINERetriever adds value to institutionalized neþworking.

Long, Fowler, ...

Architectur¡ Of VNS Retriever

The architecture created to accomplish this taskis not simply a MEDLTNE-specifrc interface. Logi-cally speaking, the VNS Retriever architecture has 4layers, as depicted in Figure 1. User interfaces com-municate through an application program interfacewith a user agent called the query manager. Thequery manager maintains the user's environment,storing queries and presenting thefu results in themanner desired by the user. The user agent, in turn,submits queries to a server agent called the resourc¿server. The resource server selects the appropriateresources for each query, translates the queryappropriately for the resource, and initiates aresource query engine to communicate with theresource. At the bottom layer are the query enginesthemselves, which are specialized to communicatewith the individual resources.

Communication within the architecture necessi-tates the use of a standard format for queries. Weuse what we refer to as our canonical qüery form todefine the logical structure of a query and itsresponse. The purpose of the canonical form is toprovide a common internal structure for use in map-ping terms as displayed by a user interface intoterms appropriate to diverse databases. For example,where MEDLINE's Elhill query language uses theindication "(AU)" to identify queries on authornames, another local database in common use uses".au" and the ANSI standard 239.50 specificationuses "ua". On the other hand, since there is no rea-son to be cryptic in displaying this information tothe user, we display the term as "Author" (the substi-tution of some language other than English for theseterms would be straightforward).

Figure 1: Logical Layout of VNS Retriever

82 Summer '92 USENIX - June 8-June 12, L992 - San Antonio, TX

Long, Fowler, ...

The canonical form also considers the type ofdata to be retrieved. Among the types that weintend to provide in addition to ASCII text areimages (of numerous types) and possibly structureddata such as those from gene sequence databases.The data recoverable by the MEDLINE Retriever areall textual data, so the only type currently supportedis ASCII strings.

The query and presentation terms we use arebased on the 23950 standard.

A physical view of the VNS Retriever architec-ture is found in Figure 2. Numerous invocations ofthe various user interfac¿s communicate one-to-onewith one of several query managers running at theinstitution. Each query manager serves a specificwork group of users related to some administrativeentity within the institution; such an entity would

. typically be a grant-funded research group.


to address a pet peeve in our user community; thatis, the need to enter a user id and password for everyinvocation of a controlled-access program. Accessto the resource server is restricted to a set of author-ized machines. Access to individual machines iscontrolled by the machine's protection mechanisms.Accesses from specified machines can be further res-tricted to a subset of the available resources bymeans of a resource management table. Our designanticipates that installation on our campus wouldhave a single resource server providing service to theentfue organization.

The resource server's role in database retrievalis to provide a common interface to the diverse setof resources that it can support, allowing a singleuser interface to access all the supported resourcés.To this end, it supports requests to enumerate thedatabases it serves and to list the query and presen-tation terms valid for each. When the resourceserver receives a search request, it stores the requestin its internal database, which is a simple tree withbranches for each work group. The server thenanalyzes the request to determine which database toquery. At this point it performs whatever accountingand access control are appropriate to the database,and then translates authorized requests into theappropriate terms for that database. Finally, thequery engine for that database is invoked through aset of library routines that launch the query engine ina synchronous exchange, and then retrieve its resultswhen they return asynchronously some time later.

After a query engine returns a response, theresource server notifies the appropriate client querymanager. When the client has acquired the resultdata and acknowledged its receipt, the query and theresponse are deleted from the local database. Anycost accounting to be done for a given database isperformed at this point.The Query Manager

The user agent is called the query manager. Itsadministrative role is to provide middle- to long-term storage of queries and results for its workgroup, thus placing the responsibility for disk utiliza-tion on each work group, whích runs its own querymanager.

The query manager provides the user interfacewith a set of mappings to translate queries from userdisplay form into canonical form, and to translateresults from canonical form to the user-prefereddisplay. It also provides the user interface with thelist of available database resources and their queryterms. The query manager connects to the resóurceserver to submit queries. The query manager holdsresponses it receives from the resource server untilthey are deleted by request of the end user via what-ever front end the user may use.

Figure 2: Physical View of the VNS Retriever

Each query manager in turn communicates withthe single institution-wide resource\ server; theresource server is then able to invoke onÞ or more ofseveral resource query engines, each of which mayserve more than one database.

The resource server and query manager are thetwo central pieces of the architecture. W{ shalldescribe them first, and then discuss interotocesscommunication within the system. Since túeluserinterface and resource query engines are implefun-tatio-n dependent, we defer discussion of them to \rdetailed description of the implementation of thèMEDLINE Retriever.The Resource Server

The server agent is called the resource server.Its administrative role is to centralize institution-wide access control and accounting. This permits us

Summer '92 USENIX - June 8-June 12,1rgg} - San Antonio, TX 83

VNS Retriever: Querylng MEDLINE over the Internet

In addition to its intermediary role in databaseretrieval, the query manager maintains the userprofiles for its clientele. These profiles describewhat information the user wishes to retrieve as adefault, and how it should appear. They alsodescribe translations, as well as the filters and desti-nations to be used for exporting results to otherplaces.

The query manager maintains little state con-cerning on-demand search requests. The querymanager knows when it has issued a search requestto the resource server, and provides a means todelete outstanding requests It depends on theresource server to maintain other information aboutthe state of an on-demand search. The querymanager does, however, keep track of scheduledqueries in more detail. Selective Dissemination ofInformation (SDI) is the name that the NLM gives tosearches that are stored c¿ntrallv at the NLM data-base for incremental execution on a regular basis, forexample, monthly when the database is updated.The fact that the query manager permits queryscheduling makes the storage of SDI queries at thequery manager possible, reducing the burden on thecentral database and opening the door for regularscheduling of intelligent wide-area database searcheswhen such searches become viable. It is the querymanager's job to queue scheduled queries and sub-mit them to the resource server at the appropriatetime.

Interprocess Communication

There are several levels of interprocess com-munication within VNS Retriever. Communicationbetween the upper levels of the system relies on theuse of a canonical form for the description of queryterms and results. We shall discuss in turn the com-munications between the query manager and its userinterfaces, between the query manager and theresource server, and between the resource server andits query engines.The Query Manager Interface

The query manager communicates with its userinterfaces through an application program interfacewhose standard queries include user profile handlers,search manipulation, and administration, as well as asearch-result callback.

The functions Query-Profile and Modify-Profileallow the user to list and alter default settings for the23950 terms Large-SeþLower-Bound and Small-Set-Upper-Bound (the maximum number of matchingrecords for which the user deems it is worth return-ing any data, and the largest number of records to bereturned from a successful search, respectively) aswell as desired presentation terms to be returned bya search, such as author, title, journal name, etc.This is also the place in which the cost center forcharge accounting is specified. To facilitate

Long, Fowler, ...

exporting result data from VNS Retriever to otherprograms, in particular the VNS, the query managersupports the functions Query-Export and Modify-Export, These functions allow the user to manipu-late defaults for data export from VNS Retriever.Groups also carry default settings for profile andexport defaults, so that a new user automaticallyadopts the standards of the group to which it isadded.

The function List-Searches returns a list ofsearches currently maintained for the user by thequery manager, and List-Results returns the resultsfor a given search, if they are available. Searchescan also be renamed or deleted. A "force" flag inthe delete function allows a search to be deletedirrespective of whether it is currently outstanding atthe resource server.

The function SubmiþSearch has th¡ee options,submit on demand, submit scheduled, and submit tosave. Saving a search simply records the contents ofthe query at the query manager so that it may beretrieved for later modification and submittal. Asearch may be scheduled for a one-time search at alater time, or for regular submittal at a restricted setof intervals such as hourly, daily or monthly(although there is no point at the current time insearching daily or more frequently, because updatesto MEDLINE are not that frequent, it is our inten-tion to incorporate local databases such as collo-quium notices, for which more frequent regularsearch would be appropriate).

Status queries provide information on thecurrent state and availability of results for the namedsearch. Administration requests include the abilityto add or delete users or groups, to request that thequery manager quit gracefully, and to requestchanges in the type of information recorded in thelog file.

Since the return.of results from the resourceserver to the query manager is an asynchronousevent, a user interface can register a result callbackso that it can be informed in a timely fashion of thereturn of a search, without the need to poll the querymanager.The Resource Serser Interface

The query manager behaves as a client of theresource server, sending synchronous requests. Theresource server recognizes three basic types ofrequest: search, status, and administration.

The resource server's search protocol is basedloosely on the 2,39.50 standard. The Init and Init-Response Application Protocol Data Units (APDU's)are used to establish a connection between a querymanager and the resource server. The Search APDUis used to initiate a search, and a Search-ResponseAPDU is used to acknowledge the succ¿ssful receiptand launch of the search, using a return code notspecifred by the 23950 standard to indicare this

84 Summer '92 USENIX - June 8-June 12,1992 - San Antonio, TX

Lnng, Fowler, ...

initial success. Because the initial response to thesearch APDU is effectively an acknowledgment ofthe request, the query manager is free to carry onwith other tasks instead of blocking synchronouslyon the search. This deviation from 23950 protocolenables the query manager to multiplex on severalclients potentially issuing multiple requests perclient.

Upon receipt of search results from the queryengine, the resource server issues an announcementto the client query manager by means of an asyn-ch'¡onous Search-Response APDU. A fixed-size por-tion of the result data is piggy-backed wirh thisannouncement. This piggy-back packet is sufficientto return the enti¡e response in many cases. Shouldthe response contain more data than will fit in thisannouncement, an indicator is set to show that moredata remain to be retrieved. It is then up to the querymanager to request and retrieve the remainder of thedata by means of Present and Present-ResponseAPDU's.

There are certain resource server functions thatlie outside the 23950 protocol. These includeadministrative functions, including addition and dele-tion of users and groups (although the identificationof a user is not strictly necessary to the resourc¿server's function, it provides a parallel structure tothat found in the query manager's internals).

Status queries return information on the state ofthe server, as well as specific information aboutgroups users, or queries. An additional status queryreturns the list of resources supported by theresource server. In our initial implementation, this isa trivial response, since the MEDLINE Retrieversupports only one database.The Query Engine Interface

Communication between resource server andquery engine is by means of library calls. It is thequery engine's responsibility to provide all commun-ication with the database(s) it serves. Each queryengine interface requires a set of translations fromthe canonical form to the query engine's nativequery language (although the query engine'slanguage may simply be the query language of thedatabase it supports, this need not be the case for amore general query engine, such as the Knowbotquery engine), as well as the routines to initializeand launch a query engine instance; to initialize aresponse retrieval and to drive data retrieval; and todestroy the query engine instance.

The MEDLINE Retr.iever

The fust instantiation of VNS Retriever hasbeen the MEDLINE Retriever. This svstem hasbeen tested and used by members of the Baylorresearch community since February, 1992. TheMEDLINE Retriever takes advantage of the ABIDEgateway and Knowbot Operating Environment

VNS Retrlever: Querying MEDLINE over the Internet

developed by David Ely of CNRI [8]. The Knowborquery engine is used to provide access toMEDLARS, the host for the National Library ofMedicine's on-line bibliographic database. We shallfirst briefly describe the user interfaces for the sys-tem: Then we shall discuss the internals of theresource server and query manager and brieflydescribe the behavior of the Knowbot query engine.User Interface

The user interface for the MEDLINE retrieverdoes not reflect all of the generality described for theVNS Retriever architecture, but it provides a basisfrom which to understand the system, as well as afully functional MEDLINE query tool that is easilyintegrable into other environments such as the VNS.

At the top layer we have demonstrated severaluser interfaces. There is a command-line interfacethat reads queries from files created by the user, andthen formats the results to standard output. Anotherinterface uses a simple form-driven X-Windowsfront end to the command-line interface, and formatsthe results into another window that permits simpleuser browsing through the returned citations.

The interface to which we have paid the mostattention is a menu-driven X-Windows interfacewritten using the Motif toolkit. This front end facili-tates more sophisticated handling of queries andresults, including creating queries using MeSH,querying for the status of outstanding searches andexporting selected citations to the Virtual NotebookSystem, as well as to order photocopies of desiredcitations direct from the Houston Academy ofMedicine-Texas Medical Center Library. This inter-face is described in more detail elsewhere [9].Server Structure

Both the resource server and the query managermaintain a mirror of their internal trees on stablestorage. This allows both servers to maintain theirstate across invocations by reading their respectivestorage trees found on disk to build their internaltrees at start-up. When the resource server initial-izes, it examines this tree for searches that have notyet received a reply from their query engines; itresubmits such searches, since the return address forthe query engine instance associated with the oldrequest is now invalid. When the query managerinitializes, it builds a queue of scheduled and regularsearches that it finds on disk, and sets a time-out forthe shortest interval found among those queued.

We have chosen to store the data on disk inASCII frles, although we could substantially reducestorage by use of binary data. This permits visualinspection of the storage tree to help detect systemproblems, and simplifies debugging. Duplicating theinformation on disk requires frequent file writes thatcertainly have some impact on performance. Wecould improve this performance in at least onerespect: Our current implementation employs a

Summer '92 USENIX - June 8-June L2,IggZ - San Antonlo, TX 85

VNS Retriever: Querying MEDLINE over the fnternet

UNIX file system, creating a directory for eachgroup, user, and query. Replacing this mechanismwith a common database manager such as ndbmwould probably signifrcantly improve our perfor-mance, at the expense of losing the ability to inspectthe storage tree visually. This is a problem that wewill address again when scaling becomes a greaterissue than it is now.Communlcation Protocols

Figure 3 shows the interprocess communicationpaths within the MEDLINE Retriever. Each level ofthe system employs a conventional client-servermodel for most transactions. At each level, anupcall or callback is also necessary to provide asyn-ch¡onous notification. The query manager usesSunRPC [10]. Below that, TCP/IP network socketsare used.

L]ser Interface

E /ã,

ë /€ð

Query Manager

Ë Ë t

\ È /.Èt

ð

Resor¡rce Server

- Ê r I

Ëg=\ + { ¡¿É€

c5

Query Engine

Figure 3: MEDLINE Retriever Interprocess Com-munication

The Resource ServerThe connection between query manager and

resource server employs BSD network sockets, usingTCP/IP. The resource server opens listening socketson one front end port and one back end port. Thedriving loop maintains one global socket, as well asone socket per active client, one initiator socket perquery engine, plus one socket per active query.

When a query manager initializes a connectionto the resource server, the resource server sets ashort time-out after replying. When the time-outexpires, the resource server examines its internal treefor outstanding responses belonging to the newly

Long, Fowler, ...

connected query manager. Any such responses havenever before been successfully reported to the querymanager, so at this point they are dispatched to thequery manager with the asynchronous Send-ResponseAPDU. This greatly simplifies connectioninitialization.

The Query ManagerThe query manager's internal structure has

much in common with the resource server. In fact,thei¡ executable code links a common library formanipulating the internal storage tree.

Connections between the query manager andthe user interfaces utilize SunRPC for interprocesscommunication. To some extent the use of two dif-ferent implementations for our communicatíons pro-tocols represents an experiment. We chose to useSunRPC in order to reduce the overhead caused bvthe use of TCP/P, but let SunRPC simplify the han'-dling of UDP connections. As a serendipitous effect,this permits us transport independent communication.

Although other RPC mechanisms are available,we chose SunRPC without examining other alterna-tives simply because it was freely available on ourinitial target platforms. We have found SunRPC tobe congenial, although development \¡/as moredifficult than it might have been because there is nochecking for null string pointers in the external datarepresentation library.

Because the query manager is effectively in themiddle of a pipe, acting as a server on one side andas a client on the other, it cannot use the SunRPCsvc_runQ call to drive its loop. The query manageropens its front end by means of SunRPC initializa-tion, and also opens a back end socket with which toconnect to the resource server. The driving loopsimply deals with the SunRPC communications frrst,thereafter responding to any asynchronous responsesfrom the resource server.

Because of the differences in treatment ofSUnRPC between the procedural model of a 'rnor-mal'r UNIX program and the event-driven model ofan X Windows application, there are two slightlydifferent application program interfaces availablethat handle the two cases.

To handle regularly scheduled SDI searches,the query manager builds a queue, and sets an alarmfor the shortest interval found among the scheduledsearches. The action taken at alarm time is simplyto set a flag to indicate the elapse of the interval.When the query manager falls to the bottom of itsdriving loop, it tests this flag and invokes the queuehandler if set. This avoids the possibility of incon-sistency in the internal tree, because there is noaccess to internal data structures during the execu-tion of the asynchronous alarm handler.

86 Summer '92 USENIX - June 8-June 12,1992 - San Antonio, TX

Long, Fowler, ...

Help ServerAdditionally, the system provides a help ser-

vice. On-line help is available for most objects inthe graphic user interface. In order to provide max-imum flexibility and host independence, this help isprovided by remote procedure calls to a help service.Help requests are keyed by program and functionname. The contents of the file corresponding to thekey are retumed to the requester as an ASCII textobject. The requesting interface can display theinformation as it sees fit (button-specific helpfilesprobably make sense for only one interface, ofcourse). This service is isolated from the querymanager in order to permit the support of multiplerelated programs out of the same help service.

Query Engines

Although we anticipate several distinct queryengines, we currently support only the KnowbotOperating Environment.The Knowbot Operating Environment

In the specific case of the MEDLINE Retriever,the query engine is provided by the pioneering workdone by David Ely at CNRI in crearing the ABIDEgateway to MEDLARS and the Knowbot OperatingEnvironment. This gateway receives appropriatelyconstructed Knowbot queries, validates and executesthem and return the results to the Kno'ffbot construc-tor that sent them.

Our collaboration "'rith

CNRI on the VNSRetriever project has provided CNRI with an oppor-tunity to shake down the concepts involved in theKnowbot Operating Environment. At the same time,it has allowed us to work directly on creating anenvironment for campus use that could be replicatedin other campus environments where access to cen-tralized databases is important.

The ABIDE gateway was originally created byCNRI as a demonstration project for the NLM. Itsuser interface closely duplicated the functionality ofthe MS/DOS version of Grateful Med, but providedno query management features or asynchronousaccess. Through this cooperative effort, we wereable to make use of the back end functionality whileenhancing the power of the user interface.

The ABIDE gateway n¡ns on a Sun-4 machinelocated at the NLM. .It serves to connect the IBMenvironment on which the MEDLINE database isoperated to the Internet.

The transactions between the resource server(via the Knowbot library) and the ABIDE Garewayuse TCP/P. As described above, the Knowbot queryengine is launched in one synchronous action. Thelaunch of the Knowbot query includes a returnaddress to which the ABIDE gateway will subse-quently respond asynchronously. Each query from agiven MEDLINE account is queued at rhe ABIDEgateway and performed sequentially.

YNS Retriever: Querying MEDLINE over the Internet

Our use of the Knowbot Operating Envi¡on-ment does not exploit its ability tõ return data to areceiver on a different host from the launcher. Thisdesign decision is based on our desire to controlaccess and perform accounting centrally.Other Query Engines

In the case of our first implementation, whichinvolves internet communication, initialization andlaunch comprise the first step of the search process,and at this point the resource server reports a suc-cessful launch to its client. Thereafter, the serverresponds to activity on the retrieval socket by invok-ing the query engine's data-retrieval function. Anti-cipated future query engines that involve local areaaccess may have essentially null query retrievalfunctions, since it is believed that data response willbe quick enough to incorporate in the launch phase.Such queries will also bypass the stage involvingrecording on stable storage for the sake of speed.

Among the query engines that we are consider-ing for addition are a Wide A¡ea Information Service(WAIS) engine, and an RPC-based Unifred MedicalLanguage System (UMLS) engine [11].

Administration

Administration of this system occurs at twolevels. The Query Manager and the Resource Servereach have separate administrative requirements andcan be administrated sepatately as needed. Thearchitecture provides for one Resource Server persite and one or more Query Managers.Resource Server Administration

Administration of the Resource Server involveschecking accounting information for each requestsubmitted. In order to give system managers at dif-ferent sites local control over how the resourceserver does accounting, we provide a hook for aremote procedure call to an external server. Thisserver simply returns zero for access accepted andnon-zero for access rejected. The server does its per-mission checking by reading a script written by thesystem manager to suit the needs of the individualinstitution.

Additionally, the Resource Server verifies infor-mation supplied by the Query Manager about thegroup or the user submitting the query to insure thatthe group or user is authorized to make use of theresource to which the query is directed. If the user orgroup is authorized, the query is processed. If not,the Resource Server returns an ACCESS DENIEDreply in response to the query. Password manage-ment and all other resource access information isstored in the Resource Server access database.Query Manager Administration

Management of the Query Manager centers onthe disposition and storage of queries and queryresponses. Most of the management of the euery

Summer '92 USENIX - June 8-June t2, L9g2 - San Antonio, TX 87


Manager occurs with each user authorized to makeuse of it. Individual users enter queries and deter-mine how to store responses.

However, there are certaín matters that requirean administrator. Authorizing users to make use of aparticular query manager, removing old users(including purging any queries or query responsescreated by those users) and providing a default userprofile must be done by the Query ManagerAdministrator.

Performance and Evaluation

The performance of the MEDLINE Retrieverhcs impressed our user group. Response times aregood from half a continent a\¡/ay. Although the sys-tem has many good qualities, there is ample roomfor improvement.

System PerformanceIn a battery of timing tests we ran using the

command-line user interface, we measured both thetime required by the resource server, and the roundtrip time from command-line user interface toMEDLARS and back. At the resource server, theaverage response time (from initialization of queryengine instance to completion of data retrieval) was11.7 seconds for a seæch that returned 10 citations(author, title, and source only) from a query involv-ing text search of two words. The average responsetime wæ 18.5 seconds for the same search to return200 citations, which required transmission of about70 Kilobytes of data (we impose a limit of 200 onSmall-Set-Upper-Bound, the maximum number ofrecords to return, which is a compromise betweenthe belief that we share with some librarians that onegenerally does not wish to examine any results at alluntil one has pared the result set down to a largehandful, and the use-patterns and desires of many ofour scientific users, who do not mind browsing evenhundreds of titles in search of relevant citations).

The round-trip times at the command line inter-face were less satisfying. The average response timein the case of returning 10 citations was 15.9seconds, an average four second overhead for thesystem architecture. Returning 200 citations took anaverage of. n.4 seconds, an average 9 second over-head. Naturally, virtually all of this overhead isclock time, not CPU; unfortunately that is what theuser perceives. Command line response time isaffected by the times incured by both querymanager and resource server in polling their socketsduring selectQ calls, as well as time required toupdate status files. We intend to eliminate thecauses of this excessive overhead in the near future.

During normal use, the average response timefor a successful search was 16 seconds. Mostsearches Q92 of.324) required less rhan 30 seconds

Long, Fowler, ...

to return. Of those, the average was 9.6 seconds.The longest successful search required 5 minutes 23seconds.

The average time required to initialize a queryengine instance was 1.2 seconds, which involvesopening a socket on the local host and establishing aconnection to the ABIDE gateway. The maximuminitialization time was 9.9 seconds.

The average cost of these searches as reportedby the ABIDE gateway was $2.30. A simple searchthat returns little or no data costs around a dime, butany attempt to retrieve larger amounts of data, eitherto retrieve abstracts, at around 3 Kilobytes apiece, orto retrieve numerous citations, causes the charge toincrease rapidly. This has raised the issue of how toaccount for MEDLINE searches. The cost schedulethat is currently used by the NLM is perceived asprohibitive by our users.Evaluation of the System

Overall, we are pleased with the results of ourdesign. One question that remains to be resolved iswhether a single resource server will scale well forfull-scale use at a large institution. Baylor has aboutforty specialized research centers comprising aroundeight hundred laboratories with several thousandscientists participating in research, and anotherseveral hund¡ed involved primarily in education. Amore detailed analysis of response time under aheavy load is needed to determine whether a singleresource server can in fact support the whole com-munity at Baylor.

Another potential problem with respect toMEDLINE specifically is that queries issued on asingle account are queued. Since our c-urrent prac-tice is to issue all queries on one account, this canadversely affect users' response times. Although ouractual deployment policy is as yet undetermined, wemay choose to employ multiple MEDLINE accountsto circumvent this situation if it actually causesdifficulty.

With regard to our decision to logically isolateour resource server from the query engines, there areadvantages and disadvantages. On the one hand,much of the knowledge necessary to access diversedatabases exists already in the Knowbot OperatingEnvi¡onment, and our logical separation from thatenvironment represents some duplication of effort.On the other hand, use of our own canonical queryform permits us to translate to and employ otherquery engines with minimal effort; it also permits usto consider the use of lighter weight mechanisms forinherently small local databases, such as"bcm.seminars", the college's electronic colloquíumbulletin board. Overall, this logical isolation pro-vides ,a flexibility that is a greater boon than a hin-drance.

88 Summer '92 USENIX - June t.June 12,1992 - San Antonio, TX

Long, Fowler, ...

We chose to use our own canonical form inpreference to SQL partly from convenience, andbecause many resources which we wish to access arenot in a relational form for which SQL is well-suited. Similarly, we do not simply employ a free-text-searching algorithm in order to take advantageof the structured nature of many common resources.

As for ow implementation, there are severalthings we will likely change in a future release ofthe system. First, we would like to replace the exist-ing socket-based transport used in the resourceserver with an RPC model. This would take advan-tage of the RPC-based communication protocol thatalready exists in the query manager. Most of the fun-damental requests in both servers are identical instructure, if not in effect, so the parallelism ofdesign would be reflected in reuse of code (the origi-nal specification of the query manager used socket-based communications for that reason).

We had chosen the socket model for theresource server's communications in part to takeadvantage of the 23950 library available fromJhinking Machines Corporation's WAIS project[12]. Since the 239.50 prorocol does nor handle theproblem of asynchrony we have mentioned earlier,there is some disadvantage to adhering to thatmodel. There is, however, substantial benefit to thenatural way in which the SunRPC package wrapsaround our canonical forms.

Finally, it might behoove us to take responsibil-ities away from the user interface. Many of thefunctions that it performs would be better performedby small utility routines that it might invoke in rhecourse of its execution.

We are pleased with the use of SunRpC,although more gracious handling of null stringpointers in the external data representation codéwould be welcome.

Although we considered the use of the Sunlightweight process library, the benefit that it wouldhave provided us in terms of imposing a logicallymulti-threaded design was rendered null by the prob-lems to be overcome in compensating for the factthat a single lighrweight thread blocking on I/Oblocks the entire process. We also had difficultieswith incompatibility in rhe memory allocation.

Related lVork

The NLM has recently promulgated anInternet-capable version of Grateful Med. This is anexcelle¡t system for its purpose, which is to provideuser-frìendly support for MEDLINE accessl how-ever, it exists oi itself, and does not integrate weltwith the goals of IAIMS at Baylor.

We also wish to reiterate our respect_ for thework of CNRI, whose Knowbot Operating Envi¡on-ment is the foundation of our link to MEDLARS atNLM. Their coop- eration has contributed markedly


to our success. The Knowbot query language pro-vides a single 239.50-like query inrerface to eacñ ofthe databases supported by ABIDE. It is an idealback end for our work, because there is great poten-tial for the expansion of its powers in order toexpand ours.

MCC's Carnot project is a vastly more ambi-tious system that incorporates both database retrievaland update, employing a transaction shell calledRosette that uses the Actor model for organizingasynchronous distributed database accesses [13].The development of a Camot query engine couldsignificantly extend the power of our system.

The V/AIS project takes a different view ofwide area data retriðval, employing full-text searchinstead of keyword indexing. This is a technologyideally suited to the parallelism of the ConnectionMachine for which it was originally designed. Theuser interface to a WAIS service could be subsumedby our existing user interface.

The University of Minnesota's internetgopher project implements a document delivery ser-vice. It allows users access to various types of infor-mation residing on multiple computers in a seamlessfashion. The interfaces for gopher use a hierarchy ofmenus. Each menu item may provide access to othermenus, files, or gateways to other types of informa-tion servers includes the WAIS environmentdescribed above.

Future Work

The design of MEDLINE Retriever has laid thefoundation for several further development efforts.These include the augmentation of the powers of theback end, closer integration with the VNS, andexpanding support for alternative data types.Back End Support

We intend to expand both the number of queryengines available to the resource server, as well asto include support for several other resources.Among the resources we hope to add are CurrentContents, as well as a service to index colloquiumannouncements and other institutional events. Stand-ing queries for these latter resources can be esta-blished to report upcoming events of interest to eachuser. We hope to deploy a local ABIDE gateway inorder to experiment with the use of Knowbot scriptsfor access to these resources.

Other resources that we wish to add suggest theconstruction of new query engines. One of these isa UMLS server to enhance the power of the userinterface to the MEDLINE Retriever. Anotherwould be a WAIS query engineCloser Integration with VNS

With the ongoing development of version 3.0of the VNS, we have the opportunity to incorporatewide area hypermedia into the VNS. We havedubbed this new model VNS-STAR. The VNS-

Summer '92 USENIX - June 8-June lZ,lgg2 - San Antonio, TX 89


STAR model distinguishes between the page, VNS'straditional vehicle for data organization, and theSTAR-PAGE. Objects on a STAR-PAGE containnot data, but the reference information necessary torecover a specific dataset from anywhere on theinternet, thus trading bandwidth for storage, andincreasing the timeliness of data displayed throughthe VNS.

Another a¡ea of interest is adapting VNS Retri-ever for the retrieval and display of alternative datatypes. This is closely tied to the object orientationin VNS 3.0. VNS Retriever will initially recognizethe types defined by the VNS, particularly images.Any further expansion of VNS Retriever's type setcan be done thròugh the user-definable objecté bf theVNS. Because of VNS Retriever's use of high-speednetworking, it is well suited to take advantage ofimagery and other higher-volume data that will inev-itably become part of the typical bibliographicresource.

Conclusion

The VNS Retriever has served a useful role inhelping us develop an architecture to systemizeaccess to bibliographic and other resources. It is anecessary step towards developing more intelligentnetworked data retrieval tools. As a tool in its ownright, it is performing satisfactorily for the BCMuser community, while at the same time, it is servingto illuminate the challenges that lie ahead in ourdevelopment of more useful tools for the typicalresearcher.

VNS Retriever is representative of many of ourefforts. Even as 'ü/e attempt to create an overallapproach to addressing classes of problems, our usercommunity benefits from the application of a usefultool which addresses a specific need. The VNSRetriever is paving the way for more facile access tothe ever-expanding collection of intemetworkedresources.

Acknowledgments

Support for this work was provided by TheNational Library of Medicine (N01-LM-1-3516 andusPH-02-G08-LM04905).

Thanks æe due to Ed Sequeira of the NLM,and to David Ely and Vint Cerf of CNRI for makingour use of ABIDE possible. In addition, we wish tothank Tony Gorry and the rest of the Baylor IAIMSgroup for their ideas and encouragement. Thanks toCindy Petermann and Barry Meyer of the group fortheir helpful comments on the manuscript.

For more information on the Knowbot Operat-ing Environment, contact CNRI electronically at:knowbot [email protected]

Long, Fowler, ...

References

[1] A. M. Burger, et. al. "The Virtual NotebookSystem," Proceedings of the Hypertext '91Conference, ACM, L991.

[2] G. A. Gorry, et al., "Computer Support forBiomedical Work Groups," Proceedings of theConference on Computer Supported Coopera-tìve Worlç September L988.

[3] C. A. Gorry, et al., "The Vi¡tual Notebook Sys-tem: An A¡chitecture for Collaboration," Jour-nal of Organizational Computing Volume 1Number 3, t992.

[4] R. Scheifler and J. Gettys, The X Wíndow Sys-tem, Digital Press, 1.988.

[5] Open Softwa¡e Foundation, OSFlMotif StyteGui.de, Prentice -Hall, L 990.

[6] R. E. Kahn and V. Cerf, An Open Architecturefor a Dìgital Library System ønd a Plan for ítsDevelopment, The Digital Library Project,Volume 1: The World of Knowbofs, Corporationfor National Reserach Initiatives, Reston, Vir-gina, 1988.

[7] B. Shearer, L. McCann, L. and W. J. Crump,"Grateful Med: Getting Started," Journal of theAmerican Board of Family Practice, Volume 3,Number 1, pp. 35-38, January-March, L990.

[8] David Ely, An Overview of the ABIDE Gate-way System, Technical Report TR-9L-L, Cor-poration for National Research Initiatives, Res-ton, Virginia, L991.

[9] J. Fowler, K. B. Long, and S. Barber "TheMEDLINE Retriever,r' Submitted, 16th AnnualSymposium on Computer Applications in Medïcal Cøre, Baltimore, Maryland, Novembert992.

[10] Sun MicroSystems, lnc., RPC: Remote Pro-cedure Call Protocol Specificaion Version 2;RFC 1057,Network Center (NIC) at SRI Inter-national, Menlo Park, Californi4 June L988.

[11] S. Barber, J. Fowler, K. B. Long, R. Dargahi,B. Meyer, "lntegrating UMLS into VNS Retri-ever," Submitted, 16th Annual Symposium onComputer Applications in Medical Cøre, Bal-timore, Maryland, November 1992.

[12] Brewster Kahle, Wide Area Information Con-cepts, Thinking Machines Corporation, Techni-cal Memo DR-89-1.

[L3]Microelectronics and Computer TechnologyCorporation, MCC Carnot Project Description,Austin, Texas, 1991..

Author Information

Kevin Long is Director of IAIMS Developmentat Baylor College of Medicine. A graduate of RiceUniversity, Kevin has been involved in theIntegrated Academic Information Management Sys-tem initiative at Baylor since 1985, and is VicePresident of The ForeFront Group, Inc., a

90 Summer '92 USENIX - June 8.June 12,1992 - San Antonio, TX

Iang, Fowler, ...

technology-transfer company which distributes theVNS outside of BCM. Reach him via U. S. Mail atBaylor College of Medicine; One Baylor Plaza;VPIT; Houston, TX 77030-3498. Reach him elec-tronically at [email protected] .

Jerry Fowler has been a member of the techni-cal staff of lntegrated Academic InformationManagement System Development Group at Baylorsince 1991. He holds bachelor's degrees in math andmusic from the University of Oklahoma and a mas-ter of science in computer science from Rice Univer-sity. His U. S. Mail address is LAIMS Development,Baylor College of Medicine; One Baylor Plaza;Houston, TX 77030-3498. His electronic home [email protected].

Stan Barber is Director of Networking and Sys-tems Support at Baylor College of Medicine. A gra-duate of Rice University, Stan has been involved inthe Integfated Academic Information ManagementSystem initiatíve at Baylor since 1986. Stan alsosupports ongoing development of rl, the popularnews reader program originally developed by LuryWall. Reach him via U. S. Mail at Baylor College ofMedicine; One Baylor Plaza; Houston, TX 77030-3498. Reach him electronically [email protected] .


Summer '92 USENIX - June 8-June 12,lgg2 - San Antonio, TX 9l

Documents

VNS Retriever: Querying MEDLINE over the InternetVNS Retriever is a central resource server that acts as an agent between the user and the NLM via asyn-chronous electronic message