3
Implementation of Recursive Queries for Information Systems Kazem Taghva and Jayalakshmi Jeyaraman Information Science Research Institute University of Nevada, Las Vegas [email protected] Abstract In this paper, we give a detailed description of a bibliographic database, a set of recursive queries, an overview of some standard query processing algorithms, and an implementation of these queries in DATALOG. 1. Introduction Sophisticated information systems require a pow- erful query language and an efficient implementation strategy. In practice, these information systems are ei- ther built on top of an existing database management system or built as an expert system with deductive capabilities. Both of these implementations must pro- vide a mechanism to express recursive queries. It is therefore a necessity for the system to have an efficient algorithm to evaluate these queries [3]. Standard query languages such as SQL has limited expressive power. This is due to the fact that SQL does not have recursive or equivalently looping capabilities. For this reason, many queries must be implemented in an embedded language such as PL/SQL. Historically, many applications which require deductive capabilities are also implemented in logic programming languages such as Prolog. Unfortunately, Prolog is not suited for large scale deductive databases due to its “one item at a time” processing method and its lack of secondary storage file processing capabilities. The main challenge in dealing with recursive queries is efficiency. The standard implementations based on resolution techniques is memory intensive and slow. A better idea is to implement these queries using a bottom up technique of fixed point. Typically, the fixed point computation is augmented with heuristics to speed up the process. Beyond this introductory Section, there are three additional Sections. Section 2 gives background and overview of some standard techniques. Section 3 is a detailed explanation and implementation of our project. Section 4 is our conclusion and future work. 2. Background Most commercial applications of database technol- ogy are implemented in a relational system such as Or- acle. The expressive power of a relational database is equivalent to the expressive power of first order logic. For example, consider the relation parent(x,y) with the intended meaning that y is the parent of x. This relation is typically represented as a table as shown in Table 1. cain adam abel adam cain eve abel eve Table 1. Parent relation A query such as finding the parent of cain can be simply implemented in a standard query language such as SQL. On the other hand, suppose you are interested in finding all ancestors of an individual x. In DATA- LOG (or Prolog) notation, this query can be expressed as [3]: ancestor(x, y) parent(x, y). (1) ancestor(x, y) ancestor(x, z), ancestor(z,y). (2) In English, these two equations state that y is an ancestor of x if either y is a parent of x or y is an ancestor of an individual z and z is an ancestor of x. In the Prolog implementation of this rule, if one is looking for ancestors of an individual, say john, then the query proceeds with matching ancestor(john, y) with the above rules. Consequently, the first rule will find the parents of john, then the process continues 19th International Conference on Systems Engineering 978-0-7695-3331-5/08 $25.00 © 2008 IEEE DOI 10.1109/ICSEng.2008.8 273

[IEEE 2008 19th International Conference on Systems Engineering (ICSENG) - Las Vegas, NV, USA (2008.08.19-2008.08.21)] 2008 19th International Conference on Systems Engineering - Implementation

Embed Size (px)

Citation preview

Page 1: [IEEE 2008 19th International Conference on Systems Engineering (ICSENG) - Las Vegas, NV, USA (2008.08.19-2008.08.21)] 2008 19th International Conference on Systems Engineering - Implementation

Implementation of Recursive Queries for Information Systems

Kazem Taghva and Jayalakshmi JeyaramanInformation Science Research Institute

University of Nevada, Las [email protected]

Abstract

In this paper, we give a detailed description of abibliographic database, a set of recursive queries, anoverview of some standard query processing algorithms,and an implementation of these queries in DATALOG.

1. Introduction

Sophisticated information systems require a pow-erful query language and an efficient implementationstrategy. In practice, these information systems are ei-ther built on top of an existing database managementsystem or built as an expert system with deductivecapabilities. Both of these implementations must pro-vide a mechanism to express recursive queries. It istherefore a necessity for the system to have an efficientalgorithm to evaluate these queries [3].

Standard query languages such as SQL has limitedexpressive power. This is due to the fact that SQL doesnot have recursive or equivalently looping capabilities.For this reason, many queries must be implemented inan embedded language such as PL/SQL. Historically,many applications which require deductive capabilitiesare also implemented in logic programming languagessuch as Prolog. Unfortunately, Prolog is not suited forlarge scale deductive databases due to its “one item ata time” processing method and its lack of secondarystorage file processing capabilities.

The main challenge in dealing with recursive queriesis efficiency. The standard implementations based onresolution techniques is memory intensive and slow.A better idea is to implement these queries using abottom up technique of fixed point. Typically, thefixed point computation is augmented with heuristicsto speed up the process.

Beyond this introductory Section, there are threeadditional Sections. Section 2 gives background andoverview of some standard techniques. Section 3

is a detailed explanation and implementation of ourproject. Section 4 is our conclusion and future work.

2. Background

Most commercial applications of database technol-ogy are implemented in a relational system such as Or-acle. The expressive power of a relational database isequivalent to the expressive power of first order logic.For example, consider the relation parent(x,y) withthe intended meaning that y is the parent of x. Thisrelation is typically represented as a table as shown inTable 1.

cain adamabel adamcain eveabel eve

Table 1. Parent relation

A query such as finding the parent of cain can besimply implemented in a standard query language suchas SQL. On the other hand, suppose you are interestedin finding all ancestors of an individual x. In DATA-LOG (or Prolog) notation, this query can be expressedas [3]:

ancestor(x, y) � parent(x, y). (1)

ancestor(x, y) � ancestor(x, z), ancestor(z, y). (2)

In English, these two equations state that y is anancestor of x if either y is a parent of x or y is anancestor of an individual z and z is an ancestor of x.

In the Prolog implementation of this rule, if one islooking for ancestors of an individual, say john, thenthe query proceeds with matching ancestor(john, y)with the above rules. Consequently, the first rule willfind the parents of john, then the process continues

19th International Conference on Systems Engineering

978-0-7695-3331-5/08 $25.00 © 2008 IEEE

DOI 10.1109/ICSEng.2008.8

273

Page 2: [IEEE 2008 19th International Conference on Systems Engineering (ICSENG) - Las Vegas, NV, USA (2008.08.19-2008.08.21)] 2008 19th International Conference on Systems Engineering - Implementation

with the second rule which will find the grandparent ofjohn in the first run. Then recursively, the second rulewill be applied to find the great grandparent of john,and so on. The standard implementation of Prolog willpause after each ancestor for the user’s input. It is theuser who will decide whether Prolog should continueto find the next ancestor.

In the database environment, the user expects tosee the entire set of ancestors after issuing the query.The fixed point approach starts with the entire relationparent as an initial ancestor table (i.e. ancestors of theentire world, not just john. It then uses the secondrule to find all grandparents. In other words, for a pairof the form (x,z) and (z,y) from ancestor, it willform the new pair (x,y) signifying that y is now anancestor of x. Next, it will continue with the secondrule again to find the next generation ancestors. Thisprocess continues until no new pair is formed. Thistermination is guaranteed as we have a finite data set.

In practice, many heuristics are used to speed upthis process. For example, the magic set algorithm[3]is one approach that only considers relevant data forsubsequent steps as opposed to the entire ancestor re-lation. This reduces the size of the ancestor relationand speeds the process.

3. Project Implementation

Our database is designed as a group of relations (ortables) that fully describe a typical reference reposi-tory. In particular, three tables named Master Entry,Parent Id, and Relationship are built in such a wayas to allow searches for explicit and implicit referencesof an object. These tables conceptually capture theparent-child relationship which is inherent in scholarlypublications. The table Master Entry with eight at-tribute is shown below:

Master Entry (cite key, entry type, title,author, publisher id,reference, relation, number of pages)

The reference attribute is initially empty. Therelation attribute relates to an entry in the Relation-ship table. The table Parent Id is a Master Entry tableand has all the attributes of Master Entry. Only thoseobjects that refer to other objects (those that serve asparents) have an entry here. If an object has a parentthen the reference attribute takes that value, if not,the reference attribute is empty.

The Relationship table has only one attributeparent id which has the same type as the attributereference in the Master Entry table. This table spec-ifies the direct parent of all the objects in Master Entry.

If an object has no parent then the value of parent idis empty.

For example, let the instances of Master Entry havecite keys 10, 11, 12, 13, 14, 15, 16, 17 ,18, 19, 20.Consider a scenario where 20 refers to 11, 12, and 15.The cite key 15 refers to 13, and the cite-key 13refer to 10. Further, assume that cite key 10 is thefollowing paper:

Kazem Taghva, Julie Borsack, Allen Condit,and Srinivas Erva. The Effects of Noisy Dataon Text Retrieval. Journal of the AmericanSociety for Information Science, 45(1):50-58,January 1994.

Following the notation of ConceptBase[4], cite key 10will be entered as:

10 in Master Entry withcite key: system generatedentry type: ARTICLErelation: 10title: The Effects of Noisy Data on Text Retrievalauthor:

first author: Kazem Taghvasecond author: Julie Borsackthird author: Allen Conditfourth author: Srinivas Erva

number of pages: 8end

The Parent Id entry for cite key 13 which is the im-mediate parent of 10 will be:

13 in Parent Id withReference: 15end

And the Relationship entry for 10 and 13 will be:10 in Relationship withparent id: 13end13 in Relationship withparent id: 15end

These queries are typical requests that can be issuedby a user of this information system. For example,one may want to get the list of all objects that referimplicitly and explicitly to a particular object. Thisquery can be formally expressed as:

An object m refers to an object p if p appearsexplicitly in the paper named m or p appearsin another paper named r and r appears inthe paper named m.

In the DATALOG notation this query is written as:

274

Page 3: [IEEE 2008 19th International Conference on Systems Engineering (ICSENG) - Las Vegas, NV, USA (2008.08.19-2008.08.21)] 2008 19th International Conference on Systems Engineering - Implementation

Reference(m, p) � Reference(m, p).

Reference(m, p) � Reference(m, r), Reference(r, p).

It should be noted here that this query is identical tothe ancestor query as described.

In ConceptBase, we define an attribute called refin Parent Id. And the rule is:

Parent Id with Attributeref: Master Entryruleisref: ∀p Parent Id and m Master Entry(m reference p)⇒(p ref m)end

This rule states that ∀p in Parent Id and ∀m inMaster Entry, if the value of the reference attributeof m is p then it implies that p refers to m. Thesevalues are stored in attribute ref of Parent Id.

ConceptBase which is built on the foundation ofDATALOG uses a bottom-up approach for evaluatingrecursive queries. For our running example, the Con-ceptBase will start with the cite key 10 and discovercite keys 13, 15, and 20 with three computational it-erations of the fixed point.

4. Conclusion and Future Work

In this paper, we have given an overview of logicqueries and their implementation. We have shown howa bottom-up approach computes a recursive query us-ing a concrete example with applications for biblio-graphic databases. Future work will focus on experi-mental analysis to compare the time complexity of theConceptBase approach with other approaches such asXML query processing using XQUERY and XSLT.

References

[1] P. A. Bernstein, K. Harry, P. Sanders, D. Shutt,and J. Zander. The Microsoft repository. In Proc.of the 23rd Intl. Conf. on Very Large Data Bases(VLDB), pages 3-12, Athens, Greece, August1997.

[2] Clocksin & Mellish, Programming in Prolog 4thed. Springer-Verlag 1994.

[3] Jeffrey D. Ullman, Principles of database andknowledge-base systems, Vol. I, Computer Sci-ence Press, Inc., New York, NY, 1988.

[4] http://dbis.rwth-aachen.de/CBdoc/

275