14
Querying the World Wide Web w Alberto O. Mendelzon 1 , George A. Mihaila 1 , Tova Milo 2 1 Department of Computer Science and CSRI, University of Toronto, Toronto, Canada 2 Computer Science Department, Tel Aviv University, Tel Aviv, Israel Received: 1 September 1996 / Accepted: 15 November 1996 Abstract. The World Wide Web is a large, heteroge- neous, distributed collection of documents connected by hypertext links. The most common technology currently used for searching the Web depends on sending infor- mation retrieval requests to ‘‘index servers’’ that index as many documents as they can find by navigating the network. One problem with this is that users must be aware of the various index servers (over a dozen of them are currently deployed on the Web), of their strengths and weaknesses, and of the peculiarities of their query interfaces. A more serious problem is that these queries cannot exploit the structure and topology of the docu- ment network. In this paper we propose a query language, WebSQL, that takes advantage of multiple index servers without requiring users to know about them, and that integrates textual retrieval with structure and topology-based queries. We give a formal semantics for WebSQL using a calculus based on a novel ‘‘virtual graph’’ model of a document network. We propose a new theory of query cost based on the idea of ‘‘query locality,’’ that is, how much of the network must be visited to answer a particular query. We give an algorithm for characterizing WebSQL queries with respect to query locality. Finally, we describe a prototype implementation of WebSQL written in Java. Keywords: Index server – Web SQL – World Wide Web 1 Introduction The World Wide Web (Berners-Lee et al. 1994) is a large, heterogeneous, distributed collection of documents con- nected by hypertext links. Current practice for finding documents of interest depends on browsing the network by following links and searching by sending information retrieval requests to ‘‘index servers’’ that index as many documents as they can find by navigating the network. The limitations of browsing as a search technique are well-known, as well as the disorientation resulting in the infamous ‘‘lost-in-hyperspace’’ syndrome. As far as keyword-based searching, one problem with it is that users must be aware of the various index servers (over a dozen of them are currently deployed on the Web), of their strengths and weaknesses, and of the peculiarities of their query interfaces. To some degree this can be remedied by front-ends that provide a uniform interface to multiple search engines, such as Multisurf (Hasan et al. 1995), Savvysearch (Dreilinger 1996), and Metacrawler (Selberg and Etzioni 1995). A more serious problem is that these queries cannot exploit the structure and topology of the document net- work. For example, suppose we are looking for an IBM catalog with prices for personal computers. A keyword search for the terms ‘‘IBM,’’ ‘‘personal computer,’’ and ‘‘price,’’ using MetaCrawler, returns 92 references, in- cluding such things as an advertisement for an ‘‘I Bought Mac’’ T-shirt and the VLDB ’96 home page. If we know the web address (called URL, or ‘‘uniform resource lo- cator’’) of the IBM home page is www.ibm.com, we would like to be able to restrict the search to only pages directly or indirectly reachable from this page, and stored on the same server. With currently available tools, this is not possible because navigation and query are distinct phases: navigation is used to construct the indexes, and query is used to search them once constructed. We pro- pose instead a tool that can combine query with navi- gation. The emphasis must be however on controlled navigation. There are currently tens of millions of doc- uments on the Web, and growing; and network band- width is still a very limited resource. It becomes important therefore to be able to distinguish in queries between documents that are stored on the local server, whose access is relatively cheap, and those that are stored on remote serves, whose access is more expensive. It is also important to be able to analyze a query to determine w A preliminary version of this paper was presented at the 1996 Symposium on Parallel and Distributed Information Systems Correspondence to: A. O. Mendelzon Int J Digit Libr (1997) 1: 54 – 67

Querying the World Wide Web

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Querying the World Wide Web

Querying the World Wide Webw

Alberto O. Mendelzon1, George A. Mihaila1, Tova Milo2

1 Department of Computer Science and CSRI, University of Toronto, Toronto, Canada2 Computer Science Department, Tel Aviv University, Tel Aviv, Israel

Received: 1 September 1996 /Accepted: 15 November 1996

Abstract. The World Wide Web is a large, heteroge-neous, distributed collection of documents connected byhypertext links. The most common technology currentlyused for searching the Web depends on sending infor-mation retrieval requests to ``index servers'' that index asmany documents as they can ®nd by navigating thenetwork. One problem with this is that users must beaware of the various index servers (over a dozen of themare currently deployed on the Web), of their strengthsand weaknesses, and of the peculiarities of their queryinterfaces. A more serious problem is that these queriescannot exploit the structure and topology of the docu-ment network. In this paper we propose a querylanguage, WebSQL, that takes advantage of multipleindex servers without requiring users to know aboutthem, and that integrates textual retrieval with structureand topology-based queries. We give a formal semanticsfor WebSQL using a calculus based on a novel ``virtualgraph'' model of a document network. We propose a newtheory of query cost based on the idea of ``querylocality,'' that is, how much of the network must bevisited to answer a particular query. We give analgorithm for characterizing WebSQL queries withrespect to query locality. Finally, we describe a prototypeimplementation of WebSQL written in Java.

Keywords: Index server ± Web SQL ± WorldWideWeb

1 Introduction

The World Wide Web (Berners-Lee et al. 1994) is a large,heterogeneous, distributed collection of documents con-nected by hypertext links. Current practice for ®nding

documents of interest depends on browsing the networkby following links and searching by sending informationretrieval requests to ``index servers'' that index as manydocuments as they can ®nd by navigating the network.The limitations of browsing as a search technique arewell-known, as well as the disorientation resulting in theinfamous ``lost-in-hyperspace'' syndrome. As far askeyword-based searching, one problem with it is thatusers must be aware of the various index servers (over adozen of them are currently deployed on the Web), oftheir strengths and weaknesses, and of the peculiarities oftheir query interfaces. To some degree this can beremedied by front-ends that provide a uniform interfaceto multiple search engines, such as Multisurf (Hasan et al.1995), Savvysearch (Dreilinger 1996), and Metacrawler(Selberg and Etzioni 1995).

A more serious problem is that these queries cannotexploit the structure and topology of the document net-work. For example, suppose we are looking for an IBMcatalog with prices for personal computers. A keywordsearch for the terms ``IBM,'' ``personal computer,'' and``price,'' using MetaCrawler, returns 92 references, in-cluding such things as an advertisement for an ``I BoughtMac'' T-shirt and the VLDB '96 home page. If we knowthe web address (called URL, or ``uniform resource lo-cator'') of the IBM home page is www.ibm.com, wewould like to be able to restrict the search to only pagesdirectly or indirectly reachable from this page, and storedon the same server. With currently available tools, this isnot possible because navigation and query are distinctphases: navigation is used to construct the indexes, andquery is used to search them once constructed. We pro-pose instead a tool that can combine query with navi-gation. The emphasis must be however on controllednavigation. There are currently tens of millions of doc-uments on the Web, and growing; and network band-width is still a very limited resource. It becomesimportant therefore to be able to distinguish in queriesbetween documents that are stored on the local server,whose access is relatively cheap, and those that are storedon remote serves, whose access is more expensive. It isalso important to be able to analyze a query to determine

w A preliminary version of this paper was presented at the 1996

Symposium on Parallel and Distributed Information Systems

Correspondence to: A. O. Mendelzon

Int J Digit Libr (1997) 1: 54 ± 67

Page 2: Querying the World Wide Web

its cost in terms of how many remote document accesseswill be needed to answer it.

In this paper we propose a query language,WebSQL,that takes advantage of multiple index servers withoutrequiring users to know about them, and that integratestextual retrieval with structure and topology-based que-ries. After introducing the language in Sect. 2, in Sect. 3we give a formal semantics for WebSQL using a calculusbased on a novel ``virtual graph'' model of a documentnetwork. In Sect. 4, we propose a new theory of querycost based on the idea of ``query locality,'' that is, howmuch of the network must be visited to answer a par-ticular query. Query cost analysis is a prerequisite forquery optimization; we give an algorithm for character-izing WebSQL queries with respect to query locality thatwe believe will be useful in the query optimization pro-cess, although we do not develop query optimizationtechniques in this paper. Finally, in Sect. 5 we describe aprototype implementation of WebSQL written in Java.We conclude in Sect. 6.

1.1 Related work

There has been work in query languages for hypertextdocuments (Beeri and Kornatzky 1990; Consens andMendelzon 1989; Minohara and Watanabe 1993) as wellas query languages for structured or semi-structureddocuments (Abiteboul et al. 1993; Christophides et al.1994; GuÈ ting et al. 1989; Navarro and Baeza-Yates 1995;Quass et al. 1995). Our work di�ers signi®cantly fromboth these streams. None of these papers make adistinction between documents stored locally or remote-ly, or make an attempt to capitalize on existing indexservers. As far as document structure, we only supportthe minimal attributes common to most HTML docu-ments (URL, title, type, length, modi®cation date). Wedo not assume the internal document structure is knownor partially known, as does the work on structured andsemi-structured documents. As a consequence, ourlanguage does not exploit internal document structurewhen it is known; but we are planning to build this ontop of the current framework.

Closer to our approach is the W3QL work by Ko-nopnicki and Shmueli (1995). Our motivation is verysimilar to theirs, but the approach is substantially dif-ferent. They emphasize extensibility and interfacing toexternal user-written programs and Unix utilities. Whileextensibility is a highly desirable goal when the tool runsin a known environment, we aim for a tool that can bedownloaded to an arbitrary client and run with minimalinteraction with the local environment. For this reason,our query engine prototype is implemented in the Javaprogramming language (Sun Microsystems) and can bedownloaded as an applet by a Java-aware browser. An-other di�erence is that we provide formal query seman-tics, and emphasize the distinction between local andremote documents. This makes a theory and analysis ofquery locality possible. On the other hand, they support®lling out forms encountered during navigation, anddiscuss a view facility based on W3QL, while we do not

currently support either. In this regard, it is not our in-tention to provide a fully functional tool, but a clean andminimal design with well-de®ned semantics that can beextended with bells and whistles later.

Another recent e�ort in this direction is the WebLoglanguage of Lakshmanan et al. (1996). Unlike WebSQL,WebLog emphasizes manipulating the internal structureof Web documents. Instead of regular expressions forspecifying paths, they rely on Datalog-like recursiverules. The paper does not describe an implementation orformal semantics.

The work we describe here is not speci®cally addressedto digital libraries. However, a declarative language thatintegrates web navigation with content-based search hasobvious applications to digital library construction andmaintenance: for example, it can be used for building andmaintaining specialized indices and on-line bibliogra-phies, for de®ning virtual documents stored in a digitallibrary, and for building applications that integrate accessto speci®c digital libraries with access to related web pages.

2 The WebSQL language

In this section we introduce our SQL-like language for theWorld Wide Web. We begin by proposing a relationalmodel of the WWW. Then, we give a few examples forqueries, and present the syntax of the language. Theformal semantics of queries is de®ned in the followingsection.

One of the di�culties in building an SQL-like querylanguage for the Web is the absence of a database sche-ma. Instead of trying to model document structure withsome kind of object-oriented schema, as in Christophideset al. (1994); Quass et al. (1995), we take a minimalistrelational approach. At the highest level of abstraction,every Web object is identi®ed by its Uniform ResourceLocator (URL) and has a binary content whose interpr-etation depends on its type (HTML, Postscript, image,audio, etc.). Also, Web servers provide some additionalinformation such as the type, length, and the last modi-®cation date of an object. Moreover, an HTML docu-ment has a title and a text. So, for query purposes, we canassociate a Web object with a tuple in a virtual relation:

Document�url; title; text; type; length;modif �where all the attributes are character strings. The url isthe key, and all other attributes can be null.

Once we de®ne this virtual relation, we can expressany content query, that is, a query that refers only to thecontent of documents, using an SQL-like notation.

Example 1. Find all HTML documents about ``hyper-text''.

SELECT d.url, d.title, d.length, d.modifFROM Document d

SUCH THAT d MENTIONS ``hypertext''WHERE d:type � ``text/html'';

Since we are not interested just in document content,but also in the hypertext structure of the Web, we make

55

Page 3: Querying the World Wide Web

hypertext links ®rst-class citizens in our model. In par-ticular, we concentrate on HTML documents and onhypertext links originating from them. A hypertext link isspeci®ed inside an HTML document by a sequence,known as an anchor, of the form <A HREF=href>label</A> where href (standing for hypertext reference) is theURL of the referenced document, and label is a textualdescription of the link. Therefore, we can capture all theinformation present in a link into a tuple:

Anchor�base; href ; label�where base is the URL of the HTML documentcontaining the link, href is the referred document andlabel is the link description.2 All these attributes arecharacter strings.

Now we can pose queries that refer to the linkspresent in documents.

Example 2. Find all links to applets from documentsabout ``Java''.

SELECT y.label, y.hrefFROM Document x

SUCH THAT x MENTIONS ``Java'',Anchor y SUCH THAT base = x

WHERE y.label CONTAINS ``applet'';

In order to study the topology of the Web we willwant sometimes to make a distinction between links thatpoint within the same document where they appear, toanother document stored at the same site, or to a docu-ment on a remote server.

De®nition 1. A hypertext link is said to be:

± interior if the destination is within the source docu-ment;3

± local if the destination and source documents are dif-ferent but located on the same server;

± global if the destination and the source documents arelocated on di�erent servers.

This distinction is important both from an expressivepower point of view and from the point of view of thequery cost analysis and the locality theory presented inSect. 4.

We assign an arrow-like symbol to each of the threelink types: let 7! denote an interior link, ! a local linkand ) a global link. Also, let � denote the empty path.Path regular expressions are built from these symbolsusing concatenation, alternation (j) and repetition (�).

For example, �j) :!� is a regular expression thatrepresents the set of paths containing the zero lengthpath and all paths that start with a global link andcontinue with zero or more local links.

Now we can express queries referring explicitly to thehypertext structure of the Web.

Example 3. Starting from the Department of ComputerScience home page, ®nd all documents that are linkedthrough paths of length two or less containing only locallinks. Keep only the documents containing the string`database' in their title.

SELECT d.url, d.titleFROM Document d SUCH THAT

``http://www.cs.toronto.edu''�j!j!! dWHERE d.title CONTAINS ``database'';

Of course, we can combine content and structurespeci®cations in a query.

Example 4. Find all documents mentioning `ComputerScience' and all documents that are linked to themthrough paths of length two or less containing only locallinks.

SELECT x.url, x.title, y.url, y.titleFROM Document x SUCH THAT

x MENTIONS ``Computer Science'',Document ySUCH THAT x �j!j!! y;

Note we are using two di�erent keywords, MEN-TIONS and CONTAINS, to do string matching in theFROM and WHERE clauses respectively. The reason isthat they mean di�erent things. Conditions in the FROMclause will be evaluated by sending them to index servers.The result of the FROM clause, obtained by navigationand index server query, is a set of candidate URL's,which are then further restricted by evaluating the con-ditions in the WHERE clause. This distinction is re-¯ected both in the formal semantics and in theimplementation. We could remove the distinction and leta query optimizer decide which conditions are evaluatedby index servers and which are tested locally. However,we prefer to keep the explicit distinction because indexservers are not perfect or complete, and the programmermay want to control which conditions are evaluated bythem and which are not.

The BNF speci®cation of the language syntax is givenin the Fig. 1. The syntax follows the standard SQLSELECT statement. All queries refer to the WWW da-tabase schema introduced above. That is, Table can onlybe Document or Anchor and Field can only be a validattribute of the table it applies to.

3 Formal semantics

In this section we introduce a formal foundation forWebSQL. Starting from the inherent graph structure ofthe WWW, we de®ne the notion of virtual graph and

2 Note that Anchor is not, strictly speaking, a relation, but a multiset

of tuples ± a document may contain several links to the same destina-

tion, all having the same label3 In HTML, links can point to speci®c named fragments within the

destination document; the fragment name is incorporated into the

URL.For example, http://www.royalbank.com/fund.html#DP;

refers to the fragment named DP within the document with URL

http://www.royalbank.com/fund.html. We will ignore this detail

in the rest of the paper

56

Page 4: Querying the World Wide Web

construct a calculus-based query language in this ab-stract setting. Then we de®ne the semantics of WebSQLqueries in terms of this calculus.

3.1 Data model

We assume an in®nite set D of data values, and a ®niteset T of simple types whose domains are subsets of D.Tuple types �a1 : t1; . . . ; an : tn� with attributes ai ofsimple type ti, i � 1 . . . n are de®ned in the standardway. The domain of a type t is denoted by dom�t�. For atuple x, we denote by x:ai the value vi associated with theattribute ai.

We distinguish a simple type Oid 2 T of object iden-ti®ers, and two tuple types Node and Link with the fol-lowing structure:

Node � �id : Oid; . . . ; ai : ti; . . .�Link � �from : Oid; to : Oid; . . . ; bj : tj; . . .�The attribute names in the two de®nitions are all distinct.We shall refer to tuples of the ®rst type as Node objectsand to tuples of the second type as Link objects. In ourmodel of the World Wide Web, documents will bemapped to Node objects and the hypertext links between

them to Link objects. In this context, the objectidenti®ers (Oid) will be the URL's and the Nodeand Link tuples will model the WebSQL Docu-ment and Anchor virtual tables introduced informallyin Sect. 2.

Virtual graphs

The set of all the documents in the Web, although ®nite,is undetermined: no one can produce a complete list of allthe documents available at a certain moment. There areonly two ways one can ®nd documents in the Web:navigation starting from known documents and queryingof index servers.

Given any URL, an agent can either fetch the asso-ciated document or give an error message if the docu-ment does not exist. This behavior can be modeled by acomputable partial function mapping Oid 's to Node ob-jects. Once a document is fetched, one can determine a®nite set of outgoing hypertext links from that document.This can also be modeled by a computable partial func-tion mapping Oid 's to sets of Link objects. In practice,navigation is done selectively, by following only certainlinks, based on their properties. In order to capture this,we introduce a ®nite set of unary link predicatesPLink � fa; b; c; . . .g.

The second way to discover documents is by queryingindex servers. To model the lists of URL's returned byindex servers we introduce a (possibly in®nite) set ofunary node predicates PNode � fP ;Q;R; . . .g where foreach predicate P we are interested in the setfxjx 2 dom�Oid�; P �x� � trueg. For example, a particularnode predicate may be associated with a keyword, and itwill be true of all documents that contain that keyword intheir text.

De®nition 2. A virtual graph is a 4-tuple C � �qNode;qLink ;PNode;PLink� where qNode : dom�Oid� ! dom�Node�and qLink : dom�Oid� ! 2dom�Link� are computable partialfunctions, PNode is a set of unary predicates on dom(Oid),and PLink is a ®nite set of unary predicates on dom(Link).

± The set fqNode�oid�joid 2 dom�Oid� and qNode is de®nedon oidg is ®nite;

± if qNode�oid� � v then v:id � oid;± for all oid 2 dom�Oid�: qNode�oid� is de®ned, qLink�oid� is de®ned;

± if qLink�oid� � E then E is ®nite and for alle 2 E; e:from � oid and q�e:to� is de®ned (we say that eis an edge from v1 � qNode�e:from� to v2 � qNode�e:to));

± every predicate a 2 PLink is a partial computable Bool-ean function on the set dom(Link), and a is de®ned on allthe links in qLink�oid� whenever qLink�oid� is de®ned.

± the function val : PNode ! 2dom�Oid�, de®ned byval�P � � fxjx 2 dom�Oid�; P �x� � trueg is computable;

Example 5. A virtual graph:

C � �qNode; qLink;PNode; PLink�

57

Page 5: Querying the World Wide Web

where:

Node � �url : Oid; title : string; text : string�,Link � � from : Oid; to : Oid; label : string�,dom�Oid� � furl1; url2; url3g,

and PNode � fContainstourist;Containscapitalg, PLink �fContainsgo;Containsbackg.

Note that avirtual graphC � �qNode; qLink;PNode;PLink�induces an underlying directed graphG�C� � �V ;E�whereV � qNode�dom�Oid�� and E � Soid2dom�Oid� qLink �oid�.However, a calculus cannotmanipulate this graph directlybecause of the computability issues presented above. Thiscaptures our intuition about the World Wide Web hyper-text graph, whose nodes and links can only be discoveredthrough navigation.

3.2 The calculus

Now we proceed to de®ne our calculus for queryingvirtual graphs. We introduce path regular expressions tospecify connectivity-based queries. We then present thenotions of range expressions and ground variables torestrict queries so that their evaluation does not requireenumerating every node of the virtual graph, and ®nallywe de®ne calculus queries.

Path regular expressions

Consider a virtual graph C and denote its underlyinggraph G�C� � �V ;E�. A path in C is de®ned in the sameway as in a directed graph: if e1; e2; . . . ; ek, we callp � �e1; e2; . . . ; ek� a path if and only if for every index1 � i < k, ei�1:from � ei:to. A path p is called simple ifthere are no di�erent edges ei 6� ej in p with the samestarting or ending points.

In order to express queries based on connectivity, weneed a way to de®ne graph patterns. Recall PLink is theset of link properties in a virtual graph. Let a 2 PLink besome link property, and let e be a link object. Ifa�e� � true then we say that e has the property a. Wede®ne the set of properties of a link e byK�e� � fa 2 PLinkja�e� � trueg. We sometimes choose toview K�e� as a formal language on the alphabet PLink.For each property a that is true of e, K�e� contains thesingle-character string a. To study the properties of a

path, we extend the de®nition of the set of properties of alink to paths as follows: if p � �e1; :::; ek� is a path thenwe de®ne

K� p� � K�e1�:::K�ek�where LL0 � fxx0jx 2 L; x0 2 L0g is the concatenation ofthe languages L and L0.

Example 6. Suppose we want to require that a property ahold on all the links of a path p � �e1; :::; ek�. This can beexpressed easily in terms of K by requiring thatak 2 K�p�.

In order to specify constraints like in the exampleabove we introduce path regular expressions, which arenothing more than regular expressions over the alphabetPLink. With each regular expression R over the alphabetPLink we associate a language L�R� � P�Link in the usualway.

De®nition 3. We say that the path p matches the pathregular expression R if and only if: K�p� \ L�R� 6� ;

In other words, the path p matches the path regularexpression R if and only if there is a word w in the set ofproperties K�p� that matches the regular expression R.

Range expressions

The algebra for a traditional relational database isbased on operators like select (r), project (p) andCartesian product (�). Because all the contents of thedatabase is assumed to be available to the query engine,all these operations can be executed, in the worst caseby enumerating all the tuples. In the case of the WorldWide Web, the result of a select operation cannot becomputed in this way, simply because one cannotenumerate all the documents. Instead, navigation andquerying of index servers must be used. We want ourcalculus to express only queries that can be evaluatedwithout having to enumerate the whole Web. Toenforce this restriction, we introduce range conditions,that will serve as restrictions for variables in thequeries.

De®nition 4. Let C � �qNode;qLink;PNode;PLink� be a vir-tual graph. Let G�C� � �V ;E� be its underlying graph. Arange atom is an expression of one of the following forms:

± Path (u, R, x) where u,x are Oids or variable names, andR is a path regular expression;

± P(x) where P 2 PNode and x is an Oid or a variablename;

± From (u, x) where u,x are Oids or variable names;

A range expression is an expression of the form:E � fx1 : T1; x2 : T2; . . . ; xn : TnjA1;A2; . . . ;Amg where A1;A2; :::;Am are range atoms, x1; x2; :::; xn are all the variablesoccurring in them and Ti 2 fNode; Linkg speci®es the typeof the variable xi, for 1 � i � n.

url qNode�url�url1 �url1; ``Toronto00; ``Visiting Toronto?:::00�url2 �url2; ``Ottawa00; ``The national capital:::00�url3 �url3; ``New York00; ``Tourist information:::00�

url qLink�url�url1 f�url1; url2; ``Go to Ottawa00�;

�url1; url3; ``Go to New York00�gurl2 f�url2; url1; ``Back to Toronto00�gurl3 f�url3; url1; ``Back to Toronto00�g

58

Page 6: Querying the World Wide Web

Consider a valuation m : fx; y; z; . . .g ! V [ E thatmaps each variable into a node or an edge of the un-derlying graph. We extend m to dom(Oid) bym�oid� � qNode�oid�, that is, m maps each Oid appearing inan atom to the corresponding node. The following de®-nition assigns semantics to range atoms.

De®nition 5. Let C � �qNode; qLink;PNode;PLink� be a vir-tual graph. Let A be a range atom. We say that A isvalidated by the valuation m if:

± for A � Path�u;R; x�, there exists a simple path fromm�u� to m�x� matching the path regular expression R;

± for A � P �x�; P�m�x�:id� � true.± for A � From�u; x�; m�x�:from � m�u�:id

Now we can give semantics to range expressions.

De®nition 6. Consider a range expression E � fx1; . . . ;xnjA1; . . . ;Amg. Then the set of tuples W�E� � f�m�x1�; . . . ;m�xn��jm is an valuation s.t. A1; . . . ;Am are all validated bymg is called the range of E.

Example 7. The set of all nodes satisfying a certain nodepredicate P together with all their outgoing links may bespeci®ed by the following range expression:fx : Node; y : LinkjP �x�; From�x; y�g

Ground variables and ground expressions

Although De®nition 6 gives a well-de®ned semantics forall range expressions, problems may arise when examin-ing the evaluation of W�E� for certain expressions E. Forexample, expressions like fx : Node; y : NodejPath�x; a; y�g(®nd all pairs of nodes connected by a link of type a) orfx : Node; y : LinkjFrom�x; y�g (®nd all nodes and all linksoutgoing from them) cannot be algorithmically evaluatedon an arbitrary virtual graph, since their evaluationwould involve the enumeration of all nodes. We imposesyntactic restrictions to disallow such range expressions,in a manner similar to the de®nition of safe expressionsin Datalog.

Consider a virtual graph C � �qNode; qLink ;PNode;PLink�. Let us examine the evaluation of atoms for all thethree cases in De®nition 4:

± if A � Path�u;R; x�, determining all pairs of nodes�u; x� separated by simple paths matching the pathregular expression R is possible only if u is a constant.Indeed, if u is known, we can traverse the graphstarting from u to generate all simple paths matchingR, thus determining the values of x. If u is a variableand x is a constant, since Web links can only be tra-versed in one direction, there is no way to determinethe values of u for arbitrary R without enumerating allthe nodes in the graph. All the more so when both uand x are variables. A similar argument shows that umust be a constant in A � From�u; x�.

± the atoms of the form P�x� where P 2 PNode pose noproblem since the set fx 2 V jP �x� � trueg is comput-able (by De®nition 2).

The above considerations lead to the following de®nition.

De®nition 7. A variable x occurring in a range atom A issaid to be independent in A if it is the only variable in theatom. If two variables u and x appear in an atom A in thisorder (i.e. A � Path�u;R; x� or A � From�u; x�), then wesay that x depends on u in A.

The idea is that independent variables can be determineddirectly, whereas dependent variables can be determinedonly after the variables they depend upon have beenassigned values. The following de®nition gives a syntacticrestriction over range expressions that will ensurecomputability.

De®nition 8. Let E � fx1 : T1; . . . ; xn : TnjA1; . . . ;Amg be arange expression. A variable x 2 fx1; . . . ; xng is said to beground in E if there exists an atom Ai such that either x isindependent in Ai, or x depends in Ai on a variable u that isground in E. The expression E is said to be ground if all thevariables in E are ground.

Theorem 1. Consider a virtual graph C � �qNode; qLink;PNode;PLink� and a range expression E � fx1 : T1; . . . ; xn :TnjA1; . . . ;Amg. If E is ground then W�E� is computable.

Proof. Consider the following dependency graph GD ��VD;ED� where VD � fx1; . . . ; xng and ED � V 2

D being thedependency relation (i.e. there is an edge from xi to xj ifand only if xj depends on xi in some atom Ak). We dis-tinguish two cases depending on the presence of cycles inGD. We will consider ®rst the acyclic case and then wewill reduce the cyclic case to the acyclic one by trans-forming the expression E into an equivalent one.

Case I (GD acyclic). By doing a topological sort we canconstruct a total order among the variables which iscompatible with the dependency relation, that is, a per-mutation r 2 S�n� s.t. �xr�i�; xr� j�� 2 E implies i < j.

To simplify the notation, we can rename the variablesaccording to the permutation r so that each variable xidepends only on variables preceding it in the list x1; :::; xn.

For each variable xi we de®ne the set Ixi � f1; . . . ;mgas the set of the indices of the atoms where xi occurseither as an independent or dependent variable. Sinceevery variable is ground, all sets Ixi are non-empty.

Since x1 does not depend on any other variable and isground, all its occurrences in atoms are independentoccurrences. This means that we can compute the set ofvalues of x1 in W�E�:px1W�E� �

\i2Ix1

fxjAi�x� � trueg

Furthermore, let us consider one element c1 in this set (ifthis set is empty, then W�E� is also empty and itscomputation is complete). We replace all occurrences ofx1 in atoms with the constant c1 (denote the transformedatoms by Ai�x1=c1�; 1 � i � m). The occurrences of x2 inthe atoms Ai with i 2 Ix2 are either independent or

59

Page 7: Querying the World Wide Web

dependent on x1. Therefore, after the substitution of x1by c1 all the occurrences of x2 in Ai�x1=c1� with i 2 Ix2became independent. This means that we can computethe set of all values of x2 in the tuples where x1 is c1:

px2rx1�c1W�E� �\

i2Ix2

fxjAi�x1=c1��x� � trueg

Now we consider an arbitrary element c2 of the above set(if it is empty, then go back and choose another value forx1). We replace all occurrences of x2 in atoms with theconstant c2 and iterate the process, sequentially for allthe other variables. That is, we compute the sets:

pxk rx1�c1^...^ xkÿ1�ckÿ1W�E�

�\

i2Ixk

fxjAi�x1=c1; . . . ; xkÿ1=ckÿ1��x� � trueg

sequentially, for 2 � k � n. Once we have computed thelast set, for every element cn in that set we add the tuple�c1; c2; . . . ; cn� to W�E�. Then, recursively, we takeanother value for cnÿ1 in its set and recompute the setfor xn, and so on, until we compute all the tuples. Thisrecursive procedure is described in Fig. 2.

Case II (GD acyclic): Consider a cycle in GD:xi1 ! xi2 ! . . .! xik ! xi1 . From the de®nition of atomswe infer that all the variables in the cycle are of the typeNode (a Link variable always has outdegree zero in thedependency graph). From the fact that all variables areground we infer that at least one of the vertices in thecycle has an incoming edge from outside the cycle or has

an independent occurrence in some atom. Without lossof generality we can consider that xi1 has this property.

We introduce a new variable xn�1 and replace all oc-currences of xi1 in the atoms where it depends on xik byxn�1. In this way, the edge xik ! xi1 is replaced by an edgexik ! xn�1 thus breaking the cycle (see Fig. 3). Also, weadd a new atom Am�1 � Path�x1; �; xn�1� to E. Please notethat all variables are still ground in the modi®ed ex-pression. We denote the new expression E'. The atomAm�1 ensures that in all tuples in W�E0�, xn�1 � x2 (twodi�erent nodes cannot be separated by an empty path).This means that W�E� � px1;...; xnW�E0�.

The new dependency graph G0D has at least one cycleless than GD. By iterating this procedure until there areno cycles left we obtain an expression F that can beevaluated using the method from Case I. Then, as the laststep, we compute W�E� � px1;...; xnW�F�. This concludesthe theorem's proof.

Remark 1. The converse of the above theorem is not true,since there are computable range expressions which arenot ground. For example, consider E � fx : T ; y : T ;z : T jP �x�; Path�y; False; z�g, where False 2 PLink is anunsatis®able link predicate. Here y and z are not groundbut W�E� � ;, and therefore trivially computable. How-ever, one can prove4 that every computable range ex-pression is equivalent to a ground range expression.

Queries

After restricting the domain from a large, non-comput-able set of nodes and links to a computable set, we mayuse the traditional relational selection and projection toimpose further conditions on the result set of a query.This allows us to introduce the general format of queriesin our calculus. We assume a given set Ps of binarypredicates over simple types. Examples of predicatesinclude equality (for any type), various inequalities (fornumeric types), and substring containment (for alpha-numeric types).

Fig. 3. Breaking a dependency cycle

4 This comes as a consequence of a more general theorem in Men-

delzon and Milo (1996)

60

Page 8: Querying the World Wide Web

De®nition 9. A virtual graph query is an expression of theform: pL r/ E where:

± E � fx1 : T1; :::; xn : TnjA1; :::;Amg is a range expression;± L is a comma separated list of expressions of the form

xi:aj where aj is some attribute of the type Ti;± / is a Boolean expression constructed from binary

predicates from Ps applied to expressions xi:aj andconstants using the standard operators ^, _, and :;

The semantics of the select (r) and project (p) operatorsis the standard one.

3.3 WebSQL semantics

We are now ready to de®ne the semantics of ourWebSQL language in terms of the formal calculusintroduced above. To do this, we need to model theWeb as a virtual graph W � �qNode; qLink ;PNode;PLink�:dom(Oid) is the in®nite set of all syntactically correctURL's, and for every element url 2 dom�Oid�, qNode�url�is either the document referred to by url, or is unde®ned,if the URL does not refer to an existing document. Notethat qNode�url� is computable (its value can be computedby sending a request to the Web server speci®ed in theURL). Moreover, qLink�url� is the set of all anchors in thedocument referred to by url, or is unde®ned, if the URLdoes not refer to an existing document. One can extractall the links appearing in an HTML document byscanning the contents in search of <A> and </A> tags.This means that the partial function qLink�url� is com-putable. In order to model content queries we considerthe following set of Node predicates: PNode �fCwjw 2 R�g where, for each w 2 R�, Cw�n� � true if thedocument n contains the string w. Finally, we considerthe following set of Link predicates: PLink � f7!;!;)gin accordance with the de®nition of path regularexpressions in WebSQL.

The semantics of a WebSQL query is de®ned as usualin terms of selections and projections. Thus a query ofthe form:

SELECT LFROM EWHERE /

translates to the following calculus query: pLr/E0 where

E0 � fx1; . . . ; xnjA1; . . . ;Ang is obtained fromE � C1; . . . ;Cn by using the following transformationrules:

± if Ci = Document x SUCH THAT u R x thenAi � Path�u;R; x�

± if Ci = Document x SUCH THAT x MENTIONS wthen Ai � Cw�x�

± if Ci = Anchor x SUCH THAT x:base � u.url thenAi � From�u; x�We only allow as legal WebSQL queries those that

translate into calculus queries that satisfy the syntacticrestrictions of Sect. 3.2.

4 Query locality

Cost is an important aspect of query evaluation. Theconventional approach in database theory is to estimatequery evaluation time as a function of the size of thedatabase. In the web context, it is not realistic to try toevaluate queries whose complexity would be consideredfeasible in the usual theory, such as polynomial or evenlinear time.

For a query to be practical, it should not attempt toaccess too much of the network. Query analysis thusinvolves, in this context, two tasks: ®rst, estimate whatpart of the network may be accessed by the query, andthen the cost of the query can be analyzed in traditionalways as a function of the size of this sub-network. In thissection, we concentrate on the ®rst task. Note that this isanalogous, in a conventional database context, to ana-lyzing queries at the physical level to estimate the numberof disk blocks that they may need to access.

For this ®rst task, we need some way to measure the``locality'' of a query, that is, how far from the origi-nating site do we have to search in order to answer it.Having a bound on the size of the sub-network needed toevaluate a query means that the rest of the network canbe ignored. In fact, a query that is sensitive only to abounded sub-network should give the same result ifevaluated in one network or in a di�erent network con-taining this sub-network. This motivates our formalde®nition of query locality.

An important issue is the cost of accessing such a sub-network. In the current web architecture, access to re-mote documents is often done by fetching each documentand analyzing it locally. The cost of an access is thusa�ected by document properties (e.g. size) and the by thecost of communication between the site where the queryis being evaluated and the site where the document isstored. Recall that we model the web as a virtual graph.To model access costs, we extend the de®nition of virtualgraphs, adding a function qc : dom�Oid� � dom�Oid�!N, where qc�i; j� is the the cost of accessing node jfrom node i.

4.1 Locality cost

We now de®ne the formal notion of locality. For that we®rst explain what it means for two networks to containthe same sub-network. We assume below that all thevirtual graphs being discussed have the same sets of nodeand link predicate names.

De®nition 10. Let C � �qNode; qLink; qc;PNode;PLink�, C0 ��q0Node; q

0Link; q

0c;P

0Node;P

0Link� be two (extended ) virtual

graphs. Let W � dom�Oid�. We say that C agrees with C0about W if for all w;w0 2 W ,

1. qNode�w� � q0Node�w�; qLink�w� � q0Link�w�; qc�w;w0�� q0c�w;w0�,2. for all the node predicates PNode, if n � qNode�w� is de-

®ned, then PNode�n� holds in C i� it holds in C0,3. for all the link predicates PLink and for all links

l 2 qLink�w�, PLink�l� holds in C i� it holds in C0.

61

Page 9: Querying the World Wide Web

Informally this means that the two graphs contain thesub-network induced by W , the nodes of W have the sameproperties in both graphs, and in both graphs this sub-network is linked to the rest of the world in the same way.

We next consider locality of queries. In our context, aquery is a mapping from the domain of virtual graphs, tothe domain of sets of tuples over simple types. As in thestandard de®nition of queries, one can further require themapping to be generic and computable. Since this is ir-relevant to the following discussion, we ignore this issuehere. Formal de®nitions of genericity and computabilityin the context of Web queries can be found in Abitebouland Vianu (1997); Mendelzon and Milo (1996).

De®nition 11. Let Q be a query, let G be a class of virtualgraphs, let C 2 G be a graph, and let W � dom�Oid�. Wesay that query Q when evaluated at node i depends on W,(for C and G), if i 2 W and for every graph C0 2 G thatagrees with C about W, Q�C0� � Q�C�, and there is nosubset of W satisfying this.

W is a minimal set of documents needed for com-puting Q. Note that W may not be unique. This is rea-sonable since the same information may be stored inseveral places on the network. If Q is evaluated at somenode i, then the cost of accessing all the documents insuch W is the sum of �qc�i;w�� over all documents w in Wsuch that qNode�w� is de®ned. We are interested inbounding cost with some function of the cost of accessingall the nodes of the network, that is, the sum of qc�i; j�over all j such that qNode� j� is de®ned.

De®nition 12. The locality cost of a query Q, whenevaluated at node i, is the maximum, over all virtual graphsC 2 G, and over all sets W on which Q depends in C, of thecost of accessing every document in W from node i.

We are interested in bounding the locality cost of aquery with some function of the cost of accessing thewhole network. If this total cost is n, note that the lo-cality cost of a query is at most linear in n. Obviously,queries with O�n� locality are impractical - the wholenetwork needs to be accessed in order to answer them.We will be interested in constant bounds, where theconstants may depend on network parameters such asnumber of documents in a site, maximal number ofURL's in a single document, certain communicationcosts, etc.

In general, access to documents on the local server isconsidered cheap, while documents in remote serversneed to be fetched and are thus relatively expensive. Tosimplify the discussion and highlight the points of inte-rest we assume below a rather simple cost function. Weassume that local accesses are free, while the access costto remote documents is bounded by some given constant.(Similar results can be obtained for a more complex costfunction). For a few examples, consider the query

Q0: SELECT xFROM Document x SUCH THAT``http://www.cs.toronto.edu'' ! x

where ``http://www.cs.toronto.edu'' is at the local server.The query accesses local documents pointed to by thehome page of the Toronto CS department, and noremote ones. Thus the locality cost is O�1�. On the otherhand, the query

Q1: SELECT xFROM Document x SUCH THAT``http://www.cs.toronto.edu'' �!j)� x

accesses both local and remote documents. The numberof remote documents being accessed depends on thenumber of anchors in the home page that contain remoteURL's. In the worst case, all the URL's in the page areremote. If k is a bound on the number of URL's in asingle document, then the locality cost of this query isO�k�.As another example, consider the queries

Q2: SELECT xFROM Document x SUCH THAT``http://www.cs.toronto.edu'' !� x

Q3: SELECT xFROM Document x SUCH THAT``http://www.cs.toronto.edu'' ) :!� x

Q4: SELECT xFROM Document x SUCH THAT``http://www.cs.toronto.edu'' �!j)�� x

Query Q2 accesses local documents reachable from theCS department home page, and is thus of locality O�1�.Query Q3 accesses all documents reachable by one globallink followed by an unbounded number of local links. Ifk is a bound on the number of URL's in a singledocument, and s is a bound on the number of documentsin a single server, then the locality cost of the query isO�ks�. This is because in the worst case all the URL's inthe CS department home page reference documents indistinct servers, and all the documents on those serversare reachable from the referenced documents. The lastquery accesses all reachable documents. In the worst caseit may attempt to access the whole network, thus its costis O�n�.

The locality analysis of various features of a querylanguage can identify potentially expensive componentsof a query. The user can then be advised to rephrasethose speci®c parts, or to give some cost bounds for themin terms of time, number of sites visited, CPU cyclesconsumed, etc., or, if enough information is available,dollars. The query evaluation would monitor resourceusage and interrupt the query when the bound is reached.

A query Q can be computed in two phases. First, thedocuments W on which Q depends on are fetched, andthen the query is evaluated locally. Of course, for thismethod to be e�ective, computing which documents needto be fetched should not be more complex than com-puting Q itself. The following result shows that com-puting W is not harder, at least in terms of how much ofthe network needs to be scanned, than computing Q.

Proposition 1. For every class of graphs G and every queryQ, the query Q' that given a graph C 2 G returns a W s.t.Q depends on W, also depends on W.

62

Page 10: Querying the World Wide Web

Proof . We use the following auxiliary de®nition.

De®nition 13. Let Q be a query, let G be a class of graphs,let G 2 G be a graph, and let W be a set of nodes in G. Wesay that Q is W -local for G and G, if for every graphG0 2 G that agrees with G about W, Q�G0� � Q�G�.

Note that a query Q depends on W for G and G, if Q isW -local and there is no W 0 � W s.t. Q is W 0-local for G.We shall call such W a window of Q in G. (Not that theremay be many windows for Q in G.)

The proof is based on the following claim:

Claim. For every two graphs G1;G2, every set of nodes Wbelonging to both graphs, and every query Q, thefollowing hold:

(i) If G1 agrees with G2 about W and Q is W -local forG1, then Q is also W -local for G2.

(ii) If G1 agrees with G2 about W and Q depends on Wfor G1, then Q also depends on W for G2.

Proof. (Sketch) We ®rst prove claim (i). If G1 is W -local,then for every graph G0 2 G that agrees with G1 about W ,Q�G0� � Q�G1�. This in particular holds for G2. Also,since G2 agrees with G1 about W , then the set of graphsthat agree with G1 about W is exactly the set containingall the graphs agreeing with G2 about W . Thus for everygraph G0 2 G that agrees with G2 about W ,Q�G0� � Q�G2�, i.e. Q is W -local for G2 . Claim (ii)follows immediately from claim (i).

We are now ready to prove the proposition. Theproof works by contradiction. Clearly the window W 0on which Q0 depends must contain W (since W is partof the answer of Q0). Assume that W � W 0. We shallshow that Q0 is W -local, a contradiction to the minimalityof W 0.

Assume Q0 is not W -local. Then there must be somegraph G0 that agrees with G about W but where Q0 has adi�erent answer. i.e. Q does not depend on W for G0. Butclaim (ii) above says that if G agrees with G0 about W andQ depends on W for G,then Q also depends on W for G0.A contradiction.

Although encouraging, the above result is in generalnot of practical use. It says that computing W is notharder in terms of the data required, but it does not sayhow to compute W . In fact it turns out that if the querylanguage is computationally too powerful, the problemof computing W can be undecidable. For example, ifyour query language is relational calculus augmentedwith Web-SQL features, the W of a query fx j / ^ Q4g isthe whole network if / is satis®able, and is empty oth-erwise. (/ here is a simple relational calculus formula andhas no path expressions).

If the query language is too complex, locality analysismay be very complex or even impossible. Nevertheless,there are many cases where locality cost can be analyzede�ectively and e�ciently. This in particular is the case forthe WebSQL query language. The fact that the languagemakes the usage of links and links traversal explicit fa-cilitates the analysis task. In the next subsection, we show

that locality of WebSQL queries can be determined intime polynomial in the size of the query.

4.2 Locality of WebSQL queries

We start by considering simple queries where the FROMclause consists of a single path atom starting from thelocal server, as in the examples above. We then analyzegeneral queries.

Analyzing single path expressions

The analysis is based on examining the types of links(internal, local, or global) that can be traversed by pathsdescribed by the path expression. Particular attention ispaid to ``starred'' sub-expressions since they can describepaths of arbitrary length. Assume that the query isevaluated at some node i, and let n denote the cost ofaccessing the whole network graph from i. Let k be somebound on the number of URL's appearing in a singledocument,5 and s some bound on the number ofdocuments in a single server.

1. Expressions with no global links can access only localdocuments and thus have locality O(1).

2. Expressions containing global links that appear in``starred'' sub-expression, can potentially access all thedocuments in the network. Thus the locality is O�n�;

3. Expressions with global links, but where none of the``starred'' sub-expressions contain a global link sym-bol, can access remote documents, but the number ofthose is limited. All the paths de®ned by such ex-pressions are of the form !l1 :) :!l2 :) . . .!lm .Let l � 1� max�l1; . . . ; lm�. The number of documentsaccessed by such path is bounded by

min�n; �m�l� 1��k min�s; kl��m��where k and s are the bounds above. This is becausethe number of di�erent documents reachable by a path!li is at most min�s; kli�. Each of these documentsmay contain k global links, and in the worst case allthe k min�s; kli� links in those documents point to ®leson distinct servers. The number of documents reachedat the end of the path is thus at most �k min�s; kl��m. Inorder to reach those documents, in the current Webarchitecture, all ®les along the path need to be fetched.To get a bound on this, we multiply the number by thelength of the path.The number m is bounded by the number of globallinks appearing in the given path expression. l can becomputed by analyzing the regular expression.6 All

5 If no such bound exists, every path expression containing a global

link is of locality cost O�n� (because in the worst case a single documentmay point to all the nodes in the network)

6 It su�ces to build a NFA for the expression and compute the

lengths of the maximal path between two successive global links, not

counting epsilon moves and internal links. li is in®nite if the path be-

tween two successive global links contain a cycle with at least one local

link, in which case min�s; kli � � s

63

Page 11: Querying the World Wide Web

this can be done in time polynomial in the size of theexpression. Thus the whole bound can be e�ectivelycomputed.The above expression is a simple upper bound on thelocality cost. A tight bound can also be computed inpolynomial time. We chose to present the above boundsince the exact expression is very complex and does notadd much insight to the analysis, so we omit it.Observe that if any of the starred sub-expressionscontains a local link, then l � s. This is because, in theworst case, such sub-expressions will attempt to accessall the documents in the server. In this case the boundbecomes min�n; �m�s� 1��k s�m��. Since servers maycontain many documents, the locality cost may be veryhigh. This indicates that such queries are potentiallyexpensive and that the user should be advised toprovide the query evaluator certain bounds on re-sources.

Analyzing queries

To analyze a WebSQL query it is not su�cient to look atindividual path atoms. The whole FROM clause needs tobe analyzed. For example, consider a query

Q5: SELECT z;wFROM Document xSUCH THAT x MENTIONS ``VLDB96''

Anchor y SUCH THAT base � xDocument z SUCH THAT y !� zDocument w SUCH THAT x! w

The path expressions in the query involve only locallinks. But since the links returned by the index servermay point to remote documents, the paths traversed bythe sub-condition

FROM Document xSUCH THAT x MENTIONS ``VLDB96''

Document w SUCH THAT x! w

are actually of the form ��!j)�:!�. Similarly, since thelinks traversed in ``Anchor y SUCH THAT base � x''can be internal, local, or global, the paths traversed bythe sub-condition

FROM Document xSUCH THAT x MENTIONS ``VLDB96''Anchor y SUCH THAT base � xDocument z SUCH THAT y !� z

are of the form ��!j)�:�7! j!j)�:!��. Thus theregular expression describing the path accessed by thequery is ���!j)�:!� j ��!j)�:�7! j!j)�:!���. As-sume that querying an index server can be done withlocality c, and that the number of URL's returned by anindex server on a single query is bounded by somenumber m. The locality cost of evaluating this query istherefore bounded by c plus m times the locality bound of���!j)�:!� j ��!j)�:�7! j!j)�:!���.

Interestingly, every FROM clause of a WebSQLquery can be transformed into a regular expression de-

scribing the paths accessed in its evaluation. This, to-gether with the locality cost of querying the index seversused in the query, and the bounds on the size of theiranswers, lets us determine the locality of the FROMclause in time polynomial on the size of the query.Bounding the locality of the FROM clause provides anupper bound on the locality of the whole query; a slightlybetter bound can be obtained by analyzing the SELECTand WHERE clauses.

We ®rst sketch below an algorithm for building aregular expression corresponding to the paths traversedby the FROM clause. Next we explain how the SELECTand WHERE clauses can be used in the locality analysis.

To build the regular expression we use the auxiliarynotion of determination.

De®nition 14. Recall that every domain term Ai in theFROM clause corresponds to an atom A0i in the calculus.We say that Ai determines a variable x, if x is independentin A0i or it depends on another variable of A0i. For twodomain terms Ai, Aj, we say that Ai determines Aj if Aidetermines any of the variables in Aj.

For simplicity, we assume below that every variable inthe FROM clause is determined be a single domain term.Observe that this is not a serious restriction because everyquery can be transformed to an equivalent query thatsatis®es the restriction: whenever a variable depends onseveral terms, the variables in all these terms but one can berenamed, and equality conditions equating the new vari-ables and the old one can be added to theWHERE clause.

Given this assumption, the determination relationshipbetween domain terms can be described by a forest. Thisforest is then used to derive the regular expression. Theforest is built is follows: The nodes are the atoms in theFROM clause, and the edges describe the dependencyrelationship between variables.

Note that the roots in such a forest are either indexterms (that is, MENTIONS terms), or anchor/path termswith a constant URL as the starting point. The non-rootnodes are anchor or path terms. For example, the forestof query Q5 above contains a single tree of the form

FROM Document x SUCH THAT x MENTIONS ``VLDB96''

j jDocument w SUCH THAT x !w Anchor y SUCH THAT base=x

jDocument z SUCH THAT y (!)* z

The next step is to replace each term node in the forest bya node containing a corresponding path expression:Index terms are replaced by !j). Non-root anchorterms are also replaced by 7! j!j). Root ones arereplaced by the same expression, if the constant URLappearing in them is at the local server, and by) :�7! j!j)� otherwise. Non-root path terms arereplaced by the regular expression appearing in theatom. Root ones with a URL at the local server are alsoreplaced by this regular expression, and otherwise by )concatenated to the expression. So for example, theabove tree becomes

64

Page 12: Querying the World Wide Web

! j ) := n= n! 7! j ! j )

jj!�

Finally, we take the obtained forest and build from it aregular expression. This is done by starting at the leavesand going up the forest, concatenating the regularexpression of each node to the union of the expressionsbuilt for the children, (and ®nally taking the union of allthe roots of the forest). The expression thus obtained forthe above tree is �!j)�:�!j ��7! j!j)�:!���.7

The locality of the FROM clause is a bound on thelocality of the query. A slightly better bound can beobtained by analyzing the SELECT and WHEREclauses. For the SELECT clause, if the documents cor-responding to variables on which no other variable de-pend are not used, (i.e. only data from their URL isretrieved), it means that the last link of the path is nottraversed. This can be easily incorporated into the lo-cality computation of the regular expression. For theWHERE clause, if the condition there is unsatis®able,the locality can be immediately reduced to O�1�.

5 Implementation

This section presents our prototype implementation of aWebSQL compiler, query engine, and user interface.

Both the WebSQL compiler and query engine areimplemented entirely in Java the language introduced bySun Microsystems with the speci®c purpose of addingexecutable content to Web documents. Java applicationsincorporated in HTML documents, called applets, resideon a Web server but are transferred on demand to theclient's site and are interpreted by the client.

A prototype user interface for the WebSQL system isaccessible from the WebSQL home page

�http : ==www:cs:toronto:edu=~georgem=WebSQL:html�

through a CGI script. Also, we have developed astandalone, GUI-based, Java application that supportsinteractive evaluation of queries.

The WebSQL system architecture is depicted in Fig.4.

The compiler and virtual machine

Starting from the BNF speci®cation of WebSQL we builta recursive descent compiler that, while checking forsyntactic correctness, recognizes the constructs in a query

and stores all the relevant information in internalstructures.

After this parsing stage is complete, the compilergenerates a set of nested loops that will evaluate therange atoms in the FROM clause. Consider the followingquery template:

SELECT Fields�x1; . . . ; xn�FROM Obj x1 SUCH THAT A1

Obj x2 SUCH THAT A2

. . .Obj xn SUCH THAT An

WHERE Condition�x1; . . . ; xn�The WebSQL compiler translates the above query to

a program in a custom-designed object language imple-menting the pseudo-code algorithm depicted in Fig. 5.

Note that this nested loops algorithm is equivalentto the recursive algorithm used in the proof of Theorem 1(for this restricted form of domain specifying expres-sions).

The object program is executed by an interpreter thatimplements a stack machine. Its stack is heterogeneous,that is, it is able to store any type of object, from integersand strings to whole vectors of Node and Link objects.

Fig. 4. The architecture of the WebSQL system

7 Which is equivalent to the expression ���!j)�:!� j ��!j)�:�7! j!j)�:!���. we obtained in the intuitive discussion above

65

Page 13: Querying the World Wide Web

The evaluation of range atoms is done via specially de-signed operation codes whose results are vectors of Nodeor Link objects.

The query engine

Whenever the interpreter encounters an operation codecorresponding to a range atom, the query engine isinvoked to perform the actual evaluation. There are threetypes of atoms, according to De®nition 4. Let us examineeach of them in sequence:

± if A � Path�u;R; x� the engine generates all simplepaths starting at u that match R, thus determining thelist of all qualifying values of x. Mendelzon and Wood(1995) give in an algorithm for ®nding all the simplepaths matching a regular expression R in a labeledgraph. We adapted this algorithm for the virtual graphcontext. For full details see Mihaila (1996).

± if A � Cw the engine queries a customizable set ofknown index servers (currently Yahoo and Lycos)with the string w and builds a sorted list of URL's bymerging the individual answer sets;

± if A � From�u; x� the engine determines ®rst if u is anHTML document and if it is, it parses it and builds alist of Link objects out of the set of all the anchor tags;if u is not an HTML document, the engine returns theempty list.

The user interface

In order to make WebSQL available to all WWW users,we have designed a CGI C program invoked from aHTML form. The appearance of the HTML form isshown in Fig. 6, a screen shot of the Hotjava browser.

The input form can be used as a template for the mostcommon WebSQL queries making it easier for the userto submit a query. If the query is more complicated it canalways be typed into an alternative text ®eld. After thequery is entered it may be submitted by pressing theappropriate button. At that point, the Java applet col-lects all the data from the input ®elds and assembles theWebSQL query. Then the query is sent to the Parser,which checks the syntax and produces the object code.The object code is then executed by the Interpreter and®nally a query result set is computed. This set is for-matted as an HTML document and displayed by thebrowser. All URL ®elds that appear in the result areformatted as anchors so that the user may jump easily tothe associated documents. Figure 7 contains a screenshot of a typical result document.

Performance

The execution time of a WebSQL query is in¯uenced byvarious factors related to the network accesses performedin the process of building the result set. Among thesefactors we can mention the number and size of the

transferred documents, the available network bandwidth,and the performance and load of the accessed Webservers. Because our query processing system does notmaintain any persistent local information between que-ries it has to access the Web for every new query.Therefore, care must be taken when formulating queries

Fig. 6. The WebSQL user interface

Fig. 7. WebSQL query results

66

Page 14: Querying the World Wide Web

by estimating the number of documents that have to beretrieved. We executed a number of queries by runningthe Java applet described in the previous section fromwithin an instance of the HotJava browser running underSolaris 2.3 on a SUN Sparcserver 20/612 with 2 CPUsand 256 Mbytes of RAM.

The execution times for the queries we tested varybetween under ten seconds, for simple content queries, toseveral minutes for structural queries involving the ex-ploration of Web subgraphs with about 500 nodes.

6 Discussion

We have presented the WebSQL language for queryingthe World Wide Web, given its formal semantics in termsof a new virtual graph model, proposed a new notion ofquery cost appropriate for Web queries, and applied it tothe analysis of WebSQL queries. Finally, we describedthe current prototype implementation of the language.

Looking at Fig. 6, one is skeptical that this complexinterface will replace simple keyword-based search en-gines. However, this is not its purpose. Just as SQL is byand large not used by end users, but by programmerswho build applications, we see WebSQL as a tool forhelping build Web-based applications more quickly andreliably. Some examples:

Selective indexing. As the Web grows larger, we will oftenwant to build indexes on a selected portion of thenetwork. WebSQL can be used to specify this portiondeclaratively.

View de®nition. This is a generalization of the previouspoint, as an index is a special kind of view. Views andvirtual documents are likely to be an important facility,as discussed by Konopnicki and Shmueli (1995), and adeclarative language is needed to specify them.

Link maintenance. Keeping links current and checkingwhether documents that they point to have changed is acommon task that can be automated with the help of adeclarative query language.

Several directions for extending this work presentthemselves. First, instead of being limited to a ®xedrepertoire of link types (internal, local, and global), wewould like to extend the language with the possibility ofde®ning arbitrary link types in terms of their properties,and use the new types in regular expressions. For ex-ample, we might be interested in links pointing to nodesin Canada such that their labels do not contain thestrings ``Back'' or ``Home.''

Second, we would like to make use of internal doc-ument structure when it is known, along the lines ofChristophides et al. (1994) and Quass et al. (1995).

There is also a great deal of scope for query optimi-zation. We do not currently attempt to be selective in theindex servers that are used for each query, or to propa-gate conditions from the WHERE to the FROM clause

to avoid fetching irrelevant documents. It would also beinteresting to investigate a distributed architecture inwhich subqueries are sent to remote servers to be exe-cuted there, avoiding unnecessary data movement.

Acknowledgements. This work was supported by the Information

Technology Research Centre of Ontario and the Natural Sciences and

Engineering Research Council of Canada. We thank the anonymous

reviewers for their suggestions.

References

Abiteboul, S., Vianu, V.: Queries and computation on the Web. In:

Proc. ICDT '97, 1997

Abiteboul, S., Cluet, S., Milo, T.: Querying and updating the ®le. In:

Proceedings of the 19th VLDB Conference, 1993

Beeri, C., Kornatzky, Y.: A logical query language for hypertext sys-

tems. In: Proc. of the European Conference on Hypertext, Cam-

bridge University Press, 1990, pp 67±80

Berners-Lee, T., Cailliau, R., Luotonen, A., Frystyk Nielsen, H., Secret,

A.: The World-Wide Web. Commun. ACM 37:76±82 (1994)

Christophides, V., Abiteboul, S., Cluet, S., Scholl, M.: From structured

documents to novel query facilities. In: Proc. ACM SIGMOD'94,

1994, pp 313±324

Consens M. P., Mendelzon, A. O.: Expressing structural hypertext

queries in Graphlog. In: Hypertext'89, 1989, pp 269±292

Dreilinger, D.: Savvysearch home page. 1996. http://guaraldi.cs.colo-

state.edu:2000/.

GuÈ ting, R. H., Zicari, R., Choy, D. M.: An algebra for structured o�ce

documents. ACM TOIS 7:123±157 (1989)

Hasan, M. Z., Golovchinsky, G., Noik, E. G., Charoenkitkarn, N.,

Chignell, M., Mendelzon, A. O., Modjeska, D.: Visual Web sur®ng

with Hy+. In: Proceedings CASCON '95, Toronto, November

1995. IBM Canada. ftp://db.toronto.edu/pub/papers/cascon95-

multisurf.ps.Z.

Konopnicki, D., Shmueli, O.: A query system for the World Wide Web.

In: Proc. VLDB '95, 1995, pp 54±65

Lakshmanan, L. V. S., Sadri, F., Subramanian, I. N.: A declarative

language for querying and restructuring the Web. In: Proc. 6th.

International Workshop on Research Issues in Data Engineering,

RIDE '96, New Orleans, February 1996 (in press)

Mendelzon A.O., Milo, T.: Formal models of web queries. In: Proc.

Acm Pods 1997 (in press)

Mendelzon, A. O., Wood, P. T.: Finding regular simple paths in graph

databases. SIAM J. Comput. 24 (1995)

Mihaila, G. A.: WebSQL - an SQL-like query language for the World

Wide Web. Master's thesis, University of Toronto, 1996

Minohara, T., Watanabe, R.: Queries on structure in hypertext. In:

Foundations of Data Organization and Algorithms, FODO '93,

Berlin Heidelberg New York: Springer 1993, pp 394±411

Navarro, G., Baeza-Yates, R.: A model to query documents by contents

and structure. In: Proc. ACM SIGIR '95. 1995, pp 93±101

Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J., Widom, J.: Querying

semistructured heterogeneous information. In: Deductive and

Object-Oriented Databases, Proceedings of the DOOD '95 Confer-

ence, Singapore, December 1995. Berlin Heidelberg New York:

Springer 1995, pp 319±344

Selberg, E., Etzioni, O.: Multi-service search and comparison using the

MetaCrawler. In: Proceedings of the Fourth Int'l WWWConference,

Boston, December 1995. http://www.w3.org/pub/Conferences/

WWW4/Papers/169.

Sun Microsystems. Java (tm): Programming for the Internet. http://

java.sun.com.

67