Effective Semantic Caching of xpath

Embed Size (px)

Citation preview

  • 8/13/2019 Effective Semantic Caching of xpath

    1/15

    Li GL, Feng JH. An effective semantic cache for exploiting XPath query/view answerability. JOURNAL OF COMPUTER

    SCIENCE AND TECHNOLOGY 25(2): 347361 Mar. 2010

    An Effective Semantic Cache for Exploiting XPath Query/View

    Answerability

    Guo-Liang Li ( ), Member, CCF, ACM, andJian-Hua Feng ( ), Senior Member, CCF, Member, ACM, IEEE

    Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology

    Tsinghua University, Beijing 100084, China

    E-mail: {liguoliang, fengjh}@tsinghua.edu.cn

    Received April 2, 2008; revised October 9, 2009.

    Abstract Maintaining a semantic cache of materialized XPath views inside or outside the database is a novel, feasibleand efficient approach to facilitating XML query processing. However, most of the existing approaches incur the followingdisadvantages: 1) they cannot discover enough potential cached views sufficiently to effectively answer subsequent queries;or 2) they are inefficient for view selection due to the complexity of XPath expressions. In this paper, we propose SCEND,an effective Semantic Cache based on dEcompositioN and Divisibility, to exploit the XPath query/view answerability. Thecontributions of this paper include: 1) a novel technique of decomposing complex XPath queries into some much simplerones, which can facilitate discovering more potential views to answer a new query than the existing methods and thus canadequately exploit the query/view answerability; 2) an efficient view-section method by checking the divisibility betweentwo positive numbers assigned to queries and views; 3) a cache-replacement approach to further enhancing the query/viewanswerability; 4) an extensive experimental study which demonstrates that our approach achieves higher performance andoutperforms the existing state-of-the-art alternative methods significantly.

    Keywords XML query processing, semantic cache, view selection, cache lookup

    1 Introduction

    XML is increasingly being used in data intensiveapplications and has become de facto standard overthe Internet. Major database vendors are incorporat-

    ing native XML support in the latest versions of theirrelational database products. The number and sizeof XML databases are rapidly increasing, and XMLdata become the focus of query evaluators and opti-mizers. In a relational database system, the in-memorybuffer cache is crucial for good performance, and asimilar buffer cache can be employed in XML systems.Maintaining semantic cache of query results has beenproposed[1-3]. They address the computational cost andcomplement the buffer cache. The cached queries arebasically materialized views, which can be used in queryprocessing. Thus, at any moment, the semantic cachecontains some views

    {V1,

    V2, . . . ,

    Vn}. When the system

    has to evaluate a new query Q, it inspects each view Viin the cache and determines whether it is possible to an-swer Q from the cached result ofVi. We say that view V

    answers queryQ if there exists a queryCQwhich, when

    executed on the result ofVi, gives the result ofQ. Wedenote this as CQ Vi Q. We call CQCompensatingQuery.

    When some cached view can answer an issued query,

    we have a hit; otherwise we have a miss. There are sev-eral applications for such a semantic cache. Firstly, con-sider its use inside the XML database system. Supposequery Q can be answered by view V with compensat-ing query CQ. Then, we can answer Q by executingCQ, which is simpler than Q, on the result ofVthat isa much smaller XML fragment than the original datainstance. This can result in a significant speedup, aswe show in our experiments. Secondly, the semanticcache can also be maintained at the application tier.Here, there will be additional savings for a hit, fromnot having to connect to the backend database. Fora heavily loaded backend server, these savings can belarge. This kind of middle-tier caching has become pop-ular for Web applications using relational databases[4].Further, the semantic cache can also be maintained in a

    Regular PaperThis work is partly supported by the National Natural Science Foundation of China under Grant No. 60873065, the National High

    Technology Research and Development 863 Program of China under Grant Nos. 2007AA01Z152 and 2009AA011906, and the NationalBasic Research 973 Program of China under Grant No. 2006CB303103.

    c2010 Springer Science + Business Media, LLC & Science Press, China

  • 8/13/2019 Effective Semantic Caching of xpath

    2/15

    348 J. Comput. Sci. & Technol., Mar. 2010, Vol.25, No.2

    different database system, on a remote host. Thus, un-like the page-based buffer cache, it can be employed in adistributed setting too. Finally, the semantic cache canalso be employed in a setting like distributed XQuery[5]

    where sub-queries of a query might refer to remote XMLdata sources, connected over a WAN. Here, a sub-querythat hits in the local cache will not have to be sent overthe network, and the savings can be huge.

    Checking query/view answerability requires match-ing operations between the tree patterns of the queryand view. Looking up the semantic cache by iteratingall the views will be rather inefficient when the numberof views is large. Mandhani and Suciu[2] proposed asemantic cache which maintains a table for XML viewsin the relational database and needs string match andother complicated operations, and thus this methodis not cost-efficient. Further, it cannot discover suffi-cient views to answer a new query. We present someexamples to show how queries are answered from thecache and illustrate the disadvantage of existing stu-dies. They will make clearer the challenges in doing ef-ficient lookup in a large cache and also illustrate queryrewriting for cache hits.

    Example 1. Suppose there is a cached viewV andseven queriesQ1, Q2, . . . , Q7 as shown in Fig.1.

    V=a[ //e]//b[c[ //a][b]][a]/d[ //b][c >50],

    Q4= a[ //e]//b[c[ //a][b]][a]/d[ //a[b][d][c >50]

    Q5= a[ //e]//b[c[ //a][b]][a]/d[ //a[b]][c >100].

    It is obvious that the results ofVcontain the resultsof Q4 and Q5. Consider Q4, CQ4 = d[ //a[b][d]], we

    only need to check whether the element d in the resultofV satisfies CQ4. Note that processing CQ4 on V ismuch easier than processing Q4on the original instance;considerQ5, CQ5 = d[ //a[b]][c > 100], we only need tocheck whether the elementd in the result ofV satisfiesCQ5. We need not process Q4 and Q5 based on thetraditional XML query processing methods. Alterna-tively, we construct compensating queries and answer

    such simpler queries, which can save I/O consumptionand improve query performance.

    It is easy to find out that the results of Valso containthe results ofQ1,Q2 and Q3. However in a naive cachesystem, neither ofQ1, Q2, . . . , Q5 can be answered byV, because they are not equivalent to V. Even if in [2],Q1, Q2 and Q3 cannot be answered by V as they donot satisfy the string match. Moreover, althoughQ6 isalso string match with V, it is obvious that V cannotanswerQ6. Further, the proposed method cannot sup-port // and wildcard in the XPath queries.

    To address above-mentioned problems, in this pa-per, we will demonstrate how to discoverVto answerQ effectively. We propose SCEND, an efficient Seman-tic Cache based on dEcompositiN and Divisibility. InSCEND, V can answer Q1, Q2. . . , Q5. Most impor-tantly, we can effectively filter out Q6 and Q7 in viewselection. To summarize, we make the following contri-butions:

    We propose SCEND, an efficient Semantic Cachebased on dEcompositiN and Divisibility, for effectivequery caching and view selection, which can signifi-cantly improve the query/view answerability.

    Fig.1. A cached view and seven queries. V is a cached query and Q1, Q2, . . . , Q7 are seven user issued queries. We describe how to

    use Vto answer queries Q1, Q2, . . . , Q7.

  • 8/13/2019 Effective Semantic Caching of xpath

    3/15

    Guo-Liang Li et al.: SCEND: An Effective Semantic Cache 349

    We introduce a novel technique for effective viewselection by checking the divisibility of two positive in-tegers assigned to queries and views, which can signifi-cantly improve the efficiency of view selection.

    We demonstrate an effective technique to ex-ploit the cache answerability by decomposing complexXPath queries into some much simpler ones, which candiscover sufficient cached view to answer queries so as

    to improve the cache hit rate. We have implemented our proposed approach and

    conducted an extensive performance study using bothreal and synthetic datasets with various characteristics.The results showed that our algorithm achieved highperformance and outperformed existing state-of-the-artapproaches[2] significantly.

    The rest of this paper is organized as follows. Westart with the background and introduce some prelimi-naries in Section 2. Section 3 proposes a novel strategyfor efficient cache lookup and Section 4 presents a newtechnique for effective view selection. We devise effec-

    tive algorithms for view selection and query rewriting inSection 5. Section 6 proposes a novel method for cachereplacement. In Section 7, we provide our experimentalresults and review related work in Section 8. Finally,we make a conclusion in Section 9.

    2 Problem Statement

    2.1 Preliminaries

    This subsection formally introduces the technique ofquery/view answerability. The question that we con-sider is that: given a view V and a query Q, does V

    answerQ , and if yes, then what shouldCQbe so thatCQ V Q.

    However, for a certain view V, selecting which nodesand their answers to cache is an important problem.Note that many more nodes selected to cache, thehigher the hit rate is, the more storage is involved tocache them. Selecting more nodes of a certain query ormore queries to cache is alternative with limited mem-ory. In addition, which node is selected to cache willalso influence the performance of XML-DBMS.

    Note that the result of V contains that of Q doesnot imply V can answer Q. For example, supposeV = a[c]//b/d[e], and Q = a[c]/b/d[e], the result of

    Qis contained in that ofV. If only the result of the re-turned nodedofVis cached, it is impossible to answerQ based on the result ofV. As there are no results ofnodes a, b in the cache, and we do not know which delement in the cached view satisfiesa/b/d. However, ifthe results of nodesa, bin Vare also cached, we can use

    the results ofVto answerQ as follows. We first get theelement sets Sa, Sb by selecting elements that satisfya/bon the cached results ofa, b respectively, and thencompute the result set Sd by selecting the element inthe result ofd, which has a parent inSb.

    In this paper, to improve query/view answerability,besides caching the result of the returned node, we alsocache the results of some other nodes. We will intro-

    duce the techniques of view section and query process-ing with cached views in the following subsections.

    2.2 Notations

    Definition 1 (Tree Pattern). A tree pattern is alabeled tree TP = (V, E), where V is the vertex set, Eis the edge set. Each vertex v has a label, denoted byv.label, in tagSet {}, where tagSetis the set ofall element names in the context. An edge can be a child

    edge (P-C edge) representing the parent-child relation-ship(/)or a descendant edge(A-D edge)representingthe ancestor-descendant relationship(//).

    In this paper, XPath queries are naturally repre-sented as tree patterns. We will reason out the useof these tree patterns to derive a sound procedure foranswering this question whetherVcan answer Q andhow to construct a compensating queryCQ.

    We present an example showing how to represent anXPath query as a tree pattern. Fig.1 shows the treepattern for

    V=a[ //e]//b[c[ //a][b]][a]/d[ //b][c >50].

    Child and descendant axes are respectively denotedby single slash and double slashes. The ellipse-shapednodes are predicates qualifying their parent nodes.

    Note that the returned node d of the query is markedby yellow circle.

    For any view V and query Q, V can answer Q, ifthe result of Q is contained in that of V. A query

    p is contained in a query q if and only if there is ahomomorphism fromq to p[6], and there is a classicalcharacterization result for conjunctive queries againstrelational databases. Similar characterizations can alsobe given for some tree patterns.

    We use the concepts of hemimorphism and treeinclusion[7] to define the inclusion between two trees.If we employ the results ofVto answerQ, thenQ must

    be included in V; however this is a necessary but notsufficient condition. Moreover,Tree Pattern Inclusionis very complicated and difficult to validate, thus we in-troduce the concept ofRestrictive Tree Pattern Inclu-sion. The difference between them is, the latter mustassure h is an injection. More importantly, Restrictive

    A view Vimplies both its cached query and the corresponding result, and when there is no ambiguity, we may also refer to V asthe cached query.

  • 8/13/2019 Effective Semantic Caching of xpath

    4/15

    350 J. Comput. Sci. & Technol., Mar. 2010, Vol.25, No.2

    Inclusionis easier to validate thanInclusion, which willbe further demonstrated in Section 4.

    Example 2. In Fig.1, as Q1, Q2, . . . , Q5 arerestrictively included in V, the results of them are con-tained in V. i, 1 i 5, we need construct com-pensating queries CQi, which satisfies CQi V Qi,to answer Qi. Note that processing CQi on V is mucheasier than directly answering Qi.

    3 SCEND: An Effective Semantic Cache

    In this section, we present a novel framework of se-mantic cache for effective query caching and view selec-tion.

    3.1 Criteria for Answerability

    For simplicity, we first give some notations.Definition 2 (Main Path). A tree patterns Main

    Path is the path from the root node to the returned

    node in the query tree pattern. Nodes on this path are

    called the axis nodes, while the others are called predi-

    cate nodes. The query depth of a tree patternQ is thenumber of axis nodes, denoted asDep(Q).

    Definition 3 (Prefix(Q, k) and Predicates).Prefix(Q, k) is the query obtained by truncating queryQ at itsk-th axis node. Thek -th axis node is included,but its predicates are not. Preds(Q, k) is the set ofpredicates of the k-th axis node of Q. Infix(Q, k) iscomposed of thek-th axis node and its predicates with-out the(k+ 1)-th axis node. Qk denotes the subtree ofQ rooted at itsk-th axis node.

    To help users understand our notations better, wegive some examples. Consider the query

    Q= a[v]/b[@w= val1][x[ //y]]//c[z > val2].In Fig.2 the depth of Q is three, and a,b,c are its

    first, second and third axis nodes respectively.

    Prefix(Q, 2) = a[v]/b;

    Preds(Q, 2) = {@w= val1; x[ //y]};

    Q2 = b[@w= val1][x[ //y]]//c[z >val2];

    Infix(Q, 2) = b[@w= val1][x[ //y]].

    The XPath fragment we cover includes the // axisand node labels. Predicates can be any of these:equalities with string or numeric constants, compa-risons with numeric constants, or an arbitrary XPath

    expression from this fragment. We also consider joinpredicates. In this paper, to improve cache answerabi-lity, we cache the results of all the axis nodes.

    We give a weak sufficient condition of Q V asformalized in Theorem 1. For ease of presentation, weintroduce some notations. MainPath(Q, k) denotes thepath of Q from the root to the k-th axis node of Q.MainPath(V) denotes the main path of V. The depth

    Fig.2. Tree patterns for V and Q.

    ofV is denoted as Dep(V). axisNode(Q, k) denotes thek-th axis node ofQ.

    Theorem 1. Q V, if(i) MainPath(Q, Dep(V)) MainPath(V); and

    (ii)k,1 k Dep(V),1) Infix(Q, k) Infix(V, k);and2) axisNode(Q, k) = axisNode(V, k).

    Proof. As MainPath(Q, Dep(V)) MainPath(V), soh, h is a homomorphism fromMainPath(V) to Main-Path(Q,Dep(V)).

    As Infix(Q, k) Infix(V, k), so hk, hk is a homo-morphism fromInfix(V, k) to Infix(Q, k).

    As the k-th axis nodes of Q and V are the same,a is a node in MainPath(V), a must be an axis node.Without loss of generality, suppose ak is the k-th axisnode, thus ak is the root node of Infix(V, k), andh(ak) = hk(ak) = ak. Therefore, we can constructh as follows: v V, there must exist only one k,v Infix(V, k),h = hk. It is obvious that h is a homo-morphism fromV toQ. Hence, Q V.

    Based on Theorem 1, we give a condition that VcananswerQ.

    Definition 4 (V Q). V Q ifV andQ satisfy:

    i) MainPath(Q, Dep(V)) MainPath(V); and

    ii) k, 1 k Dep(V), Infix(Q, k) Infix(V, k)andaxisNode(Q, k) = axisNode(V, k).

    Corollary 1. If V Q, the result of V containsthat ofQ, that is the viewV can answer queryQ.

    Corollary 1 is obvious based on Theorem 1, whichassures that ifV Q, Q can be answered by the resultofV. In this paper, if V Q, we say that Q can beanswered by the cached views; otherwise Q cannot beanswered.

    For example, in Fig.1, i, 1 i 5, V Qi, thusQ1, Q2, . . . , Q5 can be answered by V; while V cannotanswerQ6 andQ7.

  • 8/13/2019 Effective Semantic Caching of xpath

    5/15

    Guo-Liang Li et al.: SCEND: An Effective Semantic Cache 351

    Note that for any tree patterns VandQ, it is mucheasier to check whether Q Vis true through Defini-tion 4 than whether Q V holds. Because the mainpath only contains /, // and without [ ],the complexity of the containment of XPath[/,//,] isproven to be polynomial[8]. Thus it is easy to checkwhether MainPath(Q, Dep(V)) MainPath(V) is true.In addition, Infix(Q, k) is simpler than Q, thus it is

    easy to check whether Infix(Q, k) Infix(V, k) holds.To answer Q, we should construct a compensating

    query,CQ, which satisfiesCQ V Q. We will presenthow to construct CQ in Subsection 3.2. If there aremore than one cached views that satisfy V Q, it isbetter to select the best V, which needs the least ad-ditional operations onV to answer Q. We will addressthis issue in the following subsection.

    3.2 Compensating Queries

    To improve the cache hit rate, we cache the resultsof all the axis nodes, which can improve the perfor-

    mance of the semantic cache. For the example in Fig.1,if we only cache the result of the result node d of V,V can answer Q4 and Q5. However, if we cache theresults ofa, b, din the the cache instead of the result ofonly d, which is linear to cache only d, V can answerQ1, Q2, . . . , Q5, and thus cache hit rate is improved.

    Suppose k, 1 k Dep(V), CQk = Infix(Q, k),VRk is the cached result of the k-th axis node ofV, and Infix(Q, k) is considered as a query takingthe k-th axis node as its returned node, CQMP =MainPath(Q, Dep(V)), D = Dep(V), QD = QD whichtakes the returned node ofQ as its own returned node.

    Theorem 2. (CQ

    MP

    (CQ1

    VR

    1

    , CQ

    2

    VR

    2

    ,. . . ,CQD VRD)) QD Q.

    Proof. As Infix(Q, k) Infix(V, k), CQk VRk Infix(Q, k). As MainPath(Q, Dep(V)) MainPath(V),(CQMP (CQ1 VR1, CQ2 VR2, . . . ,CQD VRD))Prefix(Q, D).

    Accordingly, (CQMP (CQ1 VR1, CQ2 VR2, . . .,CQD VRD)) QD Q.

    Theorem 2 describes how to construct CQto answerQ on the cached results ofVas shown in Fig.2. CQk VRk means that we can get the result of Infix(Q, k)by querying CQk on VRk. CQMP (CQ1 VR1,CQ2 VR2, . . . ,CQD VRD) means we can get the re-

    sult of the D-th axis node of Q by integrating eachCQk VRk withCQMP. Finally, we get the result ofQby processingQD.

    4 View Selection

    We propose how to select the best Vto answer Q and

    introduce a novel technique to accelerate view selectionin this section.

    4.1 Tree Patterns Prime ProducT (PPT)

    We have presented how to check whether Vcan an-swerQ in Section 3. However, if there are hundreds andthousands of views in semantic cache, Definition 4 is in-efficient. To accelerate the view selection, we introducea more effective technique.

    We begin by introducing a novel concept of Tree Pat-terns Prime ProducT (PPT) and then give a techniqueto improve the efficiency of view selection as formalizedin Theorem 3.

    Definition 5 (Tree Patterns Prime ProducT(PPT)). We assign different nodes in a tree pattern

    with distinct prime numbers [9]. A Tree Pattern

    TPs Prime ProducT (PPT) is defined as: TPPPT =(u,v)Tp(p(u) p(v)), where(u, v) is any edge of TP;

    andp(u) is the assigned prime number ofu.

    Fig.3. Assigned prime numbers of V and Q . (a) Q. (b)(d) V1.

    Example3. In Fig.3, we assign different nodes withdistinct prime numbers as follows: a(2),b(3),c(5),d(7),e(11),(1), and wildcard is always assigned with 1,

    since can be matched by any label. We have,

    QPPT= (2 3) (3 3) (3 2) (2 5) (5 7)

    = 113400,

    V1PPT= (2 3) (3 1) (1 2) (2 7) = 504,

    V2PPT= (2 7) (7 1) (1 2) (2 3) = 1176,

    The nodes with the same label are taken as the same node.

  • 8/13/2019 Effective Semantic Caching of xpath

    6/15

    352 J. Comput. Sci. & Technol., Mar. 2010, Vol.25, No.2

    V3PPT= (2 3) (3 2) (2 5) (5 2) = 3600.

    Theorem 3. Given two tree patterns P and Q, ifP is restrictively included in Q (i.e., P Q), thenQPPT|PPPT, whereX|Ydenotes that integer Y can beexactly divided by integer X (i.e., there exists anotherinteger Z, Y =X Z).

    Proof. (u, v) Q, let EQi ={(h(u), h(v))|(u, v)

    Q, h is a homomorphism from Q to P, which satis-fies ifu = v, then h(u) = h(v)}. If u = , p(u) = 1;else u = , h(u) = u, thus p(u)|p(h(u)). Therefore,(u, v) Q, p(u) p(v)|p(h(u)) p(h(v)). Thus,,p(h(u)) p(h(v)) = p(u) p(v). Thus, ,(h(u),h(v))EQi

    (p(h(u)) p(h(v)))

    =

    (u,v)Q

    (p(u) p(v)).

    PPPT=

    (h(u),h(v))EQi

    (p(h(u)) p(h(v)))

    (h(u),h(v))EQi

    (p(h(u)) p(h(v)))

    =

    (u,v)Q

    (p(u) p(v))

    (h(u),h(v))EQi

    (p(h(u)) p(h(v)))

    = QPPT

    (h(u),h(v))EQi

    (p(h(u)) p(h(v))).

    Therefore, QPPT|PPPT. Corollary 2. IfQPPT| PPPT,Vcannot answerQ.

    Proof. We prove it by contradiction. SupposeQPPT|PPPT is false. IfQPPT|PPPT is false then Q V is false according to Theorem 3 (converse-negativeproposition). If Q V is false, then V Q is false(converse-negative proposition) and Definition 4. Thatis, Vcannot answer Q. Thus, this corollary is true.

    Theorem 3 gives a necessary but not sufficient condi-tion forPis included inQ. Corollary 2 describes whichview cannot answer Q. Accordingly, we can filter outmany views that cannot answerQ through V and QsPPT according to Corollary 2, which is very easy toimplement.

    Example 4. In Fig.3, as V2PPT|QPPT and

    V3PPT|QPPT are false, Q is not included in V2 andV3. Thus, V2 and V3 cannot answer Q. As Q is in-cluded inV1, V1PPT|QPPTis true. Similarly, in Fig.1, asVPPT|Q6PPT and VPPT|Q7PPT are false, it is very easyto filter Q6 and Q7 directly since they cannot answerV. However, Mandhani et al.[2] have to employ somecomplex operations to infer query/view answerability.Therefore, when looking for V to answer Q, we first

    determine whetherVPPT|QPPT is true, if true, then wecheck whether Q is included in V; otherwise,V cannotanswerQ, and V can be filtered directly. Accordingly,it accelerates cache lookup.

    4.2 SCEND Framework

    This subsection gives the framework of our semanticcache and presents an optimization technique to facili-

    tate checking whetherVcan answerQ as formalized inTheorem 4.

    Theorem 4. V can answerQ, if V and Q satisfythe following conditions:

    (i) MainPath(Q, Dep(V)) MainPath(V);

    (ii) k, 1 k Dep(V),

    1) AxisNode(V, k) = AxisNode(Q, k)

    2) Infix(Q, k) Infix(V, k)

    3) Infix(V, k)PPT|Infix(Q, k)PPT;

    (iii) MainPath(V)PPT|MainPath(Q, Dep(V))PPT;

    (iv) VPPT|QPPT.

    Proof. As MainPath(Q, Dep(V)) MainPath(V) in

    (i), soh, h is a homomorphism fromMainPath(V) toMainPath(Q, Dep(V)). As Infix(Q, k) Infix(V, k) inii), so hk, hk is a homomorphism from Infix(V, k) toInfix(Q, k). As the k -th axis nodes ofQ and Vare thesame in (ii), so a is a node in MainPath(V), a mustbe an axis node. Without loss of generality, supposeak is the k-th axis node, thus ak is the root node ofInfix(V, k), and h(ak) = hk(ak) = ak. Therefore, wecan construct h as follows: v V, there must existonly onek,v Infix(V, k),h = hk. It is obvious thathis a homomorphism fromV toQ. Hence,Q V. ThusV can answer Q.

    In addition, for 3), as Infix(Q, k) Infix(V,k), Infix(V, k)PPT|Infix(Q, k)PPT must be truebased on Theorem 3. Similarly, for (iii), asMainPath(Q, Dep(V)) MainPath(V), MainPath(V)PPT|MainPath(Q, Dep(V))PPT must be true. For(vi), based on Corollary 2, only ifVPPT|QPPT, V cananswerQ. These three conditions can be used for earlytermination so as to improve efficiency. That is, if oneof the conditions is not true, Vcannot answer Q.

    We note that checking whether V can answer Qthrough Theorem 4 is more efficient than throughchecking Definition 4. As some views that can-not answer Q can be efficiently filtered out by

    checking VPPT|QPPT, MainPath(V)PPT|MainPath(Q,Dep(V))PPT, andInfix(V, k)PPT|Infix(Q, k)PPT. If oneof the conditions is not true, we assume that V cannotanswerQ. We note that the worst case of our methodis still Co-NP-complete[10]. However, we can do earlytermination in many cases, which can improve the effi-ciency of finding a view to answer a new query amonga large number of views.

  • 8/13/2019 Effective Semantic Caching of xpath

    7/15

    Guo-Liang Li et al.: SCEND: An Effective Semantic Cache 353

    Fig.4. Architecture of semantic cache and the SQL for selecting views that can answer Q.

    To facilitate view selection based on Theorem 4, wedevise the architecture of the semantic cache as shownin Fig.4. The view in the cache can be indexed in the

    RDBMS, and we can use the DBMS capabilities to se-lect views that can answer Q by issuing an SQL state-ment as shown in Fig.4. If there are more than oneviews that can answer Q, we always select V, whichhas the longest depth, to answer Q, and it needs theleast additional operations to construct CQ.

    In Fig.4, Table TreePattern records the basic infor-mation of each view, where TPID denotes viewVs treepattern ID (system generated primary key); PPT de-notes Prime ProducT ofV; MP denotes the main pathofV;MPPPTdenotes Prime ProducT ofVs main path;OS denotes the Occupied Size of V; VF denotes Vis-ited Frequency of V; RVT denotes Recently Visited

    Time of V; FDT denotes Fetch Delay Time to pro-cess V, i.e., the time of processing V based on gen-eral XML query processing method without using thecached views. Table AxisNode records each axis nodein V, where TPID is AxisNodes reference key (refer toTableTreePatterns primary key TPID); ANID denoteswhich axis node is in V, and ANID ofk-th axis nodeofV is k ; AN Name is AxisNodes label; Infix(V, i)PPT,Prefix(V, i)PPT are PPTs ofInfix(V, i) and Prefix(V, i)respectively; Infix(V, i)RST is the result of i-th axisnode.

    In this paper, once V and Q satisfy the four con-

    ditions in Theorem 4, we can answer Q through Vwith some simple operations. It is obvious that, inour approach, a certain V can answer more queriesthan [2]. Moreover, our method is more effective forfinding V to answer Q as if none of the conditionsis satisfied, V cannot answer Q and such a V canbe skipped. The divisibility of two integers is easyto validate, and MainPath(Q, Dep(V)) MainPath(V),

    Infix(Q, k) Infix(V, k) is easier to check than Q V.

    5 Algorithms

    This section proposes two algorithms for view sec-tion and compensating query reconstruction.

    5.1 View Selection Algorithm

    To further improve the efficiency of view section, wedevise an effective algorithm for view selection so as toeffectively look for a best V to answer Q. AlgorithmView-Selectionin Fig.5 gives the algorithm.

    If we can find a view V to answer query Q in Al-gorithm View-Selection, it is called cache hit; other-wise, it is called cache miss. View-Selection is imple-mented based on the SQL statement in Fig.4 and it

    can skip some views that cannot answer Q. Note that,to select the best V to answer Q, we maintain theviews in the semantic cache sorted according to Dep(V)in descend order. We always select V that has thelongest depth to answer Q. Moreover, View-Selectionskips the views which do not satisfy VPPT|QPPT andMainPath(Q, D)PPT|MainPath(V)PPT in line 3. Notethat it is very effective to check the divisibility of twopositive integers. If the view cannot answer this query,we can skip it directly. Subsequently, only the views,which satisfy MainPath(Q, D) MainPath(V), can an-swer Q in line 5. As the main path only contains /,

    // and , MainPath(Q, D) MainPath(V) can bevalidated in polynomial time, which is proved in [8].Moreover,View-Selectionskips the views which do notsatisfy (iii) in Theorem 4. Accordingly, we can selectVto answerQ effectively.

    Example5. In Fig.1, consider that queriesQ6 andQ7 come, as VPPT| Q6PPT and VPPT| Q7PPT, we candetermine thatVcannot answer Q6 andQ7 based on

  • 8/13/2019 Effective Semantic Caching of xpath

    8/15

    354 J. Comput. Sci. & Technol., Mar. 2010, Vol.25, No.2

    Corollary 1 (line 3 of Algorithm View-Selection). Thus,we can effectively select the best view to answer queries.

    Now we give the time complexity analysis of theView-Selection algorithm. For each view V, we needverify whether Qs main path is included in Vs mainpath, and the time complexity is Dep(v)2. For eachnode on the main path, we need verify whetherInfix(Q, k) Infix(V, k), and the time complexity is

    |Infix(Q, k)||Infix(V,k)|, where|Infix(V, k)|is the numberof nodes in Infix(V, k). Thus the total time complexi-ty is n (Dep(v)2 +

    1kn |Infix(Q, k)|

    |Infix(V,k)|),wheren is the number of views in the cache.

    Fig.5. View-Selection algorithm.

    5.2 Query Rewriting Algorithm

    This subsection presents an algorithm Query-Rewriting in Fig.6 to construct a compensating queryso as to reconstruct the bestV to answerQ.

    We begin by introducing the standard form of atree pattern. There are some tree patterns whichare equivalent but in different expressions[2]. For ex-ample, suppose P = a[c[d]/e]/b[e[f]/g] and Q =

    a[c[d][e]]/b[e[f][g]]. Although P and Q are not thesame, they are equivalent. We takeQ as the standardform ofP.

    To address this issue, we transform the tree patternsinto their standard form as follows. Given a queryQ,for any node that has more than one child, we sort itschildren by their labels in lexicographical order. Ac-cordingly, all the equivalent queries will be transformed

    into a unique standard form. We can get the standardform of a tree pattern by calling procedure Transforma-tionTreein Fig.6.

    Suppose the results of Infix(V, k) and Infix(Q, k)are respectively QRk and VRk. If Infix(V, k) andInfix(Q, k) are in the same standard form, we haveQRk =VRk as shown in lines 34 in algorithm Query-Rewriting in Fig.6; otherwise, Query-Rewriting will

    Fig.6. Query-Rewriting algorithm.

  • 8/13/2019 Effective Semantic Caching of xpath

    9/15

    Guo-Liang Li et al.: SCEND: An Effective Semantic Cache 355

    process each CQk on the corresponding sub-view VRk

    to getQRk, i.e., QRk =CQk VRk =Infix(Q, k) VRk

    as shown in lines 56 in Algorithm Query-Rewritingin Fig.6. CQk VRk is used to get the result ofquerying CQk on sub-view VRk, which is similar togeneral XML query processing method. However, itis much easier than directly processing Q. We em-ploy a holistic twig join algorithm[11] to implement it.

    Then, we retrieve the result of the path in Q fromthe 1st axis node to the k-th axis node based on al-gorithm PathStack[12] as shown in line 8, the com-plexity of which is O(|QRk| + |FVRMP|). Finally, ifD = Dep(V) = Dep(Q), that is, the D-th axis nodeis the returned node of QD, Query-Rewriting returnsFVRMP directly in line 10; otherwise it gets the resultofQ by querying QD on FVRMP in line 12.

    Example 6. Considering queries Q1, Q2, . . . , Q7 inFig.1, we find that V can answer Q1, Q2, . . . , Q5, butcannot answerQ6 andQ7.

    BecauseVPPT| Q6PPTandVPPT| Q7PPT,Q6 and Q7

    are filtered out directly in line 3 of Algorithm View-Selectionin Fig.5, then we call AlgorithmQuery-Writingin Fig.6 to construct CQto answerQ1, Q2, . . . , Q5.

    Considering Q1, as SubQueryMatch(Infix(V, 1),Infix(Q1, 1)) is not true, we have QR

    11 =CQ

    11 VR

    11 =

    Infix(Q1, 1) VR11. As SubQueryMatch(Infix(V, 2),

    Infix(Q1, 2)) is true and SubQueryMatch(Infix(V, 3),Infix(Q1, 3)) is true, QR

    21 = VR

    21 and QR

    31 =

    VR31. We get FVRMP in line 8 by processing the

    main path of view V as illustrated in Fig.6. AsDep(Q1) = Dep(V), we get the result of the re-turned node c of Q1 by processing QD. For Q2,as SubQueryMatch(Infix(V, 1), Infix(Q2, 1)) is not true,we get QR22 =CQ22(Infix(Q2, 2)) VR2. As Sub-QueryMatch(Infix(V, 1), Infix(Q2, 1)), and SubQuery-Match(Infix(V, 3), Infix(Q2, 3)) hold, QR

    12 = VR

    12 and

    QR32 = VR32. Finally, we getFVR

    D through PathStackin line 8. As Dep(Q1)= Dep(V), we return FVR

    MP inline 10.

    Consider Q3, CQ13 = a[e], CQ

    23 = b[c[a][b]], CQ

    33 =

    d[b], and CQMP3 = a/b/d, thus we get FVRD through

    PathStack in line 8. As Dep(Q3) = Dep(V), we returnFVRD directly.

    Similarly, we can get the results ofQ4 and Q5. Ta-ble 1 gives eachCQk,CQMP andQD forQ1, Q2, . . . , Q5,

    where broken line ofCQMP denotes it is exactly Main-Path(Q, D), broken line ofCQk denotes it isInfix(Q, k),and broken line ofQD denotes it is QD.

    Now we give the time complexity analysis of theQuery-Rewritingalgorithm. For each node on Vs mainpath, we need verify whether subqueries Infix(V, k)and Infix(Q, k) are matched. Given the root nodesof Infix(V, k) and Infix(Q, k), we need sort their

    children, thus the time complexity is |Infix(V, k)| log(|Infix(V, k)|) + |Infix(Q, k)| log(|Infix(Q, k)|),where|Infix(V, k)|is the number of nodes in Infix(V, k).Then the algorithm combines QRi (1 i D),and the complexity is Dep(v). Accordingly, the com-plexity of the algorithm is Dep(v) (|Infix(V, k)| log(|Infix(V, k)|) + |Infix(Q, k)| log(|Infix(Q, k)|)).

    Table 1. Compensating Queries ofQ1 Q5

    Query CQMP CQ1 CQ2 CQ3 QD

    Q1 a[e] d[b]/c

    Q2 b[c[ //d[a]][b]]

    Q3 a/b/d a[e] b[c[a][b]] d[b]

    Q4 d[ //a[b][d]]

    Q5 d[ //a[b]][c > 100]

    6 Cache Replacement

    If the space for admitting a new query and its resultis not sufficient, some cached queries and their corre-

    sponding results need to be replaced. We integrate LFUand LRU into LFRU in this paper, that is, we alwaysreplace the cached query which is the least frequentlyand recently used query.

    As the frequent query patterns are more likely to beissued subsequently, we cache the recent frequent querypatterns. When cache replacement is needed, we firstreplace the infrequent query patterns and their corre-sponding answers. If the space for admitting the newquery result is still not sufficient, the cached resultscorresponding to some frequent query patterns will bereplaced according to some replacement policies.

    Inspired from LFU and LRU, in this paper, we in-tegrate LFU and LRU into LFRU and propose a novelcache replacement based on LFRU. We always replacethe least frequently and recently used query. In our ap-proach, the cached queries are classified into two cate-gories according to the visited time. One category isthe recent 20% visited queries and the other categoryis the other 80% queries. We assign the two parts withtwo importance ratios, and .

    Suppose the queries in the database is{q1, q2, . . . , q n}, we record the visited frequency fi,the recent visited time ti, the execution cost ci andthe occupied size si for each query qi. We always first

    replace query qi if (i fi ci)/si is minimal amongall such queries, where

    i =

    , ifqi is in the category of 20% recent queries;

    1, otherwise.

    We note that recent queries are more important gen-erally, therefore should be larger than 1. We use our

  • 8/13/2019 Effective Semantic Caching of xpath

    10/15

    356 J. Comput. Sci. & Technol., Mar. 2010, Vol.25, No.2

    incremental algorithms[13] to mine the frequent queriesto cache. We will experimentally demonstrate the ef-fectiveness of our proposed techniques in Section 7.

    7 Experimental Study

    In this section, we present the experiments con-ducted to evaluate the efficiency of various algorithmsand the obtained results.

    Mandhaniet al.[2] proposed a technique for view se-lection based on string match. We call it SCSM, Se-mantic Cache based on String Match. However SCSMcannot fully exploit the query/view answerability. Wecompared our method SCEND with the existing state-of-the-art method SCSM[2], containment checking al-gorithm CheckContainment[10], and the naive cache,which requires exact string match between the queryand view. CheckContainment needs to check the con-tainment between each view and the query.

    All the algorithms were coded in C++, and the ex-periments were conducted on an AMD 2600+ PC with

    1 GB RAM, running Windows 2000 server. We usedthe beta 2 release of Microsoft SQL Server 2005 forboth the cache and XML databases. We cached theviews similar to [2] in the semantic cache. Moreover,we randomly added some // and in the XPathqueries.

    We employed the datasets DBLP[14], TreeBank[15],and XMark[16] for our experiments: 1) XMark issynthetic and generated by an XML data generator;2) DBLP is a collection of papers and articles; 3)

    TreeBank has a highly recursive structure. The deep

    Table 2. Characteristics of Datasets

    Datasets Average No. Nodes Max Depth Max Fan-Out

    XMark 8.4 11 11

    DBLP 7.6 8 12

    TreeBank 12.2 20 10

    recursive structure of this data makes it an interestingcase for the experiment. Based on the DTDs of se-lected datasets, some // and nodes are added toconstruct the queries and views as the input. Differentcharacteristics of queries are summarized in Table 2. AsTreeBank has a complicated schema, it leads to lowersearch performance than other datasets.

    In contrast, the average number of nodes, maximum

    depth and fan-out of queries reflect the complexity ofthe datasets. All the datasets follow the default Zipfdistribution with exponent z, where z is a parameter,and the probability of choosing the i-th query is 1iz .

    7.1 Cache Hit Rate

    This subsection evaluates the query/view answer-ability of various methods. We employed the metricsof cache hit rate.

    Fig.7 shows the experimental results with differentZipf exponent Z used for generating queries. Cachedviews and test queries we employed were 200 000 and

    100 000 respectively for eachZvalue. WithZincreases,the locality of the queries increases, and thus the cachehit rates increase. We note thatSCENDalways achieveshigh cache hit rate in that it employs the decomposing-based method for view selection, which can exploit suf-ficient views to answer queries. Although CheckCon-tainment is better than SCEND, CheckContainment israther expensive in checking the containment betweenviews and queries, especially for large numbers of views.

    Fig.8 shows how the cache hit rate varies with thenumber of queries. We cached 200000 queries and setZas 1.2. The cache hit rate for SCENDdoes not shoot upas the number of queries increases. This is so becauseour cache replacement policy is very effective for querycaching and replacement. However, the cache hit ratefor SCSM and the naive cache varies with the differentnumbers of queries. We observe thatSCEND achieveshigher cache hit rate than the alternative methods, thatis more than 20% higher than SCSM[2] and 50% higher

    Fig.7. Cache hit rate vs. different Zipf exponents (200 000 views).

    http://www.cs.washington.edu/research/xmldatasets/data/treebank/.

  • 8/13/2019 Effective Semantic Caching of xpath

    11/15

    Guo-Liang Li et al.: SCEND: An Effective Semantic Cache 357

    Fig.8. Cache hit rate vs. different numbers of queries (Z= 1.2).

    than naive cache on each dataset. Thus, the query/viewanswerability that we capture is much richer thanSCSM and naive cache. Moreover, different datasetswill not influence the performance of caching. This con-trast reflects the better scalability of our method.

    7.2 Cache Lookup Time

    We in this subsection evaluate the efficiency of cachelookup. Note that the lookup time does not include thetime in obtaining the result ofQ by executing CQ(fora cache hit) or Q (for a cache miss).

    Fig.9 shows the experimental results with differentZipf exponent Z used for generating queries. Cachedviews and test queries we employed were 200 000 and100 000 respectively for eachZvalue. Fig.10 shows howthe average cache lookup time varies with the number

    of queries. Here we see how well the lookup scales to alarge number of cached views. In all cases, we cached200 000 queries. We can see that the lookup time forSCEND remains constant at around 4 ms, even as thenumber of queries increases to 6 million. This time isvery small compared to the time taken to execute a typ-ical XPath query. However, CheckContainment takesmore than 2000 milliseconds for cache lookup, whichis even much more than the query processing time.

    This experimental result is exactly what we would like.Moreover, SCEND is better than SCSM, which takesmore than 12 ms per lookup. The naive cache takes amere 2 ms per lookup. However, in terms of query pro-cessing performance, this difference will be offset by thehigher hit rate of the semantic cache, we will compareit later.

    Fig.9. Average cache lookup time vs. different Zipf exponents (200 000 views).

    Fig.10. Average cache lookup time vs. different numbers of queries (Z= 1.2).

  • 8/13/2019 Effective Semantic Caching of xpath

    12/15

    358 J. Comput. Sci. & Technol., Mar. 2010, Vol.25, No.2

    7.3 Query Performance

    We evaluate the query performance of our proposedsemantic cache. If cache hits, we compose CQon V toanswerQ; otherwise, if cache misses, we will processQdirectly.

    We first evaluate the query performance with differ-ent numbers of queries and fixing the Zipf exponent Z

    as 1.2. The number of cached queries and test queriesare 200 000 and 100 000 respectively. Fig.11 shows theaverage elapsed time of processing a query. The queriestook 1600 milliseconds for no caching. However, Check-Containment increases this to more than 2000 millisec-onds as CheckContainment takes more time for cachelookup. Having the naive cache brings this down to1000ms, while employing SCSM brings this down to600ms. SCENDbrings this down to 200ms, which is aspeedup by the factors of 10, 8, 5 and 3 for CheckCon-tainment, no cache, naive cache and SCSM respectively.

    Further, Fig.12 shows how the average time perquery varies with the number of queries. The aver-age time for SCEND and no cache does not shoot upas the number of queries increases. However, the aver-age time for CheckContainment, SCSM and the naivecache increases with the increase of queries. As thenumber of queries increases, the locality of the queriesalso changes. This will not influence the no-cache-based

    method but influences the other three methods. How-ever, because our cache replacement policy is efficientfor caching, the performance of SCEND will not dropdown with the increase of the number of queries. Thisreflects the better scalability of our method.

    Finally, Table 3 shows some additional experimentalresults. For a cache hit, the semantic cache needs toquery a cached fragment. On the other hand, the naivecache simply retrieves the whole fragment. Consideringthis, the average lookup time per cache hit is 4.11 msfor SCEND, which is impressive, because the average

    Fig.11. Average elapsed time vs. different Zipf exponents (200 000 views).

    Fig.12. Average elapsed time vs. different numbers of queries (Z= 1.2).

    Table 3. Evaluation of Different Methods

    SCEND SCSE Naive Cache No Cache CheckContainmentAvg. Lookup Time/Hit (ms) 4.11 15.66 1.64 0 2253

    Avg. Lookup Time/Miss (ms) 10.40 36.82 1.67 0 3426

    Avg. Lookup Time (ms) 4.81 20.01 1.66 0 2408

    Avg. Time/Hit (ms) 78 455 1.81 0 2442

    Avg. Time/Miss (ms) 1124 1138 1602 1603 4312

    Avg. Time (ms) 89 681 1276 1603 2549

    Hit Rate 0.952 0.781 0.214 0 0.963

  • 8/13/2019 Effective Semantic Caching of xpath

    13/15

    Guo-Liang Li et al.: SCEND: An Effective Semantic Cache 359

    time per cache hit is 78 ms. It is interesting to observethat the average time per miss for SCEND is 1124ms,which is much longer than the overall average of 89 ms,but that cache lookup only takes an extra 4.81ms.Thus, the queries that are cache misses, take longerto execute on the XML database than those which arecache hits. Compared with average time per miss, av-erage lookup time for SCEND is negligible. Although

    CheckContainment can improve the cache hit rate, itinvolves much time for cache lookup. Thus, Check-Containment leads to low performance. Moreover, thecache hit rate ofSCENDreaches 0.952 and the averagetime is only 89 ms, which are much better than that ofSCSE, naive cache, no cache, and CheckContainment.

    8 Related Work

    XML has become a standard for information rep-resentation and exchange over the Internet. Manyresearchers have been studying the problem of XMLindexing[8], XML query processing[11-12,17-19], fre-

    quent XML query pattern discovering[13,20] and XMLquery caching and answering[2,20-23]. Chen et al.[24]

    attempted to apply the ideas of semantic cachingto XML query processing systems, in particular theXQuery engine. Semantic caching implies view-basedquery answering and cache management. Hristidis andPetropoulos[25] presented a novel framework for se-mantic caching of XML databases. The cached XMLdata were organized using a modification of the in-complete tree, which has many desirable properties,such as incremental maintenance, containment decid-ability and remainder queries generation in PTIME.

    Xu[26]

    introduced a novel framework for a new semanticcaching system, which offers the representation systemof cached XML data, the algorithms to decide whethera new query can be totally answered by cached XMLdata or not, and to incrementally maintain the cachedXML data.

    The work most related to our method is that of con-tainment between XPath queries. Miklau and Suciu[10]

    proved that this problem is Co-NP-complete. A polyno-mial time algorithm is also presented for checking con-tainment, which is sound but not complete. Balmin etal.[21] employed the materialized XPath views to answerqueries. However, their method is inefficient for view se-

    lection if there are a large number of views in the cache.Further, their criterion for query/view answerability isexactly containment between queries and views. Theirversion of what we call Compensating Queries requiresnavigating up from the returned nodes of the view be-ing used. For each view, they store one or more ofXML fragments, object ids, and typed data values,and they defined query/view answerability accordingly.

    This choice allows them to maintain some cached viewsoutside the database too, and target applications likemiddle-tier caching and distributed XQuery.

    Application-tier caching for relational database hasreceived a lot of attention lately, in the context ofdatabase-driven websites[4,27]. Our caching frameworkenables the same for XML databases. Further, whenthe cache is maintained inside the XML database sys-

    tem, object ids of the result nodes can be stored in-stead of the entire result fragment. The techniques thatwe describe in this paper will remain equally applica-ble. Chen and Rundensteiner[28] proposed a semanticcache of XQuery views. It focuses on various aspects ofthe query/view matching problem, which is harder forXQuery. Having XQuery views will result in smallercached results and concise rewritten queries, which willspeed up cache hits. However, cache lookup optimiza-tion is much harder due to the more complex matchinginvolved, and lookup is likely to become the bottleneckwhen there are a large number of cached views to con-sider. Mandhaniet al.[2] proposed a method about how

    to find V in views to answer Q by string match, butwhen there are large numbers of views in cache, it israther inefficient. Moreover, it may involve cache missfor some queries, which can be answered by some viewsin cache, although. In this paper we demonstrate howto improve the cache hit rate, and adequately exploitthe query/view answerability.

    Discovering frequent XML query patterns turns outto be a significant and effective premise of query op-timization for its capability of focus capturing. Therapid growth of XML repositories has provided the im-petus to design and develop systems that can store

    and query XML data efficiently, and thus discoveringfrequent XML Query Patternshas recently attracteda large amount of attention as the answers of thesequeries can be stored and cached so as to improve thequery performance. The advantage of caching is thatwhen a user refines a query by adding or removing oneor more query terms, many of the answers that havealready been cached can be delivered to the user rightaway. This avoids the expensive evaluation of repeatedor similar queries.

    As to frequent XML query patterns mining, to thebest of our knowledge, XQPMiner[29] is the first al-gorithm to mine frequent XQPs with a global XQP

    schema guided enumeration mining algorithm for fre-quent XQP mining. It follows the traditional ideaof generate-and-test paradigm for tree-structured datamining. Global query pattern tree needs to begenerated for XQP enumeration, as well as expen-sive candidate generation and containment testing.FastXMiner[20] is the most efficient mining algorithmfor XML frequent query pattern discovery, as only

  • 8/13/2019 Effective Semantic Caching of xpath

    14/15

    360 J. Comput. Sci. & Technol., Mar. 2010, Vol.25, No.2

    valid candidate XQPs are enumerated for costly con-tainment testing, as opposed to all the candidates ofXQPMiner[29]. increQPMiner[30] studies the problemof incremental mining by using the mined results ofthe original databases. However, increQPMiner is notas efficient as our incremental algorithms[13] as incre-QPMiner does not take full advantages of the minedresults of the original database. More importantly, we

    proposed a novel method[13] for effective incrementalmining by employing the F-index and Q/F-index tofacilitate the mining of frequent query patterns.

    More recently, we proposed a novel method of ex-ploiting sequencing views in semantic cache to accele-rate XPath query evaluation so as to improve the an-swerability of caching[23]. We devised an efficient ap-proach to improve the answerability of semantic web bydecomposing XML queries into simple components andemploying a technique of the divisibility of prime num-ber products[22] for effective view selection, which leadsto a dermatic improvement over prior work in terms ofcache lookup. As a major value-added version of ourpreliminary work[22], we here summarize the major ex-tensions as follows. Firstly, we have added a novel cachereplacement technique to further improve the perfor-mance of our proposal (Section 6). Secondly, we haveadded some examples to make the paper much moreunderstandable (Examples 5, 6 and 7). Thirdly, weproposed to use DBMS capabilities for effective view se-lection (Subsection 4.2). Finally, we have conducted anextensive new performance study on different datasetsto further evaluate our algorithm and compared our ap-proach with the existing state-of-the-art methods (Figs.712 and Table 3).

    9 Conclusion

    We have proposed a semantic cache, namelySCEND,for effective query caching and view selection to ad-equately exploit XPath query/view answerability. Toenhance the query/view answerability, we decomposedthe complex queries into some simpler ones and usedthe decomposed queries to evaluate the query/view an-swerability of the complex queries and views, which canexploit sufficient views to answer queries. To improvethe efficiency of view selection, we present a novel tech-nique based on divisibility of two numbers. We assign

    each query node with a prime number and prove thatthe divisibility of two assigned positive numbers of thequery and the cached view is a necessary condition forthe query/view answerability, which can significantlyimprove the efficiency of cache lookup. We have im-plemented our method and the thorough experimentalresults give us rich confidence to believe that our ap-proach achieves high performance and outperforms the

    existing state-of-the-art methods significantly.

    References

    [1] Dar S, Franklin M J, Jonsson B T, Srivastava D, Tan M. Se-mantic data caching and replacement. In Proc. VLDB1996,Mumbai (Bombay), India, September 3-6, 1996, pp.330-341.

    [2] Mandhani B, Suciu D. Query caching and view selection forXML databases. In Proc. VLDB 2005, Trondheim, Norway,August 30-September 2, 2005, pp.469-480.

    [3] Feng J H, Li G L, Ta N. A semantic cache framework for se-cure XML queries. J. Comput. Sci. & Technol., 2008, 23(6):988-997.

    [4] Luo Q, Krishnamurthy S, Mohan C, Pirahesh H, Woo H,Lindsay B G, Naughton J F. Middle-tier database cachingfor e-business. In Proc. ACM SIGMOD Int. Conf. Manage-ment of Data, Madison, USA, June 3-6, 2002, pp.600-611.

    [5] Re C, Brinkley J, Hinshaw K, Suciu D. Distributed XQuery.InProc. Information Integration on the Web(IIWeb), VLDBWorkshop, Toronto, Canada, Aug. 30, 2004, pp.116-121,

    [6] Chandra A K, Merlin P M. Optimal implementation of con-junctive queries in relational data bases. In Proc. STOC,May 2-4, 1977, Boulder, Colorado, USA, pp.77-90.

    [7] Miklau G, Suciu D. Containment and equivalence for a frag-ment of XPath. J. ACM, 2004, 51(1): 2-45.

    [8] Milo T, Suciu D. Index structures for path expressions. InProc. ICDT, Jerusalem, Israel, January 10-12, 1999, pp.277-295.

    [9] Wu X, Lee M L, Hsu W. A prime number labeling scheme fordynamic ordered XML trees. In Proc. ICDE, Boston, USA,March 30-April 2, 2004, pp.66-78.

    [10] Miklau G, Suciu D. Containment and equivalence for anXPath fragment. In Proc. PODS, Madison, USA, June 3-5, 2002, pp.65-76.

    [11] Li G, Feng J, Zhang Y, Zhou L. Efficient holistic twig joins inleaf-to-root combining with root-to-leaf way. In Proc. DAS-FAA, Bangkok, Thailand, April 9-12, 2007, pp.834-849.

    [12] Bruno N, Koudas N, Srivastava D. Holistic twig joins: Opti-mal XML pattern matching. In Proc. ACM SIGMOD Int.Conf. Management of Data, Madison, Wisconsin, June 3-6,2002, pp.310-321.

    [13] Li G, Feng J, Wang J, Zhang Y, Zhou L. Incremental min-ing of frequent query patterns from XML queries for caching.In Proc. ICDM, December 18-22, 2006, Hong Kong, China,pp.350-361.

    [14] http://dblp.uni-trier.de/xml/.

    [15] http://www.cs.washington.edu/research/.

    [16] http://www.xml-benchmark.org/.[17] Al-Khalifa S, Jagadish H V, Patel J M, Wu Y, Koudas N,

    Srivastava D. Structural joins: A primitive for efficient XMLquery pattern matching. In Proc. ICDE 2002, February 26-March 1, 2002, San Jose, USA, pp.141-152.

    [18] Chen T, Lu J, Ling T W. On boosting holism in XML twigpattern matching using structural indexing techniques. InProc. ACM SIGMOD Int. Conf. Management of Data, Bal-timore, USA, June 14-16, 2005, pp.455-466.

    [19] Lu J, Ling T W, Chan C Y, Chen T. From region encoding

    to extended dewey: On efficient processing of XML twig pat-tern matching. In Proc. VLDB, Trondheim, Norway, August30-September 2, 2005, pp.193-204.

    [20] Yang L H, Lee M L, Hsu W. Efficient mining of XML querypatterns for caching. In Proc. VLDB, Berlin, Germany,September 9-12, 2003, pp.69-80.

    [21] Balmin A, Ozcan F, Beyer K S, Cochrane R, Pirahesh H. Aframework for using materialized XPath views in XML queryprocessing. In Proc. VLDB 2004, Toronto, Canada, August31-September 3, 2004, pp.60-71.

  • 8/13/2019 Effective Semantic Caching of xpath

    15/15

    Guo-Liang Li et al.: SCEND: An Effective Semantic Cache 361

    [22] Li G, Feng J, Ta N, Zhang Y, Zhou L. SCEND: An efficientsemantic cache to adequately explore answerability of views.In Proc. WISE 2006, Wuhan, China, October 23-26, 2006,pp.460-473.

    [23] Feng J, Ta N, Zhang Y, Li G. Exploit sequencing views in se-mantic cache to accelerate XPath query evaluation. InProc.WWW 2007, Banff, Canada, May 8-12, 2007, pp.1337-1338.

    [24] Chen L, Rundensteiner E A, Wang S. XCache: A semanticcaching system for XML queries. In Proc. ACM SIGMODInt. Conf. Management of Data, Madison, USA, June 3-6,2002, p.618.

    [25] Hristidis V, Petropoulos M. Semantic caching of XMLdatabases. In Proc. ACM SIGMOD Int. Conf. Manage-ment of Data, Madison, USA, June 3-6, 2002, pp.25-30.

    [26] Xu W. The framework of an XML semantic aching system.In Proc. ACM SIGMOD Int. Conf. Management of Data,Baltimore, USA, June 13-16, 2005, pp.127-132.

    [27] Yagoub K, Florescu D, Issarny V, Valduriez P. Caching strate-gies for data-intensive Web sites. In Proc. VLDB 2000,September 10-14, 2000, Cairo, Egypt, pp.188-199.

    [28] Chen L, Rundensteiner E A. XCache: XQuery-based CachingSystem. InProc. Int. Workshop on the Web and Databases,Madison, Wisconsin, June 3-6, 2002, pp.31-36.

    [29] Yang L H, Li M L, Hsu W, Acharya S. Mining frequent quer

    patterns from XML queries. InProc. DASFAA 2003

    , March26-28, Kyoto, Japan, 2003, pp.355-362.

    [30] Chen Y, Yang L H, Wang Y G. Incremental mining of fre-quent XML query pattern. In Proc. ICDM 2004, November1-4, 2004, Brighton, UK, pp.343-346.

    Guo-Liang Li received his B.S.degree from the Department ofComputer Science and Technol-ogy, Harbin Institute of Technology(HIT), and M.S. and Ph.D. degreesfrom Department of Computer sci-ence and technology, Tsinghua Uni-versity, where he is currently workingas a faculty member. He is a mem-

    ber of China Computer Federation(CCF). His research interests are in the fields of databaseindexing, data integration, and data cleaning. He has pub-lished papers in the top international conferences, such asACM SIGMOD, ACM SIGIR, VLDB, IEEE ICDE, WWW,ACM CIKM, IEEE ICDM, and top international journals,such as DMKD, Information System.

    Jian-Hua Feng received hisB.S., M.S. and Ph.D. degrees in com-puter science and technology fromTsinghua University. He is currentlyworking as a faculty of DepartmentComputer Science and Technology inTsinghua University. He is a se-nior member of CCF. His main re-search interests are in native XMLdatabase, keyword search and data

    mining. He has published papers in the international topjournals and conferences, such as DMKD, Information Sys-tems; ACM SIGMOD, ACM SIGKDD, VLDB, IEEE ICDE,ACM SIGIR, WWW, ACM CIKM, IEEE ICDM, SDM, ER.