Context-based Diversiﬁcation for Keyword Queries over XML Data Diversification for Keyword.pdf · Queries over XML Data Jianxin Li, Chengfei Liu Member, IEEE and Jeffrey Xu Yu Member,

1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TKDE.2014.2334297, IEEE Transactions on Knowledge and Data Engineering

1

Context-based Diversification for KeywordQueries over XML Data

Jianxin Li, Chengfei Liu Member, IEEE and Jeffrey Xu Yu Member, IEEE

Abstract —While keyword query empowers ordinary users to search vast amount of data, the ambiguity of keyword query makesit difficult to effectively answer keyword queries, especially for short and vague keyword queries. To address this challengingproblem, in this paper we propose an approach that automatically diversifies XML keyword search based on its differentcontexts in the XML data. Given a short and vague keyword query and XML data to be searched, we firstly derive keywordsearch candidates of the query by a simple feature selection model. And then, we design an effective XML keyword searchdiversification model to measure the quality of each candidate. After that, two efficient algorithms are proposed to incrementallycompute top-k qualified query candidates as the diversified search intentions. Two selection criteria are targeted: the k selectedquery candidates are most relevant to the given query while they have to cover maximal number of distinct results. At last, acomprehensive evaluation on real and synthetic datasets demonstrates the effectiveness of our proposed diversification modeland the efficiency of our algorithms.

Index Terms —XML Keyword Search, Context-based Diversification.

✦

1 INTRODUCTION

Keyword search on structured and semi-structureddata has attracted much research interest recently, as itenables users to retrieve information without the needto learn sophisticated query languages and databasestructure [1]. Compared with keyword search meth-ods in Information Retrieval (IR) that prefer to finda list of relevant documents, keyword search ap-proaches in structured and semi-structured data (de-noted as DB&IR) concentrate more on specific infor-mation contents, e.g., fragments rooted at the smallestlowest common ancestor (SLCA) nodes of a givenkeyword query in XML. Given a keyword query, anode v is regarded as an SLCA if (1) the subtree rootedat the node v contains all the keywords, and (2) theredoes not exist a descendant node v′ of v such that thesubtree rooted at v′ contains all the keywords. In otherwords, if a node is an SLCA, then its ancestors will bedefinitely excluded from being SLCAs, by which theminimal information content with SLCA semanticscan be used to represent the specific results in XMLkeyword search. In this paper, we adopt the well-accepted SLCA semantics [2], [3], [4], [5] as a resultmetric of keyword query over XML data.

In general, the more keywords a user’s querycontains, the easier the user’s search intention with

• Jianxin Li and Chengfei Liu are with the Faculty of Science,Engineering and Technology, Swinburne University of Technology,Australia. {jianxinli, cliu}@swin.edu.au

• Jeffrey Xu Yu is with the Department of Systems Engineering &Engineering Management, The Chinese University of Hong Kong,China. [email protected]

regards to the query can be identified. However,when the given keyword query only contains a smallnumber of vague keywords, it would become a verychallenging problem to derive the user’s search in-tention due to the high ambiguity of this type of key-word queries. Although sometimes user involvementis helpful to identify search intentions of keywordqueries, a user’s interactive process may be time-consuming when the size of relevant result set is large.To address this, we will develop a method of pro-viding diverse keyword query suggestions to usersbased on the context of the given keywords in thedata to be searched. By doing this, users may choosetheir preferred queries or modify their original queriesbased on the returned diverse query suggestions.

Example 1: Consider a query q={database, query}over the DBLP dataset. There are 21,260 publicationsor venues containing the keyword “database”, and9,896 publications or venues containing the keyword“query”, which contributes 2040 results that containthe two given keywords together. If we directly readthe keyword search results, it would be time consum-ing and not user-friendly due to the huge numberof results. It takes 54.22 seconds for just computingall the SLCA results of q by using XRank [2]. Evenif the system processing time is acceptable by accel-erating the keyword query evaluation with efficientalgorithms [3], [4], the unclear and repeated searchintentions in the large set of retrieved results willmake users frustrated. To address the problem, wederive different search semantics of the original queryfrom the different contexts of the XML data, whichcan be used to explore different search intentions ofthe original query. In this work, the contexts can bemodeled by extracting some relevant feature terms of



2

TABLE 1Top 10 selected feature terms of q

keyword features

database systems; relational; protein; distributed;

oriented; image; sequence; search;

model; large.

query language; expansion; optimization; evaluation;

complexity; log; efficient; distributed;

semantic; translation.

TABLE 2Part of statistic information for q

database systems query +

language expansion optimization evaluation complexity

#results 71 5 68 13 1

log efficient distributed semantic translation

#results 12 17 50 14 8

relational database query +

language expansion optimization evaluation complexity

#results 40 0 20 8 0

log efficient distributed semantic translation

#results 2 11 5 7 5

... ...

the query keywords from the XML data, as shownin Table 1. And then, we can compute the keywordsearch results for each search intention. Table 2 showspart of statistic information of the answers related tothe keyword query q, which classifies each ambiguouskeyword query into different search intentions.

The problem of diversifying keyword search isfirstly studied in IR community [6], [7], [8], [9],[10]. Most of them perform diversification as a post-processing or re-ranking step of document retrievalbased on the analysis of result set and/or the querylogs. In IR, keyword search diversification is designedat the topic or document level. E.g., [7] models userintents at the topical level of the taxonomy and [11]obtains the possible query intents by mining querylogs. However, it is not always easy to get these usefultaxonomy and query logs. In addition, the diversifiedresults in IR are often modeled at document levels.To improve the precision of query diversification instructured databases or semi structured data, it is de-sirable to consider both structure and content of datain diversification model. So the problem of keywordsearch diversification is necessary to be reconsideredin structured databases or semi structured data. [12]is the first work to measure the difference of XMLkeyword search results by comparing their featuresets. However, the selection of feature set in [12] islimited to metadata in XML and it is also a methodof post-process search result analysis.

Different from the above post-process methods,another type of works addresses the problem ofintent-based keyword query diversification throughconstructing structured query candidates [13], [14].Their brief idea is to first map each keyword to a setof attributes (metadata), and then construct a largenumber of structured query candidates by merging

the attribute-keyword pairs. They assume that eachstructured query candidate represents a type of searchintention, i.e., a query interpretation. However, theseworks are not easy to be applied in real applicationdue to the following three limitations:

• A large number of structured XML queries maybe generated and evaluated;

• There is no guarantee that the structured queriesto be evaluated can find matched results due tothe structural constraints;

• Similar to [12], the process of constructing struc-tured queries has to rely on the metadata infor-mation in XML data.

To address the above limitations and challenges, weinitiate a formal study of the diversification problemin XML keyword search, which can directly computethe diversified results without retrieving all the rele-vant candidates. Towards this goal, given a keywordquery, we first derive the co-related feature terms foreach query keyword from XML data based on mutualinformation in the probability theory, which has beenused as a criterion for feature selection [15], [16].The selection of our feature terms is not limited tothe labels of XML elements. Each combination of thefeature terms and the original query keywords mayrepresent one of diversified contexts (also denoted asspecific search intentions). And then, we evaluate eachderived search intention by measuring its relevanceto the original keyword query and the novelty of itsproduced results. To efficiently compute diversifiedkeyword search, we propose one baseline algorithmand two improved algorithms based on the observedproperties of diversified keyword search results.

The remainder of this paper is organized as follows.In Section 2, we introduce a feature selection modeland define the problem of diversifying XML keywordsearch. In Section 3, we describe the procedure ofextracting the relevant feature terms for a keywordquery based on the explored feature selection model.In Section 4, we propose three efficient algorithms toidentify a set of qualified and diversified keywordquery candidates and evaluate them based on ourproposed pruning properties. In Section 5, we provideextensive experimental results to show the effective-ness of our diversification model and the performanceof our proposed algorithms. We describe the relatedwork in Section 6 and conclude in Section 7.

2 PROBLEM DEFINITION

Given a keyword query q and an XML data T , ourtarget is to derive top-k expanded query candidates interms of high relevance and maximal diversificationfor q in T . Here, each query candidate represents acontext or a search intention of q in T .

2.1 Feature Selection Model

Consider an XML data T and its relevance-basedterm-pair dictionary W . The composition method of



3

W depends on the application context and will notaffect our subsequent discussion. As an example, itcan simply be the full or a subset of the terms com-prising the text in T or a well-specified set of term-pairs relevant to some applications.

In this work, the distinct term-pairs are selectedbased on their mutual information as [15], [16].Mutual information has been used as a criterion forfeature selection and feature transformation in ma-chine learning. It can be used to characterize both therelevance and redundancy of variables, such as theminimum redundancy feature selection. Assume wehave an XML tree T and its sample result set R(T ).Let Prob(x, T ) be the probability of term x appearing

in R(T ), i.e., Prob(x, T ) = |R(x,T )||R(T )| where |R(x, T )| is

the number of results containing x. Let Prob(x, y, T )be the probability of terms x and y co-occurring in

R(T ), i.e., Prob(x, y, T ) = |R(x,y,T )||R(T )| . If terms x and y

are independent, then knowing x does not give anyinformation about y and vice versa, so their mutual in-formation is zero. At the other extreme, if terms x andy are identical, then knowing x determines the valueof y and vice versa. Therefore, the simple measure canbe used to quantify how much the observed wordco-occurrences maximize the dependency of featureterms while reduce the redundancy of feature terms.In this work, we use the popularly-accepted mutualinformation model as follows.

MI(x, y, T ) = Prob(x, y, T ) ∗ log Prob(x,y,T )Prob(x,T )∗Prob(y,T )

(1)For each term in the XML data, we need to find a

set of feature terms where the feature terms can beselected in any way, e.g., top-m feature terms or thefeature terms with their mutual values higher thana given threshold based on domain applications ordata administrators. The feature terms can be pre-computed and stored before the procedure of queryevaluation. Thus, given a keyword query, we canobtain a matrix of features for the query keywordsusing the term-pairs in W . The matrix represents aspace of search intentions (i.e., query candidates) ofthe original query w.r.t. the XML data. Therefore, ourfirst problem is to select a subset of query candidates,which has the highest probability of interpreting thecontexts of original query. In this work, the selectionof query candidates are based on an approximatesample at the entity level of XML data.

Consider query q ={database, query} over DBLPXML dataset again. Its corresponding matrix can beconstructed from Table 1. Table 3 shows the mutualinformation score for the query keywords in q. Eachcombination of the feature terms in matrix may rep-resent a search intention with the specific seman-tics. For example, the combination “query expansiondatabase systems ” targets to search the publicationsdiscussing the problem of query expansion in the area

TABLE 3Mutual information score w.r.t. terms in q

databasesystem relational protein distributed oriented

7.06 3.84 2.79 2.25 2.06

Mutual score (10−4)image sequence search model large

1.73 1.31 1.1 1.04 1.02

querylanguage expansion optimization evaluation complexity

3.63 2.97 2.3 1.71 1.41

Mutual score (10−4)log efficient distributed semantic translation

1.17 1.03 0.99 0.86 0.70

of database systems, e.g., one of the works, “queryexpansion for information retrieval” published in Ency-clopedia of Database Systems in 2009, will be returned.If we replace the feature term “systems” with “rela-tional”, then the generated query will be changed tosearch specific publications of query expansion overrelational database, in which the returned results areempty because no work is reported to the problemover relational database in DBLP dataset.

2.2 Keyword Search Diversification Model

In our model, we not only consider the probabilityof new generated queries, i.e., relevance, we also takeinto account their new and distinct results, i.e., novelty.To embody the relevance and novelty of keywordsearch together, two criteria should be satisfied: (1) thegenerated query qnew has the maximal probability tointerpret the contexts of original query q with regardsto the data to be searched; and (2) the generated queryqnew has a maximal difference from the previouslygenerated query set Q. Therefore, we have the aggre-gated scoring function.

score(qnew) = Prob(qnew |q, T ) ∗DIF (qnew, Q, T )(2)

where Prob(qnew |q, T ) represents the probability thatqnew is the search intention when the original query qis issued over the data T ; DIF (qnew , Q, T ) representsthe percentage of results that are produced by qnew,but not by any previously generated query in Q.

2.2.1 Evaluating the Probabilistic Relevance of anIntended Query Suggestion w.r.t. the Original Query

Based on the Bayes Theorem, we have

Prob(qnew |q, T ) =Prob(q|qnew,T )∗Prob(qnew|T )

Prob(q|T ) (3)

where Prob(q|qnew , T ) models the likelihood of gener-ating the observed query q while the intended queryis actually qnew; and Prob(qnew |T ) is the query gener-ation probability given the XML data T .

The likelihood value Prob(q|qnew , T ) can be mea-sured by computing the probability of the originalquery q that is observed in the context of the featuresin qnew. Given an original query q = {ki}1≤i≤n anda generated new query candidate qnew = {si}1≤i≤n



4

where ki is a query keyword in q, si is a segmentthat consists of the query keyword ki and one ofits features fiji in W , and 1 ≤ ji ≤ m. Here, weassume that for each query keyword, only top m mostrelevant features will be retrieved from W to generatenew query candidates. To deal with multi-keywordqueries, we make the independence assumption onthe probability that fiji is the intended feature of thequery keyword ki. That is,

Prob(q|qnew , T ) =∏

ki∈q,fiji∈qnewProb(ki|fiji , T )

(4)According to the statistical sample information, the

intent of a keyword can be inferred from the oc-currences of the keyword and its correlated termsin the data. Thus, we can compute the probabilityProb(ki|fiji , T ) of interpreting a keyword ki into asearch intent fiji as follows.

Prob(ki|fiji , T ) =Prob(fiji |ki,T )∗Prob(ki,T )

Prob(fiji ,T )

=|R({ki,fiji},T )|/|R(T )|

|R(fiji ,T )|/|R(T )|

= |R(si,T )||R(fiji ,T )|

(5)

where si = {ki, fiji}.

Consider a query q ={database, query} and a querycandidate qnew={database system; query expansion}.Prob(q|qnew , T ) shows the probability of a publica-tion that addresses the problem of “database query”regarding the context of “system and expansion”,

which can be computed by|R({database system},T )|

|R(system,T )| *|R({query expansion},T )|

|R(expansion,T )| . Here, |R({database system}, T )|

represents the number of keyword search re-sults of query {database system} over the data T .|R(system, T )| represents the number of keywordsearch results of running query system over the dataT , but the number can be obtained without runningquery system because it is equal to the size of key-word node list of “system” over T . Similarly, we cancalculate the values of |R({query expansion}, T )| and|R(expansion, T )|, respectively. In this work, we adoptthe widely accepted semantics - Smallest Lowest Com-mon Ancestor SLCA to model XML keyword searchresults.

Given the XML data T , the query generation prob-ability of qnew can be calculated by the followingequation.

Prob(qnew |T ) =|R(qnew,T )|

|R(T )| =|⋂

si∈qnewR(si,T )|

|R(T )|(6)

where⋂

si∈qnewR(si, T ) represents the set of SLCA

results by merging the node lists R(si, T ) for si ∈ qnewusing the XRank algorithm in [4] that is a popularmethod to compute the SLCA results by traversingthe XML data only once.

Given an original query and the data, the value1

Prob(q|T ) is a relatively unchanged value with regards

to different generated query candidates. Therefore,Equation 3 can be rewritten as follows.

Prob(qnew |q, T ) = γ ∗ (∏ R(si,T )

R(fiji ,T ) ) ∗|⋂

R(si,T )||R(T )| (7)

where ki ∈ q, si ∈ qnew, fiji ∈ si and γ = 1Prob(q|T )

can be any value in (0,1] because it does not affect theexpanded query candidates w.r.t. an original keywordquery q and data T .

Although the above equation can model the prob-abilities of generated query candidates (i.e., the rele-vance between these query candidates and the orig-inal query w.r.t. the data), different query candi-dates may have overlapped result sets. Therefore, weshould also take into account the novelty of results ofthese query candidates.

2.2.2 Evaluating the Probabilistic Novelty of an In-tended Query w.r.t. the Original Query

As we know, an important property of SLCA seman-tics is exclusivity, i.e., if a node is taken as an SLCAresult, then its ancestor nodes cannot become SLCAresults. Due to this exclusive property, the process ofevaluating the novelty for a newly generated querycandidate qnew depends on the evaluation of the otherpreviously generated query candidates Q. As such,the novelty DIF (qnew, Q, T ) of qnew against Q can becalculated as follows:

DIF (qnew, Q, T ) =|{vx|vx∈R(qnew,T )∧∄vy∈{

⋃q′∈Q

R(q′,T )}∧vx≤vy}|

|R(qnew,T )⋃{⋃

q′∈QR(q′,T )}|

(8)

where R(qnew, T ) represents the set of SLCA resultsgenerated by qnew;

⋃q′∈QR(q

′, T ) represents the setof SLCA results generated by queries in Q, whichexcludes the duplicate and ancestor nodes; vx ≤ vymeans that vx is a duplicate of vy for “=”, or vx is anancestor of vy for “ <′′; R(qnew , T )

⋃{⋃

q′∈QR(q′, T )}

is an SLCA result set that satisfies with the exclusiveproperty.

By doing this, we can avoid presenting repeatedor overlapped SLCA results to users. In other words,the consideration of novelty allows us to incremen-tally refine the diversified results into more specificones when we incrementally deal with more querycandidates.

As we know our problem is to find top k qualifiedquery candidates and their relevant SLCA results. Todo this, we can compute the absolute score of thesearch intention for each generated query candidate.However, for reducing the computational cost, analternative way is to calculate the relative scores ofqueries. Therefore, we have the following equationtransformation. After we substitute Equation 7 andEquation 8 into Equation 2, we have the final equa-



5

tion.

score(qnew) = γ ∗∏( R(si,T )R(fiji ,T ) ) ∗

|⋂

R(si,T )||R(T )| ∗

|{vx|vx∈R(qnew,T )∧∄vy∈{⋃

q′∈QR(q′,T )}∧vx≤vy}|

|R(qnew,T )⋃{⋃

q′∈QR(q′,T )}|

= γ|R(T )| ∗

∏( R(si,T )R(fiji ,T ) ) ∗ |

⋂R(si, T )|∗



|R(qnew,T )⋃{⋃

q′∈QR(q′,T )}|

7→∏( R(si,T )R(fiji ,T ) ) ∗ |

⋂R(si, T )|∗



|R(qnew,T )⋃{⋃

q′∈QR(q′,T )}|

(9)

where si ∈ qnew, si = ki ∪ fiji , ki ∈ q, q′ ∈ Q andthe symbol 7→ represents the left side of the equationdepends on the right side of the equation becausethe value γ

|R(T )| keeps unchanged for calculating thediversification scores of different search intentions.

Property 1: (Upper Bound of Top-k Query Sugges-tions) Suppose we have found k query suggestionsQ = {q1, ..., qk} for an original keyword query q, and

for any qx ∈ Q. If we have score(qx) ≥∏( R(si,T )R(fiji ,T )) ∗

min{|Lki|} where R(si, T ) (or R(fiji , T )) is the result

set of evaluating si (or fiji ) in the entity samplespace of T and min{|Lki

|} is the size of the shortestkeyword node list for keywords ki in q, then theevaluation algorithm can be safely terminated, i.e., wecan guarantee the current k query suggestions will bethe top-k qualified ones w.r.t. q in T .

Proof: From Equation 9, we can see thatProperty 1 can be proved if the followinginequation holds: min{|Lki

|} ≥ |⋂R(si, T )|∗



|R(qnew,T )⋃{⋃

q′∈Q R(q′,T )}| .

Since si ⊇ ki ∪ fiji , it must havemin{Lsi} ≤ min{|Lki

|}. In addition, we alsoknow |

⋂R(si, T )| should be bounded by the

minimal size (min{Lsi}) of keyword nodes thatcontain si in T , i.e., |

⋂R(si, T )| ≤ min{Lsi}.

As such, we have |⋂R(si, T )| ≤ min{|Lki

|}.

Because of|{vx|vx∈R(qnew,T )∧∄vy∈{

⋃q′∈Q

R(q′,T )}∧vx≤vy}|

|R(qnew ,T )⋃{⋃

q′∈QR(q′,T )}|

≤ 1, so we can derive that |⋂R(si, T )|∗



|R(qnew,T )⋃{⋃

q′∈QR(q′,T )}| ≤

min{|Lki|}. Therefore, the property can be proved.

Since we incrementally generate and assess theintended query suggestions with the high probabilis-tic relevance values, Property 1 guarantees we canskip the unqualified query suggestions and terminateour algorithms as early as possible when the top-kqualified ones have been identified.

3 EXTRACTING FEATURE TERMS

To address the problem of extracting meaningful fea-ture terms w.r.t. an original keyword query, there aretwo relevant works [17], [18]. In [17], the authorsproposed a solution of producing top-k interestingand meaningful expansions to a keyword query byextracting k additional words with high “interesting-ness” values. The expanded queries can be used to

search more specific documents. The interestingnessis formalized with the notion of surprise [19], [20],[21]. In [18], the authors proposed efficient algorithmsto identify keyword clusters in large collections ofblog posts for specific temporal intervals. Our workintegrates both of their ideas together: we first mea-sure the correlation of each pair of terms using ourmutual information model in Equation 1, which isa simple surprise metric; and then we build termcorrelated graph that maintains all the terms and theircorrelation values. Different from [17], [18], our workutilizes entity-based sample information to build acorrelated graph with high precision for XML data.

In order to efficiently measure the correlation of apair of terms, we use a statistic method to measurehow much the co-occurrences of a pair of termsdeviate from the independence assumption wherethe entity nodes (e.g., the nodes with the “*” nodetypes in XML DTD) are taken as a sample space.For instance, given a pair of terms x and y, theirmutual information score can be calculated based onEquation 1 where Prob(x, T ) (or Prob(y, T )) is thevalue of dividing the number of entities containingx (or y) by the total entity size of the sample space;Prob({x, y}, T ) is the value of dividing the numberof entities containing both x and y by the total entitysize of the sample space.

In this work, we build a term correlated graphoffline, that is we precompute it before processingqueries. The correlation values among terms are alsorecorded in the graph, which is used to generatethe term-feature dictionary W . During the XML datatree traversal, we first extract the meaningful textinformation from the entity nodes in XML data. Here,we would like to filter out the stop words. Andthen we produce a set of term-pairs by scanningthe extracted text. After that, all the generated term-pairs will be recorded in the term correlated graph. Inthe procedure of building correlation graph, we alsorecord the count of each term-pair to be generatedfrom different entity nodes. As such, after the XMLdata tree is traversed completely, we can compute themutual information score for each term-pair based onEquation 1. To reduce the size of correlation graph,the term-pairs with their correlation lower than athreshold can be filtered out. Based on the off-linebuilt graph, we can on-the-fly select the top-m distinctterms as its features for each given query keyword.

4 KEYWORD SEARCH DIVERSIFICATIONALGORITHMS

In this section, we first propose a baseline algorithmto retrieve the diversified keyword search results.And then, two anchor-based pruning algorithms aredesigned to improve the efficiency of the keywordsearch diversification by utilizing the intermediateresults.



6

Algorithm 1 Baseline Algorithm

input: a query q with n keywords, XML data T and its termcorrelated graph Goutput: Top-k search intentions Q and the whole result setΦ

1: Mm×n = getFeatureTerms(q, G);2: while (qnew = GenerateNewQuery(Mm×n)) 6= null do3: φ = null and prob s k = 1;4: lixjy = getNodeList(sixjy , T ) for sixjy ∈ qnew ∧ 1 ≤

ix ≤ m ∧ 1 ≤ jy ≤ n;

5: prob s k =∏

fixjy∈sixjy

∈qnew(

|lixjy|

getNodeSize(fixjy,T )

);

6: φ = ComputeSLCA({lixjy});7: prob q new = prob s k * |φ|;8: if Φ is empty then9: score(qnew) = prob q new;

10: else11: for all Result candidates rx ∈ φ do12: for all Result candidates ry ∈ Φ do13: if rx == ry or rx is an ancestor of ry then14: φ.remove(rx);15: else if rx is a descendant of ry then16: Φ.remove(ry);17: score(qnew) = prob q new * |φ|* |φ|

|φ|+|Φ|;

18: if |Q| < k then19: put qnew : score(qnew) into Q;20: put qnew : φ into Φ;21: else if score(qnew) > score({q′new ∈ Q}) then22: replace q′new : score(q′new) with qnew : score(qnew);23: Φ.remove(q′new);24: return Q and result set Φ;

4.1 Baseline Solution

Given a keyword query, the intuitive idea of thebaseline algorithm is to first retrieve the relevantfeature terms with high mutual scores from the termcorrelated graph of the XML data T ; then generate listof query candidates that are sorted in the descendingorder of total mutual scores; and finally computethe SLCAs as keyword search results for each querycandidate and measure its diversification score. Assuch, the top-k diversified query candidates and theircorresponding results can be chosen and returned.

Different from traditional XML keyword search,our work needs to evaluate multiple intended querycandidates and generate a whole result set, in whichthe results should be diversified and distinct fromeach other. Therefore, we have to detect and removethe duplicated or ancestor SLCA results that havebeen seen when we obtain new generated results.

The detailed procedure has been shown in Algo-rithm 1. Given a keyword query q with n keywords,we first load its pre-computed relevant feature termsfrom the term correlated graph G of XML data T ,which is used to construct a matrix Mm×n as shownin Line 1. And then, we generate a new query can-didate qnew from the matrix Mm×n by calling thefunction GenerateNewQuery() as shown in Line 2.The generation of new query candidates are in thedescending order of their mutual information scores.Line 3-Line 7 show the procedure of computing

Prob(q|qnew , T ). To compute the SLCA results of qnew,we need to retrieve the precomputed node lists ofthe keyword-feature term pairs in qnew from T bygetNodeList(sixjy , T ). Based on the retrieved nodelists, we can compute the likelihood of generating theobserved query q while the intended query is actuallyqnew , i.e., Prob(q|qnew , T ) =

∏fixjy∈sixjy∈qnew

(|lixjy |

/getNodeSize (fixjy , T )) using Equation 5 wheregetNodeSize(fixjy , T ) can be quickly obtained basedon the precomputed statistic information of T . Afterthat, we can call for the function ComputeSLCA() thatcan be implemented using any existing XML keywordsearch method. In Line 8 - Line 16, we compare theSLCA results of the current query and the previousqueries in order to obtain the distinct and diversifiedSLCA results. At Line 17, we compute the final scoreof qnew as a diversified query candidate w.r.t. thepreviously generated query candidates in Q. At last,we compare the new query and the previously gen-erated query candidates and replace the unqualifiedones in Q, which is shown in Line 18 - Line 23. Afterprocessing all the possible query candidates, we canreturn the top k generated query candidates with theirSLCA results.

In the worst case, all the possibe queries in thematrix may have the possibility of being chosen asthe top-k qualified query candidates. In this worstcase, the complexity of the algorithm is O(m|q| ∗

L1

∑|q|i=2 logLi) where L1 is the shortest node list of

any generated query, |q| is the number of originalquery keywords and m is the size of selected featuresfor each query keyword. In practice, the complexity ofthe algorithm can be reduced by reducing the numberm of feature terms, which can be used to bound thenumber (i.e., reducing the value of m|q|) of generatedquery candidates.

4.2 Anchor-based Pruning Solution

By analyzing the baseline solution, we can find thatthe main cost of this solution is spent on computingSLCA results and removing unqualified SLCA resultsfrom the newly and previously generated result sets.To reduce the computational cost, we are motivatedto design an anchor-based pruning solution, whichcan avoid the unnecessary computational cost of un-qualified SLCA results (i.e., duplicates and ancestors).In this subsection, we first analyze the interrelation-ships between the intermediate SLCA candidates thathave been already computed for the generated querycandidates Q and the nodes that will be merged foranswering the newly generated query candidate qnew.And then, we will propose the detailed descriptionand algorithm of the anchor-based pruning solution.

4.2.1 Properties of Computing diversified SLCAsDefinition 1: (Anchor nodes) Given a set of query

candidates Q that have been already processed and



7

a new query candidate qnew, the generated SLCA re-sults Φ of Q can be taken as the anchors for efficientlycomputing the SLCA results of qnew by partitioningthe keyword nodes of qnew .

X 1

X 2

X1_pre L X1_des L

X2_des L l s 1

l s 2

l s 3

X1_next L

X2_pre L

X2_next L

x

y

0

l 2 s 1

l 2 s 3 l 2

s 2

l 3 s 1

l 3 s 2

p r u n i n g

l 3 s 3

l 4 s 1

l 4 s 3 l 4

s 2

l 1 s 1

l 1 s 3 l 1

s 2

X1_anc L

X2_anc L

Fig. 1. The usability of anchor nodes

Example 2: Figure 1 shows the usability of anchornodes for computing SLCA results. Consider twoSLCA results X1 and X2 (assume X1 precedes X2)for the current query set Q. For the next queryqnew = {s1, s2, s3} and its keyword instance lists L= {ls1, ls2, ls3}, the keyword instances in L will bedivided into four areas by the anchor X1: (1) LX1 anc

in which all the keyword nodes are the ancestor ofX1 so LX1 anc cannot generate new and distinct SLCAresults; (2) LX1 pre in which all the keyword nodes arethe previous siblings of X1 so we may generate newSLCA results if the results are still bounded in thearea; (3) LX1 des in which all the keyword nodes arethe descendant of X1 so it may produce new SLCAresults that will replace X1; and (4) LX1 next in whichall the keyword nodes are the next siblings of X1 soit may produce new results, but it may be furtherdivided by the anchor X2. If there is no intersectionof all keyword node lists in an area, then the nodesin this area can be pruned directly, e.g., l3s1 and l3s2can be pruned without computation if l3s3 is empty inLx1 des. Similarly, we can process LX2 pre, LX2 des andLX2 next. After that, a set of new and distinct SLCAresults can be obtained with regards to the new queryset Q

⋃qnew.

Theorem 1: (Single Anchor) Given an anchor nodeva and a new query candidate qnew = {s1, s2, ..., sn},its keyword node lists L = {ls1 , ls2 , ..., lsn} can bedivided into four areas to be anchored by va, i.e., thekeyword nodes that are the ancestors of va, denotedas Lva anc; the keyword nodes that are the previoussiblings of va, denoted as Lva pre; the keyword nodesthat are the descendants of va, denoted as Lva des;and the keyword nodes that are the next siblings ofva, denoted as Lva next. We have that Lva anc doesnot generate any new result; each of the other threeareas may generate new and distinct SLCA resultsindividually; no new and distinct SLCA results canbe generated accross the areas.

Proof: For the keyword nodes in the ancestor area,all of them are the ancestors of the anchor node va that

is an SLCA node w.r.t. Q. All these keyword nodes inthe area can be directly discarded due to the exclusiveproperty of SLCA semantics. For the keyword nodesin the area Lva pre (or Lva next), we need to computetheir SLCA candidates. But the computation of SLCAcandidates can be bounded by the preorder (or pos-torder) traversal number of va where we assume therange encoding scheme is used in this work. For thekeyword nodes in Lva des, if they can produce SLCAcandidates that are bounded in the range of va, thenthese candidates will be taken as the new and distinctresults while va will be removed from result set.

The last is to certify that the keyword nodes acrossany two areas cannot produce new and distinct SLCAresults. This is because the only possible SLCA can-didates they may produce are the node va and itsancestor nodes. But these nodes cannot become newand distinct SLCA results due to the anchor node va.For instance, If we can select some keyword nodesfrom Lva pre and some keyword nodes from Lva des,then their corresponding SLCA candidate must be theancestor of va, which cannot become the new anddistinct SLCA result due to the exclusive propertyof SLCA semantics. Similarly, no new result can begenerated if we select some keyword nodes from theother two areas Lva pre and Lva next (or Lva des andLva next).

v5

v1

v2

v3 v9 f1

v13

k1

v10

v6

v14

v4

v7

v12

v8 k2

v15 v16

v17 k2

k2

k2

k1

k2

v11

k2 k1

k1 , f2

k1 k2

k2 k1

k1 [1-33]

[2-8]

[3-4] [5-10]

[6-7] [8-9]

[11-12]

[13-24]

[14-15]

[16-23]

[17-22]

[18-19] [20-21]

[25-32]

[26-27]

[28-31]

[29-30]

f2 f1

f1 f1 f2

Fig. 2. Demonstrate usability of anchor nodes in Tree

Example 3: Figure 2 gives a small tree that con-tains four different terms k1, k2, f1 and f2. Andthe encoding ranges are attached to all the nodes.Assume f1 and f2 are two feature terms of agiven keyword query {k1, k2}; and f1 has highercorrelation with the query than f2. After the firstintended query {k1, k2, f1} is evaluated, the nodev10 will be the SLCA result. When we processthe next intended query {k1, k2, f2}, the node v10will be an anchor node to partition the key-word nodes into four parts by using its encod-ing range [16 − 23]: Lpre={v2, v3, v4, v5, v6, v7, v9} inwhich each node’s postorder number is less than16; Lnext={v14, v15, v16, v17} in which each node’s pre-order number is larger than 23; Lanc={v1, v8} in whicheach node’s preorder number is less than 16 while itspostorder number is larger than 23; and the rest nodesbelong to Ldes={v11, v12, v13}. Although v8 containsf2, it cannot contribute more specific results than the



8

anchor node v10 due to exclusive property of SLCAsemantics. So we only need to evaluate {k1, k2, f2}over the three areas Lpre, Lnext and Ldes individually.Since no node contains f2 in Ldes, all the keywordnodes in Ldes can be skipped. At last, the nodes v4 andv16 will be two additional SLCA results. If we havethe following intended queries to be processed, thentheir relevant keyword nodes can be partitioned bythe current three anchor nodes v4, v10, v16. Generally,the number of keyword nodes to be skipped increaseswith the increase of number of anchor nodes to beselected, which can significantly reduce the time costof algorithms as shown in experiments.

Theorem 2: (Multiple Anchors) Given multiple an-chor nodes Va and a new query candidate qnew ={s1, s2, ..., sn}, its keyword node lists L = {ls1 , ls2 , ...,lsn} can be maximally divided into (3 ∗ |Va|+1) areasto be anchored by the nodes in Va. Only the nodes in(2 ∗ |Va|+1) areas can generate new SLCA candidatesindividually and the nodes in the (2 ∗ |Va| + 1) areasare independent to compute SLCA candidates.

Proof: Let Va = (x1, ..., xi, ..., xj , ..., x|Va|), wherexi precedes xj for i < j. We first partition the spaceinto four areas using x1, i.e., Lx1 anc, Lx1 pre, Lx1 des

and Lx1 next. For Lx1 next, we partition it furtherusing x2 into four areas. We repeat this process untilwe partition the area x(|Va|−1) next into four areasx|Va| anc, Lx|Va| pre, Lx|Va| des and x|va| next by usingx|Va|. Obviously the total number of partitioned areasis 3 ∗ |Va|+1. As the ancestor areas do not contributeto SLCA results, we end up with 2∗ |Va|+1 areas thatneed to be processed.

Theorem 1 guarantees that any keyword nodesacross different areas cannot contribute new anddistinct SLCA results. This is because any possibleresult candidate to be generated by these keywordnodes must be the ancestor of one current SLCAresult at least. Due to the exclusive property of SLCAsemantics, no new and distinct SLCA results can beproduced across different areas.

Property 2: Consider the (2∗ |Va|+1) effective areasto be anchored by the nodes in Va. If ∃si ∈ qnewand none of its instances appear in an area, then thearea can be pruned because it cannot generate SLCAcandidates of qnew.

The reason is that any area that can possibly gener-ate new results should contain at least one keywordmatched instance for each keyword in the query basedon the SLCA semantics. Therefore, if an area containsno keyword instance, then the area can be removeddefinitely without any computation.

Following Example 3, if we have the third in-tended query that needs to be evaluated and v4, v10and v16 have been selected as anchor nodes, thenwe have ten areas of keyword nodes: Lv4−pre ={v3}; Lv4−anc = {v1, v2}; Lv4−des = {v5, v6};Lv4−next/v10−pre = {v7, v9}; Lv10−anc = {v1, v8};Lv10−des = {v11, v12, v13}; Lv10−next/v16−pre = {v15};

Algorithm 2 Anchor-based Pruning Algorithm

input: a query q with n keywords, XML data T and its termcorrelated graph Goutput: Top-k query intentions Q and the whole result setΦ

1: Mm×n = getFeatureTerms(q, G);2: while qnew = GenerateNewQuery(Mm×n) 6= null do3: Line 3-Line 5 in Algorithm 1;4: if Φ is not empty then5: for all vanchor ∈ Φ do6: get lixjy pre, lixjy des, and lixjy next by calling for

Partition(lixjy , vanchor);7: if ∀lixjy pre 6= null then8: φ′ = ComputeSLCA({lixjy pre}, vanchor);9: if ∀lixjy des 6= null then

10: φ′′ = ComputeSLCA({lixjy des}, vanchor);11: φ += φ′ + φ′′;12: if φ′′ 6= null then13: Φ.remove(vanchor);14: if ∃lixjy next =null then15: Break the FOR-Loop;16: lixjy = lixjy next for 1 ≤ ix ≤ m ∧ 1 ≤ jy ≤ n;17: else18: φ = ComputeSLCA({lixjy});

19: score(qnew) = prob q new * |φ|* |φ||Φ|+|φ|

;20: Line 18-Line 23 in Algorithm 1;21: return Q and result set Φ;

Lv16−anc = {v1, v14}; Lv16−des = {v17}; andLv16−next = NIL; Three of them can be directly dis-carded due to exclusive property of SLCA semantics,i.e., Lv4−anc, Lv10−anc and Lv16−anc. Lv16−next canalso be removed due to empty set. The rest can be firstfiltered if they do not contain all the keywords in thethird intended query before computing the possiblenew and distinct SLCA results.

4.2.2 The Anchor-based Pruning Algorithm

Motivated by the properties of computing diver-sified SLCAs, we design the anchor-based pruningalgorithm. The basic idea is described as follows.We generate the first new query and compute itscorresponding SLCA candidates as a start point.When the next new query is generated, we can usethe intermediate results of the previously generatedqueries to prune the unnecessary nodes accordingto the above theorems and property. By doing this,we only generate the distinct SLCA candidates everytime. That is to say, unlike the baseline algorithm, thediversified results can be computed directly withoutfurther comparison.

The detailed procedure is shown in Algorithm 2.Similar to the baseline algorithm, we need to constructthe matrix of feature terms, retrieve their correspond-ing node lists where the node lists can be maintainedusing R-tree index. And then, we can calculate thelikelihood of generating the observed query q whenthe issued query is qnew. Different from the baselinealgorithm, we utilize the intermediate SLCA resultsof previously generated queries as the anchors to



9

efficiently compute the new SLCA results for thefollowing queries. For the first generated query, wecan compute the SLCA results using any existing XMLkeyword search method as the baseline algorithmdoes, shown in Line 18. Here, we use stack-basedmethod to implement the function ComputeSLCA().

The results of the first query will be taken asanchors to prune the node lists of the next queryfor reducing its evaluation cost in Line 5 - Line 16.Given an anchor node vanchor, for each node list lixjyof a query keyword in the current new query, wemay get three effective node lists lixjy pre, lixjy des andlixjy next using R-tree index by calling for the functionPartition(). If a node list is empty, e.g., lixjy pre =null,then we don’t need to get the node lists for theother query keywords in the same area, e.g., in theleft-bottom area of vanchor. This is because it cannotproduce any SLCA candidates at all. Consequently,the nodes in this area cannot generate new anddistinct SLCA results. If all the node lists have atleast one node in the same area, then we computethe SLCA results by the function ComputeSLCA(),e.g., ComputeSLCA({lixjy des}, vanchor) that mergesthe nodes in {lixjy des}. If the SLCA results are thedescendant of vanchor, then they will be recorded asnew distinct results and vanchor will be removed fromthe temporary result set. Through Line 19, we canobtain the final score of the current query withoutcomparing the SLCA results of φ with that of Φ. Atlast, we need to record the score and the results of thenew query into Q and Φ, respectively. After all thenecessary queries are computed, the top-k diversifiedqueries and their results will be returned.

4.3 Anchor-based Parallel Sharing Solution

Although the anchor-based pruning algorithm canavoid unnecessary computation cost of the baselinealgorithm, it can be further improved by exploitingthe parallelism of keyword search diversification andreducing the repeated scanning of the same node lists.

4.3.1 Observations

According to the semantics of keyword search di-versification, only the distinct SLCA results need tobe returned to users. We have the following twoobservations.

Observation 1: Anchored by the SLCA result setVa of previously processed query candidates Q, thekeyword nodes of the next query candidate qnew canbe classified into 2 ∗ |Va| + 1 areas. According toTheorem 2, no new and distinct SLCA results can begenerated across areas. Therefore, the 2∗ |Va|+1 areasof nodes can be processed independently, i.e., we cancompute the SLCA results area by area. It can makethe parallel keyword search diversification efficientwithout communication cost among processors.

Observation 2: Because there are term overlapsbetween the generated query candidates, the interme-diate partial results of the previously processed querycandidates may be reused for evaluating the followingqueries, by which the repeated computations can beavoided.

4.3.2 Anchor-based Parallel Sharing AlgorithmTo make the parallel computing efficiently, we utilizethe SLCA results of previous queries as the anchorsto partition the node lists that need to be computed.By assigning areas to processors, no communicationcost among the processors is required. Our proposedalgorithm guarantees that the results generated byeach processor are the SLCA results of the currentquery. In addition, we also take into account theshared partial matches among the generated querycandidates, by which we can further improve theefficiency of the algorithm.

Different from the above two proposed algorithms,we first generate and analyse all the possible querycandidates Qnew. Here, we use a vector V to main-tain the shared query segments among the generatedqueries in Qnew. And then, we begin to process thefirst query like the above two algorithms. When thenext query is coming, we will check its shared querysegments and explore parallel computing to evaluatethe query candidate.

To do this, we first check if the new query candidateqnew contains shared parts ψ in V . For each sharedpart in ψ, we need to check its processing status. Ifthe status has already been set as “processed”, thenit says that the partial matches of the shared parthave been computed before. In this condition, we onlyneed to retrieve the partial matches from previouslycached results. Otherwise, it says that we haven’tcomputed the partial matches of the shared part be-fore. We have to compute the partial matches fromthe original node lists of the shared segments. Afterthat, the processing status will be set as “processed”.And then, the node lists of the rest query segmentswill be processed. In this algorithm, we also exploreparallel computing to improve the performance ofquery evaluation. At the beginning, we specify themaximal number of processors and the current pro-cessor’s id (denoted as PID). And then, we distributethe nodes that need to be computed to the processorwith PID in a round. When all the required nodes arereached at a processor, the processor will be activatedto compute the SLCA results of qnew or the partialmatches for the shared segments in qnew. After allactive processors complete their SLCA computations,we get the final SLCA results of qnew. At last, wecan calculate the score of qnew and compare its scorewith the previous ones. If its score is larger than thatof one of the query candidates in Q, then the querycandidate qnew, its score score(qnew) and its SLCAresults will be recorded. After all the necessary queries



10

are processed, we can return the top-k qualified anddiversified query candidates and their correspondingSLCA results. The detailed algorithm is not provideddue to the limited space.

5 EXPERIMENTS

In this section, we show the extensive experimentalresults for evaluating the performance of our base-line algorithm (denoted as baseline evaluation BE)and anchor-based algorithm (denoted as anchor-basedevaluation AE), which were implemented in Java andrun on a 3.0GHz Intel Pentium 4 machine with 2GBRAM running Windows XP. For our anchor-basedparallel sharing algorithm (denoted as ASPE), it wasimplemented using six computers, which can serve assix processors for parallel computation.

5.1 Dataset and Queries

We use a real dataset, DBLP [22] and a synthetic XMLbenchmark dataset XMark [23] for testing the pro-posed XML keyword search diversification model andour designed algorithms. The size of DBLP datasetis 971MB and the size of generated XMark datasetis 697MB. Compared with DBLP dataset, the syn-thetic XMark dataset has varied depths and complexdata structures, but as we know, it does not containclear semantic information due to its synthetic data.Therefore, we only use DBLP dataset to measure theeffectiveness of our diversification model in this work.

For each XML dataset used, we selected some termsbased on the following criteria: (1) a selected termshould often appear in user-typed keyword queries;(2) a selected term should highlight different seman-tics when it co-occurs with feature terms in differentcontexts. Based on the criteria of selection, we chosesome terms for each dataset as follows. For DBLPdataset, we selected “database, query, programming, se-mantic, structure, network, domain, dynamic, parallel”.And we chose “brother, gentleman, look, free, king, gen-der, iron, purpose, honest, sense, metals, petty, shakespeare,weak, opposite” for XMark dataset. By randomly com-bining the selected terms, we generated 14 sets ofkeyword queries for each dataset.

In addition, we also designed 10 specific DBLPkeyword queries with clear search intentions as theground truth query set. And then, we modified theclear keyword queries into vague queries by mask-ing some keywords, e.g., “query optimization paral-lel database”, “query language database” and “dynamicprogramming networks” where the keywords with un-derlines are masked.

5.2 Effectiveness of Diversification Model

Three metrics will be applied to evaluate the effec-tiveness of our diversification model by assessing itsrelevance, diversity and usefulness.

5.2.1 Test Adapted-nDCG

The normalized Discounted Cumulative Gain (nDCG)has established itself as the standard evaluation mea-sure when graded relevance values are available [9],[24], [13]. The first step in the nDCG computation isthe creation of a gain vector G. The gain G[k] at rankk can be computed as the relevance of the result atthis rank to the user’s keyword query. The gain isdiscounted logarithmically with increasing rank, topenalize documents lower in the ranking, reflectingthe additional user effort required to reach them. Thediscounted gain is accumulated over k to obtain theDCG value and normalized using the ideal gain atrank k to finally obtain the nDCG value. For therelevance-based retrieval, the ideal gain may be con-sidered as an identical one for some users. However,for the diversification-based retrieval, the ideal gaincannot be considered as an identical one for usersbecause users may prefer to see different results fora query. Therefore, we need adapt nDCG measure totest the effectiveness of query diversification model.

Since it is difficult to find the ground truth for querydiversification, we carefully selected the ground truthfor our tested keyword queries. Here, we employedhuman assessors to manually mark the ground truth.We invited three groups of users to do user studywhere each group has ten undergraduate volunteersand all these users share the similar scoring cri-teria, i.e., interestingness, meaningfulness, and nov-elty. In this paper, Google and Bing search enginesare used to explore diversified search intentions asa standard diverse suggestion set for a given key-word query. To do this, the volunteers in the 1st

and 2nd groups evaluated the original query usingGoogle and Bing search engines in the domain ofDBLP “informatics.unitrier.de/-ley/db/”. By check-ing at most 100 relevant results, each user ui canidentify 10 diversified queries based on her/his sat-isfaction and give each diversified query qdiv a satis-faction credit Credit(ui, qdiv) in the range [0-10]. Bydoing this, each group can produce a set of differentdiversified queries where the credit of each query

is marked by its average score:∑10

i=1(Credit(ui,qdiv))

10where Credit(ui, qdiv) = 0 if qdiv does not appearin the list of 10 diversified queries of ui. After that,the top-10 diversified queries with the highest scorescan be selected for each group. As such, the 1st and2nd groups can produce two independent ideal gainfor a keyword query using Google and Bing searchengines. Similarly, the 3rd group of users marked thediversified query suggestions produced by our work,which is used to calculate the gain G[k]. Based onthe formula nDCGk = DCGk

idealDCGk, we can calculate the

nDCG of each keyword query based on the statisticalinformation supported by the three groups of users.

Figure 3(a)-Figure 3(d) show the nDCG values ofkeyword queries over DBLP dataset where only the



11

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10

Top k diversified query suggestions

ND

CG

Val

ue

q1-Google q1-Bing q2-Google q2-Bing

(a) q1 and q2 over DBLP

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10


ND

CG

Val

ue


(b) q5 and q6 over DBLP

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10


ND

CG

Val

ue


(c) q8 and q9 over DBLP

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10


ND

CG

Val

ue


(d) q11 and q13 over DBLP

Fig. 3. nDCG Values vs. top-k query suggestions

top-10 diversified query suggestions are taken intoaccount. From Figure 3(a) and Figure 3(b), we cansee that nDCG values are no less than 0.8 whenk ≥ 2, i.e., our proposed diversification model has80% of the chance to approach the ground truth ofideal diversified query suggestion. Figure 3(c) showsfor query set q9, our diversification model only has60% of chance to approach the ground truth of idealdiversified query suggestion. Figure 3(d) also showsthe average nDCG value of the 11th keyword queryset q11 based on Google engine is near to 0.6. For thiskind of queries, we need recommend more diversifiedquery suggestions to improve the effectiveness of ourdiversification model.

5.2.2 Test Differentiation of Diversified Query Results

0 0.2

0.4 0.6

0.8 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 Selected 14 Sets of Queries

Ave

rage

Div

ersi

ficat

ion

Rat

io

DivQ DivContext

Fig. 4. Differentiation of Diversified Query Results

To show the effectiveness of diversification model,we also tested the differentiation between result setsby evaluating our suggested query candidates. Thisassessment is used to measure the possibility of di-versified results based on the suggested diversifiedquery candidates. To do this, we first selected the top-5 diversified query suggestions for each original key-word query. And then, we evaluated each diversifiedquery suggestion over the DBLP dataset, respectively.From each list of relevant results, we chose its top-10most relevant results. As such, we can have 50 resultsas a sample for each original keyword query to testits diversified possibility. The assessing metric can besummarized by the following equation:

∑qi,qj∈top−kDS(q)(1 −

∑ri∈R(qi),rj∈R(qj ) Sim(ri,rj)

|R(qi)|∗|R(qj)|)

C2k

(10)where top−kDS(q) represents top-k diversified querysuggestions; R(qi) represents the number of selectedmost relevant results for the diversified query sug-gestion qi; Sim(ri, rj) represents the similarity of tworesults ri and rj to be measured by n-gram model.

Similarly, we also tested the differentiation of di-versified query results using DivQ in [13] and com-pared their experimental results with ours (denotedas DivContext). Figure 4 demonstrates the possibilityof the different results with regards to the 14 selectedsets of queries over DBLP dataset. The experimentalresults show that our diversification model providessignificantly better differentiated results than DivQ.This is because structural difference in structuredqueries generated by DivQ cannot exactly reflect thedifferentiation of results to be generated by thesestructured queries. For the 6th, 10th and 11th sets ofqueries, DivQ can reach up to 80% because the threesets of queries contain a few specific keywords andtheir produced structured queries (i.e., query sugges-tions in DivQ) are with big different structures. Fromthe experiments, we can see our context sensitivediversification model outperforms the DivQ proba-bilistic model because both the word context (i.e.,term correlated graph) and structure (i.e., exclusiveproperty of SLCA) are considered in our model.



12

5.2.3 Test Adapted Keyword Queries

This part of experiment builds on the general assump-tion that users often pick up their interested queriesranked at the top in the list of query suggestions.Therefore, in this section, we first selected 10 specifickeyword queries with clear semantics and manuallygenerated 10 corresponding vague keyword queriesby removing some keywords from each specific key-word query. And then, we computed a list of di-versified query suggestions for each vague keywordquery. By checking the positions of the specific queriesappearing in their corresponding query suggestionlists, we can test the usefulness of our diversificationmodel. Here, we utilized a simple metric to quantifythe value of usefulness. For instance, if the specifickeyword query occurs at the top 1-5 positions inits corresponding suggestion list, then the usefulnessof the diversification model is marked by 1. If it islocated at the top 6-10 positions, then the usefulness ismarked by 0.5. If it is located at the position range 11-20, then the usefulness is marked as 0.25. Otherwise,the suggestion is unusual for users because it isseldom for users to check more than 20 suggestionsin a sequence, i.e., the usefulness is labeled as zero.

We also compared our work with the diversificationmodel in [13]. Since [13] uses structured queries torepresent query suggestions while ours uses a set ofterms as query suggestions, we manually transformedtheir structured queries into the corresponding termsets based on the context of the structured queries.To make the comparison be fair, a structured queryis considered as a specific keyword query if its cor-responding term set contains all the keywords in thespecific keyword query.

0 0.2 0.4 0.6 0.8

1

1 2 3 4 5 6 7 8 9 10 10 Adapted Vague Queries

Use

fuln

ess

Rat

io

DivQ DivContext

Fig. 5. Usefulness of Diversification Model

From Figure 5, we can see the specific keywordqueries within top 5 query suggestions for the tenadapted vague queries by our method (DivContext).But for DivQ method, only one of them can be ob-served within the top 5 query suggestions, three ofthem found within top 10 suggestions, another threewithin top 20 suggestions, and the rest cannot get intothe top 20 suggestions. From the experimental results,we can see the usefulness of a qualified suggestionmodel should consider the relevance of the querysuggestions, as well as the novelty of the results to

be generated by these query suggestions.

5.3 Efficiency of Diversification Approaches

0

10

20

30

40

50

5 10 15 20 Top k search intentions

Res

pons

e T

ime

(sec

)

XRank-AVG BE-AVG AE-AVG ASPE-AVG

(a) DBLP

0

1

2

3

4

5

6

5 10 15 20 Top k search intentions

Res

pons

e T

ime

(sec

)

XRank-AVG BE-AVG AE-AVG ASPE-AVG

(b) XMark

Fig. 6. Average Time Comparison of Original KeywordQueries and Their Suggested Query Candidates

Figure 6(a) and Figure 6(b) show the average re-sponse time of computing all the SLCA results for theselected original keyword queries by using XRank [2],and directly computing their diversifications and cor-responding SLCA results by using our proposed BE,AE and ASPE. According to the experimental results,we can see that our diversified algorithms just con-sumed 20% time of XRank to obtain top-k qualifieddiversified query suggestions and their diverse resultswhere k is 5, 10, and 15, respectively. When k =20, our algorithms consumed at most 30% time ofXRank. Here, the time of XRank just counted the costof computing SLCA results for the original keywordqueries. If we further took into account the time ofselecting diverse results, then our algorithms will beperformed much better than XRank.

Especially, when the users are only interested ina few of search contexts, our context-based key-word query diversification algorithms can outperformthe post-processing diversification approaches greatly.The main reason is that the computational cost isbounded by the minimal keyword node list. Thespecific feature terms with short node lists can be usedto skip lots of keyword nodes in the other node lists.So the efficiency can be improved greatly.

From Figure 7(a), we can see BE consumed moretime to answer a vague query with the increase ofdiversified query suggestions over DBLP dataset, i.e.,take up to 5.7 seconds to answer a vague query withdiversified results when the number of qualified sug-gestions is set as 20. However, AE and ASPE can do



13

0

1

2

3

4

5

6

5 10 15 20

Top k search intentions

Res

pons

e T

ime

(sec

)

BE AE ASPE

(a) DBLP

0

0.5

1

1.5

5 10 15 20

Top k search intentions

Res

pons

e T

ime

(sec

)

BE AE ASPE

(b) XMark

Fig. 7. Average Time Cost of Queries

that in about 3.5 seconds and 2 seconds, respectively.This is because lots of nodes can be skipped by anchornodes without computation. Another reason is thatwhen the number of suggestions is small, e.g., 5, wecan quickly identify the qualified suggestions andsafely terminate the evaluation with the guarantee ofthe upper bound. As such, the qualified suggestionsand their diverse results can be output.

Figure 7(b) shows the time cost of evaluating thekeyword queries over XMark dataset. Although theefficiency of BE is slower than that AE and ASPE,it can still finish each query evaluation in about 1.5seconds. This is because (1) the size of keyword nodesis not as large as that of DBLP;(2) the keyword nodesare distributed evenly in the synthetic XMark dataset,which limits the performance of AE and ASPE.

6 RELATED WORK

Diversifying results of document retrieval has beenintroduced [6], [7], [8], [9]. Most of the techniquesperform diversification as a post-processing or re-ranking step of document retrieval. Related work onresult diversification in IR also includes [25], [26], [27].The authors in [25] used probabilistic framework todiversify document ranking, by which web searchresult diversification is addressed. They also appliedthe similar model to discuss search result diversifica-tion through sub-queries in [26]. The authors in [27]proposed a set of natural axioms that a diversificationsystem is expected to satisfy, by which it can improveuser satisfaction with the diversified results. Differentfrom the above relevant works, in this paper, our di-versification model was designed to process keywordqueries over structured data. We have to consider the

structures of data in our model and algorithms, notlimited to pure text data like the above methods. Inaddition, our algorithms can incrementally generatequery suggestions and evaluate them. The diversifiedsearch results can be returned with the qualified querysuggestions without depending on the whole resultset of the original keyword query.

Recently, there are some relevant work to discussthe problem of result diversification in structureddata. For instance, [28] conducted thorough experi-mental evaluation of the various diversification tech-niques implemented in a common framework andproposed a threshold-based method to control thetradeoff between relevance and diversity features intheir diversification metric. But it is a big challengefor users to set the threshold value. The authors in[29] developed efficient algorithms to find top-k mostdiverse set of results for structured queries over semi-structured data. As we know, a structured query canbe used to express much more clear search intentionof a user. Therefore, diversifying structured queryresults is less significant than that of keyword searchresults. In [30], the authors focused on the selection ofdiverse item set, not considering structural relation-ships of the items to be selected. The most relevantwork to ours is the approach DivQ in [13] wherethe authors first identified the attribute-keyword pairsfor an original keyword query and then constructeda large number of structured queries by connectingthe attribute-keyword pairs using the data schema(the attributes can be mapped to corresponding labelsin the schema). The challenging problem in [13] isthat two generated structured queries with slightlydifferent structures may still be considered as differ-ent types of search intentions, which may hurt theeffectiveness of diversification as shown in our ex-periments. However, our diversification model in thiswork utilized mutually co-occurring feature term setsas contexts to represent different query suggestionsand the feature terms are selected based on theirmutual correlation and the distinct result sets together.The structure of data are considered by satisfying theexclusive property of SLCA semantics.

7 CONCLUSIONS

In this paper, we first presented an approach to searchdiversified results of keyword query from XML databased on the contexts of the query keywords in thedata. The diversification of the contexts was measuredby exploring their relevance to the original query andthe novelty of their results. Furthermore, we designedthree efficient algorithms based on the observed prop-erties of XML keyword search results. Finally, weverified the effectiveness of our diversification modelby analyzing the returned search intentions for thegiven keyword queries over DBLP dataset based onthe nDCG measure and the possibility of diversified



14

query suggestions. Meanwhile, we also demonstratedthe efficiency of our proposed algorithms by runningsubstantial number of queries over both DBLP andXMark datasets. From the experimental results, weget that our proposed diversification algorithms canreturn qualified search intentions and results to usersin a short time.

8 ACKNOWLEDGMENTS

This work was supported by the ARC Discov-ery Projects under Grant No. DP110102407 andDP120102627, and partially supported by grant of theRGC of the Hong Kong SAR, No.418512.

REFERENCES

[1] Y. Chen, W. Wang, Z. Liu, and X. Lin, “Keyword search onstructured and semi-structured data,” in SIGMOD Conference,2009, pp. 1005–1010.

[2] L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram, “Xrank:Ranked keyword search over xml documents,” in SIGMODConference, 2003, pp. 16–27.

[3] C. Sun, C. Y. Chan, and A. K. Goenka, “Multiway slca-basedkeyword search in xml data,” in WWW, 2007, pp. 1043–1052.

[4] Y. Xu and Y. Papakonstantinou, “Efficient keyword search forsmallest lcas in xml databases,” in SIGMOD Conference, 2005,pp. 537–538.

[5] J. Li, C. Liu, R. Zhou, and W. Wang, “Top-k keyword searchover probabilistic xml data,” in ICDE, 2011, pp. 673–684.

[6] J. G. Carbonell and J. Goldstein, “The use of mmr, diversity-based reranking for reordering documents and producingsummaries,” in SIGIR, 1998, pp. 335–336.

[7] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong, “Diver-sifying search results,” in WSDM, 2009, pp. 5–14.

[8] H. Chen and D. R. Karger, “Less is more: probabilistic modelsfor retrieving fewer relevant documents,” in SIGIR, 2006, pp.429–436.

[9] C. L. A. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova,A. Ashkan, S. Buttcher, and I. MacKinnon, “Novelty anddiversity in information retrieval evaluation,” in SIGIR, 2008,pp. 659–666.

[10] A. Angel and N. Koudas, “Efficient diversity-aware search,”in SIGMOD Conference, 2011, pp. 781–792.

[11] F. Radlinski and S. T. Dumais, “Improving personalized websearch using result diversification,” in SIGIR, 2006, pp. 691–692.

[12] Z. Liu, P. Sun, and Y. Chen, “Structured search result differ-entiation,” PVLDB, vol. 2, no. 1, pp. 313–324, 2009.

[13] E. Demidova, P. Fankhauser, X. Zhou, and W. Nejdl, “DivQ:diversification for keyword search over structured databases,”in SIGIR, 2010, pp. 331–338.

[14] J. Li, C. Liu, R. Zhou, and B. Ning, “Processing xml keywordsearch by constructing effective structured queries,” in AP-Web/WAIM, 2009, pp. 88–99.

[15] H. Peng, F. Long, and C. H. Q. Ding, “Feature selection basedon mutual information: Criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 27, no. 8, pp. 1226–1238, 2005.

[16] C. O. Sakar and O. Kursun, “A hybrid method for feature se-lection based on mutual information and canonical correlationanalysis,” in ICPR, 2010, pp. 4360–4363.

[17] N. Sarkas, N. Bansal, G. Das, and N. Koudas, “Measure-drivenkeyword-query expansion,” PVLDB, vol. 2, no. 1, pp. 121–132,2009.

[18] N. Bansal, F. Chiang, N. Koudas, and F. W. Tompa, “Seekingstable clusters in the blogosphere,” in VLDB, 2007, pp. 806–817.

[19] S. Brin, R. Motwani, and C. Silverstein, “Beyond marketbaskets: Generalizing association rules to correlations,” inSIGMOD Conference, 1997, pp. 265–276.

[20] W. DuMouchel and D. Pregibon, “Empirical bayes screeningfor multi-item associations,” in KDD, 2001, pp. 67–76.

[21] A. Silberschatz and A. Tuzhilin, “On subjective measures ofinterestingness in knowledge discovery,” in KDD, 1995, pp.275–281.

[22] “http://dblp.uni-trier.de/xml/.”[23] “http://monetdb.cwi.nl/xml/.”[24] K. Jarvelin and J. Kekalainen, “Cumulated gain-based evalua-

tion of ir techniques,” ACM Trans. Inf. Syst., vol. 20, no. 4, pp.422–446, 2002.

[25] R. L. T. Santos, C. Macdonald, and I. Ounis, “Exploiting queryreformulations for web search result diversification,” in WWW,2010, pp. 881–890.

[26] R. L. T. Santos, J. Peng, C. Macdonald, and I. Ounis, “Explicitsearch result diversification through sub-queries,” in ECIR,2010, pp. 87–99.

[27] S. Gollapudi and A. Sharma, “An axiomatic approach forresult diversification,” in WWW, 2009, pp. 381–390.

[28] M. R. Vieira, H. L. Razente, M. C. N. Barioni, M. Hadjielefthe-riou, D. Srivastava, C. T. Jr., and V. J. Tsotras, “On query resultdiversification,” in ICDE, 2011, pp. 1163–1174.

[29] M. Hasan, A. Mueen, V. J. Tsotras, and E. J. Keogh, “Diversi-fying query results on semi-structured data,” in CIKM, 2012,pp. 2099–2103.

[30] D. Panigrahi, A. D. Sarma, G. Aggarwal, and A. Tomkins,“Online selection of diverse results,” in WSDM, 2012, pp. 263–272.

Jianxin Li received his BE and ME degreesin computer science, from the NortheasternUniversity, China, in 2002 and 2005, respec-tively. He received his PhD degree in com-puter science, from the Swinburne Universityof Technology, Australia, in 2009. He is cur-rently a postdoctoral research fellow at Swin-burne University of Technology. His researchinterests include database query processing& optimization, and social network analytics.

Chengfei Liu received the BS, MS and PhDdegrees in Computer Science from NanjingUniversity, China in 1983, 1985 and 1988,respectively. Currently he is a Professor inthe Faculty of Science, Engineering andTechnology, Swinburne University of Tech-nology, Australia. His current research inter-ests include keywords search on structureddata, query processing and refinement foradvanced database applications, query pro-cessing on uncertain data and big data, and

data-centric workflows. He is a member of IEEE, and a member ofACM.

Jeffrey Xu Yu received the BE, ME, andPhD degrees in computer science, from theUniversity of Tsukuba, Japan, in 1985, 1987,and 1990, respectively. Currently he is a Pro-fessor in the Department of Systems Engi-neering and Engineering Management, TheChinese University of Hong Kong. His ma-jor research interests include graph mining,graph database, social networks, keywordsearch, and query processing and optimiza-tion. He is a senior member of IEEE, a mem-

ber of the IEEE Computer Society, and a member of ACM.

Documents

Context-based Diversiﬁcation for Keyword Queries over XML Data Diversification for Keyword.pdf · Queries over XML Data Jianxin Li, Chengfei Liu Member, IEEE and Jeffrey Xu Yu Member,