Finding Frequent Co-occurring Terms in Relational Keyword ... · privacy preserve tree access method 2 publish 2 Figure 4: The 8 most frequent terms in Figure 3 evaluated our technique

Finding Frequent Co-occurring Terms in RelationalKeyword Search

Yufei Tao Jeffrey Xu YuChinese University of Hong Kong

ABSTRACTGiven a set Q of keywords, conventional keyword search (KS)returns a set of tuples, each of which (i) is obtained from asingle relation, or by joining multiple relations, and (ii) con-tains all the keywords in Q. This paper proposes a relevantproblem called frequent co-occurring term (FCT) retrieval.Specifically, given a keyword set Q and an integer k, a FCTquery reports the k terms that are not in Q, but appearmost frequently in the result of a KS query with the sameQ. FCT search is able to discover the concepts that areclosely related to Q. Furthermore, it is also an effective toolfor refining the keyword set Q of traditional keyword search.While a FCT query can be trivially supported by solving thecorresponding KS query, we provide a faster algorithm thatextracts the correct results without evaluating any KS queryat all. The effectiveness and efficiency of our techniques areverified with extensive experiments on real data.

1. INTRODUCTIONGiven a set Q of keywords, a keyword search (KS) query

returns a set of tuples, each of which (i) is obtained from asingle relation or by joining several tables, and (ii) containsall the keywords1. To illustrate, we use four tables whoseschemas are shown in Figure 1, where the underlined areprimary keys. Each arrow represents a primary-to-foreignkey relationship. For example, AUTHOR → WRITES meansthat the primary key A id of AUTHOR is referenced by theA id in WRITES. Figure 2 demonstrates the partial contentof each table. Given a set Q of keywords: {Tony, paper},the KS query returns the result in Figure 3.

To understand the result, first notice that Figure 3 is ac-tually the output of the natural join:

AUTHORA name = Tony ./ WRITES ./ PAPER.

1This is the AND semantic, as is the focus of this paper.The OR semantic has also been addressed by [11, 16], wherea qualifying tuple only needs to cover at least one querykeyword.

Permission to copy without fee all or part of this material is granted pro-vided that the copies are not made or distributed for direct commercial ad-vantage, the ACM copyright notice and the title of the publication and itsdate appear, and notice is given that copying is by permission of the ACM.To copy otherwise, or to republish, to post on servers or to redistribute tolists, requires a fee and/or special permissions from the publisher, ACM.EDBT 2009, March 24–26, 2009, Saint Petersburg, Russia.Copyright 2009 ACM 978-1-60558-422-5/09/0003 ...$5.00

AUTHORA_id, A_name

WRITESA_id, P_id

PAPERP_id, P_title, C_id

CONFC_id, C_name, C_year

Figure 1: An example database schema

This expression is generated by the system automatically,as will be explained in Section 3. Second, each tuple inthe result contains all the keywords in Q, treating the tablename as an implicit keyword in every tuple of the table [16].Third, the result is obtained with the smallest number ofjoins necessary. Specifically, the keywords Tony and paperare available only in AUTHOR and PAPER, respectively. Hence,every result tuple must combine a tuple in AUTHOR with onein PAPER, which, in turn, necessitates joins with WRITES.

Frequent Co-occurring Term Search. This paper pro-poses a new operator, frequent co-occurring term (FCT) re-trieval, which adds a mining flavor to keyword search. Givena set Q of keywords, and an integer k, a FCT query returnsthe k most frequent terms in the result of a KS query withthe same Q. For example, consider again the the result inFigure 3. Given the same Q and k = 8, a FCT query returnsthe 8 terms in Figure 4, which appear most frequently inFigure 3. Note that stop-words, such as “the”, “of”, etc. areexcluded. Also excluded are the obvious noisy terms suchas the table name WRITES. Furthermore, the keywords in Qare not considered either, since they must trivially appear inall result tuples. Finally, the standard word-stemming tech-nique should be applied, so that words like “preservation”and “preserving” can be regarded as the same word.

Intuitively, a FCT query extracts the concepts that aremost closely associated with the keyword set Q. For ex-ample, the terms in Figure 4 are indeed strongly relatedto Tony, since he has published primarily in two areas: (i)spatio-temporal (indexing and query processing) and (ii) pri-vacy preserving data publication. Although the above dis-cussion is based on the artificial example of Figure 2, similarobservations indeed exist in the real world. For example,query and spatio-temporal are really the two most frequentterms in the titles of the papers by Tony; they appear 20and 13 times, respectively.

Relevance to Traditional Keyword Search. The pro-posed FCT operator has a fundamental difference from theconventional KS queries: FCT search extracts terms, whileKS fetches tuples. In particular, FCT search is different

839

symbols introduced for

illustration purposes

AUTHOR

A_id A_name

Tonya1

. . . . . .

. . . . . .

P_id

p1

p2

p3

p5

p6

p4

A_id

a1

a1

a1

a1

a1

a1

WRITES

w1

w2

w3

w4

w5

w6

PAPER

P_id

p1

p2

p3

p4 Personalized Privacy Preservation c4

p5

p6

. . . . . . . . .

P_title C_id

The MV3R-Tree: A Spatio-Temporal Access

Method for Timestamp and Interval Queriesc1

Time-Parameterized Queries in

Spatio-Temporal Databasesc2

The TPR*-Tree: An Optimized Spatio-Temporal

Access Method for Predictive Queriesc3

m-Invariance: Towards Privacy Preserving

Re-publication of Dynamic Datasetsc5

Preservation of Proximity Privacy in

Publishing Numerical Sensitive Datac6

C_id

c1

c2

c3

c5

C_name C_year

2001

2002

2003

CONF

c4

VLDB

SIGMOD

VLDB

SIGMOD

SIGMOD

2006

2007

c6 SIGMOD 2008

. . . . . . . . .

Figure 2: The table contents

A_nameTonyTonyTony

TonyTony

Tony

A_ida�a�a�a�a�a�

P_idp�p�p�p� �� c�p�p�

P_title C_id�� !�� "�� #$�� c�� #$�� %�&�� c�� '�� (�� !�� #$�� c� �"�� )�� *��$&�� ! %� �� %�� c�� ! ��+� �� $&��* ,$ �� %� c�Figure 3: Result of a KS query {Tony, paper}

from top-k KS [11, 16, 17]. Given a keyword set Q and aninteger k, a top-k KS query finds the k tuples (in the resultof a normal KS query) most relevant to Q. The relevance iscalculated by treating each tuple as a small document, andthen applying an IR-style or page-rank relevance function.Hence, in top-k KS, tuples are mutually independent, as therelevance of a tuple does not depend on the others. In con-trast, a FCT query must view all tuples in a holistic mannerin order to aggregate the frequency of a term. Note thatthe k terms produced by FCT retrieval do not necessarilyappear in the k tuples fetched by top-k KS. The reasonsare two-fold. First, in top-k KS, the relevance of a tuple isdecided by the keywords in Q, and hardly reflects the fre-quencies of the terms outside Q. Second, a tuple itself beingrelevant to Q does not imply that the terms it contains areglobally frequent in the query result.

The explorative nature of FCT search also makes it aneffective tool for refining KS queries. For example, considersomeone that is interested in the research of Tony, but is notfamiliar with the areas he has worked in. Running a simpleKS query with Q = {Tony, paper} would return too manytuples, one for each publication. With a FCT query, s/hewould be able to identify important terms that can be addedto Q to formulate a more selective KS query. As shown inFigure 4, such terms could be spatio-temporal and query.Thus, next s/he would execute a KS query Q = {Tony,spatio-temporal, query, paper} to retrieve only the papers ofTony on spatio-temporal queries.

Contributions. This paper presents a systematic studyon FCT retrieval. We first provide a formal formulation ofthe problem, and then, propose a fast algorithm to solve it.We show that a FCT query can be answered without per-forming all the joins needed by the corresponding KS query.For instance, the terms in Figure 4 can be derived withoutcomputing the tuples in Figure 3. We have experimentally

frequency333

22

3

termspatio-temporal

queryprivacypreservetree

accessmethod 2publish 2

Figure 4: The 8 most frequent terms in Figure 3

evaluated our technique on a real dataset IMDB, which in-corporates the information of over 800k movies and TV pro-grams. Our results show that FCT retrieval is effective,by revealing many interesting observations. Furthermore,our FCT algorithm significantly outperforms the straight-forward approach of evaluating the corresponding KS querycompletely, achieving a maximum speedup of 4.

The rest of the paper is organized as follows. Section 2formally formulates the problem of retrieving frequent co-occurring terms. Section 3 reviews the previous work onkeyword search. Section 4 proposes an efficient FCT algo-rithm, and Section 5 extends the algorithm to the scenariowhere term appearances in various tables have different im-portance. Section 6 contains our experimental evaluation.Finally, Section 7 concludes the paper with directions forfuture work.

2. PROBLEM DEFINITIONWe consider that the database has n tables R1, R2, ..., Rn,

referred to as the raw tables. Their referencing relationshipsare summarized in a schema graph:

Definition 1 (Schema Graph). The schema graphis a directed graph G such that (i) G has n vertices, cor-responding to tables R1, ..., Rn respectively, and (ii) G hasan edge from vertex Ri to vertex Rj (1 ≤ i 6= j ≤ n), if andonly if Rj has a foreign key referencing a primary key in Ri.

For example, Figure 1 shows the schema graph of adatabase with n = 4 tables. Let Q be a set of m keywordskw1, ..., kwm. Each answer of the traditional keyword searchis an MTJNT defined as follows:

840

a- w-p-.

.a/ w/

p/c/

(a) An MTJNT (b) Not an MTJNT

Figure 5: MTJNT illustration with Q = {Tony, pa-

per}

Definition 2 (MTJNT). A minimum total join net-work (MTJNT) is an undirected tree satisfying three require-ments:

• (join) Each vertex is a tuple of a raw table. Let t andt′ be any two adjacent vertices, and assume that theyare in raw tables R and R′ respectively. Then, R andR′ must be connected in the schema graph, and t ./ t′

must belong to R ./ R′.

• (total) Every keyword in Q is contained in at least onevertex.

• (minimal) No vertex of the tree can be removed suchthat the remaining part is still a tree fulfilling the aboverequirements.

We assume that the name of a raw table R is an implicitterm in each tuple in R. To illustrate MTJNTs, let us in-troduce some conventions for tuple referencing. For a tuplein tables AUTHOR, PAPER, CONF in Figure 2, we refer to it byits primary key, e.g., c1 represents the first tuple in CONF.Given a tuple in WRITES, we denote it using the symbolsw1, w2, ... shown in Figure 2, e.g., w1 is the first tuple inWRITES. Figure 5a demonstrates an MTJNT for a query Q= {Tony, paper}. The tree in Figure 5b, however, is not anMTJNT, as it violates the minimal requirement in Defini-tion 2. Namely, removal of vertex c1 does not compromisethe join- and total-requirements.

Definition 3 (Keyword Search). Given a set Q ofkeywords and a number Rmax, a keyword search (KS) queryreturns the set KS(Q) of all possible MTJNTs that have atmost Rmax vertices.

The parameter Rmax is introduced to prevent excessivelylarge MTJNTs. For instance, with Q = {Tony, paper} andRmax = 3, KS(Q) contains 6 MTJNTs, each of which istranslated to a different tuple in Figure 3. In particular, theMTJNT in Figure 5a belongs to KS(Q), and depicts thefirst tuple in Figure 3.

Let T be any MTJNT in KS(Q). Given a term w, we usecount(T, w) to denote the number of occurrences of w in T ,i.e., totally how many w are in the texts of the vertices of T .For example, let T be the MTJNT in Figure 5a. Then,count(T, spatio-temporal) = 1 and count(T, privacy) = 0,since the first tuple in Figure 3 contains one occurrence ofspatio-temporal but no privacy.

Equipped with function count(., .), the total frequency ofw in all MTJNTs can be obtained as:

freq(Q,w) =∑

∀T∈KS(Q)

count(T, w). (1)

Now we are ready to define frequent co-occurring term re-trieval:

Definition 4 (FCT search). Given a set Q of key-words, a number Rmax, and an integer k, a frequent co-occurring term (FCT) query returns the k terms with thehighest frequencies among all terms that (i) are not in Q,and (ii) in the result of a KS query with the same Q andRmax.

Optionally, a user, who has some knowledge of the schemagraph, may require that all the terms reported should appearin a particular set of relations.

As an example, given the database in Figure 2 and a queryQ = {Tony, paper} and Rmax = 3, a FCT query with k = 4reports terms spatio-temporal, query, privacy, preserve, be-cause they have the highest frequency 3 (see Table 4) amongall terms in Figure 3 except Tony and paper.

As explained in Section 1, the motivation of FCT retrievalis to discover the concepts that best describe the character-istics of the query keyword set Q. Since it is based on termmatching, standard pre-processing is needed to increase theaccuracy. First, all the stop-words (i.e., common words suchas“of”, “is”, etc.), noisy terms (i.e., words without significantmeanings), and numerical data are excluded from consider-ation. Second, words with the same root (e.g., “preserving”and “preservation”) should be counted as an identical word,as can be achieved through word-stemming.

3. RELATED WORKThe previous works on relational keyword search can be

divided into two categories, depending on whether they re-trieve MTJNTs based on candidate networks (CN) or data-graph traversal. In the sequel, we outline their central ideasof both categories. Our discussion focuses on relationaldatabase (as is the topic of this paper). Nevertheless, atthe end of the section, we will briefly survey relevant workson other types of data.

Methods Based on Candidate Networks. Keywordsearch (KS) aims at offering greater convenience to usersin inquiring the database. Compared to conventional SQLqueries, however, the convenience of KS is at the cost ofhigher complexity in query processing. As explained inthe sequel, a major complication arises from the fact thatMTJNTs may be produced from numerous different joins,depending on the distribution of the keywords in the rawtables.

Consider a KS query with a keyword-set Q. Given a rawtable R and a subset S of Q, let RS be the set of the tu-ples in R that (i) contain all the keywords in S, but (ii) donot include any keyword in Q − S. For example, considerthe tables in Figure 2 and a KS query with a set Q of twokeywords: kw1 = Tony, kw2 = paper. Then, AUTHORkw1 in-cludes all tuples that have Tony but not paper. As a specialcase, when S = ∅, RS is the set of tuples in R that do notcontain any keyword in Q at all. In general, for any non-empty S, RS is called a non-free tuple-set. Otherwise (i.e.,S = ∅), RS is a free tuple-set.

Before accessing the underlying tables, the database mustenumerate all the possible algebra expressions that may pro-duce MTJNTs. The simplest expression is AUTHOR{kw1,kw2},namely, if a tuple in AUTHOR contains both Tony and paper,

841

0AUTHOR12345WRITE12365 7 7WRITE

AUTHOR89:;< PAPER

89:=<(a) For expression E1 (b) For expression E4

Figure 6: Candidate network examples

the tuple itself constitutes an MTJNT. By the same rea-soning, the other one-table expressions yielding MTJNTsare WRITES{kw1,kw2} , PAPER{kw1,kw2}, and CONF{kw1,kw2}.There are more MTJNT-expressions involving two tables,for example:

AUTHOR{kw1} ./ WRITES{kw2}, (E1)

AUTHOR{kw2} ./ WRITES{kw1}, (E2)

WRITES{kw1} ./ PAPER{kw2}, ..., (E3)

to list just a few. There exist even more MTJNT-expressionswith three tables:

AUTHOR{kw1} ./ WRITES∅ ./ PAPER{kw2}, (E4)

AUTHOR{kw2} ./ WRITES∅ ./ PAPER{kw1}, (E5)

WRITES{kw1} ./ PAPER∅ ./ CONF{kw2}, ..., (E6)

Similarly, we can create a large number of MTJNT-expressions involving four tables. To avoid excessive expres-sions, a common approach [11, 12, 16, 17, 18] is to place anupper bound Rmax on the number of tuple-sets in an ex-pression. For example, with Rmax = 3, it is not necessaryto examine expressions with more than 3 tuple-sets.

An MTJNT-expression can be converted to a candidatenetwork (CN). Specifically, given such an expression E, aCN is a directed tree where (i) a vertex corresponds to a(non-free or free) tuple-set in E and (ii) an edge between twovertices indicates that the two tuple-sets should be joined inevaluating E, and the edge’s direction follows the directionof the corresponding edge in the schema graph. For example,Figure 6a (6b) presents the CN of the MTJNT-expressionE1 (E4) shown earlier.

Hristidis and Papakonstantinou [12] develop an algorithmfor generating all the candidate networks efficiently, subjectto the upper bound Rmax. This algorithm is deployed asthe first step by all KS solutions [11, 12, 16, 17, 18] basedon CNs. As a second step, a KS algorithm executes all theCNs (a.k.a MTJNT-expressions) to produce the MTJNTs.Note that a CN may not necessarily return any result. Forexample, AUTHOR{Tony, paper} is empty because no tuple inAUTHOR contains both Tony and paper simultaneously. Thesimplest approach of CN evaluation is to perform an SQLquery for each CN. Various optimizations are possible forreducing the computation time. For example, many CNsmay share common subexpressions, which, therefore, onlyneed to be evaluated only once [12, 18]. Furthermore, if thegoal is to report only the top-k MTJNTs (according to acertain scoring function) [11, 16, 17], the processing can beaccelerated using the well-known thresholding technique [7]or a skyline-sweep approach proposed in [17].

Methods Based on Data Graphs. Based on theirforeign-to-primary key relationships, the tuples in the rawtables can be connected into a data graph. Specifically, thisis a directed graph, where (i) each vertex represents a tuplein a raw table, and (ii) there is an edge from tuple t to t′ ifand only if t′ has a foreign key referencing the primary keyof t. To illustrate, Figure 7 shows the data graph resulting

a>?c> c@ cA cB cC cDp> p@ pA pB pC pDw> w@ wA wB wC wD

Figure 7: The data graph of the database in Figure 2

from the database in Figure 2. Following the naming con-vention in Figure 5, for each tuple in AUTHOR, PAPER, CONF,we label its vertex with its primary key, whereas, for tuplesin WRITES, their vertices are labeled with symbols w1, w2, ...defined in Figure 2. There is an edge from, for example, a1

to w1 because tuple w1 (first row of WRITES) references theprimary key of tuple a1 (first row of AUTHOR).

Given a set Q of keywords, the MTJNTs can be foundby traversing the data graph. Next we explain the back-ward [4] strategy, which is the foundation of other morecomplex approaches [10, 14]. Let Q contain two keywordskw1 = Tony and kw2 = paper. Backward first fetches theset S1 (S2) of tuples whose texts contain kw1 (kw2). InFigure 7, S1 = {a1} and S2 = {p1, p2, ..., p6}, which canbe easily obtained given an inverted index2. To obtain anMTJNT, backward picks two vertices from S1 and S2 respec-tively, and gradually expands their neighborhoods, until acommon vertex is encountered in both neighborhoods. Forinstance, assume that we pick a1 from S1, and p1 from S2.For a1, backward identifies its 1-edge neighborhood, i.e., theset {w1, w2, ..., w6} of vertices that can be reached from a1

by crossing one edge. Similarly, the 1-edge neighborhood ofp1 is {w1} (c1 is not included, because the edge between p1

and c1 is pointing at p1). As w1 appears in both the 1-edgeneighborhoods of a1 and p1, backward outputs the MTJNTin Figure 5a, with w1 being the root, and p1, a1 the leaves.In our example, all MTJNTs can be derived from 1-edgeneighborhoods. In general, however, further neighborhoodexpansion may be necessary to guarantee no false miss.

All the algorithms [3, 4, 6, 10, 14] leveraging data-graphsdeal with top-k KS, where the score of an MTJNT is cal-culated based on its tree structure, instead of purely fromits texts. For example, the score of an MTJNT can be de-fined as the sum of the weights of all its edges. In thiscase, fast discovery of the top-k MTNJTs requires neigh-borhood expansions to be performed in a prioritized man-ner. This creates tremendous opportunities for optimiza-tion, aiming at expanding the most promising neighbor-hood earlier. Retrieval of the top-1 MTNJT is actually aclassical steiner-tree problem, for which Ding et al. [6] givea dynamic-programming algorithm with good asymptoticalperformance. Kimelfeld and Sagiv [15] propose a theoreticalalgorithm for the general top-k version.

Whether edge weights are important is a crucial factorin choosing between KS solutions based on CNs (candidatenetwork) and data graphs. In case edge weights must beconsidered, a data-graph method should be applied, becauseCN-algorithms are aware of only the foreign-to-primary con-nections at the schema level, but not at the tuple level.On the other hand, when edge weights are irrelevant, CN-

2For every term w in the database, the inverted index con-tains a list of tuples where w appears.

842

algorithms have better performance, since they can exploitthe powerful execution engine of the database to extractmultiple MTJNTs via a single join. For this reason, we alsodesign our FCT algorithm following the CN-methodology.

Other Works. Keyword search has also been studied onnon-relational data. In particular, considerable efforts [2,9, 13, 21] have been made on XML documents. Recently,keyword-driven query processing is also introduced in spatialdatabases [8] and OLAP [20]. The above discussion focuseson centralized DBMS, whereas keyword search in distributedsystems has also been investigated [19, 22].

4. THE STAR ALGORITHMThis section discusses the algorithmic issues in FCT

search. Section 4.1 first provides the high-level description ofthe proposed algorithm. Then, Sections 4.2 and 4.3 explainthe details of two major components.

4.1 High-level DescriptionA straightforward solution, referred to as baseline, to FCT

retrieval is to first solve the corresponding KS query, andthen extract the term frequencies. Specifically, given a setQ of keywords and an integer k, baseline applies a conven-tional KS algorithm (surveyed in Section 3) to retrieve theset KS(Q) of all MTJNTs. Then, the algorithm computesthe frequency freq(Q, w) of each term w by Equation 1, andreports the k most frequent terms.

Baseline incurs expensive cost because it completely eval-uates all the joins necessitated by a KS query in order toobtain KS(Q). While complete join-evaluation is manda-tory for reporting MTJNTs, can it be avoided if our objec-tive is to derive only the term frequencies? The answer isyes. Next, we design an algorithm star to acquire all termfrequencies without producing the MTJNTs.

Star applies the methodology of candidate networks (CN)reviewed in Section 3. Specifically, given the query keyword-set Q, it employs the CN-generation algorithm in [12] toobtain all the CNs. Let us use h to represent the numberof CNs, and denote them as CN1, CN2, ..., CNh, respec-tively. Recall that, as explained in Section 3, a CN canbe regarded as an algebra expression, which retrieves a setof MTJNTs. We deploy MTJNT(CNi) to denote the set ofMTJNTs resulting from executing CNi (1 ≤ i ≤ h). Clearly,MTJNT(CNi) ∩MTJNT(CNj) = ∅ for any 1 ≤ i 6= j ≤ h,that is, no MTJNT can be output by two CNs at the sametime. It follows that

KS(Q) =h

⋃

i=1

MTJNT(CNi). (2)

Let freq-CN(CNi, w) be the total number occurrences ofterm w in all the MTJNTs of MTJNT(CNi), or formally:

freq-CN(CNi, w) =∑

∀T∈MTJNT (CNi)

count(T, w). (3)

where count(T, w), as defined in Section 2, is the numberof occurrences of w in a single MTJNT T . Thus, the totalfrequency freq(Q, w) can be calculated as:

freq(Q, w) =h

∑

i=1

freq-CN(CNi, w). (4)

ERF RGH G RIHIJ RKL KRMLM RNLN ROLORPLP RQLQ

RL

(a) (b) (c)

Figure 8: Star-CN examples

Therefore, the key to computing freq(Q, w) is to calculatefreq-CN(CNi, w) for a single candidate network CNi.

The crucial observation behind the design of our star algo-rithm is that freq-CN(CNi, w) can be calculated efficientlywhen CNi is a star-CN:

Definition 5 (Star Candidate Network). A starcandidate network (star-CN) is a CN where a vertex, calledthe root, connects to all the other vertices, called the leaves.

Figure 8 demonstrates several example of star-CNs. Notethat the simplest star-CN can have only a single tuple-setRS (as explained in Section 3, RS is the set of tuples in theraw table R that contain only the terms in S but not anyterm in Q−S, where Q is the query keyword-set). Also notethat a star-CN can have any number of leaf tuple-sets. Fur-thermore, the edge directions can be arbitrary. For instance,in Figure 8c, some edges are pointing at the root RS, whileothers away from RS. In other words, RS may reference theprimary keys of some leaf tuple-sets, and meanwhile may bereferenced by other leaves.

As elaborated in the next section, given a star-CN CN anda term w, freq-CN(CN, w) can be obtained at cost consid-erably lower than deriving the MTJNTs in MTJNT(CNi).This is why the proposed star algorithm is significantly fasterthan baseline. Apparently, when the schema graph itself isa star, all the CNs are definitely star-CNs. This makes ourstar algorithm especially suitable in data warehouse appli-cations (where star schemas are common).

In case the schema graph is not a star, some CNs CN maynot be star-CNs. In this case, we perform an interesting op-eration, called starization, to transform CN into a star-CNCN ′ that returns the same MTJNTs. Then, star proceedsnormally with CN ′. Intuitively, starization evaluates onlythe minimum set of joins to complete the conversion of CNinto a star-counterpart. Note that those joins must be per-formed by the baseline approach anyway. In other words,starization carries out some of the work done by the base-line approach, but just enough to enable the application ofstar.

In the next subsection, we elaborate how to obtain theterm frequencies from a star-CN. Then, Section 4.3 presentsthe details of starization.

4.2 Term Frequency Retrieval in a Star-CNThis section settles the following problem: given a star-

CN CN , find the frequencies freq-CN(CN, w) of all terms wthat appear in at least one MTJNT of MTJNT(CN).

Let RS be the tuple-set at the root of CN . Use l to denotethe number of leaf tuple-sets in CN , and RSi

i be the i-th leaftuple-set (1 ≤ i ≤ l). Deploy Ai to represent the set of join-

ing attributes between RS and RSi

i . Hence, conceptually RS

843

has columns {A1, A2, ..., Al, text}, where text incorporates allthe attributes other than A1, ..., Al. In the same fashion,RSi

i can be regarded to have a schema {Ai, text}. Followingthe convention of previous work [11, 12, 17, 16], we assumethat the data of each tuple-set have been collected into afile. (This can be easily achieved with a single scan of allthe raw tables. Furthermore, all the tuple-sets together con-sume exactly the same amount of space as the raw tables,because every tuple in a raw table belongs to precisely onetuple-set.)

We will use the running example in Figure 9 to illustrateour solution. Figure 9a gives the CN under consideration,assuming that the query keyword-set Q has three terms kw1,kw2, and kw3. The root of CN is a free tuple-set R∅, i.e.,no tuple in R∅ should contain any keyword in Q. All leaf

tuple-sets are non-free. For example, in R{kw1}

1 , each tuplemust include only kw1, but not kw2 and kw3. Notice that

R∅ has two primary keys A1 and A2, referenced by R{kw1}

1

and R{kw2}

2 respectively, whereas R∅ itself references the

primary key A3 of R{kw3}

3 .Figures 9b-9e give the contents of the four tuple-sets. We

use symbols α1, α2, ..., β1, ..., γ1, ..., δ1, ... to facilitatetuple pinpointing. For instance, α1 denotes the first tuplein Rkw1

1 . Note that some foreign-key values (e.g., x3) of

Rkw11 are absent from the primary-key A1 of R∅. This is

reasonable because R∅ may not contain all the tuples in theraw table R (recall that R∅ is only a subset of R). Similarphenomena can be observed for the foreign-key values inother tables.

Figure 10 shows the result of completely evaluating CN .We refer to each tuple in the result as a join tuple,which is essentially a flat representation of an MTJNT inMTJNT(CN). Symbols λ1, λ2, ... are added for tuplepinpointing. For example, tuple λ1 is the join result oftuples α3, β1, γ1, and δ2. It is easy to verify that termw1 occurs 16 times, i.e., freq-CN(CN, w1) = 16. Simi-larly, freq-CN(CN, w2) = 13, freq-CN(CN, w3) = 4, andfreq-CN(CN, w4) = 5. In the sequel, we present the staralgorithm that obtains these frequencies without producingthe results in Figure 10.

Volume. Let t be a tuple in any tuple-set of CN . We defineits volume, denoted as vol(t), as the number of join tuplesdetermined by t. For instance, as mentioned earlier, jointuple λ1 in Figure 10 is produced by tuple α3 in Figure 9b(together with β1, γ1, δ2). In fact, the volume vol(α3) is 2,because α3 also produces another join tuple λ2. To see moreexamples, vol(β3) = 0 (since β3 does not produce any jointuple), and vol(γ1) = vol(δ2) = 6 (since γ1 determines λ1,λ2, ..., λ6, and so does δ2).

The central idea underlying star is that, once we have ob-tained the volumes of the tuples in each tuple-set, we canprecisely calculate the number of occurrences of any term.Let us consider, for example, term w1. This term appearstwice in α3. As α3 yields vol(α3) = 2 join tuples, w1 alsoappears twice in each of those two join tuples, thus con-tributing totally 4 occurrences. Similarly, w1 also emergesonce in γ1 (or δ2), and hence, has one occurrence in eachof the vol(γ1) = 6 (or vol(δ2) = 6) join tuples determinedby γ1 (or δ2). This leads to another 12 occurrences of w1,resulting in freq-CN(CN, w1) = 4 + 12 = 16.

Motivated by this, star executes in two steps. The firstphase, called the volume step, computes the volumes of all

A1 A2 A3 R1{ kw1} .text R2

{ kw2}.text R3{ kw3} .text R∅.text

λ1 x2 y1 z1 kw1, w1, w1, w2 kw2, w4 kw3, w1 w1, w2 λ2 x2 y1 z1 kw1, w1, w1, w2 kw2, w2 kw3, w1 w1, w2 λ3 x2 y1 z1 kw1, w2 kw2, w4 kw3, w1 w1, w2 λ4 x2 y1 z1 kw1, w2 kw2, w2 kw3, w1 w1, w2 λ5 x2 y1 z1 kw1, w3 kw2, w4 kw3, w1 w1, w2 λ6 x2 y1 z1 kw1, w3 kw2, w2 kw3, w1 w1, w2 λ7 x4 y3 z2 kw1, w4 kw2, w3 kw3, w4 w3

Figure 10: Result of complete evaluation of CN

tuples in each tuple-set of CN . Then, the second phase, thefrequency step calculates the frequency of each term.

Volume Step. This phase is further divided into the leaf-stage and the root-stage. The leaf-stage scans each leaf tuple-set RSi

i (1 ≤ i ≤ l) of CN once. The purpose is to prepare

a num-array for the column Ai of RSi

i . For every value v

in this column3, num(v) equals the number of tuples in RSi

i

carrying v. For instance, for our running example in Fig-ure 9, the leaf-stage outputs the num-arrays in Figure 11a.For example, num(x1) equals 2, because in Rkw1

1 two tupleshave x1 as their A1-values. Notice that the num-array ofeach RSi

i can be regarded as a compressed version of its col-umn Ai. In particular, if a value v occurs many times in Ai,it is stored only once, but associated with its num-value.

The root-stage, on the other hand, reads the root tuple-setRS once, and creates an abridged root tuple-set RS

∗ , whichhas at most the same cardinality as RS. Furthermore, thisstage also obtains the volumes of all the tuples in every(root/leaf) tuple-set. At the beginning, the root-stage initi-

ates a vol-array for each leaf tuple-set RSi

i (1 ≤ i ≤ l). For

every value v in the column Ai of RSi

i , the array has an entryvol(v), which is initialized to 0. At the end of the root-stage,

vol(v) will be identical to the volume of each tuple in RSi

i

whose Ai-value equals v. It suffices to keep only one volumefor all tuples having the same Ai-value, as their volumes areequivalent.

Next, we process each tuple of the root tuple-set RS inturn. Let t = (v1, v2, ..., vl, text) be the tuple being pro-cessed, where vi (1 ≤ i ≤ l) is its value on column Ai.Then, we check if t can be discarded. Specifically, as longas any vi (1 ≤ i ≤ l) does not exist in the num-array of

leaf relation RSi

i , t does not produce any join result, andhence, can be safely eliminated. Otherwise (i.e., t cannot be

discarded), we increase entry vol(vi) in the vol-array of RSi

i

(1 ≤ i ≤ l), using the data in the num-arrays of the otherleaf tuple-sets. Formally, the update of vol(vi) is by:

vol(vi) = vol(vi) +∏

j 6=i,1≤j≤l

num(vj). (5)

We also calculate the volume of t as

vol(t) =

l∏

j=1

num(vj). (6)

Finally, we write to the abridged root tuple-set RS∗ the tuple

t, augmented with an additional field vol(t), and continuewith the next tuple in RS.

Let us demonstrate the root-stage with our running ex-ample. Recall that, in the leaf-stage, the num-arrays in

3In case Ai is a set of attributes, v is a vector.

844

RRSTURV{AR, text}

{AR, AW, AX, text}Y R

RXSTUXV{AX, text}RWSTUWV

{AW, text} A1 text

α1 x1 kw1, … α2 x1 kw1, … α3 x2 kw1, w1, w1, w2 α4 x2 kw1, w2 α5 x2 kw1, w3 α6 x3 kw1, … α7 x4 kw1, w4

A2 text β1 y1 kw2, w4 β2 y1 kw2, w2 β3 y2 kw2, ... β4 y3 kw2, w3

.

A3 text γ1 z1 kw3, w1 γ2 z2 kw3, w4 γ3 z3 kw3, ...

.

A1 A2 A3 text δ1 x1 y4 z1 ... δ2 x2 y1 z1 w1, w2 δ3 x4 y3 z2 w3

.

(a) Star-CN CN (b) Rkw11 (c) Rkw2

2 (d) Rkw33 (e) R∅

Figure 9: A running example (kw1, kw2, and kw3 are query keywords)Zx[ x\ x] x^num 2 3 1 1

y[ y\ y]num 2 1 1

z[ z\ z]num 1 1 1

(a) The num-arrays of Rkw11 , Rkw2

2 , and Rkw33_x` xa xb xc

vol 0 0 0 0y` ya yb

vol 0 0 0z` za zb

vol 0 0 0

(b) The vol-arrays at the beginning of the root-stagedxe xf xg xhvol 0 2 0 0

ye yf ygvol 3 0 0

ze zf zgvol 6 0 0

(c) The vol-arrays after processing δ2ixj xk xl xmvol 0 2 0 1

yj yk ylvol 3 0 1

zj zk zlvol 6 1 0

(d) The vol-arrays after processing δ3

A1 A2 A3 text vol δ2

* x2 y1 z1 w1, w2 6 δ3

* x4 y3 z2 w3 1

(e) R∅

∗

Figure 11: Illustration of the volume step

Figure 11a have been calculated. Before scanning the roottuple-set R∅, we initialize the vol-arrays of Rkw1

1 , Rkw22 and

Rkw33 as in Figure 11b. We proceed to process the first tuple

δ1 of R∅, and discard it immediately because its A2-value y4

does not exist in the num-array of Rkw22 , implying that δ1

does not produce any join tuple.The next tuple processed is δ2, which cannot be discarded

because its A1-, A2-, and A3-values x2, y1, z1 all appearin the num-arrays. Thus, we update three entries in thevol-arrays, i.e., vol(x2), vol(y1), and vol(z1), resulting in thenew vol-arrays in Figure 11c. Note that the updates areaccording to Equation 5. For example, vol(x2) is obtainedas num(y1) · num(z1) = 2 · 1 = 2. Next, the volume of δ2 iscalculated with Equation 6, leading to vol(δ2) = num(x2) ·num(y1) · num(z1) = 3 · 2 · 1 = 6. Finally, we append t to

the abridged root tuple-set R∅

∗ along with its volume 6.The last tuple δ3 of RS is processed in the same manner,

yielding vol(δ3) = 1, and the final vol-arrays in Figure 11d.Some entries in the vol-arrays are 0, indicating that theircorresponding tuples do not produce any join tuples. Forexample, in Rkw1

1 , tuples with A1-value x1 do not generateany join tuple. The root-stage terminates. Figure 11e givesthe content of the current R∅

∗. Notice that symbols δ∗2 andδ∗3 are introduced to enable tuple pinpointing. We formallysummarize the entire volume-step in Figure 12.

Frequency Step. Once the tuple volumes have been com-

Algorithm volume-step

/* Input: star-CN CN with root tuple-set RS and l leaf tuple-sets

RS11

, ..., RSl

l*/

1. scan each leaf tuple-set to prepare its num-arrays2. initialize an all-zero vol-array for each leaf tuple-set3. initialize an empty abridged root tuple-set RS

∗

4. while there are still un-processed tuples in RS

5. get the next un-processed tuple t = (v1, ..., vl, text)/* vi (1 ≤ i ≤ l) is the Ai-value of t */

6. if all v1, ..., vl appear in the num-arrays7. for i = 1 to l8. vol(vi) = vol(vi) + Πj 6=inum(vj)9. vol(t) = Πl

j=1num(vj)

10. t∗ = everything in t together with vol(t)11. add t∗ to RS

∗

Figure 12: The volume step of algorithm star

puted, it is relatively easy to obtain the term frequencies.Towards this purpose, the frequency step performs one morescan on each leaf tuple-set and the abridged root tuple-set.Specifically, initially, freq-CN(CN, w) equals 0 for every w.Let t be a tuple in any of the tuple-sets mentioned earlier.When t is encountered, for each occurrence of a term w int, we simply increase freq-CN(CN, w) by vol(t). It remainsto clarify how to retrieve vol(t). If t is in the abridged roottuple-set RS

∗ , vol(t) is directly fetched along with t. Oth-

erwise, assume that t comes from a leaf tuple-set RSi

i (forsome i ∈ [1, l]). We only need to obtain the Ai value v oft, and then, set the volume of t to the entry vol(v) in thevol-array.

To illustrate, let us calculate freq-CN(CN, w1) in the ex-ample of Figure 9, from the vol-arrays (Figure 11d) and

abridged root tuple-set R∅

∗ (Figure 11e) returned by the vol-ume step. At the beginning, freq-CN(CN, w1) = 0. As (i)w1 appears twice in α3 and once in γ1 and δ∗2 respectively,and (ii) vol(α3) = 2, vol(γ1) = 6, vol(δ∗2) = 6, we havefreq-CN(CN, w1) = 2 · 2 + 1 · 6 + 1 · 6 = 16. In particular,vol(α3) is retrieved from the entry vol(x2) in the vol-arrays,where x2 is the A1-value of α3. Likewise, vol(γ1) equalsvol(z1) with z1 being the A3-value of z1. Finally, vol(δ∗2) is

acquired directly from the abridged tuple-set R∅

∗.

Discussion. The star algorithm described above is highlyefficient. Specifically, regardless of the number l of leaf tuple-sets, star requires reading each tuple-set of CN only twice,and writing a tuple-set RS

∗ once that is no larger than theroot tuple-set RS. This is much faster than the full eval-uation of CN (i.e., returning all the join tuples as in Fig-ure 10), which demands more passes on the participatingtuple-sets. As mentioned before, the efficiency of star arisesfrom the fact that it focuses on calculating only tuple vol-

845

Algorithm frequency-step

/* Input: the leaf tuple sets RS11

, ..., RSl

l, the abridged root tuple-set RS

∗

and the vol-arrays output by the volume-step */1. freq-CN(CN, w) = 0 for all terms w

2. for each leaf tuple-set RSi

i (1 ≤ i ≤ l)

3. while there are still un-processed tuples in RSi

i4. get the next un-processed tuple t = (v, text)5. for each occurrence of any term w in t6. freq-CN(CN, w) = freq-CN(CN, w) + vol(v)7. while there are still un-processed tuples in RS

∗

8. get the next un-processed tuple t∗ = (v1, ..., vl, text, vol(t∗))

9. for each occurrence of any term w in t∗

10. freq-CN(CN, w) = freq-CN(CN, w) + vol(t∗)

Figure 13: The frequency step of algorithm star

umes. Indeed, tuple volumes capture less information thanjoin tuples (note that the latter can produce the former butnot the vice versa), and hence, are cheaper to calculate.

4.3 Conversion to Star-CNsThis section deals with the following starization prob-

lem: given a non-star CN , transform it to a star-CN CN ′

that returns the same set of MTJNTs, i.e., MTJNT(CN) =MTJNT(CN ′). The goal is to minimize the total cost in-curred in the transformation and executing the star algo-rithm (presented in Section 4.2) on CN ′.

A basic observation is that, if CN has s vertices, then ithas s equivalent star-CNs each of which has a different ver-tex as the root. Let us explain this with a concrete example.Consider the non-star CN in Figure 14a, corresponding tothe schema graph in Figure 1 and a query keyword-set Q ={Tony, conf}. Figure 14b gives an equivalent star-CN CN ′

1,

where WRITE∅ is the root. Notice that, conversion from CNto CN ′

1 requires a join between PAPER∅ and CONF{conf}, andthe result of the join becomes a leaf tuple-set in CN ′

1. Sim-ilarly, Figure 14c shows another equivalent star-CN CN ′

2,which necessitates a join AUTHOR{Tony} ./ WRITES∅. Fig-ures 14d and 14e present the other two equivalent star-CNsCN ′

3 and CN ′

4.The quality of a star-CN CN ′ depends on two types of

cost: the overhead of (i) converting CN to CN ′, and (ii)executing star on CN ′. Hence, finding the optimal CN ′

would be trivial if we were able to predict both costs accu-rately. While the overhead of (ii) may be easy to estimate(as mentioned earlier, star scans each participating tuple-settwice, and writes the abridged root tuple-set once), predict-ing the overhead of (i) is hard for several reasons. First,join selectivity estimation is known to be a tricky problem[5]. Although there exist solutions [1] specifically designedfor foreign-joins, they cannot be applied in our case, becausethe joins here – although they look like foreign-joins – arenot exactly so. Recall that in a traditional foreign-join, ev-ery foreign key will definitely be joined with a primary key.This property no longer holds in our scenario due to thekeyword-screening process. For example, consider the joinAUTHOR{Tony} ./ WRITES∅. Apparently, most foreign-key val-ues in WRITES∅ will not find their matching primary-keys inAUTHOR{Tony}, because AUTHOR{Tony} consists of only tuplescontaining the keyword Tony. Second, selectivity estima-tion demands specialized structures such as sample sets [5],synopses [1], etc. Such structures cannot be pre-computedbecause the tuple-sets of CN are dynamically generated ac-cording to the query keyword-set Q. Constructing the struc-

AUTHORnopq rs tuvw WRITE PAPER CONFnxvyq w(a) A non-star CN CN

AUTHORz{|} ~� ��WRITE

PAPER CONFz��} �(b) An equivalent star-CN CN ′

1

PAPER

CONF�� AUTHOR�� WRITES

(c) Equivalent CN ′

2

AUTHOR�� WRITE PAPER CONF�� CONF��

AUTHOR�� WRITE PAPER

(d) Equivalent CN ′

3 (e) Equivalent CN ′

4

Figure 14: Multiple starization choices

tures on the fly entails large cost itself [1, 5].A good strategy in starization is to avoid joins that pro-

duce gigantic results. For example, the CN ′

1 in Figure 14bis a poor choice, because PAPER∅ ./ CONF{conf} essentiallyjoins two sizable tuple-sets (in particular, CONF{conf} is thetable CONF itself – recall that every tuple in CONF implicitlyincludes the table name as a term). Not only that the joinitself incurs expensive cost, but also it creates a huge leaftuple-set for CN ′

1, leading to large cost in the subsequentapplication of the star algorithm. The CN ′

2 in Figure 14cis a much better choice. In particular, AUTHOR{Tony} is avery small tuple-set. As a result, the join AUTHOR{Tony} ./WRITES∅ is fairly efficient, and produces only a small numberof tuples.

Typically, a join is expensive if its participating tuple-sets have large cardinalities. There is a close connectionbetween the size of a tuple-set and its type. We alreadyknow that a tuple-set RS can be free or non-free. Here, wefurther divide non-free RS into two types: RS is stronglynon-free, if S contains at least a keyword that is not thename of the raw table R; otherwise, RS is weakly non-free.For example, AUTHOR{Tony} is a strongly non-free tuple-set,whereas CONF{conf} is weakly non-free. In general, weaklynon-free and free tuple-sets are large, whereas a stronglynon-free tuple-set is small4, because usually only a fractionof the raw table R includes all the keywords in S. Hence, weshould avoid joins that involve no strongly non-free tuple-setat all, e.g., PAPER∅ ./ CONF{conf}. These joins are said to bebad.

Motivated by this, we perform starization by choosing thestar-CN that requires the least number of bad joins. In casethere are multiple such star-CNs, we select the one with thegreatest number of leaf tuple-sets (in general, the larger thenumber, the fewer pairwise joins are needed). To illustrate,consider the CN in Figure 14a. Among the equivalent star-CNs in Figures 14b-14e, CN ′

1 and CN ′3 require one bad join,

whereas CN ′

2 and CN ′

4 demand no bad join at all. Now weneed to make a choice between CN ′

2 and CN ′

4. As CN ′

2

has two leaves and CN ′

4 has only one, CN ′

2 requires fewer

4This observation was first made in [12].

846

Algorithm starization

/* Input: a non-star CN with s tuple-sets RS11

, ..., RSs

1*/

1. initialize arrays bad-num and degree each with s elements2. for i = 1 to s

3. degree[i] = number of neighbors of RSi

i in CN

4. remove RSi

i from the original CN , resulting in a set ofconnected components

5. bad-num[i] = number of connected components that haveat least two tuple-sets but no strongly non-free tuple-set

6. rt = ∅; min-bad-num = ∞; rt-degree = 07. for i = 1 to s8. if bad-num[i] < min-bad-num OR

(bad-num[i] = min-bad-num AND degree[i] > rt-degree)

9. rt = RSi

i10. min-bad-num = bad-num[i]; rt-degree = degree[i]11. return the star-CN with root rt

Figure 15: The algorithm of starization

pairwise joins, and is the final output of starization.It remains to clarify how to obtain the number of bad

joins needed by a star-CN CN ′. Assume that the root ofCN ′ is RS. Let us examine the vertex RS in the origi-nal CN . Removal of RS breaks CN into several connectedcomponents. The number of bad joins equals the number ofcomponents that have (i) at least two tuple-sets but (ii) nostrongly non-free tuple-set. This number can be found witha single traversal of all the components.

Consider the CN in Figure 14a and RS = WRITES∅. Afterdeleting WRITES∅, the CN is partitioned into two compo-nents AUTHOR{Tony} and PAPER∅ ← CONF{conf}. The secondcomponent has two tuple-sets, neither of which is stronglynon-free. Thus, we know that the star-CN rooted at WRITES∅

necessitates one bad join. Figure 15 formally summarizes thestarization algorithm.

5. EXTENSIONSOur analysis so far assumes that every occurrence of a

term w is counted equally in its frequency, regardless of theraw table where w appears. Sometimes we may want to treatthe occurrences in various tables differently. For example, auser, who wants to know more about the research of Tony,may consider terms in PAPER more important than those inAUTHOR (in the schema graph of Figure 1). For this purpose,s/he may give a higher weight to PAPER and a lower one toAUTHOR, so that every appearance of a term counts more inPAPER than AUTHOR.

Carrying the idea further, a more general method is tospecify weights at the CN level. This is reasonable becausea term from the same table may not necessarily have thesame importance in different CNs. To explain, let us slightlymodify the schema of Figure 1, by adding one more columncomments to table WRITES (i.e., WRITES now has attributesA id, P id, comments). This new attribute records the com-ments of the author A id on her/his paper P id. Now con-sider a FCT query with keyword-set Q = {Tony, spatial,index}, and the following CNs:CN1 : AUTHOR{Tony} → WRITES

∅ ← PAPER{spatial, index}

CN2 : AUTHOR{Tony} → WRITES{index} ← PAPER

{spatial}

Let w be a term in PAPER. Intuitively, an occurrence ofw in (an MTJNT output by) CN1 is more important thanthat in CN2. This is because each MTJNT from CN1 corre-sponds to a paper specifically on spatial indexing, whereasan MTJNT from CN2 may be a paper on other spatial top-

ics, but with a comment from Tony related to indexes.Our FCT operator can be easily extended to incorporate

weighting in the above scenarios. Actually, this is true bothconceptually and algorithmically. In particular, conceptu-ally, the only change necessary is the definition of functioncount(T, w), which here returns the weighted number of oc-currences of w in an MTJNT T . To elaborate the details,assume that CN is the candidate network that generatesT . Suppose that CN has s tuple-sets RS1

1 , ..., RSs

s , which,by the weighting rules in the underlying application, bearweights wght1, ..., wghts, respectively. Thus, count(T, w)should be implemented as follows. First, we initialize acounter 0. Then, for every occurrence of w in T , we firstobtain the tuple-set, say RSi

i , contributing the occurrence,and increase our counter by wghti.

Accordingly, to support weighted FCT search, smallchanges are needed in the algorithms starization and starproposed in Section 4. Recall that, given a non-star can-didate network CN , starization performs some preliminaryjoins to transform CN into a star-counterpart CN ′. Eachjoin produces a leaf tuple-set in CN ′. To tackle a weightedFCT query, terms in the join result should be accompaniedby the weights of their origin leaf tuple-sets. For example,let CN be as shown in Figure 14a. Converting it to the CN ′

2

in Figure 14c needs a join AUTHOR{Tony} ./ WRITES∅. Then,for every occurrence of a term w in the join result, we asso-ciate it with the weight of AUTHOR{Tony} (or WRITES∅), if it

comes from AUTHOR{Tony} (or WRITES∅).Given a star-CN CN , on the other hand, star computes

the total weighted occurrences freq-CN(CN, w) of each termw in the MTJNTs determined by CN . The only mod-ification of star is in its frequency step, which computesfreq-CN(CN, w) from the tuple volumes, by scanning eachleaf tuple-set and the abridged root tuple-set once. Specif-ically, after fetching a tuple t, for every term w in t, weincrease freq-CN(CN, w) by vol(t) · weight(t), where vol(t)is the volume of t, and weight(t) is the weight of the origintuple-set of t.

Finally, it is worth mentioning that, since a FCT queryconcentrates on mining concepts, its effectiveness can beboosted when there is a concept hierarchy. This hierar-chy captures the belonging-to relationships among terms;for instance, nearest-neighbor belongs to spatial. As a result,whenever nearest-neighbor is encountered in an MTJNT, weshould increase the frequencies of both nearest-neighbor andspatial. This strategy makes it easier for FCT queries todiscover related concepts at the high levels.

6. EXPERIMENTSThis section aims at achieving two objectives. First, in

Section 6.1, we will demonstrate the usefulness of FCTsearch, i.e., it enables us to extract interesting informationfrom real databases conveniently. Then, in Section 6.2, wewill verify the efficiency of our FCT algorithm.

6.1 Effectiveness of FCT SearchWe use a real database IMDB [17] that collects the cast,

director, and genre information of over 800k movies and TVprograms. Figure 16 presents the schema graph of IMDB,where the table names and columns are self-illustrative. Theprimary key of each table is underlined. Note that a moviemay be classified in multiple genres, and thus, may haveseveral records in GENRE. Furthermore, it is also possible that

847

ACTORActor_id, Actor_name

ACTORPLAYActor_id, M_id

MOVIEM_id, M_title

ACTRESSPLAYActrs_id, M_id

ACTRESSActrs_id, Actrs_name

DIRECTD_id, M_id

DIRECTORD_id, D_name

GENREM_id, category

Figure 16: The schema graphs of IMDB

a movie does not belong to any genre, and hence, has notuple in GENRE. The entire title of a movie, and the full nameof an actor, actress, and director are treated as a singleterm. This is reasonable because a word appearing in, forexample, the titles of two movies does not really bear anyobvious meaning. Table 1 shows the cardinalities of thetables. Totally IMDB occupies 285 mega bytes.

To demonstrate the effectiveness of FCT retrieval, we willgive the results of several representative queries, and ex-plain why they are reasonable. Recall that a FCT query hastwo explicit parameters: a set Q of keywords, and the num-ber k of terms requested. Furthermore, a FCT query alsoimplicitly inherits another parameter from the traditionalkeyword search: Rmax, which specifies the largest size of aCN, in terms of the number of participating tuple-sets. Inthe sequel, we fix k to 10 and Rmax to 6.

First, we retrieve the most prolific comedy directors with

Q1 = {comedy, director}

yielding

Al Christie (365), Mack Sennett (303), Jules White (297)Friz Freleng (285), Allen Curtis (278), Chuck Jones (259)Dave Fleischer (255), Bud Fisher (252), WilliamBeaudine (249), William Watson (219)

The number after each name corresponds to its frequency.Note that this result is obtained after removing the stop-ping words and the obvious noisy terms such as the ta-ble names MOVIE and DIRECT. All the directors in theresult are highly successful directors in history. Forexample, Christie Al (1881-1951), a star on the Hol-lywood Walk of Frame, directed over 200 motion pic-tures, and is particularly well-known for his short comedies(en.wikipedia.org/wiki/Al Christie).

A fan of Tom Hanks may be curious which director Tomis most mentioned with. This can be extracted by:

Q2 = {Tom Hanks, director}

with result

Louis Horvitz (9), Jeff Margolis (8), Dave Wilson (6)Beth Mccarthy Miller (5), Ron Howard (4), LaurentBouzereau (4), David Frankel (4), Robert Zemeckis (3), DavidLeland (3), Penny Marshall (3)

Louis Horvitz, for instance, is indeed a director that has aclose relationship with Tom. In 2002, Louis actually directeda TV program called AFI Life Achievement Award: A Trib-ute to Tom Hanks. To acquire the genres of the motionproductions involving Tom Hanks, we perform

Table CardinalityACTOR 741449

ACTORPLAY 4244600ACTRESS 445020

ACTRESSPLAY 2262149MOVIE 833512

DIRECTOR 121928DIRECT 561173GENRE 629195

Table 1: Table cardinalities of IMDB

Q3 = {Tom Hanks, genres}

returning

comedy (44), drama (34), short (20), family (17), romance(10), thriller (9), crime (8), music (8), fantasy (7), war (7)

Interestingly, while Tom Hanks is perhaps best known forhis dramas, he actually took parts in many comedies as well(a recent one: The Terminal). Let us repeat the above twoqueries but with respect to Jim Carrey. Specifically,

Q4 = {Jim Carrie, director}

gives

Louis Horvitz (7), Bruce Gowers (4), Michel Gondry (3)Jeffrey Schwarz (3), Tom Shadyac (3), Bobby Farrelly (2)Peter Farrelly (2), Beth Mccarthy Miller (2), Troy Miller (2),Joel Schumacher (2)

The results of Q2 and Q4 indicate that Louis Horvitz worksclosely with both Tom Hanks and Jim Carrie.

Q5 = {Jim Carrie, genres}

returns

comedy (40), short (14), drama (13), family (11),action (7), fantasy (7), music (7), adventure (6),crime (6), romance (6)

Jim Carrie is particularly mentioned only in one genre: com-edy, whose frequency 40 is much higher than the others. Asshown in Q10, Tom Hanks seems to be more versatile, bybeing heavily mentioned in both comedy and drama.

Finally, we show how to leverage FCT queries to discoverthe actors and actresses closely related to a director. Forthis purpose, we choose director Jules White (1900-1985),in the result of Q1, as a representative. The next query

Q6 = {Jules White, actor, comedy}

discovers

Moe Howard (108), Larry Fine (108), Vernon Dent (70)Shemp Howard (67), Al Thompson (49), Emil Sitka (46)Joe Palma (45), John Tyrrell (39), Johnny Kascier (37),Curly Howard (36)

The result is fairly reasonable. For example, Jules Whiteis best known (en.wikipedia.org/wiki/Jules White) for hisshort-subject comedies starring the “Three Stooges” – MoeHoward, Larry Fine, and Curly Howard – all of whom areincluded in the result. As for actresses, we run

Q7 = {Jules White, actress, comedy}

and obtain

848

Q1 Q2 Q3 Q4 Q5 Q6 Q7non-empty count 8 9 2 1 4 33 12

Table 2: CN statistics

Christine Mcintyre (38), Symona Boniface (23), DorothyAppleby (21), Judy Malcolm (19), Nanette Bordeaux (15)Jean Willes (14), Barbara Jo Allen (14), Barbara Bartay(13), Margie Liszt (11), Harriette Tarler (11)

Again, these actresses are indeed highly relevant to Jules.For example, Christine Mcintyre stars, along with the ThreeStooges mentioned earlier, in numerous 1950-comedies byJules, including Punchy Cow Punchers, Hugs and Mugs,Love at First Bite, etc (www.threestooges.com).

Summary. The effectiveness of FCT search is reflectedin two aspects. First, it is able to discover concepts thatare highly related to the set of query keywords. Further-more, the frequencies of those concepts generally indicatetheir importance. Second, a FCT query is easy to formu-late. Specifically, as shown earlier, the keywords of all thequeries Q1-Q7 are simple and intuitive. They can be pro-vided even by non-database experts.

6.2 Efficiency of FCT SearchThis section evaluates the efficiency of the proposed algo-

rithm, referred to as star-FCT. As discussed in Section 4,star-FCT involves two components: (i) star (Figures 12 and13), for aggregating term frequencies from a star-CN, and(ii) starization (Figure 15), for converting a non-star CN toa star counterpart. We compare our solution against thebaseline approach. As mentioned in Section 4.1, given aFCT query with a keyword set Q and an integer k, baselinefirst solves a KS query with the same Q, computes the fre-quencies of all the terms in the result, and then, outputs thek most frequent terms.

All the results in the sequel are obtained on a computerrunning a Pentium IV dual-core CPU at 2.13GHz. To befair for baseline, we minimize its cost by implementing ahighly efficient join engine. In particular, our implementa-tion incorporates the expression-sharing heuristic proposedin [12]. That is, after being computed, the result of a join ispreserved, and re-used directly if the same join needs to beexecuted later. We allocate an equal amount of memory, 6mega bytes, for both star-FCT and baseline. This memorybuffer is significantly smaller than the size (over 280 megabytes) of IMDB. It is worth noting that an efficient algo-rithm must be able to work with a small amount of memorybecause in practice the system may have to deal with nu-merous queries concurrently.

We will demonstrate the performance of the two algo-rithms on the queries Q1-Q7 analyzed in Section 6.1. Recallthat, for each query, both star-FCT and baseline need tofirst enumerate all the CNs that have a chance to produceMTJNTs, in the way explained in Section 3. Many CNs areempty, i.e., they generate no MTJNTs at all. The numberof non-empty CNs is an important indicator of the queryoverhead. In general, more non-empty CNs lead to higherquery cost. Therefore, in Table 2, we list the number ofnon-empty CNs respectively for each query. Note that thesenumbers are identical for star-FCT and baseline.

Figure 17 presents the elapsed time of star-FCT and base-line. We break the performance of star-FCT into the over-head of star and starization, respectively. Above each col-

��

36%

star�� starization baseline

star-FCT

0

200

400

600

800

1000

1200

Q1 Q2 Q3 Q4 Q5 Q6 Q7

time (sec)

��

��

��

��

��

51%31% 36% 56%

61%

77%

Figure 17: Efficiency comparison

star baselinememory consumption (mega bytes)

01

23

4

56

7

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Figure 18: Memory consumption comparison

umn of star-FCT, we place a value denoting how much per-cent starization accounts for in the overall execution time.For example, for Q1, 36% of the star-FCT cost is due tostarization. Evidently, star-FCT consistently outperformsbaseline, achieving a maximum speedup ratio of 4 at Q1. Asexplained in Section 4.2, the superiority of star-FCT stemsfrom the fact that it acquires the term frequencies withoutcomputing the join results of a star-CN at all.

For most queries, starization entails only a fraction of thetotal cost of star-FCT, which confirms the effectiveness ofour heuristics in Section 4.3. As expected, there is a strongcorrelation between the query cost (of both algorithms) andthe number of non-empty CNs. For example, Q6 and Q7are the two most expensive queries because they have themost non-empty CNs.

Our implementation of the star algorithm keeps the num-and vol-arrays memory resident (see Section 4.2). Next, weshow that this is a reasonable choice, because these arraysare so small that they easily fit in the memory. For thispurpose, we measure the largest memory consumption ofstar during its execution, and compare it with baseline. Asmentioned earlier, the limit of memory usage is 6 mega bytesfor each algorithm.

Figures 18 presents the results. Baseline always uses upall the available memory, because its join engine automati-cally makes full use of memory to reduce the join overhead.The consumption of star, on the other hand, varies acrossqueries. This is not surprising because different queries de-mand num- and vol-arrays with different sizes, dependingon the characteristics of the CNs generated. In all queries,star requires no more than 5 mega bytes of memory. Evenin the worst case (Q5), star takes up less than 6 mega bytes.Finally, note that the above results apply to star. As withbaseline, the other component starization of star-FCT alsoutilizes all the vacant memory to minimize the cost of joins.

Summary. The proposed star-FCT algorithm is able tosolve FCT queries efficiently. In most cases, the cost of star-FCT is significantly dominated by its star component, thusjustifying the sophisticated heuristics in star. Furthermore,star-FCT requires a small amount of memory, even when

849

the underlying database is larger than 280 mega bytes.

7. CONCLUSIONSThis paper proposes a novel operator called frequent co-

occurring term (FCT) search. Given a set Q of keywords andan integer k, a FCT query returns the k terms that appearmost frequently in the result of a traditional KS (keywordsearch) query. Unlike KS that produces joined tuples con-taining all the keywords in Q, FCT search aims at extractingthe terms that most accurately characterize Q. We devise anew algorithm that efficiently solves a FCT query withoutresorting to conventional KS methods. As experimentallyevaluated with real data, (i) FCT search can indeed discoverhighly intuitive observations that cannot be found via ordi-nary KS queries; (ii) our FCT algorithm is fairly efficient,and requires small memory space.

Our study also points to several promising topics for fu-ture research. As shown in Figure 18, the star algorithmtypically does not consume all the memory available. Thus,an interesting direction is to investigate the possibility ofutilizing the remaining memory to further lower the execu-tion cost. Furthermore, we have considered only static data.Maintenance of FCT results over a continuous data streamdemands alternative strategies to be explored. Finally, ourdiscussion has focused exclusively on relational databases.Extending FCT queries to other types of data, such as XMLdocuments and spatial entities, deserves careful considera-tion.

AcknowledgementsThis work was partially supported by CERG grants CUHK1202/06, 4161/07, 4173/08, and 4182/06 from HKRGC.

REFERENCES[1] S. Acharya, P. B. Gibbons, V. Poosala, and

S. Ramaswamy. Join synopses for approximate queryanswering. In Proc. of ACM Management of Data(SIGMOD), pages 275–286, 1999.

[2] S. Amer-Yahia, E. Curtmola, and A. Deutsch. Flexibleand efficient xml search with complex full-textpredicates. In Proc. of ACM Management of Data(SIGMOD), pages 575–586, 2006.

[3] A. Balmin, V. Hristidis, and Y. Papakonstantinou.Objectrank: Authority-based keyword search indatabases. In Proc. of Very Large Data Bases(VLDB), pages 564–575, 2004.

[4] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti,and S. Sudarshan. Keyword searching and browsing indatabases using banks. In ICDE, pages 431–440, 2002.

[5] S. Chaudhuri, R. Motwani, and V. R. Narasayya. Onrandom sampling over joins. In Proc. of ACMManagement of Data (SIGMOD), pages 263–274,1999.

[6] B. Ding, J. X. Yu, S. Wang, L. Qin, X. Zhang, andX. Lin. Finding top-k min-cost connected trees indatabases. In ICDE, pages 836–845, 2007.

[7] R. Fagin, A. Lotem, and M. Naor. Optimalaggregation algorithms for middleware. In Proc. ofACM Symposium on Principles of Database Systems(PODS), 2001.

[8] I. D. Felipe, V. Hristidis, and N. Rishe. Keywordsearch on spatial databases. In Proc. of InternationalConference on Data Engineering (ICDE), 2008.

[9] L. Guo, F. Shao, C. Botev, andJ. Shanmugasundaram. Xrank: ranked keyword searchover xml documents. In Proc. of ACM Management ofData (SIGMOD), pages 16–27, 2003.

[10] H. He, H. Wang, J. Yang, and P. S. Yu. Blinks:Ranked keyword searches on graphs. In Proc. of ACMManagement of Data (SIGMOD), pages 305–316,2007.

[11] V. Hristidis, L. Gravano, and Y. Papakonstantinou.Efficient ir-style keyword search over relationaldatabases. In Proc. of Very Large Data Bases(VLDB), pages 850 – 861, 2003.

[12] V. Hristidis and Y. Papakonstantinou. Discover:Keyword search in relational databases. In Proc. ofVery Large Data Bases (VLDB), pages 670–681, 2002.

[13] V. Hristidis, Y. Papakonstantinou, and A. Balmin.Keyword proximity search on xml graphs. In ICDE,pages 367–378, 2003.

[14] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan,R. Desai, and H. Karambelkar. Bidirectional expansionfor keyword search on graph databases. In Proc. ofVery Large Data Bases (VLDB), pages 505–516, 2005.

[15] B. Kimelfeld and Y. Sagiv. Finding andapproximating top-k answers in keyword proximitysearch. In Proc. of ACM Symposium on Principles ofDatabase Systems (PODS), pages 173 – 182, 2006.

[16] F. Lui, C. Yu, W. Meng, and A. Chowdhury. Effectivekeyword search in relational databases. In Proc. ofACM Management of Data (SIGMOD), pages563–574, 2006.

[17] Y. Luo, X. Lin, W. Wang, and X. Zhou. Spark: Top-kkeyword query in relational databases. In Proc. ofACM Management of Data (SIGMOD), pages115–126, 2007.

[18] A. Markowetz, Y. Yang, and D. Papadias. Keywordsearch on relational data streams. In Proc. of ACMManagement of Data (SIGMOD), pages 605–616,2007.

[19] M. Sayyadian, H. LeKhac, A. Doan, and L. Gravano.Efficient keyword search across heterogeneousrelational databases. In Proc. of InternationalConference on Data Engineering (ICDE), pages346–355, 2007.

[20] P. Wu, Y. Sismanis, and B. Reinwald. Towardskeyword-driven analytical processing. In Proc. ofACM Management of Data (SIGMOD), pages617–628, 2007.

[21] Y. Xu and Y. Papakonstantinou. Efficient keywordsearch for smallest lcas in xml databases. In Proc. ofACM Management of Data (SIGMOD), pages527–538, 2005.

[22] B. Yu, G. Li, K. Sollins, and A. Tung. Effectivekeyword-based selection of relational databases. InProc. of ACM Management of Data (SIGMOD), pages139–150, 2007.

850

Documents

Finding Frequent Co-occurring Terms in Relational Keyword ... · privacy preserve tree access method 2 publish 2 Figure 4: The 8 most frequent terms in Figure 3 evaluated our technique