25
Phrase Query Optimization on Inverted Indexes Avishek Anand Ida Mele Srikanta Bedathur Klaus Berberich MPI–I–2014–5–002

Phrase Query Optimization on Inverted Indexes

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Phrase Query Optimization on Inverted Indexes

Phrase Query Optimizationon Inverted Indexes

Avishek AnandIda Mele

Srikanta BedathurKlaus Berberich

MPI–I–2014–5–002

Page 2: Phrase Query Optimization on Inverted Indexes

Authors’ Addresses

Avishek AnandL3S Research CenterHannover, Germany

Ida MeleMax-Planck-Institut fur InformatikSaarbrucken, Germany

Srikanta BedathurIIIT DelhiNew Delhi, India

Klaus BerberichMax-Planck-Institut fur InformatikSaarbrucken, Germany

Page 3: Phrase Query Optimization on Inverted Indexes

Abstract

Phrase queries are a key functionality of modern search engines. Beyond that,they increasingly serve as an important building block for applications such asentity-oriented search, text analytics, and plagiarism detection. Processingphrase queries is costly, though, since positional information has to be keptin the index and all words, including stopwords, need to be considered.

We consider an augmented inverted index that indexes selected variable-length multi-word sequences in addition to single words. We study howarbitrary phrase queries can be processed efficiently on such an augmentedinverted index. We show that the underlying optimization problem is NP-hard in the general case and describe an exact exponential algorithm andan approximation algorithm to its solution. Experiments on ClueWeb09 andThe New York Times with different real-world query workloads examine thepractical performance of our methods.

Keywords

Phrase Queries, Query Optimization

Page 4: Phrase Query Optimization on Inverted Indexes

Contents

1 Introduction 2

2 Model 4

3 Indexing Framework 5

4 Query Optimization 74.1 Optimal Solution . . . . . . . . . . . . . . . . . . . . . . . . . 94.2 Approximation Algorithm . . . . . . . . . . . . . . . . . . . . 12

5 Experimental Evaluation 135.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 14

6 Related Work 17

7 Conclusion 18

1

Page 5: Phrase Query Optimization on Inverted Indexes

1 Introduction

As the scale of document collections such as the Web grows, more effectivetools to search and explore them are needed. Phrase queries are one suchtool. Typically expressed by enclosing a multi-word sequence with quotationmarks (e.g., “president of the united states”) they enforce that the searchengine returns only documents that literally contain the quoted phrase. Ourfocus in this work is on supporting phrase queries more efficiently.

Phrase queries are supported by all modern search engines and are one oftheir advanced features most popular with human users, accounting for upto 5% of query volume [18]. Even when unknown to the user, phrase queriescan still be implicitly invoked, for instance, by means of query-segmentationmethods [10, 13]. Beyond their use in search engines, phrase queries in-creasingly serve as a building block for other applications such as (a) entity-oriented search and analytics [5] (e.g., to identify documents that refer to aspecific entity using one of its known labels), (b) plagiarism detection [19](e.g., to identify documents that contain a highly discriminative fragmentfrom the suspicious document), (c) culturomics [16] (e.g., to identify docu-ments that contain a specific n-gram and compute a frequency time-seriesfrom their timestamps).

Positional information about where words occur in documents has to bemaintained to support phrase queries, which leads to indexes that are larger(e.g., [7] report a factor of about 4× for the inverted index) than thoserequired for keyword queries. Also, all words need to be considered in thecase of phrase queries, as opposed to keyword queries for which stopwords canbe ignored. Consequently, phrase queries are substantially more expensiveto process, since more data has to be obtained from the index.

The problem of substring matching, which is at the core of phrase queries,has been studied extensively by the String Processing community. However,the solutions developed (e.g., suffix arrays [15] and permuterm indexes [9])are designed for main memory and cannot cope with large-scale documentcollections. Solutions developed by the Information Retrieval community [20,

2

Page 6: Phrase Query Optimization on Inverted Indexes

22] build on the inverted index, extending it to index selected multi-wordsequences, so-called phrases, in addition to single words.

In this work, we follow the general approach of augmenting the invertedindex with selected multi-word phrases. Given such an index, we focuson the problem of phrase-query optimization, that is, determining for agiven phrase query an optimal set of terms to process it. Consider, asa concrete example, the phrase query “we are the champions” and assumethat all bigrams have been indexed alongside single words. Here, the spaceof possible solutions includes among others {we, are, the, champions} and{we are, are the, champions}. Identifying a cost-minimal set of indexed termsis the problem addressed in this work. Existing work [22] has addressed thisproblem only heuristically. In contrast, we study its hardness and devise anexact exponential algorithm and an approximation algorithm to its solution.

Contributions that we make in this work thus include

• we study the problem of phrase-query optimization, establish its NP-hardness, and describe an exact exponential algorithm as well as anO(log n)-approximation algorithm to its solution;

• an experimental evaluation on ClueWeb09 and a corpus from The NewYork Times, as two real-world document collections and different work-loads denoting the three kinds of tasks where phrase queries are appli-cable, comparing our approach against state-of-the-art competitor andestablishing its efficiency and effectiveness.

Organization. The rest of this paper is organized as follows. Chapter 2introduces our formal model. The augmented inverted index is describedin Chapter 3. Chapter 4 deals with optimizing phrase queries. Chapter 5describes our experimental evaluation. We relate our work to existing priorwork in Chapter 6 and conclude in Chapter 7.

3

Page 7: Phrase Query Optimization on Inverted Indexes

2 Model

We let V denote the vocabulary of all words. The set of all non-empty se-quences of words from this vocabulary is denoted V+. Given a word sequence

s = 〈 s1, . . . , sn 〉 ∈ V+

we let | s | = n denote its length. We use s[i] to refer to the word si at the i-thposition of s, and s[i..j] (i ≤ j) to refer to the word subsequence 〈 si, . . . , sj 〉.

Given two word sequences r and s, we let pos(r, s) denote the set ofpositions at which r occurs in s, formally

pos(r, s) = { 1 ≤ i ≤ |s| | ∀ 1 ≤ j ≤ |r| : s[i+ j − 1] = r[j] } .

For r = 〈 ab 〉 and s = 〈 cabcab 〉, as a concrete example, we have pos(r, s) ={ 2, 5 }. We say that s contains r if pos(r, s) 6= ∅. To ease notation, we treatsingle words from V also as word sequences when convenient. This allows us,for instance, to write pos(w, s) to refer to the positions where w occurs in s.

We let the bag of word sequences C denote our document collection. Eachdocument d ∈ C is a word sequence from V+. Likewise, the bag of wordsequences Q denotes our workload. Each query q ∈ Q is a word sequencefrom V+.

Using our notation, we now define the notion of document frequency ascommon in Information Retrieval. Let S be a bag of word sequences (e.g.,the document collection or the workload), we define the document frequencyof the word sequence r, as the number of sequences from S containing it

df(r,S) = | { s ∈ S | pos(r, s) 6= ∅ } | .

4

Page 8: Phrase Query Optimization on Inverted Indexes

3 Indexing Framework

We build on the inverted index as the most widely-used index structure inInformation Retrieval that forms the backbone of many real-world systems.The inverted index consists of two components, namely, a dictionary D ofterms and the corresponding posting lists that record for each term infor-mation about its occurrences in the document collection. For a detaileddiscussion of the inverted index we refer to [23].

To support arbitrary phrase queries, an inverted index has to containall words from the vocabulary in its dictionary (i.e., V ⊆ D) and recordpositional information in its posting lists. Thus, the posting ( d13, 〈 3, 7 〉 )found in the posting list for word w conveys that the word occurs at posi-tions 3 and 7 in document d13. More formally, using our notation, a post-ing ( id(d), pos(w,d) ) for word w and document d contains the document’sunique identifier id(d) and the positions pos(w,d) at which the word occurs.

Query-processing performance for phrase queries on such a positional in-verted index tends to be limited, in particular for phrase queries that containfrequent words (e.g., stopwords). Posting lists for frequent terms are long,containing many postings each of which with many positions therein, ren-dering them expensive to read, decompress, and process.

Several authors [8, 20, 22] have proposed, as a remedy, to augment theinverted index by adding multi-word sequences, so-called phrases, to the setof terms. The dictionary D of such an augmented inverted index thus consistsof individual words alongside phrases (i.e., D ⊆ V+) as terms.

To process a given phrase query q, a set of terms is selected from thedictionary D, and the corresponding posting lists are intersected to iden-tify documents that contain the phrase. Intersecting of posting lists can bedone using term-at-a-time (TaaT) or document-at-a-time query processing(DaaT). For the former, posting lists are read one after each other, andbookkeeping is done to keep track of positions at which the phrase can stilloccur in candidate documents. For the latter, posting lists are read in par-allel and a document, when seen in all posting lists at once, is examined for

5

Page 9: Phrase Query Optimization on Inverted Indexes

whether it contains the query phrase. In both cases, the cost of processinga phrase query depends on the sizes of posting lists read and thus the set ofterms selected to process the query. Existing work [22] has addressed thisquery-optimization problem, determining the set of terms that should be usedto process a given phrase query, using greedy heuristics. Our approach toaddress the problem, including its formalization and a study of its hardness,is described in Chapter 4.

6

Page 10: Phrase Query Optimization on Inverted Indexes

4 Query Optimization

We now describe how a phrase query q can be processed using a given aug-mented inverted index with a concrete dictionary D. Our objective is thusto determine, at query-processing time, a subset P ⊆ D of terms, furtherreferred to as query plan, that can be used to process q.

To formulate the problem, we first need to capture when a query planP can be used to process a phrase query q. Intuitively, each word must becovered by at least one term from P . We capture whether P covers q usingthe predicate

covers(P ,q) = ∀ 1 ≤ i ≤ |q | : ∃ t ∈ P : ∃ j ∈ pos(q[i], t) :

∀ 1 ≤ k ≤ | t | : q[i− j + k] = t[k]

The phrase query q = 〈 abc 〉, as a concrete example, can thus be processedusing { 〈 ab 〉, 〈 bc 〉 } but not { 〈 ab 〉, 〈 cd 〉 }.

Second, we need to quantify the cost of processing a phrase query usinga specific query plan. As detailed above, in Chapter 3, different ways ofprocessing a phrase query (i.e., TaaT vs. DaaT) exist. In the worst case,regardless of which query-processing method is employed, all posting listshave to be read in their entirety. We model the cost of a query plan P asthe total number of postings that has to be read

c(P) =∑t∈P

df(t, C) .

While sizes of positional postings are not uniform (e.g., due to varying num-bers of contained positions), suggesting collection frequency as a possiblymore accurate cost measure, we found little difference in practice. The sumof document frequencies closely correlates with the response times of our sys-tem. This is in line with [14], who found that aggregate posting-list lengthsis the single feature most correlated with response time for full query evalu-ation, as required for phrase queries, which do not permit dynamic pruning.

7

Page 11: Phrase Query Optimization on Inverted Indexes

Assembling the above definitions of coverage and cost, we now formallydefine the problem of finding a cost-minimal query plan P for a phrase queryq and dictionary D as the following NP-hard optimization problem

Definition 1 Phrase-Query Optimization

arg minP⊆D

c(P) s.t. covers(P ,q) .

Theorem 1 Phrase-Query Optimization is NP-hard.

Proof 1 (of Theorem 1) (NP-hardness of Phrase-Query Optimiza-tion) Our proof closely follows Neraud [17], who establishes that decidingwhether a given set of strings is elementary is NP-complete. We showthrough reduction from Vertex Cover that the decision variant of Phrase-Query Optimization (i.e., whether a P with c(P) ≤ τ exists) is NP-complete.

Let G(V,E) be an undirected graph with vertices V and edges E. Weassume that there are no isolated vertices, that is, each vertex has at leastone incident edge. Vertex Cover asks whether there exists a subset ofvertices V C ⊆ V having cardinality |V C | ≤ k, so that ∀ (u, v) ∈ E : u ∈V C ∨ v ∈ V C, that is, for each edge one of its connected vertices is in thevertex cover. We obtain an instance of Phrase-Query Optimizationfrom G(V,E) as follows:

• V = V ∪E – we introduce a word to the vocabulary for each vertex (v)and edge (uv) in the graph;

• q =⊎

(u,v)∈Eu uv v – we obtain the query q as a concatenation of all edge

words uv bracketed by the words of their source (u) and target (v);

• C = {q } – the document collection contains only a single documentthat equals our query;

• D = V ∪⋃

(u,v)∈E{ 〈u uv 〉 } ∪

⋃(u,v)∈E

{ 〈uv v 〉 } – each vertex word

(v) and edge word (uv) is indexed as well as each combination of edgeand source (uuv) and edge and target (uv v).

This can be done in polynomial time and space. Note also that, by con-struction, ∀ t ∈ D : df(t, C) = 1 holds. We now show that G(V,E) hasa vertex cover with cardinality |V C | ≤ k iff there is a query plan P withc(P) ≤ k + |E |.

8

Page 12: Phrase Query Optimization on Inverted Indexes

(⇒) Observe that V C contains at least one of u or v for each u uv vfrom the query, which incurrs a cost of |V C | ≤ k. We can complementthis to obtain a query plan P by adding exactly one term (〈uv 〉, 〈uuv 〉,or 〈uv v 〉) for each u uv v from the query, incurring a cost of |E |. Thus,c(P) ≤ k + |E | holds.

(⇐) Observe that P must cover each u uv v from the query in one of fourways: (i) 〈u 〉〈uv 〉〈 v 〉, (ii) 〈u 〉〈uv v 〉, (iii) 〈uuv 〉〈 v 〉 (iv) 〈uuv 〉〈uv v 〉.Whenever a u uv v from the query is covered as 〈uuv 〉〈uv v 〉, we replace〈uuv 〉 by 〈u 〉, thus reducing case (iv) to case (ii). We refer to the queryplan thus obtained as P ′. Note that c(P ′) ≤ c(P), since all terms have thesame cost. P ′ contains exactly one term (〈uv 〉, 〈uuv 〉, or 〈uv v 〉) for eachu uv v from the query, incurring a cost of |E |. Let V C be the set of verticeswhose words are contained in P ′. We can thus write c(P ′) = |V C | + |E |.V C is a vertex cover, since after eliminating case (iv), each u uv v from thequery is covered using either 〈u 〉 or 〈 v 〉. Thus, c(P ′) = |V C | + |E | ≤c(P) ≤ |E |+ k ⇒ |V C | ≤ k.

4.1 Optimal Solution

If an optimal query plan P∗ exists, so that every term therein occurs exactlyonce in the query, we can determine an optimal query plan using dynamicprogramming based on the recurrence

Opt(i)=

df(q[1..i], C) : q[1..i] ∈ D

minj<i

Opt(j) +mink≤j+1∧q[k..i]∈D

df(q[k..i], C)

: otherwise

in time O(n2) and space O(n) where |q | = n. Opt(i) denotes the costof an optimal solution to the prefix subproblem q[1..i] – once the dynamic-programming table has been populated, an optimal query plan can be con-structed by means of backtracking. In the first case, the prefix subproblemcan be covered using a single term. In the second case, the optimal solutioncombines an optimal solution to a smaller prefix subproblem, which is theoptimal substructure inherent to dynamic programming, with a single termthat covers the remaining suffix.

Theorem 2 If an optimal query plan P∗ for a phrase query q exists suchthat

∀ t ∈ P∗ : | pos(t,q) | = 1 ,

then c(P∗) = Opt(|q |), that is, an optimal solution can be determined usingthe recurrence Opt.

9

Page 13: Phrase Query Optimization on Inverted Indexes

Proof 2 (of Theorem 2) Observe that Theorem 2 is a special case of The-orem 3 for F = ∅. We therefore only prove the more general Theorem 3.

It entails that we can efficiently determine optimal query plans for phrasequeries with no repeated words.

Corollary 1 We can compute an optimal query plan for a phrase query qin polynomial time and space, if

∀ 1 ≤ i ≤ |q | : | pos(q[i],q) | = 1 .

In practice this special case is important, since a large fraction of phrasesqueries in typical workloads do not contain repeated words.

Otherwise, when there is no optimal query plan P∗ according to Theo-rem 2, dynamic programming can not be directly applied, since there is nooptimal substructure. Consider, as a concrete problem instance, the phrasequery q = 〈 abxayb 〉 with dictionary D = { 〈 a 〉, 〈 b 〉, 〈x 〉, 〈 y 〉, 〈 ab 〉 } andassume df(t, C) > 1 for t ∈ { 〈 a 〉, 〈 b 〉, 〈x 〉, 〈 y 〉 } and df(〈 ab 〉, C) = 1. Here,the optimal solution P∗ = { 〈 a 〉, 〈 b 〉, 〈x 〉, 〈 y 〉 } does not contain an optimalsolution to any prefix subproblem q[1..i] (1 < i < |q |), which all contain theterm 〈 ab 〉.

However, as we describe next, an optimal query plan can be computed,in the general case, using a combination of exhaustive search over sets ofrepeated terms and a variant of our above recurrence.

For a phrase query q let the set of repeated terms be formally defined as

R = { t ∈ D | | pos(t,q) | > 1 } .

Let further F ⊆ R denote a subset of repeated terms. We now define amodified document frequency that is zero for terms from F , formally

df ′(t, C) =

{0 : t ∈ F

df(t, C) : otherwise

and denote by Opt′ the variant of our above recurrence that uses this mod-ified document frequency.

Algorithm 1 considers all subsets of repeated terms and, for each of them,extends it into a query plan for q by means of the recurrence Opt′. Thealgorithm keeps track of the best solution seen and eventually returns it. Itscorrectness directly follows from the following theorem.

Theorem 3 Let P∗ denote an optimal query plan for the phrase query q andF = { t ∈ P∗ | | pos(t,q) | > 1 } be the set of repeated terms therein, then

c(F) + Opt′(|q |) ≤ c(P∗) .

10

Page 14: Phrase Query Optimization on Inverted Indexes

Algorithm 1: Phrase-Query Optimization

Input: Phrase query q, dictionary DOutput: Cost optCost of optimal query plan

1 R = { t ∈ D : | pos(t,q) | > 1 }2 optCost =∞3 for F ∈ 2R do4 cost = c(F) + Opt′(|q|)5 if cost < optCost then6 optCost = cost

7 return optCost

Proof 3 (of Theorem 3) Let P∗ denote an optimal query plan for the phrasequery q and

F = { t ∈ P∗ | | pos(t,q) | > 1 }be the set of repeated terms and F = P∗ \ F be the set of non-repeatedterms therein. Without loss of generality, we assume that q ends in a non-repeated terminal term # having df(#, C) = 0 – this can always be achievedby “patching” the query. We order non-repeated terms t ∈ F by their singleitem in pos(t,q) to obtain the sequence 〈 t1, . . . , tm 〉 with m =

∣∣ F ∣∣. Werefer to the first position covered by ti, corresponding to the single item inpos(t,q), as bi and to the last position as ei = (bi + | ti | − 1).

We now show by induction that

Opt′(ei) ≤i∑

j=1

df(tj, C) = c(P∗)− c(F) .

(i = 1) We have to distinguish two cases: (i) b1 = 1, that is, q[1..e1] iscovered using a single non-repeated term – Opt’ selects this term accordingto its first case. (ii) b1 > 1, that is, there is a set of repeated terms fromF that covers q[1..k] for some b1 − 1 ≤ k < e1 – Opt’ can select the samerepeated terms at zero cost and combine it with t1 that covers q[b1..e1]. Thus,in both cases, Opt′(e1) ≤ df(t1, C).

(i → i + 1) We assume Opt′(ei) ≤∑i

j=1 df(tj, C). Again, we have todistinguish two cases: (i) ei ≥ bi+1 − 1, that is, the term before ti+1 is alsoa non-repeated term. Thus, our recurrence considers Opt′(ei) + df(ti+1, C)as one possible solution. (ii) e1 < bi+1 − 1, that is, there is a gap coveredby repeated terms between ti and ti+1 – Opt’ can select the same repeatedterms at zero cost and thus considers Opt′(ei)+0+df(ti+1, C) as one solution.Thus, in both cases, Opt′(ei+1) ≤

∑i+1j=1 df(tj, C).

11

Page 15: Phrase Query Optimization on Inverted Indexes

The cost of Algorithm 1 depends on the number of repeated terms |R |,which is small in practice and can be bounded in terms of the number ofpositions in q occupied by a repeated word

r = | { 0 ≤ i ≤ |q | | | pos(q[i],q) | > 1 } | .

For our above example phrase query q = 〈 abxayb 〉 we obtain r = 4. Note

that |R | ≤ r·(r+1)2

holds. Algorithm 1 thus has time complexity O(2r·(r+1)

2 n2)and space complexity O(n2) where |q | = n.

4.2 Approximation Algorithm

Computing an optimal query plan can be computationally expensive in theworst case, as just shown. It turns out, however, that we can efficientlycompute an O(log n)-approximation, reusing results for Set Cover [21].

To this end, we convert an instance of our problem, consisting of a phrasequery q and a dictionary D with associated costs, into a Set Cover in-stance. Let the universe of items U = { 1, . . . , |q | } correspond to positionsin the phrase query. For each term t ∈ D, we define a subset S t ⊆ U ofcovered positions as

S t = { 1 ≤ i ≤ |q | | ∃ j ∈ pos(t,q) : j ≤ i < j + | t | } .

The collection of subsets of U is then defined as S = {S t | t ∈ D } and wedefine cost(S t) = df(t, C) as a cost function.

For our concrete problem instance from above, we obtain U = { 1, . . . , 6 }and S = { { 1, 4 } , { 2, 6 } , { 3 } , { 5 } , { 1 } } as a Set Cover instance.

We can now use the greedy algorithm that selects sets from S basedon their benefit-cost ratio, which is known to be a O(log n)-approximationalgorithm [21]. This can be implemented in O(n2) time and O(n2) spacewhere | q | = n.

Note that, as a key difference to the greedy algorithm described in [22],which to the best of our knowledge does not give an approximation guaran-tee, our greedy algorithm selects subsets (corresponding to terms from thedictionary) taking into account the number of additional items covered andthe coverage already achieved by selected subsets.

12

Page 16: Phrase Query Optimization on Inverted Indexes

5 Experimental Evaluation

We now describe our experimental evaluation. We begin with the descriptionof the datasets, followed by a comparison of the query-optimization methodsfrom Chapter 4.

5.1 Datasets

Document Collections. We use two real-world document collections forour experiments:

• ClueWeb09-B [2] (CW) – more than 50 million web documents in En-glish language crawled in 2009;

• The New York Times Annotated Corpus [4] (NYT) – more than 1.8million newspaper articles published by The New York Times between1987 and 2007.

Both document collections were processed using Stanford CoreNLP [3] fortokenization. To make CW more handleable, we use boilerplate detection asdescribed in [12] and available in the DefaultExtractor of boilerpipe [1].

# Queries # Distinct ø Length

YAGO 5,720,063 4,599,745 2.45MSN 10,428,651 5,627,838 3.58

MSNP 131,857 105,825 3.37NYTS 1,000,000 970,051 21.66CWS 1,000,000 929,607 20.55

Table 5.1: Workload characteristics

13

Page 17: Phrase Query Optimization on Inverted Indexes

Workloads. To reflect different use cases including web search, entity-oriented search, and plagiarism detection, we consider four different work-loads for our experimental evaluation:

• MSN is a query workload made available for research by a commer-cial web search engine in 2009. It contains queries routed to the USMicrosoft search site and sampled over one month (May 2006).

• MSNP as the subset of explicit phrase queries from the aforementionedquery workload, i.e., queries enclosed in quotes (e.g., “national pandemicinfluenza response plan”).

• YAGO is a workload of entity labels from the YAGO2 knowledgebase [11]. In its rdfs:label relation, it collects strings that may referto a specific entity, which are mined from anchor texts in Wikipedia.For the entity Bob Dylan, as a concrete example, it includes among oth-ers the entity labels “bob dylan”, “bob allen zimmerman”, and “robertallen zimmerman”.

• NYTS/CWS as workloads consisting of 1,000,000 sentences randomlysampled from the NYT and CW document collection, respectively.

Table 5.1 reports characteristics of the different workloads. Note thatwe excluded single-word queries from all workloads. Interestingly, queriesare on average shorter in the YAGO workload (2.45 words) than in the websearch query workloads. By design, the NYTS and CWS workloads consistof substantially longer phrase queries. We use document frequency as acost measure for all our experiments. As mentioned earlier, one could usecollection frequency instead. In practice, though, the two measures are highlycorrelated and we did not observe big differences. Also, as a one-time pre-processing performed using Hadoop and made available to all methods, wecompute n-gram statistics for the workload and document collection usingthe method described in [6].

5.2 Experimental Results

The experiment examines the effect that the choice of query-optimizationmethod can have on query-processing performance. We consider three query-optimization methods for this experiment: the greedy algorithm (GRD)from [22], our greedy algorithm (APX) that comes with an approximationguarantee, and our exponential exact algorithm (OPT). GRD considers terms

14

Page 18: Phrase Query Optimization on Inverted Indexes

in increasing order of their document frequency, thus based on their selectiv-ity, and chooses a term if it covers any yet-uncovered portion of the phrasequery. Originally designed to deal with bigrams only, we extend GRD tobreak ties based on term length, and thus favor the longer term, if two termshave the same document frequency.

To compare the three query-optimization methods, we built augmentedinverted indexes whose dictionaries include all phrases up to a specific maxi-mum length l ∈ { 2, 3, 4, 5 }. Thus, for l = 5, all phrases of length five or lessare indexed. This allows us to study the behavior of the methods as moreterms to choose from become available.

Table 5.1 reports average query-processing costs for all sensible combi-nations of our document collections and workloads for different choices ofl. When studying the behavior of the different query-optimization methodson an augmented inverted index obtained for a specific choice of l, we ex-clude queries from all workloads that consist of fewer than l words. This isreasonable, since those queries can simply be processed by looking up thecorresponding n-gram and there is nothing to optimize.

As we can see from the table, the approximate solutions (GRD and APX)work well, and their costs are close to the ones obtained with the optimalalgorithm (OPT). As expected, the performance of APX, our approxima-tion algorithm, is closer to optimal than the performance of the heuristicGRD. This difference is more pronounced on the NYTS and CWS workloadsconsisting, by construction, of verbose queries.

Also, as expected, we observe that with increasing l, the number of post-ing read per query decreases for all query optimizers. The improvements aredrastic when considering sentence workloads for all optimizers. For shorterqueries there is less room for query optimization, as observed in the YAGOand MSNP workloads. For longer queries, in contrast, there is more roomfor query optimization and thus an opportunity for OPT and APX to makebetter choices, as observed on the NYTS/CWS workloads consisting of ver-bose queries. This is even more noticeable for larger choices of l, as can beseen from the results obtained for l = 4 on CWS where OPT processes lessthan 50% of postings than GRD.

We also observe that a majority of queries does not contain repeatedwords. Even otherwise the number of repetitions is typically small. Thishas a favorable impact on the execution time of OPT. We observe that allquery-optimization methods perform similarly in terms of execution time.Our optimization methods are thus robust and achieve superior performanceto GRD.

15

Page 19: Phrase Query Optimization on Inverted Indexes

NY

T

GR

DA

PX

OP

T

l=

2l

=3

l=

4l

=5

l=

2l

=3

l=

4l

=5

l=

2l

=3

l=

4l

=5

YA

GO

23,0

7018

,793

16,9

7815

,693

22,9

9718

,775

16,9

7015

,682

22,8

5418

,707

16,9

6115

,674

MS

NP

32,4

6621

,415

19,8

6619

,734

32,1

7721

,392

19,8

5719

,726

31,7

9621

,333

19,8

4419

,714

MS

N29

,755

22,6

3521

,596

21,4

4229

,596

22,6

2121

,590

21,4

3629

,300

22,5

9021

,582

21,4

27N

YT

S23

8,27

010

,466

1,03

520

322

8,11

79,

833

955

182

217,

068

9,13

589

517

0

CW

GR

DA

PX

OP

T

l=

2l

=3

l=

4l

=5

l=

2l

=3

l=

4l

=5

l=

2l

=3

l=

4l

=5

YA

GO

142,

451

67,0

4658

,463

55,6

5114

0,65

966

,398

58,3

7455

,572

138,

889

65,9

3858

,327

55,5

37M

SN

P28

1,42

787

,705

77,6

9595

,180

275,

841

86,9

8677

,634

95,1

4526

8,99

486

,413

77,5

5295

,103

MS

N28

5,82

315

0,52

616

3,28

620

9,29

428

2,21

715

0,13

016

3,25

720

9,27

827

7,67

614

9,82

916

3,21

920

9,25

0C

WS

2,84

6,71

122

1,08

983

,170

53,7

952,

692,

296

195,

132

43,2

5633

,167

2,57

6,27

318

6,09

541

,874

32,2

52

Tab

le5.

2:Q

uer

yop

tim

izat

ion

resu

lts

(#p

osti

ngs

read

)

16

Page 20: Phrase Query Optimization on Inverted Indexes

6 Related Work

Williams et al. [22] put forward the combined index to support phrase queriesefficiently. It assembles three levels of indexing: (i) a first-word index as apositional inverted index, (ii) a next-word index that indexes all bigrams con-taining a stopword, and (iii) a phrase index with popular phrases from a querylog. Its in-memory dictionary is kept compact by exploiting common firstwords between bigrams. Query processing escalates through these indexes –first it consults the phrase index and, if the phrase query is not found therein,processes it using bigrams and unigrams from the other indexes. Transierand Sanders [20] select bigrams to index based only on characteristics of thedocument collection. Selecting bigrams makes sense in settings where phrasequeries are issued by human users and tend to be short – as observed forweb search by Spink et al. [18]. Variable-length multi-word sequences havepreviously been considered by Chang and Poon [8] in their common phraseindex, which builds on [22], but indexes variable-length phrases common inthe workload.

17

Page 21: Phrase Query Optimization on Inverted Indexes

7 Conclusion

We have presented a comprehensive approach to process phrase queries usingan inverted index whose dictionary is augmented by variable-length wordsequences. We studied the problem of query optimization, to determine cost-efficient plans for phrase queries on a given index. We also proposed solutionsto select an optimal (or close to optimal) query plan for processing a phrasequery over an augmented inverted index. Lastly, we performed an extensiveevaluation of our approaches using real-world datasets.

18

Page 22: Phrase Query Optimization on Inverted Indexes

Bibliography

[1] Boilerpipehttp://code.google.com/p/boilerpipe/.

[2] ClueWeb09http://corpus.nytimes.com.

[3] Stanford CoreNLPhttp://nlp.stanford.edu/software/corenlp.shtml.

[4] The New York Times Annotated Corpushttp://corpus.nytimes.com.

[5] S. Agrawal, K. Chakrabarti, S. Chaudhuri, and V. Ganti. Scalable ad-hoc entity extraction from text collections. PVLDB, 1(1):945–957, 2008.

[6] K. Berberich and S. J. Bedathur. Computing n-gram statistics in mapre-duce. In EDBT, pages 101–112, 2013.

[7] S. Buttcher, C. L. A. Clarke, and G. V. Cormack. Information Retrieval- Implementing and Evaluating Search Engines. The MIT Press, 2010.

[8] M. Chang and C. K. Poon. Efficient phrase querying with commonphrase index. Inf. Process. Manage., 44(2):756–769, 2008.

[9] P. Ferragina and R. Venturini. The compressed permuterm index. ACMTransactions on Algorithms, 7(1):10, 2010.

[10] M. Hagen, M. Potthast, A. Beyer, and B. Stein. Towards optimum querysegmentation: in doubt without. In CIKM, pages 1015–1024, 2012.

[11] J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum. Yago2:A spatially and temporally enhanced knowledge base from wikipedia.Artif. Intell., 194:28–61, 2013.

19

Page 23: Phrase Query Optimization on Inverted Indexes

[12] C. Kohlschutter, P. Fankhauser, and W. Nejdl. Boilerplate detectionusing shallow text features. In WSDM, pages 441–450, 2010.

[13] Y. Li, B.-J. P. Hsu, C. Zhai, and K. Wang. Unsupervised query seg-mentation using clickthrough for information retrieval. In SIGIR, pages285–294, 2011.

[14] C. Macdonald, N. Tonellotto, and I. Ounis. Learning to predict responsetimes for online query scheduling. In SIGIR, pages 621–630, 2012.

[15] U. Manber and E. W. Myers. Suffix arrays: A new method for on-linestring searches. SIAM J. Comput., 22(5):935–948, 1993.

[16] J.-B. Michel, Y. K. Shen, A. P. Aiden, A. Veres, M. K. Gray, T. G. B.Team, J. P. Pickett, D. Hoiberg, D. Clancy, P. Norvig, J. Orwant,S. Pinker, M. A. Nowak, and E. L. Aiden. Quantitative Analysis ofCulture Using Millions of Digitized Books. Science, 2010.

[17] J. Neraud. Elementariness of a finite set of words is co-np-complete.ITA, 24:459–470, 1990.

[18] A. Spink, D. Wolfram, M. B. J. Jansen, and T. Saracevic. Searching theweb: The public and their queries. Journal of the American Society forInformation Science and Technology, 52(3):226–234, 2001.

[19] E. Stamatatos. Plagiarism detection based on structural information.In CIKM, pages 1221–1230, 2011.

[20] F. Transier and P. Sanders. Out of the box phrase indexing. In SPIRE,pages 200–211, 2008.

[21] V. V. Vazirani. Approximation algorithms. Springer, 2001.

[22] H. E. Williams, J. Zobel, and D. Bahle. Fast phrase querying withcombined indexes. ACM Trans. Inf. Syst., 22(4):573–594, 2004.

[23] J. Zobel and A. Moffat. Inverted files for text search engines. ACMComput. Surv., 38(2):6, 2006.

20

Page 24: Phrase Query Optimization on Inverted Indexes

Below you find a list of the most recent research reports of the Max-Planck-Institut fur Informatik. Mostof them are accessible via WWW using the URL http://www.mpi-inf.mpg.de/reports. Paper copies(which are not necessarily free of charge) can be ordered either by regular mail or by e-mail at the addressbelow.

Max-Planck-Institut fur Informatik– Library and Publications –Campus E 1 4

D-66123 Saarbrucken

E-mail: [email protected]

MPI-I-2014-5-001 M. Dylla, M. Theobald Learning Tuple Probabilities in Probabilistic Databases

MPI-I-2013-RG1-002 P. Baumgartner, U. Waldmann Hierarchic superposition with weak abstraction

MPI-I-2013-5-002 F. Makari, B. Awerbuch, R. Gemulla,R. Khandekar, J. Mestre, M. Sozio

A distributed algorithm for large-scale generalizedmatching

MPI-I-2013-1-001 C. Huang, S. Ott New results for non-preemptive speed scaling

MPI-I-2012-RG1-002 A. Fietzke, E. Kruglov, C. Weidenbach Automatic generation of inductive invariants bySUP(LA)

MPI-I-2012-RG1-001 M. Suda, C. Weidenbach Labelled superposition for PLTL

MPI-I-2012-5-004 F. Alvanaki, S. Michel, A. Stupar Building and maintaining halls of fame over a database

MPI-I-2012-5-003 K. Berberich, S. Bedathur Computing n-gram statistics in MapReduce

MPI-I-2012-5-002 M. Dylla, I. Miliaraki, M. Theobald Top-k query processing in probabilistic databases withnon-materialized views

MPI-I-2012-5-001 P. Miettinen, J. Vreeken MDL4BMF: Minimum Description Length for BooleanMatrix Factorization

MPI-I-2012-4-001 J. Kerber, M. Bokeloh, M. Wand,H. Seidel

Symmetry detection in large scale city scans

MPI-I-2011-RG1-002 T. Lu, S. Merz, C. Weidenbach Towards verification of the pastry protocol using TLA+

MPI-I-2011-5-002 B. Taneva, M. Kacimi, G. Weikum Finding images of rare and ambiguous entities

MPI-I-2011-5-001 A. Anand, S. Bedathur, K. Berberich,R. Schenkel

Temporal index sharding for space-time efficiency inarchive search

MPI-I-2011-4-005 A. Berner, O. Burghard, M. Wand,N.J. Mitra, R. Klein, H. Seidel

A morphable part model for shape manipulation

MPI-I-2011-4-003 J. Tompkin, K.I. Kim, J. Kautz,C. Theobalt

Videoscapes: exploring unstructured video collections

MPI-I-2011-4-002 K.I. Kim, Y. Kwon, J.H. Kim,C. Theobalt

Efficient learning-based image enhancement :application to compression artifact removal andsuper-resolution

MPI-I-2011-4-001 M. Granados, J. Tompkin, K. In Kim,O. Grau, J. Kautz, C. Theobalt

How not to be seen inpainting dynamic objects incrowded scenes

MPI-I-2010-RG1-001 M. Suda, C. Weidenbach,P. Wischnewski

On the saturation of YAGO

MPI-I-2010-5-008 S. Elbassuoni, M. Ramanath,G. Weikum

Query relaxation for entity-relationship search

MPI-I-2010-5-007 J. Hoffart, F.M. Suchanek,K. Berberich, G. Weikum

YAGO2: a spatially and temporally enhancedknowledge base from Wikipedia

MPI-I-2010-5-006 A. Broschart, R. Schenkel Real-time text queries with tunable term pair indexes

MPI-I-2010-5-005 S. Seufert, S. Bedathur, J. Mestre,G. Weikum

Bonsai: Growing Interesting Small Trees

MPI-I-2010-5-004 N. Preda, F. Suchanek, W. Yuan,G. Weikum

Query evaluation with asymmetric web services

MPI-I-2010-5-003 A. Anand, S. Bedathur, K. Berberich,R. Schenkel

Efficient temporal keyword queries over versioned text

Page 25: Phrase Query Optimization on Inverted Indexes

MPI-I-2010-5-002 M. Theobald, M. Sozio, F. Suchanek,N. Nakashole

URDF: Efficient Reasoning in Uncertain RDFKnowledge Bases with Soft and Hard Rules

MPI-I-2010-5-001 K. Berberich, S. Bedathur, O. Alonso,G. Weikum

A language modeling approach for temporalinformation needs

MPI-I-2010-1-001 C. Huang, T. Kavitha Maximum cardinality popular matchings in stricttwo-sided preference lists

MPI-I-2009-RG1-005 M. Horbach, C. Weidenbach Superposition for fixed domains

MPI-I-2009-RG1-004 M. Horbach, C. Weidenbach Decidability results for saturation-based model building

MPI-I-2009-RG1-002 P. Wischnewski, C. Weidenbach Contextual rewriting

MPI-I-2009-RG1-001 M. Horbach, C. Weidenbach Deciding the inductive validity of ∀∃∗ queries

MPI-I-2009-5-007 G. Kasneci, G. Weikum, S. Elbassuoni MING: Mining Informative Entity-RelationshipSubgraphs

MPI-I-2009-5-006 S. Bedathur, K. Berberich, J. Dittrich,N. Mamoulis, G. Weikum

Scalable phrase mining for ad-hoc text analytics

MPI-I-2009-5-005 G. de Melo, G. Weikum Towards a Universal Wordnet by learning fromcombined evidenc

MPI-I-2009-5-004 N. Preda, F.M. Suchanek, G. Kasneci,T. Neumann, G. Weikum

Coupling knowledge bases and web services for activeknowledge

MPI-I-2009-5-003 T. Neumann, G. Weikum The RDF-3X engine for scalable management of RDFdata

MPI-I-2009-5-003 T. Neumann, G. Weikum The RDF-3X engine for scalable management of RDFdata

MPI-I-2009-5-002 M. Ramanath, K.S. Kumar, G. Ifrim Generating concise and readable summaries of XMLdocuments

MPI-I-2009-4-006 C. Stoll Optical reconstruction of detailed animatable humanbody models

MPI-I-2009-4-005 A. Berner, M. Bokeloh, M. Wand,A. Schilling, H. Seidel

Generalized intrinsic symmetry detection

MPI-I-2009-4-004 V. Havran, J. Zajac, J. Drahokoupil,H. Seidel

MPI Informatics building model as data for yourresearch

MPI-I-2009-4-003 M. Fuchs, T. Chen, O. Wang,R. Raskar, H.P.A. Lensch, H. Seidel

A shaped temporal filter camera

MPI-I-2009-4-002 A. Tevs, M. Wand, I. Ihrke, H. Seidel A Bayesian approach to manifold topologyreconstruction

MPI-I-2009-4-001 M.B. Hullin, B. Ajdin, J. Hanika,H. Seidel, J. Kautz, H.P.A. Lensch

Acquisition and analysis of bispectral bidirectionalreflectance distribution functions

MPI-I-2008-RG1-001 A. Fietzke, C. Weidenbach Labelled splitting

MPI-I-2008-5-004 F. Suchanek, M. Sozio, G. Weikum SOFIE: a self-organizing framework for informationextraction

MPI-I-2008-5-003 G. de Melo, F.M. Suchanek, A. Pease Integrating Yago into the suggested upper mergedontology

MPI-I-2008-5-002 T. Neumann, G. Moerkotte Single phase construction of optimal DAG-structuredQEPs

MPI-I-2008-5-001 G. Kasneci, M. Ramanath, M. Sozio,F.M. Suchanek, G. Weikum

STAR: Steiner tree approximation inrelationship-graphs

MPI-I-2008-4-003 T. Schultz, H. Theisel, H. Seidel Crease surfaces: from theory to extraction andapplication to diffusion tensor MRI

MPI-I-2008-4-002 D. Wang, A. Belyaev, W. Saleem,H. Seidel

Estimating complexity of 3D shapes using viewsimilarity

MPI-I-2008-1-001 D. Ajwani, I. Malinger, U. Meyer,S. Toledo

Characterizing the performance of Flash memorystorage devices and its impact on algorithm design

MPI-I-2007-RG1-002 T. Hillenbrand, C. Weidenbach Superposition for finite domains

MPI-I-2007-5-003 F.M. Suchanek, G. Kasneci,G. Weikum

Yago : a large ontology from Wikipedia and WordNet