K-Regret Queries: From Additive to Multiplicative Utilities · arXiv:1609.07964v3 [cs.DB] 1 Oct 2016 K-Regret Queries: From Additive to Multiplicative Utilities Jianzhong Qi1, Fei

arX

iv:1

609.

0796

4v3

[cs

.DB

] 1

Oct

201

6

K-Regret Queries: From Additive to Multiplicative Utilitie s

Jianzhong Qi1, Fei Zuo1∗

, Hanan Samet2, Jia Cheng Yao3

1Department of Computing and Information Systems, University of Melbourne2Department of Computer Science, University of Maryland

3Faculty of Business and Economics, University of Melbourne

[email protected], [email protected], [email protected], [email protected]

ABSTRACTThe k-regret query aims to return a size-k subset of the en-tire database such that, for any query user that selects adata object in this size-k subset rather than in the entiredatabase, her regret ratio is minimized. Here, the regretratio is modeled by the level of difference in the optimalitybetween the optimal object in the size-k subset returned andthe optimal object in the entire database. The optimalityof a data object in turn is usually modeled by a utility func-tion of the query user. Compared with traditional top-kqueries, k-regret queries have the advantage of not requir-ing users to specify their utility functions. They can discovera size-k subset that minimizes the regret ratio for a wholefamily of utility functions without knowing any particularof them. Previous studies have answered k-regret querieswith additive utility functions such as the linear summationfunction. However, no existing result has been reported toanswer k-regret queries with multiplicative utility functions,which are an important family of utility functions.

In this study, we break the barrier of multiplicative utilityfunctions. We present an algorithm that can produce an-swers with a bounded regret ratio to k-regret queries withmultiplicative utility functions. As a case study we applythis algorithm to process a special type of multiplicative util-ity functions, the Cobb-Douglas function, and a closely re-lated function, the Constant Elasticity of Substitution func-tion. We perform extensive experiments on the proposedalgorithm. The results confirm that the proposed algorithmcan answer k-regret queries with multiplicative utility func-tions efficiently with a constantly small regret ratio.

1. INTRODUCTIONTop-k queries [7, 8, 19] and skyline queries [1, 15, 20] have

been used traditionally to return a representative subset Sof a database D to a query user when the entire database istoo large to be explored fully by the query user. These two

∗Co-first author ordered by surname.

Table 1: A Computer Database

Computer CPU (pi.c1) Brand recognition (pi.c2)

p1 2.3 80p2 1.7 90p3 2.8 50p4 2.1 55p5 2.1 50p6 3.0 55

types of queries suffer in either requiring a predefined utilityfunction to model the query user’s preference over the dataobjects, or returning an unbounded number of data objects.Recent studies [10, 14, 26] aim to overcome these limitationsby a new type of queries, the k-regret query, which returnsa size-k subset S ⊆ D that minimizes the maximum regretratio of any query user. The concept of regret comes frommicroeconomics [11]. Intuitively, if a query user had selectedthe local optimal object in S , and were later shown theoverall optimal object in the entire database D, the queryuser may regret. The k-regret query uses the regret ratioto model how regretful the query user may be, which isthe level of difference in the optimality between the localoptimal object in S and the overall optimal object in D.Here, the optimality is computed by a utility function. Thek-regret query does not require any specific utility functionto be given. Instead, it can return a set to minimize themaximum regret ratio for a family of utility functions suchas the linear summation functions.

To illustrate the k-regret query, consider an online com-puter shop with a database D of computers as shown inTable 1. There are six computers, i.e., D = p1, p2, ..., p6.Every computer pi has two attributes: CPU clock speed andbrand recognition, denoted as pi.c1 and pi.c2, respectively.Here, brand recognition represents how well a brand is rec-ognized by the customers. A larger value means that thebrand is better recognized. Since the entire database maybe too large to be all shown, the shop considers showingonly a size-k subset S ⊆ D in the front page as a recom-mendation. Such a subset may be S = p1, p3, p5 (i.e.,k = 3). When a customer visits the shop, assume that herpreference can be expressed as a utility function f1(pi) =0.5 · pi.c1 + 0.5 · pi.c2. Then the customer may purchase p1from the recommended subset S since p1 has the largest util-ity value f1(p1) = 0.5·p1 .c1+0.5·p1 .c2 = 0.5×2.3+0.5×80 =41.15 > f1(p3) = 26.40 > f1(p5) = 26.05. Note that an-other computer p2 ∈ D exists with an even larger utilityvalue f1(p2) = 0.5× 1.7 + 0.5× 90 = 45.85. If the customerlater sees p2, she may regret. Her regret ratio is computed

1

http://arxiv.org/abs/1609.07964v3

asf1(p2)− f1(p1)

f1(p2)≈ 10.25%. For another customer with a

different utility function f2(pi) = 0.99 · pi.c1 + 0.01 · pi.c2,the computer in S that best suits her preference is p3, i.e.,f2(p3) = 0.99 × 2.8 + 0.01 × 50 ≈ 3.27 > f2(p1) ≈ 3.08 >f2(p5) ≈ 2.58. Meanwhile, the overall best computer in D isp6, i.e., f2(p6) = 0.99×3.0+0.01×55 = 3.52. If the customer

purchases p3, her regret ratio will bef2(p6)− f2(p3)

f2(p6)≈

7.05%. Since every customer may have a different prefer-ence and hence a different utility function, it is unpracticalto satisfy multiple customers with a single set S . Instead oftrying to satisfy some customer with a certain utility func-tion, the k-regret query aims to generate a subset S thatminimizes the maximum regret ratio for a family of (infi-nite) utility functions.

Existing studies on the k-regret query have focused onadditive utility functions where the overall utility of an ob-ject is computed as the sum of the utility in each attributeof the object. The linear summation functions f1 and f2are examples. They can be written in a more general form:f(pi) =

∑d

j=1 αj · pi.cj , where d denotes the number of at-

tributes, and αj is the weight of attribute j. Studies [10, 14]have shown that the maximum regret ratio of the k-regretquery with additive utility functions can be bounded. How-ever, to the best of our knowledge, so far, no existing boundhas been obtained for the k-regret query with multiplicativeutility functions (MUFs).

In this paper, we break the barrier of bounding the max-imum regret ratio for the k-regret query with MUFs. AnMUF computes the overall utility of an object as the prod-uct of the utility in each attribute, i.e., f(pi) =

∏d

j=1 pi.cαj

j .It is more expressive especially in modeling the diminishingmarginal rate of substitution (DMRS) [22]. The DMRS isa common phenomenon. It refers to the fact that, as theutility in an attribute j gets larger, the extent to which thisutility can make up (substitute) for the utility in any otherattribute j′ decreases (diminishes). For example, a higherCPU clock speed may make up for a less known brand. How-ever, as the CPU clock speed gets higher, the extent to whichit can make up for the brand recognition decreases, sincemost customers only need a moderate CPU rather than anextremely fast one. This phenomenon cannot be modeledby an additive utility function such as f1 or f2, where agiven amount of increase in attribute 1 always makes up fora fixed amount of utility in attribute 2, and vice versa. As aresult, f1 and f2 favor objects with maximum values in cer-tain attributes (e.g., p2 in c2 and p6 in c1). An MUF suchas f3(pi) = pi.c

0.51 · pi.c

0.52 , on the other hand, models the

DMRS much better. It favors objects with a good (but notnecessarily the best) utility in more attributes, e.g., p1 suitsf3 the best, which is reasonably good in both attributes.

The higher expressive power of MUFs also brings signif-icant challenges in bounding the maximum regret ratio forthem. It is much more difficult to tightly bound the prod-uct of a series of exponential expressions. In this study, weovercome these challenges with an algorithm named Min-Var. We show that this algorithm can obtain a maximum

regret ratio bounded between Ω(1

k2) and O(ln(1 +

1

k1

d−1

))

for the k-regret query with MUFs. To showcase the appli-cability of the MinVar algorithm in real world scenarios, weapply it on k-regret queries with a special family of MUFs,

the Cobb-Douglas functions, which are used extensively ineconomics studies [4, 6, 24]. As a by-product, we obtain a

new upper bound O(1

k1

d−1

) of the maximum regret ratio for

the k-regret query with the Constant Elasticity of Substitu-tion (CES) functions [24]. This type of function is closelyrelated to the Cobb-Douglas functions. Our upper bound istighter than a previously obtained upper bound [10].

To summarize, this paper makes the follows contributions:

• We are the first to study the k-regret query with mul-tiplicative utility functions.

• We propose an algorithm named MinVar to process thequery. Based on this algorithm, we obtain bounds ofthe maximum regret ratio for the k-regret query withmultiplicative utility functions. We introduce an extraheuristic based on redundancy elimination to furtherlower the maximum regret ratio, which results in animproved algorithm named RF-MinVar.

• We showcase the applicability of the proposed algo-rithms on a special type of multiplicative utility func-tions, the Cobb-Douglas functions, and a closely re-lated type of functions, the CES functions.

• We perform extensive experiments using both real andsynthetic data to verify the effectiveness and efficiencyof the proposed algorithms. The results show that theregret ratios obtained by the proposed algorithms areconstantly small. Meanwhile, the proposed algorithmsare more efficient than the baseline algorithms.

The rest of the paper is organized as follows. Section 2 re-views related studies. Section 3 presents the basic concepts.Section 4 describes the proposed algorithms to process k-regret queries with MUFs, and derives the bounds of themaximum regret ratio. Section 5 applies the proposed algo-rithms on two special types of utility functions, the Cobb-Douglas function and the CES function, and shows theirbounds of the maximum regret ratio. Section 6 presentsexperimental results and Section 7 concludes the paper.

2. RELATED WORKWe review two groups of related studies: the skyline queries

and the k-regret queries.Skyline queries. The skyline query [1] is an earlier at-

tempt to generate a representative subset S of the entiredatabase D without the need of specifying any utility func-tions. This query assumes a database D of d-dimensionalpoints (d ∈ N+). Let pi and pj be two points in D wherei 6= j. Point pi is said to dominate point pj if and only if∀l ∈ [1..d], pi.cl ≥ pj .cl, where pi.cl (pj .cl) denotes the co-ordinate of pi (pj) in dimension l. Here, the “≥” operatorrepresents the user preference relationship. A point with alarger coordinate in dimension l is deemed more preferablein that dimension. The skyline query returns the subsetS ⊆ D where each point is not dominated by any otherpoint in D.

The skyline query can be answered by a two-layer nestedloop over the points in D and filtering those dominated byother points. The points remaining after the filtering are theskyline points which are the query answer. More efficientalgorithms have been proposed in literature [15, 20].

2

While the skyline query does not require a utility func-tion, it suffers in having no control over the size of the setS returned. In the worst case, the entire database D maybe returned. Studies have tried to overcome this limitationby combining the skyline query with the top-k query. Forexample, Xia et al. [25] introduce the ε-skyline query whichadds a weight to each dimension of the data points to reflectusers’ preference towards the dimension. The weights createa built-in rank for the points which can be used to answerthe top-k skyline query. Chan et al. [3] rank the points bythe skyline frequency, i.e., how frequently a point appearsas a skyline point when different numbers of dimensions areconsidered. A few other studies extract a representative sub-set of the skyline points. Lin et al. [12] propose to return thek points that together dominate the largest number of non-skyline points as the k most representative skyline subset.Tao et al. [21] select k representative skyline points based onthe distance between the skyline points instead. While thestudies above return a size-k subset of the entire database,the maximum regret ratio of such a subset is unbounded.

K-regret queries. Nanongkai et al. [14] introduce theconcept of regret minimization to top-k query processingand propose the k-regret query. The advantage of this querytype is that it does not require query users to specify theirutility functions. Instead, it can find the subset S that min-imizes the maximum regret ratio for a whole family of utilityfunctions. Nanongkai et al. propose the CUBE algorithmto process the k-regret query for a family of linear summa-tion utility functions, i.e., each utility function f is in theform of f(pi) =

∑d

j=1 αj · pi.cj where αj denotes the weightof dimension j. The CUBE algorithm is efficient, but themaximum regret ratio it obtains is quite large in practice.To obtain a smaller maximum regret ratio, in a differentpaper [13], Nanongkai et al. propose an interactive algo-rithm where query users are involved to guide the search foranswers with smaller regret ratios. Peng and Wong [17] ad-vance the k-regret query studies by utilizing geometric prop-erties to improve the query efficiency. Chester et al. [5] alsoconsider linear summation utility functions, and propose tocompute the k-regret minimizing sets, which is NP-hard.

Kessler Faulkner et al. [10] build on top of the CUBEalgorithm and propose three algorithms, MinWidth, Area-Greedy, and Angle. These three algorithms can process k-regret queries with the “concave”, “convex”, and CES util-ity functions. Nevertheless, the “concave” and “convex”utility functions considered are restricted to additive forms(See Braziunas and Boutilier [2] and Keeney and Raiffa [9]for more details on additive utilities and additive indepen-dence). They are in fact summations over a set of con-cave and convex functions. The CES utility functions alsosum over a set of terms. In this paper, we move the studyon k-regret queries from additive utility functions to mul-tiplicative utility functions. We present an algorithm thatcan produce answers with bounded maximum regret ratiosfor k-regret queries with multiplicative utility functions. Asa by-product, we also obtain a new upper bound on themaximum regret ratio for the CES utility functions which istighter than a previously obtained upper bound [10].

Zeighami and Wong [26] propose to compute the averageregret ratio. They do not assume any particular type ofutility functions, but use sampling to obtain a few utilityfunctions for the computation. This study is less relevant toour work and is not discussed further.

Table 2: Frequently Used Symbols

Symbol Description

D A databasen The cardinality of Dd The dimensionality of Dk The k-regret query parameterS A size-k subset selected from Dpi A point in D

pi.cj The coordinate of pi in dimension jt The number of intervals that the data domain

is partitioned into in each dimension

3. PRELIMINARIESWe present basic concepts and a problem definition in this

section. The symbols frequently used in the discussion aresummarized in Table 2.

We consider a database D of n data objects. Every dataobject pi ∈ D is a d-dimensional point in R

d+, where d is

a positive integer and the coordinates of the points are allpositive numbers. We use pi.cj to denote the coordinateof pi in dimension j. This coordinate represents the utilityof the data object in dimension j. A larger value of pi.cjdenotes a larger utility and hence pi is more preferable indimension j. A query parameter k is given. It specifies thesize of the answer set S (S ⊆ D) to be returned by a k-regretquery. We assume d ≤ k ≤ n.

Let f : D → R+ be a function that models the utility ofa data object, i.e., how preferable the data object is by aquery user. The gain of a query user over a set S modelsthe utility of the set to the query user.

Definition 1 (Gain). Given a utility function f of aquery user and a subset of data objects S ⊆ D, the gain ofthe query user over S, gain(S , f), is defined as the maximumutility of any object in S, i.e.,

gain(S , f) = maxpi∈S

f(pi)

Continue with the example shown in Table 1. If S =p1, p3, p5 and f3(pi) = pi.c

0.51 · pi.c

0.52 , then gain(S , f3) =

maxpi∈S f3(pi) = f3(p1) = 2.30.5 × 800.5 ≈ 13.56.By definition, when S = D, the gain of the query user is

maximized. However, the entire database D is usually toolarge to be returned to or fully explored by the query user.When S ⊂ D, the query user may potentially suffer from“losing” some gain comparing with the case where S = D.This loss of gain models how much the query user may regretif she selects the local optimal object from S and is latershown the overall optimal object in D.

Definition 2 (Regret). Given the utility function fof a query user and a subset of data objects S ⊆ D, the regretof the query user if she selects the optimal object from S,regretD(S , f), is defined as the difference between the gainof the user over D and her gain over S, i.e.,

regretD(S , f) = gain(D, f)− gain(S , f)

The regret ratio is a relative measure of the regret.

Definition 3 (Regret Ratio). Given the utility func-tion f of a query user and a subset of data objects S ⊆ D,the regret ratio of the query user if she selects the optimal

3

object from S, r ratioD(S , f), is defined as regretD(S , f)over gain(D, f), i.e.,

r ratioD(S , f) =regretD(S , f)

gain(D, f)=

regretD(S , f)

regretD(S , f) + gain(S , f)

=regretD(S, f)

regretD(S , f) + maxpj∈S f(pj)

=maxpi∈Df(pi)−maxpj∈Sf(pj)

maxpi∈Df(pi)

When S = p1, p3, p5 and f3(pi) = pi.c0.51 · pi.c

0.52 , we

have gain(S , f3) = gain(D, f3) = f3(p1) ≈ 13.56. Thus,regretD(S , f3) = 0, and r ratioD(S , f3) = 0%. For a differ-ent utility function f4(pi) = pi.c

0.991 · pi.c

0.012 , gain(S , f4) =

f4(p3) ≈ 2.88 and gain(D, f4) = f4(p6) ≈ 3.09. Thus,regretD(S , f4) ≈ 0.21, and r ratioD(S , f4) ≈

0.213.09

≈ 6.80%.Given a family of utility functions F , the maximum regret

ratio formulates how regretful a query user can possibly beif she uses any utility function in F and is given the set S .

Definition 4 (Maximum Regret Ratio). Given a fam-ily of utility functions F and a subset of data objects S ⊆ D,the maximum regret ratio if a query user with any utilityfunction in F selects the optimal object in S, mr ratioD(S ,F),is defined as the supremum of the regret ratio of any utilityfunction, i.e.,

mr ratioD(S ,F) = supf∈F

r ratioD(S , f)

= supf∈F

maxpi∈Df(pi)−maxpj∈Sf(pj)

maxpi∈Df(pi)

Here, the supremum is used instead of the maximum becausethe family of utility functions F may be infinite.

Continue with the example above. If F = f3, f4 thenmr ratioD(S ,F) = max0%, 6.80% = 6.80%.

The k-regret query aims to return the size-k subset S ⊆ Dthat minimizes the maximum regret ratio.

Definition 5 (K-Regret Query). Given a family ofutility functions F, the k-regret query returns a size-k subsetS ⊆ D, such that the maximum regret ratio over S is smallerthan or equal to that over any other size-k subset S ′ ⊆ D.Formally,

∀S ′ ⊆ D ∩ |S ′| = k : mr ratioD(S ,F) ≤ mr ratioD(S ′,F)

Specific utility functions are not always available becausethe query users are not usually pre-known and their utilityfunctions may not be specified precisely. The k-regret querydoes not require any specific utility functions to be given.Instead, the query considers a family of functions of a certainform such as the linear functions [14], i.e., f(pi) =

∑d

j=1 αj ·pi.cj where αj is the weight of dimension j. The k-regretquery minimizes the maximum regret ratio of any utilityfunction of such form (without knowing the value of αj).

Our contribution to the study of k-regret queries is theconsideration of a family of multiplicative utility functions(MUFs).

Definition 6 (Multiplicative Utility Function).We define a multiplicative utility function (MUF) f to be autility function of the following form:

f(pi) =n∏

d=1

pi.cαj

j ,

where αj ≥ 0 is a function parameter and∑d

j=1 αj ≤ 1.

Definition 7 (K-Regret Query with MUFs). Thek-regret query with MUFs takes a database D of d-dimensionalpoints and a set F of MUFs as the input, where the param-eters values α1, α2, ..., αd of each MUF may be different.The query returns a size-k subset S ⊆ D, such that the max-imum regret ratio mr ratioD(S ,F) is minimized.

We assume that pi.cj has been normalized into the rangeof (1, 2] to simplify the derivation of the maximum regretratio bounds. This can be done by a normalization function

N (pi.cj) = 1 +pi.cj

maxpi∈Dpi.cj. In fact, it is common to

normalize the data domain in different dimensions into thesame range, so that utility values of different dimensionsbecome more comparable. Note that our derivation of thebounds still holds without this assumption, although thebounds may become less concise.

Scale invariance. It has been shown [10, 14] that k-regret queries with additive utility functions are scale in-variant, i.e., scaling the data domain in any dimension doesnot change the maximum regret ratio of a set S . This prop-erty is preserved in k-regret queries with MUFs. For anMUF f(pi) =

∏n

d=1 pi.cαj

j , we can scale each dimension by

a factor λj > 0, resulting in a new MUF f ′(pi) =∏n

d=1(λj ·

pi.cj)αj =

∏n

d=1 λαj

j ·∏n

d=1 pi.cαj

j = (∏n

d=1 λαj

j )f(pi). Suchscaling does not affect the regret ratio (and hence the max-imum regret ratio), i.e., r ratioD(S , f ′) = r ratioD(S , f):

r ratioD(S , f ′)

=maxpi∈Df ′(pi)−maxpj∈Sf

′(pj)

maxpi∈Df ′(pi)

=(∏n

d=1 λαj

j )(maxpi∈Df(pi)−maxpj∈Sf(pj))

(∏n

d=1 λαj

j )maxpi∈Df(pi)

=maxpi∈Df(pi)−maxpj∈Sf(pj)

maxpi∈Df(pi)= r ratioD(S , f)

In what follows, for conciseness, we refer to the regret,the regret ratio, and the maximum regret ratio of a queryuser simply as the regret, the regret ratio, and the maximumregret ratio, respectively.

4. THE K-REGRET QUERY WITH MULTI-PLICATIVE UTILITY FUNCTIONS

We propose an algorithm named MinVar to process k-regret queries with MUFs in Section 4.1. We derive thebounds of the maximum regret ratios of the query answersreturned by MinVar in Sections 4.2 and 4.3. We furtherimprove the maximum regret ratios of the query answersthrough a heuristic based algorithm in Section 4.4.

4.1 The MinVar AlgorithmMinVar shares a similar overall algorithmic approach with

that of CUBE [14] and MinWidth [10] which were proposedto process k-regret queries with additive utility functions.

As summarized in Algorithm 1, MinVar first finds the op-timal point p∗i in dimension i for the first d− 1 dimensions,i.e, p∗i .ci is the largest utility in dimension i (i = 1, 2, ..., d−1). These d− 1 points are added to S (Lines 1 to 5). Then,the algorithm partitions each dimension i of the data do-

main into t = ⌊(k − d + 1)1

d−1 ⌋ intervals for the first d − 1

4

Dimension 1

Dimension 3

Dimension 2

p2*

p1*

s3*

s1*

s2*

s4*

Figure 1: The MinVar algorithm

dimensions (Lines 6 to 8). Here, the value of t is chosen sothat we can obtain sufficient points to be added to S to cre-ate a size-k subset. The t intervals in each of the first d− 1dimensions together partition the d dimensional data spaceinto td−1 buckets. The algorithm selects one point s∗ in eachbucket that has the largest utility s∗.cd in dimension d, andadds s∗ to S (Lines 9 to 12). There are td−1 = k − d + 1points added in this step. Together with the d−1 points pre-viously added, S now has k points, which are then returnedas the query answer (Line 13).

Figure 1 gives an example. Suppose d = 3 and k = 6,

then t = ⌊(6− 3 + 1)1

3−1 ⌋ = 2. We first add the two pointsp∗1 and p∗2 to S which have the largest utility in dimensions1 and 2, respectively. Then the data domain in dimensions1 and 2 are each partitioned into t = 2 intervals, formingtd−1 = 23−1 = 4 buckets in the data space. Four more pointss∗1, s

∗2, s

∗3, and s∗4 are added to S , each from a different bucket

and is the point with the largest utility in dimension 3 inthat bucket.

Algorithm 1: MinVar

Input: D = p1, p2, ..., pn: a d-dimensional database; k:the size of the answer set.

Output: S: a size-k subset of D.1 S ← ∅;2 for i = 1, 2, ..., d− 1 do

3 Find p∗i which has the largest utility p∗i .ci in dimension i;4 cτi ← p∗i .ci;5 S ← S ∪ p∗i ;

6 t← ⌊(k − d+ 1)1

d−1 ⌋;7 for i = 1, 2, ..., d− 1 do8 bps[i]← F indBreakpoints(D, t, n, i, cτi );

9 for each (d− 1)-integer combination1 ≤ j1 ≤ t, 1 ≤ j2 ≤ t, ...,1 ≤ jd−1 ≤ t do

10 B ← p ∈ D|∀i ∈ [1..d− 1] : bps[i][ji].lo ≤ p.ci ≤bps[i][ji].hi;

11 s∗ ← argmaxp∈B p.cd;12 S ← S ∪ s∗;

13 return S;

The FindBreakpoints algorithm. Our contributionof MinVar lies in the sub-algorithm FindBreakpoints to findthe breakpoints to partition each dimension i of the data do-main into t intervals (Line 8). The intuition of the algorithmis as follows. The optimal point p∗ for any utility function fmust lie in one of the buckets created by MinVar. Let thisbucket be B. The algorithm selects a point s∗ from B torepresent this bucket and adds it to S . If p∗ is selected tobe s∗, then the regret ratio is 0. To maximize the worst-caseprobability of p∗ being selected, we should partition dimen-

MinVar

MinWidth

CUBE CUBE

Dimension 1

Dimension 2p*

s*

Figure 2: Creating the intervals in a dimension

sion i such that each interval contains the same number ofpoints. Otherwise, if the intervals are skewed and p∗ lies ina large interval, its probability of being selected is small.

In CUBE [14], the data domain is simply broken evenly,

i.e., each interval has the same size ofcτit, where cτi denotes

the largest utility in dimension i (assuming that the datadomain starts at 0). When the data points are not uniformlydistributed, the probability of p∗ being selected to representits bucket is small. Figure 2 gives an example where d =2. The two intervals created by CUBE in dimension 1 arehighly unbalanced in the number of points in each interval.The point p∗ has a large utility in both dimensions andmay be optimal for many MUFs. However, it will not beselected by CUBE to represent its bucket, since it falls ina dense bucket and there is another point s∗ with a largerutility in dimension 2. In MinWidth [10], the data domainis broken with a greedy heuristic. This heuristic leaves outsome empty intervals with no data points, and uses a binarysearch to determine the minimal interval width such that tintervals of such width can cover the rest of the data domain.This algorithm handles sparse data better, but the equi-width intervals still do not handle skewed data well. Figure 2shows two intervals created by MinWidth which are stillunbalanced (one has 5 points and the other has 3). The twopoints p∗ and s∗ are still in the same bucket.

Algorithm 2: FindBreakpoints

Input: D = p1, p2, ..., pn: a d-dimensional database; t:number of intervals; n : size of D; i: dimensionnumber for which the breakpoints are to becomputed; cτi : the largest utility in dimension i.

Output: bps[i]: an array of t pairs of breakpoints.1 Sort D ascendingly on dimension i; let the sorted point

sequence be p′1, p′2, ..., p

′n;

2 hi← 0, δ ← 0;3 while hi 6= n do4 lo← 1;5 for j = 1, 2, ..., t do6 hi′ ← minlo+ ⌈n/t⌉ − 1 + δ, n;7 Find the largest hi ∈ [lo..hi′] such that

p′hi.ci − p′

lo.ci≤

cτit;

8 bps[i][j].lo← p′lo.ci;

9 bps[i][j].hi← p′hi.ci;

10 lo← hi+ 1;

11 δ ← δ + inc;

12 return bps[i];

To overcome these limitations, our FindBreakpoints algo-rithm adaptively uses t variable-width intervals such that

the number of points in each interval is as close to ⌈n

t⌉ as

possible (cf. Figure 2, where the two gray intervals created

5

by MinVar contain ⌈8

2⌉ = 4 points each; p∗ will be selected

to represent its bucket). To help derive the maximum regretratio bounds in the following subsections, we also require

that the width of each interval does not exceedcτit.1 Under

this constraint it is not always possible to create t intervals

with exactly ⌈n

t⌉ points in each interval. We allow ⌈

n

t⌉+ δ

data points in each interval, where δ is a parameter that willbe adaptively chosen by the algorithm. At start δ = 0.

Algorithm 2 summarizes the FindBreakpoints algorithm.This algorithm first sorts the data points in ascending orderof their coordinates in dimension i. The sorted points aredenoted as p′1, p

′2, ..., p

′n (Line 1). The algorithm then creates

t intervals, where lo and hi represent the subscript lowerand upper bounds of the data points to be put into oneinterval, respectively. At start, lo = 1 (Line 4). Betweenlo and lo + ⌈n

t⌉ − 1 + δ, the algorithm finds the largest

subscript hi such that p′hi.ci−p′lo.ci is bounded bycτit. Then

we have obtained the two breakpoints of the first intervalbps[i][1].lo = p′lo.ci and bps[i][1].hi = p′hi.ci, where bps[i] isan array to store the intervals in dimension i. We updatelo to be hi + 1, and repeat the process above to create thenext interval (Lines 5 to 10). When t intervals are created,if they cover all the n points, we have successfully createdthe intervals for dimension i. Otherwise, we need to allowa larger number of points in one interval. We increase δby inc which is a system parameter (Line 11), and repeatthe above procedure to create t intervals until n points arecovered. Then we return the interval array bps[i] (Line 12).Note that the algorithm always terminates, because whenδ increases to n, the algorithm will simply create intervals

with widthcτit. The t intervals must cover the entire data

domain and hence cover all n points.Complexity. FindBreakpoints uses a database D of n

points and an array bps[i] to store t intervals. The space

complexity is O(n+ t) where t = ⌊(k− d+1)1

d−1 ⌋ is usuallysmall. The inner loop of FindBreakpoints (Lines 5 to 10)has t iterations. In each iteration, computing hi requiresa binary search between p′lo and p′hi′ , which takes O(log n)time. Thus, The inner loop takes O(t log n) time. The outer

loop hasn

inciterations in the worst case. Together, Find-

Breakpoints takes O(tn log n

inc) time.

MinVar uses a database D of n points, an answer set S ofsize k, a (d−1)×t two dimensional array bps. An array of sizetd−1 = k−d+1 is also needed to help select the points s∗ inthe k−d+1 buckets. The space complexity is O(n+k+dt).The first loop of MinVar (Lines 2 to 5) takes O(nd) time.The second loop (Lines 7 and 8) calls FindBreakpoints for

d− 1 times, which takes O(tdn log n

inc) time. The third loop

(Lines 9 to 12) finds a point s∗ in each of the k − d + 1buckets. A linear scan on the database D is needed forthis task. For each point p visited, we need a binary searchon each of the d − 1 arrays bps[i] to identify the bucketof p, and to update the point selected s∗ in that bucket if

1It should becτi −1

tto be exact where 1 is the lower bound

of the data domain. We writecτit

here to keep it consistentwith CUBE and to ease the discussion.

needed. This takes O(nd log t) time. Overall, MinVar takes

O(nd+tdn log n

inc+nd log t) time. Here,

n

incis a controllable

parameter of the system. In the experiments, we set inc =0.01%n. The time complexity then simplifies to O(nd log t).

4.2 Upper BoundWe derive an upper bound for the maximum regret ratio

of a set S returned by MinVar.

Theorem 1. Let F = f |f(pi) =∏d

j=1 pi.cαj

j be a set

of MUFs, where αj ≥ 0,∑d

j=1 αj ≤ 1, and 1 ≤ pi.cj ≤ 2.

The maximum regret ratio mr ratioD(S ,F) of an answerset S returned by MinVar satisfies

mr ratioD(S ,F) ≤ ln(1 +1

t)

Here, t = ⌊(k − d+ 1)1

d−1 ⌋.

Proof. We prove the theorem by showing that for eachfunction f ∈ F , the regret ratio r ratioD(S , f) ≤ ln(1 + 1

t),

which means that the maximum regret ratio of F must beless than or equal to ln(1 + 1

t).

Let p∗ be the point in D with the largest utility computedby f , i.e.,

p∗ = argmax

pi∈D

f(pi)

Let s∗ be the point in S that is selected by MinVar in thesame bucket where p∗ lies in. We have:

regretD(S , f) = maxpi∈D

f(pi)−maxpi∈S

f(pi)

≤ f(p∗)− f(s∗)

=d∏

j=1

p∗.c

αj

j −d∏

j=1

s∗.c

aj

j

= exp (lnd∏

j=1

p∗.c

αj

j )− exp (lnd∏

j=1

s∗.c

aj

j )

Since g(x) = ex is convex, based on Lagrange’s Mean ValueTheorem, we have ex − ey ≤ (x− y) · ex where x ≥ y. Thus,

regretD(S , f) ≤ (ln

d∏

j=1

p∗.c

αj

j − ln

d∏

j=1

s∗.c

αj

j ) ·d∏

j=1

p∗.c

αj

j

= (

d∑

j=1

αj ln p∗.cj −

d∑

j=1

αj ln s∗.cj) ·

d∏

j=1

p∗.c

αj

j

=

[

d∑

j=1

αj(ln p∗.cj − ln s∗.cj)

]

·d∏

j=1

p∗.c

αj

j

Since MinVar selects the point in a bucket with the largestvalue in dimension d, we know that p∗.cd ≤ s∗.cd and henceln p∗.cd ≤ ln s∗.cd, i.e., ln p

∗.cd − ln s∗.cd ≤ 0. Thus, we canremove the utility in dimension d from the computation andrelax the regret to be:

regretD(S , f) ≤

[

d−1∑

j=1

αj(ln p∗.cj − ln s∗.cj)

]

·d∏

j=1

p∗.c

αj

j

=

[

d−1∑

j=1

αj ln(p∗.cj

s∗.cj)

]

·d∏

j=1

p∗.c

αj

j

6

Since s∗ is selected from the same bucket where p∗ lies in,p∗.cj − s∗.cj must be constrained by the bucket size in di-

mension j, which iscτj − 1

twhere cτj and 1 are the largest

and smallest utility values in dimension j, i.e.,

∀j ∈ [1..d − 1], p∗.cj − s∗.cj ≤

cτj − 1

t

Thus,

p∗.cj

s∗.cj≤ 1 +

cτj − 1

t · s∗.cj

Since 1 < s∗.cj ≤ cτj ≤ 2, we have

p∗.cj

s∗.cj< 1 +

1

t

Therefore,

regretD(S , f) <

[

d−1∑

j=1

αj ln(1 +1

t)

]

·d∏

j=1

p∗.c

αj

j

For the regret ratio r ratioD(S , f) we have

r ratioD(S , f) =regretD(S , f)

gain(D, f)

<

[

∑d−1j=1 αj ln(1 +

1t)]

·∏d

j=1 p∗.c

αj

j

∏d

j=1 p∗.c

αj

j

=

d−1∑

j=1

αj ln(1 +1

t) = ln(1 +

1

t)∑d−1

j=1αj

≤ ln(1 +1

t)

In the theorem, t = ⌊(k − d+ 1)1

d−1 ⌋. This means thatas k increases, the maximum regret ratio is expected to de-crease; when d increases, the maximum regret ratio is ex-pected to increase. Intuitively, if we return more points (alarger k), then the probability of returning the points thatbetter satisfy the utility functions is higher, and hence theregret ratios would be smaller. If there are more dimen-sions (a larger d), then more regret may be accumulatedover the dimensions, and hence the regret ratios would belarger. This shows that the upper bound obtained has agood explanatory power. For simplicity, we say that theupper bound grows in a scale of O(ln(1 + 1

k1

d−1

)).

To give an example, consider F = f3, f4 where f3(pi) =pi.c

0.51 · pi.c

0.52 and f4(pi) = pi.c

0.991 · pi.c

0.012 . Then d = 2.

Let k = 3, which means t = 2. The upper bound of themaximum regret ratio ln(1 + 1

t) = ln 3

2≈ 40.54%. As k

increases (e.g., to 20), this upper bound will decrease (e.g.,to ln 20

19≈ 5.13%).

4.3 Lower BoundWe derive a lower bound of the maximum regret ratio by

showing that, given a family of MUFs F , it is impossible to

bound the regret ratio to below Ω(1

k2) for a database D of

2-dimensional points (i.e., d = 2).

Theorem 2. Given k > 0, there must be a database Dof 2-dimensional points such that the maximum regret ratio

θ0

θ1

θj

θj+1

θ*

θk+1 θ

kθi

Δ

Figure 3: Lower bound illustration

of any size-k subset S ⊆ D over a family of MUFs F is at

least Ω(1

k2).

Proof. We assume a data space of (1, e] × (1, e] in thisproof. Consider an infinite set D of 2-dimensional points,where each point p satisfies

p.c1 = ecos θ

p.c2 = esin θ 0 < θ ≤π

2

Given a size-k subset S ⊆ D, each point si ∈ S corresponds

to a θi ∈ (0,π

2] where si.c1 = ecos θi and si.c2 = esin θi .

Assume that the points s1, s2, ..., sk in S are sorted in as-cending order of their corresponding θi values, i.e., 0 < θ1 ≤

θ2 ≤ ... ≤ θk ≤π

2. Further, let θ0 = 0 and θk+1 =

π

2. Con-

sider a polar coordinate system as illustrated in Figure 3.Then θi (i ∈ [0..k + 1]) can be represented as a point on aunit circle. Based on the pigeonhole principle, there mustbe some j (j ∈ [0..k]) such that

θj+1 − θj ≥π

2(k + 1)

Let θ∗ be in the middle of θj and θj+1, i.e.,

θ∗ =

θj + θj+1

2

We construct an MUF f where the optimal point p∗ corre-sponds to θ∗, i.e., p∗.c1 = ecos θ

∗

and p∗.c2 = esin θ∗ , andprove the theorem based on the regret ratio of f .

Consider an MUF f(p) = p.ccos θ∗

1 · p.csin θ∗

2 .

ln f(p) = ln (p.ccos θ∗

1 · p.csin θ∗

2 )

= cos θ∗ · ln p.c1 + sin θ∗ · ln p.c2

= cos θ∗ · cos θ + sin θ∗ · sin θ

Let g(θ) = cos θ∗ ·cos θ+sin θ∗ ·sin θ. By letting g′(θ) = 0 weobtain θ = θ∗, which means that ln f(p) is maximum whenθ = θ∗, and f(p) is maximum when p = p∗.

ln f(p∗) = cos2 θ∗ + sin2θ∗ = 1; f(p∗) = e.

Meanwhile, let si be the optimal point for f in S . Sincethere is no other points in S that is between θj and θj+1,

|θi − θ∗| = ∆ ≥ θj+1 − θ

∗ = θ∗ − θj ≥

π

4(k + 1)

7

We consider the case where θi − θ∗ = ∆. The other casewhere θ∗− θi = ∆ is symmetric. We omit it for conciseness.

ln f(si) = ln (si.ccos θ∗

1 · si.csin θ∗

2 )

= cos(θ∗ +∆) · cos θ∗ + sin(θ∗ +∆) · sin θ∗

= (cos θ∗ cos∆− sin θ∗ sin∆) · cos θ∗+

(sin θ∗ cos∆ + cos θ∗ sin∆) · sin θ∗

= cos2 θ∗ · cos∆ + sin2θ∗ · cos∆

= cos∆

Thus, f(si) = ecos∆, and r ratioD(S , f) satisfies

r ratioD(S , f) =f(p∗)− f(si)

f(p∗)=

e− ecos ∆

e= 1− e

cos∆−1

Based on the Maclaurin series, we have

ecos ∆−1 = 1−

∆2

2+

∆4

6− · · ·

Thus,

r ratioD(S , f) =∆2

2−

∆4

6+ · · ·

We already know that

∆ ≥π

4(k + 1)

Therefore,

r ratioD(S , f) ≥π2

32(k + 1)2− o(

1

k4)

This means that r ratioD(S , f) is at least Ω(1

k2).

4.4 Heuristic to Lower the Regret RatioThe upper bound derived in Section 4.2 suggests that the

maximum regret ratio decreases with the increase of t. Thisparameter decides the number of buckets from which thepoints in the answer set S are selected. The parameter itselfis determined by d and k together, which are fixed once adatabase D and a k-regret query is given. In this subsection,we propose a heuristic to increase the value of t withoutchanging d or k, aiming to obtain lower regret ratios.

This heuristic is based on the dominate relationship [1].Let pi and pj be two points in a set S where i 6= j. Point piis said to dominate point pj if and only if ∀l ∈ [1..d], pi.cl ≥pj .cl. If pi dominates pj , then f(pi) ≥ f(pj) holds for anyMUF f . Thus, once pi has been added to S , pj can bediscarded. We call pj a redundant point. Based on thisobservation, we modify the MinVar algorithm to remove theredundant points in S , aiming to obtain a redundant freeset S . We call this modified algorithm RF-MinVar.

As summarized in Algorithm 3, RF-MinVar uses the sameprocedure as that of MinVar to compute an answer set Sbased on a given t (Lines 1 to 5 and 8 to 14). After that,RF-MinVar removes the redundant points from S (Line 15).This is done by a simple scan to check for any points beingdominated. When the redundant points are removed, theremay be a few vacancies open in S . To fill up the vacancies,we wrap the procedure of computing S with a loop (Line 7).In each iteration we increase the value of t by 1 (Line 8),which creates more buckets and leads to more points beingadded to S . If |S| ≥ k after removing the redundant points,

or a predefined number of iterations itrmax is reached, theloop terminates. Then, if |S| < k, we randomly select pointsin D to fill up S (Line 17); if |S| > k, only the first k pointsare kept. We then return the set S (Line 18).

Algorithm 3: RF-MinVar

Input: D = p1, p2, ..., pn: a d-dimensional database; k:the size of the answer set.

Output: S: a size-k subset of D.1 S ← ∅;2 for i = 1, 2, ..., d− 1 do

3 Find p∗i which has the largest utility p∗i .ci in dimension i;4 cτi ← p∗i .ci;5 S ← S ∪ p∗i ;

6 itr ← 0;7 while |S| < k and itr < itrmax do

8 t← ⌊(k − d+ 1)1

d−1 ⌋ + itr;9 for i = 1, 2, ..., d− 1 do

10 bps[i]← F indBreakpoints(D, t, n, i, cτi );

11 for each (d − 1)-integer combination1 ≤ j1 ≤ t, 1 ≤ j2 ≤ t, ...,1 ≤ jd−1 ≤ t do

12 B ← p ∈ D|∀i ∈ [1..d− 1] : bps[i][ji].lo ≤ p.ci ≤bps[i][ji].hi;

13 s∗ ← argmaxp∈B p.cd;14 S ← S ∪ s∗;

15 Eliminate redundant points from S;16 itr ++;

17 S ← S ∪ k − |S| random points in D that are not in S;18 return First k points in S;

Discussion. RF-MinVar keeps the non-dominated points

in S computed when t = ⌊(k − d+ 1)1

d−1 ⌋. Thus, it retainsthe upper bound of the maximum regret ratio derived inSection 4.2. RF-MinVar can reduce the actual regret ratiosobtained, as it may also return the points computed withlarger t values. The extra cost to scan and remove the re-dundant points is minimal, as k is usually a very small num-ber (e.g., less than 100). The main cost of RF-MinVar is inthe loop to iterate through multiple values of t and computeS . In the experiments, we use itrmax to control the maxi-mum number of iterations. We observe that itrmax = 11 issufficient in the data sets tested.

5. CASE STUDIESIn this section showcase the applicability of the MinVar al-

gorithm by deriving the maximum regret ratio bounds whenapplying MinVar on k-regret queries with two special typesof utility functions, the Cobb-Douglas function and the Con-stant Elasticity of Substitution (CES) function.

5.1 The K-Regret Query with Cobb-DouglasFunctions

The Cobb-Douglas function was first proposed as a pro-duction function to model the relationship between multipleinputs and the amount of output generated [4]. It was latergeneralized [24] and used as a utility function [6].

Definition 8 (Cobb-Douglas function). A general-ized Cobb-Douglas function with d inputs x1, x2, ..., xd is amapping X : Rd

+ → R+,

X (x1, x2, ..., xd) = A

d∏

j=1

xαj

j

Here, A > 0 and αj ≥ 0 are the function parameters.

8

The generalized Cobb-Douglas function is very similar to theMUF introduced in Section 3. The d inputs here can be seenas a data point of d dimensions and input xj is the utilityin dimension j. MinVar can process the k-regret query withCobb-Douglas functions straightforwardly.

To derive an upper bound of the maximum regret ratiofor a set of Cobb-Douglas functions F = X1,X2, ...,Xn,we transform each function Xi to an MUF by scaling theparameter A to 1. It can be shown straightforwardly thatthis scaling does not affect the regret ratio or the maximumregret ratio. Assuming that xj has been normalized into therange of (1, 2]. Then the regret ratio upper bound derivedin Section 4.2 applies to the function Xi, i.e.,

r ratioD(S ,Xi) ≤ ln(1 +1

t)∑d−1

j=1αj

Here, each function Xi has a different set of parametersα1, α2, ..., αd. If

∑d

j=1 αj ≤ 1 holds for every Xi ∈ F ,then the maximum regret ratio is bounded by


t)

Otherwise, the maximum regret ratio is bounded by


t)α

τ

,

ατ = max∑d−1

j=1 Xi.αj |Xi ∈ F ,Xi.αj is a parameter of Xi.

Similarly, the lower bound Ω( 1k2 ) of the maximum regret

ratio derived in Section 4.3 also applies.

5.2 The K-Regret Query with CES FunctionsThe CES function is closely related to the Cobb-Douglas

function. It is also used as a production function as well asa utility function [22, 23].

Definition 9 (CES function). A generalized CES func-tion with d inputs x1, x2, ..., xd is a mapping X : Rd

+ → R+,

X (x1, x2, ..., xd) = A(d

∑

j=1

αjxρj )

γρ

Here, A > 0, αj ≥ 0, ρ < 1 (ρ 6= 0), and γ > 0 are thefunction parameters.

When ρ approaches 0 in the limit, the CES function willbecome a Cobb-Douglas function.

The MinVar algorithm can also process k-regret querieswith CES utility functions. To derive bounds for the max-imum regret ratio, we simplify and rewrite the CES func-tion X as a function f in the following form (assuming thatA = γ = 1):

f(pi) = (

d∑

j=1

αj · pi.cbj)

1

b

Here, 0 < b < 1 and αj ≥ 0.It has been shown in an earlier paper [10] that the maxi-

mum regret ratio for k-regret queries with CES utility func-

tions is bounded between Ω(1

bk2) and O(

1

bkb

d−1

). The lower

bound also applies to our MinVar algorithm. In what fol-lows, we derive a new upper bond which is tighter.

We first derive a new upper bound for the regret ratio fora single CES utility function.

Theorem 3. Let f(pi) = (∑d

j=1 αj · pi.cbj)

1

b be a CESutility function, where 0 < b < 1 and αj ≥ 0. The regretratio r regretD(S , f) of a set S returned by MinVar satisfies

r ratioD(S , f) ≤(d− 1)

1

b

t+ (d− 1)1

b

Proof. Let p∗ be the point in D with the largest utilitycomputed by f , and s∗ be the point in S that is selected inthe same bucket where p∗ lies in. We have:

regretD(S , f) = maxpi∈D

f(pi)−maxpi∈S

f(pi)

≤ f(p∗)− f(s∗)

= (d

∑

j=1

αj · p∗.c

bj)

1

b − (d

∑

j=1

αj · s∗.c

bj)

1

b

Since g(x) = x1

b is convex when 0 < b < 1, we have g(x)−g(y) ≤ (x− y)g′(x). Thus,

regretD(S , f)

≤ (d

∑

j=1

αj · p∗.c

bj −

d∑

j=1

αj · s∗.c

bj) ·

1

b· (

d∑

j=1

αj · p∗.c

bj)

1

b−1

=1

b

[

d∑

j=1

αj(p∗.c

bj − s

∗.c

bj)

]

(d

∑

j=1

αj · p∗.c

bj)

1

b−1

≤1

b

[

d−1∑

j=1

αj(p∗.c

bj − s

∗.c

bj)

]

(d−1∑

j=1

αj · p∗.c

bj)

1

b−1

Consider another function g(x) = xb, which is concave when0 < b < 1, and g′(x) is monotonically decreasing. Accordingto Lagrange’s Mean Value Theorem, there must exist someξ between two values x and y, such that xb − yb = (x− y) ·b · ξb−1. Thus, we have

p∗.c

bj − s

∗.c

bj ≤ |p∗.cj − s

∗.cj | · b · (minp∗.cj , s

∗.cj)

b−1

≤cτj

t· b · cτj

b−1 =b

tcτjb

Thus,

regretD(S , f)

≤1

b(

d−1∑

j=1

αjb

tcτjb)(

d−1∑

j=1

αj · p∗.c

bj)

1

b−1

≤(d− 1)

t· maxj∈[1..d−1]

αjcτjb ·

[

(d− 1) · maxj∈[1..d−1]

αjcτjb

] 1

b−1

=(d− 1)

1

b

t( maxj∈[1..d−1]

αjcτjb)

1

b

≤(d− 1)

1

b

t( maxj∈[1..d−1]

d

∑

l=1

αl · p∗j .cl

b)1

b

=(d− 1)

1

b

tmax

j∈[1..d−1](

d∑

l=1

αl · p∗j .cl

b)1

b

Let σ =(d− 1)

1

b

t. Then,

regretD(S , f) ≤ σ maxj∈[1..d−1]

(d

∑

l=1

αl · p∗j .cl

b)1

b

9

10 12 14 16 18 20 22

k

0

10-5

10-4

10-3

10-2

10-1

RR

(a) Cobb-Douglas

10 12 14 16 18 20 22

k

10-4

10-3

10-2

10-1

100

RR

RF-MinVarMinVarMaxDomArea-Greedy

(b) CES

Figure 4: Varying k (NBA)

10 13 16 19 22 25 28 31 34

k

0

10-2

10-1

100

RR

(a) Cobb-Douglas

10 13 16 19 22 25 28 31 34

k

0

10-4

10-2

100

RR


(b) CES

Figure 5: Varying k (Stocks)

Since p∗j (j ∈ [1..d − 1]) is in S , we have

maxpi∈S

f(pi) ≥ maxj∈[1..d−1]

f(p∗j ) = maxj∈[1..d−1]

(d

∑

l=1

αl · p∗j .cl

b)1

b

=1

σregretD(S , f).

The regret ratio r regretD(S , f) is hence bounded by

r regretD(S , f) =regretD(S , f)

regretD(S , f) + maxpi∈S f(pi)

=1

1 +maxpi∈S f(pi)

regretD(S,f)

≤1

1 + 1σ

=σ

1 + σ=

(d− 1)1

b

t+ (d− 1)1

b

Therefore, given a set F of CES functions, the maximumregret ratio mr regretD(S ,F) satisfies:

mr regretD(S ,F) ≤(d− 1)

1

b

t+ (d− 1)1

b

We can see from this bound that, when k decreases or d

increases, the maximum regret ratio is expected to increase.For simplicity, we say that this bound grows in a scale ofO( 1

k1

d−1

). This bound is tighter than the bound O( 1

bkb

d−1

)

obtained in [10] since 0 < b < 1.To give an example, consider F = f5, f6 where f5(pi) =

(0.5pi.c0.51 +0.5pi.c

0.52 )2 and f6(pi) = (0.99pi.c

0.51 +0.01pi.c

0.52 )2.

Then d = 2. Let k = 3, which means t = 2. The up-

per bound of the maximum regret ratio (d−1)1

b

t+(d−1)1

b

= 12+1

≈

33.33%. As k increases (e.g., to 20), this upper bound willdecrease (e.g., to 1

19+1= 5%).

6. EXPERIMENTSIn this section we evaluate the empirical performance of

the two proposed algorithms MinVar and RF-MinVar.

6.1 SettingsAll the algorithms are implemented in C++, and the ex-

periments are conducted on a computer running the OSX 10.11 operating system with a 64-bit 2.7 GHz IntelR©

Core(TM) i5 CPU and 8 GB RAM.

Table 3: Experimental Settings

Parameter Values Default

Utility function Cobb-Douglas, CES -Data set NBA, Stocks, Anti-correlated -

n 100, 1000, 10000, 100000, 1000000 10000d 2..10 3k 10..34 20

Both real and synthetic data sets are used in the experi-ments. The real data sets used are theNBA2 and the Stocks3

data sets, which have been used in previous studies on k-regret queries [17, 18]. After filtering out data points withnull fields, we obtain 20,640 data points of 7 dimensions inthe NBA data set. The Stocks data set contains 122,574data points of 5 dimensions.

The synthetic data sets are generated by the anti-correlateddata set generator [1], which is a popular data generator usedin skyline query studies [15, 16, 21]. We generate syntheticdata sets with cardinality ranging from 100 to 1,000,000 anddimensionality ranging from 2 to 10. We vary the query pa-rameter k from 10 to 34 in the experiments. Table 3 sum-marizes the parameters and the values used.

In each set of experiments, we randomly generate 10,000different sets of parameters αi|i ∈ [1..d], αi ∈ [0, 1] foreach of the generalized Cobb-Douglas function and the CESfunction, where

∑d

i=1 αi = 1. The CES function has anextra parameter b. We generate random values of b in therange of [0.1, 0.9]. We run the algorithms on the data sets,and report the maximum regret ratio on the generated util-ity functions, denoted as RR in the result figures.

Five algorithms are tested in the experiments:

• MinVar is the algorithm proposed in Section 4.1. Weuse inc = 0.01%n to control the number of iterationsthat the sub-algorithm FindBreakpoints runs for.

• RF-MinVar is the improved MinVar algorithm pow-ered by the heuristic proposed in Section 4.4. We findthat itrmax = 11 is sufficient to handle the data setstested. The results obtained are based on this setting.

• MinWidth is an algorithm proposed by Kessler Faulkneret al. [10] with bounded maximum regret ratios for k-regret queries with CES functions.

2http://www.databasebasketball.com3http://pages.swcp.com/stocks

10

10 12 14 16 18 20 22

k

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

RR


(a) Cobb-Douglas

10 12 14 16 18 20 22

k

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

RR

(b) CES

Figure 6: Varying k (Anti-correlated)

2 3 4 5 6 7 8 9 10

d

0 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20

RR


(a) Cobb-Douglas

2 3 4 5 6 7 8 9 10

d

0 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20

RR

(b) CES

Figure 7: Varying d (Anti-correlated)

102 103 104 105 106

n

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

RR


(a) Cobb-Douglas

102 103 104 105 106

n

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

RR

(b) CES

Figure 8: Varying n (Anti-correlated)

2 4 6 8 10

d

10-3

10-2

10-1

100

101

102

Tim

e (s

)

(a) Varying d

102 103 104 105 106

n

10-4

10-3

10-2

10-1

100

101

102

103

Tim

e (s

)

RF-MinVarMinVarMaxDomArea-GreedyMinWidth

(b) Varying n

Figure 9: Running time tests (Anti-correlated)

• Area-Greedy is a greedy algorithm proposed by KesslerFaulkner et al. [10] with good practical maximum re-gret ratios (but no bounds) for CES functions. Notethat we do not compare with Angle [10] which is an-other greedy algorithm because it has been shown toproduce larger regret ratios than those of Area-Greedy.

• MaxDom is a greedy algorithm proposed by Lin etal. [12] that returns the k representative skyline pointswhich dominate the largest number of other points.

6.2 ResultsEffect of k. We first test the effect of k on the maxi-

mum regret ratio. Figures 4 and 5 show the result where kis varied from 10 to 34 on the two real data sets. Note thatFigure 4 shows up to k = 22 because the maximum regret ra-tios of the algorithms have become stable beyond this point.We also omit MinWidth as it is known to produce largermaximum regret ratios than those of Area-Greedy [10].

We can see from these two figures that, as k increases,the maximum regret ratios of the algorithms either decreaseor stay stable. This confirms the upper bound derived. Itis expected as a larger k means more points are returnedand a higher probability of satisfying more utility functions.On the NBA data set, the proposed algorithm MinVar out-performs the baseline algorithms by more than an order ofmagnitude on the Cobb-Douglas functions, and more than5 times on the CES functions (note the logarithmic scale).This is because the algorithm selects the k points from amuch more evenly divided data space, and the points se-lected are much closer to the optimal points for the variousutility functions. Powered by the redundant point remov-ing heuristic, RF-MinVar achieves even smaller maximum

regret ratios, as it avoids selecting the dominated pointswhich do not contribute to the regret ratios. On the Stocksdata set, similar patterns are observed. MinVar outperformsthe baseline algorithms in most cases, and RF-MinVar out-performs the baseline algorithms in all cases tested. Theseconfirm the superiority of the proposed algorithms.

In Figure 6 we show the result of varying k on a syntheticdata set. On this data set, MaxDom obtains the smallestmaximum regret ratios. This is because the data points inthis synthetic data set follow an anti-correlated distribution.The data points tend not to be dominated by each other andtend to be optimal for different utility functions. It is intrin-sically difficult to find a small number of data points that areoptimal for a large number of utility functions under suchdata distribution. MaxDom is designed to handle this typeof data while our algorithms are not. Note that, even undersuch an extreme case, the maximum regret ratios obtainedby the proposed algorithms are still very close to those ob-tained by MaxDom, i.e., the difference is less than 0.04 (cf.Figure 6 (b)). Also, MaxDom does not have a bound on themaximum regret ratio obtained while our algorithms do. Inthe meantime, the proposed algorithms both again outper-form the other baseline algorithm Area-Greedy.

Effect of d. Next we test the algorithm performance byvarying the number of dimensions d from 2 to 10 with syn-thetic data. Figure 7 shows the result. Similar to Figure 6,on the synthetic data, MaxDom obtains the smallest max-imum regret ratios, while those of the proposed algorithmsare very close. Another observation is that, as d increases,the maximum regret ratios increase overall for all the algo-rithms. This again confirms the bounds obtained, and isexpected as the difference between the optimal points in Dand S accumulates as there are more dimensions. Fluctu-

11

ations are observed in the maximum regret ratios. This isbecause the maximum regret ratios are quite small already(i.e., less than 0.20). A change in the data set may have arandom impact on the maximum regret ratios, even thoughthe overall trend is increasing.

Effect of n. We further test the scalability of the pro-posed algorithms by varying the data set cardinality from100 to 1,000,000. The comparative performance of the algo-rithms is shown in Figure 8, which again is similar to thatshown in Figure 7. Note that, while the bounds of the maxi-mum regret ratio do not change when n changes, the actualregret ratios obtained by the algorithms may change as nchanges. This is natural as the optimal points have a smallerprobability to be returned when there are more points in thedata set. However, this change in the actual regret ratios isstill bounded. We observe that, for the synthetic data setswith up to 1,000,000 data points, the maximum regret ratiosof the proposed algorithms are below 0.10. This confirms thescalability of the proposed algorithms.

Running time results. We also show the running timeof the algorithms on the synthetic data sets in Figure 9.Note that the utility functions are irrelevant in this set ofexperiments as the algorithms do not rely on any particularutility function. We see that the proposed algorithms are thefastest in almost all the cases tested. MinWidth is the onlybaseline algorithm with a relatively close performance, butit has much larger maximum regret ratios. The proposedalgorithms outperform the other two baseline algorithms bymore than an order of magnitude. MaxDom is the slowest.It takes almost 500 seconds to process 1,000,000 data points(cf. Figure 9 (b)), while the proposed algorithms can finishin just over a second. This again confirms the scalability ofthe proposed algorithms.

7. CONCLUSIONSWe overcame the barrier of multiplicative utility functions

and presented an algorithm named MinVar to process k-regret queries with such functions. We showed that MinVarcan produce query answers with a bounded maximum re-gret ratio. In particular, when applied on k-regret querieswith Cobb-Douglas functions, MinVar achieved a maximum

regret ratio bounded between Ω(1

k2) and O(ln(1 +

1

k1

d−1

));

when applied on k-regret queries with Constant Elasticityof Substitution functions, MinVar achieved a maximum re-

gret ratio bounded between Ω(1

bk2) and O(

1

k1

d−1

). We fur-

ther proposed a heuristic to lower the maximum regret ra-tio obtained by MinVar, resulting in an improved algorithmnamed RF-MinVar. We performed extensive experimentsusing both real and synthetic data to evaluate the perfor-mance of both algorithms. The results showed that the re-gret ratios of the answers produced by MinVar and RF-MinVar are constantly small. Meanwhile, both algorithmsare more efficient than the baseline algorithms.

This study opens up opportunities to explore k-regretqueries with various other types of multiplicative utility func-tions. It would also be interesting to see how k-regret querieswith a mix of both additive and multiplicative utility func-tions can be answered with a bounded regret ratio.

AcknowledgmentsWe thank Dr. Ashwin Lall for sharing the code of MinWidthand Area-Greedy, and providing feedback on our work.

8. REFERENCES[1] S. Borzsony, D. Kossmann, and K. Stocker. The skyline

operator. In ICDE, pages 421–430, 2001.[2] D. Braziunas and C. Boutilier. Minimax regret based

elicitation of generalized additive utilities. In The 23rdConference on Uncertainty in Artificial Intelligence, pages25–32, 2007.

[3] C.-Y. Chan, H. Jagadish, K.-L. Tan, A. K. Tung, andZ. Zhang. On high dimensional skylines. In EDBT, pages478–495, 2006.

[4] P. H. D. Charles W. Cobb. A theory of production. TheAmerican Economic Review, 18(1):139–165, 1928.

[5] S. Chester, A. Thomo, S. Venkatesh, and S. Whitesides.Computing k-regret minimizing sets. PVLDB,7(5):389–400, 2014.

[6] P. Diamond, L. Helms, and J. Mirrlees. Optimal taxationin a stochastic economy. Journal of Public Economics,14(1):1 – 29, 1980.

[7] J. Huang, Z. Wen, J. Qi, R. Zhang, J. Chen, and Z. He.Top-k most influential locations selection. In CIKM, pages2377–2380, 2011.

[8] I. F. Ilyas, G. Beskales, and M. A. Soliman. A survey oftop-k query processing techniques in relational databasesystems. ACM Computing Surveys, 40(4):11:1–11:58, 2008.

[9] R. L. Keeney and H. Raiffa. Decisions with multipleobjectives: preferences and value trade-offs. Cambridgeuniversity press, 1993.

[10] T. Kessler Faulkner, W. Brackenbury, and A. Lall. k-regretqueries with nonlinear utilities. PVLDB, 8(13):2098–2109,2015.

[11] K. Ligett and G. Piliouras. Beating the best Nash withoutregret. SIGecom Exchanges, 10(1):23–26, 2011.

[12] X. Lin, Y. Yuan, Q. Zhang, and Y. Zhang. Selecting stars:The k most representative skyline operator. In ICDE, pages86–95, 2007.

[13] D. Nanongkai, A. Lall, A. D. Sarma, and K. Makino.Interactive regret minimization. In SIGMOD, pages109–120, 2012.

[14] D. Nanongkai, A. D. Sarma, A. Lall, R. J. Lipton, andJ. Xu. Regret-minimizing representative databases.PVLDB, 3(1-2):1114–1124, 2010.

[15] D. Papadias, Y. Tao, G. Fu, and B. Seeger. An optimal andprogressive algorithm for skyline queries. In SIGMOD,pages 467–478, 2003.

[16] J. Pei, B. Jiang, X. Lin, and Y. Yuan. Probabilistic skylineson uncertain data. In VLDB, pages 15–26, 2007.

[17] P. Peng and R. C.-W. Wong. Geometry approach fork-regret query. In ICDE, pages 772–783, 2014.

[18] P. Peng and R. C.-W. Wong. k-hit query: Top-k query withprobabilistic utility function. In SIGMOD, pages 577–592,2015.

[19] M. A. Soliman, I. F. Ilyas, and K. C. C. Chang. Top-kquery processing in uncertain databases. In ICDE, pages896–905, 2007.

[20] K.-L. Tan, P.-K. Eng, and B. C. Ooi. Efficient progressiveskyline computation. In VLDB, pages 301–310, 2001.

[21] Y. Tao, L. Ding, X. Lin, and J. Pei. Distance-basedrepresentative skyline. In ICDE, pages 892–903, 2009.

[22] H. R. Varian. Microeconomic analysis. Norton & Company,1992.

[23] A. D. Vılcu and G. E. Vılcu. On some geometric propertiesof the generalized CES production functions. AppliedMathematics and Computation, 218(1):124–129, 2011.

[24] G. E. Vılcu. A geometric perspective on the generalizedCobb-Douglas production functions. Applied MathematicsLetters, 24(5):777–783, 2011.

[25] T. Xia, D. Zhang, and Y. Tao. On skylining with flexibledominance relation. In ICDE, pages 1397–1399, 2008.

[26] S. Zeighami and R. C.-W. Wong. Minimizing average regretratio in database. In SIGMOD, pages 2265–2266, 2016.

12

Documents

K-Regret Queries: From Additive to Multiplicative Utilities · arXiv:1609.07964v3 [cs.DB] 1 Oct 2016 K-Regret Queries: From Additive to Multiplicative Utilities Jianzhong Qi1, Fei