21
Finding an efficient rewriting of OLAP queries using materialized views in data warehouses Chang-Sup Park * , Myoung Ho Kim, Yoon-Joon Lee Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology, 373-1 Kusung-dong, Yusung-gu, Taejon, 305-701, South Korea Received 1 January 2001; received in revised form 1 March 2001; accepted 1 June 2001 Abstract OLAP queries involve a lot of aggregations on a large amount of data in data warehouses. To process expensive OLAP queries efficiently, we propose a new method to rewrite a given OLAP query using various kinds of materialized views which already exist in data warehouses. We first define the normal forms of OLAP queries and materialized views based on the selection and aggregation granularities, which are derived from the lattice of dimension hierarchies. Conditions for usability of materialized views in rewriting a given query are specified by relationships between the components of their normal forms. We present a rewriting algorithm for OLAP queries that can effectively utilize materialized views having different selection granularities, selection regions, and aggregation granularities together. We also propose an algorithm to find a set of materialized views that results in a rewritten query which can be executed efficiently. We show the effectiveness and performance of the algorithm experimentally. D 2002 Elsevier Science B.V. All rights reserved. Keywords: Query rewriting; Materialized view; OLAP; Data warehouse 1. Introduction Data Warehouses (DWs) integrate information from multiple operational databases and are often used to support On-Line Analytical Processing (OLAP) [4]. A DW tends to have a huge amount of raw data. Queries used in OLAP are much more complex than those used in traditional OLTP applications and have many differ- ent features. They generally consist of multi-dimen- sional grouping and aggregation operations. The large size of DWs and the complexity of OLAP queries considerably increase the execution cost of the queries and have a critical effect on performance and produc- tivity of decision support systems. For efficient processing of OLAP queries, a com- monly used approach is to store the results of frequently issued queries in summary tables, i.e., Materialized Views (MVs), and make use of them to evaluate other queries. This approach involves selecting views to materialize and rewriting queries using MVs. Most of the proposed materialized view selection techniques, such as Refs. [2,10,12,13,15,24], select a set of views that optimize query evaluation cost for a given set of queries under given storage space limitation or MV maintenance cost constraint. To answer queries that are not relevant to those considered in view selection process efficiently, a query rewriting strategies for con- 0167-9236/02/$ - see front matter D 2002 Elsevier Science B.V. All rights reserved. PII:S0167-9236(01)00123-3 * Corresponding author. E-mail addresses: [email protected] (C.-S. Park), [email protected] (M.H. Kim), [email protected] (Y.-J. Lee). www.elsevier.com/locate/dsw Decision Support Systems 32 (2002) 379 – 399

Finding an efficient rewriting of OLAP queries using materialized views in data warehouses

Embed Size (px)

Citation preview

Finding an efficient rewriting of OLAP queries using

materialized views in data warehouses

Chang-Sup Park *, Myoung Ho Kim, Yoon-Joon Lee

Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology,

373-1 Kusung-dong, Yusung-gu, Taejon, 305-701, South Korea

Received 1 January 2001; received in revised form 1 March 2001; accepted 1 June 2001

Abstract

OLAP queries involve a lot of aggregations on a large amount of data in data warehouses. To process expensive OLAP

queries efficiently, we propose a new method to rewrite a given OLAP query using various kinds of materialized views which

already exist in data warehouses. We first define the normal forms of OLAP queries and materialized views based on the

selection and aggregation granularities, which are derived from the lattice of dimension hierarchies. Conditions for usability of

materialized views in rewriting a given query are specified by relationships between the components of their normal forms. We

present a rewriting algorithm for OLAP queries that can effectively utilize materialized views having different selection

granularities, selection regions, and aggregation granularities together. We also propose an algorithm to find a set of

materialized views that results in a rewritten query which can be executed efficiently. We show the effectiveness and

performance of the algorithm experimentally. D 2002 Elsevier Science B.V. All rights reserved.

Keywords: Query rewriting; Materialized view; OLAP; Data warehouse

1. Introduction

DataWarehouses (DWs) integrate information from

multiple operational databases and are often used to

support On-Line Analytical Processing (OLAP) [4]. A

DW tends to have a huge amount of raw data. Queries

used in OLAP are much more complex than those used

in traditional OLTP applications and have many differ-

ent features. They generally consist of multi-dimen-

sional grouping and aggregation operations. The large

size of DWs and the complexity of OLAP queries

considerably increase the execution cost of the queries

and have a critical effect on performance and produc-

tivity of decision support systems.

For efficient processing of OLAP queries, a com-

monly used approach is to store the results of frequently

issued queries in summary tables, i.e., Materialized

Views (MVs), and make use of them to evaluate other

queries. This approach involves selecting views to

materialize and rewriting queries using MVs. Most of

the proposed materialized view selection techniques,

such as Refs. [2,10,12,13,15,24], select a set of views

that optimize query evaluation cost for a given set of

queries under given storage space limitation or MV

maintenance cost constraint. To answer queries that are

not relevant to those considered in view selection

process efficiently, a query rewriting strategies for con-

0167-9236/02/$ - see front matter D 2002 Elsevier Science B.V. All rights reserved.

PII: S0167-9236 (01 )00123 -3

* Corresponding author.

E-mail addresses: [email protected] (C.-S. Park),

[email protected] (M.H. Kim),

[email protected] (Y.-J. Lee).

www.elsevier.com/locate/dsw

Decision Support Systems 32 (2002) 379–399

junctive queries and aggregation queries have been

proposed in the literature of relational databases, data

integration, and data warehouses [3,5,11,16,22,23,26].

Few of them, however, exploited the characteristics of

DWs and OLAP queries effectively and utilized exist-

ing MVs sufficiently.

In this paper, we propose a new algorithm for

rewriting OLAP queries which improves the usability

of MVs significantly in contrast to the previous stud-

ies. We consider typical OLAP queries that aggregate

raw data by the attributes in dimension hierarchies and

define a normal form of the queries based on the lattice

of dimension hierarchies. Then we present conditions

on which an MV can be used in rewriting a given

query, which are specified by relationships between

the components of their normal forms, and suggest the

rewriting algorithm for normal form OLAP queries.

The main contribution of this paper are as follows:. Aggregate MVs defined by an arbitrary region

of selection can be used in our method. That is, we

can use the results of grouping and aggregation of the

raw data selected by an arbitrary range of values of

dimensional attributes.. We exploit meta-information of DW schema

effectively.Dimension hierarchies inDWs are implicity

used in constructingOLAPqueries. By using such sem-

antic information, we can achieve query rewriting that

utilizes various kinds of MVs together. Ad hoc queries

that were not considered at view selection process can

be also rewritten using MVs by the proposed method.. For a given OLAP query, there can be many

equivalent rewritings using different MVs in various

ways. Their execution costs, in general, are different

from one another. We propose a heuristic algorithm to

find a set of MVs and their query regions that result in

an efficient query plan.

The rest of the paper is organized as follows. We

first present some examples motivating our studies in

Section 2. In Section 3, we define the normal forms of

OLAP queries and MVs considered in this paper. In

Section 4, we describe our method for rewriting

OLAP queries using MVs. We propose an algorithm

to select an efficient set of MVs for query rewriting in

Section 5 and present experimental results in Section

6. Related work is discussed in Section 7, and we

draw conclusions in Section 8.

2. Motivating examples

We present examples to show how OLAP queries

can be rewritten using various MVs in DWs.

Fig. 1. An example data warehouse schema.

C.-S. Park et al. / Decision Support Systems 32 (2002) 379–399380

Example 1. Consider a DW with sales data of a large

chain of department stores, whose schema is shown in

Fig. 1. The DW consists of a fact table and four-

dimension tables and has a dimension hierarchy in

each dimension table. We assume that three MVs are

available in the DW as shown in Fig. 2.

Fig. 2. Example materialized views.

Fig. 3. Example query Q1 and its rewritten query Q1V.

C.-S. Park et al. / Decision Support Systems 32 (2002) 379–399 381

Consider the OLAP query Q1 in Fig. 3, which asks

for the total sales of the stores in the USA or Canada

from 1996 to 1999 by state and year. Q1 can be

rewritten to the query Q1V in Fig. 3, which uses the

three MVs instead of the fact table Sales. Q1V consists

of three query blocks, whose results are combined by

union. Each query block contains a different MV and

computes a part of aggregate groups of Q1 as shown

in Fig. 4. Specifically, the first query block computes

from MV1 the total sales of the stores in the USA or

Canada from 1997 to 1999 by state and year. The

second and the third one use MV2 and MV3,

respectively, to compute the total sales of the stores

in the USA and Canada in 1996 by state. Since the

three sets of groups are disjoint and the union of them

is equal to the set of groups computed by Q1, we can

Fig. 4. The aggregate groups of three query blocks in Q1V.

Fig. 5. Example query Q2 and its rewritten query Q2V.

C.-S. Park et al. / Decision Support Systems 32 (2002) 379–399382

obtain the same result of Q1 by taking the union of

them as in Q1V.

The next example shows another type of query

rewriting.

Example 2. Consider the Q2 in Fig. 5 over the sales

DW in Fig. 1. Q2 can be rewritten to the equivalent

query Q2V in Fig. 5, which uses MV1 and MV3 instead

of the fact table Sales.

Q2 asks for the total sales of the stores in Canada in

all years by state. However, MV1 has the total sales of

the stores only from 1997, and MV3 has those only to

1996. That is since MV1 and MV3 are the results of

aggregations over a part of the raw data contained in

each aggregate group of Q2, neither of them can

compute the aggregation of Q2. However, as shown in

Fig. 6, two sets of the raw data selected by MV1 and

MV3 are disjoint, and the union of them is equal to the

set of the raw data for the aggregate groups of Q2.

Thus, the result of Q2 can be obtained by computing

aggregations over MV1 and MV3 for all aggregate

groups of Q2 and then aggregating the partial results

of the same groups once more, as shown in Q2V.

The rewritings in these two examples cannot be ob-

tained by other methods proposed earlier. Our method

proposed in this paper can achieve these types of re-

writings. We will revisit these examples in Section 4 to

illustrate how our algorithm can perform the rewritings.

3. OLAP queries and materialized views

3.1. The lattice of dimension hierarchies

We consider DWs that have a star schema [4]

consisting of one fact table and d dimension tables

like the one in Fig. 1. The fact table has a foreign key

to each dimension table and measure attributes on

which aggregations are performed. The tuples in the

fact table are called the raw data. We suppose that we

can obtain hierarchical classification information from

dimension tables and consider one dimension hier-

archy for each dimension table. A dimension hier-

archy DHi of a dimension table DTi, defined as

DHI= (L0i , L1

i ,. . ., Lhi , none), is an ordered set of

dimension levels. A dimension level Lji is a set of

attributes from DTi. We call j of Lji its height. L0

i is the

lowest level which equals the primary key of DTi.

None denotes the highest level and is an empty set.

There exists a functional dependency L ij � 1! Lj

i

between Lij� 1 and Lji (1V jV h). Each dimension

level is used as a criterion by which raw data are

grouped for aggregation. We call the attributes in

dimension levels the dimensional attributes.

The Cartesian product of all dimension hierar-

chies, defined as DH =DH1�DH2� . . .�DHd, is a

class of the ordered sets of dimension levels from all

different dimension hierarchies. There exists a partial

Fig. 6. The aggregate groups of the query blocks in Q2V.

C.-S. Park et al. / Decision Support Systems 32 (2002) 379–399 383

ordering relation V among the elements in DH,

defined as

ðL1l1 ; L2l2 ; . . . ; Ldld ÞVðL1m1; L2m2

; . . . ; LdmdÞ

if and only if Lili ! Limior Lili ¼ Limi

for all 1ViVd:

Two additional relations < and < > can be derived

from V as follows.

(Ll11 , Ll2

2 ,. . ., Lldd ) < (Lm1

1 , Lm2

2 , . . ., Lmd

d ) if and only if

(Ll11 , Ll2

2 , . . ., Lldd )V (Lm1

1 , Lm2

2 , . . ., Lmd

d ) but there exists

j (1V jV d) such that Lljj a Lmj

j .

(Ll11 , Ll2

2 ,. . ., Lldd ) < > (Lm1

1 , Lm2

2 , . . ., Lmd

d ) if and only

if neither (Ll11, Ll2

2, . . ., Lldd )V (Lm1

1 , Lm2

2 ,. . ., Lmd

d ) nor

(Ll11, Ll2

2,. . ., Lldd) � (Lm1

1 , Lm2

2 , . . ., Lmd

d ).

DH can be represented as a lattice, called the

Dimension Hierarchies (DH) lattice in this paper.1

Each node in the lattice means a criterion for selecting

or grouping raw data. Fig. 7 shows a DH lattice

derived from dimension hierarchies in Store and Time

in Fig. 1.2

3.2. The normal form of OLAP queries

The OLAP queries we consider in this paper are

unnested single-block aggregate queries over the base

tables. Queries over MVs can be transformed to those

that contain only base tables by unfolding the MVs.

The target queries can be characterized by a set of

selection attributes, a selection predicate, a set of

grouping attributes, a set of projection attributes, a set

of aggregate functions, and a condition on the values of

aggregate functions and the grouping attributes.

The selection attributes are those contained in the

selection predicate of the query. We assume that the

set of selection attributes is a union of dimension

levels from some different dimension hierarchies.3

Then, it can be mapped to a node in the DH lattice,

i.e., an ordered set of dimension levels, which we call

the Selection Granularity (SG) of the query. The

selection predicate is a Boolean combination of com-

parison predicates on the selection attributes. A com-

parison predicate has the form a op c, where a is a

dimensional attribute, c is a constant, and opa{ < , V ,

=, � , >}. In the selection predicate expressed in a

disjunctive normal form, each conjunct can be geo-

metrically represented as a hyper-rectangle in the d-

dimensional domain space of dimensional attributes

derived from the dimension hierarchies. Thus, the

selection predicate can be considered a set of hyper-

rectangles, called the selection region of the query.

The set of grouping attributes is used for grouping raw

data. Like the set of selection attributes, it can be

mapped to a node in the DH lattice, which we call the

Aggregation Granularity (AG) of the query. The set

of projection attributes is supposed to be identical to

the set of grouping attributes.

We propose a normal form of the considered

OLAP queries using these components of the queries.

Definition 1. The normal form of an OLAP query Q is

defined as

QðSG;R;AG;AGG;HAV Þ;

where

. SG=(S1, S2, . . ., Sd) is the selection granularity

of Q, where Si (1V iV d) is a dimension level in the

dimension hierarchy DHi.

Fig. 7. An example DH lattice.

1 A similar lattice framework was proposed in Ref. [13], which

represents a dependence relation among a set of views.2 For simplicity in notation, we denote a dimension level {a1,

a2, . . ., an} by a1�a2�. . .�an.

3 The method proposed in this paper can be extended to the

case where the set of selection attributes includes the attributes in

different levels in the same dimension hierarchy.

C.-S. Park et al. / Decision Support Systems 32 (2002) 379–399384

. R={Ri} is the selection region of Q, which is a

set of hyper-rectangles. A hyper-rectangle Ri= (Ii1,

Ii2, . . ., Iid) is an ordered set of d intervals of the

values of the dimension levels obtained from the

selection predicate of Q. Iij denotes an interval of a

dimension level in DHj, which can be open, closed,

or half-closed. The endpoints of the interval are

specified by the concatenation of all values of the

higher levels in the dimension hierarchy. We denote

unspecified left (right) endpoints of intervals by �1( +1).

. AG= (A1, A2, . . ., Ad) is the aggregation gran-

ularity ofQ, where Ai (1V iV d) is a dimension level in

the dimension hierarchy DHi.. AGG ={agg(m) | agg a{MIN, MAX, SUM,

COUNT} and m is a measure attribute}.4

. HAV is a logical formula of comparison predicates

on the attributes in AG and the aggregate functions in

AGG. It implies the HAVING condition of Q in SQL.

We denote an empty condition by null.

SG(Q), R(Q), AG(Q), AGG(Q), and HAV(Q)

denote SG, R, AG, AGG, and HAV in the normal form

of Q, respectively. For simplicity in notation, we also

use SG(Q) and AG(Q) to denote the union of the

dimension levels in them. AG(Q, i) and SG(Q, i)

(1V iV d) denote the dimension levels from DHi

contained in AG(Q) and SG(Q), respectively.

In this paper, we consider MVs that store the

results of the normal form OLAP queries which

satisfy SG(Q)�AG(Q) and HAV(Q) = null. MVs that

do not meet the conditions are not very useful for

rewriting other queries. The normal form of a mate-

rialized view MV, denoted by MV(SG, R, AG, AGG),

is defined in the similar way.

Example 3. The normal forms of the MVs and

the queries in Example 1 and Example 2 are as fol-

lows.

. MV1(SG, R, AG, AGG) =MV1 ((none, year, none,

none), {((�1, +1), [1997, +1), (�1, +1),

(�1, +1))}, (state, year, none, none), {SUM(sa-

les�dollar)}).. MV2(SG, R, AG, AGG) =MV2 ((nation, none,

none, none), {([‘USA’, ‘USA’], (�1, +1), (�1,

+1), (�1, +1))}, (state, year�month, none, none),

{SUM(sales�dollar)}).. MV3(SG, R, AG, AGG) =MV3 ((nation, year,

none, none), {([‘Canada’, ‘Canada’], (�1, 1996],

(�1, +1), (�1, +1))}, (city, year�month, none,

none), {SUM(sales�dollar)}).. Q1(SG, R, AGG, HAV) =Q1 ((nation, year, none,

none), {([‘USA’, ‘USA’], [1996, 1999], (�1, +1),

(�1, +1)) ([‘Canada’, ‘Canada’], [1996, 1999],

(�1, +1), (�1, +1))}, (state, year, none,

none), {SUM(sales�dollar)}, null).. Q2(SG, R, AGG, HAV) =Q2 ((nation, none, none,

none), {([‘Canada’, ‘Canada’], (�1, +1), (�1,

+1), (�1, +1))}, (state, none, none, none),

{SUM(sales�dollar)}, null).

4 We do not consider AVG in this paper since it can be

computed using SUM and COUNT.

Fig. 8. The SGs and AGs of the queries and MVs in Example 1 and Example 2.

C.-S. Park et al. / Decision Support Systems 32 (2002) 379–399 385

Fig. 8 shows the SGs and AGs of the MVs and

queries in Example 1 and Example 2 on the DH lattice.

We define two operations between selection regions

of queries and MVs which are used in Section 4.

Definition 2. Let Q and MV be a query and a

materialized view. The region intesection \* between

the selection region of Q and that of MV is defined as

RðQÞ \*RðMV Þ ¼ fRi \ Rj j RiaRðQÞ, RjaRðMV Þg,

where Ri\Rj is a set of hyper-rectangle which is the

intersection of Ri and Rj. The region differences � *

of the selection region of Q and that of MV is defined

as

RðQÞ �*RðMV Þ¼ fRk j RkaðRi � RjÞ, RjaRðQÞ, RjaRðMV Þg,

where Ri�Rj is a set of hyper-rectangles whose union

is identical to the difference of Ri and Rj. The

intersection and difference of the selections of two

MVs are defined in the way.

The intersection and difference of two hyper-rec-

tangles can be obtained by the algorithms proposed in

computational geometry (e.g., Ref. [17]). Since the

endpoints of hyper-rectangles are represented as a

concatenation of all values of the higher levels in

dimension hierarchies as defined in Definition 1, the

operations can be applied on the queries and MVs

having different selection granularities. The granular-

ity of the result is equal to the Greatest Lower Bound

(GLB) of the selection granularities of two operands.

We say that a tuple in an MV or the fact table is

contained in a selection region R if it is selected by the

selection predicate for R which is specified on dimen-

sional attributes involved in R.

4. Query rewriting using materialized views

Given a query Q, a query Q V is called a rewriting, or a rewritten query, of Q that uses a materialized view MV

if: (a) Q V and Q compute the same result for any given database, and (b) Q V contains MV in the FROM clause of

one of its query blocks [23]. If there exists such a rewritten query, we say that MV is usable in rewriting Q. In this

section, we propose an algorithm for rewriting normal form OLAP queries using different classes of normal form

MVs. Rewritten queries are expressed in SQL.

4.1.Candidate materialized views

To rewrite a given OLAP query Q, we have to find MVs that can be used to compute a part of the aggregate

groups of Q. We present sufficient conditions on which a materialized view can be used in rewriting Q in the

following definition.

Definition 3. An MV is a candidate MV for a query Q if it satisfies the following four conditions:

R(MV)\ *R(Q) a /, AG(MV)V SG(Q), AG(MV)VAG(Q), and AGG(MV)tAGG(Q).

The usability of the candidate MVs can be proved by the following results.

Lemma 1. If AG(MV)V SG(Q), then the tuples in MV that are contained in R(MV)\ *R(Q) are the result of

aggregation over the tuples in the fact table that are contained in R(MV)\ *R(Q).

Proof. For a tuple tV in MV, let G(tV) be the set of tuples in the table FT that constitute the aggregate group for tV.The granularity of R(MV)\ *R(Q) is GLB(SG(MV), SG(Q)). If AG(MV)V SG(Q), then with the constraint of

AG(MV)V SG(MV) in the definition of MV, we have

AGðFTÞVAGðMV ÞVGLBðSGðMV Þ, SGðQÞÞ:

The relation V implies the functional dependencies AG(FT)!AG(MV) and AG(MV)!GLB(SG(MV), SG(Q)),

and AG(FT)!GLB(SG(MV), SG(Q)) is derived from them by transitivity of functional dependencies. Thus,

C.-S. Park et al. / Decision Support Systems 32 (2002) 379–399386

for all tV in MV contained in R(MV)\ *R(Q) and for all t in FT such that taG(tV), t is also contained in

R(MV)\ *R(Q). Inversely, for all t in FT that is contained in R(MV)\ *R(Q), there exists a tuple tV in MV such

that taG(tV) and tV is also contained in R(MV)\ *R(Q). Therefore, the lemma follows. 5

Lemma 2. Let SV be a set of tuples from MV and S be the set of tuples from the fact table which constitute the

aggregate groups for the tuples in SV. If AG(MV) VAG(Q), the result of aggregation over the tuples in S by AG(Q)

and an aggregate function agg(m) in AGG(Q)\AGG(MV) is equal to the result of aggregation over the tuples in SVby AG(Q) and an aggregate function aggV(mV), where aggV is equal to agg if agg is MIN, MAX, or SUM and

aggV is SUM if agg is COUNT, and mV is the result attribute for agg(m).

Proof. AG(FT)VAG(Q) implies functional dependencies AG(FT)!AG(MV) and AG(MV)!AG(Q). Consider

an equivalent relation R on S such that (t1, t2)aR if and only if t1 and t2 have the same value of AG(MV), after being

joined with related dimension tables, if necessary. By AG(FT)!AG(MV), the aggregate groups for the tuples in SVderive a partition p1 of S induced by R. Similarly, we consider an equivalent relation RV on SV such that (t1V,t2V)aRV if and only if t1V and t2V have the same value of AG(Q), after being joined with related dimension tables,

if necessary. Then, by AG(MV)!AG(Q), the aggregate groups for the result S1W of aggregation over SV by AG(Q)derive a partition p2 of SV induced by RV. We have AG(FT)!AG(Q) by the transitivity of functional

dependencies. Thus, the aggregate groups for the result S2W of aggregation over S by AG(Q) derive a partition p3

of S, and p1 is a refinement of p3. Since all the considered aggregate functions, i.e., MIN, MAX, SUM, and

COUNT, are distributive functions [9], for all t2W in S2W, there exists t1W in S1W such that

aggðfGðp3; t2WÞ j a block of p3 which is an aggregate group for t2W in S2WgÞ¼ aggVðfaggðfGðp1; tVj a block of p1 which is an aggregate group for tVin SVand Gðp1; tVÞpGðp3; t2WÞgÞgÞ¼ aggVðfGðp2; t1WÞ j a block of p2 which is an aggregate group for t1W in S1WgÞ,

where aggV is equal to agg if agg is MIN, MAX, or SUM, and aggV is SUM if agg is COUNT. The converse also

holds. Therefore, S1W and S2W are equal to each other. 5

Theorem 1. Given an OLAP query Q, a candidate materialized view MV for Q is usable in rewriting Q.

Proof. R(MV)\ *R(Q) a / means that there exist raw data that satisfy both of the selection predicates of MV and

Q. R(Q) can be divided into two disjoint selection regions, R(MV)\ *R(Q) and R(Q)� *R(MV), which are

evaluated by a query block containing MV and the fact table FT, respectively. We first consider the query block for

MV. By Lemma 1, the set of tuples in MV that are selected by R(MV)\ *R(Q) reflects the aggregation over the

tuples in the fact table that are contained in R(MV)\ *R(Q). By Lemma 2 and AGG(MV)tAGG(Q), aggregating

the selected tuples in MV by AG(Q) gives the same result of the aggregation of Q. Therefore, we can compute the

aggregation of Q for the selection region R(MV)\ *R(Q) using MV. The aggregation for R(Q)� *R(MV) can be

obtained from FT. 5

The detailed algorithm for generating query blocks for MVs and the fact table and integrating them into a

rewritten query is given in Section 4.2. For notational convenience, we regard the fact table as a kind of candidate

MV for all queries in the rest of the paper and denote the set of the candidate MVs for a query Q by V(Q). Fig. 9

shows the possible aggregation granularities of the candidate MVs for a query Q, which satisfy AG(MV)V SG(Q)

and AG(MV) VAG(Q).

4.2. The query rewriting method

A given OLAP query Q can be rewritten using a set of the candidate MVs in V(Q). Our rewriting method consists

of three main steps. [21]

C.-S. Park et al. / Decision Support Systems 32 (2002) 379–399 387

4.2.1. Step 1: selecting materialized views

In the first step, we select MVs from V(Q) that will be actually used in a rewriting of Q and also determine the

query regions for the selected MVs. In general, a rewritten query over a materialized view MVi selects tuples in

MVi satisfying some selection predicate. The selection predicate can be represented as a selection region, which is

called the query region for MVi and denoted by QR(MVi). It must be subsumed in R(MVi)\ *R(Q). All query

regions for the selected MVs must not overlap one another and cover the selection region of Q together. That is, if

we let S(Q) be a set of MVs selected for rewriting Q, the query regions for the MVs in S(Q) must satisfy the

following conditions:

QRðMViÞ � *ðRðMViÞ \ *RðQÞÞ ¼ / for all MV1aSðQÞ, QRðMViÞ \ *QRðMVjÞ

¼ / for all MVi,MVjaSðQÞ such that MViaMVj, and RðQÞ � *[

MViaSðQÞQRðMViÞ ¼ /:

In general, there can be many equivalent rewritings containing a different set of candidate MVs, which differ

greatly in execution performance. In Section 5, we will formalize the problem of selecting the candidate MVs and

the query regions for them that constitute an efficient rewritten query. We will also provide an effective heuristic

algorithm to solve the problem.

4.2.2. Step 2: generating query blocks

For each materialized view MV selected in Step 1, we generate a query block using its query region QR(MV).

We first consider the following normal form query over the fact table,

QMV ðSGV,QRðMV Þ, AGðQÞ, AGGðQÞ,HAV ðQÞÞ,

where SGV in the granularity of QR(MV). The query block for MV must be equivalent to QMV and defined over

MV instead of the fact table. Since all the attributes in MV except for the results of aggregations are included in

AG(MV), if an attribute in SGV or AG(Q) is not included in AG(MV), a join operation between MV and the

dimension table DTi containing the attribute is required to perform selection by QR(MV) or grouping by AG(Q).

However, if the join attributes are not foreign key referring to DTi, the join may result in duplication of the same

Fig. 9. The area of the AGs of possible candidate MVs for a query Q.

C.-S. Park et al. / Decision Support Systems 32 (2002) 379–399388

tuples. In order to prevent it, we first evaluate a duplicate-eliminating (distinct) selection query over DTi and then

perform a join of MV with the result of the query, which will be nested in the query block of MV. A selection

predicate on the attributes in DTi that is subsumed by the selection predicate for QR(MV) can be included in the

nested subquery. As a result, the query block for MV is formalized in the following definition.

Definition 4. Given a query Q, a materialized view in MV in V(Q), and a query region QR(MV), the query block

for MV, QB(MV), has the form,

where the components are defined as follows:

P ¼[

1ViVd

AGðQ; iÞ

AGG ¼

faggiVðmVÞAS aliasi j mV is an attribute in MV that is the result of aggiðmÞin AGGðQÞ,

aggiV ¼ aggi if aggia fMIN ,MAX , SUMg and aggiV ¼ SUM if aggi ¼ COUNTg,

if AGðQÞ > AGðMV Þ

fmVAS aliasi j mV is an attribute in MV that is the result of aggiðmÞ in AGGðQÞg,

if AGðQÞ ¼ AGðMV Þ

8>>>>>>>>>>>><>>>>>>>>>>>>:

R ¼ fMV ,DTi,NBjÞ j ðAGðMV , iÞMSGðQMV , iÞ or AGðMV , iÞMAGðQ; iÞÞ, AGðMV , iÞ¼ Li0, ðAGðMV , jÞMSGðQMV , jÞ or AGðMV , jÞMAGðQ, jÞÞ, AGðMV , jÞaLj0g

where the nested subquery block NBj has the form

where p(QR(MV), DTj) denotes the selection predicate on the attributes in DTj which is subsumed by the

selection predicate for QR(MV) and excludes sub-predicates subsumed by the predicate for R(MV).

JC ¼ ^DTiaR

ðMV :AGðMV , iÞ ¼ DTi:AGðMV , iÞÞ� �

^ ^NBjaR

ðMV :AGðMV , jÞ ¼ NBj:AGðMV , jÞÞ� �

SC ¼ ^AGðMV , iÞtSGðQMV , iÞ

pðQRðMV Þ;DTiÞ� �

^ ^DTjaR

pðQRðMV Þ,DTjÞ� �

SELECT P, AGG

FROM R

WHERE JC, SC

GROUP BY G

HAVING HC

(SELECT DISTINCT AG(MV, j)[AG(Q, j)FROM DTj

WHERE p(QR(MV), DTj)) NBj

C.-S. Park et al. / Decision Support Systems 32 (2002) 379–399 389

G ¼AGðQÞ, if AGðQÞ > AGðMV Þ

/, if AGðQÞ ¼ AGðMV Þ

8<:

HC ¼

HAVVðQÞ, if HAV ðQÞanull andSGðMV Þ � AGðQÞ

/, otherwise

8>><>>:

where HAVV(Q) is a logical formula on P and AGG derived from HAV(Q) by replacing aggi(m) in AGG(Q)

with the related aliasi in AGG. The GROUP BY and HAVING clauses are included only when G and HC

are not /, respectively.

4.2.3. Step 3: integrating the query blocks

If only one MV is selected in Step 1, the query block generated in Step 2 becomes the final result of

rewriting. Otherwise, we obtain a rewritten query by integrating the generated query blocks. The MVs selected

in Step 1 can be classified into two categories by a relationship between the SG of an MV and the AG of the

given query Q. Fig. 10 shows the possible selection granularities of these two kinds of MVs.

. S1={MVi | SG(MVi)�AG(Q)}: the MVs in this class can be used to evaluate aggregation for all or a part of

the aggregate groups of Q.. S2={MVj jSG(MVj) <AG( Q) or SG(MVj) < >AG( Q)}: each MV in this class may not compute aggregation

for an aggregate group of Q, but all MVs in this class can compute it together.

Query blocks for the MVs in S1 are integrated into a UNION multi-block query using UNION operators. On

the other hand, query blocks for the MVs in S2 are connected by UNION ALL operators and then integrated into a

UNION ALL-GROUP BY query. It has a set AGGV(Q) of the aggregate functions from AGG(Q) except for

COUNTs which are replaced by SUMs, a GROUP BY clause containing AG(Q), and a HAVING clause with the

condition on AG(Q) and AGGV(Q) derived from HAV(Q) which is not null. Combining these two query blocks

Fig. 10. The area of the SGs of the MVs that can be integrated by UNIONs.

C.-S. Park et al. / Decision Support Systems 32 (2002) 379–399390

using a UNION operator generates the final rewritten query. Given S1={MV11, MV12,. . ., MV1m} and S2={MV21,

MV22,. . ., MV2n}, the rewritten query has the form

Example 4. We describe rewriting steps for the query Q1 in Example 1.

Step 1: By Definition 3, V(Q1)={MV1, MV2, MV3, Sales}. We select MV1, MV2, and MV3 in the order using the

algorithm proposed in Section 5. Their query regions are determined as follows (see Fig. 4).

QRðMV1Þ ¼ fð½‘USA’,‘USA’�, ½1997, 1999�, ð�1, þ1Þ, ð�1, þ1ÞÞ,ð½‘Canada’, ‘Canada’�, ½1997, 1999�, ð�1, þ1Þ, ð�1, þ1ÞÞg

QRðMV2Þ ¼ fð½‘USA’, ‘USA’�, ½1996, 1996�, ð�1, þ1Þ, ð�1, þ1ÞÞg

QRðMV3Þ ¼ fð½‘Canada’, ‘Canada’�, ½1996, 1996�, ð�1, þ1Þ, ð�1, þ1ÞÞg

Step 2: We will show how the query block for MV1, QB(MV1), is generated. The normal from query for QR(MV1)

over the fact table is

QMV1ððnation, year, none, noneÞ, fð½‘USA’, ‘USA’�, ½1997, 1999�, ð�1, þ1Þ, ð�1, þ1ÞÞ,ð½‘Canada’, ‘Canada’�, ½1997, 1999�, ð�1, þ1Þ, ð�1, þ1ÞÞg, ðstate, year, none, noneÞ,fSUMðsalesdollarÞgÞ:

The components in QB(MV1) are determined by Definition 4 as follows. P={state, year} from AG(Q1) in

Example 3. AGG={sum�dollar1} because AG(MV1) =AG(Q1) and SUM (sales�dollar) in AGG(Q1) is named

sum�dollar1 in MV1. Since AG(MV1, 1)={state}M{nation} = SG(QMV1, 1) and AG(MV1, 1)={state}M{stor-

{store�id} = L01, a join of MV1 with a nested subquery block NB1 over Store is needed. With AG(MV1,

1) =AG(Q1, 1)={state} and p(QR(MV1), Store)=(nation = ‘USA’ OR nation = ‘Canada’), NB1 has the form

(SELECT DISTINCT state FROM Store WHERE nation = ‘USA’ OR nation = ‘Canada’) NB1 However, since

AG(MV1, 2) = SG(QMV1, 2) =AG(Q1, 2)={year}, join of MV1 with Time is not necessary. Thus we have

R={MV1, NB1} and JC=(MV1.state =NB1.state). We have SC = p(QR(MV1), Time)=(yearV 1999) since the

predicate for QR(MV1) is (year� 1997 AND yearV 1999) but the sub-predicate (year� 1997) is already included

in the predicate for R(MV1). Since AG(MV1) =AG(Q1) and HAV(Q1) = null, GROUP BY and HAVING clauses

are not required. Therefore, the query block QB(MV1) is written as follows.

(QB(MV11) UNION QB(MV12) UNION. . .UNION QB(MV1m))

UNION

(SELECT AG(Q), AGGV (Q)

FROM (QB(MV21) UNION ALL QB(MV22) UNION ALL. . .UNION ALL QB(MV2n))

GROUP BY AG(Q)

HAVING HAVV (Q))

SELECT state, year, sum_dollar1FROM MV1, (SELECT DISTINCT state

FROM Store

WHERE nation = ‘USA’ OR nation = ‘Canada’) NB1

WHERE MV1.state =NB1 state AND yearV 1999

C.-S. Park et al. / Decision Support Systems 32 (2002) 379–399 391

The query blocks for MV2 and MV3 can be generated in a similar manner (see Q1V in Fig. 3).

Step 3:MV1, MV2, and MV3 satisfy SG(MV1)>AG(Q1), SG(MV2)>AG(Q1), and SG(MV3)>AG(Q1), respectively,

as shown in Fig. 8a. Thus, the query blocks for MV1, MV2, and MV3 can be integrated into a UNION multi-block

query, i.e., Q1V in Fig. 3.

Example 5. In rewriting the query Q2 in Example 2 by the proposed method, MV1 and MV3 are selected in the

first step. Since SG(MV1) < >AG(Q2) and SG(MV3) < >AG(Q2) hold as shown in Fig. 8b, the query blocks for MV1

and MV3 can be integrated into a UNION ALL-GROUP BY query, i.e., Q2V in Fig. 5.

5. Finding an efficient rewriting

In general, there exist many equivalent rewritings

containing a different set of candidate MVs, which

have different execution cost. Hence, it is desirable for

system performance to choose such MVs and query

regions as result in a rewritten query having an efficient

execution plan. In this section, we explore the problem

of selecting MVs and query regions for query rewriting

and propose a heuristic algorithm that delivers an

efficient solution within an acceptable time.

5.1. Problem definition

The problem of selecting a set of MVs and query

regions to be used in rewriting a query is formalized

as an optimization problem in Definition 5.

Definition 5. (the optimal MV set problem) Given

a query Q= (AG, SG, R, AGG, HAV), a set V(Q) of the

candidate MVs for Q, and a cost function cost (MV,

QR), which provides cost for a materialized view MV

with a query region QR, the optimal MV set problem

is to find an optimal set S of pairs (MVi, QRi), where

MVi a V(Q) and QRi is a non-empty query region for

MVi, which minimize the total cost of S,

XðMVi, QRiÞaS

costðMVi,QRiÞ

subject to

QRi � *ðRðMViÞ \*RðQÞÞ¼ / for all ðMVi,QRiÞaS,

QRi \*QRj

¼ / for all pairs of QRi and QRj in S such that

iaj, and

RðQÞ � *[

ðMVi;QRiÞaS

QRi ¼ /:

The following theorem shows the intractability of

the optimal MV set problem.

Theorem 2. The optimal MV set problem is NP-hard.

Proof. For simplicity of description, we do not consi-

der the query regions for the selected MVs, which does

not increase the complexity of the problem. Then, the

decision version of the optimal MV set problem is to

determine if there is a set S of MViaV(Q) such thatXMViaS

costðMViÞVc and

[MViaS

ðRðMViÞ \ *RðQÞÞ ¼ RðQÞ;

where c is a positive real number. We will show that the

minimum set cover decision problem [8], which was

known to be NP-complete, can be polynomially trans-

formed to this problem. The minimum set cover de-

cision problem is described as follows: Given a

collection F={S1, S2,. . .,Sn} of subsets of a finite set

S and a positive integer k, is there a subset FV of F such

that AFVAV k and vSVaFV

SV ¼ S? Given an instance of

the minimum set cover decision problem, we can cons-

truct an instance of our problem, such thatV(Q)={MV1,

MV2,. . ., MVn} where MVi = Si(1V iV n), RðQÞ ¼v

MViaV ðQÞ RðMViÞ; costðMViÞ ¼ 1ð1ViVnÞ; and c ¼ k .

Then, straightforwardly, the minimum set cover

decision problem has a solution FV if and only if the

optimal MV set decision problem has a solution S. 5

5.2. Cost models

We assume that the execution cost of a rewritten

query obtained by our method is the sum of the

execution costs of the query blocks in it. For cost-

C.-S. Park et al. / Decision Support Systems 32 (2002) 379–399392

based selection of MVs and query regions, we have to

estimate the execution cost of a query block defined

over a candidate MV and a query region for the MV.

We propose different cost models for an MV and the

fact table considering different access methods for

them. We extend the linear cost model proposed in

Ref. [13] to use the number of disk pages to be

retrieved from an MV or the fact table as a measure

of the execution cost of a query block over it.

We assume that there is no special index or

clustering on the attributes in MVs. That implies we

have to retrieve all pages in an MV to evaluate a

query block for it, regardless of the query region

applied to the MV. Therefore, the execution cost of a

query block for a materialized view MV is defined as

costðMV ,QRÞ ¼ NMV

where NMV denotes the number of pages in MV.

On the other hand, several types of access methods

have been proposed in the literature to support fast

access to the fact table in data warehouses [7,20,25].

In particular, bitmap join indices [19] on the dimen-

sional attributes are frequently used for efficient joins

between a dimension table and the fact table.

Considering such index structures, we assume the

execution cost of a query block over the fact table to

be the number of tuples contained in a given query

region over the fact table as long as it does not exceed

the total number of pages in the fact table. That is, the

execution cost is defined as

costðFT ,QRÞ ¼ minfk, NFTgwhere k is the number of tuples in the fact table

selected by the query region QR and NFT is the

number of pages in the fact table. Supporting that the

values of dimensional attributes are uniformly

distributed over the d-dimensional domain space, k

can be estimated as

k ¼ nFT � areaðQRÞareaðRðFTÞÞ

where nFT is the number of tuples in the fact table and

area(R) means the total number of different values of

Fig. 11. Greedy algorithm.

C.-S. Park et al. / Decision Support Systems 32 (2002) 379–399 393

the set of dimensional attributes (i.e., foreign keys) in

the fact table contained in the selection region R. In

this cost model, we ignore the cost of searching di-

mension tables or bitmap indices since it is relatively

low compared with that of accessing MVs and the fact

table.

5.3. A heuristic algorithm

In this section, we introduce a greedy heuristic algo-

rithm to find an effective solution for selecting a set

of MVs and query regions quickly. The algorithm,

shown in Fig. 11, works in stages. At each stage, it

selects an MV from the set of candidate MVs for a

given query Q that gives the maximum profit in

execution cost for the remaining query region QRV,

which is defined as

QRV ¼ RðQÞ � *[

MViaS

RðMViÞ ¼ /

where S is a set of the MVs already selected. Using

the cost model in Section 5.2, we employ a profit

measure for MVs in V(Q) defined as

profitðMV ,QRVÞ¼ costðFT ,QRVÞ � ðcostðFT ,QRV� *QRðMV ÞÞþ costðMV ,QRðMV ÞÞÞ

ffi costðFT ,QRðMV ÞÞ � costðMV ,QRðMV ÞÞ

¼minfnFT � areaðQRðMV ÞÞ

areaðRðFTÞÞ ,NFTg � NMV , if MVaFT

0; otherwise

8><>:

The query region for an MV is computed by

QRðMV Þ ¼ RðMV Þ \ *QRV:

At each selection of an MV, query regions and

profits are recomputed for all the remaining MVs in

V(Q) and then the MV with the largest profit value is

opted. The algorithm terminates when the maximum

profit is 0, i.e., the fact table has the maximum profit

among the remaining MVs, or the remaining query

region becomes empty. Time complexity of the

algorithm is O(n2) where n is the cardinality of V(Q).

6. Experimental evaluation

In this section, we evaluate our rewriting method

adopting the MV selection algorithm proposed in

Section 5 by some experiments. We performed two

kinds of evaluations to show the effectiveness and

performance of the algorithm. We measured the

quality of the solution produced by the algorithm

and compared it with the optimal solution, i.e., the set

of MVs and their query regions constituting the most

efficient query plan. To find the optimal solution, we

developed an A* algorithm which avoids an exhaus-

tive search of the state–space graph by pruning a large

part of the search space that do not contain the optimal

solution [18]. We also measured and compared the

execution time taken by the algorithms.

We implemented the proposed algorithm and con-

ducted experiments on a SUN UltraSPARC-II 450

MHz processor with 256 MB main memory running

Solaris 2.7. We made an assumption that the fact table

has 1 billion tuples and the values of their foreign key

attributes referring to dimension tables are uniformly

distributed. We supposed that there are four dimension

tables in the DW and five dimension levels in each

dimension hierarchy. The fan-outs along the dimen-

sion hierarchies were set to 10 in all dimensions. We

assumed the size of a tuple in the fact table and MVs to

be 128 bytes and used a page size of 8 KB.

We generated 1000 materialized views without

any knowledge of query workloads. The SGs and

AGs of the MVs were uniformly distributed over all

granularities in the DH lattice and SG�AG held for

all MVs. The selection region of an MV was

arbitrarily determined on the SG of the MV under

the constraints that 50% of the intervals of dimension

levels should be points and its area be larger than

10% of the area of the fact table.

Then, 500 test OLAP queries over the fact table

were generated and rewritten by the proposed

method. The AGs and SGs of the queries were also

uniformly selected like those of MVs. Moreover, to

simulate a workload of typical OLAP, we made query

distribution such that 30% of the queries are drill-

down queries, another 30% are roll-up queries, and

remaining 40% are random queries. The drill-down

or roll-up queries were restricted to go down or up at

most one dimension level along each dimension

hierarchy. The selective regions of the queries were

generated like those of MVs except that they have no

constraint on their area. We assumed that all MVs

and queries have only SUM as their aggregate

function and all queries have no HAVING condition.

C.-S. Park et al. / Decision Support Systems 32 (2002) 379–399394

6.1. Quality of the solution

We ran the greedy algorithm as well as the A*

algorithm on each test query to find a greedy solution

and an optimal solution for selecting MVs and query

regions. Using the cost model proposed in Section

5.2, we estimated the execution costs of the rewritten

queries from two solutions. We also computed the

cost of evaluating the given query over the fact table.

Then we compared the three costs with one another.

Fig. 12 shows the estimated costs of the rewritten

queries generated with the solutions of the greedy and

A* algorithm. Only the results for 208 queries for

whom there exist a rewritten query having the

execution cost lower than that of the original one

are presented. The queries are arranged on the x-axis

in non-decreasing order of their costs over the fact

table.

For each test query, we also compared the ratio of

the cost of a rewritten query over the cost of the test

query, as depicted in Fig. 13. The queries are

arranged in non-decreasing order of the cost ratios

for their optimal solutions. The result indicates that

the average cost of the rewritten queries using the

optimal solutions is 16.57%, and the average cost of

the rewritten queries using the greedy solutions is

17.26%. Note that if we create MVs using static view

selection techniques such as [2,10,12,13,15,24] or

adopt dynamic view management schemes [1,14]

reflecting the changing query workloads in data

warehouses, the expected cost of the rewritten queries

will be more reduced.

Fig. 13. The ratios of the cost of the optimal and greedy solution to the cost of the given query.

Fig. 12. The estimated execution costs of the given queries and their rewritten queries.

C.-S. Park et al. / Decision Support Systems 32 (2002) 379–399 395

Fig. 14 presents the quality of the solution

returned by the proposed greedy algorithm, which

is measured by a ratio of the cost of the optimal

solution to the cost of the greedy solution. We

observe that most of the solutions obtained by the

proposed algorithm are close to optimal. It yielded

optimal solutions for 137 queries, which are about

66% of the test queries for which an efficient

rewriting exists. The sub-optimal solution had the

quality of around 80% of that of the optimal solution

on the average.

6.2. Execution time

Fig. 15 shows the execution time taken by the

proposed greedy algorithm to find a solution for each

test query, which is compared with that of the A*

algorithm. The results are arranged in non-decreasing

order of the execution time of the A* algorithm. In

this experiment, the greedy algorithm took about

1/160 of the execution time of the A* algorithm on

the average. This ratio is expected to be much more

reduced as the number of MVs to be searched by the

A* algorithm is increased. We also observe that the

greedy algorithm scales up to large number of

candidate MVs well in contrast to the A* algorithm

whose execution time increases exponentially.

7. Related work

Several works on answering conjunctive queries

using materialized views has been proposed in the

literature [3,5,6,11,16,22,23,26]. Levy et al. [16]

formalized the problem of finding rewritings of a

conjunctive query under set semantics in terms of

containment mappings from views to the query.

Chaudhuri et al. [5] addressed the problem of

optimizing queries using MVs. They proposed a

rewriting method for conjunctive SPJ queries and

Fig. 15. Execution time of the greedy algorithm and the A* algorithm.

Fig. 14. The ratios of the cost of the optimal solution to the cost of the greedy solution.

C.-S. Park et al. / Decision Support Systems 32 (2002) 379–399396

integrated it into a cost-based query optimization

algorithm. However, they did not consider either

aggregate queries or aggregate views and their

algorithm can generate only single-block queries.

Recently, the problem of rewriting aggregate

queries has received much attention. Gupta et al.

[11] proposed an algorithm for answering aggregate

queries using materialized views in DW environ-

ments. Their algorithm performs syntactic trans-

formations on the query tree of a given query using

a set of proposed transformation rules. In their

algorithm, however, an MV is usable only when a

part of the definition of the query can be exactly

transformed to the definition of the MV. Hence, the

class of usable MVs and the types of rewritings

generated by the algorithm are restrictive.

Srivastava et al. [23] proposed several algorithms

to rewrite aggregate queries using conjunctive or

aggregate MVs. They presented sufficient conditions

for an MV to be used in rewriting a query using their

methods, which include the following constraints.

First, all tables included in the definition of the MV

must also appear in the definition of the query.

Second, if an attribute in the SELECT or GROUP BY

clause of the query is from a table included in the

definition of the MV, the SELECT clause of the MV

must contain the attribute. In Example 2, However,

Time that is contained in all MVs does not appear in

the definition of Q2, and in Example 1, state in the

SELECT clause of Q1 does not contained in the

SELECT clause of MV3. Thus, their algorithms

cannot perform the rewritings such as Q1V and Q2V.

As to the syntactic forms of rewritten queries, Ref.

[23] includes the UNION ALL of single-block

queries, but it is a kind of the UNION rewriting in

our method since each single-block query computes a

separate part of the aggregate groups of the given

query. Without a detail description of the rewriting

algorithm, Albrecht et al. [1] also presented the same

UNION rewriting example in the form of a UNION

ALL multi-block query in their scheme for dynamic

management of multidimensional query results. The

UNION ALL-GROUP BY rewriting proposed in this

paper cannot be performed by either of them.

In Refs. [3,26], new methods that exploit various

MVs to evaluate complex queries in DWs have been

suggested. While most previous algorithms demand

that there should be a one-to-one mapping or a

containment mapping from an MV to the given

query, the algorithm proposed in Chang and Lee [3]

can utilize the MVs which include relations not

referred to in the query. The usability of MVs is

determined based on the functional dependencies

between grouping attributes in MVs and the query.

However, they did not consider selection predicates

of the query and MVs, and the algorithm can generate

only a single-block aggregate query. Zaharioudakis et

al. [26] addressed the rewriting of queries including

complex expressions, supergroup aggregation, and

nested subqueries. They represent queries and MVs

as graphs and suggested matching conditions and

compensation rules for equivalence between two

subgraphs in a query and an MV for several simple

matching patterns. Their algorithm scans the query

and MV graphs in a bottom-up fashion, identifying

potential pairs of matching subgraphs, performing

compensation for the matching patterns, and generat-

ing rewritten subqueries including the MV. However,

they did not consider the types of rewritings proposed

in this paper which use UNION and UNION ALL

operators.

Pottinger and Levy [22] proposed an algorithm for

rewriting a conjunctive query using conjunctive

views in the context of data integration. Our method

is different from the work in that it finds the

equivalent rewriting of a given query which is

optimal or efficient in execution cost while the

algorithm in Ref. [22] generates the maximally

contained rewriting which obtains as many answers

as possible from existing MVs.

In summary, the previous query rewriting ap-

proaches have some restrictions. They did not

effectively exploit semantic information such as

dimension hierarchies in DWs and the characteristics

of OLAP queries. Hence, they can perform relatively

simple types of rewritings using a limited class of

MVs compared with our proposed method. In ad-

dition, the problem of finding the cost-optimal re-

written query among many candidate ones has not

been investigated enough in the previous work.

8. Conclusions and future work

In this paper, we proposed an approach to rewrite a

given OLAP query using MVs existing in data

C.-S. Park et al. / Decision Support Systems 32 (2002) 379–399 397

warehouses. We defined the normal form of typical

OLAP queries and MVs based on the lattice structure

derived from dimension hierarchies in DWs. We

presented conditions for usability of MVs in rewriting

OLAP queries and then proposed a rewriting method

that can rewrite the normal form OLAP queries using

various kinds of MVs together. The algorithm consists

of three main steps. In the first step, it selects MVs that

will be used in rewriting and determines query regions

for them. In the second step, it generates query blocks

for the selected MVs using their query regions. The

last step integrates the query blocks into a final

rewritten query. We use two ways of integration, viz.,

the UNION integration and the UNION ALL-GROUP

BY integration, depending on a relationship between

the SGs of MVs and the AG of the query. By

exploiting the semantic information available in DWs

and the characteristics of OLAP queries, the proposed

algorithm utilizes a much broader class of MVs and

yields more general types of rewritings than other

previous approaches can do. Furthermore, we inves-

tigated the problem of finding an optimal set of MVs

and their query regions that results in a rewritten query

which can be executed efficiently. We presented a

heuristic algorithm and showed by experiments that it

provides an effective solution in an acceptable time

even for a large number of materialized views.

Acknowledgements

This work was supported in part by BK21

Educational–Industrial collaboration fund and the

Ministry of Information and Communication of Korea

(‘‘Support Project of University Foundation Re-

search < 2000>’’ supervised by IITA).

References

[1] J. Albrecht, A. Bauer, O. Deyerling, H. Gunze, W. Hummer,

W. Lehner, L. Schlesinger, Management of Multidimensional

Aggregates for Efficient Online Analytical Processing, Pro-

ceedings of International Database Engineering and Applica-

tions Symposium, 1999, pp. 156–164.

[2] E. Baralis, S. Paraboschi, E. Teniente, Materialized View Se-

lection in a Multidimensional Database, Proceedings of the

23rd International Conference on Very Large Data Bases,

1997, pp. 318–329.

[3] J. Chang, S. Lee, Query Reformulation Using Materialized

Views in Data Warehouse Environment, Proceedings of the

First ACM International Workshop on Data Warehousing

and OLAP, 1998, pp. 54–59.

[4] S. Chaudhuri, U. Dayal, An overview of data warehousing and

OLAP technology, SIGMOD Record 26 (1) (1997) 65–74.

[5] S. Chaudhuri, R. Krishnamurthy, S. Potamianos, K. Shim,

Optimizing Queries with Materialized Views, Proceedings of

11th IEEE International Conference on the Data Engineering,

1995, pp. 190–200.

[6] C.M. Chen, N. Roussopoulos, The Implementation and Per-

formance Evaluation of the ADMS Query Optimizer: Integrat-

ing Query Result Caching and Matching, Proceedings of the

4th International Conference on Extending Database Technol-

ogy, 1994, pp. 323–336.

[7] M. Ester, J. Kohlhammer, H.-P. Kriegel, The DC-Tree: A Fully

Dynamic Index Structure for Data Warehouses, Proceedings of

the 16th IEEE International Conference on Data Engineering,

2000, pp. 379–388.

[8] M.R. Garey, D.S. Johnson, Computers and Intractability: A

Guide to the Theory of NP-completeness, W.H. Freeman,

San Francisco, CA, 1979.

[9] J. Gray, A. Bosworth, A. Laymen, H. Pirahesh, Data Cube: A

Relational Aggregation Operator Generalizing Group-By,

Cross-Tab, and Sub-Totals, Proceedings of the 12th IEEE Inter-

national Conference on Data Engineering, 1996, pp. 152–159.

[10] H. Gupta, Selection of Views to Materialize in a Data Ware-

house, Proceedings of the International Conference on Data-

base Theory, 1997, pp. 98–112.

[11] A. Gupta, V. Harinarayan, D. Quass, Aggregate-Query Pro-

cessing in Data Warehousing Environments, Proceedings of

the 21st International Conference on Very Large Data Bases,

1995, pp. 358–369.

[12] H. Gupta, I.S. Mumick, Selection of Views to Materialize

Under a Maintenance Cost Constraint, Proceedings of the Inter-

national Conference on Database Theory, 1999, pp. 453–470.

[13] V. Harinarayan, A. Rajaraman, J.D. Ullman, Implementing

Data Cube Efficiently, Proceedings of the ACM SIGMOD

International Conference on Management of Data, 1996, pp.

205–216.

[14] Y. Kortidis, N. Roussopoulos, DynaMat: A Dynamic View

Management System for Data Warehouses, Proceedings of

the ACM SIGMOD International Conference on Management

of Data, 1999, pp. 371–382.

[15] W.J. Labio, D. Quass, B. Adelberg, Physical Database Design

for Data Warehouses, Proceedings of the 13th IEEE Interna-

tional Conference on Data Engineering, 1997, pp. 277–288.

[16] A.Y. Levy, A.O. Mendelzon, Y. Sagiv, D. Srivastava, Answer-

ing Queries Using Views, Proceedings of the ACM Symposi-

um on Principles of Database Systems, 1995, pp. 95–104.

[17] A. Margalit, G.D. Knott, An algorithm for computing the

union, intersection or difference of two polygons, Computers

and Graphics 13 (2) (1989) 167–183.

[18] N.J. Nilsson, Artificial Intelligence: A New Synthesis, Morgan

Kaufmann Publishers, San Francisco, CA, 1998.

[19] P. O’Neil, G. Graefe, Multi-table joins through bitmapped join

indices, SIGMOD Record 24 (3) (1995) 8–11.

C.-S. Park et al. / Decision Support Systems 32 (2002) 379–399398

[20] P. O’Neil, D. Quass, Improved Query Performance with Var-

iant indexes, Proceedings of the ACM SIGMOD International

Conference on Management of Data, 1997, pp. 38–49.

[21] C.-S. Park, M.H. Kim, Y.-J. Lee, Rewriting OLAP Queries

Using Materialized Views and Dimension Hierarchies in Data

Warehouses, Proceedings of the 17th IEEE International Con-

ference on Data Engineering, 2001, pp. 515–523.

[22] R. Pottinger, A. Levy, A Scalable Algorithm for Answering

Queries Using Views, Proceedings of the 26th International

Conference on Very Large Data Bases, 2000, pp. 484–495.

[23] D. Srivastava, S. Dar, H.V. Jagadish, A.Y. Levy, Answering

Queries with Aggregation Using Views, Proceedings of the

22nd International Conference on Very Large Data Bases,

1996, pp. 318–329.

[24] D. Theodoratos, T. Sellis, Data Warehouse Configuration, Pro-

ceedings of the 23rd International Conference on Very Large

Data Bases, 1997, pp. 318–329.

[25] M.-C. Wu, A.P. Buchmann, Encoded Bitmap Indexing for

Data Warehouses, Proceedings of the 14th IEEE International

Conference on Data Engineering, 1998, pp. 220–230.

[26] M. Zaharioudakis, R. Cochrane, G. Lapis, H. Pirahesh, M.

Urata, Answering Complex SQL Queries Using Automatic

Summary Tables, Proceedings of 2000 ACM SIGMOD Inter-

national Conference on Management of Data, 2000, pp. 105–

116.

Chang-Sup Park received his BS and

MS degrees in Computer Science from

Korea Advanced Institute of Science and

Technology (KAIST), Teajon, Korea, in

1995 and 1997, respectively. He is cur-

rently a PhD student in the Department of

Computer Science at KAIST. His research

interests include OLAP, data warehouses,

multi-dimensional indexing, and semi-

structured data management. He is a stu-

dent member of the ACM.

Myoung Ho Kim received his BS and

MS degrees in Computer Engineering

from Seoul National University, Seoul,

Korea, in 1982 and 1984, respectively,

and his PhD degree in Computer Science

from Michigan State University, East

Lansing, MI, in 1989. In 1989, he joined

the faculty of the Department of Com-

puter Science at KAIST, Taejon, Korea,

where currently he is a professor. His

research interests include OLAP, data

mining, information retrieval, mobile computing and distributed

processing. He is a member of the ACM and the IEEE Computer

Society.

Yoon-Joon Lee received his BS degree

in Computer Science and Statistics from

Seoul National University, Seoul, Korea,

in 1977, his MS degree in Computer

Science from Korea Advanced Institute

of Science and Technology (KAIST),

Taejon, Korea, in 1979, and his PhD

degree in Computer Science from INPG-

ENSIMAG, France, in 1983. In 1984, he

joined the faculty of the Department of

Computer Science at KAIST, Taejon,

Korea, where currently he is a professor. His research interests

include data warehouses, OLAP, WWW, multimedia information

systems, and database systems. He is a member of the ACM and the

IEEE Computer Society.

C.-S. Park et al. / Decision Support Systems 32 (2002) 379–399 399