cal Clustering for OLAP the Cube File Approach

8/4/2019 cal Clustering for OLAP the Cube File Approach

1/35

The VLDB Journal (2008) 17:621655

DOI 10.1007/s00778-006-0022-1

R E G U L AR PA P E R

Hierarchical clustering for OLAP: the CUBE File approach

Nikos Karayannidis

Timos Sellis

Received: 6 September 2005 / Accepted: 13 April 2006 / Published online: 7 September 2006 Springer-Verlag 2006

Abstract This paper deals with the problem of phys-

ical clustering of multidimensional data that are orga-nized in hierarchies on disk in a hierarchy-preserving

manner. This is called hierarchical clustering. A typi-

cal case, where hierarchical clustering is necessary for

reducing I/Os during query evaluation, is the most

detailed data of an OLAP cube. The presence of hierar-

chies in the multidimensional space results in an enor-

mous search space for this problem. We propose a

representation of the data space that results in a chunk-

tree representation of the cube. The model is adaptive

to the cubes extensive sparseness and provides efficient

access to subsets of data based on hierarchy value combi-

nations. Based on this representation of the search spacewe formulate the problem as a chunk-to-bucket alloca-

tion problem, which is a packing problem as opposed to

the linear ordering approach followed in the literature.

We propose a metric to evaluate the quality of hier-

archical clustering achieved (i.e., evaluate the solutions

to the problem) and formulate the problem as an opti-

mization problem. We prove its NP-Hardness and pro-

vide an effective solution based on a linear time greedy

algorithm. The solution of this problem leads to the con-

struction of the CUBE File data structure. We analyze in

depth all steps of the construction and provide solutions

Communicated by P-L. Lions.

N. Karayannidis (B) T. SellisInstitute of Communication and Computer Systemsand School of Electrical and Computer Engineering,National Technical University of Athens,Zographou 15773, Athens, Greecee-mail: [email protected]

T. Sellise-mail: [email protected]

for interesting sub-problems arising, such as the forma-

tion of bucket-regions, the storage of large data chunksand the caching of the upper nodes (root directory) in

main memory.

Finally, we provide an extensive experimental evalu-

ation of the CUBE Files adaptability to the data space

sparseness as well as to an increasing number of data

points. The main result is that the CUBE File is highly

adaptive to even the most sparse data spaces and for

realistic cases of data point cardinalities provides hier-

archical clustering of high quality and significant space

savings.

Keywords Hierarchical clustering OLAP CUBEFile Data cube Physical data clustering

1 Introduction

Efficient processing of ad hoc OLAP queries is a very

difficult task considering, on the one hand the native

complexity of typical OLAP queries, which potentially

combine huge amounts of data, and on theother, the fact

that no a priori knowledge for queries exists and thus

no pre-computation of results or other query-specific

tuning can be exploited. The only way to evaluate these

queries is to access directly the most detailed data in an

efficient way. It is exactly this need to access detailed

data based on hierarchy criteria that calls for the hierar-

chical clustering of data. This paper discusses the phys-

ical clustering of OLAP cube data points on disk in

a hierarchy-preserving manner, where hierarchies are

defined along dimensions (hierarchical clustering).


2/35

622 N. Karayannidis, T. Sellis

The problem addressed is set out as follows: we are

given a large fact table (FT) containing only grain-level

(most detailed) data. We assume that this is part of the

star schemain a dimensional data warehouse. Therefore,

data points (i.e., tuples in the FT) are organized by a set

of N dimensions. We further assume that each dimen-

sion is organized in a hierarchy. Typically the data dis-

tribution is extremely skewed. In particular, the OLAPcube is extremely sparse and data tend to appear in

arbitrary clusters along some dimension. These clus-

ters correspond to specific combinations of the hierar-

chy values for which there exist actual data (e.g., sales

for a specific product category in a specific geographic

region for a specific period of time). The problem is

on the one hand to store the fact table data in a hier-

archy-preserving manner so as to reduce I/Os during

the evaluation of ad hoc queries containing restrictions

and/or groupings on the dimension hierarchies and, on

the other, to enable navigation in the multilevel-multi-

dimensional data space by providing direct access (i.e.,indexing) to subsets of data via hierarchical restrictions.

The later implies that index nodes must also be hier-

archically clustered if we are aiming at a reduced

I/O cost.

Some of the most interesting proposals [20, 21, 36] in

the literature for cube data structures deal with the com-

putation and storage of the data cube operator[9]. These

methods omit a significant aspect in OLAP, which is

that usually dimensions are not flat but are organized

in hierarchies of different aggregation levels (e.g., store,

city, area, country is such a hierarchy for a Location

dimension). The most popular approach for organizingthe most detailed data of a cube is the so-called star

schema. In this case the cube data are stored in a rela-

tional table, called the fact table. Furthermore, various

indexing schemes have been developed [3, 15, 25, 26], in

order to speed up the evaluation of the join of the central

(and usually very large) fact table with the surrounding

dimension tables (also known as a star-join). However,

even when elaborate indexes are used, due to the arbi-

trary ordering of the fact table tuples, there might be

as many I/Os as there are tuples resulting from the fact

table.

We propose the CUBE File data structure as an effec-

tive solution to the hierarchical clustering problem set

above. The CUBE File multidimensional data structure

[18] clusters data into buckets (i.e., disk pages) with

respect to the dimension hierarchies aiming at the hier-

archical clustering of the data. Buckets may include both

intermediate (index) nodes (directory chunks), as well

as leaf (data) nodes (data chunks). The primary goal of

a CUBE File is to cluster in the same bucket a family

of data (i.e., data corresponding to all hierarchy value

combinations for all dimensions) so as to reduce the

bucket accesses during query evaluation.

Experimental results in [18] have shown that the

CUBE File outperforms the UB-tree/MHC[22], which

is another effective method for hierarchically clustering

the cube, resulting in 79 times less I/Os on average for

all workloads tested. This simply means that the CUBE

File achieves a higher degree of hierarchical clusteringof the data. More interestingly, in [15] it was shown that

the UB-tree/MHC technique outperformed the tradi-

tional bitmap index based star-join by a factor of 20

40, which simply proves that hierarchical clustering is

the most determinant factor for a file organization for

OLAP cube data, in order to reduce I/O cost.

To tackle this problem we first model the cube data

space as a hierarchy of chunks. This modelcalled the

chunk-tree representation of a cubecopes effectively

with the vast data sparseness by truncating empty areas.

Moreover, it provides a multiple resolution view of the

data space where one can zoom-in or zoom-out to spe-cific areas navigating along the dimension hierarchies.

The CUBE File is built by allocating the nodes of the

chunk-tree into buckets in a hierarchy-preserving man-

ner. This way we depart from the common approach for

solving the hierarchical clustering problem, which is to

find a total ordering of the data points (linear cluster-

ing) and cope with it as a packing problem, namely a

chunk-to-bucket packing problem.

In order to solve the chunk-to-bucket packing prob-

lem, we need to be able to evaluate the hierarchical

clustering achieved (i.e., evaluate the solutions to this

problem). Thus, inspired by the chunk-tree represen-tation of the cube, we define a hierarchical clustering

quality metric, called the hierarchical clustering factor.

We use this metric to evaluate the quality of the chunk-

to-bucket allocation. Moreover, we exploit it in order

to formulate the CUBE File construction problem as

an optimization problem, which we call the chunk-to-

bucket allocation problem. We formally define this prob-

lem and prove that it is NP-Hard. Then, we propose a

heuristic algorithm as a solution that requires a single

pass over the input fact table and linear time in the num-

ber of chunks.

In the course of solving this problem several inter-

esting sub-problems arise. We define the sub-problem

of chunk-region formation, which deals with the clus-

tering of chunk-trees hanging from the same parent

node in order to increase further the overall hierarchi-

cal clustering. We propose two algorithms as a solution,

one of which is driven by workload patterns. Next, we

deal with the sub-problem of storing large data chunks

(i.e., chunks that do not fit in a single bucket), as well

as with the sub-problem of storing the so-called root


3/35

Hierarchical clustering for OLAP 623

directory of the CUBE File (i.e., the upper nodes of the

data structure).

Finally, we study the CUBE Files effective adapta-

tion to several cube data spaces by presenting a set of

experimental measurements that we have conducted.

All in all, the contributions of this paper are outlined

as follows:

We provide an analytic solution to the problem ofhierarchical clustering of an OLAP cube. The solu-

tion leads to the construction of the CUBE File data

structure.

We model the multilevel-multidimensional dataspace of the cube as a chunk-tree. This representa-

tion of the data space adapts perfectly to the

extensive data sparseness and provides a multi-res-

olution view of the data with respect to the hierar-

chies. Moreover, if viewed as an index, it provides

direct access to cube data via hierarchical restric-

tions, which results in significant speedups of typicalad hoc OLAP queries.

We transform the hierarchical clustering problemfrom a linear clustering problem into a chunk-to-

bucket allocation (i.e., packing) problem, which we

formally define and prove that it is NP-Hard.

We introduce a hierarchical clustering quality met-ric for evaluatingthe hierarchical clustering achieved

(i.e., evaluating the solution to the problem in ques-

tion). We provide an efficient solution to this prob-

lem as well as to all sub-problems that stem from

it, such as the storage of large data chunks or the

formation of bucket-regions. We provide an experimental evaluation which leads

to the following basic results:

o The CUBE File adapts perfectly to even the most

extremely sparse data spaces yielding significant

spacesavings.Furthermore, the hierarchical clus-

tering achieved by the CUBE File is almost unaf-

fected by the extensive cube sparseness.

o The CUBE File is scalable for any realisticnumber of input data points. In addition, the

hierarchical clustering achieved remains of high

quality, when the number of input data points

increases.

o The root directory can be cached in main mem-ory providing a single I/O cost for the evaluation

of point queries.

The rest of this paper is organized as follows. Section 2

discusses related work and positions the CUBE File

in the space of cube storage structures. Section 3 pro-

poses the chunk-tree representation of the cube as an

effective representation of the search space. Section 4

introduces a quality metric for the evaluation of

hierarchical clustering. Section 5 formally defines the

problem of hierarchical clustering, proves its NP-Hard-

ness and then delves into the nuts and bolts of building

the CUBE File. Section 6 presents our extensive experi-

mental evaluation and Sect. 7 recapitulates and empha-

sizes on main conclusions drawn.

2 Related work

2.1 The linear clustering problem for multidimensional

data

The linear clustering problem for multidimensional data

is defined as the problem of finding a linear order-

ing of records indexed on multiple attributes, to be

stored in consecutive disk blocks, such that the I/O cost

for the evaluation of queries is minimized. The cluster-

ing of multidimensional data has been studied in termsof finding a mapping of the multidimensional space

to a one-dimensional space. This approach has been

explored mainly in two directions: (a) in order to exploit

traditional one-dimensional indexing techniques to a

multidimensional index spacetypical example is the

UB-tree [2], which exploits a z-ordering of multidimen-

sional data [27], so that these can be stored into a one-

dimensional B-tree index [1]and (b) for ordering

buckets containing records that have been indexed on

multiple attributes, to minimize the disk access effort.

For example, a grid file [23] exploits a multidimensional

grid in order to provide a mapping between grid cellsand disk blocks. One could find a linear ordering of

these cells, and therefore an ordering of the underlying

buckets, such as the evaluation of a query to entail more

sequential bucket reads than random bucket accesses.

To this end, space-filling curves (see [33] for a survey)

have been used extensively. For example, Jagadish [13]

provides a linear clustering method based on the Hilbert

curve that outperforms previously proposed mappings.

Note, however, that all linear clustering methods are

inferior to a simple scan in high dimensional spaces. This

is due to the notorious dimensionality curse [41], which

states that clustering in such spaces becomes meaning-

less due to lack of useful distance metrics.

In the presence of dimension hierarchies the multidi-

mensional clustering problem becomes combinatorially

explosive. Jagadish et al. [14] try to solve the problem of

finding an optimal linear clustering of records of a fact

table on disk, given a specific workload in the form of a

probability distribution over query classes. The authors

propose a subclass of clustering methods called lattice

paths, which are paths on the lattice defined by the


4/35


hierarchy level combinations of the dimensions. The

HPP chunk-to-bucket allocation problem (in Sect. 3.2

we provide a formal definition of HPP restrictions and

queries) is a different problem for the following reasons:

1. It tries to find an optimal way (in terms of reduced

I/O cost during query evaluation) to pack the datainto buckets, rather than order the data linearly. The

problem of finding an optimal linear ordering of

the buckets, for a specific workload, so as to reduce

random bucket reads, is an orthogonal problem and

therefore, the methods proposed in [14] could be

used additionally.

2. Apart from the data, it also deals with the inter-

mediate node entries (i.e., directory chunk entries),

which provide clustering at a whole-index level and

not only at the index-leaf level. In other words, index

data are also clustered along with the real data.

As, we know that there is no linear clustering of

records that will permit all queries over a multidimen-

sional space to be answered efficiently [14], we strongly

advocate that linear clustering of buckets (inter-bucket

clustering) must be exploited in conjunction with an

efficient allocation of records into buckets (intra-bucket

clustering).

Furthermore, in [22], a path-based encoding of dimen-

sion data, similar to our encoding scheme, is exploited

in order to achieve linear clustering of multidimensional

data with hierarchies, through a z-ordering [27]. The

authors use the UB-tree [2] as an index on top of thelinearly clustered records. This technique has the advan-

tage of transforming typical star-join [25] queries to

multidimensional range queries, which are computed

more efficiently due to the underlying multidimensional

index.

However, this technique suffers from the inherent

deficiencies of the z space-filling curve, which is not the

best space-filling curve according to [7,13]. On the other

hand, it is very easy to compute and thus straightforward

to implement the technique even for high dimensional-

ities. A typical example of such deficiency is that in the

z-curve there is a dispersion of certain data points, which

are close in the multidimensional space but not close in

the linear order and the opposite, i.e., distant data points

are clustered in the linear space. The latter results also

in an inefficient evaluation of multiple disjoint query

regions, due to the repetitive retrieval of the same pages

for many queries. Finally, the benefits of z-based linear

clustering starts to disappear quite soon as dimensional-

ity increases, practically even when dimensionality gets

over the number of 45 dimensions.

2.2 Grid file based multidimensional access methods

The CUBE File organization was initially inspired by

the grid file organization [23], which can be viewed as

the multidimensional counterpart of extendible hashing

[6]. The grid file superimposes a d-dimensional orthog-

onal grid on the multidimensional space. Given that the

grid is not necessarily regular, the resulting cells may beof different shapes and sizes. A grid directory associates

one or more of these cells with data buckets, which are

stored in one disk page each. Each cell is associated with

one bucket, but a bucket may contain several adjacent

cells, therefore bucket regions may be formed.

To ensure that data items are always found with no

more than two disk accesses for exact match queries,

the grid itself is kept in main memory represented by

d one-dimensional arrays called scales. The grid file is

intended for dynamic insert/delete operations, therefore

it supports operations for splitting and merging direc-

tory cells. A well-known problem of the grid file is thatit suffers from a superlinear growth of the directory even

for data that are uniformly distributed [31]. One basic

reason for this is that splitting is not a local operation and

thus canlead to superlinear directory growth. Moreover,

depending on the implementation of the grid directory

merging may require a complete directory scan [12].

Hinrichs [12] attempts to overcome the shortcomings

of the grid file by introducing a 2-level grid directory.

In this scheme, the grid directory is now stored on disk

and a scaled-down version of it (called root directory)

is kept in main memory to ensure the two-disk access

principle still holds. Furthermore, he discusses efficientimplementations of the split, merge and neighborhood

operations. In a similar manner, Whang and krishna-

murthy [43] extends the idea of a 2-level directory to a

multilevel directory, introducing the multilevel grid file,

achieving a linear directory growth in the number of

records. There exist more grid file based organizations.

A comprehensive survey of these and multidimensional

access methods in general can be found in [8].

An obvious distinction of the CUBE File organiza-

tion from the above multidimensional access methods

is that it has been designed to fulfill completely differ-

ent requirements, namely those of an OLAP environ-

ment and not of a transaction-oriented one. A CUBE

File is designed for an initial bulk loading and then a

read-only operation mode, in contrast, to the dynamic

insert/delete/update workload of a grid file. Moreover,

a CUBE File aims at speeding up queries on multidi-

mensional data with hierarchies and exploits hierarchi-

cal clustering to this end. Furthermore, as the dimension

domain in OLAP is known a priori the directory does

not have to grow dynamically. In addition, changes to


5/35


the directory are rare, as dimension data do not change

very often (compared to the rate of change for the cube

data), and also deletions are seldom, therefore split and

merge operations are not needed so much. Nevertheless,

more important is to adapt well to the native sparseness

of a cube data space and to efficiently support incremen-

tal updating, so as to minimize the updating window and

cube query-down time, which are critical factors in busi-ness intelligence applications nowadays.

2.3 Taxonomy of cube primary organizations

The set of reported methods in theliterature for primary

organizations for the storage of cubes is quite confined.

We believe that this is basically due to two reasons:

first of all the generally held view is that a cube is

a set of pre-computed aggregated results and thus the

main focus has been to devise efficient ways to compute

these results [11], as well as to choose which ones to

compute for a specific workload (view selection/main-tenance problem [10, 32, 37]). Kotidis and Roussopoulos

[19] proposed a storage organization based on packed

R-trees for storing these aggregated results. We believe

that this is a one-sided view of the problem as it dis-

regards the fact that very often, especially for ad hoc

queries, there will be a need for drilling down to the

most detailed data in order to compute a result from

scratch. Ad hoc queries represent the essence of OLAP,

and in contrast to report queries, are not known a pri-

ori and thus cannot really benefit from pre-computa-

tion. The only way to process them efficiently is to

enable fast retrieval of the base data. This calls foran effective primary storage organization for the most

detailed data (grain level) of the cube. This argument

is of course based on the fact that a full pre-compu-

tation of all possible aggregates is prohibitive due to

the consequent size explosion, especially for sparse

cubes [24].

The second reason that makes people reluctant to

work on new primary organizations for cubes is their

adherence to relational systems. Although this seems

justified, one could pinpoint that a relational table (e.g.,

afact table of astar schema [4]) is a logical entity and thus

should be separated from the physical method chosen

for implementing it. Therefore, one can use apart from

a paged record file, also a B-tree or even a multidimen-sional data structure as a primary organization for a fact

table. In fact, there are not many commercial RDBMS

([39] is one that we know of) that exploit a multidimen-

sional data structure as a primary organization for fact

tables. All in all, the integration of a new data struc-

ture in a full-blown commercial system is a strenuous

task with high cost and high risk and thus usually the

proposed solutions are reluctant to depart from the

existing technology (see also [30] for a detailed descrip-

tion of the issues in this integration).

Figure 1 positions the CUBE File organization in the

space of primary organizations proposed for storing a

cube (i.e., only the base data and not aggregates). The

columns of this table describe the alternative data struc-

tures that have been proposed as a primary organization,while the rows classify the proposed methods accord-

ing to the achieved data clustering. At the top-left cell

lies the conventional star schema [4], where a paged

record file is used for storing the fact table. This orga-

nization guarantees no particular ordering among the

stored data and thus additional secondary indexes are

built around it in order to support efficient access to

the data.

Padmanabhan et al. [28] assume a typical relation

(i.e., a paged record file) as the primary organization

of a cube (i.e., fact table). However, unique combina-

tions of dimension values are used in order to formblocks of records, which correspond to consecutive disk

pages. These blocks can be considered as chunks. The

database administrator must choose only one hierar-

chy level from each dimension to participate in the

clustering scheme. In this sense, the method provides

multidimensional clustering and not hierarchical (mul-

tidimensional) clustering.

In [35] a chunk-based method for storing large

multidimensional arrays is proposed. No hierarchies are

assumed on the dimensions and data are clustered

according to the most frequent range queries of a partic-

ular workload. In [5] the benefits of hierarchical cluster-ing in speeding-up queries was observed as a side effect

of using a chunk-based file organization over a relation

(i.e., a paged file of records) for query caching, with

chunk as the caching unit. Hierarchical clustering was

achieved through appropriate hierarchical encoding

of the dimension data.

Markl et al. [22], also impose a hierarchical encoding

on the dimension data and assign a path-based surro-

gate key on each dimension tuple that was called the

compound surrogate key. They exploit the UB-tree mul-

tidimensional index [2] as the primary organization of

the cube. Hierarchical clustering is achieved by taking

the z-order [27] of the cube data points by interleav-

ing the bits of the corresponding compound surrogates.

Deshpande et al. [5], Markl et al. [22] and the CUBE

File [18], all exploit hierarchical clustering of the cube

data and the last two use multidimensional structures

as the primary organization. This has among others the

significant benefit of transforming a star-join [25] into

a multidimensional range query that is evaluated very

efficiently over these data structures.


6/35


Fig. 1 The space of proposedprimary organizations forcube storage

Multidimensional

data structure

Primary

Organization

Clustering

Achieved

Relation MD-Array

UB-tree

GRID

FILE-

based

No Clustering Star Schema

Chunk-based

[28] [35]

Clustering

Other

Chunk-

based[5] [18]

Hierarchical

Clustering z-order

based[22]

3 Modeling the data space as a chunk-tree

Clearly our goal is to define a multidimensional file

organization that natively supports hierarchies. Thereis indeed a plethora of data structures for multidimen-

sional data [8], but to the best of our knowledge, none of

these explicitly supports hierarchies. Hierarchies com-

plicate things, basically because, in their presence, the

data space explodes 1. Moreover, as we are primarily

aiming at speeding up queries including restrictions on

the hierarchies, we need a data structure that can effi-

ciently lead us to the corresponding data subset based

on these restrictions. A key observation at this point is

that all restrictions on the hierarchies intuitively define

a subcube or a cube-slice.

To this end, we exploit the intuitive representation ofa cube as a multidimensional array and apply a chunk-

ing method in order to create subcubes, i.e., the so-called

chunks. Our method of chunking is based on the dimen-

sion hierarchies structure and thus we call it hierar-

chical chunking. In the following sections we present

a dimension-data encoding scheme that assigns hierar-

chy-enabled unique identifiers to each data point in a

dimension. Then, we present our hierarchical chunking

method. Finally, we propose a tree structure for repre-

senting the hierarchy of the resultant chunks and thus

modeling the cube data space.

3.1 Dimension encoding and hierarchical chunking

In order to apply hierarchical chunking, we first assign

a surrogate key to each dimension hierarchy value. This

key uniquely identifies each value within the hierarchy.

1 Assuming N dimension hierarchies modeled as K-level m-waytrees, the number of possible value combinations is K-times expo-nential in the number of dimensions, i.e., O(mKN).

Continent

Country

Region

City

LOCATION

Grain level ---

North(0)

South(0.0.1)(1)

North(2)

South(3)

Greece (0.0)(0)

U.K.(1)

Europe (0)(0)

Salonica(0)

Athens(1) (2)

Rhodes Glasgow(3)

London(4)

Cardiff(5)

(0.0.1.2)

Fig. 2 Example of hierarchical surrogate keys assigned to anexample hierarchy

More specifically, we order the values in each hierar-

chy level so that sibling values occupy consecutive posi-

tions and perform a mapping to the domain of positive

integers. The resulting values are depicted in Fig. 2 for

an example of a dimension hierarchy. The simple inte-

gers appearing under each value in each level are calledorder-codes. In order to identify a value in the hierarchy,

we form the path of order-codes from the root-value to

the value in question. This path is called a hierarchical

surrogate key, or simply h-surrogate. For example the h-

surrogate for the value Rhodes is 0.0.1.2. H-surrogates

convey hierarchical (i.e., semantic) information for each

cube data point, which can be greatly exploited for the

efficient processing of star-queries [15, 29,40].

The basic incentive behind hierarchical chunking is

to partition the data space by forming a hierarchy of

chunks that is based on the dimensions hierarchies.

This has the beneficial effect of pruning all empty areas.

Remember that in a cube data space empty areas are

typically defined on specific combinations of hierarchy

values (e.g., as we did not sell the X product Category

on Region Y for T periods of time, an empty region is

formed). Moreover, it provides us with a multi-resolu-

tion view of the data space where one can zoom-in and

zoom-out navigating along the dimension hierarchies.

We model the cube as a large multidimensional array,

which consists only of the most detailed data. Initially, we


7/35


partition the cube in to a very few chunks corresponding

to the most aggregated levels of the dimensions

hierarchies. Then we recursively partition each chunk as

we drill-down to the hierarchies of all dimensions in par-

allel. We define a measure in order to distinguish each

recursion step, the chunking depth D. We will illustrate

hierarchical chunking with an example. The dimensions

of our example cube are depicted in Fig. 3 and corre-spond to a two-dimensional cube hosting sales data for

a fictitious company. The two dimensions are namely

LOCATION and PRODUCT. In the figure we can see

the members for each level of these dimensions (each

appearing with its member-code).

In order to apply our method, we need to have hierar-

chies of equal length. For this reason, we insert pseudo-

levels P into the shorter hierarchies until they reach

the length of the longest one. This padding is done

after the level that is just above the grain level. In our

example, the PRODUCTdimension has only three lev-

els and needs one pseudo-level in order to reach thelength of the LOCATION dimension. This is depicted

next, where we have also noted the order-code range at

each level:

LOCATION:[0-2].[0-4].[0-10].[0-18]

PRODUCT:[0-1].[0-2].P.[0-5]

The result of hierarchical chunking on our example

cube is depicted in Fig. 4a. Chunking begins at chunking

depth D

=0 and proceeds in a top-down fashion. To

define a chunk, we define discrete ranges of grain-level(i.e., most-detailed) values on each dimension, denoted

in the figure as [a..b], where a and b are grain-level

order-codes. Each such range is defined as the set of

values with the same parent (value) in the correspond-

ing parent level. These parent levels form the set of

pivot levels PVT, which guides the chunking process

at each step. Therefore initially, PVT = {LOCATION:

Continent, PRODUCT: Category}. For example, if we

take value 0 of pivot level Continent of the LOCA-

TIONdimension, then the corresponding range at the

grain level is Cities [0..5].

The definition of such a range for each dimension

defines a chunk. For example, the chunk defined from

the 0, 0 values of the pivot levels Continent and Cat-

egory, respectively, consists of the following grain data

(LOCATION:0.[0-1].[0-3].[0-5], PRODUCT:0.[0-1]. P.[0-3]).

The [] notation denotes a range of members.Thischunk

appears shaded in Fig. 4a at D = 0. Ultimately at D = 0we have a chunk for each possible combination between

the members of the pivot levels, that is a total of

[0-1][0-2] = 6 chunks in this example. Thus the total

number of chunks created at each depth D equals the

product of the cardinalities of the pivot levels.

Next we proceed at D = 1, with PVT= {LOCATION:Country, PRODUCT: Type} and recursively chunk each

chunk of depth D = 0. This time we define rangeswithin the previously defined ranges. For example, on

the range corresponding to Continent value 0 that we

created before, we define discrete ranges correspond-ing to each country of this continent (i.e., to each value

of the Country level, which has parent 0). In Fig. 4a,

at D = 1, shaded boxes correspond to all the chunksresulting from the chunking of the chunk mentioned in

the previous paragraph.

Similarly, we proceed the chunking by descending in

parallel all dimension hierarchies and at each depth D

we create new chunks within the existing ones. The pro-

cedure ends when the next levels to include as pivot lev-

els are the grain levels. Then we do not need to perform

any further chunking, because the chunks that would be

produced from such a chunking would be the cells of thecube themselves. In this case, we have reached the max-

imum chunking depth DMAX. In our example, chunking

stops at D = 2 and the maximum depth is D = 3. Noticethe shaded chunks in Fig. 4a depicting chunks belonging

in the same chunk hierarchy.

The rationale for inserting the pseudo-levels above

the grain level lies in that we wish to apply chunking

the soonest possible and for all possible dimensions. As

the chunking proceeds in a top-to-bottom fashion, this

eager chunking has the advantage of reducing very

early the chunk size and also provides faster access to

the underlying data, because it increases the fan-outof the intermediate nodes. If at a particular depth one

(or more) pivot level is a pseudo-level, then this level

does nottake part in the chunking. (In our example this

occurs at D = 2 for the PRODUCT dimension.) Thismeans that we do not define any new ranges within the

previously defined range for the specific dimension(s)

but instead we keep the old one with no further chunk-

ing. Therefore, as pseudo-levels restrict chunking in the

dimensions that are applied, we must insert them to

the lowest possible level. Consequently, as there is no

chunking below the grain level (a data cell cannot be

further partitioned), the pseudo-level insertion occurs

just above the grain level.

3.2 The chunk-tree representation

We use the intermediate depth chunks as directory

chunks that will guide us to the DMAX depth chunks

containing the data and thus called data chunks. This

leads to a chunk-tree representation of the hierarchi-

cally chunked cube and hence the cube data space. It


8/35


Category Type Item

Books

0Literature

0.0Murderess, A. Papadiamantis

0.0.0Karamazof brothers F.

Dostoiewsky

0.0.1Philosophy

0.1Zarathustra, F. W. Nietzsche

0.1.2

Symposium, Plato0.1.3

Music

1Classical

1.2

The Vivaldi Album Special

Edition

1.2.4Mozart: The Magic Flute

1.2.5

Continent Country Region City

Europe

0Greece

0.0Greece -North

0.0.0Salonica

0.0.0.0Greece- South

0.0.1Athens

0.0.1.1Rhodes

0.0.1.2U.K.

0.1U.K.- North

0.1.2Glasgow0.1.2.3

U.K.- South

0.1.3London

0.1.3.4Cardiff

0.1.3.5North America

1USA1.2

USA- East1.2.4

New York1.2.4.6Boston

1.2.4.7USA - West

1.2.5Los Angeles

1.2.5.8San Francisco

1.2.5.9USA- North

1.2.6Seattle

1.2.6.10Asia

2Japan

2.3Kiusiu

2.3.7Nagasaki

2.3.7.11Hondo

2.3.8Tokyo

2.3.8.12Yokohama

2.3.8.13Kioto

2.3.8.14India

2.4India- East

2.4.9Calcutta

2.4.9.15New Delhi

2.4.9.16India - West

2.4.10Surat

2.4.10.17Bombay

2.4.10.18

PRODUCT

Category

Type

Item

LOCATION

Continent

Country

Region

City

PRODUCTLOCATION

Fig. 3 Dimensions of our example cube along with two hierarchy instantiations

is depicted in Fig. 4b for our example cube. In Fig. 4b,

we have expanded the chunk-sub-tree corresponding to

the family of chunks that has been shaded in Fig. 4a.

Pseudo-levels are marked with P and the correspond-ing directory chunks have reduced dimensionality (i.e.,

one dimensional in this case). We interleave the h-sur-

rogates of the pivot level values that define a chunk and

form a chunk-id. This is a unique identifier for a chunk

within a CUBE File. Moreover, this identifier includes

the whole path in the chunk hierarchy of a chunk. In

Fig. 4b, we note the corresponding chunk-id above each

chunk. The root chunk does not have a chunk-id because

it represents the whole cube and chunk-ids essentially

denote sub-cubes. The part of a chunk-id that is con-

tained between consecutive dots and corresponds to a

specific depth D is called D-domain.

The chunk-tree representation can be regarded as a

method to model the multilevel-multidimensional data

space of an OLAP cube. We discuss next the major ben-

efits from this modeling:

Direct access to cube data through hierarchical restric-

tions One of the main advantages of the chunk-tree

representation of a cube is that it explicitly supports hier-

archies. This means that any cube data subset defined

through restrictions on the dimension hierarchies can

be accessed directly. This is achieved by simply accessing

the qualifying cells at each depth and following the inter-

mediate chunk pointers to the appropriate data. Note

that thevast majority of OLAP queries contain an equal-ity restriction on a number of hierarchical attributes and

more commonly on hierarchical attributes that form a

complete path in the hierarchy. This is reasonable as

the core of analysis is conducted along the hierarchies.

We call this kind of restrictions hierarchical prefix path

(HPP) restrictions and provide the corresponding defi-

nition next:

Definition 1 (Hierarchical Prefix Path Restriction) We

define a hierarchical prefix path restriction (HPP restric-

tion) on a hierarchy H of a dimension D, to be a set of

equality restrictions linked by conjunctions on Hs levelsthat form a path in H, which always includes the topmost

(most aggregated) level of H.

For example, if we consider the dimension LOCA-

TIONof our example cube and a DATE dimension with

a 3-level hierarchy (Year/Month/Day), then the query

show me sales for country A (in continent C) in region

B for each month of 1999 contains two whole-path

restrictions, one for the dimension LOCATION and

one for DATE: (a) LOCATION.continent = C AND


9/35


[0][1..2]

[2..3]

[0..1]

[0..2] [3..5]

[4..5]

[0..3]

[6..10][0..5]

[0..5]

[0..18]

Cube

LOCATION

PRODUCT

[11..18]

D = 0

[4..5]

[6..10] [11 - 14] [15-18]

D = 1

[0..1]

[2..3]

[4..5]

[3] [4..5] [6..7] [8..9] [10] [11][12..14][15..16][17..18]

D = 2

(Category, Continent)

(PRODUCT, LOCATION)

(Type, Country)

( - , Region)

1

3

Grain level

(Data Chunks)

Root Chunk

P P

0 1 2 3

D = 0

D = 1

LOCATION

PRODUCT

0 1 2

0

1

0

0|0.0|0 0|0.1|0

D = 2

0

0|0.0|0.0|P

0

1

1 2

0|0.0|0.1|P

0

1

0|0.1|0.2|P|P

0

1

4 5

0|0.1|0.3|P

0

1

0 1

0|0

P P

0 1 2 3

0|0.0|1 0|0.1|1

30

0|0.0|1.0|P

2

3

1 2

2

3

0|0.1|1.2|P

2

3

4 5

0|0.1|1.3|P|P

2

3

D = 3 (Max Depth)0|0.0|1.1|P

(Category, Continent)

(Type, Country)

( - , Region)

(Item, City)

(a) (b)

Fig. 4 a The cube from our running example hierarchically chunked. b The whole sub-tree up to the data chunks under chunk 0|0

LOCATION.country = A AND LOCATION.region =

B, and (b) DATE.year = 1999.

Consequently, we can now define the class of HPPqueries:

Definition 2 (Hierarchical Prefix Path Query) We call

a query Q on a cube C a hierarchical prefix path query

(HPP query), if and only if all the restrictions imposed

by Q on the dimensions of C are HPP restrictions, which

are linked together by conjunctions.

Adaptation to cubes native sparseness The cube data

space is extremely sparse [34]. In other words, the ratio

of the number of real data points to the product of the

dimension grain-level cardinalities is a very small num-

ber. Values for this ratio in the range of 1012105are more than typical (especially for cubes with more

than three dimensions). It is therefore imperative that

a primary organization for the cube adapts well to this

sparseness, allocating space conservatively. Ideally, the

allocated space must be comparable to the size of the

existing data points. The chunk-tree representation

adapts perfectly to the cube data space. The reason

is that the empty regions of a cube are not arbitrarily

formed. On the contrary, specific combinations of

dimension hierarchy values form them. For instance,

in our running example, if no music products are sold

in Greece, then a large empty region is formed. Con-sequently, the empty regions in the cube data space

translate naturally to one or more empty chunk sub-

trees in the chunk-tree representation. Therefore, empty

sub-trees can be discarded altogether and the space

allocation corresponds to real data points and

only.

Multi-resolution view of the data space The chunk-tree

represents the whole cube data space (however, with

most of the empty areas pruned). Similarly, each sub-

tree represents a sub, space. Moreover, at a specific

chunking depth we view all the data points organized

in hierarchical families (i.e., chunk-trees) accordingto the combinations of hierarchy values for the corre-

sponding hierarchy levels. By descending to a higher

depth node we view the data of the corresponding

subspace organized in hierarchical families of a more

detailed level and so on. This multi-resolution feature

will be exploited later in order to achieve a better hierar-

chical clustering of the data by promoting the storage of

lower depth chunk-trees in a bucket than that of higher

depth ones.


10/35


Storage efficiency A chunk is physically represented by

a multidimensional array. This enables an offset-based

access, rather than a search-based one, which speed-ups

the cell access mechanism considerably. Moreover, it

gives us the opportunity to exploit chunk-ids in a very

effective way. A chunk-id essentially consists of inter-

leavedcoordinate values. Therefore,we canuse a chunk-

id in order to calculate the appropriate offset of a cellin a chunk but we do not have to store the chunk-id

along with each cell. Indeed, a search-based mechanism

(like the one used by conventional B-tree indexes or

the UB-tree [2]) requires that the dimension values (or

the corresponding h-surrogates), which form the search-

key, must also be stored within each cell (i.e., tuple) of

the cube. In the CUBE File only the measure values of

the cube are stored in each cell. Hence notable space

savings are achieved. In addition, further compression

of chunks can be easily achieved, without affecting the

offset-based accessing (see [17] for the details).

Parallel processing enabling Chunk-trees (at variousdepths) can be exploited naturally for the logical frag-

mentation of the cube data, in order to enable the par-

allel processing of queries, as well as the construction

and maintenance (i.e., bulk loading and batch updating)

of the CUBE File. Chunk-trees are essentially disjoint

fragments of the data that carry all the hierarchy seman-

tics of the data. This makes the CUBE File data struc-

ture an excellent candidate for advanced fragmentation

methods ([38]) used in parallel data warehouse DBMSs.

Efficient maintenance operations Any data structure

aimed to accommodate data warehouse data must be

efficient in typical data warehousing maintenance oper-ations. The logical data partitioning provided by the

chunk-tree representation enables fast bulk loading (rol-

lin of data), data purging (rollout of data, i.e., bulk

deletions from the cube), as well as the incremental

updating of the cube (i.e., when the input data with

the latest changes arrive from the data sources, only

local reorganizations are required and not a complete

CUBE File rebuild). The key idea is that new data to

be inserted in the CUBE file correspond to a set of

chunk-trees that need to be hanged at various depths

of the structure. The insertion of each such chunk-tree

requires only a local reorganization without affecting

the rest of the structure. In addition, as noted previously,

these chunk-tree insertions can be performed in parallel

as long as they correspond to disjoint subspaces of the

cube. Finally, it is very easy to rollout the oldest months

data and rollin the current months (we call this data

purging), as these data correspond to separate chunk-

trees and only a minimum reorganization is required.

The interested reader can find more information regard-

ing other aspects of the CUBE File not covered in this

paper (e.g., the updating and maintenance operations),

as well as information for a prototype implementation

of a CUBE File based DBMS in [16].

4 Evaluating the quality of hierarchical clustering

Any physical organization of data must determine how

the latter are distributed in disk pages. A CUBE File

physically organizes its data by allocating the chunks of

the chunk-tree into a set of buckets, which is the I/O

transfer unit counterpart in our case. First, let us try to

understand what are the objectives of such an alloca-

tion. As already stated the primary goal is to achieve

a high degree of hierarchical clustering. This statement,

although clear,couldstill be interpreted in several differ-

ent ways. What are the elements that can guarantee that

a specific hierarchical clustering scheme is good? We

attempt to list some next:

1. Efficient evaluation of queries containing restric-

tions on the dimension hierarchies

2. Minimization of the size of the data

3. High space utilization

The most important goal of hierarchical clustering is

to improve response time of queries containing hier-

archical restrictions. Therefore, the first element calls

for a minimal I/O cost (i.e., bucket reads) for the eval-

uation of such restrictions. The second element deals

with the ability to minimize the size of the data to bestored (e.g., by adapting to the extensive sparseness of

the cube data spacei.e., not storing null dataas well

as storing only the minimum necessary data, e.g., in an

offset-based access structure we do not need to store the

dimension values along with the facts). Of course, the

storage overhead must also be minimized in terms of the

number of allocated buckets. Naturally, the best way to

keep this number low is to utilize the available space as

much as possible. Therefore the third element implies

that the allocation must adapt well to the data distri-

bution, e.g., more buckets must be allocated to more

densely populated areas and fewer buckets for more

sparse ones. Also, buckets must be filled almost to capac-

ity (i.e., imposing a high bucket occupancy threshold).

Both the last two elements guarantee an overall mini-

mum storage cost.

In the following, we propose a metric for evaluat-

ing the hierarchical clustering quality of an allocation of

chunks into buckets. Then in the next section we use this

metric to formally define the chunk-to-bucket allocation

problem as an optimization problem.


11/35


4.1 The hierarchical clustering factor

We advocate that hierarchical clustering is the most

important goal for a file organization for OLAP cubes.

However, the space of possible combinations of dimen-

sion hierarchy values is huge (doubly exponentialsee

Footnote 1). To this end, we exploit the chunk-tree rep-

resentation, resulting from the hierarchical chunking ofa cube, and deal with the problem of hierarchical cluster-

ing, as a problem of allocating chunks of the chunk-tree

into disk buckets. Thus, we are not searching for a linear

clustering (i.e., for a total ordering of the chunked-cube

cells), but rather we are interested in the packing of

chunks into buckets according to the criteria of good

hierarchical clustering posed above.

The intuitive explanation for the utilization of the

chunk-tree for achieving hierarchical clustering lies in

the fact that the chunk-tree is built based solely on the

hierarchies structure and content and not on some stor-

agecriteria (e.g., each node corresponding to a diskpage,etc.); as a result, it embodies all possible combinations of

hierarchical values. For example, the sub-tree hanging

from the root-chunk in Fig. 4b at the leaf level con-

tains all the sales figures corresponding to the continent

Europe (order-code 0) and to the product categoryBooks (order-code 0) and any possible combinations

of the children members of the two. Therefore, each

sub-tree in the chunk-tree corresponds to a hierarchi-

cal family of values and thus reduces the search space

significantly. In the following we will regard as a stor-

age unit the bucket. In this section, we define a metric

for evaluating the degree of hierarchical clustering ofdifferent storage schemes in a quantitative way.

Clearly, a hierarchical clustering strategy that respects

the quality element of efficient evaluation of queries

with HPP restrictions that we have posed above must

ensure that the access of the sub-trees hanging under

a specific chunk must be done with a minimal num-

ber of bucket reads. Intuitively, one can say that if we

could store whole sub-trees in each bucket (instead of

single chunks), then this would result in a better hier-

archical clustering, as all the restrictions on the specific

sub-tree, as well as on any of its descendant sub-trees,

would be evaluated with a single bucket I/O. For exam-

ple, if we store the sub-tree hanging from the root-chunk

in Fig. 4b into a single bucket, we can answer all queries

containing hierarchical restrictions on the combination

Books and Europe and on any children-values of

these two with just a single disk I/O.

Therefore, each sub-tree in this chunk-tree corre-

sponds to a hierarchical family of values. Moreover,

the smaller the chunking depth of this sub-tree the more

the value combinations it embodies. Intuitively, we can

say that the hierarchical clustering achieved could be

assessed by the degree of storing low-depth whole chunk

sub-trees into each storage unit. Next, we exploit this

intuitive criterion to define the hierarchical clustering

degree of a bucket(HCDB). We begin with a number of

auxiliary definitions:

Definition 3 (Bucket-Region) Assume a hierarchicallychunked cube represented by a chunk-tree CT of a max-

imum chunking depth DMAX. A group of chunk-trees of

the same depth having a common parent node, which are

stored in the same bucket, comprises a bucket-region

Definition 4 (Region contribution of a tree stored in a

bucketcr) Assume a hierarchically chunked cube rep-

resented by a chunk-tree CT of a maximum chunking

depth DMAX. We define as the region contribution cr of

a tree t of depth d that is stored in a bucket B to be the

total number of trees in the bucket region that this tree

belongs to divided by the total number of trees of thesame depth in the total chunk-tree CT in general. This is

then multiplied by a bucket region proximity factor rP,

which expresses the proximity of the trees of a bucket

region in the multidimensional space:

cr treeNum(d, B)

treeNum(d, CT)rP,

where treeNum(d, B)is the total number of sub-trees in

B of depth d, treeNum(d, CT) the total number of sub-

trees in CTof depth d and rP the bucket region proximity

(0 < rP 1).

The region contribution of a tree stored in a bucket

essentially denotes the percentage of trees at a specific

depth that a bucket region covers. Therefore, the greater

this percentage, the greater the hierarchical clustering

achieved by the corresponding bucket, as more com-

binations of the hierarchy members will be clustered in

the same bucket. To keep this contribution high we need

large bucket-regions of low depth trees, because in low

depths the total number ofCTsub-trees is small. Notice

also that the region contribution includes a bucket region

proximity factor rP, which expresses the spatial proxim-

ityof thetrees of a bucketregionin themultidimensionalspace. The larger this factor becomes the closer the trees

of a bucket-region are and thus the larger their individ-

ual region contributions are. We will see in more detail

the effects of this factor and its definition (Definition

10) in a following subsection, where we will discuss the

formation of the bucket regions.

Definition 5 (Depth contribution of a tree stored in a

bucketcd) Assume a hierarchically chunked cube rep-

resented by a chunk-tree CT of a maximum chunking


12/35


depth DMAX. We define as the depth contribution cd of a

tree t of depth d that is stored in a bucket B to be the ratio

of d to DMAX:

cd d

DMAX.

The depth contribution of a tree stored in a bucket

expresses the proportion between the depth of the treeand the maximum chunking depth. The less this ratio

becomes (i.e., the lower is the depth of the tree), the

greater the hierarchical clusteringachieved by the corre-

sponding bucket becomes. Intuitively, the depth contri-

bution expresses the percentage of the number of nodes

in the path from the root-chunk to the bucket in ques-

tion and thus the less it is the less is the I/O cost to access

this bucket. Alternatively, we could substitute the depth

value from the nominator of the depth contribution with

the number of buckets in the path from the root-chunk

to the bucket in question (with the latter included).

Next, we provide the definition for the hierarchicalclustering degree of a bucket:

Definition 6 (Hierarchical clustering degree of a

bucketHCDB) Assume a hierarchically chunked cube

represented by a chunk-tree CT of a maximum chunk-

ing depth DMAX. For a bucket B containing 4 whole

sub-trees {t1, t2 . . . tT} of chunking depths {d1, d2 . . . dT},respectively, where none of these sub-trees is a sub-tree of

another, we define as the Hierarchical Clustering Degree

HCDB of bucket B to be the ratio of the sum of the

region contribution of each tree ti(1

i

T) included

in B to the sum of the depth contribution of each treeti(1 i T), multiplied by the bucket occupancy OB,where 0 OB 1 :

HCDB T

i=1 cirT

i=1 cid

OB =Tcr

TcdOB =

cr

cdOB, (1)

where cir is the region contribution of tree ti and cid is the

depth contribution of tree ti(1 i T). (Note that asbucket regions have been defined as consisting of equi-

depth trees, then all trees of a bucket have the same region

contribution as well as depth contribution.)

In this definition, we have assumed that the chunking

depth di of a chunk-tree ti is equal to the chunking depth

of the root-chunk of this tree. Of course we assume that

a normalization of the depth values has taken place, so

that the depth of the chunk-tree CTis to be 1 instead of

0, in order to avoid having zero depths in the denomina-

tor of (1). Furthermore, data chunks are considered as

chunk-trees with a depth equal to the maximum chunk-

ing depth of the cube. Note that directory chunks stored

in a bucket, not as part of a sub-tree but isolated, have

a zero region contribution; therefore, buckets that con-

tain only such directory chunks have a zero degree of

hierarchical clustering.

From (1), we can see that the more sub-trees, instead

of single chunks, are included in a bucket the greater the

hierarchical clustering degree of the bucket becomes,

because more HPP restrictions can be evaluated solely

with this bucket. Also the highest these trees are (i.e.,the smaller their chunking depth is) the greater the hier-

archical degree of the bucket becomes, as more combi-

nations of hierarchical attributes are covered by this

bucket. Moreover, the more trees of the same depth and

hanging under the same parent node, we have stored in

a bucket, the greater the hierarchical clustering degree

of the bucket, as we include more combinations of the

same path in the hierarchy.

All in all, the HCDB metric favors the following stor-

age choices for a bucket:

Whole trees instead of single chunks or other datapartitions

Smaller depth trees instead of greater depth ones Tree regions instead of single trees Regions with a few low-depth trees instead of ones

with more trees of greater depth

Regions with trees of the same depth that are closein the multidimensional space instead of dispersed

trees

Buckets with a high occupancy

We prove the following theorem regarding the maxi-

mum value of the hierarchical clustering degree of abucket:

Theorem 1 (Theorem of maximum hierarchical cluster-

ing degree of a bucket) Assume a hierarchically chun-

ked cube represented by a chunk-tree CT of a maximum

chunking depth DMAX, which has been allocated to a set

of buckets. Then, for any such bucket B holds that

HCDB DMAX.

Proof From the definition of the region contribution of

a tree appearing in Definition 4, we can easily deduce

that

cir 1. (I)This means that the following holds:

Ti=1

cir T. (II)

In (II) T stands for the number of trees stored in B.

Similarly, from the definition of the depth contribution


13/35


of a tree appearing in Definition 5, we can easily deduce

that:

cid 1

DMAX, (III)

as, the smallest possible depth value is 1. This means that

the following holds:

Ti=1

cid T

DMAX. (IV)

From (II), (IV), (1) and assuming that B is filled to its

capacity (i.e., OB equals 1) the theorem is proved.

It is easy to see that the maximum degree of hierar-

chical clustering of a bucket B is achieved only in the

ideal case, where we store the chunk-tree CT that rep-

resents the whole cube in B and CT fits exactly in B.2.

In this case, all our primary goals for a good hierarchical

clustering, posed in the beginning of this chapter, such as

the efficient evaluation of HPP queries, the low storage

cost and the high space utilization are achieved. This is

because all possible HPP restrictions can be evaluated

with a single bucket read (one I/O operation) and the

achieved space utilization is maximal (full bucket) with

a minimal storage cost (just one bucket). Moreover, it

is now clear that the hierarchical clustering degree of a

bucket signifies to what extent the chunk-tree represent-

ing the cube has been packed into the specific bucket

and this is measured in terms of the chunking depth of

the tree.

By trying to create buckets with a high HCDB

we

can guarantee that our allocation respects these ele-

ments of good hierarchical clustering. Furthermore, it

is now straightforward to define a metric for evaluating

the overall hierarchical clustering achieved by a chunk-

to-bucket allocation strategy:

Definition 7 (Hierarchical clustering factor of a physical

organization for a cubefHC) For a physical organiza-

tion thatstores the dataof a cube into a setof NB buckets,

we define as the hierarchical clustering factor fHC , the

percent of hierarchical clustering achieved by this storage

organization, as this results from the hierarchical cluster-

ing degree of each individual bucket divided by the total

number of buckets and we write:

fHC NB

1 HCDB

NBDMAX. (2)

2 Indeed, a bucket with HCDB = DMAX would mean that thedepth contribution of each tree in this bucket should be equal to1/DMAX (according to the inequality (III)); however, this is onlypossible for the whole chunk-tree CT, as this only has a depthequal to 1.

Note that NB is the total number of buckets usedin order

to storethe cube; however, only the buckets that contain

at least one whole chunk-tree have a non-zero HCD Bvalue. Therefore, allocations that spend more buckets

for storing sub-trees have a higher hierarchical cluster-

ing factor than others, which favor, e.g., single directory

chunk allocations. From (2), it is clear that even if we

have two different allocations of a cube that result inthe same total HCDB of individual buckets, the one that

occupies the smaller number of buckets will have the

greater fHC, rewarding this way the allocations that use

the available space more conservatively.

Another way of viewing the fHC is as the average

HCDB for all the buckets divided by the maximum

chunking depth. It is now clear that it expresses the

percentage of the extent by which the chunk-tree rep-

resenting the whole cube has been packed into the

set of the NB buckets and thus 0 fHC 1. It followsdirectly from Theorem 1 that this factor is maximized

(i.e., equals 1), if and only if we store the whole cube(i.e., the chunk-tree CT) into a single bucket, which cor-

responds to a perfect hierarchical clustering for a cube.

In the next section we exploit the hierarchical clus-

tering factorfHC, in order to define the chunk-to-bucket

allocation problem as an optimization problem. Further-

more, we exploit the hierarchical clustering degree of a

bucket HCDB in a greedy strategy that we propose for

solving this problem, as an evaluation criterion, in order

to decide how close we are to an optimal solution.

5 Building the CUBE File

In this section we formally define the chunk-to-bucket

allocation problem as an optimization problem. We

prove that it is NP-Hard and provide a heuristic algo-

rithm as a solution. In the course of solving this problem

several interesting sub-problems arise. We tackle each

one in a separate subsection.

5.1 The HPP chunk-to-bucket allocation problem

The chunk-to-bucket allocation problem is defined as

follows:

Definition 8 (The HPP chunk-to-bucket allocation

problem) For a cube C, represented by a chunk-tree CT

with a maximum chunking depth of DMAX, find an allo-

cation of the chunks of CT into a set of fixed-size buckets

that corresponds to a maximum hierarchical clustering

factor fHC

We assume the following: The storage cost of any

chunk-tree tequals cost(t), the number of sub-trees per


14/35


depth d in CTequals treeNum(d) and the size of a bucket

equals SB. Finally, we are given a bucket of special size

SROOT consisting of consecutive simple buckets,called

the root-bucket BR, where SROOT = SB, with 1.Essentially, BR represents the set of buckets that contain

no whole sub-trees and thus have a zero HCDB.

The solution S for this problem consists of a set ofK

buckets, S = {B1, B2 . . .BK}, so that each bucket con-tains at least one sub-tree of CT and a root-bucket BRthat contains all the rest of CT (part with no whole sub-

trees). S must result in a maximum value for the fHCfactor for the given bucket size SB. As the HCDB val-

ues of the buckets of the root-bucket BR equal to zero

(recall that they contain no whole sub-trees), following

from (2), fHC can be expressed as

fHC =K

1 HCDB

(K+ )DMAX. (3)

From (3), it is clear that the more buckets we allocatefor the root-bucket (i.e., the greater becomes) the

less the degree of hierarchical clustering achieved by

our allocation. Alternatively, if we consider caching the

whole root-bucket in main memory (see the following

discussion), then we could assume that does not affect

hierarchical clustering (as it does not introduce more

bucket I/Os from the root-chunk to a simple bucket)

and could be zeroed.

In Fig. 5, we depict four different chunk-to-bucket

allocations for the same chunk-tree. The maximum

chunking depth is DMAX = 5, although in the figure we

can see the nodes up to depth D = 3 (i.e., the trianglescorrespond to sub-trees of three levels). The numbers

inside each node represent the storage cost for the cor-

responding sub-tree, e.g., the whole chunk-tree has a

cost of 65 units. Assume a bucket size ofSB = 30 units.

Below each figure we depict the calculated fHC and

beside we note the percentage with respect to the best

fHC that can be achieved for this bucket size (i.e., fHC/

fHcmax 100%). The chunk-to-bucket allocation thatyields the maximum fHC can be identified easily by

exhaustive search in this simple case. Observe, how the

fHC deteriorates gradually, as we move from Fig. 5a to d.

In Fig. 5a we have failed to create any bucket-regions

at depth D = 2. Thus each bucket stores a single sub-tree of depth 3. Note also that the occupancy of most

buckets is quite low. In Fig. 5b the hierarchical clustering

improves as some bucket-regions have been formed

buckets B1, B3 and B4 store two sub-trees of depth 3. In

Fig. 5c the total number of buckets decreases by one as a

large bucket-region of four sub-trees has been formed in

bucket B3. Finally, in Fig. 5d we have managed to store

in bucket B3 a higher level (i.e., lower depth) sub-tree

(i.e., a sub-tree of depth 2). This increases even more the

hierarchical clustering achieved, compared to the previ-

ous case (Fig. 5c), because the root node is included in

the same bucket as the four sub-trees. In addition, the

bucket occupancy of B3 is increased.

It is clear now from this simple example, that the hier-

archical clustering factor fHC rewards the allocations

that achieve to store lower-depth sub-trees in buckets,that store regions of sub-trees instead of single sub-trees

and that create highly occupied buckets. The individual

calculations of this example can be seen in Fig. 6.

All in all, it is obvious that we have now the optimi-

zation problem of finding a chunk-to-bucket allocation

such that fHC is maximized. This problem is NP-Hard,

which results from the following theorem.

Theorem 2 (Complexity of the HPP chunk-to-bucket

allocation problem) The HPP chunk-to-bucket alloca-

tion problem is NP-Hard.

Proof Assume a typical bin packing problem [42] wherewe are given Nitems with weights wi, i = 1,,N, respec-

tively, and a bin size B such as wi B for all i = 1, . . . , N.The problem is to find a packing of the items in the few-

est possible bins. Assume that we create N chunks of

depth d and dimensionality D, so as chunk c1 has a

storage cost of w1 and chunk c2 has a storage cost w2and so on. Also assume that N 1 of these chunks areunder the same parent chunk (e.g., the Nth chunk). This

way we have created a two-level chunk-tree where the

root lies at depth d = 0 and the leaves at depth d =

1. Also assume that a bin and a bucket are equivalent

terms. Now we have reduced in polynomial time the binpacking problem to an HPP chunk-to-bucket allocation

problem, which is to find an allocation of the chunks

into buckets ofB size such that the achieved hierarchi-

cal clustering factor fHC is maximized.

As all the chunk-trees (i.e., single chunks in our case)

are of the same depth, the depth contribution cid(1 i N), defined in (1), is the same for all chunk-trees. There-

fore, in order to maximize the degree of the hierarchical

clustering HCDB for each individual bucket (and thus

also increase the hierarchical clustering factor fHC), we

have to maximize the region contribution cir(1

i

N)

of each chunk-tree (1). This occurs when we pack into

each bucket as many trees as possible on the one hand

anddue to the region proximity factor rPwhen the

trees of each region are as close as possible in the mul-

tidimensional space, on the other. Finally, according to

the fHC definition, the number of buckets used must

be the smallest possible. If we assume that the chunk

dimensions have no inherent ordering then there is no

notion of spatial proximity within the trees of the same

region and the region proximity factor equals 1 for all


15/35


65

4022

10

20

5 5

2

3

D= 1

DMAX = 5

D= 2

SB = 30

B1

B2

B3B4

5

B5

B6

B7

(a)fHC=0.01(14%)

65

4022

10

20

5

5 3

D= 1

DMAX= 5

D= 2

SB= 30

B1

B2B3

2

5

B4

(b)fHC=0.03(42%)

65

4022

10

20

5 5

5

3

D= 1

DMAX = 5

D= 2

SB= 30

B1

B2

B3

2

(c)fHC=0.05(69%)

65

4022

10

20

5 5

5

2

3

D= 1

DMAX= 5

D= 2

SB = 30

B1

B2

B3

(d)fHC=0.07(100%)

Fig. 5 The hierarchical clustering factor fHC of the same chunk-tree for four different chunk-to-bucket allocations

possible regions (see also related discussion in the fol-

lowing subsection).

In this case the only factor that can maximize the

HCDB of each bucket and consequently the overall fHCis to minimize empty space within each bucket [i.e., max-

imize bucket occupancy in (1)] and use as few buckets as

possible by packing the largest number of trees in each

bucket. These are exactly the goals of the original bin

packing problem and thus a solution to the bin packing

problem is also a solution to the HPP chunk-to-bucket

allocation problem and vice versa.

As the bin packing can be reduced in polynomial time

to the HPP chunk-to-bucket, then any problem in NP

can be reduced in polynomial time to the HPP chunk-

to-bucket. Furthermore, in the general case (where we

have chunk-trees of variant depths and dimension have

inherent orderings) it is not easy to find a polynomial

time verifier for a solution to the HPP chunk-to-bucket

problem, as the maximum fHC that can be achieved is

not known (as it is in the bin packing problem where

the minimum number of bins can be computed with a

simple division of the total weight of items by the size of

a bin). Thus the problem is NP-Hard.

We proceed next by providing a greedy algorithm

based on heuristics for solving the HPP chunk-to-bucket

allocation problem in linear time. The algorithm utilizes

the hierarchical clustering degree of a bucket as a cri-

terion in order to evaluate at each step how close we

are to an optimal solution. In particular, it traverses the

chunk-tree in a top-down depth-first manner, adopting

the greedy approach that if at each step we create a


16/35


Chunk-to-bucket

Allocation

Bucket

RegionContributioncr

DepthContribution

cd BucketOccupancy

OB

HCDB

BucketSizeSB

TotalNoofBucketsK

Noofbucket

RootBucketb

Maximum

ChunkingDepthDMAX

B1 0,29 0,6 1,00 0,48B2 0,14 0,6 0,17 0,04Fig(d)

B3 0,50 0,4 0,73 0,92

3 1 0,07 100%

B1 0,29 0,6 1,00 0,48

B2 0,14 0,6 0,17 0,04Fig( c )

B3 0,57 0,6 0,50 0,48

3 1 0,05 69%

B1 0,29 0,6 1,00 0,48

B2 0,14 0,6 0,17 0,04

B3 0,29 0,6 0,33 0,16Fig(b)

B4 0,29 0,6 0,17 0,08

4 1 0,03 42%

B1 0,14 0,6 0,33 0,08B2 0,14 0,6 0,67 0,16

B3 0,14 0,6 0,17 0,04

B4 0,14 0,6 0,17 0,04

B5 0,14 0,6 0,17 0,04

B6 0,14 0,6 0,10 0,02

Fig(a)

B7 0,14 0,6 0,07 0,02

30

7 1

5

0,01 14%

fHCfHC/fHCmax

(%)

Fig. 6 The individual calculations of the example in Fig. 5

bucket with a maximum value of HCDB, then overall

the acquired hierarchical clustering factor will be maxi-

mal. Intuitively, by trying to pack the available buckets

with low-depth trees (i.e., the tallest trees) first (thus the

top-to-bottom traversal) we can ensure that we have

not missed the chance to create the best HCDB buckets

possible.

In Fig. 7, we present the GreedyPutChunksIntoBuc-

kets algorithm, which receives as input the root R of a

chunk-tree CT and the fixed size SB of a bucket. The

output of this algorithm is a set of buckets containing

at least one whole chunk-tree, a directory chunk entry

pointing at the root chunk R and the root-bucket BR

.

In each step the algorithm tries greedily to make an

allocation decision that will maximize the HCDB of the

current bucket. For example, in lines 27 of Fig. 7, the

algorithm tries to store the whole input tree in a single

bucket thus aiming at a maximum degree of hierarchi-

cal clustering for the corresponding bucket. If this fails,

then it allocates the root R to the root-bucket and tries to

achieve a maximum HCDB by allocating the sub-trees

at the next depth, i.e., the children ofR (lines 926).

This essentially is achieved by including all direct

children sub-trees with size less than (or equal to) the

size of a bucket (SB) into a list of candidate trees for

inclusion into bucket regions (buckRegion) (lines 14

16). Then the routine formBucketRegions is calledupon this list and tries to include the corresponding

trees in a minimum set of buckets, by forming bucket

regions to be stored in each bucket, so that each one

achieves the maximum possible HCDB (lines 1922).

We will come back to this routine and discuss how it

solves this problem in the next sub-section. Finally, for

the children sub-trees of root R with size cost greater

than the size of a bucket, we recursively try to solve

the corresponding HPP chunk-to-bucket allocation sub-

problem for each one of them (lines 2326). This of

course corresponds to a depth-first traversal of the input

chunk-tree.

Very important is also the fact that no space is

allocated for empty sub-trees (lines 1113); only a spe-

cial entry is inserted in the parent node to denote a

NULL sub-tree. Therefore, the allocation performed

by the greedy algorithm adapts perfectly to the data


17/35


0:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:

22:23:24:25:26:27:28:29:30:31:32:33:34:35:36:37:

GreedyPutChunksIntoBuckets(R,SB)

//Input: Root R of a chunk-tree CT, bucket size SB//Output: Updated R, list of allocated buckets BuckList, root

// bucket BR, directory entry dirEnt pointing at R

{List buckRegion // Bucket-region Candidates listIF (cost(CT) < SB){

Allocate new bucket BnStore CT in BndirEnt = addressOf(R)RETURN

}//R will be stored in the root-bucket BRIF (R is a directory chunk) {

FOR EACH child sub-tree CTC of R {IF (CTC is empty){

Mark with empty tag corresponding Rs entry}IF (cost(CTC) SB){

//Insert CTc into list for bucket-region candidates

buckRegion.push(CTC)}

}IF(buckRegion != empty){

// Formulate the bucket-regions

formBucketRegions(buckRegion, BuckList, R)

}WHILE (there is a child CT : cost(CT ) > S ){C C B

GreedyPutChunkIntoBuckets(root(CT ),SCUpdate corresponding R entry for CT

B)

C

}Store R in the root-bucket BRdirEnt = addressOf(R)

}ELSE { //R is a data chunk and cost(R) > B

Artificially chunk R, create 2-level chunk-tree CTAGreedyPutChunkIntoBuckets(root(CTA),SB)//storage of R will be taken cared of by previous call

dirEnt = addressOf(root(CTA))}RETURN}

Fig. 7 A greedy algorithm for the HPP chunk-to-bucket allocation problem

distribution, coping effectively with the native sparse-

ness of the cube.

The recursive calls might lead us eventually all the

way down to a data chunk (at depth DMAX). Indeed, if

the GreedyPutChunksIntoBuckets is called upon aroot R, which is a data chunk, then this means that we

have come upon a data chunk with size greater than the

bucket size. This is called a large data chunk and a more

detailed discussion on how to handle them will follow

in a later sub-section. For now it is enough to say that

in order to resolve the problem of storing such a chunkwe extend the chunking further (with a technique called

artificial chunking) in order to transform the large data

chunk into a 2-level chunk-tree. Then, we solve the HPP

chunk-to-bucket sub-problem for this sub-tree (lines30

35). The termination of the algorithm is guaranteed by

the fact that each recursive call deals with a sub-problem

of a smaller in size chunk-tree than the parent problem.

Thus, the size of the input chunk-tree is continuously

reduced.

65

40 22

10

20

5

5 5

2

3

D= 1

D= 2

DMAX

= 5

Fig. 8 A chunk-tree to be allocated to buckets by the greedyalgorithm

Assuming an input file consisting of the cubes data

points along with their corresponding chunk-ids (or

equivalently the corresponding h-surrogate key per

dimension) we need a single pass over this file to create


18/35


65

4022

10

20

5 5

5

2

3

D = 1

DMAX

= 5

D = 2

SB

= 30

B1

B2

B3

Fig. 9 The chunk-to-bucket allocation for SB = 30

the chunk-tree representation of the cube. Then the

above greedy algorithm requires only linear time in the

number of input chunks (i.e., the chunks of the chunk-tree) to perform the allocation of chunks to buckets, as

each node is visited exactly once and at the worst case

all nodes are visited.

Assume the chunk-tree of DMAX = 5 of Fig. 8. The

numbers inside each node represent the storage cost for

the corresponding sub-tree, e.g., the whole chunk-tree

has a cost of 65 units. For a bucket size SB = 30 units the

greedy algorithm yields a hierarchical clustering factor

fHC = 0.72. The corresponding allocation is depicted in

Fig. 9.

The solution comprises three buckets B1, B2 and B3,

depicted as rectangles in the figure. The bucket withthe highest clustering degree (HCDB) is B3, because it

includes the lowest depth tree. The chunks not included

in a rectangle will be stored in the root-bucket. In this

case,the root-bucket consists of only a single bucket (i.e.,

= 1 and K= 3, see (3)), as this suffices for storing thecorresponding two chunks.

5.2 Bucket-region formation

We have seen that in each step of the greedy algorithm

for solving the HPP chunk-to-bucket allocation problem

(corresponding to an input chunk-tree with a root nodeat a specific chunking depth), we try to store all the sib-

lingtreeshangingfromthisroottoasetofbuckets,form-

ing this way groups of trees to be stored in each bucket

that we call bucket regions. The formation of bucket

regions is essentially a special case of the HPP chunk-

to-bucket allocation problem and can be described as

follows:

Definition 9 (Thebucket regionformation problem) We

are given a set of N chunk-trees T1, T2, . . .TN,of the same

chunking depth d. Each tree Ti(1 i N) has a size:cost(Ti) SB , where SB is the bucket size. The prob-lem is to store these trees into a set of buckets, so that

the hierarchical clustering factor fHC of this allocation is

maximized.

As all the trees are of the same depth, the depth con-

tribution cid(1 i N), defined in (1), is the same for alltrees. Therefore, in order to maximize the degree of the

hierarchical clustering HCDB for each individual bucket

(and thus also increase the hierarchical clustering fac-

tor fHC), we have to maximize the region contribution

cir(1 i N) of each tree (1). This occurs when wecreate bucket regions with as many trees as possible on

the one hand anddue to the region proximity factor

rPwhen the trees of each region are as close as possi-

ble in the multidimensional space, on the other. Finally,

according to the fHC definition, the number of buckets

used must be the smallest possible.

Summarizing, in the bucket region formation prob-lem we seek a set of buckets to store the input trees, in

order to fulfill the following three criteria:

1. The bucket regions (i.e., each bucket) contain as

many trees as possible.

2. The total number of buckets is minimum.

3. The trees of a region are as close in the multidimen-

sional space as possible.

One could observe that if we focused only on the first

two criteria, then the bucket region formation problem

would be transformed to a typical bin-packing problem,which is a well-known NP-complete problem [42]. So

intuitively the bucket region formation problem can be

viewed as a bin-packing problem, where items packed in

the same bin must be neighbors in the multidimensional

space.

The space proximity of the trees of a region is mean-

ingful only when we have dimension domains with inher-

ent orderings. Typical example is the TIME dimension.

For example, we might have trees corresponding to the

months of the same year (which guarantees hierarchi-

cal proximity) but we would also like the consecutive

months to be in the same region (space proximity). This

is because these dimensions are the best candidates for

expressing range predicates (e.g., months from FEB99 to

AUG99). Otherwise, when there is not such an inherent

ordering,

Documents

cal Clustering for OLAP the Cube File Approach