Upload
apurva-meshram
View
225
Download
0
Embed Size (px)
Citation preview
8/4/2019 cal Clustering for OLAP the Cube File Approach
1/35
The VLDB Journal (2008) 17:621655
DOI 10.1007/s00778-006-0022-1
R E G U L AR PA P E R
Hierarchical clustering for OLAP: the CUBE File approach
Nikos Karayannidis
Timos Sellis
Received: 6 September 2005 / Accepted: 13 April 2006 / Published online: 7 September 2006 Springer-Verlag 2006
Abstract This paper deals with the problem of phys-
ical clustering of multidimensional data that are orga-nized in hierarchies on disk in a hierarchy-preserving
manner. This is called hierarchical clustering. A typi-
cal case, where hierarchical clustering is necessary for
reducing I/Os during query evaluation, is the most
detailed data of an OLAP cube. The presence of hierar-
chies in the multidimensional space results in an enor-
mous search space for this problem. We propose a
representation of the data space that results in a chunk-
tree representation of the cube. The model is adaptive
to the cubes extensive sparseness and provides efficient
access to subsets of data based on hierarchy value combi-
nations. Based on this representation of the search spacewe formulate the problem as a chunk-to-bucket alloca-
tion problem, which is a packing problem as opposed to
the linear ordering approach followed in the literature.
We propose a metric to evaluate the quality of hier-
archical clustering achieved (i.e., evaluate the solutions
to the problem) and formulate the problem as an opti-
mization problem. We prove its NP-Hardness and pro-
vide an effective solution based on a linear time greedy
algorithm. The solution of this problem leads to the con-
struction of the CUBE File data structure. We analyze in
depth all steps of the construction and provide solutions
Communicated by P-L. Lions.
N. Karayannidis (B) T. SellisInstitute of Communication and Computer Systemsand School of Electrical and Computer Engineering,National Technical University of Athens,Zographou 15773, Athens, Greecee-mail: [email protected]
T. Sellise-mail: [email protected]
for interesting sub-problems arising, such as the forma-
tion of bucket-regions, the storage of large data chunksand the caching of the upper nodes (root directory) in
main memory.
Finally, we provide an extensive experimental evalu-
ation of the CUBE Files adaptability to the data space
sparseness as well as to an increasing number of data
points. The main result is that the CUBE File is highly
adaptive to even the most sparse data spaces and for
realistic cases of data point cardinalities provides hier-
archical clustering of high quality and significant space
savings.
Keywords Hierarchical clustering OLAP CUBEFile Data cube Physical data clustering
1 Introduction
Efficient processing of ad hoc OLAP queries is a very
difficult task considering, on the one hand the native
complexity of typical OLAP queries, which potentially
combine huge amounts of data, and on theother, the fact
that no a priori knowledge for queries exists and thus
no pre-computation of results or other query-specific
tuning can be exploited. The only way to evaluate these
queries is to access directly the most detailed data in an
efficient way. It is exactly this need to access detailed
data based on hierarchy criteria that calls for the hierar-
chical clustering of data. This paper discusses the phys-
ical clustering of OLAP cube data points on disk in
a hierarchy-preserving manner, where hierarchies are
defined along dimensions (hierarchical clustering).
8/4/2019 cal Clustering for OLAP the Cube File Approach
2/35
622 N. Karayannidis, T. Sellis
The problem addressed is set out as follows: we are
given a large fact table (FT) containing only grain-level
(most detailed) data. We assume that this is part of the
star schemain a dimensional data warehouse. Therefore,
data points (i.e., tuples in the FT) are organized by a set
of N dimensions. We further assume that each dimen-
sion is organized in a hierarchy. Typically the data dis-
tribution is extremely skewed. In particular, the OLAPcube is extremely sparse and data tend to appear in
arbitrary clusters along some dimension. These clus-
ters correspond to specific combinations of the hierar-
chy values for which there exist actual data (e.g., sales
for a specific product category in a specific geographic
region for a specific period of time). The problem is
on the one hand to store the fact table data in a hier-
archy-preserving manner so as to reduce I/Os during
the evaluation of ad hoc queries containing restrictions
and/or groupings on the dimension hierarchies and, on
the other, to enable navigation in the multilevel-multi-
dimensional data space by providing direct access (i.e.,indexing) to subsets of data via hierarchical restrictions.
The later implies that index nodes must also be hier-
archically clustered if we are aiming at a reduced
I/O cost.
Some of the most interesting proposals [20, 21, 36] in
the literature for cube data structures deal with the com-
putation and storage of the data cube operator[9]. These
methods omit a significant aspect in OLAP, which is
that usually dimensions are not flat but are organized
in hierarchies of different aggregation levels (e.g., store,
city, area, country is such a hierarchy for a Location
dimension). The most popular approach for organizingthe most detailed data of a cube is the so-called star
schema. In this case the cube data are stored in a rela-
tional table, called the fact table. Furthermore, various
indexing schemes have been developed [3, 15, 25, 26], in
order to speed up the evaluation of the join of the central
(and usually very large) fact table with the surrounding
dimension tables (also known as a star-join). However,
even when elaborate indexes are used, due to the arbi-
trary ordering of the fact table tuples, there might be
as many I/Os as there are tuples resulting from the fact
table.
We propose the CUBE File data structure as an effec-
tive solution to the hierarchical clustering problem set
above. The CUBE File multidimensional data structure
[18] clusters data into buckets (i.e., disk pages) with
respect to the dimension hierarchies aiming at the hier-
archical clustering of the data. Buckets may include both
intermediate (index) nodes (directory chunks), as well
as leaf (data) nodes (data chunks). The primary goal of
a CUBE File is to cluster in the same bucket a family
of data (i.e., data corresponding to all hierarchy value
combinations for all dimensions) so as to reduce the
bucket accesses during query evaluation.
Experimental results in [18] have shown that the
CUBE File outperforms the UB-tree/MHC[22], which
is another effective method for hierarchically clustering
the cube, resulting in 79 times less I/Os on average for
all workloads tested. This simply means that the CUBE
File achieves a higher degree of hierarchical clusteringof the data. More interestingly, in [15] it was shown that
the UB-tree/MHC technique outperformed the tradi-
tional bitmap index based star-join by a factor of 20
40, which simply proves that hierarchical clustering is
the most determinant factor for a file organization for
OLAP cube data, in order to reduce I/O cost.
To tackle this problem we first model the cube data
space as a hierarchy of chunks. This modelcalled the
chunk-tree representation of a cubecopes effectively
with the vast data sparseness by truncating empty areas.
Moreover, it provides a multiple resolution view of the
data space where one can zoom-in or zoom-out to spe-cific areas navigating along the dimension hierarchies.
The CUBE File is built by allocating the nodes of the
chunk-tree into buckets in a hierarchy-preserving man-
ner. This way we depart from the common approach for
solving the hierarchical clustering problem, which is to
find a total ordering of the data points (linear cluster-
ing) and cope with it as a packing problem, namely a
chunk-to-bucket packing problem.
In order to solve the chunk-to-bucket packing prob-
lem, we need to be able to evaluate the hierarchical
clustering achieved (i.e., evaluate the solutions to this
problem). Thus, inspired by the chunk-tree represen-tation of the cube, we define a hierarchical clustering
quality metric, called the hierarchical clustering factor.
We use this metric to evaluate the quality of the chunk-
to-bucket allocation. Moreover, we exploit it in order
to formulate the CUBE File construction problem as
an optimization problem, which we call the chunk-to-
bucket allocation problem. We formally define this prob-
lem and prove that it is NP-Hard. Then, we propose a
heuristic algorithm as a solution that requires a single
pass over the input fact table and linear time in the num-
ber of chunks.
In the course of solving this problem several inter-
esting sub-problems arise. We define the sub-problem
of chunk-region formation, which deals with the clus-
tering of chunk-trees hanging from the same parent
node in order to increase further the overall hierarchi-
cal clustering. We propose two algorithms as a solution,
one of which is driven by workload patterns. Next, we
deal with the sub-problem of storing large data chunks
(i.e., chunks that do not fit in a single bucket), as well
as with the sub-problem of storing the so-called root
8/4/2019 cal Clustering for OLAP the Cube File Approach
3/35
Hierarchical clustering for OLAP 623
directory of the CUBE File (i.e., the upper nodes of the
data structure).
Finally, we study the CUBE Files effective adapta-
tion to several cube data spaces by presenting a set of
experimental measurements that we have conducted.
All in all, the contributions of this paper are outlined
as follows:
We provide an analytic solution to the problem ofhierarchical clustering of an OLAP cube. The solu-
tion leads to the construction of the CUBE File data
structure.
We model the multilevel-multidimensional dataspace of the cube as a chunk-tree. This representa-
tion of the data space adapts perfectly to the
extensive data sparseness and provides a multi-res-
olution view of the data with respect to the hierar-
chies. Moreover, if viewed as an index, it provides
direct access to cube data via hierarchical restric-
tions, which results in significant speedups of typicalad hoc OLAP queries.
We transform the hierarchical clustering problemfrom a linear clustering problem into a chunk-to-
bucket allocation (i.e., packing) problem, which we
formally define and prove that it is NP-Hard.
We introduce a hierarchical clustering quality met-ric for evaluatingthe hierarchical clustering achieved
(i.e., evaluating the solution to the problem in ques-
tion). We provide an efficient solution to this prob-
lem as well as to all sub-problems that stem from
it, such as the storage of large data chunks or the
formation of bucket-regions. We provide an experimental evaluation which leads
to the following basic results:
o The CUBE File adapts perfectly to even the most
extremely sparse data spaces yielding significant
spacesavings.Furthermore, the hierarchical clus-
tering achieved by the CUBE File is almost unaf-
fected by the extensive cube sparseness.
o The CUBE File is scalable for any realisticnumber of input data points. In addition, the
hierarchical clustering achieved remains of high
quality, when the number of input data points
increases.
o The root directory can be cached in main mem-ory providing a single I/O cost for the evaluation
of point queries.
The rest of this paper is organized as follows. Section 2
discusses related work and positions the CUBE File
in the space of cube storage structures. Section 3 pro-
poses the chunk-tree representation of the cube as an
effective representation of the search space. Section 4
introduces a quality metric for the evaluation of
hierarchical clustering. Section 5 formally defines the
problem of hierarchical clustering, proves its NP-Hard-
ness and then delves into the nuts and bolts of building
the CUBE File. Section 6 presents our extensive experi-
mental evaluation and Sect. 7 recapitulates and empha-
sizes on main conclusions drawn.
2 Related work
2.1 The linear clustering problem for multidimensional
data
The linear clustering problem for multidimensional data
is defined as the problem of finding a linear order-
ing of records indexed on multiple attributes, to be
stored in consecutive disk blocks, such that the I/O cost
for the evaluation of queries is minimized. The cluster-
ing of multidimensional data has been studied in termsof finding a mapping of the multidimensional space
to a one-dimensional space. This approach has been
explored mainly in two directions: (a) in order to exploit
traditional one-dimensional indexing techniques to a
multidimensional index spacetypical example is the
UB-tree [2], which exploits a z-ordering of multidimen-
sional data [27], so that these can be stored into a one-
dimensional B-tree index [1]and (b) for ordering
buckets containing records that have been indexed on
multiple attributes, to minimize the disk access effort.
For example, a grid file [23] exploits a multidimensional
grid in order to provide a mapping between grid cellsand disk blocks. One could find a linear ordering of
these cells, and therefore an ordering of the underlying
buckets, such as the evaluation of a query to entail more
sequential bucket reads than random bucket accesses.
To this end, space-filling curves (see [33] for a survey)
have been used extensively. For example, Jagadish [13]
provides a linear clustering method based on the Hilbert
curve that outperforms previously proposed mappings.
Note, however, that all linear clustering methods are
inferior to a simple scan in high dimensional spaces. This
is due to the notorious dimensionality curse [41], which
states that clustering in such spaces becomes meaning-
less due to lack of useful distance metrics.
In the presence of dimension hierarchies the multidi-
mensional clustering problem becomes combinatorially
explosive. Jagadish et al. [14] try to solve the problem of
finding an optimal linear clustering of records of a fact
table on disk, given a specific workload in the form of a
probability distribution over query classes. The authors
propose a subclass of clustering methods called lattice
paths, which are paths on the lattice defined by the
8/4/2019 cal Clustering for OLAP the Cube File Approach
4/35
624 N. Karayannidis, T. Sellis
hierarchy level combinations of the dimensions. The
HPP chunk-to-bucket allocation problem (in Sect. 3.2
we provide a formal definition of HPP restrictions and
queries) is a different problem for the following reasons:
1. It tries to find an optimal way (in terms of reduced
I/O cost during query evaluation) to pack the datainto buckets, rather than order the data linearly. The
problem of finding an optimal linear ordering of
the buckets, for a specific workload, so as to reduce
random bucket reads, is an orthogonal problem and
therefore, the methods proposed in [14] could be
used additionally.
2. Apart from the data, it also deals with the inter-
mediate node entries (i.e., directory chunk entries),
which provide clustering at a whole-index level and
not only at the index-leaf level. In other words, index
data are also clustered along with the real data.
As, we know that there is no linear clustering of
records that will permit all queries over a multidimen-
sional space to be answered efficiently [14], we strongly
advocate that linear clustering of buckets (inter-bucket
clustering) must be exploited in conjunction with an
efficient allocation of records into buckets (intra-bucket
clustering).
Furthermore, in [22], a path-based encoding of dimen-
sion data, similar to our encoding scheme, is exploited
in order to achieve linear clustering of multidimensional
data with hierarchies, through a z-ordering [27]. The
authors use the UB-tree [2] as an index on top of thelinearly clustered records. This technique has the advan-
tage of transforming typical star-join [25] queries to
multidimensional range queries, which are computed
more efficiently due to the underlying multidimensional
index.
However, this technique suffers from the inherent
deficiencies of the z space-filling curve, which is not the
best space-filling curve according to [7,13]. On the other
hand, it is very easy to compute and thus straightforward
to implement the technique even for high dimensional-
ities. A typical example of such deficiency is that in the
z-curve there is a dispersion of certain data points, which
are close in the multidimensional space but not close in
the linear order and the opposite, i.e., distant data points
are clustered in the linear space. The latter results also
in an inefficient evaluation of multiple disjoint query
regions, due to the repetitive retrieval of the same pages
for many queries. Finally, the benefits of z-based linear
clustering starts to disappear quite soon as dimensional-
ity increases, practically even when dimensionality gets
over the number of 45 dimensions.
2.2 Grid file based multidimensional access methods
The CUBE File organization was initially inspired by
the grid file organization [23], which can be viewed as
the multidimensional counterpart of extendible hashing
[6]. The grid file superimposes a d-dimensional orthog-
onal grid on the multidimensional space. Given that the
grid is not necessarily regular, the resulting cells may beof different shapes and sizes. A grid directory associates
one or more of these cells with data buckets, which are
stored in one disk page each. Each cell is associated with
one bucket, but a bucket may contain several adjacent
cells, therefore bucket regions may be formed.
To ensure that data items are always found with no
more than two disk accesses for exact match queries,
the grid itself is kept in main memory represented by
d one-dimensional arrays called scales. The grid file is
intended for dynamic insert/delete operations, therefore
it supports operations for splitting and merging direc-
tory cells. A well-known problem of the grid file is thatit suffers from a superlinear growth of the directory even
for data that are uniformly distributed [31]. One basic
reason for this is that splitting is not a local operation and
thus canlead to superlinear directory growth. Moreover,
depending on the implementation of the grid directory
merging may require a complete directory scan [12].
Hinrichs [12] attempts to overcome the shortcomings
of the grid file by introducing a 2-level grid directory.
In this scheme, the grid directory is now stored on disk
and a scaled-down version of it (called root directory)
is kept in main memory to ensure the two-disk access
principle still holds. Furthermore, he discusses efficientimplementations of the split, merge and neighborhood
operations. In a similar manner, Whang and krishna-
murthy [43] extends the idea of a 2-level directory to a
multilevel directory, introducing the multilevel grid file,
achieving a linear directory growth in the number of
records. There exist more grid file based organizations.
A comprehensive survey of these and multidimensional
access methods in general can be found in [8].
An obvious distinction of the CUBE File organiza-
tion from the above multidimensional access methods
is that it has been designed to fulfill completely differ-
ent requirements, namely those of an OLAP environ-
ment and not of a transaction-oriented one. A CUBE
File is designed for an initial bulk loading and then a
read-only operation mode, in contrast, to the dynamic
insert/delete/update workload of a grid file. Moreover,
a CUBE File aims at speeding up queries on multidi-
mensional data with hierarchies and exploits hierarchi-
cal clustering to this end. Furthermore, as the dimension
domain in OLAP is known a priori the directory does
not have to grow dynamically. In addition, changes to
8/4/2019 cal Clustering for OLAP the Cube File Approach
5/35
Hierarchical clustering for OLAP 625
the directory are rare, as dimension data do not change
very often (compared to the rate of change for the cube
data), and also deletions are seldom, therefore split and
merge operations are not needed so much. Nevertheless,
more important is to adapt well to the native sparseness
of a cube data space and to efficiently support incremen-
tal updating, so as to minimize the updating window and
cube query-down time, which are critical factors in busi-ness intelligence applications nowadays.
2.3 Taxonomy of cube primary organizations
The set of reported methods in theliterature for primary
organizations for the storage of cubes is quite confined.
We believe that this is basically due to two reasons:
first of all the generally held view is that a cube is
a set of pre-computed aggregated results and thus the
main focus has been to devise efficient ways to compute
these results [11], as well as to choose which ones to
compute for a specific workload (view selection/main-tenance problem [10, 32, 37]). Kotidis and Roussopoulos
[19] proposed a storage organization based on packed
R-trees for storing these aggregated results. We believe
that this is a one-sided view of the problem as it dis-
regards the fact that very often, especially for ad hoc
queries, there will be a need for drilling down to the
most detailed data in order to compute a result from
scratch. Ad hoc queries represent the essence of OLAP,
and in contrast to report queries, are not known a pri-
ori and thus cannot really benefit from pre-computa-
tion. The only way to process them efficiently is to
enable fast retrieval of the base data. This calls foran effective primary storage organization for the most
detailed data (grain level) of the cube. This argument
is of course based on the fact that a full pre-compu-
tation of all possible aggregates is prohibitive due to
the consequent size explosion, especially for sparse
cubes [24].
The second reason that makes people reluctant to
work on new primary organizations for cubes is their
adherence to relational systems. Although this seems
justified, one could pinpoint that a relational table (e.g.,
afact table of astar schema [4]) is a logical entity and thus
should be separated from the physical method chosen
for implementing it. Therefore, one can use apart from
a paged record file, also a B-tree or even a multidimen-sional data structure as a primary organization for a fact
table. In fact, there are not many commercial RDBMS
([39] is one that we know of) that exploit a multidimen-
sional data structure as a primary organization for fact
tables. All in all, the integration of a new data struc-
ture in a full-blown commercial system is a strenuous
task with high cost and high risk and thus usually the
proposed solutions are reluctant to depart from the
existing technology (see also [30] for a detailed descrip-
tion of the issues in this integration).
Figure 1 positions the CUBE File organization in the
space of primary organizations proposed for storing a
cube (i.e., only the base data and not aggregates). The
columns of this table describe the alternative data struc-
tures that have been proposed as a primary organization,while the rows classify the proposed methods accord-
ing to the achieved data clustering. At the top-left cell
lies the conventional star schema [4], where a paged
record file is used for storing the fact table. This orga-
nization guarantees no particular ordering among the
stored data and thus additional secondary indexes are
built around it in order to support efficient access to
the data.
Padmanabhan et al. [28] assume a typical relation
(i.e., a paged record file) as the primary organization
of a cube (i.e., fact table). However, unique combina-
tions of dimension values are used in order to formblocks of records, which correspond to consecutive disk
pages. These blocks can be considered as chunks. The
database administrator must choose only one hierar-
chy level from each dimension to participate in the
clustering scheme. In this sense, the method provides
multidimensional clustering and not hierarchical (mul-
tidimensional) clustering.
In [35] a chunk-based method for storing large
multidimensional arrays is proposed. No hierarchies are
assumed on the dimensions and data are clustered
according to the most frequent range queries of a partic-
ular workload. In [5] the benefits of hierarchical cluster-ing in speeding-up queries was observed as a side effect
of using a chunk-based file organization over a relation
(i.e., a paged file of records) for query caching, with
chunk as the caching unit. Hierarchical clustering was
achieved through appropriate hierarchical encoding
of the dimension data.
Markl et al. [22], also impose a hierarchical encoding
on the dimension data and assign a path-based surro-
gate key on each dimension tuple that was called the
compound surrogate key. They exploit the UB-tree mul-
tidimensional index [2] as the primary organization of
the cube. Hierarchical clustering is achieved by taking
the z-order [27] of the cube data points by interleav-
ing the bits of the corresponding compound surrogates.
Deshpande et al. [5], Markl et al. [22] and the CUBE
File [18], all exploit hierarchical clustering of the cube
data and the last two use multidimensional structures
as the primary organization. This has among others the
significant benefit of transforming a star-join [25] into
a multidimensional range query that is evaluated very
efficiently over these data structures.
8/4/2019 cal Clustering for OLAP the Cube File Approach
6/35
626 N. Karayannidis, T. Sellis
Fig. 1 The space of proposedprimary organizations forcube storage
Multidimensional
data structure
Primary
Organization
Clustering
Achieved
Relation MD-Array
UB-tree
GRID
FILE-
based
No Clustering Star Schema
Chunk-based
[28] [35]
Clustering
Other
Chunk-
based[5] [18]
Hierarchical
Clustering z-order
based[22]
3 Modeling the data space as a chunk-tree
Clearly our goal is to define a multidimensional file
organization that natively supports hierarchies. Thereis indeed a plethora of data structures for multidimen-
sional data [8], but to the best of our knowledge, none of
these explicitly supports hierarchies. Hierarchies com-
plicate things, basically because, in their presence, the
data space explodes 1. Moreover, as we are primarily
aiming at speeding up queries including restrictions on
the hierarchies, we need a data structure that can effi-
ciently lead us to the corresponding data subset based
on these restrictions. A key observation at this point is
that all restrictions on the hierarchies intuitively define
a subcube or a cube-slice.
To this end, we exploit the intuitive representation ofa cube as a multidimensional array and apply a chunk-
ing method in order to create subcubes, i.e., the so-called
chunks. Our method of chunking is based on the dimen-
sion hierarchies structure and thus we call it hierar-
chical chunking. In the following sections we present
a dimension-data encoding scheme that assigns hierar-
chy-enabled unique identifiers to each data point in a
dimension. Then, we present our hierarchical chunking
method. Finally, we propose a tree structure for repre-
senting the hierarchy of the resultant chunks and thus
modeling the cube data space.
3.1 Dimension encoding and hierarchical chunking
In order to apply hierarchical chunking, we first assign
a surrogate key to each dimension hierarchy value. This
key uniquely identifies each value within the hierarchy.
1 Assuming N dimension hierarchies modeled as K-level m-waytrees, the number of possible value combinations is K-times expo-nential in the number of dimensions, i.e., O(mKN).
Continent
Country
Region
City
LOCATION
Grain level ---
North(0)
South(0.0.1)(1)
North(2)
South(3)
Greece (0.0)(0)
U.K.(1)
Europe (0)(0)
Salonica(0)
Athens(1) (2)
Rhodes Glasgow(3)
London(4)
Cardiff(5)
(0.0.1.2)
Fig. 2 Example of hierarchical surrogate keys assigned to anexample hierarchy
More specifically, we order the values in each hierar-
chy level so that sibling values occupy consecutive posi-
tions and perform a mapping to the domain of positive
integers. The resulting values are depicted in Fig. 2 for
an example of a dimension hierarchy. The simple inte-
gers appearing under each value in each level are calledorder-codes. In order to identify a value in the hierarchy,
we form the path of order-codes from the root-value to
the value in question. This path is called a hierarchical
surrogate key, or simply h-surrogate. For example the h-
surrogate for the value Rhodes is 0.0.1.2. H-surrogates
convey hierarchical (i.e., semantic) information for each
cube data point, which can be greatly exploited for the
efficient processing of star-queries [15, 29,40].
The basic incentive behind hierarchical chunking is
to partition the data space by forming a hierarchy of
chunks that is based on the dimensions hierarchies.
This has the beneficial effect of pruning all empty areas.
Remember that in a cube data space empty areas are
typically defined on specific combinations of hierarchy
values (e.g., as we did not sell the X product Category
on Region Y for T periods of time, an empty region is
formed). Moreover, it provides us with a multi-resolu-
tion view of the data space where one can zoom-in and
zoom-out navigating along the dimension hierarchies.
We model the cube as a large multidimensional array,
which consists only of the most detailed data. Initially, we
8/4/2019 cal Clustering for OLAP the Cube File Approach
7/35
Hierarchical clustering for OLAP 627
partition the cube in to a very few chunks corresponding
to the most aggregated levels of the dimensions
hierarchies. Then we recursively partition each chunk as
we drill-down to the hierarchies of all dimensions in par-
allel. We define a measure in order to distinguish each
recursion step, the chunking depth D. We will illustrate
hierarchical chunking with an example. The dimensions
of our example cube are depicted in Fig. 3 and corre-spond to a two-dimensional cube hosting sales data for
a fictitious company. The two dimensions are namely
LOCATION and PRODUCT. In the figure we can see
the members for each level of these dimensions (each
appearing with its member-code).
In order to apply our method, we need to have hierar-
chies of equal length. For this reason, we insert pseudo-
levels P into the shorter hierarchies until they reach
the length of the longest one. This padding is done
after the level that is just above the grain level. In our
example, the PRODUCTdimension has only three lev-
els and needs one pseudo-level in order to reach thelength of the LOCATION dimension. This is depicted
next, where we have also noted the order-code range at
each level:
LOCATION:[0-2].[0-4].[0-10].[0-18]
PRODUCT:[0-1].[0-2].P.[0-5]
The result of hierarchical chunking on our example
cube is depicted in Fig. 4a. Chunking begins at chunking
depth D
=0 and proceeds in a top-down fashion. To
define a chunk, we define discrete ranges of grain-level(i.e., most-detailed) values on each dimension, denoted
in the figure as [a..b], where a and b are grain-level
order-codes. Each such range is defined as the set of
values with the same parent (value) in the correspond-
ing parent level. These parent levels form the set of
pivot levels PVT, which guides the chunking process
at each step. Therefore initially, PVT = {LOCATION:
Continent, PRODUCT: Category}. For example, if we
take value 0 of pivot level Continent of the LOCA-
TIONdimension, then the corresponding range at the
grain level is Cities [0..5].
The definition of such a range for each dimension
defines a chunk. For example, the chunk defined from
the 0, 0 values of the pivot levels Continent and Cat-
egory, respectively, consists of the following grain data
(LOCATION:0.[0-1].[0-3].[0-5], PRODUCT:0.[0-1]. P.[0-3]).
The [] notation denotes a range of members.Thischunk
appears shaded in Fig. 4a at D = 0. Ultimately at D = 0we have a chunk for each possible combination between
the members of the pivot levels, that is a total of
[0-1][0-2] = 6 chunks in this example. Thus the total
number of chunks created at each depth D equals the
product of the cardinalities of the pivot levels.
Next we proceed at D = 1, with PVT= {LOCATION:Country, PRODUCT: Type} and recursively chunk each
chunk of depth D = 0. This time we define rangeswithin the previously defined ranges. For example, on
the range corresponding to Continent value 0 that we
created before, we define discrete ranges correspond-ing to each country of this continent (i.e., to each value
of the Country level, which has parent 0). In Fig. 4a,
at D = 1, shaded boxes correspond to all the chunksresulting from the chunking of the chunk mentioned in
the previous paragraph.
Similarly, we proceed the chunking by descending in
parallel all dimension hierarchies and at each depth D
we create new chunks within the existing ones. The pro-
cedure ends when the next levels to include as pivot lev-
els are the grain levels. Then we do not need to perform
any further chunking, because the chunks that would be
produced from such a chunking would be the cells of thecube themselves. In this case, we have reached the max-
imum chunking depth DMAX. In our example, chunking
stops at D = 2 and the maximum depth is D = 3. Noticethe shaded chunks in Fig. 4a depicting chunks belonging
in the same chunk hierarchy.
The rationale for inserting the pseudo-levels above
the grain level lies in that we wish to apply chunking
the soonest possible and for all possible dimensions. As
the chunking proceeds in a top-to-bottom fashion, this
eager chunking has the advantage of reducing very
early the chunk size and also provides faster access to
the underlying data, because it increases the fan-outof the intermediate nodes. If at a particular depth one
(or more) pivot level is a pseudo-level, then this level
does nottake part in the chunking. (In our example this
occurs at D = 2 for the PRODUCT dimension.) Thismeans that we do not define any new ranges within the
previously defined range for the specific dimension(s)
but instead we keep the old one with no further chunk-
ing. Therefore, as pseudo-levels restrict chunking in the
dimensions that are applied, we must insert them to
the lowest possible level. Consequently, as there is no
chunking below the grain level (a data cell cannot be
further partitioned), the pseudo-level insertion occurs
just above the grain level.
3.2 The chunk-tree representation
We use the intermediate depth chunks as directory
chunks that will guide us to the DMAX depth chunks
containing the data and thus called data chunks. This
leads to a chunk-tree representation of the hierarchi-
cally chunked cube and hence the cube data space. It
8/4/2019 cal Clustering for OLAP the Cube File Approach
8/35
628 N. Karayannidis, T. Sellis
Category Type Item
Books
0Literature
0.0Murderess, A. Papadiamantis
0.0.0Karamazof brothers F.
Dostoiewsky
0.0.1Philosophy
0.1Zarathustra, F. W. Nietzsche
0.1.2
Symposium, Plato0.1.3
Music
1Classical
1.2
The Vivaldi Album Special
Edition
1.2.4Mozart: The Magic Flute
1.2.5
Continent Country Region City
Europe
0Greece
0.0Greece -North
0.0.0Salonica
0.0.0.0Greece- South
0.0.1Athens
0.0.1.1Rhodes
0.0.1.2U.K.
0.1U.K.- North
0.1.2Glasgow0.1.2.3
U.K.- South
0.1.3London
0.1.3.4Cardiff
0.1.3.5North America
1USA1.2
USA- East1.2.4
New York1.2.4.6Boston
1.2.4.7USA - West
1.2.5Los Angeles
1.2.5.8San Francisco
1.2.5.9USA- North
1.2.6Seattle
1.2.6.10Asia
2Japan
2.3Kiusiu
2.3.7Nagasaki
2.3.7.11Hondo
2.3.8Tokyo
2.3.8.12Yokohama
2.3.8.13Kioto
2.3.8.14India
2.4India- East
2.4.9Calcutta
2.4.9.15New Delhi
2.4.9.16India - West
2.4.10Surat
2.4.10.17Bombay
2.4.10.18
PRODUCT
Category
Type
Item
LOCATION
Continent
Country
Region
City
PRODUCTLOCATION
Fig. 3 Dimensions of our example cube along with two hierarchy instantiations
is depicted in Fig. 4b for our example cube. In Fig. 4b,
we have expanded the chunk-sub-tree corresponding to
the family of chunks that has been shaded in Fig. 4a.
Pseudo-levels are marked with P and the correspond-ing directory chunks have reduced dimensionality (i.e.,
one dimensional in this case). We interleave the h-sur-
rogates of the pivot level values that define a chunk and
form a chunk-id. This is a unique identifier for a chunk
within a CUBE File. Moreover, this identifier includes
the whole path in the chunk hierarchy of a chunk. In
Fig. 4b, we note the corresponding chunk-id above each
chunk. The root chunk does not have a chunk-id because
it represents the whole cube and chunk-ids essentially
denote sub-cubes. The part of a chunk-id that is con-
tained between consecutive dots and corresponds to a
specific depth D is called D-domain.
The chunk-tree representation can be regarded as a
method to model the multilevel-multidimensional data
space of an OLAP cube. We discuss next the major ben-
efits from this modeling:
Direct access to cube data through hierarchical restric-
tions One of the main advantages of the chunk-tree
representation of a cube is that it explicitly supports hier-
archies. This means that any cube data subset defined
through restrictions on the dimension hierarchies can
be accessed directly. This is achieved by simply accessing
the qualifying cells at each depth and following the inter-
mediate chunk pointers to the appropriate data. Note
that thevast majority of OLAP queries contain an equal-ity restriction on a number of hierarchical attributes and
more commonly on hierarchical attributes that form a
complete path in the hierarchy. This is reasonable as
the core of analysis is conducted along the hierarchies.
We call this kind of restrictions hierarchical prefix path
(HPP) restrictions and provide the corresponding defi-
nition next:
Definition 1 (Hierarchical Prefix Path Restriction) We
define a hierarchical prefix path restriction (HPP restric-
tion) on a hierarchy H of a dimension D, to be a set of
equality restrictions linked by conjunctions on Hs levelsthat form a path in H, which always includes the topmost
(most aggregated) level of H.
For example, if we consider the dimension LOCA-
TIONof our example cube and a DATE dimension with
a 3-level hierarchy (Year/Month/Day), then the query
show me sales for country A (in continent C) in region
B for each month of 1999 contains two whole-path
restrictions, one for the dimension LOCATION and
one for DATE: (a) LOCATION.continent = C AND
8/4/2019 cal Clustering for OLAP the Cube File Approach
9/35
Hierarchical clustering for OLAP 629
[0][1..2]
[2..3]
[0..1]
[0..2] [3..5]
[4..5]
[0..3]
[6..10][0..5]
[0..5]
[0..18]
Cube
LOCATION
PRODUCT
[11..18]
D = 0
[4..5]
[6..10] [11 - 14] [15-18]
D = 1
[0..1]
[2..3]
[4..5]
[3] [4..5] [6..7] [8..9] [10] [11][12..14][15..16][17..18]
D = 2
(Category, Continent)
(PRODUCT, LOCATION)
(Type, Country)
( - , Region)
1
3
Grain level
(Data Chunks)
Root Chunk
P P
0 1 2 3
D = 0
D = 1
LOCATION
PRODUCT
0 1 2
0
1
0
0|0.0|0 0|0.1|0
D = 2
0
0|0.0|0.0|P
0
1
1 2
0|0.0|0.1|P
0
1
0|0.1|0.2|P|P
0
1
4 5
0|0.1|0.3|P
0
1
0 1
0|0
P P
0 1 2 3
0|0.0|1 0|0.1|1
30
0|0.0|1.0|P
2
3
1 2
2
3
0|0.1|1.2|P
2
3
4 5
0|0.1|1.3|P|P
2
3
D = 3 (Max Depth)0|0.0|1.1|P
(Category, Continent)
(Type, Country)
( - , Region)
(Item, City)
(a) (b)
Fig. 4 a The cube from our running example hierarchically chunked. b The whole sub-tree up to the data chunks under chunk 0|0
LOCATION.country = A AND LOCATION.region =
B, and (b) DATE.year = 1999.
Consequently, we can now define the class of HPPqueries:
Definition 2 (Hierarchical Prefix Path Query) We call
a query Q on a cube C a hierarchical prefix path query
(HPP query), if and only if all the restrictions imposed
by Q on the dimensions of C are HPP restrictions, which
are linked together by conjunctions.
Adaptation to cubes native sparseness The cube data
space is extremely sparse [34]. In other words, the ratio
of the number of real data points to the product of the
dimension grain-level cardinalities is a very small num-
ber. Values for this ratio in the range of 1012105are more than typical (especially for cubes with more
than three dimensions). It is therefore imperative that
a primary organization for the cube adapts well to this
sparseness, allocating space conservatively. Ideally, the
allocated space must be comparable to the size of the
existing data points. The chunk-tree representation
adapts perfectly to the cube data space. The reason
is that the empty regions of a cube are not arbitrarily
formed. On the contrary, specific combinations of
dimension hierarchy values form them. For instance,
in our running example, if no music products are sold
in Greece, then a large empty region is formed. Con-sequently, the empty regions in the cube data space
translate naturally to one or more empty chunk sub-
trees in the chunk-tree representation. Therefore, empty
sub-trees can be discarded altogether and the space
allocation corresponds to real data points and
only.
Multi-resolution view of the data space The chunk-tree
represents the whole cube data space (however, with
most of the empty areas pruned). Similarly, each sub-
tree represents a sub, space. Moreover, at a specific
chunking depth we view all the data points organized
in hierarchical families (i.e., chunk-trees) accordingto the combinations of hierarchy values for the corre-
sponding hierarchy levels. By descending to a higher
depth node we view the data of the corresponding
subspace organized in hierarchical families of a more
detailed level and so on. This multi-resolution feature
will be exploited later in order to achieve a better hierar-
chical clustering of the data by promoting the storage of
lower depth chunk-trees in a bucket than that of higher
depth ones.
8/4/2019 cal Clustering for OLAP the Cube File Approach
10/35
630 N. Karayannidis, T. Sellis
Storage efficiency A chunk is physically represented by
a multidimensional array. This enables an offset-based
access, rather than a search-based one, which speed-ups
the cell access mechanism considerably. Moreover, it
gives us the opportunity to exploit chunk-ids in a very
effective way. A chunk-id essentially consists of inter-
leavedcoordinate values. Therefore,we canuse a chunk-
id in order to calculate the appropriate offset of a cellin a chunk but we do not have to store the chunk-id
along with each cell. Indeed, a search-based mechanism
(like the one used by conventional B-tree indexes or
the UB-tree [2]) requires that the dimension values (or
the corresponding h-surrogates), which form the search-
key, must also be stored within each cell (i.e., tuple) of
the cube. In the CUBE File only the measure values of
the cube are stored in each cell. Hence notable space
savings are achieved. In addition, further compression
of chunks can be easily achieved, without affecting the
offset-based accessing (see [17] for the details).
Parallel processing enabling Chunk-trees (at variousdepths) can be exploited naturally for the logical frag-
mentation of the cube data, in order to enable the par-
allel processing of queries, as well as the construction
and maintenance (i.e., bulk loading and batch updating)
of the CUBE File. Chunk-trees are essentially disjoint
fragments of the data that carry all the hierarchy seman-
tics of the data. This makes the CUBE File data struc-
ture an excellent candidate for advanced fragmentation
methods ([38]) used in parallel data warehouse DBMSs.
Efficient maintenance operations Any data structure
aimed to accommodate data warehouse data must be
efficient in typical data warehousing maintenance oper-ations. The logical data partitioning provided by the
chunk-tree representation enables fast bulk loading (rol-
lin of data), data purging (rollout of data, i.e., bulk
deletions from the cube), as well as the incremental
updating of the cube (i.e., when the input data with
the latest changes arrive from the data sources, only
local reorganizations are required and not a complete
CUBE File rebuild). The key idea is that new data to
be inserted in the CUBE file correspond to a set of
chunk-trees that need to be hanged at various depths
of the structure. The insertion of each such chunk-tree
requires only a local reorganization without affecting
the rest of the structure. In addition, as noted previously,
these chunk-tree insertions can be performed in parallel
as long as they correspond to disjoint subspaces of the
cube. Finally, it is very easy to rollout the oldest months
data and rollin the current months (we call this data
purging), as these data correspond to separate chunk-
trees and only a minimum reorganization is required.
The interested reader can find more information regard-
ing other aspects of the CUBE File not covered in this
paper (e.g., the updating and maintenance operations),
as well as information for a prototype implementation
of a CUBE File based DBMS in [16].
4 Evaluating the quality of hierarchical clustering
Any physical organization of data must determine how
the latter are distributed in disk pages. A CUBE File
physically organizes its data by allocating the chunks of
the chunk-tree into a set of buckets, which is the I/O
transfer unit counterpart in our case. First, let us try to
understand what are the objectives of such an alloca-
tion. As already stated the primary goal is to achieve
a high degree of hierarchical clustering. This statement,
although clear,couldstill be interpreted in several differ-
ent ways. What are the elements that can guarantee that
a specific hierarchical clustering scheme is good? We
attempt to list some next:
1. Efficient evaluation of queries containing restric-
tions on the dimension hierarchies
2. Minimization of the size of the data
3. High space utilization
The most important goal of hierarchical clustering is
to improve response time of queries containing hier-
archical restrictions. Therefore, the first element calls
for a minimal I/O cost (i.e., bucket reads) for the eval-
uation of such restrictions. The second element deals
with the ability to minimize the size of the data to bestored (e.g., by adapting to the extensive sparseness of
the cube data spacei.e., not storing null dataas well
as storing only the minimum necessary data, e.g., in an
offset-based access structure we do not need to store the
dimension values along with the facts). Of course, the
storage overhead must also be minimized in terms of the
number of allocated buckets. Naturally, the best way to
keep this number low is to utilize the available space as
much as possible. Therefore the third element implies
that the allocation must adapt well to the data distri-
bution, e.g., more buckets must be allocated to more
densely populated areas and fewer buckets for more
sparse ones. Also, buckets must be filled almost to capac-
ity (i.e., imposing a high bucket occupancy threshold).
Both the last two elements guarantee an overall mini-
mum storage cost.
In the following, we propose a metric for evaluat-
ing the hierarchical clustering quality of an allocation of
chunks into buckets. Then in the next section we use this
metric to formally define the chunk-to-bucket allocation
problem as an optimization problem.
8/4/2019 cal Clustering for OLAP the Cube File Approach
11/35
Hierarchical clustering for OLAP 631
4.1 The hierarchical clustering factor
We advocate that hierarchical clustering is the most
important goal for a file organization for OLAP cubes.
However, the space of possible combinations of dimen-
sion hierarchy values is huge (doubly exponentialsee
Footnote 1). To this end, we exploit the chunk-tree rep-
resentation, resulting from the hierarchical chunking ofa cube, and deal with the problem of hierarchical cluster-
ing, as a problem of allocating chunks of the chunk-tree
into disk buckets. Thus, we are not searching for a linear
clustering (i.e., for a total ordering of the chunked-cube
cells), but rather we are interested in the packing of
chunks into buckets according to the criteria of good
hierarchical clustering posed above.
The intuitive explanation for the utilization of the
chunk-tree for achieving hierarchical clustering lies in
the fact that the chunk-tree is built based solely on the
hierarchies structure and content and not on some stor-
agecriteria (e.g., each node corresponding to a diskpage,etc.); as a result, it embodies all possible combinations of
hierarchical values. For example, the sub-tree hanging
from the root-chunk in Fig. 4b at the leaf level con-
tains all the sales figures corresponding to the continent
Europe (order-code 0) and to the product categoryBooks (order-code 0) and any possible combinations
of the children members of the two. Therefore, each
sub-tree in the chunk-tree corresponds to a hierarchi-
cal family of values and thus reduces the search space
significantly. In the following we will regard as a stor-
age unit the bucket. In this section, we define a metric
for evaluating the degree of hierarchical clustering ofdifferent storage schemes in a quantitative way.
Clearly, a hierarchical clustering strategy that respects
the quality element of efficient evaluation of queries
with HPP restrictions that we have posed above must
ensure that the access of the sub-trees hanging under
a specific chunk must be done with a minimal num-
ber of bucket reads. Intuitively, one can say that if we
could store whole sub-trees in each bucket (instead of
single chunks), then this would result in a better hier-
archical clustering, as all the restrictions on the specific
sub-tree, as well as on any of its descendant sub-trees,
would be evaluated with a single bucket I/O. For exam-
ple, if we store the sub-tree hanging from the root-chunk
in Fig. 4b into a single bucket, we can answer all queries
containing hierarchical restrictions on the combination
Books and Europe and on any children-values of
these two with just a single disk I/O.
Therefore, each sub-tree in this chunk-tree corre-
sponds to a hierarchical family of values. Moreover,
the smaller the chunking depth of this sub-tree the more
the value combinations it embodies. Intuitively, we can
say that the hierarchical clustering achieved could be
assessed by the degree of storing low-depth whole chunk
sub-trees into each storage unit. Next, we exploit this
intuitive criterion to define the hierarchical clustering
degree of a bucket(HCDB). We begin with a number of
auxiliary definitions:
Definition 3 (Bucket-Region) Assume a hierarchicallychunked cube represented by a chunk-tree CT of a max-
imum chunking depth DMAX. A group of chunk-trees of
the same depth having a common parent node, which are
stored in the same bucket, comprises a bucket-region
Definition 4 (Region contribution of a tree stored in a
bucketcr) Assume a hierarchically chunked cube rep-
resented by a chunk-tree CT of a maximum chunking
depth DMAX. We define as the region contribution cr of
a tree t of depth d that is stored in a bucket B to be the
total number of trees in the bucket region that this tree
belongs to divided by the total number of trees of thesame depth in the total chunk-tree CT in general. This is
then multiplied by a bucket region proximity factor rP,
which expresses the proximity of the trees of a bucket
region in the multidimensional space:
cr treeNum(d, B)
treeNum(d, CT)rP,
where treeNum(d, B)is the total number of sub-trees in
B of depth d, treeNum(d, CT) the total number of sub-
trees in CTof depth d and rP the bucket region proximity
(0 < rP 1).
The region contribution of a tree stored in a bucket
essentially denotes the percentage of trees at a specific
depth that a bucket region covers. Therefore, the greater
this percentage, the greater the hierarchical clustering
achieved by the corresponding bucket, as more com-
binations of the hierarchy members will be clustered in
the same bucket. To keep this contribution high we need
large bucket-regions of low depth trees, because in low
depths the total number ofCTsub-trees is small. Notice
also that the region contribution includes a bucket region
proximity factor rP, which expresses the spatial proxim-
ityof thetrees of a bucketregionin themultidimensionalspace. The larger this factor becomes the closer the trees
of a bucket-region are and thus the larger their individ-
ual region contributions are. We will see in more detail
the effects of this factor and its definition (Definition
10) in a following subsection, where we will discuss the
formation of the bucket regions.
Definition 5 (Depth contribution of a tree stored in a
bucketcd) Assume a hierarchically chunked cube rep-
resented by a chunk-tree CT of a maximum chunking
8/4/2019 cal Clustering for OLAP the Cube File Approach
12/35
632 N. Karayannidis, T. Sellis
depth DMAX. We define as the depth contribution cd of a
tree t of depth d that is stored in a bucket B to be the ratio
of d to DMAX:
cd d
DMAX.
The depth contribution of a tree stored in a bucket
expresses the proportion between the depth of the treeand the maximum chunking depth. The less this ratio
becomes (i.e., the lower is the depth of the tree), the
greater the hierarchical clusteringachieved by the corre-
sponding bucket becomes. Intuitively, the depth contri-
bution expresses the percentage of the number of nodes
in the path from the root-chunk to the bucket in ques-
tion and thus the less it is the less is the I/O cost to access
this bucket. Alternatively, we could substitute the depth
value from the nominator of the depth contribution with
the number of buckets in the path from the root-chunk
to the bucket in question (with the latter included).
Next, we provide the definition for the hierarchicalclustering degree of a bucket:
Definition 6 (Hierarchical clustering degree of a
bucketHCDB) Assume a hierarchically chunked cube
represented by a chunk-tree CT of a maximum chunk-
ing depth DMAX. For a bucket B containing 4 whole
sub-trees {t1, t2 . . . tT} of chunking depths {d1, d2 . . . dT},respectively, where none of these sub-trees is a sub-tree of
another, we define as the Hierarchical Clustering Degree
HCDB of bucket B to be the ratio of the sum of the
region contribution of each tree ti(1
i
T) included
in B to the sum of the depth contribution of each treeti(1 i T), multiplied by the bucket occupancy OB,where 0 OB 1 :
HCDB T
i=1 cirT
i=1 cid
OB =Tcr
TcdOB =
cr
cdOB, (1)
where cir is the region contribution of tree ti and cid is the
depth contribution of tree ti(1 i T). (Note that asbucket regions have been defined as consisting of equi-
depth trees, then all trees of a bucket have the same region
contribution as well as depth contribution.)
In this definition, we have assumed that the chunking
depth di of a chunk-tree ti is equal to the chunking depth
of the root-chunk of this tree. Of course we assume that
a normalization of the depth values has taken place, so
that the depth of the chunk-tree CTis to be 1 instead of
0, in order to avoid having zero depths in the denomina-
tor of (1). Furthermore, data chunks are considered as
chunk-trees with a depth equal to the maximum chunk-
ing depth of the cube. Note that directory chunks stored
in a bucket, not as part of a sub-tree but isolated, have
a zero region contribution; therefore, buckets that con-
tain only such directory chunks have a zero degree of
hierarchical clustering.
From (1), we can see that the more sub-trees, instead
of single chunks, are included in a bucket the greater the
hierarchical clustering degree of the bucket becomes,
because more HPP restrictions can be evaluated solely
with this bucket. Also the highest these trees are (i.e.,the smaller their chunking depth is) the greater the hier-
archical degree of the bucket becomes, as more combi-
nations of hierarchical attributes are covered by this
bucket. Moreover, the more trees of the same depth and
hanging under the same parent node, we have stored in
a bucket, the greater the hierarchical clustering degree
of the bucket, as we include more combinations of the
same path in the hierarchy.
All in all, the HCDB metric favors the following stor-
age choices for a bucket:
Whole trees instead of single chunks or other datapartitions
Smaller depth trees instead of greater depth ones Tree regions instead of single trees Regions with a few low-depth trees instead of ones
with more trees of greater depth
Regions with trees of the same depth that are closein the multidimensional space instead of dispersed
trees
Buckets with a high occupancy
We prove the following theorem regarding the maxi-
mum value of the hierarchical clustering degree of abucket:
Theorem 1 (Theorem of maximum hierarchical cluster-
ing degree of a bucket) Assume a hierarchically chun-
ked cube represented by a chunk-tree CT of a maximum
chunking depth DMAX, which has been allocated to a set
of buckets. Then, for any such bucket B holds that
HCDB DMAX.
Proof From the definition of the region contribution of
a tree appearing in Definition 4, we can easily deduce
that
cir 1. (I)This means that the following holds:
Ti=1
cir T. (II)
In (II) T stands for the number of trees stored in B.
Similarly, from the definition of the depth contribution
8/4/2019 cal Clustering for OLAP the Cube File Approach
13/35
Hierarchical clustering for OLAP 633
of a tree appearing in Definition 5, we can easily deduce
that:
cid 1
DMAX, (III)
as, the smallest possible depth value is 1. This means that
the following holds:
Ti=1
cid T
DMAX. (IV)
From (II), (IV), (1) and assuming that B is filled to its
capacity (i.e., OB equals 1) the theorem is proved.
It is easy to see that the maximum degree of hierar-
chical clustering of a bucket B is achieved only in the
ideal case, where we store the chunk-tree CT that rep-
resents the whole cube in B and CT fits exactly in B.2.
In this case, all our primary goals for a good hierarchical
clustering, posed in the beginning of this chapter, such as
the efficient evaluation of HPP queries, the low storage
cost and the high space utilization are achieved. This is
because all possible HPP restrictions can be evaluated
with a single bucket read (one I/O operation) and the
achieved space utilization is maximal (full bucket) with
a minimal storage cost (just one bucket). Moreover, it
is now clear that the hierarchical clustering degree of a
bucket signifies to what extent the chunk-tree represent-
ing the cube has been packed into the specific bucket
and this is measured in terms of the chunking depth of
the tree.
By trying to create buckets with a high HCDB
we
can guarantee that our allocation respects these ele-
ments of good hierarchical clustering. Furthermore, it
is now straightforward to define a metric for evaluating
the overall hierarchical clustering achieved by a chunk-
to-bucket allocation strategy:
Definition 7 (Hierarchical clustering factor of a physical
organization for a cubefHC) For a physical organiza-
tion thatstores the dataof a cube into a setof NB buckets,
we define as the hierarchical clustering factor fHC , the
percent of hierarchical clustering achieved by this storage
organization, as this results from the hierarchical cluster-
ing degree of each individual bucket divided by the total
number of buckets and we write:
fHC NB
1 HCDB
NBDMAX. (2)
2 Indeed, a bucket with HCDB = DMAX would mean that thedepth contribution of each tree in this bucket should be equal to1/DMAX (according to the inequality (III)); however, this is onlypossible for the whole chunk-tree CT, as this only has a depthequal to 1.
Note that NB is the total number of buckets usedin order
to storethe cube; however, only the buckets that contain
at least one whole chunk-tree have a non-zero HCD Bvalue. Therefore, allocations that spend more buckets
for storing sub-trees have a higher hierarchical cluster-
ing factor than others, which favor, e.g., single directory
chunk allocations. From (2), it is clear that even if we
have two different allocations of a cube that result inthe same total HCDB of individual buckets, the one that
occupies the smaller number of buckets will have the
greater fHC, rewarding this way the allocations that use
the available space more conservatively.
Another way of viewing the fHC is as the average
HCDB for all the buckets divided by the maximum
chunking depth. It is now clear that it expresses the
percentage of the extent by which the chunk-tree rep-
resenting the whole cube has been packed into the
set of the NB buckets and thus 0 fHC 1. It followsdirectly from Theorem 1 that this factor is maximized
(i.e., equals 1), if and only if we store the whole cube(i.e., the chunk-tree CT) into a single bucket, which cor-
responds to a perfect hierarchical clustering for a cube.
In the next section we exploit the hierarchical clus-
tering factorfHC, in order to define the chunk-to-bucket
allocation problem as an optimization problem. Further-
more, we exploit the hierarchical clustering degree of a
bucket HCDB in a greedy strategy that we propose for
solving this problem, as an evaluation criterion, in order
to decide how close we are to an optimal solution.
5 Building the CUBE File
In this section we formally define the chunk-to-bucket
allocation problem as an optimization problem. We
prove that it is NP-Hard and provide a heuristic algo-
rithm as a solution. In the course of solving this problem
several interesting sub-problems arise. We tackle each
one in a separate subsection.
5.1 The HPP chunk-to-bucket allocation problem
The chunk-to-bucket allocation problem is defined as
follows:
Definition 8 (The HPP chunk-to-bucket allocation
problem) For a cube C, represented by a chunk-tree CT
with a maximum chunking depth of DMAX, find an allo-
cation of the chunks of CT into a set of fixed-size buckets
that corresponds to a maximum hierarchical clustering
factor fHC
We assume the following: The storage cost of any
chunk-tree tequals cost(t), the number of sub-trees per
8/4/2019 cal Clustering for OLAP the Cube File Approach
14/35
634 N. Karayannidis, T. Sellis
depth d in CTequals treeNum(d) and the size of a bucket
equals SB. Finally, we are given a bucket of special size
SROOT consisting of consecutive simple buckets,called
the root-bucket BR, where SROOT = SB, with 1.Essentially, BR represents the set of buckets that contain
no whole sub-trees and thus have a zero HCDB.
The solution S for this problem consists of a set ofK
buckets, S = {B1, B2 . . .BK}, so that each bucket con-tains at least one sub-tree of CT and a root-bucket BRthat contains all the rest of CT (part with no whole sub-
trees). S must result in a maximum value for the fHCfactor for the given bucket size SB. As the HCDB val-
ues of the buckets of the root-bucket BR equal to zero
(recall that they contain no whole sub-trees), following
from (2), fHC can be expressed as
fHC =K
1 HCDB
(K+ )DMAX. (3)
From (3), it is clear that the more buckets we allocatefor the root-bucket (i.e., the greater becomes) the
less the degree of hierarchical clustering achieved by
our allocation. Alternatively, if we consider caching the
whole root-bucket in main memory (see the following
discussion), then we could assume that does not affect
hierarchical clustering (as it does not introduce more
bucket I/Os from the root-chunk to a simple bucket)
and could be zeroed.
In Fig. 5, we depict four different chunk-to-bucket
allocations for the same chunk-tree. The maximum
chunking depth is DMAX = 5, although in the figure we
can see the nodes up to depth D = 3 (i.e., the trianglescorrespond to sub-trees of three levels). The numbers
inside each node represent the storage cost for the cor-
responding sub-tree, e.g., the whole chunk-tree has a
cost of 65 units. Assume a bucket size ofSB = 30 units.
Below each figure we depict the calculated fHC and
beside we note the percentage with respect to the best
fHC that can be achieved for this bucket size (i.e., fHC/
fHcmax 100%). The chunk-to-bucket allocation thatyields the maximum fHC can be identified easily by
exhaustive search in this simple case. Observe, how the
fHC deteriorates gradually, as we move from Fig. 5a to d.
In Fig. 5a we have failed to create any bucket-regions
at depth D = 2. Thus each bucket stores a single sub-tree of depth 3. Note also that the occupancy of most
buckets is quite low. In Fig. 5b the hierarchical clustering
improves as some bucket-regions have been formed
buckets B1, B3 and B4 store two sub-trees of depth 3. In
Fig. 5c the total number of buckets decreases by one as a
large bucket-region of four sub-trees has been formed in
bucket B3. Finally, in Fig. 5d we have managed to store
in bucket B3 a higher level (i.e., lower depth) sub-tree
(i.e., a sub-tree of depth 2). This increases even more the
hierarchical clustering achieved, compared to the previ-
ous case (Fig. 5c), because the root node is included in
the same bucket as the four sub-trees. In addition, the
bucket occupancy of B3 is increased.
It is clear now from this simple example, that the hier-
archical clustering factor fHC rewards the allocations
that achieve to store lower-depth sub-trees in buckets,that store regions of sub-trees instead of single sub-trees
and that create highly occupied buckets. The individual
calculations of this example can be seen in Fig. 6.
All in all, it is obvious that we have now the optimi-
zation problem of finding a chunk-to-bucket allocation
such that fHC is maximized. This problem is NP-Hard,
which results from the following theorem.
Theorem 2 (Complexity of the HPP chunk-to-bucket
allocation problem) The HPP chunk-to-bucket alloca-
tion problem is NP-Hard.
Proof Assume a typical bin packing problem [42] wherewe are given Nitems with weights wi, i = 1,,N, respec-
tively, and a bin size B such as wi B for all i = 1, . . . , N.The problem is to find a packing of the items in the few-
est possible bins. Assume that we create N chunks of
depth d and dimensionality D, so as chunk c1 has a
storage cost of w1 and chunk c2 has a storage cost w2and so on. Also assume that N 1 of these chunks areunder the same parent chunk (e.g., the Nth chunk). This
way we have created a two-level chunk-tree where the
root lies at depth d = 0 and the leaves at depth d =
1. Also assume that a bin and a bucket are equivalent
terms. Now we have reduced in polynomial time the binpacking problem to an HPP chunk-to-bucket allocation
problem, which is to find an allocation of the chunks
into buckets ofB size such that the achieved hierarchi-
cal clustering factor fHC is maximized.
As all the chunk-trees (i.e., single chunks in our case)
are of the same depth, the depth contribution cid(1 i N), defined in (1), is the same for all chunk-trees. There-
fore, in order to maximize the degree of the hierarchical
clustering HCDB for each individual bucket (and thus
also increase the hierarchical clustering factor fHC), we
have to maximize the region contribution cir(1
i
N)
of each chunk-tree (1). This occurs when we pack into
each bucket as many trees as possible on the one hand
anddue to the region proximity factor rPwhen the
trees of each region are as close as possible in the mul-
tidimensional space, on the other. Finally, according to
the fHC definition, the number of buckets used must
be the smallest possible. If we assume that the chunk
dimensions have no inherent ordering then there is no
notion of spatial proximity within the trees of the same
region and the region proximity factor equals 1 for all
8/4/2019 cal Clustering for OLAP the Cube File Approach
15/35
Hierarchical clustering for OLAP 635
65
4022
10
20
5 5
2
3
D= 1
DMAX = 5
D= 2
SB = 30
B1
B2
B3B4
5
B5
B6
B7
(a)fHC=0.01(14%)
65
4022
10
20
5
5 3
D= 1
DMAX= 5
D= 2
SB= 30
B1
B2B3
2
5
B4
(b)fHC=0.03(42%)
65
4022
10
20
5 5
5
3
D= 1
DMAX = 5
D= 2
SB= 30
B1
B2
B3
2
(c)fHC=0.05(69%)
65
4022
10
20
5 5
5
2
3
D= 1
DMAX= 5
D= 2
SB = 30
B1
B2
B3
(d)fHC=0.07(100%)
Fig. 5 The hierarchical clustering factor fHC of the same chunk-tree for four different chunk-to-bucket allocations
possible regions (see also related discussion in the fol-
lowing subsection).
In this case the only factor that can maximize the
HCDB of each bucket and consequently the overall fHCis to minimize empty space within each bucket [i.e., max-
imize bucket occupancy in (1)] and use as few buckets as
possible by packing the largest number of trees in each
bucket. These are exactly the goals of the original bin
packing problem and thus a solution to the bin packing
problem is also a solution to the HPP chunk-to-bucket
allocation problem and vice versa.
As the bin packing can be reduced in polynomial time
to the HPP chunk-to-bucket, then any problem in NP
can be reduced in polynomial time to the HPP chunk-
to-bucket. Furthermore, in the general case (where we
have chunk-trees of variant depths and dimension have
inherent orderings) it is not easy to find a polynomial
time verifier for a solution to the HPP chunk-to-bucket
problem, as the maximum fHC that can be achieved is
not known (as it is in the bin packing problem where
the minimum number of bins can be computed with a
simple division of the total weight of items by the size of
a bin). Thus the problem is NP-Hard.
We proceed next by providing a greedy algorithm
based on heuristics for solving the HPP chunk-to-bucket
allocation problem in linear time. The algorithm utilizes
the hierarchical clustering degree of a bucket as a cri-
terion in order to evaluate at each step how close we
are to an optimal solution. In particular, it traverses the
chunk-tree in a top-down depth-first manner, adopting
the greedy approach that if at each step we create a
8/4/2019 cal Clustering for OLAP the Cube File Approach
16/35
636 N. Karayannidis, T. Sellis
Chunk-to-bucket
Allocation
Bucket
RegionContributioncr
DepthContribution
cd BucketOccupancy
OB
HCDB
BucketSizeSB
TotalNoofBucketsK
Noofbucket
RootBucketb
Maximum
ChunkingDepthDMAX
B1 0,29 0,6 1,00 0,48B2 0,14 0,6 0,17 0,04Fig(d)
B3 0,50 0,4 0,73 0,92
3 1 0,07 100%
B1 0,29 0,6 1,00 0,48
B2 0,14 0,6 0,17 0,04Fig( c )
B3 0,57 0,6 0,50 0,48
3 1 0,05 69%
B1 0,29 0,6 1,00 0,48
B2 0,14 0,6 0,17 0,04
B3 0,29 0,6 0,33 0,16Fig(b)
B4 0,29 0,6 0,17 0,08
4 1 0,03 42%
B1 0,14 0,6 0,33 0,08B2 0,14 0,6 0,67 0,16
B3 0,14 0,6 0,17 0,04
B4 0,14 0,6 0,17 0,04
B5 0,14 0,6 0,17 0,04
B6 0,14 0,6 0,10 0,02
Fig(a)
B7 0,14 0,6 0,07 0,02
30
7 1
5
0,01 14%
fHCfHC/fHCmax
(%)
Fig. 6 The individual calculations of the example in Fig. 5
bucket with a maximum value of HCDB, then overall
the acquired hierarchical clustering factor will be maxi-
mal. Intuitively, by trying to pack the available buckets
with low-depth trees (i.e., the tallest trees) first (thus the
top-to-bottom traversal) we can ensure that we have
not missed the chance to create the best HCDB buckets
possible.
In Fig. 7, we present the GreedyPutChunksIntoBuc-
kets algorithm, which receives as input the root R of a
chunk-tree CT and the fixed size SB of a bucket. The
output of this algorithm is a set of buckets containing
at least one whole chunk-tree, a directory chunk entry
pointing at the root chunk R and the root-bucket BR
.
In each step the algorithm tries greedily to make an
allocation decision that will maximize the HCDB of the
current bucket. For example, in lines 27 of Fig. 7, the
algorithm tries to store the whole input tree in a single
bucket thus aiming at a maximum degree of hierarchi-
cal clustering for the corresponding bucket. If this fails,
then it allocates the root R to the root-bucket and tries to
achieve a maximum HCDB by allocating the sub-trees
at the next depth, i.e., the children ofR (lines 926).
This essentially is achieved by including all direct
children sub-trees with size less than (or equal to) the
size of a bucket (SB) into a list of candidate trees for
inclusion into bucket regions (buckRegion) (lines 14
16). Then the routine formBucketRegions is calledupon this list and tries to include the corresponding
trees in a minimum set of buckets, by forming bucket
regions to be stored in each bucket, so that each one
achieves the maximum possible HCDB (lines 1922).
We will come back to this routine and discuss how it
solves this problem in the next sub-section. Finally, for
the children sub-trees of root R with size cost greater
than the size of a bucket, we recursively try to solve
the corresponding HPP chunk-to-bucket allocation sub-
problem for each one of them (lines 2326). This of
course corresponds to a depth-first traversal of the input
chunk-tree.
Very important is also the fact that no space is
allocated for empty sub-trees (lines 1113); only a spe-
cial entry is inserted in the parent node to denote a
NULL sub-tree. Therefore, the allocation performed
by the greedy algorithm adapts perfectly to the data
8/4/2019 cal Clustering for OLAP the Cube File Approach
17/35
Hierarchical clustering for OLAP 637
0:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:
22:23:24:25:26:27:28:29:30:31:32:33:34:35:36:37:
GreedyPutChunksIntoBuckets(R,SB)
//Input: Root R of a chunk-tree CT, bucket size SB//Output: Updated R, list of allocated buckets BuckList, root
// bucket BR, directory entry dirEnt pointing at R
{List buckRegion // Bucket-region Candidates listIF (cost(CT) < SB){
Allocate new bucket BnStore CT in BndirEnt = addressOf(R)RETURN
}//R will be stored in the root-bucket BRIF (R is a directory chunk) {
FOR EACH child sub-tree CTC of R {IF (CTC is empty){
Mark with empty tag corresponding Rs entry}IF (cost(CTC) SB){
//Insert CTc into list for bucket-region candidates
buckRegion.push(CTC)}
}IF(buckRegion != empty){
// Formulate the bucket-regions
formBucketRegions(buckRegion, BuckList, R)
}WHILE (there is a child CT : cost(CT ) > S ){C C B
GreedyPutChunkIntoBuckets(root(CT ),SCUpdate corresponding R entry for CT
B)
C
}Store R in the root-bucket BRdirEnt = addressOf(R)
}ELSE { //R is a data chunk and cost(R) > B
Artificially chunk R, create 2-level chunk-tree CTAGreedyPutChunkIntoBuckets(root(CTA),SB)//storage of R will be taken cared of by previous call
dirEnt = addressOf(root(CTA))}RETURN}
Fig. 7 A greedy algorithm for the HPP chunk-to-bucket allocation problem
distribution, coping effectively with the native sparse-
ness of the cube.
The recursive calls might lead us eventually all the
way down to a data chunk (at depth DMAX). Indeed, if
the GreedyPutChunksIntoBuckets is called upon aroot R, which is a data chunk, then this means that we
have come upon a data chunk with size greater than the
bucket size. This is called a large data chunk and a more
detailed discussion on how to handle them will follow
in a later sub-section. For now it is enough to say that
in order to resolve the problem of storing such a chunkwe extend the chunking further (with a technique called
artificial chunking) in order to transform the large data
chunk into a 2-level chunk-tree. Then, we solve the HPP
chunk-to-bucket sub-problem for this sub-tree (lines30
35). The termination of the algorithm is guaranteed by
the fact that each recursive call deals with a sub-problem
of a smaller in size chunk-tree than the parent problem.
Thus, the size of the input chunk-tree is continuously
reduced.
65
40 22
10
20
5
5 5
2
3
D= 1
D= 2
DMAX
= 5
Fig. 8 A chunk-tree to be allocated to buckets by the greedyalgorithm
Assuming an input file consisting of the cubes data
points along with their corresponding chunk-ids (or
equivalently the corresponding h-surrogate key per
dimension) we need a single pass over this file to create
8/4/2019 cal Clustering for OLAP the Cube File Approach
18/35
638 N. Karayannidis, T. Sellis
65
4022
10
20
5 5
5
2
3
D = 1
DMAX
= 5
D = 2
SB
= 30
B1
B2
B3
Fig. 9 The chunk-to-bucket allocation for SB = 30
the chunk-tree representation of the cube. Then the
above greedy algorithm requires only linear time in the
number of input chunks (i.e., the chunks of the chunk-tree) to perform the allocation of chunks to buckets, as
each node is visited exactly once and at the worst case
all nodes are visited.
Assume the chunk-tree of DMAX = 5 of Fig. 8. The
numbers inside each node represent the storage cost for
the corresponding sub-tree, e.g., the whole chunk-tree
has a cost of 65 units. For a bucket size SB = 30 units the
greedy algorithm yields a hierarchical clustering factor
fHC = 0.72. The corresponding allocation is depicted in
Fig. 9.
The solution comprises three buckets B1, B2 and B3,
depicted as rectangles in the figure. The bucket withthe highest clustering degree (HCDB) is B3, because it
includes the lowest depth tree. The chunks not included
in a rectangle will be stored in the root-bucket. In this
case,the root-bucket consists of only a single bucket (i.e.,
= 1 and K= 3, see (3)), as this suffices for storing thecorresponding two chunks.
5.2 Bucket-region formation
We have seen that in each step of the greedy algorithm
for solving the HPP chunk-to-bucket allocation problem
(corresponding to an input chunk-tree with a root nodeat a specific chunking depth), we try to store all the sib-
lingtreeshangingfromthisroottoasetofbuckets,form-
ing this way groups of trees to be stored in each bucket
that we call bucket regions. The formation of bucket
regions is essentially a special case of the HPP chunk-
to-bucket allocation problem and can be described as
follows:
Definition 9 (Thebucket regionformation problem) We
are given a set of N chunk-trees T1, T2, . . .TN,of the same
chunking depth d. Each tree Ti(1 i N) has a size:cost(Ti) SB , where SB is the bucket size. The prob-lem is to store these trees into a set of buckets, so that
the hierarchical clustering factor fHC of this allocation is
maximized.
As all the trees are of the same depth, the depth con-
tribution cid(1 i N), defined in (1), is the same for alltrees. Therefore, in order to maximize the degree of the
hierarchical clustering HCDB for each individual bucket
(and thus also increase the hierarchical clustering fac-
tor fHC), we have to maximize the region contribution
cir(1 i N) of each tree (1). This occurs when wecreate bucket regions with as many trees as possible on
the one hand anddue to the region proximity factor
rPwhen the trees of each region are as close as possi-
ble in the multidimensional space, on the other. Finally,
according to the fHC definition, the number of buckets
used must be the smallest possible.
Summarizing, in the bucket region formation prob-lem we seek a set of buckets to store the input trees, in
order to fulfill the following three criteria:
1. The bucket regions (i.e., each bucket) contain as
many trees as possible.
2. The total number of buckets is minimum.
3. The trees of a region are as close in the multidimen-
sional space as possible.
One could observe that if we focused only on the first
two criteria, then the bucket region formation problem
would be transformed to a typical bin-packing problem,which is a well-known NP-complete problem [42]. So
intuitively the bucket region formation problem can be
viewed as a bin-packing problem, where items packed in
the same bin must be neighbors in the multidimensional
space.
The space proximity of the trees of a region is mean-
ingful only when we have dimension domains with inher-
ent orderings. Typical example is the TIME dimension.
For example, we might have trees corresponding to the
months of the same year (which guarantees hierarchi-
cal proximity) but we would also like the consecutive
months to be in the same region (space proximity). This
is because these dimensions are the best candidates for
expressing range predicates (e.g., months from FEB99 to
AUG99). Otherwise, when there is not such an inherent
ordering,