cal Clustering for OLAP the Cube File Approach

Embed Size (px)

Citation preview

  • 8/4/2019 cal Clustering for OLAP the Cube File Approach

    1/35

    The VLDB Journal (2008) 17:621655

    DOI 10.1007/s00778-006-0022-1

    R E G U L AR PA P E R

    Hierarchical clustering for OLAP: the CUBE File approach

    Nikos Karayannidis

    Timos Sellis

    Received: 6 September 2005 / Accepted: 13 April 2006 / Published online: 7 September 2006 Springer-Verlag 2006

    Abstract This paper deals with the problem of phys-

    ical clustering of multidimensional data that are orga-nized in hierarchies on disk in a hierarchy-preserving

    manner. This is called hierarchical clustering. A typi-

    cal case, where hierarchical clustering is necessary for

    reducing I/Os during query evaluation, is the most

    detailed data of an OLAP cube. The presence of hierar-

    chies in the multidimensional space results in an enor-

    mous search space for this problem. We propose a

    representation of the data space that results in a chunk-

    tree representation of the cube. The model is adaptive

    to the cubes extensive sparseness and provides efficient

    access to subsets of data based on hierarchy value combi-

    nations. Based on this representation of the search spacewe formulate the problem as a chunk-to-bucket alloca-

    tion problem, which is a packing problem as opposed to

    the linear ordering approach followed in the literature.

    We propose a metric to evaluate the quality of hier-

    archical clustering achieved (i.e., evaluate the solutions

    to the problem) and formulate the problem as an opti-

    mization problem. We prove its NP-Hardness and pro-

    vide an effective solution based on a linear time greedy

    algorithm. The solution of this problem leads to the con-

    struction of the CUBE File data structure. We analyze in

    depth all steps of the construction and provide solutions

    Communicated by P-L. Lions.

    N. Karayannidis (B) T. SellisInstitute of Communication and Computer Systemsand School of Electrical and Computer Engineering,National Technical University of Athens,Zographou 15773, Athens, Greecee-mail: [email protected]

    T. Sellise-mail: [email protected]

    for interesting sub-problems arising, such as the forma-

    tion of bucket-regions, the storage of large data chunksand the caching of the upper nodes (root directory) in

    main memory.

    Finally, we provide an extensive experimental evalu-

    ation of the CUBE Files adaptability to the data space

    sparseness as well as to an increasing number of data

    points. The main result is that the CUBE File is highly

    adaptive to even the most sparse data spaces and for

    realistic cases of data point cardinalities provides hier-

    archical clustering of high quality and significant space

    savings.

    Keywords Hierarchical clustering OLAP CUBEFile Data cube Physical data clustering

    1 Introduction

    Efficient processing of ad hoc OLAP queries is a very

    difficult task considering, on the one hand the native

    complexity of typical OLAP queries, which potentially

    combine huge amounts of data, and on theother, the fact

    that no a priori knowledge for queries exists and thus

    no pre-computation of results or other query-specific

    tuning can be exploited. The only way to evaluate these

    queries is to access directly the most detailed data in an

    efficient way. It is exactly this need to access detailed

    data based on hierarchy criteria that calls for the hierar-

    chical clustering of data. This paper discusses the phys-

    ical clustering of OLAP cube data points on disk in

    a hierarchy-preserving manner, where hierarchies are

    defined along dimensions (hierarchical clustering).

  • 8/4/2019 cal Clustering for OLAP the Cube File Approach

    2/35

    622 N. Karayannidis, T. Sellis

    The problem addressed is set out as follows: we are

    given a large fact table (FT) containing only grain-level

    (most detailed) data. We assume that this is part of the

    star schemain a dimensional data warehouse. Therefore,

    data points (i.e., tuples in the FT) are organized by a set

    of N dimensions. We further assume that each dimen-

    sion is organized in a hierarchy. Typically the data dis-

    tribution is extremely skewed. In particular, the OLAPcube is extremely sparse and data tend to appear in

    arbitrary clusters along some dimension. These clus-

    ters correspond to specific combinations of the hierar-

    chy values for which there exist actual data (e.g., sales

    for a specific product category in a specific geographic

    region for a specific period of time). The problem is

    on the one hand to store the fact table data in a hier-

    archy-preserving manner so as to reduce I/Os during

    the evaluation of ad hoc queries containing restrictions

    and/or groupings on the dimension hierarchies and, on

    the other, to enable navigation in the multilevel-multi-

    dimensional data space by providing direct access (i.e.,indexing) to subsets of data via hierarchical restrictions.

    The later implies that index nodes must also be hier-

    archically clustered if we are aiming at a reduced

    I/O cost.

    Some of the most interesting proposals [20, 21, 36] in

    the literature for cube data structures deal with the com-

    putation and storage of the data cube operator[9]. These

    methods omit a significant aspect in OLAP, which is

    that usually dimensions are not flat but are organized

    in hierarchies of different aggregation levels (e.g., store,

    city, area, country is such a hierarchy for a Location

    dimension). The most popular approach for organizingthe most detailed data of a cube is the so-called star

    schema. In this case the cube data are stored in a rela-

    tional table, called the fact table. Furthermore, various

    indexing schemes have been developed [3, 15, 25, 26], in

    order to speed up the evaluation of the join of the central

    (and usually very large) fact table with the surrounding

    dimension tables (also known as a star-join). However,

    even when elaborate indexes are used, due to the arbi-

    trary ordering of the fact table tuples, there might be

    as many I/Os as there are tuples resulting from the fact

    table.

    We propose the CUBE File data structure as an effec-

    tive solution to the hierarchical clustering problem set

    above. The CUBE File multidimensional data structure

    [18] clusters data into buckets (i.e., disk pages) with

    respect to the dimension hierarchies aiming at the hier-

    archical clustering of the data. Buckets may include both

    intermediate (index) nodes (directory chunks), as well

    as leaf (data) nodes (data chunks). The primary goal of

    a CUBE File is to cluster in the same bucket a family

    of data (i.e., data corresponding to all hierarchy value

    combinations for all dimensions) so as to reduce the

    bucket accesses during query evaluation.

    Experimental results in [18] have shown that the

    CUBE File outperforms the UB-tree/MHC[22], which

    is another effective method for hierarchically clustering

    the cube, resulting in 79 times less I/Os on average for

    all workloads tested. This simply means that the CUBE

    File achieves a higher degree of hierarchical clusteringof the data. More interestingly, in [15] it was shown that

    the UB-tree/MHC technique outperformed the tradi-

    tional bitmap index based star-join by a factor of 20

    40, which simply proves that hierarchical clustering is

    the most determinant factor for a file organization for

    OLAP cube data, in order to reduce I/O cost.

    To tackle this problem we first model the cube data

    space as a hierarchy of chunks. This modelcalled the

    chunk-tree representation of a cubecopes effectively

    with the vast data sparseness by truncating empty areas.

    Moreover, it provides a multiple resolution view of the

    data space where one can zoom-in or zoom-out to spe-cific areas navigating along the dimension hierarchies.

    The CUBE File is built by allocating the nodes of the

    chunk-tree into buckets in a hierarchy-preserving man-

    ner. This way we depart from the common approach for

    solving the hierarchical clustering problem, which is to

    find a total ordering of the data points (linear cluster-

    ing) and cope with it as a packing problem, namely a

    chunk-to-bucket packing problem.

    In order to solve the chunk-to-bucket packing prob-

    lem, we need to be able to evaluate the hierarchical

    clustering achieved (i.e., evaluate the solutions to this

    problem). Thus, inspired by the chunk-tree represen-tation of the cube, we define a hierarchical clustering

    quality metric, called the hierarchical clustering factor.

    We use this metric to evaluate the quality of the chunk-

    to-bucket allocation. Moreover, we exploit it in order

    to formulate the CUBE File construction problem as

    an optimization problem, which we call the chunk-to-

    bucket allocation problem. We formally define this prob-

    lem and prove that it is NP-Hard. Then, we propose a

    heuristic algorithm as a solution that requires a single

    pass over the input fact table and linear time in the num-

    ber of chunks.

    In the course of solving this problem several inter-

    esting sub-problems arise. We define the sub-problem

    of chunk-region formation, which deals with the clus-

    tering of chunk-trees hanging from the same parent

    node in order to increase further the overall hierarchi-

    cal clustering. We propose two algorithms as a solution,

    one of which is driven by workload patterns. Next, we

    deal with the sub-problem of storing large data chunks

    (i.e., chunks that do not fit in a single bucket), as well

    as with the sub-problem of storing the so-called root

  • 8/4/2019 cal Clustering for OLAP the Cube File Approach

    3/35

    Hierarchical clustering for OLAP 623

    directory of the CUBE File (i.e., the upper nodes of the

    data structure).

    Finally, we study the CUBE Files effective adapta-

    tion to several cube data spaces by presenting a set of

    experimental measurements that we have conducted.

    All in all, the contributions of this paper are outlined

    as follows:

    We provide an analytic solution to the problem ofhierarchical clustering of an OLAP cube. The solu-

    tion leads to the construction of the CUBE File data

    structure.

    We model the multilevel-multidimensional dataspace of the cube as a chunk-tree. This representa-

    tion of the data space adapts perfectly to the

    extensive data sparseness and provides a multi-res-

    olution view of the data with respect to the hierar-

    chies. Moreover, if viewed as an index, it provides

    direct access to cube data via hierarchical restric-

    tions, which results in significant speedups of typicalad hoc OLAP queries.

    We transform the hierarchical clustering problemfrom a linear clustering problem into a chunk-to-

    bucket allocation (i.e., packing) problem, which we

    formally define and prove that it is NP-Hard.

    We introduce a hierarchical clustering quality met-ric for evaluatingthe hierarchical clustering achieved

    (i.e., evaluating the solution to the problem in ques-

    tion). We provide an efficient solution to this prob-

    lem as well as to all sub-problems that stem from

    it, such as the storage of large data chunks or the

    formation of bucket-regions. We provide an experimental evaluation which leads

    to the following basic results:

    o The CUBE File adapts perfectly to even the most

    extremely sparse data spaces yielding significant

    spacesavings.Furthermore, the hierarchical clus-

    tering achieved by the CUBE File is almost unaf-

    fected by the extensive cube sparseness.

    o The CUBE File is scalable for any realisticnumber of input data points. In addition, the

    hierarchical clustering achieved remains of high

    quality, when the number of input data points

    increases.

    o The root directory can be cached in main mem-ory providing a single I/O cost for the evaluation

    of point queries.

    The rest of this paper is organized as follows. Section 2

    discusses related work and positions the CUBE File

    in the space of cube storage structures. Section 3 pro-

    poses the chunk-tree representation of the cube as an

    effective representation of the search space. Section 4

    introduces a quality metric for the evaluation of

    hierarchical clustering. Section 5 formally defines the

    problem of hierarchical clustering, proves its NP-Hard-

    ness and then delves into the nuts and bolts of building

    the CUBE File. Section 6 presents our extensive experi-

    mental evaluation and Sect. 7 recapitulates and empha-

    sizes on main conclusions drawn.

    2 Related work

    2.1 The linear clustering problem for multidimensional

    data

    The linear clustering problem for multidimensional data

    is defined as the problem of finding a linear order-

    ing of records indexed on multiple attributes, to be

    stored in consecutive disk blocks, such that the I/O cost

    for the evaluation of queries is minimized. The cluster-

    ing of multidimensional data has been studied in termsof finding a mapping of the multidimensional space

    to a one-dimensional space. This approach has been

    explored mainly in two directions: (a) in order to exploit

    traditional one-dimensional indexing techniques to a

    multidimensional index spacetypical example is the

    UB-tree [2], which exploits a z-ordering of multidimen-

    sional data [27], so that these can be stored into a one-

    dimensional B-tree index [1]and (b) for ordering

    buckets containing records that have been indexed on

    multiple attributes, to minimize the disk access effort.

    For example, a grid file [23] exploits a multidimensional

    grid in order to provide a mapping between grid cellsand disk blocks. One could find a linear ordering of

    these cells, and therefore an ordering of the underlying

    buckets, such as the evaluation of a query to entail more

    sequential bucket reads than random bucket accesses.

    To this end, space-filling curves (see [33] for a survey)

    have been used extensively. For example, Jagadish [13]

    provides a linear clustering method based on the Hilbert

    curve that outperforms previously proposed mappings.

    Note, however, that all linear clustering methods are

    inferior to a simple scan in high dimensional spaces. This

    is due to the notorious dimensionality curse [41], which

    states that clustering in such spaces becomes meaning-

    less due to lack of useful distance metrics.

    In the presence of dimension hierarchies the multidi-

    mensional clustering problem becomes combinatorially

    explosive. Jagadish et al. [14] try to solve the problem of

    finding an optimal linear clustering of records of a fact

    table on disk, given a specific workload in the form of a

    probability distribution over query classes. The authors

    propose a subclass of clustering methods called lattice

    paths, which are paths on the lattice defined by the

  • 8/4/2019 cal Clustering for OLAP the Cube File Approach

    4/35

    624 N. Karayannidis, T. Sellis

    hierarchy level combinations of the dimensions. The

    HPP chunk-to-bucket allocation problem (in Sect. 3.2

    we provide a formal definition of HPP restrictions and

    queries) is a different problem for the following reasons:

    1. It tries to find an optimal way (in terms of reduced

    I/O cost during query evaluation) to pack the datainto buckets, rather than order the data linearly. The

    problem of finding an optimal linear ordering of

    the buckets, for a specific workload, so as to reduce

    random bucket reads, is an orthogonal problem and

    therefore, the methods proposed in [14] could be

    used additionally.

    2. Apart from the data, it also deals with the inter-

    mediate node entries (i.e., directory chunk entries),

    which provide clustering at a whole-index level and

    not only at the index-leaf level. In other words, index

    data are also clustered along with the real data.

    As, we know that there is no linear clustering of

    records that will permit all queries over a multidimen-

    sional space to be answered efficiently [14], we strongly

    advocate that linear clustering of buckets (inter-bucket

    clustering) must be exploited in conjunction with an

    efficient allocation of records into buckets (intra-bucket

    clustering).

    Furthermore, in [22], a path-based encoding of dimen-

    sion data, similar to our encoding scheme, is exploited

    in order to achieve linear clustering of multidimensional

    data with hierarchies, through a z-ordering [27]. The

    authors use the UB-tree [2] as an index on top of thelinearly clustered records. This technique has the advan-

    tage of transforming typical star-join [25] queries to

    multidimensional range queries, which are computed

    more efficiently due to the underlying multidimensional

    index.

    However, this technique suffers from the inherent

    deficiencies of the z space-filling curve, which is not the

    best space-filling curve according to [7,13]. On the other

    hand, it is very easy to compute and thus straightforward

    to implement the technique even for high dimensional-

    ities. A typical example of such deficiency is that in the

    z-curve there is a dispersion of certain data points, which

    are close in the multidimensional space but not close in

    the linear order and the opposite, i.e., distant data points

    are clustered in the linear space. The latter results also

    in an inefficient evaluation of multiple disjoint query

    regions, due to the repetitive retrieval of the same pages

    for many queries. Finally, the benefits of z-based linear

    clustering starts to disappear quite soon as dimensional-

    ity increases, practically even when dimensionality gets

    over the number of 45 dimensions.

    2.2 Grid file based multidimensional access methods

    The CUBE File organization was initially inspired by

    the grid file organization [23], which can be viewed as

    the multidimensional counterpart of extendible hashing

    [6]. The grid file superimposes a d-dimensional orthog-

    onal grid on the multidimensional space. Given that the

    grid is not necessarily regular, the resulting cells may beof different shapes and sizes. A grid directory associates

    one or more of these cells with data buckets, which are

    stored in one disk page each. Each cell is associated with

    one bucket, but a bucket may contain several adjacent

    cells, therefore bucket regions may be formed.

    To ensure that data items are always found with no

    more than two disk accesses for exact match queries,

    the grid itself is kept in main memory represented by

    d one-dimensional arrays called scales. The grid file is

    intended for dynamic insert/delete operations, therefore

    it supports operations for splitting and merging direc-

    tory cells. A well-known problem of the grid file is thatit suffers from a superlinear growth of the directory even

    for data that are uniformly distributed [31]. One basic

    reason for this is that splitting is not a local operation and

    thus canlead to superlinear directory growth. Moreover,

    depending on the implementation of the grid directory

    merging may require a complete directory scan [12].

    Hinrichs [12] attempts to overcome the shortcomings

    of the grid file by introducing a 2-level grid directory.

    In this scheme, the grid directory is now stored on disk

    and a scaled-down version of it (called root directory)

    is kept in main memory to ensure the two-disk access

    principle still holds. Furthermore, he discusses efficientimplementations of the split, merge and neighborhood

    operations. In a similar manner, Whang and krishna-

    murthy [43] extends the idea of a 2-level directory to a

    multilevel directory, introducing the multilevel grid file,

    achieving a linear directory growth in the number of

    records. There exist more grid file based organizations.

    A comprehensive survey of these and multidimensional

    access methods in general can be found in [8].

    An obvious distinction of the CUBE File organiza-

    tion from the above multidimensional access methods

    is that it has been designed to fulfill completely differ-

    ent requirements, namely those of an OLAP environ-

    ment and not of a transaction-oriented one. A CUBE

    File is designed for an initial bulk loading and then a

    read-only operation mode, in contrast, to the dynamic

    insert/delete/update workload of a grid file. Moreover,

    a CUBE File aims at speeding up queries on multidi-

    mensional data with hierarchies and exploits hierarchi-

    cal clustering to this end. Furthermore, as the dimension

    domain in OLAP is known a priori the directory does

    not have to grow dynamically. In addition, changes to

  • 8/4/2019 cal Clustering for OLAP the Cube File Approach

    5/35

    Hierarchical clustering for OLAP 625

    the directory are rare, as dimension data do not change

    very often (compared to the rate of change for the cube

    data), and also deletions are seldom, therefore split and

    merge operations are not needed so much. Nevertheless,

    more important is to adapt well to the native sparseness

    of a cube data space and to efficiently support incremen-

    tal updating, so as to minimize the updating window and

    cube query-down time, which are critical factors in busi-ness intelligence applications nowadays.

    2.3 Taxonomy of cube primary organizations

    The set of reported methods in theliterature for primary

    organizations for the storage of cubes is quite confined.

    We believe that this is basically due to two reasons:

    first of all the generally held view is that a cube is

    a set of pre-computed aggregated results and thus the

    main focus has been to devise efficient ways to compute

    these results [11], as well as to choose which ones to

    compute for a specific workload (view selection/main-tenance problem [10, 32, 37]). Kotidis and Roussopoulos

    [19] proposed a storage organization based on packed

    R-trees for storing these aggregated results. We believe

    that this is a one-sided view of the problem as it dis-

    regards the fact that very often, especially for ad hoc

    queries, there will be a need for drilling down to the

    most detailed data in order to compute a result from

    scratch. Ad hoc queries represent the essence of OLAP,

    and in contrast to report queries, are not known a pri-

    ori and thus cannot really benefit from pre-computa-

    tion. The only way to process them efficiently is to

    enable fast retrieval of the base data. This calls foran effective primary storage organization for the most

    detailed data (grain level) of the cube. This argument

    is of course based on the fact that a full pre-compu-

    tation of all possible aggregates is prohibitive due to

    the consequent size explosion, especially for sparse

    cubes [24].

    The second reason that makes people reluctant to

    work on new primary organizations for cubes is their

    adherence to relational systems. Although this seems

    justified, one could pinpoint that a relational table (e.g.,

    afact table of astar schema [4]) is a logical entity and thus

    should be separated from the physical method chosen

    for implementing it. Therefore, one can use apart from

    a paged record file, also a B-tree or even a multidimen-sional data structure as a primary organization for a fact

    table. In fact, there are not many commercial RDBMS

    ([39] is one that we know of) that exploit a multidimen-

    sional data structure as a primary organization for fact

    tables. All in all, the integration of a new data struc-

    ture in a full-blown commercial system is a strenuous

    task with high cost and high risk and thus usually the

    proposed solutions are reluctant to depart from the

    existing technology (see also [30] for a detailed descrip-

    tion of the issues in this integration).

    Figure 1 positions the CUBE File organization in the

    space of primary organizations proposed for storing a

    cube (i.e., only the base data and not aggregates). The

    columns of this table describe the alternative data struc-

    tures that have been proposed as a primary organization,while the rows classify the proposed methods accord-

    ing to the achieved data clustering. At the top-left cell

    lies the conventional star schema [4], where a paged

    record file is used for storing the fact table. This orga-

    nization guarantees no particular ordering among the

    stored data and thus additional secondary indexes are

    built around it in order to support efficient access to

    the data.

    Padmanabhan et al. [28] assume a typical relation

    (i.e., a paged record file) as the primary organization

    of a cube (i.e., fact table). However, unique combina-

    tions of dimension values are used in order to formblocks of records, which correspond to consecutive disk

    pages. These blocks can be considered as chunks. The

    database administrator must choose only one hierar-

    chy level from each dimension to participate in the

    clustering scheme. In this sense, the method provides

    multidimensional clustering and not hierarchical (mul-

    tidimensional) clustering.

    In [35] a chunk-based method for storing large

    multidimensional arrays is proposed. No hierarchies are

    assumed on the dimensions and data are clustered

    according to the most frequent range queries of a partic-

    ular workload. In [5] the benefits of hierarchical cluster-ing in speeding-up queries was observed as a side effect

    of using a chunk-based file organization over a relation

    (i.e., a paged file of records) for query caching, with

    chunk as the caching unit. Hierarchical clustering was

    achieved through appropriate hierarchical encoding

    of the dimension data.

    Markl et al. [22], also impose a hierarchical encoding

    on the dimension data and assign a path-based surro-

    gate key on each dimension tuple that was called the

    compound surrogate key. They exploit the UB-tree mul-

    tidimensional index [2] as the primary organization of

    the cube. Hierarchical clustering is achieved by taking

    the z-order [27] of the cube data points by interleav-

    ing the bits of the corresponding compound surrogates.

    Deshpande et al. [5], Markl et al. [22] and the CUBE

    File [18], all exploit hierarchical clustering of the cube

    data and the last two use multidimensional structures

    as the primary organization. This has among others the

    significant benefit of transforming a star-join [25] into

    a multidimensional range query that is evaluated very

    efficiently over these data structures.

  • 8/4/2019 cal Clustering for OLAP the Cube File Approach

    6/35

    626 N. Karayannidis, T. Sellis

    Fig. 1 The space of proposedprimary organizations forcube storage

    Multidimensional

    data structure

    Primary

    Organization

    Clustering

    Achieved

    Relation MD-Array

    UB-tree

    GRID

    FILE-

    based

    No Clustering Star Schema

    Chunk-based

    [28] [35]

    Clustering

    Other

    Chunk-

    based[5] [18]

    Hierarchical

    Clustering z-order

    based[22]

    3 Modeling the data space as a chunk-tree

    Clearly our goal is to define a multidimensional file

    organization that natively supports hierarchies. Thereis indeed a plethora of data structures for multidimen-

    sional data [8], but to the best of our knowledge, none of

    these explicitly supports hierarchies. Hierarchies com-

    plicate things, basically because, in their presence, the

    data space explodes 1. Moreover, as we are primarily

    aiming at speeding up queries including restrictions on

    the hierarchies, we need a data structure that can effi-

    ciently lead us to the corresponding data subset based

    on these restrictions. A key observation at this point is

    that all restrictions on the hierarchies intuitively define

    a subcube or a cube-slice.

    To this end, we exploit the intuitive representation ofa cube as a multidimensional array and apply a chunk-

    ing method in order to create subcubes, i.e., the so-called

    chunks. Our method of chunking is based on the dimen-

    sion hierarchies structure and thus we call it hierar-

    chical chunking. In the following sections we present

    a dimension-data encoding scheme that assigns hierar-

    chy-enabled unique identifiers to each data point in a

    dimension. Then, we present our hierarchical chunking

    method. Finally, we propose a tree structure for repre-

    senting the hierarchy of the resultant chunks and thus

    modeling the cube data space.

    3.1 Dimension encoding and hierarchical chunking

    In order to apply hierarchical chunking, we first assign

    a surrogate key to each dimension hierarchy value. This

    key uniquely identifies each value within the hierarchy.

    1 Assuming N dimension hierarchies modeled as K-level m-waytrees, the number of possible value combinations is K-times expo-nential in the number of dimensions, i.e., O(mKN).

    Continent

    Country

    Region

    City

    LOCATION

    Grain level ---

    North(0)

    South(0.0.1)(1)

    North(2)

    South(3)

    Greece (0.0)(0)

    U.K.(1)

    Europe (0)(0)

    Salonica(0)

    Athens(1) (2)

    Rhodes Glasgow(3)

    London(4)

    Cardiff(5)

    (0.0.1.2)

    Fig. 2 Example of hierarchical surrogate keys assigned to anexample hierarchy

    More specifically, we order the values in each hierar-

    chy level so that sibling values occupy consecutive posi-

    tions and perform a mapping to the domain of positive

    integers. The resulting values are depicted in Fig. 2 for

    an example of a dimension hierarchy. The simple inte-

    gers appearing under each value in each level are calledorder-codes. In order to identify a value in the hierarchy,

    we form the path of order-codes from the root-value to

    the value in question. This path is called a hierarchical

    surrogate key, or simply h-surrogate. For example the h-

    surrogate for the value Rhodes is 0.0.1.2. H-surrogates

    convey hierarchical (i.e., semantic) information for each

    cube data point, which can be greatly exploited for the

    efficient processing of star-queries [15, 29,40].

    The basic incentive behind hierarchical chunking is

    to partition the data space by forming a hierarchy of

    chunks that is based on the dimensions hierarchies.

    This has the beneficial effect of pruning all empty areas.

    Remember that in a cube data space empty areas are

    typically defined on specific combinations of hierarchy

    values (e.g., as we did not sell the X product Category

    on Region Y for T periods of time, an empty region is

    formed). Moreover, it provides us with a multi-resolu-

    tion view of the data space where one can zoom-in and

    zoom-out navigating along the dimension hierarchies.

    We model the cube as a large multidimensional array,

    which consists only of the most detailed data. Initially, we

  • 8/4/2019 cal Clustering for OLAP the Cube File Approach

    7/35

    Hierarchical clustering for OLAP 627

    partition the cube in to a very few chunks corresponding

    to the most aggregated levels of the dimensions

    hierarchies. Then we recursively partition each chunk as

    we drill-down to the hierarchies of all dimensions in par-

    allel. We define a measure in order to distinguish each

    recursion step, the chunking depth D. We will illustrate

    hierarchical chunking with an example. The dimensions

    of our example cube are depicted in Fig. 3 and corre-spond to a two-dimensional cube hosting sales data for

    a fictitious company. The two dimensions are namely

    LOCATION and PRODUCT. In the figure we can see

    the members for each level of these dimensions (each

    appearing with its member-code).

    In order to apply our method, we need to have hierar-

    chies of equal length. For this reason, we insert pseudo-

    levels P into the shorter hierarchies until they reach

    the length of the longest one. This padding is done

    after the level that is just above the grain level. In our

    example, the PRODUCTdimension has only three lev-

    els and needs one pseudo-level in order to reach thelength of the LOCATION dimension. This is depicted

    next, where we have also noted the order-code range at

    each level:

    LOCATION:[0-2].[0-4].[0-10].[0-18]

    PRODUCT:[0-1].[0-2].P.[0-5]

    The result of hierarchical chunking on our example

    cube is depicted in Fig. 4a. Chunking begins at chunking

    depth D

    =0 and proceeds in a top-down fashion. To

    define a chunk, we define discrete ranges of grain-level(i.e., most-detailed) values on each dimension, denoted

    in the figure as [a..b], where a and b are grain-level

    order-codes. Each such range is defined as the set of

    values with the same parent (value) in the correspond-

    ing parent level. These parent levels form the set of

    pivot levels PVT, which guides the chunking process

    at each step. Therefore initially, PVT = {LOCATION:

    Continent, PRODUCT: Category}. For example, if we

    take value 0 of pivot level Continent of the LOCA-

    TIONdimension, then the corresponding range at the

    grain level is Cities [0..5].

    The definition of such a range for each dimension

    defines a chunk. For example, the chunk defined from

    the 0, 0 values of the pivot levels Continent and Cat-

    egory, respectively, consists of the following grain data

    (LOCATION:0.[0-1].[0-3].[0-5], PRODUCT:0.[0-1]. P.[0-3]).

    The [] notation denotes a range of members.Thischunk

    appears shaded in Fig. 4a at D = 0. Ultimately at D = 0we have a chunk for each possible combination between

    the members of the pivot levels, that is a total of

    [0-1][0-2] = 6 chunks in this example. Thus the total

    number of chunks created at each depth D equals the

    product of the cardinalities of the pivot levels.

    Next we proceed at D = 1, with PVT= {LOCATION:Country, PRODUCT: Type} and recursively chunk each

    chunk of depth D = 0. This time we define rangeswithin the previously defined ranges. For example, on

    the range corresponding to Continent value 0 that we

    created before, we define discrete ranges correspond-ing to each country of this continent (i.e., to each value

    of the Country level, which has parent 0). In Fig. 4a,

    at D = 1, shaded boxes correspond to all the chunksresulting from the chunking of the chunk mentioned in

    the previous paragraph.

    Similarly, we proceed the chunking by descending in

    parallel all dimension hierarchies and at each depth D

    we create new chunks within the existing ones. The pro-

    cedure ends when the next levels to include as pivot lev-

    els are the grain levels. Then we do not need to perform

    any further chunking, because the chunks that would be

    produced from such a chunking would be the cells of thecube themselves. In this case, we have reached the max-

    imum chunking depth DMAX. In our example, chunking

    stops at D = 2 and the maximum depth is D = 3. Noticethe shaded chunks in Fig. 4a depicting chunks belonging

    in the same chunk hierarchy.

    The rationale for inserting the pseudo-levels above

    the grain level lies in that we wish to apply chunking

    the soonest possible and for all possible dimensions. As

    the chunking proceeds in a top-to-bottom fashion, this

    eager chunking has the advantage of reducing very

    early the chunk size and also provides faster access to

    the underlying data, because it increases the fan-outof the intermediate nodes. If at a particular depth one

    (or more) pivot level is a pseudo-level, then this level

    does nottake part in the chunking. (In our example this

    occurs at D = 2 for the PRODUCT dimension.) Thismeans that we do not define any new ranges within the

    previously defined range for the specific dimension(s)

    but instead we keep the old one with no further chunk-

    ing. Therefore, as pseudo-levels restrict chunking in the

    dimensions that are applied, we must insert them to

    the lowest possible level. Consequently, as there is no

    chunking below the grain level (a data cell cannot be

    further partitioned), the pseudo-level insertion occurs

    just above the grain level.

    3.2 The chunk-tree representation

    We use the intermediate depth chunks as directory

    chunks that will guide us to the DMAX depth chunks

    containing the data and thus called data chunks. This

    leads to a chunk-tree representation of the hierarchi-

    cally chunked cube and hence the cube data space. It

  • 8/4/2019 cal Clustering for OLAP the Cube File Approach

    8/35

    628 N. Karayannidis, T. Sellis

    Category Type Item

    Books

    0Literature

    0.0Murderess, A. Papadiamantis

    0.0.0Karamazof brothers F.

    Dostoiewsky

    0.0.1Philosophy

    0.1Zarathustra, F. W. Nietzsche

    0.1.2

    Symposium, Plato0.1.3

    Music

    1Classical

    1.2

    The Vivaldi Album Special

    Edition

    1.2.4Mozart: The Magic Flute

    1.2.5

    Continent Country Region City

    Europe

    0Greece

    0.0Greece -North

    0.0.0Salonica

    0.0.0.0Greece- South

    0.0.1Athens

    0.0.1.1Rhodes

    0.0.1.2U.K.

    0.1U.K.- North

    0.1.2Glasgow0.1.2.3

    U.K.- South

    0.1.3London

    0.1.3.4Cardiff

    0.1.3.5North America

    1USA1.2

    USA- East1.2.4

    New York1.2.4.6Boston

    1.2.4.7USA - West

    1.2.5Los Angeles

    1.2.5.8San Francisco

    1.2.5.9USA- North

    1.2.6Seattle

    1.2.6.10Asia

    2Japan

    2.3Kiusiu

    2.3.7Nagasaki

    2.3.7.11Hondo

    2.3.8Tokyo

    2.3.8.12Yokohama

    2.3.8.13Kioto

    2.3.8.14India

    2.4India- East

    2.4.9Calcutta

    2.4.9.15New Delhi

    2.4.9.16India - West

    2.4.10Surat

    2.4.10.17Bombay

    2.4.10.18

    PRODUCT

    Category

    Type

    Item

    LOCATION

    Continent

    Country

    Region

    City

    PRODUCTLOCATION

    Fig. 3 Dimensions of our example cube along with two hierarchy instantiations

    is depicted in Fig. 4b for our example cube. In Fig. 4b,

    we have expanded the chunk-sub-tree corresponding to

    the family of chunks that has been shaded in Fig. 4a.

    Pseudo-levels are marked with P and the correspond-ing directory chunks have reduced dimensionality (i.e.,

    one dimensional in this case). We interleave the h-sur-

    rogates of the pivot level values that define a chunk and

    form a chunk-id. This is a unique identifier for a chunk

    within a CUBE File. Moreover, this identifier includes

    the whole path in the chunk hierarchy of a chunk. In

    Fig. 4b, we note the corresponding chunk-id above each

    chunk. The root chunk does not have a chunk-id because

    it represents the whole cube and chunk-ids essentially

    denote sub-cubes. The part of a chunk-id that is con-

    tained between consecutive dots and corresponds to a

    specific depth D is called D-domain.

    The chunk-tree representation can be regarded as a

    method to model the multilevel-multidimensional data

    space of an OLAP cube. We discuss next the major ben-

    efits from this modeling:

    Direct access to cube data through hierarchical restric-

    tions One of the main advantages of the chunk-tree

    representation of a cube is that it explicitly supports hier-

    archies. This means that any cube data subset defined

    through restrictions on the dimension hierarchies can

    be accessed directly. This is achieved by simply accessing

    the qualifying cells at each depth and following the inter-

    mediate chunk pointers to the appropriate data. Note

    that thevast majority of OLAP queries contain an equal-ity restriction on a number of hierarchical attributes and

    more commonly on hierarchical attributes that form a

    complete path in the hierarchy. This is reasonable as

    the core of analysis is conducted along the hierarchies.

    We call this kind of restrictions hierarchical prefix path

    (HPP) restrictions and provide the corresponding defi-

    nition next:

    Definition 1 (Hierarchical Prefix Path Restriction) We

    define a hierarchical prefix path restriction (HPP restric-

    tion) on a hierarchy H of a dimension D, to be a set of

    equality restrictions linked by conjunctions on Hs levelsthat form a path in H, which always includes the topmost

    (most aggregated) level of H.

    For example, if we consider the dimension LOCA-

    TIONof our example cube and a DATE dimension with

    a 3-level hierarchy (Year/Month/Day), then the query

    show me sales for country A (in continent C) in region

    B for each month of 1999 contains two whole-path

    restrictions, one for the dimension LOCATION and

    one for DATE: (a) LOCATION.continent = C AND

  • 8/4/2019 cal Clustering for OLAP the Cube File Approach

    9/35

    Hierarchical clustering for OLAP 629

    [0][1..2]

    [2..3]

    [0..1]

    [0..2] [3..5]

    [4..5]

    [0..3]

    [6..10][0..5]

    [0..5]

    [0..18]

    Cube

    LOCATION

    PRODUCT

    [11..18]

    D = 0

    [4..5]

    [6..10] [11 - 14] [15-18]

    D = 1

    [0..1]

    [2..3]

    [4..5]

    [3] [4..5] [6..7] [8..9] [10] [11][12..14][15..16][17..18]

    D = 2

    (Category, Continent)

    (PRODUCT, LOCATION)

    (Type, Country)

    ( - , Region)

    1

    3

    Grain level

    (Data Chunks)

    Root Chunk

    P P

    0 1 2 3

    D = 0

    D = 1

    LOCATION

    PRODUCT

    0 1 2

    0

    1

    0

    0|0.0|0 0|0.1|0

    D = 2

    0

    0|0.0|0.0|P

    0

    1

    1 2

    0|0.0|0.1|P

    0

    1

    0|0.1|0.2|P|P

    0

    1

    4 5

    0|0.1|0.3|P

    0

    1

    0 1

    0|0

    P P

    0 1 2 3

    0|0.0|1 0|0.1|1

    30

    0|0.0|1.0|P

    2

    3

    1 2

    2

    3

    0|0.1|1.2|P

    2

    3

    4 5

    0|0.1|1.3|P|P

    2

    3

    D = 3 (Max Depth)0|0.0|1.1|P

    (Category, Continent)

    (Type, Country)

    ( - , Region)

    (Item, City)

    (a) (b)

    Fig. 4 a The cube from our running example hierarchically chunked. b The whole sub-tree up to the data chunks under chunk 0|0

    LOCATION.country = A AND LOCATION.region =

    B, and (b) DATE.year = 1999.

    Consequently, we can now define the class of HPPqueries:

    Definition 2 (Hierarchical Prefix Path Query) We call

    a query Q on a cube C a hierarchical prefix path query

    (HPP query), if and only if all the restrictions imposed

    by Q on the dimensions of C are HPP restrictions, which

    are linked together by conjunctions.

    Adaptation to cubes native sparseness The cube data

    space is extremely sparse [34]. In other words, the ratio

    of the number of real data points to the product of the

    dimension grain-level cardinalities is a very small num-

    ber. Values for this ratio in the range of 1012105are more than typical (especially for cubes with more

    than three dimensions). It is therefore imperative that

    a primary organization for the cube adapts well to this

    sparseness, allocating space conservatively. Ideally, the

    allocated space must be comparable to the size of the

    existing data points. The chunk-tree representation

    adapts perfectly to the cube data space. The reason

    is that the empty regions of a cube are not arbitrarily

    formed. On the contrary, specific combinations of

    dimension hierarchy values form them. For instance,

    in our running example, if no music products are sold

    in Greece, then a large empty region is formed. Con-sequently, the empty regions in the cube data space

    translate naturally to one or more empty chunk sub-

    trees in the chunk-tree representation. Therefore, empty

    sub-trees can be discarded altogether and the space

    allocation corresponds to real data points and

    only.

    Multi-resolution view of the data space The chunk-tree

    represents the whole cube data space (however, with

    most of the empty areas pruned). Similarly, each sub-

    tree represents a sub, space. Moreover, at a specific

    chunking depth we view all the data points organized

    in hierarchical families (i.e., chunk-trees) accordingto the combinations of hierarchy values for the corre-

    sponding hierarchy levels. By descending to a higher

    depth node we view the data of the corresponding

    subspace organized in hierarchical families of a more

    detailed level and so on. This multi-resolution feature

    will be exploited later in order to achieve a better hierar-

    chical clustering of the data by promoting the storage of

    lower depth chunk-trees in a bucket than that of higher

    depth ones.

  • 8/4/2019 cal Clustering for OLAP the Cube File Approach

    10/35

    630 N. Karayannidis, T. Sellis

    Storage efficiency A chunk is physically represented by

    a multidimensional array. This enables an offset-based

    access, rather than a search-based one, which speed-ups

    the cell access mechanism considerably. Moreover, it

    gives us the opportunity to exploit chunk-ids in a very

    effective way. A chunk-id essentially consists of inter-

    leavedcoordinate values. Therefore,we canuse a chunk-

    id in order to calculate the appropriate offset of a cellin a chunk but we do not have to store the chunk-id

    along with each cell. Indeed, a search-based mechanism

    (like the one used by conventional B-tree indexes or

    the UB-tree [2]) requires that the dimension values (or

    the corresponding h-surrogates), which form the search-

    key, must also be stored within each cell (i.e., tuple) of

    the cube. In the CUBE File only the measure values of

    the cube are stored in each cell. Hence notable space

    savings are achieved. In addition, further compression

    of chunks can be easily achieved, without affecting the

    offset-based accessing (see [17] for the details).

    Parallel processing enabling Chunk-trees (at variousdepths) can be exploited naturally for the logical frag-

    mentation of the cube data, in order to enable the par-

    allel processing of queries, as well as the construction

    and maintenance (i.e., bulk loading and batch updating)

    of the CUBE File. Chunk-trees are essentially disjoint

    fragments of the data that carry all the hierarchy seman-

    tics of the data. This makes the CUBE File data struc-

    ture an excellent candidate for advanced fragmentation

    methods ([38]) used in parallel data warehouse DBMSs.

    Efficient maintenance operations Any data structure

    aimed to accommodate data warehouse data must be

    efficient in typical data warehousing maintenance oper-ations. The logical data partitioning provided by the

    chunk-tree representation enables fast bulk loading (rol-

    lin of data), data purging (rollout of data, i.e., bulk

    deletions from the cube), as well as the incremental

    updating of the cube (i.e., when the input data with

    the latest changes arrive from the data sources, only

    local reorganizations are required and not a complete

    CUBE File rebuild). The key idea is that new data to

    be inserted in the CUBE file correspond to a set of

    chunk-trees that need to be hanged at various depths

    of the structure. The insertion of each such chunk-tree

    requires only a local reorganization without affecting

    the rest of the structure. In addition, as noted previously,

    these chunk-tree insertions can be performed in parallel

    as long as they correspond to disjoint subspaces of the

    cube. Finally, it is very easy to rollout the oldest months

    data and rollin the current months (we call this data

    purging), as these data correspond to separate chunk-

    trees and only a minimum reorganization is required.

    The interested reader can find more information regard-

    ing other aspects of the CUBE File not covered in this

    paper (e.g., the updating and maintenance operations),

    as well as information for a prototype implementation

    of a CUBE File based DBMS in [16].

    4 Evaluating the quality of hierarchical clustering

    Any physical organization of data must determine how

    the latter are distributed in disk pages. A CUBE File

    physically organizes its data by allocating the chunks of

    the chunk-tree into a set of buckets, which is the I/O

    transfer unit counterpart in our case. First, let us try to

    understand what are the objectives of such an alloca-

    tion. As already stated the primary goal is to achieve

    a high degree of hierarchical clustering. This statement,

    although clear,couldstill be interpreted in several differ-

    ent ways. What are the elements that can guarantee that

    a specific hierarchical clustering scheme is good? We

    attempt to list some next:

    1. Efficient evaluation of queries containing restric-

    tions on the dimension hierarchies

    2. Minimization of the size of the data

    3. High space utilization

    The most important goal of hierarchical clustering is

    to improve response time of queries containing hier-

    archical restrictions. Therefore, the first element calls

    for a minimal I/O cost (i.e., bucket reads) for the eval-

    uation of such restrictions. The second element deals

    with the ability to minimize the size of the data to bestored (e.g., by adapting to the extensive sparseness of

    the cube data spacei.e., not storing null dataas well

    as storing only the minimum necessary data, e.g., in an

    offset-based access structure we do not need to store the

    dimension values along with the facts). Of course, the

    storage overhead must also be minimized in terms of the

    number of allocated buckets. Naturally, the best way to

    keep this number low is to utilize the available space as

    much as possible. Therefore the third element implies

    that the allocation must adapt well to the data distri-

    bution, e.g., more buckets must be allocated to more

    densely populated areas and fewer buckets for more

    sparse ones. Also, buckets must be filled almost to capac-

    ity (i.e., imposing a high bucket occupancy threshold).

    Both the last two elements guarantee an overall mini-

    mum storage cost.

    In the following, we propose a metric for evaluat-

    ing the hierarchical clustering quality of an allocation of

    chunks into buckets. Then in the next section we use this

    metric to formally define the chunk-to-bucket allocation

    problem as an optimization problem.

  • 8/4/2019 cal Clustering for OLAP the Cube File Approach

    11/35

    Hierarchical clustering for OLAP 631

    4.1 The hierarchical clustering factor

    We advocate that hierarchical clustering is the most

    important goal for a file organization for OLAP cubes.

    However, the space of possible combinations of dimen-

    sion hierarchy values is huge (doubly exponentialsee

    Footnote 1). To this end, we exploit the chunk-tree rep-

    resentation, resulting from the hierarchical chunking ofa cube, and deal with the problem of hierarchical cluster-

    ing, as a problem of allocating chunks of the chunk-tree

    into disk buckets. Thus, we are not searching for a linear

    clustering (i.e., for a total ordering of the chunked-cube

    cells), but rather we are interested in the packing of

    chunks into buckets according to the criteria of good

    hierarchical clustering posed above.

    The intuitive explanation for the utilization of the

    chunk-tree for achieving hierarchical clustering lies in

    the fact that the chunk-tree is built based solely on the

    hierarchies structure and content and not on some stor-

    agecriteria (e.g., each node corresponding to a diskpage,etc.); as a result, it embodies all possible combinations of

    hierarchical values. For example, the sub-tree hanging

    from the root-chunk in Fig. 4b at the leaf level con-

    tains all the sales figures corresponding to the continent

    Europe (order-code 0) and to the product categoryBooks (order-code 0) and any possible combinations

    of the children members of the two. Therefore, each

    sub-tree in the chunk-tree corresponds to a hierarchi-

    cal family of values and thus reduces the search space

    significantly. In the following we will regard as a stor-

    age unit the bucket. In this section, we define a metric

    for evaluating the degree of hierarchical clustering ofdifferent storage schemes in a quantitative way.

    Clearly, a hierarchical clustering strategy that respects

    the quality element of efficient evaluation of queries

    with HPP restrictions that we have posed above must

    ensure that the access of the sub-trees hanging under

    a specific chunk must be done with a minimal num-

    ber of bucket reads. Intuitively, one can say that if we

    could store whole sub-trees in each bucket (instead of

    single chunks), then this would result in a better hier-

    archical clustering, as all the restrictions on the specific

    sub-tree, as well as on any of its descendant sub-trees,

    would be evaluated with a single bucket I/O. For exam-

    ple, if we store the sub-tree hanging from the root-chunk

    in Fig. 4b into a single bucket, we can answer all queries

    containing hierarchical restrictions on the combination

    Books and Europe and on any children-values of

    these two with just a single disk I/O.

    Therefore, each sub-tree in this chunk-tree corre-

    sponds to a hierarchical family of values. Moreover,

    the smaller the chunking depth of this sub-tree the more

    the value combinations it embodies. Intuitively, we can

    say that the hierarchical clustering achieved could be

    assessed by the degree of storing low-depth whole chunk

    sub-trees into each storage unit. Next, we exploit this

    intuitive criterion to define the hierarchical clustering

    degree of a bucket(HCDB). We begin with a number of

    auxiliary definitions:

    Definition 3 (Bucket-Region) Assume a hierarchicallychunked cube represented by a chunk-tree CT of a max-

    imum chunking depth DMAX. A group of chunk-trees of

    the same depth having a common parent node, which are

    stored in the same bucket, comprises a bucket-region

    Definition 4 (Region contribution of a tree stored in a

    bucketcr) Assume a hierarchically chunked cube rep-

    resented by a chunk-tree CT of a maximum chunking

    depth DMAX. We define as the region contribution cr of

    a tree t of depth d that is stored in a bucket B to be the

    total number of trees in the bucket region that this tree

    belongs to divided by the total number of trees of thesame depth in the total chunk-tree CT in general. This is

    then multiplied by a bucket region proximity factor rP,

    which expresses the proximity of the trees of a bucket

    region in the multidimensional space:

    cr treeNum(d, B)

    treeNum(d, CT)rP,

    where treeNum(d, B)is the total number of sub-trees in

    B of depth d, treeNum(d, CT) the total number of sub-

    trees in CTof depth d and rP the bucket region proximity

    (0 < rP 1).

    The region contribution of a tree stored in a bucket

    essentially denotes the percentage of trees at a specific

    depth that a bucket region covers. Therefore, the greater

    this percentage, the greater the hierarchical clustering

    achieved by the corresponding bucket, as more com-

    binations of the hierarchy members will be clustered in

    the same bucket. To keep this contribution high we need

    large bucket-regions of low depth trees, because in low

    depths the total number ofCTsub-trees is small. Notice

    also that the region contribution includes a bucket region

    proximity factor rP, which expresses the spatial proxim-

    ityof thetrees of a bucketregionin themultidimensionalspace. The larger this factor becomes the closer the trees

    of a bucket-region are and thus the larger their individ-

    ual region contributions are. We will see in more detail

    the effects of this factor and its definition (Definition

    10) in a following subsection, where we will discuss the

    formation of the bucket regions.

    Definition 5 (Depth contribution of a tree stored in a

    bucketcd) Assume a hierarchically chunked cube rep-

    resented by a chunk-tree CT of a maximum chunking

  • 8/4/2019 cal Clustering for OLAP the Cube File Approach

    12/35

    632 N. Karayannidis, T. Sellis

    depth DMAX. We define as the depth contribution cd of a

    tree t of depth d that is stored in a bucket B to be the ratio

    of d to DMAX:

    cd d

    DMAX.

    The depth contribution of a tree stored in a bucket

    expresses the proportion between the depth of the treeand the maximum chunking depth. The less this ratio

    becomes (i.e., the lower is the depth of the tree), the

    greater the hierarchical clusteringachieved by the corre-

    sponding bucket becomes. Intuitively, the depth contri-

    bution expresses the percentage of the number of nodes

    in the path from the root-chunk to the bucket in ques-

    tion and thus the less it is the less is the I/O cost to access

    this bucket. Alternatively, we could substitute the depth

    value from the nominator of the depth contribution with

    the number of buckets in the path from the root-chunk

    to the bucket in question (with the latter included).

    Next, we provide the definition for the hierarchicalclustering degree of a bucket:

    Definition 6 (Hierarchical clustering degree of a

    bucketHCDB) Assume a hierarchically chunked cube

    represented by a chunk-tree CT of a maximum chunk-

    ing depth DMAX. For a bucket B containing 4 whole

    sub-trees {t1, t2 . . . tT} of chunking depths {d1, d2 . . . dT},respectively, where none of these sub-trees is a sub-tree of

    another, we define as the Hierarchical Clustering Degree

    HCDB of bucket B to be the ratio of the sum of the

    region contribution of each tree ti(1

    i

    T) included

    in B to the sum of the depth contribution of each treeti(1 i T), multiplied by the bucket occupancy OB,where 0 OB 1 :

    HCDB T

    i=1 cirT

    i=1 cid

    OB =Tcr

    TcdOB =

    cr

    cdOB, (1)

    where cir is the region contribution of tree ti and cid is the

    depth contribution of tree ti(1 i T). (Note that asbucket regions have been defined as consisting of equi-

    depth trees, then all trees of a bucket have the same region

    contribution as well as depth contribution.)

    In this definition, we have assumed that the chunking

    depth di of a chunk-tree ti is equal to the chunking depth

    of the root-chunk of this tree. Of course we assume that

    a normalization of the depth values has taken place, so

    that the depth of the chunk-tree CTis to be 1 instead of

    0, in order to avoid having zero depths in the denomina-

    tor of (1). Furthermore, data chunks are considered as

    chunk-trees with a depth equal to the maximum chunk-

    ing depth of the cube. Note that directory chunks stored

    in a bucket, not as part of a sub-tree but isolated, have

    a zero region contribution; therefore, buckets that con-

    tain only such directory chunks have a zero degree of

    hierarchical clustering.

    From (1), we can see that the more sub-trees, instead

    of single chunks, are included in a bucket the greater the

    hierarchical clustering degree of the bucket becomes,

    because more HPP restrictions can be evaluated solely

    with this bucket. Also the highest these trees are (i.e.,the smaller their chunking depth is) the greater the hier-

    archical degree of the bucket becomes, as more combi-

    nations of hierarchical attributes are covered by this

    bucket. Moreover, the more trees of the same depth and

    hanging under the same parent node, we have stored in

    a bucket, the greater the hierarchical clustering degree

    of the bucket, as we include more combinations of the

    same path in the hierarchy.

    All in all, the HCDB metric favors the following stor-

    age choices for a bucket:

    Whole trees instead of single chunks or other datapartitions

    Smaller depth trees instead of greater depth ones Tree regions instead of single trees Regions with a few low-depth trees instead of ones

    with more trees of greater depth

    Regions with trees of the same depth that are closein the multidimensional space instead of dispersed

    trees

    Buckets with a high occupancy

    We prove the following theorem regarding the maxi-

    mum value of the hierarchical clustering degree of abucket:

    Theorem 1 (Theorem of maximum hierarchical cluster-

    ing degree of a bucket) Assume a hierarchically chun-

    ked cube represented by a chunk-tree CT of a maximum

    chunking depth DMAX, which has been allocated to a set

    of buckets. Then, for any such bucket B holds that

    HCDB DMAX.

    Proof From the definition of the region contribution of

    a tree appearing in Definition 4, we can easily deduce

    that

    cir 1. (I)This means that the following holds:

    Ti=1

    cir T. (II)

    In (II) T stands for the number of trees stored in B.

    Similarly, from the definition of the depth contribution

  • 8/4/2019 cal Clustering for OLAP the Cube File Approach

    13/35

    Hierarchical clustering for OLAP 633

    of a tree appearing in Definition 5, we can easily deduce

    that:

    cid 1

    DMAX, (III)

    as, the smallest possible depth value is 1. This means that

    the following holds:

    Ti=1

    cid T

    DMAX. (IV)

    From (II), (IV), (1) and assuming that B is filled to its

    capacity (i.e., OB equals 1) the theorem is proved.

    It is easy to see that the maximum degree of hierar-

    chical clustering of a bucket B is achieved only in the

    ideal case, where we store the chunk-tree CT that rep-

    resents the whole cube in B and CT fits exactly in B.2.

    In this case, all our primary goals for a good hierarchical

    clustering, posed in the beginning of this chapter, such as

    the efficient evaluation of HPP queries, the low storage

    cost and the high space utilization are achieved. This is

    because all possible HPP restrictions can be evaluated

    with a single bucket read (one I/O operation) and the

    achieved space utilization is maximal (full bucket) with

    a minimal storage cost (just one bucket). Moreover, it

    is now clear that the hierarchical clustering degree of a

    bucket signifies to what extent the chunk-tree represent-

    ing the cube has been packed into the specific bucket

    and this is measured in terms of the chunking depth of

    the tree.

    By trying to create buckets with a high HCDB

    we

    can guarantee that our allocation respects these ele-

    ments of good hierarchical clustering. Furthermore, it

    is now straightforward to define a metric for evaluating

    the overall hierarchical clustering achieved by a chunk-

    to-bucket allocation strategy:

    Definition 7 (Hierarchical clustering factor of a physical

    organization for a cubefHC) For a physical organiza-

    tion thatstores the dataof a cube into a setof NB buckets,

    we define as the hierarchical clustering factor fHC , the

    percent of hierarchical clustering achieved by this storage

    organization, as this results from the hierarchical cluster-

    ing degree of each individual bucket divided by the total

    number of buckets and we write:

    fHC NB

    1 HCDB

    NBDMAX. (2)

    2 Indeed, a bucket with HCDB = DMAX would mean that thedepth contribution of each tree in this bucket should be equal to1/DMAX (according to the inequality (III)); however, this is onlypossible for the whole chunk-tree CT, as this only has a depthequal to 1.

    Note that NB is the total number of buckets usedin order

    to storethe cube; however, only the buckets that contain

    at least one whole chunk-tree have a non-zero HCD Bvalue. Therefore, allocations that spend more buckets

    for storing sub-trees have a higher hierarchical cluster-

    ing factor than others, which favor, e.g., single directory

    chunk allocations. From (2), it is clear that even if we

    have two different allocations of a cube that result inthe same total HCDB of individual buckets, the one that

    occupies the smaller number of buckets will have the

    greater fHC, rewarding this way the allocations that use

    the available space more conservatively.

    Another way of viewing the fHC is as the average

    HCDB for all the buckets divided by the maximum

    chunking depth. It is now clear that it expresses the

    percentage of the extent by which the chunk-tree rep-

    resenting the whole cube has been packed into the

    set of the NB buckets and thus 0 fHC 1. It followsdirectly from Theorem 1 that this factor is maximized

    (i.e., equals 1), if and only if we store the whole cube(i.e., the chunk-tree CT) into a single bucket, which cor-

    responds to a perfect hierarchical clustering for a cube.

    In the next section we exploit the hierarchical clus-

    tering factorfHC, in order to define the chunk-to-bucket

    allocation problem as an optimization problem. Further-

    more, we exploit the hierarchical clustering degree of a

    bucket HCDB in a greedy strategy that we propose for

    solving this problem, as an evaluation criterion, in order

    to decide how close we are to an optimal solution.

    5 Building the CUBE File

    In this section we formally define the chunk-to-bucket

    allocation problem as an optimization problem. We

    prove that it is NP-Hard and provide a heuristic algo-

    rithm as a solution. In the course of solving this problem

    several interesting sub-problems arise. We tackle each

    one in a separate subsection.

    5.1 The HPP chunk-to-bucket allocation problem

    The chunk-to-bucket allocation problem is defined as

    follows:

    Definition 8 (The HPP chunk-to-bucket allocation

    problem) For a cube C, represented by a chunk-tree CT

    with a maximum chunking depth of DMAX, find an allo-

    cation of the chunks of CT into a set of fixed-size buckets

    that corresponds to a maximum hierarchical clustering

    factor fHC

    We assume the following: The storage cost of any

    chunk-tree tequals cost(t), the number of sub-trees per

  • 8/4/2019 cal Clustering for OLAP the Cube File Approach

    14/35

    634 N. Karayannidis, T. Sellis

    depth d in CTequals treeNum(d) and the size of a bucket

    equals SB. Finally, we are given a bucket of special size

    SROOT consisting of consecutive simple buckets,called

    the root-bucket BR, where SROOT = SB, with 1.Essentially, BR represents the set of buckets that contain

    no whole sub-trees and thus have a zero HCDB.

    The solution S for this problem consists of a set ofK

    buckets, S = {B1, B2 . . .BK}, so that each bucket con-tains at least one sub-tree of CT and a root-bucket BRthat contains all the rest of CT (part with no whole sub-

    trees). S must result in a maximum value for the fHCfactor for the given bucket size SB. As the HCDB val-

    ues of the buckets of the root-bucket BR equal to zero

    (recall that they contain no whole sub-trees), following

    from (2), fHC can be expressed as

    fHC =K

    1 HCDB

    (K+ )DMAX. (3)

    From (3), it is clear that the more buckets we allocatefor the root-bucket (i.e., the greater becomes) the

    less the degree of hierarchical clustering achieved by

    our allocation. Alternatively, if we consider caching the

    whole root-bucket in main memory (see the following

    discussion), then we could assume that does not affect

    hierarchical clustering (as it does not introduce more

    bucket I/Os from the root-chunk to a simple bucket)

    and could be zeroed.

    In Fig. 5, we depict four different chunk-to-bucket

    allocations for the same chunk-tree. The maximum

    chunking depth is DMAX = 5, although in the figure we

    can see the nodes up to depth D = 3 (i.e., the trianglescorrespond to sub-trees of three levels). The numbers

    inside each node represent the storage cost for the cor-

    responding sub-tree, e.g., the whole chunk-tree has a

    cost of 65 units. Assume a bucket size ofSB = 30 units.

    Below each figure we depict the calculated fHC and

    beside we note the percentage with respect to the best

    fHC that can be achieved for this bucket size (i.e., fHC/

    fHcmax 100%). The chunk-to-bucket allocation thatyields the maximum fHC can be identified easily by

    exhaustive search in this simple case. Observe, how the

    fHC deteriorates gradually, as we move from Fig. 5a to d.

    In Fig. 5a we have failed to create any bucket-regions

    at depth D = 2. Thus each bucket stores a single sub-tree of depth 3. Note also that the occupancy of most

    buckets is quite low. In Fig. 5b the hierarchical clustering

    improves as some bucket-regions have been formed

    buckets B1, B3 and B4 store two sub-trees of depth 3. In

    Fig. 5c the total number of buckets decreases by one as a

    large bucket-region of four sub-trees has been formed in

    bucket B3. Finally, in Fig. 5d we have managed to store

    in bucket B3 a higher level (i.e., lower depth) sub-tree

    (i.e., a sub-tree of depth 2). This increases even more the

    hierarchical clustering achieved, compared to the previ-

    ous case (Fig. 5c), because the root node is included in

    the same bucket as the four sub-trees. In addition, the

    bucket occupancy of B3 is increased.

    It is clear now from this simple example, that the hier-

    archical clustering factor fHC rewards the allocations

    that achieve to store lower-depth sub-trees in buckets,that store regions of sub-trees instead of single sub-trees

    and that create highly occupied buckets. The individual

    calculations of this example can be seen in Fig. 6.

    All in all, it is obvious that we have now the optimi-

    zation problem of finding a chunk-to-bucket allocation

    such that fHC is maximized. This problem is NP-Hard,

    which results from the following theorem.

    Theorem 2 (Complexity of the HPP chunk-to-bucket

    allocation problem) The HPP chunk-to-bucket alloca-

    tion problem is NP-Hard.

    Proof Assume a typical bin packing problem [42] wherewe are given Nitems with weights wi, i = 1,,N, respec-

    tively, and a bin size B such as wi B for all i = 1, . . . , N.The problem is to find a packing of the items in the few-

    est possible bins. Assume that we create N chunks of

    depth d and dimensionality D, so as chunk c1 has a

    storage cost of w1 and chunk c2 has a storage cost w2and so on. Also assume that N 1 of these chunks areunder the same parent chunk (e.g., the Nth chunk). This

    way we have created a two-level chunk-tree where the

    root lies at depth d = 0 and the leaves at depth d =

    1. Also assume that a bin and a bucket are equivalent

    terms. Now we have reduced in polynomial time the binpacking problem to an HPP chunk-to-bucket allocation

    problem, which is to find an allocation of the chunks

    into buckets ofB size such that the achieved hierarchi-

    cal clustering factor fHC is maximized.

    As all the chunk-trees (i.e., single chunks in our case)

    are of the same depth, the depth contribution cid(1 i N), defined in (1), is the same for all chunk-trees. There-

    fore, in order to maximize the degree of the hierarchical

    clustering HCDB for each individual bucket (and thus

    also increase the hierarchical clustering factor fHC), we

    have to maximize the region contribution cir(1

    i

    N)

    of each chunk-tree (1). This occurs when we pack into

    each bucket as many trees as possible on the one hand

    anddue to the region proximity factor rPwhen the

    trees of each region are as close as possible in the mul-

    tidimensional space, on the other. Finally, according to

    the fHC definition, the number of buckets used must

    be the smallest possible. If we assume that the chunk

    dimensions have no inherent ordering then there is no

    notion of spatial proximity within the trees of the same

    region and the region proximity factor equals 1 for all

  • 8/4/2019 cal Clustering for OLAP the Cube File Approach

    15/35

    Hierarchical clustering for OLAP 635

    65

    4022

    10

    20

    5 5

    2

    3

    D= 1

    DMAX = 5

    D= 2

    SB = 30

    B1

    B2

    B3B4

    5

    B5

    B6

    B7

    (a)fHC=0.01(14%)

    65

    4022

    10

    20

    5

    5 3

    D= 1

    DMAX= 5

    D= 2

    SB= 30

    B1

    B2B3

    2

    5

    B4

    (b)fHC=0.03(42%)

    65

    4022

    10

    20

    5 5

    5

    3

    D= 1

    DMAX = 5

    D= 2

    SB= 30

    B1

    B2

    B3

    2

    (c)fHC=0.05(69%)

    65

    4022

    10

    20

    5 5

    5

    2

    3

    D= 1

    DMAX= 5

    D= 2

    SB = 30

    B1

    B2

    B3

    (d)fHC=0.07(100%)

    Fig. 5 The hierarchical clustering factor fHC of the same chunk-tree for four different chunk-to-bucket allocations

    possible regions (see also related discussion in the fol-

    lowing subsection).

    In this case the only factor that can maximize the

    HCDB of each bucket and consequently the overall fHCis to minimize empty space within each bucket [i.e., max-

    imize bucket occupancy in (1)] and use as few buckets as

    possible by packing the largest number of trees in each

    bucket. These are exactly the goals of the original bin

    packing problem and thus a solution to the bin packing

    problem is also a solution to the HPP chunk-to-bucket

    allocation problem and vice versa.

    As the bin packing can be reduced in polynomial time

    to the HPP chunk-to-bucket, then any problem in NP

    can be reduced in polynomial time to the HPP chunk-

    to-bucket. Furthermore, in the general case (where we

    have chunk-trees of variant depths and dimension have

    inherent orderings) it is not easy to find a polynomial

    time verifier for a solution to the HPP chunk-to-bucket

    problem, as the maximum fHC that can be achieved is

    not known (as it is in the bin packing problem where

    the minimum number of bins can be computed with a

    simple division of the total weight of items by the size of

    a bin). Thus the problem is NP-Hard.

    We proceed next by providing a greedy algorithm

    based on heuristics for solving the HPP chunk-to-bucket

    allocation problem in linear time. The algorithm utilizes

    the hierarchical clustering degree of a bucket as a cri-

    terion in order to evaluate at each step how close we

    are to an optimal solution. In particular, it traverses the

    chunk-tree in a top-down depth-first manner, adopting

    the greedy approach that if at each step we create a

  • 8/4/2019 cal Clustering for OLAP the Cube File Approach

    16/35

    636 N. Karayannidis, T. Sellis

    Chunk-to-bucket

    Allocation

    Bucket

    RegionContributioncr

    DepthContribution

    cd BucketOccupancy

    OB

    HCDB

    BucketSizeSB

    TotalNoofBucketsK

    Noofbucket

    RootBucketb

    Maximum

    ChunkingDepthDMAX

    B1 0,29 0,6 1,00 0,48B2 0,14 0,6 0,17 0,04Fig(d)

    B3 0,50 0,4 0,73 0,92

    3 1 0,07 100%

    B1 0,29 0,6 1,00 0,48

    B2 0,14 0,6 0,17 0,04Fig( c )

    B3 0,57 0,6 0,50 0,48

    3 1 0,05 69%

    B1 0,29 0,6 1,00 0,48

    B2 0,14 0,6 0,17 0,04

    B3 0,29 0,6 0,33 0,16Fig(b)

    B4 0,29 0,6 0,17 0,08

    4 1 0,03 42%

    B1 0,14 0,6 0,33 0,08B2 0,14 0,6 0,67 0,16

    B3 0,14 0,6 0,17 0,04

    B4 0,14 0,6 0,17 0,04

    B5 0,14 0,6 0,17 0,04

    B6 0,14 0,6 0,10 0,02

    Fig(a)

    B7 0,14 0,6 0,07 0,02

    30

    7 1

    5

    0,01 14%

    fHCfHC/fHCmax

    (%)

    Fig. 6 The individual calculations of the example in Fig. 5

    bucket with a maximum value of HCDB, then overall

    the acquired hierarchical clustering factor will be maxi-

    mal. Intuitively, by trying to pack the available buckets

    with low-depth trees (i.e., the tallest trees) first (thus the

    top-to-bottom traversal) we can ensure that we have

    not missed the chance to create the best HCDB buckets

    possible.

    In Fig. 7, we present the GreedyPutChunksIntoBuc-

    kets algorithm, which receives as input the root R of a

    chunk-tree CT and the fixed size SB of a bucket. The

    output of this algorithm is a set of buckets containing

    at least one whole chunk-tree, a directory chunk entry

    pointing at the root chunk R and the root-bucket BR

    .

    In each step the algorithm tries greedily to make an

    allocation decision that will maximize the HCDB of the

    current bucket. For example, in lines 27 of Fig. 7, the

    algorithm tries to store the whole input tree in a single

    bucket thus aiming at a maximum degree of hierarchi-

    cal clustering for the corresponding bucket. If this fails,

    then it allocates the root R to the root-bucket and tries to

    achieve a maximum HCDB by allocating the sub-trees

    at the next depth, i.e., the children ofR (lines 926).

    This essentially is achieved by including all direct

    children sub-trees with size less than (or equal to) the

    size of a bucket (SB) into a list of candidate trees for

    inclusion into bucket regions (buckRegion) (lines 14

    16). Then the routine formBucketRegions is calledupon this list and tries to include the corresponding

    trees in a minimum set of buckets, by forming bucket

    regions to be stored in each bucket, so that each one

    achieves the maximum possible HCDB (lines 1922).

    We will come back to this routine and discuss how it

    solves this problem in the next sub-section. Finally, for

    the children sub-trees of root R with size cost greater

    than the size of a bucket, we recursively try to solve

    the corresponding HPP chunk-to-bucket allocation sub-

    problem for each one of them (lines 2326). This of

    course corresponds to a depth-first traversal of the input

    chunk-tree.

    Very important is also the fact that no space is

    allocated for empty sub-trees (lines 1113); only a spe-

    cial entry is inserted in the parent node to denote a

    NULL sub-tree. Therefore, the allocation performed

    by the greedy algorithm adapts perfectly to the data

  • 8/4/2019 cal Clustering for OLAP the Cube File Approach

    17/35

    Hierarchical clustering for OLAP 637

    0:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:

    22:23:24:25:26:27:28:29:30:31:32:33:34:35:36:37:

    GreedyPutChunksIntoBuckets(R,SB)

    //Input: Root R of a chunk-tree CT, bucket size SB//Output: Updated R, list of allocated buckets BuckList, root

    // bucket BR, directory entry dirEnt pointing at R

    {List buckRegion // Bucket-region Candidates listIF (cost(CT) < SB){

    Allocate new bucket BnStore CT in BndirEnt = addressOf(R)RETURN

    }//R will be stored in the root-bucket BRIF (R is a directory chunk) {

    FOR EACH child sub-tree CTC of R {IF (CTC is empty){

    Mark with empty tag corresponding Rs entry}IF (cost(CTC) SB){

    //Insert CTc into list for bucket-region candidates

    buckRegion.push(CTC)}

    }IF(buckRegion != empty){

    // Formulate the bucket-regions

    formBucketRegions(buckRegion, BuckList, R)

    }WHILE (there is a child CT : cost(CT ) > S ){C C B

    GreedyPutChunkIntoBuckets(root(CT ),SCUpdate corresponding R entry for CT

    B)

    C

    }Store R in the root-bucket BRdirEnt = addressOf(R)

    }ELSE { //R is a data chunk and cost(R) > B

    Artificially chunk R, create 2-level chunk-tree CTAGreedyPutChunkIntoBuckets(root(CTA),SB)//storage of R will be taken cared of by previous call

    dirEnt = addressOf(root(CTA))}RETURN}

    Fig. 7 A greedy algorithm for the HPP chunk-to-bucket allocation problem

    distribution, coping effectively with the native sparse-

    ness of the cube.

    The recursive calls might lead us eventually all the

    way down to a data chunk (at depth DMAX). Indeed, if

    the GreedyPutChunksIntoBuckets is called upon aroot R, which is a data chunk, then this means that we

    have come upon a data chunk with size greater than the

    bucket size. This is called a large data chunk and a more

    detailed discussion on how to handle them will follow

    in a later sub-section. For now it is enough to say that

    in order to resolve the problem of storing such a chunkwe extend the chunking further (with a technique called

    artificial chunking) in order to transform the large data

    chunk into a 2-level chunk-tree. Then, we solve the HPP

    chunk-to-bucket sub-problem for this sub-tree (lines30

    35). The termination of the algorithm is guaranteed by

    the fact that each recursive call deals with a sub-problem

    of a smaller in size chunk-tree than the parent problem.

    Thus, the size of the input chunk-tree is continuously

    reduced.

    65

    40 22

    10

    20

    5

    5 5

    2

    3

    D= 1

    D= 2

    DMAX

    = 5

    Fig. 8 A chunk-tree to be allocated to buckets by the greedyalgorithm

    Assuming an input file consisting of the cubes data

    points along with their corresponding chunk-ids (or

    equivalently the corresponding h-surrogate key per

    dimension) we need a single pass over this file to create

  • 8/4/2019 cal Clustering for OLAP the Cube File Approach

    18/35

    638 N. Karayannidis, T. Sellis

    65

    4022

    10

    20

    5 5

    5

    2

    3

    D = 1

    DMAX

    = 5

    D = 2

    SB

    = 30

    B1

    B2

    B3

    Fig. 9 The chunk-to-bucket allocation for SB = 30

    the chunk-tree representation of the cube. Then the

    above greedy algorithm requires only linear time in the

    number of input chunks (i.e., the chunks of the chunk-tree) to perform the allocation of chunks to buckets, as

    each node is visited exactly once and at the worst case

    all nodes are visited.

    Assume the chunk-tree of DMAX = 5 of Fig. 8. The

    numbers inside each node represent the storage cost for

    the corresponding sub-tree, e.g., the whole chunk-tree

    has a cost of 65 units. For a bucket size SB = 30 units the

    greedy algorithm yields a hierarchical clustering factor

    fHC = 0.72. The corresponding allocation is depicted in

    Fig. 9.

    The solution comprises three buckets B1, B2 and B3,

    depicted as rectangles in the figure. The bucket withthe highest clustering degree (HCDB) is B3, because it

    includes the lowest depth tree. The chunks not included

    in a rectangle will be stored in the root-bucket. In this

    case,the root-bucket consists of only a single bucket (i.e.,

    = 1 and K= 3, see (3)), as this suffices for storing thecorresponding two chunks.

    5.2 Bucket-region formation

    We have seen that in each step of the greedy algorithm

    for solving the HPP chunk-to-bucket allocation problem

    (corresponding to an input chunk-tree with a root nodeat a specific chunking depth), we try to store all the sib-

    lingtreeshangingfromthisroottoasetofbuckets,form-

    ing this way groups of trees to be stored in each bucket

    that we call bucket regions. The formation of bucket

    regions is essentially a special case of the HPP chunk-

    to-bucket allocation problem and can be described as

    follows:

    Definition 9 (Thebucket regionformation problem) We

    are given a set of N chunk-trees T1, T2, . . .TN,of the same

    chunking depth d. Each tree Ti(1 i N) has a size:cost(Ti) SB , where SB is the bucket size. The prob-lem is to store these trees into a set of buckets, so that

    the hierarchical clustering factor fHC of this allocation is

    maximized.

    As all the trees are of the same depth, the depth con-

    tribution cid(1 i N), defined in (1), is the same for alltrees. Therefore, in order to maximize the degree of the

    hierarchical clustering HCDB for each individual bucket

    (and thus also increase the hierarchical clustering fac-

    tor fHC), we have to maximize the region contribution

    cir(1 i N) of each tree (1). This occurs when wecreate bucket regions with as many trees as possible on

    the one hand anddue to the region proximity factor

    rPwhen the trees of each region are as close as possi-

    ble in the multidimensional space, on the other. Finally,

    according to the fHC definition, the number of buckets

    used must be the smallest possible.

    Summarizing, in the bucket region formation prob-lem we seek a set of buckets to store the input trees, in

    order to fulfill the following three criteria:

    1. The bucket regions (i.e., each bucket) contain as

    many trees as possible.

    2. The total number of buckets is minimum.

    3. The trees of a region are as close in the multidimen-

    sional space as possible.

    One could observe that if we focused only on the first

    two criteria, then the bucket region formation problem

    would be transformed to a typical bin-packing problem,which is a well-known NP-complete problem [42]. So

    intuitively the bucket region formation problem can be

    viewed as a bin-packing problem, where items packed in

    the same bin must be neighbors in the multidimensional

    space.

    The space proximity of the trees of a region is mean-

    ingful only when we have dimension domains with inher-

    ent orderings. Typical example is the TIME dimension.

    For example, we might have trees corresponding to the

    months of the same year (which guarantees hierarchi-

    cal proximity) but we would also like the consecutive

    months to be in the same region (space proximity). This

    is because these dimensions are the best candidates for

    expressing range predicates (e.g., months from FEB99 to

    AUG99). Otherwise, when there is not such an inherent

    ordering,