26
Parallelizing the Data Cube PhD Oral Defence Todd Eavis July 23, 2003

Parallelizing the Data Cube

  • Upload
    sasha

  • View
    45

  • Download
    1

Embed Size (px)

DESCRIPTION

PhD Oral Defence Todd Eavis July 23, 2003. Parallelizing the Data Cube. Overview. Motivation for Parallel, Relational OLAP Core Algorithms and Methods Primary Systems Contributions Experimental Evaluation and Results Conclusions and Future Work. Motivation for Parallel, Relational OLAP - PowerPoint PPT Presentation

Citation preview

Page 1: Parallelizing the Data Cube

Parallelizing the Data Cube

PhD Oral DefenceTodd Eavis

July 23, 2003

Page 2: Parallelizing the Data Cube

● Motivation for Parallel, Relational OLAP● Core Algorithms and Methods● Primary Systems Contributions● Experimental Evaluation and Results● Conclusions and Future Work

Overview

Page 3: Parallelizing the Data Cube

– Motivation for Parallel, Relational OLAP– Core Algorithms and Methods– Primary Systems Contributions– Experimental Evaluation and Results– Conclusions and Future Work

Page 4: Parallelizing the Data Cube

Why study OLAP and the Data Cube?● On-line Analytical Processing: the foundation for a

range of essential business applications– sales and marketing analysis, planning and

budgeting● 4 billion dollar industry by 2005● Data Cube: a core OLAP construct, first proposed

in 1996 by Gray et al [GBLP], that supports sophisticated multi-dimensional data analysis

● Relevance to the Research Community? Results of Citeseer queries:

– OLAP: 797 papers– Data Cube: 362 papers

● Our interest: Data Cube Generation and Querying

Page 5: Parallelizing the Data Cube

Scale of OLAP Data Warehouses

● Average size of production data warehouses currently 700 GB [survey.com/Olap Report]

● Expected to reach 4 TB by 2004● 1/3 currently <= 50 GB. In two years, this number

will drop to just 6% ● Biggest data warehouses growing by a factor of 20

[Winter Report]● Biggest expected to exceed 100 TB within 2 years● Our Interest: Exploiting Parallel

Algorithms

Page 6: Parallelizing the Data Cube

Fundamental Design Alternatives● MOLAP (Multi-dimensional OLAP)

– Materialize data cube as a multi-dimensional array

– In theory: implicit indexing. In practice: hybrid schemes for sparse and dense regions

– Best for dense, low-dimensional spaces● ROLAP (Relational OLAP)

– Store data as relational tables– Requires an explicit multi-dimensional

index– Scales well to higher dimensions and higher

cardinalities● Our Interest: Highly Scalable ROLAP model

Page 7: Parallelizing the Data Cube

– Motivation for Parallel, Relational OLAP– Core Algorithms and Methods– Primary Systems Contributions– Experimental Evaluation and Results– Conclusions and Future Work

Page 8: Parallelizing the Data Cube

Computing the Full Cube in Parallel● Small number of previous projects [GC, LHL,

MM, NWY]– Speedup quite limited

● Our approach: Parallel Pipesort [DEHR2, DER3]– Model 2d views as a “task graph” – Create “Scan Pipelines” [AADG] as Minimum

Cost Spanning Tree using O(dn(m + nlogn)) bipartite matching (n = nodes, m = edges)

– Partition task graph into sub-trees with O(p3d + p2d) augmented k-min-max [BSP] and distribute sub-trees to p processors

– Use over-sampling – S * p sub-trees – to improve load balance. High-low pairing in S – 1 rounds provides approximation to NP-Complete problem

– Use performance optimized algorithms for sorting, scanning, and I/O to generation local views

Page 9: Parallelizing the Data Cube

Computing Partial Cubes in Parallel● Important in practice for environments with higher

dimensions and/or specific visualization needs ● Little previous work, only partial solutions

[BR,GC,SAG]● Our approach: Greedy algorithms for “Schedule

Tree” construction [DER2, DER5]– Solution consists of algorithms for generating

efficieint “Essential Trees” (red) and algorithms for adding beneficial non-selected nodes (blue)

– Greedy method: record “ state” information in Plan Objects. Incrementally add nodes with maximum benefit

– Pre-sorting “candidate” views by estimated size can reduce run-time from O(n3) to O(n2)

– O(d*n) heuristic extensions for higher dimensional space. A “confidence factor” β limits risk

Page 10: Parallelizing the Data Cube

A Parallel Date Cube Query Engine● Views must be indexed prior to access● Related work: sequential r-trees for data cube

[RYR] and general purpose parallel r-trees [SL] ● Our Approach: Parallel RCUBE [DER1, DER6]

– Records ordered as per Hilbert Space Filling Curve

– P-processor round-robin record striping– Construct Partial r-tree indexes on each node,

“packing” page blocks in Hilbert order● Parallel Query Engine

– Combines indexing and OLAP post-processing (query transformation, parallel Sample Sort, record permutation, etc.)

– Uses surrogate views to support Partial Cubes– Supports “linear” dimension hierarchies

Page 11: Parallelizing the Data Cube

The Virtual Data Cube

● Motivation: Hide the complexity of Data Cube algorithms and implementation

● Requires no knowledge of:– Format or extent of indexing– Degree of materialization (full or

partial)– Representation of hierarchies– Physical order of view attributes– Degree of parallelism

Page 12: Parallelizing the Data Cube

– Motivation for Parallel, Relational OLAP– Core Algorithms and Methods– Primary Systems Contributions– Experimental Evaluation and Results– Conclusions and Future Work

Page 13: Parallelizing the Data Cube

Systems Overview● Full, robust Data Cube “prototype” [DER4]● Approximately 20,000 lines of code

– C/C++, LEDA, MPI, STL● Template-based graph algorithms● Designed for, and evaluated on, contemporary

parallel machines:– “Shared nothing” Linux cluster (Dalhousie)– “Shared disk” SunFire multi-processor

(HPCVL)● Supporting systems include:

– Flex/Bison based data generator– Batch query generator– View Subset generator

Page 14: Parallelizing the Data Cube

Key Performance Issues● Dynamic selection of “best” sorting algorithm

– Radix sort versus quicksort● Minimization of data movement

– Use of horizontal and vertical indirection● New pipeline aggregation algorithm

– “Lazy” aggregation ● Streamlined I/O

– I/O manager– Independent I/O and computation threads

Page 15: Parallelizing the Data Cube

Costing model● Sophisticated cost model, common to

both full and partial cube [DEHR1]● Based upon view size estimator

– Probabilistic counting technique● Experimentally supported metrics for:

– Dynamic Sorting (linear time versus comparison based)

– In-memory scanning and data movement

– Machine specific “Read” and “Write” I/O

● Dynamically considers impact of computation versus I/O

Page 16: Parallelizing the Data Cube

A Better Search Strategy● Standard r-tree search strategy employs

Depth First Search● Our approach: Linear Breadth First

Search– Map the search algorithm to the

linearly ordered levels of the packed index

– Resolve query with a left-to-right, top-to-bottom walk of the tree

– Disk head never moves backwards– Resolution consists of a sequence of

clustered scans– Degrades gracefully to a sequential

scan of index + sequential scan of data

Page 17: Parallelizing the Data Cube

– Motivation for Parallel, Relational OLAP– Core Algorithms and Methods– Primary Systems Contributions– Experimental Evaluation and Results– Conclusions and Future Work

Page 18: Parallelizing the Data Cube

Experimental Evaluation● Default test environment includes:

– 16 to 24 processors– 2 million records/80 MB– 4 to 14 dimensions– Random query batches– 24-node Linux cluster, 16-node

SunFire MP (disk array)

Full Cube Partial Cube Query Processing

● Parallel Speedup approaching linear for all components

● Efficiency between 80 and 95%

Page 19: Parallelizing the Data Cube

Full Cube Evaluation

● Shared Disk● 80 – 90%

efficiency● Disk array is

bottleneck

● Over sampling factor

● SF = 2 consistently best

● Optimized pipeline processing

● Order of magnitude improvement

Page 20: Parallelizing the Data Cube

Partial Cube Evaluation● Using partial cube

algorithms for full cube

● All within 6% of “best” benchmark

● Recursive algorithm with 0.1%

● Partial cube of 3 dimensions or less

● Reductions over “naïve” method of 65 – 70%

● Tree pruning with confidence factor (on 14 d)

● Can eliminate up to 60% of original tree

● Virtually no reduction in tree quality

Page 21: Parallelizing the Data Cube

Query Evaluation

● Ratio of blocks retrieved to required seeks

● Random query batches

● Up to 140:1 for large, sparse spaces

● Record retrieval imbalance (16 processors)

● Only 0.3% from optimal load balance

● Overhead of using surrogate views (1 to 16 processors)

● Run time on materialized views versus time when those views were unavailable

Page 22: Parallelizing the Data Cube

– Motivation for Parallel, Relational OLAP– Core Algorithms and Methods– Primary Systems Contributions– Experimental Evaluation and Results

– Conclusions and Future Work

Page 23: Parallelizing the Data Cube

● ROLAP a viable alternative to MOLAP in parallel setting

● Partial cubes can be efficiently generated● ROLAP cubes can be efficiently indexed● Virtual cube abstraction can be efficiently

supported

Thesis Conclusions

Page 24: Parallelizing the Data Cube

● First parallel ROLAP system in the Data Cube literature ● A balanced approach to data cube research

– Algorithm design– Systems engineering– Extensive performance analysis

● Evaluated on contemporary parallel machines– Commodity-style shared nothing cluster– Shared disk architectures

● Integration of three “independent” data cube research projects into a single cohesive OLAP framework – the Virtual Cube

Research Highlights

Page 25: Parallelizing the Data Cube

Future Work

● Automated partial cube specification– Extension of “virtual cube”

● Parallel Query optimization– In addition to range queries or linear hierarchies– High volume query environments

● OLAP visualization● New projects are building on the current base

– Generation of Iceberg Cubes– Mining of association rules

Page 26: Parallelizing the Data Cube

Thank You!

DEHR1 F. Dehne, T. Eavis, S. Hambrusch and A. Rau-Chaplin, Parallelizing The Data Cube, Parallel and Distributed Databases: An International Journal, 2001

DER1 F. Dehne, Todd Eavis, A. Rau-Chaplin, Distributed Multi-dimensional ROLAP Indexing for the Data Cube, CCGrid, 2003.

DER2 F. Dehne, T. Eavis and A. Rau-Chaplin, Computing Partial Data Cubes for Parallel Data Warehousing Applications, Euro PVM-MPI, 2001.

DER3 F. Dehne, T. Eavis, and A. Rau-Chaplin, Coarse Grained Parallel On-Line Analytical Processing (OLAP) For Data Mining, ICCS, 2001.

DER4 F. Dehne, T. Eavis, and A. Rau-Chaplin, A Cluster Architecture for Parallel Data Warehousing, CCGrid, 2001.

DEHR2 F. Dehne, S. Hambrusch, T. Eavis, and A. Rau-Chaplin, Parallelizing The Data Cube, ICDT, 2001.

CHER Y. Chen, F.Dehne, Todd Eavis, A. Rau-Chaplin, Parallel ROLAP Data Cube Construction on Shared Nothing Multi-Processors, IPDPS, 2003.

DER5 F Dehne, T.Eavis, and A. Rau-Chaplin, Computing Partial Data Cubes, Submitted to HICCS, 2003.

DER6 F Dehne, T.Eavis, and A. Rau-Chaplin, RCUBE: Parallel Multi-Dimensional ROLAP Indexing, Submitted to Journal of to Data Mining and Knowledge Discovery., 2003

GBLP J. Gray and A. Bosworth and A. Layman and H. Pirahesh", Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals", ICDE, 1996

GC S. Goil and A. Choudhary, A Parallel Scalable Infrastructure for OLAP and Data Mining, IDEAS,1999

LHL H. Lu, X. Huang and Z. Li,, Computing Data Cubes Using Massively Parallel Processors, PCW '97, 1997

MM S. Muto and M. Kitsuregawa, A dynamic Load Balancing Strategy for Parallel Datacube Computation, WDW O,1999

NWY R. Ng, A. Wagner and Y. Yin, Iceberg-cube Computation with PC Clusters, SIGMOD, 2001

AADG S. Agarwal and R. Agrawal and P. Deshpande and A. Gupta and J. Naughton and R. Ramakrishnan and S. Sarawagi, On the Computation of Multidimensional aggregates, VLDB, 1996

BSP R. Becker and S. Schach and Y. Perl, A shifting algorithm for min-max tree partitioning, Journal of the ACM, 1982BR K. Beyer and R. Ramakrishnan, Bottom-up computation of sparse and Iceberg CUBEs, SIGMOD,1999RYR N. Roussopoulos and Y. Kotidis and M. Roussopolis, Cubetree: Organization of the bulk incremental updates on the data cube, SIGMOD, 1997 SL B. Schnitzer and S. Leutenegger, Master-client r-trees: a new parallel architecture, SSDM, 1999

References: Our own Virtual Data Cube Research

References: The Data Cube Literature