Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal

Implementing Data Cube Construction Using a Cluster

Middleware: Algorithms, Implementation Experience, and

Performance

Ge Yang Ruoming Jin

Gagan Agrawal Department of Computer and

Information SciencesOhio State University

Motivation

A lot of effort into developing cluster computing tools targetting scientific applications

There is an emerging class of commercial applications that are well suited for cluster environments OnLine Analytical Processing (OLAP) Data Mining

Can we successfully use cluster tools developed for scientific applications on commercial applications ?

Overview

Focus on: Data cube construction, which is an OLAP

problem Both compute and data intensive Frequently used in data warehouses

Use of Active Data Repository (ADR) developed for scientific data intensive applications

Questions: Are new algorithms / variations to existing algorithms

required ? Implementation experience ? Performance ?

Outline

Data cube construction Problem definition Challenges

Active Data Repository (ADR) Scalable data cube construction algorithms

targetting ADR Implementation Experience Performance Evaluation Summary

Data Cube Construction

Context: Data Warehouses Frequently store (possibly sparse) multidimensional

datasets Example: Sale information for a chain of stores: time, item,

and location can be the three dimensions Frequently asked queries: aggregate along one or more

dimensions Data Cube Construction: Perform all aggregations in advance to facilitate rapid

response to all queries For the original n dimension array construct:

n C m arrays of m dimensions, 0 =< m =< n

Data Cube Construction Example:

Consider original 3 dimensional array ABC Data cube comprises of

3 two-dimensional arrays AB, BC, AC 3 one-dimensional arrays A, B, and C A scalar value all

Some observations: Large input size: data warehouses can have a lot of data Total amount of output could be quite large A lot of computation is involved

Lattice for Data Cube Construction

Options for computing differentoutput arrays can be represented by a lattice

If A is the shortest dimension and C is the largest, the arrowsrepresent the minimal spanningtree of the lattice

AB is considered the smallest parent of A and B

Active Data Repository

Developed at University of Maryland (Chang, Kurc, Sussman, Saltz)

Targetted scientific data intensive applications Execution model:

Divide output dataset(s) into tiles, allocate one tile at a time

Fetch input dataset one chunk at a time to compute the tile

Decide on a plan or schedule for fetching chunks that contribute to a tile

Operations involved in computing an output element must be associative and commutative

Goals In Algorithm Design

Must use smallest parents / minimal spanning tree

Maximal cache and memory reuse: perform all computations associated with an input chunk before it is discarded from memory

Minimize interprocessor communication volume

Minimize the amount of memory that needs to be allocated across the tiles

Fit into ADR’s computation model

Approach

Currently consider data cube construction starting from three dimensional array only

Partition and tile along a single dimension only

If the size along the dimensions A, B, and C are |A|, |B| and |C|, assume that

|A| <= |B| <= |C|

(No loss of generality)

Partitioning and Tiling Always partition along the dimension C

Minimizes communication volume If |A| <= |B| <= |C|, |A||B| <= |A||C| <= |B||C|

Let the size of the dimension C on each processor be |C’|

Three separate cases for tiling Case I: |A| <= |B| <= |C’| Case II: |A| <= |C’| <= |B| Case III: |C’| <= |A| <= |B|

Focus on first and second cases, third is almost identical to the second case

First Case

Tile along the dimension C on each processor

Hold AB in memory through the processing of all tiles

AC and BC are allocated separately for each tile

Algorithm for Case I

Allocate AB Foreach tile: Allocate AC and BC Foreach input chunk to be read Update AB, AC, and BC Compute C from AC Write-back AC, BC, and C If last tile Perform global reduction to obtain AB If (proc_id == 0) Compute A and B from AB Compute all from A

Properties of the Algorithm

All arrays are computed from their smallest parents

Maximal cache and memory reuse Minimal interprocessor communication volume

among all single dimensional partitions Portion of output arrays that need to be kept in

the main memory for the entire computation is minimal of all single dimensional tiling possibilities

Second Case

Tile along the dimension B Hold AC in main memory for the entire

computation

Algorithm for Case II Allocate AC and A Foreach tile: Allocate AB and AC Foreach input chunk to be read Update AB, AC, and BC Perform global reduction to obtain final AB If (proc_id == 0) Compute B from AB Update A using AB Write-back AB, BC, and B If (last tile) Finish AC Compute C from AC If (proc_id == 0) Finish A Compute all from A

Implementation Experience Using ADR Had to supply

Local reduction function - processing for each chunk Global reduction function - after local reduction on

each tile A Finalize function – after processing all tiles A specification of tiling desired

ADR’s runtime support offered Fetching of input chunk corresponding to each tile Scheduling asynchronous operations Details of interprocessor communication

Experimental Evaluation

Goals: Speedups on sparse and dense datasets Scaling of performance with respect to dataset sizes Scaling of performance with respect to number of

tiles Evaluating the impact of sparsity

Experimental Platform: 8 250 MHz Ultra-II processors 1 GB of main memory on each Myrinet for interconnection

Scaling Input Datasets - Dense Arrays

0

200

400

600

800

1000

1200

1400

1600

1 2 4 8

1 GB2 GB 4 GB

Almost linear speedups upto 8 processors

Performance per elementincreases linearly with increase in dataset size

Scaling Dataset Sizes: Sparse Dataset

0

50

100

150

200

250

300

350

400

450

1 2 4 8

.5 GB 1 GB 2 GB

25% Sparsity level

Slightly lower speedups than dense datasets: higher comm. to comp. ratio

Execution time stays Proportional to the amt. Of Computation

Increasing Number of Tiles

0

200

400

600

800

1000

1200

1400

1600

1 2 4 8

Execution time

2 nodes Fixed amount of Computation per tile

Execution time stays proportional to the amount of computation

Impact of Sparsity

0

100

200

300

400

500

600

700

800

1 2 4 8

25%10%5%1%

Same number of non-zero elements in each dataset

Good speedups in all cases

Some reduction in sequential performance as sparsity increases: Particularly for 1% case

Summary

Consider data cube construction on clusters Used a runtime system developed for

scientific data intensive applications New algorithms to combine tiling and

interprocessor communication Observations:

Code writing simplified because of the use of runtime system

High speedups Performance scales well as dataset sizes are

increased

Documents

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal