23
Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data warehouses using the DWS technique Marco Costa DEI – CISUC – University of Coimbra Critical Software S.A.

Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

Embed Size (px)

Citation preview

Page 1: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights

Reserved.

Handling big dimensions in distributed data warehouses using the DWS technique

Marco CostaDEI – CISUC – University of CoimbraCritical Software S.A.

Page 2: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 2

Agenda

Introduction The DWS technique

Description Problems with big dimensions

The Selective Loading technique Experimental Results Conclusions

Page 3: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 3

Critical Software Inc.

Company Profile International Software Engineering company. Founded in 1998, offices in Portugal, US, UK. Entrepreneurial and independent SME. Staff of 100, software engineers, Msc’s, Phd’s.

Figures Turnover of US 6M (2004). International market represents +70%. Profitable since foundation (ebit= 17%, 2003).

Quality, R&D ISO 9001:2000 Tick-IT certified (only in

Iberia). ISO 15504 / CMM level 3 R&D focused, Patents submitted

Headquarters, Portugal

TURNOVER

0

1

2

3

4

5

6

1999 2000 2001 2002 2003 2004

M€

Domestic Exports

Fy 2003ebit = 17%ebitda = 24%

Page 4: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 4

Introduction

Companies produce and store more and more data

Data Warehouses have large and continuously growing volumes of data to process

High performance in query execution is crucial to enable interactivity in OLAP process

Typically the performance is achieved through very expensive hardware platforms (e.g. high end servers)

Page 5: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 5

Introduction

Parallel processing has been explored as one of the solutions to support large DW Intra-query parallelism

Distributed DW For geographical reasons For performance

Load balancing of data Query execution Reduce communication between nodes

Page 6: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 6

The DWS Technique

Distribution of a DW through a cluster of “low cost computers” Data partition technique Query re-write and parallel execution technique Approximated query answering

Shared-nothing architecture – Federated Conceived specifically for data warehouses

implemented with star-schema model High scalability Near linear speed up for data aggregation

queries

Page 7: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 7

The DWS Technique

Data partitioning / data placement All nodes have the same data model Dimension tables are replicated Fact tables are distributed through all

nodes in an uniform way Row by row Random

Page 8: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 8

The DWS Technique

Fact Tablerow_1

Dimension 1 row_2 Dimension 3all_ rows row_3 all_ rows

row_4row_5row_6row_7

Dimension 2 row_8 Dimension 4all_ rows ... all_ rows

row_n

Dimension 1 Fact Table 1 Dimension 3 Dimension 1 Fact Table 2 Dimension 3 Dimension 1 Fact Table 1 Dimension 3all_ rows row_1 all_ rows all_ rows row_2 all_ rows all_ rows row_3 all_ rows

row_4 row_5 row_6Dimension 2 row_7 Dimension 4 Dimension 2 row_8 Dimension 4 Dimension 2 row_9 Dimension 4

all_ rows ... all_ rows all_ rows ... all_ rows all_ rows ... all_ rowsrow_n-2 row_n-1 row_n

Data partitioning / data placement Row by row example

Page 9: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 9

The DWS Technique

Query re-write Partition the queries in steps:

Partial Query (independently executed in each node)

Merge Query Some queries might require more than

one step Execution tree optimizer – determines

the steps that need to be executed independently or can be included in the upper query

Page 10: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 10

The DWS Technique

Query Re-write (example for 2 nodes) A typical data aggregation query:

select t.calendar_month_desc "Month", c.cust_city "City", p.prod_category "Category", avg(s.quantity_sold) "Quantity", avg(s.amount_sold) "Amount" from sales s,

customers c, times t, products p

where s.time_id = t.time_id and s.cust_id = c.cust_id and s.prod_id = p.prod_id and t.calendar_year = 2000

group by t.calendar_month_desc, c.cust_city, p.prod_category

Dimensions

Facts (aggregated)

Page 11: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 11

The DWS Technique

Query Re-write (example for 2 nodes) Partial Query sent to all nodes:

create table dws110517101718101 asselect t.calendar_month_desc calendar_month_desc, c.cust_city cust_city, p.prod_category prod_category,

sum(s.quantity_sold) as dws1_sum, count(s.quantity_sold) as dws1_count, sum(s.amount_sold) as dws2_sum, count(s.amount_sold) as dws2_countfrom sales s,

customers c, times t, products p

where s.time_id = t.time_id and s.cust_id = c.cust_id and s.prod_id = p.prod_id and t.calendar_year = 2000

group by t.calendar_month_desc, c.cust_city, p.prod_category

Collect partial aggregations

Page 12: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 12

The DW-SP Technology

Query Re-write (example for 2 nodes) Merge Query – merge the partial results:

select calendar_month_desc "month", cust_city "city", prod_category "category", sum(dws1_sum) / sum(dws1_count) "quantity", sum(dws2_sum) / sum(dws2_count) "amount" from dws_finalmerge_ group by

calendar_month_desc, cust_city, prod_category

create table dws_finalmerge_ as (select * from dws110517154329101@node1 union all select * from dws110517154329101@node2)

Gather partial Results

Build finalresultsMerge

aggregations

Page 13: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 13

The DWS Technique

Achievements Optimal data load balance Optimal work load balance For each query each node processes the same

amount of data as all the others, mostly within its local data

Low communication between nodes High scalability

Near linear speed-up Nead linear scale-up Tested with APB1 benchmark (Olap Council) and 10

nodes

Page 14: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 14

The DWS Technique

The problem Replication of dimension tables is not

typically a problem (dimension tables represent 5% to 10% of the data)

Business with big dimensions can not apply DWS

The businesses that have big dimensions have high potential (e.g. airlines, telecoms, e-business)

Page 15: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 15

The Selective Load Technique

Selective load the dimension tables Typical OLAP aggregate facts according

to restrictions applied to dimensions The join between facts and dimensions

only need the dimension rows that exist in both tables

Do not replicate the big dimension tables

Load only the necessary rows to each node

Page 16: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 16

The Selective Load Technique

Selective load the dimension tables Example:

Node of a DWS cluster

Page 17: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 17

The Selective Load Technique

High reduction of the number of rows to load to each node Big dimensions

High number of rows (absolute size) Significant percentage of the number of rows in fact

tables Produce sparse models (passenger in a flight

company) Rows in the dimension table are related with

low number of facts Worst scenario is having has many dimension

rows as facts in each node

Page 18: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 18

The Selective Load Technique

Dimension browsing queries? There’s not a complete version of the big dimension table The union of all selective load partitions of the dimension

table does not give a complete version of the dimension table

Dimension rows with no fact won’t be loaded at all Apply the DWS data partitioning algorithm to the

big dimension Create a partitioned version of the dimension table

distributed through all nodes Enables the dimension queries to benefit of DWS speed

up and scale up Dimension browsing queries aiming big dimension will be

executed in parallel by all nodes

Page 19: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 19

Experimental Results

Experiments with TPC-H Facts: Lineitem Big Dimension: Orders Dimensions: Customer, Supplier, Region, Nation,

Part Scenarios

Single Node – Centralized DB for reference DWS (5,10,20) – DWS with replication of

dimensions for 5, 10 and 20 nodes DWS_SL (5,10,20) – DWS with selective load of

big dimension for 5, 10 and 20 nodes

Page 20: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 20

Experimental Results

Storage per node Replication of big dimension has a high impact Selective load reduces significantly the data

volume

  LineItem Orders Orders_dist Total

Single Node 3576,25 1573,56   5149,82

DWS_5 715,25 1573,56   2288,81

DWS_SL_5 715,25 557,97 314,71 1587,93

DWS_10 357,63 1573,56   1931,19

DWS_SL_10 357,63 312,10 157,36 827,09

DWS_20 178,81 1573,56   1752,38

DWS_SL_20 178,81 157,01 78,68 414,51

Table size (MB)

Page 21: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 21

Experimental Results

Performance DWS speed up is inexistent due to the

replication of the big dimension DWS_SL speed up is near linear

Page 22: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 22

Conclusions

DWS is a technique to distribute data warehouses through a cluster of (low cost) computers with near linear speed up and scale up for star schema models and aggregations queries

The current work enables the use of the DWS technique for star schema models with large dimensions with linear speed up and scale up. Enables browsing dimension queries to experience the advantages of parallel execution in a DWS system.

Page 23: Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data

© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 23

Questions and Contacts

Marco Costa, [email protected]

Henrique Madeira, [email protected]

Critical Software, S.A.Parque Industrial de Taveiro, Lote 483045-504 Coimbra, PORTUGALTel+351 239989100,Fax+351 239989119

Critical Software Inc.111 North Market Street, Suite 670San Jose, California, USA, 95113Tel. +1(408)9711231, Fax: +1(408)3513330