Upload
cassandra-lester
View
212
Download
0
Embed Size (px)
Citation preview
Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights
Reserved.
Handling big dimensions in distributed data warehouses using the DWS technique
Marco CostaDEI – CISUC – University of CoimbraCritical Software S.A.
© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 2
Agenda
Introduction The DWS technique
Description Problems with big dimensions
The Selective Loading technique Experimental Results Conclusions
© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 3
Critical Software Inc.
Company Profile International Software Engineering company. Founded in 1998, offices in Portugal, US, UK. Entrepreneurial and independent SME. Staff of 100, software engineers, Msc’s, Phd’s.
Figures Turnover of US 6M (2004). International market represents +70%. Profitable since foundation (ebit= 17%, 2003).
Quality, R&D ISO 9001:2000 Tick-IT certified (only in
Iberia). ISO 15504 / CMM level 3 R&D focused, Patents submitted
Headquarters, Portugal
TURNOVER
0
1
2
3
4
5
6
1999 2000 2001 2002 2003 2004
M€
Domestic Exports
Fy 2003ebit = 17%ebitda = 24%
© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 4
Introduction
Companies produce and store more and more data
Data Warehouses have large and continuously growing volumes of data to process
High performance in query execution is crucial to enable interactivity in OLAP process
Typically the performance is achieved through very expensive hardware platforms (e.g. high end servers)
© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 5
Introduction
Parallel processing has been explored as one of the solutions to support large DW Intra-query parallelism
Distributed DW For geographical reasons For performance
Load balancing of data Query execution Reduce communication between nodes
© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 6
The DWS Technique
Distribution of a DW through a cluster of “low cost computers” Data partition technique Query re-write and parallel execution technique Approximated query answering
Shared-nothing architecture – Federated Conceived specifically for data warehouses
implemented with star-schema model High scalability Near linear speed up for data aggregation
queries
© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 7
The DWS Technique
Data partitioning / data placement All nodes have the same data model Dimension tables are replicated Fact tables are distributed through all
nodes in an uniform way Row by row Random
© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 8
The DWS Technique
Fact Tablerow_1
Dimension 1 row_2 Dimension 3all_ rows row_3 all_ rows
row_4row_5row_6row_7
Dimension 2 row_8 Dimension 4all_ rows ... all_ rows
row_n
Dimension 1 Fact Table 1 Dimension 3 Dimension 1 Fact Table 2 Dimension 3 Dimension 1 Fact Table 1 Dimension 3all_ rows row_1 all_ rows all_ rows row_2 all_ rows all_ rows row_3 all_ rows
row_4 row_5 row_6Dimension 2 row_7 Dimension 4 Dimension 2 row_8 Dimension 4 Dimension 2 row_9 Dimension 4
all_ rows ... all_ rows all_ rows ... all_ rows all_ rows ... all_ rowsrow_n-2 row_n-1 row_n
Data partitioning / data placement Row by row example
© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 9
The DWS Technique
Query re-write Partition the queries in steps:
Partial Query (independently executed in each node)
Merge Query Some queries might require more than
one step Execution tree optimizer – determines
the steps that need to be executed independently or can be included in the upper query
© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 10
The DWS Technique
Query Re-write (example for 2 nodes) A typical data aggregation query:
select t.calendar_month_desc "Month", c.cust_city "City", p.prod_category "Category", avg(s.quantity_sold) "Quantity", avg(s.amount_sold) "Amount" from sales s,
customers c, times t, products p
where s.time_id = t.time_id and s.cust_id = c.cust_id and s.prod_id = p.prod_id and t.calendar_year = 2000
group by t.calendar_month_desc, c.cust_city, p.prod_category
Dimensions
Facts (aggregated)
© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 11
The DWS Technique
Query Re-write (example for 2 nodes) Partial Query sent to all nodes:
create table dws110517101718101 asselect t.calendar_month_desc calendar_month_desc, c.cust_city cust_city, p.prod_category prod_category,
sum(s.quantity_sold) as dws1_sum, count(s.quantity_sold) as dws1_count, sum(s.amount_sold) as dws2_sum, count(s.amount_sold) as dws2_countfrom sales s,
customers c, times t, products p
where s.time_id = t.time_id and s.cust_id = c.cust_id and s.prod_id = p.prod_id and t.calendar_year = 2000
group by t.calendar_month_desc, c.cust_city, p.prod_category
Collect partial aggregations
© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 12
The DW-SP Technology
Query Re-write (example for 2 nodes) Merge Query – merge the partial results:
select calendar_month_desc "month", cust_city "city", prod_category "category", sum(dws1_sum) / sum(dws1_count) "quantity", sum(dws2_sum) / sum(dws2_count) "amount" from dws_finalmerge_ group by
calendar_month_desc, cust_city, prod_category
create table dws_finalmerge_ as (select * from dws110517154329101@node1 union all select * from dws110517154329101@node2)
Gather partial Results
Build finalresultsMerge
aggregations
© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 13
The DWS Technique
Achievements Optimal data load balance Optimal work load balance For each query each node processes the same
amount of data as all the others, mostly within its local data
Low communication between nodes High scalability
Near linear speed-up Nead linear scale-up Tested with APB1 benchmark (Olap Council) and 10
nodes
© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 14
The DWS Technique
The problem Replication of dimension tables is not
typically a problem (dimension tables represent 5% to 10% of the data)
Business with big dimensions can not apply DWS
The businesses that have big dimensions have high potential (e.g. airlines, telecoms, e-business)
© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 15
The Selective Load Technique
Selective load the dimension tables Typical OLAP aggregate facts according
to restrictions applied to dimensions The join between facts and dimensions
only need the dimension rows that exist in both tables
Do not replicate the big dimension tables
Load only the necessary rows to each node
© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 16
The Selective Load Technique
Selective load the dimension tables Example:
Node of a DWS cluster
© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 17
The Selective Load Technique
High reduction of the number of rows to load to each node Big dimensions
High number of rows (absolute size) Significant percentage of the number of rows in fact
tables Produce sparse models (passenger in a flight
company) Rows in the dimension table are related with
low number of facts Worst scenario is having has many dimension
rows as facts in each node
© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 18
The Selective Load Technique
Dimension browsing queries? There’s not a complete version of the big dimension table The union of all selective load partitions of the dimension
table does not give a complete version of the dimension table
Dimension rows with no fact won’t be loaded at all Apply the DWS data partitioning algorithm to the
big dimension Create a partitioned version of the dimension table
distributed through all nodes Enables the dimension queries to benefit of DWS speed
up and scale up Dimension browsing queries aiming big dimension will be
executed in parallel by all nodes
© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 19
Experimental Results
Experiments with TPC-H Facts: Lineitem Big Dimension: Orders Dimensions: Customer, Supplier, Region, Nation,
Part Scenarios
Single Node – Centralized DB for reference DWS (5,10,20) – DWS with replication of
dimensions for 5, 10 and 20 nodes DWS_SL (5,10,20) – DWS with selective load of
big dimension for 5, 10 and 20 nodes
© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 20
Experimental Results
Storage per node Replication of big dimension has a high impact Selective load reduces significantly the data
volume
LineItem Orders Orders_dist Total
Single Node 3576,25 1573,56 5149,82
DWS_5 715,25 1573,56 2288,81
DWS_SL_5 715,25 557,97 314,71 1587,93
DWS_10 357,63 1573,56 1931,19
DWS_SL_10 357,63 312,10 157,36 827,09
DWS_20 178,81 1573,56 1752,38
DWS_SL_20 178,81 157,01 78,68 414,51
Table size (MB)
© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 21
Experimental Results
Performance DWS speed up is inexistent due to the
replication of the big dimension DWS_SL speed up is near linear
© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 22
Conclusions
DWS is a technique to distribute data warehouses through a cluster of (low cost) computers with near linear speed up and scale up for star schema models and aggregations queries
The current work enables the use of the DWS technique for star schema models with large dimensions with linear speed up and scale up. Enables browsing dimension queries to experience the advantages of parallel execution in a DWS system.
© Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 23
Questions and Contacts
Marco Costa, [email protected]
Henrique Madeira, [email protected]
Critical Software, S.A.Parque Industrial de Taveiro, Lote 483045-504 Coimbra, PORTUGALTel+351 239989100,Fax+351 239989119
Critical Software Inc.111 North Market Street, Suite 670San Jose, California, USA, 95113Tel. +1(408)9711231, Fax: +1(408)3513330