Upload
shiv-shankar-das
View
216
Download
0
Embed Size (px)
Citation preview
7/28/2019 07_dw
1/28
1
Cube Computation and
Indexes for Data WarehousesCPS 196.03
Notes 7
7/28/2019 07_dw
2/28
2
Processing
ROLAP servers vs. MOLAP servers
Index Structures
Cube computation What to Materialize?
Algorithms
Client Client
Warehouse
Source Source Source
Query & Analysis
Integration
Metadata
7/28/2019 07_dw
3/28
3
ROLAP Server
Relational OLAP Server
relational
DBMS
ROLAP
server
tools
utilities
sale prodId date sum
p1 1 62
p2 1 19
p1 2 48
Special indices, tuning;
Schema is denormalized
7/28/2019 07_dw
4/28
4
MOLAP Server
Multi-Dimensional OLAP Server
multi-dimensional
server
M.D. tools
utilities
could alsosit on
relational
DBMS
Prod
uct
Date1 2 3 4
milk
soda
eggs
soap
AB
Sales
7/28/2019 07_dw
5/28
5
MOLAP
Total annual salesof TV in U.S.A.
Date
Countr
ysum
sumTV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
7/28/2019 07_dw
6/28
6
MOLAP
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
6463626148474645
a1a0
c3c2
c1
c 0b3
b2
b1
b0
a2 a3
C
4428
5640
2452
3620
60
B
7/28/2019 07_dw
7/287
Challenges in MOLAP
Storing large arrays for efficient access
Row-major, column major
Chunking
Compressing sparse arrays
Creating array data from data in tables
Efficient techniques for Cube computation
Topics are discussed in the paper for reading
7/28/2019 07_dw
8/288
Index Structures
Traditional Access Methods
B-trees, hash tables, R-trees, grids,
Popular in Warehousesinverted lists
bit map indexes
join indexes
text indexes
7/28/2019 07_dw
9/28
9
Inverted Lists
20
23
18
19
20
2122
23
25
26
r4
r18
r34
r35
r5
r19
r37
r40
rId name age
r4 joe 20
r18 fred 20
r19 sally 21r34 nancy 20
r35 tom 20
r36 pat 25
r5 dave 21
r41 jeff 26
...
age
index
inverted
lists
data
records
7/28/2019 07_dw
10/28
10
Using Inverted Lists
Query:
Get people with age = 20 and name = fred
List for age = 20: r4, r18, r34, r35
List for name = fred: r18, r52
Answer is intersection: r18
7/28/2019 07_dw
11/28
11
Bit Maps
20
23
18
19
20
2122
23
25
26
id name age
1 joe 20
2 fred 20
3 sally 21
4 nancy 20
5 tom 20
6 pat 25
7 dave 21
8 jeff 26
..
.
age
indexbit
maps
data
records
1
1
0
1
1
0
00
0
0
01
0
0
0
1
01
1
7/28/2019 07_dw
12/28
12
Bitmap Index
Index on a particular column Each value in the column has a bit vector: bit-op is fast
The length of the bit vector: # of records in the base table
The i-th bit is set if the i-th row of the base table has the
value for the indexed column
not suitable for high cardinality domains
Cust Region Type
C1 Asia Retail
C2 Europe Dealer
C3 Asia Dealer
C4 America Retail
C5 Europe Dealer
RecID Retail Dealer
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
ecI Asia Europe America
1 1 0 0
2 0 1 0
3 1 0 0
4 0 0 1
5 0 1 0
Base table Index on Region Index on Type
7/28/2019 07_dw
13/28
13
Using Bit Maps
Query:
Get people with age = 20 and name = fred
List for age = 20: 1101100000
List for name = fred: 0100000001
Answer is intersection: 010000000000
Good if domain cardinality small
Bit vectors can be compressed
7/28/2019 07_dw
14/28
14
Join
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4
Combine SALE, PRODUCT relations
In SQL: SELECT * FROM SALE, PRODUCT WHERE ...
product id name price
p1 bolt 10
p2 nut 5
joinTb prodId name price storeId date amt
p1 bolt 10 c1 1 12
p2 nut 5 c1 1 11
p1 bolt 10 c3 1 50
p2 nut 5 c2 1 8
p1 bolt 10 c1 2 44
p1 bolt 10 c2 2 4
7/28/2019 07_dw
15/28
15
Join Indexes
product id name price jIndex
p1 bolt 10 r1,r3,r5,r6
p2 nut 5 r2,r4
sale rId prodId storeId date amt
r1 p1 c1 1 12
r2 p2 c1 1 11
r3 p1 c3 1 50
r4 p2 c2 1 8r5 p1 c1 2 44
r6 p1 c2 2 4
join index
7/28/2019 07_dw
16/28
16
Cube Computation for DataWarehouses
7/28/2019 07_dw
17/28
17
Counting Exercise
How many cuboids are there in a cube?
The full or nothing case
When dimension hierarchies are present
What is the size of each cuboid?
7/28/2019 07_dw
18/28
18
Lattice of Cuboids
city, product, date
city, product city, date product, date
city product date
all
day 2c1 c2 c3
p1 44 4
p2 c1 c2 c3
p1 12 50
p2 11 8
day 1
c1 c2 c3
p1 56 4 50
p2 11 8
c1 c2 c3
p1 67 12 50
129
7/28/2019 07_dw
19/28
19
Dimension Hierarchies
all
state
city
cities city state
c1 CA
c2 NY
7/28/2019 07_dw
20/28
20
Dimension Hierarchies
city, product
city, product, date
city, date product, date
city product date
all
state, product, date
state, date
state, product
state
not all arcs shown...
7/28/2019 07_dw
21/28
21
Efficient Data Cube Computation
Data cube can be viewed as a lattice of cuboids
The bottom-most cuboid is the base cuboid
The top-most cuboid (apex) contains only one cell
How many cuboids in an n-dimensional cube with Llevels?
Materialization of data cube
Materialize every (cuboid) (full materialization), none
(no materialization), orsome (partial materialization)
Selection of which cuboids to materialize
Based on size, sharing, access frequency, etc.
)11(
n
ii
LT
7/28/2019 07_dw
22/28
22
Derived Data
Derived Warehouse Data
indexes
aggregates
materialized views (next slide)
When to update derived data?
Incremental vs. refresh
7/28/2019 07_dw
23/28
23
Idea of Materialized Views
Define new warehouse tables/arrays
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4
product id name price
p1 bolt 10
p2 nut 5
joinTb prodId name price storeId date amt
p1 bolt 10 c1 1 12p2 nut 5 c1 1 11
p1 bolt 10 c3 1 50
p2 nut 5 c2 1 8
p1 bolt 10 c1 2 44
p1 bolt 10 c2 2 4
does not exist
at any source
7/28/2019 07_dw
24/28
24
Efficient OLAP Processing
Determine which operations should be performed on available cuboids Transform drill, roll, etc. into corresponding SQL and/or OLAP operations,
e.g., dice = selection + projection
Determine which materialized cuboid(s) should be selected for OLAP:
Let the query to be processed be on {brand, province_or_state} with the
condition year = 2004, and there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?
Explore indexing structures & compressed vs. dense arrays in MOLAP
7/28/2019 07_dw
25/28
25
What to Materialize?
Store in warehouse results useful for
common queries
Example:day 2
c1 c2 c3
p1 44 4
p2 c1 c2 c3
p1 12 50
p2 11 8
day 1
c1 c2 c3p1 56 4 50
p2 11 8
c1 c2 c3
p1 67 12 50
c1
p1 110
p2 19
129
. . .total sales
materialize
7/28/2019 07_dw
26/28
26
Materialization Factors
Type/frequency of queries
Query response time
Storage cost Update cost
Will study a concrete algorithm later
7/28/2019 07_dw
27/28
27
Iceberg Cube
Computing only the cuboid cellswhose count or other aggregates
satisfying the condition like
HAVING COUNT(*) >= minsup Motivation
Only a small portion of cube cells may be above the
water in a sparse cube
Only calculate interesting cellsdata above certainthreshold
7/28/2019 07_dw
28/28
28
Challenges in MOLAP
Storing large arrays for efficient access
Row-major, column major
Chunking
Compressing sparse arrays
Creating array data from data in tables
Efficient techniques for Cube computation
Topics are discussed in the paper for reading