Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
DW Performance Optimization (I)
Slides adapted from
Torben Bach Pedersen
Aalborg University 2008 - DWML course 2
Overview
• Choosing Aggregates
• Maintaining Views
• Bitmap Indices
Aalborg University 2008 - DWML course 3
Aggregates
• Observations
� DW queries follow the same “schema”
� Aggregate (GROUP-BY) queries
• Idea
� Compute and store query results in advance (pre-aggregation)
u Example: store “total sales per month and product”
� Yields large performance improvements (e.g., factor 1000)
� No need to store everything – reuse whenever possible
u Example: quarterly total can be quickly computed from monthly total
• Prerequisites
� Tree-structured dimensions with fixed height
� Many-to-one relationships from fact to dimensions
� Facts mapped to bottom level in all dimensions
Aalborg University 2008 - DWML course 4
Aggregate Use Example
• Imagine 1 billion sales rows, 1000 products, 100 locations
• CREATE VIEW TotalSales (pid,locid,total) ASSELECT s.pid,s.locid,SUM(s.sales)
FROM Sales s GROUP BY s.pid,s.locid
• Pre-compute the above view and store the result into a table (e.g., TotalSales with 100,000 rows)
• Rewrite the query to use the view
� SELECT p.category,SUM(s.sales) FROM Products p,
Sales s WHERE p.pid=s.pid GROUP BY p.category
u can be rewritten to
� SELECT p.category,SUM(t.total) FROM Products p,
TotalSales t WHERE p.pid=t.pid GROUP BY p.category
u Query becomes 10,000 times faster!
40323
…………
1
1
locid
2
1011
201
salespidtid
Sales
4032
………
1
locid
301
salespid
TotalSales
1 billion rows
100,000 rows
Aalborg University 2008 - DWML course 5
Pre-Aggregation Choices
• Full pre-aggregation: (all combinations of levels)
� Fast query response
� Takes a lot of space/update time (200-500 times raw data)
• No pre-aggregation
� Slow query response (for terabytes…)
• Practical pre-aggregation: chosen combinations
� A good compromise between response time and space use
� The approach supported by ROLAP tools
u IBM DB2 UDB
u Oracle 10g
u MS Analysis Services
Aalborg University 2008 - DWML course 6
Using Aggregates
• The term “aggregate” refers to materialized view
• Application: aggregates used via aggregate navigator
� A set of aggregates stored in the system (i.e., DBMS)
� Given a query, the best aggregate is found, and the query is rewritten to use it
� Done by the system, not by the user
• Four design goals for aggregate usage
� Aggregates stored separately from detail data
� “Shrunk” dimensions mapped to aggregate facts
� Connection between aggregates and detail data known by system
� All queries (SQL) refers only detail data
• SUM, MIN, MAX, COUNT, AVG functions can be handled
� The AVG function requires a trick – Why and how?
Aalborg University 2008 - DWML course 7
Choosing Aggregates
• Practical pre-aggregation, decide what aggregates to store
• Non-trivial (NP-complete) optimization problem
� Space use
� Update speed
� Response time demands
� ……
• Choose an aggregate if it is considerably smaller than
available, usable aggregates (factor 3-5-10)
• “Implementing Data Cubes Efficiently” SIGMOD’96 paper
� Greedy approach: simple but effective
� Commerical DBMS now use similar, but more advanced techniques
Aalborg University 2008 - DWML course 8
Data Cube
By make
and co
lor
By color and year
By make and yearFordChevy
19911992
19931994
Sum
The data cube stores multidimensional GROUP BY relationsof tables in data warehouses
Cross Tab
Chevy Ford By color
Sum
Red
White
Blue
By make
Group By (with total)
By color
Sum
Red
White
Blue
By make
Database capable
Data warehouse capable
Aggregate
Sum
mc
m c
none
Aalborg University 2008 - DWML course 9
Lattice Framework
• Each lattice node represents a view / query
• Dependence relationship: Q1 ≤ Q2
� Q1 is the descendant of Q2
� Q1 can be answered using only the results of Q2
• In other words, Q1 is dependent on Q2
� The ≤ operator imposes a partial ordering on the queries
• Top view = base data (i.e., most detail)
• Essentially, the lattice captures dependency relationship among
queries / views and can be represented by a lattice graph
Aalborg University 2008 - DWML course 10
psc 6M
pc 6M ps 0.8M sc 6M
p 0.2M s 0.01M c 0.1M
none 1
A Data Cube Example
Scenario:
A query asks for the sales of a part
a) If view pc is available, need to process about 6M rows; OR
b) If view p is available, need to process about 0.2M rows
(I) – 8 possible views for 3 dimensions. Each view gives the total sales as per that grouping.
1. part, supplier, customer (6M rows)
2. part, customer (6M)
3. part, supplier (0.8M)
4. supplier, customer (6M)
5. part (0.2M)
6. supplier (0.01M)
7. customer (0.1M)
8. none (1)
(II) – 8 views organized into a Lattice
Problem: Given that we have space S, what views to materialize for minimizing query costs?
(III) – Picking the right views to materialize � improve performance
(IV) – View pc and sc not needed � Effective rows reduced from 19M to 7M
19 M ro
ws to
tal
Aalborg University 2008 - DWML course 11
Cost of answering a query based on a view and its major assumptions
Cost Model
Time to answer a query is equal to the space occupied by the query from which the query is answered
1
All queries are identical to some queries in the given lattice2
An illustration:
• To answer query Q, choose an ancestor of Q, say, Qa, that has been materialized
• We thus need to process the table of Qa
• Cost of answering Q: number of rows in the table of Qa
Model simple and realistic, Enable the design and analysis of powerful algorithms
Experimental validation of the cost model: almost linear relationship between size and running time …
Source Size S Time T Ratio m
From cell itself 1 2.07 -
From view s 10,000 2.38 .000031
From view ps 0.8M 20.77 .000023
From view psc 6M 226.23 .000037
Aalborg University 2008 - DWML course 12
Greedy Approach
• We plan to solve the problem by the greedy approach
� Being as greedy as possible in each step!
� Now give you a taste of this approach ……
• Example: Pay 5 kroner with the smallest number of coins
� Suppose we have many coins of 2 kroner, 1 krone, and 50 ores (i.e., 50 cents).
• How to be greedy:
� Select the largest coin: 2 kroner
� Select the largest coin again: 2 kroner
� Remaining amount = 5 – 2 – 2 = 1 krone
� We cannot select the largest coin again
� We choose the next largest coin 1 krone instead
• Only 3 coins are needed
Aalborg University 2008 - DWML course 13
Greedy Algorithm
Greedy algorithm: each time choose the view with the maximum “benefit”
• Given a data cube lattice with space costs associated with each view
• Always include the top view because it cannot be generated from other views
• Given a number k, we are allowed to select k views (in addition to the top view)
• B(v,S), the benefit of view v (relative to S), denotes the amount that v can improve the costs of evaluating views w, including itself
• B(v,S) = Σview w max {Cost(w,S) – Cost(w,v), 0} only positive benefit taken into account
Explanation:The Greedy algorithm
S = {top view}
for i=1 to k do begin
Select a view v not in S such
that B(v,S) is maximized;
S = S union {v};
end;
return S;
Aalborg University 2008 - DWML course 14
Greedy Algorithm – Example
Benefits of possible choices in 1st round
Round 1
b
c
d
e
f
g
h
50 x 5 = 250
25 x 5 = 125
80 x 2 = 160
70 x 3 =210
60 x 2 =120
99 x 1 = 99
90 x 1 = 90
100
a
C 7550 b
20 d f 40
h 10g
1
30
e
Lattice with space costs
Initial: top view a is automatically chosen
End of round 1: view b selected
• Round 1: each query answered by a, with cost 100
• Which queries can be answered by view c?
• What is the benefit of view c? Ans: 5*(100-75)=125
• Can we find a better view than view c? How about b?
Aalborg University 2008 - DWML course 15
Greedy Algorithm – Example
Benefits of possible choices at each round
Round 1 Round 2
b
c
d
e
f
g
h
50 x 5 = 250
25 x 5 = 125
80 x 2 = 160
70 x 3 =210
60 x 2 =120
99 x 1 = 99
90 x 1 = 90
25 x 2 = 50
30 x 2 = 60
20 x 3 = 60
60 + 10 = 70
49 x 1 = 49
40 x 1 = 40
100
a
C 7550 b
20 d f 40
h 10g
1
30
e
Lattice with space costs
• Round 2: views b, d, e, g, h start with cost 50 and the other views start with cost 100
• Which queries can be improved by view f?
• Benefit of view f? Ans: (100-40)+(50-40)=70
• How about the benefit of view c?
Round 1: view b selected
End of round 2: view f selected
Aalborg University 2008 - DWML course 16
Greedy Algorithm – Example
Benefits of possible choices at each round
Round 1 Round 2 Round 3
b
c
d
e
f
g
h
50 x 5 = 250
25 x 5 = 125
80 x 2 = 160
70 x 3 =210
60 x 2 =120
99 x 1 = 99
90 x 1 = 90
25 x 2 = 50
30 x 2 = 60
20 x 3 = 60
60 + 10 = 70
49 x 1 = 49
40 x 1 = 40
25 x 1 = 25
30 x 2 = 60
2 x 20 + 10 =50
49 x 1 = 49
30 x 1 = 30
100
a
C 7550 b
20 d f 40
h 10g
1
30
e
Lattice with space costs
• Round 1: view b selected
• Round 2: view f selected
• End of round 3: view d selected
• After 3 rounds, total cost is 420 (initial cost 800)
• In summary, each round pick the view with the maximum “benefit”
• Round 3: views b, d, e, g start with cost 50; views f and h start with cost 40; the other views start with cost 100
• How about the benefit of view e now?
Aalborg University 2008 - DWML course 17
Greedy Algorithm vs Optimal Choice
Situations where the algorithm does poorly …
The benefit of Greedy is guaranteed to be at least 63% of the benefit of the optimal algorithm (see the original paper)
A Lattice not good for Greedy
200
a
d 100100 b
20 Nodes total 1000
c
99
20 Nodes total 1000
20 Nodes total 1000
20 Nodes total 1000
• Round 1: Picks c whose benefit is 4141
• Round 2: Picks b or d with benefit of 2100 each
• Greedy results in 4141+2100 = 6241
• But, the optimal choice is to pick b and d
• b and d improve by 100 for itself and all 80 nodes below resulting in total benefits of 8200
• Ratio of greedy/optimal=6241/8200=76%
Aalborg University 2008 - DWML course 18
Greedy Algorithm – Space-Time Trade-Off
Greedy order of view selection example
Selection Benefit Space
1
2
3
4
5
6
cp
7
8
9
10
11
12
ns
nt
c
p
cs
np
ct
t
n
s
none
infinite
24M rows
12M
5.9M
5.8M
1M
1M
0.01M
small
small
small
small
Time
72M rows
48M
36M
30.1M
24.3M
23.3M
23.3M
23.3M
23.3M
23.3M
23.3M
23.3M
23.3M
23.3M
23.3M
23.3M
23.3M
6M rows
6M
6M
6.1M
11.3M
16.3M
6.3M
Experiment with a lattice: Important to materialize only the first few (5) views, performance increases at first…
30
20
10
0
70
60
50
40
80
2 4 6 8 10 12
Time / Space
Views
Total Time
Total Space
Aalborg University 2008 - DWML course 19
Optimal Cases and Anomalies
Two situations where the algorithm is optimal …
� If a1 is much larger than the
other a’s, the greedy is close
to optimal
� If all the a’s are equal then
greedy is optimal
… but there are also two situations where the algorithm is not realistic
Views in a lattice are unlikely
to have the same probability
of being requested in a query
Instead of asking for some
fixed number of views to
materialize, should instead
allocate a fixed amount of
space to views
Aalborg University 2008 - DWML course 20
The size of views grows exponentially, until it reaches the size of the raw data at rank [logr m], i.e. the cliff
Number of group-by attributes
n
Size of views
m
Log r m
Effect of the number of group-by attributes on the size of views
• Each domain size is r
• Top element has m cells appearing in raw data
• If group on i attributes, cube has ri cells
• If ri ≥ m, then each cell will have at most one data point and m of the cells with be nonempty. Use m as the size of any view.
• If ri <m, then almost all ri cells will have at least one data point. Space cost is ri as several data points can be collapsed into one aggregate
Assumptions and basis of reasoning
This explains why grouping of 2 attributes (p,c), (s,c) have the same size as (p,s,c) at 6 M rows in the figure on slide #10
Hypercube Lattices – Observations
Aalborg University 2008 - DWML course 21
Space-Optimal and Time-Optimal Solutions
What is the average time for a query when the space is optimal ?
• Space is minimized when only the top view is materialized
• Every query would take time m
• Total time cost for all 2n queries is m2n
Is there sense to minimize time by materializing all views ?
• No gain past the cliff
• No point to do so
• Nature of time-optimal solution depends on
1. Rank k = [logr m] at which the cliff occurs; and
2. Rank j, such that rj = (nj) is
maximized
Aalborg University 2008 - DWML course 22
Hierarchies are important as they underlie two commonly used query operations, Drill-down and Roll-up …
Day
Week Month
Year
none
• Year ≤ Month ≤ Day
• Week ≤ Day; but
• Month ≤ week; week ≤ month
A common hierarchy … … and its dependency relations
… but, hierarchies introduce query dependencies that must be accounted for when determining which queries to materialize; and this can be complex
Hierarchies & the Lattice Framework
Aalborg University 2008 - DWML course 23
Dependencies caused by different dimensions and attribute hierarchiescan be combined into a Direct Product lattice …
Composite Lattices
… assuming views can be created by independently grouping any or no member of the hierarchy for each dimension
Combining two hierarchical dimensions
+
customer
nation
none
0.1M
25
1
product
size type
none
6M
1
50 150
A Direct Product lattice
=
cp 6M
c 0.1M
cs 5M
np 5M
nt 3750
n 25
ns 1250
p 0.2M
t 150
none 1
s 50
ct 5.99M
Aalborg University 2008 - DWML course 24
Exercise for Composite Lattice
+
Product P2
Type P1
None P0
• Consider hierarchies of two dimensions
� Product: Product (P2) � Type (P1) � ALL (P0=∅)
� Store: Store (S2) � Area (S1) � ALL (S0=∅)
• Draw the composite lattice
� How many nodes does the lattice have?
� Enumerate them
� Connect the nodes based on their dependency
Store S2
Area S1
None S0
none
P2 S2
???
Aalborg University 2008 - DWML course 25
Applicability of the Lattice Framework
Advantages of the lattice framework
• Clean framework
� A clean framework to work with dimensional hierarchies, since hierarchies are themselves lattices
• Easy to model dependencies
� Dependency relationship among queries captured by the lattice
• Order of materialization
� A simple descending-order topological sort on lattice nodes gives the required order of materialization – more in the next lecture
Aalborg University 2008 - DWML course 26
Summary on Choosing Aggregates
1. Lattice framework
2. Linear cost model
3. Greedy algorithm (How it works?)
4. Space-time trade-off
Aalborg University 2008 - DWML course 27
Maintaining Views
Aalborg University 2008 - DWML course 28
View Maintenance
• How and when should we refresh materialized views?
• Total re-computation
� Too expensive
• Incremental view maintenance
� Apply only changes since last refresh to view
� Insert new rows Ri and delete existing rows Rd
� Update = delete + insert
• Additional info must be stored in views
� To make the views self-maintainable
� Store “number of instances” cv along with each row v in V
Aalborg University 2008 - DWML course 29
View Maintenance
• Projection views (with DISTINCT in SQL)
� V=Π(R)
� (r,c): r appears “c” times in R
� If (r,ci)∈Π(Ri) and (r,cr)∈V then cr=cr+ci, otherwise insert (r,ci) into V
� If (r,cd)∈Π(Rd) and (r,cr)∈V then cr=cr–cd, delete from V if cr=0
• Example
� V=Πx(R)
� How does V change after these operations?
� Insertion
u Insert the tuple (3,7) into R
u Insert the tuple (7,9) into R
� Deletion
u Delete the tuple (3,4) from R
u Delete the tuple (6,9) from R
96
45
43
33
21
yx
6 (1)
5 (1)
3 (2)
1 (1)
x (c)
R
V
Aalborg University 2008 - DWML course 30
View Maintenance
• Join views V=R S
� Compute Ri S and add to V, update counts
� Compute Rd S and subtract from V, update
counts
• Example
� V=R S
� How does V change after these operations?
� Insertion
u Insert the tuple (3) into R
u Insert the tuple (4) into R
u Insert the tuple (7) into R
� Deletion
u Delete the tuple (2) from R
u Delete the tuple (3) from R
u Delete the tuple (7) from S
9
7
3
3
2
y
9 (1)
3 (2)
2 (1)
y (c)
V
9
4
4
3
2
y
R S
Aalborg University 2008 - DWML course 31
Aggregation View Maintenance
• COUNT� Like projection view, but we maintain a tuple for each group g
u The group g can contain multiple attributes
u E.g., GROUP BY color, year; an instance of g = blue, 1993
� Maintain <g,count> � Update count based on inserts/deletes � Insert <g,1> rows for new values, delete rows from V with count=0
• SUM� Maintain <g,sum,count> � Update count and sum based on inserts/deletes� Insert <g,a,1> rows for new values, delete rows from V with count=0
• AVG computed as sum/count
• MIN (similar for MAX)� Maintain <g,minimum,count>� Insert (g,m): update minimum,count based on m =,<,> minimum� Delete (g,m): update minimum,count based on m =,<,> minimum
� If count=0, then we need to search the base table (expensive). Why?
Aalborg University 2008 - DWML course 32
Practical View Maintenance
• Maintenance policies – When to synchronize?
� Immediate – in same transaction as base change
� Lazy – when V is used for the first time after base updates
� Periodic, e.g., once a day, often together with base load
� Forced – after a certain number of changes
• Updating aggregates
� Computation outside DBMS in flat files (still relevant?)
� Built by loader
� Computation in DBMS using SQL
� Can be expensive: DBMS must be tuned for this
• Supported by tool/DBMS
� Oracle, SQL Server, DB2
Aalborg University 2008 - DWML course 33
Indexing
• Index used in combination with aggregates� Index on dimension tables and on materialized views
• Fact table� Build primary B+-tree index on dimension keys (primary key) ?
� Build indices on each dimension key separately (index intersection)
� Indices on combinations of dimension keys? (many!)
• Sort order is important (index-organized tables)� Possible to compress data (values not repeated)
� Can save aggregates due to fast sequential scan
� Best sort order (almost) always “Time”
• Dimension tables� Build indices on many/all individual columns (think of type)
� Build indices on common combinations
• Hash indices� Efficient for un-sorted data (e.g., color)
Aalborg University 2008 - DWML course 34
Bitmap Indices
• A B+-tree index stores a list of RowIDs for each value
� A RowID takes ~8 bytes
� Large space use for columns with low cardinality (gender, color)
� Example: Index for 1 bio. rows with gender takes 8 GB
� Not efficient to do “index intersection” for these columns
• Idea: make a “position bitmap” for each value (only two)
� Female: 01110010101010…
� Male: 10001101010101…
� Takes only (num. of values)*(num. of rows)*1 bit
� Example: bitmap index on gender (as before) takes only 256 MB
� Very efficient to do “index intersection” (AND/OR) on bitmaps
u Intersection of 32 bits done in a single CPU instruction (word length=32)
� Can be improved for for higher cardinality using compression
• Supported by some RDBMS (DB2, Oracle)
Aalborg University 2008 - DWML course 35
Using Bitmap Indices
• Query example� Find female customers in Aalborg with black hair and blue eyes
� Female: 01010101010
� Aalborg: 00000011111
� Black: 10110110110
� Blue: 01101101111
� Result 00000000010 – use AND, only one such customer
• Numeric attributes can also be handled� Find … Salary BETWEEN 22000 AND 29000
� Use the binning technique, i.e., group every C values into a bin
u E.g., group every 5000 values into a bin, and assign a bitmap for it
u Bitmap for [20000-25000): 001001001
u Bitmap for [25000-30000): 010010010
u OR together: 011011011 (Why is it OR but not AND?)
u Refinement step: follow those records and check their actual salaries
� Tradeoff between storage size and index effectiveness
Do we need to consider the bitmap for [15000-20000) ?
Aalborg University 2008 - DWML course 36
Compressed Bitmaps
• Problem: space use
� With m possible values and n records: n*m bits required
� However, probability of a 1 is 1/m => very few 1’s
• Solution: compressed bitmaps
� Run-length encoding
� A run is i 0’s followed by a 1
� Concatenating binary numbers won’t work – no unique decoding
� Instead, determine j – number of bits in binary representation of i
� Encode as “<j-1 1’s>”+“0”+“<i in binary>”
� Encode next run similarly, trailing 0’s not encoded
� Example: 000000010000 encoded as 110111
� j>1 => first bit of i is 1 – this bit can be saved – encoded as 11011
� Decoding: scan bits to find j, scan next j-1 bits to find i, find next 1
� Example: j = 3, i = 7 (11) => bitmap = 00000001 + 0000 (trailing)
Aalborg University 2008 - DWML course 37
Managing Bitmaps
• Compression factor� Assume m=n (unique values)
� Each value has just one run of length i < n
� Each run takes at most 2 log2 n bits (j <= log2 n)
� Total space consumption: 2n log2 n bits (compared to n2)
• Operations on compressed bitmaps� Decompress one run at a time and produce relevant 1’s in output
• Finding and storing bitvectors� Index with B-trees + store in blocks/block chains
• Finding records� Use secondary or primary index (hashing or B-tree)
• Handling modifications� Deletion: “retire” record number + update bitmaps with 1’s
� Insertion: add new record to file + update bitmaps with 1’s (trail 0’s)
� Updates: update bitmaps with old and new 1’s (expensive)
Aalborg University 2008 - DWML course 38
Mini Project
• Subtask #5
� Build reports in Reporting Services
� Check the Reporting Services tutorial
at the mini-project website
• Subtask #6
� DW performance optimization
� Build aggregates
u SQL Server Management Studio
u Connect to Analysis Services
u Find your cube and ……
• Short demo. session on this today
Aalborg University 2008 - DWML course 39
Summary
• Choosing Aggregates
• Maintaining Views
• Bitmap Indices