DW Performance Optimization (I) - AAU

DW Performance Optimization (I)

Slides adapted from

Torben Bach Pedersen

Aalborg University 2008 - DWML course 2

Overview

• Choosing Aggregates

• Maintaining Views

• Bitmap Indices


Aggregates

• Observations

� DW queries follow the same “schema”

� Aggregate (GROUP-BY) queries

• Idea

� Compute and store query results in advance (pre-aggregation)

u Example: store “total sales per month and product”

� Yields large performance improvements (e.g., factor 1000)

� No need to store everything – reuse whenever possible

u Example: quarterly total can be quickly computed from monthly total

• Prerequisites

� Tree-structured dimensions with fixed height

� Many-to-one relationships from fact to dimensions

� Facts mapped to bottom level in all dimensions


Aggregate Use Example

• Imagine 1 billion sales rows, 1000 products, 100 locations

• CREATE VIEW TotalSales (pid,locid,total) ASSELECT s.pid,s.locid,SUM(s.sales)

FROM Sales s GROUP BY s.pid,s.locid

• Pre-compute the above view and store the result into a table (e.g., TotalSales with 100,000 rows)

• Rewrite the query to use the view

� SELECT p.category,SUM(s.sales) FROM Products p,

Sales s WHERE p.pid=s.pid GROUP BY p.category

u can be rewritten to

� SELECT p.category,SUM(t.total) FROM Products p,

TotalSales t WHERE p.pid=t.pid GROUP BY p.category

u Query becomes 10,000 times faster!

40323

…………

1

1

locid

2

1011

201

salespidtid

Sales

4032

………

1

locid

301

salespid

TotalSales

1 billion rows

100,000 rows


Pre-Aggregation Choices

• Full pre-aggregation: (all combinations of levels)

� Fast query response

� Takes a lot of space/update time (200-500 times raw data)

• No pre-aggregation

� Slow query response (for terabytes…)

• Practical pre-aggregation: chosen combinations

� A good compromise between response time and space use

� The approach supported by ROLAP tools

u IBM DB2 UDB

u Oracle 10g

u MS Analysis Services


Using Aggregates

• The term “aggregate” refers to materialized view

• Application: aggregates used via aggregate navigator

� A set of aggregates stored in the system (i.e., DBMS)

� Given a query, the best aggregate is found, and the query is rewritten to use it

� Done by the system, not by the user

• Four design goals for aggregate usage

� Aggregates stored separately from detail data

� “Shrunk” dimensions mapped to aggregate facts

� Connection between aggregates and detail data known by system

� All queries (SQL) refers only detail data

• SUM, MIN, MAX, COUNT, AVG functions can be handled

� The AVG function requires a trick – Why and how?


Choosing Aggregates

• Practical pre-aggregation, decide what aggregates to store

• Non-trivial (NP-complete) optimization problem

� Space use

� Update speed

� Response time demands

� ……

• Choose an aggregate if it is considerably smaller than

available, usable aggregates (factor 3-5-10)

• “Implementing Data Cubes Efficiently” SIGMOD’96 paper

� Greedy approach: simple but effective

� Commerical DBMS now use similar, but more advanced techniques


Data Cube

By make

and co

lor

By color and year

By make and yearFordChevy

19911992

19931994

Sum

The data cube stores multidimensional GROUP BY relationsof tables in data warehouses

Cross Tab

Chevy Ford By color

Sum

Red

White

Blue

By make

Group By (with total)

By color

Sum

Red

White

Blue

By make

Database capable

Data warehouse capable

Aggregate

Sum

mc

m c

none


Lattice Framework

• Each lattice node represents a view / query

• Dependence relationship: Q1 ≤ Q2

� Q1 is the descendant of Q2

� Q1 can be answered using only the results of Q2

• In other words, Q1 is dependent on Q2

� The ≤ operator imposes a partial ordering on the queries

• Top view = base data (i.e., most detail)

• Essentially, the lattice captures dependency relationship among

queries / views and can be represented by a lattice graph


psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

A Data Cube Example

Scenario:

A query asks for the sales of a part

a) If view pc is available, need to process about 6M rows; OR

b) If view p is available, need to process about 0.2M rows

(I) – 8 possible views for 3 dimensions. Each view gives the total sales as per that grouping.

1. part, supplier, customer (6M rows)

2. part, customer (6M)

3. part, supplier (0.8M)

4. supplier, customer (6M)

5. part (0.2M)

6. supplier (0.01M)

7. customer (0.1M)

8. none (1)

(II) – 8 views organized into a Lattice

Problem: Given that we have space S, what views to materialize for minimizing query costs?

(III) – Picking the right views to materialize � improve performance

(IV) – View pc and sc not needed � Effective rows reduced from 19M to 7M

19 M ro

ws to

tal


Cost of answering a query based on a view and its major assumptions

Cost Model

Time to answer a query is equal to the space occupied by the query from which the query is answered

1

All queries are identical to some queries in the given lattice2

An illustration:

• To answer query Q, choose an ancestor of Q, say, Qa, that has been materialized

• We thus need to process the table of Qa

• Cost of answering Q: number of rows in the table of Qa

Model simple and realistic, Enable the design and analysis of powerful algorithms

Experimental validation of the cost model: almost linear relationship between size and running time …

Source Size S Time T Ratio m

From cell itself 1 2.07 -

From view s 10,000 2.38 .000031

From view ps 0.8M 20.77 .000023

From view psc 6M 226.23 .000037


Greedy Approach

• We plan to solve the problem by the greedy approach

� Being as greedy as possible in each step!

� Now give you a taste of this approach ……

• Example: Pay 5 kroner with the smallest number of coins

� Suppose we have many coins of 2 kroner, 1 krone, and 50 ores (i.e., 50 cents).

• How to be greedy:

� Select the largest coin: 2 kroner

� Select the largest coin again: 2 kroner

� Remaining amount = 5 – 2 – 2 = 1 krone

� We cannot select the largest coin again

� We choose the next largest coin 1 krone instead

• Only 3 coins are needed


Greedy Algorithm

Greedy algorithm: each time choose the view with the maximum “benefit”

• Given a data cube lattice with space costs associated with each view

• Always include the top view because it cannot be generated from other views

• Given a number k, we are allowed to select k views (in addition to the top view)

• B(v,S), the benefit of view v (relative to S), denotes the amount that v can improve the costs of evaluating views w, including itself

• B(v,S) = Σview w max {Cost(w,S) – Cost(w,v), 0} only positive benefit taken into account

Explanation:The Greedy algorithm

S = {top view}

for i=1 to k do begin

Select a view v not in S such

that B(v,S) is maximized;

S = S union {v};

end;

return S;


Greedy Algorithm – Example

Benefits of possible choices in 1st round

Round 1

b

c

d

e

f

g

h

50 x 5 = 250

25 x 5 = 125

80 x 2 = 160

70 x 3 =210

60 x 2 =120

99 x 1 = 99

90 x 1 = 90

100

a

C 7550 b

20 d f 40

h 10g

1

30

e

Lattice with space costs

Initial: top view a is automatically chosen

End of round 1: view b selected

• Round 1: each query answered by a, with cost 100

• Which queries can be answered by view c?

• What is the benefit of view c? Ans: 5*(100-75)=125

• Can we find a better view than view c? How about b?



Benefits of possible choices at each round

Round 1 Round 2

b

c

d

e

f

g

h

50 x 5 = 250

25 x 5 = 125

80 x 2 = 160

70 x 3 =210

60 x 2 =120

99 x 1 = 99

90 x 1 = 90

25 x 2 = 50

30 x 2 = 60

20 x 3 = 60

60 + 10 = 70

49 x 1 = 49

40 x 1 = 40

100

a

C 7550 b

20 d f 40

h 10g

1

30

e


• Round 2: views b, d, e, g, h start with cost 50 and the other views start with cost 100

• Which queries can be improved by view f?

• Benefit of view f? Ans: (100-40)+(50-40)=70

• How about the benefit of view c?

Round 1: view b selected

End of round 2: view f selected



Benefits of possible choices at each round

Round 1 Round 2 Round 3

b

c

d

e

f

g

h

50 x 5 = 250

25 x 5 = 125

80 x 2 = 160

70 x 3 =210

60 x 2 =120

99 x 1 = 99

90 x 1 = 90

25 x 2 = 50

30 x 2 = 60

20 x 3 = 60

60 + 10 = 70

49 x 1 = 49

40 x 1 = 40

25 x 1 = 25

30 x 2 = 60

2 x 20 + 10 =50

49 x 1 = 49

30 x 1 = 30

100

a

C 7550 b

20 d f 40

h 10g

1

30

e


• Round 1: view b selected

• Round 2: view f selected

• End of round 3: view d selected

• After 3 rounds, total cost is 420 (initial cost 800)

• In summary, each round pick the view with the maximum “benefit”

• Round 3: views b, d, e, g start with cost 50; views f and h start with cost 40; the other views start with cost 100

• How about the benefit of view e now?


Greedy Algorithm vs Optimal Choice

Situations where the algorithm does poorly …

The benefit of Greedy is guaranteed to be at least 63% of the benefit of the optimal algorithm (see the original paper)

A Lattice not good for Greedy

200

a

d 100100 b

20 Nodes total 1000

c

99

20 Nodes total 1000

20 Nodes total 1000

20 Nodes total 1000

• Round 1: Picks c whose benefit is 4141

• Round 2: Picks b or d with benefit of 2100 each

• Greedy results in 4141+2100 = 6241

• But, the optimal choice is to pick b and d

• b and d improve by 100 for itself and all 80 nodes below resulting in total benefits of 8200

• Ratio of greedy/optimal=6241/8200=76%


Greedy Algorithm – Space-Time Trade-Off

Greedy order of view selection example

Selection Benefit Space

1

2

3

4

5

6

cp

7

8

9

10

11

12

ns

nt

c

p

cs

np

ct

t

n

s

none

infinite

24M rows

12M

5.9M

5.8M

1M

1M

0.01M

small

small

small

small

Time

72M rows

48M

36M

30.1M

24.3M

23.3M

23.3M

23.3M

23.3M

23.3M

23.3M

23.3M

23.3M

23.3M

23.3M

23.3M

23.3M

6M rows

6M

6M

6.1M

11.3M

16.3M

6.3M

Experiment with a lattice: Important to materialize only the first few (5) views, performance increases at first…

30

20

10

0

70

60

50

40

80

2 4 6 8 10 12

Time / Space

Views

Total Time

Total Space


Optimal Cases and Anomalies

Two situations where the algorithm is optimal …

� If a1 is much larger than the

other a’s, the greedy is close

to optimal

� If all the a’s are equal then

greedy is optimal

… but there are also two situations where the algorithm is not realistic

Views in a lattice are unlikely

to have the same probability

of being requested in a query

Instead of asking for some

fixed number of views to

materialize, should instead

allocate a fixed amount of

space to views


The size of views grows exponentially, until it reaches the size of the raw data at rank [logr m], i.e. the cliff

Number of group-by attributes

n

Size of views

m

Log r m

Effect of the number of group-by attributes on the size of views

• Each domain size is r

• Top element has m cells appearing in raw data

• If group on i attributes, cube has ri cells

• If ri ≥ m, then each cell will have at most one data point and m of the cells with be nonempty. Use m as the size of any view.

• If ri <m, then almost all ri cells will have at least one data point. Space cost is ri as several data points can be collapsed into one aggregate

Assumptions and basis of reasoning

This explains why grouping of 2 attributes (p,c), (s,c) have the same size as (p,s,c) at 6 M rows in the figure on slide #10

Hypercube Lattices – Observations


Space-Optimal and Time-Optimal Solutions

What is the average time for a query when the space is optimal ?

• Space is minimized when only the top view is materialized

• Every query would take time m

• Total time cost for all 2n queries is m2n

Is there sense to minimize time by materializing all views ?

• No gain past the cliff

• No point to do so

• Nature of time-optimal solution depends on

1. Rank k = [logr m] at which the cliff occurs; and

2. Rank j, such that rj = (nj) is

maximized


Hierarchies are important as they underlie two commonly used query operations, Drill-down and Roll-up …

Day

Week Month

Year

none

• Year ≤ Month ≤ Day

• Week ≤ Day; but

• Month ≤ week; week ≤ month

A common hierarchy … … and its dependency relations

… but, hierarchies introduce query dependencies that must be accounted for when determining which queries to materialize; and this can be complex

Hierarchies & the Lattice Framework


Dependencies caused by different dimensions and attribute hierarchiescan be combined into a Direct Product lattice …

Composite Lattices

… assuming views can be created by independently grouping any or no member of the hierarchy for each dimension

Combining two hierarchical dimensions

+

customer

nation

none

0.1M

25

1

product

size type

none

6M

1

50 150

A Direct Product lattice

=

cp 6M

c 0.1M

cs 5M

np 5M

nt 3750

n 25

ns 1250

p 0.2M

t 150

none 1

s 50

ct 5.99M


Exercise for Composite Lattice

+

Product P2

Type P1

None P0

• Consider hierarchies of two dimensions

� Product: Product (P2) � Type (P1) � ALL (P0=∅)

� Store: Store (S2) � Area (S1) � ALL (S0=∅)

• Draw the composite lattice

� How many nodes does the lattice have?

� Enumerate them

� Connect the nodes based on their dependency

Store S2

Area S1

None S0

none

P2 S2

???


Applicability of the Lattice Framework

Advantages of the lattice framework

• Clean framework

� A clean framework to work with dimensional hierarchies, since hierarchies are themselves lattices

• Easy to model dependencies

� Dependency relationship among queries captured by the lattice

• Order of materialization

� A simple descending-order topological sort on lattice nodes gives the required order of materialization – more in the next lecture


Summary on Choosing Aggregates

1. Lattice framework

2. Linear cost model

3. Greedy algorithm (How it works?)

4. Space-time trade-off


Maintaining Views


View Maintenance

• How and when should we refresh materialized views?

• Total re-computation

� Too expensive

• Incremental view maintenance

� Apply only changes since last refresh to view

� Insert new rows Ri and delete existing rows Rd

� Update = delete + insert

• Additional info must be stored in views

� To make the views self-maintainable

� Store “number of instances” cv along with each row v in V


View Maintenance

• Projection views (with DISTINCT in SQL)

� V=Π(R)

� (r,c): r appears “c” times in R

� If (r,ci)∈Π(Ri) and (r,cr)∈V then cr=cr+ci, otherwise insert (r,ci) into V

� If (r,cd)∈Π(Rd) and (r,cr)∈V then cr=cr–cd, delete from V if cr=0

• Example

� V=Πx(R)

� How does V change after these operations?

� Insertion

u Insert the tuple (3,7) into R

u Insert the tuple (7,9) into R

� Deletion

u Delete the tuple (3,4) from R

u Delete the tuple (6,9) from R

96

45

43

33

21

yx

6 (1)

5 (1)

3 (2)

1 (1)

x (c)

R

V


View Maintenance

• Join views V=R S

� Compute Ri S and add to V, update counts

� Compute Rd S and subtract from V, update

counts

• Example

� V=R S

� How does V change after these operations?

� Insertion

u Insert the tuple (3) into R



� Deletion

u Delete the tuple (2) from R

u Delete the tuple (3) from R

u Delete the tuple (7) from S

9

7

3

3

2

y

9 (1)

3 (2)

2 (1)

y (c)

V

9

4

4

3

2

y

R S


Aggregation View Maintenance

• COUNT� Like projection view, but we maintain a tuple for each group g

u The group g can contain multiple attributes

u E.g., GROUP BY color, year; an instance of g = blue, 1993

� Maintain <g,count> � Update count based on inserts/deletes � Insert <g,1> rows for new values, delete rows from V with count=0

• SUM� Maintain <g,sum,count> � Update count and sum based on inserts/deletes� Insert <g,a,1> rows for new values, delete rows from V with count=0

• AVG computed as sum/count

• MIN (similar for MAX)� Maintain <g,minimum,count>� Insert (g,m): update minimum,count based on m =,<,> minimum� Delete (g,m): update minimum,count based on m =,<,> minimum

� If count=0, then we need to search the base table (expensive). Why?


Practical View Maintenance

• Maintenance policies – When to synchronize?

� Immediate – in same transaction as base change

� Lazy – when V is used for the first time after base updates

� Periodic, e.g., once a day, often together with base load

� Forced – after a certain number of changes

• Updating aggregates

� Computation outside DBMS in flat files (still relevant?)

� Built by loader

� Computation in DBMS using SQL

� Can be expensive: DBMS must be tuned for this

• Supported by tool/DBMS

� Oracle, SQL Server, DB2


Indexing

• Index used in combination with aggregates� Index on dimension tables and on materialized views

• Fact table� Build primary B+-tree index on dimension keys (primary key) ?

� Build indices on each dimension key separately (index intersection)

� Indices on combinations of dimension keys? (many!)

• Sort order is important (index-organized tables)� Possible to compress data (values not repeated)

� Can save aggregates due to fast sequential scan

� Best sort order (almost) always “Time”

• Dimension tables� Build indices on many/all individual columns (think of type)

� Build indices on common combinations

• Hash indices� Efficient for un-sorted data (e.g., color)


Bitmap Indices

• A B+-tree index stores a list of RowIDs for each value

� A RowID takes ~8 bytes

� Large space use for columns with low cardinality (gender, color)

� Example: Index for 1 bio. rows with gender takes 8 GB

� Not efficient to do “index intersection” for these columns

• Idea: make a “position bitmap” for each value (only two)

� Female: 01110010101010…

� Male: 10001101010101…

� Takes only (num. of values)*(num. of rows)*1 bit

� Example: bitmap index on gender (as before) takes only 256 MB

� Very efficient to do “index intersection” (AND/OR) on bitmaps

u Intersection of 32 bits done in a single CPU instruction (word length=32)

� Can be improved for for higher cardinality using compression

• Supported by some RDBMS (DB2, Oracle)


Using Bitmap Indices

• Query example� Find female customers in Aalborg with black hair and blue eyes

� Female: 01010101010

� Aalborg: 00000011111

� Black: 10110110110

� Blue: 01101101111

� Result 00000000010 – use AND, only one such customer

• Numeric attributes can also be handled� Find … Salary BETWEEN 22000 AND 29000

� Use the binning technique, i.e., group every C values into a bin

u E.g., group every 5000 values into a bin, and assign a bitmap for it

u Bitmap for [20000-25000): 001001001

u Bitmap for [25000-30000): 010010010

u OR together: 011011011 (Why is it OR but not AND?)

u Refinement step: follow those records and check their actual salaries

� Tradeoff between storage size and index effectiveness

Do we need to consider the bitmap for [15000-20000) ?


Compressed Bitmaps

• Problem: space use

� With m possible values and n records: n*m bits required

� However, probability of a 1 is 1/m => very few 1’s

• Solution: compressed bitmaps

� Run-length encoding

� A run is i 0’s followed by a 1

� Concatenating binary numbers won’t work – no unique decoding

� Instead, determine j – number of bits in binary representation of i

� Encode as “<j-1 1’s>”+“0”+“<i in binary>”

� Encode next run similarly, trailing 0’s not encoded

� Example: 000000010000 encoded as 110111

� j>1 => first bit of i is 1 – this bit can be saved – encoded as 11011

� Decoding: scan bits to find j, scan next j-1 bits to find i, find next 1

� Example: j = 3, i = 7 (11) => bitmap = 00000001 + 0000 (trailing)


Managing Bitmaps

• Compression factor� Assume m=n (unique values)

� Each value has just one run of length i < n

� Each run takes at most 2 log2 n bits (j <= log2 n)

� Total space consumption: 2n log2 n bits (compared to n2)

• Operations on compressed bitmaps� Decompress one run at a time and produce relevant 1’s in output

• Finding and storing bitvectors� Index with B-trees + store in blocks/block chains

• Finding records� Use secondary or primary index (hashing or B-tree)

• Handling modifications� Deletion: “retire” record number + update bitmaps with 1’s

� Insertion: add new record to file + update bitmaps with 1’s (trail 0’s)

� Updates: update bitmaps with old and new 1’s (expensive)


Mini Project

• Subtask #5

� Build reports in Reporting Services

� Check the Reporting Services tutorial

at the mini-project website

• Subtask #6

� DW performance optimization

� Build aggregates

u SQL Server Management Studio

u Connect to Analysis Services

u Find your cube and ……

• Short demo. session on this today


Summary

• Choosing Aggregates

• Maintaining Views

• Bitmap Indices

Documents

DW Performance Optimization (I) - AAU