47
COMP5331 1 COMP5331 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

Embed Size (px)

Citation preview

Page 1: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 1

COMP5331

Data Warehouse

Prepared by Raymond WongPresented by Raymond Wong

raywong@cse

Page 2: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 2

Data Warehouse

Also called Online Analytical Processing (OLAP)

Many corporations use data warehouses for their analysis

Page 3: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 3

Data Warehouse

Databases Users

Databases UsersData Warehouse

Need to wait for a long time (e.g., 1 day to 1 week)

Pre-computed results

Query

Page 4: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 4

Advantages

Fast Query Response

Page 5: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 5

Data Warehouse

Problem Data Warehouse NP-hardness

Algorithm Performance Study

Page 6: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 6

Data WarehouseParts are bought from suppliers and then sold to customers at a sale price SP

part supplier customer SP

p1 s1 c1 4

p3 s1 c2 3

p2 s3 c1 7

… … … …

Table T

Page 7: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 7

Data WarehouseParts are bought from suppliers and then sold to customers at a sale price SP

part supplier customer SP

p1 s1 c1 4

p3 s1 c2 3

p2 s3 c1 7

… … … …

Table T

part

p1 p2 p3 p4 p5 suppliers1s2

s3s4

customer

c1

c2

c3

c4

4

3

Data cube

Page 8: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 8

Data WarehouseParts are bought from suppliers and then sold to customers at a sale price SP

e.g., select part, customer, SUM(SP)from table Tgroup by part, customerpart

customer

SUM(SP)

p1 c1 4

p3 c2 3

p2 c1 7

e.g., select customer, SUM(SP)from table Tgroup by customer

customer

SUM(SP)

c1 11

c2 3

pc 3 c 2

part supplier customer SP

p1 s1 c1 4

p3 s1 c2 3

p2 s3 c1 7

… … … …

Table T

AVG(SP), MAX(SP), MIN(SP), …

Page 9: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 9

Data Warehouse

psc 6M

pc 4M ps 0.8M sc 2M

p 0.2M s 0.01M c 0.1M

none 1

Parts are bought from suppliers and then sold to customers at a sale price SP

part supplier customer SP

p1 s1 c1 4

p3 s1 c2 3

p2 s3 c1 7

… … … …

Table T

Page 10: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 10

Data Warehouse

psc 6M

pc 4M ps 0.8M sc 2M

p 0.2M s 0.01M c 0.1M

none 1

Suppose we materialize all views. This wastes a lot of space.

Cost for accessing pc = 4M

Cost for accessing ps = 0.8M

Cost for accessing sc = 2MCost for accessing p = 0.2M

Cost for accessing c = 0.1MCost for accessing s = 0.01M

Page 11: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 11

Data Warehouse

psc 6M

pc 4M ps 0.8M sc 2M

p 0.2M s 0.01M c 0.1M

none 1

Suppose we materialize the top view only.

Cost for accessing pc = 6M(not 4M)

Cost for accessing ps = 6M(not 0.8M)

Cost for accessing sc = 6M(not 2M)Cost for accessing p = 6M

(not 0.2M)

Cost for accessing c = 6M(not 0.1M)

Cost for accessing s = 6M(not 0.01M)

Page 12: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 12

Data Warehouse

psc 6M

pc 4M ps 0.8M sc 2M

p 0.2M s 0.01M c 0.1M

none 1

Suppose we materialize the top view and the view for “ps” only.

Cost for accessing pc = 6M(still 6M)

Cost for accessing sc = 6M(still 6M)

Cost for accessing p = 0.8M(not 6M previously)

Cost for accessing ps = 0.8M(not 6M previously)

Cost for accessing c = 6M(still 6M)

Cost for accessing s = 0.8M(not 6M previously)

Page 13: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 13

Data Warehouse

psc 6M

pc 4M ps 0.8M sc 2M

p 0.2M s 0.01M c 0.1M

none 1

Suppose we materialize the top view and the view for “ps” only.

Cost for accessing pc = 6M(still 6M)

Cost for accessing sc = 6M(still 6M)

Cost for accessing p = 0.8M(not 6M previously)

Cost for accessing ps = 0.8M(not 6M previously)

Cost for accessing c = 6M(still 6M)

Cost for accessing s = 0.8M(not 6M previously)

Gain = 0Gain = 5.2M

Gain = 0Gain = 5.2M

Gain = 5.2M Gain = 0

Gain({view for “ps”, top view}, {top view}) = 5.2*3 = 15.6

Selective Materialization Problem:We can select a set V of k views such that Gain(V U {top view}, {top view}) is maximized.

Page 14: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 14

Data Warehouse

Problem Data Warehouse NP-hardness

Algorithm Performance Study

Page 15: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 15

NP-hardness

Selective Materialization Problem is NP-hard.

Selective Materialization Problem:We can select a set V of k views such that Gain(V U {top view}, {top view}) is maximized.

Page 16: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 16

NP-hardness

Selective Materialization Decision Problem (SMD) Given an integer k and a real number J,

We want to find a set V of k views such that Gain(V U {top view}, {top view}) is at least J.

Selective Materialization Decision Problem is NP-hard.

Selective Materialization Problem:We can select a set V of k views such that Gain(V U {top view}, {top view}) is maximized.

Page 17: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 17

NP-hardness Exact Cover by 3-Sets (XC)

Instance: Set X with 3q elements, and a collection C of size 3 subsets of X

Question: Does C contain an exact cover for X, i.e., a subcollection C’ C such that every element of X occurs in exactly one set of C’.

It is well-known that this problem is NP-complete.

Page 18: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 18

NP-hardness Problem XC can be transformed to Problem SMD

Create a root node with size = 200 (at level 1) Create a bottom node with size 1 (at level 4) For each element x in X,

Create a node Nx with size = 50 at level 3 Create an edge between Nx and the bottom node

For each element a C (where a = (x, y, z))

Create a node Na with size = 100 at level 2 Create an edge between Na and the root node Create an edge between Na and Nx Create an edge between Na and Ny Create an edge between Na and Nz

Set k = q Set J = 400q

Page 19: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 19

NP-hardness

E.g., X = {A, B, C, D, E, F} C = {(A, B, C), (B, C, D), (D, E, F)}

A B C D E F

200

1

50 50 50 50 50 50

100 100 100

k = 2

q = 2

J = 400x2 = 800

Page 20: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 20

NP-hardness

It is easy to verify that solving the problem SMD is equal to solving problem XC

Problem SMD is NP-hard.

Page 21: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 21

Data Warehouse

psc 6M

pc 4M ps 0.8M sc 2M

p 0.2M s 0.01M c 0.1M

none 1

Parts are bought from suppliers and then sold to customers at a sale price SP

part supplier customer SP

p1 s1 c1 4

p3 s1 c2 3

p2 s3 c1 7

… … … …

Table T

Page 22: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 22

Data Warehouse

psc 6M

pc 6M ps 0.8M sc 2M

p 0.2M s 0.01M c 0.1M

none 1

Parts are bought from suppliers and then sold to customers at a sale price SP

part supplier customer SP

p1 s1 c1 4

p3 s1 c2 3

p2 s3 c1 7

… … … …

Table T

Page 23: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 23

Data Warehouse

Problem Date Warehouse NP-hardness

Algorithm Performance Study

Page 24: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 24

Greedy Algorithm k = number of views to be materialized

Given v is a view S is a set of views which are selected to be

materialized Define the benefit of selecting v for

materialization as B(v, S) = Gain(S U {v}, S)

Page 25: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 25

Greedy Algorithm

S {top view}; For i = 1 to k do

Select that view v not in S such that B(v, S) is maximized;

S S U {v} Resulting S is the greedy selection

Page 26: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 26

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

Benefit from pc =

Benefit

6M-6M = 0 k = 2

Page 27: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 27

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit from ps =

Benefit

6M-0.8M = 5.2M k = 2

Page 28: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 28

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit from sc =

Benefit

6M-6M = 0

0 x 3= 0

k = 2

Page 29: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 29

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit from p =

Benefit

6M-0.2M = 5.8M

0 x 3= 0

5.8 x 1= 5.8

k = 2

Page 30: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 30

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit from s =

Benefit

6M-0.01M = 5.99M

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

k = 2

Page 31: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 31

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit from c =

Benefit

6M-0.1M = 5.9M

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

k = 2

Page 32: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 32

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from pc = 6M-6M = 0

0 x 2= 0

k = 2

Page 33: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 33

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from sc = 6M-6M = 0

0 x 2= 0

0 x 2= 0

k = 2

Page 34: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 34

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from p = 0.8M-0.2M = 0.6M

0 x 2= 0

0 x 2= 0

0.6 x 1= 0.6

k = 2

Page 35: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 35

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from s = 0.8M-0.01M = 0.79M

0 x 2= 0

0 x 2= 0

0.6 x 1= 0.6

0.79 x 1= 0.79

k = 2

Page 36: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 36

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from c = 6M-0.1M = 5.9M

0 x 2= 0

0 x 2= 0

0.6 x 1= 0.6

0.79 x 1= 0.79

5.9 x 1= 5.9

k = 2

Page 37: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 37

1.1 Data Cube

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

0 x 2= 0

0 x 2= 0

0.6 x 1= 0.6

0.79 x 1= 0.79

5.9 x 1= 5.9

Two views to be materialized are1. ps2. c

V = {ps, c} Gain(V U {top view}, {top view})= 15.6 + 5.9 = 21.5

k = 2

Page 38: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 38

Data Warehouse

Problem Data Warehouse NP-hardness

Algorithm Performance Study

Page 39: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 39

Performance Study

How bad does the Greedy Algorithm perform?

Page 40: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 40

1.1 Data Cube

a 200

b 100 c 99 d 100

p1 97

none 1

1st Choice (M) 2nd Choice (M)

b

c

d

… … …

41 x 100= 4100

Benefit from b =

Benefit

200-100= 100

20 nodes …

p20 97

q1 97

q20 97

r1 97

r20 97

s1 97

…s20 97

k = 2

Page 41: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 41

1.1 Data Cube

a 200

b 100 c 99 d 100

p1 97

none 1

1st Choice (M) 2nd Choice (M)

b

c

d

… … …

41 x 100= 4100

Benefit from c =

Benefit

200-99 = 101

20 nodes …

p20 97

q1 97

q20 97

r1 97

r20 97

s1 97

…s20 97

41 x 101= 4141

k = 2

Page 42: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 42

1.1 Data Cube

a 200

b 100 c 99 d 100

p1 97

none 1

1st Choice (M) 2nd Choice (M)

b

c

d

… … …

41 x 100= 4100

Benefit

20 nodes …

p20 97

q1 97

q20 97

r1 97

r20 97

s1 97

…s20 97

41 x 101= 4141

41 x 100= 4100

k = 2

Page 43: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 43

1.1 Data Cube

a 200

b 100 c 99 d 100

p1 97

none 1

1st Choice (M) 2nd Choice (M)

b

c

d

… … …

41 x 100= 4100

Benefit

20 nodes …

p20 97

q1 97

q20 97

r1 97

r20 97

s1 97

…s20 97

41 x 101= 4141

41 x 100= 4100

Benefit from b = 200-100= 100

21 x 100= 2100

k = 2

Page 44: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 44

1.1 Data Cube

a 200

b 100 c 99 d 100

p1 97

none 1

1st Choice (M) 2nd Choice (M)

b

c

d

… … …

41 x 100= 4100

Benefit

20 nodes …

p20 97

q1 97

q20 97

r1 97

r20 97

s1 97

…s20 97

41 x 101= 4141

41 x 100= 4100

21 x 100= 2100

21 x 100= 2100

Greedy: V = {b, c} Gain(V U {top view}, {top view})= 4141 + 2100 = 6241

k = 2

Page 45: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 45

1.1 Data Cube

a 200

b 100 c 99 d 100

p1 97

none 1

1st Choice (M) 2nd Choice (M)

b

c

d

… … …

41 x 100= 4100

Benefit

20 nodes …

p20 97

q1 97

q20 97

r1 97

r20 97

s1 97

…s20 97

41 x 101= 4141

41 x 100= 4100 41 x 100= 4100

Greedy: V = {b, c} Gain(V U {top view}, {top view})= 4141 + 2100 = 6241

21 x 101 + 20 x 1= 2141

Optimal: V = {b, d} Gain(V U {top view}, {top view})= 4100 + 4100 = 8200

k = 2

Page 46: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 46

1.1 Data Cube

a 200

b 100 c 99 d 100

p1 97

none 1

20 nodes …

p20 97

q1 97

q20 97

r1 97

r20 97

s1 97

…s20 97

Greedy: V = {b, c} Gain(V U {top view}, {top view})= 4141 + 2100 = 6241

Optimal: V = {b, d} Gain(V U {top view}, {top view})= 4100 + 4100 = 8200

Greedy

Optimal=

6241

8200=0.7611

If this ratio = 1, Greedy can give an optimal solution. If this ratio 0, Greedy may give a “bad” solution.

Does this ratio has a “lower” bound?

It is proved that this ratio is at least 0.63.

k = 2

Page 47: COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

COMP5331 47

Performance Study

This is just an example to show that this greedy algorithm can perform badly.

A complete proof of the lower bound can be found in the paper.