16
Speeding Up Warehouse Physical Design Using A Randomized Algorithm Minsoo Lee Joachim Hammer Dept. of Computer & Information Science & Engineering University of Florida

Speeding Up Warehouse Physical Design Using A Randomized Algorithm

  • Upload
    lise

  • View
    52

  • Download
    3

Embed Size (px)

DESCRIPTION

Speeding Up Warehouse Physical Design Using A Randomized Algorithm. Minsoo Lee Joachim Hammer Dept. of Computer & Information Science & Engineering University of Florida. View Selection Problem. What’s a Data Warehouse? - PowerPoint PPT Presentation

Citation preview

Page 1: Speeding Up Warehouse Physical Design Using A Randomized Algorithm

Speeding Up Warehouse Physical DesignUsing A Randomized Algorithm

Minsoo Lee

Joachim Hammer

Dept. of Computer & Information Science & Engineering University of Florida

Page 2: Speeding Up Warehouse Physical Design Using A Randomized Algorithm

View Selection Problem

• What’s a Data Warehouse?– stores info. collected from multiple, heterogeneous info.

sources to support complex querying and analysis

• Materialized Views in a DW– pre-computed portions of frequently asked queries– maintenance : incremental, periodic refresh

• View Selection Problem– decide which views to materialize in DW– considers query response time, maintenance cost(?), and

storage cost

Page 3: Speeding Up Warehouse Physical Design Using A Randomized Algorithm

g h i k

b c d

a

Views

Basetables

OR

OROROR

Overview of Our Problem

• Maintenance-cost View Selection Problem [GM99]– decide which views to materialize in DW

– minimize query response time, given an upper bound on maintenance cost (storage space is not considered)

• DW Configuration based

on OR view graphs– Any view can be computed

from any of its related views

Page 4: Speeding Up Warehouse Physical Design Using A Randomized Algorithm

Problems with existing Solutions

• Existing Solutions to the View Selection Problem– Heuristics-based Search

• Greedy Heuristics [Gup97,GM99]

• A* [Rou82, LQA97, GM99]

– Exhaustive Search [TS97]

• Problems– Does not scale up well for more than 20 views

– Time complexity is polynomial

– DW evolution requires efficient re-computation of a configuration

Page 5: Speeding Up Warehouse Physical Design Using A Randomized Algorithm

Outline of Our Approach

• Use Randomized Algorithms– Randomized algorithms provide good solution within a

small amount of time (time/quality tradeoff)

– Specifically, use Genetic Algorithms (GA)

• Advantages of Our Approach– Near linear scaleup with a solution within 90% of

optimal

– Support DW evolution with fast reconfiguration

Page 6: Speeding Up Warehouse Physical Design Using A Randomized Algorithm

Genetic Algorithms

Loop until termination condition = true

t=t+1

Select P(t) from P(t-1)

Recombine P(t)

Evaluate P(t)

genomes population

generation t generation t+1 generation t+2

Page 7: Speeding Up Warehouse Physical Design Using A Randomized Algorithm

GA : Representation of Solution

• Genome– Candidate Solution of the problem to be solved

– Represented as a String

– Ordering problems : Alphanumeric String

Selection problems : Binary String

• Binary String Representation– ex) v1 v2 v3 v4 v5

0 1 0 0 1 views v2 and v5 are selected

Page 8: Speeding Up Warehouse Physical Design Using A Randomized Algorithm

GA : Initialization of Population

• Initial Population in our experiments– Pool of randomly generated bit strings

– population size is 300

• Future experiments – generate more favorable initial population

– use external knowledge of problem

Page 9: Speeding Up Warehouse Physical Design Using A Randomized Algorithm

GA : Selection, Crossover, Mutation, Termination

• Selection– Select superior genomes among previous population

– Roulette Wheel Method [Mic94]

• Crossover– applied to two genomes by exchanging information

• Mutation– applied to a single genome : ex) flip a bit in the genome

• Termination– termination condition : 400 generations

Page 10: Speeding Up Warehouse Physical Design Using A Randomized Algorithm

GA : Evaluation Process

• Fitness Function– measures how good a genome is as a solution

– high : close to optimal, low : further from optimal

• Use Penalty Function in Fitness Function– similar solution to 0/1 knapsack solution[Mic94]

– Evaluate query benefit.

If maintenance limit is exceeded, apply penalty.

Page 11: Speeding Up Warehouse Physical Design Using A Randomized Algorithm

GA : Evaluation Process

• Penalty Functions– Logarithmic (LG)

– Linear (LN)

– Exponential (EX)

• Penalty Application Methods– Subtract (S)

– Divide (D)

– Subtract&Divide (SD)

Page 12: Speeding Up Warehouse Physical Design Using A Randomized Algorithm

Evaluation of the Algorithm

• Environment– Pentium II 450 MHz PC, Windows NT 4.0

• OR-view graphs– number of base tables : 10 tables

– number of views : 5-20 views

– edge density of graph : 15%, 30%, 50%, 75%

– parameters for node (view) & parameters for edgeRC : 100 - 10000 for base tables QC : 10 - 80% of RC of

QF : 0.1- 0.9 source view

UF : 0.1 - 0.9 MC : 10 - 150% of QC

Page 13: Speeding Up Warehouse Physical Design Using A Randomized Algorithm

Results : Quality of Solutions

Average ratios of (optimal total query cost/GA total query cost)

9092949698

100102104106108110

1

14

27

40

53

66

79

92

10

5

11

8

13

1

14

4

15

7

17

0

18

3

19

6

20

9

22

2

23

5

24

8

(density,# of views, maint. constraint)

tota

l qu

ery

co

st

rati

o

(op

tim

al/G

A)

%

LN-D

LN-SD

EX-D

EX-SD

Average ratios of (GA total maintenance cost/maintenance constraint)

050

100

150200250300

350400450

1

14

27

40

53

66

79

92

10

5

11

8

13

1

14

4

15

7

17

0

18

3

19

6

20

9

22

2

23

5

24

8

(density,# of views, maint.constraint)

tota

l mai

nte

nan

ce c

ost

ra

tio

(G

A/o

pti

mal

) %

LN-D

LN-SD

EX-D

EX-SD

Page 14: Speeding Up Warehouse Physical Design Using A Randomized Algorithm

Results : Execution Time

< Execution time for exhaustive search algorithm >

0

20

40

60

80

100

120

1 15 29 43 57 71 85 99 113

127

141

155

169

183

197

211

225

239

253

(density, # of views, maint. constraint)

tim

e (

se

c)

Exhaustive Search

< Execution time for genetic algorithm >

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1 15 29 43 57 71 85 99 113

127

141

155

169

183

197

211

225

239

253

(density, # of views, maint. cost)

tim

e (

se

c) LN-D

LN-SD

EX-D

EX-SD

Page 15: Speeding Up Warehouse Physical Design Using A Randomized Algorithm

Prototype Development

• Used version 2.4.3 of Galib from MIT• Microsoft Visual C++• Encoded our own Fitness Function

– strategy for penalty is controlled by a control variable

• Encoded OR-view graph cost evaluation functions – total query cost, total maintenance cost

• OR-view graph costs– Node: Read Cost, Query Frequency, Update Frequency– Edge: Query Cost, Maintenance Cost

Page 16: Speeding Up Warehouse Physical Design Using A Randomized Algorithm

Conclusion

• Use of Genetic Algorithm for Maintenance-cost View Selection Problem– yields a solution within 10% of optimal solution

– linear scale up for execution time w.r.t number of views

– EX-D and EX-SD strategy produce best results

– Suitable for use in DW evolution

• Future work– experiments with better initial population

– various crossover and mutation operators, termination condition

– AND-OR views, indexes

– parallel version of GA