39
George Papastefanatos 1 , Panos Vassiliadis 2 , Alkis Simitsis 3 ,Yannis Vassiliou 1 (1) National Technical University of Athens {gpapas,yv}@dbnet.ece.ntua.gr (2) University of Ioannina [email protected] (3) IBM Almaden Research Center [email protected] What-if Analysis for Data Warehouse Evolution

George Papastefanatos 1, Panos Vassiliadis 2, Alkis Simitsis 3,Yannis Vassiliou 1 (1) National Technical University of Athens {gpapas,yv}@dbnet.ece.ntua.gr

  • View
    216

  • Download
    1

Embed Size (px)

Citation preview

George Papastefanatos1, Panos Vassiliadis2,

Alkis Simitsis3 ,Yannis Vassiliou1

(1) National Technical University of Athens{gpapas,yv}@dbnet.ece.ntua.gr

(2) University of [email protected]

(3) IBM Almaden Research [email protected]

What-if Analysis for Data Warehouse Evolution

DaWaK'07, Regensburg, September 2007 2

Outline

• Motivation

• Graph-based modeling of ETL

• Adapting ETL workflows

• Evaluation

• Conclusions

DaWaK'07, Regensburg, September 2007 3

Outline

• Motivation

• Graph-based modeling of ETL

• Adapting ETL workflows

• Evaluation

• Conclusions

DaWaK'07, Regensburg, September 2007 4

Data Warehouse environment

WWW Act1

Act2

Act3

Act4

Act5

DaWaK'07, Regensburg, September 2007 5

ETL

WWW

Sources

Extract Transform & Clean

DW

Load

DSA

Act1

Act2

Act3

Act4

Act5

DaWaK'07, Regensburg, September 2007 6

Motivation

WWW Act1

Act2

Act3

Act4

Act5

DaWaK'07, Regensburg, September 2007 7

Evolving ETL sources…

• Schema Changes on the sources of ETL processes. Design constructs are– Added, Removed, Modified

• ETL processes affected:– SyntacticallySyntactically – i.e., become invalid– SemanticallySemantically – i.e., query must conform to the new

source database semantics

• Adaptation of activities, queries and views– time-consuming task,treated in most of the cases

manually by the administrators/developers

DaWaK'07, Regensburg, September 2007 8

We would like to know…

• What part of the process is affected and how if e.g., an attribute is deleted?

• Can we predict the impact of changes? • To what extent can readjustment be automated? • Can we perform what-if analysis for potential

changes of source configurations?

DaWaK'07, Regensburg, September 2007 9

Contribution

• A general mechanism for performing what-if analysis for potential changes of ETL source configurations– A graph model for relations, queries, views, ETL activities, and

their significant properties

– A framework for annotating the graph with policies concerning the behavior of nodes in the presence of hypothetical changes.

– A set of rules that dictate the proper actions, when additions, deletions or updates are performed to relations, attributes, and conditions

– An experimental assessment of our proposal

DaWaK'07, Regensburg, September 2007 10

Outline

• Motivation

• Graph-based modeling of ETL

• Adapting ETL workflows

• Evaluation

• Conclusions

DaWaK'07, Regensburg, September 2007 11

Query representation

Q: SELECT EMP.Emp#, Sum(WORKS.Hours) as T_Hours

FROM EMP, WORKS

WHERE EMP.Emp# = WORKS.Emp#

GROUP BY EMP.Emp#

JoinJoin

DaWaK'07, Regensburg, September 2007 12

Query representation

Q: SELECT EMP.Emp#, Sum(WORKS.Hours) as T_Hours

FROM EMP, WORKS

WHERE EMP.Emp# = WORKS.Emp#

GROUP BY EMP.Emp#

JoinJoin

DaWaK'07, Regensburg, September 2007 13

Outline

• Motivation

• Graph-based modeling of ETL

• Adapting ETL workflows

• Evaluation

• Conclusions

DaWaK'07, Regensburg, September 2007 14

Annotating the graph with adaptation rules

According to prevailing policy, the proper action is taken graph transformation

Set of evolving database constructs:• relations• attributes• constraints

11

Set of potential evolution changes: • addition• deletion• modification

22

Set of reaction policies: • propagate• block• prompt

33

DaWaK'07, Regensburg, September 2007 15

Annotating the graph with adaptation rules

• Assuming that a graph construct is annotated with a policy for a particular event (e.g., an activity node is tuned to deny deletions of its provider attributes) : – (a) it performs the identification of the affected

subgraph

– (b) if the policy is appropriate, it automates the readjustment of the graph to fit the new semantics imposed by the change.

DaWaK'07, Regensburg, September 2007 16

Query Adaptation - Example

from

map-select

S

Q

SS

EMP

Emp# NameEmp#

NameS

map-select

...

On attribute addition To EMPTHEN propagate

Annotated Query Graph Event

Add attribute Phone to EMP relation

Phone

from

map-select

S

Q

S SS

EMP

PhoneEmp# NameEmp#

NameS

map-select

...

On attribute addition To EMPTHEN propagate

S

map-select

Transformed Query Graph

Q: SELECT EMP.Emp#, EMP.Name

FROM EMP

Q’: SELECT EMP.Emp#, EMP.Name, EMP.Phone

FROM EMP

DaWaK'07, Regensburg, September 2007 17

Algorithm Propagate changeS (PS)

Input: an ETL summary S over a graph Go=(Vo,Eo) and an event eOutput: a graph Gn=(Vn,En)Variables: a set of events E, and an affected node A Begin dps(S, Go, Gn, {e}, A)End

dps(S, Gn, Go, E, A) { I = Ins_by_policy(affected(E)) D = Del_by_policy(affected(E)) Gn = Go – D I E = E–{e}action(affected(E)) if consumer(A)nil for each consumer(A) dps(S,Gn,Go,E,consumer(A))}

DaWaK'07, Regensburg, September 2007 18

Conflict resolution

• Graph constructs may have contradictory policies for the same event

Policies defined on query graph structures are stronger than policies defined on view graph structures which in turn prevail on policies defined on relation graph structures

RuleRule

Query Module

View Module

Relation Module

Query Conditions

Query Attributes

Query

View Conditions

View Attributes

View

Relation Conditions

Relation Attributes

Relation

User’s Choice

DaWaK'07, Regensburg, September 2007 19

Outline

• Motivation

• Graph-based modeling of ETL

• Adapting ETL workflows

• Evaluation

• Conclusions

DaWaK'07, Regensburg, September 2007 20

Evaluation of our framework

• Reverse engineering of real-world ETL processes, extracted from an application of the Greek public sector

• Monitored the changes that took place to the sources of the studied data warehouse

• Performed what-if analysis for several evolution scenarios

DaWaK'07, Regensburg, September 2007 21

Configuration of our setting

7 ETL processes

S1 Tmp1

L1 L2 Tmp2

Join

S2

T1

T2

Add field

Load to DSA

γ V3

Load to DSA

π

π Add field

Patch text fields

Sources DW Data Staging Area (DSA)

Patch text fields

53 ETL activities3 lookup tables

7 source tables 9 target tables

DaWaK'07, Regensburg, September 2007 22

Configuration of our setting

• evolution events– renaming source tables,

– renaming attributes of source tables,

– adding and deleting attributes from source tables,

– modifying the domain of attributes

– changing the primary key of lookup tables

DaWaK'07, Regensburg, September 2007 23

Measurements

• For each event, we counted: – (a) the number of activities affected both semantically

and syntactically,

– (b) the number of activities, that have automatically been adapted by our framework (propagate or block policies) as opposed to those

– (c) that required administrator’s intervention (i.e., a prompt policy)

DaWaK'07, Regensburg, September 2007 24

Adapted activities w.r.t. the ETL scenario size

0

10

20

30

40

50

60

5 5 5 7 7 12 12

scenario size

# A

cti

vit

ies

Affected

Adapted

DaWaK'07, Regensburg, September 2007 25

Adapted activities w.r.t. the complexity of activities.

0

2

4

6

8

10

12

14

16

1 2 3 4 5

complexity

#a

cti

vit

ies

Affected

Adapted

DaWaK'07, Regensburg, September 2007 26

Outline

• Motivation

• Graph-based modeling of ETL

• Adapting ETL workflows

• Evaluation

• Conclusions

DaWaK'07, Regensburg, September 2007 27

Summary

• A uniform representation for modeling relations, queries, views and ETL activities

• A framework for annotating the graph with policies concerning the behavior of nodes in the presence of hypothetical changes

• A set of rules that dictate the proper adaptation actions, when evolution events are performed to ETL sources

• An experimental assessment of our proposal

DaWaK'07, Regensburg, September 2007 28

On-going/Future Work

• Hecataeus: A tool for visualizing and performing what-if analysis for several evolution scenarios.

• SQL extensions for annotating graph constructs with evolution semantics

• Patterns of evolution sequences

DaWaK'07, Regensburg, September 2007 29http://www.cs.uoi.gr/~pvassil/projects/architecture_graph/

Danke schön!Questions?

DaWaK'07, Regensburg, September 2007 30

DaWaK'07, Regensburg, September 2007 31

Back up Slides

DaWaK'07, Regensburg, September 2007 32

Related work

• DB schema Evolution– Schema Versioning

• DW schema Evolution

• Materialized View Evolution

• Evolution wrt Model Mappings

DaWaK'07, Regensburg, September 2007 33

Scenario 1

S1 Tmp1

L1 L2 Tmp2

Join

S2

T1

T2

Add field

Load to DSA

γ V3

Load to DSA

π

π Add field

Patch text fields

Sources DW Data Staging Area (DSA)

Patch text fields

DaWaK'07, Regensburg, September 2007 34

Evolution Metadata

• Metadata Repository maintaining– Graph constructs– Annotations– What if analysis scenarios

DaWaK'07, Regensburg, September 2007 35

Extending SQL With Evolution Semantics

• Used for annotating graph constructs

ON <event> TO <element> THEN <policy>

E.g.E.g.

SELECT Emp#, NAME, AGE

FROM V

ON condition addition TO V THEN propagate,

ON attribute deletion TO V.AGE THEN block

DaWaK'07, Regensburg, September 2007 36

Hecataeus

• A tool for visualizing and performing what-if analysis for several evolution scenarios

DaWaK'07, Regensburg, September 2007 37

Adapted activities per Event

Affected Activities Per Event

0

10

20

30

40

50

60

70

80

RenameTable

RenameAttributes

AddAttributes

DeleteAttributes

ModifyAttributes

Change PK

Affected

Adapted

DaWaK'07, Regensburg, September 2007 38

Evolution changes occurred on source and lookup tables

Activities (53) Source Name Event Type

Affected Autom. adjusted Prompt % S1 Rename 14 14 0 100% Rename Attributes 14 14 0 100% Add Attributes 34 31 3 91% Delete Attributes 19 18 1 95% Modify Attributes 18 18 0 100%

S4 Rename 4 4 0 100% Rename Attributes 4 4 0 100% Add Attributes 19 15 4 79% Delete Attributes 14 12 2 86% Modify Attributes 6 6 0 100%

S2 Rename 1 1 0 100% Rename Attributes 1 1 0 100% Add Attributes 4 4 0 100% Delete Attributes 4 3 1 75%

S3 Rename 1 1 0 100% Rename Attributes 1 1 0 100%

S5 Modify Attributes 3 3 0 100% S6 NO_CHANGES 0 0 0 - S7 Rename 1 1 0 100% Rename Attributes 1 1 0 100%

L1 NO_CHANGES 0 0 0 - L2 Add Attributes 1 0 1 0% L3 Add Attributes 9 0 9 0% Change PK 9 0 9 0%

DaWaK'07, Regensburg, September 2007 39

Annotation of graph constructs with policies for kinds of events

parts of the system affected nodes annotated with policies edges annotated with policies

event on database schema

R/V R/V Attr.

R/V Cond.

Q/V Q/V Attr.

Q/V Cond.

R Ω V C Map-Select

Op From GB/OB

Ω

C Add

R/V

Ω

C Delete

R/V

Ω

C Modify/Rename

R/V