EDBT 2009 - Provenance for Nested Subqueries

  • View
    51

  • Download
    1

  • Category

    Science

Preview:

DESCRIPTION

Data provenance is essential in applications such as scientific computing, curated databases, and data warehouses. Several systems have been developed that provide provenance functionality for the relational data model. These systems support only a subset of SQL, a severe limitation in practice since most of the application domains that benefit from provenance information use complex queries. Such queries typically involve nested subqueries, aggregation and/or user defined functions. Without support for these constructs, a provenance management system is of limited use. In this paper we address this limitation by exploring the problem of provenance derivation when complex queries are involved. More precisely, we demonstrate that the widely used definition of Why-provenance fails in the presence of nested subqueries, and show how the definition can be modified to produce meaningful results for nested subqueries. We further present query rewrite rules to transform an SQL query into a query propagating provenance. The solution introduced in this paper allows us to track provenance information for a far wider subset of SQL than any of the existing approaches. We have incorporated these ideas into the Perm provenance management system engine and used it to evaluate the feasibility and performance of our approach.

Citation preview

Provenance for Nested Subqueries

Boris Glavic

Database Technology Group

Department of Informatics University of Zurich

glavic@ifi.uzh.ch

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

Gustavo Alonso

Systems GroupDepartment of Computer

Science ETH Zurich

alonso@inf.ethz.ch

2

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

Overview

1. Introduction2. The Provenance of Subqueries3. Experimental Results4. Conclusion

3

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction

Query

Which input data item(s) influenced which output data item(s)? Granularity

Tuple Attribute Value ...

Contribution semantics Influence (Lineage / Why) Copy (Where) ...

4

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction

Most application domains that benefit from provenance use complex queries Subqueries

Correlated Nested

Not supported by existing systems Semantics not clear Complex computation

5

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction

Steps to solve this problem1. Establish sound semantics for

provenance of subqueries2. Algorithms for subquery provenance

computation3. Integrate algorithms into a Provenance

Management system (Perm)

6

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction

Steps to solve this problem1. Establish sound semantics for

provenance of subqueries2. Algorithms for subquery provenance

computation3. Integrate algorithms into a Provenance

Management system (Perm)

7

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction Definition of contribution semantics

Why/Influence-provenance Introduced in [Cui, Widom ICDE ‘00] Provenance represented as list of

subsets of the input relations Defined for a single algebra operator

and a single result tuple

8

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction Definition 1: For a single algebra

operator op with input relations T1, ... , Tn a list (T1*, ... ,Tn*) of maximal subsets of the input relation is the provenance of a tuple t from the result of op iff:

u op(T1*, ..., Tn*) = t

u For all i and t* with t* in Ti*:op(T1*, ... Ti-1*, t* , Ti+1*, ... ,Tn*) !=

9

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction Perm

Provenance Extension of the Relational Model

Provenance Management System (PMS) “Pure” Relational representation of

provenance Provenance computation trough

algebraic query rewrite Implemented as extension of

PostgreSQL

10

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction

Provenance representation

OriginalAttributes

Relation 1 Attributes

Relation n Attributes

Query

1

OriginalResult

2 n

11

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction

Provenance representation

OriginalAttributes

Relation R Attributes

Relation S Attributes

Query

R

OriginalResult

S

r1

s 1r2

t 1

t 1 r1

t 1 r2

s 1

s 1

12

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction Provenance Computation though

query rewrite: Given query q generate query q+ that

computes the provenance of q Representation as defined before

Rewrites operate on the algebraic representation of a query Rewrite rules for each operator op that

transform op into a algebra statement that propagates the provenance

13

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction

Rewrite rules example:SELECT agg, GFROM TGROUP BY G

SELECT agg, G, prov(T)FROM

(SELECT agg, G FROM T GROUP BY G) AS agg,LEFT OUTER JOIN(SELECT G AS G’, prov(T) FROM T+) AS provON G = G’

14

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction

Rewrite rules example:SELECT sum(revenue) AS sum, shopFROM salesGROUP BY shop

shop month revenue

Migros Jan 100

Migros Feb 10

Migros Mar 10

Coop Jan 25

Coop Feb 25

salessum shop

120 Migros

50 Coop

result

15

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

1. Introduction

SELECT sum, shop, pShop, pMonth, pRevenueFROM

(SELECT sum(revenue) AS sum, shop FROM sales GROUP BY shop) AS aggLEFT OUTER JOIN(SELECT shop AS shop’, pShop, pMonth, pRevenue FROM sales ) AS provON shop = shop’

sum shop pShop pMonth pRevenue

120 Migros Migros Jan 100

120 Migros Migros Feb 10

120 Migros Migros Mar 10

50 Coop Coop Jan 25

50 Coop Coop Feb 25

+

16

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

Overview

1. Introduction2. The Provenance of Subqueries3. Experimental Results4. Conclusion

17

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Sublinks Subqueries in e.g. SELECT-clause

Correlated References outside attributes

Nested Sublink that contains sublinks

σ a IN σ (b=3) (S)(R)

σ a IN σ (b=a ) (S)(R)

σ a IN σ (b = ANY (T )) (S)(R)

18

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

What is the provenance of a sublink according to Definition 1? Sublinks can be used in different

contexts Selection Projection ...

Sublink either Produces exactly one value Or produces a boolean value

19

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Single uncorrelated ANY-sublinks in selection conditions

For other Types of sublinks Correlated sublinks Nested sublinks

20

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

For other Types of sublinks Correlated sublinks Nested sublinks

READ THE PAPER!

21

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Single uncorrelated ANY-sublinks in selection conditions The result of the sublink query is fixed For a given input tuple t the sublink

condition is either true or false

σ a =ANY σ (b=3) (S)(R)

22

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Some terminology The query of a sublink

The conditional expression of a sublink

Tsub

q =σ a =ANY Πb (S)(R)

Πb(S)

a = ANY Πb (S)

Csub

Tsub€

Csub

23

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Sublink condition can play different roles in a condition C of a selection (for one input tuple t): Reqtrue: the selection condition is true, iff is true Reqfalse: the selection condition is true,

iff is false

Ind: the selection condition is true indepedent of the result of €

Csub

Csub

Csub

24

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Some more terminology All tuples from the sublink query that

fulfill the “unquantified” sublink condition

All tuples from the sublink query that do not fulfill the “unquantified” sublink condition€

Tsubtrue(t)

Tsubfalse(t)

Csub = (a = ANY σ b=3(S))

Csub° = (a = b)

25

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Back to ANY-sublinks in selections Proposition:

Tsub*(t) =

Tsubtrue(t) reqtrue

Tsub reqfalse, ind

⎧ ⎨ ⎩

26

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

a

1

2

3

b c

1 100

2 10

4 24

SR€

q =σ a =ANY Πb (S)(R)

a

1

2

Result

Compute provenance for

t = (1)

Example:

27

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Tsub = Πb (S)

Tsubtrue(t) = {(1)}

is reqtrue

Csub

Tsub* =Tsub

true

Csub° = (a = b)

q =σ a =ANY Πb (S)(R)

28

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Tsubtrue(t) = {(1)}

q =σ a =ANY Πb (S)(R)

b

1

2

4

Tsub

a

1

2

3

R

Csub° = (a = b)

Compute provenance for

t = (1)

29

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

a

1

2

3

b c

1 100

2 10

4 24

SR€

q =σ a =ANY Πb (S)(R)

a

1

b

1

R* Tsub*b

1

2

4

Tsub

a

1

2

Result

Compute provenance for

t = (1)

30

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Definition 1 is ambiguous for queries with more than one sublink!

b

1

2

100

c

1

5

SR

q =σ C1∨C2(U )

C1 = (a =ANY R)

C2 = (a > ALL S)

t = (5)

a

5

Resulta

5

U

31

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Definition 1 is ambiguous for queries with more than one sublink!

b

1

2

100

c

1

5

SR

q =σ C1∨C2(U )

C1 = (a =ANY R)

C2 = (a > ALL S)

t = (5)

a

5

Resulta

5

U

true

false

32

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

b

5c

1

5

S*R*€

q =σ C1∨C2(U )

C1 = (a =ANY R)

C2 = (a > ALL S)

t = (5)

a

5

U*b

1

100

R*b

1

S*a

5

U*Solution 1 Solution 2

33

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

b

5c

1

5

S*R*€

q =σ C1∨C2(U )

C1 = (a =ANY R)

C2 = (a > ALL S)

t = (5)

a

5

U*b

1

100

R*b

1

S*a

5

U*Solution 1 Solution 2

true

false

34

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

b

5c

1

5

S*R*€

q =σ C1∨C2(U )

C1 = (a =ANY R)

C2 = (a > ALL S)

t = (5)

a

5

U*b

1

100

R*b

1

S*a

5

U*Solution 1 Solution 2

false

true

35

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Reasons for this ambiguity: The definition requires the provenance

to produce the same result But not to produce the same results for

the sublinks

-> Definition 1 produces false positives

36

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Solution: Extend definition 1 Add a third condition: For each sublink:

If computed for one result tuple t one tuple from the provenance of the sublink

Produces same sublink result as in the original query

37

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

b

5c

5

S*R*€

q =σ C1∨C2(U )

C1 = (a =ANY R)

C2 = (a > ALL S)

t = (5)

a

5

U*b

1

100

R*b

1

S*a

5

U*Solution 1 Solution 2

38

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

How to compute the provenance according to the extended definition?

Use query rewrite Generic strategy (Gen) Specialized strategies

Use un-nesting Check: does not change the provenance

39

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

2. The Provenance of Subqueries

Gen-strategy For queries we cannot un-nest

1. Join original query with all possible provenance tuples (base relations)

2. Rewrite the sublink query3. Introduce additional correlation to

simulate a join between 1) and 2)

40

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

Overview

1. Introduction2. The Provenance of Subqueries3. Experimental Results4. Conclusion

41

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

3. Experimental Results TPC-H benchmark (10 MB size)

42

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

3. Experimental Results TPC-H benchmark (1 GB size)

43

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

Overview

1. Introduction2. The Provenance of Subqueries3. Experimental Results4. Conclusion

44

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

4. Conclusion

Definition 1 fails in the presence of sublinks Can be extended to deal with sublinks

Provenance computation for sublinks By using query rewrites Implemented in the Perm

Future Work Physical provenance-aware operators

45

Zur Anzeige wird der QuickTime™ Dekompressor „“

benötigt.

Questions

? ? ?

Recommended