66
Why & Where A characterization of Data Provenance Authors: Peter Buneman, Sanjeev Khanna, and Wang Chiew-Tan Presented by: Tamra Reutlinger

Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Why & Where A characterization of Data Provenance

Authors: Peter Buneman, Sanjeev Khanna, and Wang Chiew-Tan

Presented by: Tamra Reutlinger

Page 2: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Example

Name, Id, address Id, Telephone

Telephone Name

:

("John Doe",1234) Not Valid!

Page 3: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Example

Name, Id, address Id, Telephone

Telephone Name

:

("John Doe",1234)

Id, Telephone

Valid

D1 D2

Page 4: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

The Two Meanings Of Provenance

Why – why is the tuple in our View?

Where – where did the data “1234” come from? (what path did it go

through to get here)

Page 5: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Importance of Finding Provenance

Sources of different qualities

Scientific databases

On-line monitoring

OLAP (online analytical processing)

Provenance = Lineage =

מקור, מוצא, שושלת יוחסין, אילן יוחסין

Page 6: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Goal

Computing provenance

A syntactic approach

A general data model

Page 7: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Outline:

Introduction to data provenance

A deterministic model

Syntax & operations

Encoding relations

A Query language

Why provenance

Where provenance

Conclusion

Page 8: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Example

Name, Id, address Id, Telephone

:

Telephone Name

("John Doe",1234)

Id, Telephone

Valid

D1

D2

{Id:1}

a b c

a 3 “a”

{Id:3}

{Id:2}

num

109

“Where.. Collect..”

Page 9: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Edge-labeled Tree Models For

Semi-structured Data

The labels of each node are

distinct

semi-structured data

{Id:1}

a b c

a 3 “a”

{Id:3}

{Id:2}

num

109

Page 10: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Syntax & Operations

{x1:y1, x2:y2} x1

y1

x2

y2

Page 11: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Paths: x1.x2....xn

Example: the path {Id:1} identifies the value {Name:"Kim", Rate:50}

the path {id:1}.rate identifies the value 50

Page 12: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Path representation of v

The set of all the paths to the constants

At the terminal nodes.

{a:{1:c,3:d}}

=>

{(a.1,c),(a.3,d)}

{c:3} b

1 2

1 3

c d

e1 a

Page 13: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Substructure

:{1: ,3: } :{1: ,2 : ,3: }a c d a c b d:{1: ,3: } . :{1: ,3: }a c d b a c d

1 3

c d

a

a

2

b

w v

Path representation of w is a subset of the path representation of v

Page 14: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Deep Union

v1 U v2 is the union of the path representations

of v1 and v2

c

2

a b

1 d

4

b

5

e

c

2

a b

1 d

4

5

e

v1 v1 U v2 v2

Page 15: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Deep Union

The result may not be a partial function in which case the deep union is undefined.

c

2

a b

1 c

3

b

5

e ?!? C:2 or C:3

?!?

v1 v2

Page 16: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Outline:

Introduction to data provenance

A deterministic model

Syntax & operations

Encoding relations

A Query language

Why provenance Where provenance

Conclusion

Page 17: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Encoding of Relations

1685

{name:”G.F Handel}

Born period

“baroque”

{name:”J.S Bach”}

{name:”W.A Mozart”}

Born period

1685 “baroque”

Born period

1756 “classical”

Composers

{name:”G.F Handel}

{Opus:”BMV82”}

“Ihave enough”

{name:”J.S Bach”}

Works

title

“-” “art thou troubled?”

title title

{Opus:”BMV552”} {Opus:”HMV19”}

Page 18: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Encoding of Relations

Relation Key Tuple

______________

________________

________________

______________

______________

________________

__________________________

__________________________

___________________________

_________________________

____________

___________________________

____________

___________

___________

Page 19: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Outline:

Introduction to data provenance

A deterministic model

Syntax & operations

Encoding relations

A Query language

Why provenance Where provenance

Conclusion

Page 20: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Example

Name, Id, address Id, Telephone

:

Telephone Name

("John Doe",1234)

Id, Telephone

Valid

D1

D2

{Id:1}

a b c

a 3 “a”

{Id:3}

{Id:2}

num

109

?

Page 21: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

A Query Language

A general syntactic form:

1 1  ,

          :

           ,

          

( )

n n

where p e

p e

condition

collect e

Page 22: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Example

s    . . : ,

           1700

 { : }:

where composers x born u D

u

collect year u C

{year:1685}:C x2

Page 23: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Example

Q=

( .{ : }.{ : , : } ,

             .{{ : }. : }: )

 { : }.{ : ,{ : }: }

where Composers name x born u period v D

Works name x opus w y D

collect name x born u opus w y

{

{ :" . . "}.{ :1685,{ :" 82"}:"     "},

{ :" . . "}.{ :1685,{ :" 552"}:" "},

{ :" . . "}.{ :1685,{ :" 19"}:"     ?"}

      }

e

name J S Bach born opus BMV I haveenough

name J S Bach born opus BMV

name G F Handel born opus HMV Art thoughtroubled

Page 24: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Example

1685

{name:”G.F Handel}

Born period

“baroque”

{name:”J.S Bach”}

{name:”W.A Mozart”}

Born period

1685 “baroque”

Born period

1756 “classical”

Composers

{name:”G.F Handel}

{Opus:”BMV82”}

“Ihave enough”

{name:”J.S Bach”}

Works

title

“-” “art thou troubled?”

title title

{Opus:”BMV552”} {Opus:”HMV19”}

1685

{name:”G.F Handel}

Born

{name:”J.S Bach”}

Born

1685

{Opus:”BMV82”}

“Ihave enough” “-” “art thou troubled?”

{Opus:”BMV552”} {Opus:”HMV19”}

Page 25: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

“Collect..“ -How?

For each pi and each assignment of the variables in pi, evaluate the condition

True? -add the value of e to the output.

”Union” together the output values.

( | |{ : }| | )collect e e e e e c xc – constants X – variables

Page 26: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Example

?

. : . :   ,

            . : . :  

: . _ :

where Emps Id x salary y D

Emps Id y bonus z D

collect Id x new salary y

Page 27: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Well-Formed Queries

Q is well-formed if:

a) No pi is a single variable

b) Each ei is either a (nested) query or an expression that doesn’t involve a query

c) Each comparison is between variables or between variables and constants only.

soundness of rewrite rules

x DX

. .

. . :    . ,

. :   

S t u D

where R x y z Dt u

collect x y z

V 1700u V

Page 28: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Well-Defined Queries

A query may be undefined on a certain input.

Q is Well-Defined if it is defined on any input.

- For the rest of the presentation, we will consider only queries that are both well-formed and well-defined.

Page 29: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Singular Expression

A single path terminated by a constant or variable

and for any non-empty and

distinct expressions e1 and e2

1 2( )e e e

{Id:1}

a b c

a 3 “a”

{Id:3}

{Id:2}

num

109

Page 30: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Normal Form

Q = Q1 U..U Qn and each Qi=

Spi and se - singular pattern and singular expression respectively.

Di - database constant

condition - Boolean predicate on the variables of

the query.

1 1( ,.., , ) ( )n nwhere sp D sp D condition collect se

Page 31: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Strong Normalization

The rewrite system R is strongly normalizing

Therefore:

Well-formed query

any sequence of application of rewrite rules

Normal form

In a finite number of steps!

Page 32: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Outline:

Introduction to data provenance

A deterministic model

Why provenance (syntactic characterization and invariance under query rewriting)

Where provenance

Conclusion

Page 33: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Why Is The Tuple In Our View?

Name, Id, address Id, Telephone

:

Telephone Name

("John Doe",1234)

Id, Telephone

Valid

D1

D2

{Id:1}

a b c

a 3 “a”

{Id:3}

{Id:2}

num

109

“Where.. Collect..”

Page 34: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Witnesses

The collection of values taken from D that proves an output.

s is a witness for t with respect to Q and D if:

t Q(s) and s D

Page 35: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Example

Q=

( .{ : }.{ : , : } ,

             .{{ : }. : }: )

 { : }.{ : ,{ : }: }

where Composers name x born u period v D

Works name x opus w y D

collect name x born u opus w y

{

{ :" . . "}.{ :1685,{ :" 82"}:"     "},

{ :" . . "}.{ :1685,{ :" 552"}:" "},

{ :" . . "}.{ :1685,{ :" 19"}:"     ?"}

      }

e

name J S Bach born opus BMV I haveenough

name J S Bach born opus BMV

name G F Handel born opus HMV Art thoughtroubled

Page 36: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Example

1685

{name:”G.F Handel}

Born period

“baroque”

{name:”J.S Bach”}

{name:”W.A Mozart”}

Born period

1685 “baroque”

Born period

1756 “classical”

Composers

{name:”G.F Handel}

{Opus:”BMV82”}

“Ihave enough”

{name:”J.S Bach”}

Works

title

“-” “art thou troubled?”

title title

{Opus:”BMV552”} {Opus:”HMV19”}

1685

{name:”G.F Handel}

Born

{name:”J.S Bach”}

Born

1685

{Opus:”BMV82”}

“Ihave enough” “-” “art thou troubled?”

{Opus:”BMV552”} {Opus:”HMV19”}

Page 37: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Example - {name:"G.F Handel“}.born:1685

1685

{name:”G.F Handel}

Born period

“baroque”

{name:”J.S Bach”}

{name:”W.A Mozart”}

Born period

1685 “baroque”

Born period

1756 “classical”

Composers

{name:”G.F Handel}

{Opus:”BMV82”}

“Ihave enough”

{name:”J.S Bach”}

Works

title

“-” “art thou troubled?”

title title

{Opus:”BMV552”} {Opus:”HMV19”}

1685

{name:”G.F Handel}

Born

{name:”J.S Bach”}

Born

1685

{Opus:”BMV82”}

“Ihave enough” “-” “art thou troubled?”

{Opus:”BMV552”} {Opus:”HMV19”}

Page 38: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Witnesses

{Composers.{name:"G.F. Handel"}.{born:1685, period:"baroque"},

Works.{{name:"G.F. Handel"}.opus:"HMV19"}.title:"Art thou troubled?"}

{name:"G.F Handel“}.born:1685

witnesses

Page 39: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Example – Witness Basis

1685

{name:”G.F Handel}

Born period

“baroque”

{name:”J.S Bach”}

{name:”W.A Mozart”}

Born period

1685 “baroque”

Born period

1756 “classical”

Composers

{name:”G.F Handel}

{Opus:”BMV82”}

“Ihave enough”

{name:”J.S Bach”}

Works

title

“-” “art thou troubled?”

title title

{Opus:”BMV552”} {Opus:”HMV19”}

{name:”G.F Handel}

Born period

1685 “baroque”

Composers

{name:”G.F Handel}

Works

“art thou troubled?”

title

{Opus:”HMV19”}

Page 40: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Witness Basis - WQ,D(t)

t=t1 U t2

WQ,D(t1) WQ,D(t2) WQ,D(t) U

Q=Q1 U Q2

Q2(D) Q1(D)

WQ1,D(t) WQ2,D(t) U

Q (D)

WQ,D(t)

The set of all witnesses

for a value t in Q(D)

Page 41: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Witness Basis

Lemma 1: If Q ~> Q’ via the rewrite system R, then for any

value t in the output of Q(D), WQ,D(t)=WQ’,D(t)

Q - well formed

Q(D) Q(D)

WQ,D(t) WQ’,D(t) =

Q’ - normal form ~>

Page 42: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Algorithm: Why(t,Qi,D)

D

1 1,.., ,n np e p e condition

1 1' .. n np e p e

' "    ( )      ( ') : "iQ where collect C

1 1( ,.., , ) ( )

n ni i i i i iQ where p e p e condition collect e

( ) ?ie t

סימונים

t

Page 43: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Minimal Witness Basis

A witness for a value is invariant under all equivalent queries but the witness basis is not.

The minimal witness basis is invariant under certain queries

Page 44: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Minimal Witness, Minimal Witness Basis

s is a minimal witness for t if:

MQ,D(t) - The minimal witness basis for t,

is a maximal subset of WQ,D(t) such that:

' , ( ').s s t Q s

, ,( ), ( ); .Q D Q Dm M t w W t w m

Page 45: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Example - 1685

1685

{name:”G.F Handel}

Born period

“baroque”

{name:”J.S Bach”}

{name:”W.A Mozart”}

Born period

1685 “baroque”

Born period

1756 “classical”

Composers

{name:”G.F Handel}

{Opus:”BMV82”}

“Ihave enough”

{name:”J.S Bach”}

Works

title

“-” “art thou troubled?”

title title

{Opus:”BMV552”} {Opus:”HMV19”}

{name:”G.F Handel}

Born period

1685 “baroque”

Composers

{name:”G.F Handel}

Works

“art thou troubled?”

title

{Opus:”HMV19”}

Page 46: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Example - 1685

1685

{name:”G.F Handel}

Born period

“baroque”

{name:”J.S Bach”}

{name:”W.A Mozart”}

Born period

1685 “baroque”

Born period

1756 “classical”

Composers

{name:”G.F Handel}

{Opus:”BMV82”}

“Ihave enough”

{name:”J.S Bach”}

Works

title

“-” “art thou troubled?”

title title

{Opus:”BMV552”} {Opus:”HMV19”}

Not a proof tree For value!!!

{name:”G.F Handel}

Born period

1685 “baroque”

Composers

{name:”G.F Handel}

Works

“art thou troubled?”

title

{Opus:”HMV19”}

Page 47: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Invariance of Minimal Witness Basis

under Equivalent queries

Q, Q’ - two equivalent well-formed queries

t is in Q(D) and Q’(D)

Then; MQ,D(t) = MQ’,D(t)

Page 48: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

D=D1U..UDn, V=V(D). For a value t in Q(D,V),

where Q’ is the rewritten query via our rewrite

system R in which view V has been “composed out".

Cascaded Witnesses (Query Composition)

Unnesting of Witnesses

Q’,D ,{ , ( )}

,

W t { ' | ( ') ( ),

'  is the value taken from view V D , ' ( ')}

Q D V D

V D

w w w v W t

v w W v

Page 49: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Outline:

Introduction to data provenance

A deterministic model

Why provenance

Where provenance (problems defining,

invariance under query rewriting)

Conclusion

Page 50: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Reminder:

So far we have looked at what pieces of input data validate the existence of an output value. (why provenance)

We now focus on identifying what pieces of input data helped create values that appear in the output. (where provenance)

Page 51: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Example - 1685

{name:”G.F Handel}

Born period

1685 “baroque”

Composers

{name:”G.F Handel}

Works

“art thou troubled?”

title

{Opus:”HMV19”}

{name:”G.F Handel}

Born

Composers

Witness basis

Where Provenance

There are many difficulties involved in formalizing this

Page 52: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Invariance Over Equivalent Queries

Looking for employees with a salary of 50$

where Emps.{Id:x}.salary:$50 D, collect {Id:x}.salary:$50

where Emps.{Id:x}.salary:y D, y = $50

collect {Id:x}.salary:y

!

where Emps.{Id:x}.salary:$50 D, collect {Id:x}.salary:$50

y = $50K

What is the where- Provenance of 50$?

Page 53: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Multiple Pieces of Data

where Emps.{Id:x}.salary:y D, Emps.{Id:x}.salary:z D, Emps.{Id:x}.bonus:z D

collect {Id:x}.new salary:y

where Emps.{Id:x}.salary:y D, Emps.{Id:x}.bonus:y D

collect {Id:x}.new_salary:y

New_salary is tracked

by y

New_salary is tracked

by y and z?

Page 54: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Nested Queries

where R.x.y : z D, S.x.y : z D collect x.y : z

where R.x.y : z D, S.t.u D, t:u collect {x.y : z, t : u}

where R.x.y : z  D ,

collect x.y : z  

{R.1.2:3,S.1.2:3}

D Output: 1.2:3

Where provenance: {R.1:2,S.1:2}

Where provenance: {R.1:2,S.1:2}

t:u

{1.2:3,1.2:3}

=>u = y:z

where R.x.y : z  D ,

collect x.y : z  

Page 55: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Traceable Queries

A restricted class of queries, for which where-provenance is preserved under rewriting.

Page 56: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Example - {name:"G.F Handel“}.born:1685

{name:”G.F Handel}

Born period

1685 “baroque”

Composers

{name:”G.F Handel}

Works

“art thou troubled?”

title

{Opus:”HMV19”}

{name:”G.F Handel}

Born

Composers

Witness basis

Where Provenance

Page 57: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Derivation Basis (Where Provenance)

The derivation basis for l:v finds a variable x in the output expression that will generate v.

1685

{name:”G.F Handel}

Born period

“baroque”

{name:”J.S Bach”}

{name:”W.A Mozart”}

Born period

1685 “baroque”

Born period

1756 “classical”

Composers

   . . : ,

           1700

 { : }:

where composers x born u D

u

collect year u C

{year:1685}:C x2

Page 58: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Where(l:v,Q,D)

Computes the derivation basis of l:v.

The “collect" clause of the new query returns two things:

the patterns

the paths

pointing to x in the “where" clause of Q

, 0( : , , ) ( : ) {([[ ]] .. [[ ]] , )}Q D nWhere l v Q D l v p p S

Page 59: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Derivation Basis

100

{Id:3}

bonus salary

2000

{Id:1}

{Id:2}

bonus salary

300 1900

bonus salary

17 1700

Emps

. : . :   ,

. : . :  

: . _ :

where Emps Id x salary y D

Emps Id x bonus y D

collect Id x new salary y

{Id:1}.new_salary:2100 {Id:2}.new_salary:1717 {Id:3}.new_salary:2200

1p

2p

1( ) . :1 . : 2000 p Emps Id salary D

, 1 2( : ) ( ( ) ( ), . :1 .{ , })}Q D l v p p Emps Id salary bonus

Page 60: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Derivation Basis - , ( : )Q D l v

Q=Q1 U Q2

Q2(D) Q1(D)

U 1 , ( : )Q D l v2 , ( : )Q D l v , ( : )Q D l v

v is an atomic value

Page 61: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Q Is Traceable If:

1) each pi in the query matches either against some database constant or against a sub-query

2) every sub-query is a view which does not share

any variables with the outer scope

3) only a singular pattern is allowed to match

against a sub-query

4) the pattern and output expression of the sub-

query consist of a sequence of distinct variables and have the same length.

Page 62: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Propositions

Proposition 1:

Proposition 2:

for any l:v in the output of Q(D)

Q - traceable Q’ - traceable Q ~> Q’

Q - traceable Q ~> Q’ , ',( : ) ( : )Q D Q Dl v l v

Page 63: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Outline:

Introduction to data provenance

A deterministic model

Why provenance

Where provenance

Conclusion

Page 64: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Why Is The Tuple In Our View?

Name, Id, address Id, Telephone

:

Telephone Name

("John Doe",1234)

Id, Telephone

Valid

D1

D2

{Id:1}

a b c

a 3 “a”

{Id:3}

{Id:2}

num

109

“Where.. Collect..”

Page 65: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

Conclusions

o Describing and Understanding provenance of data

o Two perspectives: Why is a piece of data in the output? Where did a piece of data come from?

o A system of rewrite rules where

why-provenance is preserved over the class of well-defined queries and where-provenance is preserved over the class of traceable queries.

Page 66: Why & Where - Technionkanza/dbseminar/2011/WhyWhere.… · OLAP (online analytical processing) Provenance = Lineage = רוקמ ,אצומ ,ןיסחוי תלשוש ,ןיסחוי ןליא

!תודה על ההקשבה