35
Circuits for Datalog Provenance Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

Embed Size (px)

Citation preview

Page 1: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

Circuits for Datalog Provenance

Daniel DeutchTel Aviv Univ.

Tova MiloTel Aviv Univ.

Sudeepa RoyUniv. of Washington

Val TannenUniv. of Pennsylvania

Page 2: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

“Boolean Provenance/Lineage” as a Boolean formula Q is true on D FQ,D is true Poly-size, Poly-time computable (data complexity) But Q is a RA+ query This talk: What if Q is a Datalog Program?

A Simple Example of Data Provenance

AsthmaPatient

Ann

Bob

Friend

Ann Joe

Ann Tom

Bob Tom

Smoker

Joe

Tom

Boolean query Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y)

x1

x2

z1

z2

y1

y2

y3Database D

FQ,D = (x1y1z1) (x1y2z2) (x2y3z2)

Page 3: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

3

Provenance– Reliability and repeatability– View management and deletion propagation– Trust and security management– Query answering in probabilistic database, ….

Datalog– Datalog is popular again! (two keynotes this ICDT/EDBT)– Data extraction in Web, declarative networking– Academic/commercial systems (Webdamlog, LogicBlox, Dedalus, Dyna)

Finding suitable “Provenance for Datalog” is important– Both from theoretical and practical viewpoints

How do we compute, store, and interpret provenance for datalog programs efficiently and effectively?

Motivation

Page 4: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

4

Can we get poly-size Boolean formulas for datalog provenance?

No, even if we allow unbounded time

Do we have a solution? Yes! Use Boolean Circuits!

What about general “provenance semirings” beyond Boolean provenance? ref. [Green et. al. ’07]

It depends on the semiring

Overview of Our Results

Page 5: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

5

Background

Circuits for Boolean Provenance

Circuits for General Provenance Semirings

Outline

Page 6: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

6

Background

Circuits for Boolean Provenance

Circuits for General Provenance Semirings

Outline

Page 7: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

7

T(x, y) :- R(x, y)T(x, y) :- R(x, z), T(z, y)S(x) :- T(a, x)

DatalogDatalog program for Transitive Closure and Single-source Reachability

EDB (base) relation for edges: R

IDB (derived) relations─ Transitive closure (T)─ Single-source reachability from vertex ‘a’ (S)

IDB(Intensional Databases)

EDB(Extensional Databases)

Page 8: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

8

Boolean Provenance PosBool(X)-Database

Tuples are annotated with variables from a set X– Here X = {x1, x2, y1, y2, ….}

For n tuples in X, 2n possible worlds by assignments : X {True, False}

Useful in query evaluation on incomplete or probabilistic databases

AsthmaPatient

Ann

Bob

Friend

Ann Joe

Ann Tom

Bob Tom

Smoker

Joe

Tom

x1

x2

z1

z2

y1

y2

y3

PosBool(X)-database D

Page 9: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

9

RA+ over PosBool(X)-Database

Annotation propagates from input to output– Join = , Projection/Union =

Output tuples are annotated by monotone Boolean formula – FQ,D is the annotation of the unique output tuple

AsthmaPatient

Ann

Bob

Friend

Ann Joe

Ann Tom

Bob Tom

Smoker

Joe

Tom

RA+ Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y)

x1

x2

z1

z2

y1

y2

y3PosBool(X)-Database D

FQ,D = (x1y1z1) (x1y2z2) (x2y3z2)

Page 10: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

10

Two Important Properties:RA+ over PosBool(X)-Database

For all RA+ query Q, D, and assignment 1. (Faithful Representation) Q(D)= [Q(D)]

2. (Poly-size overhead) The size of FQ,D is poly in |D| and can be computed in poly-time.

AsthmaPatient

Ann

Bob

Friend

Ann Joe

Ann Tom

Bob Tom

Smoker

Joe

Tom

RA+ Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y)

x1

x2

z1

z2

y1

y2

y3

FQ,D = (x1y1z1) (x1y2z2) (x2y3z2)

True

False

True

False

True

True

False

= False

= False

PosBool(X)-Database D

Page 11: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

Semantics using Derivation Trees (Green et al. 2007)

Annotation of T(a, b):

11

Datalog over PosBool(X) Database

T(x, y) :- R(x, y)

T(x, y) :- R(x, y), T(y, z)

S(x) :- T(a, x)

R

a a

a b

p

qa

b

Trees Leaves t of Annot(t)

= (q) (pq) (ppq) …

• Infinitely many trees• But always has a finite equivalent form

= q

But not necessarily poly-size

T(a, b)

R(a, a) T(a, b)

R(a, a) T(a, b)

R(a, b)

T(a, b)

R(a, a) T(a, b)

R(a, b)

R(a, b)

T(a, b)

Page 12: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

12

Theorem:Given PosBool(X)-database D and datalog program P,

provenance of tuples in P(D) cannot have a faithful representation using

Boolean formulas of size polynomial in |D|

Lower Bound: Boolean formulas for Datalog Provenance on PosBool(X)

Proof outline:• st-connectivity on n nodes requires n(logn)-size monotone Boolean formula

• Karchmer-Wigderson, 1988

• Faithful representation requires: for all True/False assignments to X, P(D)= [P(D)]

• Reduce to the hard instance with right when P = transitive closure

Solution: Boolean Circuit!

Page 13: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

13

Background

Circuits for Boolean Provenance or PosBool(X)

Circuits for General Provenance Semirings

Outline

Page 14: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

14

Circuit is a DAG– use common subexpressions– Boolean formula = tree

Leaf nodes: – EDB vars in X

Internal nodes – : IDB/EDB vars used in one derivation– : Alternative derivations

Roots: – IDB vars

Boolean Circuits

R

a a

a b

p

q

T(x, y) :- R(x, y)

T(x, y) :- R(x, y), T(y, z)

S(x) :- T(a, x)

XT(a, b)

q pXT(a, b)

XR(a, b)XR(a, a)

a

b

Page 15: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

15

Theorem:

Given any PosBool(X)-database D and datalog program P, provenance of tuples in P(D) can be faithfully represented

using monotone Boolean Circuits of poly-size in |D| (and can be computed in poly-time)

Upper Bound: Boolean Circuits for PosBool(X)

Page 16: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

16

1. Datalog Provenance can be represented by a system of equations by instantiating vars in the datalogprogram P to EDB/IDB tuples [Green et al. 2007]

Proof Skecth

2. A System of equations with N Boolean variables can be solved in N+1 iterations [Esparza et al. 2011]

• N = #IDB tuples• Build a circuit with N+1 layers from the system of equations

Two key ideas from previous work

• EDB tuples constants, IDB tuples variables • Iteratively solve this system of equations• Fixpoint = provenance for all IDB tuples

Page 17: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

17

IllustrationT(x, y) :- R(x, y)

T(x, y) :- R(x, y), T(y, z)

S(x) :- T(a, x)

R

a a

a b

p

qa

b

Step1 : Build system of equations by all possible instantiations: x, y, z a, b

XT(a, a) = p (p XT(a, a))

XT(a, b) = q (p XT(a, b))

XS(b) = XT(a, b)

XS(a) = XT(a, a)

Step 2: Build a circuit with 4 + 1 layers (N = 4) …

var

Const

Page 18: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

18

XT(a,a),0XS(b),0 XT(a,a),0XT(a,b),0XS(a),0

pq

XT(a,a),1 XS(b),1 XT(a,a),1

XT(a,b),1

XS(a),1

XS(a),2

XT(a,a),2 XS(b),2 XTa,a),2XT(a,b),2

Level 1

Level 2

false false falsefalsefalse

IllustrationXT(a, a) = p (p XT(a, a))

XT(a, b) = q (p XT(a, b))

XS(b) = XT(a, b)

XS(a) = XT(a, a)

Assign leaf IDB vars to false

Multiple roots for multiple IDB vars

Page 19: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

19

1. Store only two levels of circuit instead of N+1 levels– Evaluate iteratively

2. Embed circuit construction in semi-naïve evaluation– Check for new derivations, not only new IDB variables– Sound and Complete

3. Remove self-dependency of IDB vars– works for PosBool(X) and also some other semirings…

XT(a, a) = p (p XT(a, a))

XT(a, b) = q (p XT(a, b))

XS(b) = XT(a, b)

XS(a) = XT(a, a)

Optimizations

Page 20: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

20

Illustration (From here…)

XT(a,a),0XS(b),0 XT(a,a),0XT(a,b),0XS(a),0

pq

XT(a,a),1 XS(b),1 XT(a,a),1

XT(a,b),1

XS(a),1

XS(a),2

XT(a,a),2 XS(b),2 XTa,a),2XT(a,b),2

Level 1

Level 2

false false falsefalsefalse

Page 21: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

21

Illustration (…To here)

XT(a,a),bottomXT(a,b),bottomXS(a),bottom

pq

XT(a,a),topXT(a,b),topXS(a),top

With all these optimizations

Top Level

Bottom Level

Page 22: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

22

Linear-time deletion propagation (in circuit-size)

Approximation for probabilistic databases– even when only the circuit (and not the database) is available

Circuits can be computed “offline”– Only linear-time evaluation is required when needed (e.g. deletion

propagation) compared to storing and solving a system of equations iteratively, or re-evaluating datalog program

Can use existing techniques for efficient and parallel circuit evaluation

Applications of PosBool(X)-Circuits

Page 23: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

23

Background

Circuits for Boolean Provenance or PosBool(X)

Circuits for General Provenance Semirings

Outline

Page 24: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

24

(K, +K, K, 0K, 1K)– domain K – +K, K : associative, commutative, have neutral elements 0K, 1K

– K distributes over +K , i.e. a K (b +K c) = a K b +K a K c

– 0K cancels any element in K, i.e. a K 0K = 0K K a = 0K

Examples:

– (B, , , False, True) Set semantics

– (N, +, , 0, 1) Bag semantics

– (N {}, min, +, , 0) Tropical semiring to compute cost (e.g. cost of a shortest path)

Commutative Semirings

Page 25: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

25

Generalization of PosBool(X)

(K, +K, K, 0K, 1K)– Tuples are annotated with variables from X– K is of the form Prov(X)– +K denotes alternative usage

– K denotes joint usage

Examples:– (PosBool(X), , , False, True)

– (Lin(X), , , , ) tracks contributing tuples [Cui et. al. ’00]

– (Why(X), , , , {}) : pairwise union of subsets, tracks contributing tuples in alternative derivations

[Buneman et. al. ’01]

Provenance Semirings

Page 26: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

26

Key property needed for applications like deletion propagation, trust management, cost computation, …

Prov(X) specializes correctly to K, if any valuation v : X K extends uniquely to a homomorphism hv : Prov(X) K (which correctly maps +, of Prov(X) to that of K)

Further, some provenance semirings are “more informative” than the others

Provenance Specialization

Page 27: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

27

Provenance Semiring Hierarchy

N[X]

Why(X)

Lin(X)PosBool(X)

Sorp(X)

Tropical

N (bag)

Security Boolean (set)

Defined later

Specializes correctly

More informative

Less informative

Page 28: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

28

Datalog Provenance for General Semirings

Trees Leaves t of Annot(t)

Trees Leaves t of Annot(t)

PosBool(X)

General Prov(X)

+kk

• Infinite sums should be well-defined

• Need to consider “–continuous semirings” and “–continuous homomorphism”

Page 29: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

29

Provenance Semiring Hierarchy

N[X]

Why(X)

Lin(X)PosBool(X)

Sorp(X)

Tropical

N (bag)

Security Boolean (set)

Finite so -continuous

Need to add

N[[X]] and N

N[[X]] : Most informative provenance semiring [Green et al. ’07]

Page 30: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

30

Poly-size overhead is not valid because of infinite sum But can outputs have finite annotations (with X, , +) that specializes

correctly to semirings with finite domains?

How good is N[[X]] w.r.t. Size of Datalog Provenance?

Theorem:It is not possible to annotate with finite provenance expressions the output of datalog programs following N[[X]] -semanticsthat specialize “correctly” to the semiring Why(X)

Theorem: However, we can generate poly-size circuits in poly-time directly for Why(X)

─ Need more levels in the circuit from system of equations─ Need a different argument for correctness

Finite annotations won’t specialize correctly to Why(X)

Page 31: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

31

We propose Sorp(X)– Most general absorptive semiring

a + a.b = a

– N[X] but keep polynomials that are not “absorbed” by the others e.g. pq + p2q3 pq

p2q + pq2 p2q + pq2

The same algorithm, proof, and optimizations to construct poly-size circuits hold– Circuits are more general than Boolean circuit

Can we still have a good general semiring w.r.t. size?

1. Specializes correctly to interesting semirings2. Outputs can be annotated by poly-size circuits

Page 32: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

32

Provenance Semiring Hierarchy

N[X]

Why(X)

Lin(X)PosBool(X)

Sorp(X)

Tropical

N (bag)

Security Boolean (set)

Page 33: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

33

Data Provenance– e.g. [Cui et. al.’00, Buneman et al. ’08, Cheney et al. ’09, Benjelloun et al. ’08]

Circuits– Circuit complexity (size, /depth, parallelism) has been studied for decades, e.g.

[Arora-Barak ’09] (book)

Provenance for Datalog– System of equations, derivation trees, infinite sum [Grahne’91, Green et al. ’07]– Poly-size c-tables with Boolean formulas for datalog with contradictions

[Abiteboul et al. 2014]

Related Work

Page 34: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

34

Circuits to represent and store Datalog Provenance– for PosBool(X) and other semirings– Semantics, Algorithms, Limitations, Applicability

– Preliminary experiments support our results we compared circuits for deletion propagation with iteratively solving

system of equations and reevaluation of datalog from scratch

Future Work:– A complete implementation, evaluation, new applications

Conclusions

Page 35: Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

35

Thank You

Questions?