45
A Generic Provenance Middleware for Database Queries, Updates, and Transactions Bahareh Sadat Arab 1 , Dieter Gawlick 2 , Venkatesh Radhakrishnan 2 , Hao Guo 1 , Boris Glavic 1 IIT DBGroup 1 Oracle 2

A Generic Provenance Middleware for Database Queries , Updates, and Transactions

  • Upload
    liliha

  • View
    61

  • Download
    0

Embed Size (px)

DESCRIPTION

A Generic Provenance Middleware for Database Queries , Updates, and Transactions. Bahareh Sadat Arab 1 , Dieter Gawlick 2 , Venkatesh Radhakrishnan 2 , Hao Guo 1 , Boris Glavic 1. IIT DBGroup 1. Oracle 2. Outline. ❶Motivation and Overview ❷ GProM Vision - PowerPoint PPT Presentation

Citation preview

Page 1: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

A Generic Provenance Middleware for Database Queries, Updates, and

TransactionsBahareh Sadat Arab1, Dieter Gawlick2, Venkatesh

Radhakrishnan2, Hao Guo1, Boris Glavic1

IIT DBGroup1 Oracle2

Page 2: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

2

Outline

❶ Motivation and Overview❷ GProM Vision❸ Provenance for Transactions

GProM - Provenance for Queries, Updates, and Transactions

Page 3: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

3

Introduction

• Data Provenance – Information about the origin and creation process data

• Provenance tracking for database operations– Considerable interest from database community in last decade

• The de-facto standard for database provenance [1,2,3,4,5] – model provenance as annotations on data (e.g., tuples)– compute the provenance by propagating annotations (query rewrite)

SELECT DISTINCT OwnerFROM CannAcc;

[1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In Search of Elegance in the Theory and Practice of Computation, Springer, 2013.[2] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative data sharing via update exchange and provenance. TODS, 2013.[3] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An Annotation Management System for Relational Databases. VLDB Journal, 14(4):373–396, 2005.[4] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio: A System for Data, Uncertainty, and Lineage. In VLDB, pages 1151–1154, 2006.[5] G. Karvounarakis and T. Green. Semiring-annotated data: Queries and provenance. SIGMOD Record, 41(3):5–14, 2012.

GProM - Provenance for Queries, Updates, and Transactions

Page 4: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

4 GProM - Provenance for Queries, Updates, and Transactions

Use Cases

• Debugging data and transformations (queries)[1]• Probabilistic databases (queries)[5]• Auditing and compliance (transactions and update

statements)[6]• Understanding data integration transformations (queries and

transactions)• Assessing data quality and trust (queries and transactions)[7]

Computing provenance for updates and transactions is essential for many use cases.

[1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In Search of Elegance in the Theory and Practice of Computation, pringer, 2013.[5] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio: A System for Data, Uncertainty, and Lineage. In VLDB, 2006.[6] D. Gawlick and V. Radhakrishnan. Fine grain provenance using temporal databases. In TaPP, 2011.[7] G. Karvounarakis and T. Green. Semiring-annotated data: Queries and provenance. SIGMOD Record, 2012.

Page 5: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

5 GProM - Provenance for Queries, Updates, and Transactions

Shortcomings of State-of-the-Art

• No practical implementation for updates • No system or model supports transactions• Inflexible provenance storage – Always on [2,3]– On-demand only [1]

• Query rewrite use atypical access patterns and operator sequences – -> leads to poor execution plans

• Most systems: only one type of provenance[1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In Search of Elegance in the Theory and Practice of Computation, pringer, 2013.[2] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An Annotation Management System for Relational Databases. VLDB Journal, 2005.[3] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative data sharing via update exchange and provenance. TODS, 2013.

Page 6: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

6 GProM - Provenance for Queries, Updates, and Transactions

Objectives

1. Vision: Generic Provenance Database Middleware (GProM).– Provenance for

• Queries, updates, and transactions– User decides when to compute and store provenance– Supports multiple provenance models– Database-independent

2. Tracking provenance of concurrent transactions– Reenactment Queries

Page 7: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

7 GProM - Provenance for Queries, Updates, and Transactions

Contributions

1. First solution for provenance of transactions2. Retroactive on-demand provenance

computation– Using read-only reenactment

3. Only requires audit log + time travel– Supported by most DBMS– No additional storage and runtime overhead

4. Non-invasive provenance computation – query rewrite + annotation propagation

Page 8: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

8 GProM - Provenance for Queries, Updates, and Transactions

Outline

❶ Motivation and Overview❷ GProM Vision❸ Provenance for Transactions

Page 9: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

9 GProM - Provenance for Queries, Updates, and Transactions

System Architecture

• Database independent middleware– Plug-able parser and SQL code generator

• Internal query representation– Relational Algebra Graph Model (AGM)

• Core driver: Query rewrites– Provenance Computation– Flexible storage policies for provenance– Provenance import/export– AGM Optimizer (rewritten queries)– Extensibility: Rewrite Specification Language (RSL)

• Initial prototype build on-top of Oracle

Page 10: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

10 GProM - Provenance for Queries, Updates, and Transactions

GProM Overview

Page 11: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

11 GProM - Provenance for Queries, Updates, and Transactions

Provenance Computation

• Query rewrite– Take original query q and rewrite into q+

Computes original results + provenance– Propagate provenance through operations

Page 12: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

12

Example Rewrite• Input:SELECT DISTINCT u.Owner FROM Usacc u, CanAcc c WHERE u.ID = c.ID;

• Rewrite Parts:USacc SELECT ID, Owner, Balance, Type, ID AS P1, Owner AS P2, Balance AS P3, Type AS P4

FROM USaccCanAcc SELECT ID, Owner, Balance, Type, ID AS P5, Owner AS P6, Balance AS P7, Type AS P8

FROM CanAcc

WHERE u.ID = c.ID WHERE u.ID = c.ID SELECT DISTINCT Owner SELECT Owner, P1, P2, P3, P4, P5, P6, P7, P8

• Output:SELECT u.Owner, P1, P2, P3, P4, P5, P6, P7, P8 FROM (SELECT ID, Owner, Balance, Type,

ID AS P1, Owner AS P2, Balance AS P3, Type AS P4

FROM USacc) u (SELECT ID, Owner, Balance, Type, ID AS P5, Owner AS P6, Balance AS P7, Type AS P8

FROM CanAcc) cWHERE u.ID = c.ID;

GProM - Provenance for Queries, Updates, and Transactions

Page 13: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

13 GProM - Provenance for Queries, Updates, and Transactions

Provenance Computation

• Operates on relational algebra representation of queries– Fixed set of rewrite rules per provenance type:

• One per type of algebra operator• Recursive top-down rewrite

– For each relation access: duplicate attributes as provenance– For each operator: replace with algebra graph that propagates

provenance annotations• Composable

Page 14: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

14 GProM - Provenance for Queries, Updates, and Transactions

Supporting Past Queries, Updates, and Transactions

• Only needs audit log and time travel– supported by most DBMS

• Sufficient for provenance of past queries [4]• Our contribution–Sufficient for updates and transactions

[4] J. Zhang and H. Jagadish. Lost source provenance. In EDBT, 2010.

Page 15: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

15 GProM - Provenance for Queries, Updates, and Transactions

Provenance Generation and Storage Policies

• GProM default– Only compute provenance if explicitly requested

• User can register storage policies–When to store which type of provenance

POLICY storeOnR { FIRE ON Query, Insert q WHEN Root(q) +=> Table(R) COMPUTE PI-CS STORE AS NEW TABLE

NAMING SCHEME Hash}

Page 16: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

16 GProM - Provenance for Queries, Updates, and Transactions

Optimizing Rewritten Queries

• Query rewrite use atypical access patterns and operator sequences

leads to poor execution plans• Optimization for rewritten queries– Heuristic – Cost-based SELECT ID, Owner, Balance,

CASE WHEN Balance > 1000000 THEN 'Premium ' ELSE Type END AS Type, prov_CanAcc_ID, prov_CanAcc_Owner, prov_CanAcc_Balance, prov_CanAcc_Type, prov_USacc_ID, prov_USacc_Owner, prov_USacc_Balance, prov_USacc_TypeFROM u1...

SELECT ID, Owner, Balance, 'Premium ' AS Type, prov_CanAcc_ID, prov_CanAcc_Owner, prov_CanAcc_Balance, prov_CanAcc_Type, prov_USacc_ID, prov_USacc_Owner, prov_USacc_Balance, prov_USacc_TypeFROM u1WHERE Balance > 1000000UNION ALLSELECT * FROM u1WHERE (Balance > 1000000) IS NOT TRUE

Page 17: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

17 GProM - Provenance for Queries, Updates, and Transactions

Rewrite Extensibility

• Extensible using Rewrite Specification Language (RSL)– Concise specification of rewrite rules

RULE mergeSelections { FOR q => c => g WHERE q->type = selection AND c->type = selection REWRITE INTO selection [pred = q->pred AND c->pred] => g}

Page 18: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

18 GProM - Provenance for Queries, Updates, and Transactions

Outline

❶ Motivation and Overview❷ GProM Vision❸ Provenance for Transactions

Page 19: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

19 GProM - Provenance for Queries, Updates, and Transactions

Provenance of Transactions

Page 20: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

20 GProM - Provenance for Queries, Updates, and Transactions

Provenance of Transactions

INSERT INTO USacc (SELECT ID, Owner, Balance, ‘Standard’ AS Type FROM CanAcc WHERE Type = ‘US_dollar’); UPDATE USacc SET Type = ’Premium’WHERE Balance > 1000000; COMMIT;

Page 21: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

21 GProM - Provenance for Queries, Updates, and Transactions

Provenance of Transactions

INSERT INTO Usacc(SELECT ID, Owner, Balance, ‘Standard’ AS TypeFROM CanAccWHERE Type = ‘US_dollar’); 

UPDATE Usacc SET Type = ’Premium’WHERE Balance > 1000000;

u1 u

2

Page 22: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

22 GProM - Provenance for Queries, Updates, and Transactions

Provenance of Transactions

• Our Approach: Reenactment + Provenance Propagation

• Currently supports– Snapshot Isolation– Statement-level Snapshot Isolation

Gather Transaction Information

ConstructUpdate

ReenactmentQuery

Rewrite For Provenance

Computation

ExecuteQuery

1Construct

Transaction Reenactment

Query

2 3 4 5

Page 23: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

23 GProM - Provenance for Queries, Updates, and Transactions

1.Gather Transaction Information

• Retrieve SQL statements of transaction from audit log

• Update u1:INSERT INTO USacc (SELECT ID, Owner, Balance, ‘Standard’ AS Type FROM CanAcc WHERE Type = ‘US_dollar’);

• Update u2:UPDATE Usacc SET Type = ’Premium’ WHERE Balance > 1000000;

Page 24: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

24 GProM - Provenance for Queries, Updates, and Transactions

2. Translate Updates: Reenactment

• Update reads table version and outputs updated table version• Multiple versions of the database

– Each modification of a tuple t causes a new version to be created– Old tuple versions are kept (SI)– Add version annotation τ to provenance of each updated row

• Use semi-ring model

UPDATE Usacc SET Type=’Premium’WHERE Balance>1000000;

Page 25: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

25 GProM - Provenance for Queries, Updates, and Transactions

2.Translate Updates

• Construct update reenactment query – Simulates effect of update– Read DB version seen by update using time travel– Query result = updated table (Annotation-Equivalent)

SELECT ID, Owner, Balance, ’Standard’ AS TypeFROM CanAcc AS OF SCN 3652WHERE Type=‘US_dollar’UNION ALLSELECT * FROM Usacc AS OF SCN 3652;

UPDATE Usacc SET Type = ’Premium’WHERE Balance > 1000000;

SELECT ID, Owner, Balance, ’Premium’ AS TypeFROM Usacc AS OF SCN 3652WHERE Balance>1000000UNION ALLSELECT *FROM Usacc AS OF SCN 3652WHERE (Balance>1000000) IS NOT TRUE;

INSERT INTO Usacc(SELECT ID, Owner, Balance, ‘Standard’ AS TypeFROM CanAccWHERE Type = ‘US_dollar’); 

Page 26: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

26 GProM - Provenance for Queries, Updates, and Transactions

3. Construct Reenactment Query• Simulates the whole transaction

– Annotation-Equivalent to original transaction• Merge reenactment queries based on concurrency control protocol

– Each concurrency control requires a different merge process– SERIALIZABLE (Snapshot isolation) -> modifications before the transaction

started + previous updates of the transaction– READ COMMITTED (Snapshot isolation) -> sees committed changes by

concurrent transactionWHIT U1 AS(SELECT ID, Owner, Balance, ’Standard’ AS TypeFROM CanAcc AS OF SCN 3652WHERE Type=‘US_dollar’UNION ALLSELECT * FROM Usacc AS OF SCN 3652);

SELECT ID, Owner, Balance, ’Premium’ AS TypeFROM U1WHERE Balance>1000000UNION ALLSELECT * FROM U1WHERE (Balance>1000000) IS NOT TRUE;

Page 27: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

27 GProM - Provenance for Queries, Updates, and Transactions

4. Rewrite For Provenance Computation

• Rewrite reenactment query to compute provenance using annotation propagation

WITHu1 AS(SELECT ID, Owner, Balance, ’Standard ’ AS Type, ID AS prov_CanAcc_ID, . . . NULL AS prov_USacc_ID, . . . 1 AS updated,FROM CanAcc AS OF SCN 3652WHERE Type = ’US dollar ’UNION ALLSELECT ID , Owner , Balance , Type , NULL AS prov_CanAcc_ID, . . . ID AS prov_USacc_ID, . . . 0 AS updatedFROM USacc AS OF SCN 3652),. . .u1 AS(SELECT . . .

Page 28: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

28 GProM - Provenance for Queries, Updates, and Transactions

4. Execute Query

• Execute query to retrieve provenance

Updated USacc Tuples Provenance from CanAcc Provenance from USacc

ID Owner Balance Type P1 P2 P3 P4 P5 P6

3 Alice Bright 1,500,000 Premium 3 Alice Bright 1,500,000 NULL NULL NULL

5 Mark Smith 50 Standard 5 Mark Smith 50 NULL NULL NULL

Page 29: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

29 GProM - Provenance for Queries, Updates, and Transactions

Conclusions

• We present our vision for GProM– Database-independent middleware for computing

provenance of queries, updates, and transactions.• First solution for provenance of transactions • Query rewrite techniques on steroids:– Provenance computation– Transaction reenactment– Provenance translation– Provenance storage– Optimization

• Extensible through RSL language

Page 30: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

30 GProM - Provenance for Queries, Updates, and Transactions

Future Works

• Implementing additional provenance types• Comprehensive study of heuristic and cost-based

optimizations • Design and implementation of RSL• Implementing additional provenance formats• Study reenactment for other concurrency control

mechanisms– Locking protocols (2PL)

• Investigate additional Use-cases for Reenactment– Transaction backout– Retroactive What-if analysis

Page 32: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

32

References

[1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In Search of Elegance in the Theory and Practice of Computation, pages 291–320. Springer, 2013.[2] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An Annotation Management System for Relational Databases. VLDB Journal, 14(4):373–396, 2005.[3] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative data sharing via update exchange and provenance. TODS, 38(3): 19, 2013.[4] J. Zhang and H. Jagadish. Lost source provenance. In EDBT, pages 311–322, 2010.[5] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio: A System for Data, Uncertainty, and Lineage. In VLDB, pages 1151–1154, 2006.[6] D. Gawlick and V. Radhakrishnan. Fine grain provenance using temporal databases. In TaPP, 2011.[7] G. Karvounarakis and T. Green. Semiring-annotated data: Queries and provenance. SIGMOD Record, 41(3):5–14, 2012.

Page 33: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

33

Q-Bomb

• One pattern that arises from reenactment are long chains of SELECT clauses using CASE– Each level references attributes from next level multiple times– Subquery pull-up creates expressions of size exponential in the number

of SELECT clauses– In praxis: optimization never finishes

• Minimal example using one row tableSELECT CASE WHEN b < 100 THEN a ELSE a + 2 END AS a, bFROM SELECT CASE WHEN b < 100 THEN a ELSE a + 2 END AS a, b…FROM SELECT CASE WHEN b < 100 THEN a ELSE a + 2 END AS a, bFROM R

Page 34: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

34

Example Provenance Computation

Page 35: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

35

Example – Update Reenactment

Page 36: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

36

Example – Trans. Reenactment

Page 37: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

37

Rewrite Reenactment Query

Page 38: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

38

Execute Rewritten Query

Page 39: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

39

Types of Update Operations - Insert

• Insert executed at time t• Updated version of R contains

1. All tuples from previous version 2. All newly inserted tuples

• Fixed tuple defined in VALUES clause• Results of query over database version at t

Union these two setsINSERT INTO R VALUES (v1, ... ,vn);

INSERT INTO R (q);

(SELECT * FROM R AS OF t)UNION ALL(SELECT v1 AS a1, ... , vn AS an);

(SELECT * FROM R AS OF t)UNION ALL(q(t));

Page 40: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

40

Types of Update Operations - Delete

• Delete executed at time t• Tuples in updated version of R:– All tuples from for which Condition is not

fulfilled

DELETE FROM R WHERE C ; SELECT * FROM R AS OF t WHERE (C) IS NOT TRUE;

Page 41: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

41

Types of Update Operations - Update

• Update executed at time t• Find tuples where Condition holds and update

the attribute values• Find tuples where NOT Condition holds Union these two sets

UPDATE R SET A WHERE C ;(SELECT A’ FROM R AS OF t WHERE C)UNION ALL(SELECT * FROM R AS OF t WHERE (C) IS NOT TRUE)

Page 42: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

42

READ COMMITTED

• Statement of a transaction T sees committed changes by concurrent transaction• For a given update we need to combine

– tuples produced by previous statements of same transaction– tuples produced by transactions that committed before update

• Observations– Once a transaction T modifies a tuple t, no other transaction can access t until T

commits– Let ui be the update executed at time x of T that first modifies t– ui will read the latest version committed x– If we know ui then updates of T before x do not have to look at t

• Consider the database version 1 time unit (C-1) before commit of T– This contains all the tuple versions seen by the first update of T updating each

individual tuple– Let t be a tuple version in this version and it’s start time is y– We know that updates from T which executed before y cannot have updated t– We can use version C-1 as input for reenactment as long as we hide tuple version t at

y from an reenactment of an updated executed at x with x < y

Page 43: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

43

READ COMMITTED

u1 AS(SELECTCASE WHEN Balance <=1000000 AND version <= 0 THEN 'Standard ' ELSE Type END AS Type ,ID , Owner , Balance ,CASE WHEN Balance <=1000000 AND version <= 0 THEN −1 ELSE version END AS version FROM USacc AS OF SCN 3652),u2 AS(SELECTCASE WHEN Balance > 1000000 AND version <= 1 THEN 'Premium' ELSE Type END AS Type ,ID , Owner , Balance , CASE WHEN Balance > 1000000 AND version <= 1 THEN −1 ELSE version END AS version FROM u1 )

SELECT ID , Owner , Balance , Type FROM u2 WHERE version = −1;

Page 44: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

44

Database Independence

• Encapsulate database-specific functionality in pluggable modules.

• What needs to be adapted are :1) Parser 2) SQL code generator 3) Metadata access4) Audit log access5) Time travel activation.

Page 45: A Generic Provenance  Middleware for Database Queries , Updates, and  Transactions

45

Accessing Several Tables

• Transactions Accessing Several Tables–We require user to specify which table she is

interested in– Replace access to table with query for last update

that modified the table