Download pptx - A Formal Approach to Finding Explanations for Database Queries

1

A Formal Approach to Finding Explanations for Database Queries

Sudeepa RoyDan Suciu

University of Washington, Seattle

2

We need to understand “Big Data”

ref. Big data whitepaper, Jagadish et al., 2011-12

D1

D2

D3

Data Analysis System

1. Acquire Data 2. Prepare Data 3. Store in DB

Clean Extract Feature Integrate

5. Plot Graphs6. Ask Questions! 4. Run Queries

Do you have an

explanation?

3

• Why is there a peak for #sigmod papers from industry during 2000-06,

while #academia papers kept increasing?

• Why is #SIGMOD papers #PODS papers in UK?

Sample Questions

Dataset: Pre-processed DBLP + Affiliation dataDisclaimer: Not all authors have affiliation info

Explanations by our approach at the end

4

“What was the cause of the observation?”• Not simple association or correlation• e.g. People having headache drink coffee Does coffee cause headache? Does headache lead to drinking coffee?

Ideal goal: Why Causality

5

• Has been studied for many years (Hume1748)• Extensive study in AI over the last decade by Judea Pearl using the notion of intervention:

X is a cause of Y, if removal of X also removes Y keeping other conditions unchanged

• Needs controlled experiments • Not always possible with a database

But, causality is hard…

6

Realistic Database-y goal: Why Explanation

Causality ExplanationControlled Experiment Input database and observed query outputs

Causal Paths PK-FK constraints and their generalization

Intervention Remove input tuples,query output should change

Top Causes Top explanations will change the outputin the expected direction to a greater extent

7

Previous/Related Work• Causality in databases

– Meliou et al.’10, Meliou et al.’11

• Explanations in databases– Explaining outliers in aggregate queries: Wu-Madden’13– Specific applications (Map-Reduce, Access log, User Rating,…): e.g. Khoussainova et al.’12, Fabbri et al.’12, Das et al.’11

• Other related topics– Provenance, deletion propagation: e.g. Green et al.’07, Buneman et al.’01– Missing answer/Why-Not: e.g. Herschel et al.’09, Huang et al.’10, Chapman-Jagadish’09– Finding causal structure/data mining: e.g. Silverstein et al.’00– OLAP: e.g. Sarawagi-Sathe’01

• Informally use intervention• Explanation = predicate• Mostly single table, no join

• Pearl’s notion of causality and intervention• Causal structure from input to output by lineage• Cause = Individual input tuples, not predicates• No inherent causal structure in input data

Upcoming VLDB 2014 Tutorial“Causality and Explanations in Databases”Alexandra Meliou, Sudeepa Roy, Dan Suciu

This work:• Formal framework of explanations (= predicates) and theoretical analysis – causal structure within input data independent of

queries or user questions– allow multiple tables and joins

• Optimizations and Evaluation– find top explanations using data cube

8

Outline

• Framework

• Causal Paths and Intervention

• Computing Intervention

• Optimization: Ranking Explanations by Data Cube

• Evaluation

• Future Work

9

Input and Output

Run Group-By Queries and Plot

Toy DBLP database Output Plot

User question • Numerical expression E• Direction: high/low

E = (q1/q3) / (q2/q4)Direction = high

Why is q1/q3 q2/q4

e.g. q1

select count(distinct x.pubid)from Author x, Authored y, Publication zwhere x.id = y.id and y.pubid = z.pubid and z.venue = ’SIGMOD’ and 2000 <= z.year and z.year <= 2004 and x.domain = ’com’

These values will vary for q2, q3, q4

Input

Explanation(s) ф : Predicate on attributese.g.[name = ‘JG’][name = ‘JG’] [inst = ‘C.edu’][name = ‘JG’] [year = 2007]Note: attr from multiple tables

Output

E should change when database is “intervened “with

10

Causal Paths by Foreign Key Constraints

• Causal path X Y: removing X removes Y• Analogy in DB: Foreign key constraints and cascade delete semantics

Author(id, name, inst, dom)

Authored(id, pubid)

Publication(pubid, year, venue)

Standard F.K.(cascade delete)

Back and Forth F.K.(cascade delete

+reverse cascade delete)ForwardReverse

Intuition: • An author can exist if one of her papers is deleted• A paper cannot exist if any of its co-authors is deleted

11

Causal Paths and InterventionCandidate explanation ф : [name = ‘RR’]

ReverseForward

Intervention ф : Tuples T0 that satisfy ф + Tuples reachable from T0

Given ф, computation of ф requires a recursive query

Multiple tables

require universal

table

12

Objective: top-k explanations• Consider user question: Why is E = (q1/q2)/(q3/q4) low,

Find top-k explanations ф w.r.t a score ф = E(D - ф)

The obvious approach: Two sources of complexity

1. For all possible predicates ф– Compute the intervention ф for ф

– Delete tuples in ф from D

– Evaluate q1, q2, q3, q4 on D – ф

– Compute E(D - ф)

2. Find top explanations with highest scores E(D - ф) (top-k)

Recursion

13

Computing ф by a Recursive Program

Properties:1. Program has a unique least fixpoint which can be

obtained in poly-time (n = |D| steps)

2. Program is not monotone in database, • i.e., if D D’, not necessarily (D) (D’)• Therefore not expressible in datalog• But expressible in datalog + negation

Delete from universal table

tuples |= ф Cascade delete

Reverse Cascade delete

ф is fixed

14

Convergence Depends on Schema

• Convergence in ≤ 4 steps

S

R T

• Convergence requiresn - 1 steps

Can be generalized

15

Finding Top-k Explanations with Data CubeFor all possible predicates ф

– Compute the intervention ф for ф

– …..

#Possible predicates is huge

Running FOR LOOP is expensive

Running RECURSION is expensive

Optimization: OLAP data cube

why (q1*q4)/(q2*q3) high?

Suppose we want predicates on attributes [name, inst, venue] as explanations

group by name, inst, venuewith cube

name, inst, venue

e.g. Cube for q1e.g. Query for q1

16

name Inst venue q1() - - - 1

RR - - 1

- M.com - 1

……..

q1: com, 2000-04 name Inst venue q2()

- - - 1

CM - - 1

- I.com - 1

……..

q2: com, 2007-11

name Inst venue q3()

- - - 1

JG - - 1

- C.Edu - 1

……..

q3: edu, 2000-04 name Inst venue q4()

- - - 1

JG - - 1

- C.Edu - 1

……..

q4: edu, 2007-11

Sketch of Algorithm with Data Cube

name Inst year E(D - D )

- - - JG - -RR - -- C.edu

1. (Outer)-join the cubes + compute score

Score• All computation done by DBMS

• But,− Cube Algorithm matches theory for some

inputs (e.g. single table, DBLP examples)

− For other inputs it is a heuristic (recursion is necessary)

2. Run Top-K

17

Experiment 1: Data Cube Optimization vs. Iterative Algo

Natality Dataset 2010: (from National Center for Health Statistics (NCHS)). Single table with 233 attributes, ~4M entries, 2.89GB size.

More experiments in the paper

Data size vs. time

18

Experiment 2: Scalability of the Data Cube Optimization

No. of attributes

E1E2

E1E2

E1 E2

# Aggregate queries in the User Question 2 4

# Data Cubes computed 2 4

# Joins of the Data Cubes to compute final table 1 3

Why (q1/q2) low Why (q1/q2)/(q3/q4) low

Data size vs. time(Max) No. of attributes in explanation predicates vs. time

19

Qualitative Evaluation (DBLP)

Q. Why is there a peak for #sigmod papers from industry during 2000-06, while #academia papers kept increasing?

Intuition:1. If we remove these industrial labs and their senior researchers, the

peak during 2000-04 is more flattened2. If we remove these universities with relatively new but highly prolificdb groups, the curve for academia is less increasing

Hard due to lack of gold standard

20

Qualitative Evaluation (DBLP)

Intuition:If we remove these leading theoretical DB researchers or their universities/cities,

the bar for UK will look different.

e.g. for UK, Originally: PODS = 62%, SIGMOD = 38%Removing all publications by Libkin: PODS = 46%, SIGMOD = 54%

Q. Why is #SIGMOD papers #PODS papers in UK?

P = 32, S = 3 P = 24, S = 1 P = 9, S = 0

P = 15, S = 2

source: DBLP

Not top expl.:Wenfei FanPeter Buneman…..

P = 15, S = 12

P = 6, S = 12

21

Current/Future Work• Optimize for arbitrary SPJUA queries and schemas

– Go beyond data cube

• Model increasing/decreasing trend by linear regression (E = slope)• Ranking algorithm: simple, meaningful, diverse explanations• Prototype with a GUI

22

Thank you

Questions?