1
A Formal Approach to Finding Explanations for Database Queries
Sudeepa RoyDan Suciu
University of Washington, Seattle
2
We need to understand “Big Data”
ref. Big data whitepaper, Jagadish et al., 2011-12
D1
D2
D3
Data Analysis System
1. Acquire Data 2. Prepare Data 3. Store in DB
Clean Extract Feature Integrate
5. Plot Graphs6. Ask Questions! 4. Run Queries
Do you have an
explanation?
3
• Why is there a peak for #sigmod papers from industry during 2000-06,
while #academia papers kept increasing?
• Why is #SIGMOD papers #PODS papers in UK?
Sample Questions
Dataset: Pre-processed DBLP + Affiliation dataDisclaimer: Not all authors have affiliation info
Explanations by our approach at the end
4
“What was the cause of the observation?”• Not simple association or correlation• e.g. People having headache drink coffee Does coffee cause headache? Does headache lead to drinking coffee?
Ideal goal: Why Causality
5
• Has been studied for many years (Hume1748)• Extensive study in AI over the last decade by Judea Pearl using the notion of intervention:
X is a cause of Y, if removal of X also removes Y keeping other conditions unchanged
• Needs controlled experiments • Not always possible with a database
But, causality is hard…
6
Realistic Database-y goal: Why Explanation
Causality ExplanationControlled Experiment Input database and observed query outputs
Causal Paths PK-FK constraints and their generalization
Intervention Remove input tuples,query output should change
Top Causes Top explanations will change the outputin the expected direction to a greater extent
7
Previous/Related Work• Causality in databases
– Meliou et al.’10, Meliou et al.’11
• Explanations in databases– Explaining outliers in aggregate queries: Wu-Madden’13– Specific applications (Map-Reduce, Access log, User Rating,…): e.g. Khoussainova et al.’12, Fabbri et al.’12, Das et al.’11
• Other related topics– Provenance, deletion propagation: e.g. Green et al.’07, Buneman et al.’01– Missing answer/Why-Not: e.g. Herschel et al.’09, Huang et al.’10, Chapman-Jagadish’09– Finding causal structure/data mining: e.g. Silverstein et al.’00– OLAP: e.g. Sarawagi-Sathe’01
• Informally use intervention• Explanation = predicate• Mostly single table, no join
• Pearl’s notion of causality and intervention• Causal structure from input to output by lineage• Cause = Individual input tuples, not predicates• No inherent causal structure in input data
Upcoming VLDB 2014 Tutorial“Causality and Explanations in Databases”Alexandra Meliou, Sudeepa Roy, Dan Suciu
This work:• Formal framework of explanations (= predicates) and theoretical analysis – causal structure within input data independent of
queries or user questions– allow multiple tables and joins
• Optimizations and Evaluation– find top explanations using data cube
8
Outline
• Framework
• Causal Paths and Intervention
• Computing Intervention
• Optimization: Ranking Explanations by Data Cube
• Evaluation
• Future Work
9
Input and Output
Run Group-By Queries and Plot
Toy DBLP database Output Plot
User question • Numerical expression E• Direction: high/low
E = (q1/q3) / (q2/q4)Direction = high
Why is q1/q3 q2/q4
e.g. q1
select count(distinct x.pubid)from Author x, Authored y, Publication zwhere x.id = y.id and y.pubid = z.pubid and z.venue = ’SIGMOD’ and 2000 <= z.year and z.year <= 2004 and x.domain = ’com’
These values will vary for q2, q3, q4
Input
Explanation(s) ф : Predicate on attributese.g.[name = ‘JG’][name = ‘JG’] [inst = ‘C.edu’][name = ‘JG’] [year = 2007]Note: attr from multiple tables
Output
E should change when database is “intervened “with
10
Causal Paths by Foreign Key Constraints
• Causal path X Y: removing X removes Y• Analogy in DB: Foreign key constraints and cascade delete semantics
Author(id, name, inst, dom)
Authored(id, pubid)
Publication(pubid, year, venue)
Standard F.K.(cascade delete)
Back and Forth F.K.(cascade delete
+reverse cascade delete)ForwardReverse
Intuition: • An author can exist if one of her papers is deleted• A paper cannot exist if any of its co-authors is deleted
11
Causal Paths and InterventionCandidate explanation ф : [name = ‘RR’]
ReverseForward
Intervention ф : Tuples T0 that satisfy ф + Tuples reachable from T0
Given ф, computation of ф requires a recursive query
Multiple tables
require universal
table
12
Objective: top-k explanations• Consider user question: Why is E = (q1/q2)/(q3/q4) low,
Find top-k explanations ф w.r.t a score ф = E(D - ф)
The obvious approach: Two sources of complexity
1. For all possible predicates ф– Compute the intervention ф for ф
– Delete tuples in ф from D
– Evaluate q1, q2, q3, q4 on D – ф
– Compute E(D - ф)
2. Find top explanations with highest scores E(D - ф) (top-k)
Recursion
13
Computing ф by a Recursive Program
Properties:1. Program has a unique least fixpoint which can be
obtained in poly-time (n = |D| steps)
2. Program is not monotone in database, • i.e., if D D’, not necessarily (D) (D’)• Therefore not expressible in datalog• But expressible in datalog + negation
Delete from universal table
tuples |= ф Cascade delete
Reverse Cascade delete
ф is fixed
14
Convergence Depends on Schema
• Convergence in ≤ 4 steps
S
R T
• Convergence requiresn - 1 steps
Can be generalized
15
Finding Top-k Explanations with Data CubeFor all possible predicates ф
– Compute the intervention ф for ф
– …..
#Possible predicates is huge
Running FOR LOOP is expensive
Running RECURSION is expensive
Optimization: OLAP data cube
why (q1*q4)/(q2*q3) high?
Suppose we want predicates on attributes [name, inst, venue] as explanations
group by name, inst, venuewith cube
name, inst, venue
e.g. Cube for q1e.g. Query for q1
16
name Inst venue q1() - - - 1
RR - - 1
- M.com - 1
……..
q1: com, 2000-04 name Inst venue q2()
- - - 1
CM - - 1
- I.com - 1
……..
q2: com, 2007-11
name Inst venue q3()
- - - 1
JG - - 1
- C.Edu - 1
……..
q3: edu, 2000-04 name Inst venue q4()
- - - 1
JG - - 1
- C.Edu - 1
……..
q4: edu, 2007-11
Sketch of Algorithm with Data Cube
name Inst year E(D - D )
- - - JG - -RR - -- C.edu
1. (Outer)-join the cubes + compute score
Score• All computation done by DBMS
• But,− Cube Algorithm matches theory for some
inputs (e.g. single table, DBLP examples)
− For other inputs it is a heuristic (recursion is necessary)
2. Run Top-K
17
Experiment 1: Data Cube Optimization vs. Iterative Algo
Natality Dataset 2010: (from National Center for Health Statistics (NCHS)). Single table with 233 attributes, ~4M entries, 2.89GB size.
More experiments in the paper
Data size vs. time
18
Experiment 2: Scalability of the Data Cube Optimization
No. of attributes
E1E2
E1E2
E1 E2
# Aggregate queries in the User Question 2 4
# Data Cubes computed 2 4
# Joins of the Data Cubes to compute final table 1 3
Why (q1/q2) low Why (q1/q2)/(q3/q4) low
Data size vs. time(Max) No. of attributes in explanation predicates vs. time
19
Qualitative Evaluation (DBLP)
Q. Why is there a peak for #sigmod papers from industry during 2000-06, while #academia papers kept increasing?
Intuition:1. If we remove these industrial labs and their senior researchers, the
peak during 2000-04 is more flattened2. If we remove these universities with relatively new but highly prolificdb groups, the curve for academia is less increasing
Hard due to lack of gold standard
20
Qualitative Evaluation (DBLP)
Intuition:If we remove these leading theoretical DB researchers or their universities/cities,
the bar for UK will look different.
e.g. for UK, Originally: PODS = 62%, SIGMOD = 38%Removing all publications by Libkin: PODS = 46%, SIGMOD = 54%
Q. Why is #SIGMOD papers #PODS papers in UK?
P = 32, S = 3 P = 24, S = 1 P = 9, S = 0
P = 15, S = 2
source: DBLP
Not top expl.:Wenfei FanPeter Buneman…..
P = 15, S = 12
P = 6, S = 12
21
Current/Future Work• Optimize for arbitrary SPJUA queries and schemas
– Go beyond data cube
• Model increasing/decreasing trend by linear regression (E = slope)• Ranking algorithm: simple, meaningful, diverse explanations• Prototype with a GUI
22
Thank you
Questions?