Generating Test Data for Killing SQL Mutants: A Constraint Based Approach

ICDE 2011 1

Generating Test Data for Killing SQL Mutants:

A Constraint Based Approach

CSE Department, IIT Bombay

Shetal Shah, S. Sudarshan, Suhas Kajbaje, Sandeep PatidarBhanu Pratap Gupta, Devang Vira

ICDE 20112

Testing SQL Queries: A Challenge Complex SQL queries hard to get right Question: How to check if an SQL query is

correct? Formal verification is not applicable since we do

not have a separate specification and an implementation

State of the art solution: manually generate test databases and check if the query gives the intended result Often misses errors

ICDE 20113

Example Find number of high paid instructors in each

department

dept instructor

σ sal>100000

deptnameGcount(instrID)

dept instructor

σ sal>100000

deptnameGcount(instrID)

Test dataset should be able to distinguish these two alternatives

Problem: does not output departments with no high paid instructors

ICDE 20114

Generating Test Data: Prior Work Automated Test Data generation

Initial work based on database constraints alone Later work: DB constraints + SQL query

Agenda, Reverse Query Processing (RQP), QAGen Our earlier work in ICDE 2010 (poster) on killing SQL

mutants More recent (independent) de la Riva et al [AST10]

consider test data generation for killing mutants But limited space of mutations, no guarantees of

completeness None of the above guarantee anything about detecting

errors in SQL queries Question: How do you model SQL errors? Answer: Query Mutation

ICDE 20115

Query Mutations Mutations: model common programming errors

Join used instead of outerjoin (or vice versa) Join/selection condition errors

< vs. <=, missing or extra condition Wrong aggregate (min vs. max)

Mutant: syntactic variation of the given query Q Mutant may be the intended query

A dataset D kills a mutant M of query Q if output of Q and M differ on D

For a mutation space, a dataset suite is complete if all non-equivalent mutants are killed

ICDE 20116

Mutation Testing of Queries Mutation testing can be used to check coverage of a

test suite Given a test suite, generate mutants in a pre-defined

mutation space; evaluate on a given test suite Our goal is different:

Generate datasets for the given query, based on possible mutations

Each dataset and query result on the dataset shown to a human, who verifies that the query result is what is expected on that dataset

We do not need to actually generate and execute mutants

ICDE 20117

Query and Mutation SpaceQuery: Single Block SQL Query with unconstrained

aggregation at the top Mutation Space: Join Type Mutation : change of any of inner join, left

outer join, right outer join, full outer join to another Across all possible join orders (unlike earlier work)

Comparison Operator Mutations: An occurrence of one of (=, <, >, <=, >=, <>) in the WHERE clause replaced by another

Unconstrained Aggregation Mutations: MIN ↔ MAX, SUM ↔ SUM(DISTINCT)

AVG ↔ AVG(DISTINCT), COUNT ↔ COUNT(DISTINCT)

ICDE 20118

Schema: r(A), s(B)

To kill this mutant: ensure that for some r tuple there is no matching s tuple

Generated test case r(A) ={(1)}; s(B) ={}

Basic idea, version 1 [ICDE 2010] run query on given database, from result extract matching tuples for r and s delete s tuple to ensure no matching tuple for r Limitation: foreign keys, repeated relations

Killing Join Mutants: Basic Idea

ICDE 20119

Schema: r(A,B), s(C,D), t(E); Extra join above mutated node

To kill this mutant we must ensure that for an r tuple there is no matching s tuple, but there is a matching t tuple

Generated test case: r(A,B)={(1,2)}; s(C,D)={}; t(E)={(2)} Our approach: generate not-matching constraints of the form

∃ r1 in r, t1 in t s.t. (r1.B=t1.E and ∃ si in s s.t. (si.C=r1.A))and solve using a solver Informally, we “nullify” s to kill the above mutation

Need to Ensure Difference in Final Query Result

ICDE 201110

Working around Foreign Key Constraints teaches instructor

is equivalent to teaches instructor if there is a foreign key from teaches.ID to instructor.ID

BUT: teaches σ dept=CS(instructor)is not equivalent to teaches σ dept=CS(instructor)

Key idea: have a teaches tuple with an instructor not from CS

Selections and joins can be used to kill mutations We show that it comes for free when we “nullify” relations

to kill other join mutations or generate data to kill selection mutations

ICDE 201111

Schema: r(A,B), s(C,D), t(E), Mutation: r s

Any result with a r.B being null will be removed by join with t Similarly equivalence can result due to selections But mutation of s to s would be non equivalent with join

order (r t) s SQL Query may not specify join orders; many join trees possible; the

datasets should kill mutants across all join orders Note: de la Riva [AST10] do not consider alternative join orders

Join Orders and Equivalent Mutations

ICDE 201112

Equivalent Trees and Equivalence Classes of Attributes Whether query conditions written as

A.x = B.x AND B.x = C.x or as A.x = B.x AND A.x = C.x should not affect set of mutants generated

Solution: Equivalence classes of attributes

ICDE 201113

Data Generation Algorithm Assumptions

Only PK/FK constraints, single block SQL queries, only simple arithmetic expressions, predicates are purely conjunctive (and a few more in paper)

Algorithm Overview For each dataset generate constraints.

1. DB constraints : primary key, foreign key constraints 2. Query constraints: join conditions, selection conditions, other

than conditions inconsistent with non-matching constraints (see below)

3. Constraints to kill mutations: not-matching constraints derived from join/selection conditions

One constraint per dataset Generate Test Data

1. Using a constraint solver (currently CVC3) on above constraints

ICDE 201114

Main Procedure: generateDataSet(query q) preprocess query tree

build equivalence classes foreign key closure

generateDataSetForOriginalQuery()A constraint set with only DB and Query constraints such that the result set for q is non-empty

killEquivalenceClasses() killOtherPredicates() killComparisonOperators() killAggregates()

ICDE 201115

Killing Join Mutants: Equijoin(killEquivalenceClasses) Query : r s t where r.A = s.B and s.B =t.C

r.A, s.B, t.C are in one equivalence class Generate one dataset for each element of each

equivalence class Three datasets generated, with not-matching

mutation constraints: D1: r1∃ εr s1∃ εs (r1.A = s1.B ti∧ ∃ εt (ti.C=r1.A)) D2: s1∃ εs t1∃ εt (s1.B = t1.C ri∧ ∃ εr (ri.A=s1.B)) D3: r1∃ εr t1∃ εt (r1.A = t1.C si∧ ∃ εs (si.B=t1.C))

But if t.C is an FK referencing s.B, instead generate D3’: r1∃ εr ( ti∃ εt, siεs (si.B=r1.A ti.C=r1.A))∧

ICDE 201116

Comparison Operation Mutations Example of comparison operation mutations: A < 5 vs. A <= 5 vs. A > 5 vs A >= 5 vs. A=5,

vs A <> 5 Idea: generate three datasets (leaving rest of

query unchanged), one each for: A < 5 A = 5 A > 5

These datasets will kill all above mutations

ICDE 201117

Aggregation Operation Mutations Aggregation operations

count(A) ↔ count(distinct A); sum(A) ↔ sum(distinct A) avg(A) ↔ avg(distinct A); min(A) ↔ max(A)

Eg: select min (salary) from instructor group by dept_name; Dataset Generated: [CS, Srinivasan, 80000] [CS, Wu, 80000] [CS, Alex, 65000] will kill each of the above pairs of mutations Issues:

Database/query constraints forcing A to be unique Database/query constraints forcing G to be unique

Carefully crafted set of constraints, which are relaxed to handle such cases Details in paper

One group with two values equal (and <> 0),Third different

ICDE 201118

On Completeness and Complexity With respect to query size:

Number of datasets generated: linear Number of mutants: exponential

Theorem: The problem of generating data to kill a mutant is NP-hard reduction from query containment; polynomial in special cases.

Theorem: For the class of queries, and the space of join-type and selection mutations considered, the suite of datasets generated by our algorithm is complete, i.e., the datasets kill all non-equivalent mutations of a given query

Also complete for a restricted classes of aggregation mutations with aggregation as top operation of tree, under some restrictions on joins in input

ICDE 201119

Performance Results University database schema from Database System

Concepts 6th Ed consisting of 7 relations Further Optimizations : Unfolding of bounded first

order quantified constraints E.g. “for all i in 1 to 4, P(i)” unfolded as

P(1) ∧ P(2) ∧ P(3) ∧ P(4) Also tested with addition of input database

constraints Forcing output tuples to come from input database

ICDE 201120

Results for inner join queries

# Datasets generated small; # mutants (killed) grows exponentially Unfolding of constraints gives substantial benefits Increasing # foreign keys reduces the total time taken Results similar for queries with left/right outer join

ICDE 201121

Results for queries with selections, aggregations

Unfolding reduces times taken, epsecially when the number of tuples are large (aggregations with joins)

(Not shown above) using input database increases time taken but not by much

ICDE 201122

Conclusions and Future Work Presented a principled approach to data generation

for testing queries And showed it works efficiently for a large class of queries

We’ve only made first steps; more work to be done: Constrained aggregations and sub-queries Other mutations: e.g., missing/extra join and selection

conditions Multiple queries Query and form parameters

Acknowledgements Bhupesh Chawda for discussions/implementation help Microsoft Research India for their generous travel support

ICDE 2011 23

Thank YouQuestions

ICDE 201124

Related Work

Tuya and Suarez-Cabal [IST07], Chan et al. [QSIC05] defined a class of SQL query mutations

Shortcoming: do not address test data generation More recently (and independent of our work) de la

Riva et al [AST10] address data generation using constraints, with the Alloy solver Do not consider alternative join orders, No

completeness results, Limitations on constraints

ICDE 201125

Dataset for Original Query Array of tuples of constraint variables, per relation One or more constraint tuple from array, for each occurrence of a

relation plus occurrences to ensure f.k. constraints

Equality conditions between variables based on equijoins Other selection and join conditions become constraints Constraints for primary and foreign keys

f.k. from R.A to S.B FORALL i EXISTS j (O_R[i].A = O_S[j].B);

p.k. on R.A FORALL i FORALL j (O_R[i].A = O_R[j].A]) => “all other attrs equal”

since range is over finite domain, p.k. and f.k. constraints can be unfolded

See sample set of constraints in file cvc3_0.cvc

ICDE 201126

Killing Join Mutants: Equijoin

killEquivalenceClasses() for each equivalence class ec do

Let allRelations := Set of all <rel, attr> pairs in ec for each element e in allRelations do

conds := empty set Let e := R.a S := (set of elements in ec which are foreign keys

referencing R.a directly or indirectly) UNION R:a P := ec - S if P:isEmpty() then

continue else … main code for generating constraints (see next

slide)

ICDE 201127

Killing Join Mutants: EquiJoins conds.add(generateEqConds(P)) conds:add( “NOT EXISTS i: R[i].a = ” + cvcMap(P[0])) for all other equivalence classes oe do

conds.add(generateEqConds(oe)) for each other predicate p do

conds:add(cvcMap(p)) conds.add(genDBConstraints()) /*P.K. and F.K*/ callSolver(conds) if solution exists then

create a dataset from solver output

ICDE 201128

Killing Other Predicates Create separate dataset for each attribute in

predicate e.g. For Join condition B.x = C.x + 10

Dataset 1 (nullifying B:x): ASSERT NOT EXISTS (i : B_INT) : (B[i].x = C[1].x +

10); Dataset 2 (nullifying C:x):

ASSERT NOT EXISTS (i : C_INT) : (B[1].x = C[i].x + 10);

Documents

Generating Test Data for Killing SQL Mutants: A Constraint Based Approach