47
Automatically Synthesizing SQL Queries from Input-Output Examples Sai Zhang University of Washington Joint work with: Yuyin Sun

Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Automatically Synthesizing SQL Queries

from Input-Output Examples

Sai Zhang

University of Washington

Joint work with: Yuyin Sun

Page 2: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Goal: making it easier for non-expert users

to write correct SQL queries

2

• Non-expert database end-users

– Business analysts, scientists, marketing managers, etc.

can describe what the query task is do not know how write a correct query

This paper: bridge the gap!

Page 3: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

An example

SELECT name, MAX(score)

FROM student, enrolled

WHERE student.stu_id = enrolled.stu_id

GROUP BY student.stu_id

HAVING COUNT(enrolled.course_id) > 1

name stu_id

Alice 1

Bob 2

Charlie 3

Dan 4

stu_id course_id score

1 504 100

1 505 99

2 504 96

3 501 60

3 502 88

3 505 68

Table: student Table: enrolled

name MAX(score)

Alice 100

Charlie 88

Output table

Find the name and the maximum course score of each student

enrolled in more than 1 course.

The correct SQL query:

Page 4: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Existing solutions for querying a database

• General programming languages

+ powerful

− learning barriers

• GUI tools

+ easy to use

− limited in customization and personalization

− hard to discover desired features in complex GUIs

4

Page 5: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

SELECT name, MAX(score)

FROM student, enrolled

WHERE student.stu_id = enrolled.stu_id

GROUP BY student.stu_id

HAVING COUNT(enrolled.course_id) > 1

name stu_id

Alice 1

Bob 2

Charlie 3

Dan 4

stu_id course_id score

1 504 100

1 505 99

2 504 96

3 501 60

3 502 88

3 505 68

Table: student Table: enrolled

name MAX(score)

Alice 100

Charlie 88

Output table

SQLSynthesizer

Our solution: programming by example

Page 6: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

How do end-users use SQLSynthesizer?

6

Real, large database tables Desired output result

SQL?

Small, representative

Input-output examples

SQLSynthesizer

Page 7: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

SQLSynthesizer’s advantages

7

• Fully automated − Only requires input-output examples

− No need of annotations, hints, or specification of any form

• Support a wide range of SQL queries − Beyond the “select-from-where” queries [Tran’09]

Page 8: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Expressiveness

Ease

of

Use

GUI tools

SQLSynthesizer

Programming

languages

Comparison of solutions

Page 9: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Outline

• Motivation

• A SQL Subset

• Synthesis Approach

• Evaluation

• Related Work

• Conclusion

9

Page 10: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Designing a SQL subset

10

The full SQL language

A SQL Subset

• 1000+ pages specification

• PSPace-Completeness [Sarma’10]

• Some features are rarely used

SQLSynthesizer’s focus: a widely-used SQL subset

Page 11: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

How to design a SQL subset?

11

The full SQL language

A SQL Subset

• Previous approaches:

– Decided by the paper authors [Kandel’11] [Tran’09]

• Our approach:

– Ask experienced IT professionals for the most widely-used SQL features

?

Page 12: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Our approach in designing a SQL subset

1. Online survey: eliciting design requirement

2. Designing the SQL subset

3. Follow-up interview: obtaining feedback

− Ask each participant to select 10 most widely-used SQL features

− Got 12 responses

− Ask each participant to rate the sufficiency of the subset

Supported SQL features

1) SELECT.. FROM…WHERE

2) JOIN

3) GROUP BY / HAVING

4) Aggregators (e.g., MAX, COUNT, SUM, etc)

5) ORDER BY

Supported in the previous work

[Tran’09]

Not sufficient at all Completely sufficient

0 5

Average rating: 4.5

Page 13: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Our approach in designing a SQL subset

1. Online survey: eliciting design requirement

1. Designing the SQL subset

2. Follow-up interview: obtaining feedback

− Ask each participant to select 10 most widely-used SQL features

− Got 12 respondents

− Ask each participant to rate the sufficiency of the subset

Supported SQL features

- SELECT.. FROM…WHERE

- JOIN

- GROUP BY / HAVING

- Aggregators (e.g., MAX, COUNT, SUM, etc)

- ORDER BY

Supported in the previous work

[Tran’09]

Not sufficient at all Completely sufficient

0 5

Average rating: 4.5

The SQL subset is enough to

write most common queries.

Page 14: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Outline

• Motivation

• Language Design

• Synthesis Approach

• Evaluation

• Related Work

• Conclusion

14

Page 15: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

SQLSynthesizer Workflow

15

Input-Output

Examples SQLSynthesizer Queries

Select the desired query, or provide more examples

Input

tables

Output

table

Page 16: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

SQLSynthesizer Workflow

16

Input-Output

Examples SQLSynthesizer Queries

Select the desired query, or provide more examples

Input

tables

Output

table

A SQL query

Page 17: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

SQLSynthesizer Workflow

17

Input-Output

Examples SQLSynthesizer Queries

Select the desired query, or provide more examples

Input

tables

Output

table

Filter Project Combine

Page 18: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

SQLSynthesizer Workflow

18

Input-Output

Examples SQLSynthesizer Queries

Select the desired query, or provide more examples

Input

tables

Output

table

Filter Project

SELECT name, MAX(score)

FROM student, enrolled

WHERE student.stu_id = enrolled.stu_id

GROUP BY student.stu_id

HAVING COUNT(enrolled.course_id) > 1 Query condition

Join condition

Projection columns

Join

condition

Query

condition

Projection

columns

A complete SQL:

Combine

Page 19: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Multiple solutions

19

Input

tables Output

table

Query 2

Query 1

Query 3

Page 20: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Multiple solutions

20

Input

tables Output

table

Filter Project

Combine

… …

Computes all solutions, ranks them, and shows them to the user.

Page 21: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Key techniques

21

Input

tables

Output

table

Filter Project

Join

condition

Query

condition

Projection

columns

Combine

1. Combine: Exhaustive search over legal combinations

(e.g., cannot join columns with different types)

2. Filter: A machine learning approach to infer query conditions

3. Project: Exhaustive search over legal columns (e.g., cannot apply AVG to a string column)

Page 22: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

22

Cast as a rule learning problem:

Finding rules that can perfectly divide a search space

into a positive part and a negative part

All rows in the

joined table

Rows contained

in the output table

The rest of

the rows

Learning query conditions

Page 23: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Search space: the joined table

23

name stu_id

Alice 1

Bob 2

Charlie 3

Dan 4

stu_id course_id score

1 504 100

1 505 99

2 504 96

3 501 60

3 502 88

3 505 68

Table: student

Table: enrolled

Name stu_id course_id score

Alice 1 504 100

Alice 1 505 99

Bob 2 504 96

Charlie 3 501 60

Charlie 3 502 88

Charlie 3 505 68

Join on the stu_id column

The joined table

(inferred in the

Combine step)

Page 24: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Finding rules selecting rows contained in

the output table

24

name stu_id course_id score

Alice 1 504 100

Alice 1 505 99

Bob 2 504 96

Charlie 3 501 60

Charlie 3 502 88

Charlie 4 505 68

The joined table

name MAX(score)

Alice 100

Charlie 88

Output table

Page 25: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Finding rules selecting rows containing

the output table

25

name stu_id course_id score

Alice 1 504 100

Alice 1 505 99

Bob 2 504 96

Charlie 3 501 60

Charlie 3 502 88

Charlie 4 505 68

The joined table

name stu_id course_id score

Alice 1 504 100

Alice 1 505 99

Charlie 3 501 60

Charlie 3 502 88

Charlie 4 505 68 name stu_id course_id score

Alice 1 504 100

Alice 1 505 99

Charlie 3 501 60

Charlie 3 502 88

Charlie 4 505 68

Page 26: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Finding rules selecting rows containing

the output table

name stu_id course_id score

Alice 1 504 100

Alice 1 505 99

Bob 2 504 96

Charlie 3 501 60

Charlie 3 502 88

Charlie 4 505 68

The joined table

No good rules!

name stu_id course_id score

Alice 1 504 100

Alice 1 505 99

Charlie 3 501 60

Charlie 3 502 88

Charlie 4 505 68

Page 27: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Solution: computing additional features

• Key idea:

– Expand the search space with additional features

• Enumerate all possibilities that a table can be aggregated

• Precompute aggregation values as features

name stu_id course_id score

Alice 1 504 100

Alice 1 505 99

Bob 2 504 96

Charlie 3 501 60

Charlie 3 502 88

Charlie 3 505 68

Suppose grouping it by stu_id

MAX(score) SUM

(score)

COUNT

(course_id)

100 199 2

100 199 2

96 96 1

88 216 3

88 216 3

88 216 3

additional features

...

The joined table

Page 28: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Finding rules without additional features

28

name stu_id course_id score

Alice 1 504 100

Alice 1 505 99

Bob 2 504 96

Charlie 3 501 60

Charlie 3 502 88

Charlie 4 505 68

The joined table

No good rules!

name stu_id course_id score

Alice 1 504 100

Alice 1 505 99

Charlie 3 501 60

Charlie 3 502 88

Charlie 4 505 68

Page 29: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Finding rules with additional features

The joined table

name stu_id course_id score COUNT(course_id) MIN(score)

Alice 1 504 100 2 99

Alice 1 505 99 2 99

Bob 2 504 96 1 96

Charlie 3 501 60 3 60

Charlie 3 502 88 3 60

Charlie 4 505 68 3 60

COUNT(course_id) > 1

(after groupping by stu_id)

SELECT name, MAX(score)

FROM student, enrolled

WHERE student.stu_id =

enrolled.stu_id

GROUP BY student.stu_id

HAVING COUNT(enrolled.course_id) > 1

after the table is grouped by stu_id

name stu_id course_id score

Alice 1 504 100

Alice 1 505 99

Charlie 3 501 60

Charlie 3 502 88

Charlie 4 505 68

Page 30: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Ranking multiple SQL queries

• Occam’s razor principle: rank simpler queries higher

– A simpler query is less likely to overfit the examples

• Approximate a query’s complexity by its text length

30

name age score

Alice 20 100

Bob 20 99

Charlie 30 99

name

Alice

Bob

Query 1: select name from student where age < 30

Query 2: select name from student where

name = ‘Alice’ || name = ‘Bob’

Input table: student Output table

Page 31: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Outline

• Motivation

• Language Design

• Synthesis Approach

• Evaluation

• Related Work

• Conclusion

31

Page 32: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Research Questions

• Success ratio in synthesizing SQL queries?

• What is the tool time cost?

• How much human effort is needed in writing examples?

• Comparison to existing techniques.

32

Page 33: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Benchmarks

• 23 SQL query related exercises from a classic textbook

– All exercises in chapters 5.1 and 5.2

• 5 forum questions

– Can be answered by using standard SQL

(Most questions are vendor-specific)

– 2 questions contain example tables

33

Page 34: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Evaluation Procedure

• Rank of the correct SQL query

• Tool time cost

• Manual cost

– Example size, time cost, and the number of interaction rounds

(All experiments are done by the second author)

34

Input-Output

Examples SQLSynthesizer Queries

Select the desired query, or provide more examples

Page 35: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Results: success ratio

35

28 SQL questions

Fail on 8

questions

Succeed on 20 questions

5 forum questions

23 textbook

exercises

Succeed on 5

forum questions

Succeed on 15

exercises

Fail on 8

exercises

Require writing sub-queries, which are not

supported in SQLSynthesizer

The correct query ranks 1st

in all succeeded questions

Page 36: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Result: tool time cost

• On average, 8 seconds per benchmark

– Min: 1 second, max: 120 seconds

– Roughly proportional to the #table and #column

36

Page 37: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Results: manual cost

• Example size

– 22 rows, on average (min: 8 rows, max: 52 rows)

• Time cost in writing examples

– 3.6 minutes per benchmark, on average

(min: 1 minute, max: 7 minutes)

• Number of interaction rounds

– 2.3 rounds per benchmark, on average

(min: 1 round, max: 5 rounds)

37

Page 38: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Comparison with an existing approach

• Query-by-Output (QBO) [Tran’09]

– Support simple “select-from-where” queries

– Use data values as machine learning features

38

name age score

Alice 20 100

Bob 20 99

Charlie 30 80

name

Alice

Bob

select name from student

where age < 30

Table: student

age MAX(score)

20 100

30 80 select age, max(score) from student

group by age

Output table

Page 39: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Query-by-Output vs. SQLSynthesizer

39

Query-by-Output SQLSynthesizer

28 SQL questions

28 SQL questions

Fail on 26 questions

Succeed on 2 questions Succeed on 20 questions

Fail on 8

questions

- Many realistic SQL queries use aggregation features.

- Users are unlikely to get stuck on simple “select-from-where” queries.

Page 40: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Experimental Conclusions

• Good success ratio (71%)

• Low tool time cost

– 8 seconds on average

• Reasonable manual cost

– 3.6 minutes on average

– 2.3 interaction rounds

• Outperform an existing technique

– Success ratio: QBO (7%) vs. SQLSynthesizer (71%)

40

Page 41: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Outline

• Motivation

• Language Design

• Synthesis Approach

• Evaluation

• Related Work

• Conclusion

41

Page 42: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Related Work

• Reverse engineering SQL queries

Query-by-Examples [Zloof’75]

A new GUI with a domain-specific language to write queries

Query-by-Output [Tran’09]

Uses data values as features, and supports a small SQL subset.

View definition Synthesis [Sarma’10]

Theoretical analysis, and is limited to 1 input/output table.

• Automated program synthesis

PADS [Fisher’08], Wrangler [Kandel’11], Excel Macro [Harris’11],

SQLShare [Howe’11], SnippetSuggest [Khoussainova’11],

SQL Inference from Java code [Cheung’13]

− Targets different problems, or requires different input.

− Inapplicable to SQL synthesis

42

Page 43: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Outline

• Motivation

• Language Design

• Synthesis Approach

• Evaluation

• Related Work

• Conclusion

43

Page 44: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Contributions

• A programming-by-example technique

– Synthesize SQL queries from input-output examples

– Core idea: using machine learning to infer query conditions

• Experiments that demonstrate its usefulness

– Accurate and efficient

• Inferred correct answers for 20 out of 28 SQL questions

• 8 seconds for each question

– Outperforms an existing technique

• The SQLSynthesizer implementation

http://sqlsynthesizer.googlecode.com

44

Page 45: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

[Backup Slides]

45

Page 46: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

The most widely-used SQL features

46

0 2 4 6 8 10 12

RIGHT JOIN

NOT EXIST

UNION

NOT NULL

LIKE

BETWEEN

LEFT JOIN

HAVING

INNER JOIN

FULL JOIN

MIN

MAX

IN

SUM

AVG

DISTINCT

COUNT

ORDER BY

GROUP BY

SELECT... FROM..

Number of votes

21 features

Aggregation features

The standard select ..

from.. where.. feature

Joining features

Existential features

Value matching features

Page 47: Automatically Synthesizing SQL Queries from Input …• Query-by-Output (QBO) [Tran’09] –Support simple “select-from-where” queries –Use data values as machine learning

Design a SQL subset

47

0 2 4 6 8 10 12

RIGHT JOIN

NOT EXIST

UNION

NOT NULL

LIKE

BETWEEN

LEFT JOIN

HAVING

INNER JOIN

FULL JOIN

MIN

MAX

IN

SUM

AVG

DISTINCT

COUNT

ORDER BY

GROUP BY

SELECT... FROM..

Special joins

Wildcard matching

Sub-query

Covered features

Uncovered features

Covered 15 features