View
223
Download
4
Tags:
Embed Size (px)
Citation preview
1
Primitives for Workload Summarization and Implications
for SQL
Prasanna Ganesan*Stanford University
Surajit Chaudhuri Vivek NarasayyaMicrosoft Research
*Work done at Microsoft Research
2
Motivation
• Workload: Set of SQL Statements• Many tasks exploit workload information
– DB Admin, Index Tuning, Statistics building, Approximate Query Processing
• DBMS profilers produce large workloads (+additional info)
• Most tasks need small workloads • Goal: Summarization - Find a “representative”
subset of a given, large workload. – Sometimes a weighted subset
3
Why Not Random Sampling?
• One Size does not fit all– Different definitions of “representative subset”– Random sampling may lose valuable info
• Ignores additional info associated with statements
• Shown to work poorly, e.g., for Index Selection [chaudhuri02] – May oversample queries on some tables, while
ignoring less frequent queries on other tables
4
Our Solution
1. Treat input as a relation• Each SQL statement (+associated info) is a tuple
2. Extend SQL with new language primitives • Allow declarative specification of desired subset• Usable on arbitrary relations, not just workloads
3. Implement extensions inside query engine• Why? Primitives appear widely applicable• Other implementation options available
5
The Architecture
Query SQL FROM …… Estimated ExecutionID String Tables Cost Cost
Q1SELECT *
FROM R1, R2 {R1, R2} 2.5 3.03
Q2 … … … …
.. … … .. … …
SELECT *, DOMSUM(Count) FROM WkldTbl DOMINATE WITH PARTITIONING BY FromTables, JoinConds, WhereCols(SLAVE.GroupByCols MASTER.GroupByCols) AND (SLAVE.OrderByCols PREFIX MASTER.OrderByCols)REPRESENT WITH PARTITIONING BY FromTables, JoinConds, WhereCols MAXIMIZING SUM(DOM_Count) GLOBAL CONSTRAINT Count(*) ≤ 200 LOCAL CONSTRAINT Count(*) ≥ int(200*LOCAL.Count(*)/GLOBAL.Count(*))
ExecutionEngine
Summary Application
6
Outline
• New Primitives for Summarization (Subsetting)– Dominance– Representation
• Implementing summarization primitives in SQL • Experiments
7
Dominance
• Idea: Filter and aggregate using a partial order on tuples
• Specify condition for one tuple to dominate another– Transitive condition– Encapsulates application knowledge
• Output: Keep throwing away tuples that are dominated– Retain aggregate info about dominated tuples
9
Applying Dominance to Workloads
• Example: Index Selection
– An index useful for Q1 likely to be useful for Q2
SELECT ... FROM R
GROUP BY A, B, C
SELECT … FROM R
GROUP BY A, Bdominates
Q1 Q2
MASTER.FromTables=SLAVE.FromTables AND MASTER.GroupByCols SLAVE.GroupByCols AND MASTER.OrderByCols PREFIX SLAVE.OrderByCols
10
Outline
• New Primitives for Summarization (Subsetting)– Dominance– Representation
• Implementing Summarization Primitives in SQL• Experiments
11
Representation
• Dominance only gets us so far– Need a “lossier” way to select a subset
• Idea: Pick a subset that solves a Linear Program – Optimize some criterion – Satisfy lots of constraints– Support concept of partitioning
12
Details
• Partition tuples by a set of attributes
• Criterion: Maximize/Minimize Aggregate– E.g., Minimize Count(*)
• Global Constraints– E.g., Sum(B) in chosen subset > 60% Sum(B) in input
• Local Constraints - apply to each partition– E.g., Sum(B) in chosen subset > 40% Sum(B) in that partition
A B C1 10 ..2 5 ..3 71 …2 … ..3 ….. … ..
A B C1 10 ..1 .. ..1 .. ..
A B C2 5 ..2 .. ..2 .. ..
A B C3 7 ..3 .. ..3 .. ..
13
An Index Selection Example
• Partition by Tables, Join Conditions and attributes in WHERE clause
• Criterion: Maximize Sum(ExecutionCost)– Need best “coverage”
• Global Constraint: Count(*) ≤ 200• Local Constraint: Proportionate representation
– A partition with 20% of input should have 20% of output
– Count(*) ≥int(200*LOCAL.Count(*)/GLOBAL.Count(*))
14
Putting it all together
1. Apply dominance criterion (as earlier).
2. Apply representation (as earlier, but maximize SUM(DOM_Count) ).
3. Weight each tuple by the number of tuples it dominates.
SELECT SqlString, DOMSUM(Count) FROM WkldTbl DOMINATE WITH PARTITIONING BY FromTables, JoinConds, WhereCols(SLAVE.GroupByCols MASTER.GroupByCols) AND (SLAVE.OrderByCols PREFIX MASTER.OrderByCols)REPRESENT WITH PARTITIONING BY FromTables, JoinConds, WhereCols MAXIMIZING SUM(DOM_Count) GLOBAL CONSTRAINT Count(*) ≤ 200 LOCAL CONSTRAINT Count(*) ≥ int(200*LOCAL.Count(*)/GLOBAL.Count(*))
15
Outline
• New Primitives for Summarization (Subsetting)– Dominance– Representation
• Implementing Summarization Primitives in SQL• Experiments
16
Implementing Summarization Primitives in SQL
• Assume set and sequence support in SQL– The mills of the standards bodies…
• Partitioning useful for both primitives– Hashing, Sort-based, Index-based…
• Implementing Dominance– Naïve O(n2) algorithm– Techniques from group-wise processing – Leverage Skyline optimizations
17
Representation
• Implementing directly is LP-hard• Many queries are much simpler
– Fall into one of two special cases
• Other queries are handled by a simple heuristic– User-guided search
• Implement as multiple operators
18
User-Guided Search
• Scan tuples in a specific order– User-specified, or heuristically chosen
• Will always minimize/maximize Count(*) – Use ordering to transform other objectives– Slightly different algorithms for the two cases
20
Two Special Cases
• Maximize SUM(Attr)– All constraints are on Count(*)– Use partitioning and sort-order access
• Minimize Count(*)– Single constraint: Again easily solved– More special cases also solvable– Multiple constraints: Approximation algorithm
21
Experiments
• Evaluate utility for index selection• Compare to sophisticated Wkld. Compression
[chaudhuri02]– Clusters using a complex distance function
• Simple query as described earlier– Constrained to output same number of statements as
Workload Compression– Orders of magnitude faster
• TPC-H 1GB database– Multiple synthetic workloads introduced in
[chaudhuri02]
23
Comparing Estimated Costs
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
SPJ SPJ-GB SPJ-GBOB SingleTable
Workloads
Est
imat
ed C
ost
Wkld Compression Proportionate(Syntactic)