Transcript
Page 1: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

STHoles: A Multidimensional Workload-Aware Histogram

Nicolas Bruno*Columbia University

Luis Gravano*Columbia University

Surajit ChaudhuriMicrosoft Research

SIGMOD 2001

* Work done in part while the authors were visiting Microsoft Research.

Page 2: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

2

Histograms as Succinct Data Set Summaries

Used for selectivity estimation and approximate query processing.

Data set partitioned into buckets, each approximated by aggregate statistics.

Page 3: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

3

Histograms

Each bucket consists of a bounding box and a tuple frequency value.

Uniformity is assumed inside buckets.– Histograms should partition data set in

buckets with uniform tuple density. Multi-dimensional data makes

partitioning even more challenging.

Page 4: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

4

Outline

Overview of existing multidimensional histogram techniques.

Introduction to STHoles histograms. System architecture and STHoles

construction algorithm. Experimental evaluation.

Page 5: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

5

Gaussian Data Set

Histograms Techniques: EquiDepth

EquiDepth Histogram[Muralikrishna and DeWitt 1988]

Correctly identifies core of densest clusters. Partitioning uses “equi-count” instead of “equi-density”

Page 6: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

6

Gaussian Data Set MHist Histogram[Poosala and Ioannidis 1997]

Histogram Techniques: MHist

Works well for highly skewed data distributions. Devotes too many buckets to the densest clusters. Bad initial “choices” are amplified in later steps.

Page 7: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

7

Gaussian Data Set GenHist Histogram[Gunopulos et al. 2000]

Histogram Techniques: GenHist

More robust than previous techniques (based on multidimensional information).

Difficult to choose right values of various parameters. Requires at least 5-10 passes over the data.

Page 8: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

8

Gaussian Data Set STGrid Histogram[Aboulnaga and Chaudhuri 1999]

Histogram Techniques: STGrid

Incorporates feedback from query execution. Grid partitioning strategy is sometimes too rigid. Focuses on efficiency rather than accuracy.

Page 9: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

9

Our New Histogram Technique: STHoles

Flexible bucket partitioning. Exploits workload information to allocate

buckets. Query feedback captures uniformly

dense regions. Does not examine actual data set.

Page 10: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

10

STHoles Histograms Tree structure among buckets. Buckets with holes: relaxes rectangular

regions while using rectangular bucket structures.

Non rectangular

region

Page 11: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

11

System Architecture for STHolesRange Query

Page 12: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

12

STHoles Construction Algorithm

Initialize histogram H as an empty histogram.

For each query q in workload:1- Gather simple statistics from query

results.2- Identify candidate holes and drill (add)

them as new buckets in H.3- Merge superfluous buckets in H.

Page 13: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

13

?

Drilling New Candidate Buckets

Count how many tuples in result stream lie inside qb.

Drill qb as a new bucket (child of b).

q

For each query q in workload and bucket b in histogram:

Page 14: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

14

Shrinking Candidate Buckets

Partition constraint: Bounding boxes must be rectangular.

Apply greedy technique to shrink a candidate hole to a rectangle.

Page 15: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

15

Merging Buckets To avoid exceeding available space. Merge most “similar” buckets in terms

of tuple density.

Page 16: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

16

Parent-Child Merges

Eliminate buckets too similar to their parents.Example: The interesting region in bc is

covered by its child b1.

Page 17: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

17

Sibling-Sibling Merges

Consolidate buckets with similar densities that cover close regions.

Extrapolate frequency distributions to yet unseen regions.

Page 18: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

18

Gaussian Data Set STHoles Histogram

An Example STHoles Histogram

Page 19: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

19

Experimental Setting

Data Sets: – Real: (UCI Repository)

• Sample of Census data set (200K tuples)• Cover data set (500K tuples)

– Synthetic: Variations of Gaussian and Zipfian(Array) distributions.

200K to 500K tuples, 2 to 4 dimensions.

Histograms:– 1024 available bytes per histogram.

– EquiDept, MHist, GenHist, STGrid, STHoles.

Page 20: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

20

Experimental Setting (cont.) Workloads [Pagel et al. 1993]:

– 1,000 queries.– Query centers follow different distributions:

Uniform, Biased, Gaussian.– Query boundaries follow different constraints:

area covered, tuples covered.

Census data set Biased (tuples) workload Gaussian (area) workload

Accuracy Metric: Absolute Error.

(with some normalization; details in paper)

Page 21: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

21

Comparison with Other Approaches: Biased Workload

Biased workload, query boundaries cover around 1% of the data domain

Page 22: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

22

Comparison with Other Approaches: Uniform Workload

Uniform workload, query boundaries cover around 1% of the data set tuples.

Page 23: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

23

Convergence with Workload

Biased workload

Page 24: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

24

Handling Data Set Updates

From Gaussian to Zipfian data distributions.

Page 25: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

25

Other Experiments Varying:

– data skew.– data dimensionality.– histogram size.– workload generation parameters.– number of attributes in queries.

Overhead for intercepting query results in Microsoft SQL Server 2000 is less than 8%.

STHoles lead to robust selectivity estimates across data distributions and workloads.

See full paper for details!

Page 26: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

26

Summary: STHoles, a Multidimensional Workload-Aware Histogram

Exploits query feedback. Built without examining data set. Allows bucket nesting to capture complex

shapes using only rectangular bucket structures.

Results in robust and accurate selectivity estimations.

In many cases, outperforms the best techniques that access full data sets.

Page 27: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

27

Related Work (Histograms)

Unidimensional:– EquiDepth [Piatetsky-Shapiro and Connell 1984]– MaxDiff [Poosala et al. 1996]– V-Optimal [Jagadish et al. 1998]– Many more!

Multidimensional:– EquiDepth [Muralikrishna and DeWitt 1988]– MHist [Poosala and Ioannidis 1997]– GenHist [Gunopulos et al. 2000]– STGrid [Aboulnaga and Chaudhuri 1999]

Page 28: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

28

Related Work (Other Techniques)

Sampling [Olken and Rotem 1990] Wavelets [Matias et al. 1997] Discrete transformations [Lee et al.

1999] Parametric Curve Fitting [Chen and

Roussopoulos 1994]

Page 29: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

29

Evaluation Metric

Absolute Error:

Normalized Absolute Error:

Page 30: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

30

Overhead Evaluation over Microsoft SQL Server 2000

Page 31: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

31

Varying Histogram Size

Gaussian Data SetZipfian Data SetCensus Data Set

Page 32: STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

32

Varying Spatial Selectivity

Gaussian Data SetZipfian Data SetCensus Data Set


Recommended