CON2803_PDF_2803_0001

7/23/2019 CON2803_PDF_2803_0001

1/151

The Evolution of Histograms

Jonathan Lewisjonathanlewis.wordpress.com

www.jlcomp.demon.co.uk

Title

2 / 30

Jonathan Lewis

2011

Who am I ?

Independent Consultant

28+ years in IT24+ using Oracle

Strategy, Design, Review,Briefings, Educational,Trouble-shooting

Member of the Oak Table NetworkOracle ACE DirectorOracle author of the year 2006SelectEditors choice 2007UKOUG Inspiring Presenter 2011UKOUG Council member 2012ODTUG 2012 Best Presenter (d/b)O1 visa for USA

7/23/2019 CON2803_PDF_2803_0001

2/152

Title

3 / 30

Jonathan Lewis

2011

O-1 Visa

An alien of extraordinary ability

Title

4 / 30

Jonathan Lewis

2011

Highlights

Why Histograms

Current mechanisms

Problems and workarounds New mechanisms

7/23/2019 CON2803_PDF_2803_0001

3/153

Title

5 / 30

Jonathan Lewis

2011

Sample Data (a)

S COUNT(*)

P 52,352

C 9,416,360

O 3,499

L 86,084

CODE DESCRIPTION

A ASSIGNED

B HANDED BACK

C CLOSED

L LOGGED

O HANDED OVER

P PENDING

Other ideasChange 'commonest value' to null

Virtual columns / Function-based indexes

List partitions

Standard Strategy

Frequency histogram with literals in SQL

Title

6 / 30

Jonathan Lewis

2011

Problems

Coding to take advantage of histogram

Limit on distinct values

Resources needed for gathering Accuracy of histogram

Timing of gathering

7/23/2019 CON2803_PDF_2803_0001

4/154

Title

7 / 30

Jonathan Lewis

2011

Limits (a)

select

specifier, count(*)

from

messages

group by

specifier

order by

count(*) desc

;

SPECIFIER COUNT(*)

BVGFJB 1,851,177

LYYVLH 719,582

MTVMIE 672,823

YETSDP 659,661

DAJYGS 504,641

...

KDCFVJ 75,328

JITCRI 74,104

DNRYKC 70,029

BEWPEQ 68,681

...

JXXXRE 1

OHMNVU 1

YGOBWQ 1

UBBWQH 1

Distinct Specifiers = 352Frequency Limit is 254

Height-balanced less precise

Popular values use lots of buckets

Title

8 / 30

Jonathan Lewis

2011

Limits (b)

Interesting arithmetic - for THIS data set

Top N values % of data

140 99.00

210 99.90

250 99.98

Each "bucket" represents roughly 40,000 rows (10M / 254)

A value with 40,001 rows MIGHT get captured twice

A value with 79,999 rows MIGHT NOT get captured twice

In this data set there are 25 values that WILL get captured (ct > 80,001)

There are 35 values that might be captured one day, and not the next.

7/23/2019 CON2803_PDF_2803_0001

5/155

Title

9 / 30

Jonathan Lewis

2011

Limits (c)

12c allows 2,048 buckets

The default is still 254

Don't be in a rush to use the maximum Don't forget the optstat history tables

There are several new columns

There are some new costs

Title

10/ 30

Jonathan Lewis

2011

Precision (a)

select

status, count(*)

from

orders

group by

status

order by

status;

S COUNT(*)

C 529,100

P 300

R 300

S 300

X 500,000

begin

dbms_stats.gather_table_stats(

tabname =>'orders',

estimate_percent => dbms_stats.auto_sample_size ,

method_opt => 'for columns status size 10'

);

end;

/

7/23/2019 CON2803_PDF_2803_0001

6/156

Title

11/ 30

Jonathan Lewis

2011

select

endpoint_number,

endpoint_number - nvl(prev_endpoint,0) frequency,

chr(to_number(substr(hex_val, 2,2),'XX')) status

from (

select

endpoint_number,

lag(endpoint_number,1) over(

order by endpoint_number

) prev_endpoint,

to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_val

from

user_tab_histograms

where

table_name = 'ORDERS'and column_name = 'STATUS'

)

order by

endpoint_number

/ http://jonathanlewis.wordpress.com/2010/10/05/frequency-histogram-4/

Precision (b)

select

endpoint_number,

lag(endpoint_number,1) over(

order by endpoint_number

) prev_endpoint,

to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_val

from

user_tab_histograms

where

table_name = 'ORDERS'and column_name = 'STATUS'

Title

12/ 30

Jonathan Lewis

2011

Precision (c)

ENDPOINT_NUMBER FREQUENCY STATUS

2848 2848 C

2849 1 P

5629 2780 X


2741 2741 C

2742 1 P

2743 1 R

5331 2588 X


2706 2706 C

2708 2 P

5355 2647 X


2852 2852 C

2854 2 P

2856 2 R

2859 3 S

5472 2613 X

Results 11.2.0.3 - four attempts

Missing values are NOT NICE

7/23/2019 CON2803_PDF_2803_0001

7/157

Title

13/ 30

Jonathan Lewis

2011

Basic Cost

select

substrb(dump(val,16,0,32),1,120) ep, cnt

from (

select /*+ lots of hints */

"STATUS" val, count(*) cnt

from "TEST_USER"."ORDERS" t

where "STATUS" is not null

group by

"STATUS"

)

order by val

Rows Row Source Operation

5 SORT GROUP BY {various statistics etc.}

1030000 TABLE ACCESS FULL : {various statistics etc.}

-- Could extract a sample

Title

14/ 30

Jonathan Lewis

2011

Solution (b)

c_array := dbms_stats.chararray('C', 'P', 'R', 'S', 'X');

srec.bkvals := dbms_stats.numarray (5000, 3, 3, 3, 5000);

srec.epc := 5;

dbms_stats.prepare_column_values(srec, c_array);

dbms_stats.set_column_stats(

ownname => user,

tabname => 'ORDERS',

colname => 'STATUS',

distcnt => m_distcnt,

density => m_density,

nullcnt => m_nullcnt,

srec => srec,

avgclen => m_avgclen

);

end;

7/23/2019 CON2803_PDF_2803_0001

8/158

Title

15/ 30

Jonathan Lewis

2011

Solution (a)

declare

srec dbms_stats.statrec;

c_array dbms_stats.chararray;

http://jonathanlewis.wordpress.com/2009/05/28/frequency-histograms/

m_distcnt number;

m_density number;

m_nullcnt number;

m_avgclen number;

begin

m_distcnt := 5;

m_density := 0.00001;

m_nullcnt := 0;

m_avgclen := 1;

Title

16/ 30

Jonathan Lewis

2011

Precision (12c)


2741 2741 C

2742 1 P

2743 1 R

5331 2588 X

2706 2706 C

2708 2 P

5355 2647 X

2852 2852 C

2854 2 P

2856 2 R

2859 3 S

5472 2613 X


529100 529100 C

529400 300 P

529700 300 R

530000 300 S

1030000 500000 X

11.2.0.3 12.1.0.0

2848 2848 C

2849 1 P5629 2780 X

12c has enhanced the code for the calculationof "approximate NDV"so for a small number ofdistinct values it can produce an accuratefrequency histogram at virtually no extra cost

7/23/2019 CON2803_PDF_2803_0001

9/159

Title

17/ 30

Jonathan Lewis

2011

Basic Principle

0 15

240 255

The square is a visual aid only The number of hash buckets is 2^64 (= 10^19)

Title

18/ 30

Jonathan Lewis

2011

Minimising cost

0 15

240 255

We only keep 16,384 items in the hash table for each column. We discard half the table each time we reach this limit

7/23/2019 CON2803_PDF_2803_0001

10/1510

Title

19/ 30

Jonathan Lewis

2011

Top-Frequency (12c)

If a small set of values accounts for most of the data,Oracle 12c can produce a frequency histogram forthe popular values and use an estimate for the rest.

select

skewed, count(*)

from

t1

group by

skewed

order by

skewed

;

SKEWED COUNT(*)

1 4

2 8

3 12

4 16

5 20

6 24

7 28

8 32

9 36

10 116

11 44

12 48

13 52

14 56

15 6016 64

17 68

18 72

19 76

20 4

If you wanted 18 buckets for this data (840 rows)

you could (easily) fit the four least popular valuesinto 1 bucket - leaving just 16 interesting values

Title

20/ 30

Jonathan Lewis

2011

Top-Frequency (12c)

EPV EPN FREQ

1 1

4 17 16

5 37 20

6 61 24

7 89 28

8 121 32

9 157 36

10 273 116

11 317 44

12 365 48

13 417 52

14 473 56

15 533 60

16 597 64

17 665 68

18 737 72

19 813 76

20 814 1

select

endpoint_value epv,

endpoint_number epn,

endpoint_number -

lag(endpoint_Number,1) over (

order by endpoint_number) freq

from user_tab_histograms

where table_name = 'T1'

and column_name = 'SKEWED'

order by

endpoint_value

;

(There is still a little flaw)

7/23/2019 CON2803_PDF_2803_0001

11/1511

Title

21/ 30

Jonathan Lewis

2011

Too many values (a)

23 23 28 24 29 36 27 13 30 46 43 29 25 20 39 38 20 33 29 35

20 38 27 20 28 29 42 26 19 16 33 26 43 18 19 31 32 35 28 22

40 33 19 34 45 28 42 33 27 38 35 21 35 12 8 59 35 34 31 24

38 33 39 16 35 22 32 38 20 34 18 37 27 29 30 50 33 27 27 15

13 12 31 26 13 35 31 41 44 29 22 30 33 33 43 31 28 32 17 28

8 12 12 13 13 13 15 16 16 17 18 18 19 19 19 20 20 20 20 20

21 22 22 22 23 23 24 24 25 26 26 26 27 27 27 27 27 27 28 28

28 28 28 28 29 29 29 29 29 29 30 30 30 31 31 31 31 31 32 32

32 33 33 33 33 33 33 33 33 34 34 34 35 35 35 35 35 35 35 36

37 38 38 38 38 38 39 39 40 41 42 42 43 43 43 44 45 46 50 59

Sort

Building a Height-Balanced Histogram

Title

22/ 30

Jonathan Lewis

2011

Too many values (b)

We have 100 items and 37 distinct values.Assume we are limited to 20 buckets

After sorting the data we record the value of every 5th row. (100/20)

8 12 12 13 13 13 15 16 16 17 18 18 19 19 19 20 20 20 20 20

21 22 22 22 23 23 24 24 25 26 26 26 27 27 27 27 27 27 28 28

28 28 28 28 29 29 29 29 29 29 30 30 30 31 31 31 31 31 32 32

32 33 33 33 33 33 33 33 33 34 34 34 35 35 35 35 35 35 35 36

37 38 38 38 38 38 39 39 40 41 42 42 43 43 43 44 45 46 50 59

13 17 19 20 23 26 27 28 xx 29 31 32 33 34 35 36 38 41 43 598 xx 29

13 17 19 20 23 26 27 28 29 31 32 3334 3536 38 41 43 59

29 is the only "popular" value with two buckets (i.e. 10 rows).

All other values are assumed to have (100 - 10) / (37 - 1) = 3 rows. (10.2.0.4+)

Lots more popular values

7/23/2019 CON2803_PDF_2803_0001

12/1512

Title

23/ 30

Jonathan Lewis

2011

Solution (8i - 11g)

Fake it with a frequency histogram.

Pick the 254 most popular values.

Include the low and high values

Fake selectivity for remainder

Needs one entry with double the desired cardinality Could assign this to the low/high value if introduced

Otherwise change the value with the lowest frequency

Title

24/ 30

Jonathan Lewis

2011

Too many values (12c)

8 12 12 13 13 13 15 16 16 17 18 18 19 19 19 20 20 20 20 20

21 22 22 22 23 23 24 24 25 26 26 26 27 27 27 27 27 27 28 28

28 28 28 28 29 29 29 29 29 29 30 30 30 31 31 31 31 31 32 32

32 33 33 33 33 33 33 33 33 34 34 34 35 35 35 35 35 35 35 36

37 38 38 38 38 38 39 39 40 41 42 42 43 43 43 44 45 46 50 59

8 12 12 13 13 13 15 16 16 17 18 18 19 19 19 20 20 20 20 20

21 22 22 22 23 23 24 24 25 26 26 26 27 27 27 27 27 27 28 28

28 28 28 28 29 29 29 29 29 29 30 30 30 31 31 31 31 31 32 32

32 33 33 33 33 33 33 33 33 34 34 34 35 35 35 35 35 35 35 36

37 38 38 38 38 38 39 39 40 41 42 42 43 43 43 44 45 46 50 59

7/23/2019 CON2803_PDF_2803_0001

13/1513

Title

25/ 30

Jonathan Lewis

2011

Hybrid Histogram

select

endpoint_number,

endpoint_value,

endpoint_repeat_count

from

user_tab_histograms

where

table_name = 'T1'

;

EPN EPV REP

1 8 1

6 13 3

12 18 2

20 20 5

26 23 2

32 26 3

38 27 6

44 28 6

50 29 6

58 31 5

69 33 8

79 35 7

86 38 5

90 41 192 42 2

95 43 3

96 44 1

97 45 1

98 46 1

100 59 1

This looks like an old frequency histogram, buteach bucket has a "repeat count" showing howoften the highest value appears in the bucket.

7 rows in

the bucket38 appear

5 times

Title

26/ 30

Jonathan Lewis

2011

SQL (top-N pt.1)

select

/*+ {lots of hints} */

to_char(count("VALUE")),

to_char(substrb(dump(min("VALUE"),16,0,64),1,240)),

to_char(substrb(dump(max("VALUE"),16,0,64),1,240)),

count(rowidtochar(rowid))

from

"TEST_USER"."T1" t /* TOPN,NIL,NIL,RWID,U18U*/

select

/*+ {lots of hints} */

to_char(count("VALUE")),

to_char(substrb(dump(min("VALUE"),16,0,64),1,240)),

to_char(substrb(dump(max("VALUE"),16,0,64),1,240))

from

"TEST_USER"."T1" t /* NDV,NIL,NIL*/

SQL behind creating a histogram with 18 buckets

SQL behind basic "approximate NDV" (single column table - 11g)

7/23/2019 CON2803_PDF_2803_0001

14/1514

Title

27/ 30

Jonathan Lewis

2011

SQL (top-N pt.2)


substrb(dump("VALUE",16,0,64),1,240) val,

rowidtochar(rowid) rwid

from

"TEST_USER"."T1" t

where rowid in (

chartorowid('AAAWaHAAFAAAAEEAAB'),chartorowid('AAAWaHAAFAAAAEEAAC'),

chartorowid('AAAWaHAAFAAAAEEAAD'),chartorowid('AAAWaHAAFAAAAEEAAE'),

chartorowid('AAAWaHAAFAAAAEEAAF'),chartorowid('AAAWaHAAFAAAAEEAAG'),

chartorowid('AAAWaHAAFAAAAEEAAH'),chartorowid('AAAWaHAAFAAAAEEAAI'),

chartorowid('AAAWaHAAFAAAAEEAAJ'),chartorowid('AAAWaHAAFAAAAEEAAK'),

chartorowid('AAAWaHAAFAAAAEEAAL'),chartorowid('AAAWaHAAFAAAAEEAAM'),

chartorowid('AAAWaHAAFAAAAEEAAN'),chartorowid('AAAWaHAAFAAAAEEAAO'),

chartorowid('AAAWaHAAFAAAAEEAAP'),chartorowid('AAAWaHAAFAAAAEEAAQ'),

chartorowid('AAAWaHAAFAAAAEFAAA'),chartorowid('AAAWaHAAFAAAAEFAAB')

)

order by "VALUE"

Title

28/ 30

Jonathan Lewis

2011

SQL (hybrid)

select

substrb(dump(val,16,0,64),1,20) ep, freq, cdn, ndv,

(sum(pop) over())popcnt, (sum(pop * freq) over())popfreq,

substrb(dump(max(val) over(),16,0,64),1,20) maxval,

substrb(dump(min(val) over(),16,0,64),1,20) minval

from (

select

val, freq, (sum(freq) over()) cdn, (count(*) over()) ndv,

(case when freq > ((sum(freq) over())/15) then 1 else 0 end) pop

from (select /*+ lots of hints */

"VALUE" val, count("VALUE") freq

from

"TEST_USER"."T1" t

where

"VALUE" is not null

group by

"VALUE"

)

)

order by val

/

With only 15 buckets thisdataset got a hybrid histogram

7/23/2019 CON2803_PDF_2803_0001

15/15

Title

29/ 30

Jonathan Lewis

2011

SQL (old height-balanced)

select

min(minbkt),maxbkt,

substrb(dump(min(val),16,0,32),1,120) minval,

substrb(dump(max(val),16,0,32),1,120) maxval,

sum(rep) sumrep, sum(repsq) sumrepsq, max(rep) maxrep, count(*) bktndv,

sum(case when rep=1 then 1 else 0 end) unqrep

from (

select

val, min(bkt) minbkt, max(bkt) maxbkt,

count(val) rep, count(val)*count(val) repsq

from (


"LN100" val, ntile(200) over (order by "LN100") bkt

from sys.ora_temp_1_ds_616t

where "LN100" is not null

)

group byval

)

group bymaxbkt order bymaxbkt

Title

30/ 30

Jonathan Lewis

2011

Conclusions for 12c

Use auto_sample_size

2,048 buckets is legal The default is still 254, and it's likely to be adequate

Frequency / Top N histograms Fast and accurate

Hybrid Capture far more popular values, still samples, and costly

Timing is still important

May still want to create some by code

Documents

CON2803_PDF_2803_0001