7/23/2019 CON2803_PDF_2803_0001
1/151
The Evolution of Histograms
Jonathan Lewisjonathanlewis.wordpress.com
www.jlcomp.demon.co.uk
Title
2 / 30
Jonathan Lewis
2011
Who am I ?
Independent Consultant
28+ years in IT24+ using Oracle
Strategy, Design, Review,Briefings, Educational,Trouble-shooting
Member of the Oak Table NetworkOracle ACE DirectorOracle author of the year 2006SelectEditors choice 2007UKOUG Inspiring Presenter 2011UKOUG Council member 2012ODTUG 2012 Best Presenter (d/b)O1 visa for USA
7/23/2019 CON2803_PDF_2803_0001
2/152
Title
3 / 30
Jonathan Lewis
2011
O-1 Visa
An alien of extraordinary ability
Title
4 / 30
Jonathan Lewis
2011
Highlights
Why Histograms
Current mechanisms
Problems and workarounds New mechanisms
7/23/2019 CON2803_PDF_2803_0001
3/153
Title
5 / 30
Jonathan Lewis
2011
Sample Data (a)
S COUNT(*)
P 52,352
C 9,416,360
O 3,499
L 86,084
CODE DESCRIPTION
A ASSIGNED
B HANDED BACK
C CLOSED
L LOGGED
O HANDED OVER
P PENDING
Other ideasChange 'commonest value' to null
Virtual columns / Function-based indexes
List partitions
Standard Strategy
Frequency histogram with literals in SQL
Title
6 / 30
Jonathan Lewis
2011
Problems
Coding to take advantage of histogram
Limit on distinct values
Resources needed for gathering Accuracy of histogram
Timing of gathering
7/23/2019 CON2803_PDF_2803_0001
4/154
Title
7 / 30
Jonathan Lewis
2011
Limits (a)
select
specifier, count(*)
from
messages
group by
specifier
order by
count(*) desc
;
SPECIFIER COUNT(*)
BVGFJB 1,851,177
LYYVLH 719,582
MTVMIE 672,823
YETSDP 659,661
DAJYGS 504,641
...
KDCFVJ 75,328
JITCRI 74,104
DNRYKC 70,029
BEWPEQ 68,681
...
JXXXRE 1
OHMNVU 1
YGOBWQ 1
UBBWQH 1
Distinct Specifiers = 352Frequency Limit is 254
Height-balanced less precise
Popular values use lots of buckets
Title
8 / 30
Jonathan Lewis
2011
Limits (b)
Interesting arithmetic - for THIS data set
Top N values % of data
140 99.00
210 99.90
250 99.98
Each "bucket" represents roughly 40,000 rows (10M / 254)
A value with 40,001 rows MIGHT get captured twice
A value with 79,999 rows MIGHT NOT get captured twice
In this data set there are 25 values that WILL get captured (ct > 80,001)
There are 35 values that might be captured one day, and not the next.
7/23/2019 CON2803_PDF_2803_0001
5/155
Title
9 / 30
Jonathan Lewis
2011
Limits (c)
12c allows 2,048 buckets
The default is still 254
Don't be in a rush to use the maximum Don't forget the optstat history tables
There are several new columns
There are some new costs
Title
10/ 30
Jonathan Lewis
2011
Precision (a)
select
status, count(*)
from
orders
group by
status
order by
status;
S COUNT(*)
C 529,100
P 300
R 300
S 300
X 500,000
begin
dbms_stats.gather_table_stats(
tabname =>'orders',
estimate_percent => dbms_stats.auto_sample_size ,
method_opt => 'for columns status size 10'
);
end;
/
7/23/2019 CON2803_PDF_2803_0001
6/156
Title
11/ 30
Jonathan Lewis
2011
select
endpoint_number,
endpoint_number - nvl(prev_endpoint,0) frequency,
chr(to_number(substr(hex_val, 2,2),'XX')) status
from (
select
endpoint_number,
lag(endpoint_number,1) over(
order by endpoint_number
) prev_endpoint,
to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_val
from
user_tab_histograms
where
table_name = 'ORDERS'and column_name = 'STATUS'
)
order by
endpoint_number
/ http://jonathanlewis.wordpress.com/2010/10/05/frequency-histogram-4/
Precision (b)
select
endpoint_number,
lag(endpoint_number,1) over(
order by endpoint_number
) prev_endpoint,
to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_val
from
user_tab_histograms
where
table_name = 'ORDERS'and column_name = 'STATUS'
Title
12/ 30
Jonathan Lewis
2011
Precision (c)
ENDPOINT_NUMBER FREQUENCY STATUS
2848 2848 C
2849 1 P
5629 2780 X
ENDPOINT_NUMBER FREQUENCY STATUS
2741 2741 C
2742 1 P
2743 1 R
5331 2588 X
ENDPOINT_NUMBER FREQUENCY STATUS
2706 2706 C
2708 2 P
5355 2647 X
ENDPOINT_NUMBER FREQUENCY STATUS
2852 2852 C
2854 2 P
2856 2 R
2859 3 S
5472 2613 X
Results 11.2.0.3 - four attempts
Missing values are NOT NICE
7/23/2019 CON2803_PDF_2803_0001
7/157
Title
13/ 30
Jonathan Lewis
2011
Basic Cost
select
substrb(dump(val,16,0,32),1,120) ep, cnt
from (
select /*+ lots of hints */
"STATUS" val, count(*) cnt
from "TEST_USER"."ORDERS" t
where "STATUS" is not null
group by
"STATUS"
)
order by val
Rows Row Source Operation
5 SORT GROUP BY {various statistics etc.}
1030000 TABLE ACCESS FULL : {various statistics etc.}
-- Could extract a sample
Title
14/ 30
Jonathan Lewis
2011
Solution (b)
c_array := dbms_stats.chararray('C', 'P', 'R', 'S', 'X');
srec.bkvals := dbms_stats.numarray (5000, 3, 3, 3, 5000);
srec.epc := 5;
dbms_stats.prepare_column_values(srec, c_array);
dbms_stats.set_column_stats(
ownname => user,
tabname => 'ORDERS',
colname => 'STATUS',
distcnt => m_distcnt,
density => m_density,
nullcnt => m_nullcnt,
srec => srec,
avgclen => m_avgclen
);
end;
7/23/2019 CON2803_PDF_2803_0001
8/158
Title
15/ 30
Jonathan Lewis
2011
Solution (a)
declare
srec dbms_stats.statrec;
c_array dbms_stats.chararray;
http://jonathanlewis.wordpress.com/2009/05/28/frequency-histograms/
m_distcnt number;
m_density number;
m_nullcnt number;
m_avgclen number;
begin
m_distcnt := 5;
m_density := 0.00001;
m_nullcnt := 0;
m_avgclen := 1;
Title
16/ 30
Jonathan Lewis
2011
Precision (12c)
ENDPOINT_NUMBER FREQUENCY STATUS
2741 2741 C
2742 1 P
2743 1 R
5331 2588 X
2706 2706 C
2708 2 P
5355 2647 X
2852 2852 C
2854 2 P
2856 2 R
2859 3 S
5472 2613 X
ENDPOINT_NUMBER FREQUENCY STATUS
529100 529100 C
529400 300 P
529700 300 R
530000 300 S
1030000 500000 X
11.2.0.3 12.1.0.0
2848 2848 C
2849 1 P5629 2780 X
12c has enhanced the code for the calculationof "approximate NDV"so for a small number ofdistinct values it can produce an accuratefrequency histogram at virtually no extra cost
7/23/2019 CON2803_PDF_2803_0001
9/159
Title
17/ 30
Jonathan Lewis
2011
Basic Principle
0 15
240 255
The square is a visual aid only The number of hash buckets is 2^64 (= 10^19)
Title
18/ 30
Jonathan Lewis
2011
Minimising cost
0 15
240 255
We only keep 16,384 items in the hash table for each column. We discard half the table each time we reach this limit
7/23/2019 CON2803_PDF_2803_0001
10/1510
Title
19/ 30
Jonathan Lewis
2011
Top-Frequency (12c)
If a small set of values accounts for most of the data,Oracle 12c can produce a frequency histogram forthe popular values and use an estimate for the rest.
select
skewed, count(*)
from
t1
group by
skewed
order by
skewed
;
SKEWED COUNT(*)
1 4
2 8
3 12
4 16
5 20
6 24
7 28
8 32
9 36
10 116
11 44
12 48
13 52
14 56
15 6016 64
17 68
18 72
19 76
20 4
If you wanted 18 buckets for this data (840 rows)
you could (easily) fit the four least popular valuesinto 1 bucket - leaving just 16 interesting values
Title
20/ 30
Jonathan Lewis
2011
Top-Frequency (12c)
EPV EPN FREQ
1 1
4 17 16
5 37 20
6 61 24
7 89 28
8 121 32
9 157 36
10 273 116
11 317 44
12 365 48
13 417 52
14 473 56
15 533 60
16 597 64
17 665 68
18 737 72
19 813 76
20 814 1
select
endpoint_value epv,
endpoint_number epn,
endpoint_number -
lag(endpoint_Number,1) over (
order by endpoint_number) freq
from user_tab_histograms
where table_name = 'T1'
and column_name = 'SKEWED'
order by
endpoint_value
;
(There is still a little flaw)
7/23/2019 CON2803_PDF_2803_0001
11/1511
Title
21/ 30
Jonathan Lewis
2011
Too many values (a)
23 23 28 24 29 36 27 13 30 46 43 29 25 20 39 38 20 33 29 35
20 38 27 20 28 29 42 26 19 16 33 26 43 18 19 31 32 35 28 22
40 33 19 34 45 28 42 33 27 38 35 21 35 12 8 59 35 34 31 24
38 33 39 16 35 22 32 38 20 34 18 37 27 29 30 50 33 27 27 15
13 12 31 26 13 35 31 41 44 29 22 30 33 33 43 31 28 32 17 28
8 12 12 13 13 13 15 16 16 17 18 18 19 19 19 20 20 20 20 20
21 22 22 22 23 23 24 24 25 26 26 26 27 27 27 27 27 27 28 28
28 28 28 28 29 29 29 29 29 29 30 30 30 31 31 31 31 31 32 32
32 33 33 33 33 33 33 33 33 34 34 34 35 35 35 35 35 35 35 36
37 38 38 38 38 38 39 39 40 41 42 42 43 43 43 44 45 46 50 59
Sort
Building a Height-Balanced Histogram
Title
22/ 30
Jonathan Lewis
2011
Too many values (b)
We have 100 items and 37 distinct values.Assume we are limited to 20 buckets
After sorting the data we record the value of every 5th row. (100/20)
8 12 12 13 13 13 15 16 16 17 18 18 19 19 19 20 20 20 20 20
21 22 22 22 23 23 24 24 25 26 26 26 27 27 27 27 27 27 28 28
28 28 28 28 29 29 29 29 29 29 30 30 30 31 31 31 31 31 32 32
32 33 33 33 33 33 33 33 33 34 34 34 35 35 35 35 35 35 35 36
37 38 38 38 38 38 39 39 40 41 42 42 43 43 43 44 45 46 50 59
13 17 19 20 23 26 27 28 xx 29 31 32 33 34 35 36 38 41 43 598 xx 29
13 17 19 20 23 26 27 28 29 31 32 3334 3536 38 41 43 59
29 is the only "popular" value with two buckets (i.e. 10 rows).
All other values are assumed to have (100 - 10) / (37 - 1) = 3 rows. (10.2.0.4+)
Lots more popular values
7/23/2019 CON2803_PDF_2803_0001
12/1512
Title
23/ 30
Jonathan Lewis
2011
Solution (8i - 11g)
Fake it with a frequency histogram.
Pick the 254 most popular values.
Include the low and high values
Fake selectivity for remainder
Needs one entry with double the desired cardinality Could assign this to the low/high value if introduced
Otherwise change the value with the lowest frequency
Title
24/ 30
Jonathan Lewis
2011
Too many values (12c)
8 12 12 13 13 13 15 16 16 17 18 18 19 19 19 20 20 20 20 20
21 22 22 22 23 23 24 24 25 26 26 26 27 27 27 27 27 27 28 28
28 28 28 28 29 29 29 29 29 29 30 30 30 31 31 31 31 31 32 32
32 33 33 33 33 33 33 33 33 34 34 34 35 35 35 35 35 35 35 36
37 38 38 38 38 38 39 39 40 41 42 42 43 43 43 44 45 46 50 59
8 12 12 13 13 13 15 16 16 17 18 18 19 19 19 20 20 20 20 20
21 22 22 22 23 23 24 24 25 26 26 26 27 27 27 27 27 27 28 28
28 28 28 28 29 29 29 29 29 29 30 30 30 31 31 31 31 31 32 32
32 33 33 33 33 33 33 33 33 34 34 34 35 35 35 35 35 35 35 36
37 38 38 38 38 38 39 39 40 41 42 42 43 43 43 44 45 46 50 59
7/23/2019 CON2803_PDF_2803_0001
13/1513
Title
25/ 30
Jonathan Lewis
2011
Hybrid Histogram
select
endpoint_number,
endpoint_value,
endpoint_repeat_count
from
user_tab_histograms
where
table_name = 'T1'
;
EPN EPV REP
1 8 1
6 13 3
12 18 2
20 20 5
26 23 2
32 26 3
38 27 6
44 28 6
50 29 6
58 31 5
69 33 8
79 35 7
86 38 5
90 41 192 42 2
95 43 3
96 44 1
97 45 1
98 46 1
100 59 1
This looks like an old frequency histogram, buteach bucket has a "repeat count" showing howoften the highest value appears in the bucket.
7 rows in
the bucket38 appear
5 times
Title
26/ 30
Jonathan Lewis
2011
SQL (top-N pt.1)
select
/*+ {lots of hints} */
to_char(count("VALUE")),
to_char(substrb(dump(min("VALUE"),16,0,64),1,240)),
to_char(substrb(dump(max("VALUE"),16,0,64),1,240)),
count(rowidtochar(rowid))
from
"TEST_USER"."T1" t /* TOPN,NIL,NIL,RWID,U18U*/
select
/*+ {lots of hints} */
to_char(count("VALUE")),
to_char(substrb(dump(min("VALUE"),16,0,64),1,240)),
to_char(substrb(dump(max("VALUE"),16,0,64),1,240))
from
"TEST_USER"."T1" t /* NDV,NIL,NIL*/
SQL behind creating a histogram with 18 buckets
SQL behind basic "approximate NDV" (single column table - 11g)
7/23/2019 CON2803_PDF_2803_0001
14/1514
Title
27/ 30
Jonathan Lewis
2011
SQL (top-N pt.2)
select /*+ lots of hints */
substrb(dump("VALUE",16,0,64),1,240) val,
rowidtochar(rowid) rwid
from
"TEST_USER"."T1" t
where rowid in (
chartorowid('AAAWaHAAFAAAAEEAAB'),chartorowid('AAAWaHAAFAAAAEEAAC'),
chartorowid('AAAWaHAAFAAAAEEAAD'),chartorowid('AAAWaHAAFAAAAEEAAE'),
chartorowid('AAAWaHAAFAAAAEEAAF'),chartorowid('AAAWaHAAFAAAAEEAAG'),
chartorowid('AAAWaHAAFAAAAEEAAH'),chartorowid('AAAWaHAAFAAAAEEAAI'),
chartorowid('AAAWaHAAFAAAAEEAAJ'),chartorowid('AAAWaHAAFAAAAEEAAK'),
chartorowid('AAAWaHAAFAAAAEEAAL'),chartorowid('AAAWaHAAFAAAAEEAAM'),
chartorowid('AAAWaHAAFAAAAEEAAN'),chartorowid('AAAWaHAAFAAAAEEAAO'),
chartorowid('AAAWaHAAFAAAAEEAAP'),chartorowid('AAAWaHAAFAAAAEEAAQ'),
chartorowid('AAAWaHAAFAAAAEFAAA'),chartorowid('AAAWaHAAFAAAAEFAAB')
)
order by "VALUE"
Title
28/ 30
Jonathan Lewis
2011
SQL (hybrid)
select
substrb(dump(val,16,0,64),1,20) ep, freq, cdn, ndv,
(sum(pop) over())popcnt, (sum(pop * freq) over())popfreq,
substrb(dump(max(val) over(),16,0,64),1,20) maxval,
substrb(dump(min(val) over(),16,0,64),1,20) minval
from (
select
val, freq, (sum(freq) over()) cdn, (count(*) over()) ndv,
(case when freq > ((sum(freq) over())/15) then 1 else 0 end) pop
from (select /*+ lots of hints */
"VALUE" val, count("VALUE") freq
from
"TEST_USER"."T1" t
where
"VALUE" is not null
group by
"VALUE"
)
)
order by val
/
With only 15 buckets thisdataset got a hybrid histogram
7/23/2019 CON2803_PDF_2803_0001
15/15
Title
29/ 30
Jonathan Lewis
2011
SQL (old height-balanced)
select
min(minbkt),maxbkt,
substrb(dump(min(val),16,0,32),1,120) minval,
substrb(dump(max(val),16,0,32),1,120) maxval,
sum(rep) sumrep, sum(repsq) sumrepsq, max(rep) maxrep, count(*) bktndv,
sum(case when rep=1 then 1 else 0 end) unqrep
from (
select
val, min(bkt) minbkt, max(bkt) maxbkt,
count(val) rep, count(val)*count(val) repsq
from (
select /*+ lots of hints */
"LN100" val, ntile(200) over (order by "LN100") bkt
from sys.ora_temp_1_ds_616t
where "LN100" is not null
)
group byval
)
group bymaxbkt order bymaxbkt
Title
30/ 30
Jonathan Lewis
2011
Conclusions for 12c
Use auto_sample_size
2,048 buckets is legal The default is still 254, and it's likely to be adequate
Frequency / Top N histograms Fast and accurate
Hybrid Capture far more popular values, still samples, and costly
Timing is still important
May still want to create some by code
Recommended