CON2803_PDF_2803_0001

Embed Size (px)

Citation preview

  • 7/23/2019 CON2803_PDF_2803_0001

    1/151

    The Evolution of Histograms

    Jonathan Lewisjonathanlewis.wordpress.com

    www.jlcomp.demon.co.uk

    Title

    2 / 30

    Jonathan Lewis

    2011

    Who am I ?

    Independent Consultant

    28+ years in IT24+ using Oracle

    Strategy, Design, Review,Briefings, Educational,Trouble-shooting

    Member of the Oak Table NetworkOracle ACE DirectorOracle author of the year 2006SelectEditors choice 2007UKOUG Inspiring Presenter 2011UKOUG Council member 2012ODTUG 2012 Best Presenter (d/b)O1 visa for USA

  • 7/23/2019 CON2803_PDF_2803_0001

    2/152

    Title

    3 / 30

    Jonathan Lewis

    2011

    O-1 Visa

    An alien of extraordinary ability

    Title

    4 / 30

    Jonathan Lewis

    2011

    Highlights

    Why Histograms

    Current mechanisms

    Problems and workarounds New mechanisms

  • 7/23/2019 CON2803_PDF_2803_0001

    3/153

    Title

    5 / 30

    Jonathan Lewis

    2011

    Sample Data (a)

    S COUNT(*)

    P 52,352

    C 9,416,360

    O 3,499

    L 86,084

    CODE DESCRIPTION

    A ASSIGNED

    B HANDED BACK

    C CLOSED

    L LOGGED

    O HANDED OVER

    P PENDING

    Other ideasChange 'commonest value' to null

    Virtual columns / Function-based indexes

    List partitions

    Standard Strategy

    Frequency histogram with literals in SQL

    Title

    6 / 30

    Jonathan Lewis

    2011

    Problems

    Coding to take advantage of histogram

    Limit on distinct values

    Resources needed for gathering Accuracy of histogram

    Timing of gathering

  • 7/23/2019 CON2803_PDF_2803_0001

    4/154

    Title

    7 / 30

    Jonathan Lewis

    2011

    Limits (a)

    select

    specifier, count(*)

    from

    messages

    group by

    specifier

    order by

    count(*) desc

    ;

    SPECIFIER COUNT(*)

    BVGFJB 1,851,177

    LYYVLH 719,582

    MTVMIE 672,823

    YETSDP 659,661

    DAJYGS 504,641

    ...

    KDCFVJ 75,328

    JITCRI 74,104

    DNRYKC 70,029

    BEWPEQ 68,681

    ...

    JXXXRE 1

    OHMNVU 1

    YGOBWQ 1

    UBBWQH 1

    Distinct Specifiers = 352Frequency Limit is 254

    Height-balanced less precise

    Popular values use lots of buckets

    Title

    8 / 30

    Jonathan Lewis

    2011

    Limits (b)

    Interesting arithmetic - for THIS data set

    Top N values % of data

    140 99.00

    210 99.90

    250 99.98

    Each "bucket" represents roughly 40,000 rows (10M / 254)

    A value with 40,001 rows MIGHT get captured twice

    A value with 79,999 rows MIGHT NOT get captured twice

    In this data set there are 25 values that WILL get captured (ct > 80,001)

    There are 35 values that might be captured one day, and not the next.

  • 7/23/2019 CON2803_PDF_2803_0001

    5/155

    Title

    9 / 30

    Jonathan Lewis

    2011

    Limits (c)

    12c allows 2,048 buckets

    The default is still 254

    Don't be in a rush to use the maximum Don't forget the optstat history tables

    There are several new columns

    There are some new costs

    Title

    10/ 30

    Jonathan Lewis

    2011

    Precision (a)

    select

    status, count(*)

    from

    orders

    group by

    status

    order by

    status;

    S COUNT(*)

    C 529,100

    P 300

    R 300

    S 300

    X 500,000

    begin

    dbms_stats.gather_table_stats(

    tabname =>'orders',

    estimate_percent => dbms_stats.auto_sample_size ,

    method_opt => 'for columns status size 10'

    );

    end;

    /

  • 7/23/2019 CON2803_PDF_2803_0001

    6/156

    Title

    11/ 30

    Jonathan Lewis

    2011

    select

    endpoint_number,

    endpoint_number - nvl(prev_endpoint,0) frequency,

    chr(to_number(substr(hex_val, 2,2),'XX')) status

    from (

    select

    endpoint_number,

    lag(endpoint_number,1) over(

    order by endpoint_number

    ) prev_endpoint,

    to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_val

    from

    user_tab_histograms

    where

    table_name = 'ORDERS'and column_name = 'STATUS'

    )

    order by

    endpoint_number

    / http://jonathanlewis.wordpress.com/2010/10/05/frequency-histogram-4/

    Precision (b)

    select

    endpoint_number,

    lag(endpoint_number,1) over(

    order by endpoint_number

    ) prev_endpoint,

    to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_val

    from

    user_tab_histograms

    where

    table_name = 'ORDERS'and column_name = 'STATUS'

    Title

    12/ 30

    Jonathan Lewis

    2011

    Precision (c)

    ENDPOINT_NUMBER FREQUENCY STATUS

    2848 2848 C

    2849 1 P

    5629 2780 X

    ENDPOINT_NUMBER FREQUENCY STATUS

    2741 2741 C

    2742 1 P

    2743 1 R

    5331 2588 X

    ENDPOINT_NUMBER FREQUENCY STATUS

    2706 2706 C

    2708 2 P

    5355 2647 X

    ENDPOINT_NUMBER FREQUENCY STATUS

    2852 2852 C

    2854 2 P

    2856 2 R

    2859 3 S

    5472 2613 X

    Results 11.2.0.3 - four attempts

    Missing values are NOT NICE

  • 7/23/2019 CON2803_PDF_2803_0001

    7/157

    Title

    13/ 30

    Jonathan Lewis

    2011

    Basic Cost

    select

    substrb(dump(val,16,0,32),1,120) ep, cnt

    from (

    select /*+ lots of hints */

    "STATUS" val, count(*) cnt

    from "TEST_USER"."ORDERS" t

    where "STATUS" is not null

    group by

    "STATUS"

    )

    order by val

    Rows Row Source Operation

    5 SORT GROUP BY {various statistics etc.}

    1030000 TABLE ACCESS FULL : {various statistics etc.}

    -- Could extract a sample

    Title

    14/ 30

    Jonathan Lewis

    2011

    Solution (b)

    c_array := dbms_stats.chararray('C', 'P', 'R', 'S', 'X');

    srec.bkvals := dbms_stats.numarray (5000, 3, 3, 3, 5000);

    srec.epc := 5;

    dbms_stats.prepare_column_values(srec, c_array);

    dbms_stats.set_column_stats(

    ownname => user,

    tabname => 'ORDERS',

    colname => 'STATUS',

    distcnt => m_distcnt,

    density => m_density,

    nullcnt => m_nullcnt,

    srec => srec,

    avgclen => m_avgclen

    );

    end;

  • 7/23/2019 CON2803_PDF_2803_0001

    8/158

    Title

    15/ 30

    Jonathan Lewis

    2011

    Solution (a)

    declare

    srec dbms_stats.statrec;

    c_array dbms_stats.chararray;

    http://jonathanlewis.wordpress.com/2009/05/28/frequency-histograms/

    m_distcnt number;

    m_density number;

    m_nullcnt number;

    m_avgclen number;

    begin

    m_distcnt := 5;

    m_density := 0.00001;

    m_nullcnt := 0;

    m_avgclen := 1;

    Title

    16/ 30

    Jonathan Lewis

    2011

    Precision (12c)

    ENDPOINT_NUMBER FREQUENCY STATUS

    2741 2741 C

    2742 1 P

    2743 1 R

    5331 2588 X

    2706 2706 C

    2708 2 P

    5355 2647 X

    2852 2852 C

    2854 2 P

    2856 2 R

    2859 3 S

    5472 2613 X

    ENDPOINT_NUMBER FREQUENCY STATUS

    529100 529100 C

    529400 300 P

    529700 300 R

    530000 300 S

    1030000 500000 X

    11.2.0.3 12.1.0.0

    2848 2848 C

    2849 1 P5629 2780 X

    12c has enhanced the code for the calculationof "approximate NDV"so for a small number ofdistinct values it can produce an accuratefrequency histogram at virtually no extra cost

  • 7/23/2019 CON2803_PDF_2803_0001

    9/159

    Title

    17/ 30

    Jonathan Lewis

    2011

    Basic Principle

    0 15

    240 255

    The square is a visual aid only The number of hash buckets is 2^64 (= 10^19)

    Title

    18/ 30

    Jonathan Lewis

    2011

    Minimising cost

    0 15

    240 255

    We only keep 16,384 items in the hash table for each column. We discard half the table each time we reach this limit

  • 7/23/2019 CON2803_PDF_2803_0001

    10/1510

    Title

    19/ 30

    Jonathan Lewis

    2011

    Top-Frequency (12c)

    If a small set of values accounts for most of the data,Oracle 12c can produce a frequency histogram forthe popular values and use an estimate for the rest.

    select

    skewed, count(*)

    from

    t1

    group by

    skewed

    order by

    skewed

    ;

    SKEWED COUNT(*)

    1 4

    2 8

    3 12

    4 16

    5 20

    6 24

    7 28

    8 32

    9 36

    10 116

    11 44

    12 48

    13 52

    14 56

    15 6016 64

    17 68

    18 72

    19 76

    20 4

    If you wanted 18 buckets for this data (840 rows)

    you could (easily) fit the four least popular valuesinto 1 bucket - leaving just 16 interesting values

    Title

    20/ 30

    Jonathan Lewis

    2011

    Top-Frequency (12c)

    EPV EPN FREQ

    1 1

    4 17 16

    5 37 20

    6 61 24

    7 89 28

    8 121 32

    9 157 36

    10 273 116

    11 317 44

    12 365 48

    13 417 52

    14 473 56

    15 533 60

    16 597 64

    17 665 68

    18 737 72

    19 813 76

    20 814 1

    select

    endpoint_value epv,

    endpoint_number epn,

    endpoint_number -

    lag(endpoint_Number,1) over (

    order by endpoint_number) freq

    from user_tab_histograms

    where table_name = 'T1'

    and column_name = 'SKEWED'

    order by

    endpoint_value

    ;

    (There is still a little flaw)

  • 7/23/2019 CON2803_PDF_2803_0001

    11/1511

    Title

    21/ 30

    Jonathan Lewis

    2011

    Too many values (a)

    23 23 28 24 29 36 27 13 30 46 43 29 25 20 39 38 20 33 29 35

    20 38 27 20 28 29 42 26 19 16 33 26 43 18 19 31 32 35 28 22

    40 33 19 34 45 28 42 33 27 38 35 21 35 12 8 59 35 34 31 24

    38 33 39 16 35 22 32 38 20 34 18 37 27 29 30 50 33 27 27 15

    13 12 31 26 13 35 31 41 44 29 22 30 33 33 43 31 28 32 17 28

    8 12 12 13 13 13 15 16 16 17 18 18 19 19 19 20 20 20 20 20

    21 22 22 22 23 23 24 24 25 26 26 26 27 27 27 27 27 27 28 28

    28 28 28 28 29 29 29 29 29 29 30 30 30 31 31 31 31 31 32 32

    32 33 33 33 33 33 33 33 33 34 34 34 35 35 35 35 35 35 35 36

    37 38 38 38 38 38 39 39 40 41 42 42 43 43 43 44 45 46 50 59

    Sort

    Building a Height-Balanced Histogram

    Title

    22/ 30

    Jonathan Lewis

    2011

    Too many values (b)

    We have 100 items and 37 distinct values.Assume we are limited to 20 buckets

    After sorting the data we record the value of every 5th row. (100/20)

    8 12 12 13 13 13 15 16 16 17 18 18 19 19 19 20 20 20 20 20

    21 22 22 22 23 23 24 24 25 26 26 26 27 27 27 27 27 27 28 28

    28 28 28 28 29 29 29 29 29 29 30 30 30 31 31 31 31 31 32 32

    32 33 33 33 33 33 33 33 33 34 34 34 35 35 35 35 35 35 35 36

    37 38 38 38 38 38 39 39 40 41 42 42 43 43 43 44 45 46 50 59

    13 17 19 20 23 26 27 28 xx 29 31 32 33 34 35 36 38 41 43 598 xx 29

    13 17 19 20 23 26 27 28 29 31 32 3334 3536 38 41 43 59

    29 is the only "popular" value with two buckets (i.e. 10 rows).

    All other values are assumed to have (100 - 10) / (37 - 1) = 3 rows. (10.2.0.4+)

    Lots more popular values

  • 7/23/2019 CON2803_PDF_2803_0001

    12/1512

    Title

    23/ 30

    Jonathan Lewis

    2011

    Solution (8i - 11g)

    Fake it with a frequency histogram.

    Pick the 254 most popular values.

    Include the low and high values

    Fake selectivity for remainder

    Needs one entry with double the desired cardinality Could assign this to the low/high value if introduced

    Otherwise change the value with the lowest frequency

    Title

    24/ 30

    Jonathan Lewis

    2011

    Too many values (12c)

    8 12 12 13 13 13 15 16 16 17 18 18 19 19 19 20 20 20 20 20

    21 22 22 22 23 23 24 24 25 26 26 26 27 27 27 27 27 27 28 28

    28 28 28 28 29 29 29 29 29 29 30 30 30 31 31 31 31 31 32 32

    32 33 33 33 33 33 33 33 33 34 34 34 35 35 35 35 35 35 35 36

    37 38 38 38 38 38 39 39 40 41 42 42 43 43 43 44 45 46 50 59

    8 12 12 13 13 13 15 16 16 17 18 18 19 19 19 20 20 20 20 20

    21 22 22 22 23 23 24 24 25 26 26 26 27 27 27 27 27 27 28 28

    28 28 28 28 29 29 29 29 29 29 30 30 30 31 31 31 31 31 32 32

    32 33 33 33 33 33 33 33 33 34 34 34 35 35 35 35 35 35 35 36

    37 38 38 38 38 38 39 39 40 41 42 42 43 43 43 44 45 46 50 59

  • 7/23/2019 CON2803_PDF_2803_0001

    13/1513

    Title

    25/ 30

    Jonathan Lewis

    2011

    Hybrid Histogram

    select

    endpoint_number,

    endpoint_value,

    endpoint_repeat_count

    from

    user_tab_histograms

    where

    table_name = 'T1'

    ;

    EPN EPV REP

    1 8 1

    6 13 3

    12 18 2

    20 20 5

    26 23 2

    32 26 3

    38 27 6

    44 28 6

    50 29 6

    58 31 5

    69 33 8

    79 35 7

    86 38 5

    90 41 192 42 2

    95 43 3

    96 44 1

    97 45 1

    98 46 1

    100 59 1

    This looks like an old frequency histogram, buteach bucket has a "repeat count" showing howoften the highest value appears in the bucket.

    7 rows in

    the bucket38 appear

    5 times

    Title

    26/ 30

    Jonathan Lewis

    2011

    SQL (top-N pt.1)

    select

    /*+ {lots of hints} */

    to_char(count("VALUE")),

    to_char(substrb(dump(min("VALUE"),16,0,64),1,240)),

    to_char(substrb(dump(max("VALUE"),16,0,64),1,240)),

    count(rowidtochar(rowid))

    from

    "TEST_USER"."T1" t /* TOPN,NIL,NIL,RWID,U18U*/

    select

    /*+ {lots of hints} */

    to_char(count("VALUE")),

    to_char(substrb(dump(min("VALUE"),16,0,64),1,240)),

    to_char(substrb(dump(max("VALUE"),16,0,64),1,240))

    from

    "TEST_USER"."T1" t /* NDV,NIL,NIL*/

    SQL behind creating a histogram with 18 buckets

    SQL behind basic "approximate NDV" (single column table - 11g)

  • 7/23/2019 CON2803_PDF_2803_0001

    14/1514

    Title

    27/ 30

    Jonathan Lewis

    2011

    SQL (top-N pt.2)

    select /*+ lots of hints */

    substrb(dump("VALUE",16,0,64),1,240) val,

    rowidtochar(rowid) rwid

    from

    "TEST_USER"."T1" t

    where rowid in (

    chartorowid('AAAWaHAAFAAAAEEAAB'),chartorowid('AAAWaHAAFAAAAEEAAC'),

    chartorowid('AAAWaHAAFAAAAEEAAD'),chartorowid('AAAWaHAAFAAAAEEAAE'),

    chartorowid('AAAWaHAAFAAAAEEAAF'),chartorowid('AAAWaHAAFAAAAEEAAG'),

    chartorowid('AAAWaHAAFAAAAEEAAH'),chartorowid('AAAWaHAAFAAAAEEAAI'),

    chartorowid('AAAWaHAAFAAAAEEAAJ'),chartorowid('AAAWaHAAFAAAAEEAAK'),

    chartorowid('AAAWaHAAFAAAAEEAAL'),chartorowid('AAAWaHAAFAAAAEEAAM'),

    chartorowid('AAAWaHAAFAAAAEEAAN'),chartorowid('AAAWaHAAFAAAAEEAAO'),

    chartorowid('AAAWaHAAFAAAAEEAAP'),chartorowid('AAAWaHAAFAAAAEEAAQ'),

    chartorowid('AAAWaHAAFAAAAEFAAA'),chartorowid('AAAWaHAAFAAAAEFAAB')

    )

    order by "VALUE"

    Title

    28/ 30

    Jonathan Lewis

    2011

    SQL (hybrid)

    select

    substrb(dump(val,16,0,64),1,20) ep, freq, cdn, ndv,

    (sum(pop) over())popcnt, (sum(pop * freq) over())popfreq,

    substrb(dump(max(val) over(),16,0,64),1,20) maxval,

    substrb(dump(min(val) over(),16,0,64),1,20) minval

    from (

    select

    val, freq, (sum(freq) over()) cdn, (count(*) over()) ndv,

    (case when freq > ((sum(freq) over())/15) then 1 else 0 end) pop

    from (select /*+ lots of hints */

    "VALUE" val, count("VALUE") freq

    from

    "TEST_USER"."T1" t

    where

    "VALUE" is not null

    group by

    "VALUE"

    )

    )

    order by val

    /

    With only 15 buckets thisdataset got a hybrid histogram

  • 7/23/2019 CON2803_PDF_2803_0001

    15/15

    Title

    29/ 30

    Jonathan Lewis

    2011

    SQL (old height-balanced)

    select

    min(minbkt),maxbkt,

    substrb(dump(min(val),16,0,32),1,120) minval,

    substrb(dump(max(val),16,0,32),1,120) maxval,

    sum(rep) sumrep, sum(repsq) sumrepsq, max(rep) maxrep, count(*) bktndv,

    sum(case when rep=1 then 1 else 0 end) unqrep

    from (

    select

    val, min(bkt) minbkt, max(bkt) maxbkt,

    count(val) rep, count(val)*count(val) repsq

    from (

    select /*+ lots of hints */

    "LN100" val, ntile(200) over (order by "LN100") bkt

    from sys.ora_temp_1_ds_616t

    where "LN100" is not null

    )

    group byval

    )

    group bymaxbkt order bymaxbkt

    Title

    30/ 30

    Jonathan Lewis

    2011

    Conclusions for 12c

    Use auto_sample_size

    2,048 buckets is legal The default is still 254, and it's likely to be adequate

    Frequency / Top N histograms Fast and accurate

    Hybrid Capture far more popular values, still samples, and costly

    Timing is still important

    May still want to create some by code