55
Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology [email protected]

Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology [email protected]

Embed Size (px)

Citation preview

Page 1: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Tuning the Optimizer Statistics

>Eric MinerSenior EngineerData Server [email protected]

Page 2: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

There are Two Kinds of Optimizer Statistics

> Table/Index level- describes a table and its index(es)

> Page/row counts, cluster ratios, deleted and forwarded rows

> Some are updated dynamically as DML occurs

• page/ row counts, deleted rows, forwarded rows, cluster ratios

> Stored in systabstats

> Column level - describes the data to the optimizer

> Histogram (distribution), density values, default selectivity values

> Static, need to be updated or written directly

> Stored in sysstatistics

> This presentation deals with the column level statistics

Page 3: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

What Are These Column Level Statistics Used For?

> The Histogram Values

> Describes the distribution of values in the column

> Belongs to a column, not an index

> Used in costing SARGs

> A step is the point in the column where a value is read to obtain a ‘boundary value’

> A cell represents the rows that fall between two steps

> Each cell has a weight which is the fraction of rows in the column it represents - read as a percentage of all rows

> There are approximately the same number of rows in each cell - except Frequency count cells

Page 4: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Some Quick Definitions

Range cell density: 0.0037264745412389 Total density: 0.3208892191740000 Range selectivity: default used (0.33) In between selectivity: default used (0.25) Histogram for column: “A"Column datatype: integerRequested step count: 20Actual step count: 10 Step Weight Value 1 0.00000000 <= 154143023 2 0.05263700 <= 154220089 3 0.05245500 <= 171522361 4 0.00000000 < 800000000 5 0.34489399 = 800000000 6 0.04968300 <= 859388217 7 0.00000000 < 860000000

Page 5: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

What Are These Column Level Statistics Used For? cont.

> A cell can represent either a single value or multiple values of the column

> Range cell - more than one value 2 0.05263982 <= 31423 3 0.05263316 <= 63045

• All values between 31424 and 63045 are in cell 3

> Frequency count cell - only one value (very accurate) 5 0.00000000 < 170016 6 0.15908001 = 170016 7 0.05263815 <= 201861 8 0.10263316 = 201862 9 0.05264576 <= 317462

• Cells 6 and 8 represent only one value

> Cell (step) 1 represents the NULL values in the column

Page 6: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Statistics On Inner Columns of Composite Indexes

Stats on inner columns of composite indexes

Think of a composite index as a 3D object, columns with statistics are transparent, those without statistics are opaque

> Columns with statistics give the optimizer a clearer picture of an index – sometimes good, sometimes not

> This is a fairly common practice

> Does add maintenance

> update index statistics most commonly used to do this

update index statistics tab_name [ind_name]

Page 7: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Statistics On Inner Columns of Composite Indexes cont.

Index on columns E and B – No statistics on column bselect * from TW4 where E = "yes" and b >= 959789065 and id >= 600000 and F > "May 14, 2002“ and A_A = 959000000

Beginning selection of qualifying indexes for table TW4',varno = 0, objectid 464004684. The table (Allpages) has 1000000 rows, 24098 pages,

Estimated selectivity for E, selectivity = 0.527436, upper limit = 0.527436.

No statistics available for B,using the default range selectivity to estimate selectivity.

Estimated selectivity for B, selectivity = 0.330000.

Page 8: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Statistics On Inner Columns of Composite Indexes cont.

The best qualifying index is ‘E_B' (indid 7)costing 49264 pages, with an estimate of 191rows to be returned per scan of the table

FINAL PLAN (total cost = 481960):

varno=0 (TW4) indexid=0 ()path=0xfbccc120 pathtype=sclausemethod=NESTED ITERATION

Table: TW4 scan count 1, logical reads:(regular=24098apf=0 total=24098)physical reads: (regular=16468 apf=0 total=16468), apf IOs used=0

Page 9: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Statistics On Inner Columns of Composite Indexes cont.

Statistics are now on column BEstimated selectivity for E, selectivity = 0.527436, upper limit = 0.527436.

Estimated selectivity for B, selectivity = 0.022199, upper limit = 0.074835.

The best qualifying index is ‘E_B' (indid 7)costing 3317 pages,with an estimate of 13 rows tobe returned per scan of the table

FINAL PLAN (total cost = 55108):

varno=0 (TW4) indexid=7 (E_B)path=0xfbd1da08 pathtype=sclausemethod=NESTED ITERATION

Table: TW4 scan count 1, logicalreads:(regular=4070 apf=0 total=4070),physical reads: (regular=820 apf=0 total=820),

Page 10: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Statistics On Non-Indexed Columns and Joins

Stats on non-indexed columns

Can’t help with index selection but can affect join ordering

> Columns with statistics give the optimizer a clearer picture of the column – no hard coded assumptions have to be used

> When costing joins of non-indexed columns having statistics may result in better plans than using the default values

> Without statistics there will be no Total density or histogram that the optimizer can use to cost the column in the join

> Yes, in some circumstances histograms can be used in costing joins – if there is a SARG on the joining column and that column is also in the join table then the SARG from the joining table can be used to filter the join table

> If there is no SARG on the join column or on the joining column the Total density value (with stats) or the default value (w/o stats) will be used

Page 11: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Statistics On Non-Indexed Columns and Joins cont.

SARG exampleselect ....from TW1, TW4where TW1.A = TW4.A and TW1.A = 10

Selecting best index for the JOIN CLAUSE: TW4.A = TW1.A TW4.A = 10Estimated selectivity for a, selectivity = 0.003726,upper limit = 0.049683. Histogram values used

select ....from TW1, TW4where TW1.A = TW4.A and TW1.B = 10

Selecting best index for the JOIN CLAUSE: TW4.A = TW1.A

Estimated selectivity for a, selectivity = 0.320889. Total density value used

Page 12: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Statistics On Non-Indexed Columns and Joins - Example

select * from TW1,TW2 where TW1.A=TW2.A and TW1.A =805975090 A simple join with a SARG on the join column of one table

Table TW2 column A has no statistics, TW1 column A doesSelecting best index for the JOIN CLAUSE: (for TW2.A)

TW2.A = TW1.A TW2.A = 805975090 Inherited from SARG on TW1

But, can’t help…no statsEstimated selectivity for A, selectivity = 0.100000.

The best qualifying access is a table scan, costing 13384 pages, with an estimate of 50000 rows to be returned per scan of the table, using no data prefetch (size 2K I/O), in data cache 'default data cache' (cacheid 0) with MRU replacementJoin selectivity is 0.100000.

Inherited SARG from other table doesn’t help in this case

Page 13: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Statistics On Non-Indexed Columns and Joins – Example cont.

Without statistics on TW2.A the plan includes a reformat with TW1 as the outer table

FINAL PLAN (total cost = 2855774):

varno=0 (TW1) indexid=2 (A_E_F)path=0xfbd46800 pathtype=sclausemethod=NESTED ITERATION

varno=1 (TW2) indexid=0 ()path=0xfbd0bb10 pathtype=joinmethod=REFORMATTING

> Not the best plan – but the optimizer had little to go on

Page 14: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Statistics On Non-Indexed Columns and Joins – Example cont.

Table TW2 column A now has statistics. The inherited SARG on TW1.A can now be used to help

filter the join on TW2.ASelecting best index for the JOIN CLAUSE:

TW2.A = TW1.A TW2.A = 805975090

Estimated selectivity for A, selectivity = 0.001447, upper limit = 0.052948.

The best qualifying access is a table scan, costing 13384 pages, with an estimate of 724 rows to be returned per scan of the table, using no data prefetch (size 2K I/O), in data cache 'default data cache' (cacheid 0) with MRU replacement

Join selectivity is 0.001447.

Page 15: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Statistics On Non-Indexed Columns and Joins – Example cont.

With statistics on TW2.A reformatting is not used and the join order has changed

FINAL PLAN (total cost = 1252148):

varno=1 (TW2) indexid=0 ()path=0xfbd0b800 pathtype=sclausemethod=NESTED ITERATION

varno=0 (TW1) indexid=2 (A_E_F)path=0xfbd46800 pathtype=sclausemethod=NESTED ITERATION

Page 16: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

The Effects of Changing the Number of Steps (Cells)

The Number of Cells (steps) Affects SARG Costing –

As the Number Of Steps Changes Costing Does Too

Cell weights and Range cell density are used in costing SARGs

> Cell weight is used as column’s ‘upper limit’ Range cell density is used as ‘selectivity’ for Equi-SARGs – as seen in 302 output

> Result(s) of interpolation is used as column ‘selectivity’ for Range SARGs

> Increasing the number of steps narrows the average cell width, thus the weight of Range cells decreases

> Can also result in more Frequency count cells and thus change the Range cell density value

> More cells means more granular cells

Page 17: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

The Effects of Changing the Number of Steps (Cells) cont.

Average cell width = # of rows/(# of requested steps –1)

> Table has 1 million rows, requested 20 steps -

> 1,000,000/19 = 52,632 rows per cell

> 1,000,000/199 = 5,025 rows per cell

> What does this mean?

> As you increase the number of steps (cells) they become narrower – representing fewer values

> We’ll see that this has an effect on how the optimizer estimates the cost of a SARG

> “update statistics ……. using X values“create index ….. using X values

Page 18: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

The Effects of Changing the Number of Steps (Cells) cont.

Changing the number of steps – effects on Equi-SARGs

select A from TW2 where B = 842000000

With 20 cells (steps) in the histogramRange cell density: 0.0012829768785739

9 0.05263200 <= 82556933710 0.05264200 <= 842084405

SARG value falls into cell 10Estimated selectivity for B, selectivity = 0.001283, upper limit = 0.052642. Range cell weight of density qualifying cell

Page 19: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

The Effects of Changing the Number of Steps (Cells) cont.

With 200 cells (steps) in the histogramRange cell density: 0.0002303825911991 77 0.00507200 <= 839463989 78 0.00506000 <= 842019895

> SARG value falls into cell 78

Estimated selectivity for B,

selectivity = 0.000230, upper limit = 0.005060.

In this case more cells result in a lower estimated selectivity

> Increasing the number of steps has decreased the average width and lowered the Range cell density and the average cell weight.

> Range cell density decreased because Frequency count cells appeared in the histogram

Page 20: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

The Effects of Changing the Number of Steps (Cells) cont.

Changing the number of steps – effects on Range SARGs -select * from TW2 where B between 825570000 and 830000000

With 20 cells (steps) in the histogram

Range cell density: 0.0012829768785739

9 0.05263200 <= 82556933710 0.05264200 <= 842084405

> SARG values fall into cell 10

Estimated selectivity for B,

selectivity = 0.014121, upper limit = 0.052642.

> Here ‘selectivity’ is the product of interpolation, ‘upper limit’ is the weight of the qualifying cell.

> Interpolation estimates how much of cell will qualify for the range SARG

Page 21: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

The Effects of Changing the Number of Steps (Cells) cont.

select * from TW2 where B between 825570000 and 830000000

With 200 cells (steps) in the histogram

Range cell density: 0.0002303825911991 67 0.00505200 <= 825505843 68 0.00503000 <= 825570611 69 0.00508000 <= 825635378 70 0.00504000 <= 825690418 71 0.00506400 <= 825702450 72 0.00503200 <= 825767218 73 0.00510200 <= 825831945 74 0.00425800 <= 825833785 75 0.00598400 <= 839462921

Estimated selectivity for B, selectivity = 0.029624, upper limit = 0.034606.

> The SARG values now span multiple cells> Interpolation estimates amount of cells 68 and 75 to use since not

all of those two cells qualify

Page 22: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Some Statistics Related FAQs cont.

How many steps should I request?

> It will depend on your data and your queries

> Increase requested steps to get Frequency count cells when there are highly duplicated values

> FC only represents one value - very accurate weight

> Range SARGs will estimate what portion of a cell qualifies for the SARG

> More cells means narrower cells (represent fewer values)

> Narrower cells mean more accurate estimates

> Can have an affect on equi-SARGs - lower selectivity

Page 23: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Removing Statistics Can Effect Query Plans

Sometimes no statistics are better then having them

This will usually be an issue when very dense columns are involved

Histogram for column: “E" Step Weight Value 1 0.00000000 < "no" 2 0.47256401 = "no" 3 0.00000000 < "yes" 4 0.52743602 = "yes“

This can also show up when you have ‘spikes’ (Frequency count cells) in the distribution

Page 24: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Removing Statistics Can Effect Query Plans cont.

select count(*) from TW4 where E = “yes” and C = 825765940

The table…has 1000000 rows, 24098 pages,

Estimated selectivity for E, selectivity = 0.527436, upper limit = 0.527436.

Estimating selectivity of index ‘E_AA_B', indid 6scan selectivity 0.52743602,filter selectivity 0.527436 527436 rows, 174107 pages

The best qualifying index is ‘E_AA_B' (indid 6) costing 174107 pages, with an estimate of 526 rows

FROM TABLE TW4 Nested iteration. Table Scan.

Page 25: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Removing Statistics Can Effect Query Plans cont.

delete statistics TW4(E)

Estimated selectivity for E, selectivity = 0.100000.

Estimating selectivity of index ‘E_AA_B', indid 6scan selectivity 0.100000,filter selectivity 0.100000 100000 rows, 20584 pages

The best qualifying index is ‘E_AA_B (indid 6) costing 20584 pages, with an estimate of 92 rows

FROM TABLE TW4 Nested iteration. Index : E_AA_B Forward scan. Positioning by key.

Page 26: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Maintaining Tuned Statistics

Tuned statistics will add to your maintenance

Any statistical value you write to sysstatistics either via optdiag or sp_modifystats will be overwritten by update statistics

> Keep optdiag input files for reuse

> If needed get an optdiag output file, edit it and read it in

> Keep scripts that run sp_modifystats

> Rewrite tuned statistics after running update statistics that affects the column with the modified statistics

Page 27: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Monitoring Table/Index Level Fragmentation Using The Statistics

Can Be Both An Optimizer and Space Management Concern

The more fragmentation the less efficient page reads are

> Deleted rows – fewer rows per page, affects costing

> Forwarded rows – 2 I/O each, optimizer adds to costing

> Empty data/leaf pages – more reads may be necessary

> Clustering can get worse

> Watch the DPCR of the table or APL clustered index

> In general the Cluster Ratios are not a good indicator of fragmentation since they are often normally low

> Use optdiag outputs to monitor these values

Page 28: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Monitoring Table/Index Level Fragmentation Using The Statistics cont.

> ASE 12.0 and above check the ‘Space utilization’ value

> ‘Large I/O efficiency’ is another value to watchEmpty data page count: 0Forwarded row count: 0.0000000000000000Deleted row count: 0.0000000000000000

Derived statistics: Data page cluster ratio: 0.9994653835872761 Space utilization: 0.9403543288085808 Large I/O efficiency: 1.0000000000000000

> ‘Space utilization’ and ‘Large I/O efficiency’ are not used by the optimizer

> The further from 1 the more fragmentation there is

Page 29: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Maintaining the Statistics

When data changes the statistics become out of date

In general up to date statistics are needed to get the best query plans

> Statistics are usually updated using update statistics commands

> The more statistics you have the more maintenance

> It’s a trade off between the gain in query performance and the increased statistics maintenance

> There’s no point in updating statistics if the table is static

Page 30: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Update Statistics

> Update statistics has been extended to allow for placement of statistics on columns

> update statistics table_name (col_name)

> update index statistics table_name [ind_name]

> update all statistics table_name

> Specify the requested number of steps (cells) to use when building the column’s histogram

> update statistics table_name (col_name) using 200 values

Page 31: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

How Update Statistics Works

Column and table/index values have to be read in order to gather the statistics

> What does it do?

> Reads the column to gather information for density and histogram, writes the column level statistics

> While reading the column it gathers index/table level statistics – row & page count, forwarded rows, deleted rows, the cluster ratios, etc.

> Takes a sample value every X rows for a histogram boundary value - (based on the number of rows and requested steps)

• If same value for multiple steps save it to make an FC

Page 32: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

How Update Statistics Works cont.

> Values have to be in sorted order for statistics gathering

> If it’s the leading column of an index no sort is necessary

> Just scan index leaf for statistics

> If not the leading column of an index - create a worktable, read values in, sort and scan for statistics

update statistics tab_name (col_name)- a table scan will be done to read the column

update index statistics (ind_name)- then only an index scan (with a sort of the inner columns)

> The sort is done in a worktable in tempdb.

update index and update all statistics will use a lot of tempdb space unless sampling is used

Page 33: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Some Statistics Related Myths & Legends

“Update statistics will result in improved performance”

> Only guarantees up to date statistics

> Due to distribution statistics may not give a ‘pretty’ picture of the column

“Always use update all statistics”

> Rarely need statistics on all columns of a table

> Can take a VERY long time to run, makes maintenance difficult at best

> Should consider adding stats to composite index columns

Page 34: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Statistics Tools

> Some useful tools for working with the statistics

> Some are by Sybase some are by users

> Optdiag - read, write and simulate the statistics

> Well known and documented

> sp_modifystats - make modifications to density values (more functionality coming soon - 11.9.2.4, 12.0.0.4, 12.5)

> sp__optdiag (that’s a double underscore) -

> by Kevin Sherlock

> Displays the statistics ala optdiag output - very handy

> http://www.sypron.nl/download.html

Page 35: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Sampling for Update Statistics

A new feature in 12.5.0.3

Designed to dramatically reduce the time it takes to update statistics – can dramatically speed up the running of update statistics

> ‘Opens’ your maintenance window

> Decreases the cost of using ASE

Randomly selected pages are read instead of reading all pages to gather the column level statistics – less I/O

> The percentage of pages to be sampled can be specified

update statistics tab_name with sampling = X percent

> X is the percentage of pages you want to sample

• Can be between 1 and 100

Page 36: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Definitions

> Column Level Statistics – those statistics that describe the values in the column to the optimizer – an attribute of a colum (i.e.; the histogram and density values)

> Sampling – randomly reading rows from a specified percentage (subset) of pages rather than all pages of the table in order to gather column level statistics

> Sampling Rate – the specified percentage of pages to read

> Full Scan – to gather statistics by reading all pages of the object (table or index)

> Major Attribute of an Index – the ‘leading’ column of an index as listed in the create index command

Page 37: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Sampling for Update Statistics cont.

> Unofficial tests show that a sampling rate of 10% on a 1 million row numeric column reduces the time for update statistics to run from 9 minutes to 30 seconds

Page 38: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Sampling for Update Statistics cont.

> The Resulting histogram will be based on the values that are sampled

• It will differ from a histogram obtained from a ‘full scan’ update statistics

• The lower the specified percentage of sampling the more the histogram will differ from a full scan histogram

• Test your queries against sampled statistics. In most cases you won’t see any major changes

• Density values not updated by sampling

> In most cases this won’t be an issue.

Page 39: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Why Sampling for Update Statistics?

> As datasets have grown the time it takes to run update statistics has also grown – Dramatically!!

> This became more of an issue with ‘update index statistics’ introduced in 11.9.x due to extra sort in worktable

> TCO and auto-tuning/admin require a faster way to run update statistics

> Without a faster update statistics neither efforts would succeed

> Speeding up update statistics is a long standing Customer feature request

> Random page sampling is the most I/O efficient method

> Dramatically decreased the time to run update statistics

Page 40: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Why Sampling for Update Statistics? cont.

> Some time test results – Not official, not for general release

> ‘your mileage may vary’

> Timings are from tests run by Sybase QA

> 1 million row int colum – timings based on ‘elapsed time’

20% sampling rate – Full scan time :2465850Sampling time : 398783Percentage of savings time(elapsed time):83%

10% sampling rate –Full scan time :2139013Sampling time : 153130Percentage of savings time(elapsed time):92% > Variations in full scan time are taken into account

Page 41: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

How Does It Work?

> Specify the percentage of pages to read via update statistics

> ‘with sampling = X percent’• Percent value can be between 1 and 100

> ‘with’ extensions must follow ‘using’ – • with sampling = x percent and/or with consumers = x must follow

using X values

update statistics authors(auth_id) using 40 values with percent = 10

> Sampling reads all rows from each page read

> Row values are moved to the worktable to be sorted and the statistics gathered

> This saves tempdb space since the sampled sets of values are smaller than if the whole column was read into the worktable

Page 42: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

How Does It Work? cont.

Specific update statistics syntax and their affects

update statistics table_name [index_name] with sampling = X percent

> Will full scan index pages to update/create statistics on the major attribute(s) of the specified index or all indexes

on the table ignoring the specified sampling rate – sampling will not be done

Page 43: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

How Does It Work? cont.

update index statistics tab_name [ind_name] with sampling = X percent

> Will full scan index pages to update/create statistics for the major attribute(s) of the indexes or specified index on the table, ignoring sampling.

> For minor index attributes will use sampling to scan the requested percentage of pages, read those values into a worktable, sort and gather statistics from there.

> The space used in tempdb will decrease as the sampling rate decreases

update statistics tab_name (col_name) with sampling = X percent

> Will use sampling to update/create statistics for the specified column using the specified sampling rate. This applies to all columns whether major attributes of an index or not

> Will not affect multi-column density values

Page 44: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

How Does It Work? cont.

update all statistics table_name with sampling = X percent

> Will full scan index pages to gather statistics for the major attribute of all indexes – will not use sampling on these columns

> Will use sampling to gather statistics for all columns that are not the major attribute of an index

> The space used in tempdb will decrease as the sampling rate decreases

Page 45: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

How Does It Work? cont.

> Sampling is not used for create index

> Since a full scan is required to build an index there is no additional cost for building the statistics

Page 46: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Trade Offs

> A sampled set of anything is not as accurate as examining the most effective sampling rate for a given dataset

> A histogram created with sampling is not likely to match a histogram created via a full scan

> Histogram boundary values will vary> Cell weights will vary> Minimum and maximum histogram boundary values will vary

> Since cell weight(s) and Range cell density are used to cost all SARGs a histogram from a sampled set will have an affect on SARG costing

> Variations in the upper and lower histogram values may result in ‘out of bounds costing’ by the optimizer

> The smaller the sampling rate the greater the variance is likely to be

Page 47: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Trade Offs cont.

> If there are existing density values they will not be overwritten. If there are no density values a default value of 0.100000 will be used for both Range cell and Total density values

> There is currently no information saved about the use of sampling (whether or not it was used and the sampling rate)

> Different cell types may appear

> As the sampling rate decreases it is possible that Frequency count and/or Range cells may appear where they didn’t exist prior to sampling

> The same pages will be resampled if the dataset is static and the same sampling rate used

Page 48: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Examples of Variations in the Histogram

Full scan histogram -

Step Weight Value

1 0.00000000 <= 154218543

2 0.05315000 <= 805909305

3 0.05305000 <= 808793353

4 0.05311000 <= 822687028

5 0.05304000 <= 825700873

6 0.05314000 <= 839464505

7 0.05292000 <= 842544649

8 0.05305000 <= 858863369

<edited>

20 0.04621000 <= 960051465> Note boundary values, cell weights and the upper and lower boundary values> Variations within the histogram are the main issue that needs to be tested

Page 49: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Examples of Variations in the Histogram cont.

10% sampled histogram -

Step Weight Value

1 0.00000000 <= 154218799

2 0.05253968 <= 805909300

3 0.05253968 <= 808728585

4 0.05253968 <= 822686772

5 0.05269841 <= 825636617

6 0.05349206 <= 839464498

7 0.05253968 <= 842543113

8 0.05253968 <= 858797321

<edited>

20 0.04888889 <= 960050979

> Note variations in the boundary values, cell weights and the upper and lower boundary values

Page 50: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Tuning and Troubleshooting

> “Trial-and-error” testing/tuning will need to be done to determine the most optimal sampling rate for a given dataset

> In most cases variations in the statistics will have no affectIn other cases small variations may change query plans

> There is no ‘rule of thumb’ on what sampling rate to use

> In some cases the same sampling rate may be fine across all or most tables/columns.

> In some cases sampling may not result in efficient plans

> Use showplan and traceon 302/310 outputs to track changes to the query plan as the sampling rate changes

> Using sample queries get above outputs from statistics gathered by a full scan. Update statistics with the sampling rate, rerun query and compare outputs

Page 51: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Tuning and Troubleshooting cont.

>Use optdiag to monitor changes to the histogram

>Check optdiag of full scan histogram for upper and lower boundary values these can be inserted into the histogram if needed

>Keep a copy of optdiag output file as a backup of statistics in case old values need to be reloaded

Page 52: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Future Enhancements

> This first implementation of sampling will require some enhancements

> ‘Scale’ density values gathered by sampling so that they are more accurate

> Track the min/max values in the column in order to maintain the upper and lower boundary values of the histogram

> Sampling index pages

> Will help decrease the time of running update statistics even further

> Add a mechanism to record if sampling was used and what sampling rate was last used

> Add this information to optdiag and traceon 302 (and future optimizer diagnostics)

Page 53: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Where To Get More Information

> The Sybase Customer newsgroups

> http://support.sybase.com/newsgroups

> The Sybase list server

> [email protected]

> The external Sybase FAQ

> http://www.isug.com/Sybase_FAQ/

> Join the ISUG, ISUG Technical Journal, feature requests

> http://www.isug.com

Page 54: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Where To Get More Information

> The latest Performance and Tuning Guide

> Don’t be put off by the ASE 12.0 in the title, it covers the 11.9.2 features/functionality too

> http://sybooks.sybase.com/onlinebooks/group-as/asg1200e

> Any “What’s New” docs for a new ASE release

> Tech Docs at Sybase Support

> http://techinfo.sybase.com/css/techinfo.nsf/Home

> Upgrade/Migration help page

> http://www.sybase.com/support/techdocs/migration

Page 55: Tuning the Optimizer Statistics >Eric Miner Senior Engineer Data Server Technology eric.miner@sybase.com

Sybase Developer Network (SDN)

Additional Resources for Developers/DBAs

> Single point of access to developer software, services, and up-to-date technical information:

> White papers and documentation

> Collaboration with other developers and Sybase engineers

> Code samples and beta programs

> Technical recordings

> Free software