Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

drive

THE STORK CORRELATIONUSE & ABUSE OF STATISTICS IN CAPACITY PLANNING

Denise P. KalmR&D Sr. Product SpecialistBMC Software, Inc.

“Statistical analysis – Mysterious, sometimes bizarre manipulations performed upon the collected data of an experiment in order to obscure the fact that the results have no generalizable meaning for humanity. Commonly, computers are used, lending an aura of unreality to the proceedings.”

Agenda

Why a stork?Tools of the tradeGetting the terminology rightMy favorite statisticsLies, damned lies and telling your manager what he wants to knowSummary

The Stork Correlation

In a small Welsh town, there was a .95 correlation between the arrival of storks and the arrival of babies.

Why?

The Stork Correlation

There was also a 1.0 correlation between the dates fishermen were home from the sea and the likely dates of conception.

Statistical Abuse

Correlation is the most misused statisticStandard deviation and mean rank 2nd and 3rdIgnorance of statistics leads to career-limiting recommendationsStatistics can be your best friend, once you understand themBut… It is more art than science.

Why Me?

BackgroundTrainingExperience

Why You?

Determining the significance of changesNot confusing correlation with cause-and-effect Saving time on problem resolutionTheory testing

Big Caveats

Only use statistics with like-minded individuals. Managers typically only understand average and percentiles.

When statistics don’t appear to be working for you, check out statistics that do not require the assumption of normality.

Tools of the Trade

SAS/SAS GraphSPSSStatistical calculatorExcelBrute force with the equations (not recommended, but possible)

Definitions

Sample & populationNormalityOutlierMean, median, mode & percentileStandard deviation & varianceMisc. terms

“Statistics is a systematic method for getting the wrong conclusion with 95% confidence.”

Sample & Population

Population – all the data for the period of time studied, I.e., every RMF/SMF data record for an hour

Sample – a random selection of all data points/records available.

Normal Distribution

A distribution which describes many situations where observations are distributed symmetrically around the mean . 68% of all values under the curve lie within one standard deviation of the mean and 95% lie within two standard deviations.

Central Limit Theorem

As sample size increases, the distribution of the sample approaches a normal distribution, where the mean = the mean of the population and the standard deviation equals the standard deviation of the population divided by the square root of the sample size.

More samples, better data.

Formulas

f(x) = 1/[2* )1/2* ] * e**{-1/2*[(x-µ)/ ]2}- < x < where µ is the mean is the standard deviation e is the base of the natural logarithm, sometimes called Euler's e (2.71...) is the constant Pi (3.14...)

“

Outlier

Outlier - A point that, because of observation noise, does not followthe characteristics of the input (or desired response) data.

“There are liars, outliers, and out-and-out liars.”

Mean

Arithmetic Mean – numeric average of all the data.

X = x1 + x2 + x3…/ N(x)

Assumes normalityAffected by outliersPlot data to understand

Plot to see meaning of “mean”

Frequency

0

10

20

30

40

50

60

0.2

1.4

2.6

3.8 5

6.2

7.4

8.6

9.8 11

12.2

13.4

14.8

Frequency

Mean

Median/Mode

Median and Mode

Median – middle value, where half the values lie on each side of the median, when they are ordered by value.Mode – most frequently observed value. If no repeats, there is no mode value.

Percentile & Percentage Change

Percentile – group data by putting equal number of data points into each group. Ex. 95% percentile – 95% of values are less than x.

Percentage Change = (after value – before value) / before value

Risk of using percentage change

Standard Deviation & Variance

Standard Deviation – square root of the variance. For normal data, 2/3 of the data points are within 1 SD of the mean on either side.

Variance – amount of “spread” of the data around the mean:

S2= ((x1-X)2 + (x2-X)2 + …. (xn-X)2 ) / n-1

Where x=mean and xn is each data point, n is the number of samples

Standard Deviation of a Sample

“If the SD is large, you need to inspect your sampling method. This may indicate suspect data, poor interval choices, etc.

My Favorite Statistics

Linear RegressionCorrelation

“In ancient times, they had no statistics, so they had to fall back on lies.” - Stephen B. Leacock

Linear Regression

Linear Regression – describing the relationship between two data elements, by fitting a straight line to the data.Ex. X=transaction rate Y=%CPU utilization

Y= bX+C where x and y are the variables, b is the slope of the line and C is the point where the line intercepts the y-axis.

Linear Regression

Good Candidate for Regression

47© 2001 BMC SOFTWARE, INC.

Potential Impact of Increasing Volume

0

20

40

60

80

100

BASELINE PLAN1 PLAN2 PLAN3 PLAN4

% Proc

PRD2

PRD1

Predict Impact of Change

Bad Candidate for Regression

48© 2001 BMC SOFTWARE, INC.

Predict Impact of Change

Impact on Response Time

0

.50

1.0

1.5

2.0

2.5

BASELINE PLAN1 PLAN2 PLAN3 PLAN4

Secs

BILLWEB Page Service

BILLWEB Page Wait

BILLWEB I/O Wait

BILLWEB CPU Service

BILLWEB I/O Service

BILLWEB CPU Wait

Gotchas

Make sure relating the variables makes sense.Plot data when not sure of the relationship (scatter plot)Do not throw out outliers until you are sure of why they occurredDo not commit linear “progression”

Correlation

Correlation coefficient - R2 measures the degree of relationship(and direction) between two variables. R2 =1.00 indicates a perfect correlation; R2 = 0.0 means there is no relationship at all. R2 = a negative number means that as one variable increases, the other decreases.

Correlation is NOT cause and effect.

Though there may be a causal relationship between two variables, you cannot infer it from a correlation analysis.

A third factor may really be causing the correlation.

Don’t calculate it by hand – use a tool.

Use your brain to interpret the results.

“A statistician is someone who is skilled at drawing a precise line from an unwarranted assumption to a foregone conclusion.”

drive

How to Lie With Statistics

Statisticulation

“Statistics are like a bikini. What they reveal is suggestive, but what they conceal is vital.” - Aaron Levenstein

Why Lie?

Outliers make your data look badYou are trying to comply with a performance clauseYou are too busy writing the great American novel to do your jobYour manager wouldn’t understand anyway

Averaging Averages

Why do it?Most performance data is already averaged, so it is easierMakes response times look better in most casesSmooths out all variabilityMostly eliminates outliers, particularly in plotting data

Using Percentage Change

Why do it?To exaggerate the benefit of a performance change.Ex. RT decreased 50% going from 0.2 to 01.

To justify a processor upgradeEx. Doubling application volume will increase its CPU demand 100% (even when the CPU demand was very small)

To impress or terrify

Small Sample Size

Why do it? SAS jobs run faster Large, randomly obtained data doesn’t give the

right results; a small, selected window doesYou don’t really have any data and have to

invent some

Stupid Graph Tricks

Why do it? To make your data look better

How to do it Log functions on one axis – to diminish the

impact of a change. Or just use different orders of magnitude for x and y axes

Select graph type (pie, line, stacked bar) which best misleads your audience

Eliminate actual metrics so you can draw the line to reflect your reality

Put time on the wrong axis Eliminate all legends, data tables, etc.

Invalid Metrics

How to do itUse your own definitions. Ex. Typical CICS tran = non-browsing, non-batch workUse multiple decimal places to lend an air of precision to the data. Good with small or unreliable sample or poor capture ratio.Compare apples to oranges. Ex. Compare performance after tuning using a period of low demand to compare to a “before” of high demandAdd percentage changes together. Ex. If volume changes cause a 10% inc. in DB2, a 15% inc. in CICS and a 20% increase in batch, that’s 45%.

Correlation Abuse

How to do it

Select two metrics that aren’t usually related (I/O response time and file size), draw a correlation and justify a memory upgrade.

Most people don’t know performance metrics well enough to challenge you.

Another Common Lie

Linear progression – forecasting the line past the data points you have

Unless you are sure the relationship between two variables is linear, do not attempt this. Even mostly linear relationships (such as CPU vs. volume) may go non-linear at near-saturation.

What Can Go Wrong

What you think might happen

What might really be happening

What We Didn’t Cover

Hypothesis testing – valuable if you want to see how likely it is that your theory matches reality. Is the change in the data due to chance, or did you really make a difference?

Chi-squareT-test

When you don’t have enough information about the data (population) or about cause-and-effect relationships

Summary

Turn data into information by applying statistics and your knowledge.

Practice “safe performance analysis” and protect your job.

CYA

“Numbers are like people; torture them enough and they’ll tell you anything.”

References

Geis “How to Lie with Statistics”Dixon and Massey “Introduction to Statistical Analysis”Gonick & Smith “The Cartoon Guide to Statistics”Sziede “Statistics for the Algebraically Challenged”Munoz “Sampling Issues in the Collection of Performance Data” CMG2002

Questions?

Denise P. [email protected] Software, Inc.

Documents

Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc