Upload
matthew-derick-montgomery
View
217
Download
0
Embed Size (px)
Citation preview
drive
THE STORK CORRELATIONUSE & ABUSE OF STATISTICS IN CAPACITY PLANNING
Denise P. KalmR&D Sr. Product SpecialistBMC Software, Inc.
“Statistical analysis – Mysterious, sometimes bizarre manipulations performed upon the collected data of an experiment in order to obscure the fact that the results have no generalizable meaning for humanity. Commonly, computers are used, lending an aura of unreality to the proceedings.”
Agenda
Why a stork?Tools of the tradeGetting the terminology rightMy favorite statisticsLies, damned lies and telling your manager what he wants to knowSummary
The Stork Correlation
In a small Welsh town, there was a .95 correlation between the arrival of storks and the arrival of babies.
Why?
The Stork Correlation
There was also a 1.0 correlation between the dates fishermen were home from the sea and the likely dates of conception.
Statistical Abuse
Correlation is the most misused statisticStandard deviation and mean rank 2nd and 3rdIgnorance of statistics leads to career-limiting recommendationsStatistics can be your best friend, once you understand themBut… It is more art than science.
Why Me?
BackgroundTrainingExperience
Why You?
Determining the significance of changesNot confusing correlation with cause-and-effect Saving time on problem resolutionTheory testing
Big Caveats
Only use statistics with like-minded individuals. Managers typically only understand average and percentiles.
When statistics don’t appear to be working for you, check out statistics that do not require the assumption of normality.
Tools of the Trade
SAS/SAS GraphSPSSStatistical calculatorExcelBrute force with the equations (not recommended, but possible)
Definitions
Sample & populationNormalityOutlierMean, median, mode & percentileStandard deviation & varianceMisc. terms
“Statistics is a systematic method for getting the wrong conclusion with 95% confidence.”
Sample & Population
Population – all the data for the period of time studied, I.e., every RMF/SMF data record for an hour
Sample – a random selection of all data points/records available.
Normal Distribution
A distribution which describes many situations where observations are distributed symmetrically around the mean . 68% of all values under the curve lie within one standard deviation of the mean and 95% lie within two standard deviations.
Central Limit Theorem
As sample size increases, the distribution of the sample approaches a normal distribution, where the mean = the mean of the population and the standard deviation equals the standard deviation of the population divided by the square root of the sample size.
More samples, better data.
Formulas
f(x) = 1/[2* )1/2* ] * e**{-1/2*[(x-µ)/ ]2}- < x < where µ is the mean is the standard deviation e is the base of the natural logarithm, sometimes called Euler's e (2.71...) is the constant Pi (3.14...)
“
Outlier
Outlier - A point that, because of observation noise, does not followthe characteristics of the input (or desired response) data.
“There are liars, outliers, and out-and-out liars.”
Mean
Arithmetic Mean – numeric average of all the data.
X = x1 + x2 + x3…/ N(x)
Assumes normalityAffected by outliersPlot data to understand
Plot to see meaning of “mean”
Frequency
0
10
20
30
40
50
60
0.2
1.4
2.6
3.8 5
6.2
7.4
8.6
9.8 11
12.2
13.4
14.8
Frequency
Mean
Median/Mode
Median and Mode
Median – middle value, where half the values lie on each side of the median, when they are ordered by value.Mode – most frequently observed value. If no repeats, there is no mode value.
Percentile & Percentage Change
Percentile – group data by putting equal number of data points into each group. Ex. 95% percentile – 95% of values are less than x.
Percentage Change = (after value – before value) / before value
Risk of using percentage change
Standard Deviation & Variance
Standard Deviation – square root of the variance. For normal data, 2/3 of the data points are within 1 SD of the mean on either side.
Variance – amount of “spread” of the data around the mean:
S2= ((x1-X)2 + (x2-X)2 + …. (xn-X)2 ) / n-1
Where x=mean and xn is each data point, n is the number of samples
Standard Deviation of a Sample
“If the SD is large, you need to inspect your sampling method. This may indicate suspect data, poor interval choices, etc.
My Favorite Statistics
Linear RegressionCorrelation
“In ancient times, they had no statistics, so they had to fall back on lies.” - Stephen B. Leacock
Linear Regression
Linear Regression – describing the relationship between two data elements, by fitting a straight line to the data.Ex. X=transaction rate Y=%CPU utilization
Y= bX+C where x and y are the variables, b is the slope of the line and C is the point where the line intercepts the y-axis.
Linear Regression
Good Candidate for Regression
47© 2001 BMC SOFTWARE, INC.
Potential Impact of Increasing Volume
0
20
40
60
80
100
BASELINE PLAN1 PLAN2 PLAN3 PLAN4
% Proc
PRD2
PRD1
Predict Impact of Change
Bad Candidate for Regression
48© 2001 BMC SOFTWARE, INC.
Predict Impact of Change
Impact on Response Time
0
.50
1.0
1.5
2.0
2.5
BASELINE PLAN1 PLAN2 PLAN3 PLAN4
Secs
BILLWEB Page Service
BILLWEB Page Wait
BILLWEB I/O Wait
BILLWEB CPU Service
BILLWEB I/O Service
BILLWEB CPU Wait
Gotchas
Make sure relating the variables makes sense.Plot data when not sure of the relationship (scatter plot)Do not throw out outliers until you are sure of why they occurredDo not commit linear “progression”
Correlation
Correlation coefficient - R2 measures the degree of relationship(and direction) between two variables. R2 =1.00 indicates a perfect correlation; R2 = 0.0 means there is no relationship at all. R2 = a negative number means that as one variable increases, the other decreases.
Correlation is NOT cause and effect.
Though there may be a causal relationship between two variables, you cannot infer it from a correlation analysis.
A third factor may really be causing the correlation.
Don’t calculate it by hand – use a tool.
Use your brain to interpret the results.
“A statistician is someone who is skilled at drawing a precise line from an unwarranted assumption to a foregone conclusion.”
drive
How to Lie With Statistics
Statisticulation
“Statistics are like a bikini. What they reveal is suggestive, but what they conceal is vital.” - Aaron Levenstein
Why Lie?
Outliers make your data look badYou are trying to comply with a performance clauseYou are too busy writing the great American novel to do your jobYour manager wouldn’t understand anyway
Averaging Averages
Why do it?Most performance data is already averaged, so it is easierMakes response times look better in most casesSmooths out all variabilityMostly eliminates outliers, particularly in plotting data
Using Percentage Change
Why do it?To exaggerate the benefit of a performance change.Ex. RT decreased 50% going from 0.2 to 01.
To justify a processor upgradeEx. Doubling application volume will increase its CPU demand 100% (even when the CPU demand was very small)
To impress or terrify
Small Sample Size
Why do it? SAS jobs run faster Large, randomly obtained data doesn’t give the
right results; a small, selected window doesYou don’t really have any data and have to
invent some
Stupid Graph Tricks
Why do it? To make your data look better
How to do it Log functions on one axis – to diminish the
impact of a change. Or just use different orders of magnitude for x and y axes
Select graph type (pie, line, stacked bar) which best misleads your audience
Eliminate actual metrics so you can draw the line to reflect your reality
Put time on the wrong axis Eliminate all legends, data tables, etc.
Invalid Metrics
How to do itUse your own definitions. Ex. Typical CICS tran = non-browsing, non-batch workUse multiple decimal places to lend an air of precision to the data. Good with small or unreliable sample or poor capture ratio.Compare apples to oranges. Ex. Compare performance after tuning using a period of low demand to compare to a “before” of high demandAdd percentage changes together. Ex. If volume changes cause a 10% inc. in DB2, a 15% inc. in CICS and a 20% increase in batch, that’s 45%.
Correlation Abuse
How to do it
Select two metrics that aren’t usually related (I/O response time and file size), draw a correlation and justify a memory upgrade.
Most people don’t know performance metrics well enough to challenge you.
Another Common Lie
Linear progression – forecasting the line past the data points you have
Unless you are sure the relationship between two variables is linear, do not attempt this. Even mostly linear relationships (such as CPU vs. volume) may go non-linear at near-saturation.
What Can Go Wrong
What you think might happen
What might really be happening
What We Didn’t Cover
Hypothesis testing – valuable if you want to see how likely it is that your theory matches reality. Is the change in the data due to chance, or did you really make a difference?
Chi-squareT-test
When you don’t have enough information about the data (population) or about cause-and-effect relationships
Summary
Turn data into information by applying statistics and your knowledge.
Practice “safe performance analysis” and protect your job.
CYA
“Numbers are like people; torture them enough and they’ll tell you anything.”
References
Geis “How to Lie with Statistics”Dixon and Massey “Introduction to Statistical Analysis”Gonick & Smith “The Cartoon Guide to Statistics”Sziede “Statistics for the Algebraically Challenged”Munoz “Sampling Issues in the Collection of Performance Data” CMG2002
Questions?
Denise P. [email protected] Software, Inc.