Upload
benedict-clarke
View
212
Download
0
Embed Size (px)
Citation preview
Variance
Harry R. Erwin, PhD
School of Computing and Technology
University of Sunderland
Resources
• Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley.
• Gentle, JE (2002) Elements of Computational Statistics. Springer.
• Gonick, L., and Woollcott Smith (1993) A Cartoon Guide to Statistics. HarperResource (for fun).
Measure of Variability
• This is the most important quantity in statistical analysis. Data may show no central tendency, but they almost always show variation.
• The greater the variability, – the greater the uncertainty about parameters
estimated from the data.– the lower our ability to distinguish between
competing hypotheses.
Typical Measures of Variability
• The range—depends only on outlying values.• Sum of the differences between the data and the mean
—useless, because it is zero by definition.• Sum of the absolute values of the differences between
the data and the mean—hard to use, although a very good measure.
• Sum of the squares of the differences between the data and the mean—most often used. Divide by the number of data points to get the mean squared deviation.
• Variance is slightly different again….
Degrees of Freedom
• Suppose you have n data points, v.• The mean, m(v), is the sum of the data point values, divided by n
(the number of independent pieces of information).• Suppose you know n-1 of the data point values and the mean.
What is the value of the remaining point?• Hence, an estimate involving the data points and the mean
actually has only n-1 independent pieces of information. The degrees of freedom of the estimate are n-1.
• Definition: the degrees of freedom of an estimate, df, is the sample size, n, minus the number of parameters in the estimate p already estimated from the data.
Variance
• If you have n points of data, v, from an unknown distribution, and you want to compute an estimate of its variability, use the following equation:
variance = s2 = (sum of squares)/(n-1) • Note you divide by n-1. This is the df for the sample
variance. (The sum of squares uses the sample mean.)• If you know the mean, you can divide by n, but if all
you have are the sample data, dividing by n will be too small (biased low).
Variance and Sample Size
• sample variance is not well behaved.• The number of data points, n, affects the value of the
variance estimated. For a small number of points, the variance estimate varies a lot. It still can vary by about a factor of three for 30 points.
• Rules of thumb:– You want a large number of independent data points if you need
to estimate the variance.– Less than 10 sample points is a very small sample.– Less than 30 points is a small sample.– 30 points is a reasonable sample.
Measures of Unreliability
• Given a sample variance (s2), how much will the estimate of the mean vary with different samples?
• This is known as the standard error of the mean:SEy = √(s2/n)
• Note that the central limit theorem implies that the estimate of the mean will converge to a normal distribution as n increases.
• You can use this fact to derive a confidence interval for your estimate of the mean. (n of 30+ allows the normal distribution to be used.)
Small Sample Confidence Intervals
• For n<30, you can’t assume the normal distribution applies.
• Instead, you usually use Student’s t-distribution, which incorporates the degrees of freedom of the sample.
• You can also use bootstrap methods (advanced).
Confidence Intervals
• Three ways of generating a confidence interval for an estimate:– Assume a normal distribution. (You need lots of
samples).– Assume a 2 distribution. (Less samples)– Bootstrapping (makes the fewest assumptions,
computationally demanding)
• Demonstration (advanced)
R Demonstrations of all this…
• From the book.
• ozone<-read.table("gardens.txt”,header=T)
• attach(ozone)
• ozone
Ozone Data Frame
gardenA gardenB gardenC1 3 5 32 4 5 33 4 6 24 3 7 15 2 4 106 3 4 47 1 3 38 3 5 119 5 6 310 2 5 10
Continued
• mean(gardenA)
• 3
• mean(gardenB)
• 5
• mean(gardenC)
• 5
• Are gardenB and gardenC distinguishable?
Continued Further
• var(gardenA)• 1.33333• var(gardenB)• 1.33333• var(gardenC)• 14.22222• gardenA and gardenB have the same variance,
gardenC does not!
Apply var.test
var.test(gardenB,gardenC)
F test to compare two variances
data: gardenB and gardenC F = 0.0938, num df = 9, denom df = 9, p-value = 0.001624alternative hypothesis: true ratio of variances is not equal
to 1 95 percent confidence interval: 0.02328617 0.37743695 sample estimates:ratio of variances 0.09375
Implications
• Since gardenA and gardenB have the same variance, you can use the t.test to compare means and conclude they are significantly different.
• Since gardenC has a different variance, you cannot use the t.test, and must use something weaker to compare their means.
Application of t.test to gardenA and gardenB
t.test(gardenA,gardenB)
Welch Two Sample t-test
data: gardenA and gardenB t = -3.873, df = 18, p-value = 0.001115alternative hypothesis: true difference in means is
not equal to 0 95 percent confidence interval: -3.0849115 -0.9150885 sample estimates:mean of x mean of y 3 5
Application of t.test to gardenA and gardenC
t.test(gardenA,gardenC)
Welch Two Sample t-test
data: gardenA and gardenC t = -1.6036, df = 10.673, p-value = 0.1380alternative hypothesis: true difference in means is
not equal to 0 95 percent confidence interval: -4.7554137 0.7554137 sample estimates:mean of x mean of y 3 5