Variance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Variance

Harry R. Erwin, PhD

School of Computing and Technology

University of Sunderland

Resources

• Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley.

• Gentle, JE (2002) Elements of Computational Statistics. Springer.

• Gonick, L., and Woollcott Smith (1993) A Cartoon Guide to Statistics. HarperResource (for fun).

Measure of Variability

• This is the most important quantity in statistical analysis. Data may show no central tendency, but they almost always show variation.

• The greater the variability, – the greater the uncertainty about parameters

estimated from the data.– the lower our ability to distinguish between

competing hypotheses.

Typical Measures of Variability

• The range—depends only on outlying values.• Sum of the differences between the data and the mean

—useless, because it is zero by definition.• Sum of the absolute values of the differences between

the data and the mean—hard to use, although a very good measure.

• Sum of the squares of the differences between the data and the mean—most often used. Divide by the number of data points to get the mean squared deviation.

• Variance is slightly different again….

Degrees of Freedom

• Suppose you have n data points, v.• The mean, m(v), is the sum of the data point values, divided by n

(the number of independent pieces of information).• Suppose you know n-1 of the data point values and the mean.

What is the value of the remaining point?• Hence, an estimate involving the data points and the mean

actually has only n-1 independent pieces of information. The degrees of freedom of the estimate are n-1.

• Definition: the degrees of freedom of an estimate, df, is the sample size, n, minus the number of parameters in the estimate p already estimated from the data.

Variance

• If you have n points of data, v, from an unknown distribution, and you want to compute an estimate of its variability, use the following equation:

variance = s2 = (sum of squares)/(n-1) • Note you divide by n-1. This is the df for the sample

variance. (The sum of squares uses the sample mean.)• If you know the mean, you can divide by n, but if all

you have are the sample data, dividing by n will be too small (biased low).

Variance and Sample Size

• sample variance is not well behaved.• The number of data points, n, affects the value of the

variance estimated. For a small number of points, the variance estimate varies a lot. It still can vary by about a factor of three for 30 points.

• Rules of thumb:– You want a large number of independent data points if you need

to estimate the variance.– Less than 10 sample points is a very small sample.– Less than 30 points is a small sample.– 30 points is a reasonable sample.

Measures of Unreliability

• Given a sample variance (s2), how much will the estimate of the mean vary with different samples?

• This is known as the standard error of the mean:SEy = √(s2/n)

• Note that the central limit theorem implies that the estimate of the mean will converge to a normal distribution as n increases.

• You can use this fact to derive a confidence interval for your estimate of the mean. (n of 30+ allows the normal distribution to be used.)

Small Sample Confidence Intervals

• For n<30, you can’t assume the normal distribution applies.

• Instead, you usually use Student’s t-distribution, which incorporates the degrees of freedom of the sample.

• You can also use bootstrap methods (advanced).

Confidence Intervals

• Three ways of generating a confidence interval for an estimate:– Assume a normal distribution. (You need lots of

samples).– Assume a 2 distribution. (Less samples)– Bootstrapping (makes the fewest assumptions,

computationally demanding)

• Demonstration (advanced)

R Demonstrations of all this…

• From the book.

• ozone<-read.table("gardens.txt”,header=T)

• attach(ozone)

• ozone

Ozone Data Frame

gardenA gardenB gardenC1 3 5 32 4 5 33 4 6 24 3 7 15 2 4 106 3 4 47 1 3 38 3 5 119 5 6 310 2 5 10

Continued

• mean(gardenA)

• 3

• mean(gardenB)

• 5

• mean(gardenC)

• 5

• Are gardenB and gardenC distinguishable?

Continued Further

• var(gardenA)• 1.33333• var(gardenB)• 1.33333• var(gardenC)• 14.22222• gardenA and gardenB have the same variance,

gardenC does not!

Apply var.test

var.test(gardenB,gardenC)

F test to compare two variances

data: gardenB and gardenC F = 0.0938, num df = 9, denom df = 9, p-value = 0.001624alternative hypothesis: true ratio of variances is not equal

to 1 95 percent confidence interval: 0.02328617 0.37743695 sample estimates:ratio of variances 0.09375

Implications

• Since gardenA and gardenB have the same variance, you can use the t.test to compare means and conclude they are significantly different.

• Since gardenC has a different variance, you cannot use the t.test, and must use something weaker to compare their means.

Application of t.test to gardenA and gardenB

t.test(gardenA,gardenB)

Welch Two Sample t-test

data: gardenA and gardenB t = -3.873, df = 18, p-value = 0.001115alternative hypothesis: true difference in means is

not equal to 0 95 percent confidence interval: -3.0849115 -0.9150885 sample estimates:mean of x mean of y 3 5

Application of t.test to gardenA and gardenC

t.test(gardenA,gardenC)

Welch Two Sample t-test

data: gardenA and gardenC t = -1.6036, df = 10.673, p-value = 0.1380alternative hypothesis: true difference in means is

not equal to 0 95 percent confidence interval: -4.7554137 0.7554137 sample estimates:mean of x mean of y 3 5

Documents

Variance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland