Variation, uncertainties and models

Variation, uncertainties and models

Marian ScottSchool of Mathematics and Statistics, University of

Glasgow

June 2012

the sample mean

• Perhaps the most commonly used measure of centre is the arithmetic mean (from now on called the mean). If we have a sample of n observations denoted by x1, x2,...,xn then the mean is shown below

x = sum of observations

total number of observations = 1n

i=1

nxi

the sample variance

• the variance of the observations is shown below

s2 = 1

n-1 i=1

n(xi-x)2

the sample standard deviation

• the standard deviation is the square root of the variance and is shown below

s = 1

n-1

i=1

nxi2 - nx2

the estimated standard error

• the standard error is the standard deviation divided by n

• this is a measure of the precision with which we can estimate the mean

• it is sometimes called the standard deviation of the mean

the coefficient of variation

• The coefficient of variation is a simple summary,

CV = (stdev/mean)*100%. • It is a useful way of evaluating the variation

relative to the mean value and also to compare different data sets, even where the mean value is quite different.

Data summaries

1500125010007505002500FS

Dotplot of FS

7206004803602401200FS

Dotplot of FS

Case 1: all data, mean=130.5, stdev= 256.9, CV= 197%

Case 2: extreme value at 1500 removed, mean= 95.4, stdev= 133.3, CV=139%

a more sensible analysis

use the log data, as above- no problem data values

CV=36.9%

7.26.35.44.53.62.71.80.9logFS

Dotplot of logFS

robust summary statistics

• robust summary statistics include the median, quartiles and inter-quartile range (IQR)

• the median which is defined as the value below which (or equivalently above which) half of the observations lie. It is also known as the 50th percentile. This is a non-parametric percentile, since no distributional assumptions are made

robust summary statistics

• quartiles and inter-quartile range (IQR)

• Similarly, the more robust way to measure spread is to look at the lower and upper quartiles Q1 and Q3 - also known as the 25th and 75th percentiles. The IQR (interquartile range) is Q3 – Q1.

• these statistics form the basis of the construction of the boxplot

Preliminary Analysis

• Bathing water example• There is considerable

variation – Across different sites – Within the same site

across different years• Distribution of data is

highly skewed with evidence of outliers and in some cases bimodality

2004 2005 2006 20070

20

04

00

60

08

00

Boxplots of FS: 114567

SEPA location code 114567Year

FS

detecting and dealing with outliers

from the boxplot, most statistical software identifies an outlier as a value which is more then 1.5 * IQR from the median and marks it by a special symbol.

1600

1400

1200

1000

800

600

400

200

0

FS

Boxplot of FS

Formal tests

Formal outlier tests exist- Dixon’s, Grubb’s Chauvenet’s criterion; all are based on the ‘how far rule’, but usually how far from the mean, in terms of standard deviations.

what to do?

first check your data for any errors

second, perhaps consider an analysis both with and without the problem value

use robust statistics

Robust values

original outlier removed

Q1 Median Q3 Q1 median Q3

10.0 60.0 23.0 10.0 58.0 112.5

Removing the outlier makes almost no difference to the median but the range is affected.

Simple Regression Model

• The basic regression model assumes:• The average value of the response y, is

linearly related to the explanatory x,• The spread of the response y, about the

average is the SAME for all values of x, The VARIABILITY of the response y,

about the average follows a NORMAL distribution for each value of x.

Simple Regression Model

• Model is fit typically using least squares

• Goodness of fit of model assessed based on residual sum of squares and R2

• Assumptions checked using residual plots

• Inference about model parameters

Regression- chlorophyll

3.53.02.52.01.51.00.5

100

90

80

70

60

50

40

30

20

10

N

chlo

ro

Scatterplot of chloro vs N

Regression Output

The regression equation ischloro = - 1.7 + 28.8 NPredictor Coef StDev T PConstant -1.69 10.14 -0.17 0.869N 28.808 4.171 6.91 0.000S = 15.19 R-Sq = 67.5% R-Sq(adj) = 66.1%

Analysis of VarianceSource DF SS MS F PRegression 1 11000 11000 47.70 0.000Error 23 5304 231Total 24 16304

Check Assumptions

40200-20-40

99

90

50

10

1

Residual

Perc

ent

10080604020

40

20

0

-20

-40

Fitted Value

Resi

dual

403020100-10-20-30

8

6

4

2

0

Residual

Fre

quency

24222018161412108642

40

20

0

-20

-40

Observation Order

Resi

dual

Normal Probability Plot Versus Fits

Histogram Versus Order

Residual Plots for chloro

Assess the Model Fit

3.53.02.52.01.51.00.5

140

120

100

80

60

40

20

0

-20

-40

N

chlo

ro

S 15.1864R-Sq 67.5%R-Sq(adj) 66.1%

Regression95% CI95% PI

Fitted Line Plotchloro = - 1.69 + 28.81 N

Conclusions

• the equation for the best fit straight line as one with an intercept of -1.7 and a slope of 28.8. Thus for every unit increase in N, the chloro measures increases by 28.8.

• The R2(adj) value is 66.1%, so we have explained 66% of the variation in chloro by its relationship to N. The S value is 15.19, which describes the variation in the points around this fitted line.

Conclusions• Analysis of Variance table, against the

Regression term, a p-value of 0.000. since the p-value is small (<0.05), then we can conclude that the regression is significant.

• Check for unusual observations these may have a large residual, which simply means that the observed value lies far from the fitted line or they may be influential, this means that the value for this particular observation has been particularly important in the calculation of the best fitted line.

Example 1: simple regression

8.58.07.57.0

2.0

1.5

1.0

0.5

0.0

-0.5

pH (pH units)

log(a

mm

)

Scatterplot of log(amm) vs pH (pH units)

log ammonia model

• Model

log(amm) ~ pH

• Fitted model

Log(amm)= 6.45 -0.75 pH

Fitted line: simple regression

8.58.07.57.0

2.0

1.5

1.0

0.5

0.0

-0.5

pH (pH units)

log(a

mm

)

S 0.336030R-Sq 19.8%R-Sq(adj) 19.4%

Fitted Line Plotlog(amm) = 6.450 - 0.7503 pH (pH units)

Regression Output

The regression equation islog(amm) = 6.45 -0.75pH

Predictor Coef ese Constant 6.45 0.7837 pH -0.75 0.0998

S = 0.336 R-Sq(adj) = 19.4%

So only 19.4% of variability in log(amm) explained by pH

Check Assumptions

1.41.21.00.80.60.40.20.0

1.5

1.0

0.5

0.0

-0.5

-1.0

Fitted Value

Resi

dual

Versus Fits(response is log(amm))

1.51.00.50.0-0.5-1.0

99.9

99

95

90

80706050403020

10

5

1

0.1

Residual

Perc

ent

Normal Probability Plot(response is log(amm))

Residual plot shows no pattern, probability plot looks broadly linear

Assess the Model Fit

The R2 (adjusted) value expresses the % variability in the response variable that has been explained. High values are good!!

19.4% of variability in log(amm) explained by pH

Look at the fitted values and compare with the observed data (using the residuals).

Look at the residual plots.

other features

Influential points• they are key in

determining where the fitted line goes.

• often (they are at the ‘ends’ of the line), so either large or small x values

14121086420

8

7

6

5

4

3

2

1

0

log(FC)

log(F

S)

S 0.954771R-Sq 55.8%R-Sq(adj) 55.3%

Fitted Line Plotlog(FS) = 0.6868 + 0.6140 log(FC)

14121086420

8

7

6

5

4

3

2

1

0

log(FC)

log(F

S)

S 1.07879R-Sq 42.9%R-Sq(adj) 42.2%

Fitted Line Plotlog(FS) = 1.393 + 0.4679 log(FC)

Model inferenceThe main items of note :• Testing significance of parameters using p-values• Testing the overall significance of the regression using the ANOVA

table• Assessing the goodness of fit using the R2(adjusted value) and the

residuals.

• typical questions concerning the slope and intercept of the line are

• Does the line pass through the origin? (is 0 = 0)• Is the slope significantly different from 0? (is 1 0)• Constructing

– a 95% confidence interval for the mean response for a given value of the explanatory variable and a 95% prediction interval for a future observation.

Modelling dissolved oxygen

2520151050

14

12

10

8

6

4

2

0

temperature

dis

solv

ed o

xygen

S 2.05981R-Sq 47.3%R-Sq(adj) 47.3%

Fitted Line PlotDO = 11.89 - 0.4752 temperature

Model 1: DO ~ temperature

Regression output

The regression equation isDO = 11.9 - 0.475 temp

Predictor Coef SE Coef T PConstant 11.8887 0.1303 91.27 0.000temp -0.47524 0.01133 -41.95 0.000

S = 2.0598 R-Sq(adj) = 47.3% So only 47.3% of variability in DO is explained by temperature.

Regression output

Analysis of Variance

Source DF SS MS F p-valueRegression 1 7467.7 7467.7 1760.08 0.000Residual Error 1961 8320.2 4.2Total 1962 15787.9

The ANOVA table shows the residual sum of squares as 8320.2, the p-value is 0.000, so the summary of a test of the

null hypothesis: model0: DO=error.

We would reject this model in favour of model1: DO=temperature+error

Check Assumptions

121086420

10

5

0

-5

Fitted Value

Resi

dual

Versus Fits(response is doa)

1050-5-10

99.99

99

95

80

50

20

5

1

0.01

Residual

Perc

ent

Normal Probability Plot(response is doa)

Measures of agreement

When there are two methods by which a measurement can be made, then it is important to know how well the methods agree.

As an example, we can consider a recent study of low-level total phosphorus (Nov 2007) conducted in the Edinburgh chemistry lab. Although not a situation where two different analytical techniques were being used, instead duplicate samples of water were analysed for two different lochs over approximately one month. How well did the duplicate samples agree?


First what not to do!

don’t quote a correlation coefficient

A correlation coefficient measures the strength of relationship between two quantities, and we might expect if we have two measurement techniques, that they are indeed related, so that the correlation coefficient therefore is not a measure of agreement.


A further tool commonly used is the scatterplot.

in this situation care must be taken in constructing the scatterplot-

the scale on both the x- and y-axis must be the same, and

as a useful visual aid, it would be common to sketch the line of equality (y=x).

30252015105

30

25

20

15

10

5

duplicate 1

duplic

ate

2

LLLM

C8

Scatterplot of TP The scatterplot with the line y=x is shown.

If the two sets are in agreement, then the points should be scattered closely round the line

assessing agreement

The scatterplot with the line y=x is shown. the blue line is the best fitting straight line.

so the results are clearly related but we knew that anyway.

30252015105

30

25

20

15

10

5

duplicate 1

duplic

ate

2

LLLM

C8

Scatterplot of TP

assessing agreement

Bland-Altman method

• This method involves studying the distribution of the between-method differences, and summarizing these data by the mean and 95% range of the differences. (These are called the 95% limits of agreement).

• This is then backed up with a Bland and Altman plot which plots the differences against the mean of the paired measurements, to ensure that the difference data are well behaved.

252015105

0

-1

-2

-3

-4

-5

-6

-7

-8

-9

mean TP

diffe

rence

0LLLM

C8

Bland altman plotmean difference is -1.30 and standard deviation of the differences is 1.974.

But is there a suggestion that the difference is larger for higher levels of TP?

Bland-Altman approach

mean difference is -1.30 and standard deviation of the differences is 1.974.

limits of agreement are indicated.

Bland-Altman approach

252015105

4

2

0

-2

-4

-6

-8

-10

mean TP

diffe

rence

0

2.65

-5.25

LLLM

C8

Bland altman plot

Documents

Variation, uncertainties and models