Upload
mattox
View
21
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Variation, uncertainties and models. Marian Scott School of Mathematics and Statistics, University of Glasgow June 2012. the sample mean. - PowerPoint PPT Presentation
Citation preview
Variation, uncertainties and models
Marian ScottSchool of Mathematics and Statistics, University of
Glasgow
June 2012
the sample mean
• Perhaps the most commonly used measure of centre is the arithmetic mean (from now on called the mean). If we have a sample of n observations denoted by x1, x2,...,xn then the mean is shown below
x = sum of observations
total number of observations = 1n
i=1
nxi
the sample variance
• the variance of the observations is shown below
s2 = 1
n-1 i=1
n(xi-x)2
the sample standard deviation
• the standard deviation is the square root of the variance and is shown below
s = 1
n-1
i=1
nxi2 - nx2
the estimated standard error
• the standard error is the standard deviation divided by n
• this is a measure of the precision with which we can estimate the mean
• it is sometimes called the standard deviation of the mean
the coefficient of variation
• The coefficient of variation is a simple summary,
CV = (stdev/mean)*100%. • It is a useful way of evaluating the variation
relative to the mean value and also to compare different data sets, even where the mean value is quite different.
Data summaries
1500125010007505002500FS
Dotplot of FS
7206004803602401200FS
Dotplot of FS
Case 1: all data, mean=130.5, stdev= 256.9, CV= 197%
Case 2: extreme value at 1500 removed, mean= 95.4, stdev= 133.3, CV=139%
a more sensible analysis
use the log data, as above- no problem data values
CV=36.9%
7.26.35.44.53.62.71.80.9logFS
Dotplot of logFS
robust summary statistics
• robust summary statistics include the median, quartiles and inter-quartile range (IQR)
• the median which is defined as the value below which (or equivalently above which) half of the observations lie. It is also known as the 50th percentile. This is a non-parametric percentile, since no distributional assumptions are made
robust summary statistics
• quartiles and inter-quartile range (IQR)
• Similarly, the more robust way to measure spread is to look at the lower and upper quartiles Q1 and Q3 - also known as the 25th and 75th percentiles. The IQR (interquartile range) is Q3 – Q1.
• these statistics form the basis of the construction of the boxplot
Preliminary Analysis
• Bathing water example• There is considerable
variation – Across different sites – Within the same site
across different years• Distribution of data is
highly skewed with evidence of outliers and in some cases bimodality
2004 2005 2006 20070
20
04
00
60
08
00
Boxplots of FS: 114567
SEPA location code 114567Year
FS
detecting and dealing with outliers
from the boxplot, most statistical software identifies an outlier as a value which is more then 1.5 * IQR from the median and marks it by a special symbol.
1600
1400
1200
1000
800
600
400
200
0
FS
Boxplot of FS
Formal tests
Formal outlier tests exist- Dixon’s, Grubb’s Chauvenet’s criterion; all are based on the ‘how far rule’, but usually how far from the mean, in terms of standard deviations.
what to do?
first check your data for any errors
second, perhaps consider an analysis both with and without the problem value
use robust statistics
Robust values
original outlier removed
Q1 Median Q3 Q1 median Q3
10.0 60.0 23.0 10.0 58.0 112.5
Removing the outlier makes almost no difference to the median but the range is affected.
Simple Regression Model
• The basic regression model assumes:• The average value of the response y, is
linearly related to the explanatory x,• The spread of the response y, about the
average is the SAME for all values of x, The VARIABILITY of the response y,
about the average follows a NORMAL distribution for each value of x.
Simple Regression Model
• Model is fit typically using least squares
• Goodness of fit of model assessed based on residual sum of squares and R2
• Assumptions checked using residual plots
• Inference about model parameters
Regression- chlorophyll
3.53.02.52.01.51.00.5
100
90
80
70
60
50
40
30
20
10
N
chlo
ro
Scatterplot of chloro vs N
Regression Output
The regression equation ischloro = - 1.7 + 28.8 NPredictor Coef StDev T PConstant -1.69 10.14 -0.17 0.869N 28.808 4.171 6.91 0.000S = 15.19 R-Sq = 67.5% R-Sq(adj) = 66.1%
Analysis of VarianceSource DF SS MS F PRegression 1 11000 11000 47.70 0.000Error 23 5304 231Total 24 16304
Check Assumptions
40200-20-40
99
90
50
10
1
Residual
Perc
ent
10080604020
40
20
0
-20
-40
Fitted Value
Resi
dual
403020100-10-20-30
8
6
4
2
0
Residual
Fre
quency
24222018161412108642
40
20
0
-20
-40
Observation Order
Resi
dual
Normal Probability Plot Versus Fits
Histogram Versus Order
Residual Plots for chloro
Assess the Model Fit
3.53.02.52.01.51.00.5
140
120
100
80
60
40
20
0
-20
-40
N
chlo
ro
S 15.1864R-Sq 67.5%R-Sq(adj) 66.1%
Regression95% CI95% PI
Fitted Line Plotchloro = - 1.69 + 28.81 N
Conclusions
• the equation for the best fit straight line as one with an intercept of -1.7 and a slope of 28.8. Thus for every unit increase in N, the chloro measures increases by 28.8.
• The R2(adj) value is 66.1%, so we have explained 66% of the variation in chloro by its relationship to N. The S value is 15.19, which describes the variation in the points around this fitted line.
Conclusions• Analysis of Variance table, against the
Regression term, a p-value of 0.000. since the p-value is small (<0.05), then we can conclude that the regression is significant.
• Check for unusual observations these may have a large residual, which simply means that the observed value lies far from the fitted line or they may be influential, this means that the value for this particular observation has been particularly important in the calculation of the best fitted line.
Example 1: simple regression
8.58.07.57.0
2.0
1.5
1.0
0.5
0.0
-0.5
pH (pH units)
log(a
mm
)
Scatterplot of log(amm) vs pH (pH units)
log ammonia model
• Model
log(amm) ~ pH
• Fitted model
Log(amm)= 6.45 -0.75 pH
Fitted line: simple regression
8.58.07.57.0
2.0
1.5
1.0
0.5
0.0
-0.5
pH (pH units)
log(a
mm
)
S 0.336030R-Sq 19.8%R-Sq(adj) 19.4%
Fitted Line Plotlog(amm) = 6.450 - 0.7503 pH (pH units)
Regression Output
The regression equation islog(amm) = 6.45 -0.75pH
Predictor Coef ese Constant 6.45 0.7837 pH -0.75 0.0998
S = 0.336 R-Sq(adj) = 19.4%
So only 19.4% of variability in log(amm) explained by pH
Check Assumptions
1.41.21.00.80.60.40.20.0
1.5
1.0
0.5
0.0
-0.5
-1.0
Fitted Value
Resi
dual
Versus Fits(response is log(amm))
1.51.00.50.0-0.5-1.0
99.9
99
95
90
80706050403020
10
5
1
0.1
Residual
Perc
ent
Normal Probability Plot(response is log(amm))
Residual plot shows no pattern, probability plot looks broadly linear
Assess the Model Fit
The R2 (adjusted) value expresses the % variability in the response variable that has been explained. High values are good!!
19.4% of variability in log(amm) explained by pH
Look at the fitted values and compare with the observed data (using the residuals).
Look at the residual plots.
other features
Influential points• they are key in
determining where the fitted line goes.
• often (they are at the ‘ends’ of the line), so either large or small x values
14121086420
8
7
6
5
4
3
2
1
0
log(FC)
log(F
S)
S 0.954771R-Sq 55.8%R-Sq(adj) 55.3%
Fitted Line Plotlog(FS) = 0.6868 + 0.6140 log(FC)
14121086420
8
7
6
5
4
3
2
1
0
log(FC)
log(F
S)
S 1.07879R-Sq 42.9%R-Sq(adj) 42.2%
Fitted Line Plotlog(FS) = 1.393 + 0.4679 log(FC)
Model inferenceThe main items of note :• Testing significance of parameters using p-values• Testing the overall significance of the regression using the ANOVA
table• Assessing the goodness of fit using the R2(adjusted value) and the
residuals.
• typical questions concerning the slope and intercept of the line are
• Does the line pass through the origin? (is 0 = 0)• Is the slope significantly different from 0? (is 1 0)• Constructing
– a 95% confidence interval for the mean response for a given value of the explanatory variable and a 95% prediction interval for a future observation.
Modelling dissolved oxygen
2520151050
14
12
10
8
6
4
2
0
temperature
dis
solv
ed o
xygen
S 2.05981R-Sq 47.3%R-Sq(adj) 47.3%
Fitted Line PlotDO = 11.89 - 0.4752 temperature
Model 1: DO ~ temperature
Regression output
The regression equation isDO = 11.9 - 0.475 temp
Predictor Coef SE Coef T PConstant 11.8887 0.1303 91.27 0.000temp -0.47524 0.01133 -41.95 0.000
S = 2.0598 R-Sq(adj) = 47.3% So only 47.3% of variability in DO is explained by temperature.
Regression output
Analysis of Variance
Source DF SS MS F p-valueRegression 1 7467.7 7467.7 1760.08 0.000Residual Error 1961 8320.2 4.2Total 1962 15787.9
The ANOVA table shows the residual sum of squares as 8320.2, the p-value is 0.000, so the summary of a test of the
null hypothesis: model0: DO=error.
We would reject this model in favour of model1: DO=temperature+error
Check Assumptions
121086420
10
5
0
-5
Fitted Value
Resi
dual
Versus Fits(response is doa)
1050-5-10
99.99
99
95
80
50
20
5
1
0.01
Residual
Perc
ent
Normal Probability Plot(response is doa)
Measures of agreement
When there are two methods by which a measurement can be made, then it is important to know how well the methods agree.
As an example, we can consider a recent study of low-level total phosphorus (Nov 2007) conducted in the Edinburgh chemistry lab. Although not a situation where two different analytical techniques were being used, instead duplicate samples of water were analysed for two different lochs over approximately one month. How well did the duplicate samples agree?
Measures of agreement
First what not to do!
don’t quote a correlation coefficient
A correlation coefficient measures the strength of relationship between two quantities, and we might expect if we have two measurement techniques, that they are indeed related, so that the correlation coefficient therefore is not a measure of agreement.
Measures of agreement
A further tool commonly used is the scatterplot.
in this situation care must be taken in constructing the scatterplot-
the scale on both the x- and y-axis must be the same, and
as a useful visual aid, it would be common to sketch the line of equality (y=x).
30252015105
30
25
20
15
10
5
duplicate 1
duplic
ate
2
LLLM
C8
Scatterplot of TP The scatterplot with the line y=x is shown.
If the two sets are in agreement, then the points should be scattered closely round the line
assessing agreement
The scatterplot with the line y=x is shown. the blue line is the best fitting straight line.
so the results are clearly related but we knew that anyway.
30252015105
30
25
20
15
10
5
duplicate 1
duplic
ate
2
LLLM
C8
Scatterplot of TP
assessing agreement
Bland-Altman method
• This method involves studying the distribution of the between-method differences, and summarizing these data by the mean and 95% range of the differences. (These are called the 95% limits of agreement).
• This is then backed up with a Bland and Altman plot which plots the differences against the mean of the paired measurements, to ensure that the difference data are well behaved.
252015105
0
-1
-2
-3
-4
-5
-6
-7
-8
-9
mean TP
diffe
rence
0LLLM
C8
Bland altman plotmean difference is -1.30 and standard deviation of the differences is 1.974.
But is there a suggestion that the difference is larger for higher levels of TP?
Bland-Altman approach
mean difference is -1.30 and standard deviation of the differences is 1.974.
limits of agreement are indicated.
Bland-Altman approach
252015105
4
2
0
-2
-4
-6
-8
-10
mean TP
diffe
rence
0
2.65
-5.25
LLLM
C8
Bland altman plot