View
50
Download
0
Category
Tags:
Preview:
DESCRIPTION
Statistics for Water Science. Module 17.1: Descriptive Statistics. Module 17: Statistics. Statistics A branch of mathematics dealing with the collection, analysis, interpretation and presentation of masses of numerical data Descriptive Statistics (Lecture 17.1) - PowerPoint PPT Presentation
Citation preview
Statistics for Water Science
Module 17.1: Descriptive Statistics
Developed by: Host Updated 2/2004: U5-m17-s2
Module 17: Statistics
Statistics A branch of mathematics dealing with the
collection, analysis, interpretation and presentation of masses of numerical dataDescriptive Statistics (Lecture 17.1)
Basic description of a variableExploratory Data Analysis (Lecture 17.2)
Techniques for understanding dataHypothesis Testing (Lecture 17.3)
Asks the question – is X different from Y?
Developed by: Host Updated 2/2004: U5-m17-s3
Simple graphical representations of data
Descriptive statistics
Describe basic characteristics of a population of numbers Central Tendency or
“Middleness”Means, medians and others
Variance or “spread” of dataStandard Deviation
The range of dataMin, Max and Percentiles
Developed by: Host Updated 2/2004: U5-m17-s4
Adapted from Ratti and Garton (1994)
Precision, accuracy and bias
Precision: Tendency to have
values closely clustered around the mean
Accuracy: Tendency of an
estimator to predict the value it was intended to estimate
Bias: A systematic error in
prediction
Developed by: Host Updated 2/2004: U5-m17-s5
Unbiased Biased
Not
Pre
cise
Pre
cise
The yellow curlingrocks representmeans from repeated samples
Green dots are the mean value
Spread is analogous to the standard error Accurate Not Accurate
Developed by: Host Updated 2/2004: U5-m17-s6
Finding the middle:The arithmetic mean
Between 1998 and 2002, the Ice Lake RUSS unit collected 2120 temperatures readings at depths of 1-4 m
What is the average June temperature?Surface Temperature
050
100150200
250300350400
Temperature
# o
f O
bse
rva
tio
ns
Surface Temperature
Developed by: Host Updated 2/2004: U5-m17-s7
Not too hard - Add’em up, divide by n
Surface Temperature
050
100150200
250300350400
Temperature
# of
Obs
erva
tions
Surface Temperature
Finding the middle:The arithmetic mean
39179.3 2120
= 18.48 C
Sum of temperatures = 39179.3
Developed by: Host Updated 2/2004: U5-m17-s8
Expressing variability: Standard deviation (SD)
Note that there is ‘scatter’ around the mean The Standard Deviation quantifies how wide or
narrow this scatter is: For this data set, the SD is 2.34 C Mean and SD are often combined:
18.48 +/- 2.34
Developed by: Host Updated 2/2004: U5-m17-s9
Let’s consider a second data set, shown in blue. This is the mean seasonal temperature in the lower reaches of the lake (8-13 m)
n = 3097
Comparing data sets
Developed by: Host Updated 2/2004: U5-m17-s10
Comparing data sets
Two things to note: It’s a lot colder at the bottom of the lake! The temperatures are much less variable – why?
Developed by: Host Updated 2/2004: U5-m17-s11
Mean SD
Surface 18.48 2.34
Bottom 5.96 0.85
Means and standard deviations for epilimnetic and hypolimnetic temperatures
Developed by: Host Updated 2/2004: U5-m17-s12
Standard deviation: Fun facts
The SD is always in the same units as the mean Roughly 68% of the values are included in +/- 1
SD of the mean, 95% within +/- 2 SD If the SD is larger than the mean (e.g. 20 +/- 24),
your data is pretty flaky Definition of flaky – the data are so widely
scattered that the mean is, well, meaningless. In this case, use some other measure of middleness, such as the geometric mean or median
Developed by: Host Updated 2/2004: U5-m17-s13
Using geometric means: Fecal coliform example
What about data that are not well behaved? Fecal coliform counts are often used by
management agencies as an indicator of water quality
For non-contact water recreation (boating and fishing), Colorado Public Health state that fecal coliform count shall not exceed 2000 fecal coliforms per 100 mL (based on geometric mean of representative samples)
Developed by: Host Updated 2/2004: U5-m17-s14
Fecal coliform counts can range over several orders of magnitude.
For such data, the geometric mean is a more appropriate indicator of central tendency.
SampleF. coli.counts
1 160
2 700
3 60
7 12000
ArithmeticMean
3230
12000
Boulder Creek Longitudinal Fecal Coliform Profiles for July, 2000
The problem
Developed by: Host Updated 2/2004: U5-m17-s15
Multiply ’em together, take the nth root To be honest, this is a pain without a good
calculator, but there’s a shortcut…
Geometric mean = 160 * 700 * 60 * 120004
The geometric mean
Developed by: Host Updated 2/2004: U5-m17-s16
Take the logarithm of each data point (easy)
Sample F. coli. counts Log(10)
1 160 2.20
2 700 2.85
3 60 1.78
7 12000 3.51
The geometric mean: The easy way
Developed by: Host Updated 2/2004: U5-m17-s17
The geometric mean
Take the logarithm of each data point (easy) Average the log values (easier)
Sample F. coli. counts Log
1 160 2.20
2 700 2.85
3 60 1.78
7 12000 3.51
Average 2.88
Developed by: Host Updated 2/2004: U5-m17-s18
The geometric mean
Take the logarithm of each data point (easy) Average the log values (easier) Calculate the antilog (sounds hard, is easy)
SampleF. coli.counts
Log
1 160 2.20
2 700 2.85
3 60 1.78
7 12000 3.51
Average 2.88
Antilog= 10^2.88= 764.1
The geometric mean is 764.1 cells/ 100 ml
Lower than the state regulatory standard of 2000 cells/ 100 ml
Developed by: Host Updated 2/2004: U5-m17-s19
Fun facts about geometric means
The geometric mean is always less then the arithmetic mean.
The ‘shortcut’ calculation works with either natural logs or base 10 logs.
The geometric mean tends to dampen the effect of very low or very high values, and is useful when values range from 10-10,000 over a given period.
Excel has a GEOMEAN function. Life is good. Use of the geometric mean is a standard for most
wastewater discharge and beach monitoring programs: Beach standards are typically 200 counts/100 ml.
Developed by: Host Updated 2/2004: U5-m17-s20
Ice Lake Mean SD Min Max Median
Surface 19.59 2.28 12.1 27.1 18.2
Bottom 5.96 0.85 4.3 9.0 5.9
Descriptive statistics: Min, Max, and Median
Developed by: Host Updated 2/2004: U5-m17-s21
When to use medians: Stream turbidity levels
Background:
• Turbidity in streams makes the water appear cloudy (muddy), mostly from suspended sediments. It’s bad for fish, their eggs and their food (bugs) – particularly cold water species such as brook trout.
• Minnesota Water Pollution Rules set a Chronic Standard of 10 NTU - the highest level to which these organisms can be exposed indefinitely without causing chronic toxicity (see Notes for reference website).
• Tischer Creek is a trout stream in Duluth, MN with a nearly continuous turbidity record in summer/fall 2002. Let’s look at a 30 day period in midsummer and decide what the level of exposure was for the fish.
Developed by: Host Updated 2/2004: U5-m17-s22
Medians: the middlemost value
Prevents being mislead by a few very small or very large values
Consider salaries within a hypothetical company Which is the more
appropriate measure of a typical salary?
Mean $104,000 Median $24,000
CEO $350,000
Middle manager
88,000
Worker 1 24,000
Worker 2 22,000
Worker 3 18,000
Mean $104,400
Median $24,000
Developed by: Host Updated 2/2004: U5-m17-s23
Medians: a real world exampleTischer Creek: July 13 - Aug 12, 2002
Mean 13.1Standard Error 0.93Median 1.0Mode 0.0Standard Deviation 48.0Sample Variance 2301.1Kurtosis 153.9Skewness 9.6Range 1017.2Minimum 0Maximum 1017.2Sum 35061.7Count 2679Confidence Level(95.0%) 1.82
~ 30 days straddling the late July storm
13 Jul 02- 12 Aug 02 Tischer Turbidity Tischer Creek 13 Jul - 12 Aug 2002 30d spanning late July Storm
0
100
200
300
400
11-Jul 21-Jul 31-Jul 10-Aug
Date 2002
Tu
rbid
ity
(N
TU
s)
Summary Mean+/- s.d. Median Range 30 d: 7/13 - 8/12/02 13.1+ 0.9 1 0.0 - 1017
Developed by: Host Updated 2/2004: U5-m17-s24
Frequency Distribution: Jul 13- Aug 12 Tischer Creek – Summer 2002
0
500
1000
1500
2000
2500
0 60 120
180
239
299
359
419
479
539
598
658
718
778
838
898
957
Mor
e
Turbidity (NTUs)
Fre
quen
cy
Note that these data are highly skewed, with >80% of the values in the 20-40 NTU range
There is one value of 1017 NTU, no valid reason to delete it.
Developed by: Host Updated 2/2004: U5-m17-s25
Tischer Creek –Summer 2002 Storm Period
Stream Data Visualization
Developed by: Host Updated 2/2004: U5-m17-s26
Another plot of Tischer from midsummer 2002
Developed by: Host Updated 2/2004: U5-m17-s27
Means vs Medians: Which represent the data better?
The mean of 13 NTU for the 30 day period suggests that the chronic toxicity standard was violated
The standard deviation of the mean was high (48 NTUs) relative to the mean and so the coefficient of variation was a whopping 369%: CV = (48/13)*100
Although the range was high, from 0 to 1017 NTU, “most of the time” the stream ran clear with values <<10 . The mode (most common value) was in fact = 0
The median value was 1.0 NTU and perhaps best characterizes the state of turbidity in the stream and the level of exposure of the fish (the 50th percentile).
Determining chronic exposure values for “flashy” data is not trivial
Developed by: Host Updated 2/2004: U5-m17-s28
Mean @average()
Median @median()
Standard Deviation @stdev()
Minimum @min()
Maximum @max()
Geometric mean @geomean()
Excel functions for descriptive statistics: Format - @statistic(datarange)
Developed by: Host Updated 2/2004: U5-m17-s29
Upcoming: How can we tell if two populations of numbers are different?
Recommended