1 lab basicstatisticsfall2013

9/3/2013

1

Lab #1 Basic StatisticsEVEN 3321

• Definition of STATISTICS

• 1: a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data

• 2: a collection of quantitative data

• Origin of STATISTICS: German Statistik study of political facts and figures, from New Latin statisticus of politics, from Latin status state

• First Known Use: 1770

• Rhymes with STATISTICS: ballistics, ekistics, linguistics, logistics, patristics, stylistics

• http://www.merriam-webster.com/dictionary/statistics

Statistics

9/3/2013

2

Why is this important?

Environmental Sampling

∗ Need to know relationships between quantities

∗ Parameters (examples):� PH

� Conductivity

� Particle concentration

� Amount of a chemical or other material in air, water, soil

� Bacteria counts

Instrumentation

∗ PH Meter

∗ Micro-balance

∗ Gas Chromatography

∗ Ozone monitor

∗ ICPMS

∗ TOC

Morning Session of FE Exam

Engineering Probability and Statistics Topic Area

The following subtopics are covered in the Engineering Probability and Statistics portion of the FE Examination:

A. Measures of central tendencies and dispersions (e.g., mean, mode, standard deviation) B. Probability distributions (e.g., discrete, continuous, normal, binomial) C. Conditional probabilities D. Estimation (e.g., point, confidence intervals) for a single mean E. Regression and curve fitting F. Expected value (weighted average) in decision-making G. Hypothesis testing

The Engineering Probability and Statistics portion covers approximately 7% of the morning session test content. Reference: http://www.feexam.org/ProbStats.html

FE Exam

9/3/2013

3

• “Sample” versus “population”

• Random variables

• Population mean (μ), variance (σ2) & standard deviation (s), kurtosis, skewness

• Also expressed as: Sample mean (y), variance (s2), and

standard deviation (s)

• Frequency distribution/histogram (relates to skewness)

• Boxplots

• Precision and accuracy, Confidence interval

• Linear regression

Some Key Ideas

• It is impossible to determine the concentrations of a given pollutant at every possible location at a site.

• Statistical methods allow us to use a small number of samples to make inferences about the entire site.

• A single sample is a subset of all the possible samples (n) that could be taken from a given site.

–Multivariate data sets have several data values generated for each location and time.

–As opposed to univariate data sets.

• The hypothetical set of all possible values is referred to as the population.

Key Ideas: continued

9/3/2013

4

• Number of samples collected is the sample size (n).

• A random variable is a variable that is random.

• Experimental observations are considered random variables.

• Experimental errors

Key Ideas continued

∗ Experimental measurements are always imperfect:

∗ Measured value = true value ± error

∗ The error is a combined measure of the inherent variation of the phenomenon we are observing and the numerous factors that interfere with the measurement.

∗ Any quantitative result should be reported with an accompanying estimate of its error.

∗ Systematic errors (or determinate errors) can be traced to their source (e.g., improper sampling or analytical methods).

∗ Random errors (or indeterminate errors) are random fluctuations and cannot be identified or corrected for.

Experimental Errors

9/3/2013

5

Example: Population versus Sample

0

10

20

30

40

50

60

70

80

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Ozo

ne

[p

pb

]

April 2013

0

10

20

30

40

50

60

1 2 3 4 5 6 7 8 9 101112131415161718192021222324

Ozo

ne

[p

pb

]

24 Hours

• Accuracy is the degree of agreement of a measured value with the true or expected value.

• Precision is the degree of mutual agreement among individual measurements (x1, x2, …xn) made under the same conditions.

• Precision measures the variation among measurements and may be expressed as sample standard deviation (s):

Accuracy and Precision

( )2

1

1

n

ii

y ys

n=

−=

−

∑

9/3/2013

6


Example: Five analysts were each given five samples that were prepared to have a known concentration of 8.0 mg/L. The results are summarized in the figure below.


9/3/2013

7

• A random variable, y is characterized by:

• A set of possible values.

An associated set of relative likelihoods (this is called a

probability distribution).

• Random variables can be discrete or continuous.

e.g., a die toss is a discrete random variable.

e.g., ozone conc. is a continuous random variable.

• Experimental observations are considered random

variables.

Random Variables

• When we sample the environment, the sample values are known, but not the population values.

• For a sample size n, the number of times a specific value occurs is call the frequency.

• The frequency divided by the sample size n is the relative frequency.

• The relative frequency is an estimate of the probabilitythat given value occurs in the population.

• If we compute the relative frequencies for each possible value of a random variable, we have an estimate of the probability distribution of the random variable (see next slide).

Frequency Distribution

9/3/2013

8

• For continuous random variables, we can group the measured values into intervals (or “bins”).

• Plotting the number of values measured in each interval gives a frequency histogram (see next slide).

• Plotting the total number of measured values in or below a given interval gives a cumulative frequency distribution(see next slide).

• To obtain the relative frequency, the number of measured values falling within a given interval is divided by the sample size n.

• The shape of a histogram can allow us to infer the distribution of the population.

Continuous Frequency Distributions

HistogramsNormal (Gaussian) and skewed

9/3/2013

9

Histograms (cont.)Bimodal and Uniform

∗ In general, we do not know the mean and standard

deviation of the underlying population.

∗ The population mean can be estimated from the

sample mean and sample standard deviation s:

∗ Note that in environmental monitoring, the standard

deviation s for the sample depends on the amount of

sample collected

Sample Mean and Standard Deviation

1

1 n

ii

y xn =

= ∑( )2

1

1

n

ii

y ys

n=

−=

−

∑

9/3/2013

10

In many situations, environmental data involves working with a small sample set.

Also known as Bessel’s correction or unbiased estimate.http://en.wikipedia.org/wiki/Bessel%27s_correction

Another way of looking at it:

The POPULATION VARIANCE (σ2) is a PARAMETER of the population.

s2 The SAMPLE VARIANCE is a STATISTIC of the sample.

We use the sample statistic to estimate the population parameter.

The sample variance s2 is an estimate of the population variance σ2.

Note: Excel 2010 has a couple functions for standard deviation. One for population (=STD.P(range)) and the other based on sample (=STD.S(range)).

Short video:https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/v/review-and-intuition-why-we-divide-by-n-1-for-the-unbiased-sample-variance

A note about (n-1)

• Most random variables have two important characteristic values: the mean (μ) and the variance (s2).

• Square-root of the variance is the standard deviation (s).

• The mean is also called the expected value of the random variable xi.

• The mean represents balance point on graph.

• The variance & standard deviation both quantify how much the possible values disperse away from the mean.

• For a normal distribution, 68% of values lies within µ ±σ, 95% within µ ± 2σ, and 99.7% within µ ± 3σ.

Mean, Variance, Standard Deviation

9/3/2013

11

Mean, Median, Mode

∗ Covariance is a simplistic test to determine whether the data can be characterized by a normal distribution. The formula for covariance is the standard deviation divided by the mean. The closer the ratio is to zero, the better the possibility that the data has a normal distribution. A number greater than unity indicates a non- normal distribution.

∗ Skewness is a measure of symmetry or lack of it and can be normal, negative, or positive.

∗ Kurtosis is a measure whether the data are flat relative to a normal distribution.

Covariance, Skewness, Kurtosis

9/3/2013

12

Skewness/Kurtosis

Box-and-Whisker Plot

9/3/2013

13

Normal Distribution at 68%, 95%, 99%

The value is the probability that a random variable will fall in the upper or lower tail of a probability distribution.

For example, α = 0.05 implies that there is a 0.95 probability that a random variable will not fall in the upper or lower tail of the probability distribution.

Statistical tables of probability distributions (e.g., normal and “student t”) list probabilities that a random variable will fall in the upper tail only.

α Values for Probability Distributions

9/3/2013

14

• We typically want to determine a confidence interval

for which we are 90% confident that a random

variable will not fall in either tail.

• In this case, we use an α/2 = 0.05.

• Similarly, to determine 95% and 99% confidence

intervals, we would use α/2 = 0.025 and 0.005,

respectively.

α values and confidence intervals

�� = � ± � �

√= � ± (�)( ��)

Regression analysis (dependency) – an analysis focused on the degree to which one variable (the dependent variable) is dependent upon one or more other variables (independent variable).

(examples: ozone vs. temperature, bacteria counts versus chlorination treatment)

Correlation analysis – neither variable is identified as more important than the other, but the investigator is interested in their interdependence or joint behavior

NOTE: Correlation or association is not causation.

Linear Regression

9/3/2013

15

Linear Regression Examples

• Slope formula: y = mx + b

• coefficient of determination, R2 is used in the context of statistical models whose main purpose is the prediction of future outcomes on the basis of other related information. It is the proportion of variability in a data set that is accounted for by the statistical model. It provides a measure of how well future outcomes are likely to be predicted by the model.

R2 does NOT tell whether:

� the independent variables are a true cause of the changes in the dependent variable

� omitted-variable bias exists� the correct regression was used� the most appropriate set of independent variables has been chosen� there is co-linearity present in the data� the model might be improved by using transformed versions of the� existing set of independent variables

R2, Slope Equation

9/3/2013

16

Statistics Excel 2010

Summary Statistics

http://academic.brooklyn.cuny.edu/economic/friedman/descstatexcel.htm

Column1

Mean 74.92857143

Standard Error 5.013678308Median 78.5Mode 80

Standard Deviation 18.75946647

Sample Variance 351.9175824

Kurtosis 1.923164749

Skewness -1.31355395Range 71Minimum 29Maximum 100Sum 1049Count 14

Confidence Level(95.0%) 10.83139138

Ozone April 2013 Histogram and Summary Statistics

Mean 35.48948

Median 35

Mode 35

Standard Dev 10.72231

Sample Variance 114.968

Kurtosis -0.20548

Skewness 0.146677

Minimum 2

Maximum 68

Sum 25304

Count 713

9/3/2013

17

April 2013 Ozone Box-Whisker

Population size: 713Median: 35Minimum: 2Maximum: 68First quartile: 28Third quartile: 43Interquartile Range: 15Outliers: 2 5 5 5 6 8 10 11 11 68 65 64 62 62 61 61 60 59 58 58 58

∗ Access TCEQ web site data.

∗ Importing files into Excel and Matlab.

∗ Using Excel for statistical work, Matlab for statistics. Plotting histograms.

∗ Read the papers posted on Blackboard: Statistics for Analysis of Experimental Data, Errors and Limitation Associated with Regression, and Why we divide by n-1.

∗ Lab will be assigned.

Lab Thursday

9/3/2013

18

Video

https://www.khanacademy.org/math/probability

Statistics Handbook

http://www.itl.nist.gov/div898/handbook/index.htm

Elementary Statistics https://www.udacity.com/course/st095

Self Study/Supplemental

Technology

1 lab basicstatisticsfall2013