BIOL 582
Lecture Set 1
Some basics
Data, Parameters, and Statistics
BIOL 582 What are data?
A datum (plural = data) describes some characteristic or count that is generally numerable. A statistic summarizes data in some way (e.g., average).
Example: Said to the careless driver by a concerned friend, “Hey, don’t become a statistic!”
Should be: “Hey, don’t become a datum!”
Example: In the USA, 75% of all auto accidents were alcohol related. In this case, 75% is a statistic that summarizes the proportion of accidents related to alcohol.
Data, Parameters, and Statistics
Population (N)
Sample
n = Number of individuals (sample size)
X = Variable of interest
Data = The values of individuals for the population
Parameter = A number summarizing all data from N
Statistic = A number summarizing the n data of the sample
BIOL 582 What are data?
Types of Data (Variables):
1. Qualitative or Categorical (e.g., sex, color, presence/absence).
2. Quantitative
i. Discrete (e.g., number of fin rays, days to maturation).
ii. Continuous (e.g., height, weight, oxygen consumption rate).
BIOL 582 Data types
Dichotomous key for variable types:
1 a. Data fit into a non-numerical category: Qualitative
1 b. Data are certainly numerical: Quantitative (Go to 2)
2 a. Data are countable in distinct units and data represent an order: Discrete
2 b. Data are points along a spectrum of infinite possibilities: Continuous
An example for distinguishing between continuous and discrete quantitative data:
Team BA Rank (# hits)
Texas Rangers 0.278564 3
Chicago Cubs 0.281582 1
Boston Red Sox 0.279417 2
Continuous
Discrete
BIOL 582 Organizing Qualitative Data
0
5
10
15
20
25
30
F M
Gender
Nu
me
r o
f S
tud
en
ts
36%
64%
Gender F
Gender M
Bar Graph (Frequency Distribution) Pie Chart
Examples
Different types of bar graphs:
Data for class and gender
Class Gender Number
Freshman Female 11
Male 3
Sophomore Female 8
Male 2
Junior Female 5
Male 9
Senior Female 4
Male 2 0
5
10
15
20
25
30
F M F M F M F M
Fr So Ju Se
Year and Gender
Pe
rce
nt
of
Stu
den
ts
0
2
4
6
8
10
12
F M F M F M F M
Fr So Ju Se
Year and Gender
Nu
mb
er
of
Stu
de
nts
Frequency Distribution
Relative Frequency Distribution
BIOL 582 Organizing Qualitative Data
Different types of bar graphs:
Data for class and gender
Class Gender Number
Freshman Female 11
Male 3
Sophomore Female 8
Male 2
Junior Female 5
Male 9
Senior Female 4
Male 2
0
2
4
6
8
10
12
F M F M F M F M
Fr So Ju Se
Year and Gender
Nu
mb
er
of
Stu
de
nts
Frequency Distribution
Pareto Chart
0
2
4
6
8
10
12
F M F F F M M M
Fr Ju So Ju Se Fr So Se
Year and Gender
Nu
mb
er
of
Stu
de
nts
BIOL 582 Organizing Qualitative Data
BIOL 582 Organizing Quantitative Data
Recall that quantitative data can be
1. Discrete, or
2. Continuous
But, some of the ways of displaying data are similar to the ways used with Qualitative data.
Consider the following example for discrete data, Exam Scores:
67 64 32 59 85 65 51 73
79 66 67 35 58 57 68 27
23 56 70 46 86 72 65 96
34 87 67 61 81 64 23 53
88 49 73 33 70 80 74 67
77 70 93 69 69 78 74 85
White = Exam 1, Yellow = Exam 2
Dot Diagram
Stem and Leaf Plot
67 64 32 59 85 65 51 73
79 66 67 35 58 57 68 27
23 56 70 46 86 72 65 96
34 87 67 61 81 64 23 53
88 49 73 33 70 80 74 67
77 70 93 69 69 78 74 85
Notice that displays of quantitative data can be used to
show the distribution of data
BIOL 582 Organizing Quantitative Data
Back-to-Back Stem and Leaf Plot
67 64 32 59 85 65 51 73
79 66 67 35 58 57 68 27
23 56 70 46 86 72 65 96
34 87 67 61 81 64 23 53
88 49 73 33 70 80 74 67
77 70 93 69 69 78 74 85
3 3 2 7
4 2 3 3 5
4 6 9
8 1 5 3 6 7 9
9 8 7 7 7 5 6 1 4 4 5 6 7 9
9 7 4 4 3 0 0 7 0 2 3 8
8 6 5 1 8 0 5 7
3 9 6
Exam 1 Exam 2
BIOL 582 Organizing Quantitative Data
Or, we can simply make a table of the data….
48/483/483
45/484/484
41/482/482
39/486/486
33/4813/4813
20/4811/4811
9/487/487
2/482/482
Cumulative Relative Frequency
Relative Frequency
FrequencyClass Intervals
10090 score
9080 score
6050 score
7060 score
8070 score
5040 score
4030 score
30
100.0%6.3%3< 30
93.8%8.3%430-39
85.4%4.2%240-49
81.3%12.5%650-59
68.8%27.1%1360-69
41.7%22.9%1170-79
18.8%14.6%780-89
4.2%4.2%290-100
Cumulative Relative
FrequencyRelative
FrequencyFrequencyInterval
Which may look like this if you presented it to someone:
BIOL 582 Organizing Quantitative Data
Or, we can simply make a table of the data….
48/483/483
45/484/484
41/482/482
39/486/486
33/4813/4813
20/4811/4811
9/487/487
2/482/482
Cumulative Relative Frequency
Relative Frequency
FrequencyClass Intervals
10090 score
9080 score
6050 score
7060 score
8070 score
5040 score
4030 score
30
Or like this in graph form:
Frequency of Exam scores
0%
20%
40%
60%
80%
100%
90-100 80-89 70-79 60-69 50-59 40-49 30-39 < 30
Class Interval
Per
cen
t F
req
uen
cy
Cumulative Relative Frequency
Relative Frequency
BIOL 582 Organizing Quantitative Data
Whereas we called a bar graph like this a frequency, or relative frequency distribution for qualitative data, for quantitative data, we refer to this as a frequency (or relative frequency) HISTOGRAM.
Frequency of Exam scores
0%
20%
40%
60%
80%
100%
90-100 80-89 70-79 60-69 50-59 40-49 30-39 < 30
Class Interval
Per
cen
t F
req
uen
cy
Cumulative Relative Frequency
Relative Frequency
BIOL 582 Organizing Quantitative Data
Something to consider:Frequency of exam scores
0
2
4
6
8
10
12
14
90-100 80-89 70-79 60-69 50-59 40-49 30-39 < 30
Interval
Nu
mb
er
of
sc
ore
s
Frequency of exam scores
0
2
4
6
8
10
12
14
< 30 30-39 40-49 50-59 60-69 70-79 80-89 90-100
Interval
Nu
mb
er
of
sc
ore
s
&
Are the same!!!
BIOL 582 Organizing Quantitative Data
Something to consider:Frequency of exam scores
0
2
4
6
8
10
12
14
< 30 30-39 40-49 50-59 60-69 70-79 80-89 90-100
Interval
Nu
mb
er
of
sc
ore
s
&
Are the same!!!
Just rotate 90o
BIOL 582 Organizing Quantitative Data
Frequency of exam scores
0
2
4
6
8
10
12
14
< 30 30-39 40-49 50-59 60-69 70-79 80-89 90-100
Interval
Nu
mb
er
of
sc
ore
s
&
Are the same!!!
Just rotate 90o
Something to consider:
BIOL 582 Organizing Quantitative Data
Something to consider:
Continuous quantitative data, like discrete quantitative data, can be summarized with frequency histograms, but in both cases, intervals are arbitrary choices.
E.g., GPA scores
3.95 2.70 2.27
3.51 2.64 2.39
3.59 2.83 2.45
3.32 2.86 2.37
3.37 2.77 2.40
3.39 2.27 1.65
3.19 2.15 1.98
3.16 2.35 1.84
3.02 2.40 1.85
3.10 2.10 1.37
Fre
quen
cy
10
5
0
15
1-1.99 2-2.99 3-3.99
Fre
quen
cy10
5
0
15
1-1.49 1.5-1.99 2-2.49 2.5-2.99 3-3.49 3.5-3.99
BIOL 582 Organizing Quantitative Data
Something to consider:
When there are enough intervals, the plot can look more like a curve
Fre
quen
cy
10
5
01 2 3 4 5 6 7 8 9 10 11 12 13
Symmetrical (bell-shaped)
Fre
quen
cy
10
5
01 2 3 4 5 6 7 8 9 10 11 12 13
Uniform
Fre
quen
cy
10
5
01 2 3 4 5 6 7 8 9 10 11 12 13
Skewed RightF
requ
ency
10
5
01 2 3 4 5 6 7 8 9 10 11 12 13
Skewed Left
These are referred to as distribution shapes
BIOL 582 Organizing Quantitative Data
BIOL 582
Distributions have describable shape.
Symmetrical, Uniform, or skewed (left or right)
In general, there are two characteristics about distributions that we usually wish to describe (there are also others, but we will leave these for other statistics courses):
Center and Spread
When we measure attributes of a population that describe “central tendency” or “dispersion”, we have to first identify if we are dealing with the whole population or a sample of the population
Measures of Central Tendency and Dispersion
A Parameter is a descriptive measure of a population of size N.
A Statistic is a descriptive measure of a sample of size n.
Consider a population with N observations (units). We can represent the population as a set:
The population arithmetic mean (pronounced “mew”) is defined as follows:
Where (sigma) means a summation of xi, going from 1 to N
Likewise, if we take a sample of size n observations from the population, we can represent the sample as a set:
The sample arithmetic mean (pronounced “x-bar”) is as follows:
},.....,,{ 21 Nxxx
N
x
N
xxx
N
ii
N
121 ....
N
iix
1
},.....,,{ 21 nxxx
n
x
n
xxxx
n
ii
n
121 ....
BIOL 582 Measures of Central Tendency and Dispersion
Example: Recall our GPA data set. Let’s consider this to be the whole population.
3.95 2.70 2.27
3.51 2.64 2.39
3.59 2.83 2.45
3.32 2.86 2.37
3.37 2.77 2.40
3.39 2.27 1.65
3.19 2.15 1.98
3.16 2.35 1.84
3.02 2.40 1.85
3.10 2.10 1.37
64.230
24.79
30
37.185.1......59.351.395.31
N
xN
ii
Now let’s take a random sample of n = 9 from the population:
1.84 1.65 3.51
3.32 2.64 2.7
3.16 3.59 3.19
84.29
6.259
19.370.251.359.364.265.116.332.384.11
x
n
xx
n
ii
BIOL 582 Measures of Central Tendency and Dispersion
Other measures of Central Tendency:
Consider, as an example, the income for 9-month appointments of faculty from two departments at WKU, each with 9 faculty members (purely fictional):
A: {$45,000, $48,000, $48,000, $51,000, $51,000, $51,000, $54,000, $54,000, $57,000}
B: {$45,000, $45,000, $45,000, $45,000, $48,000, $51,000, $54,000, $57,000, $69,000}
For both departments, the mean income is $51,000; Thus, if the population is all
18 incomes, then:
But, look at the shape of the distributions:
A: B:
000,51$ ba xx
0
1
2
3
4
45 48 51 54 57 More
x $1000
Fre
qu
ency
0
1
2
3
4
45 48 51 54 57 More
x $1000
Fre
qu
ency
BIOL 582 Measures of Central Tendency and Dispersion
Other measures of Central Tendency (also called measures of location):
Consider, as an example, the income for 9-month appointments of faculty from two departments at SFA, each with 9 faculty members (purely fictional):
A: {$45,000, $48,000, $48,000, $51,000, $51,000, $51,000, $54,000, $54,000, $57,000}
B: {$45,000, $45,000, $45,000, $45,000, $48,000, $51,000, $54,000, $57,000, $69,000}
Median (M): value of the variable that lies in the middle of a data set when arranged in order (i.e., half of the data are above and half the data are below). If there are an even number of values, then the two middle values are averaged.
Midrange: (smallest value + largest value)/2. Midrange indicates where the center of the range is, irrespective of the shape of the distribution.
Mode: The most frequently occurring value of a variable.
For A: M = $51,000, Midrange = $51,000, and Mode = $51,000.
For B: M = $48,000, Midrange = $57,000, and Mode = $45,000.
BIOL 582 Measures of Central Tendency and Dispersion
Median (M): value of the variable that lies in the middle of a data set when arranged in order (i.e., half of the data are above and half the data are below). If there are an even number of values, then the two middle values are averaged.
Midrange: (smallest value + largest value)/2. Midrange (midpoint) indicates where the center of the range is, irrespective of the shape of the distribution.
Mode: The most frequently occurring value of a variable.
For A: M = $51,000, Midrange = $51,000, and Mode = $51,000.
For B: M = $48,000, Midrange = $57,000, and Mode = $45,000.
0
1
2
3
4
45 48 51 54 57 More
x $1000
Fre
qu
ency
0
1
2
3
4
45 48 51 54 57 More
x $1000
Fre
qu
ency
So, how does mean, M, midrange, and mode relate to distribution shape?
BIOL 582 Measures of Central Tendency and Dispersion
Obviously, the shape of a distribution is influenced by how values are distributed
Symmetric
Skewed (right)
mean, mode, and median
mode mean
median
One can consider if values are close to a central measure or are spread out far from the center. The measures of such parameters (population) or statistics (sample) are called measures of dispersion.
BIOL 582 Measures of Central Tendency and Dispersion
There are several measures of dispersion. Range, Variance, and Standard Deviation are the three that we will generally talk about.
mean, mode, and median
mean, mode, and median
The Range, R, of a variable is the difference between the largest and smallest values: max(xi)-min(xi)
R
R
BIOL 582 Measures of Central Tendency and Dispersion
Is using Range adequate for describing dispersion?
• Values may vary a lot or just slightly.
• Extremely large or extremely small values will “deviate” more from the mean.
• Therefore, a measure of variation (dispersion) that incorporates deviation from the mean would be useful.
BIOL 582 Measures of Central Tendency and Dispersion
Deviation from the mean
μ
x1 x2 x3
x2 - μ
x3 - μx1 - μ
So, the deviation from the mean (xi – μ) is like a distance from the mean that has a positive value if expressed as | xi – μ| or if squared (xi - μ)2.
The Population Variance, σ2, is the arithmetic mean of squared deviations from the population mean:
for a population of size N
N
xN
ii
1
2
2
)(
The Sample Variance, s2, is the sum of squared deviations from the sample mean, divided by n-1 degrees of freedom:
for a sample of size n
1
)(1
2
2
n
xxs
n
ii
BIOL 582 Measures of Central Tendency and Dispersion
BIOL 582 3.2 Measures of Dispersion
N
xN
ii
1
2
2
)( 1
)(1
2
2
n
xxs
n
ii
Notice the difference between population and sample variance:
Why not n?
n-1 is called the degrees of freedom (df). Using n, the sample variance consistently underestimates the population variance (i.e., is biased). n-1 produces an unbiased estimate of σ2 from s2. Another way to think of it: because variance has in its definition, the arithmetic mean, and the mean remains the same in the calculation (i.e., fixed) then only n-1 things are free to vary in the sample, not n.
BIOL 582 3.2 Measures of Dispersion
The Population Variance, σ2, is the arithmetic mean of squared deviations from the population mean:
for a population of size N
N
xN
ii
1
2
2
)(
The Sample Variance, s2, is the sum of squared deviations from the sample mean, divided by n-1 degrees of freedom:
for a sample of size n
1
)(1
2
2
n
xxs
n
ii
A computationally easier way to calculate variance can be done with the following formula (which with a little algebra can be shown to be the same thing):
NN
xx i
i
22
2
)(
* Means to square each xi, first, then take sum
* Means to add all xi
first, then square the sum
The same is true for sample variance, using n-1 instead of N in the lower denominator.
A computationally easier way to calculate variance can be done with the following formula (which with a little algebra can be shown to be the same thing):
NN
xx i
i
22
2
)(
Example (from an introductory stats course) for home runs by American League baseball teams:
i xi (homeruns) xi2
1 158 24,964
2 136 18,496
3 198 39,204
4 214 45,796
5 212 44,944
6 139 19,321
7 152 23,104
8 164 26,896
9 203 41,209
10 199 39,601
11 169 28,561
12 121 14,641
13 246 60,516
14 195 38,025
Total 2,506 465,278
1.193,114
14)506,2(
278,4652
2
One problem with variance! The units are now squared!
BIOL 582 Measures of Central Tendency and Dispersion
Introducing, the Standard Deviation of the Mean!
WARNING, the following calculation involves some tricky math!!!
Population Standard Deviation:
Sample Standard Deviation:
Now we have a measure of dispersion in the same units as the mean!
2
2
ss
BIOL 582 Measures of Central Tendency and Dispersion
Why is the Standard Deviation of the Mean important?
•Same units as mean
•The Empirical Rule If data from units of a population are “normally” distributed (meaning perfectly bell-shaped), then 68% of the data will fall within μ +1σ. 95% of the data will fall within μ +2σ. 99.7% of the data will fall within μ +3σ.
BIOL 582 Measures of Central Tendency and Dispersion
BIOL 582 Measures of Position
Speaking of exams, ever wonder after taking something like the ACT, SAT, or GRE, what it means to be in the 80th percentile?
The kth percentile, Pk, is a value such that k% of all values are below Pk (or 100-k% values are above Pk).
This is important for the next measures of position: Quartiles.
Quartiles, Qi, divide data sets into 4 equal parts (25th, 50th, and 75th percentiles). We have already defined the 50th percentile, or second quartile, as the median.
Min Q1 Q2 Q3 Max
50% of data within this box, centered around median
Box and Whisker Plot (showing a 5-point summary)
BIOL 582 Measures of Position
Quartiles (and Box Plots) can tell us about the shape of a distribution.
Quartiles can tell us if there are outliers.
How? First, we define the interquartile range (IQR): IQR = Q3-Q1
Thus, IQR = 50% interior range of data. Next, determine fences: Lower Fence = Q1-1.5*IQR, Upper Fence = Q3+1.5*IQR
If xi < lower fence, or xi > upper fence, then it is considered an outlier.
Symmetric
Skewed Left
Skewed Right
BIOL 582 Measures of Position
Example: The following is a data set from a collection of pupfish (Cyprinodon), small fishes (usually endangered) that inhabit desert aquatic habitats in North America. The variable of interest is standard length, in mm.
Xi = { }
First Step: Put in ascending order Xi = { }
Second Step: Calculate percentiles, IQR (n = 9)
kth percentile = n*k/100 (use integer rules): P25 = Q1 : (9*25)/100 = 2.25 x3; P50 = Q2
= M : (9*50)/100 = 4.5 x5; P75 = Q3 : (9*75)/100 = 6.75 x7.
IQR = x7-x3 = (30.39-28.50) = 1.89; M = 28.96
Third Step: Lower Fence = Q1 - 1.5*IQR = 28.50-1.5*1.89 = 25.66 Upper Fence = Q3 + 1.5*IQR = 30.39-1.5*1.89 = 33.25
21.56 28.87 28.50 28.96 27.00 32.50 30.39 36.77 29.39
21.56 27.00 28.50 28.87 28.96 29.39 30.39 32.50 36.77
Lower and Upper Fences
BIOL 582 Measures of Position
Example: The following is a data set from a collection of pupfish (Cyprinodon), small fishes (usually endangered) that inhabit desert aquatic habitats in North America. The variable of interest is standard length, in mm.
Xi = { }21.56 28.87 28.50 28.96 27.00 32.50 30.39 36.77 29.39
Last Step: Make the Box Plot
20 22 24 26 28 30 32 34 36
IQR & M
* *
Outliers
Change of Pace:
What if some data collected from a population have a dependency on something independent of the population?
E.g., time, temperature, season, photoperiod
BIOL 582 Considering Multiple Variables
Sometimes we are interested in trends. A plot that shows how some variable (attribute) changes over time is a Time Series Plot.
A Time Series Plot is obtained by plotting the time in which a variable is measured on the horizontal axis and the corresponding value for the variable on the vertical axis. Lines are drawn connecting the points.
E.g., The Texas Parks and Wildlife Department is concerned about non-game fish in East Texas Rivers. They launch a survey to estimate numbers of darters (Etheostoma) using beach seine hauls over 10 m transects in the Neches River. The variable assessed is average number of darters per seine haul. Concerned that darter populations may fluctuate in time, they do this every month of the year.
BIOL 582 Considering Multiple Variables
Month fish/haul
January 4
February 3.1
March 2.5
April 2.2
May 1.9
June 10.6
July 15.7
August 13.3
September 11.2
October 8.4
November 7.3
December 5.1
Darters in the Neches River
0
2
4
6
8
10
12
14
16
18
Month (2003)
Ave
rag
e n
um
ber
of f
ish
/sei
ne
hau
l
BIOL 582 Considering Multiple Variables
Time is an obvious independent variable. But what if we had a different “independent variable” that was an attribute of the subjects of the population we were sampling?
So far, we have been concerned with univariate data (in which a single variable is measured). For example, data from standard length collected on pupfish:
Xi = {21.56, 28.87, 28.50, 28.96, 27.00, 32.50, 30.39, 36.77, 29.39} (units = mm)
Maybe, while measuring the length of each fish, we could measure the weight also:
Yi = {0.32, 0.81, 0.63, 0.70, 0.55, 0.92, 0.67, 1.36, 0.61} (units = g)
Or we could write the data as:
Xi,Yi = {(21.56, 0.32), (28.87, 0.81), (28.96, 0.63),….., (29.39,0.61)}
We call this bivariate data (when two variables are measured for each individual unit).
When we measure three or more variables for each individual unit, then we have multivariate data.
Xi,Yi,Zi = {(x1, y1, z1), (x2, y2, z2), …., (xn, yn, zn)}
BIOL 582 Considering Multiple Variables
With bivariate data, two values are measured on each population (or experimental) unit. We denote the data as ordered pairs (xi, yi). The data can be both qualitative, one qualitative and one quantitative, or both quantitative. In some examples, xi is the independent (predictor) variable and yi is the dependent (response) variable.
Bivariate Quantitative variables
Scatter Plot:Weight vs. Length for pupfish data
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 10 20 30 40Length (mm)
We
igh
t (g
)
BIOL 582 Considering Multiple Variables
We will come back to this later
BIOL 582 Summary/Future directions
• Everything we learned to this point falls under the category, Descriptive
Statistics• Descriptive Stats generally represent measures of
• Central Tendency• Position• Dispersion• Data can be either qualitative or quantitative, and if quantitative, either
discrete or continuous
• Subjects are the organisms, objects, things, sampled from a population• Variables are measureable or definable aspects of subjects• Subjects can have multiple variables
• Going forward, we will concern ourselves more with asking if populations
are similar or different in terms of variables of interest• This part of statistics involves testing hypotheses• Hypotheses tests fall under the category of Inferential Statistics