Download ppt - BIOL 582 Lecture Set 1 Some basics. Data, Parameters, and Statistics BIOL 582 What are data? A datum (plural = data) describes some characteristic or

BIOL 582

Lecture Set 1

Some basics

Data, Parameters, and Statistics

BIOL 582 What are data?

A datum (plural = data) describes some characteristic or count that is generally numerable. A statistic summarizes data in some way (e.g., average).

Example: Said to the careless driver by a concerned friend, “Hey, don’t become a statistic!”

Should be: “Hey, don’t become a datum!”

Example: In the USA, 75% of all auto accidents were alcohol related. In this case, 75% is a statistic that summarizes the proportion of accidents related to alcohol.

Data, Parameters, and Statistics

Population (N)

Sample

n = Number of individuals (sample size)

X = Variable of interest

Data = The values of individuals for the population

Parameter = A number summarizing all data from N

Statistic = A number summarizing the n data of the sample

BIOL 582 What are data?

Types of Data (Variables):

1. Qualitative or Categorical (e.g., sex, color, presence/absence).

2. Quantitative

i. Discrete (e.g., number of fin rays, days to maturation).

ii. Continuous (e.g., height, weight, oxygen consumption rate).

BIOL 582 Data types

Dichotomous key for variable types:

1 a. Data fit into a non-numerical category: Qualitative

1 b. Data are certainly numerical: Quantitative (Go to 2)

2 a. Data are countable in distinct units and data represent an order: Discrete

2 b. Data are points along a spectrum of infinite possibilities: Continuous

An example for distinguishing between continuous and discrete quantitative data:

Team BA Rank (# hits)

Texas Rangers 0.278564 3

Chicago Cubs 0.281582 1

Boston Red Sox 0.279417 2

Continuous

Discrete

BIOL 582 Organizing Qualitative Data

0

5

10

15

20

25

30

F M

Gender

Nu

me

r o

f S

tud

en

ts

36%

64%

Gender F

Gender M

Bar Graph (Frequency Distribution) Pie Chart

Examples

Different types of bar graphs:

Data for class and gender

Class Gender Number

Freshman Female 11

Male 3

Sophomore Female 8

Male 2

Junior Female 5

Male 9

Senior Female 4

Male 2 0

5

10

15

20

25

30

F M F M F M F M

Fr So Ju Se

Year and Gender

Pe

rce

nt

of

Stu

den

ts

0

2

4

6

8

10

12

F M F M F M F M

Fr So Ju Se

Year and Gender

Nu

mb

er

of

Stu

de

nts

Frequency Distribution

Relative Frequency Distribution


Different types of bar graphs:

Data for class and gender

Class Gender Number

Freshman Female 11

Male 3

Sophomore Female 8

Male 2

Junior Female 5

Male 9

Senior Female 4

Male 2

0

2

4

6

8

10

12

F M F M F M F M

Fr So Ju Se

Year and Gender

Nu

mb

er

of

Stu

de

nts

Frequency Distribution

Pareto Chart

0

2

4

6

8

10

12

F M F F F M M M

Fr Ju So Ju Se Fr So Se

Year and Gender

Nu

mb

er

of

Stu

de

nts


BIOL 582 Organizing Quantitative Data

Recall that quantitative data can be

1. Discrete, or

2. Continuous

But, some of the ways of displaying data are similar to the ways used with Qualitative data.

Consider the following example for discrete data, Exam Scores:

67 64 32 59 85 65 51 73

79 66 67 35 58 57 68 27

23 56 70 46 86 72 65 96

34 87 67 61 81 64 23 53

88 49 73 33 70 80 74 67

77 70 93 69 69 78 74 85

White = Exam 1, Yellow = Exam 2

Dot Diagram

Stem and Leaf Plot

67 64 32 59 85 65 51 73

79 66 67 35 58 57 68 27

23 56 70 46 86 72 65 96

34 87 67 61 81 64 23 53

88 49 73 33 70 80 74 67

77 70 93 69 69 78 74 85

Notice that displays of quantitative data can be used to

show the distribution of data


Back-to-Back Stem and Leaf Plot

67 64 32 59 85 65 51 73

79 66 67 35 58 57 68 27

23 56 70 46 86 72 65 96

34 87 67 61 81 64 23 53

88 49 73 33 70 80 74 67

77 70 93 69 69 78 74 85

3 3 2 7

4 2 3 3 5

4 6 9

8 1 5 3 6 7 9

9 8 7 7 7 5 6 1 4 4 5 6 7 9

9 7 4 4 3 0 0 7 0 2 3 8

8 6 5 1 8 0 5 7

3 9 6

Exam 1 Exam 2


Or, we can simply make a table of the data….

48/483/483

45/484/484

41/482/482

39/486/486

33/4813/4813

20/4811/4811

9/487/487

2/482/482

Cumulative Relative Frequency

Relative Frequency

FrequencyClass Intervals

10090 score

9080 score

6050 score

7060 score

8070 score

5040 score

4030 score

30

100.0%6.3%3< 30

93.8%8.3%430-39

85.4%4.2%240-49

81.3%12.5%650-59

68.8%27.1%1360-69

41.7%22.9%1170-79

18.8%14.6%780-89

4.2%4.2%290-100

Cumulative Relative

FrequencyRelative

FrequencyFrequencyInterval

Which may look like this if you presented it to someone:


Or, we can simply make a table of the data….

48/483/483

45/484/484

41/482/482

39/486/486

33/4813/4813

20/4811/4811

9/487/487

2/482/482


Relative Frequency

FrequencyClass Intervals

10090 score

9080 score

6050 score

7060 score

8070 score

5040 score

4030 score

30

Or like this in graph form:

Frequency of Exam scores

0%

20%

40%

60%

80%

100%

90-100 80-89 70-79 60-69 50-59 40-49 30-39 < 30

Class Interval

Per

cen

t F

req

uen

cy


Relative Frequency


Whereas we called a bar graph like this a frequency, or relative frequency distribution for qualitative data, for quantitative data, we refer to this as a frequency (or relative frequency) HISTOGRAM.

Frequency of Exam scores

0%

20%

40%

60%

80%

100%

90-100 80-89 70-79 60-69 50-59 40-49 30-39 < 30

Class Interval

Per

cen

t F

req

uen

cy


Relative Frequency


Something to consider:Frequency of exam scores

0

2

4

6

8

10

12

14

90-100 80-89 70-79 60-69 50-59 40-49 30-39 < 30

Interval

Nu

mb

er

of

sc

ore

s

Frequency of exam scores

0

2

4

6

8

10

12

14

< 30 30-39 40-49 50-59 60-69 70-79 80-89 90-100

Interval

Nu

mb

er

of

sc

ore

s

&

Are the same!!!


Something to consider:Frequency of exam scores

0

2

4

6

8

10

12

14

< 30 30-39 40-49 50-59 60-69 70-79 80-89 90-100

Interval

Nu

mb

er

of

sc

ore

s

&

Are the same!!!

Just rotate 90o


Frequency of exam scores

0

2

4

6

8

10

12

14

< 30 30-39 40-49 50-59 60-69 70-79 80-89 90-100

Interval

Nu

mb

er

of

sc

ore

s

&

Are the same!!!

Just rotate 90o

Something to consider:



Continuous quantitative data, like discrete quantitative data, can be summarized with frequency histograms, but in both cases, intervals are arbitrary choices.

E.g., GPA scores

3.95 2.70 2.27

3.51 2.64 2.39

3.59 2.83 2.45

3.32 2.86 2.37

3.37 2.77 2.40

3.39 2.27 1.65

3.19 2.15 1.98

3.16 2.35 1.84

3.02 2.40 1.85

3.10 2.10 1.37

Fre

quen

cy

10

5

0

15

1-1.99 2-2.99 3-3.99

Fre

quen

cy10

5

0

15

1-1.49 1.5-1.99 2-2.49 2.5-2.99 3-3.49 3.5-3.99



When there are enough intervals, the plot can look more like a curve

Fre

quen

cy

10

5

01 2 3 4 5 6 7 8 9 10 11 12 13

Symmetrical (bell-shaped)

Fre

quen

cy

10

5

01 2 3 4 5 6 7 8 9 10 11 12 13

Uniform

Fre

quen

cy

10

5

01 2 3 4 5 6 7 8 9 10 11 12 13

Skewed RightF

requ

ency

10

5

01 2 3 4 5 6 7 8 9 10 11 12 13

Skewed Left

These are referred to as distribution shapes


BIOL 582

Distributions have describable shape.

Symmetrical, Uniform, or skewed (left or right)

In general, there are two characteristics about distributions that we usually wish to describe (there are also others, but we will leave these for other statistics courses):

Center and Spread

When we measure attributes of a population that describe “central tendency” or “dispersion”, we have to first identify if we are dealing with the whole population or a sample of the population

Measures of Central Tendency and Dispersion

A Parameter is a descriptive measure of a population of size N.

A Statistic is a descriptive measure of a sample of size n.

Consider a population with N observations (units). We can represent the population as a set:

The population arithmetic mean (pronounced “mew”) is defined as follows:

Where (sigma) means a summation of xi, going from 1 to N

Likewise, if we take a sample of size n observations from the population, we can represent the sample as a set:

The sample arithmetic mean (pronounced “x-bar”) is as follows:

},.....,,{ 21 Nxxx

N

x

N

xxx

N

ii

N

121 ....

N

iix

1

},.....,,{ 21 nxxx

n

x

n

xxxx

n

ii

n

121 ....

BIOL 582 Measures of Central Tendency and Dispersion

Example: Recall our GPA data set. Let’s consider this to be the whole population.

3.95 2.70 2.27

3.51 2.64 2.39

3.59 2.83 2.45

3.32 2.86 2.37

3.37 2.77 2.40

3.39 2.27 1.65

3.19 2.15 1.98

3.16 2.35 1.84

3.02 2.40 1.85

3.10 2.10 1.37

64.230

24.79

30

37.185.1......59.351.395.31

N

xN

ii

Now let’s take a random sample of n = 9 from the population:

1.84 1.65 3.51

3.32 2.64 2.7

3.16 3.59 3.19

84.29

6.259

19.370.251.359.364.265.116.332.384.11

x

n

xx

n

ii


Other measures of Central Tendency:

Consider, as an example, the income for 9-month appointments of faculty from two departments at WKU, each with 9 faculty members (purely fictional):

A: {$45,000, $48,000, $48,000, $51,000, $51,000, $51,000, $54,000, $54,000, $57,000}

B: {$45,000, $45,000, $45,000, $45,000, $48,000, $51,000, $54,000, $57,000, $69,000}

For both departments, the mean income is $51,000; Thus, if the population is all

18 incomes, then:

But, look at the shape of the distributions:

A: B:

000,51$ ba xx

0

1

2

3

4

45 48 51 54 57 More

x $1000

Fre

qu

ency

0

1

2

3

4

45 48 51 54 57 More

x $1000

Fre

qu

ency


Other measures of Central Tendency (also called measures of location):

Consider, as an example, the income for 9-month appointments of faculty from two departments at SFA, each with 9 faculty members (purely fictional):

A: {$45,000, $48,000, $48,000, $51,000, $51,000, $51,000, $54,000, $54,000, $57,000}

B: {$45,000, $45,000, $45,000, $45,000, $48,000, $51,000, $54,000, $57,000, $69,000}

Median (M): value of the variable that lies in the middle of a data set when arranged in order (i.e., half of the data are above and half the data are below). If there are an even number of values, then the two middle values are averaged.

Midrange: (smallest value + largest value)/2. Midrange indicates where the center of the range is, irrespective of the shape of the distribution.

Mode: The most frequently occurring value of a variable.

For A: M = $51,000, Midrange = $51,000, and Mode = $51,000.

For B: M = $48,000, Midrange = $57,000, and Mode = $45,000.


Median (M): value of the variable that lies in the middle of a data set when arranged in order (i.e., half of the data are above and half the data are below). If there are an even number of values, then the two middle values are averaged.

Midrange: (smallest value + largest value)/2. Midrange (midpoint) indicates where the center of the range is, irrespective of the shape of the distribution.

Mode: The most frequently occurring value of a variable.

For A: M = $51,000, Midrange = $51,000, and Mode = $51,000.

For B: M = $48,000, Midrange = $57,000, and Mode = $45,000.

0

1

2

3

4

45 48 51 54 57 More

x $1000

Fre

qu

ency

0

1

2

3

4

45 48 51 54 57 More

x $1000

Fre

qu

ency

So, how does mean, M, midrange, and mode relate to distribution shape?


Obviously, the shape of a distribution is influenced by how values are distributed

Symmetric

Skewed (right)

mean, mode, and median

mode mean

median

One can consider if values are close to a central measure or are spread out far from the center. The measures of such parameters (population) or statistics (sample) are called measures of dispersion.


There are several measures of dispersion. Range, Variance, and Standard Deviation are the three that we will generally talk about.



The Range, R, of a variable is the difference between the largest and smallest values: max(xi)-min(xi)

R

R


Is using Range adequate for describing dispersion?

• Values may vary a lot or just slightly.

• Extremely large or extremely small values will “deviate” more from the mean.

• Therefore, a measure of variation (dispersion) that incorporates deviation from the mean would be useful.


Deviation from the mean

μ

x1 x2 x3

x2 - μ

x3 - μx1 - μ

So, the deviation from the mean (xi – μ) is like a distance from the mean that has a positive value if expressed as | xi – μ| or if squared (xi - μ)2.

The Population Variance, σ2, is the arithmetic mean of squared deviations from the population mean:

for a population of size N

N

xN

ii

1

2

2

)(

The Sample Variance, s2, is the sum of squared deviations from the sample mean, divided by n-1 degrees of freedom:

for a sample of size n

1

)(1

2

2

n

xxs

n

ii


BIOL 582 3.2 Measures of Dispersion

N

xN

ii

1

2

2

)( 1

)(1

2

2

n

xxs

n

ii

Notice the difference between population and sample variance:

Why not n?

n-1 is called the degrees of freedom (df). Using n, the sample variance consistently underestimates the population variance (i.e., is biased). n-1 produces an unbiased estimate of σ2 from s2. Another way to think of it: because variance has in its definition, the arithmetic mean, and the mean remains the same in the calculation (i.e., fixed) then only n-1 things are free to vary in the sample, not n.

BIOL 582 3.2 Measures of Dispersion

The Population Variance, σ2, is the arithmetic mean of squared deviations from the population mean:

for a population of size N

N

xN

ii

1

2

2

)(

The Sample Variance, s2, is the sum of squared deviations from the sample mean, divided by n-1 degrees of freedom:

for a sample of size n

1

)(1

2

2

n

xxs

n

ii

A computationally easier way to calculate variance can be done with the following formula (which with a little algebra can be shown to be the same thing):

NN

xx i

i

22

2

)(

* Means to square each xi, first, then take sum

* Means to add all xi

first, then square the sum

The same is true for sample variance, using n-1 instead of N in the lower denominator.

A computationally easier way to calculate variance can be done with the following formula (which with a little algebra can be shown to be the same thing):

NN

xx i

i

22

2

)(

Example (from an introductory stats course) for home runs by American League baseball teams:

i xi (homeruns) xi2

1 158 24,964

2 136 18,496

3 198 39,204

4 214 45,796

5 212 44,944

6 139 19,321

7 152 23,104

8 164 26,896

9 203 41,209

10 199 39,601

11 169 28,561

12 121 14,641

13 246 60,516

14 195 38,025

Total 2,506 465,278

1.193,114

14)506,2(

278,4652

2

One problem with variance! The units are now squared!


Introducing, the Standard Deviation of the Mean!

WARNING, the following calculation involves some tricky math!!!

Population Standard Deviation:

Sample Standard Deviation:

Now we have a measure of dispersion in the same units as the mean!

2

2

ss


Why is the Standard Deviation of the Mean important?

•Same units as mean

•The Empirical Rule If data from units of a population are “normally” distributed (meaning perfectly bell-shaped), then 68% of the data will fall within μ +1σ. 95% of the data will fall within μ +2σ. 99.7% of the data will fall within μ +3σ.


BIOL 582 Measures of Position

Speaking of exams, ever wonder after taking something like the ACT, SAT, or GRE, what it means to be in the 80th percentile?

The kth percentile, Pk, is a value such that k% of all values are below Pk (or 100-k% values are above Pk).

This is important for the next measures of position: Quartiles.

Quartiles, Qi, divide data sets into 4 equal parts (25th, 50th, and 75th percentiles). We have already defined the 50th percentile, or second quartile, as the median.

Min Q1 Q2 Q3 Max

50% of data within this box, centered around median

Box and Whisker Plot (showing a 5-point summary)


Quartiles (and Box Plots) can tell us about the shape of a distribution.

Quartiles can tell us if there are outliers.

How? First, we define the interquartile range (IQR): IQR = Q3-Q1

Thus, IQR = 50% interior range of data. Next, determine fences: Lower Fence = Q1-1.5*IQR, Upper Fence = Q3+1.5*IQR

If xi < lower fence, or xi > upper fence, then it is considered an outlier.

Symmetric

Skewed Left

Skewed Right


Example: The following is a data set from a collection of pupfish (Cyprinodon), small fishes (usually endangered) that inhabit desert aquatic habitats in North America. The variable of interest is standard length, in mm.

Xi = { }

First Step: Put in ascending order Xi = { }

Second Step: Calculate percentiles, IQR (n = 9)

kth percentile = n*k/100 (use integer rules): P25 = Q1 : (9*25)/100 = 2.25 x3; P50 = Q2

= M : (9*50)/100 = 4.5 x5; P75 = Q3 : (9*75)/100 = 6.75 x7.

IQR = x7-x3 = (30.39-28.50) = 1.89; M = 28.96

Third Step: Lower Fence = Q1 - 1.5*IQR = 28.50-1.5*1.89 = 25.66 Upper Fence = Q3 + 1.5*IQR = 30.39-1.5*1.89 = 33.25

21.56 28.87 28.50 28.96 27.00 32.50 30.39 36.77 29.39

21.56 27.00 28.50 28.87 28.96 29.39 30.39 32.50 36.77

Lower and Upper Fences


Example: The following is a data set from a collection of pupfish (Cyprinodon), small fishes (usually endangered) that inhabit desert aquatic habitats in North America. The variable of interest is standard length, in mm.

Xi = { }21.56 28.87 28.50 28.96 27.00 32.50 30.39 36.77 29.39

Last Step: Make the Box Plot

20 22 24 26 28 30 32 34 36

IQR & M

* *

Outliers

Change of Pace:

What if some data collected from a population have a dependency on something independent of the population?

E.g., time, temperature, season, photoperiod

BIOL 582 Considering Multiple Variables

Sometimes we are interested in trends. A plot that shows how some variable (attribute) changes over time is a Time Series Plot.

A Time Series Plot is obtained by plotting the time in which a variable is measured on the horizontal axis and the corresponding value for the variable on the vertical axis. Lines are drawn connecting the points.

E.g., The Texas Parks and Wildlife Department is concerned about non-game fish in East Texas Rivers. They launch a survey to estimate numbers of darters (Etheostoma) using beach seine hauls over 10 m transects in the Neches River. The variable assessed is average number of darters per seine haul. Concerned that darter populations may fluctuate in time, they do this every month of the year.


Month fish/haul

January 4

February 3.1

March 2.5

April 2.2

May 1.9

June 10.6

July 15.7

August 13.3

September 11.2

October 8.4

November 7.3

December 5.1

Darters in the Neches River

0

2

4

6

8

10

12

14

16

18

Month (2003)

Ave

rag

e n

um

ber

of f

ish

/sei

ne

hau

l


Time is an obvious independent variable. But what if we had a different “independent variable” that was an attribute of the subjects of the population we were sampling?

So far, we have been concerned with univariate data (in which a single variable is measured). For example, data from standard length collected on pupfish:

Xi = {21.56, 28.87, 28.50, 28.96, 27.00, 32.50, 30.39, 36.77, 29.39} (units = mm)

Maybe, while measuring the length of each fish, we could measure the weight also:

Yi = {0.32, 0.81, 0.63, 0.70, 0.55, 0.92, 0.67, 1.36, 0.61} (units = g)

Or we could write the data as:

Xi,Yi = {(21.56, 0.32), (28.87, 0.81), (28.96, 0.63),….., (29.39,0.61)}

We call this bivariate data (when two variables are measured for each individual unit).

When we measure three or more variables for each individual unit, then we have multivariate data.

Xi,Yi,Zi = {(x1, y1, z1), (x2, y2, z2), …., (xn, yn, zn)}


With bivariate data, two values are measured on each population (or experimental) unit. We denote the data as ordered pairs (xi, yi). The data can be both qualitative, one qualitative and one quantitative, or both quantitative. In some examples, xi is the independent (predictor) variable and yi is the dependent (response) variable.

Bivariate Quantitative variables

Scatter Plot:Weight vs. Length for pupfish data

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 10 20 30 40Length (mm)

We

igh

t (g

)


We will come back to this later

BIOL 582 Summary/Future directions

• Everything we learned to this point falls under the category, Descriptive

Statistics• Descriptive Stats generally represent measures of

• Central Tendency• Position• Dispersion• Data can be either qualitative or quantitative, and if quantitative, either

discrete or continuous

• Subjects are the organisms, objects, things, sampled from a population• Variables are measureable or definable aspects of subjects• Subjects can have multiple variables

• Going forward, we will concern ourselves more with asking if populations

are similar or different in terms of variables of interest• This part of statistics involves testing hypotheses• Hypotheses tests fall under the category of Inferential Statistics