STA 291 Summer 2010

Preview:

DESCRIPTION

STA 291 Summer 2010. Lecture 4 Dustin Lueker. Population Distribution. The population distribution for a continuous variable is usually represented by a smooth curve Like a histogram that gets finer and finer - PowerPoint PPT Presentation

Citation preview

STA 291Summer 2010

Lecture 4Dustin Lueker

The population distribution for a continuous variable is usually represented by a smooth curve◦ Like a histogram that gets finer and finer

Similar to the idea of using smaller and smaller rectangles to calculate the area under a curve when learning how to integrate

Symmetric distributions◦ Bell-shaped◦ U-shaped◦ Uniform

Not symmetric distributions:◦ Left-skewed◦ Right-skewed◦ Skewed

Population Distribution

2STA 291 Summer 2010 Lecture 4

Center of the data◦ Mean◦ Median◦ Mode

Dispersion of the data Sometimes referred to as spread

◦ Variance, Standard deviation◦ Interquartile range◦ Range

Summarizing Data Numerically

3STA 291 Summer 2010 Lecture 4

Mean◦ Arithmetic average

Median◦ Midpoint of the observations when they are

arranged in order Smallest to largest

Mode◦ Most frequently occurring value

Measures of Central Tendency

4STA 291 Summer 2010 Lecture 4

Sample size n Observations x1, x2, …, xn Sample Mean “x-bar”

Sample Mean

5

SUM

STA 291 Summer 2010 Lecture 4

n

ii

n

xn

nxxxx

1

21

1/)...(

Population size N Observations x1 , x2 ,…, xN Population Mean “mu”

Note: This is for a finite population of size N

Population Mean

6

SUM

STA 291 Summer 2010 Lecture 4

N

ii

N

xN

Nxxx

1

21

1/)...(

Requires numerical values◦ Only appropriate for quantitative data◦ Does not make sense to compute the mean for

nominal variables◦ Can be calculated for ordinal variables, but this does not

always make sense Should be careful when using the mean on ordinal variables Example “Weather” (on an ordinal scale)

Sun=1, Partly Cloudy=2, Cloudy=3,Rain=4, Thunderstorm=5Mean (average) weather=2.8

Another example is “GPA = 3.8” is also a mean of observations measured on an ordinal scale

Mean

7STA 291 Summer 2010 Lecture 4

Center of gravity for the data set Sum of the differences from values above

the mean is equal to the sum of the differences from values below the mean◦ 3+2+2 = 3 + 4

Mean

STA 291 Summer 2010 Lecture 4 8

Mean◦ Sum of observations divided by the number of

observations

Example◦ {7, 12, 11, 18}◦ Mean =

Mean (Average)

9STA 291 Summer 2010 Lecture 4

Highly influenced by outliers◦ Data points that are far from the rest of the data

◦ Example Monthly income for five people

1,000 2,000 3,000 4,000 100,000 Average monthly income =

What is the problem with using the average to describe this data set?

Mean

10STA 291 Summer 2010 Lecture 4

Measurement that falls in the middle of the ordered sample

When the sample size n is odd, there is a middle value◦ It has the ordered index (n+1)/2

Ordered index is where that value falls when the sample is listed from smallest to largest An index of 2 means the second smallest value

◦ Example 1.7, 4.6, 5.7, 6.1, 8.3

n=5, (n+1)/2=6/2=3, index = 3Median = 3rd smallest observation = 5.7

Median

11STA 291 Summer 2010 Lecture 4

When the sample size n is even, average the two middle values◦ Example

3, 5, 6, 9, n=4(n+1)/2=5/2=2.5, Index = 2.5Median = midpoint between 2nd and 3rd smallest observations = (5+6)/2 = 5.5

Median

12STA 291 Summer 2010 Lecture 4

For skewed distributions, the median is often a more appropriate measure of central tendency than the mean

The median usually better describes a “typical value” when the sample distribution is highly skewed

Example◦ Monthly income for five people

1,000 2,000 3,000 4,000 100,000◦ Median monthly income:

Why is the median better to use with this data than the mean?

Mean and Median

13STA 291 Summer 2010 Lecture 4

Measures of Central Tendency

14

Mode - Most frequent value.

Notation: Subscripted variables n = # of units in the sample N = # of units in the population x = Variable to be measured xi = Measurement of the ith unit

Mean - Arithmetic Average

Mean of a Sample - xMean of a Population -

μ

Median - Midpoint of the observations when they are arranged in increasing order

STA 291 Summer 2010 Lecture 4

Example: Highest Degree Completed

Median for Grouped or Ordinal Data

15

Highest Degree Frequency Percentage

Not a high school graduate

38,012 21.4

High school only 65,291 36.8Some college, no

degree33,191 18.7

Associate, Bachelor, Master, Doctorate,

Professional

41,124 23.2

Total 177,618 100

STA 291 Summer 2010 Lecture 4

n = 177,618 (n+1)/2 = 88,809.5 Median = midpoint between the 88809th

smallest and 88810th smallest observations◦ Both are in the category “High school only”

Mean wouldn’t make sense here since the variable is ordinal

Median◦ Can be used for interval data and for ordinal data◦ Can not be used for nominal data because the

observations can not be ordered on a scale

Calculate the Median

16STA 291 Summer 2010 Lecture 4

Mean◦ Interval data with an approximately symmetric

distribution Median

◦ Interval data◦ Ordinal data

Mean is sensitive to outliers, median is not

Mean vs. Median

17STA 291 Summer 2010 Lecture 4

Symmetric distribution◦ Mean = Median

Skewed distribution◦ Mean lies more toward the direction which the

distribution is skewed

Mean vs. Median

18STA 291 Summer 2010 Lecture 4

While the median is better than the mean for skewed distributions there is one large disadvantage to using the median◦ Insensitive to changes within the lower or upper

half of the data◦ Example

1, 2, 3, 4, 5 1, 2, 3, 100, 100

◦ Sometimes, the mean is more informative even when the distribution is skewed

Median

19STA 291 Summer 2010 Lecture 4

Keeneland Sales

Example

STA 291 Summer 2010 Lecture 4 20

Deviations The deviation of the ith observation xi from

the sample mean is the difference between them, ◦ Sum of all deviations is zero◦ Therefore, we use either the sum of the absolute

deviations or the sum of the squared deviations as a measure of variation

21

x)( xxi

STA 291 Summer 2010 Lecture 4

Variance of n observations is the sum of the squared deviations, divided by n-1

Sample Variance

22

22 ( )

1ix x

sn

STA 291 Summer 2010 Lecture 4

Example

23

Observation Mean Deviation SquaredDeviation

134710

Sum of the Squared Deviationsn-1

Sum of the Squared Deviations / (n-1)

STA 291 Summer 2010 Lecture 4

Interpreting Variance About the average of the squared

deviations◦ “average squared distance from the mean”

Unit◦ Square of the unit for the original data

Difficult to interpret◦ Solution

Take the square root of the variance, and the unit is the same as for the original data Standard Deviation

24STA 291 Summer 2010 Lecture 4

Properties of Standard Deviation s ≥ 0

◦ s = 0 only when all observations are the same If data is collected for the whole population

instead of a sample, then n-1 is replaced by N

s is sensitive to outliers

25STA 291 Summer 2010 Lecture 4

Variance and Standard Deviation Sample

◦ Variance

◦ Standard Deviation

Population◦ Variance

◦ Standard Deviation

26

22 ( )

1ix x

sn

2( )1

ix xs

n

22 ( )ix

N

2( )ix

N

STA 291 Summer 2010 Lecture 4

Population Parameters and Sample Statistics Population mean and population standard

deviation are denoted by the Greek letters μ (mu) and σ (sigma)◦ They are unknown constants that we would like to

estimate Sample mean and sample standard deviation are

denoted by and s◦ They are random variables, because their values vary

according to the random sample that has been selected

27

x

STA 291 Summer 2010 Lecture 4

Empirical Rule If the data is approximately symmetric and

bell-shaped then◦ About 68% of the observations are within one

standard deviation from the mean◦ About 95% of the observations are within two

standard deviations from the mean◦ About 99.7% of the observations are within

three standard deviations from the mean

28STA 291 Summer 2010 Lecture 4

Example Scores on a standardized test are scaled so

they have a bell-shaped distribution with a mean of 1000 and standard deviation of 150◦ About 68% of the scores are between

◦ About 95% of the scores are between

◦ If you have a score above 1300, you are in the top %

29STA 291 Summer 2010 Lecture 4

Recommended