Chapter 3: Statistics for describing, exploring, and ...cms.cerritos.edu/uploads/itse/math112/Lecture_chapter_3.pdf · Chapter 3: Statistics for describing, exploring, and comparing

Chapter 3: Statistics for describing, exploring, and comparing data

Chapter Problem: A common belief is that women talk more than men.

Is that belief founded in fact, or is it a myth?

Data set 8 in Appendix B includes different sample groups from the results

provided by researchers show that the sample mean for males is 15,668.5 for a

sample size 186, and for female is 16,215.0 for a sample size 210 per day. 1

Chapter 3-1 Overview and 3-2.1 Measures of Center

3-1 Overview:

Discuss the characteristics of a data set: CVDOT

Statistics:

Descriptive statistics: summarize or describe the characteristics of a data set; Chapters 2 and 3 discuss the fundamental principles of descriptive statistics

inferential statistics: use sample data to make inferences (or generalizations) about a population; focus of later chapters

3-2 Measures of Center:

Objective: discuss the characteristics “center”, mean and median of a data set, effect of outliers on the mean and median

Definition: a measure of center is a value at the center of middle of a data set

Definition: The arithmetic mean of a set of values is the sum of the data

values divided by total number n of value. We call it “mean” (means arithmetic mean) and mean is denoted by x-bar

2

n

xx

xSum of all sample values

number of sample values

3-2.2 Mean, Median & Notations Mean is relative reliable, since means of samples drawn from the same

population don‟t vary as much as other measure of center since it takes every data value into account. But mean is sensitive to every value, especially when there is outliers (is a disadvantage).

Greek letter sigma denote the sum of a set of values

x is the variable for individual data value

n is the number of values in a sample

N represents the number of values in a population

The median (x-tilde) of a data set is the measure of center that is the middle value when the original data values are in ascending order

If the number of values is odd, the median x is exactly the middle of the list

If the number of values is even, the median is mean (average) of the two middle numbers; Note: Median is not affected by outliers

3

is the mean of a set of sample values

is the mean of all values in a population

n

xx

N

x

x~

3-2.3 Example of mean and median

Monitoring Lead in Air

Data taken are 5.4, 1.10, 0.42, 0.73, 0.48, 1.10

Mean is 1.538 g/m3

To find median, first you need to arrange the data in ascending order

0.42, 0.48, 0.73, 1.10, 1.10, 5.4 (the number is 6); The median is 0.915

g/m3 = (0.73+1.10)/2

0.42, 0.48, 0.66, 0.73, 1.10, 1.10, 5.4 (the number is 7); The median is

0.73 g/m3

These examples show that median is not sensitive to extreme values

and median is often used for data sets with a few extremes.

Example 2

Find the mean and median for the word counts from 5 men: 27,531; 15,684;

5,638; 27,997; and 25,433. (20,456.6; 25,433)

Find the mean and median if including the additional 8,077 words. (18,393.3,

20,558.5)

4

3-2.3 Mode and Midrange

The mode of a data set is the value that occurs most frequently

Data set is called bimodal when there are two values occur with the same

greatest frequency, each one is a mode

Data set is called multimodal when there are more than two values occur

with the same greatest frequency, each one is a mode

Data set is called no mode when no value is repeated

Midrange is the measure of center, is calculated by average of the minimum and maximum value of the data set: Midrange = (max + min)/2

Example from lead in the air: Midrange = (5.4+0.42)/2 = 2.910 g/m3

Example from word counts: 27,531; 15,684; 5,638; 27,997; 25,433; Midrange = ?

Midrange is rarely used, since it only uses max and min and too sensitive to those extremes. It is easy to compute and is one of the values to define the “center” of the data set. “Midrange” is different from the “mean”

Round-off rule – Carry one more decimal place than is present in the original set of values

5

3-2.4 Examples

Example 1: Comparison of ages of best actress and best actors

What does this data tell? Measures of center suggests that best actresses are

younger than best actors.

In Ch 9, we will discuss the methods for determining whether such

differences are satisfactory significant.

Example 2: Find the mean, median, mode, and midrange of the randomly

selected cans of Coke: 12.3, 12.1, 12.2, 12.3, 12.2 (12.22, 12.20, 12.2, 12.3,

12.2)

6

Comparison of Ages of Best Actresses and Best Actors

Best Actresses Best Actors

Mean 35.7 43.9

Median 33.5 42

Mode 35 41 and 42

Midrange 50.5 52.5

3-2.5 More Examples

The following examples identify a major reason why the mean and median

are not meaning statistics that accurately and effectively serve as measures

of center.

Find the mean and median of the following

Zip codes: 12601, 90210, 02116, 76177, 19102

Ranks of stress level from different jobs: 2 3 1 7 9

Surveyed respondents are coded as 1 (for democrat), 2 (for republican), 3 (for

liberal), 4 (conservative), or 5 (for others)

Mean salary of secondary school teachers: from 50 states, $37,200, $

49,400, $40,000, ….$37,800. The mean is $42,210. but is this mean

salary of all secondary school teachers in U.S.? Why or why not?

The above example did not take into the considerations of the number of

secondary school teachers in each state. The mean for all secondary school

teachers in the U.S. is $45,200, not $42,210.

7

3-2.6 Mean from a Frequency Distribution

Mean from a Frequency Distribution is defined as

formula 3-2

Example:

You can use TI, with midpoints in L1, f in L2, then calculate

Age of actress Frequency f Class Midpoint x f x

21 - 30 28 25.5 714

31 - 40 30 35.5 1065

41 - 50 12 45.5 546

51 - 60 2 55.5 111

61 - 70 2 65.5 131

71 - 80 2 75.5 151

Totals 76 2718

8

Where x is the class midpoint,

f is the frequency for that class

f

xfx

8.3576

2718

f

xfx

3-2.7 Weighted Mean

Weighted mean – used when the values with

different degrees of importance, is defined as

Formula 3-3:

Example: mean of 3 test scores (85, 90, 75)

Test 1: 20%

Test 2: 30%

Test 3: 50%

9

w

xwx

5.81100

8150

503020

)7550()9030()8520(

)(

w

xwx

3-2.8 Best measure of Center?

10

Measures

of center

mean Median ModeMidrange

Find the sum

Of all values,

then divide by

the number of

values

Sort the data

Median is the

value in the

exact middle

Add the 2 middle

numbers,

then divide by 2

Odd

number

of value

Even

number

of value

Value that

Occurs

Most

frequently

(max + min)/2

(1) Sensitive to

Extreme value

(2) Sample means

to vary less than

other measures

of center The median is a good choice

If there are some extreme values

The mode is good

for data at the

nominal level of

measurement

The midrange

Is rarely used

Use mean most, then median

3-2.9 Skewness of data

Definition: a distribution of data is skewed if it is not symmetric and extends more to one side than the other

Skewed to the left (negatively skewed; has a longer left tail) if mean and the median to the left of mode

Skewed to the right (positively skewed; has a longer right tail) if mean and the median to the right of the mode

A distribution is symmetric (zero skewness) if the left half of its histogram is roughly a mirror image of its right half, mean=median=mode

11

3-2.10 Summary and Homework #8 Section 3-2

We have learned types of measurements of center of a data set;

mean from a frequency distribution, weighted mean, best

measure of center, and skewness.

The mean and median cannot always be used to identify the

shape of the distribution.

Question: What is the highest point of the graph whether it is

symmetric or skewness?

HW #8, Pages 94-96, #5-17 odd, 33-34 (answer for 34: mean

= 84.8, grade = B)

12

3-3.1 Measures of Variation

Objective 3-3:

Learn the characteristic of variation; such as standard deviation and variance.

Learn how to use a data set for finding the value of the range and standard deviation;

Interpreting values of standard deviations and reasons of standard deviation

Definition: The range of a set of data is the difference between the

maximum value and the minimum value

Range = (maximum value) – (minimum value)

Not useful, since it depends on max and min (i.e. extreme sensitive to the extreme

values)

Definition: The standard deviation is a set of sample values is the measure

of variation of values about the mean.

13

Formula 3-4 simple

standard deviation

Formula 3-5 shortcut

formula for sample

standard deviation (formula used by calculators

and computer programs)

1

)( 2

n

xxs

)1(

)()( 22

nn

xxns

3-3.2 Properties of Standard Deviation (S.D.)

S.D. is a measure of variation of all values from the mean

S.D. is always 0. It is zero only when all of the data values are the same; large S.D. values indicate greater amount of variation

S.D. can increase dramatically with the inclusion of one or more outliers

The units of the S.D. s are the same as the units of original values, e.g. minutes, feet, pounds, etc..

1. Compute the mean

2. Subtract the mean from each value

3. Square the difference

4. Add the all the squares

5. Divide the total by n-1 (i.e. one less than the number)

6. Find the square root of the result of step 5

Find the standard deviation of the waiting times from the multiple times. Those times (in minutes) are 1, 3, 14.

14

xxx

2)( xx

1

)( 2

n

xx 2)( xx

1

)( 2

n

xxs

3-3.3 Standard Deviation of a Population

Standard deviation of a population is the formula of

sample deviation, except divided by N (N is the

population size)

The population standard deviation is defined as

Since we generally deal with sample data, thus we

usually use the formula 3-4.

15

N

x

2)(

1

)( 2

n

xxs

3-3.4 Variance of a Sample and Population

Definition – The variance of a set of values is a measure of variation equal to the square of the standard deviation Sample variance: s2 square of the standard deviation Population variance: 2 square of the population standard deviation

s2 is called unbiased estimator of the population variation 2

Example: Use the waiting times of 1 min, 3 min, and 14 min to find the variance of waiting time

Q: Is smaller variance better? Note: The units of variance are different from the units of original data set;

the standard deviation has the same unit as the data set Notations:

s = sample standard deviation s2 = sample variance = population standard deviation 2 = population variance SD – standard deviation VAR – variance

Round-Off Rule – carry one more decimal place than the original set of data for the final answer (don‟t round-off in the middle of a calculation)

16

3-3.5 Why learn Standard Deviation and interpretation

Standard deviation measures the variation among values Small standard deviation means values are close together, while large standard

deviation means values are spread farther apart

Range Rule of Thumb to estimate standard deviation is used to roughly estimate standard deviation which is based on the principle that for

most data sets the vast majority (such as 95%) of sample values lie within 2 standard

deviation s; where s Range/4 (range = max – min)

If the standard deviation s is known, we can use it to estimate min and

max of sample values Minimum “usual” value = (mean) – 2 * (standard deviation)

Maximum “usual” value = (mean) + 2 * (standard deviation)

Example 1: IQ test, mean is 100, S.D. is 15; Min is 70, max is 130

Interpretation: Based on these results, we expect that typical IQ scores fall between 70

and 130. How do you interpret IQ 65 or IQ 135?

Example 2: Pulse rate of women: mean is 76, S. D. is 12.5; min is 51

beats/min, and max is 101 beats/minute

Interpretation Typical women pulses are from 51 to 101 beats/min

If someone has pulse rate 110 would be unusual, since 110 is outside the limits

17

3-3.6 Empirical (or 68-95-99.7) Rule for data with

normal distribution

Empirical rule – for data set having a distribution that is approximately bell-shaped

has the following properties:

About 68% of all values fall within 1 standard deviation of the mean, i.e. between (mean -

s) and (mean + s)

About 95% of all values fall within 2 standard deviation of the mean, i.e. between (mean-

2s) and (mean+2s)

About 99.7% of all values fall within 3 standard deviation of the mean, i.e. between (mean-

3s) and (mean+3s)

Example of IQ scores, mean is 100, standard deviation is 15. What percentage of IQ scores are between 70 and 130?

18

3-3.7 Chebyshev‟s Theorem

The proportion (or fraction) of any data set lying with K standard deviations of the mean is always at least 1-1/K2

where K >1

When K=2, we can interpret that at least ¾ (75%) of all values lie within 2 standard deviation of the mean

When K=3, we can interpret that at least 8/9 (or 89%) of all values lie within 3 standard deviation of the mean

Example – IQ score:

At least 75% of people have IQ between 70 and 130 (2 SD from mean)

At least 89% of people have IQ between 55 and 145 (3 SD from mean)

Comparison: Example – IQ score using empirical rule

About 68% of people have IQ between 85 and 115 (1 SD from mean)

About 95% of people have IQ between 70 and 130 (2 SD from mean)

About 99.7% of people have IQ between 55 and 145 (3 SD from mean)

19

3-3.8 Coefficient of Variation in Different Populations

Coefficient of variation (CV) for a set of nonnegative sample

population data (expressed as %) is used to describe the standard

deviation relative to the mean with the following:

Example: Heights and Weights of Men(data set 1 in Appendix B)

For heights: mean , s.d. = 3.02in

For weights: mean , s.d. = 26.33lb

We want to compare variation among heights to variation among weights.

Heights: CV = 4.42%; weights: CV = 15.26%

Interpretation: ? The heights has considerably less variation than weights,

does it make sense?20

Sample Population

%100x

scv

%100

cv

inx 34.68

lbx 55.172

3-3.9 Summary and HW #9 (3-3)

Range rule of thumb s range/4

Min (usual) = mean – 2*s

Max (usual) = mean + 2*s

Empirical rule (only applicable to normal (bell-shaped)

distribution) 68% within 1 S.D. means data values are within (mean- s) and (mean +

s)

95% within 2 S.D. means data values are within (mean-2s) and (mean

+2s)

99.7%within 3 S.D. means data values are within (mean-3s) and (mean

+3s)

Chebyshev‟s Theorem helps to approximate the values of data

set (applicable to any data set, but has limited usefulness)

HW #9 (3-3) Pp. 110 -113, # 5-11odd, 17, 31- 35odd

21

3-4.1 Measure of Relative Standing and Boxplots

Objective: To learn the “measure” that can be used to compare values from

the same or different data set, z score, and able to convert data values to z-

scores, quartiles, percentiles, and boxplots

Definition: a z score (or standardized value) is the number of standard

deviations that a given value x is above or below the mean

For sample data

For population data

Round z to two decimal places

A man is 76.in tall with 237.1 lb weight. Find the Z-score for the height

and weight. (mean height = 68.34in, s.d.= 3.02in, mean weight = 172.55 lb

and s.d. = 26.33lb); z-score for this man is 2.60 in height, 2.45 in weight.

Interpretation: The man is 2.6 above the mean height, 2.45 above the mean

weight. The height is more extreme than the weight.

Example: Lyndon Johnson 75” (mean 71.5”, S.D. 2.1”), Shaquille O‟Neal

85”(mean 80”, S.D. 3.3”)

Interpretation: ?

22

s

xxz

xz

3-4.2 z-score and unusual values Use range rule of thumb, a value is “unusual” if it is more than 2 S.D. from the

mean:

min (usual) = mean – 2*s, and

max (usual)= mean + 2*s

Use z-score, a value is „unusual” if it is less than -2, or greater than +2

Ordinary values: -2 z score +2

Unusual values: z score < -2 or z score >2

z scores: measures of position relative to the mean, a z- score of +2 means 2

standard deviations above the mean, z score of -3 means 3 standard deviation

below the mean

Example: Over the past 30 years, heights of basketball players at Newport University have a mean of 74.5in, and a s.d. of 2.5in. The latest recruit has a height of 79.0in Find z score

Is the height of 79.0in unusual among the heights of players over the past 30 years? Why or why not? 23

3-4.3 Percentiles

Definition:

Is one type of quantiles (fractiles) which partition data into groups with roughly the same

numbers of values in each group

Percentiles are measures of location. There are 99 percentiles and are denoted by P1, P2,

P3, …P99, which is divide a set of data into 100 groups about 1% of the values in each

group

Example: 50th percentile, denoted by P50, has about 50% of the data values below it,

and about 50% of the data values above it; 50th percentile is the same as the median.

Formula is (round the result to the nearest whole number)

Another way is

Find the percentile for the value of $29 millions. Table 3-4 in the Text (click here)

24

)100(or 100

*

n

L k

valuesofnumbertotal

xlessvaluesofnumberxvalueofPercentile

n = total number of values in the data set

k = percentile being used

L = location that gives the position of a value

nk

L 100

4.5 5 6.5 7 20 20 29 30 35 40

40 41 50 52 60 65 68 68 70 70

70 72 74 75 80 100 113 116 120 125

132 150 160 200 225

3-4.4 Converting from kth percentile to the corresponding data value

25

Yes

StartSort the data

(arrange the data

from low to high)

Is L a whole

number?

NoChange L by

rounding it up

to the next

whole number

The value of Pk

is the Lth value

counting

from the lowest

The value of Pk is mean of the

values Lth location and (L+1)th

location

Compute

L = (k/100)n

n = # of values

k = percentile

Example: Find the 17th percentile of the previous

Data set.

3-4.5 Example: Setting Speed Limits

The table is the recorded speeds miles/hour randomly selected

on 405 highway

Find the 85th percentile of the listed speeds

Given that speed limits are usually rounded to a multiple of 5,

what speed limit is suggested by these data? Explain your

choice

Does the existing speed limit on Highway 405 conform to the

85th percentile rule (i.e. the speed limit is set so that 85% of

drivers are at or below the speed limit)26

68 68 72 73 65 74 73 72 68 65

65 73 66 71 68 74 66 71 65 73

59 75 70 56 66 75 68 75 62 72

60 73 61 75 58 74 60 73 58 75

3-4.6 Quartiles

Definition

Quartiles are measures of location, denoted by Q1, Q2, and Q3 , which divide a

data set into four groups with about 25% of the values in each group (percentile

divide the data into 100 groups.)

Three quartiles Q1, Q2, Q3 Divide the sorted data value into 4 equal parts

Q1 (first quartile) separate the bottom 25% from the rest

Q2 (second quartile) separate the bottom 50% from the rest (Q2 is also the

median) (also 50 percentile)

Q3 (third quartile) separate the bottom 75% from the rest

Interquartile range (IQR) = Q3 - Q1

Semi-interquartile = (Q3 - Q1)/2

Mid-quartile = (Q3 + Q1)/2

10-90 percentile range = P90 - P10

Example: find the values of Min, Q1, Q2, Q3, Max, IQR of movie budget;

(click here for the table)

27

3-4.8 5-Number Summary and Boxplot

Definition – for a set of data, the 5-number summary consists of :

(1) Minimum, (2) Q1, (3) Q2 (the median), (4) Q3 , (5) Maximum

A boxplot (or box-and-whisker diagram) is a graph of data set that consists of a

line extending from the min to max and a box with lines drawn at the Q1, the

median, and the Q3 (summary and example next page)

A graph which is useful for revealing

The center of the data

The spread of distribution of the data

The presence of outliers

Outlier is a value that is located away from almost all of the other values, an extreme value falls outside the general pattern;

A data x value is an outlier if x –Q3 > 1.5 IQR or Q1-x > 1.5 IQR

An outlier can have a dramatic effect on the Mean

Standard deviation

The scale of histogram, so the true nature of the distribution is totally observed

28

3-4.9 Procedures for Construct a Boxplot, HW #10

Find the 5-number summary, min, Q1, median, Q3, and the max

Construct a scale with values that include the min and max data value

Construct a box (rectangle) extending from Q1 to Q3, and draw a line in the box at the median value

Draw lines extending outward from the box to min and max data value

Example on the board, use the movie budget (click here) 5-number summary

Boxplots don‟t show detail info as histograms or stem-and leaf plots – not the best choice when dealing with a single data set; but it‟s great for comparing different data sets (use the same scale)

Do women really talk more than men? Use the 5-number summary

Read Table 3-3 Comparison of word counts of men and women for “mean”, “median”, “midrange”, “range”, “S.D”

Example: Here are measured reaction times (in seconds) in a test of driving skills;

2.4, 2.5, 2.8, 2.0, 2.4, 2.9, 3.2, 3.5, 2.7, 2.7, 2.8, 2.6; find the five-number-summary.

HW #10: P. 127-128, #1, 5-7, 9, 13, 15, 19, 23, 27

29

Min Q1 Q2 Q3 Max

Men 695 10009 14290 20565 47016

Women 1674 11010 15917 20571 40055

Review (1)

Ch 1 – You should be able to do the following learned distinguish between a population and a sample; and parameter and statistic

Understand the importance of good experimental design, including the control of variable effects, replication, and randomization

Recognize the importance of good sampling methods in general, a simple random sample in particular

Understand if sample data are not collected in an appropriate way, the data may be completely useless

Ch 2: You should be able to do: Summarize data by constructing a frequency distribution or relative frequency

distribution

Visually display the nature of the distribution by constructing a histogram or relative frequency histogram

Investigate important characteristics of a data set by creating visual display, such as a frequency polygon, dotplot, stemplot, pareto chart, pie chart, scatterplot or time-series graph

Understand and interpret those result

30

Review (3) - Continued

You should be able to Calculate measures of center by finding the mean and

median Calculate measures of variation by finding the standard

deviation, variance, and range Understand and interpret the standard deviation by using

the tools such as range rule of thumb Compare individual values by using z score, quartiles, or

percentiles, identify outliers Investigate and explore the spread of data, the center of the

data, and the range of values by constructing a boxplot Understand and interpret those result such as standard

deviation us a measure of how much data vary, and use standard deviation to distinguish between values that are usual and unusual

31

Examples

32

Always consider certain key factors:• Context of the data

• Source of the data

• Sampling method

• Measures of center

• Measures of variation

• Distribution

• Outliers

• Changing patterns over time

• Conclusion

• Practical implications

Documents

Chapter 3: Statistics for describing, exploring, and ...cms.cerritos.edu/uploads/itse/math112/Lecture_chapter_3.pdf · Chapter 3: Statistics for describing, exploring, and comparing