Chapter 3 Numerical Methods for Describing Data Distributions Created by Kathy Fritz

Chapter 3

Numerical Methods for Describing Data

Distributions

Created by Kathy Fritz

Suppose that you have just received your score on an exam in one of your classes. What would you want to know about the distribution of scores for this exam?

Measures of center describe were the data distribution is located along the number line. A measure of center provides information about what is “typical”.

Measures of spread describe how much variability there is in a data distribution. A measure of spread provides information about how much individual values tend to differ from one another.

What was the class

average?What were the high and low

scores?You want to know a “typical” exam

score or a number that best describes the entire set of scores

You want to know about the variability of the data set.

The stress of the final years of medical training can contribute to depression and burnout. The authors of the paper “Rates of Medication Errors Among Depressed and Burnt Out Residents” (British Medical Journal [2008]: 488) studied 24 residents in pediatrics. Medical records of patients treated by these residents during a fixed time period were examined for errors in ordering or administering medications. The accompanying dotplot displays the total number of medication errors for each of the 24 residents.

Which is more appropriate as the “typical” value of this data set?

Explain.

Mean = 1or

Median = 0

Choosing Appropriate Measures for Describing Center and

Spread

If the shape of the data distribution is …

Approximately symmetric

Skewed or has outliers

Describe Center and Spread Using …

Mean and standard deviation

Median and interquartile range

Describing Center and Spread For Data Distributions That Are Approximately Symmetric

MeanStandard Deviation

Mean

The sample mean is the arithmetic average of values in a data set.

It is denoted by the symbol (pronounced as x-bar).

Formula:

Some notation:

x = the variable of interestn = the sample sizex1, x2, …, xn are the individual observations in the data set

The population mean, m (the Greek letter mu), is the arithmetic average of all the x

values in an entire population.

Measuring VariabilityConsider the three sets of six exam scores displayed below:

Each data set has a mean exam score of 75. Does that completely describe these data sets?

RangeThe simplest numeric measure of variability is range.

Range = largest observation – smallest observation

What are the ranges of the three data sets?

50 60 70 80 90 100

DeviationsThe most widely used measures of variability are based on how far each observation deviates from the sample mean, called deviations from the mean.

Mean = 75Deviation = 5Deviation = -

25

What does a negative deviation tell you about how the data value compares to the

mean?

Be sure to subtract so that data values above the mean have a positive deviation

from the mean . . .

. . . and that data values below the mean have a negative deviation from the mean.

If we found each of the deviations from the mean (-25, -15, -5, 5, 15, 25), what is the

sum of these deviations?

The sum of the deviations from the mean will always equal zero!

Suppose that we are interested in finding the “typical” or average deviation from the mean.

So, to calculate the “typical” or average deviation from the mean, we must first square each deviation. Then the all the squared deviations are positive.

The deviations from the mean were -25, -15, -5, 5, 15, and 25. The squares of these deviations from the mean are 625, 225, 25, 25, 225, 625.

Now we can average these.

Variance and Standard Deviation

Can we just calculate the arithmetic average for the

deviations from the mean? Why or why not?

Since the sum of the deviations from the mean is always zero, you cannot just add

the deviations and then divide by the number of deviations. What do you do?

Wait a minute . . .If the data values represented the entire population,

then we would divide by the sample size (n).

However, more often than not, the data values represent a sample from the population and we

divide by (n – 1).

Why?

If the spread of the population were from 50 to 100, samples would rarely have the same spread.

The samples would have a smaller spread (less variability).

By dividing by a smaller number n - 1, we get a better estimate of the true “typical” deviation

from the mean.

50 60 70 80 90 100

•


The standard deviation is a more natural measure of variability than the variance

because it is expressed in the same units as the original data values.

•


The data values deviate from the mean of 75, on average, 18.708

units.

50 60 70 80 90 100

Notation to remember

Population

Sample

Mean

Variance

Standard Deviation

m

s 2

s

s2

s

1. Select appropriate measures of center and spread (look at the shape of the distribution)

2. Compute the values of the selected measures

3. Interpret the values in context.

Putting it Together

•

Since the data distribution is approximately symmetric with no outliers, we should use the

mean and standard deviation as the measures of center and variability.

On average, the bats in the sample made 69.09 attempts to drink from the smooth, metal

surface.

The sample standard deviation is 56.35. This is relatively large compared to the values in the

data set, indicating a lot of variability from bat to bat in number of drinking attempts.

Describing Center and Spread For Data Distributions That Are Skewed or Have Outliers

MedianInterquartile Range

Median

The sample median is obtained by first ordering the n observations from smallest to largest (with any repeated values included, so that every sample observation appears in the ordered list). Then . . .

even is if values middle two the of average

odd is if value middle single median sample

n

n

Forty students were enrolled in a statistical reasoning course at a California college. The instructor made course materials, grades, and lecture notes available to students on a class web site. Course management software kept track of how often each student accessed any of these web pages. The data set below (in order from smallest to largest) is the number of times each of the 40 students had accessed the class web page during the first month.

0 0 0 0 0 0 3 4 4 4 5 5 7 7

8 8 8 12 12 13 13 13 14 14 16 18 19 19

20 20 21 22 23 26 36 36 37 42 84 331

These braces split the data set into two equal halves with 20 values above and 20 values below.

The median is the average of these two middle values. Median = 13

Why is the sample mean so much larger than the sample median?

The sample mean can be sensitive to even a single outlier.

The median is quite insensitive to outliers.

Interquartile range (iqr) is based on quantities called quartiles which divide the data set into four equal parts (quarters).

Lower quartile (Q1) = the median of the lower half of the data

Upper quartile (Q3) = the median of the upper half of the data

In n is odd, the median of the entire data set is excluded from both halves when computing quartiles.

iqr = Q3 – Q1

Interquartile Range

The sample standard deviation, s, can also be greatly affected by the presence of even one

outlier.

The interquartile range is a measure of variability that is resistant to the effects of

outliers.

Recall the website data set:

0 0 0 0 0 0 3 4 4 4 5 5 7 7

8 8 8 12 12 13 13 13 14 14 16 18 19 19

20 20 21 22 23 26 36 36 37 42 84 331

The lower quartile (Q1) is the median of the lower 20 data values.

Median

13

Q1

4.5

Q3

20.5

The interquartile (iqr) is the difference of the upper and lower quartile.

iqr = 20.5 – 4.5 = 16

The upper quartile (Q3) is the median of the upper 20 data values.

The interquartile range also measures how spread out the middle half of the data set is.

If the interquartile range is small, then the middle half of the data set is tightly clustered together.

If the interquartile range is large, then the middle half of the data set is more spread out.Since the interquartile range of 16 is relatively large,

the middle half of the data set for the number of visits to the class website is fairly spread out.

How many data values are in the first quarter of the data set?

. . . in the second quarter of the data set?. . . in the third quarter of the data set?

. . . in the fourth quarter of the data set?

The Chronicle of Higher Education (Almanac Issue, 2009-2010) published the accompanying data on the percentage of the population with a bachelor’s degree or graduate degree in 2007 for each of the 50 U.S. states and the District of Columbia. The data distribution is shown in the histogram below.

Putting it Together

Step 1: SelectWe will use the median and interquartile range as measures of center and spread for this data distribution since it is skewed and has an outlier.

Step 2: Calculations

median = 26iqr = 6

Putting it TogetherStep 3: InterpretThe median for this data set is 26. For half the states, the percentage of the population with a bachelor’s or graduate degree is 26% or less. For the other half of the states, 26% or more of the population have a bachelor's or graduate degree. The interquartile range of 6 indicates that the middle half of the data is spread out over an interval of 6 percentage points.

Boxplots

General BoxplotsModified Boxplots

Five-Number Summary

The five-number summary consists of the following:

1. Smallest observation in the data set (minimum)

2. Lower quartile (Q1)3. Median4. Upper quartile (Q3)5. Largest observation in the data set

(maximum)

A boxplot is a graph of the five-number summary.

BoxplotsWhen to Use Univariate numerical

data

How to construct 1. Compute the values in the five-number summary2. Draw a horizontal line and add an appropriate scale.3. Draw a box above the line that extends from the

lower quartile (Q1) to the upper quartile (Q3)4. Draw a line segment inside the box at the location of

the median.5. Draw two line segments, called whiskers, which

extend from the box to the smallest observation and from the box to the largest observation

What to look for center, spread, and shape of the data distribution

and if there are any unusual features

The authors of the paper “Striatal Volume Predicts Level of Video Game Skill Acquisition” (Cerebral Cortex [2010]: 2522-2530) studied a number of factors that affect performance in a complex video game. One factor was practice strategy. Forty college students who never played the game Space Fortress were assigned at random to one of two groups: 1) told to improve total score or 2) told to improve a different aspect, such as speed. The data distribution for the first group (improving total score) is shown in the dotplot below, along with the median and the lower and upper quartiles.

We already have a horizontal line and an appropriate scale.

Draw a box from the lower quartile to the upper

quartile.

Draw a line inside the box for the medianDraw line segments for the

whiskers

Comparative BoxplotsA comparative boxplot is two or more boxplots drawn on the same numerical scale.

Recall the video game study. There were two groups: 1) told to improve total score or 2) told to improve a different aspect, such as speed.

1st

2nd

Both distributions are approximately symmetric. However, the median improvement score for the second group is much larger than the median improvement score for the first group.

The improvement scores for the first group are more consistent than the improvement scores for the second group.

OutliersAn observation is an outlier if it is . . .

greater than upper quartile + 1.5(iqr)

Or

less than lower quartile – 1.5 iqr

A modified boxplot is a boxplot that shows outliers.

Modified boxplotsHow to construct

1. Compute the values in the five-number summary2. Draw a horizontal line and add an appropriate scale.3. Draw a box above the line that extends from the

lower quartile (Q1) to the upper quartile (Q3)4. Draw a line segment inside the box at the location of

the median.5. Determine if there are any outliers in the data set.6. Add whiskers that extend from the box to the

smallest observation that is not an outlier and largest observation that is not an outlier.

7. If there are outliers, add dots to the plot to indicate the positions of the outliers.

iqr = 3.835 – 2.455 = 1.38

2.455 - 1.5(1.38) = 0.385

3.835 + 1.5(1.38) = 5.905

Big Mac prices in U.S. dollars for 44 different countries were given in the article “Big Mac Index 2010”. The following 44 Big Mac prices are arranged in order from the lowest price (Ukraine) to the highest price (Norway).

1.84 1.86 1.90 1.95 2.17 2.19 2.19 2.28 2.33 2.34 2.45

2.46 2.50 2.51 2.60 2.62 2.67 2.71 2.80 2.82 2.99 3.08

3.33 3.34 3.43 3.48 3.54 3.56 3.59 3.67 3.73 3.74 3.83

3,84 3.86 3.89 4.00 4.33 4.39 4.90 4.91 6.19 6.56 7.20

Smallest observation = 1.84

Compute the five-number summary.

Upper quartile = 3.835

Lower quartile = 2.455Median = 3.205

Largest observation = 7.20

1.84 1.86 1.90 1.95 2.17 2.19 2.19 2.28 2.33 2.34 2.45

2.46 2.50 2.51 2.60 2.62 2.67 2.71 2.80 2.82 2.99 3.08

3.33 3.34 3.43 3.48 3.54 3.56 3.59 3.67 3.73 3.74 3.83

3.84 3.86 3.89 4.00 4.33 4.39 4.90 4.91 6.19 6.56 7.20

1.84 1.86 1.90 1.95 2.17 2.19 2.19 2.28 2.33 2.34 2.45

2.46 2.50 2.51 2.60 2.62 2.67 2.71 2.80 2.82 2.99 3.08

3.33 3.34 3.43 3.48 3.54 3.56 3.59 3.67 3.73 3.74 3.83

3.84 3.86 3.89 4.00 4.33 4.39 4.90 4.91 6.19 6.56 7.20

The median is the average of the two blue numbers.

Check if there are any outliers.There are no outliers on the lower end of the

data set, but there are three outliers on the upper end: 6.19, 6.56,

7.20.

Big Mac Prices Continued . . .

Smallest observation = 1.84

Upper quartile = 3.835

Lower quartile = 2.455Median = 3.205

Largest observation = 7.20 Draw a horizontal line with

an appropriate scale.Draw the box from Q1 to Q2 and

add a line at the median.

1 2 3 4 5 6 7 8

Big Mac Prices1 2 3 4 5 6 7 8

Big Mac Prices

Draw the whiskers from the box to the smallest and largest observations that are not

outliers.

1 2 3 4 5 6 7 8

Big Mac Prices

Add dots for the outliers.

1 2 3 4 5 6 7 8

Big Mac Prices

Interpret the

graph.

There are three outliers at $6.19, $6.56, and $7.20. The typical or median price for a Big Mac is $3.205 and the interquartile range of the prices is 1.38. There is quite a bit of variability in the Big Mac prices. The distribution of prices is skewed right.

How does the mean price of Big Macs compare to the median

price of $3.205?

The 2009-2010 salaries of NBA players published on the web site hoopshype.com were used to construct the comparative boxplot of salary data for five teams.

Discuss the similarities

and differences.

Measures of Relative Standing

z -scoresPercentiles

The z -score corresponding to a particular data value is

z -scores

deviation standardmean - value data

scorez

When you obtain your score after taking a test, you

probably want to know how it compares to the scores of

others.

Is your score above or below the mean, and by how much?

Does your score place you among the top 5% of the class or

only among the top 25%?

Answering these questions involves measuring the position, or relative

standing, of a particular value in a data set.

One measure of relative standing is a z-score.

The z -score tells you how many standard deviations the data value is from the mean.

The process of subtracting the mean and then dividing by the standard deviation is

sometimes referred to as standardization.

A z-score is one example of a standardized score.

What do these z-scores mean?

-2.3

1.8

The data value is 2.3 standard deviations below the mean

The data value is 1.8 standard deviations above

the mean

Suppose that two graduating seniors, one a marketing major and one an accounting major, are comparing job offers. The accounting major has an offer for $45,000 per year, and the marketing major has an offer for $43,000 per year.

Which is the better offer?Z-scores help us to compare these two offers.

Accounting:mean = 46,000 standard deviation = 1500Marketing: mean = 42,500 standard deviation = 1000

Relative to their distributions, the marketing offer is actually more attractive than the accounting

offer, even though the marketing major may not be happy!

If the data distribution is mound shaped and approximately symmetric, then . . .

• Approximately 68% of the observations are within 1 standard deviation of the mean

• Approximately 95% of the observations are within 2 standard deviation of the mean

• Approximately 99.7% of the observations are within 3 standard deviation of the mean

Empirical Rule

99.7%95%68%

The z-score is particularly useful when the data distribution is mound shaped

and approximately symmetric.

This illustrates the percentages given by the Empirical Rule.

Empirical Rule

•

Number of Standard Deviations

Interval Actual Empirical Rule

1 60.094 to 64.875

72.1% Approximately 68%

2 57.704 to 67.264

96.2% Approximately 95%

3 55.314 to 69.654

99.2% Approximately 99.7%

Remember the Empirical Rule is only an approximation of the actual percentages – but it

is a good estimate as long as the data distribution is mound shaped and approximately

symmetric.

Percentiles

For a number r between 0 and 100, the rth percentile is a value such that r percent of the observations fall AT or BELOW that value.

This diagram illustrates the 90th percentile.

In addition to weight and length, head circumference is another measure of health in newborn babies. The National Center for Health Statistics reports the following summary values for head circumference (in cm) at birth for boys.

Head circumference (cm)

32.2 33.2 34.5 35.8 37.0 38.2 38.6

Percentile 5 10 25 50 75 90 95

What value of head circumference is at the 75th percentile?

What is the median value of head circumference?

37.0 cm

35.8 cm

Common Mistakes

Avoid these Common Mistakes1. Watch out for categorical data that look

numerical! Often, categorical data is coded numerically. For example gender might be coded as 0 = female and 1 = male, but this does not make gender a numerical variable. Categorical data CANNOT be summarized using the mean and standard deviation or the median and interquartile range.

10

Avoid these Common Mistakes2. Measures of center don’t tell all. Although

measures of center, such as the mean and the median, do give you a sense of what might be typical value for a variable, this is only one characteristic of a data set. Without additional information about variability and distribution shape, you don’t really know much about the behavior of the variable.

center

? ?

Avoid these Common Mistakes3. Data distributions with different shapes

can have the same mean and standard deviation. For example, consider the following two histograms:

Both histograms have the same mean of 10 and standard deviation of 2, but VERY different shapes.

Avoid these Common Mistakes4. Both the mean and the standard deviation

are sensitive to extreme values in a data set, especially if the sample size is small.

If the data distribution is markedly skewed or if the data set has outliers, the median and interquartile range are a better choice for describing center and spread. Extreme values Mean & standard

deviation

Avoid these Common Mistakes5. Measures of center and measures of

variability describe values of a variable, not frequencies in a frequency distribution or heights of bars in a histogram. For example, consider the following two frequency distributions and histograms:

Distribution A has a larger standard deviation even though the frequencies are equal.

Avoid these Common Mistakes6. Be careful with boxplots based on small

sample sizes. Boxplots convey information about center, variability, and shape, but interpreting shape information is problematic when the sample size is small.

1 2 3 4 5 6 7 81 2 3 4 5 6 7 81 2 3 4 5 6 7 8

n = 5n = 10n = 20

Avoid these Common Mistakes

7. Not all distributions are mound shaped. Using the Empirical Rule in situations where you are not convinced that the data distribution is mound shaped and approximately symmetric can lead to incorrect statements.

Avoid these Common Mistakes8. Watch for outliers! Unusual observations

in a data set often provide important information about the variable under study, so it is important to consider outliers in addition to describing what is typical.

Outliers can also be problematic because the values of some summaries are influenced by outliers and because some methods for drawing conclusions from data are not appropriate if the data set has outliers.

Documents

Chapter 3 Numerical Methods for Describing Data Distributions Created by Kathy Fritz