Introduction Our daily lives often involve a great deal of data, or numbers in context. It is important to understand how data is found, what it means,

IntroductionOur daily lives often involve a great deal of data, or numbers in context. It is important to understand how data is found, what it means, and how the information is used. The focus of this lesson is on how to calculate and understand statistics—the numbers that summarize, describe, or represent sets of data.

1

1.1.1: Describing Data Sets

Key Concepts• Data can be described, summarized, and graphed in a

variety of ways.

• We can represent a data set using a measure of center.

Measures of Center

• A measure of center is a single number used to represent the middle value, expected value, or most typical value of a data set.

• Two commonly used measures of center are the median and the mean.

2


Key Concepts, continued• The median is the middle-most value of a data set;

50% of the data is less than this value, and 50% is greater than it.

• To find the median, arrange the data values from least to greatest. The median is the middle value in an ordered data set if the number of data values is odd. If the data set contains an even number of values, the median is the average of the two middle numbers.

3


Key Concepts, continued

• The mean is found by adding the values in a data set

and then dividing the sum by the number of values in

the data set. It is also considered the average of all

the values in a data set. The mean can be found

using the formula , where (pronounced “x

bar”) represents the mean.

4


Key Concepts, continued• is the uppercase Greek letter sigma, and is used to

represent a sum.

• So, represents the sum of the n data values in the

data set:

5


Key Concepts, continuedThe Five-Number Summary• The five-number summary of a data set consists of

the following key numbers: the minimum, the first quartile (Q1), the median (Q2), the third quartile (Q3), and the maximum.

• The minimum is the smallest value in the data set and the maximum is the largest value in the data set.

• The median, also known as the second quartile, is represented by Q2.

6


Key Concepts, continued• When the data values are ordered from least to

greatest, the first quartile, Q1, is the value that identifies the lower 25% of the data. It is also the median of the lower half of the data set; 75% of all data is greater than this value.

• The third quartile, Q3, is the value that identifies the upper 25% of the data. It is also the median of the upper half of the data set; 75% of all data is less than this value.

7


Key Concepts, continuedMeasures of Spread or Variability

• A measure of spread is a number used to describe how far apart certain key values are from each other, or how far a typical value is from the mean of a data set. Measures of spread are also known as measures of variability.

• The most common measures of spread are the range, interquartile range, and standard deviation.

• The range is the difference from the minimum to the maximum in a data set; that is, range = maximum – minimum. The range describes the spread of the entire data set.

8


Key Concepts, continued• The interquartile range, IQR, is the difference from

the first quartile to the third quartile: IQR = Q3 – Q1. The interquartile range describes the spread of the middle “half ” of the data set.

• Note: In some cases, the data values between Q1 and Q3 do not form exactly half the data set. But data sets often have many values, and in those cases the middle “half ” is very close to half, so the distinction is not important. For example, if a data set has 1,001 values, then the middle “half ” has 501 values, which is approximately 50.05% of the data set.

9


Key Concepts, continued• The mean absolute deviation, MAD, is the average

absolute value of the difference between each data point in a data set and the mean. It is found by summing the absolute value of each difference (or deviation from the mean), then dividing the sum by the total number of data points.

• The formula for mean absolute deviation is

, where is the mean and n is the

number of data values.

10


Key Concepts, continued• Shown in expanded form, the formula looks like this:

• Consider this data set: 3, 5, 6, 8, 8.

• The mean is 6: .

• Use the mean to find the mean absolute deviation by substituting each of the values in the data set for xi and 6 for , as shown on the next slide.

11



• The mean absolute deviation is 1.6.12


Key Concepts, continued• The lowercase Greek letter sigma, σ, is used in two

measures of spread, or variability: variance and standard deviation.

• The variance, σ 2 , is a measure of spread, or variability; it is the average of the squares of the deviations of all the data values in a data set from the mean.

• The variance is found using the formula ,

where is the mean and n is the number of data

values.

13



• Consider the same data set as before: 3, 5, 6, 8, 8, with a mean of 6.

• Find the variance by substituting each of the values in the data set for xi and 6 for , as shown on the next slide.

14



• The variance is 3.6.15


Key Concepts, continued• The standard deviation, σ, is another measure of

spread, or variability; it is the average square difference from the mean, denoted by the lowercase Greek letter sigma, σ.

• The standard deviation is found using the formula

, where xi is a data point, is the

mean, and n is the number of data values.

16



• Consider the same data set as earlier: 3, 5, 6, 8, 8.

• The variance, found previously, is 3.6. Take the square root of the variance to find the standard deviation:

σ = 3.6

σ ≈ 1.897

17


Key Concepts, continued• The standard deviation describes how much the data

values vary, or deviate, from the mean. That is, it describes the deviation of a typical data value from the mean.

• When the mean is used as the measure of center, the standard deviation should be used as a measure of spread.

18


Key Concepts, continuedOutliers and Extreme Values• An outlier is a data value that is much less or much

greater than most of the values in the data set.

• A data value is an outlier if it is less than Q1 – 1.5(IQR) or if it is greater than Q3 + 1.5(IQR).

• An extreme value is a data value that seems to be much less or much greater than most of the other data values. Note: All outliers are extreme values, but not all extreme values are outliers.

19


Key Concepts, continued• The term “extreme value” is less precise than the term

“outlier” because there is no rule for identifying extreme values; they are a matter of opinion.

• Nevertheless, extreme values can affect the choices of measures of center and spread.

• Extreme values that are not outliers are those values that fall within the limits discussed previously for outliers.

• When there are no outliers or other extreme data values, the mean is generally a better measure of center than the median.

20


Key Concepts, continued• When there is an outlier, or in some cases one or more

other extreme values, the median is generally a better measure of center than the mean.

21


Key Concepts, continuedBox Plots and Dot Plots• A box plot is a graph that shows the five-number

summary of a data set.

22


Key Concepts, continued• The vertical line segment inside the box in a box plot

represents the median (Q2).

• The length of the box in a box plot is the interquartile range (IQR).

• A dot plot is a graph that uses dots to show the number of times each value in a data set appears in that data set.

23


Key Concepts, continued• The mean is the balance point on the dot plot of any

data set; that is, if the dots were weights on a scale, the mean would be the point at which the scale would be balanced, or level.

• A data distribution is an arrangement of data values. When the data values are displayed in a dot plot, the distribution might have a shape that can be named. Two shapes of particular interest are symmetric and skewed.

24


Key Concepts, continued• In a symmetric distribution, a line can be drawn so

that the left and right sides are mirror images of each other, as shown.

• In a skewed distribution, most of the data values are concentrated on one side of the median.

25


Key Concepts, continued• A distribution in which there is a “tail” of isolated,

spread-out data points to the right of the median is called skewed to the right. (“Tail” describes the visual appearance of the data points.) Data that is skewed to the right is also called positively skewed.

26


Key Concepts, continued• A distribution is skewed to the right if most of the data

values are concentrated on the left. That is, many of the values are clustered on the left side of the distribution, and few values are on the right side (creating the “tail”). There may be one or more outliers or other extreme values on the right.

27


Key Concepts, continued• A distribution in which there is a tail to the left of the

median is called skewed to the left. Data that is skewed to the left is also called negatively skewed.

28


Key Concepts, continued• A distribution is skewed to the left if most of the data

values are concentrated on the right. That is, many of the values are clustered on the right side of the distribution, and few values are on the left side (creating the “tail”). There may be one or more outliers or other extreme values on the left.

29


Key Concepts, continuedRepresenting a Given Data Set Accurately• It is not always obvious how to choose the most

appropriate measures of center and spread as well as the most appropriate graph for a data set. Furthermore, it is not always clear that one particular choice is better than another. Use the table on the next slide to help guide your decisions.

30


Key Concepts, continuedSelecting Appropriate Measures of Center and Spread

and Appropriate Graphs

*Mean absolute deviation (MAD) and variance (σ2) may be used sometimes as well.31


If there is an outlier, use: If there is no outlier, use:

Measure of center Median (Q2) Mean

Rough measure

of spreadRange Range

Additional measure of spread

Interquartile range (IQR) Standard deviation (σ)*

GraphBox plot (The median is the vertical segment inside the box.)

Dot plot (The mean is the balance point.)

Common Errors/Misconceptions• confusing the terms mean and median, and how to

calculate each measure

• confusing the terms mean absolute deviation, variance, and standard deviation, and how to calculate each measure

• forgetting to order the data values from least to greatest before calculating the median, first and third quartiles, and interquartile range

32


Common Errors/Misconceptions, continued

• choosing the data value whose position number is

as the median when there are n data values and n is

even; for example, choosing the fifth data value as the

median when there are ten data values

• forgetting that when the median is used as the measure of center, the interquartile range should be used as a measure of spread

• confusing the terms skewed to the left and skewed to the right

33


Guided Practice

Example 1The following data set shows the numbers of minutes it took 10 chemistry students to complete a quiz:

9 13 10 10 2 11 2 11 11 12

Describe the data set, using appropriate measures of center and spread. Identify any outliers or other extreme values and describe their effects.

34


Guided Practice: Example 1, continued

1. Make a plan. The choice of spread depends on the choice of center.

The choice of center depends on whether there are any outliers.

To identify outliers, you need the interquartile range.

To find the interquartile range, you need to first find the quartiles Q1 and Q3.

So, begin by finding the five-number summary of the data set.

35



2. Find the five-number summary. The five-number summary includes the minimum value, the first quartile (Q1), the second quartile (Q2) or median, the third quartile (Q3), and the maximum value.

Begin by ordering the data values from least to greatest.

2 2 9 10 10 11 11 11 12 13

The minimum is 2 and the maximum is 13.36


Guided Practice: Example 1, continuedThe median, Q2, is the average of the two middle values because the number of values, 10, is even.

The two middle values are 10 and 11, so add and divide by 2 to find the median.

The median is 10.5.

There are 5 data values on either side of 10.5; since the number of data values is odd, we can find Q1 and Q3 without averaging values.

37


Guided Practice: Example 1, continuedThe first quartile, Q1, is the middle value of the lower half (the data values to the left of the median): 9.

The third quartile, Q3, is the middle value of the upper half (the data values to the right of the median): 11.

The five-number summary is shown in the following diagram.

38



3. Find the interquartile range (IQR). The interquartile range is the difference between Q3 (11) and Q1 (9).

IQR = Q3 – Q1

IQR = (11) – (9)

IQR = 2

The interquartile range is 2.

39



4. Identify any outliers. A data value is an outlier if it is less than Q1 – 1.5(IQR) or greater than Q3 + 1.5(IQR).

Calculate Q1 – 1.5(IQR) for Q1 = 9 and IQR = 2.

Q1 – 1.5(IQR) = (9) – 1.5(2)

Q1 – 1.5(IQR) = 9 – 3

Q1 – 1.5(IQR) = 6

The data values 2 and 2 are outliers because 2 < 6.

40


Guided Practice: Example 1, continuedCalculate Q3 + 1.5(IQR) for Q3 = 11 and IQR = 2.

Q3 + 1.5(IQR) = (11) + 1.5(2)

Q3 + 1.5(IQR) = 11 + 3

Q3 + 1.5(IQR) = 14

There are no data values greater than 14.

The only outliers are 2 and 2.

41



5. Choose an appropriate measure of center for the data.The median, 10.5, is an appropriate measure of center because there are two extreme values, 2 and 2, that are also outliers of the data set.

42



6. Choose an appropriate measure of spread for the data.The range is useful for any data set, but it is only a rough measure because it does not give any information about data values between the minimum and the maximum.

Because the median has been chosen as the more appropriate measure of center, the additional measure of spread should be the interquartile range.

43



7. Draw a box plot and a dot plot to display the data set.Use the five-number summary to create the box plot.

44


Guided Practice: Example 1, continuedCreate the dot plot by marking occurrences of each data set value on a number line that has the same increments as your box plot.

45



8. Use the plots to describe the data set.

The distribution is skewed to the left because there are two values that are on the left, relatively far from the rest of the data, which is concentrated at the right.

The median, Q2 = 10.5, represents the data set.

The median is represented by the vertical line segment inside the box of the box plot.

The interquartile range, 2, is the difference between the upper quartile (Q3), which is 11, and the lower quartile (Q1), which is 9.

46


Guided Practice: Example 1, continuedThe data values 2 and 2 are extreme values in this data set; their effect is to make the mean too low to be an accurate measure of center.

The extreme data values 2 and 2 can be called outliers because they are less than Q1 – 1.5(IQR).

On a box plot, outliers are data values that are outside the box by a distance of more than 1.5 times the interquartile range; that is, outside the box by a distance of more than 1.5 times the length of the box. Looking at the box plot, it appears that the distance between 2 and the left side of the box is more than twice the length of the box itself. 47


✔


48


http://www.walch.com/ei/00617

Guided Practice

Example 2Eight friends are discussing their part-time jobs. They worked the following numbers of hours last week:

8 6 8 4 8 14 10 14

Describe the data set, using appropriate measures of center and spread. Identify any outliers or other extreme values and describe their effects.

49



1. Make a plan.

The choice of spread depends on the choice of center.

The choice of center depends on whether there are any outliers.

To identify outliers, you need the interquartile range.

To find the interquartile range, you need to first find the quartiles Q1 and Q3.

So, begin by finding the five-number summary of the data set.

50



2. Find the five-number summary.

Order the data values from least to greatest.

4 6 8 8 8 10 14 14

The minimum is 4 and the maximum is 14.

The median is the average of the two middle values, because the number of data values is even.

51


Guided Practice: Example 2, continuedThe median of 8 doesn’t fall between any values in the data set, so we are splitting the data set into two halves, each with an even number of data values. We will need to average values to find Q1 and Q3.

Q1 is the average of the two middle values of the lower half of the data set (the data to the left of the median).

52


Guided Practice: Example 2, continuedQ3 is the average of the two middle values of the upper half of the data set (the data to the right of the median).

The five-number summary is shown below.

53



3. Find the interquartile range (IQR).

The interquartile range is the difference between Q3 (12) and Q1 (7).

IQR = Q3 – Q1

IQR = (12) – (7)

IQR = 5

54



4. Identify any outliers.A data value is an outlier if it is less than Q1 – 1.5(IQR) or greater than Q3 + 1.5(IQR).

Calculate Q1 – 1.5(IQR) for Q1 = 7 and IQR = 5.

Q1 – 1.5(IQR) = (7) – 1.5(5)

Q1 – 1.5(IQR) = 7 – 7.5

Q1 – 1.5(IQR) = –0.5

There are no data values less than –0.5.55


Guided Practice: Example 2, continuedCalculate Q3 + 1.5(IQR) for Q3 = 12 and IQR = 5.

Q3 + 1.5(IQR) = (12) + 1.5(5)

Q3 + 1.5(IQR) = 12 + 7.5

Q3 + 1.5(IQR) = 19.5

There are no data values greater than 19.5. There are no outliers.

56



5. Choose an appropriate measure of center.

There are no outliers; therefore, look at the ordered list of data values and decide whether there are any values that seem to be extreme, even if they do not qualify as outliers. Do this by informally comparing the differences between consecutive values.

57


Guided Practice: Example 2, continuedOrdered data values: 4, 6, 8, 8, 8, 10, 14, 14

There are no large differences between consecutive data values, so there do not seem to be any extreme values.

The mean is an appropriate measure of center because there are no outliers or other extreme values.

58



6. Find the mean, x.The mean is the average of all the data values.

Formula for calculating mean

is the sum of the n

data values.

Substitute values from the

data set for x1, etc. There are

8 data values, so

n = 8.

59


—


Simplify.

The mean is 9.

60



7. Choose appropriate measures of spread.

Because the mean has been chosen as the measure of center, appropriate measures of spread are the range, mean absolute deviation (MAD), variance (σ2), and standard deviation (σ).

61



8. Find the range.The range is the difference between the maximum and minimum.

In this data set, the maximum is 14 and the minimum is 4.

range = maximum – minimum

range = (14) – (4)

range = 10

The range is 10.

62



9. Calculate the mean absolute deviation, the variance, and the standard deviation for individual data values.For each value, find its deviation from the mean, then take the absolute value of the deviation, and then square the deviation.

Organize the data values and results in a table, as shown on the next slide.

63



64


Data value Mean

Deviation from mean

Absolute deviation

Deviation squared

xi

4 9 –5 5 256 9 –3 3 98 9 –1 1 18 9 –1 1 18 9 –1 1 1

10 9 1 1 114 9 5 5 2514 9 5 5 25


10. Find the mean absolute deviation (MAD), the variance, and the standard deviation for the data set.

Find the sum in each of the last two columns of the table from the previous step.

65



66


Data value Mean

Deviation from mean

Absolute deviation

Deviation squared

xi

4 9 –5 5 256 9 –3 3 98 9 –1 1 18 9 –1 1 18 9 –1 1 1

10 9 1 1 114 9 5 5 2514 9 5 5 25

Sum 22 88

Guided Practice: Example 2, continuedThe sum of the absolute deviations for the individual data values is 22.

The sum of the squares of the deviations is 88.

The mean absolute deviation is the average of the sum of the absolute deviations, as shown on the next slide.

67



Formula for mean absolute

deviation

Substitute 22 for , the sum of the absolute

deviations, and 8 for n, the

number of data values.

Simplify.

The mean absolute deviation is 2.75.68



The variance is the average of the sum of the squares of the deviations:

Formula for variance

Substitute 88 for , the sum of the squares of

the deviations, and 8 for n,

the number of data values.

Simplify.

The variance is 11.69



The standard deviation is the square root of the variance:

Formula for standard deviation

Substitute 11 for the variance, σ2.

Simplify.

The standard deviation is approximately 3.32. 70



11. Draw a box plot.Use the five-number summary to create the box plot.

71



12. Draw a dot plot.Create the dot plot by marking occurrences of each data set value on a number line that has the same increments as your box plot.

72



13. Use the plots to describe the data set.

The distribution is neither significantly skewed nor symmetric, though it is nearly symmetric about the value 8.

The mean, , and median, Q2 = 8, are both reasonable choices as appropriate measures of center. But the mean is a slightly better choice because it is the balance point of the entire data set, and the data set has no outliers or other extreme values.

73



74



75


Guided Practice: Example 2, continuedThe range, 10, describes the spread of the entire data set, from minimum to maximum.

The standard deviation, σ ≈ 3.32, describes the difference, or deviation, between a typical data value and the mean. (The mean absolute deviation, MAD = 2.75, and the variance, σ2 = 11, are associated with the standard deviation.)

There are no extreme values or outliers.

76


✔


77


http://www.walch.com/ei/00618

Documents

Introduction Our daily lives often involve a great deal of data, or numbers in context. It is important to understand how data is found, what it means,