Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
AP Statistics – Ch. 1 Notes
Exploring Data
Statistics is the art and science of learning from data. This may include:
• Designing appropriate tools to collect data.
• Organizing data in a meaningful way.
• Displaying data with appropriate graphs.
• Summarizing data with numbers.
• Using data to make predictions. Data are numbers or other information in context. Individuals are the objects described by a set of data. They may be people, animals, or things. A variable is any characteristic of an individual. It can take different values for different individuals. A categorical (or qualitative) variable places an individual into one of several groups or categories. A quantitative variable takes numerical values for which it makes sense to find an average. � Not every variable with a number value is quantitative!
Examples: zip codes, phone numbers, social security numbers, barcodes, etc.
The distribution of a variable shows what values the variable takes and how often it takes each value.
How to Explore Data
� Begin by examining each variable by itself. Then move on to study relationships among the data. � Start with a graph or graphs. Then add numerical summaries.
Example: The following table shows information about several popular cell phone models.
Phone Operating System Screen Size
(inches)
Internal
Storage (GB)
Expandable
Storage
Rear
Camera
(megapixels)
Battery Life
(Talk Time)
(hours)
Apple iPhone 6S Plus iOS 9 5.5 16 No 12 24
Apple iPhone 6s iOS 9 4.7 16 No 12 14
Apple iPhone 6 iOS 8 4.7 16 No 8 14
BlackBerry DTEK 50 Android 6.0 5.2 16 Yes 13 17
BlackBerry Priv Android 5.1 5.4 32 Yes 18 24
BlackBerry Leap BlackBerry 10 5.0 16 Yes 8 25
LG X Skin Android 6.0 5.0 16 Yes 8 7
LG G5 SE Android 6.0 5.3 32 Yes 16 20
LG G5 Android 6.0 5.3 32 Yes 16 20
Microsoft Lumia 650 Windows 10 5.0 16 Yes 8 13
Microsoft Lumia 950 Windows 10 5.2 32 Yes 20 13
Microsoft Lumia 950 XL Windows 10 5.7 32 Yes 20 19
Samsung Galaxy Note 7 Android 6.0 5.7 64 Yes 12 24
Samsung Galaxy On 7 Pro Android 6.0 5.5 16 Yes 13 11
Samsung Galaxy S7 Edge Android 6.0 5.5 32 Yes 12 33
AP Statistics – Ch. 1 Notes
a) Who/what are the individuals in this data set? Cell phone models
b) What variables are measured? Identify each as categorical or quantitative. In what units were the
quantitative variables measured? Operating system (categorical), screen size (quantitative – inches), amount of internal storage
(quantitative – GB), whether or not it has expandable memory (categorical), rear camera resolution
(quantitative – megapixels), Battery life (quantitative – hours)
c) Give the distributions of the following for the data set: screen size, internal storage, and presence of
expandable memory. Screen Size (inches) Internal Storage (GB) Expandable Memory?
4.7 2 16 8 Yes 12
5 3 32 6 No 3
5.2 2 64 1
5.3 2
5.4 1
5.5 3
5.7 2
Analyzing Categorical Data
The values of a categorical variable are labels for the categories, such as “male” or “female”. The distribution of a categorical variable gives the categories and either the count or proportion of individuals who fall into each category. Proportion: The fraction of the total that possesses a certain attribute. Proportions can be expressed as fractions, decimals, or percentages. Frequency: The number (count) of individuals in each category. Relative Frequency: The proportion of individuals in each category. Often, we organize categorical data into either a frequency table or a relative frequency table. Example: The following is a frequency table showing the distribution of responses to the question, “How do you eat corn on the cob?” Find the relative frequency distribution. Categorical data is often displayed using bar graphs and pie charts.
How do you eat corn on the cob? Frequency Relative Frequency
In rows 22 22/50 = 44%
In circles 13 13/50 = 26%
Bite wherever 6 6/50 = 12%
I don’t eat corn on the cob 5 5/50 = 10%
Cut the corn off the cobb 4 4/50 = 8%
Total 50 100%
AP Statistics – Ch. 1 Notes
44%
26%
12%
10%
8%
Method of Eating
Corn on the Cob
Rows Circles Wherever
Don't Eat Cut Off
Bar Graph Procedure: 1. Draw a horizontal line. Write the category names or labels below the line at regularly spaced
intervals. 2. Draw a vertical line. Label the scale using either frequency or relative frequency. 3. Place a rectangular bar above each category label with the height corresponding to the frequency or
relative frequency. Make sure all bars have the same width. Bars are historically drawn with gaps between them to indicate that the data is not continuous.
Pie Chart Procedure:
1. Draw a circle to represent the entire data set. 2. For each category, calculate the size of the central angle of the “slice”:
slice size 360° relative frequency of category= ⋅
3. Use a protractor to divide the circle into slices with the appropriate central angles. 4. Label the slices appropriately!
An alternative to a pie chart is a segmented bar graph, where there is a single bar with “segments” corresponding to the different categories. Segmented bar graphs use relative frequencies on the vertical axis. The height of each segment corresponds to the relative frequency of that category. Example: Draw a well-labeled bar graph, a well-labeled segmented bar graph, and a well-labeled segmented pie chart of the vision correction data from the previous example.
� Bar graphs are more flexible than pie charts! Bar graphs can
compare any set of quantities that are measured in the same units, but pie charts can only compare all parts of a single whole. For example, you couldn’t compare the percentages of sophomores, juniors, and seniors who approve of the parking policies with a pie chart. These are not parts of the same whole. The percentages wouldn’t add to 100%. You could use a bar graph for this data, however.
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
Re
lati
ve F
req
ue
ncy
Method of Eating
Corn on the Cob
Angles for pie chart:
Rows: (0.44)(360°) = 158.4°
Circles: (0.26)(360°) = 93.6°
Cut off: (0.12)(360°) = 43.2°
Wherever: (0.10)(360°) = 36.0°
Don’t eat: (0.08)(360°) = 28.8°
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Re
lati
ve F
reu
qe
ncy
Method of Eating
Corn on the Cob
Cut Off
Don't Eat
Wherever
Circles
Rows
AP Statistics – Ch. 1 Notes
Deceptive Graphs:
• Watch out for graphs in which the width changes in addition to the height. The eye responds to area, so this makes the graph misleading. This happens a lot in pictographs.
• Watch out for graphs where the axes don’t start at zero (and/or are missing).
AP Statistics – Ch. 1 Notes
30%
20%
20%
30%
Perception of 3D Pie Charts
Cool Confusing Misleading Unreadable
• Watch out for unequally-spaced intervals.
• Watch out for pie charts or segmented bar graphs where the percentages don’t add to 100%. Pie charts and segmented bar graphs represent all the parts of a single whole. Bar graphs, on the other hand, can represent proportions of different groups, so it is okay if the percentages on a bar graph don’t add to 100%.
• Watch out for 3D graphs or graphs set at an angle. This distorts the data.
AP Statistics – Ch. 1 Notes
Two-Way Tables (also called Contingency Tables)
Often, we are interested in the relationship between two categorical variables. In this case, we can organize the data into a table, with the rows representing possible values of one variable and the columns representing possible values of the other. The marginal distribution of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all the individuals described by the table. (Use the totals in the margins to calculate this). Example: A sample of 200 children from the United Kingdom aged 9-17 was selected from the CensusAtSchool website. The gender of each child was recorded along with which superpower they would most like to have. Here are the results. Calculate the marginal distribution of superpower preferences. Make a graph to display the marginal distribution. Describe what you see. To examine the relationships between variables, we need to calculate some well-chosen percentages from the counts in the table. A conditional distribution of a variable describes the values of that variable among individuals with one specific value of another variable. There is a separate conditional distribution for each value of the other variable. For example, the conditional distribution of superpower preferences for females would tell us what percentage of females prefers each superpower. Example: Using the data above, calculate the conditional distribution of superpower preference for each gender. (This means figure out what percentage of girls want each superpower and what percentage of boys want each superpower.)
Gender
Superpower Female Male Total
Invisibility 17 13 30 Superstrength 3 17 20
Telepathy 39 5 44 Fly 36 18 54
Freeze Time 20 32 52
Total 115 85 200
Superpower Female Male
Invisibility 17/115 = 14.8% 13/85 = 15.3%
Superstrength 3/115 = 2.6% 17/85 = 20.0%
Telepathy 39/115 = 33.9% 5/85 = 5.9%
Fly 36/115 = 31.3% 18/85 = 21.2%
Freeze Time 20/115 = 17.4% 32/85 = 37.6%
Superpower Relative Frequency
Invisibility 30/200 = 15%
Superstrength 20/200 = 10%
Telepathy 44/200 = 22%
Fly 54/200 = 27%
Freeze Time 52/200 = 26%
0%5%
10%15%20%25%30%
Pe
rce
nt
Superpower Choice
For all the students surveyed, flying is the most popular
choice of superpower, followed closely by freezing time.
Superstrength is least popular.
AP Statistics – Ch. 1 Notes
To compare the conditional distributions of a categorical variable, we use side-by-side bar graphs (or
comparative bar graphs). These use the same horizontal and vertical axes for the bar graphs of two of more groups. The bar charts are usually color-coded to indicate which bars correspond to each group. Relative frequency should be used for the vertical axis. If specific values of one variable tend to occur in common with specific values of the other, we say there is an association (or relationship) between the two variables. For example, if males prefer certain superpowers more than females (or vice-versa), we would say that there is an association between gender and superpower choice. � Do not use the word correlation when you mean association. Correlation has a very specific meaning
in statistics, which we will talk about later in the year. Example: Draw a side-by-side bar graph comparing the superpower choices of males and females. Use percentages for the vertical axis. Then draw a segmented bar graph for each gender. Describe what you see. Does there appear to be an association between gender and superpower choice? There are two choices for a side-by-side bar graph:
Segmented bar graph:
0%5%
10%15%20%25%30%35%40%
Pe
rce
nt
Superpower Choice
Choice of Superpower by Gender
Female
Male
0%
5%
10%
15%
20%
25%
30%
35%
40%
Female Male
Pe
rce
nt
Superpower Choice
Choice of Superpower by Gender
Invisibility
Superstrength
Telepathy
Fly
Freeze Time
0%
20%
40%
60%
80%
100%
Female Male
Pe
rce
nt
Gender
Choice of Superpower by Gender
Freeze Time
Fly
Telepathy
Superstrength
Invisibility
There is a definite association between
superpower choice and gender. Specifically,
females are much more likely to choose
telepathy than males, and males are much
more likely to choose superstrength than
females. Flying is a more popular choice for
females and freezing time is a more popular
choice for males. Invisibility was chosen by
similar percentages of males and females.
AP Statistics – Ch. 1 Notes
Displaying Quantitative Data with Graphs One of the most common parts of a statistical problem is finding an appropriate way to display data. Quantitative data can’t be displayed the same way as categorical data (bar graphs and pie charts don’t work). The most common ways to display quantitative data are dotplots, stemplots, histograms, and boxplots. How to Examine the Distribution of a Quantitative Variable
• Describe the overall pattern of a distribution by describing its shape, center, and spread.
• Point out any outliers (unusually small or unusually large data values).
• Always put your descriptions in context! Describing Shape:
• How many peaks does the distribution have? Don’t count minor ups and downs, only major peaks. o Unimodal: One peak. o Bimodal: Two peaks. o Multimodal: Three or more peaks.
• Is the distribution symmetric or skewed? o If the right and left sides of the graph are close to mirror images of each other, describe the
distribution as “approximately symmetric.” Always use the words “approximately” or “roughly”, because in real life, distributions of data are almost never perfectly symmetric.
o If the right side of the graph is much longer than the left side (tail to the right), describe the distribution as “skewed to the right” or “skewed to positive values” or “positively skewed.”
o If the left side of the graph is much longer than the right side (tail to the left), describe the distribution as “skewed to the left” or “skewed to negative values” or “negatively skewed.”
Describing Center: Use the median or the mean. Describing Spread: Use the range, interquartile range, or standard deviation, or say something like, “The data vary from a low of _____ to a high of _____.”
AP Statistics – Ch. 1 Notes
Dotplots: 1. Draw a horizontal line and mark it with an appropriate measurement scale. 2. Locate each value in the data set along the measurement scale and represent it by a dot above the
line. If there are two or more observations with the same value, stack the dots vertically. To compare two distributions, stack the dotplots on top of each other, using the same scales. Make sure to label the two groups being compared.
Example: Here are the hair lengths of 24 AP Statistics students, in inches. Make a dotplot. Describe the distribution. 1 1 2 2 2 2 2 2 3 3 3 4 6 7 7 8 12 13 15 19 20 20 22 22 24 26
Here are the lengths sorted by gender. Make parallel dotplots of the hair lengths for males and females. Comment on what you see.
Male: 1 1 2 2 2 2 2 2 3 3 3 4 6 7
Female: 7 8 12 13 15 19 20 20 22 22 24 26
26242220181614121086420
Hair Length (in.)
26242220181614121086420
Hair Length (in.)
The distribution of hair lengths is bimodal
with peaks at 2-3 inches and 20-22 inches.
The distribution is slightly skewed to the
right, meaning that in this sample, shorter
hair lengths are slightly more common than
longer hair lengths. The hair lengths vary
from 1 inch to 26 inches (range = 25 inches).
There do not appear to be any outliers.
The distributions of hair length are unimodal for both
genders. The distribution for males is skewed to the
right, meaning that short hair is more common than long
hair, while the distribution for females skewed to the
left, meaning that long hair is more common than short
hair.
Females tend to have much longer hair, on average, than
males (median = 19.5 inches for females and 2 inches for
males). There is hardly any overlap in the two
distributions. The male with the longest hair has the
same length hair as the female with the shortest hair.
There is much more variability in hair length for females
(range 19 inches) than for males (range 6 inches).
There do not appear to be any outliers for either gender.
AP Statistics – Ch. 1 Notes
Stemplots (or Stem-and-Leaf Plots):
Each number in the data set is broken into two pieces—a stem and a leaf. The stem is the first part of the number and consists of the beginning digits. The leaf is the last part of the number and consists of the final digit(s).
1. Choose stems (one or more of the leading digits) that divide the data into a reasonable number of groups (at least 5, but not too many). List possible stem values (not just those that actually appear in the data set—don’t skip stems) in a vertical column. Draw a vertical line to the right of the stems.
2. The next digit(s) after the stem become(s) the leaf. List the leaf for every observation to the right of the corresponding stem.
3. Include a key explaining what the stems and leaves represent, e.g., “2 | 5 represents 2.5 seconds” � It is common to round and/or truncate (leave off) the remaining digits. For example, in a stemplot
of annual salary, we might represent $35,360 as 35 | 3, 35 | 4, or as 3 | 5, depending on our data set.
� If necessary, consider using split stems. Write each stem more than once, and assign the lower group of leaves to the first stem and the higher group of leaves to the next. For example, put the leaves 0-4 with the first stem and the leaves 5-9 with the second. If you do this, be sure that each stem is assigned an equal number of possible leaf digits (two stems, with five possible leaves each; or five stems, with two possible leaves each).
� To compare two groups, make a back-to-back stemplot. Use the same set of stems and write the leaves for one group to the right and for the other group to the left. Be sure to label each side to indicate which group is being represented.
Example: The data below shows the number of pairs of shoes owned for male and female AP Statistics students. Make a back-to-back stemplot of the data using split stems. Comment on the main differences between the two data sets.
Female 8 8 9 10 10 11 11 15 15 15 15 15 15
20 22 22 23 31 40 50 64
Male 1 4 4 4 4 4 4 5 5 5 5 5 5
6 6 7 7 8 8 10 10 10 12 12 15
Both distributions are unimodal and skewed to higher numbers – it is more common to own fewer pairs of shoes
than more pairs. Females tend to own more pairs of shoes (median 15) than males (median 5). There is much more
variability in the number of pairs of shoes owned by females (range 56) than the number of pairs of shoes owned
by males (range 14). The females who own 50 and 64 pairs of shoes are possible outliers.
Number of Pairs of Shoes
Female Male
0 1 4 4 4 4 4 4
9 8 8 0 5 5 5 5 5 5 6 6 7 7 8 8
1 1 0 0 1 0 0 0 2 2
5 5 5 5 5 5 1 5
3 2 2 0 2
2
1 3
3
0 4
4
0 5
5
4 6
1|5
represents 15
pairs of
shoes
AP Statistics – Ch. 1 Notes
Histograms: 1. Divide the range of the data into intervals of equal width. The intervals are called “bins.” The low
value in each bin is included in the bin, but the high value is not. For example, the bins might be 0 to < 3, 3 to < 6, 6 to < 9, etc.
� If the data are discrete (the observations take only whole number values) and are tightly packed, the bins are usually centered at the integer values with a width of one unit, so the rectangle for 1 is centered at 1 (0.5 to < 1.5), the rectangle for 2 is centered at 2 (1.5 to < 2.5), etc.
� There are no set-in-stone rules for how many bins to use (5 to 10 is a common number), but it may be a good idea to see what the graph looks like with different width bins. It can change quite a bit!
2. Find the count (frequency) or percent (relative frequency) of individuals in each class. 3. Label and scale your axes. Mark the boundaries of each interval on a horizontal axis and use either
frequency or relative frequency on the vertical axis. 4. Draw a rectangle for each interval. The edges of the rectangle are at the interval boundaries, and the
height of the rectangle represents the frequency or relative frequency of the individuals in each interval.
� Histograms and bar graphs are different!
o Bar graphs are used for categorical data. Histograms are used for quantitative data. o With histograms, you have one variable, and you sort the values of the variable by placing
them into bins. The histogram shows how many measurements are in each bin. The rectangles can’t be rearranged.
o With bar graphs, the bars represent different categories. The height compares some characteristic of each category. The bars can be rearranged.
o The bars in bar graphs are generally unconnected. The bars in histograms are connected. Example: The following data gives the average points scored per game (PTSG) for the 30 NBA teams in the 2016-2017 regular season. Draw two relative frequency histograms using different bin widths. Describe the distribution.
97.9 100.2 100.4 101.1 101.3 102.4 102.4 102.8 102.9 103.2
103.3 104.3 104.3 104.6 104.9 105.3 105.6 105.6 105.8 105.8
106.5 107.5 107.6 107.7 107.8 109.0 111.4 111.7 114.4 116.5
97 - < 99 1 107 - < 109 4
99 - < 101 2 109 - < 111 1
101 - < 103 6 111 - <113 2
103 - <105 6 113 - <115 1
105 - <107 6 115 - <117 1
97 - <101 3 109 - < 113 3
101 - < 105 12 113 - < 117 2
105 - <109 10
The distribution of average points scored per game is unimodal and skewed to the right (lower scoring averages are
more common than higher ones). The median is 105.1 points per game. There is a 18.6 point per game difference
between the team with the highest scoring average and the team with the lowest scoring average. There do not appear
to be any outliers. Notice that there is more noise in the histogram with smaller bin widths.
11711310910510197
12
10
8
6
4
2
0
Average Points per Game
Fre
qu
en
cy (
# o
f Team
s)
1171151131111091071051031019997
6
5
4
3
2
1
0
Average Points per Game
Fre
qu
en
cy (
# o
f Team
s)
AP Statistics – Ch. 1 Notes
Describing Quantitative Data with Numbers Population: The entire collection of individuals or objects that you want to learn about. Sample: A part of the population that is selected for study. Resistant Measure: A measure that is not influenced very much by strong skewness or extreme values.
Measures of Center: The most common measures of center are the mean and the median.
Mean: The sum of the values divided by the number of observations.
• If the n observations in a sample are 1 2, ,..., ,n
x x x the mean is 1 2 ....n i
x x x xx
n n
+ + + ∑= =
• The mean can be thought of as the “average” value, the “fair share” value, or the “balance point” of a distribution.
• The mean is not a resistant measure. It is very sensitive to outliers and skewness. The mean of a sample is abbreviated x (pronounced “x-bar”) and the mean of a population is abbreviated
xμ (the Greek letter mu, pronounced “myoo”). They are both calculated the same way. The distinction will
be important later in the year. If the problem doesn’t specify whether the data represent a population or a sample, assume you are dealing with a sample and use .x Median (M): The midpoint of a distribution. Half of the observations are smaller than the median and half of the values are larger than the median. To find the median:
1. Put the n observations in order from smallest to largest. 2. If the number of observations, n, is odd, the median is the middle observation of the ordered list. 3. If the number of observations, n, is even, the median is the average (mean) of the two middle
observations in the ordered list.
• The median can be thought of as the “typical” value of a variable.
• The median is a resistant measure. It is not changed greatly by strong skewness or outliers. Comparing the Mean and the Median: The mean and median of a roughly symmetric distribution are close together. If the distribution is exactly symmetric, they are equal. However, outliers and other extreme values drag the mean toward them without having much effect on the median. As a result, in skewed
distributions, the mean will be further out in the long tail than is the median.
AP Statistics – Ch. 1 Notes
300250200150100500
Number of Visits to Class Website
Example: Here are the amounts of fat (in grams) in McDonald’s beef sandwiches. Make a stemplot of the distribution and comment on its shape. Then calculate the mean and the median amount of fat. d Example: Forty students were enrolled in a statistical reasoning course at a California college. The instructor made course materials, grades, and lecture notes available to students on a class web site, and course management software kept track of how often each student accessed any of these web pages. One month after the course began, the instructor requested a report of how many times each student had accessed a class web page. The 40 observations are below. Wasn’t it nice of me to put them in order? 0 0 0 0 0 0 3 4 4 4 5 5 7 7 8 8 8 12 12 13 13 13 14 14 16 18 19 19 20 20 21 22 23 26 36 36 37 42 84 331 (not a typo) Here is a dotplot of the data. Describe the distribution. Based on the graph, do you expect the mean or the median to be higher? Calculate the mean and the median to see if you were right. Which measure would be the best choice to describe center in this situation? The distribution is unimodal and extremely skewed to the right. Most students accessed the website between 0
and 20 times. The students who accessed the website 84 and 331 times are possible outliers. Since the distribution
is so skewed and has high outliers, the mean will be pulled towards the high values, and will be much higher than
the median. The median will be more representative of the class as a whole.
Sandwich Fat (g) Sandwich Fat (g)
Hamburger 9 Big N’ Tasty 24
Cheeseburger 12 Big N’ Tasty with Cheese 28
Double Cheeseburger 23 McRib 26
McDouble 19 Mac Snack Wrap 19
Quarter Pounder 19 Angus Bacon & Cheese 39
Quarter Pounder with Cheese 26 Angus Deluxe 39
Double Quarter Pounder with Cheese 42 Angus Mushroom & Swiss 40
Big Mac 29
Grams of Fat
0 9
1 2 9 9 9
2 3 4 6 6 8 9
3 9 9
4 0 2
1|2
represents
12 fat
grams
The distribution of fat content is unimodal and
approximately symmetric, so we would expect the
median to be close to the mean.
Median = 26 grams
Mean = (9 + 12 + … + 40)/15 = 394/15 = 26.3 grams
13 13Median = 13
2
+=
0 0 ... 33123.1
40x
+ + += =
AP Statistics – Ch. 1 Notes
Measures of Spread: Numbers that describe how spread out the data are. The most common are the range, the interquartile range, and the standard deviation. Range: The difference between the maximum and minimum values. Interquartile Range (IQR): First, calculate the quartiles:
1. Arrange the data in increasing order and locate the median, M. (The median is sometimes called the second quartile, or Q2).
2. The first quartile (Q1) is the median of all the observations lower than the median. 3. The third quartile (Q3) is the median of all the observations higher than the median.
The interquartile range is calculated as follows: IQR = Q3 – Q1
The IQR is the range of the middle 50% of the data. � The range and interquartile range are numbers! Don’t say “The range is 5 to 30.” In that case, the
range would be 25. 1.5 × IQR Rule for Outliers: Any observation that falls more than 1.5 IQR× above the third quartile or
below the first quartile. � Always check for outliers and examine them closely! They may be errors, or they may tell you
something important about your data that you need to pay attention to. Don’t ignore them. Boxplots (or Box and Whisker Plots):
1. Find the Five-Number Summary: Minimum Q1 M Q3 Maximum
2. Check for outliers. You must always show this step.
• Calculate the IQR.
• Find ( )1 1.5Q IQR− × and ( )3 1.5 .Q IQR+ ×
• If you have any data points outside these thresholds, they are outliers. 3. Draw the boxplot:
• Draw a central box from Q1 to Q3.
• Draw a vertical line in the box to mark the median.
• Draw the “whiskers”: lines extending from the box out to the smallest and largest observations that are not outliers.
• Mark outliers with dots in the appropriate locations.
� Each section of a boxplot contains 25% of the data.
• The lower quartile is higher than 25% of the data.
• The median (or second quartile) is higher than 50% of the data.
• The upper quartile is higher than 75% of the data.
� Boxplots are useful for comparing distributions, but you have to be careful with them. They sometimes mask important features of a distribution. For instance, you can’t tell from a boxplot whether the distribution is unimodal or bimodal.
AP Statistics – Ch. 1 Notes
Example: The data below shows the number of text messages sent by a random sample of students in a day. Draw parallel boxplots of the number of texts sent for male and female students. You must show how you determined whether there are outliers. Compare the distributions. What conclusions can you draw about the texting habits of males and females?
Male 3 6 6 7 10 10 12 17 24 25 40 45 50 87 111
Female 7 11 20 26 38 52 59 79 90 156
Male 3 6 6 7 10 10 12 17 24 25 40 45 50 87 111
Min = 3, Q1 = 7, Med = 17, Q3 = 45, Max = 111
( ) ( )
( ) ( )
1
3
IQR 45 7 38
Q 1.5 IQR 7 1.5 38 50
Q 1.5 IQR 45 1.5 38 102
= − =
− = − = −
+ = + =
No low outliers because there are no numbers less than –50. 111 is an outlier because it is higher than 102.
Female 7 11 20 26 38 | 52 59 79 90 156
38 522
Median = 45+ =
Min = 7, Q1 = 20, Med = 45, Q3 = 79, Max = 156
( ) ( )
( ) ( )
1
3
IQR 79 20 59
Q 1.5 IQR 20 1.5 59 68.5
Q 1.5 IQR 79 1.5 59 167.5
= − =
− = − = −
+ = + =
No outliers because there are no numbers less than –68.5 or higher than 167.5.
Standard Deviation: The most common measure of spread is the standard deviation. It measures the “average” or typical distance of the observations from the mean. The formula for standard deviation is slightly different depending on whether you have all the data for the entire population or are dealing with a sample from the population. For a Sample:
If the n observations in a sample are 1 2, ,..., ,n
x x x and the mean is ,x the standard deviation is given by:
( ) ( ) ( ) ( )2 2 2 2
1 2 ...
1 1
n i
x
x x x x x x x xs
n n
− + − + + − −∑= =
− −
The sample standard deviation is abbreviated .x
s The square of the standard deviation is called the
variance, abbreviated 2.xs
The females in the sample text much more, on average, than
the males (median = 45 for females and 17 for males). Since the
median number of texts for females is equal to the third
quartile for males, we can see that the top 50% of females text
more than the bottom 75% of the males. Both distributions are
skewed to the right, meaning that it is more common to send
smaller numbers of texts than larger ones. There is more
variation in # of texts sent for females than for males (IQR = 59
for females and 38 for males). There is one outlier for the
males. He sent 111 texts, which is unusually high. There were
no outliers for the females.
Female
Male
160140120100806040200
# of Texts Sent in Past 24 Hrs.
AP Statistics – Ch. 1 Notes
For the Population:
The standard deviation of a population of size N with mean μ and observations 1 2, ,...,n
x x x is given by:
( ) ( ) ( ) ( )2 2 2 2
1 2 ...n i
x
x μ x μ x μ x μσ
N N
− + − + + − −∑= =
The population standard deviation is abbreviated xσ (the Greek letter sigma). The population variance is
abbreviated 2 .xσ
The reason that we divide by 1n− in a sample is complicated. We’ll discuss it later in the year.
� Always use xs rather than x
σ unless you know that the data represent the entire population, which is
rare! Calculating the standard deviation by hand:
1. Calculate the mean, .x 2. Find the distance of each observation from the mean (the deviations). 3. Square each of these distances to eliminate negative numbers. 4. “Average” the squared distances by adding them together and dividing by 1.n− This gives the
variance, 2.xs
5. Take the square root of the variance to get the standard deviation, .x
s
6. Interpret your result. The standard deviation is the “average” or typical distance of the observations from the mean.
Example: The table below shows the sugar content in several types of candy bar. Find the mean and standard deviation of the data. Interpret your result in context.
Candy Bar Sugar (grams) Deviations
ix x−
Squared Deviations
( )2
ix x−
Hershey’s Milk Chocolate 31 4.4 19.5
Krackel 24 –2.6 6.7
Kit Kat 22 –4.6 21.1
York Peppermint Pattie 25 –1.6 2.5
Reese’s Peanut Butter Cups 25 –1.6 2.5
Snickers 30 3.4 11.6
Milky Way 35 8.4 70.6
Twix 27 0.4 0.2
3 Musketeers 40 13.4 179.9
Baby Ruth 33 6.4 41.1
Nestle Crunch 24 –2.6 6.7
Butterfinger 29 2.4 5.8
Mounds 21 –5.6 31.2
Almond Joy 20 –6.6 43.4
Mr. Goodbar 22 –4.6 21.1
Payday 21 –5.6 31.2
Heath 23 –3.6 12.9
Total 452 0 508.118
452 17 26.6 gramsx = ≈ ( )
2
2 2508.11831.8 grams
1 17 1
i
x
x xs
n
−= = =
− −
∑ 2 31.8x x
s s= = ≈ 5.6 grams
The sugar contents of
the individual candy
bars typically differ
from the mean sugar
content by about 5.6
grams.
- OR -
On average, the sugar
contents of the
individual candy bars
differ from the mean
sugar content by
about 5.6 grams.
AP Statistics – Ch. 1 Notes
Example: Each of these distributions has a mean of 5. Rank the standard deviations from lowest to highest.
Properties of the Standard Deviation
• The standard deviation measures the spread about the mean and should only be used when the mean is chosen as the measure of center.
• The standard deviation is always greater than or equal to zero. If there is no variability (all observations have the same value), the standard deviation is zero. Otherwise, it is greater than zero.
• The standard deviation has the same units of measurement as the original observations. This is one reason we often prefer the standard deviation over the variance.
• The standard deviation is not a resistant measure. A few outliers can change its value dramatically.
Choosing Measures of Center and Spread:
• Use the median and IQR for describing a skewed distribution or a distribution with strong outliers.
• Use the mean and standard deviation for describing reasonably symmetric distributions without outliers.
� ALWAYS GRAPH YOUR DATA! Numerical measures of center and spread report specific facts
about a distribution, but don’t give information about its entire shape. You may miss something important if you don’t graph the data.
1 086420
1 086420
1 086420
Highest standard deviation: The points are
furthest from the mean, on average.
4.1x
s ≈
Lowest standard deviation: The points are
closest to the mean, on average.
2.9x
s ≈
Middle
3.6x
s ≈