Upload
amil
View
69
Download
0
Tags:
Embed Size (px)
DESCRIPTION
AP Statistics. Chapter 1 Exploring Data. WHAT IS STATISTICS?. Statistics is the study of how we: -Collect Data -Organize Data -Analyze Data -Use data to make predictions Statistics is the tool we use to extract information from data!. Lesson Objectives. - PowerPoint PPT Presentation
Citation preview
AP Statistics
Chapter 1Exploring Data
2
WHAT IS STATISTICS?• Statistics is the study of how we:
-Collect Data-Organize Data-Analyze Data -Use data to make predictions
Statistics is the tool we use to extract information from data!
3
Lesson Objectives• Identify individuals and
variables in a set of data.• Classify a variable as being a
quantitative or categorical variable.
• Identify the units of measurement for a quantitative value.
4
VARIABLES• Individuals are the objects
described by a set of data.• Variables are characteristics that
can take different values from individual to individual.
• A variable can be considered either categorical or quantitative.
EXAMPLE• Suppose we observed a bag of M&M
candies and were studying the different colors of the pieces.
• What would be the individuals in the study? What would be the variable?
6
Categorical vs. Quantitative
• Categorical variables place an individual in to a group or category.
• Quantitative variables assign a numerical value to an individual.
EXAMPLES:Which type of variable is each?A person’s height …A person’s eye color … A person’s ZIP code …
QuantitativeCategorical
Categorical
While ZIP codes are numeric in form, you would not use arithmetic to combine them in any form.
Distribution• The type of data collected can be a
determining factor of the way the values are organized.
• Quantitative values can be very close together or very spread out.
• The pattern of variation between these values in a set of data is called the distribution.
• Distribution – a description of the values a variable takes and how often it takes these values.
8
DISTRIBUTION• Both quantitative and categorical
data will have differences from individual to individual.
• The pattern of variation of a variable is referred to as its distribution.
• In order to get a grasp of a variable’s distribution, we may use a graphical display of the data.
9
AP EXAM TipYou will often be asked to “describe the distribution”
of a set of data. When you are asked to do this, make sure that you have your SOCkS on!S – ShapeO – OutliersC – CenterS – Spread
When you describe these four characteristics of the data, you will be effectively describing the distribution!
10
ACTIVITY: Sexual Discrimination?????
25 airplane pilots have applied to fill 8 positions to be pilots with an airline company. 15 of them are males and 10 are females. To be fair, the managers select the 8 pilots to be employed by a lottery.
A day later, the managers announce the 8 pilots to be hired. 5 of them are female and only 3 are males.
Many of the males claimed that the lottery had to have been “rigged” since there was no way that so many females were selected.
11
ACTIVITY CONTINUEDTo simulate the situation, select ten red
cards and fifteen black cards.Use the cards within your group to conduct
your own lottery by drawing 8 cards.Count the number of females, and record
that number.Put the cards back, and shuffle the cards.
Repeat the process four more times.Report your results to be recorded.
12
What do we see?
0 1 2 3 4 5 6 7 8
Number of Females Hired
Do you think that it is possible that the number of females hired in the problem was a coincidence???
13
HOMEWORKComplete the assignment listed in the
packet. This assignment will be due at the beginning of the next class
session.
Analyzing Categorical Data• In this section we will learn about:
–Bar graphs/pie charts–Problems with graphs–Two-way Tables and Marginal
Distribution–Conditional Distributions–Simpson’s Paradox
EXAMPLEThe Radio Arbitron service places each of the
contry’s 13,838 stations in categories that describe the type of music they play. Here is the distribution of the data.Frequency Table
Format Count of Stations
Adult Contemporary 1556
Adult Standards 1196
Contemporary Hit 569
Country 2066
News/Talk 2179
Oldies 1060
Religious 2014
Rock 869
Spanish Language 750
Other Formats 1579
Total 13838
Relative Frequency Table
Format Percent of Stations
Adult Contemporary 11.2
Adult Standards 8.6
Contemporary Hit 4.1
Country 14.9
News/Talk 15.7
Oldies 7.7
Religious 14.6
Rock 6.3
Spanish Language 5.4
Other Formats 11.4
Total 99.9
ContinuedSometimes, we may wish to use a graph instead of
table to clarify relationships.Frequency Table
Format Count of Stations
Adult Contemporary 1556
Adult Standards 1196
Contemporary Hit 569
Country 2066
News/Talk 2179
Oldies 1060
Religious 2014
Rock 869
Spanish Language 750
Other Formats 1579
Total 13838
0
500
1000
1500
2000
2500
Count of StationsRelative Frequency Table
Format Percent of Stations
Adult Contemporary 11.2
Adult Standards 8.6
Contemporary Hit 4.1
Country 14.9
News/Talk 15.7
Oldies 7.7
Religious 14.6
Rock 6.3
Spanish Language 5.4
Other Formats 11.4
Total 99.9
11%
9%
4%
15%
16%8%
15%
6%
5%
11%
Percent of StationsAdult ContemporaryAdult Standards
Contemporary hit
Country
News/Talk
Oldies
Religious
Rock
Spanish
Other
Be Careful!!!
Because of their appeal to the eyes, graphical displays can sometimes be misleading. Always look for things like scaling and relevance. Pictographs can almost always be misleading.
Pictograph• What is the
issue with this ad that was used by Apple Computers to show the people that were buying their new iMac Computer?
Activity• Use the table below. • A.) Make a well-labeled graph to display
the data.• B.) Would it be appropriate to make a pie
chart here? Why?
Possible Answer
Two-Way Tables• A survey of 4826 randomly selected young
adults (19-25 yrs old) asked, “What do you think are the chances you will have much more than a middle-class income at age 30?”
• This is an example of a two-way table.
Two-Way Table• A two-way table describes two categorical
variables, organizing counts according to a row variable and a column variable.
• Marginal Distribution –– The distribution of values of one of the
variables among all individuals in that category of a two-way table.
• To examine a marginal distribution:– Use the table data to compute percents
of the row or column totals.– Make a graph to display the marginal
distribution.
Young adults by gender and chance of getting rich
Female Male Total
Almost no chance 96 98 194
Some chance, but probably not 426 286 712
A 50-50 chance 696 720 1416
A good chance 663 758 1421
Almost certain 486 597 1083
Total 2367 2459 4826
Two-Way Tables and Marginal Distributions
Response PercentAlmost no chance 194/4826 =
4.0%Some chance 712/4826 =
14.8%A 50-50 chance 1416/4826 =
29.3%A good chance 1421/4826 =
29.4%Almost certain 1083/4826 =
22.4%
Examine the marginal distribution of chance of getting rich.
Almost none
Some chance
50-50 chance
Good chance
Almost certain
05
101520253035
Chance of being wealthy by age 30
Survey Response
Perc
ent
Simpson’s Paradox Accident victims are often transported by
hospital to a medical facility. Does this act help save lives?
What are the percentage of deaths for each of the two categories?
…not too positive, huh?!
Continued…• Let’s look at the data differently.
• Compute the percentages now.• …is that right?• This phenomenon is referred to as Simpson’s
Paradox.• It is caused by what is referred to a lurking
variable.• What was the lurking variable here?
26
HOMEWORKComplete the assignment listed in the
packet. This assignment will be due at the beginning of the next class
session.
1.2 – Displaying Quantitative Data
• Dotplots are a commonly used method of displaying quantitative data.
• To make a dotplot, DRAW a horizontal number line, labeled with the name of the variable.
• SCALE the number line, including the minimum and maximum values.
• MARK a dot above the corresponding location on the axis for each data value.
EXAMPLE• The table here displays goals scored by
the US Women’s Soccer Team in 2004.
• Create a dotplot to represent the data.
Number of Goals Scored Per Game by the 2004 US Women’s Soccer Team
3 0 2 7 8 2 4 3 5 1 1 4 5 3 1 1 33 3 2 1 2 2 2 4 3 5 6 1 5 5 1 1 5
EXAMPLE 2• The table and dotplot below displays the
Environmental Protection Agency’s estimates of highway gas mileage in miles per gallon (MPG) for a sample of 24 model year 2009 midsize cars.
• Use the dotplot to describe the distribution.
2009 Fuel Economy Guide
MODEL MPG
1
2
3
4
5
6
7
8
9
Acura RL 22
Audi A6 Quattro 23
Bentley Arnage 14
BMW 5281 28
Buick Lacrosse 28
Cadillac CTS 25
Chevrolet Malibu 33
Chrysler Sebring 30
Dodge Avenger 30
2009 Fuel Economy Guide
MODEL MPG <new>
9
10
11
12
13
14
15
16
17
Dodge Avenger 30
Hyundai Elantra 33
Jaguar XF 25
Kia Optima 32
Lexus GS 350 26
Lincolon MKZ 28
Mazda 6 29
Mercedes-Benz E350 24
Mercury Milan 29
2009 Fuel Economy Guide
MODEL MPG <new>
16
17
18
19
20
21
22
23
24
Mercedes-Benz E350 24
Mercury Milan 29
Mitsubishi Galant 27
Nissan Maxima 26
Rolls Royce Phantom 18
Saturn Aura 33
Toyota Camry 31
Volkswagen Passat 29
Volvo S80 25
MPG14 16 18 20 22 24 26 28 30 32 34
2009 Fuel Economy Guide Dot Plot
Describing the Shape• We can describe the shape of a distribution
bby using the following terms.• Symmetric - if the right and left sides of the
graph are approximately mirror images of each other.
• Skewed Right - if the right side of the graph (containing the half of the observations with larger values) is much longer than the left side.
• Skewed Left – just the opposite of skewed right.
• Bimodal – A set of data that has two peaks.
Identify Each
DiceRolls0 2 4 6 8 10 12
Collection 1 Dot Plot
Score70 75 80 85 90 95 100
Collection 1 Dot Plot
Symmetric Skewed - left
Siblings0 1 2 3 4 5 6 7
Collection 1 Dot Plot
Skewed - rightBimodal
Applying the Concepts• Complete the “Check Your Understanding”
questions on pg. 31.
33
VIDEO #2Decisions Through Data
Stemplots
34
Stemplots• Stemplots are often used as a means of
representing quantitative values.• The data is organized by separating
each observation into a stem (all but the last digit) and a leaf (the last digit).
• The leaf values are then paired with their stem and ordered.
• Trends and patterns in the distribution can be seen here.
35
Caffeine content of an 8oz. serving of many popular soft drinks.A&W Cream 20 Diet Sun Drop 47Barq’s Root Beer 15 Diet Sunkist 28Cherry Coke 23 Diet Cherry Pepsi 24Cherry RC Cola 29 Dr. Nehi 28Coke Classic 23 Dr. Pepper 28Diet A&W Cream 15 IBC Cherry 16Diet Cherry Coke 23 Kick 38Diet Coke 31 KMX 36Diet Dr. Pepper 28 Mello Yello 35Diet Mello Yello 35 Mountain Dew 37Diet Mtn Dew 37 Mr. Pibb 27Diet Mr. Pibb 27 Nehi Wild Red 33Diet Pepsi 24 Pepsi One 37Diet Red Squirt 26 Pepsi 25
36
RC Edge 47Red Flash 27Royal Crown 29Red Squirt 26Sun Drop Cherry 43Sun Drop 43Sunkist 28Surge 35Tab 31Cherry Pepsi 25
Arrange all of this data in to a stemplot, and observe the distribution.
37
Stemplot from the Example
1 5 5 6
2 0 3 3 3 4 4 5 5 6 6 7 7 7 8 8 8 8 8 9 9
3 1 1 3 5 5 5 6 7 7 7 8
4 3 3 7 7
Key: 3|5 means 35 mg of caffeine per 8 oz. serving
Caffeine Content (mg) per 8oz. Serving of Various Soft Drinks
38
An Alternative (better) Plot
1 5 5 62 0 3 3 3 4 42 5 5 6 6 7 7 7 8 8 8 8 8 9 93 1 1 33 5 5 5 6 7 7 7 84 3 34 7 7
Caffeine Content (mg) per 8oz. Serving of Various Soft Drinks
Key: 3|5 means 35 mg of caffeine per 8 oz. serving
39
Tips for Stemplots• When you split stems, make sure each part
is assigned an equal number of possibilities.• There is no set number of stems.• Too few stems makes a “skyscaper” shape.• Too many stems makes a “pancake” shape.• As a rule, a minimum of five stems is good
to follow.• Always include a title and a key to show
how the stems were formed.
40
HOMEWORKComplete the assignment listed in the
packet. This assignment will be due at the beginning of the next class
session.
41
VIDEO #3Decisions Through Data
Histograms
42
HISTOGRAMS• Histograms are different than bar
graphs as they are represented on a continuum of values.
• As with stemplots, there is no set number of classes to use.
• Five classes is a good minimum.• Remember: area of each bar is what
matters. Make sure width is constant and height varies.
43
Relative Frequency Histograms
• A relative frequency histogram is based on relative frequencies of each category.
• Relative Frequency = number of occurrences in the category/ total number of occurrences.
• Relative frequency is often used to find percentiles, or the portion of data that at or below a value.
Applying the Concepts• Complete the “Check Your Understanding”
questions on pg. 39.
45
HOMEWORKComplete the assignment listed in the
packet. This assignment will be due at the beginning of the next class
session.
46
VIDEO #4Decisions Through Data
Measures of Center
47
1.3 - Describing Distributions with
Numbers• To describe a distribution, we
must identify its center.• One measure of the center of a
set of data is the mean.• The mean is the sum of all
observations in a set divided by the total number of observations.
48
FORMULA for Mean
• The mean of a set of data is:
or…
1 2 ... nx x xxn
1
1 n
ii
x xn
49
Median• The median of a set of data is the
midpoint of the data.• Arrange all of the numbers from least
to greatest.• If there is an odd number of
observations, the median is the center of the list.
• If there is an even number of observations, the median is the mean of the two center observations.
50
Mean vs. Median• The median is a more resistant measure
than the mean. • This means that the mean can be more
easily influenced by extreme values.• Differences between mean and median
can indicate skewness in the data. • Skewed Left data will have a mean that is
less than the median.• Skewed Right data will have a mean that
is greater than the data. • Mean/Median Applet
51
HOMEWORKComplete the assignment listed in the
packet. This assignment will be due at the beginning of the next class
session.
52
VIDEO #5Decisions Through Data
Boxplots
53
Spread of Data• The spread of a distribution is another
important characteristic of the data.• One simple measure of the spread of a
set of data is the range.• Range = (highest value – lowest value)• Since the range only involves two
values, the range can be strongly influenced by outliers.
54
Quartiles• Sometimes it is best to split the data in
to quartiles.• Each quartile represents 25% of the
data.• Often, the difference between the 3rd
and 1st quartiles is used to describe the spread of the data.
• This difference is referred to as the interquartile range (IQR).
55
Example of Quartiles• Barry Bonds’s home run counts (arranged in
order) are:16 19 24 25 25 33 33 34 34 37 37
40
42 46 49 73A Five-Number Summary can be used to
describe the data as well. It includes the min value, Q1, Median, Q3, and max value.
What is the Five-Number Summary of Barry Bonds’s home run data?
MQ1 Q3
56
BOXPLOTS• A boxplot can be used to graphically
represent the spread of the data in a set. • To form a boxplot, we plot points at all
values in the Five Number Summary.• Form a box between the Q1 and Q3
values, with a bar inside at the median.• Connect the max and min values to the
box with line segments.• This can also be done on the calculator in
stat plot mode.
57
Modified Boxplot• A boxplot will sometimes be modified
by excluding any outliers from the plot.
• An outlier is defined as any value more than 1.5 times the IQR above Q3 or below Q1.
• The outliers will still be graphed as single points excluded from the boxplot.
58
ACTIVITYHank Aaron’s home run data (in numerical
order) are:13 20 24 26 27 29 30 32 34 34 38
39 39
40 40 44 44 44 44 45 47
Identify the Five Number Summary of this data.13 28 38 44 47
Now construct boxplots in the calculator to compare Bonds’s data to Aaron’s.
59
ACTIVITY CONTINUED• What do you see about the data
that you have plotted?• Do you think that there are any
outliers in the data? If so, what values?
• Now construct a modified boxplot on the calculator and describe what you see.
60
HOMEWORKComplete the assignment listed in the
packet. This assignment will be due at the beginning of the next class
session.
61
VIDEO #6Decisions Through Data
Standard Deviation
62
Measuring the Spread • The most common way of measuring
the spread of data is by using the variance or standard deviation.
• The standard deviation (s) is the average distance between each of the items and the mean.
• The variance and standard deviation both provide a numerical value to assess the spread of the data.
63
FORMULASVARIANCE
Or…
2 2 22 1 2( ) ( ) ... ( )
1nx x x x x xs
n
2 21 ( )1 is x x
n
64
FORMULASSTANDARD DEVIATION
Notice that standard deviation is simply the square root of the variance.
21 ( )1 is x x
n
65
EXAMPLEA study on dieting and exercise
analyzed the metabolic rates of 7 men. The units in calories per 24 hrs of each were.
1792 1666 1362 1614 1460 1867 1439
What are the mean and standard deviation of this data?
66
EXAMPLE CONTINUED
1792 1666 1362 1614 1460 1867 1439
7x
1600
Observed
1792
1666
1362
1614
1460
1867
1439
( )ix x 2( )ix x19266
23814
140267161
11200 0
368644356
56644196
196007128925921
214870
67
EXAMPLE CONTINUED
2 2148706
s 35811.67
…and we can use this to find the standard deviation.
35811.67s 189.24 calories
68
Properties of Standard Deviation
• s measures spread about the mean. It should only be used when the mean has been identified.
• s = 0 means that there is no spread. This only happens when all observations are the same. Otherwise, s will always be positive.
• s is not a resistant measure. Skewness and outliers can make s very large.
69
Linear Transformation• If we multiply all of the data in the set
by a constant, the mean and standard deviation will be changed by that factor.
• If we add the same amount to each piece of data, the mean will move up or down that same amount.
• The standard deviation will not be affected by this addition.
70
EXAMPLESuppose teachers have an annual
salary with a mean of $35,000 and a standard deviation of $7,000.
What would happen to the mean and standard deviation if all teachers received a $3,000 BONUS for high test scores?
What would happen to the mean and standard deviation if teachers were given a 10% raise?
71
HOMEWORKComplete the assignment listed in the
packet. This assignment will be due at the beginning of the next class
session.