AP Statistics

AP Statistics

Chapter 1Exploring Data

2

WHAT IS STATISTICS?• Statistics is the study of how we:

-Collect Data-Organize Data-Analyze Data -Use data to make predictions

Statistics is the tool we use to extract information from data!

3

Lesson Objectives• Identify individuals and

variables in a set of data.• Classify a variable as being a

quantitative or categorical variable.

• Identify the units of measurement for a quantitative value.

4

VARIABLES• Individuals are the objects

described by a set of data.• Variables are characteristics that

can take different values from individual to individual.

• A variable can be considered either categorical or quantitative.

EXAMPLE• Suppose we observed a bag of M&M

candies and were studying the different colors of the pieces.

• What would be the individuals in the study? What would be the variable?

6

Categorical vs. Quantitative

• Categorical variables place an individual in to a group or category.

• Quantitative variables assign a numerical value to an individual.

EXAMPLES:Which type of variable is each?A person’s height …A person’s eye color … A person’s ZIP code …

QuantitativeCategorical

Categorical

While ZIP codes are numeric in form, you would not use arithmetic to combine them in any form.

Distribution• The type of data collected can be a

determining factor of the way the values are organized.

• Quantitative values can be very close together or very spread out.

• The pattern of variation between these values in a set of data is called the distribution.

• Distribution – a description of the values a variable takes and how often it takes these values.

8

DISTRIBUTION• Both quantitative and categorical

data will have differences from individual to individual.

• The pattern of variation of a variable is referred to as its distribution.

• In order to get a grasp of a variable’s distribution, we may use a graphical display of the data.

9

AP EXAM TipYou will often be asked to “describe the distribution”

of a set of data. When you are asked to do this, make sure that you have your SOCkS on!S – ShapeO – OutliersC – CenterS – Spread

When you describe these four characteristics of the data, you will be effectively describing the distribution!

10

ACTIVITY: Sexual Discrimination?????

25 airplane pilots have applied to fill 8 positions to be pilots with an airline company. 15 of them are males and 10 are females. To be fair, the managers select the 8 pilots to be employed by a lottery.

A day later, the managers announce the 8 pilots to be hired. 5 of them are female and only 3 are males.

Many of the males claimed that the lottery had to have been “rigged” since there was no way that so many females were selected.

11

ACTIVITY CONTINUEDTo simulate the situation, select ten red

cards and fifteen black cards.Use the cards within your group to conduct

your own lottery by drawing 8 cards.Count the number of females, and record

that number.Put the cards back, and shuffle the cards.

Repeat the process four more times.Report your results to be recorded.

12

What do we see?

0 1 2 3 4 5 6 7 8

Number of Females Hired

Do you think that it is possible that the number of females hired in the problem was a coincidence???

13

HOMEWORKComplete the assignment listed in the

packet. This assignment will be due at the beginning of the next class

session.

Analyzing Categorical Data• In this section we will learn about:

–Bar graphs/pie charts–Problems with graphs–Two-way Tables and Marginal

Distribution–Conditional Distributions–Simpson’s Paradox

EXAMPLEThe Radio Arbitron service places each of the

contry’s 13,838 stations in categories that describe the type of music they play. Here is the distribution of the data.Frequency Table

Format Count of Stations

Adult Contemporary 1556

Adult Standards 1196

Contemporary Hit 569

Country 2066

News/Talk 2179

Oldies 1060

Religious 2014

Rock 869

Spanish Language 750

Other Formats 1579

Total 13838

Relative Frequency Table

Format Percent of Stations

Adult Contemporary 11.2

Adult Standards 8.6

Contemporary Hit 4.1

Country 14.9

News/Talk 15.7

Oldies 7.7

Religious 14.6

Rock 6.3

Spanish Language 5.4

Other Formats 11.4

Total 99.9

ContinuedSometimes, we may wish to use a graph instead of

table to clarify relationships.Frequency Table

Format Count of Stations

Adult Contemporary 1556

Adult Standards 1196

Contemporary Hit 569

Country 2066

News/Talk 2179

Oldies 1060

Religious 2014

Rock 869

Spanish Language 750

Other Formats 1579

Total 13838

0

500

1000

1500

2000

2500

Count of StationsRelative Frequency Table

Format Percent of Stations

Adult Contemporary 11.2

Adult Standards 8.6

Contemporary Hit 4.1

Country 14.9

News/Talk 15.7

Oldies 7.7

Religious 14.6

Rock 6.3

Spanish Language 5.4

Other Formats 11.4

Total 99.9

11%

9%

4%

15%

16%8%

15%

6%

5%

11%

Percent of StationsAdult ContemporaryAdult Standards

Contemporary hit

Country

News/Talk

Oldies

Religious

Rock

Spanish

Other

Be Careful!!!

Because of their appeal to the eyes, graphical displays can sometimes be misleading. Always look for things like scaling and relevance. Pictographs can almost always be misleading.

Pictograph• What is the

issue with this ad that was used by Apple Computers to show the people that were buying their new iMac Computer?

Activity• Use the table below. • A.) Make a well-labeled graph to display

the data.• B.) Would it be appropriate to make a pie

chart here? Why?

Possible Answer

Two-Way Tables• A survey of 4826 randomly selected young

adults (19-25 yrs old) asked, “What do you think are the chances you will have much more than a middle-class income at age 30?”

• This is an example of a two-way table.

Two-Way Table• A two-way table describes two categorical

variables, organizing counts according to a row variable and a column variable.

• Marginal Distribution –– The distribution of values of one of the

variables among all individuals in that category of a two-way table.

• To examine a marginal distribution:– Use the table data to compute percents

of the row or column totals.– Make a graph to display the marginal

distribution.

Young adults by gender and chance of getting rich

Female Male Total

Almost no chance 96 98 194

Some chance, but probably not 426 286 712

A 50-50 chance 696 720 1416

A good chance 663 758 1421

Almost certain 486 597 1083

Total 2367 2459 4826

Two-Way Tables and Marginal Distributions

Response PercentAlmost no chance 194/4826 =

4.0%Some chance 712/4826 =

14.8%A 50-50 chance 1416/4826 =

29.3%A good chance 1421/4826 =

29.4%Almost certain 1083/4826 =

22.4%

Examine the marginal distribution of chance of getting rich.

Almost none

Some chance

50-50 chance

Good chance

Almost certain

05

101520253035

Chance of being wealthy by age 30

Survey Response

Perc

ent

Simpson’s Paradox Accident victims are often transported by

hospital to a medical facility. Does this act help save lives?

What are the percentage of deaths for each of the two categories?

…not too positive, huh?!

Continued…• Let’s look at the data differently.

• Compute the percentages now.• …is that right?• This phenomenon is referred to as Simpson’s

Paradox.• It is caused by what is referred to a lurking

variable.• What was the lurking variable here?

26



session.

1.2 – Displaying Quantitative Data

• Dotplots are a commonly used method of displaying quantitative data.

• To make a dotplot, DRAW a horizontal number line, labeled with the name of the variable.

• SCALE the number line, including the minimum and maximum values.

• MARK a dot above the corresponding location on the axis for each data value.

EXAMPLE• The table here displays goals scored by

the US Women’s Soccer Team in 2004.

• Create a dotplot to represent the data.

Number of Goals Scored Per Game by the 2004 US Women’s Soccer Team

3 0 2 7 8 2 4 3 5 1 1 4 5 3 1 1 33 3 2 1 2 2 2 4 3 5 6 1 5 5 1 1 5

EXAMPLE 2• The table and dotplot below displays the

Environmental Protection Agency’s estimates of highway gas mileage in miles per gallon (MPG) for a sample of 24 model year 2009 midsize cars.

• Use the dotplot to describe the distribution.

2009 Fuel Economy Guide

MODEL MPG

1

2

3

4

5

6

7

8

9

Acura RL 22

Audi A6 Quattro 23

Bentley Arnage 14

BMW 5281 28

Buick Lacrosse 28

Cadillac CTS 25

Chevrolet Malibu 33

Chrysler Sebring 30

Dodge Avenger 30


MODEL MPG <new>

9

10

11

12

13

14

15

16

17

Dodge Avenger 30

Hyundai Elantra 33

Jaguar XF 25

Kia Optima 32

Lexus GS 350 26

Lincolon MKZ 28

Mazda 6 29

Mercedes-Benz E350 24

Mercury Milan 29


MODEL MPG <new>

16

17

18

19

20

21

22

23

24

Mercedes-Benz E350 24

Mercury Milan 29

Mitsubishi Galant 27

Nissan Maxima 26

Rolls Royce Phantom 18

Saturn Aura 33

Toyota Camry 31

Volkswagen Passat 29

Volvo S80 25

MPG14 16 18 20 22 24 26 28 30 32 34

2009 Fuel Economy Guide Dot Plot

Describing the Shape• We can describe the shape of a distribution

bby using the following terms.• Symmetric - if the right and left sides of the

graph are approximately mirror images of each other.

• Skewed Right - if the right side of the graph (containing the half of the observations with larger values) is much longer than the left side.

• Skewed Left – just the opposite of skewed right.

• Bimodal – A set of data that has two peaks.

Identify Each

DiceRolls0 2 4 6 8 10 12

Collection 1 Dot Plot

Score70 75 80 85 90 95 100


Symmetric Skewed - left

Siblings0 1 2 3 4 5 6 7


Skewed - rightBimodal

Applying the Concepts• Complete the “Check Your Understanding”

questions on pg. 31.

33

VIDEO #2Decisions Through Data

Stemplots

34

Stemplots• Stemplots are often used as a means of

representing quantitative values.• The data is organized by separating

each observation into a stem (all but the last digit) and a leaf (the last digit).

• The leaf values are then paired with their stem and ordered.

• Trends and patterns in the distribution can be seen here.

35

Caffeine content of an 8oz. serving of many popular soft drinks.A&W Cream 20 Diet Sun Drop 47Barq’s Root Beer 15 Diet Sunkist 28Cherry Coke 23 Diet Cherry Pepsi 24Cherry RC Cola 29 Dr. Nehi 28Coke Classic 23 Dr. Pepper 28Diet A&W Cream 15 IBC Cherry 16Diet Cherry Coke 23 Kick 38Diet Coke 31 KMX 36Diet Dr. Pepper 28 Mello Yello 35Diet Mello Yello 35 Mountain Dew 37Diet Mtn Dew 37 Mr. Pibb 27Diet Mr. Pibb 27 Nehi Wild Red 33Diet Pepsi 24 Pepsi One 37Diet Red Squirt 26 Pepsi 25

36

RC Edge 47Red Flash 27Royal Crown 29Red Squirt 26Sun Drop Cherry 43Sun Drop 43Sunkist 28Surge 35Tab 31Cherry Pepsi 25

Arrange all of this data in to a stemplot, and observe the distribution.

37

Stemplot from the Example

1 5 5 6

2 0 3 3 3 4 4 5 5 6 6 7 7 7 8 8 8 8 8 9 9

3 1 1 3 5 5 5 6 7 7 7 8

4 3 3 7 7

Key: 3|5 means 35 mg of caffeine per 8 oz. serving

Caffeine Content (mg) per 8oz. Serving of Various Soft Drinks

38

An Alternative (better) Plot

1 5 5 62 0 3 3 3 4 42 5 5 6 6 7 7 7 8 8 8 8 8 9 93 1 1 33 5 5 5 6 7 7 7 84 3 34 7 7

Caffeine Content (mg) per 8oz. Serving of Various Soft Drinks

Key: 3|5 means 35 mg of caffeine per 8 oz. serving

39

Tips for Stemplots• When you split stems, make sure each part

is assigned an equal number of possibilities.• There is no set number of stems.• Too few stems makes a “skyscaper” shape.• Too many stems makes a “pancake” shape.• As a rule, a minimum of five stems is good

to follow.• Always include a title and a key to show

how the stems were formed.

40



session.

41


Histograms

42

HISTOGRAMS• Histograms are different than bar

graphs as they are represented on a continuum of values.

• As with stemplots, there is no set number of classes to use.

• Five classes is a good minimum.• Remember: area of each bar is what

matters. Make sure width is constant and height varies.

43

Relative Frequency Histograms

• A relative frequency histogram is based on relative frequencies of each category.

• Relative Frequency = number of occurrences in the category/ total number of occurrences.

• Relative frequency is often used to find percentiles, or the portion of data that at or below a value.

Applying the Concepts• Complete the “Check Your Understanding”

questions on pg. 39.

45



session.

46


Measures of Center

47

1.3 - Describing Distributions with

Numbers• To describe a distribution, we

must identify its center.• One measure of the center of a

set of data is the mean.• The mean is the sum of all

observations in a set divided by the total number of observations.

48

FORMULA for Mean

• The mean of a set of data is:

or…

1 2 ... nx x xxn

1

1 n

ii

x xn

49

Median• The median of a set of data is the

midpoint of the data.• Arrange all of the numbers from least

to greatest.• If there is an odd number of

observations, the median is the center of the list.

• If there is an even number of observations, the median is the mean of the two center observations.

50

Mean vs. Median• The median is a more resistant measure

than the mean. • This means that the mean can be more

easily influenced by extreme values.• Differences between mean and median

can indicate skewness in the data. • Skewed Left data will have a mean that is

less than the median.• Skewed Right data will have a mean that

is greater than the data. • Mean/Median Applet

http://bcs.whfreeman.com/yates2e/pages/bcs-main.asp?v=category&s=00020&n=99000&i=99020.01&o=

51



session.

52


Boxplots

53

Spread of Data• The spread of a distribution is another

important characteristic of the data.• One simple measure of the spread of a

set of data is the range.• Range = (highest value – lowest value)• Since the range only involves two

values, the range can be strongly influenced by outliers.

54

Quartiles• Sometimes it is best to split the data in

to quartiles.• Each quartile represents 25% of the

data.• Often, the difference between the 3rd

and 1st quartiles is used to describe the spread of the data.

• This difference is referred to as the interquartile range (IQR).

55

Example of Quartiles• Barry Bonds’s home run counts (arranged in

order) are:16 19 24 25 25 33 33 34 34 37 37

40

42 46 49 73A Five-Number Summary can be used to

describe the data as well. It includes the min value, Q1, Median, Q3, and max value.

What is the Five-Number Summary of Barry Bonds’s home run data?

MQ1 Q3

56

BOXPLOTS• A boxplot can be used to graphically

represent the spread of the data in a set. • To form a boxplot, we plot points at all

values in the Five Number Summary.• Form a box between the Q1 and Q3

values, with a bar inside at the median.• Connect the max and min values to the

box with line segments.• This can also be done on the calculator in

stat plot mode.

57

Modified Boxplot• A boxplot will sometimes be modified

by excluding any outliers from the plot.

• An outlier is defined as any value more than 1.5 times the IQR above Q3 or below Q1.

• The outliers will still be graphed as single points excluded from the boxplot.

58

ACTIVITYHank Aaron’s home run data (in numerical

order) are:13 20 24 26 27 29 30 32 34 34 38

39 39

40 40 44 44 44 44 45 47

Identify the Five Number Summary of this data.13 28 38 44 47

Now construct boxplots in the calculator to compare Bonds’s data to Aaron’s.

59

ACTIVITY CONTINUED• What do you see about the data

that you have plotted?• Do you think that there are any

outliers in the data? If so, what values?

• Now construct a modified boxplot on the calculator and describe what you see.

60



session.

61


Standard Deviation

62

Measuring the Spread • The most common way of measuring

the spread of data is by using the variance or standard deviation.

• The standard deviation (s) is the average distance between each of the items and the mean.

• The variance and standard deviation both provide a numerical value to assess the spread of the data.

63

FORMULASVARIANCE

Or…

2 2 22 1 2( ) ( ) ... ( )

1nx x x x x xs

n

2 21 ( )1 is x x

n

64

FORMULASSTANDARD DEVIATION

Notice that standard deviation is simply the square root of the variance.

21 ( )1 is x x

n

65

EXAMPLEA study on dieting and exercise

analyzed the metabolic rates of 7 men. The units in calories per 24 hrs of each were.

1792 1666 1362 1614 1460 1867 1439

What are the mean and standard deviation of this data?

66

EXAMPLE CONTINUED

1792 1666 1362 1614 1460 1867 1439

7x

1600

Observed

1792

1666

1362

1614

1460

1867

1439

( )ix x 2( )ix x19266

23814

140267161

11200 0

368644356

56644196

196007128925921

214870

67

EXAMPLE CONTINUED

2 2148706

s 35811.67

…and we can use this to find the standard deviation.

35811.67s 189.24 calories

68

Properties of Standard Deviation

• s measures spread about the mean. It should only be used when the mean has been identified.

• s = 0 means that there is no spread. This only happens when all observations are the same. Otherwise, s will always be positive.

• s is not a resistant measure. Skewness and outliers can make s very large.

69

Linear Transformation• If we multiply all of the data in the set

by a constant, the mean and standard deviation will be changed by that factor.

• If we add the same amount to each piece of data, the mean will move up or down that same amount.

• The standard deviation will not be affected by this addition.

70

EXAMPLESuppose teachers have an annual

salary with a mean of $35,000 and a standard deviation of $7,000.

What would happen to the mean and standard deviation if all teachers received a $3,000 BONUS for high test scores?

What would happen to the mean and standard deviation if teachers were given a 10% raise?

71



session.

Documents

AP Statistics