The Practice of Statistics for Business and Economics Third Edition David S. Moore George P. McCabe...

Preview:

DESCRIPTION

© 2010 Pearson Education 3 Class Exercise – Real Estate, House Prices

Citation preview

The Practice of Statistics for Business and EconomicsThird EditionDavid S. Moore

George P. McCabeLayth C. AlwanBruce A. CraigWilliam M. Duckworth

© 2011 W.H. Freeman and Company

Examining Relationships Scatterplots

PSBE Chapter 2.1

© 2011 W. H. Freeman and Company

© 2010 Pearson Education 3

Class Exercise – Real Estate, House Prices

0 1 2 3 4 5 6 7 8 9 -

100

200

300

400

500

600

700

Scatter Plot of Price versusNumber of Bathrooms

Number of Bathrooms

Pric

e (in

$1,

000)

© 2010 Pearson Education 4

A scatterplot, which plots one quantitative variable against another, can be an effective display for data.

Scatterplots are the ideal way to picture associations between two quantitative variables.

© 2010 Pearson Education 5

Assigning Roles to Variables in Scatterplots

To make a scatterplot of two quantitative variables, assign one to the y-axis and the other to the x-axis.

Be sure to label the axes clearly, and indicate the scales of the axes with numbers.

Each variable has units, and these should appear with the display—usually near each axis.

© 2010 Pearson Education 6

Assigning Roles to Variables in Scatterplots

Each point is placed on a scatterplot at a position that corresponds to values of the two variables.

The point’s horizontal location is specified by its x-value, and its vertical location is specified by its y-value variable.

Together, these variables are known as coordinates and written (x, y).

© 2010 Pearson Education 7

Assigning Roles to Variables in Scatterplots

One variable plays the role of the explanatory or predictor variable, while the other takes on the role of the response variable.

We place the explanatory variable on the x-axis and the response variable on the y-axis.

The x- and y-variables are sometimes referred to as the independent and dependent variables, respectively. In this class, use the terms explanatory or predictor variable (x) and the response variable (y).

© 2010 Pearson Education 8

Looking at Scatterplots – Diamond PricesCarat Price0.33 10790.33 10790.39 10300.4 11500.41 11100.42 12100.42 12100.46 15700.47 21130.48 21470.51 17700.56 17200.61 25000.62 31160.63 31650.64 26000.7 30800.7 33900.71 34400.71 35300.71 44810.72 45620.75 50690.8 58470.83 4930

Which variable will be the explanatory variable and which will be the response variable?

© 2010 Pearson Education 9

Looking at Scatterplots – Diamond PricesCarat Price0.33 10790.33 10790.39 10300.4 11500.41 11100.42 12100.42 12100.46 15700.47 21130.48 21470.51 17700.56 17200.61 25000.62 31160.63 31650.64 26000.7 30800.7 33900.71 34400.71 35300.71 44810.72 45620.75 50690.8 58470.83 4930

© 2010 Pearson Education 10

Looking at Scatterplots

The direction of the association is important.

A pattern that runs from the upper left to the lower right is said to be negative.

A pattern running from the lower left to the upper right is called positive.

© 2010 Pearson Education 11

Looking at Scatterplots – Diamond PricesCarat Price0.33 10790.33 10790.39 10300.4 11500.41 11100.42 12100.42 12100.46 15700.47 21130.48 21470.51 17700.56 17200.61 25000.62 31160.63 31650.64 26000.7 30800.7 33900.71 34400.71 35300.71 44810.72 45620.75 50690.8 58470.83 4930

Direction?

Positive

© 2010 Pearson Education 12

Looking at Scatterplots

The second thing to look for in a scatterplot is its form.

If there is a straight line relationship, it will appear as a cloud or swarm of points stretched out in a generallyconsistent, straight form. This is called linear form.

Sometimes the relationship curves gently, while still increasing or decreasing steadily; sometimes it curves sharply up then down.

© 2010 Pearson Education 13

Looking at Scatterplots – Diamond PricesCarat Price0.33 10790.33 10790.39 10300.4 11500.41 11100.42 12100.42 12100.46 15700.47 21130.48 21470.51 17700.56 17200.61 25000.62 31160.63 31650.64 26000.7 30800.7 33900.71 34400.71 35300.71 44810.72 45620.75 50690.8 58470.83 4930

Form?

Linear

© 2010 Pearson Education 14

Looking at Scatterplots

The third feature to look for in a scatterplot is the strength of the relationship.

Do the points appear tightly clustered in a single stream or do the points seem to be so variable and spread out that we can barely discern any trend or pattern?

© 2010 Pearson Education 15

Looking at Scatterplots – Diamond PricesCarat Price0.33 10790.33 10790.39 10300.4 11500.41 11100.42 12100.42 12100.46 15700.47 21130.48 21470.51 17700.56 17200.61 25000.62 31160.63 31650.64 26000.7 30800.7 33900.71 34400.71 35300.71 44810.72 45620.75 50690.8 58470.83 4930

Strength?

Moderately Strong

© 2010 Pearson Education 16

Looking at Scatterplots

Finally, always look for the unexpected.

An outlier is an unusual observation, standing away from the overall pattern of the scatterplot.

© 2010 Pearson Education 17

Looking at Scatterplots – Diamond PricesCarat Price0.33 10790.33 10790.39 10300.4 11500.41 11100.42 12100.42 12100.46 15700.47 21130.48 21470.51 17700.56 17200.61 25000.62 31160.63 31650.64 26000.7 30800.7 33900.71 34400.71 35300.71 44810.72 45620.75 50690.8 58470.83 4930

Outliers?

No Outliers

Examining relationshipsMost statistical studies involve more than one variable.

Questions: What individuals do the data describe?

What variables are present and how are they measured?

Are all of the variables quantitative?

Do some of the variables explain or even cause changes in other variables?

Looking at relationships

Start with a graph

Look for an overall pattern and deviations from the pattern

Use numerical descriptions of the data and overall pattern (if appropriate)

Explanatory and response variables

A response variable measures or records an outcome of a study. Also called dependent variable.

An explanatory variable explains changes in the response variable (also called independent variable).

Scatterplot

A scatterplot shows the relationship between two quantitative variables measured on the same individuals.

Typically, the explanatory or independent variable is plotted on the x axis, and the response or dependent variable is plotted on the y axis.

Each individual in the data appears as a point in the plot.

Scatterplot exampleBotnet Bots Spams

Srizbi 315 60

Bobax 185 9

Rustock 150 30

Cutwail 125 16

Storm 85 3

Grum 50 2

Ozdok 35 10

Nucrypt 20 5

Wopla 20 0.06

Spamthru 10 0.035

Here, we have two quantitative variables for each of 10 botnets:

•Number of bots (thousands)

•Spams per day (billions)

We are interested in the relationship between the two variables: How is one affected by changes in the other one?

Botnet Bots Spams

Srizbi 315 60

Bobax 185 9

Rustock 150 30

Cutwail 125 16

Storm 85 3

Grum 50 2

Ozdok 35 10

Nucrypt 20 5

Wopla 20 0.06

Spamthru 10 0.035

Scatterplot example

ScatterplotsSome plots don’t have clear explanatory and response variables.

Do calories explain sodium amounts?

ScatterplotsSome plots don’t have clear explanatory and response variables.

Does percent return on Treasury bills explain percent return on common stocks?

Interpreting scatterplots

After plotting two variables on a scatterplot, we describe the relationship by examining the form, direction, and strength of the association. We look for an overall pattern …

Form: linear, curved, clusters, no pattern

Direction: positive, negative, no direction

Strength: how closely the points fit the “form”

… and deviations from that pattern.

Outliers

Form and direction of an association

Linear

Nonlinear

No relationship

Positive association: High values of one variable tend to occur together with high values of the other variable.

Negative association: High values of one variable tend to occur together with low values of the other variable.

No relationship: X and Y vary independently. Knowing X tells you nothing about Y.

Strength of the association

The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form.

With a strong relationship, you can get a pretty good estimate

of y if you know x.

With a weak relationship, for any x you might get a wide range of

y values.

Strength of the association

This is a weak relationship. For a particular state median household income, you can’t predict the state per capita income very well.

Strength of the association

This is a very strong relationship. The daily amount of gas consumed can be predicted quite accurately for a given temperature value.

Stronger association?

Two scatterplots of the same data.

The straight-line pattern in the lower plot appears stronger because of the surrounding open space.

How to scale a scatterplot

Using an inappropriate scale for a scatterplot can give an incorrect impression.

Both variables should be given a similar amount of space:• Plot roughly square• Points should occupy all

the plot space (no blank space)

OutliersAn outlier is a data value that has a low probability of occurrence (i.e., it is

unusual or unexpected).

In a scatterplot, outliers are points that fall outside of the overall pattern of the relationship.

OutliersThe upper-right-hand point here is not an outlier of the relationship—It is what you would expect for this number of bots given the linear relationship between spams per day and bots.

This point is not in line with theothers, so it is an outlier of the relationship.

Not an outlier

Outlier

Outliers

IQ score and Grade Point Average

a)Describe in words what this plot shows.

b)Describe the direction, shape, and strength. Are there outliers?

c) What might explain these people?

Categorical variables in scatterplots

Often, things are not simple and one-dimensional. We need to group the data into categories to reveal trends.

What may look like a positive linear relationship is in fact a series of negative linear associations.

Plotting different habitats in different colors allows us to make that important distinction.

Categorical variables in scatterplotsComparison of men’s and women’s racing records over time. Each group shows a very strong negative linear relationship that would not be apparent without the gender categorization.

Categorical variables in scatterplotsRelationship between lean body mass and metabolic rate in men and women. Both men and women follow the same positive linear trend, but women show a stronger association. As a group, males typically have larger values for both variables.

Categorical explanatory variablesWhen the explanatory variable is categorical, you cannot make a scatterplot, but you can compare the different categories side-by-side on the same graph (boxplots, or mean +/ standard deviation).

Comparison of income (quantitative response variable) for different education levels (five categories).

But be careful in your interpretation: This is NOT a positive association because education is not quantitative.

Examining RelationshipsCorrelation

PSBE Chapter 2.2

© 2011 W.H. Freeman and Company

© 2010 Pearson Education 45

Understanding Correlation

Correlation ConditionsCorrelation measures the strength of the linear association between two quantitative variables.

Objectives (PSBE Chapter 2.2)

Correlation The correlation coefficient “ r ” r does not distinguish between x and y r has no units of measurement r ranges from -1 to +1 r is strongly affected by influential points an outliers

© 2010 Pearson Education 47

Understanding Correlation

The ratio of the sum of the product zxzy for every point in the scatterplot to n – 1 is called the correlation coefficient.

1x yz z

rn

Two of the more common alternative formulas for correlation are:

2 2 1 x y

x x y y x x y yr

n s sx x y y

The correlation coefficient “r”

Bots: x = 99.5, sx = 96.9

Spams per day: y = 13.51 sy = 18.71

= 0.885

© 2010 Pearson Education 49

Understanding Correlation

Correlation Conditions

Before you use correlation, you must check three conditions:

• Quantitative Variables Condition: Correlationapplies only to quantitative variables.

• Linearity Condition: Correlation measures the strength only of the linear association.

• Outlier Condition: Unusual observations can distort the correlation.

No matter how strong the association, r does not describe curved relationships.

Correlation only describes linear relationships

© 2010 Pearson Education 51

Understanding Correlation

Correlation Properties

• The sign of a correlation coefficient gives the direction of the association.

• Correlation is always between –1 and +1.• Correlation measures the strength of the linear association

between the two variables.• Correlation treats x and y symmetrically.• Correlation has no units.• Correlation is not affected by changes in the center or scale

of either variable.• Correlation is sensitive to unusual observations.

The correlation coefficient “r”

The correlation coefficient is a measure of the direction and strength of a linear relationship.

It is calculated using the mean and the standard deviation of both the x and y variables.

Correlation can only be used to describe quantitative variables. Categorical variables don’t have means and standard deviations.

Facts about correlation

r ignores the distinction between response and explanatory variables

r measures the strength and direction of a linear relationship between two quantitative variables

r is not affected by changes in the unit of measurement

Positive value of r means association between the two variables is positive

Negative value of r means association between the variables is negative

r is always between -1 and +1

r is strongly affected by outliers

“r” ranges from -1 to +1

Strength: how closely the points follow a straight line.

Direction: is positive when individuals with higher X values tend to have higher values of Y.

Review example

Estimate r1. r = 1.002. r = -0.943. r = 1.124. r = 0.945. r = 0.21

(in 1000’s)

© 2010 Pearson Education 56

Understanding Correlation

Correlation Tables

Sometimes the correlations between each pair of variables in a data set are arranged in a table like the one below.

© 2010 Pearson Education 60

Lurking Variables and Causation

There is no way to conclude from a high correlation alone that one variable causes the other.

There’s always the possibility that some third variable—a lurking variable—is simultaneously affecting both of the variables you have observed.

© 2010 Pearson Education 61

What Can Go Wrong?

• Don’t say “correlation” when you mean “association.”

• Don’t correlate categorical variables.

• Make sure the association is linear.

• Beware of outliers.

• Don’t confuse correlation with causation.

• Watch out for lurking variables.

© 2010 Pearson Education 62

What Have We Learned?• Begin our investigation by looking at a scatterplot.

• The sign of the correlation tells us the direction of the association.

• The magnitude of the correlation tells us of the strength of a linear association.

• Correlation has no units, so shifting or scaling the data, standardizing, or even swapping the variables has no effect on the numerical value.

© 2010 Pearson Education 63

What Have We Learned?

To use correlation we have to check certain conditions forthe analysis to be valid:

• Check the Linearity Condition.

• Watch out for unusual observations.

We’ve learned not to make the mistake of assuming that a high correlation or strong association is evidence of a cause-and-effect relationship.