31
FPP 7-9 Exploratory Data Analysis: Two Variables 1

FPP 7-9 Exploratory Data Analysis: Two Variables 1

Embed Size (px)

Citation preview

Page 1: FPP 7-9 Exploratory Data Analysis: Two Variables 1

FPP 7-9

Exploratory Data Analysis: Two Variables

1

Page 2: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Exploratory data analysis: two variablesThere are three combinations of variables we

must consider. We do so in the following order1 qualitative/categorical, 1 quantitative variables

Side-by-side box plots, counts, etc.

2 quantitative variablesScatter plots, correlations, regressions

2 qualitative/categorical variablesContingency tables (we will cover these later in the

semester)

2

Page 3: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Box plots

A box plot is a graph of five numbers (often called the five number summary) minimum Maximum Median 1st quartile 3rd quartile

We know how to compute three of the numbers (min,max,median) To compute the 1st quartile

find the median of the 50% of observations that are smaller than the median

To compute the 3rd quartile find the median of the 50% of observatins that are bigger than the median

3

Page 4: FPP 7-9 Exploratory Data Analysis: Two Variables 1

JMP box plots

4

Page 5: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Side-by-side box plotsSide-by-side box plots are graphical summaries

of data when one variable is categorical and the other quantitative

These plots can be used to compare the distributions associated with the the quantitative variable across the levels of the categorical variable

5

Page 6: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Pets and stressAre there any differences in stress levels when

doing tasks with your pet, a good friend, or alone?

Allen et al. (1988) asked 45 people to count backwards by 13s and 17s.

People were randomly assigned to one of the three groups: pet, friend, alone.

Response is subject’s average heart rate during task

6

Page 7: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Pets and stressIt looks like the task

is most stressful around friends and least stressful around pets

Note we are comparing quantitative variable (heart rate) across different levels of categorical variable (group)

Hea

rt R

ate

60

70

80

90

100

C F P

Group

7

Page 8: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Vietnam draft lottery In 1970, the US government drafted young men for military

service in the Vietnam War. These men were drafted by means of a random lottery. Basically, paper slips containing all dates in January were placed in a wooden box and then mixed. Next, all dates in February (including 2/29) were added to the box and mixed. This procedure was repeated until all 366 dates were mixed in the box. Finally, dates were successively drawn without replacement. The first data drawn (Sept. 14) was assigned rank 1, the second data drawn (April 24) was assigned rank 2, and so on. Those eligible for the draft who were born on Sept. 14 were called first to service, then those born on April 24 were called, and so on.

Soon after the lottery, people began to complain that the randomization system was not completely fair. They believed that birth dates later in the year had lower lottery numbers than those earlier in the year (Fienberg, 1971)

What do the data say? Was the draft lottery fair? Let’s to a statistical analysis of the data to find out.

8

Page 9: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Draft rank by month in the Vietnam draft lottery: Raw data

Dra

ft R

ank

0

50

100

150

200

250

300

350

1 2 3 4 5 6 7 8 9 10 11 12

Month of Year9

Page 10: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Draft rank by month in the Vietnam draft lottery: Box plots

Dra

ft R

ank

0

50

100

150

200

250

300

350

1 2 3 4 5 6 7 8 9 10 11 12

Month of Year10

Page 11: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Exploratory data analysis two quantitative variablesScatter plots

A scatter plot shows one variable vs. the other in a 2-dimensional graph

Always plot the explanatory variable, if there is one, on the horizontal axis

We usually call the explanatory variable x and the response variable yalternatively x is called the independent variable y the

dependent

If there is no explanatory-response distinction, either variable can go on the horizontal axis

11

Page 12: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Example Gross Sales Items890.5 115197 17231 26170 21

202.5 30225.5 35489.7 84234.8 42161.5 21284 44422 65

300.7 59412.4 69346.8 5992.3 19

255.8 42118.5 16286.5 39594 72

263.29 43244.08 45394.28 64241.31 36299.97 40649.04 10312

Page 13: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Describing scatter plotsForm

Linear, quadratic, exponential

DirectionPositive association

An increase in one variable is accompanied by an increase in the other

Negatively associatedA decrease in one variable is accompanied by an

increase in the other

StrengthHow closely the points follow a clear form

13

Page 14: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Describing scatter plots

Form:Linear

DirectionPositive

StrengthStrong

14

Page 15: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Correlation coefficientWe need something more than an arbitrary

ocular guess to assess the strength of an association between two variables.

We need a value that can summarize the strength of a relationshipThat doesn’t change when units changeThat makes no distinction between the

response and explanatory variables

15

Page 16: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Correlation CoefficientDefinition: Correlation coefficient is a

quantity used to measure the direction and strength of a linear relationship between two quantitative variables.

We will denote this value as r

16

Page 17: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Computing correlation coefficientLet x, y be any two quantitative variables

for n individuals

17

r =1

N

x i −μ xσ x

⎝ ⎜

⎠ ⎟y i −μ yσ y

⎝ ⎜ ⎜

⎠ ⎟ ⎟

i=1

N

where μx and μx are the means and σ x and σ y are the standard deviations

of the variables x and y respectively

Page 18: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Correlation coefficientRemember are

standardized values of variable x and y respectively

The correlation r is an average of the products of the standardized values of the two variables x and y for the n observations

18

x i −μ xσ x

and y i −μ yσ y

Page 19: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Properties of rMakes no distinction between explanatory and

response variables

Both variables must be quantitativeNo ordering with qualitative variables

Is invariant to change of units

Is between -1 and 1

Is affected by outliers

Measures strength of association for only linear relationships!

19

Page 20: FPP 7-9 Exploratory Data Analysis: Two Variables 1

True or FalseLet X be GNP for the U.S. in dollars and Y be

GNP for Mexico, in pesos. Changing Y to U.S. dollars changes the value of the correlation.

20

Page 21: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Correlation Coefficient is ____

5

5

00

Correlation Coefficient is ____

5

5

00

Correlation Coefficient is _____

5

5

00

Correlation Coefficient is ____

5

5

00

21

Page 22: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Think about itIn each case, say which correlation is

higher. Height at age 4 and height at age 18, height

at age 16 and height at age 18

Height at age 4 and height at age 18, weight at age 4 and weight at age 18

Height and weight at age 4, height at weight at age 18

22

Page 23: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Correlation coefficientCorrelation is not an appropriate measure

of association for non-linear relationships

What would r be for this scatter plot

23

Page 24: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Correlation coefficient

24

Page 25: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Correlation coefficientCORRELATION IS NOT CAUSATION

A substantial correlation between two variables might indicate the influence of other variables on both

Or, lack of substantial correlation might mask the effect of the other variables

25

Page 26: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Correlation coefficientCORRELATION IS NOT CAUSATION

Plot of life expectancy of population and number of people per TV for 22 countries (1991 data)

40

45

50

55

60

65

70

75

80Li

fe e

xp.

0 50 100 150 200 250

People per TV

Bivariate Fit of Life exp. By People per TV

26

Page 27: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Correlation coefficientCORRELATION IS NOT CAUSATION

A study showed that there was a strong correlation between the number of firefighters at a fire and the property damage that the fire causes.We should send less fire fighters to fight fires

right??

Example of a lurking variable. What might it be?

27

Page 28: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Interpreting correlationsA newspaper article contains a quote

from a psychologist, who says, “The evidence indicates the correlation between the research productivity and teaching rating of faculty members is close to zero.” The paper reports this as “The professor said that good researchers tend to be poor teachers, and vice versa.”

Did the newspaper get it right?

28

Page 29: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Correlation coefficientWhat’s wrong with each of these

statements?

There exists a high correlation between the gender of American workers and their income.

The correlation between amount of sunlight and plant growth was r = 0.35 centimeters.

There is a correlation of r =1.78 between speed of reading and years of practice

29

Page 30: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Examining many correlations simultaneouslyThe correlation matrix displays correlations

for all pairs of variables

30

Page 31: FPP 7-9 Exploratory Data Analysis: Two Variables 1

Ecological correlationCorrelations based on rates or averages.

How will using rates or averages affect r?

31