20
Not in FPP Exploratory data analysis with two qualitative variables 1

Not in FPP Exploratory data analysis with two qualitative variables 1

Embed Size (px)

Citation preview

Page 1: Not in FPP Exploratory data analysis with two qualitative variables 1

Not in FPP

Exploratory data analysis with two qualitative variables

1

Page 2: Not in FPP Exploratory data analysis with two qualitative variables 1

Exploratory data analysis with two qualitative/categorical variablesMain tools

Contigency tablesConditional, marginal, and joint frequencies

2

Page 3: Not in FPP Exploratory data analysis with two qualitative variables 1

Motivating exampleSurviving the Titanic

Was there a class discrimination in survival of the wreck of the Titanic?

“It has been suggested before the Enquiry that the third-class passengers had been unfairly treated, that their access to the boat deck had been impeded; and that when they reached the deck the first and second-class passengers were given precedence in getting places in the boats.” Lord Mersey, 1912

3

Page 4: Not in FPP Exploratory data analysis with two qualitative variables 1

Titanic: Class by survival

11stst ClassClass

22ndnd ClassClass

33rdrd ClassClass

CrewCrew

DeadDead 122122 167167 528528 696696 15131513

AliveAlive 203203 118118 178178 212212 711711

325325 285285 706706 908908 22242224

4

Page 5: Not in FPP Exploratory data analysis with two qualitative variables 1

Titanic: Marginal frequencies% Dead = 1513/2224 = 0.68% Alive = 711/2224 = 0.32

% in first class = 325/2224 = 0.14% in second class = 285/2224 = 0.13% in third class = 706/2224 = 0.32% crew = 908/2224 = 0.41

5

Page 6: Not in FPP Exploratory data analysis with two qualitative variables 1

Titanic: Conditional frequenceis% (Alive | 1st) = 203/325 = 0.625% (Alive | 2nd) = 118/285 = 0.414% (Alive | 3rd) = 178/706 = 0.252% (Alive | Crew) = 212/908 = 0.233

Based on these frequencies does there appear to be class discrimination?

6

Page 7: Not in FPP Exploratory data analysis with two qualitative variables 1

Titanic: Class by person type

1st Class

2nd Class

3rd Class

Crew

Child. 6 24 79 0 109

Wom. 144 93 165 23 425

Men 175 168 462 885 1690

325 285 706 908 2224

7

Page 8: Not in FPP Exploratory data analysis with two qualitative variables 1

Titanic: percentage of men in each class% (Man | 1st) = 175/325 = 0.54% (Man | 2nd) = 168/285 = 0.59% (Man | 3rd) = 462/706 = 0.65% (Man | Crew) = 885/908 = 0.97

There are larger percentages of men in third class and crew

8

Page 9: Not in FPP Exploratory data analysis with two qualitative variables 1

Surviving the TitanicA reason for class differences in survival:

Larger percentages of men died3rd class consisted of mostly men.Hence, a larger percentage of 3rd class

passengers died.

Once again keep in mind possible lurking variables that could be driving the relationship seen between two measured variables

9

Page 10: Not in FPP Exploratory data analysis with two qualitative variables 1

Relative risk and odds ratiosMotivating example

Physicians’ health study (1989): randomized experiment with 22071 male physicians at least 40 years old

Half the subjects assigned to take aspirin every other day

Other half assigned to take a placebo, a dummy pill that looked and tasted like aspirin

10

Page 11: Not in FPP Exploratory data analysis with two qualitative variables 1

Physicians’ health studyHere are the number of people in each cell:

11

Page 12: Not in FPP Exploratory data analysis with two qualitative variables 1

Relative risk

y1 y2

x1 a ba+b

x2 c dc+d

a+c

b+d

Risk of y1 for level x1=a/(a+b)

Risk of y1 for level x2=c/(c+d)

Relative risk =a/(a +b)

c /(c + d)12

Page 13: Not in FPP Exploratory data analysis with two qualitative variables 1

Relative risk for physicians’ health studyRelative risk of a heart attack when taking

aspirin versus when taking a placebo equals

People that took aspirin are 0.55 times as likely to have a heart attack than people that took the placebo

Or people that took placebo are 1/0.55 = 1.82 times as likely to have a heart attack than people that took aspirin

RR =104 /(104 +10933)

189 /(189 +10845)= 0.55

13

Page 14: Not in FPP Exploratory data analysis with two qualitative variables 1

Odds ratios

y1 y2

x1 a b

x2 c d

Odds of y1 for level x1=a/b

Odds of y1 for level x2=c/d

Odds ratio =a/b

c /d

14

Page 15: Not in FPP Exploratory data analysis with two qualitative variables 1

Odds ratios for physicians’ health studyRelative risk of a heart attack when taking

aspirin versus taking a placebo is

Odds of having a heart attack when taking aspirin over odds of a heart attack when taking a placebo (odds ratio)€

RR =104 /(104 +10933)

189 /(189 +10845)= 0.55

OR =104 /10933

189 /10845= 0.546

15

Page 16: Not in FPP Exploratory data analysis with two qualitative variables 1

Interpreting odds ratios and relative risksWhen the variables X and Y are

independentodds ratio = 1 relative risk = 1

When subjects with level x1 are more likely to have y1 than subjects with level x2, theodds ratio > 1 relative risk > 1

When subjects with level x1 are less likely to have y1 than subjects with level x2, thenodds ratio < 1 relative risk < 1

16

Page 17: Not in FPP Exploratory data analysis with two qualitative variables 1

Which one should be used?If Relative Risk is available then it should be usedIn a cohort study, the relative risk can be

calculated directlyIn a case-control study the relative risk cannot be

calculated directly, so an odds ratio is used insteadCase-control studies is an example. They compare subjects

who have a “condition” to subjects that don’t but have similar controls

In this type of study we know %(exposure|disease). But to compute the RR we need %(disease|exposure).

Recall that RR = %(disease|exposure)/%(disease|placebo)

Not available in more complex modeling (logistic regression)

17

Page 18: Not in FPP Exploratory data analysis with two qualitative variables 1

Odds ratio vs relative riskWhen is odds ratio a good approximation of

relative riskWhen cases are representative of diseased

populationWhen controls are representative of

population without diseaseWhen the disease being studied occurs at

low frequencyOf itself, an odds ratio is a useful measure of

association

18

Page 19: Not in FPP Exploratory data analysis with two qualitative variables 1

Relative risk vs absolute risk% smokers who get lung cancer: 8%

(conservative guess here)

Relative risk of lung cancer for smokers: 800%

Getting lung cancer is not commonplace, even for smokers. But, smokers’ chances of getting lung cancer are much, much higher than non-smokers’ chances.

19

Page 20: Not in FPP Exploratory data analysis with two qualitative variables 1

Simpsons paradoxWhen a third variable seemingly reverses

the association between two other variables

Hot hand example

20