39
Multiple Regression II 4/11/12 • Categorical explanatory variables • Adjusted R 2 Not in book Professor Kari Lock Morgan Duke University

Multiple Regression II 4/11/12

  • Upload
    gerodi

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

Multiple Regression II 4/11/12. Categorical explanatory variables Adjusted R 2. Not in book. Professor Kari Lock Morgan Duke University. Project 1 Regrade Requests. Regrade requests must be submitted in writing, with the original project, by Friday, 4/13/12, at 5pm - PowerPoint PPT Presentation

Citation preview

Page 1: Multiple Regression II 4/11/12

Multiple Regression II4/11/12

• Categorical explanatory variables• Adjusted R2

Not in book Professor Kari Lock MorganDuke University

Page 2: Multiple Regression II 4/11/12

• Regrade requests must be submitted in writing, with the original project, by Friday, 4/13/12, at 5pm

• If regraded, I will grade the entire project, and the grade may go up or down

Project 1 Regrade Requests

Page 3: Multiple Regression II 4/11/12

• Project 2 Proposal (due TODAY, 5pm)

• Homework 9 (due Monday, 4/16)

• Project 2 Presentation (Thursday, 4/19)

• Project 2 Paper (Wednesday, 4/25)

To Do

Page 5: Multiple Regression II 4/11/12

• Your group members are in the same lab (mostly) so this is a great time to meet with your group in person

• There is no “On Your Own” and I’ve tried to make the lab short, so you should have some time with your group to get started with your project

• If you want to work through the lab on your own before coming to class, you can have the whole lab time with your group to work on the project

• Come to lab tomorrow!

Lab

Page 6: Multiple Regression II 4/11/12

US States• We will build a model to predict the % of the state that voted Republican in the 2008 US presidential election, using the 50 states as cases

• Sample? Population?

• This can help us to understand how certain features of a state are associated with political beliefs

Page 7: Multiple Regression II 4/11/12

US States• Response Variable:

• Our first explanatory variable is region of the country:

Midwest, Northeast, South, or West

Page 8: Multiple Regression II 4/11/12

Categorical Variables

• For this to make any sense, each x value has to be a number.

• How do we include categorical variables in a regression setting?

0 1 2 21 ... k k ix xxy

Page 9: Multiple Regression II 4/11/12

Categorical Variables• Take one categorical variable, and replace it with several “dummy” variables

• A dummy variable is 1 if the case falls into the category represented by the dummy variable, and 0 otherwise

• Create one dummy variable for each category of the categorical variable

Page 10: Multiple Regression II 4/11/12

Dummy Variables

State Region South West Northeast Midwest

Alabama South 1 0 0 0

Alaska West 0 1 0 0

Arkansas South 1 0 0 0

California West 0 1 0 0

Colorado West 0 1 0 0

Connecticut Northeast 0 0 1 0

Delaware Northeast 0 0 1 0

Florida South 1 0 0 0

Georgia South 1 0 0 0

Hawaii West 0 1 0 0

… … … … … …

dummy variables

Page 11: Multiple Regression II 4/11/12

Dummy Variables• When using dummy variables, one has to be left out of the model

• The dummy variable left out is called the reference level

• When using region of the country (Northeast, South, Midwest, West) to predict % McCain vote, how many dummy variables will be included?

a) One b) Two c) Three d) Four

Page 12: Multiple Regression II 4/11/12

Dummy Variables

• Predicting % vote for McCain with one categorical variable: region of the country

• If “midwest” is the reference level:

0 21 3% Rep. vote = Northeast South West

Page 13: Multiple Regression II 4/11/12

Voting by Region

Based on the output above, which region had the highest percent vote for McCain?

a) Midwestb) Northeastc) Southd) West

Page 14: Multiple Regression II 4/11/12

Voting by Region

What is the predicted % Republican vote for a state in the northeast?

a) –10.2%b) 48.6%c) 38.4%d) 58.8%

Page 15: Multiple Regression II 4/11/12

Voting by Region

What is the predicted % Republican vote for a state in the midwest?

a) 50%b) 48.6%c) 0%d) 58.8%

Page 16: Multiple Regression II 4/11/12

Categorical Variables• The p-value for each dummy variable tests for a significant difference between that category and the reference level

• For an overall p-value for the significance of the categorical variable with multiple categories, use

a) z-testb) T-testc) Chi-square testd) ANOVA

Page 17: Multiple Regression II 4/11/12

Categorical Variables

Page 18: Multiple Regression II 4/11/12

Categorical Variables in R• R automatically creates dummy variables for you if you include a categorical explanatory variable

• The first level alphabetically is usually the reference level

• If you want to change the reference level, see me

Page 19: Multiple Regression II 4/11/12

Categorical Variables• Either all dummy variables associated with a categorical variable have to be included in the model, or none of them

• RegionW is not significant, but leaving it out would clump West with the reference level, Midwest, which does not make sense

Page 20: Multiple Regression II 4/11/12

Variables

• Let’s include some more explanatory variables!

• What helps to predict % voting Republican?

Page 21: Multiple Regression II 4/11/12

Categorical Variables• Be careful not to include a categorical variable for which every case is it’s own category

• Example: using “State” as an explanatory variable would be silly, even though R2 = 1!

• If you want to know how each state voted, it would make more sense to just look directly at McCainVote, rather than fitting a model and giving each state it’s own coefficient

Page 22: Multiple Regression II 4/11/12

Explanatory Variables• Also, be careful not to include explanatory variables that are essentially just another form of the response variable

• For example, ObamaMcCain is “M” if the state went for McCain, and “O” if the state went for Obama

• This is certainly associated with the % of people in the state that voted for McCain, but tells us nothing interesting

Page 23: Multiple Regression II 4/11/12

Explanatory Variables• Models should be created either to learn about relationships between explanatory variables and the response, or for prediction

• Make sure the explanatory variables you include in the model are not contradicting the point of the model

Page 24: Multiple Regression II 4/11/12

Visualization

How would we visualize the association between region and % vote for McCain?

a) Scatterplotb) Side-by-side boxplotsc) Bar chartd) Pie charte) Mosaic plot

Page 25: Multiple Regression II 4/11/12

Side-by-Side Boxplots

Page 26: Multiple Regression II 4/11/12

Test for Association

How would we test for an association between region and % vote for McCain?

a) t-test for difference in meansb) test for a correlationc) ANOVAd) chi-square teste) test for a difference in proportions

Page 27: Multiple Regression II 4/11/12

ANOVA

Page 28: Multiple Regression II 4/11/12

Visualization

All of the other potential explanatory variables are quantitative. How would we visualize the association between each of them and % vote for McCain?

a) Scatterplotb) Side-by-side boxplotsc) Bar chartd) Pie charte) Mosaic plot

Page 29: Multiple Regression II 4/11/12

What do you see?

Page 30: Multiple Regression II 4/11/12

Test for Association

How would we test the association between each of these variables and % vote for McCain?

a) t-test for difference in meansb) test for a correlationc) ANOVAd) chi-square teste) test for a difference in proportions

Page 31: Multiple Regression II 4/11/12

Test for Correlation

Page 32: Multiple Regression II 4/11/12

Regression Model

Page 33: Multiple Regression II 4/11/12

Physical Activity

Given all the other variables in the model, states with a higher percentage of physically active citizens are more likely to vote

(a) Republican

(b) Democratic

Page 34: Multiple Regression II 4/11/12

West Region• With only region as an explanatory variable, interpret the meaning of the negative coefficient of RegionW.

•With all the other explanatory variables included, interpret the meaning of the positive coefficient of RegionW.

In this data set, states in the West voted less Republican than states in the Midwest.

States in the West voted more Republican than would be expected based on the other variables in the model, as compared to states in the Midwest.

Page 35: Multiple Regression II 4/11/12

Goal of the Model?• If the goal of the model is to see what and how each variable is associated with a state’s voting patterns, given all the other variables in the model, then we are done

• If the goal is to predict the % of the vote that will be Republican, say in the 2012 election, we want to prune out insignificant variables to improve the model

Page 36: Multiple Regression II 4/11/12

Over-fitting• It is possible to over-fit a model: to include too many explanatory variables

• The fewer the coefficients being estimated, the better they will be estimated

• Usually, a good model has pruned out explanatory variables that are not helping

Page 37: Multiple Regression II 4/11/12

R2

• Adding more explanatory variables will only make R2 increase or stay the same

• Adding another explanatory variable can not make the model explain less, because the other variables are all still in the model

•Is the best model always the one with the highest proportion of variability explained, and so the highest R2?

(a) Yes (b) No

Page 38: Multiple Regression II 4/11/12

Adjusted R2

• Adjusted R2 is like R2, but takes into account the number of explanatory variables

• As the number of explanatory variables increases, adjusted R2 gets smaller than R2

• One way to choose a model is to choose the model with the highest adjusted R2

Page 39: Multiple Regression II 4/11/12

Adjusted R2

You now know how to interpret all of these numbers!