16
STAT E100 Section Week 3 - Regression

STAT E100 Section Week 3 - Regression. Review Descriptive Statistics versus Hypothesis Testing Outliers Sample vs. Population Residual Plots

Embed Size (px)

Citation preview

Page 1: STAT E100 Section Week 3 - Regression. Review  Descriptive Statistics versus Hypothesis Testing  Outliers  Sample vs. Population  Residual Plots

STAT E100Section Week 3 - Regression

Page 2: STAT E100 Section Week 3 - Regression. Review  Descriptive Statistics versus Hypothesis Testing  Outliers  Sample vs. Population  Residual Plots

Review

Descriptive Statistics versus Hypothesis Testing

Outliers

Sample vs. Population

Residual Plots

Page 3: STAT E100 Section Week 3 - Regression. Review  Descriptive Statistics versus Hypothesis Testing  Outliers  Sample vs. Population  Residual Plots

Key Equations and Graphs: 

 

Residual = observed value (y) - predicted value (ŷ)

 

Page 4: STAT E100 Section Week 3 - Regression. Review  Descriptive Statistics versus Hypothesis Testing  Outliers  Sample vs. Population  Residual Plots

Key Concepts:

Correlation ≠ Causation

R2 is the coefficient of determination that tells the proportion of variability in y that can be explained by x.

Leverage refers to data points whose x-values stand away from the mean of x. These points are said to exert leverage on a linear model. High-leverage points pull the line close to them, and so they can have a large effect on the line.

Page 5: STAT E100 Section Week 3 - Regression. Review  Descriptive Statistics versus Hypothesis Testing  Outliers  Sample vs. Population  Residual Plots

Homework Tips:

from

http://stattrek.com/statistics/charts/boxplot.aspx

Page 6: STAT E100 Section Week 3 - Regression. Review  Descriptive Statistics versus Hypothesis Testing  Outliers  Sample vs. Population  Residual Plots

SAMPLE QUESTION #1

1) Data comes from: http://www.cbssports.com/mlb/salaries/teams.

Data was collected on the opening day payroll (in millions of $) and the number of wins that year for the 30 major league baseball (MLB) teams in 2010. Below are the histograms of the 2 variables:

 

a) Based on these histograms, describe the distribution of each of these variables. The scatterplot of y = wins vs. x = payroll is on the next slide.

Page 7: STAT E100 Section Week 3 - Regression. Review  Descriptive Statistics versus Hypothesis Testing  Outliers  Sample vs. Population  Residual Plots

SAMPLE QUESTION #1

 

Page 8: STAT E100 Section Week 3 - Regression. Review  Descriptive Statistics versus Hypothesis Testing  Outliers  Sample vs. Population  Residual Plots

SAMPLE QUESTION #1

1) Data comes from: http://www.cbssports.com/mlb/salaries/teams.

Data was collected on the opening day payroll (in millions of $) and the number of wins that year for the 30 major league baseball (MLB) teams in 2010. Below are the histograms of the 2 variables:

 

a) Based on these histograms, describe the distribution of each of these variables. The scatterplot of y = wins vs. x = payroll is on the next slide.

It seems that the distributions are more or less symmetrical (although there is a slight skewness in the variable wins). Also, note the isolated point on the payroll histogram.

Page 9: STAT E100 Section Week 3 - Regression. Review  Descriptive Statistics versus Hypothesis Testing  Outliers  Sample vs. Population  Residual Plots

SAMPLE QUESTION #1

1) Data comes from: http://www.cbssports.com/mlb/salaries/teams.

b) What do you think the correlation coefficient is between these 2 variables?

Here is some more SPSS output:

Page 10: STAT E100 Section Week 3 - Regression. Review  Descriptive Statistics versus Hypothesis Testing  Outliers  Sample vs. Population  Residual Plots

SAMPLE QUESTION #1

1) Data comes from: http://www.cbssports.com/mlb/salaries/teams.

b) What do you think the correlation coefficient is between these 2 variables?

Here is some more SPSS output:

Correlation coefficient = 0.536

Page 11: STAT E100 Section Week 3 - Regression. Review  Descriptive Statistics versus Hypothesis Testing  Outliers  Sample vs. Population  Residual Plots

SAMPLE QUESTION #11) Data comes from: http://www.cbssports.com/mlb/salaries/teams.

c) What is the equation for the best fit line for this data?

For your reference (probably should show how to run the regression and correlation in SPSS):

d) What is the estimated number of wins for a team with a payroll of $120.1 million (the Red Sox)? How would this be predicted to change if it increased by $10 million?

Page 12: STAT E100 Section Week 3 - Regression. Review  Descriptive Statistics versus Hypothesis Testing  Outliers  Sample vs. Population  Residual Plots

SAMPLE QUESTION #11) Data comes from: http://www.cbssports.com/mlb/salaries/teams.

c) What is the equation for the best fit line for this data?

For your reference (probably should show how to run the regression and correlation in SPSS):

d) What is the estimated number of wins for a team with a payroll of $120.1 million (the Red Sox)? How would this be predicted to change if it increased by $10 million?

Page 13: STAT E100 Section Week 3 - Regression. Review  Descriptive Statistics versus Hypothesis Testing  Outliers  Sample vs. Population  Residual Plots

SAMPLE QUESTION #11) Data comes from: http://www.cbssports.com/mlb/salaries/teams.

c) What is the equation for the best fit line for this data?

ywins = 0.167 xpayroll + 67.982

For your reference (probably should show how to run the regression and correlation in SPSS):

d) What is the estimated number of wins for a team with a payroll of $120.1 million (the Red Sox)? How would this be predicted to change if it increased by $10 million?

Page 14: STAT E100 Section Week 3 - Regression. Review  Descriptive Statistics versus Hypothesis Testing  Outliers  Sample vs. Population  Residual Plots

SAMPLE QUESTION #11) Data comes from: http://www.cbssports.com/mlb/salaries/teams.

c) What is the equation for the best fit line for this data?

ywins = 0.167 xpayroll + 67.982

For your reference (probably should show how to run the regression and correlation in SPSS):

d) What is the estimated number of wins for a team with a payroll of $120.1 million (the Red Sox)? How would this be predicted to change if it increased by $10 million?

ywins = 0.167 (120.1) + 67.982 = 88.0387 88 wins

If increased by 10 million, then the numbers of wins would increase becase the slope is positive.

Page 15: STAT E100 Section Week 3 - Regression. Review  Descriptive Statistics versus Hypothesis Testing  Outliers  Sample vs. Population  Residual Plots

SAMPLE QUESTION #1

1) Data comes from: http://www.cbssports.com/mlb/salaries/teams.

e) The Red Sox actually won 86 games in 2010. Calculate their residual.

f) What is R2 for this model? What is its interpretation?

g) Based on the scatterplot, do you have any concerns for the using this regression model?

h) How the regression line change if the Yankees were removed (payroll = $194.66 million, wins = 97)?

Page 16: STAT E100 Section Week 3 - Regression. Review  Descriptive Statistics versus Hypothesis Testing  Outliers  Sample vs. Population  Residual Plots

SAMPLE QUESTION #1

1) Data comes from: http://www.cbssports.com/mlb/salaries/teams.

e) The Red Sox actually won 86 games in 2010. Calculate their residual.

Residual = 88.0387 2 wins

f) What is R2 for this model? What is its interpretation?

R2 = (0.536) 2 = 0.287 The regression line explains 28.7% of the variability in wins.

g) Based on the scatterplot, do you have any concerns for using this regression model?

There may be a point with high residual.

h) How would the regression line change if the Yankees were removed (payroll = $194.66 million, wins = 97)?

This point may have high residual.