45
1 Regression Analysis The contents in this chapter are from Chapters 20-23 of the textbook. The cntry15.sav data will be used. The data collected 15 countries’ information lifeexpf: female life expectancy Birthrat: births per 1000 population Both are scale variables.

Regression Analysis

  • Upload
    zariel

  • View
    18

  • Download
    0

Embed Size (px)

DESCRIPTION

Regression Analysis. The contents in this chapter are from Chapters 20-23 of the textbook. The cntry15.sav data will be used. The data collected 15 countries’ information lifeexpf: female life expectancy Birthrat: births per 1000 population Both are scale variables. - PowerPoint PPT Presentation

Citation preview

Page 1: Regression Analysis

1

Regression Analysis

The contents in this chapter are from Chapters 20-23 of the textbook.

The cntry15.sav data will be used. The data collected 15 countries’ information lifeexpf: female life expectancy Birthrat: births per 1000 population Both are scale variables.

Page 2: Regression Analysis

2

Linear regression model

Page 3: Regression Analysis

3

It is obviously, the points are not randomly scattered over the grid. Instead, there appears to be a pattern.

As birthrate increases, life expectancy decreases.

How to choose the “best” line? The least squares principle is

recommended.

Linear regression model

Page 4: Regression Analysis

4

Least squares principle

Page 5: Regression Analysis

5

Least squares principle

Dependent variable: the variable you wish to predict

Independent variable: variables used to make the prediction

Simple linear regression: in which a single numerical independent variable X is used to predict the numerical dependent variable Y.

where

XY 10

unknown. are and ,

)Var( and 0)E( error with random is

ts,coefficien regression are and

210

2

10

.

Page 6: Regression Analysis

6

Least squares principle

have wemodel above the to}21,,{set data afit To ,n,,iyx ii

.',iY

X

Y

XY

ii

i

i

iii

2i

1

0

10

varianceand 0mean with iid are s n observatiofor in error random

population for the slope

population for theintercept

)y variableexplanator theas toreferred (sometimes t variableindependen

variable)response theas toreferred (sometimes variabledependent where

Page 7: Regression Analysis

7

Least squares principle

2

110

2

1

10

10

ˆ

minimize tois and for estimation squaresleast theof idea The

error. random theof varianceand and

tscoefficien regression theestimate tous helpcan method squaresleast The

))Xb(b(Y)Y(Yn

iii

n

iii

slope sample

intercept sample

n observatiofor of value

n observatiofor of valuepredictedˆ where

ˆ

1

0

10

b

Yb

iXX

iYY

XbbY

i

i

ii

Page 8: Regression Analysis

8n.,1,i xbby where

yy2n

1s

1

1

11

i10i

n

1i

2ii

22

101

2

11

22

1

1111

11

)ˆ(ˆ

xbyb,SS

SSb

xn

x)xx(SS

yxn

yx)yy)(xx(SS

yn

y,xn

x

x

xy

n

ii

n

iii

n

ix

n

ii

n

ii

n

iiiii

n

ixy

n

ii

n

ii

Least squares principle

Page 9: Regression Analysis

9

Linear regression model

Page 10: Regression Analysis

10

Coefficientsa

89.985 1.765 50.995 .000

-.697 .050 -.968 -13.988 .000

(Constant)Births per 1000population,

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: Female life expectancya.

Linear regression model

The regression model becomes life expectancy=90-(0.70 x birthrate)

That tells us that for an increase of 1 in birthrate, there is a decrease in life expectancy of 0.70 years.

Page 11: Regression Analysis

11

Case Summaries

31 68 68.36833 -.3683350 53 55.11929 -2.1192918 79 77.43346 1.5665428 72 70.46028 1.5397213 82 80.92004 1.0799634 68 66.27637 1.7236345 63 58.60588 4.3941213 81 80.92004 .0799624 72 73.24955 -1.2495546 55 57.90856 -2.9085650 55 55.11929 -.1192920 71 76.03882 -5.0388228 72 70.46028 1.5397245 56 58.60588 -2.6058848 59 56.51393 2.48607

1Algeria1Burkina Faso1Cuba1Equador1France1Mongolia1Namibia1Netherlands1North Korea1Somalia1Tanzania1Thailand1Turkey1Zaire1Zambia

country

Births per1000

population,Female lifeexpectancy

Unstandardized Predicted

ValueUnstandardize

d Residual

Prediction and residuals

Page 12: Regression Analysis

12

Coefficient of Correlation

It measures the strength of the linear relationship between two numerical variables.

YX SS

YXr

),cov(

1 where 1

n

)YY)(XX()Y,Xcov(

n

iii

1

11

2

1

2

n

)YY(S

n

)XX(S

n

ii

Y

n

ii

X

Page 13: Regression Analysis

13

Coefficient of Correlation

Coefficient of correlation

-1=< r >= 1

Page 14: Regression Analysis

14

Prediction and residuals

The coefficient determination

SST

SSRR

squares of sum total

squares of sum regression2

Page 15: Regression Analysis

15

Model Summaryb

.968a .938 .933 2.537Model1

R R SquareAdjusted R

SquareStd. Error ofthe Estimate

Predictors: (Constant), Births per 1000population,

a.

Dependent Variable: Female life expectancyb.

ANOVAb

1259.263 1 1259.263 195.653 .000a

83.671 13 6.4361342.933 14

RegressionResidualTotal

Model1

Sum ofSquares df Mean Square F Sig.

Predictors: (Constant), Births per 1000 population,a.

Dependent Variable: Female life expectancyb.

ANOVA

Page 16: Regression Analysis

16

Testing hypotheses about the assumptions

Independence: all of the observations are independent

The variance homogeneity: the variance of the distribution of the dependent variable must be the same for all values of the independent variable.

Normality: for each value of the independent variable, the distribution of the related dependent variable follows a normal distribution.

Page 17: Regression Analysis

17

Coefficientsa

89.985 1.765 50.995 .000 86.173 93.797

-.697 .050 -.968 -13.988 .000 -.805 -.590

(Constant)Births per 1000population,

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig. Lower Bound Upper Bound

95% Confidence Intervalfor B

Dependent Variable: Female life expectancya.

Model Summaryb

.968a .938 .933 2.537Model1

R R SquareAdjusted R

SquareStd. Error ofthe Estimate

Predictors: (Constant), Births per 1000population,

a.

Dependent Variable: Female life expectancyb.

Testing hypotheses

Page 18: Regression Analysis

18

Testing that the slope is zero In this example, the sample slope is about -0.70

and its standard error is 0.05, so the value for the t statistics is -0.70/0.05=-14, related p-value is less that 0.0005. We should reject the hypothesis. There appears to be a linear relationship between 1992 female life expectancy and birthrate.

The 95% confidence interval for the population slope is (-0.805, -0.590).

Testing hypotheses

Page 19: Regression Analysis

19

Prediction

The regression equation obtained can be used for predict the life expectancy based on birthrates.

For a country with a birthrate of 30 per 1000 population

Predicted life expectancy

=89.99-0.697 x 30=69.08 years

Page 20: Regression Analysis

20

Predicting means and individual observations

The plot on the next page gives the standard error of the predicted mean life expectancy for different values of birthrate.

The vertical line at 32.9 is the average birthrate for all cases.

The farther birthrates are from the sample mean, the larger the standard error of the predicted means.

Page 21: Regression Analysis

21

Plot of standard error of predicted mean

Page 22: Regression Analysis

22

The 95% fitting confidence region

Page 23: Regression Analysis

23

Statistical diagnostics

Is the model correct?

Are there any outliers?

Is the variance constant?

Is the error normally distributed?

Page 24: Regression Analysis

24

Statistical diagnostics

Residuals can provide many useful information for the above four issues in statistical diagnostics.

You can’t judge the related size of a residual by looking at its value alone as it depends on the unit of the dependent variable and are not convenient to use.

Standardized residuals: divide the residual by the estimated standard deviation of the residuals.

Page 25: Regression Analysis

25

Statistical diagnostics

If the distribution of residuals is approximately normal, about 95% of the standardized residuals should be between -2 and 2; 99% should be between -2.58 and 2.58. It is easy to see whether there are some outliers.

Page 26: Regression Analysis

26

Statistical diagnostics

When you compute a standardized residuals, all of the observed residuals are divided by the same number.

The variability of the dependent variable is not constant for all points, but depends on the value of the independent variable.

The studentized residual takes into account the differences in variability from point to point.

We calculate it by dividing the residual by an estimate of the standard deviation of the residual at that point.

Page 27: Regression Analysis

27

Statistical diagnostics

A residual divided by an estimate of the standard deviation of the residual at that point is called its studentized residual.

The studentized residuals make it easier to see violations of the regression assumptions.

Page 28: Regression Analysis

28

Statistical diagnostics

Case Summaries

31 68 68.36833 -.36833 -.14518 -.1503950 53 55.11929 -2.11929 -.83536 -.9225218 79 77.43346 1.56654 .61749 .6705528 72 70.46028 1.53972 .60691 .6313213 82 80.92004 1.07996 .42569 .4817134 68 66.27637 1.72363 .67940 .7034445 63 58.60588 4.39412 1.73204 1.8500513 81 80.92004 .07996 .03152 .0356624 72 73.24955 -1.24955 -.49254 -.5183246 55 57.90856 -2.90856 -1.14647 -1.2314650 55 55.11929 -.11929 -.04702 -.0519320 71 76.03882 -5.03882 -1.98616 -2.1301128 72 70.46028 1.53972 .60691 .6313245 56 58.60588 -2.60588 -1.02716 -1.0971548 59 56.51393 2.48607 .97994 1.06610

1Algeria1Burkina Faso1Cuba1Equador1France1Mongolia1Namibia1Netherlands1North Korea1Somalia1Tanzania1Thailand1Turkey1Zaire1Zambia

country

Births per1000

population,Female lifeexpectancy

Unstandardized Predicted

ValueUnstandardize

d ResidualStandardized

ResidualStudentized

Residual

Page 29: Regression Analysis

29

Standardized Residual Stem-and-Leaf Plot

Frequency Stem & Leaf

3.00 -1 . 019 4.00 -0 . 0148 7.00 0 . 0466669 1.00 1 . 7

Stem width: 1.00000 Each leaf: 1 case(s)

Standardized Residuals

Page 30: Regression Analysis

30

Checking for normality

Page 31: Regression Analysis

31

If the data are a sample from a normal distribution, you expect the points to fall more or less on a straight line.

You can see the two largest residuals in absolute value (Thailand and Namibia) are stragglers from the line.

Next page is a detrended normality plot. If the data are from a normal, the points in the detrended normal plot should fall randomly in a band abound 0.

Checking for normality

Page 32: Regression Analysis

32

Checking for normality

Page 33: Regression Analysis

33

Tests of Normality

.137 15 .200* .971 15 .866Standardized ResidualStatistic df Sig. Statistic df Sig.

Kolmogorov-Smirnova Shapiro-Wilk

This is a lower bound of the true significance.*.

Lilliefors Significance Correctiona.

Testing for normality

Many statistical tests for normality have been proposed, one of them is the Kolmogorov-Smirnov test.

Page 34: Regression Analysis

34

Checking for constant variance

Residual plot: plot of studentized residuals against the estimated values.

From the residual plot you can see whether there are some pattern.

For a normal case, the residuals appears to be randomly scattered around a horizontal line through 0.

Page 35: Regression Analysis

35

Checking for constant variance

Page 36: Regression Analysis

36

Checking linearity

When the relationship between two variables is not linear, you can sometimes transform the variables to make the relationship linear, for example, take logarithm, sine, exponential, etc.

Scale plot of female life expectancy against natural log of phones per 100.

Page 37: Regression Analysis

37

Multiple Regression Models

Considering the country.sav data, you are interesting to predict female life expectancy from Urban: percentage of the population living in

urban areas Docs: number of doctors per 10,000 people Beds: number of hospital beds per 10,000 people Gdp: per capita gross domestic product in dollars Radios: radios per people

Page 38: Regression Analysis

38

Multiple Regression Models

A linear regression model is

Scatterplot matrix is useful.

radiosBgdp BbedsBdocBurbanBConstant

expectancy life Predicted

54 3 21

Page 39: Regression Analysis

39

Scatterplot matrix

Page 40: Regression Analysis

40

Scatterplot matrix

The relationship between female life expectancy and the percentage of the population living urban areas appears to be more or less linear.

The other four independent variables appear to be related to female life expectancy, but the relation is not linear.

We take log of the values of the four independent variables.

Page 41: Regression Analysis

41

Page 42: Regression Analysis

42

Correlations

1.000 .730 .880 .836 .693 .697

.730 1.000 .711 .741 .616 .576

.880 .711 1.000 .824 .633 .763

.836 .741 .824 1.000 .716 .748

.693 .616 .633 .716 1.000 .579

.697 .576 .763 .748 .579 1.000. .000 .000 .000 .000 .000

.000 . .000 .000 .000 .000

.000 .000 . .000 .000 .000

.000 .000 .000 . .000 .000

.000 .000 .000 .000 . .000

.000 .000 .000 .000 .000 .116 116 116 116 116 116

116 116 116 116 116 116

116 116 116 116 116 116

116 116 116 116 116 116

116 116 116 116 116 116

116 116 116 116 116 116

Female life expectancyNatural log hospitalbeds/10,000Natural log of doctorsper 10000Natural log of GDPNatural log of radiosper 100 peoplePercent urbanFemale life expectancyNatural log hospitalbeds/10,000Natural log of doctorsper 10000Natural log of GDPNatural log of radiosper 100 peoplePercent urbanFemale life expectancyNatural log hospitalbeds/10,000Natural log of doctorsper 10000Natural log of GDPNatural log of radiosper 100 peoplePercent urban

Pearson Correlation

Sig. (1-tailed)

N

Female lifeexpectancy

Natural loghospital

beds/10,000

Natural logof doctorsper 10000

Naturallog of GDP

Natural logof radios per100 people Percent urban

Correlation matrix

Page 43: Regression Analysis

43

Coefficientsa

40.767 3.174 12.845 .000

1.147 .749 .095 1.532 .128

4.069 .563 .569 7.228 .000

1.709 .616 .236 2.776 .006

1.542 .686 .130 2.247 .027

-.020 .029 -.045 -.686 .494

(Constant)Natural log hospitalbeds/10,000Natural log ofdoctors per 10000Natural log of GDPNatural log of radiosper 100 peoplePercent urban

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: Female life expectancya.

Regression coefficients

The estimated regression model Y=40.78-0.007 urban + 3.96 lndocs + 1.17

lnbeds +1.63 lngdp +1.54 lnradio

Page 44: Regression Analysis

44

Variables Entered/Removedb

Percenturban,Natural loghospitalbeds/10,000, Naturallog ofradios per100 people,Natural logof doctorsper 10000,Natural logof GDP

a

. Enter

Model1

VariablesEntered

VariablesRemoved Method

All requested variables entered.a.

Dependent Variable: Female life expectancyb.

Model Summaryb

.910a .827 .819 4.742Model1

R R SquareAdjusted R

SquareStd. Error ofthe Estimate

Predictors: (Constant), Percent urban, Natural loghospital beds/10,000, Natural log of radios per100 people, Natural log of doctors per 10000,Natural log of GDP

a.

Dependent Variable: Female life expectancyb.

SPSS output: model summary statistics

Page 45: Regression Analysis

45

ANOVAb

11844.633 5 2368.927 105.336 .000a

2473.807 110 22.48914318.440 115

RegressionResidualTotal

Model1

Sum ofSquares df Mean Square F Sig.

Predictors: (Constant), Percent urban, Natural log hospital beds/10,000,Natural log of radios per 100 people, Natural log of doctors per 10000,Natural log of GDP

a.

Dependent Variable: Female life expectancyb.

SPSS output: ANOVA

This regression is meaningful as the significance level is less than 0.0005.

The residual variance is 22.489