View
221
Download
0
Category
Preview:
Citation preview
7/31/2019 Econometrics With R
1/56
Econometrics using R
Rajat Tayal
Rmetrics Panel Data Workshop
February 16-February 18, 2011
Indira Gandhi Institute of Development Research, Mumbai
February 17, 2012
Rajat Tayal Econometrics using R
http://find/http://goback/7/31/2019 Econometrics With R
2/56
Outline of the presentation
Linear regression
Simple linear regressionMultiple linear regression
Regression diagnostics
Leverage and standardized residualsDeletion diagnosticsThe function influence.measures()Testing for heteroskedasticityTesting for functional form
Testing for autocorrelationRobust standard errors and tests
Linear regression with panel data
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
3/56
Part I
Linear regression
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
4/56
Introduction
The linear regression model, typically estimated by ordinary leastsquares (OLS), is the workhorse of applied econometrics. Themodel is
yi = xTi + i, i = 1, 2, . . . , n. (1)
y = X+ (2)For cross-sections:
E(|X) = 0 (3)
Var(|X) =
2
I (4)
For time series:E(j|xi) = 0, i j. (5)
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
5/56
Introduction
We estimate the OLS by:
= (XTX)1XTy (6)
The corresponding fitted values are: y = X, the residuals are = y y and the residual sum of squares is T.
In R, models are typically fitted by calling a model-fitting function,in this case lm(), with a formula object describing the model and adata.frame object containing the variables used in the formula.
fm < lm(formula, data, . . .)
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
6/56
The first example
Data from Stock & Watson (2007) on subscriptions to economicsjournals at US libraries for the year 2000:
> install.packages("AER", dependencies=TRUE)
> library(AER)
> data("Journals") ; names("Journals")
> journals journals$citeprice summary(journals)
subs price citeprice
Min. : 2.0 Min. : 20.0 Min. : 0.005223
1st Qu.: 52.0 1st Qu.: 134.5 1st Qu.: 0.464495
Median : 122.5 Median : 282.0 Median : 1.320513
Mean : 196.9 Mean : 417.7 Mean : 2.548455
3rd Qu.: 268.2 3rd Qu.: 540.8 3rd Qu.: 3.440171
Max. :1098.0 Max. :2120.0 Max. :24.459459
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
7/56
The first example
In view of the wide range of the variables, combined with aconsiderable amount of skewness, it is useful to takelogarithms.
The goal is to estimate the effect of the price per citation onthe number of library subscriptions.
To explore this issue quantitatively, we will fit a linearregression model,
log(subs)i = 1 + 2log(citeprice)i + i (7)
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
8/56
The first example
In view of the wide range of the variables, combined with aconsiderable amount of skewness, it is useful to takelogarithms.
The goal is to estimate the effect of the price per citation onthe number of library subscriptions.
To explore this issue quantitatively, we will fit a linearregression model,
log(subs)i = 1 + 2log(citeprice)i + i (7)
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
9/56
The first example
In view of the wide range of the variables, combined with aconsiderable amount of skewness, it is useful to takelogarithms.
The goal is to estimate the effect of the price per citation onthe number of library subscriptions.
To explore this issue quantitatively, we will fit a linearregression model,
log(subs)i = 1 + 2log(citeprice)i + i (7)
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
10/56
The first example
Here, the formula of interest is log(subs) log(citeprice). Thiscan be used both for plotting and for model fitting:
> plot(log(subs) ~ log(citeprice), data = journals)> jour_lm abline(jour_lm)
abline() extracts the coefficients of the fitted model and adds thecorresponding regression line to the plot.
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
11/56
The first example
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
12/56
The first example
The function lm() returns a fitted-model object, here stored asjour lm.It is an object of class lm.
> class(jour_lm)
[1] "lm"
> names(jour_lm)
[1] "coefficients" "residuals" "effects" "rank" "fitted
[7] "qr" "df.residual" "xlevels" "call" "terms" "model"
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
13/56
The first example
> summary(jour_lm)Call:
lm(formula = log(subs) ~ log(citeprice), data = journals)
Residuals:
Min 1Q Median 3Q Max
-2.72478 -0.53609 0.03721 0.46619 1.84808
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.76621 0.05591 85.25
7/31/2019 Econometrics With R
14/56
Generic functions for fitted (linear) model objects
Function Function Descriptionprint() simple printed displaysummary() standard regression outputcoef() (or coefficients()) extracting the regression coefficientsresiduals() (or resid()) extracting residuals
fitted() (or fitted.values()) extracting fitted valuesanova() comparison of nested modelspredict() predictions for new dataplot() diagnostic plotsconfint() confidence intervals for the regression
coefficients
deviance() residual sum of squaresvcov() (estimated) variance-covariance matrixlogLik() log-likelihood (assuming normally distributed
errors)AIC() information criteria including AIC, BIC/SBC
(assuming normally distributed errors)Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
15/56
The first example
It is instructive to take a brief look at what the summary()method returns for a fitted lm object:
> jour_slm class(jour_slm)
[1] "summary.lm"> names(jour_slm)
[1] "call" "terms" "residuals" "coefficients" "aliased" "sigma"
[7] "df" "r.squared" "adj.r.squared" "fstatistic" "cov.unscale
> jour_slm$coefficients
Estimate Std. Error t value Pr(>|t|)(Intercept) 4.7662121 0.05590908 85.24934 2.953913e-146
log(citeprice) -0.5330535 0.03561320 -14.96786 2.563943e-33
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
16/56
Analysis of variance
> anova(jour_lm)
Analysis of Variance Table
Response: log(subs)
Df Sum Sq Mean Sq F value Pr(>F)
log(citeprice) 1 125.93 125.934 224.04 < 2.2e-16 ***
Residuals 178 100.06 0.562
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
The ANOVA table breaks the sum of squares about the mean (forthe dependent variable, here log(subs)) into two parts: a part thatis accounted for by a linear function of log(citeprice) and a partattributed to residual variation.
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
17/56
Point and Interval estimates
To extract the estimated regression coefficients , the functioncoef() can be used:
> coef(jour_lm)
(Intercept) log(citeprice)4.7662121 -0.5330535
> confint(jour_lm, level = 0.95)
2.5 % 97.5 %
(Intercept) 4.6558822 4.8765420
log(citeprice) -0.6033319 -0.4627751
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
18/56
Prediction
Two types of predictions:1 the prediction of points on the regression line and2 the prediction of a new data value.
The standard errors of predictions for new data take intoaccount both the uncertainty in the regression line and thevariation of the individual points about the line.
Thus, the prediction interval for prediction of new data islarger than that for prediction of points on the line. Thefunction predict() provides both types of standard errors.
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
19/56
Prediction
> predict(jour_lm, newdata = data.frame(citeprice = 2.11),
interval = "confidence")
fit lwr upr
4.368188 4.247485 4.48889
> predict(jour_lm, newdata = data.frame(citeprice = 2.11),interval = "prediction")
fit lwr upr
4.368188 2.883746 5.852629
The point estimates are identical (fit) but the intervals differ.The prediction intervals can also be used for computing andvisualizing confidence bands.
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
20/56
Prediction
The prediction intervals can also be used for computing andvisualizing confidence bands.
> lciteprice jour_pred plot(log(subs) ~ log(citeprice), data = journals)
> lines(jour_pred[, 1] ~ lciteprice, col = 1)
> lines(jour_pred[, 2] ~ lciteprice, col = 1, lty = 2)
> lines(jour_pred[, 3] ~ lciteprice, col = 1, lty = 2)
Rajat Tayal Econometrics using R
http://goforward/http://find/http://goback/7/31/2019 Econometrics With R
21/56
Prediction
Figure: Scatterplot with prediction intervals for the journals data
Rajat Tayal Econometrics using R
http://find/http://goback/7/31/2019 Econometrics With R
22/56
Plotting lm objects
The plot() method for class lm() provides six types of diagnosticplots, four of which are shown by default.We set the graphical parameter mfrow to c(2, 2) using the par()
function, creating a 2 * 2 matrix of plotting areas to see all fourplots simultaneously:
> par(mfrow = c(2, 2))
> plot(jour_lm)
> par(mfrow = c(1, 1))
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
23/56
Plotting lm objects
Figure: Diagnostic plots for the journals data
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
24/56
Testing a linear hypothesis
The standard regression output as provided by summary() onlyindicates individual significance of each regressor and jointsignificance of all regressors in the form of t and F statistics,respectively. Often it is necessary to test more general hypotheses.This is possible using the function linear.hypothesis() from the carpackage.Suppose we want to test the hypothesis that the elasticity of thenumber of library subscriptions with respect to the price percitation equals 0.5.
H0 : 2 = 0.5 (8)
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
25/56
Testing a linear hypothesis
The standard regression output as provided by summary() onlyindicates individual significance of each regressor and jointsignificance of all regressors in the form of t and F statistics,respectively. Often it is necessary to test more general hypotheses.This is possible using the function linear.hypothesis() from the carpackage.Suppose we want to test the hypothesis that the elasticity of thenumber of library subscriptions with respect to the price percitation equals 0.5.
H0 : 2 = 0.5 (8)
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
26/56
7/31/2019 Econometrics With R
27/56
Testing a linear hypothesis
> linear.hypothesis(jour_lm, "log(citeprice) = -0.5")
Linear hypothesis test
Hypothesis:
log(citeprice) = - 0.5
Model 1: restricted model
Model 2: log(subs) ~ log(citeprice)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 179 100.54
2 178 100.06 1 0.48421 0.8614 0.3546
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
28/56
Part II
Regression diagnostics
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
29/56
Review
> data("Journals")
> journals journals$citeprice journals$age jour_lm
7/31/2019 Econometrics With R
30/56
Testing for heteroskedasticity
For cross-section regressions, the assumption Var(i|xi) = 2
is typically in doubt. A popular test for checking thisassumption is the Breusch-Pagan test (Breusch and Pagan1979).
For our model fitted to the journals data, stored in jour lm,
the diagnostic plots in suggest that the variance decreaseswith the fitted values or, equivalently, it increases with theprice per citation.
Hence, the regressor log(citeprice) used in the main modelshould also be employed for the auxiliary regression.
Under H0, the test statistic of the Breusch-Pagan testapproximately follows a 2q distribution, where q is thenumber of regressors in the auxiliary regression (excluding theconstant term).
Rajat Tayal Econometrics using R
T i f h k d i i
http://find/7/31/2019 Econometrics With R
31/56
Testing for heteroskedasticity
For cross-section regressions, the assumption Var(i|xi) = 2
is typically in doubt. A popular test for checking thisassumption is the Breusch-Pagan test (Breusch and Pagan1979).
For our model fitted to the journals data, stored in jour lm,
the diagnostic plots in suggest that the variance decreaseswith the fitted values or, equivalently, it increases with theprice per citation.
Hence, the regressor log(citeprice) used in the main modelshould also be employed for the auxiliary regression.
Under H0, the test statistic of the Breusch-Pagan testapproximately follows a 2q distribution, where q is thenumber of regressors in the auxiliary regression (excluding theconstant term).
Rajat Tayal Econometrics using R
T i f h k d i i
http://find/7/31/2019 Econometrics With R
32/56
Testing for heteroskedasticity
For cross-section regressions, the assumption Var(i|xi) = 2
is typically in doubt. A popular test for checking thisassumption is the Breusch-Pagan test (Breusch and Pagan1979).
For our model fitted to the journals data, stored in jour lm,
the diagnostic plots in suggest that the variance decreaseswith the fitted values or, equivalently, it increases with theprice per citation.
Hence, the regressor log(citeprice) used in the main modelshould also be employed for the auxiliary regression.
Under H0, the test statistic of the Breusch-Pagan testapproximately follows a 2q distribution, where q is thenumber of regressors in the auxiliary regression (excluding theconstant term).
Rajat Tayal Econometrics using R
T ti f h t k d ti it
http://find/7/31/2019 Econometrics With R
33/56
Testing for heteroskedasticity
The function bptest() implements all these flavors of theBreusch-Pagan test. By default, it computes the studentizedstatistic for the auxiliary regression utilizing the original
regressors X.> bptest(jour_lm)
studentized Breusch-Pagan test
data: jour_lm
BP = 9.803, df = 1, p-value = 0.001742
Rajat Tayal Econometrics using R
T ti f h t k d ti it
http://find/7/31/2019 Econometrics With R
34/56
Testing for heteroskedasticity
Alternatively, the White test picks up the heteroskedasticity.It uses the original regressors as well as their squares andinteractions in the auxiliary regression, which can be passed as
a second formula to bptest().> bptest(jour_lm, ~ log(citeprice) + I(log(citeprice)^2),
+ data = journals)
studentized Breusch-Pagan test
data: jour_lm
BP = 10.912, df = 2, p-value = 0.004271
Rajat Tayal Econometrics using R
T ti th f ti l f
http://find/7/31/2019 Econometrics With R
35/56
Testing the functional form
The assumption E(|X) = 0 is crucial for consistency of theleast-squares estimator.
A typical source for violation of this assumption is a
misspecification of the functional form; e.g., by omittingrelevant variables.
One strategy for testing the functional form is to constructauxiliary variables and assess their significance using a simple
F test. This is what Ramseys RESET does.
Rajat Tayal Econometrics using R
Testing the functional form
http://find/7/31/2019 Econometrics With R
36/56
Testing the functional form
The function resettest() defaults to using second and thirdpowers of the fitted values as auxiliary variables.
> resettest(jour_lm)RESET test
data: jour_lm
RESET = 1.4409, df1 = 2, df2 = 176, p-value = 0.2395
Rajat Tayal Econometrics using R
Testing the functional form
http://find/7/31/2019 Econometrics With R
37/56
Testing the functional form
The rainbow test (Utts 1982) takes a different approach totesting the functional form. It fits a model to a subsample(typically the middle 50%) and compares it with the modelfitted to the full sample using an F test.
> raintest(jour_lm, order.by = ~ age, data = journals)Rainbow test
data: jour_lm
Rain = 1.774, df1 = 90, df2 = 88, p-value = 0.003741
This appears to be the case, signaling that the relationshipbetween the number of subscriptions and the price percitation also depends on the age of the journal.
Rajat Tayal Econometrics using R
Testing for autocorrelation
http://find/7/31/2019 Econometrics With R
38/56
Testing for autocorrelation
Let us reconsider the first model for the US consumptionfunction.
> library(dynlm)
> data("USMacroG")
> consump1 dwtest(consump1)
Durbin-Watson test
data: consump1
DW = 0.0866, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater
than 0
Rajat Tayal Econometrics using R
Testing for autocorrelation
http://find/7/31/2019 Econometrics With R
39/56
Testing for autocorrelation
Further tests for autocorrelation are the Box-Pierce test andthe Ljung-Box test, both being implemented in the functionBox.test() in base R.
> Box.test(residuals(consump1), type = "Ljung-Box")
Box-Ljung test
data: residuals(consump1)
X-squared = 176.0698, df = 1, p-value < 2.2e-16
Rajat Tayal Econometrics using R
Multiple linear regression
http://find/7/31/2019 Econometrics With R
40/56
Multiple linear regression
Labour economics example: Estimation of wage equation
> data("CPS1988")
> summary(CPS1988)
wage education experience ethnicity
Min. : 50.05 Min. : 0.00 Min. :-4.0 cauc:25923
1st Qu.: 308.64 1st Qu.:12.00 1st Qu.: 8.0 afam: 2232
Median : 522.32 Median :12.00 Median :16.0
Mean : 603.73 Mean :13.07 Mean :18.2
3rd Qu.: 783.48 3rd Qu.:15.00 3rd Qu.:27.0
Max. :18777.20 Max. :18.00 Max. :63.0
The model of interest is
log(wage) = 1 + 2exp+ 3exp2 + 4education + 5ethnicity+
(9)
Rajat Tayal Econometrics using R
Multiple linear regression
http://find/7/31/2019 Econometrics With R
41/56
Multiple linear regression
> cps_lm summary(cps_lm)
Call:
lm(formula = log(wage) ~ experience + I(experience^2) + education +
ethnicity, data = CPS1988)
Residuals:
Min 1Q Median 3Q Max
-2.9428 -0.3162 0.0580 0.3756 4.3830Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.321e+00 1.917e-02 225.38
7/31/2019 Econometrics With R
42/56
Comparison of models
With more than a single explanatory variable, it is interesting totest for the relevance of subsets of regressors.
> cps_noeth anova(cps_noeth, cps_lm)
Analysis of Variance Table
Model 1: log(wage) ~ experience + I(experience^2) + education
Model 2: log(wage) ~ experience + I(experience^2) + education + ethnicity
Res.Df RSS Df Sum of Sq F Pr(>F)
1 28151 9719.6
2 28150 9598.6 1 121.02 354.91 < 2.2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
This reveals that the effect of ethnicity is significant at any reasonable level.
Rajat Tayal Econometrics using R
Comparison of models
http://find/7/31/2019 Econometrics With R
43/56
Comparison of models
> cps_noeth waldtest(cps_lm, . ~ . - ethnicity) from the
package lmtest
Wald test
Model 1: log(wage) ~ experience + I(experience^2) +
education + ethnicity
Model 2: log(wage) ~ experience + I(experience^2) + education
Res.Df Df F Pr(>F)
1 28150
2 28151 -1 354.91 < 2.2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
44/56
Part III
Linear regression with panel data
Rajat Tayal Econometrics using R
Introduction
http://find/http://goback/7/31/2019 Econometrics With R
45/56
Introduction
There has been considerable interest in panel dataeconometrics over the last two decades.
The package plm (Croissant and Millo 2008) contains therelevant fitting functions and methods for specifications in R.
Two types of panel data models:1 Statis linear models2 Dynamic linear models
Rajat Tayal Econometrics using R
Introduction
http://find/7/31/2019 Econometrics With R
46/56
Grunfeld data comprising 20 annual observations on the threevariables real gross investment (invest), real value of the firm(value), and real value of the capital stock (capital) for 11 largeUS firms for the years 1935-1954.
> data("Grunfeld", package = "AER")
> library("plm")
> gr pgr gr_pool
7/31/2019 Econometrics With R
47/56
One-way panel regression
7/31/2019 Econometrics With R
48/56
y p g
A two-way model could have been estimated upon setting
effect = twoways.
If fixed effects need to be inspected, a fixef() method and anassociated summary() method are available.
To check whether the fixed effects are really needed, we
compare the fixed effects and the pooled OLS fits by means ofpFtest().
> gr_fe pFtest(gr_fe, gr_pool)
F test for individual effects
data: invest ~ value + capital
F = 56.8247, df1 = 2, df2 = 55, p-value = 4.148e-14
alternative hypothesis: significant effects
Rajat Tayal Econometrics using R
One-way panel regression
http://find/7/31/2019 Econometrics With R
49/56
y p g
It is also possible to fit a random-effects using the same fittingfunction upon setting model = random and selecting amethod for estimating the variance components.
Four methods are available: Swamy-Arora, Amemiya,Wallace-Hussain, and Nerlove.
> gr_re summary(gr_re)
Rajat Tayal Econometrics using R
One-way panel regression
http://find/7/31/2019 Econometrics With R
50/56
Oneway (individual) effect Random Effect Model
(Wallace-Hussains transformation)
Call:plm(formula = invest ~ value + capital, data = pgr, model = "random",
random.method = "walhus")
Balanced Panel: n=3, T=20, N=60
Effects:
var std.dev share
idiosyncratic 4389.31 66.25 0.352
individual 8079.74 89.89 0.648theta: 0.8374
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-187.00 -32.90 6.96 31.40 210.00
Coefficients :
Estimate Std. Error t-value Pr(>|t|)(Intercept) -109.976572 61.701384 -1.7824 0.08001 .
value 0.104280 0.014996 6.9539 3.797e-09 ***
capital 0.344784 0.024520 14.0613 < 2.2e-16 ***
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
TSS: 1988300 RSS: 257520
R-Squared : 0.87048 Adj. R-Squared : 0.82696
F-statistic: 191.545 on 2 and 57 DF, p-value: < 2.22e-16Rajat Tayal Econometrics using R
One-way panel regression
http://find/7/31/2019 Econometrics With R
51/56
A comparison of the regression coefficients shows that fixed-and randomeffects methods yield rather similar results forthese data.
To check whether the random effects are really needed, aLagrange multiplier test is available in plmtest(), defaulting to
the test proposed by Honda (1985).> plmtest(gr_pool)
Lagrange Multiplier Test - (Honda)
data: invest ~ value + capital
normal = 15.4704, p-value < 2.2e-16
alternative hypothesis: significant effects
Rajat Tayal Econometrics using R
One-way panel regression
http://find/7/31/2019 Econometrics With R
52/56
Random-effects methods are more efficient than the
fixed-effects estimator under more restrictive assumptions,namely exogeneity of the individual effects. It is thereforeimportant to test for endogeneity, and the standard approachemploys a Hausman test. The relevant function phtest()requires two panel regression objects, in our case yielding
> phtest(gr_re, gr_fe)
Hausman Test
data: invest ~ value + capital
chisq = 0.0404, df = 2, p-value = 0.98
alternative hypothesis: one model is inconsistent
In line with the rather similar estimates presented above,endogeneity does not appear to be a problem here.
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
53/56
Part IV
Useful resources
Rajat Tayal Econometrics using R
Econometrics Task View in R
http://find/7/31/2019 Econometrics With R
54/56
Linear regression models: stats, lmtest, car, sandwich, AER.
Non linear regression models: stats.
Quantile regression: quantreg.
Panel data: plm
Linear structural equation model: sem.
Rajat Tayal Econometrics using R
Books on Econometrics in R
http://find/7/31/2019 Econometrics With R
55/56
Kleiber C. & A. Zeileis (2008), Applied Econometrics withR, Springer.
Vinod H.R. (2008), Hands-on Econometrics using R, WorldScientific Publishing.
Hatekar N.R. (2010), Principles of Econometrics: Anintroduction using R, Sage Publications.
Rajat Tayal Econometrics using R
http://find/7/31/2019 Econometrics With R
56/56
Thank You
Rajat Tayal Econometrics using R
http://find/Recommended