Download pdf - Introduction Econometrics R

8/9/2019 Introduction Econometrics R

1/48

Econometrics using R

Rajat Tayal

Fourth Quantitative Finance Workshop

December 21-December 24, 2012

Indian Institute of Technology, Kanpur

23 December 2012

Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II

23 December 2012 1 / 1
http://find/


2/48

Outline of the presentation

Linear regression

Simple linear regressionMultiple linear regressionPartially linear modelsFactors, interactions, and weightsLinear regression with time series dataLinear regression with panel dataSystems of linear equations

Regression diagnostics

Leverage and standardized residualsDeletion diagnosticsThe function influence.measures()Testing for heteroskedasticityTesting for functional formTesting for autocorrelationRobust standard errors and tests


23 December 2012 2 / 1
http://find/


3/48

Part I

Linear regression


23 December 2012 3 / 1
http://find/


4/48

Introduction

The linear regression model, typically estimated by ordinary least squares(OLS), is the workhorse of applied econometrics. The model is

yi =xTi +i, i= 1, 2, . . . , n. (1)

y=X+ (2)For cross-sections:

E(|X) = 0 (3)

Var(|X) =2I (4)

For time series:E(j|xi) = 0, i j. (5)

Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II 23 December 2012 4 / 1
http://find/


5/48

Introduction

We estimate the OLS by:

= (XTX)1XTy (6)

The corresponding fitted values are: y=X, the residuals are = y yand the residual sum of squares is T.

In R, models are typically fitted by calling a model-fitting function, in thiscase lm(), with a formula object describing the model and a data.frameobject containing the variables used in the formula.

fm < lm(formula, data, . . .)

http://find/


6/48


7/48

The first example

In view of the wide range of the variables, combined with aconsiderable amount of skewness, it is useful to take logarithms.

The goal is to estimate the effect of the price per citation on thenumber of library subscriptions.

To explore this issue quantitatively, we will fit a linear regressionmodel,

log(subs)i=1+2log(citeprice)i+i (7)

http://find/


8/48

The first example

Here, the formula of interest is log(subs) log(citeprice). This can be usedboth for plotting and for model fitting:

> plot(log(subs) ~ log(citeprice), data = journals)> jour_lm abline(jour_lm)

abline() extracts the coefficients of the fitted model and adds the

corresponding regression line to the plot.

http://find/


9/48

The first example

http://find/


10/48

The first example

The function lm() returns a fitted-model object, here stored as jour lm.It is an object of class lm.

> class(jour_lm)[1] "lm"

> names(jour_lm)

[1] "coefficients" "residuals" "effects" "rank" "fitted.va

[7] "qr" "df.residual" "xlevels" "call" "terms" "model"

Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 10 / 1
http://find/


11/48

The first example

> summary(jour_lm)

Call:lm(formula = log(subs) ~ log(citeprice), data = journals)

Residuals:

Min 1Q Median 3Q Max

-2.72478 -0.53609 0.03721 0.46619 1.84808

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 4.76621 0.05591 85.25


12/48

Generic functions for fitted (linear) model objects

Function Function Description

print() simple printed displaysummary() standard regression outputcoef() (or coefficients()) extracting the regression coefficientsresiduals() (or resid()) extracting residualsfitted() (or fitted.values()) extracting fitted values

anova() comparison of nested modelspredict() predictions for new dataplot() diagnostic plotsconfint() confidence intervals for the regression

coefficientsdeviance() residual sum of squaresvcov() (estimated) variance-covariance matrixlogLik() log-likelihood (assuming normally distributed

errors)AIC() information criteria including AIC, BIC/SBC

(assuming normally distributed errors)

http://find/


13/48

The first example

It is instructive to take a brief look at what the summary() methodreturns for a fitted lm object:

> jour_slm class(jour_slm)

[1] "summary.lm"> names(jour_slm)

[1] "call" "terms" "residuals" "coefficients" "aliased" "sig

[7] "df" "r.squared" "adj.r.squared" "fstatistic" "cov.unsca

> jour_slm$coefficients

Estimate Std. Error t value Pr(>|t|)(Intercept) 4.7662121 0.05590908 85.24934 2.953913e-146

log(citeprice) -0.5330535 0.03561320 -14.96786 2.563943e-33

http://find/http://goback/


14/48

Analysis of variance

> anova(jour_lm)

Analysis of Variance Table

Response: log(subs)

Df Sum Sq Mean Sq F value Pr(>F)

log(citeprice) 1 125.93 125.934 224.04 < 2.2e-16 ***

Residuals 178 100.06 0.562

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

The ANOVA table breaks the sum of squares about the mean (for thedependent variable, here log(subs)) into two parts: a part that isaccounted for by a linear function of log(citeprice) and a part attributed toresidual variation.

http://find/


15/48

Point and Interval estimates

To extract the estimated regression coefficients , the function coef() canbe used:

> coef(jour_lm)

(Intercept) log(citeprice)4.7662121 -0.5330535

> confint(jour_lm, level = 0.95)

2.5 % 97.5 %

(Intercept) 4.6558822 4.8765420

log(citeprice) -0.6033319 -0.4627751

http://find/


16/48

Prediction

Two types of predictions:1 the prediction of points on the regression line and2 the prediction of a new data value.

The standard errors of predictions for new data take into accountboth the uncertainty in the regression line and the variation of theindividual points about the line.

Thus, the prediction interval for prediction of new data is larger thanthat for prediction of points on the line. The function predict()provides both types of standard errors.

http://find/http://goback/


17/48

Prediction

> predict(jour_lm, newdata = data.frame(citeprice = 2.11),

interval = "confidence")

fit lwr upr

1 4.368188 4.247485 4.48889

> predict(jour_lm, newdata = data.frame(citeprice = 2.11),

interval = "prediction")

fit lwr upr

1 4.368188 2.883746 5.852629

The point estimates are identical (fit) but the intervals differ.The prediction intervals can also be used for computing and visualizingconfidence bands.

http://find/


18/48

Prediction

> lciteprice jour_pred plot(log(subs) ~ log(citeprice), data = journals)> lines(jour_pred[, 1] ~ lciteprice, col = 1)

> lines(jour_pred[, 2] ~ lciteprice, col = 1, lty = 2)

> lines(jour_pred[, 3] ~ lciteprice, col = 1, lty = 2)

http://find/


19/48

Prediction

Figure: Scatterplot with prediction intervals for the journals data

http://find/


20/48

Plotting lm objects

The plot() method for class lm() provides six types of diagnostic plots,four of which are shown by default.We set the graphical parameter mfrow to c(2, 2) using the par() function,creating a 2 2 matrix of plotting areas to see all four plots simultaneously:

> par(mfrow = c(2, 2))

> plot(jour_lm)

> par(mfrow = c(1, 1))

The first provides a graph of residuals versus fitted values, the second is aQQ plot for normality, plots three and four are a scale-location plot and aplot of standardized residuals against leverages, respectively.

http://find/


21/48

Plotting lm objects

Figure: Diagnostic plots for the journals data

http://find/


22/48

Testing a linear hypothesis

The standard regression output as provided by summary() only indicatesindividual significance of each regressor and joint significance of allregressors in the form of t and F statistics, respectively. Often it is

necessary to test more general hypotheses.This is possible using the function linear.hypothesis() from the car package.Suppose we want to test the hypothesis that the elasticity of the numberof library subscriptions with respect to the price per citation equals 0.5.

H0:2= 0.5 (8)

http://find/


23/48

Testing a linear hypothesis

> linear.hypothesis(jour_lm, "log(citeprice) = -0.5")

Linear hypothesis test

Hypothesis:

log(citeprice) = - 0.5

Model 1: restricted model

Model 2: log(subs) ~ log(citeprice)

Res.Df RSS Df Sum of Sq F Pr(>F)1 179 100.54

2 178 100.06 1 0.48421 0.8614 0.3546

http://find/


24/48

Multiple linear regression

In economics, most regression analyses comprise more than a singleregressor. Often there are regressors of a special type, usually referred toas dummy variables in econometrics, which are used for codingcategorical variables.

> data("CPS1988")

> summary(CPS1988)

wage education experience ethnicity smsa region parttime

Min. : 50.05 Min. : 0.00 Min. :-4.0 cauc:25923 no : 7223 northeast:6441 no :256311st Qu.: 308.64 1st Qu.:12.00 1st Qu.: 8.0 afam: 2232 yes:20932 midwest :6863 yes: 2524

Median : 522.32 Median :12.00 Median :16.0 south :8760Mean : 603.73 Mean :13.07 Mean :18.2 west :6091

3rd Qu.: 783.48 3rd Qu.:15.00 3rd Qu.:27.0

Max. :18777.20 Max. :18.00 Max. :63.0

The model of interest is

log(wage) =1+2experience+3experience2+4education+5ethnicity+

(9)

http://find/


25/48

Multiple linear regression

> cps_lm summary(cps_lm)

Call:

lm(formula = log(wage) ~ experience + I(experience^2) + education +

ethnicity, data = CPS1988)

Residuals:

Min 1Q Median 3Q Max

-2.9428 -0.3162 0.0580 0.3756 4.3830Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 4.321e+00 1.917e-02 225.38


26/48

Comparison of models

With more than a single explanatory variable, it is interesting to test for

the relevance of subsets of regressors. For any two nested models, this canbe done using the function anova(). E.g. to test for the relevance of thevariable ethnicity, we explicitly fit the model without ethnicity and thencompare both models.

> cps_noeth anova(cps_noeth, cps_lm)

Analysis of Variance Table

Model 1: log(wage) ~ experience + I(experience^2) + education

Model 2: log(wage) ~ experience + I(experience^2) + education + ethnicity

Res.Df RSS Df Sum of Sq F Pr(>F)

1 28151 9719.6

2 28150 9598.6 1 121.02 354.91 < 2.2e-16 ***---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

This reveals that the effect of ethnicity is significant at any reasonable level.

http://find/


27/48

Comparison of models

> cps_noeth waldtest(cps_lm, . ~ . - ethnicity)

Wald test



Res.Df Df F Pr(>F)

1 28150

2 28151 -1 354.91 < 2.2e-16 ***---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

http://find/


28/48

Part II

Linear regression with panel data

http://find/


29/48

Introduction

There has been considerable interest in panel data econometrics overthe last two decades.

The package plm(Croissant and Millo 2008) contains the relevantfitting functions and methods for specifications in R.

Two types of panel data models:1 Statis linear models2 Dynamic linear models


I d i
http://find/


30/48

Introduction

For illustrating the basic fixed- and random-effects methods, we use the

wellknown Grunfeld data (Grunfeld 1958) comprising 20 annualobservations on the three variables real gross investment (invest), realvalue of the firm (value), and real value of the capital stock (capital) for11 large US firms for the years 1935-1954.

> data("Grunfeld", package = "AER")

> library("plm")

> gr


31/48

One-way panel regression

investit=

1values

+

2capital

+

i+

it (10)where i = 1, . . . , n, t = 1, . . . , T, and the i denote the individual-specific effects. Afixed-effects version is estimated by running OLS on a within-transformed model:

> gr_fe summary(gr_fe)

Oneway (individual) effect Within Model

Call:

plm(formula = invest ~ value + capital, data = pgr, model = "within")

Balanced Panel: n=3, T=20, N=60

Residuals :

Min. 1st Qu. Median 3rd Qu. Max.

-167.00 -26.10 2.09 26.80 202.00

Coefficients :

Estimate Std. Error t-value Pr(>|t|)value 0.104914 0.016331 6.4242 3.296e-08 ***

capital 0.345298 0.024392 14.1564 < 2.2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Total Sum of Squares: 1888900 Residual Sum of Squares: 243980

R-Squared : 0.87084 Adj. R-Squared : 0.79827

F-statistic: 185.407 on 2 and 55 DF, p-value: < 2.22e-16Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 31 / 1

O l i
http://find/


32/48


A two-way model could have been estimated upon setting effect =twoways.

If fixed effects need to be inspected, a fixef() method and anassociated summary() method are available.

To check whether the fixed effects are really needed, we compare thefixed effects and the pooled OLS fits by means of pFtest().

> gr_fe pFtest(gr_fe, gr_pool)

F test for individual effects

data: invest ~ value + capital

F = 56.8247, df1 = 2, df2 = 55, p-value = 4.148e-14

alternative hypothesis: significant effects


O l i
http://find/


33/48


It is also possible to fit a random-effects version of (3.3) using thesame fitting function upon setting model = random and selecting amethod for estimating the variance components.Four methods are available: Swamy-Arora, Amemiya,Wallace-Hussain, and Nerlove.> gr_re summary(gr_re)


One wa panel regression
http://find/


34/48


Oneway (individual) effect Random Effect Model

(Wallace-Hussains transformation)

Call:plm(formula = invest ~ value + capital, data = pgr, model = "random",

random.method = "walhus")

Balanced Panel: n=3, T=20, N=60

Effects:

var std.dev share

idiosyncratic 4389.31 66.25 0.352

individual 8079.74 89.89 0.648

theta: 0.8374

Residuals :

Min. 1st Qu. Median 3rd Qu. Max.

-187.00 -32.90 6.96 31.40 210.00

Coefficients :

Estimate Std. Error t-value Pr(>|t|)(Intercept) -109.976572 61.701384 -1.7824 0.08001 .

value 0.104280 0.014996 6.9539 3.797e-09 ***

capital 0.344784 0.024520 14.0613 < 2.2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Total Sum of Squares: 1988300 Residual Sum of Squares: 257520

R-Squared : 0.87048 Adj. R-Squared : 0.82696F-statistic: 191.545 on 2 and 57 DF -value: < 2.22e-16Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 34 / 1

One way panel regression
http://find/


35/48


A comparison of the regression coefficients shows that fixed- andrandomeffects methods yield rather similar results for these data.

To check whether the random effects are really needed, a Lagrangemultiplier test is available in plmtest(), defaulting to the test

proposed by Honda (1985).> plmtest(gr_pool)

Lagrange Multiplier Test - (Honda)

data: invest ~ value + capitalnormal = 15.4704, p-value < 2.2e-16

alternative hypothesis: significant effects


One way panel regression
http://find/


36/48


Random-effects methods are more efficient than the fixed-effects

estimator under more restrictive assumptions, namely exogeneity ofthe individual effects. It is therefore important to test for endogeneity,and the standard approach employs a Hausman test. The relevantfunction phtest() requires two panel regression objects, in our caseyielding

> phtest(gr_re, gr_fe)

Hausman Test

data: invest ~ value + capitalchisq = 0.0404, df = 2, p-value = 0.98

alternative hypothesis: one model is inconsistent

In line with the rather similar estimates presented above, endogeneitydoes not appear to be a problem here.


Dynamic linear models
http://find/


37/48


To conclude this section, we present a more advanced example, thedynamic panel data model:

yit=

pi=1

jyi,tj+xTit+ uit, uit=i+t+it (11)

This is estimated by the method of Arellano and Bond (1991) viz.generalized method of moments (GMM) estimator utilizing lagged

endogenous regressors after a first-differences transformation.


http://find/


38/48


> data("EmplUK", package = "plm")

> form empl_ab


39/48


Twoways effects Two steps model

Call:

pgmm(formula = dynformula(form, list(2, 1, 0, 1)), data = EmplUK,effect = "twoways", model = "twosteps", index = c("firm",

"year"), ... = list(gmm.inst = ~log(emp), lag.gmm = list(c(2,

99))))

Unbalanced Panel: n=140, T=7-9, N=1031

Number of Observations Used: 611

Residuals

Min. 1st Qu. Median Mean 3rd Qu. Max.

-0.6191000 -0.0255700 0.0000000 -0.0001339 0.0332000 0.6410000

Coefficients

Estimate Std. Error z-value Pr(>|z|)

lag(log(emp), c(1, 2))1 0.474151 0.085303 5.5584 2.722e-08 ***

lag(log(emp), c(1, 2))2 -0.052967 0.027284 -1.9413 0.0522200 .log(wage) -0.513205 0.049345 -10.4003 < 2.2e-16 ***

lag(log(wage), 1) 0.224640 0.080063 2.8058 0.0050192 **

log(capital) 0.292723 0.039463 7.4177 1.191e-13 ***

log(output) 0.609775 0.108524 5.6188 1.923e-08 ***

lag(log(output), 1) -0.446373 0.124815 -3.5763 0.0003485 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 39 / 1

http://find/


40/48


Sargan Test: chisq(25) = 30.11247 (p.value=0.22011)

Autocorrelation test (1): normal = -2.427829 (p.value=0.0075948)

Autocorrelation test (2): normal = -0.3325401 (p.value=0.36974)

Wald test for coefficients: chisq(7) = 371.9877 (p.value=< 2.22e-16)Wald test for time dummies: chisq(6) = 26.9045 (p.value=0.0001509)

The results suggest that autoregressive dynamics are important for thesedata.

http://find/


41/48

Part III

Regression diagnostics


Review
http://find/


42/48

Review

> data("Journals")

> journals journals$citeprice journals$age jour_lm


43/48

Testing for heteroskedasticity

For cross-section regressions, the assumption Var(i|xi) =2 istypically in doubt. A popular test for checking this assumption is theBreusch-Pagan test (Breusch and Pagan 1979).

For our model fitted to the journals data, stored in jour lm, the

diagnostic plots in suggest that the variance decreases with the fittedvalues or, equivalently, it increases with the price per citation.

Hence, the regressor log(citeprice) used in the main model should alsobe employed for the auxiliary regression.

Under H0

, the test statistic of the Breusch-Pagan test approximatelyfollows a 2qdistribution, where q is the number of regressors in theauxiliary regression (excluding the constant term).


Testing for heteroskedasticity
http://find/


44/48

st g o t os st c ty

The function bptest() implements all these flavors of the

Breusch-Pagan test. By default, it computes the studentized statisticfor the auxiliary regression utilizing the original regressors X.

> bptest(jour_lm)

studentized Breusch-Pagan test

data: jour_lm

BP = 9.803, df = 1, p-value = 0.001742Alternatively, the White test picks up the heteroskedasticity. It usesthe original regressors as well as their squares and interactions in theauxiliary regression, which can be passed as a second formula tobptest().

> bptest(jour_lm, ~ log(citeprice) + I(log(citeprice)^2),

+ data = journals)

studentized Breusch-Pagan test

data: jour_lm

BP = 10.912, df = 2, p-value = 0.004271Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 44 / 1

Testing the functional form
http://find/


45/48

g

The assumption E(|X) = 0 is crucial for consistency of theleast-squares estimator. A typical source for violation of thisassumption is a misspecification of the functional form; e.g., byomitting relevant variables. One strategy for testing the functionalform is to construct auxiliary variables and assess their significance

using a simple F test. This is what Ramseys RESETdoes.

The function resettest() defaults to using second and third powers ofthe fitted values as auxiliary variables.

> resettest(jour_lm)

RESET testdata: jour_lm

RESET = 1.4409, df1 = 2, df2 = 176, p-value = 0.2395


Testing the functional form
http://find/


46/48

g

The rainbow test (Utts 1982) takes a different approach to testingthe functional form. It fits a model to a subsample (typically themiddle 50%) and compares it with the model fitted to the full sampleusing an F test.

> raintest(jour_lm, order.by = ~ age, data = journals)Rainbow test

data: jour_lm

Rain = 1.774, df1 = 90, df2 = 88, p-value = 0.003741

This appears to be the case, signaling that the relationship betweenthe number of subscriptions and the price per citation also dependson the age of the journal.


Testing for autocorrelation
http://find/


47/48

g

Let us reconsider the first model for the US consumption function.> library(dynlm)

> data("USMacroG")

> consump1 dwtest(consump1)

Durbin-Watson test

data: consump1

DW = 0.0866, p-value < 2.2e-16

alternative hypothesis: true autocorrelation is greater than 0Further tests for autocorrelation are the Box-Pierce test and theLjung-Box test, both being implemented in the function Box.test() inbase R.> Box.test(residuals(consump1), type = "Ljung-Box")

Box-Ljung test

data: residuals(consum 1)Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 47 / 1
http://find/


48/48