Introduction Econometrics R

Embed Size (px)

Citation preview

  • 8/9/2019 Introduction Econometrics R

    1/48

    Econometrics using R

    Rajat Tayal

    Fourth Quantitative Finance Workshop

    December 21-December 24, 2012

    Indian Institute of Technology, Kanpur

    23 December 2012

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II

    23 December 2012 1 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    2/48

    Outline of the presentation

    Linear regression

    Simple linear regressionMultiple linear regressionPartially linear modelsFactors, interactions, and weightsLinear regression with time series dataLinear regression with panel dataSystems of linear equations

    Regression diagnostics

    Leverage and standardized residualsDeletion diagnosticsThe function influence.measures()Testing for heteroskedasticityTesting for functional formTesting for autocorrelationRobust standard errors and tests

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II

    23 December 2012 2 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    3/48

    Part I

    Linear regression

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II

    23 December 2012 3 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    4/48

    Introduction

    The linear regression model, typically estimated by ordinary least squares(OLS), is the workhorse of applied econometrics. The model is

    yi =xTi +i, i= 1, 2, . . . , n. (1)

    y=X+ (2)For cross-sections:

    E(|X) = 0 (3)

    Var(|X) =2I (4)

    For time series:E(j|xi) = 0, i j. (5)

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II 23 December 2012 4 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    5/48

    Introduction

    We estimate the OLS by:

    = (XTX)1XTy (6)

    The corresponding fitted values are: y=X, the residuals are = y yand the residual sum of squares is T.

    In R, models are typically fitted by calling a model-fitting function, in thiscase lm(), with a formula object describing the model and a data.frameobject containing the variables used in the formula.

    fm < lm(formula, data, . . .)

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II 23 December 2012 5 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    6/48

  • 8/9/2019 Introduction Econometrics R

    7/48

    The first example

    In view of the wide range of the variables, combined with aconsiderable amount of skewness, it is useful to take logarithms.

    The goal is to estimate the effect of the price per citation on thenumber of library subscriptions.

    To explore this issue quantitatively, we will fit a linear regressionmodel,

    log(subs)i=1+2log(citeprice)i+i (7)

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II 23 December 2012 7 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    8/48

    The first example

    Here, the formula of interest is log(subs) log(citeprice). This can be usedboth for plotting and for model fitting:

    > plot(log(subs) ~ log(citeprice), data = journals)> jour_lm abline(jour_lm)

    abline() extracts the coefficients of the fitted model and adds the

    corresponding regression line to the plot.

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II 23 December 2012 8 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    9/48

    The first example

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II 23 December 2012 9 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    10/48

    The first example

    The function lm() returns a fitted-model object, here stored as jour lm.It is an object of class lm.

    > class(jour_lm)[1] "lm"

    > names(jour_lm)

    [1] "coefficients" "residuals" "effects" "rank" "fitted.va

    [7] "qr" "df.residual" "xlevels" "call" "terms" "model"

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 10 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    11/48

    The first example

    > summary(jour_lm)

    Call:lm(formula = log(subs) ~ log(citeprice), data = journals)

    Residuals:

    Min 1Q Median 3Q Max

    -2.72478 -0.53609 0.03721 0.46619 1.84808

    Coefficients:

    Estimate Std. Error t value Pr(>|t|)

    (Intercept) 4.76621 0.05591 85.25

  • 8/9/2019 Introduction Econometrics R

    12/48

    Generic functions for fitted (linear) model objects

    Function Function Description

    print() simple printed displaysummary() standard regression outputcoef() (or coefficients()) extracting the regression coefficientsresiduals() (or resid()) extracting residualsfitted() (or fitted.values()) extracting fitted values

    anova() comparison of nested modelspredict() predictions for new dataplot() diagnostic plotsconfint() confidence intervals for the regression

    coefficientsdeviance() residual sum of squaresvcov() (estimated) variance-covariance matrixlogLik() log-likelihood (assuming normally distributed

    errors)AIC() information criteria including AIC, BIC/SBC

    (assuming normally distributed errors)

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 12 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    13/48

    The first example

    It is instructive to take a brief look at what the summary() methodreturns for a fitted lm object:

    > jour_slm class(jour_slm)

    [1] "summary.lm"> names(jour_slm)

    [1] "call" "terms" "residuals" "coefficients" "aliased" "sig

    [7] "df" "r.squared" "adj.r.squared" "fstatistic" "cov.unsca

    > jour_slm$coefficients

    Estimate Std. Error t value Pr(>|t|)(Intercept) 4.7662121 0.05590908 85.24934 2.953913e-146

    log(citeprice) -0.5330535 0.03561320 -14.96786 2.563943e-33

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 13 / 1

    http://find/http://goback/
  • 8/9/2019 Introduction Econometrics R

    14/48

    Analysis of variance

    > anova(jour_lm)

    Analysis of Variance Table

    Response: log(subs)

    Df Sum Sq Mean Sq F value Pr(>F)

    log(citeprice) 1 125.93 125.934 224.04 < 2.2e-16 ***

    Residuals 178 100.06 0.562

    ---

    Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

    The ANOVA table breaks the sum of squares about the mean (for thedependent variable, here log(subs)) into two parts: a part that isaccounted for by a linear function of log(citeprice) and a part attributed toresidual variation.

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 14 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    15/48

    Point and Interval estimates

    To extract the estimated regression coefficients , the function coef() canbe used:

    > coef(jour_lm)

    (Intercept) log(citeprice)4.7662121 -0.5330535

    > confint(jour_lm, level = 0.95)

    2.5 % 97.5 %

    (Intercept) 4.6558822 4.8765420

    log(citeprice) -0.6033319 -0.4627751

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 15 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    16/48

    Prediction

    Two types of predictions:1 the prediction of points on the regression line and2 the prediction of a new data value.

    The standard errors of predictions for new data take into accountboth the uncertainty in the regression line and the variation of theindividual points about the line.

    Thus, the prediction interval for prediction of new data is larger thanthat for prediction of points on the line. The function predict()provides both types of standard errors.

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 16 / 1

    http://find/http://goback/
  • 8/9/2019 Introduction Econometrics R

    17/48

    Prediction

    > predict(jour_lm, newdata = data.frame(citeprice = 2.11),

    interval = "confidence")

    fit lwr upr

    1 4.368188 4.247485 4.48889

    > predict(jour_lm, newdata = data.frame(citeprice = 2.11),

    interval = "prediction")

    fit lwr upr

    1 4.368188 2.883746 5.852629

    The point estimates are identical (fit) but the intervals differ.The prediction intervals can also be used for computing and visualizingconfidence bands.

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 17 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    18/48

    Prediction

    > lciteprice jour_pred plot(log(subs) ~ log(citeprice), data = journals)> lines(jour_pred[, 1] ~ lciteprice, col = 1)

    > lines(jour_pred[, 2] ~ lciteprice, col = 1, lty = 2)

    > lines(jour_pred[, 3] ~ lciteprice, col = 1, lty = 2)

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 18 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    19/48

    Prediction

    Figure: Scatterplot with prediction intervals for the journals data

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 19 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    20/48

    Plotting lm objects

    The plot() method for class lm() provides six types of diagnostic plots,four of which are shown by default.We set the graphical parameter mfrow to c(2, 2) using the par() function,creating a 2 2 matrix of plotting areas to see all four plots simultaneously:

    > par(mfrow = c(2, 2))

    > plot(jour_lm)

    > par(mfrow = c(1, 1))

    The first provides a graph of residuals versus fitted values, the second is aQQ plot for normality, plots three and four are a scale-location plot and aplot of standardized residuals against leverages, respectively.

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 20 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    21/48

    Plotting lm objects

    Figure: Diagnostic plots for the journals data

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 21 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    22/48

    Testing a linear hypothesis

    The standard regression output as provided by summary() only indicatesindividual significance of each regressor and joint significance of allregressors in the form of t and F statistics, respectively. Often it is

    necessary to test more general hypotheses.This is possible using the function linear.hypothesis() from the car package.Suppose we want to test the hypothesis that the elasticity of the numberof library subscriptions with respect to the price per citation equals 0.5.

    H0:2= 0.5 (8)

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 22 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    23/48

    Testing a linear hypothesis

    > linear.hypothesis(jour_lm, "log(citeprice) = -0.5")

    Linear hypothesis test

    Hypothesis:

    log(citeprice) = - 0.5

    Model 1: restricted model

    Model 2: log(subs) ~ log(citeprice)

    Res.Df RSS Df Sum of Sq F Pr(>F)1 179 100.54

    2 178 100.06 1 0.48421 0.8614 0.3546

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 23 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    24/48

    Multiple linear regression

    In economics, most regression analyses comprise more than a singleregressor. Often there are regressors of a special type, usually referred toas dummy variables in econometrics, which are used for codingcategorical variables.

    > data("CPS1988")

    > summary(CPS1988)

    wage education experience ethnicity smsa region parttime

    Min. : 50.05 Min. : 0.00 Min. :-4.0 cauc:25923 no : 7223 northeast:6441 no :256311st Qu.: 308.64 1st Qu.:12.00 1st Qu.: 8.0 afam: 2232 yes:20932 midwest :6863 yes: 2524

    Median : 522.32 Median :12.00 Median :16.0 south :8760Mean : 603.73 Mean :13.07 Mean :18.2 west :6091

    3rd Qu.: 783.48 3rd Qu.:15.00 3rd Qu.:27.0

    Max. :18777.20 Max. :18.00 Max. :63.0

    The model of interest is

    log(wage) =1+2experience+3experience2+4education+5ethnicity+

    (9)

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 24 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    25/48

    Multiple linear regression

    > cps_lm summary(cps_lm)

    Call:

    lm(formula = log(wage) ~ experience + I(experience^2) + education +

    ethnicity, data = CPS1988)

    Residuals:

    Min 1Q Median 3Q Max

    -2.9428 -0.3162 0.0580 0.3756 4.3830Coefficients:

    Estimate Std. Error t value Pr(>|t|)

    (Intercept) 4.321e+00 1.917e-02 225.38

  • 8/9/2019 Introduction Econometrics R

    26/48

    Comparison of models

    With more than a single explanatory variable, it is interesting to test for

    the relevance of subsets of regressors. For any two nested models, this canbe done using the function anova(). E.g. to test for the relevance of thevariable ethnicity, we explicitly fit the model without ethnicity and thencompare both models.

    > cps_noeth anova(cps_noeth, cps_lm)

    Analysis of Variance Table

    Model 1: log(wage) ~ experience + I(experience^2) + education

    Model 2: log(wage) ~ experience + I(experience^2) + education + ethnicity

    Res.Df RSS Df Sum of Sq F Pr(>F)

    1 28151 9719.6

    2 28150 9598.6 1 121.02 354.91 < 2.2e-16 ***---

    Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

    This reveals that the effect of ethnicity is significant at any reasonable level.

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 26 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    27/48

    Comparison of models

    > cps_noeth waldtest(cps_lm, . ~ . - ethnicity)

    Wald test

    Model 1: log(wage) ~ experience + I(experience^2) + education

    Model 2: log(wage) ~ experience + I(experience^2) + education

    Res.Df Df F Pr(>F)

    1 28150

    2 28151 -1 354.91 < 2.2e-16 ***---

    Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 27 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    28/48

    Part II

    Linear regression with panel data

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 28 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    29/48

    Introduction

    There has been considerable interest in panel data econometrics overthe last two decades.

    The package plm(Croissant and Millo 2008) contains the relevantfitting functions and methods for specifications in R.

    Two types of panel data models:1 Statis linear models2 Dynamic linear models

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 29 / 1

    I d i

    http://find/
  • 8/9/2019 Introduction Econometrics R

    30/48

    Introduction

    For illustrating the basic fixed- and random-effects methods, we use the

    wellknown Grunfeld data (Grunfeld 1958) comprising 20 annualobservations on the three variables real gross investment (invest), realvalue of the firm (value), and real value of the capital stock (capital) for11 large US firms for the years 1935-1954.

    > data("Grunfeld", package = "AER")

    > library("plm")

    > gr

  • 8/9/2019 Introduction Econometrics R

    31/48

    One-way panel regression

    investit=

    1values

    +

    2capital

    +

    i+

    it (10)where i = 1, . . . , n, t = 1, . . . , T, and the i denote the individual-specific effects. Afixed-effects version is estimated by running OLS on a within-transformed model:

    > gr_fe summary(gr_fe)

    Oneway (individual) effect Within Model

    Call:

    plm(formula = invest ~ value + capital, data = pgr, model = "within")

    Balanced Panel: n=3, T=20, N=60

    Residuals :

    Min. 1st Qu. Median 3rd Qu. Max.

    -167.00 -26.10 2.09 26.80 202.00

    Coefficients :

    Estimate Std. Error t-value Pr(>|t|)value 0.104914 0.016331 6.4242 3.296e-08 ***

    capital 0.345298 0.024392 14.1564 < 2.2e-16 ***

    ---

    Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

    Total Sum of Squares: 1888900 Residual Sum of Squares: 243980

    R-Squared : 0.87084 Adj. R-Squared : 0.79827

    F-statistic: 185.407 on 2 and 55 DF, p-value: < 2.22e-16Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 31 / 1

    O l i

    http://find/
  • 8/9/2019 Introduction Econometrics R

    32/48

    One-way panel regression

    A two-way model could have been estimated upon setting effect =twoways.

    If fixed effects need to be inspected, a fixef() method and anassociated summary() method are available.

    To check whether the fixed effects are really needed, we compare thefixed effects and the pooled OLS fits by means of pFtest().

    > gr_fe pFtest(gr_fe, gr_pool)

    F test for individual effects

    data: invest ~ value + capital

    F = 56.8247, df1 = 2, df2 = 55, p-value = 4.148e-14

    alternative hypothesis: significant effects

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 32 / 1

    O l i

    http://find/
  • 8/9/2019 Introduction Econometrics R

    33/48

    One-way panel regression

    It is also possible to fit a random-effects version of (3.3) using thesame fitting function upon setting model = random and selecting amethod for estimating the variance components.Four methods are available: Swamy-Arora, Amemiya,Wallace-Hussain, and Nerlove.> gr_re summary(gr_re)

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 33 / 1

    One wa panel regression

    http://find/
  • 8/9/2019 Introduction Econometrics R

    34/48

    One-way panel regression

    Oneway (individual) effect Random Effect Model

    (Wallace-Hussains transformation)

    Call:plm(formula = invest ~ value + capital, data = pgr, model = "random",

    random.method = "walhus")

    Balanced Panel: n=3, T=20, N=60

    Effects:

    var std.dev share

    idiosyncratic 4389.31 66.25 0.352

    individual 8079.74 89.89 0.648

    theta: 0.8374

    Residuals :

    Min. 1st Qu. Median 3rd Qu. Max.

    -187.00 -32.90 6.96 31.40 210.00

    Coefficients :

    Estimate Std. Error t-value Pr(>|t|)(Intercept) -109.976572 61.701384 -1.7824 0.08001 .

    value 0.104280 0.014996 6.9539 3.797e-09 ***

    capital 0.344784 0.024520 14.0613 < 2.2e-16 ***

    ---

    Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

    Total Sum of Squares: 1988300 Residual Sum of Squares: 257520

    R-Squared : 0.87048 Adj. R-Squared : 0.82696F-statistic: 191.545 on 2 and 57 DF -value: < 2.22e-16Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 34 / 1

    One way panel regression

    http://find/
  • 8/9/2019 Introduction Econometrics R

    35/48

    One-way panel regression

    A comparison of the regression coefficients shows that fixed- andrandomeffects methods yield rather similar results for these data.

    To check whether the random effects are really needed, a Lagrangemultiplier test is available in plmtest(), defaulting to the test

    proposed by Honda (1985).> plmtest(gr_pool)

    Lagrange Multiplier Test - (Honda)

    data: invest ~ value + capitalnormal = 15.4704, p-value < 2.2e-16

    alternative hypothesis: significant effects

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 35 / 1

    One way panel regression

    http://find/
  • 8/9/2019 Introduction Econometrics R

    36/48

    One-way panel regression

    Random-effects methods are more efficient than the fixed-effects

    estimator under more restrictive assumptions, namely exogeneity ofthe individual effects. It is therefore important to test for endogeneity,and the standard approach employs a Hausman test. The relevantfunction phtest() requires two panel regression objects, in our caseyielding

    > phtest(gr_re, gr_fe)

    Hausman Test

    data: invest ~ value + capitalchisq = 0.0404, df = 2, p-value = 0.98

    alternative hypothesis: one model is inconsistent

    In line with the rather similar estimates presented above, endogeneitydoes not appear to be a problem here.

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 36 / 1

    Dynamic linear models

    http://find/
  • 8/9/2019 Introduction Econometrics R

    37/48

    Dynamic linear models

    To conclude this section, we present a more advanced example, thedynamic panel data model:

    yit=

    pi=1

    jyi,tj+xTit+ uit, uit=i+t+it (11)

    This is estimated by the method of Arellano and Bond (1991) viz.generalized method of moments (GMM) estimator utilizing lagged

    endogenous regressors after a first-differences transformation.

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 37 / 1

    Dynamic linear models

    http://find/
  • 8/9/2019 Introduction Econometrics R

    38/48

    Dynamic linear models

    > data("EmplUK", package = "plm")

    > form empl_ab

  • 8/9/2019 Introduction Econometrics R

    39/48

    Dynamic linear models

    Twoways effects Two steps model

    Call:

    pgmm(formula = dynformula(form, list(2, 1, 0, 1)), data = EmplUK,effect = "twoways", model = "twosteps", index = c("firm",

    "year"), ... = list(gmm.inst = ~log(emp), lag.gmm = list(c(2,

    99))))

    Unbalanced Panel: n=140, T=7-9, N=1031

    Number of Observations Used: 611

    Residuals

    Min. 1st Qu. Median Mean 3rd Qu. Max.

    -0.6191000 -0.0255700 0.0000000 -0.0001339 0.0332000 0.6410000

    Coefficients

    Estimate Std. Error z-value Pr(>|z|)

    lag(log(emp), c(1, 2))1 0.474151 0.085303 5.5584 2.722e-08 ***

    lag(log(emp), c(1, 2))2 -0.052967 0.027284 -1.9413 0.0522200 .log(wage) -0.513205 0.049345 -10.4003 < 2.2e-16 ***

    lag(log(wage), 1) 0.224640 0.080063 2.8058 0.0050192 **

    log(capital) 0.292723 0.039463 7.4177 1.191e-13 ***

    log(output) 0.609775 0.108524 5.6188 1.923e-08 ***

    lag(log(output), 1) -0.446373 0.124815 -3.5763 0.0003485 ***

    ---

    Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 39 / 1

    Dynamic linear models

    http://find/
  • 8/9/2019 Introduction Econometrics R

    40/48

    Dynamic linear models

    Sargan Test: chisq(25) = 30.11247 (p.value=0.22011)

    Autocorrelation test (1): normal = -2.427829 (p.value=0.0075948)

    Autocorrelation test (2): normal = -0.3325401 (p.value=0.36974)

    Wald test for coefficients: chisq(7) = 371.9877 (p.value=< 2.22e-16)Wald test for time dummies: chisq(6) = 26.9045 (p.value=0.0001509)

    The results suggest that autoregressive dynamics are important for thesedata.

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 40 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    41/48

    Part III

    Regression diagnostics

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 41 / 1

    Review

    http://find/
  • 8/9/2019 Introduction Econometrics R

    42/48

    Review

    > data("Journals")

    > journals journals$citeprice journals$age jour_lm

  • 8/9/2019 Introduction Econometrics R

    43/48

    Testing for heteroskedasticity

    For cross-section regressions, the assumption Var(i|xi) =2 istypically in doubt. A popular test for checking this assumption is theBreusch-Pagan test (Breusch and Pagan 1979).

    For our model fitted to the journals data, stored in jour lm, the

    diagnostic plots in suggest that the variance decreases with the fittedvalues or, equivalently, it increases with the price per citation.

    Hence, the regressor log(citeprice) used in the main model should alsobe employed for the auxiliary regression.

    Under H0

    , the test statistic of the Breusch-Pagan test approximatelyfollows a 2qdistribution, where q is the number of regressors in theauxiliary regression (excluding the constant term).

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 43 / 1

    Testing for heteroskedasticity

    http://find/
  • 8/9/2019 Introduction Econometrics R

    44/48

    st g o t os st c ty

    The function bptest() implements all these flavors of the

    Breusch-Pagan test. By default, it computes the studentized statisticfor the auxiliary regression utilizing the original regressors X.

    > bptest(jour_lm)

    studentized Breusch-Pagan test

    data: jour_lm

    BP = 9.803, df = 1, p-value = 0.001742Alternatively, the White test picks up the heteroskedasticity. It usesthe original regressors as well as their squares and interactions in theauxiliary regression, which can be passed as a second formula tobptest().

    > bptest(jour_lm, ~ log(citeprice) + I(log(citeprice)^2),

    + data = journals)

    studentized Breusch-Pagan test

    data: jour_lm

    BP = 10.912, df = 2, p-value = 0.004271Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 44 / 1

    Testing the functional form

    http://find/
  • 8/9/2019 Introduction Econometrics R

    45/48

    g

    The assumption E(|X) = 0 is crucial for consistency of theleast-squares estimator. A typical source for violation of thisassumption is a misspecification of the functional form; e.g., byomitting relevant variables. One strategy for testing the functionalform is to construct auxiliary variables and assess their significance

    using a simple F test. This is what Ramseys RESETdoes.

    The function resettest() defaults to using second and third powers ofthe fitted values as auxiliary variables.

    > resettest(jour_lm)

    RESET testdata: jour_lm

    RESET = 1.4409, df1 = 2, df2 = 176, p-value = 0.2395

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 45 / 1

    Testing the functional form

    http://find/
  • 8/9/2019 Introduction Econometrics R

    46/48

    g

    The rainbow test (Utts 1982) takes a different approach to testingthe functional form. It fits a model to a subsample (typically themiddle 50%) and compares it with the model fitted to the full sampleusing an F test.

    > raintest(jour_lm, order.by = ~ age, data = journals)Rainbow test

    data: jour_lm

    Rain = 1.774, df1 = 90, df2 = 88, p-value = 0.003741

    This appears to be the case, signaling that the relationship betweenthe number of subscriptions and the price per citation also dependson the age of the journal.

    Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 46 / 1

    Testing for autocorrelation

    http://find/
  • 8/9/2019 Introduction Econometrics R

    47/48

    g

    Let us reconsider the first model for the US consumption function.> library(dynlm)

    > data("USMacroG")

    > consump1 dwtest(consump1)

    Durbin-Watson test

    data: consump1

    DW = 0.0866, p-value < 2.2e-16

    alternative hypothesis: true autocorrelation is greater than 0Further tests for autocorrelation are the Box-Pierce test and theLjung-Box test, both being implemented in the function Box.test() inbase R.> Box.test(residuals(consump1), type = "Ljung-Box")

    Box-Ljung test

    data: residuals(consum 1)Rajat Tayal (IIT Kanpur) Introduction to Estimation/Computing Environment -II23 December 2012 47 / 1

    http://find/
  • 8/9/2019 Introduction Econometrics R

    48/48