26
5 What Do All these Tests and Statistics Mean? LEARNING OBJECTIVES j Be able to interpret OLS coefficients, and construct elasticities, in a linear model j Be able to use the ‘F’ test in three ways in a linear regression – as a structural stability test, as a variable addition/deletion test and as a test of the hypothesis that R squared is greater than 0 j Know how R squared can be ‘adjusted’ and the limitations of doing this j Understand how to construct a forecast from a linear regression model and be able to use forecast evaluation statistics CHAPTER SUMMARY 5.1 Introduction: Typical test statistics in computer output 5.2 Telling the story of the regression coefficients: The use of elasticities 5.3 The construction and use of ‘F’ tests in regression 5.4 Adjusting the R squared: How and why? 5.5 Be careful with all R squareds 5.6 Basic econometric forecasting 5.7 Review studies 5.8 Conclusion Dos and Don’ts Exercises References Weblink WHAT DO ALL THESE TESTS AND STATISTICS MEAN? 115

What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

Embed Size (px)

Citation preview

Page 1: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

5What Do All these Tests and StatisticsMean?

LEARNING OBJECTIVES

j Be able to interpret OLS coefficients, and construct elasticities, in a linearmodel

j Be able to use the ‘F’ test in three ways in a linear regression – as astructural stability test, as a variable addition/deletion test and as a test ofthe hypothesis that R squared is greater than 0

j Know how R squared can be ‘adjusted’ and the limitations of doing thisjUnderstand how to construct a forecast from a linear regression model and

be able to use forecast evaluation statistics

CHAPTER SUMMARY

5.1 Introduction: Typical test statistics in computer output5.2 Telling the story of the regression coefficients: The use of elasticities

5.3 The construction and use of ‘F’ tests in regression5.4 Adjusting the R squared: How and why?5.5 Be careful with all R squareds

5.6 Basic econometric forecasting5.7 Review studies

5.8 ConclusionDos and Don’ts • Exercises • References • Weblink

WHAT DO ALL THESE TESTS AND STATISTICS MEAN?

115

Page 2: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

5.1 INTRODUCTION: TYPICAL TEST STATISTICS IN COMPUTEROUTPUT

By this stage, you should be familiar with single equation multiple regression equations

estimated by ordinary least squares (OLS). We continue with the assumption that the model is

strictly linear (i.e. a two-variable relationship will be a straight line if plotted on a graph). Once

you have estimated such an equation, you will usually be presented with a display which

contains, at a minimum, estimated coefficients, standard errors of these, an ‘F’ ratio, sum of

squared residuals and an R squared. You may also be given a ‘critical significance level’, an R

squared adjusted, sum of residuals and other statistics that do not concern us at the moment.

A specimen output is shown in Table 5.1, which is an exact copy of a regression estimated in

a version of SST (Statistical Software Tools). The data used here is a cross-section of American

states (including Hawaii) in 1980.

The estimating equation is:

PCBUR ¼ b0 þ b1CRROB þ b2CRBUR þ b3CRTHF þ b4UR

þ b5PINC þ b6POV þ b7PCBLK þ u (5:1)

The equation has per capita burglary rates (PCBUR) as the dependent variable and the

independent variables are clearance rates (that is, the fraction of crimes ‘solved’ or written off by

the police) for burglary, robbery and theft (CRBUR, CRROB, CRTHF), unemployment rates

(UR), percentage of households in poverty (POV) and percentage of the population classified

Table 5.1 Ordinary least squares estimation of a burglary equation

Dependent variable: PCBUR

Independent variable Estimated coefficient Standard error t-statistic

INT 15.10120 10.13702 1.48971CRROB �0.11814 9.88712e0–002 �1.19485

CRBUR �0.34167 0.26948 �1.26789CRTHF 6.35066e–002 0.19346 0.32827

UR �0.18846 0.40588 �0.46434

PINC 9.34656e—004 6.92072e—004 1.35052POV �9.20597e—002 0.37353 �0.24646

PCBLK 11.41881 7.53019 1.51640Number of Observations 51

R-squared 0.36813Corrected R-squared 0.26527Sum of Squared Residuals 8.39292eþ002

Standard Error of the Regression 4.41797Durbin–Watson Statistic 1.99202

Mean of Dependent Variable 14.88223

ECONOMETRICS

116

Page 3: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

as black (PCBLK). The clearance rates here have been converted to percentages by multiplying

by 100 and the crime rate has been converted to per 100 000 population. You can find this data

on the website in an Excel file called USADAT.XLS.

If you wish to replicate Table 5.1, you will need to create the variables PCBUR and PCBLK

by dividing the burg and black variables by the pop variable. I have replicated this regression

using SPSS, on the same data set, and obtained results which differed in all cases, in terms of

the coefficients and the ‘t’ ratios. This was sometimes by as much as 10 per cent but it would

not have made any appreciable difference to hypothesis testing.

The ‘t’ ratios shown here are the values of the coefficients divided by the corresponding

standard errors, which are in the middle column of figures. These are based on the assumption

that you want initially to test the null hypothesis of a zero coefficient. We can refer to this as the

default ‘t’ statistic, as it is the one every computer package will give us without asking for it.

Almost all published articles, by economists, will give this ‘t’ statistic. If they do not, they will

tend to give the standard error in brackets below the coefficient. In cases where the standard

error or ‘t’ statistic is not shown, it will be impossible for a reader to carry out any hypothesis

tests, on the individual coefficients, whatsoever. It is thus highly desirable that you do present

either, or both, of these in your written reports. Where the ‘t’ or standard error are not given

they may be replaced by the ‘asterisk’ system of giving a number of * beside the coefficients to

represent the critical significance level (usually out of 10, 5, 1 and 0.1 per cent) at which the

null is rejected. The ‘Corrected R squared’ is a reformulation of the R squared which we discuss

later in this chapter (Section 5.4).

The mean of the dependent variable is shown here but the means of the independent

variables have not been provided. They are readily available on any package, however; all you

need to do is ask for basic descriptive statistics. You will find considerable variation in practice

over descriptive statistics in published work. Sometimes means and standard deviations of all

the variables are given and sometimes they are not. If they are not given, it often makes it

impossible for the reader to make further interpretative calculations using the equations such as

elasticities. and predictions (both of which are discussed later in this chapter, in Sections 5.2

and 5.6 respectively).

The standard error of the regression is a formula which is closely related to the R squared. It

is given by the following formula:

r Xi

uu12

!=n� k (5:2)

which is the square root of the sum of squared residuals after it has been divided by the number

of degrees of freedom in the regression where k stands for the number of parameters estimated

(including the intercept).

The sum of squared residuals is 839.92, as we move the decimal point two places to the right

because of the use of scientific notation in the results readout. We cannot say if this is telling us

anything about how well this set of variables predicts the behaviour of the burglary rate, as it is

dependent on units of measurement. Likewise, the standard error of the regression is not

informative about goodness of fit, so people will use the R squared statistics to judge goodness

of fit. In the present case, R squared is 0.36813, which says that 36.813 per cent of the variation

in the dependent variable (burglary rates), around its mean, is accounted for by the independent

WHAT DO ALL THESE TESTS AND STATISTICS MEAN?

117

Page 4: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

variables. Obviously, 62.187 per cent of the variation is not accounted for and can be attributed

to omitted variables, errors in the data and other errors in the specification of the model.

The Durbin–Watson test relates to one of the assumptions of the CLRM (that the u terms

are not covariant with u’s in other periods) and will be covered in Chapter 9. It is given

automatically in many packages, although there are some situations, such as that in Table 5.1,

in which it will be a meaningless statistic for the type of data involved (see Chapter 9). This

output did not report the so-called ‘F ratio for the equation’ with its degrees of freedom for the

default hypothesis test assumed by the package. This, and the alternative ‘F’ tests you might

want are explained at Section 5.3 of this chapter.

Any package will provide you with a number of results that do not greatly interest you as well

as those you do want. This particular table did not provide the sum of residuals, which you may

well find on the package you are using. This should be a very small number given that we saw in

Chapter 3 that the OLS technique constrains the sum of residuals to be zero. They will not be

exactly zero due to small rounding errors on the computer. Therefore, any figures you see given

for the sum of residuals are likely to be followed by E with a minus and a number of trailing

digits.

This results table does not show the degrees of freedom for the ‘t’ test and it shows the

number of observations written in words rather than simply giving a figure for n (sample size).

The degrees of freedom for the ‘t’ tests will be n � 8, in this case being 43. The critical value

for ‘t’ at 43 degrees of freedom at the 5 per cent level on a two-tailed test is between 2.021 and

2.014. On a one-tailed test, the critical value is between 1.684 and 1.679.

The model estimated in Table 5.1 could be related to the rational choice theory of criminal

activity put forward by Becker in 1968 and the subject of econometric work by numerous

economists, including the published studies of basketball and ‘skyjacking’ reviewed in this

chapter (see Section 5.7). Most economists expect that there will be substitution effects from

the risk of punishment for a crime and hence the parameter on CRBUR should be negative.

CRTHF and CRROB represent the prices of crimes, which will be either substitutes or

complements, and hence their coefficients will be indeterminate. The remaining variables are, in

the case of UR and PCBLK, expected to have positive coefficients as they proxy labour market

conditions. The PINC variable is problematic: most writers suggest that it should have a

positive coefficient if it is understood to be a measure of the expected rate of return for a

burglary. However, it could represent the opposite in that higher wealth means that people

have, on average, less incentive to steal. The model does not include variables for the length of

the prison sentence for burglary.

We can summarize the above discussion in a list of expectations:

b0? b1? b2 , 0 b3? b4 . 0 b5? b6 . 0 b7 . 0 (5:3)

Some of the coefficients in this table are very small numbers. In particular, the PINC

coefficient rounds to 0.000935. Such small coefficients may represent a small relationship

between the two variables but quite often, as in this case, it will represent the fact that the left-

and right-hand side variables are measured in very different units. PCBUR is measured in the

range of quite small fractions of 1, while PINC is in the order of thousands of dollars.

There is some element of doubt as to whether these coefficients should be the subject of two-

tailed or one-tailed tests. Looking at these results, we can see that this is not a problem as the ‘t’

ratios are all extremely weak in terms of statistical significance. The largest ‘t’ ratio is only

ECONOMETRICS

118

Page 5: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

1.5164 (on the PCBLK variable). This is of the correct sign but it falls below the 5 per cent

level on a one-tailed test. In summary then, this is a regression, which performs poorly in terms

of supporting the usual theoretical model, as it is not possible to convincingly reject the null for

any of the coefficients. In a typical article by economists, this is likely to be expressed in the

more loose form: ‘these variables are not significant’.

5.2 TELLING THE STORY OF THE REGRESSION COEFFICIENTS:THE USE OF ELASTICITIES

Let us now focus on the items in the regression output that do interest us, and which we already

have some experience of from Chapter 3. The first thing we should look at is the regression

coefficients themselves. You will notice that the package producing Table 5.1 has given the full

set of information for the constant (intercept term). Some published work does drop this term

in reporting the results (for example, studies from the world’s leading economic journal

examined in Chapter 7) which may seem to be of little interest given that we do not have any

particular interest in hypotheses about this coefficient. However, its absence means that we are

unable to construct predictions based on assumptions about changes in the level of the

independent variables.

Let us turn now to the coefficients for the so-called ‘explanatory variables’. We shall apply the

method of checking the three S’s.

SIGNDo we have the correct signs? In Table 5.1 CRBUR, POV and UR seem to have the ‘correct’

signs: that is, they are negative as the theory most people would put forward would suggest.

The other variables either have the wrong sign on their coefficients or are cases where the

prediction is ambiguous (that is, we cannot readily tell whether the sign is negative or positive).

SIZE AND SIGNIFICANCEUnfortunately, regression coefficients by themselves do not tell us very much. There are two

reasons for this:

c They are not the ‘true’ values. They are just the most likely values to occur. You need a

hypothesis test in order to establish the distance of this central value from zero. You will be

hoping to find a coefficient that is statistically significant and of the correct sign except, of

course, in the case where you are engaged in ‘confrontational’ hypothesis testing where you

include a variable to demonstrate that it is not influential.

c Even if such a hypothesis test brings us the result we are looking for, we still face the problem

that the numerical value of the coefficient may not mean very much. To put it bluntly, we

could quite easily produce a regression in which there is a very economically significant

relationship between the dependent and independent variable (i.e. small changes in the latter

produce large changes in the former) but a very small coefficient and vice versa. This can

happen because the estimators produced by OLS are dependent on the units of measurement

in the variables we have chosen to enter into the estimation routine. The coefficient of a

WHAT DO ALL THESE TESTS AND STATISTICS MEAN?

119

Page 6: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

regression only tells us the estimated impact of a one-unit increase in an independent variable

(in the units in which it is measured) on the dependent variable in the units in which it is

measured.

If we are going to present our work to others to convince them of the importance of the

relationships we have identified then we are going to need some measures that do not depend

on units of measurement, as well as convincing hypothesis tests. To obtain such ‘unit free’

estimators there are three approaches you could use:

(i) Make sure the variables are in units, which can provide meaningful measures of impact

before you do the estimation. There are a few cases where this is easy. Take the simple

consumption function as shown in Table 3.1. If we regress per capita consumption on per

capita income, then the resulting coefficient will be the estimated marginal propensity to

consume.

(ii) Use beta coefficients as alternatives to regression coefficients. These are not popular with

economists, although they are found in other social sciences particularly those that use

‘Path Analysis’, which is a multiple equation system (related to those we discuss in

Chapters 12 and 13). Beta coefficients are routinely printed alongside the regression

coefficients (betas) in the SPSS program. A beta coefficient is obtained by multiplying the

coefficient by the ratio of the standard error of the independent variable to that of the

dependent variable. This takes a value between –1 and þ1 and can be used to rank

strength of relationships; that is, if we had six independent variables and the beta

coefficient or the third was twice that of the fourth and four times that of the fifth we

would know that their impact goes in this order. However, we could not say the relation-

ship is two times as much or four times as much.

(iii) Use elasticities. This is by far the most popular approach among economists for summariz-

ing the numerical strength of individual relationships estimated from a regression.

The notion of elasticity is familiar to anyone who has ever studied economics but the idea of

elasticity itself is merely a form of measurement, which has no economic content as such. The

elasticity is a unit-free form of measurement because we divide the percentage change in one

variable by the percentage change in another variable. We usually make the presumption that

the variable on the top line is the dependent variable and that on the bottom is an independent

variable. There are three elasticities for the equation estimated in Chapter 3 and shown in Table

3.4. The elasticity of demand with respect to prices (price elasticity), the elasticity of demand

with respect to income (income elasticity) and the elasticity of demand with respect to the take-

up of tertiary education (education elasticity). By the same reasoning, there are seven elasticities

for the supply of burglary in the equation in Table 5.1.

The formula for point elasticity can be written as follows:

Elasticity of Y with respect to X ¼ �Y=�X :(X=Y ) (5:4)

which simplifies to the regression coefficient (bb) multiplied by X/Y. The regression coefficient,

in a linear equation, will be the partial derivative of Y with respect to X. Y and X values will vary

as we move along the fitted regression line although, until we reach Chapter 6, the estimated b

value does not change as the levels of X and Y vary.

ECONOMETRICS

120

Page 7: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

We have seven point elasticities in the present model:

�PCBUR=�UR:(UR=PCBUR) (5:5a)

�PCBUR=�PCBLK :(PCBLK=PCBUR) (5:5b)

�PCBUR=�CRBUR(CRBUR=PCBUR) (5:5c)

�PCBUR=�XCRROB(CRROB=PCBUR) (5:5d)

�PCBUR=CRTHF :(CRTHF=PCBUR) (5:5e)

�PCBUR=�PINC :(PINC=PCBUR) (5:5f)

�PCBUR=�POV :(POV =PCBUR) (5:5g)

To calculate an elasticity, then, we have to set values of X and Y. We could choose any X and

Y, although it would not really make sense to calculate an elasticity that does not lie on the

estimated function. As well as this, we might want an elasticity that sums up in one figure the

size of the relationship between two variables. The obvious candidate for this is to evaluate the

elasticity at the means of the sample data. If you see an article in which an elasticity is produced

from a linear equation then it has been calculated in this way. It may be described as an

‘elasticity at the means’. There have been some econometrics packages (such as SHAZAM),

which give the elasticity at the means in the tabulated output. If your package does not give the

means of all variables (Table 5.1 only gives the mean of the dependent variable not the

independent variables), then you can usually obtain the means of data as one of the options for

descriptive statistics on a regression command.

As an illustration of the calculation, I have gone back to the data set used in Table 5.1 and

obtained the mean of CRBUR as 15.53137. Using this and the figures in the table gives a

burglary own clear-up rate elasticity of

�0:34167 3 (15:53137=14:88223) ¼ 0:3565731

At this stage we perhaps should point out some important facts, in case there is any

confusion:

c It is just a coincidence that the elasticity is similar to the point estimate; a coincidence which

is due to the units of measurement of the Y and X variables being on a similar scale in this

case. You will see examples in Chapter 7 where the elasticity is quite different from the point

estimate.

c Don’t become too impressed by the false impression of accuracy given by the large number of

decimal places to which the calculations are taken. I have simply used the full estimate

reported by a computer package to the number of decimal places it uses as a default.

c The simple method of multiplying regression coefficients by the ratios of the means of two

variables is only applicable to the linear estimating equation. The correct formulae for other

functional forms are to be found in Chapter 7.

WHAT DO ALL THESE TESTS AND STATISTICS MEAN?

121

Page 8: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

5.3 THE CONSTRUCTION AND USE OF ‘F’ TESTS IN REGRESSION

As we saw in Chapter 2, the ‘F’ test is formed by calculating the ratio of two variances, v1 and

v2, where v1 is greater than v2 and comparing this with the critical value for degrees of freedom

corresponding to the ratio v1/v2. The degrees of freedom will depend on the sample size and

the number of regression coefficients estimated. In analysis of regression, we use the ‘F’ ratio to

carry out tests that involve more than one parameter. The ‘t’ test is limited to being able to test

only hypotheses concerning one parameter at a time. There are circumstances where we want to

test several hypotheses simultaneously. One example is where we are concerned with sums of

parameters, as in the case of a production function where we want to test the hypothesis of

returns to scale. To deal with this and more complicated cases of multiple hypothesis testing we

have to use the ‘F’ test. Any ‘F’ test for a regression is based on the notion of there being two

regressions, which give us the two different variances to be compared. One regression is called

the restricted regression while the other is called the unrestricted regression. The difference

between the two is due to our choice of restrictions. We have already been making restrictions

when we did ‘t’ tests. The default two-tailed ‘t’ test is based on imposing the restriction that a

parameter equals zero. Any restriction imposed on an equation must lead to a rise in the sum of

squared residuals compared with the original unrestricted model. Thus, the restricted model

must have a higher variance of its residuals than the unrestricted model. If the ratio of the two is

not significantly different from 1, then it appears that the model has not been seriously affected

by the restrictions in terms of its goodness of fit/ability to ‘explain’ the dependent variable. The

general formula for F in regressions is:

F ¼ (Sr � S)=g

S=(n� k)(5:6)

where g and n � k are the degrees of freedom we use to look up the critical value of the ‘F’

ratio in Appendix 4. Sr is the residual sum of squares from the restricted regression and S is the

residual sum of squares in the unrestricted regression. n is the sample size, k the number of

parameters estimated in the unrestricted regression and g the number of restrictions imposed in

moving from the unrestricted form to the restricted form. The number of restrictions in a ‘t’

test is only one. There is no need for an ‘F’ test and if we did an ‘F’ test for such a case we

would always find that the square of the calculated ‘F’ ratio is exactly equal to the corresponding

‘t’ ratio. We would always find exactly the same result doing the hypothesis test using the ‘F’

ratio as we would using the ‘t’ ratio.

‘F’ FOR THE EQUATIONThe specimen computer package table shown in Table 6.2 in the next chapter shows an ‘F’

statistic of 0.57685[0.457], while that of the regression in Table 6.4 gives a figure of

460.9527[0.000] for the same statistic without telling us what it is supposed to be testing. This

is because all packages, which give an ‘F for the equation’ assume a default hypothesis of the

following form:

H0: b1 ¼ b2 ¼ b3 ¼ . . . b g ¼ 0 (5:7)

ECONOMETRICS

122

Page 9: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

against the alternative that these coefficients are not all simultaneously zero.

The g here represents the total number of parameters, other than the intercept, which have

been estimated. Many published studies also report this statistic without further information,

although some will call it ‘F for the equation’. This tests the hypothesis that all parameters on

the explanatory variables are simultaneously zero. This null hypothesis is stating that the

regression has no explanatory power whatsoever. The restricted equation is simply a regression

of Y on the intercept. In this case, g will equal k � 1 where k is the number of parameters but

not in any other case. For this case, Sr will be the sum of squared deviation of Y from its mean.

This means that this F test can be calculated from the R squared of the unrestricted regression

as follows:

fR2=(1 � R2)g:f(n� k)=(k � 1)g (5:8)

This formula cannot be used for any other null hypotheses about multiple restrictions.

The larger is the goodness of fit (R2), other things being equal, then the more likely is the

null to be rejected. This does not guarantee that a large R2 will reject the null as small degrees

of freedom for the model could lead to the acceptance of the null. The degrees of freedom for

the example given in Table 6.1 will be (7,43). The computer output in Table 6.1 did not

actually give the F for the equation but using the formula above, it is:

(0:36813=0:63157):(43=7) ¼ 3:58055:

The critical value at (7,43) d.f. is between 2.20 and 2.25 at the 5 per cent level, and therefore

we can reject the null that the coefficients in the equation are jointly insignificant. You are

unlikely to ever have to do this calculation as most articles will inevitably give you the ‘F’

statistic, as will computer packages. Computer packages will also tend to report the critical

significance level, meaning you will not need to look up the ‘F’ tables for this particular

hypothesis test.

Finding that the ‘F for the equation’ is highly significant is not necessarily a good reason to

get excited because it is not demonstrating that the model is particularly useful or powerful.

Our focus should always be on the plausibility and significance of the individual coefficients as

this is what distinguishes a subject (whether it be econometrics, psychometrics or biometrics)

application of statistical methods from purely automated line fitting. It is even possible that we

could have an equation in which none of the coefficients is statistically significant yet the ‘F’ for

the equation strongly rejects the null. The results in Table 5.1 are tending towards this

situation. The extreme case of this is symptomatic of multicollinearity, which will be explored in

Chapter 10.

On the other hand, finding ‘F for the equation’ to be insignificant at say the 5 per cent level

would worry most economists, as it implies the so-called ‘explanatory’ (i.e. independent)

variables in our model are not doing any ‘explaining’ at all. However, we should note that the

acceptance of the null from an insignificant ‘F for the equation’ does not conflict with findings

of statistical significance for the findings on some of the individual coefficients in the model.

This could, for example, come about if we had included a large number of irrelevant variables,

which bring down the degrees of freedom.

The use of ‘F for the equation’, that we have just encountered, is effectively a test of the null

hypothesis that R2 is zero against the one-sided alternative that it is positive.

WHAT DO ALL THESE TESTS AND STATISTICS MEAN?

123

Page 10: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

‘F’ TEST FOR EXCLUSION/INCLUSION OF TWO OR MORE VARIABLESLet us say that we have instead the following null:

H0: b2 ¼ b3 ¼ 0 (5:9)

in a model such as :

Y ¼ b0 þ b1X1 þ b2X2 þ b3X3 þ u (5:10)

When we substitute the restriction in (5.9) into equation (5.10) we get equation (5.11):

Y ¼ b0 þ b1X1 þ u (5:11)

We use formula (5.6) for this case, once we have estimated the regression on X1 only, and

the regression on the unrestricted regression and retrieved the sum of squared residuals from

the output. Most articles will not reproduce the sum of squared residuals in the output. If you

are not given the residual sum of squares but are given the R squared for regressions in which

some are ‘nested’ within another (i.e. you can get from the more general equation to the other

ones by simply deleting some variables), then you can test the significance of the exclusion of

variables using the formula below, which is equivalent to Equation (5.2) for this case:

f(R22 � R1

2)=(1 � R12)g:f(n� k)=gg (5:12)

where the 1 subscript refers to the smaller R squared (i.e. for the restricted equation) and the 2

subscript to the larger R squared (i.e. for the unrestricted equation). You should make sure you

are using R2 and not adjusted R2 which is reported in some articles rather than the unadjusted.

You would need to convert adjusted back to unadjusted if only the former was given.

We have just looked at a test of the null hypothesis that the change in R2 is zero against the

one-sided alternative that it is positive. Most packages provide options for this test in terms of

‘variable addition’ or ‘variable deletion’ tests once you have instructed the package which

variables are to be deleted or added. A specimen of this kind of output, from Microfit, is shown

in Table 7.2 in Chapter 7, which shows the use of forecast dummy variables. This test is

equivalent to testing that the change in R2 between the larger and smaller models is greater

than zero. In packages aimed at the general social scientist (such as SPSS) you may find the

program offers the option of doing the test described in this way.

For the moment, let us illustrate the variable deletion/addition ‘F’ test using the burglary

equation from Table 5.1. I re-estimated this equation with the three punishment variables

(CRBUR, CRTHF, CRROB) excluded, which gave an R squared of 0.18969 and a sum of

squared residuals of 1076.31. Following the formula in Equation (6.2) we get:

[(1076:31 � 839:292)=3]=(10761:31=43) ¼ (rounded) 3:1564

You should get the same answer using the formula in Equation (6.7), that is:

[(0:36813 � 0:18969)=(1 � 0:18969)] 3 43=3:

ECONOMETRICS

124

Page 11: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

As the degrees of freedom are 3,43, the null hypothesis will be rejected at the 5 per cent level

because the critical value is between 2.84 and 2.76 and these are the boundaries given by (3,40)

and (3,60) degrees of freedom. You can easily check that the null will be accepted at the 1 per

cent level by looking up the ‘F’ tables.

This result implies that the inclusion of these variables has made a significant difference to the

extent to which the model ‘explains’ the burglary rate. You might notice a possible contra-

diction here. The individual ‘t’ ratios for these three variables are not significant but the test on

all three simultaneously is significant.

Why would we want to do this kind of ‘F’ test? There are two ways of looking at this: in terms

of model reduction or model expansion. The reduction idea uses tests to get down to the most

efficient model by deleting unnecessary variables that might have been included. This is the idea

behind the ‘general to specific’ modelling strategy pioneered by the Scottish econometrician

David Hendry. His work was directed against a tendency which had sprung up in applied

econometrics to go from the ‘specific to the general’. That is, write down what seems like a

reasonable model then keep adding bits to it to see if you can make it better. Once this process

has been finished an ‘F’ test for variable addition might be used to check whether the collection

of ‘add ons’ has made a statistically significant contribution as a block or group. One might

view the ‘F’ test on the results from Table 6.1 as a test that punishment variables matter as we

have deleted them all from the model. In a similar vein we could construct a basic economic

model of demand and then add ‘sociological’ taste variables to it and use an ‘F’ test on the

block of these to see if they merit inclusion on purely goodness of fit grounds.

AN ‘F’ TEST FOR THE RESTRICTION THAT THE PARAMETERS ARE EQUALACROSS TWO SAMPLESWe have just looked at two variations on the use of the ‘F’ ratio to test hypotheses about

multiple parameter restrictions in regression. The first of these was a special case of the second,

where the alternative hypothesis involved deleting all explanatory variables from the model. In

the first case, we did not literally have to run two regressions as we could use the R squared

from the unrestricted regression. We now look at a test, often called the ‘Chow test’ (after the

American economist Gregory Chow), which requires you to run two separate regressions on

different parts of the sample for exactly the same equation, as well as the full sample

regression. Having said this, it should be pointed out that the process can be simplified to one

of running just two regressions once we have learnt the technique of dummy variables (see

Section 7.5 in Chapter 7).

Here we modify formula (5.5) as follows:

(a) You must run three regressions – one for each sample and one for the two samples

combined.

(b) The sum of the squared residuals from the whole sample regression is to be used as S r (the

restricted sum of squared residuals); you must add the sum of squared residuals from the

other two regressions together to get S (the unrestricted sum of squared residuals).

(c) g in this case should be the total number of parameters in the equation including the

intercept, that is, k.

(d) n � k should be replaced by n � 2k where n is the number of observations in the whole

sample.

WHAT DO ALL THESE TESTS AND STATISTICS MEAN?

125

Page 12: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

Why would we go to the bother of doing a test like this?

c It might be considered a good idea to test whether the sample has constant parameters. In the

absence of a prior hypothesis about how the parameters might be shifting, we could simply

divide the sample in two – if it is a time series we might as well divide it into a first and

second half. This is the use of the ‘F’ ratio as a stability test. If the null were rejected then we

would be concluding the parameters are not stable, as they have shifted from one part of the

sample to the other.

c We might have a sample which is constructed by pooling a number of samples which might

give us prior reason to suspect that parameters are not equal across all samples. For example,

we might have pooled a sample of men and women, in which case the ‘F’ test is a test as to

whether the functional forms for men and women are completely identical. Note that because

this is a ratio test, it can only be used to test the pooling of two samples at a time. The

stability test is also only concerned with the overall effect of the parameters not instability for

each individual parameter. Given this, it seems better to use the form of the ‘F’ test, using

slope and shift dummies, that is presented in Chapter 8.

EXAMPLE OF HOW TO DO A CHOW TEST

Here is an example of how to do a Chow test using a simple equation to explain the rates of

death from motor vehicle accidents, in the USA, using a cross-section of data from 1980. If

you wish to replicate this, the data is in the Excel file USADAT.XLS. The variable DEAD does

not appear in this data set. You would need to create it by dividing vad (Vehicle Accident

Deaths) by mvr (motor vehicle registrations). The three variables used are:

DEAD = deaths in motor accidents per registered driver

AVSPD = average driving speed in miles per hour

PINC = median per capita income.

The results are shown in Table 5.2. I have divided the full sample of 42 observations (42

states for which data were available) into sub-samples of 19 and 23. This was done purely

arbitrarily as a stability test (the sample of 19 is the first 19 observations on the data file and

that of 23 is the second 23) not based on any prior hypothesis about what might cause the

parameters to differ between sections of the sample.

Before we do the ‘F’ (Chow) test for stability, a little bit of economic theory might be called

on to justify such an equation. Vehicle deaths would be regarded here as a choice variable

resulting from the private decisions of drivers about risk taking. The speed of driving might

also be seen as a choice and therefore might strictly speaking be seen as endogenous (but we

do not deal with this type of problem until Chapters 12 and 13). Faster driving by other

motorists would lead us to expect a higher risk, therefore predicting a positive coefficient on

the AVSPD variable. There is quite a large literature on this subject, in which some economists

(see, e.g., Lave, 1985) have argued that speed, as such, is not the crucial factor (at least not

within the range normally observed on highways), rather it is the variation in driver speed

ECONOMETRICS

126

Page 13: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

which causes accidents because it represents the number of opportunities where there is an

accident risk.

The other variable here, per capita income, measures roughly the value of time to a driver,

in that higher wage rates mean a greater opportunity cost of consumption and production

opportunities foregone. The coefficients in this regression show, for AVSPD the impact of

drivers driving, on average, one mile per hour faster on deaths per registered driver; and for

PINC, the effect of a one-dollar increase in median income on the number of deaths per

registered driver. The coefficients are extremely small because of the units of measurement

chosen.

The sum of the sum of squared residuals from the split samples is 0.115562. The top line of

the formula 6.2 [(Sr � S)=g] becomes (0.088208/4) ¼ 0.022052. The bottom line [S=(n� k)]

is 0.115562/34 which equals 0.003399. Therefore, the ‘F’ value is:

[(0:088208=4)]=0:115562=34 ¼ 6:4878:

The degrees of freedom, at which we should look up the critical value, are 4 and 34. At the

5 per cent level this gives a value somewhere between 2.34 and 2.42. We would then reject

the null at the 5 per cent level in favour of the alternate hypothesis that the equation is

‘structurally unstable’, in the sense that its parameters have shifted between one part of the

sample and the other.

5.4 ADJUSTING THE R SQUARED: HOW AND WHY?

In Chapter 3 we met the R squared statistic as a summary of the goodness of fit of a model. As

such, it is a summary of the ‘within sample’ variation in Y ‘explained’ by the linear combination

of X ’s. It is not a hypothesis test nor does it necessarily indicate the forecasting power of a

model that is examined later in this chapter. Some version of the R squared is reported in almost

every published study using a multiple regression equation. There seems to be no hard and fast

rule whether a paper will present the R squared or the adjusted R squared, or indeed, both.

Table 5.2 OLS equations for the motor vehicle death rate in the USA in 1980

Equation (1) (2) (3)Dependent variable: DEAD DEAD DEAD

Intercept �0.83593 7.64293E�002 �1.30151

0.49540 0.47222 0.53941AVSPD 2.38871E�002 8.47768E�003 3.19664E�002

8.77144E�003 1.06294E�005 9.14286E�006PINC �1.34079E�005 �2.48367E�005 �5.77049E�006

9.07173E�006 1.06294E�005 9.14286E�006R squared 0.20377 0.33281 0.35993SSR 0.22310 2.59454E�002 8.96172E�002

N 42 19 23

Note: The figures below the coefficients are the standard errors.

WHAT DO ALL THESE TESTS AND STATISTICS MEAN?

127

Page 14: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

In the eyes of the average researcher it seems that a large R squared seems to be a good thing,

and a small one a bad thing. However, many factors should lead us not to jump to such a hasty

conclusion. Before I deal with these, let us look at the adjusted or ‘corrected’ R squared, which

was included in Table 5.1. You may see it written in words as ‘R bar squared’ to represent the

fact that it is normally written exactly as its unadjusted counterpart but with a bar on the top of

it. This makes it look exactly like it is the arithmetic mean but you should not be confused by

this. The use of a bar is just a convention, which economists have adopted to distinguish the

adjusted R squared from the unadjusted R squared.

The form of adjustment, or correction, is performed to take account of the fact that when we

add variables to a regression equation we lose degrees of freedom. The formula is often

described as incorporating a ‘penalty’ for the loss of degrees of freedom. The formula is as

follows:

R2 ¼ 1 � f[(n� 1)=n� k)]:(1 � R2)g (5:13)

The unadjusted R squared has the properties that it will tend towards one as we add variables

or decrease the sample size. It is impossible for the R squared to fall when we add more variables

as each new parameter being estimated will reduce the size of the sum of squared residuals. The

formula for adjusted R squared does not have this property. When we add variables to a model,

this statistic can either rise, fall or stay the same. If we are adding variables one at a time, then

the R squared adjusted will rise so long as the ‘t’ value on entry is greater than 1 in absolute

value. It may be noted that given the conventions we use, in social sciences in general, that this

means that adding ‘statistically insignificant variables’ can increase the R squared adjusted.

So, this is how you adjust R squared and almost any computer package will give you this

statistic alongside the ‘raw’ or unadjusted R squared. Why would you prefer this statistic to the

corresponding unadjusted measure of goodness of fit? Writers of econometrics textbooks have

taken the view that it is a means of preventing you from falling into bad habits. Specifically from

committing the crime of ‘data mining’. For example, Hebden (1981, p. 112) says that it

‘prevents the researcher from thinking he is doing better just by putting in more variables’,

while Gujarati (1988, p. 135, quoting from Theil, 1978) says ‘R squared tends to give an overly

optimistic view of the fit of the regression’.

These remarks are written from the perspective that inexperienced researchers or those with a

sketchy statistical knowledge might become seduced into ‘playing a game’ of trying to

maximize R squared simply by desperately searching for variables to put in. If they are doing

this but are directed towards the target of a maximum adjusted R squared instead, then the

game becomes one of searching for variables which enter with an absolute ‘t’ of 1. This does

not seem a great improvement. The best thing would be if you are encouraged to remember

that your statistical model should be based on a well-specified theoretical model. It should not

be an exercise in ‘fitting up’ the data. This leads us to the warning of the next section.

5.5 BE CAREFUL WITH ALL R SQUAREDS

It is clear that you should not imagine that guiding your efforts by the criterion of size of R

squared adjusted is any better than that of R squared unadjusted. You should be careful in your

handling of all R squareds. It is tempting to think that since it is a ‘goodness of fit’ statistic, and

ECONOMETRICS

128

Page 15: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

a good fit is a desirable thing, that a large value of R squared shows high quality research and a

low value shows low quality research. However, this is something that you should not think.

The R squared is not a hypothesis test. It is a descriptive statistic that derives from the sum of

squared residuals and as such it will be maximized automatically by OLS for a given sample and

specification of variables and functional form. Although you will read people remarking on the

fact that their R squared is quite good or fairly good and so on, this is not particularly

meaningful in the absence of some hypothesis about how large it might have been expected to

be. There are a number of factors which influence the size of R squared, some of which can be

deliberately varied and which are not a reflection of the success of researchers in producing a

‘good’ model. The following list of these makes many references to topics that are only

introduced in the later chapters of the book. You may wish to come back to this list when you

have finished the book.

(i) The type of data matters. We discussed the different types of data in Chapter 2. Only in

the case where the dependent variable is ratio/continuous in nature, measured in scalar units,

and there is an intercept will the R squared formula have any sensible meaning. Beyond this, the

source of the data matters. Given exactly the same underlying model, we would normally find

that the size of the R squared tends to be largest in aggregate time-series data. R squareds of

near 1 are not at all rare in such studies, especially of standard macroeconomic functions like the

consumption function (as you can see in Tables 3.1 and 7.8) and the demand for money. The

next largest R squareds tend to occur in aggregate cross-section data. Typical values for studies

across regions or nations might be in the 0.4–0.7 zone. The lowest values will tend to be found

in disaggregated individual level samples such as those from large interview studies. If you were

asked to review the literature in a subject area, it would be a big mistake to give more

prominence to the time-series studies than the cross-section studies just because they had larger

R squareds.

(ii) Basic errors of measurement in the data may lead to a low R squared. The simplest case to

imagine is a high variance in the dependent variable due to errors of measurement which will

make it harder to predict and thus have a lower R squared than otherwise. If we were comparing

identical models estimated in two different countries, by two different researchers, there is no

reason to suppose that one researcher is better than the other because of an R squared

difference due to measurement error.

(iii) Too little variation in the data may lead to a low R squared. This might seem like the

opposite problem to the last problem. Let us think of a specific example. Say you were

estimating an equation to explain the charitable donations of the same group of suburban

dwellers, with identical tastes and similar incomes over a period of time. Donations may simply

not change enough for any kind of meaningful regression model to be estimated and hence a

low R squared would be obtained. The low value here may be telling us that we need to go and

get a more appropriate set of data rather than abandoning the specified equation as a failure. In

this case, there is very little variation around the mean of the dependent variable and you are

likely to get a large ‘t’ ratio on the intercept and a very low R squared.

(iv) The R squared for equations where the dependent variable is measured on a different

scale of units (such as logarithmically; see Chapter 6) are not comparable. There is an even more

serious problem when weights are used (for example, to deal with heteroscedasticity; see

Chapter 11), in conjunction with a logarithmic dependent variable as the R squared can be

varied by changing the weight.

WHAT DO ALL THESE TESTS AND STATISTICS MEAN?

129

Page 16: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

(v) There are circumstances where the conventional R squared and adjusted formulae are

inappropriate, such as when there is no intercept in the regression or the equation forms part of

a system of equations (see Chapter 15).

(vi) The specification of the dependent variable matters. There is often more than one way we

could define the dependent variable of an equation. For example, we might or might not divide

it by some measure of population size. In the macroeconomic literature there are parallel

literatures on the savings ratio and the consumption function, even though these are linked

through the underlying C þ S ¼ Y identity. Consumption function studies are in levels such as

the results you have seen in Table 3.1. Switching to a ratio dependent variable (C/Y) would

imply an underlying consumption function in levels that is non-linear. Leaving this issue aside,

we may note that an equation to explain C/Y is the same model as an equation to explain S/Y

and an equation to explain S in terms of Y is the same model as an equation to explain C in

terms of Y. This is because of the national income identity. The R squared will not be the same

because C is a much bigger fraction of Y than S is. In the case of the Belgian consumption

function (with intercept) shown in Chapter 3, Table 3.1, the R squared falls to 0.96997 (the

re-estimated equation is not reproduced here) if we use S instead of C as the dependent variable.

This is not a big change in the present case but other data and model set-ups may produce more

dramatic differences with such changes in the choice of dependent variable.

(vii) A high R squared may be due to the inclusion of variables that could well be argued to

be not worthy of being termed ‘explanatory’. One example of this is a model with seasonal

dummy variables in it (see Section 7.4 in Chapter 7). If it turns out that a large part of the R

squared is due to these seasonal adjustments then we are, in effect, ‘explaining’ the dependent

variable by the simple fact that the dependent variable has quite different averages across the

season. This would be exposed if the sample was divided up and separate equations were

estimated for each season, as the R squared would then fall drastically if the more ‘genuine’

explanatory variables, grounded in some kind of theory, are not very influential. The point I am

making here applies to all types of dummy variables being used to combine different samples.

For example, if we had 20 country cross-section demand equations with 19 dummy variables to

represent the countries and these 19 variables are largely the source of the size of the R squared

then the ‘explanatory’ power of a model is somewhat dubious.

(viii) Correlated measurement errors may produce a spuriously high R squared. This may be

illustrated in the case where the right-hand side variables contain the dependent variable. For

example, the crime model in Table 5.1 has an ‘explanatory’ variable (CLBUR) which has the

top line of the dependent variable as its bottom line. This makes the errors in the variable

correlated with the u term in the equation thus breaking one of the classical assumptions. This

may force up the R squared artificially although it has the far more important problem of

causing bias in the parameter estimates. Chapter 13 discusses how we deal with this problem.

(ix) It is possible to get a very high R squared when there is no relationship whatsoever, in

time-series data, in the case of a ‘spurious regression’ caused by ‘non-stationary’ data, as will be

explained in Chapter 14.

5.6 BASIC ECONOMETRIC FORECASTING

Most packages will allow for forecasting or prediction and some measures of how accurate this

has been. As indicated in Chapter 1, prediction is a major use of the CLRM. It can be used by

ECONOMETRICS

130

Page 17: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

businesses in an attempt to work out strategies to increase their profits. Governments use it to

attempt to figure out policies to control the economy.

It is very simple to extend an OLS regression equation into a forecasting model. We first

make the highly restrictive assumption that the parameters in the prediction sample will be the

same as in the estimation sample. Say we are advising a government by forecasting the future

rate of unemployment (UR) in the economy from an equation of the form:

UR ¼ b0 þ b1:X1 þ b2:X2 þ b3:X3 þ b4:X4 þ u (5:14)

where X are explanatory variables and u is a classical disturbance term. You should be aware

that, in reality, a government is unlikely to use a single equation model to inform its decisions.

It is more likely to use models using many dozens of equations which draw on the techniques

developed in Chapters 12 and 13.

Let us assume Equation (5.9) has been estimated on annual data for the years 1980–2002

and is to be used to forecast the years 2003–2010. The model will yield numerical estimates of

each b, which can be tested against their null hypotheses with 18 degrees of freedom using the

‘t’ test. To predict the unemployment rates we simply take the value for bo and add to it the

products of each b by the value of the appropriate X for the year in question. Clearly, these

forecasts will not be 100 per cent accurate and will over- or under-predict the value of UR in

each year. The differences are called forecast errors and should not be confused with the

residuals found in ‘within sample’ estimation. Forecast errors may reflect measurement error in

the ‘out of sample’ data but could also be signs of a shift in the parameters of the specified

equation or the importance of an omitted variable. As an example of the latter, suppose

employer confidence is an important factor in determining labour demand but no measure of

this appeared in the variables X1, X2, X3, X4, then movements in this would produce

fluctuations in UR that the model could not track.

You may have noticed an obvious problem that was overlooked in the above discussion –

where can we possibly get data from for future events that have not happened yet? The simple

answer is, of course, that we cannot get this data. These data will themselves have to be

forecasted either by some form of guessing, expert opinion or the use of data forecasted from

other econometric equations.

Most interest in prediction in academic circles is concerned with forecasting in the literal

sense of trying to predict the future, but this is not the only way to use a regression equation for

prediction. We can ‘hold out’ part of a cross-section sample and use the retained portion to

estimate the parameters which are then used to see how well the fitted model explains the ‘hold

out’ sample. This is a very simple thing to do on a computer package. There are several reasons

why we might want to do this:

c It might be seen as good scientific method in protecting us from accusations of data mining.

That is, it is an antidote to the ‘data mining’ strategy of sifting through a collection of variables

until you get a model which maximizes the within sample R squared adjusted. In technical

terms, the use of a ‘hold out’ sample protects us from the problem of ‘pre-test bias’.

c It may be a test of the stability of the model. If it has systematic forecast errors then the

model may suffer from shifting parameters (assuming it is not due to omitted variables). The

Chow test given above is another way to do this.

c It may be of commercial usefulness. Take the example of a manufacturer engaged in quality

WHAT DO ALL THESE TESTS AND STATISTICS MEAN?

131

Page 18: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

control. They could estimate a model to predict the rate of producing faulty goods as a

function of variables describing characteristics of the workers and the work situation. The

expected fault rate could be projected from the success of the estimated equation in predicting

the hold out sample.

MEASURING FORECAST ACCURACYA set of forecasts is never going to be 100 per cent accurate. Therefore, anyone employed to

make forecasts would like to find some way of judging the accuracy of a model. We now review

some basic forecast statistics. When you get to Chapter 7, a new technique using dummy

variables will be added to this collection. All methods of forecast appraisal start with the forecast

errors. The simplest thing to do with these would be to add them to get the sum of forecast

errors. If this were zero, then the forecast would be 100 per cent accurate. So, other things

being equal, the larger the sum of forecast errors is, the worse the predictive power of the model

would be. The size of this statistic will be influenced by the units of measurement and by the

number of observations. Therefore, it would seem these need to be adjusted for in order to get

a more useful evaluation statistic.

The first adjustment to the sum of forecast errors which suggests itself is to make sure that

positive and negative errors do not cancel each other out. This leads us to our first statistic.

Forecast statistic 1: MAD [mean absolute deviation]If we use the following formula:

Xi

j fei j= j (5:15)

where fe is the forecast error residual and j is the number of observations in the prediction

sample, then we will get the MAD (Mean Absolute Deviation). That is, take the absolute

prediction errors, add them up and divide by the size of the forecast sample.

This faces the problem of being dependent on the units of measurement and thus it is not

easy to make judgements about forecast accuracy from looking at it.

Forecast statistic 2: MAPE [mean absolute percentage error]One way of overcoming the units of measurement problem is to compute the MAPE:

Xiþ j

j100:(YYiþ j � Yiþ j )=Yiþ j j= j (5:16)

which is formed by computing the absolute percentage errors and then averaging them over the

j prediction periods.

The alternative means of avoiding positive and negative errors cancelling out is squaring,

which leads us to the next two statistics.

ECONOMETRICS

132

Page 19: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

Forecast statistic 3: Out of sample R squaredThe use of squared forecast errors leads to the possibility of using an R squared for the forecast

period.

The formula becomes

R2 ¼ 1 �X

ff 2eiþ j=X

y2iþ j

h i(5:17)

In this case, the sum of squared errors would be divided by the sum of squared deviations of

Y in the out of sample (prediction) period. The interpretation of this R squared is that it is,

when multiplied by 100, the percentage variation in the predicted variable ‘explained’ by the

model parameters estimated in the within sample period.

Forecast statistic 4: Root mean squared error (RMSE)The RMSE is Equation (5.18):

q Xf 2eiþ j

� �= j (5:18)

If we take the average of the squared prediction errors and then take the square root of this,

then we have the RMSE. If the forecast is perfect then this statistic, like the first two above, will

be zero. As the forecast is improved, ceteris paribus, the RMSE will tend towards zero. This is

another symmetric statistic in that it weights equal-sized positive and negative deviations

equally. However, it gives more weight to larger forecast errors. So, if our aim in developing a

forecasting model was to minimize this statistic this would be consistent with a loss function

(see Section 2.6) in which the costs of a prediction error increase at the rate of the square of the

error.

Forecast statistic 5: Theil’s U statisticNamed after Henri Theil, this is development of the mean squared error as follows:

U ¼ pMSE=

XAi

2=n� �

(5:19)

where MSE is the mean standard error, n is the sample size and A is the actual change in the

dependent variable between the last two time periods. If the predictions are 100 per cent

accurate then this statistic is equal to zero. If U is equal to 1 then this means that we have a

forecast that is no better than a simple prediction of no change from the last period. If U is

greater than 1 then the model is even worse at prediction than a simple ‘no change’ forecast.

Obviously, what we hope for is a U statistic between 0 and 1, ideally the closer to 0 the better.

This statistic is not valid if the model suffers from autocorrelation (tests for which are examined

in Section 9.7 of Chapter 9).

Finally, we should mention a statistic obtained by using the within and without samples

combined when you actually run the regression:

WHAT DO ALL THESE TESTS AND STATISTICS MEAN?

133

Page 20: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

Forecast statistic 6: ‘T’ tests on individual forecast errors using dummiesThis will be explained in Chapter 7 (Section 7.7).

The usual approach to forecast evaluation treats errors symmetrically. That is, the cost of an

over-prediction error is treated as the same as that of an equivalent size under-prediction. A

good forecast is judged in the same light as we treated the OLS estimators in Chapter 3. That

is, unbiasedness and minimum variance are the criteria for the best forecast. There might be

some circumstances where we do not want to treate errors symmetrically; for example, if you

open a restaurant and are trying to forecast the required size and pricing structure, then the cost

of having an empty seat may not be identical to the cost of turning away an excess demand

customer. If you were dealing with such a problem then the forecast evaluation statistic would

have to be adjusted in a manner appropriate to deal with the asymmetric loss function.

Forecasting is a large and complicated subject in its own right and this section has only been

a gentle introduction to it. Nevertheless, a thorough background in the CLRM model provides

a good jumping off point for you to progress into the forecasting area.

5.7 REVIEW STUDIES

The review studies in this chapter involve some methods which we have not covered yet, but

this should not be an obstacle to finding these studies useful. The studies reviewed here share a

common perspective and address the same problem despite the seeming disparity of their

research topics – that is, one (Landes) looks at the skyjacking (i.e. hijacking aircraft)

phenomenon and the other (McCormick and Tollison) looks at the behaviour of basketball

players. Neither of these papers is a specialist study in the field of terrorism or sport psychology.

Rather, the authors have seized upon these activities as suitable sources of data for testing the

economic model of crime which is a straightforward application of standard utility-maximizing

models in risky situations. The dominant theme in the early work on this subject, inspired by

the paper by Becker (1968), was that increased punishment deters crime by substitution effects

which lead the criminal to retire from crime or to switch time to leisure or non-crime activities.

This was implied in the discussion of the regression shown in Table 5.1.

A major problem in the econometric work, which followed Becker, was non-random error in

the form of correlated measurement error shared by the dependent variable and one of the

independent variables. In other words, the same violation of the classical assumption discussed

in Section 5.4, point (viii). In this particular case we have the problem of bias in the coefficient

of the punishment variables. That is, the number of crimes counted in official statistics is not

the true figure as there are more crimes which go unrecorded. If the rate of crime recording is

correlated with the volume of police officers, then there will be correlated errors between these

two variables and any punishment variables which may be a function of the volume of police

officers – such as the clearance rate variable in Table 5.1. This variable has the further serious

problem that the bottom line of the formula for clearance rates is the same as the top line of the

formula for crime rates. This may create a serious negative bias in the clearance rate coefficient,

which could lead us to wrongly reject the null hypothesis.

Economists are unlikely to get access to genuinely ‘experimental’ data on crime (or indeed

most other areas), so one way round this problem is to make a clever choice. The articles

considered here use data which is deliberately chosen to control for some of the problems in the

crime data of the FBI and comparable organizations around the world. Both studies are time

ECONOMETRICS

134

Page 21: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

series in nature, although there is a slight difference in that Landes uses quarterly data while

McCormick and Tollison use data on a time series of sport contests which are not equally

spaced in the same way. Both studies are linear and use a ‘count’ variable on the left-hand side,

which means it is not strictly correct to use an OLS model. Landes uses forecasting to work out

the effect of changes in his focus variables over time, while McCormick and Tollison do not.

They rely simply on the regression coefficient for the number of referees variable.

WHAT DO ALL THESE TESTS AND STATISTICS MEAN?

135

Page 22: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

The Basic FactsAuthors:William M. Landes

Title: An Economic Study of US Aircraft Hijacking, 1961–1976

Where Published: Journal of Law and Economics, 21(2), 1978, 1–32.

Reason for Study: To account for the dramatic decline in US aircraft hijackings after 1972, with

particular reference to the risk and size of punishment.

Conclusion Reached: That the risks faced by hijackers are a statistically significant deterrent of

aircraft hijacking. The forecasting equation is used to come to the conclusion that extra

prevention measures (mandatory screening and increased post-attempt apprehension risk) saved

the USA from between 47 and 67 additional hijackings in the 1973–76 period.

How Did They Do It?Data: Quarterly observations from the USA 1961–76

Sample Size: 59–60 in Table 3. 140 in Table 4.

Technique: OLS but some results use the Cochrane–Orcutt (GLS) technique. You should

concentrate mainly on the OLS equation in levels in Table 3.

Dependent Variable: Number of hijackings (Table 3) and time between hijackings (Table 4).

Focus Variables: Probability of apprehension, conditional probability of imprisonment,

proportion of offenders killed, and average length of prison sentence.

Control Variables: Time trend (see Section 7.8 in Chapter 7), number of foreign hijackings, per

capita consumption expenditure, unemployment rates, population, number of flights.

PresentationHow Results Are Listed: Table of coefficients with absolute ‘t’ ratios in brackets underneath. The

constant term is also given in this manner.

Goodness of Fit: R squared.

Tests: Default ‘t’ tests as explained above. Durbin–Watson test. No ‘F’ tests.

Diagnosis/Stability: No explicit diagnosis/stability testing but the author does use a number of

approaches to estimation and checks whether his main conclusion might be explained by a rival

hypothesis (the ‘fad effect’).

Anomalous Results: There are a number of control variables (population, number of flights and

consumption expenditure) which are not significant in the hijack numbers equation (Table 3)

but are consistently significant in the Time between hijackings results (Table 4).

Student ReflectionThings to Look Out For: There are quite a lot of adjustments to the individual variables in this

model in terms of use of lags (see Chapter 7) and moving averages.

Problems in Replication: It might be quite hard to get hold of the air flight-related data. The

other statistics all come from standard sources of macroeconomic statistics but the flight data

require access to airline industry publications.

ECONOMETRICS

136

Page 23: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

The Basic FactsAuthors: Robert E. McCormick and Robert D. Tollison

Title: Crime on the Court

Where Published: Journal of Political Economy, 92(3), 1984, 223–235.

Reason for Study: To test the economic theory of crime in the form of the hypothesis that more

referees in a ball game means fewer fouls by the players.

Conclusion Reached: ‘We find a large reduction, 34 per cent, in the number of fouls committed

during a basketball game when the number of referees increases from two to three’ (p. 223).

How Did They Do It?Data: Games played in Atlantic Coast Conference basketball tournaments 1954–83, but data is

missing for 1955 and 1962. Data is divided into separate winner and loser samples for

estimation.

Sample Size: 201 in total games.

Technique: OLS but also SUR (see Chapter 13) and logit (described as logistic regression – see

Chapter 8). You should concentrate on the OLS results.

Dependent Variable: Number of fouls.

Focus Variables: The variable OFFICIALmeasuring the number of referees which is either 2 or

3. It is 2 up to the end of 1978.

Control Variables: See p. 227: measure of experience differential between teams, total score in

match, year of tournament, difference in coaching experience between teams, attendance at the

game, experience of the referees, dummy variables (see Chapter 8) to control for rule changes,

measures of other team’s accuracy.

PresentationHow Results Are Listed: Four columns – parameter estimate, standard error, ‘t’ ratio, Prob value.

Goodness of Fit: R squared in a bracket above the results.

Tests: Default ‘t’ tests. ‘F’ statistic given in brackets above the results. Two-tailed Prob value.

Diagnosis/Stability: They make some attempt to look at the influence of ‘false arrests’ which

would be a source of measurement error (see Table 2). Footnote 8 on p. 229 reports that they

tried several other control variables which did not alter their main conclusions.

Anomalous Results: Attendance at the game is not significant for winners or losers. The

experience variables are not significant in the loser’s fouling equation.

Student ReflectionThings to Look Out For: None.

Problems in Replication: You might be able to find cases for other sports (in various countries) in

which there have been referee or rule changes, but if these have not taken place during your

sample then it will not be possible to use sport as a ‘laboratory’ for the testing of hypotheses

about the economics of crime.

WHAT DO ALL THESE TESTS AND STATISTICS MEAN?

137

Page 24: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

These studies are extremely typical of what appears in the mainstream American economics

journal. They start with a very clear message about the focus variables that will be the subject of

the paper, and produce results that quite strongly support the main hypotheses about the focus

variables. The discussion of the control variables is fairly brief. In both studies, the control

variables are loosely specified as measures of the expected costs and benefits of the rule-breaking

choice (fouling at basketball or hijacking a plane). Not too much concern is shown if some of

the control variables are not significant. The question that remains is the extent to which the

supportive results for the focus variables are dependent on the specific set of controls,

definitions etc. used in the papers. Or to put it more crudely, might the authors be guilty of

data mining? There is a slight degree of exploration of this in these papers but it is not very

comprehensive.

We should not, of course, jump to the conclusion that there has been any data mining taking

place. To answer this would require a thorough replication in which we can draw on the data as

originally defined, but also possibly additional definitions and variables which might be relevant

but are not mentioned in the original studies. It is certainly not possible to do a direct

replication by checking the reported estimates using the authors’ own data, as in 1978 and

1984 and as is still true today, economics journals do not, generally speaking, require the

authors to provide in print (or deposit the file in an archive) the data used in their papers.

5.8 CONCLUSION

This chapter has extended the multiple linear regression model first encountered at the end of

Chapter 3. The main purpose has been to improve your ability to interpret regression

coefficients and to test hypotheses about them. Estimates of equations for the burglary rate and

the motor vehicle death rate, in the USA, were used to illustrate the use of elasticities and ‘F’

tests to provide more information about our results. Another pair of articles from economics

journals were used to show how the knowledge you have gained so far can help you understand

a research paper. These two studies (on the hijacking of aircraft in the USA and the extent of

fouling in basketball games) did involve some methods and concepts that have not been

covered, but you can understand these studies to a high degree by concentrating on the features

shown in our panels of review studies in Chapter 3. They are, again, simple multiple linear

regressions. These studies are included to extend the skills developed in looking at the Chapter

3 review studies. You may find it profitable to go back to Chapter 3 and re-examine the review

studies to see if you now feel more comfortable in working out what these papers were trying to

do and the conclusions they came to. In this chapter we began the process of extending the

usefulness of multiple regression by showing how it can be used for forecasting as well as for

hypothesis testing. It is, fortunately, possible to make the CLRM model even more useful, in a

number of ways, without having to learn any new statistical techniques or ideas. This is done in

Chapters 6 and 7, which take us much further into the heart of the subject.

ECONOMETRICS

138

Page 25: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

DOS AND DON’TS

Do4 Provide accurate information on how your data is defined and constructed.

4 Make sure you understand why there are three different forms of the ‘F’ test in Section 5.3.

4 Try to implement all reasonable hypothesis tests on your coefficients as the default tests from the

package may not be that informative.

4 Consider the use of some form of evaluation of the stability of your model whether it be the use

of one set of data to predict performance in another set or the ‘Chow’ F test.

4 Try to find the time to understand the pair of studies reviewed in the panel in Section 5.7 (or a

similar pair which may have been given to you). If you can grasp the use of econometrics in these

papers then you have made good progress and can expect to continue to do so.

Don’t8 Forget that the size of coefficients in a regression is dependent on the units of observation.

8 Fall into the trap of relying on either the point estimate or the level of statistical significance to

tell a story. You need to use both to make sense of your results.

8 Forget that reports of ‘the elasticity’ in a linear regression model are usually calculations at the

means of the data. You will get a different elasticity at other points on the fitted line.

8 Get over-excited at finding large values of R squared, as this is not necessarily a sign of the

success of your model.

8 Forget that the ‘F test for the equation’ is equivalent to a test of the null that R squared is zero.

EXERCISES

5.1 Looking at Table 5.1, work out the impact on the burglary rate of a rise of 5 percentage

points in the unemployment rate and a fall of 5 percentage points in the percentage ofthe population classified as black. Now have a look at the ‘t’ values on the coefficients

for UR and PCBLK and comment on the accuracy of the calculations you have just made.

5.2 Go back to Tables 3.1 and 3.2 in Chapter 3 and Table 4.4 in Chapter 4. For each of theseregressions calculate the R squared adjusted and the ‘F’ for the equation (note: this has

already been given in Table 4.4 but you might like to check it).

5.3 Using Table 5.2, Equation 1, calculate the impact of a 10 mile per hour increase in

average driving speed on the number of drivers killed per 100 000 registered drivers.

5.4 Do you think that forcing people to use R squared adjusted is a way of preventing ‘datamining’? Give reasons for your conclusion.

5.5 State whether the following statements are true, false or indeterminate:(i) The R2 of a linear equation will be equal to the R2 for its ‘out of sample’ forecasts.(ii) A good forecast should have the same properties as a good estimator, that is, they should

be BLUE.(iii) An equation which has a high R2 adjusted will be good for making out of sample

predictions.

WHAT DO ALL THESE TESTS AND STATISTICS MEAN?

139

Page 26: What Do All these Tests and Statistics Mean? · What Do All these Tests and Statistics Mean? ... The Durbin–Watson test relates to one of the assumptions of the CLRM ... automatically

(iv) Attempting to minimize the value of RMSE for the out of sample predictions should bethe aim of every good forecaster.

REFERENCESBecker, G.S. (1968) Crime and punishment: An economic approach. Journal of Political Economy, 76(1),169–217.Gujarati, D.N. (1988) Basic Econometrics, 2nd edn, McGraw-Hill, New York.Hebden, J. (1981) Statistics for Economists, Philip Allan, Oxford.Landes, W.M. (1978) An economic study of US Aircraft Hijacking, 1961—1976. Journal of Law andEconomics, 21(2), 1—32.Lave, C. (1985) Speeding, coordination, and the 55 mph limit. American Economic Review, 75(5),1159–1164.McCormick, R.E. and Tollison, R.D. (1984) Crime on the court. Journal of Political Economy, 92(3), 223—235.Theip, H. (1978) Introduction to Econometrics, Prentice-Hall, Englewood Cliffs, NJ.

WEBLINK

http://www.paritech.com/education/technical/indicators/trend/r-squared.asp

A basic explanation of R squared for stock market forecasters (the definition is not entirely

correct).

ECONOMETRICS

140