An introduction to nonparametric and semi-parametric … · 2015. 2. 19. · Nonparametric MethodsNonparametric estimators estimate objects of interest toeconomists by replacing unknown

An introduction to nonparametric and

semi-parametric econometric methods

Robert BreunigAustralian National [email protected]

http://econrsss.anu.edu.au/Staff/breunig/course_bb.htm

March 1, 2011

0-0

Outline1. Introduction

2. Density Estimation

(a) Kernel techniques

(b) Bandwidth Selection

(c) Estimating derivatives of densities

(d) Non-kernel techniques

3. Conditional Mean Estimation

4. Semi-parametric estimation

(a) Robinson’s method

(b) Differencing

(c) Binary Choice models

(d) Mixed categorical and continuous variables

1

Objectives1. Introduce nonparametric and semiparametric techniques

2. Introduce some of the key issues in the literature

3. Introduce several key tools and techniques

4. Provide examples of the use of techniques

5. Provide reference literature so that interested students canpursue these techniques in their applied work

2

Objects of InterestAll statistical objects studied by applied econometricians may beexpressed as functions of unknown distributions.

• Measurement of inequality

Fa(x) − Fb(x) =

x∫−∞

fa(t)dt−x∫

−∞fb(t)dt

• Regression modelling

m(x) = E[Y |x] =

∞∫−∞

yf(y, x)f1(x)

dy

3

• Measurement of response

β(x) =dE[Y |x]dx

=d

[∫y f(y,x)

f1(x) dy]

dx

• Market risk

σ2(x) =

∞∫−∞

(y −E[Y |x])2f(y, x)f1(x)

dy

• Discrete choice

Prob[Y = 1|x] =f(y, x)f1(x)

> .5

4

Parametric ModelsParametric econometric methods require the prior specification of thefunctional form of the object being estimated.

For example, one might assume that the conditional mean function islinear

m(x) = E[Y |x] =

∞∫−∞

yf(y, x)f1(x)

dy = β0 + β1x

This specification implies a constant response

β(x) =dE[Y |x]dx

=d

[∫y f(y,x)

f1(x) dy]

dx=β0 + β1x

dx= β1

5

Parametric Methods: DrawbacksParametric models impose a priori structure on the underlying DGP.

Having assumed that this structure is known, we then estimate ahandful of unknown parameters.

Choice of models is frequently not based upon any attempt to selectthe correct parametric specification from the space of admissiblemodels. Rather, model selection is usually made on the bases oftractability and ease of interpretation.

The risk is that inference, prediction, and policy are all based uponan incorrectly specified parametric model. The consequences of suchmis-specification are well known.

6

Nonparametric MethodsNonparametric estimators estimate objects of interest to economistsby replacing unknown densities and distribution functions with theirnonparametric density estimators.

They are consistent under less restrictive assumptions than thoseunderlying their parametric counterparts.

When there is sufficient data, these estimators frequently revealfeatures of the data that are invisible under parametric techniques.

Different features and structures revealed by nonparametricestimators often lead to different conclusions and policy prescriptionsthan those based upon parametric methods.

7

Four uses of nonparametric

methods1. Visualizing the data

2. Testing and comparing models

3. Conditional Mean Estimation (regression)

4. Combining parametric and nonparametric methods(semi-parametric estimation)

8

Basic building block:

Nonparametric Kernel Density

Estimator

f(x) =1nh

n∑i=1

K

(xi − x

h

)

9

We would like to estimate the density, f(x), from a sample

x1, x2, ..., xn

• Histogram

• Naive Non-parametric/Local Histogram

fI(x) =1nh

n∑i=1

I

(−1

2<xi − x

h<

12

)=n∗

nh

n∗ is the number of points which lie between x− h2 and x+ h

2 .

h will determine smoothness of estimate

10

• Replace Indicator function with smooth weighting functioncalled a Kernel

f(y) =1nh

n∑i=1

K

(yi − y

h

)where∫K (ψ) dψ = 1

K (ψ) large for small ψ

K (ψ) small for large ψ

K () should be symmetric.

11

A large class of functions satisfy these assumptions, for example

(i) Standard Normal:

K(ψ) = (2π)−1/2 exp[−1

2(ψ)2

](ii) Uniform:

K(ψ) = (2c)−1, −c < ψ < c and 0 otherwise.

(iii) Epanechnikov (1969) [optimal kernel]:

K0(ψ) = 34

(1 − ψ2

), |ψ| ≤ 1

= 0 otherwise

12

In order to implement this estimator, we have to make two choices.

1. Kernel (weight) function: K

2. The smoothing parameter (bandwidth): h

It turns out that the choice of kernel does not have much effect onthe optimality of the estimator, but that the choice of bandwidth (orwindow width) has important repercussions for our results.

13

Bandwidth selection methods1. Plug-in methods

2. Likelihood cross-validation

3. Least-squares cross-validation

14

All of these methods begin from the same starting point, which isthat the bandwidth, h, should be chosen so that the estimateddensity, f(x) is as close as possible to the true density, f(x). Most ofthe time we employ some kind of global criteria. The most commonis the integrated squared error (ISE)

∞∫−∞

(f(x) − f(x)

)2

(ISE)

or its expected value, the mean integrated squared error (MISE)

E

⎡⎣ ∞∫−∞

(f(x) − f(x)

)2

⎤⎦ (MISE)

These two quantities correspond to loss and risk.

15

For independent and identically distributed (i.i.d.) data, it isstraightforward to show that

Bias(f) = Ef − f =∫K(ψ) [f(hψ + x) − f(x)] dψ

and

V (f) = (nh)−1

∫K2(ψ)f(hψ + x)dψ − n−1

[∫K(ψ)f(hψ + x)dψ

]2

16

The expressions for exact bias and variance are not useful withoutknowledge of the quantity that we are attempting to estimate–thetrue underlying density.

We can, however, derive approximations to these quantities byexpanding f(hψ + x) by Taylor series expansion, for small h.

f(hψ + x) = f(x) + hψf (1)(x) +h2

2ψ2f (2)(x) + . . . .

17

Given i.i.d. assumption above, the assumptions made regarding thekernel function, and the following additional assumptions

(A3) The second order derivatives of f are continuous andbounded in some neighborhood of x.

(A4) h = hn → 0 as n→ ∞.

(A5) nhn → ∞ as n→ ∞.

we can show that upto O(h2) the bias is given by

Bias (f) =h2

2μ2f

(2)(x)

and upto O(nh)−1, the variance is given by

V (f) = (nh)−1f(x)∫K2(ψ)dψ,

18

MISE =∫ [(

Bias(f))2

+ V ar(f)]dx

The approximate MISE, using the above expressions, is

AMISE =h4

4μ2

2

∫ (f

′′(x)

)2

dx+ (nh)−1

∫f(x)dx

∫K2(ψ)dψ

=14λ1h

4 + λ2(nh)−1

where

λ1 = μ22

∫ (f (2)(x)

)2

dx, λ2 =∫K2(ψ)dψ. (1)

19

The optimal window width, in the sense that the approximateintegrated mean squared error is minimized, will be

h = cn−15

20

Assuming a normal kernel and a normal density, f(x), both λ1 andλ2 can be evaluated numerically. This provides

h = 1.06σxn− 1

5

Software packages which implement nonparametric densityestimation (SAS, Shazam, STATA) use this as the default windowwidth. For non-normal distributions it works well as a firstapproximation. It also can provide a good starting point for datadriven methods of bandwidth selection (see more below.) It is by farthe most commonly used window width in the literature.

21

Silverman (1986) provides several other alternatives which work wellfor heavily skewed data or multi-modal data.

A simple improvement is to replace σ by a robust estimator of spreadand he specifies two alternatives that seem to work well;

h = .79 Rn−1/5

h = .9 An−1/5,

where R is the inter-quartile range and A = min(σ, (R/1.34)).

22

Least Squares Cross-validationThis is essentially a data-driven technique of choosing the optimalbandwidth. The idea is to minimize a particular criteria function. Inleast squares cross-validation the function minimized is

ISE(h) =∫ (

f(x) − f(x))2

dx

=∫f2dx+

∫f2dx− 2

∫ffdx,

Since f2dx does not depend upon f , the function that peopleminimize in practice is actually

∫f2dx− 2

∫ffdx.

23

Further manipulation yields

ISE∗(h) = n−2h−1n∑

i=1

n∑j=1

K ◦K(xi − xj

h) − 2n−1

n∑i=1

f−i(xi).

as the function that is actually minimized.

n−1∑n

i=1 f−i(xi) is the leave one out estimator which is formed as astandard kernel density estimator, omitting the ith observation. Thisprovides an unbiased estimate of

∫ffdx. Most programs actually

implement the leave-one out estimator as the actual density estimatesince it minimizes the influence of solitary observations.

24

Likelihood Cross-validationThe basic idea behind this method is to choose an h which maximizesthe likelihood log �L =

n∑i=1

log f(xi). An estimated log likelihood or

pseudo log likelihood can be written as

log L =n∑

i=1

log f(xi) = log L(h),

where f(xi) is a density estimator of f and it depends on h.

Maximizing log L with respect to h produces a trivial maximum ath = 0. To overcome this problem, the cross validation principle mightbe adopted, in which f(xi) is replaced by f−i(x).

25

This “leave one out” version of the estimator can be written as

f−i(xi) = ((n− 1)h)−1n∑

j �=ij=1

K

(xj − xi

h

).

Thus the likelihood CV principle is to choose h such that

log L(h) =n∑

i=1

log f−i(xi) is a maximum. The procedure is also

known as Kullback-Leibler cross validation in the sense that it givesan h for which the Kullback-Leibler distance measure between twodensities f and f , I(f, f), is a minimum, where

I(f, f) =∫f(x) log

{f(x)

f(x)

}dx;

see Hall (1987).

26

A disadvantage of the h obtained by likelihood CV is that it can beseverely affected by the tail behavior of f . Furthermore, Hall (1987)has indicated that selecting h by minimizing the Kulback-Leiblermeasure may be useful for the statistical discrimination problem butnot for curve estimation. Thus the likelihood CV procedure has notproven to be of much current interest in the literature.

27

Other density estimation

techniques• Nearest Neighbor Density Estimation

Let d(x1, x) represent the distance of point x1 from the point x,and for each x denote dk(x) as the distance of x from its k − th

nearest neighbor (k −NN) among x1, ..., xn. Then, takingh = 2dk(x), the estimator can be written as

fk−NN (x) = #(x1...xn) in [(x−dk(x)), (x+dk(x))]2ndk(x)

= k2ndk(x) = 1

2ndk(x)

n∑i=1

I(

xi−x2dk(x)

).

The degree of smoothing is controlled by an integer k, typicallyk ≈ n1/2.

28

• Series estimation

Suppose X is a random variable with density f on the unitinterval [0,1]. Under these circumstances it can be expressed asthe Fourier series

f(x) =∞∑

j=0

ajζj(x),

where, for each j ≥ 0, the coefficients

aj =∫ 1

0

f(x)ζj(x)dx = Eζj(x),

and the sequence ζj(x) is given by ζ0(x) = 1, and ζj(x) =√2 cos π(j + 1)x when j is odd and

√2 sin πjx when j is even.

29

Using aj = n−1∑n

i=1 ζj(xi) as an estimator of aj , the orthogonalseries estimator is defined as

f(x) =m∑

j=0

ajζj(x),

where m is the cutoff point in the infinite sum and determinesthe amount of smoothing.

The regression analog of this is to express the conditional meanof y as an infinite polynomial in x.

30

• Variable window width estimators

Another option is to let the window width vary with each pointin the data according to some rule. The estimator will then havethe form

fvww(x) =1n

n∑i=1

1hni

K

(xi − x

hni

),

In general, the rule should allow larger h in regions where thereare few observations and smaller h in those where observationsare densely located.

31

• Penalized likelihood estimators

• Local Log-Likelihood estimators

Both of these techniques treat f(x) as an unknown parameterand try to employ likelihood methods to estimate the unknownquantity. The global likelihood has no finite maximum over theclass of all densities, so options are to instead maximize apenalized likelihood function (which imposes somepre-determined amount of smoothness on the function) or thelocal, kernel-weighted, log-likelihood.

32

• Example 1Eruption Length of Old Faithful Geyser in Yellowstone NationalPark 3 files–oldfaithful∗.pdf

• Example 2Hamilton Lin (1996) model of excess stock returns from Standardand Poor 500

• Example 3Ait-Sahalia (1996) nonparametric test of interest rate diffusionmodels

33

Multivariate Density EstimationConsider a bivariate distribution where the ith sample observation isgiven by (yi, xi) and z = (y, x) is a fixed point.

This can be estimated nonparametrically by

f(y, x) = f(z) =1nh2

n∑i=1

K1

(zi − z

h

),

34

The kernel estimator of the marginal density f1(x) of X is

f1(x) =∫f(y, x)dy = 1

nh2

n∑i=1

∫K1

(yi−y

h , xi−xh

)dy

= 1nh

n∑i=1

K(

xi−xh

)where K(x) =

∫K1(y, x)dy is such that

∫K(x)dx = 1. The estimator

of the conditional density of Y given X can then be written as

f(y |x ) =f(y, x)

f1(x).

35

In general, for a multivariate density estimation problem of dimensiond, the optimal h which minimizes the approximate MISE can befound by substituting nhd for nh in the MISE expression givenearlier and minimizing with respect to h. It is easy to show that

h = cn−1/(4+d),

and, for this h, AMISE = 0(n−4/(4+d)). When the kernel ismultivariate standard normal, c = {4/(2d+ 1)}1/(d+4).

36

Curse of dimensionalityIt is clear from this result that the higher the dimension q + 1, theslower will be the speed of convergence of f to f . Thus one may needa large data size to estimate the multivariate density in highdimensions.

37

Multivariate Kernels• standard multivariate normal density,

K(ψ) = (2π)−d/2 exp(−12ψ′ψ)

where d = dim(ψ)

• multivariate Epanechnikov kernel

Kc(ψ) = .5 c−1d (d+ 2)(1 − ψ′ψ)

if ψ′ψ < 1 and equaling 0 otherwise, where cd is the volume ofthe unit d-dimensional sphere (c1 = 2, c2 = π, c3 = 4π/3).

38

One disadvantage with direct application of the kernels above is thatthe variables may exhibit disparate variation. To overcome thisproblem it is good practice to work with standardized data, i.e.,normalized by the standard deviation or some measure of scale. Theneach of the elements in ψ will have unit variance and application of akernel such as the multivariate standard normal is appropriate.

39

Conditional mean estimationConsider q + 1 = p economic variables (Y, X ′) where Y is thedependent variable and X is a (q × 1) vector of regressors; these pvariables are taken to be completely characterized by their unknownjoint density f(y, x1, . . . , xq) = f(y, x), at the points y, x.

As noted in the introduction interest frequently centres upon theconditional mean m(x) = E(Y |X = x), where x is some fixed valueof X .

Now suppose that we have n data points (yi, x′i). By definition,

Yi = E(Yi |Xi = xi) + ui = m(xi) + ui

where the error term ui has the properties E(ui |xi ) = 0 andE(u2

i |xi ) = σ2(xi).

40

Parametric EstimationParametric methods specify a form of m(xi). In the case of a linearspecification

yi = α+ xiβ + ui.

The least squares estimators of α and β are α∗ = y − xβ∗ andβ∗ =

(∑ni=1(xi − x)2

)−1 (∑n

i=1(xi − x)yi) .

The best unbiased parametric estimator of m(x) = α+ xβ is

m∗(x) = α∗ + xβ∗ =n∑

i=1

ani(x)yi (2)

where ani(x) = n−1 + (x− x)(xi − x)(∑n

i=1(xi − x)2)−1. The m∗ in

(2) is the weighted sum of yi, where the weights ani are linear in x,and depend on the distance of x from x.

41

The assumption that m(xi) = α+ xiβ implies certain assumptionabout the data generating process (joint density). For example, if(yi, xi) is a bivariate normal density then it can be shown that themean of the conditional density of yi given xi is, E(yi |xi ) = α+ xiβ,where α = Eyi − (Exi)β and β = (var(xi))

−1 cov {(xi, yi)} . Thisimplies that the assumption of linear specification for m(x) holds ifthe data comes from the normal distribution. However, if the truedistribution is not normal then the linear specification for theconditional expectation may be invalid, and so the least squaresestimator of m(x) will become biased and inconsistent.

42

For example suppose the true relationship is

yi = α+ xiβ + x2i γ + ui

then the “parameter of interest” is β + 2γxi = ∂yi/∂xi. However, if alinear approximation is taken , ∂yi/∂xi is being estimated under thefalse restriction that γ = 0. Typically, the exact functional formconnecting m(x) with x is unknown. Because of the possibility thatforcing the function to be linear or quadratic may affect the accuracyof estimation of m(x), it is worthwhile considering nonparametricestimation of the unknown function, and this task is taken up in thefollowing sections.

43

Kernel-Based EstimationSuppose that the xi are i.i.d. random variables. Because m(xi) is themean of the conditional density f(yi |xi ) = f(Y |X = xi), there is apotential to employ the methods of density estimation seen earlier.By definition the conditional mean is

m =∫ ∞

−∞(yf(y, x)/f1(x)) dy, (3)

where f1(x) is the marginal density of X at x.

Nadaraya (1964) and Watson (1964) therefore proposed that m beestimated by replacing f(y, x) by f(y, x) and f1(x) by f1(x), wherethese density estimators were the kernel estimators discussed above.

44

The expressions for f(y, x), f1(x) from the first part of this talk maybe substituted into (3) to give

m =∫ ∞

−∞y

[(nhp)−1

(nhq)−1

∑ni=1K1

(yi−y

h , xi−xh

)∑ni=1K

(xi−x

h

) ]dy, (4)

where p = q + 1 and h is the window width. Some simplificationyields

m =

[(nhq)−1

n∑i=1

yiK

(xi − x

h

)]/

[(nhq)−1

n∑i=1

K

(xi − x

h

)]

=n∑

i=1

K

(xi − x

h

)yi/

n∑i=1

K

(xi − x

h

),

45

A feature of the Nadaraya-Watson estimator is that it is a weightedsum of those yi’s that correspond to xi in a neighborhood of x. Theweights are low for xi’s far away from x and high for xi’s closer to x.

With this motivation, a general class of nonparametric estimators ofm(x) can be written as

m = m(x) =n∑

i=1

wni(x)yi

where wni(x) = wn(xi, x) represents the weight assigned to the i’thobservation yi, and it depends on the distance of xi from the point x.

Note that the parametric estimator m(x) in (2) is a special case withlinear weights wni(x) = ani(x) such that

∑wni(x) = 1, but

wni(x) ≥ 0 is not necessarily true.

46

An implicit assumption in nonparametric estimation is that m(x) issmooth over x, implying that yi contains information about m(x)whenever xi is near to x. The estimator m(x) is a smoothedestimator in the sense that it is constructed, at every point, by localaveraging of the observations yi’s corresponding to those xi’s close tox in some sense.

In parametric regression, a functional form is specified for theconditional mean m(x). This functional form, say m(x, β), dependson a finite number of unknown parameters β. The least squaresestimate of m = m(x) is m(x, β), where β is chosen to minimize

n∑i=1

(yi −m(xi, β)

)2

. (5)

47

Compare (5) with the following weighted least squares criterion forthe nonparametric estimation of m(x) :

n∑i=1

w∗ni(x) [yi −m(x)]2 . (6)

In (6), m(x) replaces the m(x, β) that appears in (5). If m(x) isregarded as a single unknown parameter m, it may be estimated byminimizing

n∑i=1

w∗ni(x)[yi −m]2. (7)

The resulting estimate, m, of m(x) is precisely the Nadaraya-Watsonestimator. Thus the kernel estimator m is also a least squaresestimator, with w∗

ni(x) = K ((xi − x)/h) .

48

One might also think of m(x) as a method of moments estimator.Since E(ui |xi ) = 0,

Ew∗ni(x) (yi −m(xi)) = 0 (8)

or

= E [w∗ni(x)(yi −m) + w∗

ni(x) (m−m(xi))] = 0. (9)

If the second term in (9) is ignored and a sample estimate of the first,n−1

∑ni=1 w

∗ni(x)(yi −m), is used, the value of m for which this is

zero is again the Nadaraya-Watson estimator.

49

Whether the second term can be ignored depends upon the weightsw∗

ni(x). If the weights were the indicator functions of the “localhistogram” presented earlier, the second term will be identically zero,whereas with kernel weights it is only asymptotically zero. Becausethe orthogonality relation only holds as n→ ∞, the situation is outof the normal framework described by Hansen (1982), but it is closeto work reported in Powell (1986), in that the expected value of thefunction the parameter solves changes with the sample size (throughh) and so its large sample limit has to be used instead.

50

Local Linear Nonparametric

RegressionThe Nadaraya-Watson estimator of m(x) minimizesΣn

i=1 {yi − α}2K(

xi−xh

)with respect to α, giving

m(x) = α =[ΣK

(xi−x

h

)]−1 ΣK(

xi−xh

)yi.

Stone (1977) and Cleveland (1979) suggested that instead oneminimize

n∑i=1

{yi − α− (xi − x)β}2K

(xi − x

h

),

with respect to α and β and set m(x) equal to the resulting estimateof α.

51

This estimate can be found by performing a weighted least squaresregression of yi against z′i = (1 (xi − x)) with weights[K

(xi−x

h

)]1/2. Thus, while the Nadaraya-Watson estimator is fittinga constant to the data close to x, the local linear approximation isfitting a straight line.

This local linear smoothing estimator has been extensivelyinvestigated by Fan (1992a), (1993), Fan and Gijbels (1992) andRuppert and Wand (1994).

52

The resulting estimator has the form

mLL(x) =n∑

i=1

wLLni (x)yi,

with weights wLLni = e′1 (

∑ziKiz

′i)

−1ziKi, where e1 is a column

vector of dimension the same as zi with unity as first element andzero elsewhere.

One advantage of this estimator is that it can be analysed withstandard regression techniques, and it has the same first orderstatistical properties irrespective of whether the xi are stochastic ornon-stochastic. The optimal window width is proportional to n−1/5.

53

Applications of the idea in econometrics are McManus (1994) toestimation of cost functions, Gourrieourx and Scaillat (1994) to theterm structure, Lin and Shu (1994) to estimation of a disequilibriumtransition model, Bossaerts and Hillion (1997) to options prices andtheir determinants, and Ullah and Roy(1996) for a nutrition/incomerelation. Implementation and computations are discussed inCleveland et al (1988). Hastie and Loader (1993) provide an excellentaccount of the history and potential of the method.

54

The logic of linear local regression smoothing can be seen byexpanding m (xi) around x to get

m (xi) = m(x) +∂m

∂x(x∗) (xi − x) , (10)

where x∗ lies between xi and x. This may be expressed as

m (xi) = α+ β (x∗) (xi − x) . (11)

55

Now, since E (yi|xi) = m (xi) , the objective function

Σ (yi −mi (xi))2Ki = Σ (yi − α− β (x∗) (xi − x))2Ki

is essentially the residual sum of squares from a regression using onlyobservations close to xi = x. Notice that this means that β (x∗) willbe very close to constant as x∗ must lie between xi and x. This alsopoints to the fact that improvements might be available fromexpanding m (xi) as a j′th order polynomial in (xi − x) , but doing sorequires the derivatives m(j) to exist.

56

Example 4

Eruption Length of Old Faithful

Geyser Conditional on Waiting

Time

57

Other Notes• The optimal h can be found by minimizing the MISE similar to

the density case, and it can be shown that

hopt α n−1/(q+4)

• Cross validation may be performed by minimizing the estimatedprediction error (EPE), n−1Σ (yi − m(xi))

2, where m(xi) iscomputed as the “leave-one-out” estimator deleting the i’thobservation in the sums. To appreciate why minimizing EPE issensible notice that, when the “leave one out” estimator isemployed and observations are independent, mi is independent ofyi, meaning that E (mi(yi −mi)) = 0, and soE(EPE) = σ2 +E

(n−1Σ(mi −mi)2

)= σ2 +MASE.

58

Minimizing E(EPE) with respect to h is therefore equivalent tominimizing MASE with respect to h. Unfortunately, minimizingthe sample EPE tends to produce an estimator of h thatconverges only extremely slowly to the value of h minimizingE(EPE), of order n−1/10,

• The curse of dimensionality means that “pure” nonparametricregression is difficult to use in higher dimension problems.

59

Semi-parametric estimationA number of models exist in the literature which have thedistinguishing feature that part of the model is linear and partconstitutes an unknown non-linear format.

yi = x′1iβ + g1(x2i) + ui, (12)

which could be written in matrix form as

y = X1β + g1 + u. (13)

In (23) x2i cannot have unity as an element.

60

This intercept restriction is an identification condition arising fromthe fact that g1(x2i) is unconstrained and therefore can have aconstant term as part of its definition. Hence, it would always bepossible to add any constant number to (23) and then absorb it intog(x2i), showing that, without some further restriction upon thenature of g1(x2i), it is impossible to consistently estimate anintercept.

This issue of identification of parameters, particularly in regards tothe intercept, but sometimes a scale parameter as well, arises a gooddeal in the semi-parametric literature and needs to be dealt with byimposing some restrictions.

The parameter of interest is β so that the issue is how to estimate itin the presence of the unknown function g1.

61

A Semi-Parametric Estimator of βTaking the conditional expectation of (13) leads toE (yi |x2i ) = E (x1i |x2i )′ β + g1(x2i).

Consequently

yi − E (yi |x2i ) = (x1i −E (x1i |x2i ))′ β + ui (14)

and

g1(x2i) = E (yi |x2i ) − E (x1i |x2i )′ β. (15)

62

Since (14) has the properties of a linear regression model withdependent variable yi −E (yi|x2i) and independent variables(x1i − E (x1i |x2i )), an obvious estimator of β is

β =

[n∑

i=1

(x1i − m12i) (x1i − m12i)′]−1 [

n∑i=1

(x1i − m12i) (yi − m2i)

],

(16)

where m12i and m2i are the kernel based estimators ofm12i = E(x1i |x2i ) and m2i = E(yi |x2i ).

63

Once β is found, g1(x2i) can be estimated from (15) as

g1(x2i) = m2i − m′12iβ, (17)

for example Stock (1989) works with this model but is particularlyinterested in estimating g1(x2i) rather than β.

The kernel estimator for β in the context of (13) was analyzed byRobinson (1988)

64

DifferencingConsider again the partial linear model

yi = x1iβ + g1(x2i) + εi, (18)

where x1 is a scalar.

Order the x2 from smallest to largest so that x21 ≤ x22 . . . ≤ x2n

Suppose that x1 is a smooth function of x2 where

E[x1|x2] = g(x2)

and thereforex1 = g(x2) + u

65

yi − yi−1 = (x1i − x1,i−1)β + (f(x2i) − f(x2,i−1)) + εi − εi−1

= (g(x2i) − g(x2,i−1))β + (ui − ui−1)β+

(f(x2i) − f(x2,i−1)) + εi − εi−1

Provided that the functions f and g are sufficiently smooth and thatthe data is sufficiently dense, the differences f(x2i) − f(x2,i−1) andg(x2i) − g(x2,i−1) should be very small providing the approximations

zi − zi−1 =ui − ui−1

yi − yi−1 = (ui − ui−1)β + εi − εi−1

66

The non-parametric difference estimator of β is simply

βdiff =∑

(zi − zi−1)∑

(yi − yi−1)∑(zi − zi−1)2

which converges at the usual rate of√n, with normal distribution so

that

βdiffD∼ N

(β,

1.5n

σ2ε

σ2u

)

67

Example 5: Yatchew and No (2001)

Gasoline Demand in Canada

68

Binary Choice ModelsWe often start with the idea of an underlying linear (latent variable)model

y∗i = x′iβ + ui (19)

yi = 1 when y∗i > 0 and takes value 0 otherwise.

The standard approach to estimating β in (22) is via maximumlikelihood. The likelihood function is formed for a sample of size n as

L =n∑

i=1

[yiln(Gi) + (1 − yi)ln(1 −Gi)] (20)

Gi =

x′iβ∫

−∞[g(u)] = Prob(ui < x′iβ)

69

G is assumed to be normal (probit) or logistic (logit) in mostapplications.

Klein and Spady (1993) propose to estimate a smooth version of thelikelihood that locally approximates the parametric likelihood. Notethat xiβ could be written in more general terms, but Klein andSpady do retain the linear index function in their method.

The key transformation is to note that G in (20) is the probabilitythat u is less than the x′iβ conditional on the index function and theparameter β. This can be written asa

G[x′iβ;β] = Prob(y = 1)gυ|y=1

gυ(21)

where gυ|y=1 is the distribution of the index function conditional ony = 1 and gυ is the unconditional distribution of the index function.

aProb(A|B) =P rob(A∩B)

P rob(B)=

P rob(B|A)P rob(A)P rob(B)

70

These can both be estimated nonparametrically using standardkernel techniques while the Prob(y = 1) can be estimated as thesample fraction of observations with yi = 1.

71

Ichimura and Thompson (1998) propose a wider class of estimateswhich is based upon a random coefficients approach.

y∗i = x′iβi + ui (22)

yi = 1 when y∗i > 0 and takes value 0 otherwise.

The distribution of βi is estimated by nonparametric methods withfew restrictions.

Ai and Chen (Econometrica, 2003) have proposed a better methodfor estimating binary choice models which is currently considered thestate of the art.

72

Additional notes on bandwidth

selection• Plug-in methods

Usually reserved for simple density estimationFan and Gijbels (1996) provide plug-in estimators for regressionestimation

• Least-squares cross-validation popular in many applicationsIchimura and Todd (2004, Handbook of Econometrics V) findthat this method works well in a simulation studyThe biggest problem with least-squares cross-validation happenswhen the data are sparse. In this case the method tends tochoose a bandwidth which is too large in order to avoid havingzero densities in any area (the criterion takes on an unboundedvalue if the density is zero at any point).

73

• Variable bandwidth selection methods result in estimates thatare no longer densities. Thus global bandwidth selection methodstend to be preferred

• There are also bootstrap bandwidth selection methods whichtend to be very computationally intensive

74

Reducing the curse of

dimensionality• Restricting the class of models

ex: Separable models of Robinson and Yatchewex: Klein and Spady Binary Choice Model

• Changing the Parameter of Interestex: Average derivative methods

75

• Specifying different stochastic assumptionssee Powell (1984, J. of Econometrics)I won’t discuss this last one. But these methods essentiallyinvolve making some restriction on the conditional distribution ofobservable variables but not enough to estimate the modelparametrically. Powell applies these to various limited dependentvariable models including the Tobit model.

76

Average Derivative MethodConsider the model

yi = g(xi) + ui, (23)

Suppose that instead of estimating the derivative g′(x) at everypoint, we are interested in

E(g′(x)) (24)

The advantage is that by taking the average over all points, the curseof dimensionality is eliminated. Even though the function g can notbe estimated at the rate of parametric convergence, the average of itsderivatives can.

77

These estimators have achieved great popularity and are discussed in

Stoker (1986, Econometrica)Hardle and Stoker (1989, JASA)Powell, Stock and Stoker (1989, Econometrica)

The simplest form is the direct average derivative estimator which issimply

β =

n∑i=1

∂E(yi|xi)∂x ti

n∑i=1

ti

(25)

where t is a trimming function that removes points which have zeroor negative densities.

78

What affects the results?• Bandwidth Choice

• Trimming

79

TrimmingTrimming essentially refers to the practice of dropping someobservations which meet a particular criterion. In other cases, it maymean rounding values at or near zero up to some acceptable level.(ex: Klein and Spady.)

• Practical reasonsIn all of the regression estimators that we have looked at, sometype of density estimate appears in the denominator of theexpression. If this is zero or near zero, the estimate of theconditional mean function is undefined. So it is sometimesnecessary to drop data points in order to avoid the “boundaryproblem”.

80

• Technical reasonsSemiparametric estimators which use non-parametric estimatorsin their construction. The non-parametric estimators need tohave uniform rates of convergence in order to establish theasymptotic properties of the semiparametric estimators. Thisgenerally involves the use of bounded kernels and densities (for x,typically) that are bounded. So most technical proofs involve theintroduction of some trimming function. (See Robinson (1988) orKlein and Spady (1993) for examples.)

81

Additively Separable ModelsThis represents another way to restrict the class of models

yi = β0 + g1(xi1) + g2(xi2) + . . .+ gk(xik) + u (26)

Less restrictive than it appears because some variables could involveinteractions with other variables.Estimates achieve the univariate rate of convergence: n2/5

Complicated to estimate.Use Backfitting or an integration approach of Newey (1994,Econometric Theory) and Hardle and Linton (1996, Biometrika)Less commonly applied than the partially linear model

82

Partially Linear Models: Recent

developmentsRefinements have been proposed by

• Ahn and Powell (1993, Journal of Econometrics)

• Heckman, Ichimura, and Todd (1998, U. of Chicago, stillunpublished)

These deal with the case where instrumental variables are needed andwhere sample selection correction of unknown functional form isestimated.

83

Other NotesThe book by Pagan and Ullah (1998) remains an excellent reference.The new book by Li and Racine (2006) is written to serve more as ateaching text, complete with problem sets and examples. More recentdevelopments are discussed by Ichimura and Todd (Handbook ofEconometrics, Volume 5, 2004). I particularly like their section onbandwidth selection (chapter 6) for semi-parametric, parametric, andaverage derivative regression estimation techniques.

84

Documents

An introduction to nonparametric and semi-parametric … · 2015. 2. 19. · Nonparametric MethodsNonparametric estimators estimate objects of interest toeconomists by replacing unknown