Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
An introduction to nonparametric and
semi-parametric econometric methods
Robert BreunigAustralian National [email protected]
http://econrsss.anu.edu.au/Staff/breunig/course_bb.htm
March 1, 2011
0-0
Outline1. Introduction
2. Density Estimation
(a) Kernel techniques
(b) Bandwidth Selection
(c) Estimating derivatives of densities
(d) Non-kernel techniques
3. Conditional Mean Estimation
4. Semi-parametric estimation
(a) Robinson’s method
(b) Differencing
(c) Binary Choice models
(d) Mixed categorical and continuous variables
1
Objectives1. Introduce nonparametric and semiparametric techniques
2. Introduce some of the key issues in the literature
3. Introduce several key tools and techniques
4. Provide examples of the use of techniques
5. Provide reference literature so that interested students canpursue these techniques in their applied work
2
Objects of InterestAll statistical objects studied by applied econometricians may beexpressed as functions of unknown distributions.
• Measurement of inequality
Fa(x) − Fb(x) =
x∫−∞
fa(t)dt−x∫
−∞fb(t)dt
• Regression modelling
m(x) = E[Y |x] =
∞∫−∞
yf(y, x)f1(x)
dy
3
• Measurement of response
β(x) =dE[Y |x]dx
=d
[∫y f(y,x)
f1(x) dy]
dx
• Market risk
σ2(x) =
∞∫−∞
(y −E[Y |x])2f(y, x)f1(x)
dy
• Discrete choice
Prob[Y = 1|x] =f(y, x)f1(x)
> .5
4
Parametric ModelsParametric econometric methods require the prior specification of thefunctional form of the object being estimated.
For example, one might assume that the conditional mean function islinear
m(x) = E[Y |x] =
∞∫−∞
yf(y, x)f1(x)
dy = β0 + β1x
This specification implies a constant response
β(x) =dE[Y |x]dx
=d
[∫y f(y,x)
f1(x) dy]
dx=β0 + β1x
dx= β1
5
Parametric Methods: DrawbacksParametric models impose a priori structure on the underlying DGP.
Having assumed that this structure is known, we then estimate ahandful of unknown parameters.
Choice of models is frequently not based upon any attempt to selectthe correct parametric specification from the space of admissiblemodels. Rather, model selection is usually made on the bases oftractability and ease of interpretation.
The risk is that inference, prediction, and policy are all based uponan incorrectly specified parametric model. The consequences of suchmis-specification are well known.
6
Nonparametric MethodsNonparametric estimators estimate objects of interest to economistsby replacing unknown densities and distribution functions with theirnonparametric density estimators.
They are consistent under less restrictive assumptions than thoseunderlying their parametric counterparts.
When there is sufficient data, these estimators frequently revealfeatures of the data that are invisible under parametric techniques.
Different features and structures revealed by nonparametricestimators often lead to different conclusions and policy prescriptionsthan those based upon parametric methods.
7
Four uses of nonparametric
methods1. Visualizing the data
2. Testing and comparing models
3. Conditional Mean Estimation (regression)
4. Combining parametric and nonparametric methods(semi-parametric estimation)
8
Basic building block:
Nonparametric Kernel Density
Estimator
f(x) =1nh
n∑i=1
K
(xi − x
h
)
9
We would like to estimate the density, f(x), from a sample
x1, x2, ..., xn
• Histogram
• Naive Non-parametric/Local Histogram
fI(x) =1nh
n∑i=1
I
(−1
2<xi − x
h<
12
)=n∗
nh
n∗ is the number of points which lie between x− h2 and x+ h
2 .
h will determine smoothness of estimate
10
• Replace Indicator function with smooth weighting functioncalled a Kernel
f(y) =1nh
n∑i=1
K
(yi − y
h
)where∫K (ψ) dψ = 1
K (ψ) large for small ψ
K (ψ) small for large ψ
K () should be symmetric.
11
A large class of functions satisfy these assumptions, for example
(i) Standard Normal:
K(ψ) = (2π)−1/2 exp[−1
2(ψ)2
](ii) Uniform:
K(ψ) = (2c)−1, −c < ψ < c and 0 otherwise.
(iii) Epanechnikov (1969) [optimal kernel]:
K0(ψ) = 34
(1 − ψ2
), |ψ| ≤ 1
= 0 otherwise
12
In order to implement this estimator, we have to make two choices.
1. Kernel (weight) function: K
2. The smoothing parameter (bandwidth): h
It turns out that the choice of kernel does not have much effect onthe optimality of the estimator, but that the choice of bandwidth (orwindow width) has important repercussions for our results.
13
Bandwidth selection methods1. Plug-in methods
2. Likelihood cross-validation
3. Least-squares cross-validation
14
All of these methods begin from the same starting point, which isthat the bandwidth, h, should be chosen so that the estimateddensity, f(x) is as close as possible to the true density, f(x). Most ofthe time we employ some kind of global criteria. The most commonis the integrated squared error (ISE)
∞∫−∞
(f(x) − f(x)
)2
(ISE)
or its expected value, the mean integrated squared error (MISE)
E
⎡⎣ ∞∫−∞
(f(x) − f(x)
)2
⎤⎦ (MISE)
These two quantities correspond to loss and risk.
15
For independent and identically distributed (i.i.d.) data, it isstraightforward to show that
Bias(f) = Ef − f =∫K(ψ) [f(hψ + x) − f(x)] dψ
and
V (f) = (nh)−1
∫K2(ψ)f(hψ + x)dψ − n−1
[∫K(ψ)f(hψ + x)dψ
]2
16
The expressions for exact bias and variance are not useful withoutknowledge of the quantity that we are attempting to estimate–thetrue underlying density.
We can, however, derive approximations to these quantities byexpanding f(hψ + x) by Taylor series expansion, for small h.
f(hψ + x) = f(x) + hψf (1)(x) +h2
2ψ2f (2)(x) + . . . .
17
Given i.i.d. assumption above, the assumptions made regarding thekernel function, and the following additional assumptions
(A3) The second order derivatives of f are continuous andbounded in some neighborhood of x.
(A4) h = hn → 0 as n→ ∞.
(A5) nhn → ∞ as n→ ∞.
we can show that upto O(h2) the bias is given by
Bias (f) =h2
2μ2f
(2)(x)
and upto O(nh)−1, the variance is given by
V (f) = (nh)−1f(x)∫K2(ψ)dψ,
18
MISE =∫ [(
Bias(f))2
+ V ar(f)]dx
The approximate MISE, using the above expressions, is
AMISE =h4
4μ2
2
∫ (f
′′(x)
)2
dx+ (nh)−1
∫f(x)dx
∫K2(ψ)dψ
=14λ1h
4 + λ2(nh)−1
where
λ1 = μ22
∫ (f (2)(x)
)2
dx, λ2 =∫K2(ψ)dψ. (1)
19
The optimal window width, in the sense that the approximateintegrated mean squared error is minimized, will be
h = cn−15
20
Assuming a normal kernel and a normal density, f(x), both λ1 andλ2 can be evaluated numerically. This provides
h = 1.06σxn− 1
5
Software packages which implement nonparametric densityestimation (SAS, Shazam, STATA) use this as the default windowwidth. For non-normal distributions it works well as a firstapproximation. It also can provide a good starting point for datadriven methods of bandwidth selection (see more below.) It is by farthe most commonly used window width in the literature.
21
Silverman (1986) provides several other alternatives which work wellfor heavily skewed data or multi-modal data.
A simple improvement is to replace σ by a robust estimator of spreadand he specifies two alternatives that seem to work well;
h = .79 Rn−1/5
h = .9 An−1/5,
where R is the inter-quartile range and A = min(σ, (R/1.34)).
22
Least Squares Cross-validationThis is essentially a data-driven technique of choosing the optimalbandwidth. The idea is to minimize a particular criteria function. Inleast squares cross-validation the function minimized is
ISE(h) =∫ (
f(x) − f(x))2
dx
=∫f2dx+
∫f2dx− 2
∫ffdx,
Since f2dx does not depend upon f , the function that peopleminimize in practice is actually
∫f2dx− 2
∫ffdx.
23
Further manipulation yields
ISE∗(h) = n−2h−1n∑
i=1
n∑j=1
K ◦K(xi − xj
h) − 2n−1
n∑i=1
f−i(xi).
as the function that is actually minimized.
n−1∑n
i=1 f−i(xi) is the leave one out estimator which is formed as astandard kernel density estimator, omitting the ith observation. Thisprovides an unbiased estimate of
∫ffdx. Most programs actually
implement the leave-one out estimator as the actual density estimatesince it minimizes the influence of solitary observations.
24
Likelihood Cross-validationThe basic idea behind this method is to choose an h which maximizesthe likelihood log �L =
n∑i=1
log f(xi). An estimated log likelihood or
pseudo log likelihood can be written as
log L =n∑
i=1
log f(xi) = log L(h),
where f(xi) is a density estimator of f and it depends on h.
Maximizing log L with respect to h produces a trivial maximum ath = 0. To overcome this problem, the cross validation principle mightbe adopted, in which f(xi) is replaced by f−i(x).
25
This “leave one out” version of the estimator can be written as
f−i(xi) = ((n− 1)h)−1n∑
j �=ij=1
K
(xj − xi
h
).
Thus the likelihood CV principle is to choose h such that
log L(h) =n∑
i=1
log f−i(xi) is a maximum. The procedure is also
known as Kullback-Leibler cross validation in the sense that it givesan h for which the Kullback-Leibler distance measure between twodensities f and f , I(f, f), is a minimum, where
I(f, f) =∫f(x) log
{f(x)
f(x)
}dx;
see Hall (1987).
26
A disadvantage of the h obtained by likelihood CV is that it can beseverely affected by the tail behavior of f . Furthermore, Hall (1987)has indicated that selecting h by minimizing the Kulback-Leiblermeasure may be useful for the statistical discrimination problem butnot for curve estimation. Thus the likelihood CV procedure has notproven to be of much current interest in the literature.
27
Other density estimation
techniques• Nearest Neighbor Density Estimation
Let d(x1, x) represent the distance of point x1 from the point x,and for each x denote dk(x) as the distance of x from its k − th
nearest neighbor (k −NN) among x1, ..., xn. Then, takingh = 2dk(x), the estimator can be written as
fk−NN (x) = #(x1...xn) in [(x−dk(x)), (x+dk(x))]2ndk(x)
= k2ndk(x) = 1
2ndk(x)
n∑i=1
I(
xi−x2dk(x)
).
The degree of smoothing is controlled by an integer k, typicallyk ≈ n1/2.
28
• Series estimation
Suppose X is a random variable with density f on the unitinterval [0,1]. Under these circumstances it can be expressed asthe Fourier series
f(x) =∞∑
j=0
ajζj(x),
where, for each j ≥ 0, the coefficients
aj =∫ 1
0
f(x)ζj(x)dx = Eζj(x),
and the sequence ζj(x) is given by ζ0(x) = 1, and ζj(x) =√2 cos π(j + 1)x when j is odd and
√2 sin πjx when j is even.
29
Using aj = n−1∑n
i=1 ζj(xi) as an estimator of aj , the orthogonalseries estimator is defined as
f(x) =m∑
j=0
ajζj(x),
where m is the cutoff point in the infinite sum and determinesthe amount of smoothing.
The regression analog of this is to express the conditional meanof y as an infinite polynomial in x.
30
• Variable window width estimators
Another option is to let the window width vary with each pointin the data according to some rule. The estimator will then havethe form
fvww(x) =1n
n∑i=1
1hni
K
(xi − x
hni
),
In general, the rule should allow larger h in regions where thereare few observations and smaller h in those where observationsare densely located.
31
• Penalized likelihood estimators
• Local Log-Likelihood estimators
Both of these techniques treat f(x) as an unknown parameterand try to employ likelihood methods to estimate the unknownquantity. The global likelihood has no finite maximum over theclass of all densities, so options are to instead maximize apenalized likelihood function (which imposes somepre-determined amount of smoothness on the function) or thelocal, kernel-weighted, log-likelihood.
32
• Example 1Eruption Length of Old Faithful Geyser in Yellowstone NationalPark 3 files–oldfaithful∗.pdf
• Example 2Hamilton Lin (1996) model of excess stock returns from Standardand Poor 500
• Example 3Ait-Sahalia (1996) nonparametric test of interest rate diffusionmodels
33
Multivariate Density EstimationConsider a bivariate distribution where the ith sample observation isgiven by (yi, xi) and z = (y, x) is a fixed point.
This can be estimated nonparametrically by
f(y, x) = f(z) =1nh2
n∑i=1
K1
(zi − z
h
),
34
The kernel estimator of the marginal density f1(x) of X is
f1(x) =∫f(y, x)dy = 1
nh2
n∑i=1
∫K1
(yi−y
h , xi−xh
)dy
= 1nh
n∑i=1
K(
xi−xh
)where K(x) =
∫K1(y, x)dy is such that
∫K(x)dx = 1. The estimator
of the conditional density of Y given X can then be written as
f(y |x ) =f(y, x)
f1(x).
35
In general, for a multivariate density estimation problem of dimensiond, the optimal h which minimizes the approximate MISE can befound by substituting nhd for nh in the MISE expression givenearlier and minimizing with respect to h. It is easy to show that
h = cn−1/(4+d),
and, for this h, AMISE = 0(n−4/(4+d)). When the kernel ismultivariate standard normal, c = {4/(2d+ 1)}1/(d+4).
36
Curse of dimensionalityIt is clear from this result that the higher the dimension q + 1, theslower will be the speed of convergence of f to f . Thus one may needa large data size to estimate the multivariate density in highdimensions.
37
Multivariate Kernels• standard multivariate normal density,
K(ψ) = (2π)−d/2 exp(−12ψ′ψ)
where d = dim(ψ)
• multivariate Epanechnikov kernel
Kc(ψ) = .5 c−1d (d+ 2)(1 − ψ′ψ)
if ψ′ψ < 1 and equaling 0 otherwise, where cd is the volume ofthe unit d-dimensional sphere (c1 = 2, c2 = π, c3 = 4π/3).
38
One disadvantage with direct application of the kernels above is thatthe variables may exhibit disparate variation. To overcome thisproblem it is good practice to work with standardized data, i.e.,normalized by the standard deviation or some measure of scale. Theneach of the elements in ψ will have unit variance and application of akernel such as the multivariate standard normal is appropriate.
39
Conditional mean estimationConsider q + 1 = p economic variables (Y, X ′) where Y is thedependent variable and X is a (q × 1) vector of regressors; these pvariables are taken to be completely characterized by their unknownjoint density f(y, x1, . . . , xq) = f(y, x), at the points y, x.
As noted in the introduction interest frequently centres upon theconditional mean m(x) = E(Y |X = x), where x is some fixed valueof X .
Now suppose that we have n data points (yi, x′i). By definition,
Yi = E(Yi |Xi = xi) + ui = m(xi) + ui
where the error term ui has the properties E(ui |xi ) = 0 andE(u2
i |xi ) = σ2(xi).
40
Parametric EstimationParametric methods specify a form of m(xi). In the case of a linearspecification
yi = α+ xiβ + ui.
The least squares estimators of α and β are α∗ = y − xβ∗ andβ∗ =
(∑ni=1(xi − x)2
)−1 (∑n
i=1(xi − x)yi) .
The best unbiased parametric estimator of m(x) = α+ xβ is
m∗(x) = α∗ + xβ∗ =n∑
i=1
ani(x)yi (2)
where ani(x) = n−1 + (x− x)(xi − x)(∑n
i=1(xi − x)2)−1. The m∗ in
(2) is the weighted sum of yi, where the weights ani are linear in x,and depend on the distance of x from x.
41
The assumption that m(xi) = α+ xiβ implies certain assumptionabout the data generating process (joint density). For example, if(yi, xi) is a bivariate normal density then it can be shown that themean of the conditional density of yi given xi is, E(yi |xi ) = α+ xiβ,where α = Eyi − (Exi)β and β = (var(xi))
−1 cov {(xi, yi)} . Thisimplies that the assumption of linear specification for m(x) holds ifthe data comes from the normal distribution. However, if the truedistribution is not normal then the linear specification for theconditional expectation may be invalid, and so the least squaresestimator of m(x) will become biased and inconsistent.
42
For example suppose the true relationship is
yi = α+ xiβ + x2i γ + ui
then the “parameter of interest” is β + 2γxi = ∂yi/∂xi. However, if alinear approximation is taken , ∂yi/∂xi is being estimated under thefalse restriction that γ = 0. Typically, the exact functional formconnecting m(x) with x is unknown. Because of the possibility thatforcing the function to be linear or quadratic may affect the accuracyof estimation of m(x), it is worthwhile considering nonparametricestimation of the unknown function, and this task is taken up in thefollowing sections.
43
Kernel-Based EstimationSuppose that the xi are i.i.d. random variables. Because m(xi) is themean of the conditional density f(yi |xi ) = f(Y |X = xi), there is apotential to employ the methods of density estimation seen earlier.By definition the conditional mean is
m =∫ ∞
−∞(yf(y, x)/f1(x)) dy, (3)
where f1(x) is the marginal density of X at x.
Nadaraya (1964) and Watson (1964) therefore proposed that m beestimated by replacing f(y, x) by f(y, x) and f1(x) by f1(x), wherethese density estimators were the kernel estimators discussed above.
44
The expressions for f(y, x), f1(x) from the first part of this talk maybe substituted into (3) to give
m =∫ ∞
−∞y
[(nhp)−1
(nhq)−1
∑ni=1K1
(yi−y
h , xi−xh
)∑ni=1K
(xi−x
h
) ]dy, (4)
where p = q + 1 and h is the window width. Some simplificationyields
m =
[(nhq)−1
n∑i=1
yiK
(xi − x
h
)]/
[(nhq)−1
n∑i=1
K
(xi − x
h
)]
=n∑
i=1
K
(xi − x
h
)yi/
n∑i=1
K
(xi − x
h
),
45
A feature of the Nadaraya-Watson estimator is that it is a weightedsum of those yi’s that correspond to xi in a neighborhood of x. Theweights are low for xi’s far away from x and high for xi’s closer to x.
With this motivation, a general class of nonparametric estimators ofm(x) can be written as
m = m(x) =n∑
i=1
wni(x)yi
where wni(x) = wn(xi, x) represents the weight assigned to the i’thobservation yi, and it depends on the distance of xi from the point x.
Note that the parametric estimator m(x) in (2) is a special case withlinear weights wni(x) = ani(x) such that
∑wni(x) = 1, but
wni(x) ≥ 0 is not necessarily true.
46
An implicit assumption in nonparametric estimation is that m(x) issmooth over x, implying that yi contains information about m(x)whenever xi is near to x. The estimator m(x) is a smoothedestimator in the sense that it is constructed, at every point, by localaveraging of the observations yi’s corresponding to those xi’s close tox in some sense.
In parametric regression, a functional form is specified for theconditional mean m(x). This functional form, say m(x, β), dependson a finite number of unknown parameters β. The least squaresestimate of m = m(x) is m(x, β), where β is chosen to minimize
n∑i=1
(yi −m(xi, β)
)2
. (5)
47
Compare (5) with the following weighted least squares criterion forthe nonparametric estimation of m(x) :
n∑i=1
w∗ni(x) [yi −m(x)]2 . (6)
In (6), m(x) replaces the m(x, β) that appears in (5). If m(x) isregarded as a single unknown parameter m, it may be estimated byminimizing
n∑i=1
w∗ni(x)[yi −m]2. (7)
The resulting estimate, m, of m(x) is precisely the Nadaraya-Watsonestimator. Thus the kernel estimator m is also a least squaresestimator, with w∗
ni(x) = K ((xi − x)/h) .
48
One might also think of m(x) as a method of moments estimator.Since E(ui |xi ) = 0,
Ew∗ni(x) (yi −m(xi)) = 0 (8)
or
= E [w∗ni(x)(yi −m) + w∗
ni(x) (m−m(xi))] = 0. (9)
If the second term in (9) is ignored and a sample estimate of the first,n−1
∑ni=1 w
∗ni(x)(yi −m), is used, the value of m for which this is
zero is again the Nadaraya-Watson estimator.
49
Whether the second term can be ignored depends upon the weightsw∗
ni(x). If the weights were the indicator functions of the “localhistogram” presented earlier, the second term will be identically zero,whereas with kernel weights it is only asymptotically zero. Becausethe orthogonality relation only holds as n→ ∞, the situation is outof the normal framework described by Hansen (1982), but it is closeto work reported in Powell (1986), in that the expected value of thefunction the parameter solves changes with the sample size (throughh) and so its large sample limit has to be used instead.
50
Local Linear Nonparametric
RegressionThe Nadaraya-Watson estimator of m(x) minimizesΣn
i=1 {yi − α}2K(
xi−xh
)with respect to α, giving
m(x) = α =[ΣK
(xi−x
h
)]−1 ΣK(
xi−xh
)yi.
Stone (1977) and Cleveland (1979) suggested that instead oneminimize
n∑i=1
{yi − α− (xi − x)β}2K
(xi − x
h
),
with respect to α and β and set m(x) equal to the resulting estimateof α.
51
This estimate can be found by performing a weighted least squaresregression of yi against z′i = (1 (xi − x)) with weights[K
(xi−x
h
)]1/2. Thus, while the Nadaraya-Watson estimator is fittinga constant to the data close to x, the local linear approximation isfitting a straight line.
This local linear smoothing estimator has been extensivelyinvestigated by Fan (1992a), (1993), Fan and Gijbels (1992) andRuppert and Wand (1994).
52
The resulting estimator has the form
mLL(x) =n∑
i=1
wLLni (x)yi,
with weights wLLni = e′1 (
∑ziKiz
′i)
−1ziKi, where e1 is a column
vector of dimension the same as zi with unity as first element andzero elsewhere.
One advantage of this estimator is that it can be analysed withstandard regression techniques, and it has the same first orderstatistical properties irrespective of whether the xi are stochastic ornon-stochastic. The optimal window width is proportional to n−1/5.
53
Applications of the idea in econometrics are McManus (1994) toestimation of cost functions, Gourrieourx and Scaillat (1994) to theterm structure, Lin and Shu (1994) to estimation of a disequilibriumtransition model, Bossaerts and Hillion (1997) to options prices andtheir determinants, and Ullah and Roy(1996) for a nutrition/incomerelation. Implementation and computations are discussed inCleveland et al (1988). Hastie and Loader (1993) provide an excellentaccount of the history and potential of the method.
54
The logic of linear local regression smoothing can be seen byexpanding m (xi) around x to get
m (xi) = m(x) +∂m
∂x(x∗) (xi − x) , (10)
where x∗ lies between xi and x. This may be expressed as
m (xi) = α+ β (x∗) (xi − x) . (11)
55
Now, since E (yi|xi) = m (xi) , the objective function
Σ (yi −mi (xi))2Ki = Σ (yi − α− β (x∗) (xi − x))2Ki
is essentially the residual sum of squares from a regression using onlyobservations close to xi = x. Notice that this means that β (x∗) willbe very close to constant as x∗ must lie between xi and x. This alsopoints to the fact that improvements might be available fromexpanding m (xi) as a j′th order polynomial in (xi − x) , but doing sorequires the derivatives m(j) to exist.
56
Example 4
Eruption Length of Old Faithful
Geyser Conditional on Waiting
Time
57
Other Notes• The optimal h can be found by minimizing the MISE similar to
the density case, and it can be shown that
hopt α n−1/(q+4)
• Cross validation may be performed by minimizing the estimatedprediction error (EPE), n−1Σ (yi − m(xi))
2, where m(xi) iscomputed as the “leave-one-out” estimator deleting the i’thobservation in the sums. To appreciate why minimizing EPE issensible notice that, when the “leave one out” estimator isemployed and observations are independent, mi is independent ofyi, meaning that E (mi(yi −mi)) = 0, and soE(EPE) = σ2 +E
(n−1Σ(mi −mi)2
)= σ2 +MASE.
58
Minimizing E(EPE) with respect to h is therefore equivalent tominimizing MASE with respect to h. Unfortunately, minimizingthe sample EPE tends to produce an estimator of h thatconverges only extremely slowly to the value of h minimizingE(EPE), of order n−1/10,
• The curse of dimensionality means that “pure” nonparametricregression is difficult to use in higher dimension problems.
59
Semi-parametric estimationA number of models exist in the literature which have thedistinguishing feature that part of the model is linear and partconstitutes an unknown non-linear format.
yi = x′1iβ + g1(x2i) + ui, (12)
which could be written in matrix form as
y = X1β + g1 + u. (13)
In (23) x2i cannot have unity as an element.
60
This intercept restriction is an identification condition arising fromthe fact that g1(x2i) is unconstrained and therefore can have aconstant term as part of its definition. Hence, it would always bepossible to add any constant number to (23) and then absorb it intog(x2i), showing that, without some further restriction upon thenature of g1(x2i), it is impossible to consistently estimate anintercept.
This issue of identification of parameters, particularly in regards tothe intercept, but sometimes a scale parameter as well, arises a gooddeal in the semi-parametric literature and needs to be dealt with byimposing some restrictions.
The parameter of interest is β so that the issue is how to estimate itin the presence of the unknown function g1.
61
A Semi-Parametric Estimator of βTaking the conditional expectation of (13) leads toE (yi |x2i ) = E (x1i |x2i )′ β + g1(x2i).
Consequently
yi − E (yi |x2i ) = (x1i −E (x1i |x2i ))′ β + ui (14)
and
g1(x2i) = E (yi |x2i ) − E (x1i |x2i )′ β. (15)
62
Since (14) has the properties of a linear regression model withdependent variable yi −E (yi|x2i) and independent variables(x1i − E (x1i |x2i )), an obvious estimator of β is
β =
[n∑
i=1
(x1i − m12i) (x1i − m12i)′]−1 [
n∑i=1
(x1i − m12i) (yi − m2i)
],
(16)
where m12i and m2i are the kernel based estimators ofm12i = E(x1i |x2i ) and m2i = E(yi |x2i ).
63
Once β is found, g1(x2i) can be estimated from (15) as
g1(x2i) = m2i − m′12iβ, (17)
for example Stock (1989) works with this model but is particularlyinterested in estimating g1(x2i) rather than β.
The kernel estimator for β in the context of (13) was analyzed byRobinson (1988)
64
DifferencingConsider again the partial linear model
yi = x1iβ + g1(x2i) + εi, (18)
where x1 is a scalar.
Order the x2 from smallest to largest so that x21 ≤ x22 . . . ≤ x2n
Suppose that x1 is a smooth function of x2 where
E[x1|x2] = g(x2)
and thereforex1 = g(x2) + u
65
yi − yi−1 = (x1i − x1,i−1)β + (f(x2i) − f(x2,i−1)) + εi − εi−1
= (g(x2i) − g(x2,i−1))β + (ui − ui−1)β+
(f(x2i) − f(x2,i−1)) + εi − εi−1
Provided that the functions f and g are sufficiently smooth and thatthe data is sufficiently dense, the differences f(x2i) − f(x2,i−1) andg(x2i) − g(x2,i−1) should be very small providing the approximations
zi − zi−1 =ui − ui−1
yi − yi−1 = (ui − ui−1)β + εi − εi−1
66
The non-parametric difference estimator of β is simply
βdiff =∑
(zi − zi−1)∑
(yi − yi−1)∑(zi − zi−1)2
which converges at the usual rate of√n, with normal distribution so
that
βdiffD∼ N
(β,
1.5n
σ2ε
σ2u
)
67
Example 5: Yatchew and No (2001)
Gasoline Demand in Canada
68
Binary Choice ModelsWe often start with the idea of an underlying linear (latent variable)model
y∗i = x′iβ + ui (19)
yi = 1 when y∗i > 0 and takes value 0 otherwise.
The standard approach to estimating β in (22) is via maximumlikelihood. The likelihood function is formed for a sample of size n as
L =n∑
i=1
[yiln(Gi) + (1 − yi)ln(1 −Gi)] (20)
Gi =
x′iβ∫
−∞[g(u)] = Prob(ui < x′iβ)
69
G is assumed to be normal (probit) or logistic (logit) in mostapplications.
Klein and Spady (1993) propose to estimate a smooth version of thelikelihood that locally approximates the parametric likelihood. Notethat xiβ could be written in more general terms, but Klein andSpady do retain the linear index function in their method.
The key transformation is to note that G in (20) is the probabilitythat u is less than the x′iβ conditional on the index function and theparameter β. This can be written asa
G[x′iβ;β] = Prob(y = 1)gυ|y=1
gυ(21)
where gυ|y=1 is the distribution of the index function conditional ony = 1 and gυ is the unconditional distribution of the index function.
aProb(A|B) =P rob(A∩B)
P rob(B)=
P rob(B|A)P rob(A)P rob(B)
70
These can both be estimated nonparametrically using standardkernel techniques while the Prob(y = 1) can be estimated as thesample fraction of observations with yi = 1.
71
Ichimura and Thompson (1998) propose a wider class of estimateswhich is based upon a random coefficients approach.
y∗i = x′iβi + ui (22)
yi = 1 when y∗i > 0 and takes value 0 otherwise.
The distribution of βi is estimated by nonparametric methods withfew restrictions.
Ai and Chen (Econometrica, 2003) have proposed a better methodfor estimating binary choice models which is currently considered thestate of the art.
72
Additional notes on bandwidth
selection• Plug-in methods
Usually reserved for simple density estimationFan and Gijbels (1996) provide plug-in estimators for regressionestimation
• Least-squares cross-validation popular in many applicationsIchimura and Todd (2004, Handbook of Econometrics V) findthat this method works well in a simulation studyThe biggest problem with least-squares cross-validation happenswhen the data are sparse. In this case the method tends tochoose a bandwidth which is too large in order to avoid havingzero densities in any area (the criterion takes on an unboundedvalue if the density is zero at any point).
73
• Variable bandwidth selection methods result in estimates thatare no longer densities. Thus global bandwidth selection methodstend to be preferred
• There are also bootstrap bandwidth selection methods whichtend to be very computationally intensive
74
Reducing the curse of
dimensionality• Restricting the class of models
ex: Separable models of Robinson and Yatchewex: Klein and Spady Binary Choice Model
• Changing the Parameter of Interestex: Average derivative methods
75
• Specifying different stochastic assumptionssee Powell (1984, J. of Econometrics)I won’t discuss this last one. But these methods essentiallyinvolve making some restriction on the conditional distribution ofobservable variables but not enough to estimate the modelparametrically. Powell applies these to various limited dependentvariable models including the Tobit model.
76
Average Derivative MethodConsider the model
yi = g(xi) + ui, (23)
Suppose that instead of estimating the derivative g′(x) at everypoint, we are interested in
E(g′(x)) (24)
The advantage is that by taking the average over all points, the curseof dimensionality is eliminated. Even though the function g can notbe estimated at the rate of parametric convergence, the average of itsderivatives can.
77
These estimators have achieved great popularity and are discussed in
Stoker (1986, Econometrica)Hardle and Stoker (1989, JASA)Powell, Stock and Stoker (1989, Econometrica)
The simplest form is the direct average derivative estimator which issimply
β =
n∑i=1
∂E(yi|xi)∂x ti
n∑i=1
ti
(25)
where t is a trimming function that removes points which have zeroor negative densities.
78
What affects the results?• Bandwidth Choice
• Trimming
79
TrimmingTrimming essentially refers to the practice of dropping someobservations which meet a particular criterion. In other cases, it maymean rounding values at or near zero up to some acceptable level.(ex: Klein and Spady.)
• Practical reasonsIn all of the regression estimators that we have looked at, sometype of density estimate appears in the denominator of theexpression. If this is zero or near zero, the estimate of theconditional mean function is undefined. So it is sometimesnecessary to drop data points in order to avoid the “boundaryproblem”.
80
• Technical reasonsSemiparametric estimators which use non-parametric estimatorsin their construction. The non-parametric estimators need tohave uniform rates of convergence in order to establish theasymptotic properties of the semiparametric estimators. Thisgenerally involves the use of bounded kernels and densities (for x,typically) that are bounded. So most technical proofs involve theintroduction of some trimming function. (See Robinson (1988) orKlein and Spady (1993) for examples.)
81
Additively Separable ModelsThis represents another way to restrict the class of models
yi = β0 + g1(xi1) + g2(xi2) + . . .+ gk(xik) + u (26)
Less restrictive than it appears because some variables could involveinteractions with other variables.Estimates achieve the univariate rate of convergence: n2/5
Complicated to estimate.Use Backfitting or an integration approach of Newey (1994,Econometric Theory) and Hardle and Linton (1996, Biometrika)Less commonly applied than the partially linear model
82
Partially Linear Models: Recent
developmentsRefinements have been proposed by
• Ahn and Powell (1993, Journal of Econometrics)
• Heckman, Ichimura, and Todd (1998, U. of Chicago, stillunpublished)
These deal with the case where instrumental variables are needed andwhere sample selection correction of unknown functional form isestimated.
83
Other NotesThe book by Pagan and Ullah (1998) remains an excellent reference.The new book by Li and Racine (2006) is written to serve more as ateaching text, complete with problem sets and examples. More recentdevelopments are discussed by Ichimura and Todd (Handbook ofEconometrics, Volume 5, 2004). I particularly like their section onbandwidth selection (chapter 6) for semi-parametric, parametric, andaverage derivative regression estimation techniques.
84