Smoothing Spline Gaussian Regression: More Scalable ...chong/ps/kim2r.pdfapplicability. In this article, we study more scalable computation of smoothing spline regres-sion via certain

Smoothing Spline Gaussian Regression: More Scalable

Computation via Efficient Approximation

YOUNG-JU KIM

Yale University, USA

CHONG GU∗

Purdue University, USA.

SUMMARY

Smoothing splines via the penalized least squares method provide versatile and effectivenonparametric models for regression with Gaussian responses. The computation of smoothingsplines is generally of the order O(n3), n being the sample size, which severely limits its practicalapplicability. In this article, we study more scalable computation of smoothing spline regres-sion via certain low-dimensional approximations that are asymptotically as efficient. A simplealgorithm is presented and the Bayes model associated with the approximations is derived, withthe latter guiding the porting of Bayesian confidence intervals. The practical choice of the di-mension of the approximating space is determined through simulation studies, and empiricalcomparisons of the approximations with the exact solution are presented. Also evaluated is asimple modification of the generalized cross-validation method for smoothing parameter selec-tion, which to a large extent fixes the occasional undersmoothing problem suffered by generalizedcross-validation.

Keywords: BAYESIAN CONFIDENCE INTERVAL; COMPUTATION; GENERALIZEDCROSS-VALIDATION; PENALIZED LEAST SQUARES.

1 Introduction

Consider a regression problem with observations Yi = η(xi) + εi, i = 1, . . . , n, where εi ∼ N(0, σ2),independent. In a classical parametric regression analysis, η is assumed to be of the form η(x,β),known up to the parameters β, which are to be estimated from the data; when η(x,β) is linear inβ, one has a standard linear model. A parametric model characterizes a set of rigid constraintson η. The dimension of the model space, i.e., the number of unknown parameters, is presumablymuch smaller than the sample size n.

Parametric models often incur model bias. To avoid this, an alternative approach to estimationis to allow η to vary in a high (possibly infinite) dimensional function space, leading to variousnonparametric or semiparametric estimation methods. A popular approach to the nonparametric

∗Address for correspondence: Department of Statistics, Purdue University, West Lafayette, IN 47907, USA.

1

estimation of η is via the minimization of a penalized least squares functional,

1n

n∑i=1

(Yi − η(xi))2 + λJ(η), (1.1)

where J(η) is a quadratic functional measuring the roughness of η. The first term in (1.1) dis-courages the lack of fit of η to the data, the second term penalizes the roughness of η, and thesmoothing parameter λ controls the tradeoff between the two conflicting goals. An example of thisis the cubic smoothing spline for univariate regression, with J(η) =

∫η2dx.

Pioneered by the work of Kimeldorf and Wahba (1970a, 1970b, 1971), penalized least squaresregression and generalizations thereof have been studied extensively over the years; see, e.g., Wahba(1990), Green and Silverman (1994), and Gu (2002) for comprehensive treatments.

The minimizer ηλ of (1.1) resides in an n-dimensional space, and despite the existence of O(n)algorithms for univariate regression, the computation in multivariate settings is generally of theorder O(n3). A recent result of Gu and Kim (2002) shows that the minimizers of (1.1) in certainq-dimensional spaces share the same asymptotic convergence rate with ηλ, so long as q → ∞ ata rate no slower than n2/(pr+1)+ε for some p ∈ [1, 2], r > 1, and ∀ε > 0; the computation of suchapproximations is of the order O(nq2). For the cubic splines with J(η) =

∫η2dx, r = 4, and for

tensor products thereof, r = 4 − δ, ∀δ > 0. The constant p depends on the smoothness of η: forη “barely” satisfying J(η) <∞, p = 1, and for η satisfying more stringent smoothness conditions,p > 1, up to 2; for example, with J(η) =

∫η2dx, one has p = 1.5 for η satisfying

∫(η(3))2dx < ∞

and p = 2 for η satisfying∫

(η(4))2dx <∞. With q � n2/(pr+1)+ε where r = 4− δ and p ∈ [1, 2], theorders of computation range from O(n13/9+ε∗) to O(n9/5+ε∗), where ε∗ > 0 is arbitrary. The mainpurpose of this article is to develop algorithms for the computation of such approximations and toresolve a host of practical and theoretical issues related to the approximations.

The performance of penalized least squares regression hinges on the proper selection of thesmoothing parameter λ in (1.1), for which a popular method is the generalized cross-validation(GCV) of Craven and Wahba (1979). Despite its theoretical justification (Li 1986) and adequatepractical performance, GCV may yield severe undersmoothing (too small a λ) in up to 10% of cases.In this article, we also present empirical studies to suggest a simple modification of GCV that maycurb undersmoothing for the “bad” cases without sacrificing the generally good performance forthe other cases.

The rest of the article is organized as follows. In §2, we shall briefly review some basic factsconcerning the penalized least squares regression of (1.1). The basic algorithm is outlined in §3,followed by empirical studies concerning the modification of GCV in §4 and the practical choice ofq in §5. The minimizer ηλ of (1.1) is known to be a Bayes estimate with a Gaussian process priorfor η, and in §6 we present the Bayes model associated with the q-dimensional approximations;technical details are relegated to an appendix. Further empirical results are given in §7 comparingthe exact minimizer and the q-dimensional approximations. Real-data examples are shown in §8.A few remarks in §9 conclude the article.

2

2 Background

We first review some basic facts concerning the penalized least squares regression of (1.1) and setup the notation. For general references concerning these facts, see, e.g., Wahba (1990) and Gu(2002).

2.1 Reproducing Kernel and Solution Expression

The minimization of (1.1) is in a space H ⊆ {η : J(η) <∞} in which J(η) is a square (semi) norm,or a subspace therein. The evaluation [x]f = f(x) appears in the first term, which is assumedto be continuous in H. A space H in which the evaluation is continuous is called a reproducingkernel Hilbert space (RKHS) possessing a reproducing kernel (RK) R(·, ·), a non-negative definitefunction satisfying 〈R(x, ·), f(·)〉 = f(x), ∀η ∈ H, where 〈·, ·〉 is the inner product in H. The normand the RK determine each other uniquely.

Let NJ = {η : J(η) = 0} be the null space of J(η) and consider the tensor sum decompositionH = NJ ⊕HJ . The space HJ is an RKHS with J(η) as the square norm. The minimizer of (1.1)has an expression

η(x) =∑m

ν=1 dνφν(x) +∑n

i=1 ciRJ(xi, x),

where {φν} is a basis of NJ and RJ is the RK in HJ .For X a product domain, certain ANOVA decompositions can be built in through the construc-

tion of J(η). The decomposition can be characterized byH = ⊕gβ=0Hβ and J(η) =

∑gβ=1 θ

−1β Jβ(ηβ),

where ηβ ∈ Hβ, 0 < θβ < ∞, and Jβ is the square norm in Hβ, β > 0. One has NJ = H0,HJ = ⊕g

β=1Hβ , and RJ =∑g

β=1 θβRβ , where Rβ is the RK in Hβ . The θβ ’s are an extra set ofsmoothing parameters to be selected, but they may not appear explicitly in the notation.

2.2 Asymptotic Convergence Rate

Let f(x) be the limiting density of xi on the covariate domain X , assumed to be bounded from aboveand below, and define the bilinear form V (g, h) =

∫X g(x)h(x)f(x). The asymptotic convergence

rate of ηλ is characterized by an eigenvalue analysis of J(η) with respect to V (η) = V (η, η).Let ψν be the eigenfunctions satisfying V (ψν , ψµ) = δν,µ and J(ψν , ψµ) = ρνδν,µ, where J(g, h)

is the bilinear form associated with J(g) and δν,µ is Kronecker’s delta. Assume ρν > Cνr for someC > 0, r > 1, and ν sufficiently large. Assuming

∑ν ρ

pνη2

ν <∞, where ην = V (ψν , η) and p ∈ [1, 2],it can be shown that as λ→ 0 and nλ2/r →∞,

V (ηλ − η) = Op(λp + n−1λ−1/r); (2.1)

note that J(η) =∑

ν ρνη2ν . If H∗ ⊂ H satisfies V (h) = Op(λJ(h)), ∀h ∈ H H∗, then the rate of

(2.1) also holds for the minimizer of (1.1) in H∗. The optimal convergence rate is Op(n−pr/(pr+1)),achieved with λ � n−r/(pr+1). For Hq = NJ ⊕ span{RJ(zj , ·), j = 1, . . . , q}, where zj have thelimiting density f(x), it can be shown that

3

V (h) = (V + λJ)(h)Op(q−1/2λ−1/r);

random subsets {zj} ⊂ {xi} have the limiting density f(x). Setting q � λ−2/r−ε � n2/(pr+1)+ε,where λ � n−r/(pr+1) is optimal and ε > 0 is arbitrary, one has V (h) = op(λJ(h)), ∀h ∈ H Hq.See Gu and Kim (2002).

2.3 Generalized Cross-Validation

The proper selection of the smoothing parameters λ and θβ is essential for good practical perfor-mance. Evaluating ηλ at the sampling points xi, one has Y = A(λ)Y, where A(λ) is known as thesmoothing matrix. The popular GCV method selects the λ that minimizes

V (λ) =n−1YT (I −A(λ))2Y{n−1tr(I −A(λ))}2

. (2.2)

Let L(λ) = n−1∑n

i=1(ηλ(xi) − η(xi))2 be the mean square error loss at data points; note thatL(λ) ≈ V (ηλ − η). Under mild conditions, it was shown by Li (1986) that

V (λ)− L(λ)− n−1εT ε = op(L(λ)),

where ε = (ε1, . . . , εn)T ; see also Gu (2002, §3.2) for a short proof of the result.

2.4 Bayes Model

The minimizer ηλ of (1.1) is known to be a Bayes estimate. Suppose η = η0 + η, where η0 hasa diffuse prior in NJ and η has a Gaussian process prior with mean 0 and covariance functionE[η(x1)η(x2)] = bRJ(x1, x2), independent of each other. Observing Yi = η(xi) + εi, one has

E[η(x)|Y] = ηλ(x),

where ηλ minimizes (1.1) with σ2/nλ = b. One can also calculate the posterior variance under theBayes model, which forms the basis for the Bayesian confidence intervals of Wahba (1983).

For RJ =∑g

β=1 θβRβ , η =∑g

β=1 ηβ decomposes into independent components with covari-ance functions bθβRβ . Posterior analysis of the components yields the component-wise Bayesianconfidence intervals of Gu and Wahba (1993b).

3 Computation

Algorithms for the computation of ηλ were developed by Gu et al. (1989) and Gu and Wahba(1991), which employed a certain numerical structure not shared by the approximation in Hq =NJ⊕span{RJ(zj , ·), j = 1, . . . , q}. In this section, we shall develop an algorithm for the computationof the approximation.

4

Functions in Hq can be written as

η(x) =m∑

ν=1

dνφν(x) +q∑

j=1

cjRJ(zj , x) = φTd + ξT c, (3.1)

where φ and ξ are vectors of functions and d and c are vectors of coefficients. Plugging (3.1) into(1.1), one estimates d and c through the minimization of

(Y − Sd−Rc)T (Y − Sd−Rc) + (nλ)cTQc, (3.2)

where S is n ×m with the (i, ν)th entry φν(xi), R is n × q with the (i, j)th entry RJ(xi, zj), andQ is q × q with the (j, k)th entry RJ(zj , zk); it is known that J(RJ(zj , ·), RJ(zk, ·)) = RJ(zj , zk),where J(f, g) defines the semi inner product corresponding to the squares semi norm J(f). Weshall assume a full column rank for S, which ensures a unique minimizer of (1.1) even though thecoefficients d and c may not be unique; see, e.g., Gu (2002 , §3.1).

Differentiating (3.2) with respect to d and c and setting the derivatives to 0, some algebra yields

(STS STR

RTS RTR+ (nλ)Q

)(d

c

)=

(STY

RTY

). (3.3)

Fixing the smoothing parameters λ and θβ (if present, hidden in R and Q) and assuming a fullcolumn rank of R, the linear system (3.3) can be easily solved by a Cholesky decomposition of the(m+ q)× (m+ q) matrix followed by forward and backward substitutions; see, e.g., Golub and VanLoan (1989, §4.2, §3.1).

Care must be taken when R is singular. Write the Cholesky decomposition

(STS STR

RTS RTR+ (nλ)Q

)=

(GT

1 O

GT2 GT

3

)(G1 G2

O G3

), (3.4)

where STS = GT1G1, G2 = G−T

1 STR, and GT3G3 = RT (I−S(STS)−1ST )R+(nλ)Q. Possibly with

an exchange of indices known as pivoting, one may write

G3 =

(H1 H2

O O

)=

(H

O

),

where H1 is nonsingular. Now define

G3 =

(H1 H2

O δI

), G =

(G1 G2

O G3

);

one has

G−1 =

(G−1

1 −G−11 G2G

−13

O G3−1

)(3.5)

5

Premultiplying (3.3) by G−T , some algebra yields

(I O

O G−T3 GT

3G3G3−1

)(d

c

)=

(G−T

1 STY

G−T3 RT (I − S(STS)−1ST )Y

), (3.6)

where(

dc

)= G

(dc

). Partitioning G−1

3 = (K,L), one has HK = I and HL = O, so

G−T3 GT

3G3G3−1

=

(KT

LT

)GT

3G3(K,L) =

(KT

LT

)HTH(K,L) =

(I O

O O

).

LTGT3G3L = O implies LTRT (I − S(STS)−1ST )RL = O, so LTRT (I − S(STS)−1ST )Y = 0. The

linear system (3.6) is thus of the form

I O O

O I O

O O O

d

c1

c2

=

∗∗0

, (3.7)

which is a solvable system but c2 can be arbitrary. Replacing the lower-right block O in the matrixon the left-hand side by I, which amounts to replacing G3 in (3.4) by G3, one sets c2 = 0 in (3.7).In practice, one may simply perform the Cholesky decomposition of (3.4) with pivoting, replacethe trailing O (if present) by δI with an appropriate value of δ, then proceed as if R were of fullcolumn rank.

It is easy to see that

Y = Sd +Rc = (S,R)G−1G−T

(ST

RT

)Y = A(λ)Y.

From this, the evaluation of the GCV score of (2.2) is straightforward. The numerical accuracy ofthe GCV evaluation through this may however deteriorate badly for nλ very small, which amountsto virtual interpolation; a stable, much more accurate algorithm for GCV evaluation is given inAppendix B, which is due to Simon Wood. The price one pays for the stable algorithm is a typicaltwo to five fold increase in computing time, and we have yet to observe practical differences betweenthe two evaluation algorithms when care is taken to prevent interpolation, which one should dowith noisy data. Some timing results are given in §8.

For the minimization of the GCV score with respect to λ and θβ , the quasi-Newton methodsof Dennis and Schnabel (1996) can be employed, which uses carefully scaled finite differences toapproximate the derivatives; high quality code is available in the public domain FORTRAN routineoptif9, which is accessible in R through the wrapper function nlm.

The formation of the linear system (3.3) takes O(nq2) floating point operations, or flops. TheCholesky decomposition involves O(q3) flops, and the forward and backward substitutions needO(q2) flops. The calculation of trA(λ) involves n forward substitutions, so is of the order O(nq2).The overall computation thus takes O(nq2) flops.

6

4 Modification of GCV

The GCV score of (2.2) was proposed by Craven and Wahba (1979) and justified by Li (1986), andis widely used for the selection of smoothing parameters in penalized least squares regression; likeother versions of cross-validation in the same and similar settings, GCV may occasionally lead tosevere undersmoothing. An alternative method for smoothing parameter selection is the generalizedmaximum likelihood (GML) derived by Wahba (1985), which never interpolates but, when η(x)is “supersmooth” (i.e., p > 1) and n is large, consistently undersmoothes to a mild extent. Thederivation and computation of GML in the context is discussed in §6.

In this section, we evaluate the empirical performance of a simple modification of GCV andcompare with GCV and GML. The modified GCV score is of the form

Vα(λ) =n−1YT (I −A(λ))2Y{n−1tr(I − αA(λ))}2

, (4.1)

where α ≥ 1; it reduces to (2.2) when α = 1 and yields smoother estimates as α increases. Ourexperiments suggest good values of α for practical use in the range of 1.2 ∼ 1.4. Modificationsof this sort have been suggested in various venues including software manuals, see, e.g., Nychka,Haaland, O’Connell, and Ellner (1998), but we are not aware of existing accounts of quantitativeevaluations for its practical use.

As will be seen in §5, our approach to the determination of empirical formulas of q needs “fullbasis” estimates (q = n) to serve as references, but one would need to avoid using suboptimal cases,such as undersmoothing ones, as references. The modified GCV provides indispensable technicalsupport for the development of §5.

Two sets of simulations were conducted, one univariate and one multivariate. For the univariatesimulation, data were generated from Yi = η1(xi) + εi, i = 1, . . . , n, where η1(x) = 1 + 3 sin(2πx),xi = (i− 0.5)/n, and εi ∼ N(0, 1). For sample sizes n = 100, 500, one hundred replicates each weregenerated and cubic splines were calculated with q = n and λ on a grid log10(nλ) = (−5)(.05)(−1).The mean square error loss L(λ) = n−1

∑ni=1(ηλ(xi)− η1(xi))2 was recorded for all the fits, along

with the GCV scores Vα(λ) with α = 1, 1.2, 1.4, 1.6, 1.8 and the GML score M(λ) of (6.6). Thesmoothing parameters minimizing L(λ), Vα(λ), and M(λ) on the grid were identified and thecorresponding losses extracted. Also calculated for the GCV fits was a variance estimate

σ2 =n−1YT (I −A(λv))2Yn−1tr(I −A(λv))

(4.2)

proposed by Wahba (1983), where λv is the minimizer of Vα(λ). The performances of Vα(λ) andM(λ) in the simulation are summarized in Figure 4.1. As seen in the left and center frames ofFigure 4.1, the occasional wild failures of GCV were effectively eliminated with α = 1.4. The boxplots in the right frame of Figure 4.1 show that the best performance of GCV was achieved with αin the range 1.2 ∼ 1.4, and that the performance of GML degraded as n increased, as forecast by

7

0.00 0.10 0.20 0.30

0.00

0.10

0.20

0.30

min L(λ)

L(λ)

of G

CV

0.00 0.02 0.04 0.06

0.00

0.02

0.04

0.06

min L(λ)L(

λ) o

f GC

V

α

Rel

ativ

e E

ffica

cy0.

00.

20.

40.

60.

81.

0

M 1.0 1.4 1.8

Figure 4.1: Performances of Vα(λ) and M(λ) in Univariate Simulation. Left: Performances ofVα(λ) with α = 1 (faded) and α = 1.4 for n = 100. Center: Performances of Vα(λ) with α = 1(faded) and α = 1.4 for n = 500. Right: min L(λ)/L(λ) with λ minimizing M(λ) or Vα(λ) atα = 1, 1.2, 1.4, 1.6, 1.8, for n = 100 (fatter boxes) and n = 500 (thinner boxes).

−5.0 −4.0 −3.0 −2.0

−5.

0−

4.0

−3.

0−

2.0

log10(nλ) of V1(λ)

log 1

0(nλ

) of V

1.4(

λ) o

r M

(λ)

α

log 1

0(L(

λ m)

L(λ v

))−

0.8

−0.

40.

00.

4

1.0 1.2 1.4 1.6 1.8

α

σ2

0.6

0.8

1.0

1.2

1.4

1.0 1.2 1.4 1.6 1.8

Figure 4.2: Further Results from Univariate Simulation. Left: the λ minimizing V1(λ) versus thatminimizing M(λ) (faded) or V1.4(λ), for n = 100. Center: L(λm)/L(λv), with λm minimizing M(λ)and λv minimizing Vα(λ), for n = 100 (fatter boxes) and n = 500 (thinner boxes). Right: σ2 of(4.2), for n = 100 (fatter boxes) and n = 500 (thinner boxes).

the asymptotic analysis of Wahba (1985). Further empirical results are shown in Figure 4.2, wherethe left frame compares the minimizers of V1(λ), V1.4(λ), and M(λ), the center frame comparesthe performances of Vα(λ) and M(λ), and the right frame assesses the performance of the varianceestimate of (4.2).

In the notation of §2.1 and §3, the formulation of cubic spline on [0, 1] used in the simulationhas m = 2 with φ1(x) = 1, φ2(x) = x− 0.5, and

RJ(x1, x2) = Rc(x1, x2) = k2(x1)k2(x2)− k4(|x1 − x2|), (4.3)

8

0.0 0.4 0.8

0.0

0.4

0.8

min L(λ)

L(λ)

of G

CV

M 1.0 1.4 1.8

0.0

0.2

0.4

0.6

0.8

1.0

α

Rel

ativ

e E

ffica

cy

1.0 1.2 1.4 1.6 1.8

8.0

9.0

10.0

11.0

α

σ2

Figure 4.3: Performance of Vα(λ), M(λ), and σ2 in Multivariate Simulation. Left: Performances ofVα(λ) with α = 1 (faded) and α = 1.4. Center: min L(λ)/L(λ) with λ minimizing M(λ) or Vα(λ)at α = 1, 1.2, 1.4, 1.6, 1.8. Right: σ2 of (4.2).

where

k2(x) =12

(k2

1(x)−112

),

k4(x) =124

(k4

1(x)−k2

1(x)2

+7

240

),

and k1(x) = x− 0.5. See, e.g., Gu (2002, §2.3.3) for further details.For the multivariate simulation, data were generated from Yi = η2(xi) + εi, i = 1, . . . , n, where

xi ∼ U(0, 1)2, εi ∼ N(0, 32), and

η2(x) = 5 + exp(3x〈1〉) + 106x11〈2〉(1− x〈2〉)6 + 104x3

〈2〉(1− x〈2〉)10 + 5 cos(2π(x〈1〉 − x〈2〉)),

with the notation x = (x〈1〉, x〈2〉) ∈ (0, 1)2. For a sample size n = 300, one hundred replicateswere generated and tensor product cubic splines were fitted with q = n and with the smoothingparameters minimizing L(λ), Vα(λ) with α = 1, 1.2, 1.4, 1.6, 1.8, and M(λ), respectively. Thesimulation results are summarized in Figure 4.3.

In the notation of §2.1 and §3, the formulation of tensor product cubic spline on [0, 1]2 used in thesimulation has m = 4 with φ1(x) = 1, φ2(x) = k1(x〈1〉), φ3(x) = k1(x〈2〉), φ4(x) = k1(x〈1〉)k1(x〈2〉),where k1(u) = u− 0.5, and

RJ(x1, x2) = θ1Rc(x1〈1〉, x2〈1〉) + θ2Rc(x1〈2〉, x2〈2〉)

+ θ3Rc(x1〈1〉, x2〈1〉)k1(x1〈2〉)k1(x2〈2〉) + θ4k1(x1〈1〉)k1(x2〈1〉)Rc(x1〈2〉, x2〈2〉)

+ θ5Rc(x1〈1〉, x2〈1〉)Rc(x1〈2〉, x2〈2〉)

=∑5

β=1 θβRβ(x1, x2) (4.4)

9

with 5 θβ ’s, where Rc(u1, u2) is given in (4.3) and x1 = (x1〈1〉, x1〈2〉), x2 = (x2〈1〉, x2〈2〉). See, e.g.,Gu (2002, §2.4.3) for further details.

5 Empirical Choice of q

As mentioned in §2.2, a q � n2/(pr+1)+ε, ∀ε > 0, is sufficient for asymptotic efficiency. For thecubic spline, r = 4, and for the tensor product cubic splines, r = 4− δ, ∀δ > 0; see, e.g., Gu (2002,Chapter 8). Since ε, δ > 0 can be arbitrarily small, one may use q = kn2/(4p+1) in practice. In thissection, we conduct some simulations to suggest adequate values of k for practical use.

Consider the test functions η1(x) and η2(x) as used in §4, which are smooth enough so p = 2.Samples of sizes n = 100, 300, 500 were drawn from the two simulation settings of §4, respectively.For each of the 6 samples and every k on the grid k = 5(1)15, 30 different random subsets {zj} ⊂{xi} of size q = kn2/9 were generated, and (tensor product) cubic splines were fitted to the datawith the smoothing parameters minimizing Vα(λ) with α = 1.4. The fits with q = n were alsocalculated. The losses were recorded for all the fits and the results are summarized in Figure 5.1 inbox plots. The fact that the box width gradually decreases as k increases indicates that q � n2/9 isthe “correct” scale; similar plots on the q � n2/5 scale (not shown here) have also been inspected,but the box width shrinks at a much faster rate there. The plots suggest that a k around 10 couldbe stable enough for practical use.

In practice, we suggest the use of q = kn2/9 with k around 10 for (tensor product) cubic splines;examples with “barely” square integrable second derivatives may be artifically constructed but wedoubt there are many of such “true” functions out in the real world. Since the computation is somuch faster (some timing results can be found in §8), quick checks on the stability can be performedsimply by comparing estimates with different random subsets {zj} ⊂ {xi}.

6 Bayes Model

Based on the Bayes model briefly discussed in §2.4, Bayesian confidence intervals were derived byWahba (1983) and Gu and Wahba (1993b) to provide interval estimates in penalized least squaresregression. The Bayes model is also the basis for the GML method for smoothing parameterselection derived by Wahba (1985). In this section, we extend the Bayes model to the approximationin the space Hq = NJ ⊕ span{RJ(zj , ·), j = 1, . . . , q}; secondary technical details are relegated toAppendix A.

Similar to §2.4, consider η = η0 + η, where η0 has a diffuse prior in NJ and η has a mean 0Gaussian process prior with the covariance function

E[η(x1)η(x2)] = bRJ(x1, zT )Q+RJ(z, x2),

where Q+ is the Moore-Penrose inverse of Q = RJ(z, zT ). Recall the notation ξj = RJ(zj , ·) from§3 and RT = ξ(xT ), where x = (x1, . . . , xn)T , and write nλ = σ2/b and M = RQ+RT + nλI.

10

k

L(λ)

0.00

0.04

0.08

5 10 15

k5 10 15

k

L(λ)

0.0

0.2

0.4

0.6

0.8

5 10 15

k5 10 15

Figure 5.1: Effect of q on Estimation Consistency. Boxplots of L(λ) with 30 different randomsubsets {zj} of size q, for each of q = kn2/9. Left: η1(x). Right: η2(x). Top: from high to low,n = 100, 300, 500. Bottom: n = 500 with better resolution. The dashed lines correspond to q = n.

Under the prior specified above, it can be shown that

E[η(x)|Y ] = φTd + ξT c, (6.1)

whered = (STM−1S)−1STM−1Y,

c = Q+RT (M−1 −M−1S(STM−1S)−1STM−1)Y,(6.2)

and that

var[η(x)|Y ]/b = ξTQ+ξ + φT (STM−1S)−1φ

− φT (STM−1S)−1STM−1RQ+ξ − ξTQ+RTM−1S(STM−1S)−1φ

− ξTQ+RT (M−1 −M−1S(STM−1S)−1STM−1)RQ+ξ. (6.3)

11

It is straightforward to verify that the d and c given in (6.2) solve the linear system (3.3); the factQQ+RT = RT is needed in the verification. Detailed derivations can be found in Appendix A.

From (6.2), it is easy to verify that

A(λ) = I − nλ(M−1 −M−1S(STM−1S)−1STM−1). (6.4)

Replacing ξ by RT and φ by ST in (6.3), the matrix reduces to nλA(λ) after some algebra, whereA(λ) is as given in (6.4). The posterior variances at the sampling points are thus given by thediagonals of σ2A(λ).

For the computation of (6.3) away from the sampling points, the second and third lines involveformulas of d and c given in (6.2) but with RQ+ξ replacing Y, which are available through forwardand backward substitutions given the Cholesky factor G in §3. As shown in (A.2) in Appendix A,(STM−1S)−1 is nλ times the upper-left block of G−1G−T , where G−1 is given in (3.5), so the termφT (STM−1S)−1φ is available through forward substitution.

For RJ =∑g

β=1 θβRβ , η =∑g

β=1 ηβ decomposes into multiple components with the priorcovariance functions given by

E[ηβ(x1)ηγ(x2)] = b(θβRβ(x1, zT ))Q+(θγRγ(z, x2)), β, γ = 1, . . . , g.

Also decompose the diffuse terms η0 =∑m

ν=1 ψν , where ψν = bνφν . The posterior means andvariances of arbitrary linear combinations of ψν and ηβ can be obtained by simple modificationsof (6.1) and (6.3). For example, with ψ1 + η1 + η2, one simply replaces φ in (6.1) and (6.3) by(φ1(x), 0, . . . , 0)T and ξ by θ1R1(z, x) + θ2R2(z, x).

The Bayes model under study can be perceived as a mixed effect model, with η0 = φT β beingthe fixed effects and η being the random effects. The GML method of Wahba (1985) is virtually anapplication of the popular restricted maximum likelihood (REML) method, which is widely usedfor the estimation of variance components.

Let S = FT = (F1, F2)(

TO

)= F1T be the QR-decomposition of S, where F is orthogonal and

T is upper triangular. REML maximizes the marginal likelihood of F T2 Y, which is normal with

mean 0 and covariance matrix bF T2 RQ

+RTF2 + σ2I = bF T2 MF2. It can be shown that the REML

estimates of (b, λ) are given by

b =YTF2(F T

2 MF2)−1F T2 Y

n−m(6.5)

and λm = arg min M(λ), where

M(λ) =YT (I −A(λ))Y

|I −A(λ)|1/(n−m)+

, (6.6)

with | · |+ denoting the product of positive eigenvalues; note that b depends on λm through M =

12

RQ+RT + nλI. It also holds that (nλ)−1(I −A(λ)) = F2(F T2 MF2)−1F T

2 . See Appendix A.The numerator of (6.6) is readily available. For the denominator, note that

|I −A(λ)|−1+ = |(nλ)−1F2MF2| = |(nλ)−1F T

2 RQ+RTF2 + I| = |(nλ)−1Q+RTF2F

T2 R+ I|.

Let Q+ = (P1, P2)(

D−1Q O

O O

)(P T

1

P T2

)= P1D

−1Q P T

1 be the eigenvalue decomposition of Q+, where DQ

is diagonal with the positive eigen values of Q. Noting that P T2 R

T = O, one has

|(nλ)−1Q+RTF2FT2 R+ I| = |D−1

Q (nλ)−1P T1 R

TF2FT2 RP1 + I| = |Q+ (nλ)−1RTF2F

T2 R|+/|Q|+.

The formation of RTF2FT2 R is O(nq2) and the eigenvalue problem is O(q3).

7 Numerical Accuracy

Based on the posterior mean and posterior standard deviation under the Bayes model for the exactsolution with q = n, Wahba (1983) constructed the so-called Bayesian confidence intervals. Despitethe derivation under the Bayes model, the intervals demonstrate a certain frequentist across-the-function coverage property with the smoothing parameter selected by GCV; see, e.g., Wahba (1983)and Nychka (1988). Component-wise intervals were studied by Gu and Wahba (1993a).

Under the Bayes model specified in §6 for q < n, the formulas of the posterior mean andposterior standard deviation on the sampling points, when expressed in terms of the smoothingmatrix A(λ), match those of Wahba (1978, 1983) for q = n; the formula of M(λ) also matchesthat in Wahba (1985). The analysis of §6 further allows the approximation of posterior mean andposterior standard deviation away from the sampling points and in component-wise calculations.We now assess the numerical accuracy of the approximation through simulation studies.

Consider again the simulation settings of §4 and §5. For sample size n = 100, one hundredreplicates were drawn from the η1(x) simulation, fits were calculated with q = n, and the posteriormeans η(xi) and the posterior standard deviations sη(xi) were calculated on the sampling points.For each of the replicates, 10 different random subsets {zj} ⊂ {xi} of size q = 10n2/9 were generatedand fits were calculated, and the posterior means η(xi) and the posterior standard deviations sη(xi)were calculated. All fits were calculated with λ minimizing Vα(λ) for α = 1.4. The standardizeddifferences |η(xi)− η(xi)|/

√L in posterior mean and the ratios sη(xi)/sη(xi) in posterior standard

deviation were recorded, where L = e2 = n−1∑n

i=1(η(xi) − η(xi))2 was the mean square errorloss of the fit with q = n. This yielded 100(10)(100) = 105 entries of differences and ratios. Theexperiment was repeated for sample size n = 300 on 30 replicates, yielding 30(10)(300) = 9 × 104

entries of differences and ratios. These results are summarized in Table 7.1.For the η2(x) simulation, the same experiment was conducted on ten replicates for sample size

n = 300. Functions on [0, 1]2 can be decomposed as

η(x) = η∅ + η〈1〉(x〈1〉) + η〈2〉(x〈2〉) + η〈1,2〉(x〈1〉, x〈2〉), (7.1)

13

0% 1% 5% 25% 50% 75% 95% 99% 100%|η − η|/e: n = 100 0.0000 0.0000 0.0003 0.0018 0.0043 0.0089 0.0244 0.0515 0.4392

n = 300 0.0000 0.0000 0.0003 0.0020 0.0047 0.0096 0.0278 0.0716 0.3942sη/sη: n = 100 0.9288 0.9765 0.9868 0.9968 0.9993 1.0003 1.0019 1.0049 1.0233

n = 300 0.9222 0.9703 0.9836 0.9960 0.9990 1.0003 1.0019 1.0058 1.0306

Table 7.1: Quantiles of |η(xi)− η(xi)|/√L and sη(xi)/sη(xi) in η1(x) Simulation: n = 100, 300.

where η〈1〉, η〈2〉, and η〈1,2〉 satisfy certain side conditions to ensure identifiability; this is an ANOVAdecomposition with η〈1〉 and η〈2〉 being the main effects and η〈1,2〉 being the interaction. For theformulation of tensor product cubic spline as described in §4, one has the ANOVA decompositionof the posterior mean

η(x) =∑4

ν=1 dνφν(x) +∑q

j=1 cj(∑5

β=1 θβRβ(zj , x))

=[d1φ1(x)

]+[d2φ2(x) +

∑qj=1 cjθ1R1(zj , x)

]+[d3φ3(x) +

∑qj=1 cjθ2R2(zj , x)

]+[d4φ4(x) +

∑qj=1 cj

(∑5β=3 θβRβ(zj , x)

)]= η∅ + η〈1〉 + η〈2〉 + η〈1,2〉,

with the side conditions∫ 1

0η〈1〉dx〈1〉 =

∫ 1

0η〈2〉dx〈2〉 =

∫ 1

0η〈1,2〉dx〈1〉 =

∫ 1

0η〈1,2〉dx〈2〉 = 0;

see (4.4) and the text prior to that for the notation of φν and Rβ . Besides those of the overallfunction η(x), component-wise differences |η(xi) − η(xi)|/

√L and ratios sη(xi)/sη(xi) were also

calculated, where the loss L = e2 was taken as the respective component-wise versions. The resultsare summarized in Table 7.2. For η, the range of e =

√L for the 10 replicates was (0.497, 0.835),

but for η〈1〉, η〈2〉, and η〈1,2〉, the e ranges were (1.876, 10.388), (2.223, 15.395), and (3.567, 10.081),respectively; note that the identifiability of the components are defined through integrations overthe domain but the comparison of fits and the calculation of L were done on the sampling points.Things were not as favorable as in the η1(x) simulation but the overall accuracy appears to bereasonable given the moderate signal to noise ratio.

8 Examples

We now apply the techniques developed to two real data sets. The primary goal here is not dataanalysis, however; the purpose is to visually compare the approximation with the exact solutionwith q = n. Also of interest are timing results in real-data applications.

All fits presented below were calculated with the smoothing parameters minimizing Vα(λ) forα = 1.4. The timing results were obtained on a workstation with Athlon MP2800+ and 3GB

14

0% 1% 5% 25% 50% 75% 95% 99% 100%|η − η|/e: η 0.0000 0.0004 0.0019 0.0099 0.0232 0.0539 0.2073 0.5221 1.2974

η〈1〉 0.0000 0.0001 0.0008 0.0065 0.0199 0.05113 0.2253 0.4700 0.8836η〈2〉 0.0000 0.0004 0.0019 0.0122 0.0334 0.0804 0.3918 0.8949 1.4798η〈1,2〉 0.0000 0.0000 0.0006 0.0058 0.0227 0.0689 0.3185 0.9683 2.4799

sη/sη: η 0.8065 0.9069 0.9525 0.9866 0.9971 1.0022 1.0173 1.1057 1.5936η〈1〉 0.8046 0.8514 0.9468 0.9950 1.0011 1.0150 1.1278 2.5762 43.4178η〈2〉 0.0464 0.5183 0.8909 0.9931 1.0022 1.0136 1.0821 1.4542 9.7915η〈1,2〉 0.0014 0.2307 0.4979 0.9896 1.0104 1.0611 1.5764 3.6381 130.69

Table 7.2: Quantiles of |η(xi)− η(xi)|/√L and sη(xi)/sη(xi) in η2(x) Simulation: n = 300.

RAMS, running FreeBSD 4.4 and R 1.6.2.

8.1 Ozone in Los Angeles Basin

Daily measurements of ozone concentration and eight meteorological quantities in the Los Angelesbasin were recorded for 330 days of 1976. The data were used by Breiman and Friedman (1985)to illustrate their ACE algorithm (alternating conditional expectation) and by Buja et al. (1989)to illustrate nonparametric additive models through the back-fitting algorithm. An analysis of thedata using penalized least squares regression with q = n can be found in Gu (2002, §3.7.2).

Following Gu (2002, §3.7.2), a tensor product cubic spline model of the form

Y = η∅ + η1(x〈1〉) + η2(x〈2〉) + η3(x〈3〉) + η1,3(x〈1〉, x〈3〉) + ε

was fitted to the data, where Y was log10(ozone concentration) (ppm), x〈1〉 was inversion basetemperature (oF ), x〈2〉 was Dagget pressure gradient (mmHg), and x〈3〉 was visibility (miles). Theobserved xi were mapped into (0, 1)3, on which the tensor product cubic spline was formulatedwith φ1(x) = 1, φ2(x) = k1(x〈1〉), φ3(x) = k1(x〈2〉), φ4(x) = k1(x〈3〉), and φ5(x) = k1(x〈1〉)k1(x〈3〉),where k1(u) = u− 0.5, and

RJ(x1, x2) = θ1Rc(x1〈1〉, x2〈1〉) + θ2Rc(x1〈2〉, x2〈2〉) + θ3Rc(x1〈3〉, x2〈3〉)

+ θ4Rc(x1〈1〉, x2〈1〉)k1(x1〈3〉)k1(x2〈3〉) + θ5k1(x1〈1〉)k1(x2〈1〉)Rc(x1〈3〉, x2〈3〉)

+ θ6Rc(x1〈1〉, x2〈1〉)Rc(x1〈3〉, x2〈3〉),

where Rc(u1, u2) is given in (4.3). The model was fitted with q = n = 330 and q = 37 ≈ 10(330)2/9,respectively. The main effects of the fits are plotted in Figure 8.1.

The fitting for q = 330 took about 136 CPU seconds and that for q = 37 took about 7.3 CPUseconds. Using the stable GCV evaluation algorithm of Appendix B, we got virtually the same fits,but after 455 and 12.3 CPU seconds, respectively, for q = 330, 37. The timing may vary greatlywith the subsets {zj}, and the same-data timing ratios also vary from machine to machine.

15

0 50 150 250

−0.

6−

0.2

0.2

0.6

inversion base temperature−50 0 50 100

−0.

6−

0.2

0.2

0.6

Dagget pressure gradient0 50 150 250 350

−0.

6−

0.2

0.2

0.6

visibility

Figure 8.1: Main Effects of the Ozone Fits. Plotted are the fitted terms and 95% Bayesian confi-dence intervals, with q = n (faded) and q = 10n2/9. The rugs on the bottom of the plots mark thesampling points, with the visibility jittered.

8.2 Global Temperature Map

Maps of meteorological quantities constructed from records registered at weather stations are valu-able tools in various applications such as climate change studies. A data set involving 690 weatherstations over the globe was derived by Wang and Ke (2002, §8.2), which contained the locations ofthe stations (x) and the average temperatures from December 1980 to February 1981 (Y oC). Toillustrate their S-PLUS package assist, Wang and Ke (2002) fitted a global temperature map tothe data using the spherical spline constructed by Wahba (1981); a similar illustration based on725 stations can be found in Luo and Wahba (1997).

For points x1 and x2 on the sphere, write w = cos γ(x1, x2), where γ(x1, x2) is the angle betweenx1 and x2, and W = (1− w)/2. The spherical spline of order 2 has m = 1 with φ(x) = 1 and

RJ(x1, x2) = log(1 +W−1/2)(12W 2 − 4W )− 12W 3/2 + 6W + 1;

see Wahba (1981) for details. The formulation is apparently invariant of the coordinate systemused on the sphere. With the limiting distribution f(x) of xi bounded from above and below,the eigenvalues of the corresponding J(η) with respect to V (η) =

∫S f(x)η2(x)dx grow at a rate

ρν = O(ν4), where S denotes the sphere, so r = 4; see §2.2 for the notation and Wahba (1981) fortechnical details.

The majority of the weather stations in the data set are on or near the continents, so thedistribution density f(x) is not bounded from below. The distribution is actually much denserin Europe and Japan, making the upper bound also a shaky one. With the highly non-uniformdistribution of xi, the asymptotics are not even remotely plausible, so it is no surprise that theempirical formula q = 10n2/9 does not work here.

In an effort to understand how one may achieve accurate approximation in the situation, thefollowing experiment was conducted. First, the fit η with q = n = 690 was computed and evaluated

16

q

δ r0

12

3

50 100 200q

δ a0

24

68

50 100 200 0.05 0.20 1.00 5.00

0.05

0.20

1.00

5.00

δr

δ a

Figure 8.2: Approximation Accuracy of Spherical Spline. Left and Center: The effect of subsetsize q on δr and δa, for simple random placement of zj (fatter boxes) and “space-filling” randomplacement (thinner boxes). Right: δa (pair average) versus δr, for simple random placement of zj(faded) and “space-filling” random placement.

on the sampling points xi. For each of the subset sizes q = 50, 100, 150, 200, 250, twenty pairs ofrandom subsets {zj} ⊂ {xi} were generated, and for each pair, fits η1 and η2 were calculated andthe quantities

δr =∑n

i=1(η1(xi)− η2(xi))2∑ni=1(Yi − η1(xi))2 +

∑ni=1(Yi − η2(xi))2

,

δa =1n

n∑i=1

(ηj(xi)− η(xi))2, j = 1, 2,

were recorded; δr assesses the relative discrepancy between the pair and δa measures the accuracyof η as approximation of η. Following a hunch that a “uniform” distribution of zj over the coveredarea may yield better approximation, we also tried a pseudo “space-filling” filter in the randomselection of {zj} by disallowing mutual distances that were less than 3 angular degrees, and repeatedthe above experiment with such a filter in place. The results of the experiments are summarized inFigure 8.2. It can be seen that the “space-filling” placement of zj generally leads to better accuracy,and that δr is roughly monotone in δa.

The fitting for q = 690 took about 281 CPU seconds, and that for q = 50, 100, 150, 200, 250took about 1.2, 2.1, 6.3, 12.2, and 19.8 CPU seconds, respectively.

Using the stable GCV algorithm of Appendix B, we again obtained virtually the same fits, butthe timing results were about 3.8, 11.2, 28.2, 50.5, 79.8, and 683 CPU seconds, respectively, forq = 50, 100, 150, 200, 250, 690.

Plotted in Figure 8.3 are the contours of the q = 690 fit and a q = 200 fit with a “space-filling”random placement of zj ; the rough appearance of the standard error is largely due to its overallflatness and the resulting fine scale. Densely sampled areas fetch smaller standard errors, whereasplaces like the southern Pacific Ocean, the southern Indian Ocean, and the Antarctic fare worse.

17

−150 −100 −50 0 50 100 150

−50

050

longitude

latit

ude

−150 −100 −50 0 50 100 150

−50

050

longitude

latit

ude

Figure 8.3: Global Temperature Map. Top: Posterior means. Bottom: Posterior standard devi-ations. The fit η with q = 690 is in faded lines and the fit η with q = 200 is in solid lines. Thestations are superimposed as the dots.

18

The evaluation of the posterior mean from the fits took little time compared to the fitting, but theevaluation of posterior standard deviation posed some load: on the 121× 61 grid used to draw themaps of Figure 8.3, the evaluation of the q = 200 fit took about 14 CPU seconds and that of theq = 690 fit took about 185 CPU seconds.

9 Discussion

In this article, we studied more scalable computation of smoothing spline Gaussian regressionthrough asymptotically efficient low-dimensional approximations. Algorithms were developed,Bayesian confidence intervals derived, empirical rules proposed, and numerical accuracies assessed.Also evaluated was a simple modification of generalized cross-validation, which compared favorablyagainst standard GCV and GML for smoothing parameter selection.

Although O(n) algorithms do exist for the calculation of univariate smoothing splines, theposterior standard deviations are only available through O(n3) algorithms, to our knowledge. This,plus the desire to check the validity of the approach in the simplest possible setting, justify theunivariate simulations of §5 and §7.

The “optimal” α to use in Vα(λ), if there is such a thing, would no doubt depend on the truthand possibly other factors, but those are largely beyond reach in practice. The default value ofα = 1.4 provides adequate performance over a range of simulation settings; more experimentswere conducted than presented. Similarly, the empirical formula q = 10n2/9 works well for (tensorproduct) cubic splines over a range of simulation settings, and makes a reasonable default.

The idea of fast computation through low-dimensional approximation is an old one: it is simplya version of penalized regression splines. The quantification of the adequate dimension q througha combination of asymptotic analysis and numerical simulation is new, however; see also Gu andWang (2003). Through more delicate placement of the “knots” zj , it should be possible to achieveasymptotic efficiency with q smaller than what we prescribe here, but our random placement of“knots” is simple to operate and is “universally” applicable. The empirical formula q = 10n2/9 iscertainly not universally applicable, with a counter example already seen in §8.2, but the theoreticalconsideration and numerical experiments leading to it may serve as a model for the discovery ofsimilar empirical formulas in targeted application settings.

On product domains permitting ANOVA decompositions, a feature of our approach inheritedfrom smoothing splines but unusual for regression splines is the terms in different function spacesbundled together in the basis RJ(zj , x) =

∑gβ=1 θβRβ(zj , x). For better or worse, we are able to

accommodate multiple terms with no increase in q yet with (hopefully) little loss in flexibility,which is impossible with separate terms, especially when interactions are included. Also, becauseof the presence of smoothing parameters θβ in the basis RJ(zj , x), the algorithm of Wood (2000)does not apply as the analytical derivatives of Vα(λ) and M(λ) are not available here.

Most of the calculations reported in this article were performed in R (Ihaka and Gentleman1996), an open-source clone of S/S-PLUS. Polished user-interface is provided in the ssanova1 suite

19

in the R package gss by the second author, as of version 0.7-4. Fits with α = 1 and q = n byssanova1 have been checked against fits by the ssanova suite powered by the O(n3) algorithms ofGu et al. (1989) and Gu and Wahba (1991) for numerical consistency. With q = n, ssanova1 ismuch slower than ssanova; analytical derivatives of V (λ) and M(λ) are used by ssanova but arenot available to ssanova1.

A Detailed Derivations of Bayes Model

Since J is the square norm in span{ξj}, J(ξT c) = cTQc = 0 implies ξT c = 0, so ξ(x) is in thecolumn space of Q, ∀x, and hence QQ+RT = RT , where QQ+ = Q+Q is the projection matrix inthe column space of Q.

Following Wahba (1978), we first assume η0(x) = bT φ(x) with b ∼ N(0, τ2I), then let τ2 →∞.It is easy to see that Y and η(x) are jointly normal with mean 0 and covariance matrix

(bRQ+RT + τ2SST + σ2I bRQ+ξ + τ2Sφ

bξTQ+RT + τ2φTST bξTQ+ξ + τ2φT φ

)(A.1)

where ST = φ(xT ) as given in §3. Standard calculation yields

E[η(x)|Y ] = (bξTQ+RT + τ2φTST )(bRQ+RT + τ2SST + σ2I)−1Y

= ρφTST (M + ρSST )−1Y + ξTQ+RT (M + ρSST )−1Y,

where ρ = τ2/b, nλ = σ2/b, and M = RQ+RT +nλI. Letting ρ→∞, by (2.7) and (2.8) of Wahba(1978), that

limρ→∞(ρSST +M)−1 = M−1 −M−1S(STM−1S)−1STM−1,

limρ→∞ρS

T (ρSST +M)−1 = (STM−1S)−1STM−1,

one obtains (6.1). Similarly, one has from (A.1),

var[η(x)|Y ]/b = ξTQ+ξ + ρφT φ− (ξTQ+RT + ρφTST )(M + ρSST )−1(RQ+ξ + ρSφ),

which, as ρ→∞, yields (6.3), where (2.15) of Wahba (1983), that

limρ→∞ ρI − ρ2ST (ρSST +M)−1S = (STM−1S)−1,

is also needed.

20

We now show that

(STM−1S)−1 = (nλ)(G−11 G−T

1 +G−11 G2G

−13 G−T

3 GT2G

−T1 )

= (nλ){(STS)−1 + (STS)−1STRG−13 G−T

3 RTS(STS)−1}, (A.2)

where the notation follows §3. First note that M−1 = (nλ)−1(I − R(nλQ + RTR)+RT ); multiplyand simplify using the fact that QQ+RT = (nλQ + RTR)(nλQ + RTR)+RT = RT . MultiplyingST (I − R(nλQ + RTR)+RT )S and the right-hand side of (A.2) and using the relations GT

3G3 =RT (I − S(STS)−1S)R+ nλQ and GT

3G3G−13 G−T

3 RT = RT , straightforward algebra yields (A.2).The minus log likelihood of F T

2 Y is seen to be

12b

YTF2(F T2 MF2)−1F T

2 Y +12

log |F T2 MF2|+ n−m

2log b. (A.3)

Fixing λ in M and maximizing (A.3) with respect to b, one has (6.5), and the profile log likelihoodof λ is monotone in

log YTF2(F T2 MF2)−1F T

2 Y +1

n−mlog |F T

2 MF2|. (A.4)

Partition U = (F TMF )−1 = F TM−1F =(

F T1 M−1F1 F T

1 M−1F2

F T2 M−1F1 F T

2 M−1F2

). By a standard result in Rao

(1973), page 33, the bottom-right block of U−1 = F TMF is given by

(F T2 M

−1F2 − F T2 M

−1F1(F T1 M

−1F1)−1F T1 M

−1F2)−1.

Note that (6.4) holds with F1 replacing S, so one has (F T2 MF2)−1 = (nλ)−1F T

2 (I −A(λ))F2. Nowsince ST (I − A(λ)) = O, F2(F T

2 MF2)−1F T2 = (nλ)−1(I − A(λ)), and hence (A.4) is equivalent to

(6.6).

B Stable Algorithm for GCV Evaluation

Note that equation (3.2) can be minimized through the least squares problem

min

∥∥∥∥∥(

Y

0

)−(S R

O C

)(d

c

)∥∥∥∥∥2

,

where C is the Cholesky factor of (nλ)Q = CTC. Write the QR-decomposition

(S R

O C

)= (U, V )

(T

O

)= UT,

where U is (n+ q)× (m+ q) orthogonal and T is (m+ q)× (m+ q) upper triangular. The solutionis seen to satisfy UT

1 Y = T ( dc ), where U1 is the first n rows of U . It follows that Y = U1U

T1 Y and

trA(λ) = tr(UT1 U1). T could be singular and the solution may not be unique, but the evaluations

21

of Y and trA(λ), all that is needed in GCV scores, only involve U1. The flop count is of the orderO(nq2).

This algorithm is due to Simon Wood.

Acknowledgements

The authors are indebted to Simon Wood, who, as a referee, contributed many thought-provokingcomments and suggestions, including the stable GCV evaluation algorithm described in AppendixB. The work of the first author was done while she was a graduate student at Purdue University.This research was supported by U.S. National Institutes of Health under Grant R33HL68515, andby Purdue Research Foundation under a research grant.

References

Breiman, L. and J. H. Friedman (1985). Estimating optimal transformations for multiple regres-sion and correlation. J. Amer. Statist. Assoc. 80, 580–598 (with discussions).

Buja, A., T. Hastie, and R. Tibshirani (1989). Linear smoothers and additive models. Ann.Statist. 17, 453–555 (with discussions).

Craven, P. and G. Wahba (1979). Smoothing noisy data with spline functions: Estimating thecorrect degree of smoothing by the method of generalized cross-validation. Numer. Math. 31,377–403.

Dennis, J. E. and R. B. Schnabel (1996). Numerical Methods for Unconstrained Optimizationand Nonlinear Equations. Philadelphia: SIAM. Corrected reprint of the 1983 original.

Golub, G. and C. Van Loan (1989). Matrix Computations (2nd ed.). Baltimore, MD: The JohnsHopkins University Press.

Green, P. J. and B. W. Silverman (1994). Nonparametric Regression and Generalized LinearModels. London: Chapman & Hall.

Gu, C. (2002). Smoothing Spline ANOVA Models. New York: Springer-Verlag.

Gu, C., D. M. Bates, Z. Chen, and G. Wahba (1989). The computation of GCV functionsthrough Householder tridiagonalization with application to the fitting of interaction splinemodels. SIAM J. Matrix Anal. Applic. 10, 457–480.

Gu, C. and Y.-J. Kim (2002). Penalized likelihood regression: General formulation and efficientapproximation. Can. J. Statist. 30, 619–628.

Gu, C. and G. Wahba (1991). Minimizing GCV/GML scores with multiple smoothing parametersvia the Newton method. SIAM J. Sci. Statist. Comput. 12, 383–398.

Gu, C. and G. Wahba (1993a). Semiparametric analysis of variance with tensor product thinplate splines. J. Roy. Statist. Soc. Ser. B 55, 353–368.

22

Gu, C. and G. Wahba (1993b). Smoothing spline ANOVA with component-wise Bayesian “con-fidence intervals”. J. Comput. Graph. Statist. 2, 97–117.

Gu, C. and J. Wang (2003). Penalized likelihood density estimation: Direct cross-validation andscalable approximation. Statist. Sin. 13, 811–826.

Ihaka, R. and R. Gentleman (1996). R: A language for data analysis and graphics. J. Comput.Graph. Statist. 5, 299–314.

Kimeldorf, G. and G. Wahba (1970a). A correspondence between Bayesian estimation of stochas-tic processes and smoothing by splines. Ann. Math. Statist. 41, 495–502.

Kimeldorf, G. and G. Wahba (1970b). Spline functions and stochastic processes. Sankhya Ser.A 32, 173–180.

Kimeldorf, G. and G. Wahba (1971). Some results on Tchebycheffian spline functions. J. Math.Anal. Applic. 33, 82–85.

Li, K.-C. (1986). Asymptotic optimality of CL and generalized cross-validation in the ridgeregression with application to spline smoothing. Ann. Statist. 14, 1101–1112.

Luo, Z. and G. Wahba (1997). Hybrid adaptive splines. J. Amer. Statist. Assoc. 92, 107–116.

Nychka, D. (1988). Bayesian confidence intervals for smoothing splines. J. Amer. Statist. As-soc. 83, 1134–1143.

Nychka, D., P. D. Haaland, M. A. O’Connell, and S. Ellner (1998). FUNFITS: Data analysisand statistical tools for estimating functions. In D. Nychka, W. W. Piegorsch, and L. H. Cox(Eds.), Case Studies in Environmental Statistics, pp. 159–179. New York: Springer-Verlag.

Rao, C. R. (1973). Linear Statistical Inference and Its Applications. New York: Wiley.

Wahba, G. (1978). Improper priors, spline smoothing and the problem of guarding against modelerrors in regression. J. Roy. Statist. Soc. Ser. B 40, 364–372.

Wahba, G. (1981). Spline interpolation and smoothing on the sphere. SIAM J. Sci. Statist.Comput. 2, 5–16.

Wahba, G. (1983). Bayesian “confidence intervals” for the cross-validated smoothing spline. J.Roy. Statist. Soc. Ser. B 45, 133–150.

Wahba, G. (1985). A comparison of GCV and GML for choosing the smoothing parameter inthe generalized spline smoothing problem. Ann. Statist. 13, 1378–1402.

Wahba, G. (1990). Spline Models for Observational Data, Volume 59 of CBMS-NSF RegionalConference Series in Applied Mathematics. Philadelphia: SIAM.

Wang, Y. and C. Ke (2002). ASSIST: A Suite of S-plus Functions Implementing Spline SmoothingTechniques.

Wood, S. N. (2000). Modelling and smoothing parameter estimation with multiple quadraticpenalties. J. Roy. Statist. Soc. Ser. B 62, 413–428.

23

Documents

Smoothing Spline Gaussian Regression: More Scalable ...chong/ps/kim2r.pdfapplicability. In this article, we study more scalable computation of smoothing spline regres-sion via certain