24
Asymptotics for Lasso-Type Estimators Keith Knight; Wenjiang Fu The Annals of Statistics, Vol. 28, No. 5. (Oct., 2000), pp. 1356-1378. Stable URL: http://links.jstor.org/sici?sici=0090-5364%28200010%2928%3A5%3C1356%3AAFLE%3E2.0.CO%3B2-W The Annals of Statistics is currently published by Institute of Mathematical Statistics. Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/journals/ims.html. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is an independent not-for-profit organization dedicated to and preserving a digital archive of scholarly journals. For more information regarding JSTOR, please contact [email protected]. http://www.jstor.org Wed Apr 18 18:55:57 2007

Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

Asymptotics for Lasso-Type Estimators

Keith Knight; Wenjiang Fu

The Annals of Statistics, Vol. 28, No. 5. (Oct., 2000), pp. 1356-1378.

Stable URL:

http://links.jstor.org/sici?sici=0090-5364%28200010%2928%3A5%3C1356%3AAFLE%3E2.0.CO%3B2-W

The Annals of Statistics is currently published by Institute of Mathematical Statistics.

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available athttp://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtainedprior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content inthe JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained athttp://www.jstor.org/journals/ims.html.

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printedpage of such transmission.

JSTOR is an independent not-for-profit organization dedicated to and preserving a digital archive of scholarly journals. Formore information regarding JSTOR, please contact [email protected].

http://www.jstor.orgWed Apr 18 18:55:57 2007

Page 2: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

The Annals of Statistics 2000, Vol. 28, No. 5, 1356-1378

ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS

University of Toronto and Michigan State University

We consider the asymptotic behavior of regression estimators that minimize the residual sum of squares plus a penalty proportional to C P for some y > 0. These estimators include the Lasso as a special case when y = 1. Under appropriate conditions, we show that the limiting distribu- tions can have positive probability mass a t 0 when the true value of the parameter is 0. We also consider asymptotics for "nearly singular" designs.

1. Introduction. Consider the linear regression model

where s l , . . . ,E,, are i.i.d. random variables with mean 0 and variance a2.

Without loss of generality, we will assume that the covariates are centered to have mean 0 and take Do = in which case we can replace Y, in (1)by Yi - Y and concentrate on estimating p. Again without loss of generality, we will assume that Y = 0.

We estimate p by minimizing the penalized least squares (LS) criterion

for a given A,, where y > 0. Such estimators were called Bridge estimators by Frank and Friedman (1993) who introduced them as a generalization of ridge regression (which occurs for y = 2). The special case when y = 1is related to the "Lasso" of Tibshirani (1996) (hence, the term "Lasso-type" in the title); in the case of wavelet regression, this approach to estimation is called basis pursuit [Chen, Donoho and Saunders (1999)l. Some other proposals for penal- ties are made in Fan and Li (1999). For y 5 1, the estimators minimizing (2) have the potentially attractive feature of being exactly 0 if A,, is sufficiently large, thus combining parameter estimation and model selection; indeed model selection methods that penalize by the number of nonzero parameters [such as AIC and BIC; Linhart and Zucchini (1986)l can be viewed as limiting cases of Bridge estimation as y + 0 since

Received June 1999; revised May 2000. lSupported by the Natural Sciences and Engineering Research Council of Canada. 2Supported in part by grant R030A83010 from the National Cancer Institute. AMS 1991 subject classifications. Primary 62505, 62507; secondary 62320, 60F05. Key words andphrases. Penalized regression, Lasso, shrinkage estimation, epi-convergence in

distribution.

1356

Page 3: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

1357 ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS

For a given A,, we will denote the estimator minimizing (2) by p,,. Of course, A, = 0 corresponds to the ordinary LS estimator; this estimator will be denoted by p!,?).

We will assume the following regularity conditions for the design,

where C is a nonnegative definite matrix and

1 T - max xi xi + 0. n l ~ i ~ n

Typically in practice, the covariates are scaled so that the diagonal elements of C, (and hence those of C) are all identically 1.

The parametrization of the linear model (1)is unique if the matrix C,, is nonsingular or, equivalently, the design matrix has full rank. It is worth not- ing, however, that a unique minimum to (2) may exist even if C, is singular; indeed, this is one of the benefits of this type of estimation. Define the "equiv- alence class"

@,, = { i :[ = p + v where C,v = 0))

where p satisfies (1).When C, is singular, we could define a unique parametrization of (1)from the equivalence class @, by defining po (for a given y > 0) as

However, this will not be pursued further here; we will assume that C, is nonsingular for all n.

An important class of designs is the class of "nearly singular" designs. For such designs, C, is nonsingular but may have one or more small eigenval- ues (indicating the presence of collinearity among the covariates) such that (asymptotically) C,, + C where C is singular. In practice, nearly singular designs can arise when many covariates are available, increasing the possi- bility of nearly linear dependencies between two or more covariates. These designs are considered in Section 5 .

Under conditions (3) and (4)(with C nonsingular), it is well-known that the LS estimator is consistent and that

In fact, conditions (3) and (4) can be weakened considerably without losing asymptotic normality of the LS estimator [Srivastava (1971)l; however, we will assume the existence of the limit C in (3) throughout the paper. For the most part, we will assume that C is nonsingular.

Page 4: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

1358 K. KNIGHT AND W. FU

In Section 2, we discuss consistency and limiting distributions of Bridge estimators while in Section 3, we try to examine the small sample behav-ior by considering "local asymptotics." Asymptotics for bootstrapped Bridge estimation are considered in Section 4. In Section 5, we consider "nearly sin-gular" designs as defined above. Finally, in Section 6 we try to tie up some other loose ends by considering Bridge estimation for singular designs as well as computation of Bridge estimators when y < 1.

2. Limiting distributions. In this section, as well as in Section 3, we will assume that the matrix C defined in (3) is nonsingular.

The limiting behavior of the Bridge estimator 6, can be determined by studying the asymptotic behavior of the objective function (2). For example, to consider consistency of fin,we will define the (random) function

which is minimized at + = f in . The following result shows that Dl, is consistent provided A, = o(n).

THEOREM1. If C in (3) is nonsingular and A,/n + A. 2 0, then fi,, +-, argmin(2) where

Thus if A, = o(n), argmin(2) = P and so fin is consistent.

PROOF. Define 2, as in (5).We need to show that

for any compact set K and that

(7) P, = 0,(1).

Under (6) and (7),we have

For y 2 1, 2, is convex; thus (6)and (7)follow from the pointwise convergence in probability of Z,(+) to Z(+) + a2by applying standard results [Anderson and Gill (1982); Pollard (1991)l. For y < 1, 2, is no longer convex but (6) follows easily. To prove (7),note that

(0)for all +. Since argmin(Z,, ) = OP(1), it follows that argrnin(2,) = 0,(1).

Page 5: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

1359 ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS

Even though A, = o ( n ) is sufficient for consistency, we require that A,, grow more slowly for fi-consistency of the Bridge estimator. However, if A,, grows too slowly then fi(fili- p ) will have the same limiting distribution as f i ( f i k O ' - P ) . In fact, the rate of growth of A, needed to get an "interesting" limiting distribution depends on whether y > 1 or y < 1.Theorem 2 below indicates that we need A,, = O ( f i ) for fi-consistency for y 3 1;Theorem 3 suggests A, = O(nyI2)is necessary for y < 1. [In fact, A,, = O ( f i ) suffices for y < 1.1

THEOREM2. Suppose that y 2 1. If An/& + A. 2 0 and C is nonsingular then

where

P

V ( u )= - 2 u T w + u T C u + AO C u j l ~ - lsgn@ j ) l ~ j=1

i f y > 1,

if y = 1, and W has a N ( 0 7 a 2 C ) distribution.

PROOF. Define V , ( u ) by

[where u = (u , , . . . , and note that V , is minimized a t fi(fi,,- p). First note that

with finite-dimensional convergence holding trivially. If y > 1then

while for y = 1,we have

Page 6: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

1360 K. KNIGHT AND W. FU

Thus V,,(u) jdV(u) (as defined above) with the finite-dimensional conver-gence holding trivially. Since V,, is convex and V has a unique minimum, it follows [Geyer (1996)l that

Note that when A. = 0, argmin(V) = C-lW - N ( 0 , a2C-I),

Proponents of ridge regression (that is, y = 2) may be disappointed with the conclusion of Theorem 2 as it gives

which suggests that ridge estimation is inferior to ordinary LS estimation. However, the asymptotic perspective used here is somewhat unfair to ridge estimation; see Theorem 4 in Section 3 for a more flattering asymptotic per-spective of ridge estimation. However, Theorem 2 does illustrate that for y > 1, the amount of shrinkage towards 0 increases with the magnitude of the param-eter being estimated; thus, for "large" parameters, the bias of their estimators for y > 1may be unacceptably large.

THEOREM3. Suppose that y < 1. I f A , , / ~ Y / ~+ A. > 0 then

where

PROOF. The proof is similar to that of Theorem 2; however, there are some added complexities due to the nonconvexity of the objective function. Define

Since An = O(nyI2)= o(f i ) , it follows that

An[IPj + u j / f i l Y- IPjIYI+ 0

if P # 0. Thus

and the convergence is uniform over u in compact sets. It follows then that

Page 7: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1361

on the space of functions topologized by uniform convergence on compact sets. To prove that argmin(V,,) + - d argmin(V), it suffices to show that argmin(V,) = OP(1) [Kim and Pollard (1990)l. However, note that

(1)for all u and n sufficiently large. Since the quadratic terms in V,, grow faster ( 1 )than the IujlY terms, it follows that argmin(V, ) = OP(1);hence, it follows

that argmin(V,) = OP(1).Since argmin(V) is unique with probability 1, the conclusion follows.

The conclusion of Theorem 3 is quite interesting. It suggests that if y < 1, we can estimate nonzero regression parameters a t the usual rate without asymptotic bias while shrinking the estimates of zero regression parameters to 0 with positive probability. This is in contrast to what happens when y > 1; Theorem 2 indicates that nonzero parameters are estimated with some asymp-totic bias if ho > 0.

As an alternative to the conditions on A, given in Theorem 3, we can also consider what happens if A,/& + ho 10 while ) l , , / n ~ / ~+ m. Suppose that PI, . . . ,PI' are nonzero while P,,,, . . . ,Pp are zero. Then defining V,(u) as in (81, it follows that V,(u) +d V(u) where

[In fact, since V can be infinite, we can no longer define convergence of V,, to V via uniform convergence on compact sets but instead define it via epiconver-gence which allows for extended real-valued functions; see Geyer (1994, 1996), Pflug (1995) for more details on epiconvergence.] Applying the arguments given in the proof of Theorem 3, it follows that &(fi,, - P) +d argmin(V) where the last (p - r) elements of argmin(V) are exactly 0. This result sug-gests that we could achieve the best of both worlds (at least asymptotically) by taking A, - hon"/2where y < CY < 1. However, this latter formulation does not really capture what happens for finite samples.

When some of the Pj's are exactly 0, the limiting distributions (as given by Theorems 2 and 3) put positive probability at 0 when y 5 1.We will illustrate this in the case when y = 1. Suppose that P I , . . . ,P, are nonzero and P,+, = . . = pp = 0. In this case,

Page 8: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

1362 K. KNIGHT AND W. FU

Now rewrite the matrix C, W and u as follows:

where Cll is r x r , Cz2is ( p - r ) x ( p - r ) and Czl = cT2;

where W1 and u1 are r-vectors. If V(u) is minimized at u, = 0 then it follows that

and

where 1is a vector of 1's and the inequalities are interpreted coordinatewise. Solving for u1 in (9), we get

and substituting into (lo), it follows that u, = 0 if

In the case where pl = P, = . . . = p, = 0, this reduces to

Note that this same rationale applies for finite samples as well; for example, Dl, = 0 if and only if -A7 ,1 F 2 C i Y i x i F A , l .

EXAMPLE1. Consider a quadratic regression model

= Po + Plzli + Pzzzi+ F~ for i = 1 , .. . ,n,

Page 9: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

1363 ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS

where the xi's are uniformly distributed over the interval [O,1] with

and

In this case,

We define estimators blzl, bIz2to minimize

In this example, we will consider the cases y = 1 and y = 112 with A , / ~ Y / ~+ A. > 0 and pl > 0, p2 = 0. Then

where

for y = 1,

for y = 112, and (W1, W2) is a zero mean bivariate Normal random vector with covariance matrix a2C.

For y = 1(assuming that pl > 0 and ,El2 = O), it is fairly straightforward to evaluate argmin(V) explicitly; letting (GI, G2) = argmin(V), we have three

Page 10: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

1364 K. KNIGHT AND W. FU

cases to consider, depending on the value of

1. If lr(W1, W2, AO)l F A. then

A

U, = 0.

2. If r(W1, W2, AO) < -AO then

3. If r(W1, W2, hO) > A0 then

61= 16W1- - (2& + 8)Ao,4 4 % ~ ~

For y = 112, an exact representation of the distribution of (GI: U^,) is more difficult to obtain but this distribution can be easily simulated.

Tables 1 and 2 give the means, variances and correlations of U^, and G2 A

as well as P(U2 = 0) for various values of A. (scaled by a) . As one might expect, the asymptotic variances of b,,, and bn2decrease as A. increases. On the other hand, the asymptotic bias of ~,llAbecomes increasingly negative as A. increases while the asymptotic bias of Pn2increases away from zero then decreases to 0 as ho increases.

Figures 1, 2, and 3 show scatterplots of random samples of 500 drawn from the limiting distributions for the LS estimator (Ao = 0), A. = 1.0 for y = 1, and A. = 0.5 for y = 112, respectively (with a2= 1).[These values of A, for the different values of y give approximately equal values of p(G2 = O).] To facili- tate comparison, the same values of (W,, W2) were used to generate the three

TABLE1 Properties of the distribution of argmin(V) for y = 1and various values of ho

Page 11: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1365

TABLE2 Properties o f the distribution o f argmin(V) for y = 0.5 and various values of A.

scatterplots. These scatterplots illustrate the effect of Bridge estimation rela-tive to LS estimation from an asymptotic point of view. In LS estimation, the strong correlation between the two variables means that overestimation of P1 is generally accompanied by underestimation of P2(and vice versa); moreover, this effect holds regardless of the true values of PIand p2.In contrast, Lasso estimation (y = 1) compensates for underestimation of p, by overestimation of P2but effectively sets the estimate of p2 to zero if P1is overestimated. For y = 112, the shrinkage to zero in the estimation of p2 is more selective and "larger" estimates are essentially unchanged from their corresponding LS estimates. Also note that there is a "no man's l and in the distribution of E2 when y = 112; for each Ao, there is an open interval I(Ao)= (0, c(Ao))[with c(AO)> 01 such that P[IU^~I]@ I(Ao)]= 1.For A. = 112, c(Ao)x 0.86.

How well do the asymptotic distributions approximate finite sample dis-tributions? There are a number of factors involved including the accuracy of

FIG. 1. Sample of 500 from the limiting distribution of the LS estimator i n Example 1.

Page 12: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

K. KNIGHT AND W. FU

FIG.8, Sample of 500 from the l i m i t i ~ g distribution of the Bridge estimator in Example 1with y = 1and ho = 1. The probability that U 2 is strictly less than 0 is approximately 4.1 x which

A

explains the absence of negative U 2 ualues.

FIG.3. Sample of 500 from the limiting distribution of the Bridge estimator in Example 1with y = 112 and ho = 112.

Page 13: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

1367 ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS

normal approximations; however, the key factor here would seem to be the extent to which the asymptotic penalty in V(u) (defined in Theorems 2 and 3) approximates the true penalty term. For example, when y 5 1, these approxi- mations may not be particularly good for finite samples as the function jxlY is not particularly smooth when x is close to 0. This is addressed to some extent in the next section.

3. Local asymptotics and small parameters. A distinguishing feature of Bridge estimation for y 5 1is the possibility of obtaining exact 0 parameter estimates. In the previous section, we showed that the limiting distributions have positive mass at 0 when the true parameter value is 0 but are absolutely continuous (with respect to Lebesgue measure) otherwise. In this section, we will try to illustrate how this "exact 0" phenomenon can occur in finite samples when the true parameter is small but nonzero.

To do this, we will assume that we have a triangular array of observations. That is, define

(11) Yni = p:xlli + cIli for i = 1 , .. . ,n,

where €or each n, cnl, . . . ,E,, are i.i.d, random variables with mean 0 and variance a2.We assume that the xni's satisfy the conditions

for some positive definite matrix C and

1 T - max xnixni + 0; 72 l i i i n

these are the obvious analogues of (3) and (4). Suppose that Pn = P + t / f i and define p, to minimize

This formulation allows us to examine the asymptotic properties of Bridge estimators when one or more of the regression parameters are close to 0 but nonzero. The idea here is to get a hint of the small sample behavior of Bridge estimation.

THEOREM4. Assume the model (11)with Pn = P + t/&i and assume that (12) and (13) are satisfied. Let pn minimize (14) for some y > 1.

(a) If hn/&i + ho >- 0 then

A ( B n - Pn) +a! argmin(v),

where

V(u) = -2uTw +u T c u+ ho CP

u j S ~ ~ ( P ~ ) I P ~ Y - ' . .j=1

Page 14: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing
Page 15: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

1369 ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS

4. Bootstrapping. Attaching standard error estimates to Bridge param- eter estimates is nontrivial especially when y 5 1. For the Lasso (y = I) , Tibshirani (1996) gives an approximation of the covariance matrix of the esti- mators. However, his approximation leads to standard error estimates of 0 when the estimate is 0, which is clearly unsatisfactory; Osborne, Presnell and Turlach (1998) give an alternative approximation that leads to apparently more satisfactory standard error estimates. However, these approximations to the covariance matrix implicitly assume that the estimators are approx- imately linear transformations, which is clearly not the case when y 5 1. An alternative approach to obtaining standard error estimates is to use the bootstrap.

In regression models, there are effectively two approaches to bootstrapping depending on whether the design is considered fixed or random.

1. (Random design). We draw a bootstrap sample (YT, xT), . . . , (YE, xT,) with replacement from {(Y1, xi), . . . , (Y,,, x,)).

2. (Fixed design). The bootstrap sample is (YL x,), . . . , (YE, x,,) where

YT = Y + p;xi + E T for i = 1, . . . , n

with E;, . . . , E; sampled with replacement from "residuals" {el, . . . , e,,) and p, some estimator of p (not necessarily a Bridge estimator).

Using the bootstrap sample, we can then obtain a bootstrap Bridge esti- mator of p (call it p i ) by minimizing an appropriate version of (2) for some y and A,. The idea is that the bootstrap distribution of ,h(fi> fin) should approximate the sampling distribution of &(fin - PI,) where fin is the Bridge estimator minimizing (2).

The asymptotics of approach (2) are quite simple. Assume that f i (p , , -p) +d U and the set of residuals sampled from has mean 0. For each bootstrap sample, we will centre the YT's by their sample mean. Define.

Conditional on p,, the randomness of V; comes from the bootstrap sampling producing the ET's. The idea here is exactly the same as before: if VT, converges to some V* then the bootstrap distribution of argmin(Vi) = &(fi; - p,) should converge (in some sense) to that of argmin(V*). What complicates mat- ters is the fact that there are two layers of randomness: one due to the original sample (reflected through p,) and one due to the bootstrap sampling.

We will assume the conditions on A, stated in Theorems 2 and 3; that is, A,/& + A. > 0 if y 2 1and , l , / n ~ / ~ + A. '0 if y < 0. The simple case is when all of the Pj's are nonzero. Then Dnj +, P j # 0. Thus

Page 16: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

1370 K. KNIGHT AND W. FU

and

where R!:'(u) = o,(l) for each u and Wi has a limiting Normal distribution (with covariance matrix C). From this it follows that

P*(fi(fi", p,) E A) + p P(argmin(V) 6 A),

where V is as defined in Theorems 2 or 3. If one or more of Pj's is 0 then the argument given above still works for

y > 1but fails for y 5 1; a more sophisticated argument is needed for this latter case. Suppose that &(fill - P) + U a s . Then under the conditions on A,, given above, we have for y = 1,

and for y < 1,

where R$:)(U) = o(1) with probability 1. From this, it follows that

where for y = 1,

P V*(u)= - 2 u T w +uTcu+ ho x[ujsgn(Pj)I(Pj # 0) + luj + UjJI(Pj= O)]

j=1

and for y < 1,

Note the parallels between these results and the results of Section 3. In our case, we do not have almost sure convergence of &(Dl, - P) but

rather convergence in distribution; however, by the Skorokhod representation theorem [(cf.) van der Vaart and Wellner (1996)], given U, +d U there exists a probability space and random elements {U',), U' having the same distribu- tions as {U,), U such that U', + U' a.s. From this fact, we can deduce that

~*(fi(fi; - p,) E A) +d P*(argmin(V*)E A),

where probability in the limit is in fact a random variable if P = 0 for at least one j . On the other hand, if P j # 0 for all j then the limiting probability is nonrandom and is the same as that given in Theorems 2 and 3.

Page 17: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1371

The asymptotic results presented above indicate that the bootstrap may have some problems in estimating the sampling distribution of Bridge esti- mators for y < 1when some true parameter values are either exactly 0 or close to 0; in such cases, bootstrap sampling introduces a bias (due to 0,) that does not vanish asymptotically. One possible solution is to choose an estimator 0, that has P(B, = 0) z 1when p j = 0 but p(finj =. 0) z 0 when P j # 0; there are a variety of ways to do this, for example, by using a consistent model selection procedure. While this may seem attractive from an asymptotic view- point, such an approach may cause more problems in practice than it solves.

5. Asymptotics for nearly singular designs. In this section, we will consider the asymptotic behavior of Bridge estimators when the design is nearly singular. More precisely, suppose that C, [as defined in (3)l is non- singular but tends to a singular matrix C. In particular, we will assume that

for some sequence {a,) tending to infinity where D is positive definite on the null space of C (that is, vTDv > 0 for nonzero v with Cv = 0). (Note that D is necessarily nonnegative definite on the null space of C so that it is not too stringent to require it to be positive definite on this null space.)

The consistency and limiting distribution arguments given in Section 2 require that the functions Z and V (defined in Theorems 1, 2 and 3) have unique minimizers. When the matrix C is singular this is not generally the case. For example, define V(u) as in Theorem 2. If y > 1then u 6 argmin(V) satisfies

for some function T . If v lies in the null space of C then clearly u + tv 6

argmin(V) for any t and so argmin(V) consists of a single point if, and only if, C is nonsingular. Likewise, when y = 1, u E argmin(V) satisfies a modification of (16), namely

where now T is possibly a set-valued function (or multifunction). Again u+tv6 argmin(V) for any v in the null space of C and so (17) has a unique solution if, and only if, C is nonsingular.

When y < 1, the situation is somewhat more complicated. Define V(u) as in Theorem 3. In general, if C is singular then argmin(V) will not be unique; if u E argmin(V) and v lies in the null space of C then for some nonzero t , V(u) = V(u+tv). However, suppose that = . . . = p, = 0 and that the null space of C is spanned by the standard basis vectors e,.+l, . . . , e,; then we have

Page 18: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

1372 K. KNIGHT AND W. FU

which has a unique minimizer. Note that this condition on the null space of C implies that the strongest collinearity in the design is restricted to the covariates that have no influence on the response.

We will now consider the asymptotic behavior of nearly singular designs under fairly weak conditions. We will assume that C, is nonsingular for all n and satisfies (15) for some sequence {a,). Define b, = (n/a,) l I2 and redefine V, , to be

Note that since b, = o ( f i ) , the estimators will have a slower rate of conver-gence than when C is nonsingular.

THEOREM6. Assume a nearly singular model with C,, satisfying (15). Let W be a zero mean multivariate Normal random vector such that v a r ( u T w )= u T D u > 0 for each nonzero u satisfying Cu = 0.

(a) If y > 1and h,/b, + ho > 0, then

where

(b) If y = 1and h,/b, + ho > 0, then

where

(c) If y < 1and h,/bi + ho > 0 then

where

PROOF. The proofs of (a), (b) and (c) are essentially the same as before. Define V , as in (18). Then in each case, for u in the null space of C, we have V , ( u ) +d V ( U )while for u outside this null space, V , ( u ) +,oo.

Page 19: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS

EXAMPLE Consider a design with 2.

where p , + 1 and a,(l - P,) + + > 0. In this case, {C,) converges to a matrix C (of all 1's) and an(Cn- C )+ D where

If the matrices are p x p then the null space of C is the space of vectors u with ul + . . . + u p = 0. For the sake of illustration, let's suppose that P1 # 0, P 2 = . . . = P p = 0 and take y < 1. Then the limiting objective function V in Theorem 6 is

where

The limiting distribution of ( ~ z / a , ) l / ~ ( f i ,- P) is simply U A

= argmin{V(u) :

u1+ . . . + u p = 0). Each component of 6has positive probability mass a t 0 and these components must sum to 0. Thus, if one uses Bridge estimation as a method for model selection (that is, estimate the number of nonzero parameters) then asymptotically the probability of selecting the true model is P(U = 0).

When p = 2 (with P1 # 0 and p2 = O), it is relatively straightforward to compute the limiting distribution. In this case, define u = ul = -u2 and W = W 1= -W2 (since ul+u2 = 0 and W1+W2= 0) where W - N(0, G). Then

It is possible to show (see Lemma A in the Appendix) that V is minimized at 0 if

otherwise, V is minimized at 6satisfying

Page 20: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

1374 K. KNIGHT AND W. FU

FIG.4. Densities for h = 0 (solid line), h = 0.5 (dotted l ine) and A = 1 (dashed line); for h = 0.5 and h = 1; these are the densities of the absolute continuous part o f the distribution as the distribution in these cases has positive probability mass a t 0.

The density of the absolutely continuous part of U^ is

whenever

and

Setting @ = 1 and y = 112, the densities of the absolutely continuous part of U for A. = 0,0.5 and 1are given in Figure 4; for these parameter values ~ ( 6= 0) = 0, 0.448 and 0.655, respectively. Note that when A. > 0, these densities have a "gap" [that is, f (u) = 01 for values of u violating either or both of (19) and (20).

6. Other issues.

Singular designs. In developing our asymptotic results, we have assumed almost exclusively in the previous sections that the matrix C, [defined in (3)l is nonsingular for each n and hence that the parametrization in (1)is unique. In most situations, this is a reasonable assumption as a singular design can be made nonsingular by judiciously removing covariates or reparametrizing

Page 21: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

1375 ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS

the model. However, in some problems, singular designs are unavoidable. For example, in epidemiologic age-period-cohort studies of disease rates, singular designs result due to linear relationship among different variables [Kupper, Janis, Karmous and Greenberg (1985)l. Also in chemometric studies, singu- lar designs result due to the number of parameters exceeding the number of observations [Frank and Friedman (1993)l.

When A, > 0 and y > 1, the objective function (2) is strictly convex and hence has a unique minimizer p,; this may be true even for y 5 1. In this section, we will consider n fixed and consider the behavior of the estimator as A = An + 0.

Define p, to minimize the objective function

and note that if p, minimizes (21) it also minimizes

where p ( O ) is a LS estimator of P, that is, b ( O ) satisfies

It is easy to see that as A + 0, h, in (22) epiconverges to the function

(23) ho($) = (z;=~ if xi(Y - x:+) = 0,mjlY, 00, otherwise

and hence if argmin(ho) is unique (as would be the case if y > I),

as A + 0. When y = 2, the estimator po defined in (24) is simply a projec- tion of the possible estimators onto the space spanned by the eigenvectors of C, with positive eigenvalues; this estimator is called the intrinsic estima- tor by Fu (1999). If we view po as a regularized LS estimator then p, can be used to approximate po by taking A close to 0. Effectively, we are using an unconstrained optimization problem [minimizing h, in (22) for small A] to approximate a constrained optimization problem [minimizing ho in (23)l; this is a standard trick in optimization [Fiacco and McCormick (1990)l.

Page 22: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

1376 K. KNIGHT AND W. FU

Computation. To this point, we have not explicitly mentioned computation of the estimators. For y > 1,the objective function is a smooth convex func- tion and numerical algorithms such as Newton-Raphson or reweighted least squares work very well. For y = 1, the objective function is also convex and so methods such as those discussed in Tibshirani (1996), Fu (1998) and Osborne, Presnell and Turlach (1998) can be used. In the context of wavelet regression, algorithms have been proposed by Chen, Donoho and Saunders (1999) and Sardy, Bruce and Tseng (1998).

When y < 1, the objective function (2) is no longer convex and so com- putation of p is nontrivial, particularly if p is large; the objective function can have multiple local minima at which it is nondifferentiable. Here we will briefly describe some simple algorithms for computing Bridge estimates when y < 1; a more detailed treatment will be given elsewhere.

Although the objective function is generally nontrivial to minimize, it is interesting to note that the one variable problem is quite easy to solve. For example, given a and h > 0, define

It is simple to verify (see Lemma A in the Appendix) that g is minimized at u = 0 if and only if

Otherwise, g is minimized at u = 6 satisfying g'(z2) = 0 and g"(z2) > 0. This latter equation can be solved in a variety of ways including the fixed-point iteration:

The feasibility of the one-variable problem suggests that a Gauss-Seidel or ICM [Besag (1986)l algorithm (which iteratively minimize one variable at a time) might be appropriate to compute f i n . This is true to some extent (as the objective function decreases at each iteration) but with some caveats. Due to the nature of the objective function, it is very easy for a naive Gauss-Seidel algorithm to get "trapped in a local minimum. However, this can be avoided to some extent by keeping estimates away from 0 until it is absolutely necessary to set them to 0. Alternatively, we can try multiple starting points in different parts of the parameter space.

A second approach is to solve a sequence of ridge regression problems. For example, starting with the ridge regression (y = 2), estimate

Page 23: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS

We can define successive estimates by

for k = 1 ,2 ,3 ,. . . where

and D(+) is a diagonal matrix with diagonal elements +l 1 '-yI2, . . . , + p 1 '-yI2. Again the sequence { p ( k ) ) does not necessarily converge to the global mini-mum but seems to work quite well if multiple starting points are used.

APPENDIX

Let g(u) = u2 - 2au + hlulY where h 2 0 and 0 < y 5 1. If a = 0 then argmin(g) = 0; thus we shall focus on the case where a # 0.

LEMMAA. Suppose that a # 0. Then 0 E argmin(g) if; and only if;

Moreover, if y < 1then argmin(g) = 0 if and only if we have strict inequality above.

PROOF. Define

h(t) = g(at) = a2(t2- 2t + hltl~lal'-~).

For t < 0, h'(t) < 0 and thus h(t) is strictly decreasing on the interval (-oo,01. For t > 1, h'(t) > 0 and so h(t) is strictly increasing on the interval (1, oo). Thus argmin(g) = ta for some t E [O, 11.

If 0 E argmin(g) then we must have

for all 0 5 t 5 1. In other words,

h 2 la12-~max tl-'"(2 - t)Ostsl

Using calculus, it is easy to verify that the right-hand side above is maximized for t = 2(1- y)/(2 - y) and so 0 E argmin(g) if and only if

Moreover, if y < 1and a # 0, then strict inequality implies that argmin(g)=0. If equality holds then argmin(g) = {0,2a(l - y)/(2 - y)); note that this set contains a single point when y = 1.

Page 24: Asymptotics for Lasso-Type Estimators Keith Knight ...jizhu/jizhu/KnightFu-AoS00.pdf · ASYMPTOTICS FOR LASSO-TYPE ESTIMATORS 1357 For a given A,, we will denote the estimator minimizing

1378 K. KNIGHT AND W. FU

Acknowledgments. The authors thank the referees and the Associate Editor for their valuable comments and suggestions. The support and encour- agement of Rob Tibshirani is also gratefully acknowledged.

REFERENCES

ANDERSON,P. K. and GILL, R. D. (1982). Cox's regression model for counting processes: a large sample study. Ann. Statist. 10 1100-1120.

BESAG,J. (1986). On the statistical analysis of dirty pictures (with discussion). J. Roy. Statist. Soc. Ser. B 48 259-302.

CHEN, S. S., DONOHO, D. L. and SAUNDERS, M. A. (1999). Atomic decomposition by basis pursuit. SIAM J Scientific Computing 20 33-61.

FAN,J. and LI, R. (1999). Variable selection via penalized likelihood. Unpublished manuscript. FIACCO,A. V. and MCCORMICK, G. P. (1990). Nonlinear Programmzng: Sequential Unconstrained

Minimization Techniques. SIAM, Philadelphia. FRANK,I. E, and FRIEDMAN, J. H. (1993). A statistical view of some chemometrics regression tools

(with discussion). Technometrics 35 109-148. Fu, W. J. (1998). Penalized regressions: the Bridge versus the Lasso. J. Comput. Graph. Statist.

7 397-416. Fu, W. J . (1999). Estimating the effective trends in age-period-cohort studies. Unpublished

manuscript. GEYER, C. J. (1994). On the asymptotics of constrained M-estimation.Ann. Statist. 22 1993-2010. GEYER, C. J. (1996). On the asymptotics of convex stochastic optimization. Unpublished

manuscript. KIM,J. and POLLARD, D. (1990). Cube root asymptotics. Ann. Statistics 18 191-219. KUPPER, L. L., JANIS,J. M., KARMOUS, A. and GREENBERG, B. G. (1985). Statistical age-period-

cohort analysis: a review and critique. J. Chronic Disease 38 811-830. LINHART,H. and ZUCCHINI, W. (1986). Model Selection. Wiley, New York. OSBORNE,M. R., PRESNELL, B. and TURLACH, B. A. (1998). On the Lasso and its dual. Research

Report 9811, Dept. Statistics, Univ. Adelaide. PFLUG, G. CH. (1995). Asymptotic stochastic programs. Math. Oper Res. 20 769-789. POLLARD,D. (1991). Asymptotics for least absolute deviation regression estimators. Econometric

Theory 7 186-199. SARDY, S., BRUCE, A. G. and TSENG, P. (1998). Block coordinate relaxation methods for

nonparametric signal denoising with wavelet dictionaries. Technical Report, Dept. Mathematics, EPFL, Lausanne.

SRIVASTAVA,M. S. (1971). On fixed width confidence bounds for regression parameters. Ann. Math. Statist. 42 1403-1411.

TIBSHIRANI,R. (1996). Regression shrinkage and selection via the Lasso. J Roy. Statzst. Soc. Ser B 58 267-288.

VAN DER VAART, A. W. and WELLNER, J. A. (1996). Weak Convergence and Empirical Processes with Applications to Statistics. Springer, New York.