Expected length of post-model-selection con dence ...€¦ · give a necessary and su cient condition for its expected length to be in nite. In simulations, we nd that this condition

Expected length of post-model-selectionconfidence intervals conditional on

polyhedral constraints

Danijel KivaranovicDepartment of Statistics and Operations Research, University of Vienna,

andHannes Leeb

Department of Statistics and Operations Research, University of Vienna

January 25, 2018

Abstract

Valid inference after model selection is currently a very active area of research.The polyhedral method, pioneered by Lee et al. (2016), allows for valid inferenceafter model selection if the model selection event can be described by polyhedralconstraints. In that reference, the method is exemplified by constructing two validconfidence intervals when the Lasso estimator is used to select a model. We herestudy the expected length of these intervals. For one of these confidence intervals,that is easier to compute, we find that its expected length is always infinite. Forthe other of these confidence intervals, whose computation is more demanding, wegive a necessary and sufficient condition for its expected length to be infinite. Insimulations, we find that this condition is typically satisfied.

Keywords: Lasso, inference after model selection, hypothesis test.

1

1 Introduction

Lee et al. (2016) recently introduced a new technique for valid inference after model selec-

tion, the so-called polyhedral method. Using this method, and using the Lasso for model

selection in linear regression, Lee et al. (2016) derived two new confidence sets that are

valid conditional on the outcome of the model selection step. More precisely, let m de-

note the model containing those regressors that correspond to non-zero coefficients of the

Lasso estimator, and let s denote the sign-vector of those non-zero Lasso coefficients. Then

Lee et al. (2016) constructed confidence intervals [Lm,s, Um,s] and [Lm, Um] whose coverage

probability is 1− α, conditional on the events {m = m, s = s} and {m = m}, respectively

(provided that the probability of the conditioning event is positive). The computational

effort in constructing these intervals is considerably lighter for [Lm,s, Um,s]. In simulations,

Lee et al. (2016) noted that this latter interval can be quite long in some cases; cf. Figure 10

in that reference. We here analyze the (conditional) expected length of these intervals.

1.1 Overview of findings

Throughout, we use the same setting and assumptions as Lee et al. (2016). In particular,

we assume that the response vector is distributed as N(µ, σ2In) with unknown mean µ ∈ Rn

and known variance σ2 > 0 (our results carry over to the unknown-variance case; see the

end of Section 3), and that the non-stochastic regressor matrix has columns in general

position. Write Pµ,σ2 and Eµ,σ2 for the probability measure and the expectation operator,

respectively, corresponding to N(µ, σ2In).

For the interval [Lm,s, Um,s], we find the following: Fix a non-empty model m, a sign-

vector s, as well as µ ∈ Rn and σ2 > 0. If Pµ,σ2(m = m, s = s) > 0, then

Eµ,σ2 [Um,s − Lm,s| m = m, s = s] = ∞. (1)

2

Obviously, this statement continues to hold if the event m = m, s = s is replaced by the

larger event m = m throughout (because there are only finitely many possible values for

the sign-vector s). And this statement continues to hold if the condition Pµ,σ2(m = m, s =

s) > 0 is dropped and the conditional expectation in (1) is replaced by the unconditional

one.

For the interval [Lm, Um], we derive a necessary and sufficient condition for its expected

length to be infinite, conditional on the event m = m. That condition depends on the

regressor matrix, on the model m and also on a linear contrast that defines the quantity

of interest, and is daunting to verify in all but the most basic examples. We also provide

a sufficient condition for infinite expected length that is easy to verify. In simulations, we

find that this sufficient condition for infinite expected length is typically satisfied when the

model m excludes a significant portion of all the available regressors (e.g., if the selected

model is ‘sparse’). And even if the model m is not sparse, we find that this condition is still

satisfied for a sizable fraction of the linear contrasts that define the quantity of interest.

See Table 1 and the attending discussion for more detail.

The methods developed in this paper can also be used if the Lasso, as the model

selector, is replaced by any other procedure that allows for application of the polyhedral

method. In particular, we see that confidence intervals based on the polyhedral method

in Gaussian regression can have infinite expected length. Our findings suggest that the

expected length of confidence intervals based on the polyhedral method should be closely

scrutinized, in Gaussian regression but also in non-Gaussian settings and in other variations

of the polyhedral method.

The rest of the paper is organized as follows: We conclude this section by discussing a

number of related results that put our findings in context. Section 2 describes the confidence

intervals of Lee et al. (2016) in detail and introduces some notation. Section 3 contains

3

two core results, Propositions 1 and 2, which entail our main findings, the simulation study

mentioned earlier, as well as a discussion of the unknown variance case. The appendix

contains the proofs of our core results and some auxiliary lemmas.

1.2 Context and related results

There are currently several exciting ongoing developments based on the polyhedral method,

not least because it proved to be applicable to more complicated settings, and there are

several generalization of this framework. See, among others, Tibshirani et al. (2016), Taylor

& Tibshirani (2017), Tian & Taylor (2015). Certain optimality results of the method of

Lee et al. (2016) are given in Fithian et al. (2017). Using a different approach, Berk et al.

(2013) proposed the so-called PoSI-intervals which are unconditionally valid. A benefit of

the PoSI-intervals is that they are valid after selection with any possible model selector,

instead of a particular one like the Lasso; however, as a consequence, the PoSI-intervals are

typically very conservative (that is, the actual coverage probability is above the nominal

level). Nonetheless, Bachoc et al. (2016) showed in a Monte Carlo simulation that, in

certain scenarios, the PoSI-intervals can be shorter than the intervals of Lee et al. (2016).

The results of the present paper are based on the first author’s master’s thesis.

It is important to note that all confidence sets discussed so far are non-standard, in

the sense that the parameter to be covered is not the true parameter in an underlying cor-

rect model (or components thereof), but instead is a model-dependent quantity of interest.

(See Section 2 for details and the references in the preceding paragraph for more extensive

discussions.) An advantage of this non-standard approach is that it does not rely on the

assumption that any of the candidate models is correct. Valid inference for an underlying

true parameter is a more challenging task, as demonstrated by the impossibility results in

Leeb & Potscher (2006a,b, 2008). There are several proposals of valid confidence intervals

4

after model selection (in the sense that the actual coverage probability of the true param-

eter is at or above the nominal level) but these are rather large compared to the standard

confidence intervals from the full model (supposing that one can fit the full model); see

Potscher (2009), Potscher & Schneider (2010), Schneider (2016). In fact, Leeb & Kabaila

(2017) showed that the usual confidence interval obtained by fitting the full model is ad-

missible also in the unknown variance case; therefore, one cannot obtain uniformly smaller

valid confidence sets for a component of the true parameter by any other method.

2 Assumptions and confidence intervals

Let Y denote the N(µ, σ2In)-distributed response vector, n ≥ 1, where µ ∈ Rn is unknown

and σ2 > 0 is known. Let X = (x1, . . . , xp), p ≥ 1, with xi ∈ Rn for each i = 1, . . . , p,

be the non-stochastic n × p regressor matrix. We assume that the columns of X are in

general position (this mild assumption is further discussed in the following paragraph).

The full model {1, . . . , p} is denoted by mF . All subsets of the full model are collected in

M, that is, M = {m : m ⊆ mF}. The cardinality of a model m is denoted by |m|. For

any m = {i1, . . . , ik} ∈ M \ ∅ with i1 < · · · < ik, we set Xm = (xi1 , . . . , xik). Analogously,

for any vector v ∈ Rp, we set vm = (vi1 , . . . , vik)′. If m is the empty model, then Xm is to

be interpreted as the zero vector in Rn and vm as 0.

The Lasso estimator, denoted by β(y), is a minimizer of the least squares problem with

an additional penalty on the absolute size of the regression coefficients (Frank & Friedman

1993, Tibshirani 1996):

minβ∈Rp

1

2‖y −Xβ‖22 + λ‖β‖1, y ∈ Rn, λ > 0.

The Lasso has the property that some coefficients of β(y) are zero with positive probability.

A minimizer of the Lasso objective function always exists, but it is not necessarily unique.

5

Uniqueness of β(y) is guaranteed here by our assumption that the columns of X are in

general position (Tibshirani 2013). This assumption is relatively mild; e.g., if the entries

of the matrix X are drawn from a (joint) distribution that has a Lebesgue density, then

the columns of X are in general position with probability 1 (Tibshirani 2013). The model

m(y) selected by the Lasso and the sign-vector s(y) of non-zero Lasso coefficients can now

formally be defined through

m(y) ={j : βj(y) 6= 0

}and s(y) = sign

(βm(y)(y)

).

Recall that M denotes the set of all possible submodels and set Sm = {−1, 1}|m| for each

m ∈ M. For later use we also denote by M+ and S+m the collection of non-empty models

and the collection of corresponding sign-vectors, that occur with positive probability, i.e.,

M+ = {m ∈M \ ∅ : Pµ,σ2(m(Y ) = m) > 0} ,

S+m = {s ∈ Sm : Pµ,σ2(m(Y ) = m, s(Y ) = s) > 0} (m ∈M \ ∅).

These sets do not depend on µ and σ2 as the measure Pµ,σ2 is equivalent to Lebesgue

measure with respect to null sets. Also, our assumption that the columns of X are in

general position guarantees that M+ only contains models m for which Xm has column-

rank m (Tibshirani 2013).

Inference is focused on a non-standard, model dependent, quantity of interest. Through-

out the following, fix m ∈M+ and let

βm = Eµ,σ2

[(X ′mXm)−1X ′mY

]= (X ′mXm)−1X ′mµ.

For γm ∈ R|m|\{0}, the goal is to construct a confidence interval for γm′βm with conditional

coverage probability 1 − α on the event {m = m}. Clearly, the quantity of interest can

also be written as γm′βm = ηm′µ for ηm = Xm(X ′mXm)−1γm. For later use, write Pηm for

the orthogonal projection on the space spanned by ηm.

6

At the core of the polyhedral method lies the observation that the event where m = m

and where s = s describes a convex polytope in sample space Rn (up to a Lebesgue null

set): For each m ∈M+ and each s ∈ Sm, we have

{y : m(y) = m, s(y) = s} a.s.= {y : Am,sy < bm,s}, (2)

cf. Theorem 3.3 in Lee et al. (2016) (explicit formulas for the matrix Am,s and the vector

bm,s are also repeated in Appendix C in our notation). Fix z ∈ Rn orthogonal to ηm. Then

the set of y satisfying (In − Pηm)y = z and Am,sy < b is either empty or a line segment. In

either case, that set can be written as {z + ηmw : V−m,s(z) < w < V+m,s(z)}. The endpoints

satisfy −∞ ≤ V−m,s(z) ≤ V+m,s(z) ≤ ∞ (see Lemma 4.1 of Lee et al. 2016; formulas for

these quantities are also given in Appendix C in our notation). Now decompose Y into

the sum of two independent Gaussians PηmY and (In − Pηm)Y , where the first one is a

linear function of ηm′Y ∼ N(ηm′µ, σ2ηm′ηm). With this, the conditional distribution of

ηm′Y , conditional on the event that m(Y ) = m, s(Y ) = s and (In − Pηm)(Y ) = z, is the

conditional N(ηm′µ, σ2ηm′ηm)-distribution, conditional on the set (V−m,s(z),V+m,s(z)) (in the

sense that the latter conditional distribution is a regular conditional distribution if one

starts with the conditional distribution of ηm′Y given m = m and s = s – which is always

well-defined – and if one then conditions on the random variable (In − Pηm)Y ).

To use these observations for the construction of confidence sets, consider first the

conditional distribution of a random variableW ∼ N(θ, ς2) conditional on the eventW ∈ T ,

where θ ∈ R, where ς2 > 0, and where T ⊆ R is the union of finitely many non-empty

open intervals. Write F Tθ,ς2(·) for the cumulative distribution function (c.d.f) of W given

W ∈ T . The corresponding law can be viewed as a ‘truncated normal’ distribution and

will be denoted by TN(θ, ς2, T ) in the following. A confidence interval for θ with coverage

probability 1 − α conditional on the event W ∈ T is obtained by the usual method of

7

collecting all values θ0 for which a hypothesis test of H0 : θ = θ0 against H1 : θ 6= θ0 does

not reject. In particular, for w ∈ R, define L(w) and U(w) through

F TL(w),ς2(w) = 1− α

2and F T

U(w),ς2(w) =α

2

(these quantities are well-defined in view of Lemma A.2). With this, we have P (θ ∈

[L(W ), U(W )]|W ∈ T ) = 1− α, irrespective of θ ∈ R.

Fix m ∈M+ and s ∈ S+m, and let σ2

m = σ2ηm′ηm and Tm,s(z) = (V−m,s(z),V+m,s(z)) for z

orthogonal to ηm. With this, we have

ηm′Y∣∣∣{m = m, s = s, (In − Pηm)Y = z} ∼ TN(ηm′µ, σ2

m, Tm,s(z)).

for each z ∈ {(In − Pηm)y : Am,sy < bm,s}. Now define Lm,s(y) and Um,s(y) through

FTm,s((In−Pηm )y)

Lm,s(y),σ2m

(ηm′y) = 1− α

2and F

Tm,s((In−Pηm )y)

Um,s(y),σ2m

(ηm′y) =α

2

for each y so thatAm,sy < bm,s. Then, by construction, the random interval [Lm,s(Y ), Um,s(Y )]

covers ηm′µ with probability 1− α conditional on the event that m = m and s = s.

In a similar fashion, fix m ∈ M+, set Tm(z) = ∪s∈S+mTm,s(z) for z orthogonal to ηm,

and define Lm(y) and Um(y) through

FTm((In−Pηm )y)

Lm(y),σ2m

(ηm′y) = 1− α

2and F

Tm((In−Pηm )y)

Um(y),σ2m

(ηm′y) =α

2.

Again by construction, the random interval [Lm(Y ), Um(Y )] covers ηm′µ with probability

1− α conditional on the event that m = m.

Remark. (i) If m = m(y) is any other model selection procedure, so that the event

{y : m = m} is the union of a finite number of polyhedra (up to null sets), then the

polyhedral method can be applied to obtain a confidence set for ηm′µ with conditional

coverage probability 1 − α, conditional on the event {m = m}, if that event has positive

8

probability. Indeed, for such a model selection procedure, the arguments following (1) also

apply, mutatis mutandis.

(ii) So far, we have defined confidence intervals only on the events W ∈ T , m ∈ M+ and

sm ∈ S+m, and m ∈ M+, respectively. In the remaining cases, the interval endpoints (and

the corresponding quantity of interest) can be chosen arbitrarily (measurable) without

affecting our results. It is easy to choose constructions so that one obtains meaningful

confidence intervals that are defined everywhere in sample space.

(iii) In Theorem 3.3 of Lee et al. (2016), relation (2) is stated as an equality, not as an

equality up to null sets, and with the right-hand side replaced by {y : Am,sy ≤ bm,s} (in

our notation). Because (2) differs from this only on a Lebesgue null set, the difference is

inconsequential for the purpose of the present paper. The statement in Lee et al. (2016) is

based on the fact that m was defined as the equicorrelation set (Tibshirani 2013) in that

paper. But if m is the equicorrelation set, then there can exist vectors y ∈ {m = m} such

that some coefficients of β(y) are zero, which clashes with the idea that m contains those

variables whose Lasso coefficients are non-zero. However, for any m ∈M+, the set of such

ys is a Lebesgue null set.

3 Core results

We first analyze the simple confidence set [L(W ), U(W )] that was introduced in the pre-

ceding section, which covers θ with probability 1 − α, conditional on W ∈ T , where

W ∼ N(θ, ς2). By assumption, T is of the form T = ∪Ki=1(ai, bi) where K < ∞ and

−∞ ≤ a1 < b1 < · · · < aK < bK ≤ ∞. Figure 1 exemplifies the length of [L(w), U(w)]

when T is bounded (left panel) and when T is unbounded (right panel). The dashed line

9

is the length of the standard (unconditional) confidence interval for θ. In the left panel,

we see that the length of [L(w), U(w)] diverges as w approaches the far left or the far right

boundary point of the truncation set (i.e., -3 and 3). On the other hand, in the right

panel we see that the length of [L(w), U(w)] is bounded and converges to the length of the

standard interval as |w| → ∞.

5.0

7.5

10.0

−3 −2 −1 0 1 2 3w

U(w

) −

L(w

)

3.0

3.5

4.0

4.5

−7 −5 −3 −1 1 3 5 7w

U(w

) −

L(w

)

Figure 1: Length of the interval [L(w), U(w)] for the case where T = (−3,−2) ∪ (−1, 1) ∪

(2, 3) (left panel) and the case where T = (−∞,−2) ∪ (−1, 1) ∪ (2,∞) (right panel). In

both cases, we took ς2 = 1 and α = 0.05.

Write Φ(w) and φ(w) for the c.d.f. and p.d.f. of the standard normal distribution,

respectively, where we adopt the usual convention that Φ(−∞) = 0 and Φ(∞) = 1.

Proposition 1. If T is bounded either from above or from below, then

E[U(W )− L(W )|W ∈ T ] = ∞.

10

If T is unbounded from above and from below, then

U(W )− L(W )

ς

a.s.

≤ 2Φ−1(1− p∗α/2)

≤ 2Φ−1(1− α/2) +aK − b1

ς,

where p∗ = infϑ∈R P (N(ϑ, ς2) ∈ T ) and where aK − b1 is to be interpreted as 0 in case

K = 1. [The first inequality trivially continues to hold if T is bounded, as then p∗ = 0.]

Intuitively, one expects confidence intervals to be wide if one conditions on a bounded set

because extreme values cannot be observed on a bounded set and the confidence intervals

have to take this into account. We find that the conditional expected length is infinite in

this case. If, for example, T is bounded from below, i.e., if −∞ < a1, then first statement

in the proposition follows from two facts: First, the length of U(w) − L(w) behaves like

1/(w− a1) as w approaches a1 from above; and, second, the p.d.f. of the truncated normal

distribution at w is bounded away from 0 zero as w approaches a1 from above. See the

proof in Section B for a more detailed version of this argument. On the other hand, if

the truncation set is unbounded, extreme values are observable and confidence intervals,

therefore, do not have to be extremely wide. The second upper bound provided by the

proposition for that case will be useful later.

We see that the boundedness of the truncation set T is critical for the interval length.

When the Lasso is used as a model selector, this prompts the question whether the trun-

cation sets Tm,s(z) and Tm(z) are bounded or not, because the intervals [Lm,s(y), Um,s(y)]

and [Lm(y), Um(y)] are obtained from conditional normal distributions with truncation sets

Tm,s((In − Pηm)y) and Tm((In − Pηm)y), respectively. For m ∈M+, s ∈ S+m, and z orthog-

onal to ηm, recall that Tm,s(z) = (V−m,s(z),V+m,s(z)), and that Tm(z) is the union of these

intervals over s ∈ S+m. Write [ηm]⊥ for the orthogonal complement of the span of ηm.

11

Proposition 2. For each m ∈M+ and each s ∈ Sm, we have

∀z ∈ [ηm]⊥ : −∞ < V−m,s(z) or ∀z ∈ [ηm]⊥ : V+m,s(z) <∞

or both.

For the confidence interval [Lm,s(Y ), Um,s(Y )], the statement in (1) now follows imme-

diately: If m is a non-empty model and s is a sign-vector so that the event {m = m, s = s}

has positive probability, then m ∈ M+ and s ∈ S+m. Now Proposition 2 entails that

Tm,s((In − Pηm)Y ) is almost surely bounded on the event {m = m, s = s}, and Proposi-

tion 1 entails that (1) holds.

For the confidence interval [Lm(Y ), Um(Y )], we obtain that its conditional expected

length is finite, conditional on m = m with m ∈ M+, if and only if its corresponding

truncation set Tm(Y ) is almost surely unbounded from above and from below on that

event. More precisely, for m ∈M+, we have

Eµ,σ2 [Um(Y )− Lm(Y )|m = m] = ∞ (3)

if and only if there exists a s ∈ S+m and a vector y satisfying Am,sy < bm,s, so that

Tm((In − Pηm)y) is bounded from above or from below. (4)

Before proving this equivalence, recall that Tm((In−Pηm)y) is the union of the intervals

(V−m,s((In−Pηm)y),V+m,s((In−Pηm)y)) with s ∈ S+

m. Inspection of the explicit formulas for

the interval endpoints given in Appendix C now immediately reveals the following: The

lower endpoint V−m,s((In − Pηm)y) is either constant equal to −∞ on the set {y : Am,sy <

bm,s}, or it is the minimum of a finite number of linear functions of y (and hence finite and

continuous) on that set. Similarly the upper endpoint V+m,s((In − Pηm)y) is either constant

12

equal to ∞ on that set, or it is the maximum of a finite number of linear functions of y

(and hence finite and continuous) on that set.

To prove the equivalence, we first assume, for some s and y with s ∈ S+m and Am,sy <

bm,s, that the set in (4) is bounded from above (the case of boundedness from below is

similar). Then there is an open neighborhood O of y, so that each point w ∈ O satisfies

Am,sw < bm,s and also so that Tm((In − Pηm)w) is bounded from above. Because O has

positive Lebesgue measure, (3) now follows from Proposition 1. To prove the converse,

assume for each s ∈ S+m and each y satisfying Am,sy < bm,s that Tm((In − Pηm)y) is

unbounded from above and from below. Because the sets {y : Am,sy < bm,s} for s ∈ S+m are

disjoint by construction, the same is true for the sets Tm,s((In − Pηm)y) for s ∈ S+m. Using

Proposition 1, we then obtain that Um(Y )− Lm(Y ) is bounded by a linear function of

max{V−m,s((In − Pηm)Y ) : s ∈ S+m} − min{V+

m,s((In − Pηm)Y ) : s ∈ S+m}

Lebesgue almost everywhere on the event {m = m}. (The maximum and the minimum in

the preceding display correspond to aK and b1, respectively, in Proposition 1.) It remains

to show that the expression in the preceding display has finite conditional expectation on

the event {m = m}. But this expression is the maximum of a finite number of Gaussians

minus the minimum of a finite number of Gaussians. Its unconditional expectation, and

hence also its conditional expectation on the event {m = m}, is finite.

In order to infer (3) from (4), that latter condition needs to be checked for every point y

in a union of polyhedra. While this is easy in some simple examples like, say, the situation

depicted in Figure 1 of Lee et al. (2016), searching over polyhedra in Rn is hard in general.

In our simulations, we therefore use a simpler sufficient condition that implies (3): After

observing the data, i.e., after observing a particular value y∗ of Y , and hence also observing

m(y∗) = m and s(y∗) = s, we check whether Tm((In − Pηm)y∗) is bounded from above or

13

from below (and also whether Am,sy∗ < bm,s, which, if satisfied, entails that m ∈ M+ and

that s ∈ S+m). If this is the case, then it follows, ex post, that (3) holds. Note that these

computations occur naturally during the computation of [Lm(y∗), Um(y∗)] and can hence

be performed as a safety precaution with little extra effort.

Remark. If m is any other model selection procedure, so that the event {y : m = m} is

the union of a finite number of polyhedra (up to null sets), then the polyhedral method can

be applied to obtain a confidence set for ηm′µ with conditional coverage probability 1− α,

conditional on the event {m = m} if that event has positive probability. Clearly, for such a

model selection procedure, an equivalence similar to (3)–(4) holds. Indeed, the derivation

of this equivalence relies on Proposition 1 but not on the Lasso-specific Proposition 2.

3.1 Simulation results

To investigate whether condition (4) is restrictive or not, we perform an exploratory sim-

ulation exercise consisting of repeated samples of size n = 100 in the nine scenarios corre-

sponding to the rows in Table 1, which cover all combinations of the cases p = 20, p = 50

and p = 200 (i.e., p small, moderate, large) with the cases λ = 0.1, λ = 1 and λ = 10 (i.e.,

λ small, moderate, large). For each of the nine scenarios, we generate an n × p regressor

matrix X whose rows follow a p-variate Gaussian distribution with mean zero, so that the

diagonal elements of the covariance matrix all equal 1 and the off-diagonal elements all

equal 0.2. We then generate an n-vector y∗ whose entries are i.i.d. standard Gaussians,

compute the Lasso estimator β(y∗) and the resulting selected model m = m(y∗) (if m = ∅,

this process is repeated with a newly generated vector y∗). Finally, we generate 2000 direc-

tions γmj that are i.i.d. uniform on the unit-sphere in Rm and set ηmj = Xm(X ′mXm)−1γmj ,

1 ≤ j ≤ 2000. For each ηmj we now check if the sufficient condition outlined in the preceding

14

paragraph is satisfied with ηmj replacing ηm. If it is satisfied, the corresponding confidence

set [Lm(Y ), Um(Y )] (with ηmj replacing ηm) is guaranteed to have infinite expected length

conditional on the event that m = m. The fraction of indices j, 1 ≤ j ≤ 2000, for which

this is the case, together with the number of parameters in the selected model are displayed

in the cells of Table 1, for 50 independent replications of y∗ in each scenario (row). In each

of the nine scenarios, and for each of the 50 replications in each, this fraction estimates

a lower bound for the probability that the confidence interval [Lm(Y ), Um(Y )] has infinite

expected length conditional on m = m if the direction of interest, i.e., γm or, equivalently,

ηm, is chosen at random.

As soon as the selected model excludes more than a few variables, we see that the

conditional expected length of [Lm(Y ), Um(Y )] is guaranteed to be infinite in a substantial

number of cases. In particular, this always occurs if p > n. (Also keep in mind that we

only check a sufficient condition for infinite expected length, not a necessary one.) Also,

within each row in the table, the number of cases with infinite conditional expected length

is roughly increasing as the number of parameters in the selected model decreases. Beyond

these observations, there appears to be no simple relation between the number of selected

variables in the model and the percentage of cases where the interval has infinite conditional

expected length. We also stress here that a simulation study can not be exhaustive and

that other simulation scenarios will give different results.

3.2 The unknown variance case

Suppose here that σ2 > 0 is unknown and that σ2 is an estimator for σ2. Fix m ∈ M+

and s ∈ S+m. Note that the set Am,sy < bm,s does not depend on σ2 and hence also

V−m,s((In − Pηm)y) and V −m,s((In − Pηm)y) do not depend on σ2. For each ς2 > 0 and for

each y so that Am,sy < bm,s define Lm,s(y, ς2), Um,s(y, ς

2), Lm(y, ς2), and Um(y, ς2) like

15

p λ y∗1 y∗2 y∗3 · · · y∗48 y∗49 y∗50

20

0.10% 0% 0% · · · 80% 100% 100%

(20) (20) (20) · · · (19) (19) (19)

10% 0% 0% · · · 100% 100% 100%

(20) (20) (20) · · · (16) (16) (14)

1049% 60% 72% · · · 100% 100% 100%

(4) (7) (8) · · · (3) (3) (3)

50

0.10% 0% 0% · · · 100% 100% 100%

(50) (50) (50) · · · (48) (47) (47)

196% 99% 100% · · · 100% 100% 100%

(47) (44) (47) · · · (39) (38) (37)

10100% 100% 100% · · · 100% 100% 100%

(18) (18) (17) · · · (7) (4) (3)

200

0.1100% 100% 100% · · · 100% 100% 100%

(100) (100) (100) · · · (97) (97) (95)

1100% 100% 100% · · · 100% 100% 100%

(96) (94) (94) · · · (84) (83) (82)

10100% 100% 100% · · · 100% 100% 100%

(41) (37) (36) · · · (21) (20) (19)

Table 1: Percentage of cases where ηm is such that the confidence interval [Lm(Y ), Um(Y )]

for ηm′µ is guaranteed to have infinite expected length conditional on m = m, with m =

m(y∗i ) and, in parentheses, the number of parameters in the model, i.e., |m|. The entries

in each row are ordered to improve readability, first by percentage (increasing) and second

by number of parameters (decreasing).

16

Lm,s(y), Um,s(y), Lm(y), and Um(y) in Section 2 with ς2 replacing σ2 in the formulas.

(Note that, say, Lm,s(y) depends on σ2 through σ2m = σ2ηm′ηm.) The asymptotic coverage

probability of the intervals [Lm,s(Y, σ2), Um,s(Y, σ

2)] and [Lm(Y, σ2), Um(Y, σ2)], conditional

on the events {m = m, s = s} and {m = m}, respectively, is discussed in Lee et al. (2016).

If σ2 is independent of ηm′Y and positive with positive probability, then it is easy to see

that (1) continues to hold with [Lm,s(Y, σ2), Um,s(Y, σ

2)] replacing [Lm,s(Y ), Um,s(Y )] for

each m ∈M+ and each s ∈ S+m. And if, in addition, σ2 has finite mean conditional on the

event {m = m} for m ∈ M+, then it is elementary to verify that the equivalence (3)–(4)

continues to hold with [Lm(Y, σ2), Um(Y, σ2)] replacing [Lm(Y ), Um(Y )] (upon repeating

the arguments following (3)–(4) and upon using the finite conditional mean of σ2 in the

last step).

In the case where p < n, the usual variance estimator ‖Y −X(X ′X)−1X ′Y ‖2/(n − p)

is independent of ηm′Y , is positive with probability 1, and has finite unconditional (and

hence also conditional) mean. For variance estimators in the case where p ≥ n, we refer to

Lee et al. (2016) and the references therein.

Appendix A Auxiliary results

In this section, we collect some properties of functions like F Tθ,ς2(w) that will be needed in

the proofs of Proposition 1 and Proposition 2. The following result will be used repeatedly

in the following and is easily verified using L’Hospital’s method.

Lemma A.1. For all a, b with −∞ ≤ a < b ≤ ∞, the following holds:

limθ→∞

Φ (a− θ)Φ (b− θ)

= 0.

17

Write F Tθ,ς2(w) and fTθ,ς2(w) for the c.d.f. and p.d.f. of the TN(θ, ς2, T )-distribution,

where T = ∪Ki=1(ai, bi) with −∞ ≤ a1 < b1 < a2 < · · · < aK < bK ≤ ∞. For w ∈ T and for

k so that ak < w < bk, we have

F Tθ,ς2(w) =

Φ(w−θς

)− Φ

(ak−θς

)+

k−1∑i=1

Φ(bi−θς

)− Φ

(ai−θς

)K∑i=1

Φ(bi−θς

)− Φ

(ai−θς

) ;

if k = 1, the sum in the numerator is to be interpreted as 0. And for w as above, the

density fTθ,ς2(w) is equal to φ((w − θ)/ς)/ς divided by the denominator in the preceding

display.

Lemma A.2. For each fixed w ∈ T , F Tθ,ς2(w) is continuous and strictly decreasing in θ,

and

limθ→−∞

F Tθ,ς2(w) = 1 and lim

θ→∞F Tθ,ς2(w) = 0.

Proof. Continuity is obvious and monotonicity has been shown in Lee et al. (2016) for the

case where T is a single interval, i.e., K = 1; it is easy to adapt that argument to also

cover the case K > 1. Next consider the formula for F Tθ,ς2(w). As θ → ∞, Lemma A.1

implies that the leading term in the numerator is Φ((w−θ)/ς) while the leading term in the

denominator is Φ((bK − θ)/ς). Using Lemma A.1 again gives limθ→∞ FTθ,ς2(w) = 0. Finally,

it is easy to see that F Tθ,ς2(w) = 1− F−T−θ,ς2(−w) (upon using the relation Φ(t) = 1− Φ(−t)

and a little algebra). With this, we also obtain that limθ→−∞ FTθ,ς2(w) = 1.

For γ ∈ (0, 1) and w ∈ T , define Qγ(w) through

F TQγ(w),ς2

(w) = γ.

Lemma A.2 ensures that Qγ(w) is well-defined. Note that L(w) = Q1−α/2(w) and U(w) =

Qα/2(w).

18

Lemma A.3. For fixed w ∈ T , Qγ(w) is strictly decreasing in γ on (0, 1). And for fixed

γ ∈ (0, 1), Qγ(w) is continuous and strictly increasing in w ∈ T so that limw↘a1 Qγ(w) =

−∞ and limw↗bK Qγ(w) =∞.

Proof. Fix w ∈ T . Strict monotonicity of Qγ(w) in γ follows from strict monotonicity of

F Tθ,ς2(w) in θ; cf. Lemma A.2.

Fix γ ∈ (0, 1) throughout the following. To show that Qγ(·) is strictly increasing on T ,

fix w,w′ ∈ T with w < w′. We get

γ = F TQγ(w),ς2

(w) < F TQγ(w),ς2

(w′),

where the inequality holds because the density of F TQγ(w),ς2

(·) is positive on T . The definition

of Qγ(w′) and Lemma A.2 entail that Qγ(w) < Qγ(w

′).

To show that Qγ(·) is continuous on T , we first note that F Tθ,ς2(w) is continuous in

(θ, w) ∈ R×T (which is easy to see from the formula for F Tθ,ς2(w) given after Lemma A.1).

Now fix w ∈ T . Because Qγ(·) is monotone, it suffices to show that Qγ(wn) → Qγ(w)

for any increasing sequence wn in T converging to w from below, and for any decreasing

sequence wn converging to w from above. If the wn increase towards w from below, the

sequence Qγ(wn) is increasing and bounded by Qγ(w) from above, so that Qγ(wn) converges

to a finite limit Q. With this, and because F Tθ,ς2(w) is continuous in (θ, w), it follows that

limnF TQγ(wn),ς2

(wn) = F TQ,ς2

(w).

In the preceding display, the sequence on the left-hand side is constant equal to γ by

definition of Qγ(wn), so that F TQ,ς2

(w) = γ. It follows that Q = Qγ(w). If the wn decrease

towards w from above, a similar argument applies.

To show that limw↗bK Qγ(w) =∞, let wn, n ≥ 1, be an increasing sequence in T that

converges to bK . It follows that Qγ(wn) converges to a (not necessarily finite) limit Q as

19

n→∞. If Q <∞, we get for each b < bK that

lim infn

F TQγ(wn),ς2

(wn) ≥ lim infn

F TQγ(wn),ς2

(b) = F TQ,ς2

(b).

In this display, the inequality holds because F TQγ(wn),ς2

(·) is a c.d.f., and the equality holds

because F Tθ,ς2(b) is continuous in θ. As this holds for each b < bK , we obtain that

lim infn FTQγ(wn),ς2

(wn) = 1. But in this equality, the left-hand side equals γ – a contra-

diction. By similar arguments, it also follows that limw↘a1 Qγ(w) = −∞.

Lemma A.4. The function Qγ(·) satisfies

limw↗bK

(bK − w)Qγ(w) = −ς2 log(γ) if bK <∞ and

limw↘a1

(a1 − w)Qγ(w) = −ς2 log(1− γ) if a1 > −∞.

Proof. As both statements follow from similar arguments, we only give the details for the

first one. As w approaches bk from below, Qγ(w) converges to ∞ by Lemma A.3. This

observation, the fact that F TQγ(w),ς2

(w) = γ holds for each w, and Lemma A.1 together

imply that

limw↗bk

Φ(w−Qγ(w)

ς

)Φ(bk−Qγ(w)

ς

) = γ.

Because Φ(−x)/(φ(x)/x)→ 1 as x→∞ (cf. Feller 1957, Lemma VII.1.2.), we get that

limw↗bk

φ(w−Qγ(w)

ς

)φ(bk−Qγ(w)

ς

) = γ.

The claim now follows by plugging-in the formula for φ(·) on the left-hand side, simplifying,

and then taking the logarithm of both sides.

20

Appendix B Proof of Proposition 1

Proof of the first statement in Proposition 2. Assume that bK < ∞ (the case where a1 >

−∞ is treated similarly). Lemma A.4 entails that limw↗bK (bK −w)(U(w)− L(w)) = ς2C,

where C = log((1− α/2)/(α/2)) is positive. Hence, there exists a constant ε > 0 so that

U(w)− L(w) >1

2

ς2C

bK − w

whenever w ∈ (bK − ε, bK) ∩ T . Set B = inf{fTθ,ς2(w) : w ∈ (bK − ε, bK) ∩ T}. For w ∈ T ,

fTθ,ς2(w) is a Gaussian density divided by a constant scaling factor, so that B > 0. Because

U(w)− L(w) ≥ 0 in view of Lemma A.3, we obtain that

Eθ,ς2 [U(W )− L(W )|W ∈ T ] ≥ ς2BC

2

∫(bK−ε)∩T

1

bK − wdw = ∞.

Proof of the first inequality in Proposition 2. Define Rγ(w) through Φ((w−Rγ(w))/ς) = γ,

i.e, Rγ(w) = w − ςΦ−1(γ) Then, on the one hand, we have

F TRγ(w),ς2

(w) =P (N(Rγ(w), ς2) ≤ w,N(Rγ(w), ς2) ∈ T )

P (N(Rγ(w), ς2) ∈ T )

≤ P (N(Rγ(w), ς2) ≤ w)

infϑ P (N(ϑ, ς2) ∈ T )=

γ

p∗,

while, on the other,

F TRγ(w),ς2

(w) ≥ P (N(Rγ(w), ς2) ≤ w)− P (N(Rγ(w), ς2) 6∈ T )

P (N(Rγ(w), ς2) ∈ T )

≥ infϑ

P (N(Rγ(w), ς2) ≤ w)− 1 + P (N(ϑ, ς2) ∈ T )

P (N(ϑ, ς2) ∈ T )

=γ − 1 + p∗

p∗.

21

The inequalities in the preceding two displays, together with the fact that F Tθ,ς2(w) is

decreasing in θ, imply that

R1−p∗(1−γ)(w) ≤ Qγ(w) ≤ Rp∗γ(w).

(Note that the inequality in the third-to-last display continues to hold with p∗γ replacing

γ; in that case, the upper bound reduces to γ. And, similarly, the inequality in the

second-to-last display continues to hold with 1 − p∗(1 − γ) replacing γ, in which case the

lower bound reduces to γ). In particular, we get that U(w) = Qα/2(w) ≤ Rp∗α/2(w) =

w − ςΦ−1(p∗α/2) and that L(w) = Q1−α/2(w) ≥ R1−p∗α/2(w) = w − ςΦ−1(1− p∗α/2). The

last two inequalities, and the symmetry of Φ(·) around zero, imply the second inequality

in the proposition.

Proof of the second inequality in Proposition 2. Note that p∗ ≥ p◦ = infϑ P (N(ϑ, ς2) <

b1 or N(ϑ, ς2) > aK), because T is unbounded above and below. Setting δ = (aK−b1)/(2ς),

we note that δ ≥ 0 and that it is elementary to verify that p◦ = 2Φ(−δ). Because Φ−1(1−

p∗α/2) ≤ Φ−1(1− p◦α/2), the inequality will follow if we can show that Φ−1(1− p◦α/2) ≤

Φ−1(1−α/2)+δ or, equivalently, that Φ−1(p◦α/2) ≥ Φ−1(α/2)−δ. Because Φ(·) is strictly

increasing, this is equivalent to

p◦α/2 = Φ(−δ)α ≥ Φ(Φ−1(α/2)− δ).

To this end, we set f(δ) = αΦ(−δ)/Φ(Φ−1(α/2) − δ) and show that f(δ) ≥ 1 for δ ≥ 0.

Because f(0) = 1, it suffices to show that f ′(δ) is non-negative for δ > 0. The derivative

can be written as a fraction with positive denominator and with numerator equal to

−αφ(−δ)Φ(Φ−1(α/2)− δ) + αΦ(−δ)φ(Φ−1(α/2)− δ).

The expression in the preceding display is non-negative if and only if

Φ(−δ)φ(−δ)

≥ Φ(Φ−1(α/2)− δ)φ(Φ−1(α/2)− δ)

.

22

This will follow if the function g(x) = Φ(−x)/φ(x) is decreasing in x ≥ 0. The derivative

g′(x) can be written as a fraction with positive denominator and with numerator equal to

−φ(x)2 + xΦ(−x)φ(x) = xφ(x)

(Φ(−x)− φ(x)

x

).

Using the well-known inequality Φ(−x) ≤ φ(x)/x for x > 0 (Feller 1957, Lemma VII.1.2.),

we see that the expression in the preceding display is non-positive for x > 0.

Appendix C Proof of Proposition 2

From Lee et al. (2016), we recall the formulas for the expressions on the right-hand side of

(2), namely Am,s = (A0m,s′, A1

m,s′)′ and bm,s = (b0m,s

′, b1m,s′)′ with A0

m,s and b0m,s given by

1

λ

X ′mc(In − PXm)

−X ′mc(In − PXm)

and

ι−X ′mcXm(X ′mXm)−1s

ι+X ′mcXm(X ′mXm)−1s

,

respectively, and with A1m,s = −diag(s)(X ′mXm)−1X ′m and b1m,s = −λdiag(s)(X ′mXm)−1s (in

the preceding display, PXm denotes the orthogonal projection matrix onto the column space

spanned by Xm and ι denotes an appropriate vector of ones). Moreover, it is easy to see that

the set {y : Am,sy < bm,s} can be written as {y : for z = (Ip − PXm)y, we have V−m,s(z) <

ηm′y < V+m,s(z),V0

m,s(z) > 0}, where

V−m,s(z) = max({(bm,s − Am,sz)i/(Am,sc

m)i : (Am,scm)i < 0} ∪ {−∞}

),

V+m,s(z) = min

({(bm,s − Am,sz)i/(Am,sc

m)i : (Am,scm)i > 0} ∪ {∞}

),

V0m,s(z) = min

({(bm,s − Am,sz)i : (Am,sc

m)i = 0} ∪ {∞})

with cm = ηm/‖ηm‖2; cf. also Lee et al. (2016).

23

Proof of Proposition 2. Set I− = {i : (Am,scm)i < 0} and I+ = {i : (Am,sc

m)i > 0}. In

view of the formulas of V−m,s(z) and V+m,s(z) given earlier, it suffices to show that either

I− or I+ is non-empty. Conversely, assume that I− = I+ = ∅. Then Am,scm = 0 and

hence also A1m,sc

m = 0. Using the explicit formula for A1m,s and the definition of ηm, i.e.,

ηm = Xm(X ′mXm)−1γm, it follows that γm = 0, which contradicts our assumption that

γm ∈ R|m| \ {0}.

References

Bachoc, F., Leeb, H. & Potscher, B. M. (2016), ‘Valid confidence intervals for post-model-

selection predictors’, arXiv preprint arXiv:1412.4605 .

Berk, R., Brown, L., Buja, A., Zhang, K. & Zhao, L. (2013), ‘Valid post-selection inference’,

Ann. Statist. 41, 802–837.

Feller, W. (1957), An Introduction to Probability Theory and its Applications, Vol. 1, 2nd

edn, Wiley, New York, NY.

Fithian, W., Sun, D. & Taylor, J. (2017), ‘Optimal inference after model selection’, arXiv

preprint arXiv:1410.2597 .

Frank, I. E. & Friedman, J. H. (1993), ‘A statistical view of some chemometrics regression

tools’, Technometrics 35, 109–135.

Lee, J. D., Sun, D. L., Sun, Y. & Taylor, J. E. (2016), ‘Exact post-selection inference, with

application to the Lasso’, Ann. Statist. 44, 907–927.

Leeb, H. & Kabaila, P. (2017), ‘Admissibility of the usual confidence set for the mean of a

24

univariate or bivariate normal population: The unknown variance case’, J. R. Stat. Soc.

Ser. B Stat. Methodol. 79, 801–813.

Leeb, H. & Potscher, B. M. (2006a), ‘Can one estimate the conditional distribution of

post-model-selection estimators?’, Ann. Statist. 34, 2554–2591.

Leeb, H. & Potscher, B. M. (2006b), ‘Performance limits for estimators of the risk or

distribution of shrinkage-type estimators, and some general lower risk-bound results’,

Econometric Theory 22, 69–97.

Leeb, H. & Potscher, B. M. (2008), ‘Can one estimate the unconditional distribution of

post-model-selection estimators?’, Econometric Theory 24, 338–376.

Potscher, B. M. (2009), ‘Confidence sets based on sparse estimators are necessarily large’,

Sankhya Ser. A 71, 1–18.

Potscher, B. M. & Schneider, U. (2010), ‘Confidence sets based on penalized maximum

likelihood estimators in Gaussian regression’, Electron. J. Statist. 4, 334–360.

Schneider, U. (2016), ‘Confidence sets based on thresholding estimators in high-dimensional

Gaussian regression models’, Econometric Rev. 35, 1412–1455.

Taylor, J. & Tibshirani, R. (2017), ‘Post-selection inference for l1-penalized likelihood mod-

els’, Canad. J. Statist. forthcoming.

Tian, X. & Taylor, J. E. (2015), ‘Selective inference with a randomized response’, arXiv

preprint arXiv:1507.06739 .

Tibshirani, R. (1996), ‘Regression shrinkage and selection via the Lasso’, J. R. Stat. Soc.

Ser. B Stat. Methodol. 58, 267–288.

25

Tibshirani, R. J. (2013), ‘The Lasso problem and uniqueness’, Electron. J. Statist. 7, 1456–

1490.

Tibshirani, R. J., Taylor, J., Lockhart, R. & Tibshirani, R. (2016), ‘Exact post-selection

inference for sequential regression procedures’, J. Amer. Statist. Assoc. 111, 600–620.

26

Documents

Expected length of post-model-selection con dence ...€¦ · give a necessary and su cient condition for its expected length to be in nite. In simulations, we nd that this condition