Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Expected length of post-model-selectionconfidence intervals conditional on
polyhedral constraints
Danijel KivaranovicDepartment of Statistics and Operations Research, University of Vienna,
andHannes Leeb
Department of Statistics and Operations Research, University of Vienna
January 25, 2018
Abstract
Valid inference after model selection is currently a very active area of research.The polyhedral method, pioneered by Lee et al. (2016), allows for valid inferenceafter model selection if the model selection event can be described by polyhedralconstraints. In that reference, the method is exemplified by constructing two validconfidence intervals when the Lasso estimator is used to select a model. We herestudy the expected length of these intervals. For one of these confidence intervals,that is easier to compute, we find that its expected length is always infinite. Forthe other of these confidence intervals, whose computation is more demanding, wegive a necessary and sufficient condition for its expected length to be infinite. Insimulations, we find that this condition is typically satisfied.
Keywords: Lasso, inference after model selection, hypothesis test.
1
1 Introduction
Lee et al. (2016) recently introduced a new technique for valid inference after model selec-
tion, the so-called polyhedral method. Using this method, and using the Lasso for model
selection in linear regression, Lee et al. (2016) derived two new confidence sets that are
valid conditional on the outcome of the model selection step. More precisely, let m de-
note the model containing those regressors that correspond to non-zero coefficients of the
Lasso estimator, and let s denote the sign-vector of those non-zero Lasso coefficients. Then
Lee et al. (2016) constructed confidence intervals [Lm,s, Um,s] and [Lm, Um] whose coverage
probability is 1− α, conditional on the events {m = m, s = s} and {m = m}, respectively
(provided that the probability of the conditioning event is positive). The computational
effort in constructing these intervals is considerably lighter for [Lm,s, Um,s]. In simulations,
Lee et al. (2016) noted that this latter interval can be quite long in some cases; cf. Figure 10
in that reference. We here analyze the (conditional) expected length of these intervals.
1.1 Overview of findings
Throughout, we use the same setting and assumptions as Lee et al. (2016). In particular,
we assume that the response vector is distributed as N(µ, σ2In) with unknown mean µ ∈ Rn
and known variance σ2 > 0 (our results carry over to the unknown-variance case; see the
end of Section 3), and that the non-stochastic regressor matrix has columns in general
position. Write Pµ,σ2 and Eµ,σ2 for the probability measure and the expectation operator,
respectively, corresponding to N(µ, σ2In).
For the interval [Lm,s, Um,s], we find the following: Fix a non-empty model m, a sign-
vector s, as well as µ ∈ Rn and σ2 > 0. If Pµ,σ2(m = m, s = s) > 0, then
Eµ,σ2 [Um,s − Lm,s| m = m, s = s] = ∞. (1)
2
Obviously, this statement continues to hold if the event m = m, s = s is replaced by the
larger event m = m throughout (because there are only finitely many possible values for
the sign-vector s). And this statement continues to hold if the condition Pµ,σ2(m = m, s =
s) > 0 is dropped and the conditional expectation in (1) is replaced by the unconditional
one.
For the interval [Lm, Um], we derive a necessary and sufficient condition for its expected
length to be infinite, conditional on the event m = m. That condition depends on the
regressor matrix, on the model m and also on a linear contrast that defines the quantity
of interest, and is daunting to verify in all but the most basic examples. We also provide
a sufficient condition for infinite expected length that is easy to verify. In simulations, we
find that this sufficient condition for infinite expected length is typically satisfied when the
model m excludes a significant portion of all the available regressors (e.g., if the selected
model is ‘sparse’). And even if the model m is not sparse, we find that this condition is still
satisfied for a sizable fraction of the linear contrasts that define the quantity of interest.
See Table 1 and the attending discussion for more detail.
The methods developed in this paper can also be used if the Lasso, as the model
selector, is replaced by any other procedure that allows for application of the polyhedral
method. In particular, we see that confidence intervals based on the polyhedral method
in Gaussian regression can have infinite expected length. Our findings suggest that the
expected length of confidence intervals based on the polyhedral method should be closely
scrutinized, in Gaussian regression but also in non-Gaussian settings and in other variations
of the polyhedral method.
The rest of the paper is organized as follows: We conclude this section by discussing a
number of related results that put our findings in context. Section 2 describes the confidence
intervals of Lee et al. (2016) in detail and introduces some notation. Section 3 contains
3
two core results, Propositions 1 and 2, which entail our main findings, the simulation study
mentioned earlier, as well as a discussion of the unknown variance case. The appendix
contains the proofs of our core results and some auxiliary lemmas.
1.2 Context and related results
There are currently several exciting ongoing developments based on the polyhedral method,
not least because it proved to be applicable to more complicated settings, and there are
several generalization of this framework. See, among others, Tibshirani et al. (2016), Taylor
& Tibshirani (2017), Tian & Taylor (2015). Certain optimality results of the method of
Lee et al. (2016) are given in Fithian et al. (2017). Using a different approach, Berk et al.
(2013) proposed the so-called PoSI-intervals which are unconditionally valid. A benefit of
the PoSI-intervals is that they are valid after selection with any possible model selector,
instead of a particular one like the Lasso; however, as a consequence, the PoSI-intervals are
typically very conservative (that is, the actual coverage probability is above the nominal
level). Nonetheless, Bachoc et al. (2016) showed in a Monte Carlo simulation that, in
certain scenarios, the PoSI-intervals can be shorter than the intervals of Lee et al. (2016).
The results of the present paper are based on the first author’s master’s thesis.
It is important to note that all confidence sets discussed so far are non-standard, in
the sense that the parameter to be covered is not the true parameter in an underlying cor-
rect model (or components thereof), but instead is a model-dependent quantity of interest.
(See Section 2 for details and the references in the preceding paragraph for more extensive
discussions.) An advantage of this non-standard approach is that it does not rely on the
assumption that any of the candidate models is correct. Valid inference for an underlying
true parameter is a more challenging task, as demonstrated by the impossibility results in
Leeb & Potscher (2006a,b, 2008). There are several proposals of valid confidence intervals
4
after model selection (in the sense that the actual coverage probability of the true param-
eter is at or above the nominal level) but these are rather large compared to the standard
confidence intervals from the full model (supposing that one can fit the full model); see
Potscher (2009), Potscher & Schneider (2010), Schneider (2016). In fact, Leeb & Kabaila
(2017) showed that the usual confidence interval obtained by fitting the full model is ad-
missible also in the unknown variance case; therefore, one cannot obtain uniformly smaller
valid confidence sets for a component of the true parameter by any other method.
2 Assumptions and confidence intervals
Let Y denote the N(µ, σ2In)-distributed response vector, n ≥ 1, where µ ∈ Rn is unknown
and σ2 > 0 is known. Let X = (x1, . . . , xp), p ≥ 1, with xi ∈ Rn for each i = 1, . . . , p,
be the non-stochastic n × p regressor matrix. We assume that the columns of X are in
general position (this mild assumption is further discussed in the following paragraph).
The full model {1, . . . , p} is denoted by mF . All subsets of the full model are collected in
M, that is, M = {m : m ⊆ mF}. The cardinality of a model m is denoted by |m|. For
any m = {i1, . . . , ik} ∈ M \ ∅ with i1 < · · · < ik, we set Xm = (xi1 , . . . , xik). Analogously,
for any vector v ∈ Rp, we set vm = (vi1 , . . . , vik)′. If m is the empty model, then Xm is to
be interpreted as the zero vector in Rn and vm as 0.
The Lasso estimator, denoted by β(y), is a minimizer of the least squares problem with
an additional penalty on the absolute size of the regression coefficients (Frank & Friedman
1993, Tibshirani 1996):
minβ∈Rp
1
2‖y −Xβ‖22 + λ‖β‖1, y ∈ Rn, λ > 0.
The Lasso has the property that some coefficients of β(y) are zero with positive probability.
A minimizer of the Lasso objective function always exists, but it is not necessarily unique.
5
Uniqueness of β(y) is guaranteed here by our assumption that the columns of X are in
general position (Tibshirani 2013). This assumption is relatively mild; e.g., if the entries
of the matrix X are drawn from a (joint) distribution that has a Lebesgue density, then
the columns of X are in general position with probability 1 (Tibshirani 2013). The model
m(y) selected by the Lasso and the sign-vector s(y) of non-zero Lasso coefficients can now
formally be defined through
m(y) ={j : βj(y) 6= 0
}and s(y) = sign
(βm(y)(y)
).
Recall that M denotes the set of all possible submodels and set Sm = {−1, 1}|m| for each
m ∈ M. For later use we also denote by M+ and S+m the collection of non-empty models
and the collection of corresponding sign-vectors, that occur with positive probability, i.e.,
M+ = {m ∈M \ ∅ : Pµ,σ2(m(Y ) = m) > 0} ,
S+m = {s ∈ Sm : Pµ,σ2(m(Y ) = m, s(Y ) = s) > 0} (m ∈M \ ∅).
These sets do not depend on µ and σ2 as the measure Pµ,σ2 is equivalent to Lebesgue
measure with respect to null sets. Also, our assumption that the columns of X are in
general position guarantees that M+ only contains models m for which Xm has column-
rank m (Tibshirani 2013).
Inference is focused on a non-standard, model dependent, quantity of interest. Through-
out the following, fix m ∈M+ and let
βm = Eµ,σ2
[(X ′mXm)−1X ′mY
]= (X ′mXm)−1X ′mµ.
For γm ∈ R|m|\{0}, the goal is to construct a confidence interval for γm′βm with conditional
coverage probability 1 − α on the event {m = m}. Clearly, the quantity of interest can
also be written as γm′βm = ηm′µ for ηm = Xm(X ′mXm)−1γm. For later use, write Pηm for
the orthogonal projection on the space spanned by ηm.
6
At the core of the polyhedral method lies the observation that the event where m = m
and where s = s describes a convex polytope in sample space Rn (up to a Lebesgue null
set): For each m ∈M+ and each s ∈ Sm, we have
{y : m(y) = m, s(y) = s} a.s.= {y : Am,sy < bm,s}, (2)
cf. Theorem 3.3 in Lee et al. (2016) (explicit formulas for the matrix Am,s and the vector
bm,s are also repeated in Appendix C in our notation). Fix z ∈ Rn orthogonal to ηm. Then
the set of y satisfying (In − Pηm)y = z and Am,sy < b is either empty or a line segment. In
either case, that set can be written as {z + ηmw : V−m,s(z) < w < V+m,s(z)}. The endpoints
satisfy −∞ ≤ V−m,s(z) ≤ V+m,s(z) ≤ ∞ (see Lemma 4.1 of Lee et al. 2016; formulas for
these quantities are also given in Appendix C in our notation). Now decompose Y into
the sum of two independent Gaussians PηmY and (In − Pηm)Y , where the first one is a
linear function of ηm′Y ∼ N(ηm′µ, σ2ηm′ηm). With this, the conditional distribution of
ηm′Y , conditional on the event that m(Y ) = m, s(Y ) = s and (In − Pηm)(Y ) = z, is the
conditional N(ηm′µ, σ2ηm′ηm)-distribution, conditional on the set (V−m,s(z),V+m,s(z)) (in the
sense that the latter conditional distribution is a regular conditional distribution if one
starts with the conditional distribution of ηm′Y given m = m and s = s – which is always
well-defined – and if one then conditions on the random variable (In − Pηm)Y ).
To use these observations for the construction of confidence sets, consider first the
conditional distribution of a random variableW ∼ N(θ, ς2) conditional on the eventW ∈ T ,
where θ ∈ R, where ς2 > 0, and where T ⊆ R is the union of finitely many non-empty
open intervals. Write F Tθ,ς2(·) for the cumulative distribution function (c.d.f) of W given
W ∈ T . The corresponding law can be viewed as a ‘truncated normal’ distribution and
will be denoted by TN(θ, ς2, T ) in the following. A confidence interval for θ with coverage
probability 1 − α conditional on the event W ∈ T is obtained by the usual method of
7
collecting all values θ0 for which a hypothesis test of H0 : θ = θ0 against H1 : θ 6= θ0 does
not reject. In particular, for w ∈ R, define L(w) and U(w) through
F TL(w),ς2(w) = 1− α
2and F T
U(w),ς2(w) =α
2
(these quantities are well-defined in view of Lemma A.2). With this, we have P (θ ∈
[L(W ), U(W )]|W ∈ T ) = 1− α, irrespective of θ ∈ R.
Fix m ∈M+ and s ∈ S+m, and let σ2
m = σ2ηm′ηm and Tm,s(z) = (V−m,s(z),V+m,s(z)) for z
orthogonal to ηm. With this, we have
ηm′Y∣∣∣{m = m, s = s, (In − Pηm)Y = z} ∼ TN(ηm′µ, σ2
m, Tm,s(z)).
for each z ∈ {(In − Pηm)y : Am,sy < bm,s}. Now define Lm,s(y) and Um,s(y) through
FTm,s((In−Pηm )y)
Lm,s(y),σ2m
(ηm′y) = 1− α
2and F
Tm,s((In−Pηm )y)
Um,s(y),σ2m
(ηm′y) =α
2
for each y so thatAm,sy < bm,s. Then, by construction, the random interval [Lm,s(Y ), Um,s(Y )]
covers ηm′µ with probability 1− α conditional on the event that m = m and s = s.
In a similar fashion, fix m ∈ M+, set Tm(z) = ∪s∈S+mTm,s(z) for z orthogonal to ηm,
and define Lm(y) and Um(y) through
FTm((In−Pηm )y)
Lm(y),σ2m
(ηm′y) = 1− α
2and F
Tm((In−Pηm )y)
Um(y),σ2m
(ηm′y) =α
2.
Again by construction, the random interval [Lm(Y ), Um(Y )] covers ηm′µ with probability
1− α conditional on the event that m = m.
Remark. (i) If m = m(y) is any other model selection procedure, so that the event
{y : m = m} is the union of a finite number of polyhedra (up to null sets), then the
polyhedral method can be applied to obtain a confidence set for ηm′µ with conditional
coverage probability 1 − α, conditional on the event {m = m}, if that event has positive
8
probability. Indeed, for such a model selection procedure, the arguments following (1) also
apply, mutatis mutandis.
(ii) So far, we have defined confidence intervals only on the events W ∈ T , m ∈ M+ and
sm ∈ S+m, and m ∈ M+, respectively. In the remaining cases, the interval endpoints (and
the corresponding quantity of interest) can be chosen arbitrarily (measurable) without
affecting our results. It is easy to choose constructions so that one obtains meaningful
confidence intervals that are defined everywhere in sample space.
(iii) In Theorem 3.3 of Lee et al. (2016), relation (2) is stated as an equality, not as an
equality up to null sets, and with the right-hand side replaced by {y : Am,sy ≤ bm,s} (in
our notation). Because (2) differs from this only on a Lebesgue null set, the difference is
inconsequential for the purpose of the present paper. The statement in Lee et al. (2016) is
based on the fact that m was defined as the equicorrelation set (Tibshirani 2013) in that
paper. But if m is the equicorrelation set, then there can exist vectors y ∈ {m = m} such
that some coefficients of β(y) are zero, which clashes with the idea that m contains those
variables whose Lasso coefficients are non-zero. However, for any m ∈M+, the set of such
ys is a Lebesgue null set.
3 Core results
We first analyze the simple confidence set [L(W ), U(W )] that was introduced in the pre-
ceding section, which covers θ with probability 1 − α, conditional on W ∈ T , where
W ∼ N(θ, ς2). By assumption, T is of the form T = ∪Ki=1(ai, bi) where K < ∞ and
−∞ ≤ a1 < b1 < · · · < aK < bK ≤ ∞. Figure 1 exemplifies the length of [L(w), U(w)]
when T is bounded (left panel) and when T is unbounded (right panel). The dashed line
9
is the length of the standard (unconditional) confidence interval for θ. In the left panel,
we see that the length of [L(w), U(w)] diverges as w approaches the far left or the far right
boundary point of the truncation set (i.e., -3 and 3). On the other hand, in the right
panel we see that the length of [L(w), U(w)] is bounded and converges to the length of the
standard interval as |w| → ∞.
5.0
7.5
10.0
−3 −2 −1 0 1 2 3w
U(w
) −
L(w
)
3.0
3.5
4.0
4.5
−7 −5 −3 −1 1 3 5 7w
U(w
) −
L(w
)
Figure 1: Length of the interval [L(w), U(w)] for the case where T = (−3,−2) ∪ (−1, 1) ∪
(2, 3) (left panel) and the case where T = (−∞,−2) ∪ (−1, 1) ∪ (2,∞) (right panel). In
both cases, we took ς2 = 1 and α = 0.05.
Write Φ(w) and φ(w) for the c.d.f. and p.d.f. of the standard normal distribution,
respectively, where we adopt the usual convention that Φ(−∞) = 0 and Φ(∞) = 1.
Proposition 1. If T is bounded either from above or from below, then
E[U(W )− L(W )|W ∈ T ] = ∞.
10
If T is unbounded from above and from below, then
U(W )− L(W )
ς
a.s.
≤ 2Φ−1(1− p∗α/2)
≤ 2Φ−1(1− α/2) +aK − b1
ς,
where p∗ = infϑ∈R P (N(ϑ, ς2) ∈ T ) and where aK − b1 is to be interpreted as 0 in case
K = 1. [The first inequality trivially continues to hold if T is bounded, as then p∗ = 0.]
Intuitively, one expects confidence intervals to be wide if one conditions on a bounded set
because extreme values cannot be observed on a bounded set and the confidence intervals
have to take this into account. We find that the conditional expected length is infinite in
this case. If, for example, T is bounded from below, i.e., if −∞ < a1, then first statement
in the proposition follows from two facts: First, the length of U(w) − L(w) behaves like
1/(w− a1) as w approaches a1 from above; and, second, the p.d.f. of the truncated normal
distribution at w is bounded away from 0 zero as w approaches a1 from above. See the
proof in Section B for a more detailed version of this argument. On the other hand, if
the truncation set is unbounded, extreme values are observable and confidence intervals,
therefore, do not have to be extremely wide. The second upper bound provided by the
proposition for that case will be useful later.
We see that the boundedness of the truncation set T is critical for the interval length.
When the Lasso is used as a model selector, this prompts the question whether the trun-
cation sets Tm,s(z) and Tm(z) are bounded or not, because the intervals [Lm,s(y), Um,s(y)]
and [Lm(y), Um(y)] are obtained from conditional normal distributions with truncation sets
Tm,s((In − Pηm)y) and Tm((In − Pηm)y), respectively. For m ∈M+, s ∈ S+m, and z orthog-
onal to ηm, recall that Tm,s(z) = (V−m,s(z),V+m,s(z)), and that Tm(z) is the union of these
intervals over s ∈ S+m. Write [ηm]⊥ for the orthogonal complement of the span of ηm.
11
Proposition 2. For each m ∈M+ and each s ∈ Sm, we have
∀z ∈ [ηm]⊥ : −∞ < V−m,s(z) or ∀z ∈ [ηm]⊥ : V+m,s(z) <∞
or both.
For the confidence interval [Lm,s(Y ), Um,s(Y )], the statement in (1) now follows imme-
diately: If m is a non-empty model and s is a sign-vector so that the event {m = m, s = s}
has positive probability, then m ∈ M+ and s ∈ S+m. Now Proposition 2 entails that
Tm,s((In − Pηm)Y ) is almost surely bounded on the event {m = m, s = s}, and Proposi-
tion 1 entails that (1) holds.
For the confidence interval [Lm(Y ), Um(Y )], we obtain that its conditional expected
length is finite, conditional on m = m with m ∈ M+, if and only if its corresponding
truncation set Tm(Y ) is almost surely unbounded from above and from below on that
event. More precisely, for m ∈M+, we have
Eµ,σ2 [Um(Y )− Lm(Y )|m = m] = ∞ (3)
if and only if there exists a s ∈ S+m and a vector y satisfying Am,sy < bm,s, so that
Tm((In − Pηm)y) is bounded from above or from below. (4)
Before proving this equivalence, recall that Tm((In−Pηm)y) is the union of the intervals
(V−m,s((In−Pηm)y),V+m,s((In−Pηm)y)) with s ∈ S+
m. Inspection of the explicit formulas for
the interval endpoints given in Appendix C now immediately reveals the following: The
lower endpoint V−m,s((In − Pηm)y) is either constant equal to −∞ on the set {y : Am,sy <
bm,s}, or it is the minimum of a finite number of linear functions of y (and hence finite and
continuous) on that set. Similarly the upper endpoint V+m,s((In − Pηm)y) is either constant
12
equal to ∞ on that set, or it is the maximum of a finite number of linear functions of y
(and hence finite and continuous) on that set.
To prove the equivalence, we first assume, for some s and y with s ∈ S+m and Am,sy <
bm,s, that the set in (4) is bounded from above (the case of boundedness from below is
similar). Then there is an open neighborhood O of y, so that each point w ∈ O satisfies
Am,sw < bm,s and also so that Tm((In − Pηm)w) is bounded from above. Because O has
positive Lebesgue measure, (3) now follows from Proposition 1. To prove the converse,
assume for each s ∈ S+m and each y satisfying Am,sy < bm,s that Tm((In − Pηm)y) is
unbounded from above and from below. Because the sets {y : Am,sy < bm,s} for s ∈ S+m are
disjoint by construction, the same is true for the sets Tm,s((In − Pηm)y) for s ∈ S+m. Using
Proposition 1, we then obtain that Um(Y )− Lm(Y ) is bounded by a linear function of
max{V−m,s((In − Pηm)Y ) : s ∈ S+m} − min{V+
m,s((In − Pηm)Y ) : s ∈ S+m}
Lebesgue almost everywhere on the event {m = m}. (The maximum and the minimum in
the preceding display correspond to aK and b1, respectively, in Proposition 1.) It remains
to show that the expression in the preceding display has finite conditional expectation on
the event {m = m}. But this expression is the maximum of a finite number of Gaussians
minus the minimum of a finite number of Gaussians. Its unconditional expectation, and
hence also its conditional expectation on the event {m = m}, is finite.
In order to infer (3) from (4), that latter condition needs to be checked for every point y
in a union of polyhedra. While this is easy in some simple examples like, say, the situation
depicted in Figure 1 of Lee et al. (2016), searching over polyhedra in Rn is hard in general.
In our simulations, we therefore use a simpler sufficient condition that implies (3): After
observing the data, i.e., after observing a particular value y∗ of Y , and hence also observing
m(y∗) = m and s(y∗) = s, we check whether Tm((In − Pηm)y∗) is bounded from above or
13
from below (and also whether Am,sy∗ < bm,s, which, if satisfied, entails that m ∈ M+ and
that s ∈ S+m). If this is the case, then it follows, ex post, that (3) holds. Note that these
computations occur naturally during the computation of [Lm(y∗), Um(y∗)] and can hence
be performed as a safety precaution with little extra effort.
Remark. If m is any other model selection procedure, so that the event {y : m = m} is
the union of a finite number of polyhedra (up to null sets), then the polyhedral method can
be applied to obtain a confidence set for ηm′µ with conditional coverage probability 1− α,
conditional on the event {m = m} if that event has positive probability. Clearly, for such a
model selection procedure, an equivalence similar to (3)–(4) holds. Indeed, the derivation
of this equivalence relies on Proposition 1 but not on the Lasso-specific Proposition 2.
3.1 Simulation results
To investigate whether condition (4) is restrictive or not, we perform an exploratory sim-
ulation exercise consisting of repeated samples of size n = 100 in the nine scenarios corre-
sponding to the rows in Table 1, which cover all combinations of the cases p = 20, p = 50
and p = 200 (i.e., p small, moderate, large) with the cases λ = 0.1, λ = 1 and λ = 10 (i.e.,
λ small, moderate, large). For each of the nine scenarios, we generate an n × p regressor
matrix X whose rows follow a p-variate Gaussian distribution with mean zero, so that the
diagonal elements of the covariance matrix all equal 1 and the off-diagonal elements all
equal 0.2. We then generate an n-vector y∗ whose entries are i.i.d. standard Gaussians,
compute the Lasso estimator β(y∗) and the resulting selected model m = m(y∗) (if m = ∅,
this process is repeated with a newly generated vector y∗). Finally, we generate 2000 direc-
tions γmj that are i.i.d. uniform on the unit-sphere in Rm and set ηmj = Xm(X ′mXm)−1γmj ,
1 ≤ j ≤ 2000. For each ηmj we now check if the sufficient condition outlined in the preceding
14
paragraph is satisfied with ηmj replacing ηm. If it is satisfied, the corresponding confidence
set [Lm(Y ), Um(Y )] (with ηmj replacing ηm) is guaranteed to have infinite expected length
conditional on the event that m = m. The fraction of indices j, 1 ≤ j ≤ 2000, for which
this is the case, together with the number of parameters in the selected model are displayed
in the cells of Table 1, for 50 independent replications of y∗ in each scenario (row). In each
of the nine scenarios, and for each of the 50 replications in each, this fraction estimates
a lower bound for the probability that the confidence interval [Lm(Y ), Um(Y )] has infinite
expected length conditional on m = m if the direction of interest, i.e., γm or, equivalently,
ηm, is chosen at random.
As soon as the selected model excludes more than a few variables, we see that the
conditional expected length of [Lm(Y ), Um(Y )] is guaranteed to be infinite in a substantial
number of cases. In particular, this always occurs if p > n. (Also keep in mind that we
only check a sufficient condition for infinite expected length, not a necessary one.) Also,
within each row in the table, the number of cases with infinite conditional expected length
is roughly increasing as the number of parameters in the selected model decreases. Beyond
these observations, there appears to be no simple relation between the number of selected
variables in the model and the percentage of cases where the interval has infinite conditional
expected length. We also stress here that a simulation study can not be exhaustive and
that other simulation scenarios will give different results.
3.2 The unknown variance case
Suppose here that σ2 > 0 is unknown and that σ2 is an estimator for σ2. Fix m ∈ M+
and s ∈ S+m. Note that the set Am,sy < bm,s does not depend on σ2 and hence also
V−m,s((In − Pηm)y) and V −m,s((In − Pηm)y) do not depend on σ2. For each ς2 > 0 and for
each y so that Am,sy < bm,s define Lm,s(y, ς2), Um,s(y, ς
2), Lm(y, ς2), and Um(y, ς2) like
15
p λ y∗1 y∗2 y∗3 · · · y∗48 y∗49 y∗50
20
0.10% 0% 0% · · · 80% 100% 100%
(20) (20) (20) · · · (19) (19) (19)
10% 0% 0% · · · 100% 100% 100%
(20) (20) (20) · · · (16) (16) (14)
1049% 60% 72% · · · 100% 100% 100%
(4) (7) (8) · · · (3) (3) (3)
50
0.10% 0% 0% · · · 100% 100% 100%
(50) (50) (50) · · · (48) (47) (47)
196% 99% 100% · · · 100% 100% 100%
(47) (44) (47) · · · (39) (38) (37)
10100% 100% 100% · · · 100% 100% 100%
(18) (18) (17) · · · (7) (4) (3)
200
0.1100% 100% 100% · · · 100% 100% 100%
(100) (100) (100) · · · (97) (97) (95)
1100% 100% 100% · · · 100% 100% 100%
(96) (94) (94) · · · (84) (83) (82)
10100% 100% 100% · · · 100% 100% 100%
(41) (37) (36) · · · (21) (20) (19)
Table 1: Percentage of cases where ηm is such that the confidence interval [Lm(Y ), Um(Y )]
for ηm′µ is guaranteed to have infinite expected length conditional on m = m, with m =
m(y∗i ) and, in parentheses, the number of parameters in the model, i.e., |m|. The entries
in each row are ordered to improve readability, first by percentage (increasing) and second
by number of parameters (decreasing).
16
Lm,s(y), Um,s(y), Lm(y), and Um(y) in Section 2 with ς2 replacing σ2 in the formulas.
(Note that, say, Lm,s(y) depends on σ2 through σ2m = σ2ηm′ηm.) The asymptotic coverage
probability of the intervals [Lm,s(Y, σ2), Um,s(Y, σ
2)] and [Lm(Y, σ2), Um(Y, σ2)], conditional
on the events {m = m, s = s} and {m = m}, respectively, is discussed in Lee et al. (2016).
If σ2 is independent of ηm′Y and positive with positive probability, then it is easy to see
that (1) continues to hold with [Lm,s(Y, σ2), Um,s(Y, σ
2)] replacing [Lm,s(Y ), Um,s(Y )] for
each m ∈M+ and each s ∈ S+m. And if, in addition, σ2 has finite mean conditional on the
event {m = m} for m ∈ M+, then it is elementary to verify that the equivalence (3)–(4)
continues to hold with [Lm(Y, σ2), Um(Y, σ2)] replacing [Lm(Y ), Um(Y )] (upon repeating
the arguments following (3)–(4) and upon using the finite conditional mean of σ2 in the
last step).
In the case where p < n, the usual variance estimator ‖Y −X(X ′X)−1X ′Y ‖2/(n − p)
is independent of ηm′Y , is positive with probability 1, and has finite unconditional (and
hence also conditional) mean. For variance estimators in the case where p ≥ n, we refer to
Lee et al. (2016) and the references therein.
Appendix A Auxiliary results
In this section, we collect some properties of functions like F Tθ,ς2(w) that will be needed in
the proofs of Proposition 1 and Proposition 2. The following result will be used repeatedly
in the following and is easily verified using L’Hospital’s method.
Lemma A.1. For all a, b with −∞ ≤ a < b ≤ ∞, the following holds:
limθ→∞
Φ (a− θ)Φ (b− θ)
= 0.
17
Write F Tθ,ς2(w) and fTθ,ς2(w) for the c.d.f. and p.d.f. of the TN(θ, ς2, T )-distribution,
where T = ∪Ki=1(ai, bi) with −∞ ≤ a1 < b1 < a2 < · · · < aK < bK ≤ ∞. For w ∈ T and for
k so that ak < w < bk, we have
F Tθ,ς2(w) =
Φ(w−θς
)− Φ
(ak−θς
)+
k−1∑i=1
Φ(bi−θς
)− Φ
(ai−θς
)K∑i=1
Φ(bi−θς
)− Φ
(ai−θς
) ;
if k = 1, the sum in the numerator is to be interpreted as 0. And for w as above, the
density fTθ,ς2(w) is equal to φ((w − θ)/ς)/ς divided by the denominator in the preceding
display.
Lemma A.2. For each fixed w ∈ T , F Tθ,ς2(w) is continuous and strictly decreasing in θ,
and
limθ→−∞
F Tθ,ς2(w) = 1 and lim
θ→∞F Tθ,ς2(w) = 0.
Proof. Continuity is obvious and monotonicity has been shown in Lee et al. (2016) for the
case where T is a single interval, i.e., K = 1; it is easy to adapt that argument to also
cover the case K > 1. Next consider the formula for F Tθ,ς2(w). As θ → ∞, Lemma A.1
implies that the leading term in the numerator is Φ((w−θ)/ς) while the leading term in the
denominator is Φ((bK − θ)/ς). Using Lemma A.1 again gives limθ→∞ FTθ,ς2(w) = 0. Finally,
it is easy to see that F Tθ,ς2(w) = 1− F−T−θ,ς2(−w) (upon using the relation Φ(t) = 1− Φ(−t)
and a little algebra). With this, we also obtain that limθ→−∞ FTθ,ς2(w) = 1.
For γ ∈ (0, 1) and w ∈ T , define Qγ(w) through
F TQγ(w),ς2
(w) = γ.
Lemma A.2 ensures that Qγ(w) is well-defined. Note that L(w) = Q1−α/2(w) and U(w) =
Qα/2(w).
18
Lemma A.3. For fixed w ∈ T , Qγ(w) is strictly decreasing in γ on (0, 1). And for fixed
γ ∈ (0, 1), Qγ(w) is continuous and strictly increasing in w ∈ T so that limw↘a1 Qγ(w) =
−∞ and limw↗bK Qγ(w) =∞.
Proof. Fix w ∈ T . Strict monotonicity of Qγ(w) in γ follows from strict monotonicity of
F Tθ,ς2(w) in θ; cf. Lemma A.2.
Fix γ ∈ (0, 1) throughout the following. To show that Qγ(·) is strictly increasing on T ,
fix w,w′ ∈ T with w < w′. We get
γ = F TQγ(w),ς2
(w) < F TQγ(w),ς2
(w′),
where the inequality holds because the density of F TQγ(w),ς2
(·) is positive on T . The definition
of Qγ(w′) and Lemma A.2 entail that Qγ(w) < Qγ(w
′).
To show that Qγ(·) is continuous on T , we first note that F Tθ,ς2(w) is continuous in
(θ, w) ∈ R×T (which is easy to see from the formula for F Tθ,ς2(w) given after Lemma A.1).
Now fix w ∈ T . Because Qγ(·) is monotone, it suffices to show that Qγ(wn) → Qγ(w)
for any increasing sequence wn in T converging to w from below, and for any decreasing
sequence wn converging to w from above. If the wn increase towards w from below, the
sequence Qγ(wn) is increasing and bounded by Qγ(w) from above, so that Qγ(wn) converges
to a finite limit Q. With this, and because F Tθ,ς2(w) is continuous in (θ, w), it follows that
limnF TQγ(wn),ς2
(wn) = F TQ,ς2
(w).
In the preceding display, the sequence on the left-hand side is constant equal to γ by
definition of Qγ(wn), so that F TQ,ς2
(w) = γ. It follows that Q = Qγ(w). If the wn decrease
towards w from above, a similar argument applies.
To show that limw↗bK Qγ(w) =∞, let wn, n ≥ 1, be an increasing sequence in T that
converges to bK . It follows that Qγ(wn) converges to a (not necessarily finite) limit Q as
19
n→∞. If Q <∞, we get for each b < bK that
lim infn
F TQγ(wn),ς2
(wn) ≥ lim infn
F TQγ(wn),ς2
(b) = F TQ,ς2
(b).
In this display, the inequality holds because F TQγ(wn),ς2
(·) is a c.d.f., and the equality holds
because F Tθ,ς2(b) is continuous in θ. As this holds for each b < bK , we obtain that
lim infn FTQγ(wn),ς2
(wn) = 1. But in this equality, the left-hand side equals γ – a contra-
diction. By similar arguments, it also follows that limw↘a1 Qγ(w) = −∞.
Lemma A.4. The function Qγ(·) satisfies
limw↗bK
(bK − w)Qγ(w) = −ς2 log(γ) if bK <∞ and
limw↘a1
(a1 − w)Qγ(w) = −ς2 log(1− γ) if a1 > −∞.
Proof. As both statements follow from similar arguments, we only give the details for the
first one. As w approaches bk from below, Qγ(w) converges to ∞ by Lemma A.3. This
observation, the fact that F TQγ(w),ς2
(w) = γ holds for each w, and Lemma A.1 together
imply that
limw↗bk
Φ(w−Qγ(w)
ς
)Φ(bk−Qγ(w)
ς
) = γ.
Because Φ(−x)/(φ(x)/x)→ 1 as x→∞ (cf. Feller 1957, Lemma VII.1.2.), we get that
limw↗bk
φ(w−Qγ(w)
ς
)φ(bk−Qγ(w)
ς
) = γ.
The claim now follows by plugging-in the formula for φ(·) on the left-hand side, simplifying,
and then taking the logarithm of both sides.
20
Appendix B Proof of Proposition 1
Proof of the first statement in Proposition 2. Assume that bK < ∞ (the case where a1 >
−∞ is treated similarly). Lemma A.4 entails that limw↗bK (bK −w)(U(w)− L(w)) = ς2C,
where C = log((1− α/2)/(α/2)) is positive. Hence, there exists a constant ε > 0 so that
U(w)− L(w) >1
2
ς2C
bK − w
whenever w ∈ (bK − ε, bK) ∩ T . Set B = inf{fTθ,ς2(w) : w ∈ (bK − ε, bK) ∩ T}. For w ∈ T ,
fTθ,ς2(w) is a Gaussian density divided by a constant scaling factor, so that B > 0. Because
U(w)− L(w) ≥ 0 in view of Lemma A.3, we obtain that
Eθ,ς2 [U(W )− L(W )|W ∈ T ] ≥ ς2BC
2
∫(bK−ε)∩T
1
bK − wdw = ∞.
Proof of the first inequality in Proposition 2. Define Rγ(w) through Φ((w−Rγ(w))/ς) = γ,
i.e, Rγ(w) = w − ςΦ−1(γ) Then, on the one hand, we have
F TRγ(w),ς2
(w) =P (N(Rγ(w), ς2) ≤ w,N(Rγ(w), ς2) ∈ T )
P (N(Rγ(w), ς2) ∈ T )
≤ P (N(Rγ(w), ς2) ≤ w)
infϑ P (N(ϑ, ς2) ∈ T )=
γ
p∗,
while, on the other,
F TRγ(w),ς2
(w) ≥ P (N(Rγ(w), ς2) ≤ w)− P (N(Rγ(w), ς2) 6∈ T )
P (N(Rγ(w), ς2) ∈ T )
≥ infϑ
P (N(Rγ(w), ς2) ≤ w)− 1 + P (N(ϑ, ς2) ∈ T )
P (N(ϑ, ς2) ∈ T )
=γ − 1 + p∗
p∗.
21
The inequalities in the preceding two displays, together with the fact that F Tθ,ς2(w) is
decreasing in θ, imply that
R1−p∗(1−γ)(w) ≤ Qγ(w) ≤ Rp∗γ(w).
(Note that the inequality in the third-to-last display continues to hold with p∗γ replacing
γ; in that case, the upper bound reduces to γ. And, similarly, the inequality in the
second-to-last display continues to hold with 1 − p∗(1 − γ) replacing γ, in which case the
lower bound reduces to γ). In particular, we get that U(w) = Qα/2(w) ≤ Rp∗α/2(w) =
w − ςΦ−1(p∗α/2) and that L(w) = Q1−α/2(w) ≥ R1−p∗α/2(w) = w − ςΦ−1(1− p∗α/2). The
last two inequalities, and the symmetry of Φ(·) around zero, imply the second inequality
in the proposition.
Proof of the second inequality in Proposition 2. Note that p∗ ≥ p◦ = infϑ P (N(ϑ, ς2) <
b1 or N(ϑ, ς2) > aK), because T is unbounded above and below. Setting δ = (aK−b1)/(2ς),
we note that δ ≥ 0 and that it is elementary to verify that p◦ = 2Φ(−δ). Because Φ−1(1−
p∗α/2) ≤ Φ−1(1− p◦α/2), the inequality will follow if we can show that Φ−1(1− p◦α/2) ≤
Φ−1(1−α/2)+δ or, equivalently, that Φ−1(p◦α/2) ≥ Φ−1(α/2)−δ. Because Φ(·) is strictly
increasing, this is equivalent to
p◦α/2 = Φ(−δ)α ≥ Φ(Φ−1(α/2)− δ).
To this end, we set f(δ) = αΦ(−δ)/Φ(Φ−1(α/2) − δ) and show that f(δ) ≥ 1 for δ ≥ 0.
Because f(0) = 1, it suffices to show that f ′(δ) is non-negative for δ > 0. The derivative
can be written as a fraction with positive denominator and with numerator equal to
−αφ(−δ)Φ(Φ−1(α/2)− δ) + αΦ(−δ)φ(Φ−1(α/2)− δ).
The expression in the preceding display is non-negative if and only if
Φ(−δ)φ(−δ)
≥ Φ(Φ−1(α/2)− δ)φ(Φ−1(α/2)− δ)
.
22
This will follow if the function g(x) = Φ(−x)/φ(x) is decreasing in x ≥ 0. The derivative
g′(x) can be written as a fraction with positive denominator and with numerator equal to
−φ(x)2 + xΦ(−x)φ(x) = xφ(x)
(Φ(−x)− φ(x)
x
).
Using the well-known inequality Φ(−x) ≤ φ(x)/x for x > 0 (Feller 1957, Lemma VII.1.2.),
we see that the expression in the preceding display is non-positive for x > 0.
Appendix C Proof of Proposition 2
From Lee et al. (2016), we recall the formulas for the expressions on the right-hand side of
(2), namely Am,s = (A0m,s′, A1
m,s′)′ and bm,s = (b0m,s
′, b1m,s′)′ with A0
m,s and b0m,s given by
1
λ
X ′mc(In − PXm)
−X ′mc(In − PXm)
and
ι−X ′mcXm(X ′mXm)−1s
ι+X ′mcXm(X ′mXm)−1s
,
respectively, and with A1m,s = −diag(s)(X ′mXm)−1X ′m and b1m,s = −λdiag(s)(X ′mXm)−1s (in
the preceding display, PXm denotes the orthogonal projection matrix onto the column space
spanned by Xm and ι denotes an appropriate vector of ones). Moreover, it is easy to see that
the set {y : Am,sy < bm,s} can be written as {y : for z = (Ip − PXm)y, we have V−m,s(z) <
ηm′y < V+m,s(z),V0
m,s(z) > 0}, where
V−m,s(z) = max({(bm,s − Am,sz)i/(Am,sc
m)i : (Am,scm)i < 0} ∪ {−∞}
),
V+m,s(z) = min
({(bm,s − Am,sz)i/(Am,sc
m)i : (Am,scm)i > 0} ∪ {∞}
),
V0m,s(z) = min
({(bm,s − Am,sz)i : (Am,sc
m)i = 0} ∪ {∞})
with cm = ηm/‖ηm‖2; cf. also Lee et al. (2016).
23
Proof of Proposition 2. Set I− = {i : (Am,scm)i < 0} and I+ = {i : (Am,sc
m)i > 0}. In
view of the formulas of V−m,s(z) and V+m,s(z) given earlier, it suffices to show that either
I− or I+ is non-empty. Conversely, assume that I− = I+ = ∅. Then Am,scm = 0 and
hence also A1m,sc
m = 0. Using the explicit formula for A1m,s and the definition of ηm, i.e.,
ηm = Xm(X ′mXm)−1γm, it follows that γm = 0, which contradicts our assumption that
γm ∈ R|m| \ {0}.
References
Bachoc, F., Leeb, H. & Potscher, B. M. (2016), ‘Valid confidence intervals for post-model-
selection predictors’, arXiv preprint arXiv:1412.4605 .
Berk, R., Brown, L., Buja, A., Zhang, K. & Zhao, L. (2013), ‘Valid post-selection inference’,
Ann. Statist. 41, 802–837.
Feller, W. (1957), An Introduction to Probability Theory and its Applications, Vol. 1, 2nd
edn, Wiley, New York, NY.
Fithian, W., Sun, D. & Taylor, J. (2017), ‘Optimal inference after model selection’, arXiv
preprint arXiv:1410.2597 .
Frank, I. E. & Friedman, J. H. (1993), ‘A statistical view of some chemometrics regression
tools’, Technometrics 35, 109–135.
Lee, J. D., Sun, D. L., Sun, Y. & Taylor, J. E. (2016), ‘Exact post-selection inference, with
application to the Lasso’, Ann. Statist. 44, 907–927.
Leeb, H. & Kabaila, P. (2017), ‘Admissibility of the usual confidence set for the mean of a
24
univariate or bivariate normal population: The unknown variance case’, J. R. Stat. Soc.
Ser. B Stat. Methodol. 79, 801–813.
Leeb, H. & Potscher, B. M. (2006a), ‘Can one estimate the conditional distribution of
post-model-selection estimators?’, Ann. Statist. 34, 2554–2591.
Leeb, H. & Potscher, B. M. (2006b), ‘Performance limits for estimators of the risk or
distribution of shrinkage-type estimators, and some general lower risk-bound results’,
Econometric Theory 22, 69–97.
Leeb, H. & Potscher, B. M. (2008), ‘Can one estimate the unconditional distribution of
post-model-selection estimators?’, Econometric Theory 24, 338–376.
Potscher, B. M. (2009), ‘Confidence sets based on sparse estimators are necessarily large’,
Sankhya Ser. A 71, 1–18.
Potscher, B. M. & Schneider, U. (2010), ‘Confidence sets based on penalized maximum
likelihood estimators in Gaussian regression’, Electron. J. Statist. 4, 334–360.
Schneider, U. (2016), ‘Confidence sets based on thresholding estimators in high-dimensional
Gaussian regression models’, Econometric Rev. 35, 1412–1455.
Taylor, J. & Tibshirani, R. (2017), ‘Post-selection inference for l1-penalized likelihood mod-
els’, Canad. J. Statist. forthcoming.
Tian, X. & Taylor, J. E. (2015), ‘Selective inference with a randomized response’, arXiv
preprint arXiv:1507.06739 .
Tibshirani, R. (1996), ‘Regression shrinkage and selection via the Lasso’, J. R. Stat. Soc.
Ser. B Stat. Methodol. 58, 267–288.
25
Tibshirani, R. J. (2013), ‘The Lasso problem and uniqueness’, Electron. J. Statist. 7, 1456–
1490.
Tibshirani, R. J., Taylor, J., Lockhart, R. & Tibshirani, R. (2016), ‘Exact post-selection
inference for sequential regression procedures’, J. Amer. Statist. Assoc. 111, 600–620.
26