39
Valid Post-selection Inference in Assumption-lean Linear Regression Arun Kumar Kuchibhotla, Lawrence D. Brown, Andreas Buja, Richard Berk, Linda Zhao and Edward George 31 March 2017 1 Introduction: What this paper is about? Data-dependent tuning of estimation and inference procedures in analyzing real data is one of the most important components usually unaccounted for in reaching final conclu- sions or policy decisions. Exploratory data analysis falls into one such data-dependent tuning procedures. For instance, plotting a histogram of a univariate data set and de- ciding whether fitting a normal location-scale family would be an appropriate choice or considering a scatter plot to decide what transformations to consider in linear regression brings in additional randomness due to the data-dependent (visual) decision made. In these examples, no formal hypothesis test is performed but to convince yourself of ad- ditional randomness think of what you would have done if the univariate data set did not entirely look like normal but a little skewed. Quoting Wikipedia about exploratory data analysis (EDA), ... primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task ... John Tukey suggested that more emphasis needs to be placed on using the data to suggest what hypothesis to look for but he also warned about the danger of systematic bias (due to additional randomness) when EDA and inference were based on the same data set. The latter part of this comment is not so much adjusted for as the former is used by practitioners. Classical statistical inference was designed to work with hypothesis and models chosen prior to obtaining the data. Unaccounted additional randomness due to data-dependent tuning is only one of the many reasons for invalidating the final conclusions based on classical statistical inference. Intentionally or unintentionally, many statistics textbooks mention this method for practical data analysis. need a reference. The second prominent reason for invalid conclusions obtained using tools from mathe- matical statistics is the “blind” use of the results that are obtained under the assumption of correct modeling. Even if no data-dependent tuning is involved, the inference pro- vided under correct model assumption may not lead to valid conclusions as we cannot often be sure about the correctness of the model used. Suppose we fit a simple linear 1

Valid Post-selection Inference in Assumption-lean Linear Regression · 2019. 12. 12. · the problem of model-selection in linear regression and its e ect of t- and F-test results

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • Valid Post-selection Inference in Assumption-lean Linear

    Regression

    Arun Kumar Kuchibhotla, Lawrence D. Brown, Andreas Buja,Richard Berk, Linda Zhao and Edward George

    31 March 2017

    1 Introduction: What this paper is about?

    Data-dependent tuning of estimation and inference procedures in analyzing real data isone of the most important components usually unaccounted for in reaching final conclu-sions or policy decisions. Exploratory data analysis falls into one such data-dependenttuning procedures. For instance, plotting a histogram of a univariate data set and de-ciding whether fitting a normal location-scale family would be an appropriate choice orconsidering a scatter plot to decide what transformations to consider in linear regressionbrings in additional randomness due to the data-dependent (visual) decision made. Inthese examples, no formal hypothesis test is performed but to convince yourself of ad-ditional randomness think of what you would have done if the univariate data set didnot entirely look like normal but a little skewed. Quoting Wikipedia about exploratorydata analysis (EDA),

    . . . primarily EDA is for seeing what the data can tell us beyond the formalmodeling or hypothesis testing task . . .

    John Tukey suggested that more emphasis needs to be placed on using the data to suggestwhat hypothesis to look for but he also warned about the danger of systematic bias (dueto additional randomness) when EDA and inference were based on the same data set.The latter part of this comment is not so much adjusted for as the former is usedby practitioners. Classical statistical inference was designed to work with hypothesisand models chosen prior to obtaining the data. Unaccounted additional randomnessdue to data-dependent tuning is only one of the many reasons for invalidating the finalconclusions based on classical statistical inference. Intentionally or unintentionally, manystatistics textbooks mention this method for practical data analysis. need a reference.

    The second prominent reason for invalid conclusions obtained using tools from mathe-matical statistics is the “blind” use of the results that are obtained under the assumptionof correct modeling. Even if no data-dependent tuning is involved, the inference pro-vided under correct model assumption may not lead to valid conclusions as we cannotoften be sure about the correctness of the model used. Suppose we fit a simple linear

    1

  • regression model for two random variables in a data set. The classical linear regressiontheory under the linear model assumption with homoscedastic Gaussian errors suggestssome properties of the linear regression slope estimator such as unbiasedness, best linearunbiased estimator (BLUE) and a particular form of asymptotic variance as well as anormal distribution centered at the true slope. A big question is “how drastically willthe conclusions thus obtained change if this linear model with homoscedatic Gaussianerrors does not hold? What are we estimating?” A similar question can be raised inalmost every modeling setting in mathematical statistics. When we use maximum like-lihood estimator with a particular parametric family for a given data set, the classicalinference is in question if the data were not generated from an element of the parametricfamily used. See Huber (1967). It is to be understood from these comments that the(optimality or inference) guarantees obtained under correct model assumptions do notusually carry over when the model assumptions are not satisfied.

    To summarize, we have given at least two ways in which the inference generatedusing classical tools from mathematical statistics can be invalidated, namely,

    (P1) Data-dependent tuning; and

    (P2) Invalid model assumptions.

    We are not the first to realize these problems with classical inference. To mention afew, the following papers or books address or put forward problem (P1). Buehler andFeddersen (1963), Olshen (1973), Rencher and Pun (1980), Freedman (1983), discussthe problem of model-selection in linear regression and its effect of t- and F -test resultsdue to model-selection, while Sen (1979), Saleh (2006) discuss some results related topreliminary tests based inference for general parametric modeling. In linear regression,model-selection or data-dependent tuning often refers to selecting a subset of covariatesto be used in the final linear regression fit. Most of these papers discuss ways of adjust-ing inference when the method of model-selection is automated, fully specified and fixedapriori. Here the word automated for model-selection is used to signify that the selectionmethod should be replicable in the same way for any other data set presented. Note thatthe visual decision given as an example in the beginning of this section does not fall intothis category. Data-dependent tuning is not just limited to the case of model-selection.It has effects in statistical estimation too. To be concrete, most of the procedures usedin (modern) non-parametric and high-dimensional inference have one or more tuningparameters that needs to be selected appropriately to guarantee good/optimal perfor-mance of the estimator. Often in practice, such a tuning parameter is left unspecifiedwith a side note that some sort of cross-validation can be used in practice. However, theperformance of thus obtained estimator is mostly left unaddressed. See Arcones (2005),Dodge and Jurečková (2000) and the references therein for techniques to deal with suchtuning. There is yet another way of data-dependent tuning related to outliers. Thereis a whole field of robust statistics (starting from Huber (1964)) providing many waysof automatically ignoring the outliers in data for estimation and inference. But in somepractical scenarios, it is very possible that outliers themselves are interesting or the ro-bust procedure used fails because of a masking effect. In these cases, one might manually

    2

  • look at each data point to identify outliers and end up only using those observationsthat the data analyst certified as “good” observations. Under certain assumptions andcomparatively restricted settings, a particular outlier detection method was accountedfor in Gervini and Yohai (2002) and Agostinelli et al. (2015). By no means, the ref-erences or practical data-dependent tuning methods above are exhaustive. These arepresented just as a sample. As an end note on data-dependent tuning, most of the lit-erature seems to be focused on the first type where this tuning refers to model-selectionor more precisely to variable selection in regression/generalized linear model problems.Some concrete examples for these three different data-dependent tuning are as follows:

    (a) Suppose we have observations (Xi, Yi) ∈ Rp for 1 ≤ i ≤ n that satisfy Yi =X>i β0 + εi with independent and identically distributed εi ∼ N(0, 1). Also β0 issparse with known sparsity. For each subset M ⊆ {1, 2, . . . , p}, let β̂M denote theleast squares estimator obtained by regressing Y on {X(j) : j ∈M} (where X(j)denotes the j-th coordinate of vector X). Now choose a model M̂ based on bestsubsets regression (here we have a ·̂ because the choice is estimated). We want touse β̂M̂ . What are its properties? This is model/variable selection.

    (b) Suppose we have univariate observations X1, X2, . . . , Xn that are distributed sym-metrically around θ0 for some unknown θ0 and the sequence of estimators of loca-tion as {θ̂(α) : α ≥ 1}, where

    θ̂(α) = arg minµ∈R

    n∑i=1

    |Xi − µ|α,

    Here arg min denotes the minimizer of the function. Now we want to choose anoptimal α (possibly to get an efficient estimator) by minimizing an estimate ofthe asymptotic variance of θ̂(α). Let that optimal choice be α̂. What are theproperties of θ̂(α̂)?

    (c) For an example from this century, suppose we have observations (Xi, Yi) ∈ Rp×Rfor 1 ≤ i ≤ n as in Example (a). Define the LASSO estimator for every λ > 0 as

    β̂(λ) := arg minβ∈Rp

    1

    n

    n∑i=1

    (Yi −X>i β)2 + λ ‖β‖1 .

    We do not know what λ to use. So, we use the “universal” tool of cross-validationor one of its many variants and get a random choice λ̂ based on the same data.How do we do inference on β̂(λ̂)? Where does this converge to?

    (d) Suppose we have random vectors (Xi, Yi) ∈ R2, (p = 1) for 1 ≤ i ≤ n as inExample (a). We fit the linear regression and see the residual plot. Remove allthose observations for which the corresponding residuals falls out of the “3σ” limits.Now refit the regression with the remaining observations. What are the propertiesof the resulting estimator?

    3

  • The first three of these examples are very closely related but the main difference is thatin the second example the estimator sequence θ̂(α) is in some way smoothly varying overα ≥ 1. Such a smoothness is absent in variable selection.

    There has been a recent push in the literature for methods adjusting classical infer-ence procedure when arbitrary model selection, primarily those done visually like usingdiagnostic or influence plots and so on. One of the first such works was Berk et al.(2013) which dealt with the problem of multiple linear regression under some classi-cal assumptions where uniform inference was made possible by accounting for arbitrarymodel-selection. The technicalities will be made clear in the subsequent sections. Werefer to Berk et al. (2013) for many other prior works related to problem (P1). Somemore recent works that build on Berk et al. (2013) uniform inference method or givean alternative solution include: Lee et al. (2016), Tibshirani et al. (2016), Tian et al.(2016) and Bachoc et al. (2016). It is to be mentioned here that the former three papersaddress post model-selection inference by conditioning on the selected model and is verydependent on certain distributional assumptions. Also, it cannot deal with arbitrarymodel-selection as can be addressed by Berk et al. (2013). The present manuscript isclose in ideology to Berk et al. (2013) and Bachoc et al. (2016) where uniform infer-ence was produced for general M -estimation problems. Our setting seems more general,though.

    Most of the above mentioned solutions related to problem (P1) have taken certaincorrect model or distributional assumptions for granted and can easily breakdown oncesuch assumptions fail which brings us to problem (P2). For instance, think of what weare trying to estimate in examples (b) and (d) above if the assumptions of symmetryand homoscedastic Gaussian errors fail respectively. In modern terminology of math-ematical statistics, one would need semi-parametric inference not parametric inferencewhen dealing with many of the classical inference procedures which in their early stagesof development were assumed purely parametric. To expand on this, when consideringa linear regression model classical framework states that the conditional distribution iscompletely known except for the slope vector parameter and possibly the conditionalhomoscedastic variance parameter. This is a purely parametric setting. In contrast wecan view linear regression as fitting a linear function as (only) an approximation of theconditional expectation function with no other assumptions in the sense the joint (and soconditional) distributions left completely unspecified. From this viewpoint, the problemhas a parametric component of interest with an infinite dimensional nuisance componentand hence is a semi-parametric problem. See Kosorok (2008) and Section 2 for more onthis view. The inference differs with the change in the view of linear regression. Thiswas pointed out and expanded further to other problems with an indication of correctinference in Buja et al. (2014) and Buja et al. (2016).

    The present manuscript addresses problems (P1) and (P2) in case of using linearregression with data dependent tuning limited only to choosing a subset of covariates.We do not ask for any model assumptions. In this respect, this can be seen as a classicalnon-parametric statistical procedure with the term non-parametric used in the sense ofdistribution-free as in Gibbons and Chakraborti (2011). From now on, the term model-

    4

  • selection is used synonymously with variable selection. The paper has two importantgoals. Firstly, we provide confidence regions for linear regression coefficients obtainedafter model-selection. Secondly, we prove simple but powerful results about linear re-gression itself. We need these simple results also to prove validity of confidence. Themain contributions of the current paper can be listed as follows:

    1. We treat linear regression as only a working model in the sense of Buja et al. (2014)and do not assume the correctness of a linear model. We consider the case wherethe observations are the random vectors that comprises of covariates and response.So, in contrast to the classical case, we treat covariates as random variables not asfixed values. This requires us to interpret and understand what we are estimatingmore carefully. This framework seems more natural for any regression problem.Fixed covariates is the setting of Berk et al. (2013) and Bachoc et al. (2016).

    2. We decouple the inference from the model-selection procedure in the sense thatthe method presented provides correct inference no matter how the model-selectionwas done. The post model-selection procedure proposed in Berk et al. (2013) andBachoc et al. (2016) has the very same feature. This feature is both a bless and acurse. It is a bless because it can account for some ad-hoc visual decisions madeby the data analyst. It is a curse since this decoupling implies that the inferencedoes not take into account if we have a true model and the selection procedure is“consistent”.

    3. We allow for a very large number of covariates (almost exponentially in the samplesize) used when considering variable selection. This is in accordance with thecurrent trend of high-dimensional statistics. Although, we allow for larger thanthe sample size number of covariates, we require the final model selected to be ofsize smaller than the sample size since the least squares linear regression estimatoris not unique otherwise.

    4. We will mostly concentrate on one simple strategy for valid post-selection infer-ence and will indicate two other strategies that also provide valid post-selectioninference. Table 1 distinguishes the motivations and characteristics of these threestrategies in comparison with Berk et al. (2013). The computational complexityis in comparison with that of computing the least squares estimator. Coordi-nate or Global inference is about whether is inference is about each coordinate orabout a vector of coefficients as a single entity. Last column Conf. Region is theshape of the resulting confidence regions. Because of the similarity of Approach1 with inequalities used in the Dantzig selector and/or lasso, one might think ofour approach 1 as a convex relaxation based inference for post-selection inferenceproblem. While the approach of Berk et al. (2013) and approaches 2, 3 can beseen as best-subset selection versions. See Remark 3.7 for more details.

    5. The main focus of the paper is post model-selection inference in multiple linearregression but all the three strategies can easily be extended to generalized linearmodels which will be indicated in Section 7.

    5

  • Table 1: Characteristics of Different Approaches to Post-Selection Inference.

    Strategy Computation Inference Conf. Region

    Berk et al. (2013) exponential Cooordinate RectanglesApproach 1 Linear Global ParallelopiedsApproach 2 exponential Global EllipsoidsApproach 3 exponential Coordinate Rectangles

    6. Most of the results are based on deterministic inequalities which allows one to dopost model-selection inference even when the random vectors involved are struc-turally dependent. Since we base on deterministic inequalities, we may not getbest possible rates in some contexts but we are now more robust to dependenceassumptions on the observations. Also, our results will be more widely applicable.To the best of our knowledge, some of our results are new in linear regression withincreasing dimension.

    In this paper, we do not address the question of whether performing a simple procedurelike linear regression on a data is good enough for prediction or for any other purpose.We infer about linear regression after a decision of using linear regression was made.

    The remaining paper is organized as follows. Section 2 provides the required nota-tion for formulating the post model-selection problem on a rigorous footing. The firstapproach (or strategy) for valid post-selection inference is presented in Section 3 alongwith its main features. Section 4 describes the method of implementing or computingthe valid post-selection confidence regions which involves multiplier bootstrap. Someimportant results related to least squares linear regression estimator are presented inSection 5, however, the proofs are deferred to the supplementary material. Section 6sketches the methodology for approaches 2 and 3 that also provide valid post-selectioninference. We do not provide any technical results related to approaches 2 and 3 to notover load the ideas. These will be taken up in a future paper. We indicate how theseapproaches extend to other problems in Section 7. Finally, we end with some conclusionsand future related work in Section 8.

    2 Notation and Problem Formulation

    Suppose we have a sample of n observations (Xi, Yi) ∈ Rp×R, 1 ≤ i ≤ n. The covariatevectors Xi are column vectors. We, firstly, describe the notation and get to the stochasticassumptions required latter on. In case p varies with n, this should be interpreted as atriangular array. We do not require any specific assumption about the relation betweenp and n. Throughout the paper, we implicitly think of p as a function of n and the term“model” is used to specify the subset of covariates used in the regression. We do notassume a linear model (in any sense) to be true anywhere for any choice of covariates inthis or in the subsequent sections.

    For any vector v ∈ Rp and 1 ≤ j ≤ p, v(j) denotes the j-coordinate of v. For any

    6

  • non-empty model M given by a subset of {1, 2, . . . , p} and a vector v ∈ Rp, we use v(M)to denote a sub-vector of v with indices in M . The notation |M | will be used to denotethe cardinality of M . For instance, if M = {2, 4} and p ≥ 4, then v(M) = (v(2), v(4)).If M = {j} is a singleton then we write v(j) instead of v({j}). For any non-empty modelM ⊆ {1, 2, . . . , p} and any symmetric matrix A ∈ Rp×p, let A(M) denote the sub-matrixof A with indices in M ×M and for 1 ≤ j, k ≤ p, let A(j, k) denotes the value at the j-throw and the k-column of A. For any vector v ∈ Rq for some q ≥ 1, define the r-norm for1 ≤ r ≤ ∞ of v by

    ‖v‖rr :=q∑j=1

    |v(j)|r, for 1 ≤ r Av| ≤ ‖A‖∞ ‖v‖21 . (1)

    For any 1 ≤ k ≤ p, define the set of models

    M(k) := {M : M ⊆ {1, 2, . . . , p}, 1 ≤ |M | ≤ k},

    so that M(p) is the power set of {1, 2, . . . , p} with the deletion of the empty set. Theset M(k) denotes the set of all non-empty models of size bounded by k. Traditionally,it is common to include an intercept term when fitting the linear regression. To avoidextra notation, we implicitly assume that all covariates under consideration are includedin the vectors Xi. So, if one wants the intercept, without loss of generality, take the firstcoordinate of all Xi’s is 1, i.e., Xi(1) = 1 for all 1 ≤ i ≤ n.

    We have not yet mentioned any assumptions on the sample of observations. Toproceed further, let us make our first assumption on the observations.

    The sample {(Xi, Yi) ∈ Rp×R : 1 ≤ i ≤ n} is a triangular array of identicallydistributed random variables defined on a measurable space (Ω(n),F (n),P(n)).

    We say a triangular array to cover the case where p = pn varies with n. As mentionedbefore, this is an assumption different from the classical linear regression setting wherethe covariates Xi, 1 ≤ i ≤ n are assumed fixed (deterministic) vectors rather than ran-dom vectors. To avoid any confusion, by saying the observations are real-valued, wedo not ask for continuous random variables; they can be indicator or discrete randomvariables. Notice that we are not assuming independence of the random vectors. The

    7

  • assumption of identical distribution of random vectors can also be relaxed completelyat the expense of messy formulae. From now on, let (X,Y ) ∈ Rp × R denote a genericrandom vector that is independent of the sample {(Xi, Yi) : 1 ≤ i ≤ n} with distri-bution identical to (X1, Y1). See Remark 3.3 for more details about relaxing identicaldistribution. Throughout, the notations E and P will be used to denote expectation andprobability computed with respect to all the randomness involved. This implies that theprobability space needs to be enlarged whenever there is additional randomness (due tothe data analyst) in the event.

    For any M ∈ M(p), define the ordinary least squares empirical risk (or objective)function as

    R̂M (θ) :=1

    n

    n∑i=1

    {Yi −X>i (M)θ

    }2, for all θ ∈ R|M |. (2)

    Using this, define the expected risk (or objective) function as

    RM (θ) := E[{Y −X>(M)θ

    }2], for all θ ∈ R|M |.

    Define the related matrices and vectors as

    Σn :=1

    n

    n∑i=1

    XiX>i ∈ Rp×p, and Γn :=

    1

    n

    n∑i=1

    XiYi ∈ Rp,

    Σ := E[XX>

    ]∈ Rp×p, and Γ := E [XY ] ∈ Rp.

    For any model M ⊆ {1, 2, . . . , p}, we have the sub-matrices and sub-vectors as

    Σn(M) :=1

    n

    n∑i=1

    Xi(M)X>i (M), and Γn(M) :=

    1

    n

    n∑i=1

    Xi(M)Yi,

    Σ(M) := E[X(M)X>(M)

    ], and Γ(M) := E [X(M)Y ] .

    For the first approach to post-selection inference, we need the estimation errors of Σ andΓ defined by

    D1n := ‖Γn − Γ‖∞ = supM∈M(1)

    ‖Γn(M)− Γ(M)‖∞ , (3)

    D2n := ‖Σn − Σ‖∞ = supM∈M(2)

    ‖Σn(M)− Σ(M)‖∞ . (4)

    Finally, define the least squares estimator and the corresponding target as

    β̂M := arg minθ∈R|M|

    R̂M (θ), and β0,M := arg minθ∈R|M|

    RM (θ),

    for all M ⊆ {1, 2, . . . , p}. Observe that β̂M and β0,M are vectors in R|M |. They arenot sub-vectors of any big fixed vector. This is the reason we specifically write M as asubscript and not in parenthesis. See Berk et al. (2013) for a related discussion.

    8

  • Remark 2.1 Note that objective functions R̂M (θ) and RM (θ) are both convexquadratic functions of θ ∈ R|M | and so, the minimizers always exist but with no guar-antee of uniqueness though. Under the assumption of strict positive definiteness of Σ(or just Σ(M)), RM (θ) is strictly convex and implies a unique minimizer β0,M . Thisuniqueness does not require any specific relation between model size |M | and the sam-ple size n. In other words, β0,M is well-defined even if |M | > n as long as Σ(M) isnon-singular. However, β̂M is only well-defined for the case |M | ≤ n since Σn(M) hasrank at most min{|M |, n}. It is clear that the condition Σ(M) is non-singular becomesincreasing hard with increasing model size |M |. �

    Remark 2.2 To relate to the classical case and for the understanding of β0,M , itwould be useful to suppose (for instance) the true linear model. Suppose for a modelM , we have Yi = X

    >i (M)ξ0 + εi where εi ∼ N(0, σ2) for some fixed σ2 > 0. In this case,

    RM (θ) = E[(Y −X>(M)θ)2

    ]= E

    [(Y −X>(M)ξ0)2

    ]+ E

    [(X>(M)θ −X>(M)ξ0)2

    ].

    The second term on the right hand side is exactly

    (θ − ξ0)>Σ(M)(θ − ξ0).

    Hence a minimizer of RM (θ) is ξ0 and this is the only minimizer if Σ(M) is non-singular.Therefore, the target β0,M becomes the usual “true” parameter under the classical linearmodel. �

    Under very mild assumptions, β̂M −β0,M converges to zero as n tends to infinity (seeBuja et al. (2016) for a detailed discussion and Theorem 5 in Section 5). In this respect,β̂M is estimating β0,M . So, if we fix a set of covariates and apply a linear regression,we are essentially estimating β0,M . When we use the t- or F -tests or the correspondingconfidence intervals or regions, we are trying to test or cover β0,M . The classical linearregression analysis based on fixed covariates and homoscedastic Gaussian errors is invalidunder the setting of random covariates where no linear model is assumed. Understandingof how much invalidity this corresponds to was the main focus of Buja et al. (2016) witha fixed model M . Under the additional assumption of independent observations andmild moment conditions, it can be proved that β̂M (for a fixed M) has an asymptoticnormal distribution, i.e.,

    n1/2(β̂M − β0,M

    )L→ N|M | (0, AVM ) ,

    for some positive definite matrix AVM that depends on M and some moments of (X,Y ).

    See Theorem 5 in Section 5 for more details. Here we used the notationL→ to denote the

    convergence in law (or distribution) and N|M |(0, AVM ) denotes the multivariate normal

    distribution on R|M | with mean zero and covariance matrix AVM . Fix a real α ∈ [0, 1].Under this asymptotic normality, we can construct a (1−α)-confidence region R̂M suchthat

    lim infn→∞

    P(β0,M ∈ R̂M

    )≥ 1− α.

    9

  • The confidence region R̂M depends on all the observations {(Xi, Yi) : 1 ≤ i ≤ n} andM . This implies that for any two models M1 and M2,

    lim infn→∞

    infM∈{M1,M2}

    P(β0,M ∈ R̂M

    )≥ 1− α. (5)

    Suppose now we use the data to decide between models M1 and M2 by some criterionformal or otherwise. Let the final model (which is now random) be denoted by M̂ . Thisis what was described in example (a) of Section 1. About this random model, we only

    know P(M̂ ∈ {M1,M2}

    )= 1. Observe that

    P(β0,M̂ ∈ R̂M̂

    )= P

    (β0,M1 ∈ R̂M1

    ∣∣∣∣M̂ = M1)P(M̂ = M1)+ P

    (β0,M2 ∈ R̂M2

    ∣∣∣∣M̂ = M2)P(M̂ = M2) . (6)There are a few comments to be made from this equation.

    • After choosing a random model M̂ , if we use the linear regression estimator β̂M̂ ,then we are essentially trying to estimate β0,M̂ , since this is what happens if M̂ isdegenerate on a particular model. It is very crucial to understand that β0,M̂ is arandom quantity. In the scenario above (in general), we have

    P(β0,M̂ = β0,M1

    )= P

    (M̂ = M1

    ), P

    (β0,M̂ = β0,M2

    )= P

    (M̂ = M2

    ).

    If M1 and M2 are of different cardinalities, then β0,M̂ takes values in two differentEuclidean spaces with non-zero probability.

    • Having a guarantee as stated in the confidence bound (5) does not imply anything(in general) about the conditional probabilities in Equation (6) unless M̂ does notdepend on the data. The recent paper Rinaldo et al. (2016) uses sample splittingas a method of inference after model-selection. Note that if M̂ is chosen indepen-dently of the data used to construct R̂M then the conditional probabilities in (6)become the marginal probabilities and so we get valid post-selection inference. Oneimportant disadvantage of sample splitting is its inability to extend to dependentobservations even though for independent observations it gives very general andpowerful results. For more comments, see section 7 of Rinaldo et al. (2016).

    • Ignoring the randomness in the model as if the model final used is not obtainedfrom the same data and by using the confidence region R̂M̂ , the practitioner orthe data analyst is trying to claim that asymptotically the left hand side of (6) islarger than 1− α.

    • The conditional inference put forward by Lee et al. (2016), Tibshirani et al. (2016)and Tian et al. (2016) among others try to bound the conditional probabilities onthe right hand side of (6) for some fixed method of choosing M̂ . Berk et al. (2013)lower bounds the left hand side of (6) under the assumption of fixed covariatesand homoscedastic Gaussian errors for arbitrary method of choosing M̂ .

    10

  • • The present paper is concerned with the left hand side of (6), with model chosenfrom possibly large number of models, by dispensing most of the assumptions madeby Berk et al. (2013). We will still require some moment assumptions which wereimplicit in the Gaussianity assumption made there.

    The problem of post model-selection inference is to construct the set of confidenceregions {R̂M : M ∈ M} for some fixed (non-random) set of models M such that whena random model M̂ (possibly) depending on the same data satisfying P

    (M̂ ∈M

    )= 1

    is chosen, we have

    lim infn→∞

    P(β0,M̂ ∈ R̂M̂

    )≥ 1− α. (7)

    Throughout the paper, α ∈ [0, 1] is fixed and the confidence regions R̂M does depend onα. We are requiring the confidence asymptotically since we believe that no finite sampleconfidence statements can be made without significantly loading the assumptions whenwe generalize to other M -estimation problems. The following remarks will make theunderstanding of setting clearer.

    Remark 2.3 If we are able to produce a sequence of confidence regions R̂M thatsatisfy (7), then we say that we are equipped to perform post-selection inference sinceby duality of confidence regions and hypothesis testing, we can formally test hypothesisrelated to the randomly chosen model M̂ . �

    Remark 2.4 Comparing confidence statement (7) and (5), we note a major differencethat in (7), we are trying to catch a random parameter with a random confidence regionwhile in (5), we are trying to catch a fixed (non-random) parameter with a randomconfidence region. Often it is useful to have infimum over a class of data-generatingdistribution before the probability in (7) which is referred to as honest confidence inRinaldo et al. (2016). This “honesty” holds for our results too, however, we do notstress on this anymore. �

    Remark 2.5 We require the full set of covariates {X(1), X(2), . . . , X(p)} and theset of models M used for model selection based on the same data to be fixed apriori.In order for the results to be stated in the following sections to be valid, one shouldnot look at the dataset to decide whether or not to include some covariate in model-selection or if some models can be included in M. For example one should not do thefollowing. Suppose we have observations {(Yi, Xi(1), Xi(2)) : 1 ≤ i ≤ n}. Choose byany method some model M̂ from the set {{1}, {2}, {1, 2}}. Now make a residual plotand the residuals seem correlated with the covariates in M̂ . So, you now add quadraticterms for covariates in M̂ . This gives 2|M | covariates in total. Then we do anothermodel selection with all possible combinations of 2|M | new covariates. Continue untilthe residual plot looks good enough. This procedure does not have a fixed value of p anda fixed set M in our notation. This scenario is possible when trying to fit a polynomialregression where we keep on adding higher powers until we are satisfied with the residualplot. This comment also helps clarify why one cannot considerM to be the set of modelsgenerated by the LASSO path because what models constitute the LASSO path is (orare) data-dependent. The confidence regions to be introduced in Section 3 have the

    11

  • Lebesgue measure scaling in p as some power of log p. Hence, when in doubt include asmany covariates as possible in the full set of covariates. �

    Remark 2.6 Even though we deal with a deceivingly simple procedure like ordi-nary least squares linear regression, it encompasses a lot of complicated non-parametricmodels. Most of the non-parametric regression function estimation problems can be castinto a linear regression problem by a series expansion using Fourier, wavelet, spline orany other appropriate basis. The results are applicable as long as the gram matrix Σstays invertible. �

    Remark 2.7 The present paper deals in its entirety with real-valued covariatesbut with some care the results can be generalized to other inner product space valuedcovariates, however, we do not pursue this comment any further. �

    3 First Approach for Post-Selection Inference

    The first step towards achieving the goal of constructing a set of confidence regions{R̂M : M ∈ M} satisfying (7) is to obtain a lower bound on the probability in (7).Such a lower bound is provided in Lemma 1. This lemma is in a very loose way ageneralization of Lemma 4.1 of Berk et al. (2013).

    Lemma 1. For any set of confidence regions {R̂M : M ∈ M} and any random modelchoice M̂ satisfying P(M̂ ∈M) = 1, the following lower bound holds:

    P(β0,M̂ ∈ R̂M̂

    )≥ P

    ( ⋂M∈M

    {β0,M ∈ R̂M

    }).

    Proof. Define for every M ∈M, the event AM = {β0,M ∈ R̂M}. Since P(M̂ ∈M) = 1,we get by extending the equality (6),

    P(β0,M̂ ∈ R̂M̂

    )= P

    (AM̂

    )=

    ∑M ′∈M

    P(β0,M ′ ∈ R̂M ′

    ∣∣∣∣M̂ = M ′)P(M̂ = M ′)=

    ∑M ′∈M

    P(AM ′

    ∣∣∣∣M̂ = M ′)P(M̂ = M ′)≥

    ∑M ′∈M

    P

    ( ⋂M∈M

    AM∣∣∣∣M̂ = M ′

    )P(M̂ = M ′)

    = P

    ( ⋂M∈M

    AM

    )= P

    ( ⋂M∈M

    {β0,M ∈ R̂M

    }).

    Summarizing these inequalities proves the result.

    Remark 3.1 It should be clear from the proof why we require the set of models Mto be fixed apriori independent of the data. See Remark 2.5. �

    12

  • Remark 3.2 We remark in view of Lemma 1 that valid post-selection inference isinherently a high-dimensional problem in the sense that the number of parameter wewant to estimate and infer about, in general, is (much) larger than the sample sizeavailable. For illustration, consider the “not-so-bad” usual regression setting wherewe have p = 10 covariates and n = 500 observations. We have 50 observations perparameter if we consider the full model which is considered a fixed-dimensional problem.Now, if we consider post-selection inference with all non-empty sub-models, then we get2p − 1 = 1023 vector parameters of varying dimension. So, we want to infer about 5120many parameters from 500 observations which falls into the high-dimensional category.It should be mentioned here that the price we pay is only of order log(p) not log(2p) = pas is usual in high-dimensional statistics. This is the reason why we could use far manycovariates than the sample size. �

    Using Lemma 1, we observe that in order to achieve the goal (7) of valid post-selectioninference, it suffices to construct a set of confidence regions R̂M such that

    P

    ( ⋂M∈M

    {β0,M ∈ R̂M

    })≥ 1− α. (8)

    To proceed further, note that for M ∈M(p) and θ ∈ R|M |,

    R̂M (θ) =1

    n

    n∑i=1

    Y 2i − 21

    n

    n∑i=1

    θ>Xi(M)Yi +1

    n

    n∑i=1

    θ>Xi(M)X>i (M)θ,

    =1

    n

    n∑i=1

    Y 2i − 2θ>Γn(M) + θ>Σn(M)θ.

    By a very similar expansion, we get that

    RM (θ) = E[Y 2]− 2θ>Γ(M) + θ>Σ(M)θ.

    Since the first two terms in R̂M (θ) and RM (θ) does not change with θ, we get that

    β̂M = arg minθ∈R|M|

    {θ>Σn(M)θ − 2θ>Γn(M)

    }, and,

    β0,M = arg minθ∈R|M|

    {θ>Σ(M)θ − 2θ>Γ(M)

    }.

    Recall the definitions of D1n and D2n. From the definitions of β̂M and β0,M , note that thedifference between the sample and population objective functions is bounded in termsof D1n and D2n, that is, for any M ⊂M(p) and θ ∈ R|M |∣∣∣θ>Σn(M)θ − 2θ>Γn(M)− θ>Σ(M)θ + 2θ>Γ(M)∣∣∣ ≤ ‖θ‖21D2n + 2 ‖θ‖1D2n.This is the fundamental reason why we can derive most of the results in terms of D1nand D2n.

    13

  • Consider the confidence regions

    R̂M ={θ ∈ R|M | :

    ∥∥∥Σn(M){β̂M − θ}∥∥∥∞≤ C1(α) + C2(α)

    ∥∥∥β̂M∥∥∥1

    },

    for every M ∈ M(p), where C1(α) and C2(α) are bivariate joint quantiles of D1n andD2n, that is,

    P (D1n ≤ C1(α) and D2n ≤ C2(α)) ≥ 1− α. (9)

    The constants C1(α) and C2(α), of course, depend on the sample size n and the numberof full set of covariates p which will be suppressed throughout. The error differences D1nand D2n converge by law of large numbers to zero as n→∞ under very mild conditions.Therefore, C1(α) and C2(α) converge to zero as n→∞.

    The following theorem which is the main result of the paper proves validity of (8)for a particular choice of M with the confidence regions R̂M given above. To state thetheorem, define the strength of regression as

    Sr,k = supM∈M(k)

    ‖β0,M‖r for r ≥ 1,

    and consider the following assumptions:

    (A1) The observations {(Xi, Yi) ∈ Rp × R : 1 ≤ i ≤ n} are identically distributed onsome measurable space,

    (A2) The constants C1(α) and C2(α) satisfy the probability inequality (9),

    (A3) The matrix Σ ∈ Rp×p is uniformly non-singular.

    (A4)(k) The error estimates D1n and D2n in connection with k satisfy

    k [D1n +D2n (1 + S1,k)] = op(1) as n→∞.

    Some comments on the assumptions could be helpful. Note that S1,k is a measure ofsparsity used in high-dimensional statistics. As mentioned before, (A1) is not at allneeded and the discussion on its relaxation is given in Remark 3.3. Assumption (A2) isstrictly speaking not an assumption but we are asking for access to the values C1(α) andC2(α). We use the word “uniformly” in Assumption (A3) since we are implicitly thinkingabout a sequence p = pn that could increase with n. Hence by Assumption (A3) we meanthat the smallest eigenvalue of Σ is bounded uniformly away from zero as p changes. Thisassumption is necessary for all the targets {β0,M : M ∈ M(p)} to be well-defined (orunique). Assumption (A4)(k) is the main component that is a bit non-trivial. In general,S1,k = O(k) and rate of convergence of D1n and D2n to zero imply a rate constraint onk. Here too we are thinking that k = kn as a sequence depending on n. As can beexpected, the dependence structure between the random vectors (Xi, Yi), 1 ≤ i ≤ n andtheir moments decide the rate at which D1n and D2n converge to zero. See Lemma 2for more details. This is the place where we do not get an optimal rate for k if we stickwith independent and identically distributed observations. However, the optimal rate

    14

  • can be easily recovered from the proofs presented. The theorem is stated with these highlevel assumptions so that it is more widely applicable in particular to various structuraldependencies on observations.

    Theorem 1. Under assumptions (A1) – (A3) and for every 1 ≤ k ≤ p that satisfies(A4)(k), we have

    lim infn→∞

    P

    ⋂M∈M(k)

    {β0,M ∈ R̂M

    } ≥ 1− α.Furthermore, for any random model M̂ satisfying P(M̂ ∈M(k)) = 1, we get

    lim infn→∞

    P(β0,M̂ ∈ R̂M̂

    )≥ 1− α.

    Proof. The proof is surprisingly elementary and involves simple manipulation of esti-mating equations. Assumption (A1) is only required for the definition of β0,M . Sincethe empirical and the expected risk function are both convex quadratic functions, theestimator and the targets β̂M , β0,M satisfy the equations

    Σn(M)β̂M − Γn(M) = 0, for all M ∈M(p), (10)Σ(M)β0,M − Γ(M) = 0, for all M ∈M(p). (11)

    The first of these equations (10) is just the normal equation for the linear regression.Adding and subtracting β0,M from β̂M in equation (10) implies

    Σn(M)β0,M − Γn(M) = Σn(M){β0,M − β̂M

    }for all M ∈M(p).

    Subtracting Equation (11) from this equation leads to

    [Σn(M)− Σ(M)]β0,M−[Γn(M)− Γ(M)] = Σn(M){β0,M − β̂M

    }for all M ∈M(p).

    Taking ‖·‖∞ on both sides and using the triangle inequality, we get for all M ∈M(p),∥∥∥Σn(M){β0,M − β̂M}∥∥∥∞≤ ‖Γn(M)− Γ(M)‖∞ + ‖[Σn(M)− Σ(M)]β0,M‖∞ . (12)

    By an application of the second inequality in (1), we obtain for all M ∈M(p),∥∥∥Σn(M){β0,M − β̂M}∥∥∥∞≤ ‖Γn(M)− Γ(M)‖∞ + ‖Σn(M)− Σ(M)‖∞ ‖β0,M‖1 . (13)

    Since Γn(M),Σn(M) are sub-vector and sub-matrix of Γn and Σn, we get trivially forall M ∈M(p),∥∥∥Σn(M){β0,M − β̂M}∥∥∥

    ∞≤ ‖Γn − Γ‖∞ + ‖Σn − Σ‖∞ ‖β0,M‖1 . (14)

    15

  • Note that all the equations and inequalities above are deterministic and hold for anysample. Changing this into a probability one statement and using the notation fromEquations (3), (4), we have

    P

    ⋂M∈M(p)

    {∥∥∥Σn(M){β0,M − β̂M}∥∥∥∞≤ D1n +D2n ‖β0,M‖1

    } = 1.Under assumption (A2), we get

    P

    ⋂M∈M(p)

    {∥∥∥Σn(M){β0,M − β̂M}∥∥∥∞≤ C1(α) + C2(α) ‖β0,M‖1

    } ≥ 1− α. (15)Note for any M ∈M(k), by the triangle inequality,

    C2(α) ‖β0,M‖1 ≤ C2(α)∥∥∥β̂M∥∥∥

    1+ C2(α) sup

    M ′∈M(k)

    ∥∥∥β̂M ′ − β0,M ′∥∥∥1. (16)

    Combining assumptions (A3) and (A4)(k) with Lemma 4, we get that

    supM∈M(k)

    ∥∥∥β̂M − β0,M∥∥∥1

    = op(1) as n→∞.

    This implies that the second term in inequality (16) is always of smaller order than thefirst term. Also, note that the second term in (16) is does not depend on M and hence,

    lim infn→∞

    P

    ⋂M∈M(k)

    {∥∥∥Σn(M){β0,M − β̂M}∥∥∥∞≤ C1(α) + C2(α)

    ∥∥∥β̂M∥∥∥1

    } ≥ 1− α.Therefore,

    lim infn→∞

    P

    ⋂M∈M(k)

    {β0,M ∈ R̂M

    } ≥ 1− α.The second result follows by an application of Lemma 1.

    Remark 3.3 For simplicity in the discussion above, we have used the assumption ofidentically distributed random vectors (Xi, Yi), 1 ≤ i ≤ n. In case this assumption is notsatisfied, we need to redefine the target β0,M , the matrix Σ and the vector Γ as follows.The expected objective function in case of non-identically distributed random vectors isgiven by

    RM (θ) =1

    n

    n∑i=1

    E[{Yi −X>i (M)θ

    }2], for all θ ∈ R|M |.

    Now, define accordingly,

    β0,M := arg minθ∈R|M|

    RM (θ), Σ :=1

    n

    n∑i=1

    E[XiX

    >i

    ], and Γ :=

    1

    n

    n∑i=1

    E [XiYi] .

    16

  • This gives the definitions for Σ(M) and Γ(M) for M ∈ M(p). Also, the error normsD1n and D2n change to

    D1n := ‖Γn − Γ‖∞ and D2n := ‖Σn − Σ‖∞ .

    Now, it is easily seen that the proof of Theorem 1 remains valid with these new definitionsand so Theorem 1 holds without assumption (A1). �

    Remark 3.4 Our main technical assumption (A4)(k) depends on the quantity S1,kand we see that our forthcoming results also depend on quantities Sr,k for some r ≥ 1.Thus, it might be helpful to give general bounds on these quantities. Under the settingof identically distributed observations, observe that by definition of β0,M ,

    0 ≤ E[{Y −X>(M)β0,M

    }2]≤ E

    [Y 2]− β>0,MΣ(M)β0,M .

    Hence for every M ⊆M(p),

    ‖β0,M‖22 λmin (Σ(M)) ≤ β0,MΣ(M)β0,M ≤ E[Y 2].

    Therefore, using ‖β0,M‖1 ≤ |M |1/2 ‖β0,M‖2, we get under assumption (A3),

    ‖β0,M‖1 ≤ |M |1/2 ‖β0,M‖2 ≤

    (|M |E

    [Y 2]

    λmin(Σ(M))

    ) 12

    = O(|M |

    12

    ).

    See Foygel and Srebro (2011) for a similar calculation. �A closer look at the proof of Theorem 1 can show a potentially different sequence of

    confidence regions that has finite sample validity for post-selection inference under con-ditions weaker than needed for Theorem 1. In view of Remark 3.3, we ignore assumption(A1) for the following theorem.

    Theorem 2. Under the only assumption of (A2), we have for any n ≥ 1 and p ≥ 1,

    P

    ⋂M∈M(p)

    {β0,M ∈ R̃M

    } ≥ 1− α,where for every M ⊆M(p),

    R̃M :={θ ∈ R|M | :

    ∥∥∥Σn(M){β̂M − θ}∥∥∥∞≤ C1(α) + C2(α) ‖θ‖1

    }⊆ R|M |.

    As before, for any random model M̂ with P(M̂ ∈M(p)) = 1, we get

    P(β0,M ∈ R̃M

    )≥ 1− α.

    Proof. The proof is the same as the proof of Theorem 1 up to the inequality (15). Notethe difference between the intersection in Theorem 1 that is over M ∈ M(k) and inTheorem 2 that is over a bigger set M ∈M(p).

    17

  • Remark 3.5 It is easy to understand that the confidence regions R̂M are easy tocompute and picture which is not readily the case for R̃M . But if finite sample validityis of far more importance in an application, then Theorem 2 is the way to go. Also, if thenumber of full covariates p is allowed to increase with n or if p > n, then Σn or Σn(M)could potentially be non-singular which makes the confidence regions in Theorems 1 and2 infinitely wide. We will show in Lemma 3 that under assumption (A3) and for every1 ≤ k ≤ p satisfying kD2n = op(1) as n → ∞, Σn(M) is uniformly non-singular overM ∈M(k) with probability approaching 1. �

    In order to not disrupt the flow of post-selection inference results, we present onegeneralization of Theorems 1 and 2 that gives valid post-selection confidence regions inlinear regression type problems. The importance of this generalization will be clear whenwe discuss some disadvantages of the previous confidence regions in Sub-section 3.1. Todescribe this generalization, consider the following setting. Suppose we are given twop-dimensional matrices Σ∗n,Σ

    ∗ and two p-dimensional vectors Γ∗n,Γ∗. Consider the error

    normsD∗1n := ‖Σ∗n − Σ∗‖∞ and D

    ∗2n := ‖Γ∗n − Γ∗‖∞ .

    Define for every M ∈M(p), the estimator and the supposed target as

    ξ̂M := arg minθ∈R|M|

    {θ>Σ∗n(M)θ − 2θ>Γ∗n(M)

    },

    ξ0,M := arg minθ∈R|M|

    {θ>Σ∗(M)θ − 2θ>Γ∗(M)

    }.

    Consider for any M ∈ M(p), the confidence regions R̂∗M and R̃∗M , analogues to thosebefore, as

    R̂∗M :={θ ∈ R|M | :

    ∥∥∥Σ∗n(M){ξ̂M − θ}∥∥∥∞ ≤ C∗1 (α) + C∗2 (α) ∥∥∥ξ̂M∥∥∥1} ,R̃∗M :=

    {θ ∈ R|M | :

    ∥∥∥Σ∗n(M){ξ̂M − θ}∥∥∥∞ ≤ C∗1 (α) + C∗2 (α) ‖θ‖1} ,where C∗1 (α) and C

    ∗2 (α) are constants (or joint quantiles) that satisfy,

    P (D∗1n ≤ C∗1 (α) and D∗2n ≤ C∗2 (α)) ≥ 1− α.

    Finally, define S∗1,k = sup{‖ξ0,M‖1 : M ∈M(k)}.

    Theorem 3. If Σ∗ is uniformly non-singular, then for every 1 ≤ k ≤ p that satisfiesk[D∗1n +D∗2n(1 + S∗1,k)] = op(1) as n→∞, we have

    lim infn→∞

    P

    ⋂M∈M(k)

    {ξ0,M ∈ R̂∗M

    } ≥ 1− α.Under no assumptions, we have for every n ≥ 1, p ≥ 1,

    P

    ⋂M∈M(p)

    {ξ0,M ∈ R̃∗M

    } ≥ 1− α.18

  • Proof. The proof is exactly the same as Theorem 1. The reader just have to realizethat we did not use any structure of Σn,Γn or that they are unbiased estimators of Σ,Γrespectively, in the proof there.

    Remark 3.6 The result, Theorem 3, allows one to deal with the case of missingdata or outliers in linear regression setting. In case of missing data or when the data issuspected of containing outliers, it might be more useful to use estimators of Σ and Γ thattake these into account. For the case of missing data/errors-in-covariates/multiplicativenoise, see Loh and Wainwright (2012) Examples 1, 2 and 3 and references therein. Forthe case of outliers either in the classical sense or in the adversarial corruption setting,see Chen et al. (2013b). For correct usage of this theorem, it is crucial that we usesub-matrix and sub-vector of Σ∗n and Γ

    ∗n, respectively for sub-models. For example, if

    we use full covariate imputation in case of missing data, then the sub-model estimatorshould be based on a sub-matrix of this imputation. �

    Remark 3.7 We cannot escape at this point but have to make a statement aboutthe relation between these confidence regions to the (by now) well-known estimator inhigh-dimensional literature called the Dantzig Selector put forward by Candes and Tao(2007) and its close cousins proposed by Rosenbaum and Tsybakov (2010) and Chenet al. (2013b) to deal with matrix uncertainty and robustness to outliers. These papersor methods are not related to post-selection inference and they are all under a linearmodel assumption. In our notation, the Dantzig selector is defined by the optimizationproblem

    minimize ‖β‖1 subject to ‖Γn − Σnβ‖∞ ≤ λn,for some tuning parameter λn that converges to zero as n increases. This problem istrying to estimate β0 ∈ Rp, where the observations satisfy Yi = X>i β0+εi for independentand identically distributed errors εi with a mean zero normal distribution. To relate thisto our confidence regions R̂M , note that for β = β0 in the constraint set, the quantityinside the norm is Σn(β̂ − β0) where β̂ is any least squares estimator. Also, Candes andTao (2007) as many others assume fixed covariates Xi, 1 ≤ i ≤ n. The estimator definedin Chen et al. (2013b) resembles

    minimize ‖β‖1 subject to ‖Γn − Σnβ‖∞ ≤ λn + δn ‖β‖1 ,

    for some tuning parameters λn and δn both converging to zero as n increases. Thisconstraint set corresponds to our confidence regions R̃M in Theorem 2. Any curious andwell-informed reader can imagine how to generalize our approach 1 from linear regressionto other generalized linear models. See Section 7 and James and Radchenko (2009) formore details. �

    After Remark 3.7, it is only reasonable to ask if there are versions of valid post-selection confidence regions in lines with the lasso and the answer is the following theo-rem. The proof as well as some generalizations are given in the supplementary material.Define for every M ∈M(p), the confidence regions

    ŘM :={θ ∈ R|M | : R̂M (θ) ≤ R̂M (β̂M ) + 4C1(α)

    ∥∥∥β̂M∥∥∥1

    + 2C2(α)∥∥∥β̂M∥∥∥2

    1

    },

    19

  • where R̂M (·) is the empirical least squares objective function defined in Equation (2).

    Theorem 4. Under assumptions (A2) – (A3) and for every 1 ≤ k ≤ p that satisfies(A4)(k), we have

    lim infn→∞

    P

    ⋂M∈M(k)

    {β0,M ∈ ŘM

    } ≥ 1− α.Our main results above and all the results to be presented in Section 5 will be given

    in terms of the error norms D1n and D2n. We will at first prove a result that informsthe rate at which these error norms converge to zero in two settings that we think aregeneral enough. For Theorem 3 and settings mentioned in Remark 3.6, these error ratescan be found in the references mentioned there. Set Zi = (X

    >i , Yi)

    > for 1 ≤ i ≤ n anddefine

    Ωn =1

    n

    n∑i=1

    ZiZ>i , and Ω =

    1

    n

    n∑i=1

    E[ZiZ

    >i

    ]∈ R(p+1)×(p+1).

    Observe that max{D1n,D2n} ≤ ‖Ωn − Ω‖∞ . The following lemma is essentially liftingsome of the high-dimensional covariance matrix estimation results in various settings.Surprisingly, this may be the only mathematically demanding result in this paper.

    Lemma 2. The following statements hold true:

    1. Suppose the observations Zi = (X>i , Yi)

    > ∈ Rp+1 are independent random vectorsand

    (a) if there exists a σ ∈ (0,∞) such that for all 1 ≤ i ≤ n, 1 ≤ j ≤ p+ 1,

    E[exp(t{Zi(j)− E[Zi(j)]})] ≤ exp(t2σ2/2) for all t ∈ R,

    thenmax{D1n,D2n} ≤ ‖Ωn − Ω‖∞ = Op

    (n−1 log p

    )1/2.

    (b) if there exists an m ≥ 1 and constant Km ∈ R such that for all 1 ≤ i ≤ n,

    max1≤j≤p+1

    E[|Zi(j)− E [Zi(j)] |4m

    ]≤ Km,

    thenmax{D1n,D2n} ≤ ‖Ωn − Ω‖∞ = Op

    (n−1/2p1/m

    ).

    2. Suppose the observations Zi = (X>i , Yi)

    > ∈ Rp+1 for a stationary process such thatZi = g(Fi) where g(Fi) = (g1(Fi), . . . , gp+1(Fi)) is an Rp+1-valued measurablefunction, Fi = (. . . , ei−1, ei) is a shift process and ei are iid random vectors. Definethe functional dependence measure of Zi(j) = gj(Fi), for w > 0, as

    θi,w,j =(E|Zi(j)− Z ′i(j)|w

    )1/w,

    20

  • where Z ′i(j) = gj(F ′i),F ′i = (. . . , e−1, e′0, e1, . . . , ei) and e′0 is such that e′0 ande`, ` ∈ Z are iid. Assume the short-range dependence condition holds, i.e.,

    Θm,w := max1≤j≤p+1

    ∞∑`=m

    θ`,w,j 2, α > 0, µ

  • 3.1 Pros and Cons of Approach 1

    We call the approach using either of the confidence regions R̂M or R̃M as approach 1.In this sub-section, we discuss various advantages and disadvantages of this approach.These comments also apply to the confidence regions used in Theorem 3.

    The following are some of the advantages of this approach. The confidence regions areasymptotically valid for post-selection inference. To the best of our knowledge, we are thefirst to give such valid confidence regions in this level of generality. Our confidence regionsfor any particular model M depend only on the joint quantiles C1(α), C2(α) and the leastsquares estimator corresponding to the model M , β̂M . So, their complexity of computingthese confidence regions is no more than a multiple of complexity of computing theestimator β̂M . It will be clear from Section 4, that computing C1(α), C2(α) takes nomore than a linear function of p operations. This computational complexity is in sharpcontrast to the valid post-selection inference method proposed by Berk et al. (2013) orBachoc et al. (2016) which requires essentially solving for all the least squares estimators.Therefore, implementation of their procedure is NP-hard. Another advantage from atheoretical side of approach 1 is that the confidence regions obtained have the bestpossible rate for their Lebesgue measure but we do not know of the optimal constantsat present. Note that the Lebesgue measure of the confidence region for model M iscomputed on R|M |.

    There are some considerable disadvantages and some irking factors associated withthis approach. Firstly, notice that the confidence regions are not invariant under lin-ear transformations of the observations. This is true for most methods used in high-dimensional linear regression procedures that induce sparsity. Even from a naive pointof view, we want invariance under change of units for all variables involved. This trans-lates to invariance under diagonal linear transformations of the observations. A commonway suggested (at least in the case of iid observations) to attain this diagonal invarianceis by normalizing all the variables involved to have unit standard deviation. Formally,we need to use

    X̃i =

    (Xi(1)− X̄(1)

    sn(1), . . . ,

    Xi(p)− X̄i(p)sn(p)

    ), Ỹi =

    Yi − Ȳsn(0)

    ,

    in place of (Xi, Yi), 1 ≤ i ≤ n, where for 1 ≤ j ≤ p,

    X̄(j) =1

    n

    n∑i=1

    Xi(j), and s2n(j) =

    1

    n

    n∑i=1

    [Xi(j)− X̄(j)

    ]2,

    and

    Ȳ =1

    n

    n∑i=1

    Yi, and s2n(0) =

    1

    n

    n∑i=1

    [Yi − Ȳ

    ]2.

    This leads to the matrix and vector,

    Σn =1

    n

    n∑i=1

    X̃iX̃>i , and Γn =

    1

    n

    n∑i=1

    X̃iỸi.

    22

  • Note that the observations (X̃i, Ỹi), 1 ≤ i ≤ n are not independent even if we start withindependent observations (Xi, Yi). This is one of the reasons why we did not assumeindependence for Theorem 1 and stated Theorem 3. Of course one needs to prove therates for error norms D1n and D2n in this case separately. We leave it to the readerto verify that the rates are exactly the same obtained in Lemma 2 (one needs to use aSlutsky-type argument). See Cui et al. (2016) for more details. We conjecture that weneed much weaker conditions than listed in Lemma 2 for those same rates, in particular,we do not need exponential moments. See van de Geer and Muro (2014, Theorem 5.3)for a result in this direction. Getting back to the invariance under arbitrary lineartransformations, we do not know if it is possible come up with a procedure that retainsthe computational complexity of approach 1 while being invariant under arbitrary lineartransformations and we conjecture that this is not possible. We mean by this conjecturethat there is a strict trade-off between computational efficiency and affine invariance.On this point, we mention that approaches 2 and 3 to be described in Section 6 obeythis affine invariance but are NP-hard in computation.

    Another disadvantage of approach 1 is that it is mostly based on deterministic in-equalities. As many readers must have recognized, this might lead to some conservative-ness of the method. Again, we conjecture that this is another price we need to pay forsmaller computational complexity. Exactly how conservativeness does approach 1 leadsto is studied in simulations presented in the supplementary material. We re-emphasizebefore ending this sub-section, that the two main motivations of approach 1 are validityand computational complexity.

    Need to make some comments about shape of the confidence region and that theconfidence is for whole vector rather than each coordinate separately.

    4 Computations and Multiplier Bootstrap

    One important component of an application of approach 1 for valid post-selection in-ference is computing or estimating the joint bivariate quantiles C1(α) and C2(α). Wenote that either a classical bootstrap or the recently popularized method of multiplierbootstrap works for estimating the joint quantiles C1(α) and C2(α) in settings describedin Lemma 2. See Chernozhukov et al. (2013) and Zhang and Cheng (2014) for moredetails. For simplicity, we will only describe the method of multiplier bootstrap for thecase of independent and identically distributed random vectors. And we refer to Zhangand Cheng (2014) for the case of dependent settings described in Lemma 2, part 2.

    Suppose Z1, Z2, . . . , Zn be a sequence of mean zero independent and identically dis-tributed random vectors in Rq. Define the scaled mean as

    SZn :=1√n

    n∑i=1

    Zi ∈ Rq.

    Let e1, e2, . . . , en be independent and identically distributed standard normal random

    23

  • variables and define

    SZ∗n :=1√n

    n∑i=1

    ei(Zi − Z̄n), where Z̄n :=1

    n

    n∑i=1

    Zi.

    The Berry-Esseen bound for the multiplier bootstrap as

    ρ∗n := supt∈R

    ∣∣∣∣P(∥∥SZ∗n ∥∥∞ ≤ t∣∣∣∣{Zi : 1 ≤ i ≤ n})− P (∥∥SZn ∥∥∞ ≤ t)∣∣∣∣ .Under certain moment assumptions, ρ∗n converges in probability to zero as n → ∞with rates specified in the results of Chernozhukov et al. (2013). For the application ofmultiplier bootstrap in approach 1, we take Zi = (X

    >i , Yi)

    >. The following algorithmgives the pseudo-program for implementing bootstrap.

    1. Generate N replicates of n standard normal random variables. Let these be de-noted by {ei,j : 1 ≤ i ≤ n, 1 ≤ j ≤ N}.

    2. Compute the j-th replicate of SZn as

    Tn,j :=

    ∥∥∥∥∥ 1nn∑i=1

    ei,j(Zi − Z̄n)

    ∥∥∥∥∥∞

    , for 1 ≤ j ≤ N.

    3. Find the (1− α)-upper quantile of S(j)n as q(SZn , α) such that

    1

    N

    N∑i=1

    1{Tn,j ≥ q(SZn , α)} ≤ α.

    Here 1{A} is the indicator function of a set A. This q(SZn , α) is an estimate of(1− α)-upper quantile of

    ∥∥SZn ∥∥∞ .5 Uniform Convergence Results for Linear Regression

    In Section 4, we described the multiplier bootstrap technique with the observations(Xi, Yi) being independent or following some general yet known dependent structure.We first present a result under the assumption of independent and identically distributedobservations (Xi, Yi), 1 ≤ i ≤ n that shows asymptotic normality and related propertiesof β̂M for any fixed model M . The following theorem also proves some semi-parametricefficiency properties of β̂M . The sandwich asymptotic variance estimator of the asymp-totic variance of β̂M is well-known in the literature to be a heteroscedasticity robustvariance estimator. See Buja et al. (2014) for more details. This sandwich estimator hasalso taken a lot of beating in the literature for its high variability. The following theoremproves as a part, that the sandwich estimator is a semi-parametrically efficient estima-tor. So, informally, there is no other estimator that can beat the sandwich estimator interms of variance. Even though the result is fairly straightforward, the following theo-rem seems new in total. All the results in this section are proved in the supplementarymaterial.

    24

  • Theorem 5. Suppose that the observations (Xi, Yi) are independent and identicallydistributed with finite fourth moments. For any fixed model M of size not changing withn, under assumption (A3), we have the following results

    (a) The least squares estimator β̂M converges in probability to β0,M as n→∞;

    (b) The following asymptotic normality result holds:

    n1/2(β̂M − β0,M

    )L→ N|M |

    (0, B−1M VMB

    −1M

    ),

    where BM = Σ(M) and VM = E[X(M)X>(M){Y −X>(M)β0,M}2

    ].

    (c) The estimator β̂M is a semi-parametrically efficient estimator of β0,M over alldistributions of (X1, Y1) with finite fourth moments.

    (d) The well-known sandwich estimator [Σn(M)]−1 Vn,M [Σn(M)]

    −1 is a consistent es-timator of B−1M VMB

    −1M , where

    Vn,M =1

    n

    n∑i=1

    Xi(M)X>i (M){Yi −X>i (M)β̂M}2.

    Moreover, [Σn(M)]−1 Vn,M [Σn(M)]

    −1 is a semi-parametrically efficient estimatorof B−1M VMB

    −1M .

    For the remaining results in this section, we get back to the original setting wherewe do not need the independence or identical distirbution assumption. The followingnotation of stochastic ordering is useful in some of the results. For any two non-negativerandom variables, U and V , let U � V signify the inequality

    P (U ≥ δ) ≤ P (V ≥ δ) for all δ > 0.

    This stochastic ordering implies that E |U |r ≤ E|V |r for all r ≥ 1. For any matrixA ∈ Rq×q, we use λmin(A) to denote the smallest eigenvalue of A.

    In Section 3, we proved a rate theorem for the error norms D1n and D2n in specificdependent settings (including iid). But to accommodate any user specific setting weprove mainly deterministic results in the sense that we do not require the random vectors{(Xi, Yi) : 1 ≤ i ≤ n} to be either independent or identically distributed. We do notknow of any previous work that proved such deterministic results about linear regression.Also, most of our results are uniform over (possibly growing number of) models. All ofour results are finite sample. See Table 2 for a quick list of results.

    The confidence regions proposed depend on the matrix-vector product Σn(M)(β̂M −β0,M ) which could potentially be too large because of a possible non-singularity ofΣn(M). But under a condition on k weaker than that required assumption (A4)(k),it can be shown that Σn(M) is non-singular uniformly over M ∈M(k). This is the coreof Lemma 3. Note the similarity with the restricted isometry property (RIP) well-knownin the high-dimensional linear regression and compressed sensing literature.

    25

  • Lemma 3. If k satisfies kD2n = op(1), then

    supM∈M(k)

    ‖Σn(M)− Σ(M)‖op ≤ kD2n = op(1).

    The rate obtained from this Lemma in combination with setting 1(a) of Lemma2 is not optimal. Using deterministic inequalities, it is not clear if the result of theLemma can be improved in terms of the rate. For notational convenience, define for anyk ∈ {1, 2, . . . , p}, the Restricted Isometry Property (RIP) rate as

    RIPn(k) = supM∈M(k)

    ‖Σn(M)− Σ(M)‖op .

    The usual definition of restricted isometry property involves the assumption Σ = I. Interms of this definition, the previous lemma proved RIPn(k) ≤ kR2n. Lemma 3 is anecessary ingredient in all the results that follow and so the non-optimality of the ratepropagates. However, we let the reader figure out the little tweaks to be included in theproofs to recover optimal rates.

    The sequence of lemmas to follow some uniform consistency results related to thelinear regression estimators defined in Section 2. All these results also apply to thelinear regression type estimator used in Theorem 3. We use the same notation as wasintroduced in Section 2.

    Lemma 4. For any k ≥ 1 and n ≥ 1 that satisfy 2kD2n ≤ λmin(Σ), the followingstochastic ordering holds:

    supM∈M(k)

    ∥∥∥β̂M − β0,M∥∥∥1�

    4k(D1n +D2nS1,k)λmin(Σ)− 2kD2n

    = op(1),

    where the last equality holds under assumptions (A3) and (A4)(k) in Theorem 1.

    As a complimentary result, we present the rate of convergence of uniform consistencyin the ‖·‖2-norm.

    Lemma 5. For any k ≥ 1, and n ≥ 1 that satisfy 2kD2n ≤ λmin(Σ), the followingstochastic ordering holds:

    supM∈M(k)

    ∥∥∥β̂M − β0,M∥∥∥2�

    4(k1/2D1n + RIPn(k)S2,k)λmin(Σ)− 2RIPn(k)

    �4(k1/2D1n + kD2nS2,k)λmin(Σ)− 2kD2n

    .

    The confidence regions proposed through approach 1 are simple parallelepipeds andcan be seen as linear transformations of ‖·‖∞-norm balls. Hence, their Lebesgue measurescan be computed exactly. Since we give confidence regions that are valid over a largenumber of models, we have to give a relative Lebesgue measure result uniform over aset of models. For A ⊆ Rq with q ≥ 1, let ν(A) denote the Lebesgue measure of A withthe measure supported on Rq. For convenience, we do not use different notations forLebesgue measure for different q ≥ 1.

    26

  • Lemma 6. For any k ≥ 1 such that assumptions (A3) and (A4)(k) are satisfied, theuniform relative Lebesgue measure result holds:

    supM∈M(k)

    ν(R̂M )(C1(α) + C2(α)S1,k)|M |

    = Op(1).

    Hence, it can be said that ν(R̂M ) = Op(D1n + D2nS1,k)|M | uniformly for M ∈ M(k).Moreover, additionally under the setting 1(a) of Lemma 2, we have

    ν(R̂M

    )= Op

    (√|M | log p

    n

    )|M |uniformly for M ∈M(k).

    Our final result in this section is related to uniform consistency of sandwich estimator.As mentioned in Theorem 5, for any fixed M with fixed cardinality, a correct asymptoticvariance estimator is the sandwich estimator. We have shown in Lemma 3 that thesample bread part, Σn(M), converges uniformly to the population counterpart of bread,Σ(M), even with increasing dimension. The following result shows that the meat partalso satisfies a bit of uniform consistency. Recall the meat part of the sandwich estimatorfor the model M as

    Vn,M :=1

    n

    n∑i=1

    Xi(M)X>i (M)(Yi −X>i (M)β̂M )2.

    Its population counterpart for the model M is given by

    VM :=1

    n

    n∑i=1

    E[Xi(M)X

    >i (M)(Yi −X>i (M)β0,M )2

    ].

    Define an intermediate “estimator” as

    Ṽn,M :=1

    n

    n∑i=1

    Xi(M)X>i (M)(Yi −X>i (M)β0,M )2.

    Observe that Ṽn,M is a simple mean the expectation of which matches the requiredmatrix VM . And the consistency of Ṽn,M to VM depends on the dependence structure.The difference Ṽn,M −VM cannot be bounded in operator norm just in terms of D1n andD2n, in general. The following result implies uniform convergence in operator norm ofVn,M − Ṽn,M to zero as n→∞. Define the error norm

    D0n :=

    ∣∣∣∣∣ 1nn∑i=1

    Y 2i −1

    n

    n∑i=1

    E[Y 2i]∣∣∣∣∣ .

    Observe that in the notation used for Lemma 2, we have

    max{D0n,D1n,D2n} = ‖Ωn − Ω‖∞ .

    27

  • Lemma 7. Under the condition D0n = op(1), assumptions (A3) and (A4)(k), we have

    supM∈M(k)

    ∥∥∥Vn,M − Ṽn,M∥∥∥op

    = k max1≤i≤n

    ‖Xi‖∞Op (D1n +D2nS1,k) = op(1).

    Furthermore, if Xi(j) is sub-Gaussian for every 1 ≤ j ≤ n as in setting 1(a) of Lemma2, then

    supM∈M(k)

    ∥∥∥Vn(M)− Ṽn(M)∥∥∥op

    = k (log(pn))1/2Op(D1n +D2nS1,k).

    If instead we have bounded covariates, then

    supM∈M(k)

    ∥∥∥Vn(M)− Ṽn(M)∥∥∥op

    = kOp(D1n +D2nS1,k).

    In fact the following exact stochastic ordering is true. Suppose there exists a real constantµ such that

    1

    n

    n∑i=1

    E[Y 2]≤ µ2

  • 6 Other Approaches for Post-Selection Inference

    Post-selection inference based on Approach 1 proposed in Section (3) can often be con-servative and has other disadvantages mentioned in Sub-section 3.1. Also, if we do notperform model-selection and decide apriori to use model M , then the confidence regionsfrom approach 1 do not match the “optimal” confidence regions for β0,M . We do notknow of any result in the literature in the assumption-lean framework about “optimal”confidence regions. It should be mentioned here that the approach 1 is also not basedon any pivotal statistics which leads to certain non-optimalities.

    For these reasons, we present two other approaches for valid post-selection inferencethat are based on pivotal statistics and are invariant under linear transformations. Asmentioned in the introduction, these two approaches are NP-hard in computation. Toproceed further, recall we have Rp × R-valued observations (Xi, Yi), 1 ≤ i ≤ n and forevery model M ⊆M(p),

    β̂M = arg minθ∈R|M|

    1

    n

    n∑i=1

    {Yi −X>i (M)θ

    }2,

    β0,M = arg minθ∈R|M|

    1

    n

    n∑i=1

    E[{Yi −X>i (M)θ

    }2].

    Even though the techniques given in this section also generalize to settings other thanindependent observations, we restrict and consider the observations that are independentrandom vectors. Before getting into approaches 2 and 3, we comment that the definitionof confidence regions from approaches 2 and 3 depend on the set of all models consideredin model selection. Note that this is not the case for approach 1.

    6.1 Approach 2: Post-selection using a Global Test

    Basic normal equations for linear regression for model M gives the equality

    Σn(M){β̂M − β0,M

    }=

    1

    n

    n∑i=1

    Xi(M)(Yi −X>i (M)β0,M ). (17)

    We know that that the right hand side has zero expectation from the definition of β0,M .By the classical central limit theorem, under finite second moment, we obtain for anyfixed M ⊆M(p) with fixed |M |,

    n1/21

    n

    n∑i=1

    Xi(M)(Yi −X>i (M)β0,M )L→ N|M | (0, VM ) , (18)

    for matrix VM given in Theorem 5.The convergence stated in relation (18) may not be uniform over M ∈ M(k). By

    delta method, we obtain

    n

    (1

    n

    n∑i=1

    Xi(M)(Yi −X>i (M)β0,M )

    )>V −1M

    (1

    n

    n∑i=1

    Xi(M)(Yi −X>i (M)β0,M )

    )L→ χ2|M |.

    29

  • Here χ2|M | denotes the central chi-square distribution with degrees of freedom |M |. Thiscan equivalently be written as

    n(β̂M − β0,M

    )>Σn(M)V

    −1M Σn(M)

    (β̂M − β0,M

    )L→ χ2|M |.

    Recall the estimator matrix Vn,M and the intermediate statistic Ṽn,M defined in Theorem5. As can be seen from the proof of Theorem 5(d) or by an application of Lemma 7, weget that Vn,M is consistent for VM . Hence, by an application of Slutsky theorem, we getunder the assumption of positive definiteness of VM ,

    Tn,M := n(β̂M − β0,M

    )>Σn(M)V

    −1n,MΣn(M)

    (β̂M − β0,M

    )L→ χ2|M |,

    T̃n,M := n(β̂M − β0,M

    )>Σn(M)Ṽ

    −1n,MΣn(M)

    (β̂M − β0,M

    )L→ χ2|M |.

    The statistics Tn,M and T̃n,M are invariant under linear transformations. Observe that

    T̃n,M = supα∈R|M|,α 6=0

    ∑ni=1 α

    >Xi(M)(Yi −X>i (M)β0,M )(∑ni=1

    {α>Xi(M)(Yi −X>i (M)β0,M )

    }2)1/2 .This is very similar to the Hotelling’s T 2-statistic that is used in one-sample test of zeromean hypothesis and is usually referred to as a multivariate self-normalized statistic(a generalization of the univariate analogue). Also, note that this corresponds to theWald test statistic for the hypothesis H0 : β0,M = 0 with the only change that in the

    denominator β0,M is replaced by β̂M . Using Lemma 7, it can be proved that Tn,M isequivalent to T̃n,M (that is, Tn,M − T̃n,M = op(1) with op(1)) uniformly over all modelsM ∈M(k) with k satisfying certain rate constraints.

    The essential idea of approach 2 for valid post-selection inference is to take themaximum over M ∈ M(k) of (a function of) Tn,M . It is also important to take intoaccount the fact that Tn,M or T̃n,M have mean and variance increasing with the size|M | of models M . This implies that Tn,M (and T̃n,M ) is stochastically larger for thosemodels M with larger cardinality. Therefore, we should not take the maximum over{M : |M | ≤ k} of Tn,M (or T̃n,M ) as a statistic for valid post-selection inference. Insteadwe should transform the random variables by the cumulative distribution of χ2|M | to get

    them to the same scale (since now they are all asymptotically uniform on [0, 1]) oralternatively, we could just normalize them by their mean and standard deviation to getthem to the same scale.

    Just to conclude with precise formulation, for 1 ≤ k ≤ p and α ∈ [0, 1], define Fk,αas the real number that satisfies either

    P

    (sup

    M∈M(k)

    Tn,M − |M |√2|M |

    ≥ Fk,α

    )≤ α.

    This gives the valid post-selection confidence regions as

    R̂(2)M ={θ ∈ R|M | : n

    (β̂M − θ

    )>Σn(M)V

    −1n,MΣn(M)

    (β̂M − θ

    )≤ |M |+ (2|M |)1/2Fk,α

    },

    30

  • for models M ∈ M(k). Regarding the computation and computational complexity, wepostpone the issue till the completion of next approach given in the following sub-sectionas the next approach is also based on self-normalized statistics.

    6.2 Approach 3: Post-selection using t-statistics

    The approaches 1 and 2 presented in Section 3 and sub-section 6.1 are more gearedtowards full vector confidence regions in the sense that these regions are not parallel tocoordinate axis to get precise intervals for each coordinate parameter β0,M (j) for j ∈M .The focus of the present approach (referred to as approach 3) is to focus on gettingprecise intervals for these coordinate parameters. This approach is more in the lines ofthe PoSI approach proposed in Berk et al. (2013). To proceed further, recall once againthe normal equation (17),

    Σn(M){β̂M − β0,M

    }=

    1

    n

    n∑i=1

    Xi(M)(Yi −X>i (M)β0,M ).

    By law of large numbers or from Lemma 3, we know Σn(M) converges almost surely toΣ(M) uniformly in M . So,

    β̂M − β0,M ≈1

    n

    n∑i=1

    [Σ(M)]−1Xi(M)(Yi −X>i (M)β0,M ).

    Let σj,M denote the j-th row of (Σ(M))−1 and similarly let σ̂j,M denote the j-th row of

    (Σn(M))−1. Then,

    β̂M (j)− β0,M (j) ≈1

    n

    n∑i=1

    σ>j,MXi(M)(Yi −X>i (M)β0,M ).

    Here too, the right hand side has zero expectation. The classical t-statistic for testingmean zero of the random variables on the right-hand side is given (almost) by

    t̃j,M =

    ∑ni=1 σ

    >j,MXi(M)(Yi −X>i (M)β0,M )(∑n

    i=1

    {σ>j,MXi(M)(Yi −X>i (M)β0,M )

    }2)1/2 .This is also the t-statistic that should used for testing β0,M (j) = 0 under random covari-ates with the exception of replacing unknown σj,M and β0,M in the denominator by thesample analogues. Note that t̃j,M is a univariate self-normalized statistic. The actualt-statistic is given by

    tj,M =

    √n(β̂M (j)− β0,M (j)

    )(∑n

    i=1

    {σ̂>j,MXi(M)(Yi −X>i (M)β̂M )

    }2)1/2 .

    31

  • Asymptotically for any fixed M and j ∈M , tj,M converges in distribution to a standardnormal distribution. Using Lemmas 3 and 7, we get that tj,M is equivalent to t̃j,Muniformly over all M ∈ M(k) and j ∈ M with certain rate constraints on k. To definethe post-selection confidence regions, for 1 ≤ k ≤ p and α ∈ [0, 1], let Γk,α denote thereal number that satisfies

    P

    supM∈M(k),j∈M

    |tj,M | ≥ Γk,α

    ≤ α.Then the confidence regions for model M are given by

    R̂(3)M =

    θ ∈ R|M | : sup

    j∈M

    √n∣∣∣β̂M (j)− θ(j)∣∣∣(∑n

    i=1

    {σ̂>j,MXi(M)(Yi −X>i (M)β̂M )

    }2)1/2 ≤ Γk,α

    Note that these confidence regions give rectangles with faces parallel to the coordinateaxis. It is also easy to modify this method to get shorter confidence regions if one is onlyinterested in models that contain a particular variable in which case the computation ordefinition of Γk,α involves only these models.

    6.3 Computation of Confidence Regions

    For approach 1, the method of computation is described in Section 4. For approaches 2and 3, we describe the computation in this sub-section. As mentioned before, these com-putations are essentially NP-hard in the sense the computation involves almost completeenumeration of models under considered in model selection. The method of estimation isalso a type of multiplier bootstrap as in 4. Firstly, recognize that the statistics involvedin approaches 2 and 3 can be cast into a self-normalized empirical process defined asfollows.

    Suppose F is a given class of functions (which is possibly uncountable) and letZ1, Z2, . . . , Zn be independent random variables on some measurable space (which neednot be an Euclidean space) such that Ef(Zi) = 0. Define

    Wn(f) =

    ∑ni=1 f(Zi)

    (∑n

    i=1 f2(Zi))

    1/2, and Wn = sup

    f∈F|Wn(f)|.

    The function class F and observations Zi for approaches 2 and 3 are as follows:

    1. For approach 2, the random variables are Zi = (Xi, Yi) and F contains for allM ∈M(k), α ∈ R|M | and j ∈M with ‖α‖ = 1, the functions

    fj,M,α(Zi) = α>Xi(M)(Yi −X>i (M)β0,M ).

    32

  • So, F constitutes an uncountable class of functions. Note that

    supα∈R|M|,‖α‖=1

    |Wn(fj,M,α)| = T̃n,M ,

    that is equivalent to Tn,M .

    2. For approach 3, the random variables are (again) Zi = (Xi, Yi) and F contains forall M ∈M(k), j ∈M , the functions,

    fj,M (Zi) = σ>j,MXi(M)(Yi −X>i (M)β0,M ).

    In this case, F is actually a finite set of functions but possibly growing with n if kgrows with n.

    Getting back to Wn, fix an α ∈ [0, 1]. We want to estimate the (1−α)-quantile, q(F , α)that satisfies

    P (Wn ≥ q(F , α)) ≤ α.The method of computation is simply described as follows.

    • Fix a natural number N ≥ 1, the number of bootstrap replicates needed.

    • Let εi,j , 1 ≤ i ≤ n, 1 ≤ j ≤ N be iid Rademacher random variables, i.e.,

    P(εi,j = 1) = P(εi,j = −1) = 1/2.

    • Construct the j-th replicate of Wn as

    W(j)n = supf∈F

    |∑n

    i=1 εi,jf(Zi)|(∑ni=1 ε

    2i,jf

    2(Zi))1/2 = sup

    f∈F

    |∑n

    i=1 εi,jf(Zi)|(∑n

    i=1 f2(Zi))

    1/2.

    • Find the (1− α)-th sample quantile q̂N (F , α) of {W(j)n , 1 ≤ j ≤ N} and this is anestimate of q(F , α).

    A variant of this multiplier bootstrap with standard normal random variables is alsopossible. Application of this method for approaches 2 and 3 involves replacing σj,M and

    β0,M with σ̂j,M and β̂M , respectively. Recall that the resulting statistics are uniformlyclose to the actual self-normalized statistics and so asymptotically should not be a prob-lem. Some rigorous statements will be presented in a future paper. This method showsthat approaches 2 and 3 are actually NP-hard and their computation essentially requiresenumerating all models considered for model-selection because one needs to compute thesupremum over f ∈ F exactly for each replicate.

    Remark 6.1 The required quantiles q(F , α) of the self-normalized empirical processWn can easily be obtained if a tail bound is presented. Hence, a quick way to getbounds on the quantile q(F , α) is to use a concentration inequality for Wn. One suchconcentration inequality is given in Bercu et al. (2002). However, it should be understoodthat these concentration inequalities are not sharp in terms of constants and leads to aconstant fold wide confidence regions. �

    33

  • 7 Extensions to Generalized Linear Models and M -Estimation Prob-lems

    In the previous sections, we focused on ordinary least squares linear regression. Itis of interest to understand how the techniques generalize when other M -Estimationtechniques are used. For example, if the response is categorical, then it might be moreappealing to apply a logistic regression rather than linear regression. In this section, weshow how the three approaches in linear regression generalize to such modeling methods.We do not present any theoretical results but indicate the required ingredients. It may behelpful to realize here that any smooth function locally a quadratic convex function nearits minimum. From this point of view, once we prove the results for linear regression, themethod for other M -estimation problems should in principle follow from Taylor seriesexpansion.

    Suppose Ψ is a real-valued function defined on the real line and as before we haveindependent and identically distributed observations (Xi, Yi), 1 ≤ i ≤ n. Let (X,Y ) bea generic random vector. Assume that Ψ is twice differentiable with first and secondderivatives denoted by Ψ̇ and Ψ̈. Recall the notation from Section 2. Consider theempirical and the expected objective functions for M ⊆M(p),

    Ôn(θ) :=1

    n

    n∑i=1

    {Ψ(X>i (M)θ)− YiX>i (M)θ

    }, θ ∈ R|M |,

    O(θ) := E[Ψ(X>(M)θ)− Y X>(M)θ

    ], θ ∈ R|M |.

    Observe that most generalized linear model problems fall into this category for specificchoices of Ψ. For linear regression, we take Ψ(t) = t2/2. See Erdogdu et al. (2016) formore related applications. In general, the function Ψ is required to be convex which wedo not ask for. Define for M ∈M(p), the estimator and the corresponding target as

    η̂M := arg minθ∈R|M|

    1

    n

    n∑i=1

    {Ψ(X>i (M)θ)− YiX>i (M)θ

    },

    η0,M := arg minθ∈R|M|

    E[Ψ(X>(M)θ)− Y X>(M)θ

    ].

    An application of results from Hjort and Pollard (1993), we get that for any fixed M withfixed cardinality, η̂M converges in probability to η0,M as n → ∞. Using the proofs ofLemmas 4 and 5 in combination with the results of Negahban et al. (2009), it is possibleto prove the uniform consistency results,

    supM∈M(k)

    ‖η̂M − η0,M‖1 = op(1), as n→∞, (19)

    under some rate constraints on k. Approach 1 for linear regression does not readilylead to something quick in terms of computational complexity. This is because Ψ̈ = 1

    34

  • for linear regression which allows the decoupling of some terms in approach 1. Forapproaches 2 and 3, note that η̂M satisfies the equation

    1

    n

    n∑i=1

    Xi(M)Ψ̇(X>i (M)η̂M )−

    1

    n

    n∑i=1

    YiXi(M) = 0.

    By a Taylor series expansion of Ψ̇ and by uniform consistency (19), we get that(1

    n

    n∑i=1

    Xi(M)X>i (M)Ψ̈(X

    >i (M)η0,M )

    )(η̂M − η0,M ) (20)

    ≈ 1n

    n∑i=1

    {Xi(M)Ψ̇(X

    >i (M)η0,M )−Xi(M)Yi

    }.

    Here the right hand side has mean zero by definition of η0,M and the approximationholds uniformly over M ∈ M(k). Define, for notational convenience, Zi = (Xi, Yi) for1 ≤ i ≤ n, the score function,

    ψM (Zi, θ) := Xi(M)Ψ̇(X>i (M)θ)−Xi(M)Yi,

    and the matrices

    Bn,M :=1

    n

    n∑i=1

    Xi(M)X>i (M)Ψ̈(X

    >i (M)η̂M ), B̃n,M :=

    1

    n

    n∑i=1

    Xi(M)X>i (M)Ψ̈(X

    >i (M)η0,M ),

    BM := E[X(M)X>(M)Ψ̈(X>(M)η0,M )

    ], Vn,M :=

    1

    n

    n∑i=1

    ψM (Zi, η̂M )ψ>M (Zi, η̂M ),

    Ṽn,M :=1

    n

    n∑i=1

    ψM (Zi, η0,M )ψ>M (Zi, η0,M ), VM := E

    [ψM (Zi, η0,M )ψ

    >M (Zi, η0,M )

    ].

    The sample mean on the right hand side of the approximation (20) has an asymptoticnormal distribution with a suitable scaling. Hence by an application of the delta method,we get

    Tn,M := n (η̂M − η0,M )>Bn,MV −1n,MBn,M (η̂M − η0,M )L→ χ2|M |,

    T̃n,M := n (η̂M − η0,M )>Bn,M Ṽ −1n,MBn,M (η̂M − η0,M )L→ χ2|M |.

    As before in approach 2, T̃n,M is a multivariate self-normalized statistic. Now followapproach 2 verbatim to construct a valid post-selection inference procedure.

    For approach 3, let σj,M and σ̂j,M denote j-row of B−1M and B

    −1n,M . Then for any

    M ∈M(k) and j ∈M ,

    η̂M (j)− η0,M (j) ≈1

    n

    n∑i=1

    σ>j,MψM (Zi, η0,M ),

    35

  • where again the right hand side has mean zero. We can construct the t-statistic as inapproach 3 as

    t̃j,M :=

    ∑ni=1 σ

    >j,MψM (Zi, η0,M )(∑n

    i=1

    {σ>j,MψM (Zi, η0,M )

    }2)1/2 .Here too, the statistic depends on σj,M and η0,M that are unknown. As in approach 3,we replace them in the denominator by σ̂j,M and η̂M , to get the t-statistic

    tj,M :=n1/2 (η̂M (j)− η0,M (j))(∑ni=1

    {σ̂>j,MψM (Zi, η̂M )

    }2)1/2 .Now follow approach 3 verbatim to construct a valid post-selection inference procedurefor each coordinate. The computation of quantiles for these two approaches is the sameas given in Sub-section 6.3.

    Remark 7.1 As can be understood from this section, the main ingredient in theconstruction of valid post-selection inference is a linearization equation (20). The nextthree major ingredients are to prove for some sequence k = kn that

    (a) sup{‖η̂M − η0,M‖1 : M ∈M(k)

    }= op(1) as n→∞;

    (b) sup{‖Bn,M −BM‖op : M ∈M(k)

    }= op(1) as n→∞; and

    (c) sup

    {∥∥∥Ṽn,M − Vn,M∥∥∥op

    : M ∈M(k)}

    = op(1) as n→∞.

    Once these three requirements are fulfilled, it is possible to get valid post-selectioninference in any M -estimation problem even with increasing number of parameters.Also, it should mentioned here that linearization equation (20) holds for many estimatorswith non-necessarily differentiable objective functions. This is one of the most basic toolsavailable for proving asymptotic normality of minimizer sequence. See Hjort and Pollard(1993) for more details. �

    8 Closing Remarks and Future Work

    Acknowledgments

    The first author is grateful to Abhishek Chakrabortty and Bikram Karmakar for helpfuldiscussions and their comments on approach 1.

    36

  • Bibliography

    Agostinelli, C., Leung, A., Yohai, V. J., and Zamar, R. H. (2015). Robust estimation ofmultivariate location and scatter in the presence of cellwise and casewise contamina-tion. TEST, 24(3):441–461.

    Arcones, M. A. (2005). Convergence of the optimal M -estimator over a parametricfamily of M -estimators. Test, 14(1):281–315.

    Bachoc, F., Preinerstorfer, D., and Steinberger, L. (2016). Uniformly valid confidenceintervals post-model-selection. ArXiv e-prints.

    Bercu, B., Gassiat, E., and Rio, E. (2002). Concentration inequalities, large and moder-ate deviations for self-normalized empirical processes. Ann. Probab., 30(4):1576–1604.